CN116301874A - Code compiling method, electronic device and storage medium - Google Patents

Code compiling method, electronic device and storage medium Download PDF

Info

Publication number
CN116301874A
CN116301874A CN202111576033.7A CN202111576033A CN116301874A CN 116301874 A CN116301874 A CN 116301874A CN 202111576033 A CN202111576033 A CN 202111576033A CN 116301874 A CN116301874 A CN 116301874A
Authority
CN
China
Prior art keywords
code
code blocks
task
pipeline
code block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111576033.7A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN202111576033.7A priority Critical patent/CN116301874A/en
Publication of CN116301874A publication Critical patent/CN116301874A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the application discloses a code compiling method, electronic equipment and a storage medium. The method comprises the following steps: acquiring N code blocks, wherein each code block comprises a state identifier for determining the running sequence of the code block, and N is an integer greater than or equal to 2; determining the running sequence of the N code blocks according to the state identifier packaged in each code block; and generating a pipeline according to the running sequence of the N code blocks and the N code blocks. The embodiment of the application is beneficial to improving the production line generation efficiency.

Description

Code compiling method, electronic device and storage medium
Technical Field
The present disclosure relates to the field of computer software technologies, and in particular, to a code compiling method, an electronic device, and a storage medium.
Background
With the development of technology, the requirement on the operation speed of the processor is higher and higher due to the improvement of the operation capability of the processor. Currently, pipeline processing mode is a relatively general acceleration method. One common pipelined implementation uses a double buffer (double buffer) mechanism to achieve parallelism between different task flows, but the double buffer mechanism is difficult to adapt to complex task flows. With the increase of the number of tasks in the pipeline, the relation among the tasks is complex, the analysis of the code is more and more complicated, and the realization of determining the pipeline based on the code analysis is difficult. Therefore, how to efficiently and quickly construct a pipeline including multiple tasks is a technical problem to be solved.
Disclosure of Invention
The embodiment of the application provides a code compiling method, electronic equipment and a storage medium, and the construction efficiency of a pipeline is improved by setting a state identifier in a code block.
In a first aspect, an embodiment of the present application provides a code compiling method, including:
acquiring N code blocks, wherein each code block comprises a state identifier for determining the running sequence of the code block, and N is an integer greater than or equal to 2;
determining the running sequence of the N code blocks according to the state identifier packaged in each code block;
and generating a pipeline according to the running sequence of the N code blocks and the N code blocks.
In one embodiment of the present application, the status identifier is used to indicate a task status of a task implemented by the code blocks, and determining, according to the status identifier encapsulated in each code block, an operation order of the N code blocks includes:
determining the dependency relationship among the N code blocks according to the task state indicated by the state identification in the N code blocks;
and determining the running sequence of the N code blocks according to the dependency relationship among the N code blocks.
In one embodiment of the present application, the task states of the tasks implemented by the N code blocks include a data loading task, a data operation task, and a data storage task;
in one pipeline, the code blocks implementing the data loading task, the code blocks of the data operation task, and the code blocks of the data storage task are sequentially executed.
In one embodiment of the present application, the pipeline further comprises a code block;
the synchronous code blocks are contained between any two adjacent code blocks in the pipeline;
the synchronous code block is used for indicating that the (i+1) th code block is operated after the (i) th code block in the pipeline is operated, wherein i is a positive integer greater than or equal to 1, and i is less than or equal to N.
In one embodiment of the present application, each code block further includes a location identifier, where the location identifier is used to identify a starting location of the code block in the original code; the acquiring N code blocks includes:
and taking a code between a j-th position identifier and a j+1th position identifier in the original code and the j-th position identifier as a j-th code block, wherein the j value is an integer from 1 to N.
In one embodiment of the present application, before acquiring the N code blocks, the method further includes:
acquiring the value of a preset marker bit;
and when the value of the preset flag bit is determined to be a preset value, executing the operation of acquiring N code blocks.
In one embodiment of the present application, the number of pipelines is at least two, and the method further comprises:
according to the state identification of each code block in at least two pipelines, code blocks in different task states in different pipelines are determined to be parallel code block groups, and a plurality of code blocks in the parallel code block groups can be executed in parallel in the same time unit.
In one embodiment of the present application, the determining the code blocks in different task states in different pipelines as parallel code block groups according to the state identification of each code block in at least two pipelines includes:
executing a kth code block in a first pipeline in parallel with a kth-1 code block in a second pipeline, the task state of the kth-1 code block being different from the task state of the first pipeline;
the first pipeline is any one of the at least two pipelines, the second pipeline is a next stage pipeline of the first pipeline of the at least two pipelines, k is a positive integer greater than or equal to 2, and k is less than or equal to N.
In one embodiment of the present application, the method further comprises:
and allocating a memory for the parallel code block group, wherein the size of the memory is the product of the number of the code blocks in the parallel code block group and the size of a preset storage space.
In one embodiment of the present application, the method further comprises:
and compiling the original codes formed by the N code blocks into target codes.
In a second aspect, an embodiment of the present application provides a pipeline generating apparatus, including:
an obtaining unit, configured to obtain N code blocks, where each code block includes a status identifier for determining an operation sequence of the code block, and N is an integer greater than or equal to 2;
the processing unit is used for determining the running sequence of the N code blocks according to the state identification packaged in each code block;
and generating a pipeline according to the running sequence of the N code blocks and the N code blocks.
In one embodiment of the present application, the state identifier is used to indicate a task state of a task implemented by the code blocks, and the processing unit is specifically configured to:
Determining the dependency relationship among the N code blocks according to the task state indicated by the state identification in the N code blocks;
and determining the running sequence of the N code blocks according to the dependency relationship among the N code blocks.
In one embodiment of the present application, the task states of the tasks implemented by the N code blocks include a data loading task, a data operation task, and a data storage task;
in one pipeline, the code blocks implementing the data loading task, the code blocks of the data operation task, and the code blocks of the data storage task are sequentially executed.
In one embodiment of the present application, the pipeline further comprises a code block;
the synchronous code blocks are contained between any two adjacent code blocks in the pipeline;
the synchronous code block is used for indicating that the (i+1) th code block is operated after the (i) th code block in the pipeline is operated, wherein i is a positive integer greater than or equal to 1, and i is less than or equal to N.
In one embodiment of the present application, each code block further includes a location identifier, where the location identifier is used to identify a starting location of the code block in the original code; in acquiring N code blocks, the acquiring unit is specifically configured to:
And taking a code between a j-th position identifier and a j+1th position identifier in the original code and the j-th position identifier as a j-th code block, wherein the j value is an integer from 1 to N.
In one embodiment of the present application, before acquiring the N code blocks, the acquiring unit is further configured to acquire a value of a preset flag bit; and the processing unit is also used for executing the operation of acquiring N code blocks when the value of the preset flag bit is determined to be the preset value.
In one embodiment of the present application, the number of pipelines is at least two, and the processing unit is further configured to:
according to the state identification of each code block in at least two pipelines, code blocks in different task states in different pipelines are determined to be parallel code block groups, and a plurality of code blocks in the parallel code block groups can be executed in parallel in the same time unit.
In one embodiment of the present application, in determining code blocks in different task states in different pipelines as parallel code block groups according to the state identification of each code block in at least two pipelines, the processing unit is specifically configured to:
executing a kth code block in a first pipeline in parallel with a kth-1 code block in a second pipeline, the task state of the kth-1 code block being different from the task state of the first pipeline;
The first pipeline is any one of the at least two pipelines, the second pipeline is a next stage pipeline of the first pipeline of the at least two pipelines, k is a positive integer greater than or equal to 2, and k is less than or equal to N.
In an embodiment of the present application, the processing unit is further configured to allocate a memory for the parallel code block group, where a size of the memory is a product of a number of code blocks in the parallel code block group and a preset storage space size.
In one embodiment of the present application, the processing unit is further configured to compile the original code formed by the N code blocks into object code.
In a third aspect, an embodiment of the present application provides an electronic device, including: and a processor connected to a memory for storing a computer program, the processor being configured to execute the computer program stored in the memory, to cause the electronic device to perform the method according to the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, the computer program causing a computer to perform the method according to the first aspect.
In a fifth aspect, embodiments of the present application provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program, the computer being operable to cause a computer to perform the method according to the first aspect.
The implementation of the embodiment of the application has the following beneficial effects:
it can be seen that in the embodiment of the present application, the code blocks are provided with the status identifiers, and the running sequence of the code blocks of each task can be directly determined according to the status identifiers, so that the running sequence of the code blocks of each task is not required to be determined by analyzing the functions of the code and the data flow directions between the code blocks, thereby realizing rapid and efficient construction of the pipeline. Further, since the state identifier is encapsulated in each code block, according to the above implementation manner, the present application may generate a corresponding pipeline for each task flow, and the manner of generating the pipeline is not limited by hardware resources (such as the number of storage partitions), so as to support the scenario of the complex task flow.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of an implementation process of a dual cache mechanism according to an embodiment of the present application;
fig. 2 is a schematic flow chart of a code compiling method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an acquiring code block according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a component pipeline according to an embodiment of the present application;
FIG. 5 is a schematic diagram of another component pipeline provided in an embodiment of the present application;
FIG. 6 is a schematic diagram of yet another component pipeline provided by an embodiment of the present application;
FIG. 7 is a functional block diagram of a pipeline generating device according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
The terms "first," "second," "third," and "fourth" and the like in the description and in the claims of this application and in the drawings, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In order to efficiently execute different tasks, the computing resources of hardware are fully utilized, and the tasks with interdependencies are often formed into a pipeline, so that the tasks without interdependencies in different pipelines can be executed in parallel, and the execution efficiency of the tasks can be improved. The task pipeline generation and implementation can have a plurality of different modes aiming at different processing scenes, and the purpose of the present disclosure is to provide a method, a device and a related product for generating a multi-stage pipeline aiming at a complex task stream.
One common pipeline implementation mode is a memory partition mode, and data in different memory blocks can be sent to an operation unit in turn for operation, but the implementation mode can only support simple task flows of at most two pipelines and is difficult to be suitable for a complex task flow scene. For example, fig. 1 shows a process of implementing a task pipeline using a double buffer mechanism (double buffer), and the current implementation of the double buffer mechanism is described below in conjunction with fig. 1.
As shown in fig. 1, to reduce the latency of the arithmetic unit, the storage space of the unified cache may be divided into two parts, such as splitting the unified cache into a first cache and a second cache. When the arithmetic unit reads and calculates the data in the first buffer, the storage unit can write the next data in the second buffer in parallel. When the operation unit is switched to read and calculate the data in the second buffer, the storage unit may write the calculation result stored in the first buffer out to the storage unit in parallel, or the storage unit may write the next data into the first buffer in parallel. Therefore, the data access task and the calculation task of the operation unit can be executed in parallel, so that the idle problem of the operation unit is effectively relieved, the utilization rate of the operation unit is improved, and the execution efficiency of the task is improved. However, the dual cache mechanism is implemented for each task by parsing the underlying code to find the code content of each task, and then implementing the task based on the code content. For example, for a computing task of an arithmetic unit in a pipeline, it is necessary to parse the underlying code to find code content capable of implementing the computing task, and implement the computing task by executing the code content. However, as the number of tasks in the pipeline increases, the parsing of code becomes increasingly cumbersome and complex. Therefore, how to efficiently construct a pipeline supporting multiple tasks is a technical problem to be solved at present.
Referring to fig. 2, fig. 2 is a flowchart of a code compiling method according to an embodiment of the present application. The method may be implemented by a compiler running on a processor, such as a general purpose processor CPU or the like. The method may include the following steps:
201: n code blocks are acquired, wherein N is an integer greater than or equal to 2.
Where each code block functions to perform a task, the execution of a code block referred to herein is essentially a task, and is essentially identical, without distinction. Furthermore, different code blocks have different functions, i.e. different code blocks are used for implementing different types of tasks. In this application, tasks implemented by N code blocks are mainly exemplified by a data loading task (data_load), a data operation task (data_computer), and a data storage task (data_store), and other task states may be also available in practical applications, which will not be described in detail.
The application may first obtain the original code and obtain N code blocks based on the original code parsing. Illustratively, the original code is preset with at least one code block identifier, namely a position identifier; thus, when the original code is analyzed, the code between the position identifier and the next position identifier can be used as one code block to obtain N code blocks.
In particular, the location identifier may be placed in each code block, the location identifier being used to identify the starting location of the code block in the original code. Thus for the N code blocks N location identities are set in the original code. When the original code is analyzed, the code between the j position identification and the j+1 position identification is used as the j code block, and the j value is an integer from 1 to N.
Optionally, the through block level scope encapsulates a code block corresponding to each task to obtain an original code. The location identity referred to in this application may be represented using a block level identifier, for example, with ir.block. If the ir.block identifier is identified in the process of analyzing the original code, the ir.block identifier and the code between the ir.block identifier and the next ir.block identifier can be used as a code block to obtain N code blocks. Wherein the IR in the location identifier represents the code level of the original code. In one implementation, the code levels of the original code include, but are not limited to, high level Python code, low level Python code, and lower level object code, which may be code formed in a class C language (e.g., CUDAC). Wherein the low-level Python code may be a code formed based on tensor calculation primitives constructed in Python language, for example TCP (Tensor Computer Primitive) or TIK (Tensor Iterator Kernel).
For example, for three tasks to be implemented in the present application, the width tcp.block may be used as the location identifier, and as shown in fig. 3, the code between the first width tcp.block and the second width tcp.block may be used as the first code block, the code between the second width tcp.block and the third width tcp.block may be used as the second code block, and the code between the third width tcp.block and the code end location may be used as the third code block.
202: and determining the running sequence of the N code blocks according to the state identification packaged in each code block.
Each code block comprises a state identifier for determining the running sequence of the code block.
For example, the status identification in each code block may be used to indicate the task status of the task that the code block implements. Alternatively, the status identifier may be represented in the function of each code block itself, i.e. by the function of each code block itself to indicate the task status of the task implemented by each code block.
For example, as shown in fig. 3, for the first code block for implementing the data loading task, the status identifier may be "stage_scope=load", so when the status identifier of the code block is parsed to be "stage_scope=load", it is determined that the task implemented by the code is the data loading task. For the second code block for implementing the data operation task, the status identifier may be "stage_scope=computer", so when the status identifier of the code block is analyzed to be "stage_scope=computer", the task implemented by the code is determined to be the data operation task. For the third code block for implementing the data storage task, the status identifier may be "stage_scope=store", so when the status identifier of the code block is analyzed to be "stage_scope=store", it is determined that the task implemented by the code is the data storage task. Through the steps, the method and the device can respectively obtain three code blocks which are respectively used for realizing a data loading task, a data operation task and a data storage task. Then, the application can determine the running sequence of the three code blocks according to the dependency relationship among the tasks realized by the three code blocks.
Further, the dependency relationship of the N code blocks is determined according to the task state indicated by the state identification in the N codes. The dependency is the flow direction of data between each code block. The order of execution among the N code blocks can thus be determined from the dependencies among the N code blocks.
For example, for a data loading task, a data operation task, and a data storage task, the dependency relationship thereof depends on the data loading task, and the data storage task depends on the data operation task, so the dependency relationship between code blocks corresponding to the three tasks is: the code blocks used to implement the data manipulation tasks depend on the code blocks used to implement the data loading tasks and the code blocks used to implement the data storage tasks depend on the code blocks used to implement the data manipulation tasks. Thus, in the pipeline, a code block for realizing a data loading task, a code block for realizing a data operation task, and a code block for realizing a data storage task need to be sequentially executed. It should be clear that, here, the data loaded by the data loading task is the input data of the data operation task, and the data written by the data storage task is the output data of the data operation task, so that the data dependency relationship exists among the three tasks. Otherwise, in other scenarios, there is not necessarily a data dependency between the three tasks.
In one possible implementation, the running order of the code blocks may be encapsulated directly in each code block, e.g., the status identification of each code block may be set directly to the running order of the code block. For example, for a code block for implementing a data loading task, the status identifier of the code block may be set to "1", so that after the code block is acquired, the running sequence of the code block may be directly determined to be the first based on the status identifier "1" of the code block, without determining the dependency relationship of N code blocks according to the task status of the task implemented by each code block, and determining the running sequence of N code blocks according to the dependency relationship, thereby improving the construction efficiency of the pipeline.
203: and generating a pipeline according to the running sequence of the N code blocks and the N code blocks.
Illustratively, the N code blocks are automatically combined according to their order of execution, generating a pipeline.
For example, for three code blocks of the present application, the three code blocks may be organized into a pipeline as shown in fig. 4, in the order in which the three code blocks are run, respectively.
It can be seen that, in the embodiment of the present application, the status identifier is set in the code blocks, and the running sequence of the code blocks of each task can be directly determined according to the status identifier, without analyzing the function of the code and the data flow direction between the code blocks to determine the running sequence of the code blocks of each task, thereby implementing fast and efficient construction of the pipeline. Further, since the state identifier is encapsulated in each code block, according to the above implementation manner, the present application may generate a corresponding pipeline for each task flow, and the manner of generating the pipeline is not limited by hardware resources (such as the number of storage partitions), so as to support the scenario of the complex task flow.
In one embodiment of the present application, a synchronization code block is included between any two adjacent code blocks in the pipeline, where the synchronization code block is used to indicate that after the ith code block in the pipeline is run, the (i+1) th code block is run, where i is a positive integer greater than or equal to 1, and i is less than or equal to N. Although the N code blocks are combined in the order of execution, the execution time of some code blocks may be relatively long, and in order to be able to completely ensure that the N code blocks are executed in sequence, a synchronization code block is inserted between two adjacent code blocks, so that for the two adjacent code blocks, only after the previous code block is executed, the next code block is executed. From the task execution perspective, each synchronous code block is used for realizing one synchronous task, namely, the synchronous task is inserted between two adjacent tasks realized by any two adjacent code blocks, so that the next task can be executed only after the previous task is executed for the two adjacent tasks in the pipeline, thereby realizing the sequential execution of N tasks in the pipeline. Inserting a synchronization code block between any two adjacent code blocks in the pipeline may produce a pipeline with synchronization code blocks inserted as shown in fig. 5.
In one embodiment of the present application, the number of pipelines generated is at least two, i.e., the N code blocks may be grouped into two or more pipelines. For each pipeline, the N code blocks are combined according to the running sequence, which is not described. For example, for three code blocks in the present application, three pipelines as shown in fig. 6 may be formed according to the sequence between the code blocks, and a synchronous code block is inserted between any two adjacent code blocks in each pipeline, so as to ensure that the three code blocks in each pipeline are sequentially executed.
As shown in fig. 2 or 6, the pipeline described above may refer to a serial pipeline formed by a timeline. The pipeline of the present application may also include parallel pipelines formed based on different task states. Further, according to the state identification of each code block in the at least two pipelines, the code blocks in different task states in different pipelines are determined to be parallel code block groups, so that a plurality of code blocks in the parallel code block groups are executed in parallel in the same time unit, and the parallel code block groups can form a parallel pipeline.
In particular, the at least two pipelines may have priority. For example, for the pipeline of FIG. 6, the uppermost one has the highest priority, and the other pipelines have sequentially lower priorities. The priority of at least two pipelines in the present application may be determined based on different iteration cycles. For example, the at least two pipelines may be different iteration cycles in the same loop. From the time line, the priority of the pipeline that was executed before is higher than the priority of the pipeline that was executed after, wherein the iteration cycle that the pipeline that was executed before is less than the iteration cycle that the pipeline that was executed after is. As shown in fig. 6, 3 iteration cycles may be included in one loop, where the 3 iteration cycles correspond to the three pipelines in fig. 6, respectively, by their priorities. In the method, in order to improve the parallelism of tasks between different pipelines and improve the execution efficiency of the tasks, the code blocks in different task states in different pipelines can be determined to be parallel code block groups, and then a plurality of code blocks in the parallel code block groups are executed in parallel in the same time unit, so that the task pipeline of the tasks realized by the different code blocks can be realized, and the operation efficiency and the hardware utilization rate are improved.
Because the N code blocks are respectively used for implementing different serial tasks, and different hardware resources are respectively called between two adjacent tasks (for example, two adjacent tasks respectively call two different resources for storage and calculation), the mode of forming the code block group for the at least two pipelines can be as follows:
executing the kth code block in the first pipeline in parallel with the kth-1 code block in the second streamline, namely taking the kth code block in the first pipeline and the kth-1 code block in the second pipeline as a code block group; the first pipeline is any one of at least two pipelines, the second pipeline is the next pipeline of the first pipeline of the at least two pipelines, and the value of k is an integer from 2 to N.
In particular, it is limited for the hardware resources, and two tasks with the same task state are not generally executed simultaneously in one time unit, so for a task with one task state, corresponding hardware resources are allocated to execute the task in one time unit. For three tasks of the present application, memory resources may be allocated for a data load task of a first pipeline at a first time unit, thereby executing a first code block of the first pipeline; in the second time unit, since the first code block of the first pipeline has been executed, computing resources may be allocated to the data operation task of the first pipeline, and since memory resources are idle, the second code block of the first pipeline and the first code block of the second pipeline may be used as a code block group, and thus two code blocks in the code block group may be executed in parallel using the memory resources and the computing resources in the second time unit. And so on, in a third time unit, a third code block in the first pipeline, a second code block in the second pipeline, and the first code block in the third pipeline may be taken as a code block group, and in the third time unit, three code blocks in the code block group may be executed in parallel using the memory resource and the computing resource. Therefore, for the case that the plurality of pipelines each include N code blocks, in the nth time unit, the N code blocks may be used as a code block group, so that the N code blocks are executed in parallel, thereby implementing parallel execution of the N tasks.
Further, the method and the device are also used for inserting synchronous tasks between different pipelines so as to ensure that the code blocks can be executed in parallel. That is, a synchronization code block is inserted between adjacent code blocks between each iteration cycle to achieve a synchronization task. For example, a synchronization code block is inserted between a code block for completing a data storage task in a first iteration cycle and a code block for completing a data loading task in a second iteration cycle to achieve a synchronization task.
In one embodiment of the present application, the method may further include allocating memory for the set of code blocks.
Optionally, the size of the memory is a product of the number of code blocks in the parallel code block group and a preset storage space size, and the size of the memory may be equal to the maximum number of code blocks that can be accommodated in the parallel code block. Wherein the preset space size is related to the size of the data to be processed. It can be understood that for each code block group, all code blocks in the code block group need to be executed in parallel in the same time unit, and each code block needs to have a corresponding memory to cache the running result of the code block, so that memory is allocated to all code blocks at one time according to the number of code blocks in the code block group, and memory allocation is not required to be independently allocated to each code block, so that the memory allocation efficiency can be improved.
For example, if the data to be processed is two-dimensional data, the size of the two-dimensional data may be represented by m times n, where m is used to represent the size of the data to be processed in a first dimension and n is used to represent the size of the data to be processed in a second dimension. If the original code is to complete the processing process of the data to be processed, m iteration cycles are needed, and each iteration cycle processes the data to be processed with the data size of n. At least three code blocks of one pipeline along the timeline in fig. 6 are executed in each iteration cycle, i.e., the data loading task, the data computing task, and the data storage task are completed sequentially in each iteration cycle. For data processing in each iteration cycle, a preset storage space is required to be allocated to complete the processing process of the data to be processed with the data size of n. In this embodiment, to support implementation of a multi-level pipeline, a memory may be allocated for the entire parallel code block set at one time, where the size of the memory may be equal to the number N of parallel code blocks multiplied by the preset storage space size N required in a single iteration cycle. As shown in fig. 6, the memory may have a size of 3n.
In one embodiment of the present application, an automatic pipeline generation mode and a normal mode may be included, where the automatic pipeline mode refers to automatically generating a pipeline according to a state identifier in a code block, such as the method steps shown above. The normal mode refers to a manner of generating a pipeline according to a conventional code analysis manner. In order to support the switching between the two modes, the implementation of the present disclosure may further set a preset flag bit in the original code for implementing the mode switching. Specifically, in the process of analyzing the original code, when the value of the preset flag bit is determined to be a preset value, an automatic pipeline generation mode is determined to be enabled, and when the value of the preset flag bit is determined not to be the preset value, a common mode is determined to be enabled, wherein the preset value can be 1 or other values.
In this embodiment, the flag bit is preset, when the value of the preset flag bit is read to be the preset value, it is determined that the automatic pipeline generation mode needs to be enabled, N code blocks will be acquired at this time, and the N code blocks are formed into a pipeline according to the manner shown in fig. 2 or fig. 6. Specifically, if it is determined that the automatic pipeline generation mode is enabled, the memory allocation operation described above may be automatically completed. If the value of the preset flag bit is not the preset value, the normal mode is determined to be started, N code blocks are not required to be acquired at this time, and the assembly line is directly generated according to the writing sequence of the original codes and the traditional code analysis mode.
In one embodiment of the present application, the method may further include:
the original code formed by the N code blocks is compiled into target code, wherein the target code can be code expressed by class C language (such as CudaC). In the embodiment of the application, the original code subjected to pipeline optimization can be compiled into the target code by performing automatic pipeline optimization on the original code. Further, the present application may also compile the object code into binary instructions that can be executed by the hardware platform. The hardware platform includes, but is not limited to, a processing unit and a storage unit, where the processing unit and the storage unit may perform corresponding operations in a pipelined manner as shown in fig. 6.
Referring to fig. 7, fig. 7 is a functional unit block diagram of a pipeline generating device according to an embodiment of the present application. The pipeline generating apparatus 700 includes:
an obtaining unit 701, configured to obtain N code blocks, where each code block includes a status identifier for determining an operation sequence of the code block, and N is an integer greater than or equal to 2;
a processing unit 702, configured to determine an operation order of the N code blocks according to the status identifier encapsulated in each code block;
and generating a pipeline according to the running sequence of the N code blocks and the N code blocks.
In one embodiment of the present application, the state identifier is used to indicate a task state of a task implemented by the code blocks, and the processing unit 702 is specifically configured to determine, in terms of an execution sequence of the N code blocks according to the state identifier encapsulated in each code block:
determining the dependency relationship among the N code blocks according to the task state indicated by the state identification in the N code blocks;
and determining the running sequence of the N code blocks according to the dependency relationship among the N code blocks.
In one embodiment of the present application, the task states of the tasks implemented by the N code blocks include a data loading task, a data operation task, and a data storage task;
In one pipeline, the code blocks implementing the data loading task, the code blocks of the data operation task, and the code blocks of the data storage task are sequentially executed.
In one embodiment of the present application, the pipeline further comprises a code block;
the synchronous code blocks are contained between any two adjacent code blocks in the pipeline;
the synchronous code block is used for indicating that the (i+1) th code block is operated after the (i) th code block in the pipeline is operated, wherein i is a positive integer greater than or equal to 1, and i is less than or equal to N.
In one embodiment of the present application, each code block further includes a location identifier, where the location identifier is used to identify a starting location of the code block in the original code; in acquiring N code blocks, the acquiring unit 702 is specifically configured to:
and taking a code between a j-th position identifier and a j+1th position identifier in the original code and the j-th position identifier as a j-th code block, wherein the j value is an integer from 1 to N.
In one embodiment of the present application, before acquiring N code blocks, the acquiring unit 701 is further configured to acquire a value of a preset flag bit; the processing unit 702 is further configured to perform an operation of acquiring N code blocks when determining that the value of the preset flag bit is a preset value.
In one embodiment of the present application, the number of pipelines is at least two, and the processing unit 702 is further configured to:
according to the state identification of each code block in at least two pipelines, code blocks in different task states in different pipelines are determined to be parallel code block groups, and a plurality of code blocks in the parallel code block groups can be executed in parallel in the same time unit.
In one embodiment of the present application, in determining code blocks in different task states in different pipelines as parallel code block groups according to the state identification of each code block in at least two pipelines, the processing unit 702 is specifically configured to:
executing a kth code block in a first pipeline in parallel with a kth-1 code block in a second pipeline, the task state of the kth-1 code block being different from the task state of the first pipeline;
the first pipeline is any one of the at least two pipelines, the second pipeline is a next stage pipeline of the first pipeline of the at least two pipelines, k is a positive integer greater than or equal to 2, and k is less than or equal to N.
In one embodiment of the present application, the processing unit 702 is further configured to allocate a memory for the parallel code block group, where a size of the memory is a product of a number of code blocks in the parallel code block group and a preset storage space size.
In one embodiment of the present application, the processing unit 702 is further configured to compile the original code formed by the N code blocks into object code.
Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 8, the electronic device 800 includes a transceiver 801, a processor 802, and a memory 803. Which are connected by a bus 804. The memory 803 is used to store computer programs and data, and the data stored in the memory 803 can be transferred to the processor 802.
The processor 802 is configured to read a computer program in the memory 803 to perform the following operations:
acquiring N code blocks, wherein each code block comprises a state identifier for determining the running sequence of the code block, and N is an integer greater than or equal to 2;
determining the running sequence of the N code blocks according to the state identifier packaged in each code block;
and generating a pipeline according to the running sequence of the N code blocks and the N code blocks.
The specific functions of the processor 802 may refer to the specific functions of the processing unit 702 and the acquiring unit 701, which are not described herein.
The present application also provides a computer-readable storage medium storing a computer program that is executed by a processor to implement some or all of the steps of any one of the code compiling methods described in the above method embodiments.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the code compiling methods described in the method embodiments above.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required in the present application.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional manners of dividing the actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units described above may be implemented either in hardware or in software program modules.
The integrated units, if implemented in the form of software program modules, may be stored in a computer-readable memory for sale or use as a stand-alone product. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.
The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the above examples being provided solely to assist in the understanding of the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (11)

1. A code compiling method, comprising:
acquiring N code blocks, wherein each code block comprises a state identifier for determining the running sequence of the code block, and N is an integer greater than or equal to 2;
Determining the running sequence of the N code blocks according to the state identifier packaged in each code block;
and generating a pipeline according to the running sequence of the N code blocks and the N code blocks.
2. The method of claim 1, wherein the status identifiers are used to indicate task status of tasks implemented by the code blocks, and wherein determining the running order of the N code blocks according to the status identifiers encapsulated in each code block comprises:
determining the dependency relationship among the N code blocks according to the task state indicated by the state identification in the N code blocks;
and determining the running sequence of the N code blocks according to the dependency relationship among the N code blocks.
3. The method according to claim 1 or 2, wherein the task states of the tasks implemented by the N code blocks include a data loading task, a data operation task and a data storage task;
in one pipeline, the code blocks implementing the data loading task, the code blocks of the data operation task, and the code blocks of the data storage task are sequentially executed.
4. A method according to any one of claims 1-3, wherein each of said code blocks further comprises a location identifier for identifying the starting location of the code block in the original code; the acquiring N code blocks includes:
And taking a code between a j-th position identifier and a j+1th position identifier in the original code and the j-th position identifier as a j-th code block, wherein the j value is an integer from 1 to N.
5. The method of any of claims 1-4, wherein the number of pipelines is at least two, the method further comprising:
according to the state identification of each code block in at least two pipelines, code blocks in different task states in different pipelines are determined to be parallel code block groups, and a plurality of code blocks in the parallel code block groups can be executed in parallel in the same time unit.
6. The method of claim 5, wherein determining code blocks in different task states in different pipelines as parallel code block groups based on the state identification of each code block in at least two pipelines, comprises:
executing a kth code block in a first pipeline in parallel with a kth-1 code block in a second pipeline, the task state of the kth-1 code block being different from the task state of the first pipeline;
the first pipeline is any one of the at least two pipelines, the second pipeline is a next stage pipeline of the first pipeline of the at least two pipelines, k is a positive integer greater than or equal to 2, and k is less than or equal to N.
7. The method according to claim 5 or 6, characterized in that the method further comprises:
and allocating a memory for the parallel code block group, wherein the size of the memory is the product of the number of the code blocks in the parallel code block group and the size of a preset storage space.
8. The method according to any one of claims 1-7, further comprising:
acquiring the value of a preset marker bit;
and when the value of the preset flag bit is determined to be a preset value, entering an automatic pipeline generation mode to automatically generate a pipeline according to the state identification packaged in each code block.
9. The method according to any one of claims 1-8, further comprising:
and compiling the original codes formed by the N code blocks into target codes.
10. An electronic device, comprising: a processor and a memory, the processor being connected to the memory, the memory being for storing a computer program, the processor being for executing the computer program stored in the memory to cause the electronic device to perform the method of any one of claims 1-9.
11. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program, which is executed by a processor to implement the method of any one of claims 1-9.
CN202111576033.7A 2021-12-21 2021-12-21 Code compiling method, electronic device and storage medium Pending CN116301874A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111576033.7A CN116301874A (en) 2021-12-21 2021-12-21 Code compiling method, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111576033.7A CN116301874A (en) 2021-12-21 2021-12-21 Code compiling method, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN116301874A true CN116301874A (en) 2023-06-23

Family

ID=86813620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111576033.7A Pending CN116301874A (en) 2021-12-21 2021-12-21 Code compiling method, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN116301874A (en)

Similar Documents

Publication Publication Date Title
US8683468B2 (en) Automatic kernel migration for heterogeneous cores
CN109272109B (en) Instruction scheduling method and device of neural network model
US20150149744A1 (en) Data processing apparatus and method for performing vector processing
US10430191B2 (en) Methods and apparatus to compile instructions for a vector of instruction pointers processor architecture to enable speculative execution and avoid data corruption
US11900113B2 (en) Data flow processing method and related device
CN110308982B (en) Shared memory multiplexing method and device
US9645802B2 (en) Technique for grouping instructions into independent strands
US9612867B2 (en) Apparatus and method for data partition and allocation in heterogeneous multi-processor environment
US9105208B2 (en) Method and apparatus for graphic processing using multi-threading
CN112559053B (en) Data synchronization processing method and device for reconfigurable processor
US9354850B2 (en) Method and apparatus for instruction scheduling using software pipelining
CN110503179B (en) Calculation method and related product
WO2017016255A1 (en) Parallel processing method and apparatus for multiple launch instructions of micro-engine, and storage medium
CN114924748A (en) Compiling method, device and equipment
US11816061B2 (en) Dynamic allocation of arithmetic logic units for vectorized operations
CN110377339A (en) Long-latency instruction processing unit, method and equipment, readable storage medium storing program for executing
CN108021563B (en) Method and device for detecting data dependence between instructions
CN113791770B (en) Code compiler, code compiling method, code compiling system, and computer medium
US10580106B2 (en) Graphics processing method utilizing predefined render chunks
CN116301874A (en) Code compiling method, electronic device and storage medium
US20140013312A1 (en) Source level debugging apparatus and method for a reconfigurable processor
CN115860066A (en) Neural network reasoning pipeline multiplexing method based on batch processing
Mastoras et al. Load-balancing for load-imbalanced fine-grained linear pipelines
WO2021101663A1 (en) Pre-instruction scheduling rematerialization for register pressure reduction
CN112463218A (en) Instruction emission control method and circuit, data processing method and circuit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination