CN110990151A

CN110990151A - Service processing method based on heterogeneous computing platform

Info

Publication number: CN110990151A
Application number: CN201911161201.9A
Authority: CN
Inventors: 赵雅倩; 朱效民
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2019-11-24
Filing date: 2019-11-24
Publication date: 2020-04-10
Also published as: WO2021098257A1

Abstract

The scheme can start and end threads only once in the service processing process under a unified computing framework, avoids the overhead caused by frequent thread starting and closing and improves the efficiency of heterogeneous computing; and moreover, the communication of the calculation progress between the master core and the slave core is realized by utilizing a master-slave core communication framework and a synchronization mechanism, so that the master core is ensured to perform the next operation after the slave core is calculated, the slave core is ensured to start corresponding slave core calculation modules at different moments to execute corresponding calculation operations, and the calculation accuracy is ensured.

Description

Service processing method based on heterogeneous computing platform

Technical Field

The application relates to the technical field of computers, in particular to a heterogeneous computing platform and a service processing method, device and main core thereof.

Background

The simulation and simulation by using a large-scale supercomputer is an important and even irreplaceable method for supporting the development of scientific research work, and the calculation simulation and simulation become a third model of scientific research.

In recent years, with the higher demands on computing speed for many applications, accelerators with stronger floating point computing power have become an important component for building supercomputers. The computational speed of an application is increased by offloading the computationally intensive portions that were originally run on a conventional CPU to an accelerator specifically designed to increase computational speed. Typical accelerators include GPUs, domestic SW26010, and the like.

In order to fully utilize the computing performance of these acceleration components, it is generally necessary to port the traditional compute-intensive part running on the CPU to a many-core architecture, where the most efficient methods are programming, compiling, running with accelerator-supported libraries, such as CUDA supported by GPU, Athread supported by SW26010, and so on. The current programming method accomplishes the goal of offloading the computed part to the accelerator by calling the accelerator function at the corresponding compute-intensive module location. After the computation is completed, i.e., after the function called returns, the main core CPU continues to perform non-computation-intensive parts, such as communication and the like.

However, in general, the computation-intensive modules involved in an application are not single, and the computation modules are not continuous, and a CPU is required to perform transactions such as communication between the computation modules, which results in that each module needs to perform thread starting, and then after the slave core performs computation, the thread is finished and returns to the master process. The overheads such as start-stop related to the thread are not always negligible, and especially in the case of a small amount of data and a low calculation access ratio (i.e. a small calculation amount caused by a single data), the extra benefit brought by the calculation is not significant, and the calculation performance improvement caused by the overheads related to the thread is not significant, so it is necessary to optimize the thread overheads.

Based on the analysis of the thread-related overhead and the actual performance test and analysis in the process of actually transplanting and optimizing the ROMS, it can be seen that the thread overhead can cause that the acceleration of heterogeneous computation is not ideal. At present, most application migration only considers optimization at a computing level, and no relevant public visible method exists for the overhead at the system level.

Therefore, how to avoid the thread overhead of the heterogeneous computing acceleration system in the service processing process and improve the system performance is a problem to be solved by those skilled in the art.

Disclosure of Invention

The application aims to provide a heterogeneous computing platform and a service processing method, a device and a main core thereof, which are used for solving the problem that the traditional heterogeneous computing acceleration system is influenced by thread overhead in the service processing process, so that the system performance is low. The specific scheme is as follows:

in a first aspect, the present application provides a service processing method based on a heterogeneous computing platform, which is applied to a primary core, and includes:

when the current task is started to be executed, controlling the slave core to start a target thread;

when the preset execution state of the current task is reached, generating scheduling information so that the slave core can execute corresponding calculation operation by using the target thread according to the scheduling information and generate calculation progress information;

and when the calculation progress information is calculation completion, continuing to execute the current task until the slave core is controlled to close the target thread when the current task is executed.

Preferably, before generating the scheduling information when the preset execution state of the current task is reached, the method further includes:

and allocating a shared memory in a shared storage space to store the scheduling information and the calculation progress information in the current task execution process.

Preferably, after the allocating the shared memory in the shared memory space, the method further includes:

setting a first shared variable and a second shared variable in the shared memory, and initializing the first shared variable and the second shared variable, wherein the first shared variable is used for storing the scheduling information, and the second shared variable is used for storing the calculation progress information.

Preferably, the generating scheduling information so that the slave core performs a corresponding computing operation by using the target thread according to the scheduling information includes:

and assigning the first shared variable as identification information of a target computing module to serve as scheduling information, so that the slave core utilizes the target thread to call the target computing module to execute corresponding computing operation according to the scheduling information.

Preferably, when the calculation progress information is calculation completion, continuing to execute the current task includes:

and inquiring the second shared variable every other preset time length until the calculation progress information is that the calculation is completed, and continuously executing the current task.

Preferably, after the scheduling information is generated when the preset execution state of the current task is reached, the method further includes:

and transmitting the scheduling information to the slave core in a display communication mode.

Preferably, the slave core includes multiple task types, and the generating the scheduling information when the preset execution state of the current task is reached includes:

and when the preset execution state of the current task is reached, generating scheduling information of the target slave core according to the task type.

In a second aspect, the present application provides a service processing apparatus based on a heterogeneous computing platform, which is applied to a primary core, and includes:

a thread starting module: the slave core is controlled to start the target thread when the current task is started to be executed;

a scheduling module: the slave core is used for generating scheduling information when the preset execution state of the current task is reached, so that the slave core utilizes the target thread to execute corresponding calculation operation according to the scheduling information and generate calculation progress information;

a thread closing module: and the slave core is used for continuously executing the current task when the calculation progress information is that the calculation is completed, and controlling the slave core to close the target thread when the execution of the current task is finished.

In a third aspect, the present application provides a primary core of a heterogeneous computing platform, including:

a memory: for storing a computer program;

a processor: for executing the computer program to implement the steps of the business processing method based on the heterogeneous computing platform as described above.

In a fourth aspect, the present application provides a heterogeneous computing platform comprising: a master core and a slave core;

the main core is used for controlling the secondary core to start a target thread when the current task is started to be executed; when the preset execution state of the current task is reached, generating scheduling information;

the slave core is used for executing corresponding calculation operation according to the scheduling information by using the target thread and generating calculation progress information;

and the master core is used for continuously executing the current task when the calculation progress information is that the calculation is completed, and controlling the slave core to close the target thread when the current task is finished.

The application provides a business processing method based on a heterogeneous computing platform, which is applied to a main core and comprises the following steps: when the current task is started to be executed, controlling the slave core to start a target thread; when the preset execution state of the current task is reached, generating scheduling information so that the slave core utilizes the target thread to execute corresponding calculation operation according to the scheduling information and generate calculation progress information; and when the calculation progress information is that the calculation is completed, continuing to execute the current task until the current task is executed, and controlling the slave core to close the target thread.

Therefore, the method can start and end the threads only once under a uniform computing frame, thereby avoiding the overhead caused by frequent thread starting and closing and improving the efficiency of heterogeneous computing; moreover, the communication of the calculation progress between the master core and the slave core is realized by designing a master-slave core communication framework and a synchronization mechanism, so that the master core is ensured to perform the next operation after the slave core is calculated, the slave core is ensured to start the corresponding slave core calculation module at different moments to execute the corresponding calculation operation, and the calculation accuracy is also ensured.

In addition, the application also provides a business processing device based on the heterogeneous computing platform, a main core and the heterogeneous computing platform, and the technical effect of the business processing device, the main core and the heterogeneous computing platform correspond to the technical effect of the method, and the details are not repeated here.

Drawings

For a clearer explanation of the embodiments or technical solutions of the prior art of the present application, the drawings needed for the description of the embodiments or prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart illustrating a first implementation of a service processing method based on a heterogeneous computing platform according to an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating implementation of a second embodiment of a service processing method based on a heterogeneous computing platform according to the present application;

fig. 3 is a functional block diagram of an embodiment of a service processing apparatus based on a heterogeneous computing platform according to the present application.

Detailed Description

In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

To ensure computational performance, the primary core accomplishes the goal of offloading the compute portion to the accelerator with calls to the accelerator functions at compute-intensive module locations (typically for loops at the code layer); after the computation is completed, i.e., after the function is returned from the call, the main core CPU continues with the non-computation-intensive part. However, in general, the computation-intensive modules involved in the application are not single (e.g., more than 50 such for-loop modules in step2d module of ocean mode ROMS), and these computation modules are not continuous, so frequent starting and stopping of threads is required, and the computation performance is affected.

In order to solve the above problems, the present application provides a heterogeneous computing platform, a service processing method, an apparatus, and a main core thereof, in a scenario where a plurality of discontinuous modules in heterogeneous computing are unloaded into an accelerator, the scheme may start and end a thread only once in a unified computing framework, thereby avoiding overhead caused by frequent thread start and close, and improving efficiency of heterogeneous computing.

Referring to fig. 1, a first embodiment of a service processing method based on a heterogeneous computing platform provided in the present application is described below, where the first embodiment is applied to a primary core, and includes:

s101, when the current task is started to be executed, controlling a slave core to start a target thread;

s102, when the preset execution state of the current task is reached, scheduling information is generated, so that the slave core can execute corresponding calculation operation according to the scheduling information by using the target thread, and calculation progress information is generated;

the preset execution state may be specifically executed to a certain computation module, or reached to a certain time node.

S103, when the calculation progress information is that the calculation is completed, the current task is continuously executed until the slave core is controlled to close the target thread when the current task is executed.

In this embodiment, as for a system framework of a heterogeneous computing platform, a master-slave heterogeneous computing mode is adopted. Specifically, the master core completes the corresponding computation partition in the computation-intensive part by calling the slave core, the slave core completes the distributed computation task return, and then the master core continues to execute the non-intensive computation part which the slave core does not need to participate in, such as transactions of communication and the like. And performing phase state query instead of stopping after the kernel calculation is finished, and starting the next calculation when the next phase calculation needs to be started. According to the mode, the loop is repeated until all the calculation needing to be unloaded to the slave core is completed, namely the slave core is informed to stop waiting for the next calculation, the slave core returns, and the calculation phase is completed.

It should be noted that, in this embodiment, the function call of the master core to the slave core is not explicit, but is implemented by communication and sharing between the master core and the slave core. The master core monitors the status of the slave core, and the slave core implements corresponding computations with libraries supported by the slave core, such as CUDA (computer Unified device architecture) supported by GPU, Athread (thread library supported by SW26010 processor) supported by SW26010 (Shenwei 26010), and the like, for all computation-intensive parts.

In order to enable the master core to acquire the slave core calculation progress message so as to determine whether to start the next calculation and transaction processing, and the slave core to acquire the scheduling information of the master core so as to determine to start the corresponding calculation, a communication mechanism between the master core and the slave core needs to be designed.

In this embodiment, the master-slave core communication mechanism of the heterogeneous computing platform is used. Communication between the master and slave cores may be accomplished in two ways. The method is implicit communication, namely, shared memories which can be directly accessed exist between a master core and a slave core, and communication is achieved through assignment and value operation of shared variables. It should be noted that when communication is implemented in this manner, variables need to be set to prevent operations in the cache, and instead, variables are directly read and written. And the second mode is explicit communication, namely no memory which can be shared exists between the master core and the slave core, or the data volume needing communication is large, and the message exchange between the master core and the slave core can be realized through the explicit communication between the accessible storage areas of the master core and the slave core.

In this embodiment, for the mechanism for ensuring synchronization of the master core of the heterogeneous computing platform, when the master core calls the slave cores to perform computation, it is necessary to ensure that all the slave cores have completed computation, and then the next process is performed. On the basis of the aforementioned master-slave core communication mechanism, the master check can be performed to inquire the variables corresponding to the computation states of all the slave cores, that is, after all the variables are set by the slave cores, the master core can perform the next process. After the current module is calculated, the slave core needs to set a corresponding variable belonging to the current slave core. Moreover, when the main core acquires that all variables corresponding to all the slave cores are set, and then the next process is performed, all the variables need to be restored, so that the next calculation module can multiplex the variables to update the state.

In this embodiment, the slave core starting computing mechanism of the heterogeneous computing platform is implemented based on the foregoing master-slave core communication mechanism. Specifically, each time the slave core completes one computation module, the slave core does not immediately start the next computation module, but waits for the update of the computation state of the master core, that is, the master core performs the computation of the next module only when the slave core needs to start the corresponding computation. The specific implementation is that the slave core uses a communication or sharing mechanism to realize the inquiry of the calculation state of the master core, and after the corresponding calculation module variable is set, the calculation of the corresponding module is carried out.

In summary, the service processing method based on the heterogeneous computing platform provided by this embodiment is applied to the main core, and the method can start and end the threads only once in a unified computing framework, thereby avoiding the overhead caused by frequent thread start and close, and improving the efficiency of heterogeneous computing; moreover, the communication of the calculation progress between the master core and the slave core is realized by designing a master-slave core communication framework and a synchronization mechanism, so that the master core is ensured to perform the next operation after the slave core is calculated, the slave core is ensured to start the corresponding slave core calculation module at different moments to execute the corresponding calculation operation, and the calculation accuracy is also ensured.

The second embodiment of the service processing method based on the heterogeneous computing platform provided by the present application is described in detail below, and is implemented based on the first embodiment, and is expanded to a certain extent on the basis of the first embodiment.

Referring to fig. 2, the second embodiment is specifically applied to the primary core, and includes:

s201, distributing a shared memory in a shared storage space to store the scheduling information and the calculation progress information in the current task execution process;

s202, setting a first shared variable and a second shared variable in the shared memory, and initializing the first shared variable and the second shared variable, wherein the first shared variable is used for storing the scheduling information, and the second shared variable is used for storing the calculation progress information;

s203, controlling the slave core to start a target thread when the current task is started to be executed;

s204, assigning the first shared variable as identification information of a target computing module to serve as scheduling information, so that the slave core utilizes the target thread to call the target computing module to execute corresponding computing operation according to the scheduling information and generate computing progress information;

specifically, the heterogeneous computing platform of this embodiment includes slave cores of multiple task types, and when the preset execution state of the current task is reached, the master core generates scheduling information for the target slave core according to the specific task type.

S205, when the calculation progress information is that the calculation is completed, the current task is continuously executed until the slave core is controlled to close the target thread when the current task is executed.

It should be noted that, in the embodiment, a case that the heterogeneous computing platform supports the shared storage is taken as an example for description, and when the system supports the shared memory, the main core opens up and manages the shared memory space. In an actual application scenario, if shared storage is not supported, a space is opened up in respective storage spaces, and subsequent data access is not direct access but explicit data transmission.

The following respectively describes a master-slave core heterogeneous computation framework, a master-slave core communication mechanism, a master core computation scheduling mechanism, and a slave core computation synchronization mechanism of the heterogeneous computation platform in this embodiment.

For a master-slave core heterogeneous computing framework, the method mainly comprises the following steps: the main core end is realized by calling the slave core function in the intensive calculation part, the slave core actively detects the execution state of the main core to automatically start calculation without explicit function calling, corresponding calculation progress information is set after the calculation is finished, and the main core starts the next calculation according to the calculation progress information of all the slave cores. And the slave cores respectively complete corresponding calculation tasks aiming at different calculation intensive modules. In the aspect of secondary core coding implementation, the secondary core codes of the corresponding computing modules are respectively implemented for each computation-intensive sub-module by using a library supported by the secondary core. For the encoding implementation of the master core, for the compute-intensive parts needed to be unloaded to the slave core, the corresponding parts are deleted in the execution code of the master core in advance.

For a master-slave core communication mechanism, the method mainly comprises the following steps: and opening a space in the shared storage space, and respectively storing the computing state of the master core and the computing state of each slave core. Specifically, the computing state of the master core may represent the computing module to be started by an integer, and the master core performs setting and the slave core performs query; the calculation state of the slave core is used for indicating whether each slave core completes the calculation required by the master core, the slave core is used for setting, and the master core is used for inquiring.

For a master core computation scheduling mechanism, the method mainly comprises the following steps:

and initializing a storage space. For the distributed main core calculation state space, initializing to be-1, namely inquiring to be-1 by the slave core, and not starting any calculation module; for the distributed slave core to calculate the state space, the initialization is set to-1, namely the slave core does not perform any calculation;

and (5) setting data. When the main core code reaches the calculation intensive module, setting the value as the ID (different integers) of the corresponding calculation module for the inquiry of the slave core;

and (6) querying data. After the master core sets the module ID for the slave core to access, the calculation state set by the slave core is inquired, so as to acquire whether the slave core completes the calculation module or not, and if the slave core completes the calculation module, the code execution is continued; otherwise, the loop waits until all the slave cores have completed the corresponding computation.

For the slave core computation synchronization mechanism, the method mainly comprises the following steps:

and adding a start-stop code. I.e. before all the computation modules from the core code, the thread is started. Adding a code for stopping the thread at the tail of the last calculation module, so that the thread is started and stopped only once in the whole calculation process;

from a core data query. The slave core directly inquires the state of the shared storage (or initiates data transmission), and after the ID set by the master core is inquired, the slave core calculation corresponding to the ID is started;

and setting from the core data. After the slave core completes the calculation, the state of the storage location corresponding to the slave core is set, so that the master core can query the state to acquire that the calculation of the slave core is completed.

In summary, the service processing method based on the heterogeneous computing platform provided in this embodiment can implement communication between the master core and the slave core in a shared memory or explicit communication manner, and implement setting and query of the computing states of the master core and the slave core, so that the computing task can be started at a corresponding time, and after the computing task is completed, the flows such as communication and the like are started, thereby implementing synchronization of the overall computing and ensuring the correctness of the computing; and the cost of starting and stopping the thread for multiple times is avoided, so that the data communication with lower cost can be realized, the additional overhead of starting and stopping the thread for multiple times in the traditional heterogeneous computing mode is avoided, and the speed-up ratio and the efficiency of heterogeneous computing are improved.

Therefore, in the embodiment, a heterogeneous computation framework is designed for heterogeneous computation, such as GPU, SW26010, and the like, through a communication mechanism of a master core and a slave core, it is ensured that each computation module is sequentially performed at a master core end, and it is ensured that a corresponding computation submodule is started at a corresponding time at an accelerator end, so that correctness of the heterogeneous computation is ensured, and efficiency of the heterogeneous computation is improved.

In the following, a business processing apparatus based on a heterogeneous computing platform provided in an embodiment of the present application is introduced, and a business processing apparatus based on a heterogeneous computing platform described below and a business processing method based on a heterogeneous computing platform described above may be referred to correspondingly.

As shown in fig. 3, the service processing apparatus of this embodiment, applied to a primary core, includes:

the thread start module 301: the slave core is controlled to start the target thread when the current task is started to be executed;

the scheduling module 302: the slave core is used for generating scheduling information when the preset execution state of the current task is reached, so that the slave core utilizes the target thread to execute corresponding calculation operation according to the scheduling information and generate calculation progress information;

the thread shutdown module 303: and the slave core is used for continuously executing the current task when the calculation progress information is that the calculation is completed, and controlling the slave core to close the target thread when the execution of the current task is finished.

A service processing apparatus based on a heterogeneous computing platform of this embodiment is used to implement the aforementioned service processing method based on a heterogeneous computing platform, and therefore a specific implementation manner of the apparatus can be seen in the foregoing embodiment parts of the service processing method based on a heterogeneous computing platform, for example, the thread starting module 301, the scheduling module 302, and the thread closing module 303 are respectively used to implement steps S101, S102, and S103 in the aforementioned service processing method based on a heterogeneous computing platform. Therefore, specific embodiments thereof may be referred to in the description of the corresponding respective partial embodiments, and will not be described herein.

In addition, since the service processing apparatus based on the heterogeneous computing platform in this embodiment is used to implement the service processing method based on the heterogeneous computing platform, the role of the service processing apparatus corresponds to that of the method described above, and details are not described here.

In addition, the present application also provides a primary core of a heterogeneous computing platform, including:

a memory: for storing a computer program;

a processor: for executing the computer program to implement the steps of a business processing method based on heterogeneous computing platforms as described above.

Finally, the present application provides a heterogeneous computing platform comprising: a master core and a slave core;

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above detailed descriptions of the solutions provided in the present application, and the specific examples applied herein are set forth to explain the principles and implementations of the present application, and the above descriptions of the examples are only used to help understand the method and its core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A business processing method based on a heterogeneous computing platform is applied to a main core and comprises the following steps:

2. The method of claim 1, wherein before generating scheduling information upon reaching a preset execution state of the current task, further comprising:

3. The method of claim 2, wherein after said allocating shared memory in the shared memory space, further comprising:

4. The method of claim 3, wherein the generating scheduling information to facilitate the slave core to perform a corresponding computing operation with the target thread according to the scheduling information comprises:

5. The method of claim 3, wherein continuing to execute the current task when the computation progress information is computation completion comprises:

6. The method of claim 1, wherein after generating scheduling information upon reaching a preset execution state of the current task, further comprising:

7. The method of any one of claims 1-6, comprising a plurality of task type slave cores, wherein generating scheduling information upon reaching a preset execution state of the current task comprises:

8. A business processing device based on a heterogeneous computing platform is applied to a main core and comprises the following components:

9. A primary core of a heterogeneous computing platform, comprising:

a memory: for storing a computer program;

a processor: for executing the computer program to implement the steps of a heterogeneous computing platform based business processing method according to any of claims 1 to 7.

10. A heterogeneous computing platform, comprising: a master core and a slave core;