WO2021098257A1 - 一种基于异构计算平台的业务处理方法 - Google Patents

一种基于异构计算平台的业务处理方法 Download PDF

Info

Publication number
WO2021098257A1
WO2021098257A1 PCT/CN2020/103650 CN2020103650W WO2021098257A1 WO 2021098257 A1 WO2021098257 A1 WO 2021098257A1 CN 2020103650 W CN2020103650 W CN 2020103650W WO 2021098257 A1 WO2021098257 A1 WO 2021098257A1
Authority
WO
WIPO (PCT)
Prior art keywords
calculation
core
current task
scheduling information
slave
Prior art date
Application number
PCT/CN2020/103650
Other languages
English (en)
French (fr)
Inventor
赵雅倩
朱效民
Original Assignee
浪潮电子信息产业股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浪潮电子信息产业股份有限公司 filed Critical 浪潮电子信息产业股份有限公司
Publication of WO2021098257A1 publication Critical patent/WO2021098257A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues

Definitions

  • This application relates to the field of computer technology, in particular to a heterogeneous computing platform and its business processing method, device and main core.
  • accelerators with stronger floating-point computing capabilities have become important components for building supercomputers.
  • Typical accelerators include GPU, domestically produced SW26010, etc.
  • the most efficient method is to use the library supported by the accelerator to program, compile, and run, such as CUDA supported by GPU, Athread supported by SW26010, etc.
  • the current programming method is to use the call to the accelerator function in the corresponding computationally intensive module position to complete the goal of unloading the computational part to the accelerator. After the calculation is completed, that is, after returning from the called function, the main core CPU continues to perform non-computationally intensive parts, such as communication.
  • the calculation-intensive modules involved in an application are not single, and these calculation modules are not continuous.
  • the calculation modules require CPU to perform transactions such as communication, which also leads to the need for each module.
  • the thread is started, and then after the slave core calculation, the thread ends and returns to the main process. Thread-related start and stop overheads are not always negligible, especially when the amount of data is small and the calculation memory access is relatively low (that is, the amount of calculation caused by a single data is less), the additional benefits of calculation It is not obvious, the computational performance improvement caused by thread-related overhead is not obvious, so it is necessary to optimize the thread overhead.
  • thread overhead will lead to unsatisfactory acceleration of heterogeneous computing.
  • most application migrations only consider optimization at the computational level, and there is no relevant publicly visible method for this system-level overhead.
  • the purpose of this application is to provide a heterogeneous computing platform and its business processing method, device and main core to solve the problem of the traditional heterogeneous computing acceleration system being affected by thread overhead during business processing, resulting in low system performance
  • the specific plan is as follows:
  • this application provides a business processing method based on a heterogeneous computing platform, which is applied to the main core, including:
  • the method further includes:
  • the shared memory is allocated in the shared storage space to store the scheduling information and the calculation progress information during the execution of the current task.
  • the method further includes:
  • the generating scheduling information so that the slave core uses the target thread to perform corresponding calculation operations according to the scheduling information includes:
  • the slave core uses the target thread to call the target calculation module to perform a corresponding calculation operation according to the scheduling information.
  • the continuing to execute the current task when the calculation progress information is that the calculation is completed includes:
  • the second shared variable is queried every preset time period, and the current task is continued to be executed when the calculation progress information indicates that the calculation is completed.
  • the method further includes:
  • the scheduling information is delivered to the slave core.
  • generating scheduling information includes:
  • scheduling information for the target slave core is generated according to the task type.
  • this application provides a service processing device based on a heterogeneous computing platform, which is applied to the main core, and includes:
  • Thread start module used to control the start of the target thread from the core when starting to execute the current task
  • Scheduling module used to generate scheduling information when the preset execution state of the current task is reached, so that the slave core uses the target thread to perform corresponding calculation operations according to the scheduling information, and generates calculation progress information;
  • Thread shutdown module when the calculation progress information indicates that the calculation is completed, continue to execute the current task, and control the slave core to shut down the target thread when the execution of the current task ends.
  • this application provides a main core of a heterogeneous computing platform, including:
  • Memory used to store computer programs
  • Processor used to execute the computer program to implement the steps of a business processing method based on a heterogeneous computing platform as described above.
  • this application provides a heterogeneous computing platform, including: a master core and a slave core;
  • the master core is used to control the slave core to start the target thread when starting to execute the current task; when the preset execution state of the current task is reached, scheduling information is generated;
  • the slave core is configured to use the target thread to perform a corresponding calculation operation according to the scheduling information, and to generate calculation progress information;
  • the master core is configured to continue to execute the current task when the calculation progress information indicates that the calculation is completed, and control the slave core to close the target thread when the execution of the current task ends.
  • a business processing method based on a heterogeneous computing platform provided by this application is applied to the master core, including: when starting to execute the current task, controlling the slave core to start the target thread; when the preset execution state of the current task is reached, generating Scheduling information, so that the target thread from the core uses the target thread to perform the corresponding calculation operations according to the scheduling information, and generates calculation progress information; when the calculation progress information is the calculation is completed, continue to execute the current task until the end of the current task execution.
  • the target thread is applied to the master core, including: when starting to execute the current task, controlling the slave core to start the target thread; when the preset execution state of the current task is reached, generating Scheduling information, so that the target thread from the core uses the target thread to perform the corresponding calculation operations according to the scheduling information, and generates calculation progress information; when the calculation progress information is the calculation is completed, continue to execute the current task until the end of the current task execution.
  • the target thread when starting to execute the current task, controlling the slave core to start the
  • this method can start and end threads only once under a unified computing framework, avoid the overhead caused by frequent thread startup and shutdown, and improve the efficiency of heterogeneous computing; moreover, by designing the master-slave core communication framework and synchronization mechanism, it is realized
  • the communication of the calculation progress between the master and slave cores is ensured to ensure that the master core does not perform the next operation until the slave core calculation is completed. It is ensured that the slave core starts the corresponding slave core calculation module to perform the corresponding calculation operation at a different time. Ensure the correctness of the calculation.
  • this application also provides a business processing device, a main core, and a heterogeneous computing platform based on a heterogeneous computing platform, the technical effects of which correspond to those of the foregoing method, and will not be repeated here.
  • FIG. 1 is an implementation flowchart of Embodiment 1 of a business processing method based on a heterogeneous computing platform provided by this application;
  • FIG. 2 is an implementation flowchart of Embodiment 2 of a business processing method based on a heterogeneous computing platform provided by this application;
  • Figure 3 is a functional block diagram of an embodiment of a service processing device based on a heterogeneous computing platform provided by this application.
  • the main core uses the call to the accelerator function at the location of the computationally intensive module (generally the for loop of the code layer) to complete the goal of unloading the computing part to the accelerator; after the calculation is completed, it is from After the called function returns, the main core CPU continues to perform the non-computationally intensive part.
  • the calculation-intensive modules involved in the application are not single (for example, there are more than 50 such for loop modules in the step2d module of the ocean mode ROMS), and these calculation modules are not continuous, so frequent Start and close threads, so that computing performance is affected.
  • this application provides a heterogeneous computing platform and its service processing method, device, and main core. Faced with the scenario where multiple discontinuous modules in heterogeneous computing are unloaded into the accelerator, the solution can be integrated in a unified Under the framework of computing, only start and end threads once, avoid the overhead caused by frequent thread startup and shutdown, and improve the efficiency of heterogeneous computing.
  • the first embodiment is applied to the main core and includes:
  • the foregoing preset execution state may specifically be executed to a certain computing module or reach a certain time node.
  • the master-slave heterogeneous computing mode is adopted. Specifically, in the computationally intensive part, the master core completes the corresponding calculation division by calling the slave core, and the slave core completes the assigned computing task and returns, and then the master core continues to execute the non-intensive computing part that the slave core does not need to participate in, such as communications. After the core calculation is over, it does not stop, but performs a phased status query. When the query needs to start the next phase of calculation, the next calculation is started. According to the above method, the loop is repeated until all the calculations that need to be unloaded to the slave core are completed, that is, the slave core is notified to stop waiting for the next calculation, and the slave core returns, and the calculation phase is completed.
  • the function call of the master core to the slave core is not explicit, but is implemented through communication and sharing between the master and slave cores.
  • the master core monitors the status of the slave core, and the slave core uses the library supported by the slave core to implement corresponding calculations for all the computationally intensive parts, such as CUDA (Compute Unified Device Architecture) supported by GPU, SW26010 ( Shenwei 26010) supports Athread (thread library supported by SW26010 processor) and so on.
  • CUDA Computer Unified Device Architecture
  • SW26010 Shenwei 26010
  • Athread thread library supported by SW26010 processor
  • the master core obtains the calculation progress information from the slave core, decides whether to start the next calculation and transaction processing, and obtains the scheduling information from the master core to the master core, and thus decides to start the corresponding calculation, it is necessary to design the communication between the master and slave cores. mechanism.
  • the communication between the master and slave cores can be achieved in two ways.
  • One is implicit communication, that is, there is a shared memory that can be directly accessed between the master and slave cores, and communication is realized by assigning and fetching values to shared variables. It should be noted that when communication is implemented in this way, variables need to be set to prevent them from operating in the cache, but to read and write the variables directly.
  • the second is explicit communication, that is, there is no memory that can be shared between the master and slave cores, or the amount of data that needs to be communicated is large, and the messages between the master and slave cores can be realized through explicit communication between the storage areas accessible by the master and slave cores. exchange.
  • the master core when the master core invokes the slave cores for calculation, it is necessary to ensure that all the slave cores have been calculated before proceeding to the next step.
  • the master core querying all the variables corresponding to the calculation state of the slave cores, that is, when all the variables are set by the slave cores, the master core can proceed to download One-step process. After the slave core completes the calculation of the current module, it needs to set the corresponding variables belonging to the current slave core. Moreover, the master core obtains that all the variables corresponding to the slave cores have been set, so that when proceeding to the next step, all variables must be restored, so that the next calculation module can reuse these variables to update the state.
  • the slave core startup computer mechanism of the heterogeneous computing platform is implemented based on the aforementioned master-slave core communication mechanism. Specifically, whenever the slave core completes a calculation module, it does not immediately start the next calculation module, but waits for the update of the calculation state of the master core, that is, the master core needs to start the corresponding calculation from the core before performing the calculation of the next module. .
  • the slave core uses the communication or sharing mechanism to query the calculation status of the master core. After the corresponding calculation module variable is set, the calculation of the corresponding module is performed.
  • this embodiment provides a business processing method based on a heterogeneous computing platform, which is applied to the main core.
  • This method can start and end threads only once under a unified computing framework, avoiding frequent thread startup and shutdown causes Cost, improve the efficiency of heterogeneous computing;
  • the master-slave core communication framework and synchronization mechanism the communication of the calculation progress between the master and slave cores is realized, so as to ensure that the master core does not proceed to the next step after the slave core calculation is completed.
  • the second embodiment of a service processing method based on a heterogeneous computing platform provided by the present application will be introduced in detail below.
  • the second embodiment is implemented based on the aforementioned first embodiment, and is expanded to a certain extent on the basis of the first embodiment.
  • the second embodiment is specifically applied to the main core, including:
  • S201 Allocate shared memory in the shared storage space to store the scheduling information and the calculation progress information during the execution of the current task;
  • the heterogeneous computing platform of this embodiment includes slave cores of multiple task types.
  • the master core When the preset execution state of the current task is reached, the master core generates scheduling information for the target slave core according to the specific task type.
  • this embodiment takes a case where a heterogeneous computing platform supports shared storage as an example.
  • the main core develops and manages the shared memory space.
  • shared storage is not supported, space is opened in their respective storage spaces, and subsequent data access is not direct access but explicit data transmission.
  • the following respectively describes the master-slave core heterogeneous computing framework, master-slave core communication mechanism, master core computing scheduling mechanism, and slave core computing synchronization mechanism of the heterogeneous computing platform in this embodiment.
  • the master-slave core heterogeneous computing framework mainly includes: the master core is implemented by calling the slave core functions in the computationally intensive part, and does not make explicit function calls. Instead, the slave core actively detects the execution status of the master core and automatically starts the calculation. , After the calculation is completed, the corresponding calculation progress information is set, and the master core starts the next calculation according to the calculation progress information settings of all the slave cores. The slave core completes the corresponding calculation tasks for different calculation-intensive modules. In terms of implementing slave core coding, use the library supported by the slave core to implement the slave core code of the corresponding computing module for each computationally intensive sub-module. For the coding implementation of the master core, for the computationally intensive part that needs to be uninstalled to the slave core, the corresponding part is deleted in the master core execution code in advance.
  • the master-slave core communication mechanism mainly includes: opening up space in the shared storage space to store the calculation status of the master core and the calculation status of each slave core respectively.
  • the calculation status of the master core can be an integer representing the calculation module to be started, which is set by the master core and inquired by the slave core; the calculation status of the slave core is used to indicate whether each slave core has completed the calculation required by the master core. Set the slave core, and query the master core.
  • the main core computing scheduling mechanism mainly includes:
  • the storage space is initialized.
  • the initialization is set to -1, that is, the slave core query is -1, and no calculation module is started; for the allocated slave core calculation state space, the initialization is set to -1, that is, the slave core has not performed any Calculation
  • the master core After the master core has set the module ID for the slave core to access, it will query the calculation status set by the slave core to obtain whether the slave core has completed the calculation module. If it has been completed, continue the code execution; otherwise, wait in a loop , Until all the slave cores have completed the corresponding calculations.
  • the slave core computing synchronization mechanism mainly includes:
  • Start-stop code added. That is, the thread is started before all the calculation modules in the slave core code. At the end of the last calculation module, add the thread stop code, so that the thread only starts and stops once in the whole calculation process;
  • the slave core directly queries the status of the shared storage (or initiates data transmission), and after querying the ID set by the master core, the slave core calculation of the corresponding ID is started;
  • the business processing method based on a heterogeneous computing platform can realize communication between master and slave cores in a shared memory or explicit communication mode, and realize the calculation status setting and query of the master and slave cores.
  • the calculation task can be started at the corresponding moment, and the communication and other processes can be started after the calculation task is completed, so as to realize the synchronization of the overall calculation and ensure the correct calculation; and these are not at the cost of starting and stopping the thread multiple times, so that it can be at a small cost
  • Data communication avoids the additional overhead of multiple start and stop of threads in the traditional heterogeneous computing model, and improves the speedup and efficiency of heterogeneous computing.
  • this embodiment is oriented to heterogeneous computing, such as GPU, SW26010, etc., and a heterogeneous computing framework is designed.
  • each computing module is ensured in order on the master core side, and the accelerator side ensures that the corresponding time Start the corresponding calculation sub-module to ensure the correctness of heterogeneous calculations and improve the efficiency of heterogeneous calculations.
  • the following describes a business processing device based on a heterogeneous computing platform provided by an embodiment of the present application.
  • the business processing device based on a heterogeneous computing platform described below and the business processing device based on a heterogeneous computing platform described above are described below.
  • the methods can be referred to each other.
  • the service processing device of this embodiment applied to the main core, includes:
  • Thread starting module 301 used to control the starting of the target thread from the core when starting to execute the current task;
  • Scheduling module 302 used to generate scheduling information when the preset execution state of the current task is reached, so that the slave core uses the target thread to perform corresponding calculation operations according to the scheduling information, and generates calculation progress information ;
  • the thread shutdown module 303 is configured to continue to execute the current task when the calculation progress information indicates that the calculation is completed, and control the slave core to shut down the target thread when the execution of the current task ends.
  • a business processing device based on a heterogeneous computing platform in this embodiment is used to implement the foregoing business processing method based on a heterogeneous computing platform. Therefore, the specific implementation of the device can be seen in the foregoing one based on heterogeneous computing
  • the embodiment part of the service processing method of the platform for example, the thread starting module 301, the scheduling module 302, and the thread closing module 303 are respectively used to implement steps S101, S102, and S103 in the above-mentioned service processing method based on a heterogeneous computing platform. Therefore, for the specific implementation manners, reference may be made to the descriptions of the respective parts of the embodiments, and the introduction is not repeated here.
  • a business processing device based on a heterogeneous computing platform of this embodiment is used to implement the foregoing business processing method based on a heterogeneous computing platform, its function corresponds to the function of the above method, and will not be repeated here. .
  • this application also provides a main core of a heterogeneous computing platform, including:
  • Memory used to store computer programs
  • Processor used to execute the computer program to implement the steps of a business processing method based on a heterogeneous computing platform as described above.
  • this application provides a heterogeneous computing platform, including: a master core and a slave core;
  • the master core is used to control the slave core to start the target thread when starting to execute the current task; when the preset execution state of the current task is reached, scheduling information is generated;
  • the slave core is configured to use the target thread to perform a corresponding calculation operation according to the scheduling information, and to generate calculation progress information;
  • the master core is configured to continue to execute the current task when the calculation progress information indicates that the calculation is completed, and control the slave core to close the target thread when the execution of the current task ends.
  • the steps of the method or algorithm described in combination with the embodiments disclosed in this document can be directly implemented by hardware, a software module executed by a processor, or a combination of the two.
  • the software module can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or all areas in the technical field. Any other known storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)
  • Stored Programmes (AREA)

Abstract

一种基于异构计算平台及其业务处理方法、装置和主核,该方案可以在一个统一的计算框架下,在业务处理过程中仅启动、结束线程一次,避免频繁线程启动关闭导致的开销,提高异构计算的效率;而且,利用主从核通信框架与同步机制,实现了主从核之间计算进度的通信,从而保证主核在从核计算完成后才进行下一步的操作,保证从核在不同的时刻启动对应的从核计算模块执行相应的计算操作,保证了计算的正确性。

Description

一种基于异构计算平台的业务处理方法
本申请要求于2019年11月24日提交中国专利局、申请号为201911161201.9、发明名称为“一种基于异构计算平台的业务处理方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种基于异构计算平台及其业务处理方法、装置和主核。
背景技术
利用大型超级计算机进行仿真和模拟,是支撑科学研究工作的展开的一种重要的甚至不可替代的方法,计算模拟与仿真已经成为科学研究的第三范式。
近几年,随着诸多应用对计算速度的更高要求,具有更强浮点计算能力的加速器成为构建超级计算机的重要部件。通过将原来在传统CPU上运行的计算密集型部分卸载到专门为提高计算速度而设计的加速器上,以提高应用的计算速度。典型的加速器包括GPU、国产SW26010等。
为了充分利用这些加速部件的计算性能,一般需要将传统的在CPU上运行的计算密集型部分移植到众核架构上去,其中最高效的方法为利用加速器支持的库进行编程、编译、运行,如GPU支持的CUDA、SW26010支持的Athread等。目前的编程方法是在对应的计算密集型模块位置,利用对加速器函数的调用,完成将计算部分卸载到加速器的目标。待到计算完成后,也就是从调用的函数返回后,主核CPU继续进行非计算密集型的部分,比如通信等。
但是,一般情形下,一个应用中涉及的计算密集型模块不是单一,且这些计算模块并不是连续的,计算模块之间需要CPU进行诸如通信等事务处理,这也就导致对每一个模块都需要进行线程的启动,然后经过从核计算后,线程结束,返回主进程。线程相关的启停等开销并非总是可以忽略不计的,尤其是在数据量较少且计算访存比较低(即单个数据导致的计算量较少)的情形下,计算带来的额外的收益并不明显,线程相关的开销导 致的计算性能提升并不明显,因此有必要针对线程开销进行优化。
基于上述对线程相关开销的分析,以及在实际移植优化ROMS过程中的实际性能测试与分析,可以看出,线程开销会导致异构计算加速不够理想。而目前,大多数的应用移植只考虑在计算层面的优化,对于这种系统层面的开销,并没有相关的公开可见的方法。
可见,如何避免异构计算加速系统在业务处理过程中的线程开销,提升系统性能,是亟待本领域技术人员解决的问题。
发明内容
本申请的目的是提供一种基于异构计算平台及其业务处理方法、装置和主核,用以解决传统的异构计算加速系统在业务处理过程中受到线程开销的影响,导致系统性能较低的问题。其具体方案如下:
第一方面,本申请提供了一种基于异构计算平台的业务处理方法,应用于主核,包括:
在开始执行当前任务时,控制从核启动目标线程;
在达到所述当前任务的预设执行状态时,生成调度信息,以便于所述从核利用所述目标线程根据所述调度信息执行相应的计算操作,并生成计算进度信息;
在所述计算进度信息为计算完成时,继续执行所述当前任务,直至所述当前任务执行结束时控制所述从核关闭所述目标线程。
优选的,所述在达到所述当前任务的预设执行状态时,生成调度信息之前,还包括:
在共享存储空间分配共享内存,以存储所述当前任务执行过程中的所述调度信息和所述计算进度信息。
优选的,在所述在共享存储空间分配共享内存之后,还包括:
在所述共享内存设置第一共享变量和第二共享变量,并对所述第一共享变量和所述第二共享变量进行初始化,其中所述第一共享变量用于存储所述调度信息,所述第二共享变量用于存储所述计算进度信息。
优选的,所述生成调度信息,以便于所述从核利用所述目标线程根据所述调度信息执行相应的计算操作,包括:
将所述第一共享变量赋值为目标计算模块的标识信息,以作为调度信息,以便于所述从核利用所述目标线程根据所述调度信息调用所述目标计算模块执行相应的计算操作。
优选的,所述在所述计算进度信息为计算完成时,继续执行所述当前任务,包括:
每隔预设时长,查询所述第二共享变量,直至所述计算进度信息为计算完成时,继续执行所述当前任务。
优选的,所述在达到所述当前任务的预设执行状态时,生成调度信息之后,还包括:
以显示通信的方式,将所述调度信息传递至所述从核。
优选的,包括多种任务类型的从核,所述在达到所述当前任务的预设执行状态时,生成调度信息,包括:
在达到所述当前任务的预设执行状态时,根据任务类型生成对目标从核的调度信息。
第二方面,本申请提供了一种基于异构计算平台的业务处理装置,应用于主核,包括:
线程启动模块:用于在开始执行当前任务时,控制从核启动目标线程;
调度模块:用于在达到所述当前任务的预设执行状态时,生成调度信息,以便于所述从核利用所述目标线程根据所述调度信息执行相应的计算操作,并生成计算进度信息;
线程关闭模块:用于在所述计算进度信息为计算完成时,继续执行所述当前任务,直至所述当前任务执行结束时控制所述从核关闭所述目标线程。
第三方面,本申请提供了一种异构计算平台的主核,包括:
存储器:用于存储计算机程序;
处理器:用于执行所述计算机程序,以实现如上所述的一种基于异构计算平台的业务处理方法的步骤。
第四方面,本申请提供了一种异构计算平台,包括:主核和从核;
所述主核用于在开始执行当前任务时,控制从核启动目标线程;在达 到所述当前任务的预设执行状态时,生成调度信息;
所述从核用于利用所述目标线程根据所述调度信息执行相应的计算操作,并生成计算进度信息;
所述主核用于在所述计算进度信息为计算完成时,继续执行所述当前任务,直至所述当前任务执行结束时控制所述从核关闭所述目标线程。
本申请所提供的一种基于异构计算平台的业务处理方法,应用于主核,包括:在开始执行当前任务时,控制从核启动目标线程;在达到当前任务的预设执行状态时,生成调度信息,以便于从核利用目标线程根据调度信息执行相应的计算操作,并生成计算进度信息;在计算进度信息为计算完成时,继续执行当前任务,直至当前任务执行结束时控制从核关闭所述目标线程。
可见,该方法可以在一个统一的计算框架下,仅启动、结束线程一次,避免频繁线程启动关闭导致的开销,提高异构计算的效率;而且,通过设计主从核通信框架与同步机制,实现了主从核之间计算进度的通信,从而保证主核在从核计算完成后才进行下一步的操作,保证从核在不同的时刻启动对应的从核计算模块执行相应的计算操作,也就保证了计算的正确性。
此外,本申请还提供了一种基于异构计算平台的业务处理装置、主核和异构计算平台,其技术效果与上述方法的技术效果相对应,这里不再赘述。
附图说明
为了更清楚的说明本申请实施例或现有技术的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请所提供的一种基于异构计算平台的业务处理方法实施例一的实现流程图;
图2为本申请所提供的一种基于异构计算平台的业务处理方法实施例二的实现流程图;
图3为本申请所提供的一种基于异构计算平台的业务处理装置实施例 的功能框图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
为保证计算性能,主核在计算密集型模块位置(一般而言是代码层的for循环)利用对加速器函数的调用,完成将计算部分卸载到加速器的目标;待到计算完成后,也就是从调用的函数返回后,主核CPU继续进行非计算密集型的部分。但是,一般情形下,应用中涉及的计算密集型模块不是单一(如海洋模式ROMS的step2d模块中,有超过50个这样的for循环模块),且这些计算模块并不是连续的,所以需要频繁的启动和关闭线程,使计算性能受到影响。
针对上述问题,本申请提供一种基于异构计算平台及其业务处理方法、装置和主核,面对异构计算中多个不连续模块卸载到加速器中这一场景,该方案可以在一个统一的计算框架下,仅启动、结束线程一次,避免频繁线程启动关闭导致的开销,提高异构计算的效率。
下面对本申请提供的一种基于异构计算平台的业务处理方法实施例一进行介绍,参见图1,实施例一应用于主核,包括:
S101、在开始执行当前任务时,控制从核启动目标线程;
S102、在达到所述当前任务的预设执行状态时,生成调度信息,以便于所述从核利用所述目标线程根据所述调度信息执行相应的计算操作,并生成计算进度信息;
上述预设执行状态具体可以为执行到某个计算模块,或者达到某个时间节点。
S103、在所述计算进度信息为计算完成时,继续执行所述当前任务,直至所述当前任务执行结束时控制所述从核关闭所述目标线程。
本实施例中,就异构计算平台的系统框架而言,采用主从异构计算模式。具体的,主核在计算密集型部分通过调用从核完成对应的计算划分,从核完成分配的计算任务返回,然后主核继续执行从核无需参与的非密集计算部分,诸如通信等事务。从核计算结束后不停止,而是进行阶段性状态查询,查询到需要启动下一阶段计算时,即启动下一次的计算。按照上述方式,循环反复,直到所有的需要卸载到从核的计算都完成时,即通知从核停止等待下一步的计算,从核则返回,计算阶段完成。
需要说明的是,本实施例中主核对从核的函数调用并非显式的,而是通过主从核之间的通信及共享实现的。主核监测从核状态,从核端则针对所有的计算密集型部分,分别以从核支撑的库实现对应的计算,如GPU支持的CUDA(Compute Unified Device Architecture,统一计算设备架构),SW26010(申威26010)支持的Athread(SW26010处理器支持的线程库)等。
为了实现主核获取从核计算进度消息,从而决定是否启动下一步的计算及事务处理,以及从核获取到主核的调度信息,从而决定启动对应的计算,需要设计主从核之间的通信机制。
本实施例中,就异构计算平台的主从核通信机制而言。主从核之间的通信可以通过两种方法实现。一是隐式通信,即主从核之间有可以直接访问的共享内存,通过对共享变量的赋值和取值操作,实现通信。需要注意的是,以该方式实现通信时,需要对变量进行设置,防止其在缓存中进行操作,而是直接对变量进行读写。二为显式通信,即主从核之间没有可以共享的内存,或者需要通信的数据量较大,可以通过主从核可访问存储区域之间的显式通信实现主从核之间的消息交换。
本实施例中,就异构计算平台的主核同步保证机制而言,在主核调用从核进行计算时,需要确保所有的从核都已经计算完成,然后才进行下一步的流程。在前文提及的主从核通信机制的基础上,可以通过由主核查询所有从核的计算状态对应的变量实现,也就是当所有变量被从核都设置完成后,主核即可进行下一步的流程。从核在计算完成当前的模块后,需要对对应的属于当前从核的变量进行设置。而且,主核在获取到所有从核对 应的变量都已经被设置,从而进行下一步流程时,要对所有变量进行复原,从而使得下一个计算模块可以复用这些变量进行状态的更新。
本实施例中,就异构计算平台的从核启动计算机制基于前述主从核通信机制实现。具体的,从核每当完成一个计算模块时,并不立即启动下一个计算模块,而是等待主核计算状态的更新,即主核需要从核启动对应的计算时才进行下一个模块的计算。具体实现则由从核利用通信或共享机制,实现对主核计算状态的查询,在对应的计算模块变量被设置后,即进行对应模块的计算。
综上,本实施例所提供一种基于异构计算平台的业务处理方法,应用于主核,该方法可以在一个统一的计算框架下,仅启动、结束线程一次,避免频繁线程启动关闭导致的开销,提高异构计算的效率;而且,通过设计主从核通信框架与同步机制,实现了主从核之间计算进度的通信,从而保证主核在从核计算完成后才进行下一步的操作,保证从核在不同的时刻启动对应的从核计算模块执行相应的计算操作,也就保证了计算的正确性。
下面开始详细介绍本申请提供的一种基于异构计算平台的业务处理方法实施例二,实施例二基于前述实施例一实现,并在实施例一的基础上进行了一定程度上的拓展。
参见图2,实施例二具体应用于主核,包括:
S201、在共享存储空间分配共享内存,以存储所述当前任务执行过程中的所述调度信息和所述计算进度信息;
S202、在所述共享内存设置第一共享变量和第二共享变量,并对所述第一共享变量和所述第二共享变量进行初始化,其中所述第一共享变量用于存储所述调度信息,所述第二共享变量用于存储所述计算进度信息;
S203、在开始执行当前任务时,控制从核启动目标线程;
S204、将所述第一共享变量赋值为目标计算模块的标识信息,以作为调度信息,以便于所述从核利用所述目标线程根据所述调度信息调用所述目标计算模块执行相应的计算操作,并生成计算进度信息;
具体的,本实施例的异构计算平台包括多种任务类型的从核,在达到 所述当前任务的预设执行状态时,主核根据具体的任务类型生成对目标从核的调度信息。
S205、在所述计算进度信息为计算完成时,继续执行所述当前任务,直至所述当前任务执行结束时控制所述从核关闭所述目标线程。
需要说明的是,本实施例以异构计算平台支持共享存储的情形为例进行说明,当系统支持共享内存时,由主核来开辟并管理共享内存空间。在实际应用场景中,若不支持共享存储,则分别在各自的存储空间中开辟空间,后续的数据访问则非直接访问而是显式的数据传输。
下面分别就本实施例中异构计算平台的主从核异构计算框架、主从核通信机制、主核计算调度机制、从核计算同步机制进行说明。
对于主从核异构计算框架,主要包括:主核端在计算密集部分通过调用从核函数进行实现,并不进行显式的函数调用,而是由从核主动探测主核执行状态自主启动计算,计算完成后设置对应的计算进度信息,主核根据所有从核的计算进度信息设置再启动下一步的计算。从核针对不同的计算密集型模块,分别完成对应的计算任务。在从核编码实现方面,利用从核支持的库,针对各个计算密集型子模块,分别实现对应的计算模块的从核代码。对于主核的编码实现,针对卸载需要到从核的计算密集型部分,预先在主核执行代码中删除对应的部分。
对于主从核通信机制,主要包括:在共享存储空间中开辟空间,分别存储主核的计算状态和各从核的计算状态。具体的,主核的计算状态可以以一个整数代表即将启动的计算模块,由主核进行设置,从核进行查询;从核的计算状态用于表示各从核是否完成主核要求的计算,由从核进行设置,主核进行查询。
对于主核计算调度机制,主要包括:
存储空间初始化。对于分配的主核计算状态空间,初始化设置为-1,即从核查询为-1,不启动任何计算模块;对于分配的从核计算状态空间,初始化设置为-1,即从核尚未进行任何计算;
数据设置。当主核代码到达计算密集型模块时,将值设置为对应计算模块的ID(为不同的整数),供从核进行查询;
数据查询。当主核设置完供从核访问的模块ID后,即对由从核设置的计算状态进行查询,从而获取到从核是否已经完成该计算模块,若已经完成,则继续代码执行;否则,循环等待,直到所有的从核都已经完成对应的计算。
对于从核计算同步机制,主要包括:
启停代码添加。即在从核代码所有的计算模块前,启动线程。在最后一个计算模块的末尾,添加线程停止的代码,从而使得整个计算过程中,线程只启动、停止一次;
从核数据查询。即从核直接查询共享存储的状态(或者发起数据传输),查询主核设置的ID后,即启动对应ID的从核计算;
从核数据设置。从核计算完成后,即对该从核对应的存储位置的状态进行设置,使得主核可以查询该状态获取到从核的计算已经完成。
综上,本实施例提供的一种基于异构计算平台的业务处理方法,可以以共享内存或显式通信的方式实现主从核之间的通信,实现主从核的计算状态设置与查询,从而可以在对应的时刻启动计算任务,在计算任务完成后启动通信等流程,实现整体计算的同步,保证计算的正确;而这些并不以多次启停线程为代价,从而可以以较小代价的数据通信,避免传统异构计算模式中线程多次启停的额外开销,提高异构计算的加速比与效率。
可见,本实施例面向异构计算,如GPU、SW26010等,设计异构计算框架,通过主从核的通信机制,在主核端确保各计算模块按序进行,在加速器端确保在对应的时刻启动对应的计算子模块,保证异构计算的正确性,提高异构计算的效率。
下面对本申请实施例提供的一种基于异构计算平台的业务处理装置进行介绍,下文描述的一种基于异构计算平台的业务处理装置与上文描述的一种基于异构计算平台的业务处理方法可相互对应参照。
如图3所示,本实施例的业务处理装置,应用于主核,包括:
线程启动模块301:用于在开始执行当前任务时,控制从核启动目标线程;
调度模块302:用于在达到所述当前任务的预设执行状态时,生成调度信息,以便于所述从核利用所述目标线程根据所述调度信息执行相应的计算操作,并生成计算进度信息;
线程关闭模块303:用于在所述计算进度信息为计算完成时,继续执行所述当前任务,直至所述当前任务执行结束时控制所述从核关闭所述目标线程。
本实施例的一种基于异构计算平台的业务处理装置用于实现前述的一种基于异构计算平台的业务处理方法,因此该装置中的具体实施方式可见前文中的一种基于异构计算平台的业务处理方法的实施例部分,例如,线程启动模块301、调度模块302、线程关闭模块303,分别用于实现上述一种基于异构计算平台的业务处理方法中步骤S101,S102,S103。所以,其具体实施方式可以参照相应的各个部分实施例的描述,在此不再展开介绍。
另外,由于本实施例的一种基于异构计算平台的业务处理装置用于实现前述的一种基于异构计算平台的业务处理方法,因此其作用与上述方法的作用相对应,这里不再赘述。
此外,本申请还提供了一种异构计算平台的主核,包括:
存储器:用于存储计算机程序;
处理器:用于执行所述计算机程序,以实现如上文所述的一种基于异构计算平台的业务处理方法的步骤。
最后,本申请提供了一种异构计算平台,包括:主核和从核;
所述主核用于在开始执行当前任务时,控制从核启动目标线程;在达到所述当前任务的预设执行状态时,生成调度信息;
所述从核用于利用所述目标线程根据所述调度信息执行相应的计算操作,并生成计算进度信息;
所述主核用于在所述计算进度信息为计算完成时,继续执行所述当前任务,直至所述当前任务执行结束时控制所述从核关闭所述目标线程。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
以上对本申请所提供的方案进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (10)

  1. 一种基于异构计算平台的业务处理方法,其特征在于,应用于主核,包括:
    在开始执行当前任务时,控制从核启动目标线程;
    在达到所述当前任务的预设执行状态时,生成调度信息,以便于所述从核利用所述目标线程根据所述调度信息执行相应的计算操作,并生成计算进度信息;
    在所述计算进度信息为计算完成时,继续执行所述当前任务,直至所述当前任务执行结束时控制所述从核关闭所述目标线程。
  2. 如权利要求1所述的方法,其特征在于,所述在达到所述当前任务的预设执行状态时,生成调度信息之前,还包括:
    在共享存储空间分配共享内存,以存储所述当前任务执行过程中的所述调度信息和所述计算进度信息。
  3. 如权利要求2所述的方法,其特征在于,在所述在共享存储空间分配共享内存之后,还包括:
    在所述共享内存设置第一共享变量和第二共享变量,并对所述第一共享变量和所述第二共享变量进行初始化,其中所述第一共享变量用于存储所述调度信息,所述第二共享变量用于存储所述计算进度信息。
  4. 如权利要求3所述的方法,其特征在于,所述生成调度信息,以便于所述从核利用所述目标线程根据所述调度信息执行相应的计算操作,包括:
    将所述第一共享变量赋值为目标计算模块的标识信息,以作为调度信息,以便于所述从核利用所述目标线程根据所述调度信息调用所述目标计算模块执行相应的计算操作。
  5. 如权利要求3所述的方法,其特征在于,所述在所述计算进度信息为计算完成时,继续执行所述当前任务,包括:
    每隔预设时长,查询所述第二共享变量,直至所述计算进度信息为计算完成时,继续执行所述当前任务。
  6. 如权利要求1所述的方法,其特征在于,所述在达到所述当前任务 的预设执行状态时,生成调度信息之后,还包括:
    以显示通信的方式,将所述调度信息传递至所述从核。
  7. 如权利要求1-6任意一项所述的方法,其特征在于,包括多种任务类型的从核,所述在达到所述当前任务的预设执行状态时,生成调度信息,包括:
    在达到所述当前任务的预设执行状态时,根据任务类型生成对目标从核的调度信息。
  8. 一种基于异构计算平台的业务处理装置,其特征在于,应用于主核,包括:
    线程启动模块:用于在开始执行当前任务时,控制从核启动目标线程;
    调度模块:用于在达到所述当前任务的预设执行状态时,生成调度信息,以便于所述从核利用所述目标线程根据所述调度信息执行相应的计算操作,并生成计算进度信息;
    线程关闭模块:用于在所述计算进度信息为计算完成时,继续执行所述当前任务,直至所述当前任务执行结束时控制所述从核关闭所述目标线程。
  9. 一种异构计算平台的主核,其特征在于,包括:
    存储器:用于存储计算机程序;
    处理器:用于执行所述计算机程序,以实现如权利要求1-7任意一项所述的一种基于异构计算平台的业务处理方法的步骤。
  10. 一种异构计算平台,其特征在于,包括:主核和从核;
    所述主核用于在开始执行当前任务时,控制从核启动目标线程;在达到所述当前任务的预设执行状态时,生成调度信息;
    所述从核用于利用所述目标线程根据所述调度信息执行相应的计算操作,并生成计算进度信息;
    所述主核用于在所述计算进度信息为计算完成时,继续执行所述当前任务,直至所述当前任务执行结束时控制所述从核关闭所述目标线程。
PCT/CN2020/103650 2019-11-24 2020-07-23 一种基于异构计算平台的业务处理方法 WO2021098257A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911161201.9A CN110990151A (zh) 2019-11-24 2019-11-24 一种基于异构计算平台的业务处理方法
CN201911161201.9 2019-11-24

Publications (1)

Publication Number Publication Date
WO2021098257A1 true WO2021098257A1 (zh) 2021-05-27

Family

ID=70086139

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/103650 WO2021098257A1 (zh) 2019-11-24 2020-07-23 一种基于异构计算平台的业务处理方法

Country Status (2)

Country Link
CN (1) CN110990151A (zh)
WO (1) WO2021098257A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990151A (zh) * 2019-11-24 2020-04-10 浪潮电子信息产业股份有限公司 一种基于异构计算平台的业务处理方法
CN111459647B (zh) * 2020-06-17 2020-09-25 北京机电工程研究所 基于嵌入式操作系统的dsp多核处理器并行运算方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120304184A1 (en) * 2010-02-23 2012-11-29 Fujitsu Limited Multi-core processor system, computer product, and control method
US20140130021A1 (en) * 2012-11-05 2014-05-08 Nvidia Corporation System and method for translating program functions for correct handling of local-scope variables and computing system incorporating the same
CN104794006A (zh) * 2010-02-23 2015-07-22 富士通株式会社 多核处理器系统、中断程序、以及中断方法
CN105242962A (zh) * 2015-11-24 2016-01-13 无锡江南计算技术研究所 基于异构众核的轻量级线程快速触发方法
CN110990151A (zh) * 2019-11-24 2020-04-10 浪潮电子信息产业股份有限公司 一种基于异构计算平台的业务处理方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7533382B2 (en) * 2002-10-30 2009-05-12 Stmicroelectronics, Inc. Hyperprocessor
CN101551761A (zh) * 2009-04-30 2009-10-07 浪潮电子信息产业股份有限公司 一种异构多处理器中共享流内存的方法
CN103902387A (zh) * 2014-04-29 2014-07-02 浪潮电子信息产业股份有限公司 一种cpu+gpu协同并行计算动态负载均衡方法
CN104869398B (zh) * 2015-05-21 2017-08-22 大连理工大学 一种基于cpu+gpu异构平台实现hevc中的cabac的并行方法
CN104899089A (zh) * 2015-05-25 2015-09-09 常州北大众志网络计算机有限公司 一种面向异构多核体系的任务调度方法
CN106358003B (zh) * 2016-08-31 2019-02-19 华中科技大学 一种基于线程级流水线的视频分析加速方法
CN108319510A (zh) * 2017-12-28 2018-07-24 大唐软件技术股份有限公司 一种异构处理方法及装置
CN108416433B (zh) * 2018-01-22 2020-11-24 上海熠知电子科技有限公司 一种基于异步事件的神经网络异构加速方法和系统
KR102533241B1 (ko) * 2018-01-25 2023-05-16 삼성전자주식회사 적응적으로 캐시 일관성을 제어하도록 구성된 이종 컴퓨팅 시스템
CN110135584B (zh) * 2019-03-30 2022-11-18 华南理工大学 基于自适应并行遗传算法的大规模符号回归方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120304184A1 (en) * 2010-02-23 2012-11-29 Fujitsu Limited Multi-core processor system, computer product, and control method
CN104794006A (zh) * 2010-02-23 2015-07-22 富士通株式会社 多核处理器系统、中断程序、以及中断方法
US20140130021A1 (en) * 2012-11-05 2014-05-08 Nvidia Corporation System and method for translating program functions for correct handling of local-scope variables and computing system incorporating the same
CN105242962A (zh) * 2015-11-24 2016-01-13 无锡江南计算技术研究所 基于异构众核的轻量级线程快速触发方法
CN110990151A (zh) * 2019-11-24 2020-04-10 浪潮电子信息产业股份有限公司 一种基于异构计算平台的业务处理方法

Also Published As

Publication number Publication date
CN110990151A (zh) 2020-04-10

Similar Documents

Publication Publication Date Title
CN105074666B (zh) 执行在具有不同指令集架构的处理器上的操作系统
US8276145B2 (en) Protected mode scheduling of operations
JP6010540B2 (ja) 選択された実行ランタイムによる実行のためのユーザコードのランタイム非依存表現
JP5295228B2 (ja) 複数のプロセッサを備えるシステム、ならびにその動作方法
US9063783B2 (en) Coordinating parallel execution of processes using agents
US8869162B2 (en) Stream processing on heterogeneous hardware devices
US7802252B2 (en) Method and apparatus for selecting the architecture level to which a processor appears to conform
US20180121240A1 (en) Job Scheduling Method, Device, and Distributed System
CN103793255B (zh) 可配置的多主模式多os内核实时操作系统架构的启动方法
US9063805B2 (en) Method and system for enabling access to functionality provided by resources outside of an operating system environment
US20110219373A1 (en) Virtual machine management apparatus and virtualization method for virtualization-supporting terminal platform
WO2021098257A1 (zh) 一种基于异构计算平台的业务处理方法
WO2007020739A1 (ja) スケジューリング方法およびスケジューリング装置
CN110990154B (zh) 一种大数据应用优化方法、装置及存储介质
CN109740765A (zh) 一种基于亚马逊网络服务器的机器学习系统搭建方法
CN111666210A (zh) 一种芯片验证方法及装置
US20200272488A1 (en) Managing containers across multiple operating systems
CN111078412B (zh) 一种通过api截获对gpu进行资源管理的方法
EP3719645B1 (en) Extension application mechanisms through intra-process operation systems
WO2022166480A1 (zh) 任务调度方法、装置及系统
US20190310874A1 (en) Driver management method and host
EP3401784A1 (en) Multicore processing system
CN111459573A (zh) 一种智能合约执行环境的启动方法以及装置
CN109522111A (zh) 异构生态系统的调用方法、装置、电子设备及存储介质
WO2022048191A1 (en) Method and apparatus for reusable and relative indexed register resource allocation in function calls

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20890719

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20890719

Country of ref document: EP

Kind code of ref document: A1