CN101551761A

CN101551761A - Method for sharing stream memory of heterogeneous multi-processor

Info

Publication number: CN101551761A
Application number: CNA2009100149388A
Authority: CN
Inventors: 魏健; 王守昊
Original assignee: Langchao Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2009-04-30
Filing date: 2009-04-30
Publication date: 2009-10-07

Abstract

The invention provides a method for sharing the stream memory of a heterogeneous multi-processor. The method comprises the following steps: an application program runs on a master processor and an API is called for the first time, and one or more executable programs are encoded from the source code containing local variables for a plurality of processor units with stream memory; and then the API is called for the second time so as to load one or more executable programs to a plurality of processor units; collateral execution is conducted on a plurality of treads; when in loading, local storage units are allocated from the local storage of a processor; in addition, when in loading, a first stream storage unit is allocated from the stream memory; when a processing unit executes a plurality of treads simultaneously, the threads access to the values of the variables on the basis of the storage units of the stream memory; the source program containing stream variables further comprises the following steps: the API is called for the third time; in the stream memory, a second stream storage unit is allocated for the stream variables; based on the second stream storage unit, the variable values of the stream variables are accessed through a plurality of processor units.

Description

Share the method for stream memory in a kind of heterogeneous multi-processor

Technical field

The present invention relates to a kind of data parallel computing technique, especially history carries out sharing when data parallel calculates the method for stream memory by heterogeneous multi-processor CPUs and GPUs.

Background technology

Along with GPU includes high performance parallel computation equipment gradually in, GPU has been developed increasing application program and has finished data parallel calculating by the computing equipment according to general objects.Today, we design these application programs with professional interface and professional GPU equipment that supplier provides, and therefore, even CPU and GPU one are used from data handling system, it is overweight that CPU can load yet, and application program also may operate on the GPU of different vendor.

Yet, along with being embedded into multinuclear, increasing CPU finishes data parallel calculating, and the more and more data Processing tasks promptly can be finished with CPUs and GPUs.The processor of a plurality of CPU or GPU combination is write a Chinese character in simplified form CPUs and GPUs, and on the traditional sense, GPUs and CPUs are to compile by different program environments respectively, therefore make that CPU and GPU interoperability are not fine.Therefore making application make good use of CPUs and GPUs simultaneously, to handle resource be unusual difficulty, thereby need a new data handling system overcome above-mentioned difficulties.Thereby make application can make full use of CPU and the various processing resources of GPU.

Summary of the invention

The purpose of this invention is to provide the method for sharing stream memory in a kind of heterogeneous multi-processor.

The objective of the invention is to realize in the following manner, comprise primary processor and computation processor, operate in the application program in the primary processor, based on main processor invokes API, executable program is loaded into computation processor from primary processor, and be computation processor configuration store ability, being certain variable storage allocation of the thread accesses in the computation processor, computation processor is GPU or CPU;

Step is as follows: application program operates in the primary processor calls API for the first time, for a plurality of processor units of being furnished with stream memory from the one or more executable programs of the compilation of source code that comprises local variable; For the second time call API then, remove to load one or more executable programs in a plurality of processor units, a plurality of threads of executed in parallel during loading, distribute LSU local store unit from the local storage of a processor; And when loading, from stream memory, distribute first stream storage unit, in a processing unit, carry out a plurality of threads simultaneously, these threads are based on the value of the memory unit access variable of stream memory, further comprise for the source program that comprises flow variables: call API for the third time, in stream memory, for flow variables is distributed second stream storage unit; Based on second stream storage unit, from the variate-value of a plurality of processor unit access stream variablees.

Excellent effect of the present invention is well to make application program make good use of CPUs and GPUs processing resource simultaneously, improves the ability that application program is handled mass data.

Description of drawings

Fig. 1 finishes the computing equipment arrangement plan that data parallel calculates;

Fig. 2 is that parallel multiprocessor is carried out the shared stream memory synoptic diagram of multithreading;

Fig. 3 is the process synoptic diagram that scheduling API finishes Memory Allocation.

Embodiment

With reference to explaining below the method work of Figure of description to shared stream memory in a kind of heterogeneous multi-processor of the present invention.

Operate in the application on the primary processor among the present invention, the storage capacity of configuration computation processor, computation processor can be CPU or GPU, and is the executable program of one group of thread execution in the computing, visits a variable memory allocated unit.By the value of variable of this group thread accesses, the perhaps stream memory of sharing from the local memory of computation processor or primary processor and computation processor.By API Calls, use distribution and the configuration of finishing internal memory.When calling API for the first time, for being furnished with a plurality of processing units of stream memory, from the one or more executable programs of compilation of source code; For the second time call API then, remove to load these executable programs in a plurality of processing units, and carry out a plurality of threads simultaneously.During loading, distribute LSU local store unit from the local storage of a processor, this storage unit is used for preserving the local variable of source code; And from stream memory, distribute first stream storage unit when loading, carry out a plurality of threads simultaneously in a processing unit, a plurality of threads are based on the value of the memory unit access local variable of stream memory.Further comprise for the source program that comprises flow variables: call API for the third time, in stream memory, for flow variables is distributed second stream storage unit; Based on second stream storage unit, can be from a plurality of processor units, the access stream variable.In the stream buffer memory,, preserve the value of variable in the stream storage unit in the buffer unit for variable distributes buffer unit.

Embodiment

Fig. 1 is for finishing the computing equipment arrangement plan of application data parallel processing, in this computing equipment, comprise central processor CPU and graphic process unit GPU, a primary processor is arranged in the host processing system wherein, can upload data download and checkout result in network, primary processor connects heterogeneous processor CPUs and GPUs by data bus.CPU can be the CPU of multinuclear, and GPU is the hardware that can support graphics process and double-precision floating point computing.Function library is preserved source code and executable program, and compiling layer is responsible for compile source code, uses and passes through API Calls, load executable program to firing floor, firing floor is by the distribution of computational resource, the management processing task executions, the calculate platform layer, the sign of responsible physical computing devices.The executable program that compiling is finished is loaded into firing floor by API Calls, and firing floor is according to the data file of processor during operation, and mutual with compiling layer, compile source code generates new executable program in real time.Firing floor is assigned to computational resource to qualified executable program by the calculate platform layer.

Fig. 2 is that parallel multiprocessor is carried out the shared stream memory synoptic diagram of multithreading, and at this moment application program is loaded into computation processor with executable program from primary processor by API Calls.Executable program is a plurality of threads of executed in parallel in a processing unit, as seen from the figure, have 1 in computation processor _ 1 to M thread, have 1 among computation processor _ L to N thread, each thread is by the value of its privately owned its local variable of internal storage access in computing, a plurality of threads in computing are by the value of local shared drive access variable, and the thread in a plurality of processing is based on the value of the memory unit access flow variables of stream memory.For example, the value of the 1 storage thread of the privately owned internal memory in the computation processor 1,1 local variable to be processed; The variate-value that storage thread 1 and M need handle in the local shared drive; And computation processor _ 1 thread M and computation processor _ L thread N, then by flowing the value of cache access flow variables.Local shared drive also is based on the storage unit of stream memory.

Fig. 3 is the process synoptic diagram that scheduling API finishes Memory Allocation, and application program at first by the API scheduling, is finished the compiling of source code, and compiling generates one or more executable programs; And then call API loading executable program to handling the unit, finish Memory Allocation during loading to local variable in the executable program, this Memory Allocation is based on the local storage capability of processor, finish the distribution of first stream memory simultaneously, be used for a plurality of threads of processor access variable simultaneously; Call API at last for the third time, in stream memory, distribute second stream storage unit, thereby make a plurality of processor units, the access stream variable for flow variables.

Claims

1. share the method for stream memory in the heterogeneous multi-processor, comprise primary processor and computation processor, it is characterized in that, operate in the application program in the primary processor, based on main processor invokes API, executable program is loaded into computation processor from primary processor, and is computation processor configuration store ability, be certain variable storage allocation of the thread accesses in the computation processor, computation processor is GPU or CPU;

2, method according to claim 1, it is characterized in that, storage unit is the local storage of being furnished with on the processing unit, or stream memory, the stream storage unit is to be distributed by the application that operates on the main processor unit, the storage capacity of stream memory does not comprise the support of local storage, for variable distributes buffer unit, preserves the value of variable in the stream storage unit in the buffer unit in the stream buffer memory.

3, method according to claim 1 is characterized in that, heterogeneous multi-processor comprises primary processor, one or more processor unit, API storehouse; Wherein primary processor and processor unit are furnished with shared stream memory; Comprise source code and executable program in the API storehouse; Have at least a processing unit that local storage is arranged in one or more processing units, the distribution of the internal memory of local variable is based on the storage capacity of this local storage in the executable program.

4, method according to claim 1 is characterized in that, a processor unit comprises a CPU or a GPU at least.