CN101551761A - Method for sharing stream memory of heterogeneous multi-processor - Google Patents
Method for sharing stream memory of heterogeneous multi-processor Download PDFInfo
- Publication number
- CN101551761A CN101551761A CNA2009100149388A CN200910014938A CN101551761A CN 101551761 A CN101551761 A CN 101551761A CN A2009100149388 A CNA2009100149388 A CN A2009100149388A CN 200910014938 A CN200910014938 A CN 200910014938A CN 101551761 A CN101551761 A CN 101551761A
- Authority
- CN
- China
- Prior art keywords
- processor
- stream
- unit
- memory
- storage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 15
- 238000010304 firing Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
Images
Landscapes
- Multi Processors (AREA)
Abstract
The invention provides a method for sharing the stream memory of a heterogeneous multi-processor. The method comprises the following steps: an application program runs on a master processor and an API is called for the first time, and one or more executable programs are encoded from the source code containing local variables for a plurality of processor units with stream memory; and then the API is called for the second time so as to load one or more executable programs to a plurality of processor units; collateral execution is conducted on a plurality of treads; when in loading, local storage units are allocated from the local storage of a processor; in addition, when in loading, a first stream storage unit is allocated from the stream memory; when a processing unit executes a plurality of treads simultaneously, the threads access to the values of the variables on the basis of the storage units of the stream memory; the source program containing stream variables further comprises the following steps: the API is called for the third time; in the stream memory, a second stream storage unit is allocated for the stream variables; based on the second stream storage unit, the variable values of the stream variables are accessed through a plurality of processor units.
Description
Technical field
The present invention relates to a kind of data parallel computing technique, especially history carries out sharing when data parallel calculates the method for stream memory by heterogeneous multi-processor CPUs and GPUs.
Background technology
Along with GPU includes high performance parallel computation equipment gradually in, GPU has been developed increasing application program and has finished data parallel calculating by the computing equipment according to general objects.Today, we design these application programs with professional interface and professional GPU equipment that supplier provides, and therefore, even CPU and GPU one are used from data handling system, it is overweight that CPU can load yet, and application program also may operate on the GPU of different vendor.
Yet, along with being embedded into multinuclear, increasing CPU finishes data parallel calculating, and the more and more data Processing tasks promptly can be finished with CPUs and GPUs.The processor of a plurality of CPU or GPU combination is write a Chinese character in simplified form CPUs and GPUs, and on the traditional sense, GPUs and CPUs are to compile by different program environments respectively, therefore make that CPU and GPU interoperability are not fine.Therefore making application make good use of CPUs and GPUs simultaneously, to handle resource be unusual difficulty, thereby need a new data handling system overcome above-mentioned difficulties.Thereby make application can make full use of CPU and the various processing resources of GPU.
Summary of the invention
The purpose of this invention is to provide the method for sharing stream memory in a kind of heterogeneous multi-processor.
The objective of the invention is to realize in the following manner, comprise primary processor and computation processor, operate in the application program in the primary processor, based on main processor invokes API, executable program is loaded into computation processor from primary processor, and be computation processor configuration store ability, being certain variable storage allocation of the thread accesses in the computation processor, computation processor is GPU or CPU;
Step is as follows: application program operates in the primary processor calls API for the first time, for a plurality of processor units of being furnished with stream memory from the one or more executable programs of the compilation of source code that comprises local variable; For the second time call API then, remove to load one or more executable programs in a plurality of processor units, a plurality of threads of executed in parallel during loading, distribute LSU local store unit from the local storage of a processor; And when loading, from stream memory, distribute first stream storage unit, in a processing unit, carry out a plurality of threads simultaneously, these threads are based on the value of the memory unit access variable of stream memory, further comprise for the source program that comprises flow variables: call API for the third time, in stream memory, for flow variables is distributed second stream storage unit; Based on second stream storage unit, from the variate-value of a plurality of processor unit access stream variablees.
Excellent effect of the present invention is well to make application program make good use of CPUs and GPUs processing resource simultaneously, improves the ability that application program is handled mass data.
Description of drawings
Fig. 1 finishes the computing equipment arrangement plan that data parallel calculates;
Fig. 2 is that parallel multiprocessor is carried out the shared stream memory synoptic diagram of multithreading;
Fig. 3 is the process synoptic diagram that scheduling API finishes Memory Allocation.
Embodiment
With reference to explaining below the method work of Figure of description to shared stream memory in a kind of heterogeneous multi-processor of the present invention.
Operate in the application on the primary processor among the present invention, the storage capacity of configuration computation processor, computation processor can be CPU or GPU, and is the executable program of one group of thread execution in the computing, visits a variable memory allocated unit.By the value of variable of this group thread accesses, the perhaps stream memory of sharing from the local memory of computation processor or primary processor and computation processor.By API Calls, use distribution and the configuration of finishing internal memory.When calling API for the first time, for being furnished with a plurality of processing units of stream memory, from the one or more executable programs of compilation of source code; For the second time call API then, remove to load these executable programs in a plurality of processing units, and carry out a plurality of threads simultaneously.During loading, distribute LSU local store unit from the local storage of a processor, this storage unit is used for preserving the local variable of source code; And from stream memory, distribute first stream storage unit when loading, carry out a plurality of threads simultaneously in a processing unit, a plurality of threads are based on the value of the memory unit access local variable of stream memory.Further comprise for the source program that comprises flow variables: call API for the third time, in stream memory, for flow variables is distributed second stream storage unit; Based on second stream storage unit, can be from a plurality of processor units, the access stream variable.In the stream buffer memory,, preserve the value of variable in the stream storage unit in the buffer unit for variable distributes buffer unit.
Embodiment
Fig. 1 is for finishing the computing equipment arrangement plan of application data parallel processing, in this computing equipment, comprise central processor CPU and graphic process unit GPU, a primary processor is arranged in the host processing system wherein, can upload data download and checkout result in network, primary processor connects heterogeneous processor CPUs and GPUs by data bus.CPU can be the CPU of multinuclear, and GPU is the hardware that can support graphics process and double-precision floating point computing.Function library is preserved source code and executable program, and compiling layer is responsible for compile source code, uses and passes through API Calls, load executable program to firing floor, firing floor is by the distribution of computational resource, the management processing task executions, the calculate platform layer, the sign of responsible physical computing devices.The executable program that compiling is finished is loaded into firing floor by API Calls, and firing floor is according to the data file of processor during operation, and mutual with compiling layer, compile source code generates new executable program in real time.Firing floor is assigned to computational resource to qualified executable program by the calculate platform layer.
Fig. 2 is that parallel multiprocessor is carried out the shared stream memory synoptic diagram of multithreading, and at this moment application program is loaded into computation processor with executable program from primary processor by API Calls.Executable program is a plurality of threads of executed in parallel in a processing unit, as seen from the figure, have 1 in computation processor _ 1 to M thread, have 1 among computation processor _ L to N thread, each thread is by the value of its privately owned its local variable of internal storage access in computing, a plurality of threads in computing are by the value of local shared drive access variable, and the thread in a plurality of processing is based on the value of the memory unit access flow variables of stream memory.For example, the value of the 1 storage thread of the privately owned internal memory in the computation processor 1,1 local variable to be processed; The variate-value that storage thread 1 and M need handle in the local shared drive; And computation processor _ 1 thread M and computation processor _ L thread N, then by flowing the value of cache access flow variables.Local shared drive also is based on the storage unit of stream memory.
Fig. 3 is the process synoptic diagram that scheduling API finishes Memory Allocation, and application program at first by the API scheduling, is finished the compiling of source code, and compiling generates one or more executable programs; And then call API loading executable program to handling the unit, finish Memory Allocation during loading to local variable in the executable program, this Memory Allocation is based on the local storage capability of processor, finish the distribution of first stream memory simultaneously, be used for a plurality of threads of processor access variable simultaneously; Call API at last for the third time, in stream memory, distribute second stream storage unit, thereby make a plurality of processor units, the access stream variable for flow variables.
Claims (4)
1. share the method for stream memory in the heterogeneous multi-processor, comprise primary processor and computation processor, it is characterized in that, operate in the application program in the primary processor, based on main processor invokes API, executable program is loaded into computation processor from primary processor, and is computation processor configuration store ability, be certain variable storage allocation of the thread accesses in the computation processor, computation processor is GPU or CPU;
Step is as follows: application program operates in the primary processor calls API for the first time, for a plurality of processor units of being furnished with stream memory from the one or more executable programs of the compilation of source code that comprises local variable; For the second time call API then, remove to load one or more executable programs in a plurality of processor units, a plurality of threads of executed in parallel during loading, distribute LSU local store unit from the local storage of a processor; And when loading, from stream memory, distribute first stream storage unit, in a processing unit, carry out a plurality of threads simultaneously, these threads are based on the value of the memory unit access variable of stream memory, further comprise for the source program that comprises flow variables: call API for the third time, in stream memory, for flow variables is distributed second stream storage unit; Based on second stream storage unit, from the variate-value of a plurality of processor unit access stream variablees.
2, method according to claim 1, it is characterized in that, storage unit is the local storage of being furnished with on the processing unit, or stream memory, the stream storage unit is to be distributed by the application that operates on the main processor unit, the storage capacity of stream memory does not comprise the support of local storage, for variable distributes buffer unit, preserves the value of variable in the stream storage unit in the buffer unit in the stream buffer memory.
3, method according to claim 1 is characterized in that, heterogeneous multi-processor comprises primary processor, one or more processor unit, API storehouse; Wherein primary processor and processor unit are furnished with shared stream memory; Comprise source code and executable program in the API storehouse; Have at least a processing unit that local storage is arranged in one or more processing units, the distribution of the internal memory of local variable is based on the storage capacity of this local storage in the executable program.
4, method according to claim 1 is characterized in that, a processor unit comprises a CPU or a GPU at least.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2009100149388A CN101551761A (en) | 2009-04-30 | 2009-04-30 | Method for sharing stream memory of heterogeneous multi-processor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2009100149388A CN101551761A (en) | 2009-04-30 | 2009-04-30 | Method for sharing stream memory of heterogeneous multi-processor |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101551761A true CN101551761A (en) | 2009-10-07 |
Family
ID=41156010
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2009100149388A Pending CN101551761A (en) | 2009-04-30 | 2009-04-30 | Method for sharing stream memory of heterogeneous multi-processor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101551761A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102314670A (en) * | 2010-06-29 | 2012-01-11 | 技嘉科技股份有限公司 | Processing module, operating system and processing method |
CN102323917A (en) * | 2011-09-06 | 2012-01-18 | 中国人民解放军国防科学技术大学 | Shared memory based method for realizing multiprocess GPU (Graphics Processing Unit) sharing |
CN102870096A (en) * | 2010-05-20 | 2013-01-09 | 苹果公司 | Subbuffer objects |
CN102902654A (en) * | 2012-09-03 | 2013-01-30 | 东软集团股份有限公司 | Method and device for linking data among heterogeneous platforms |
CN103412823A (en) * | 2013-08-07 | 2013-11-27 | 格科微电子(上海)有限公司 | Chip architecture based on ultra-wide buses and data access method of chip architecture |
CN103559078A (en) * | 2013-11-08 | 2014-02-05 | 华为技术有限公司 | GPU (Graphics Processing Unit) virtualization realization method as well as vertex data caching method and related device |
CN104836970A (en) * | 2015-03-27 | 2015-08-12 | 北京联合大学 | Multi-projector fusion method based on GPU real-time video processing, and multi-projector fusion system based on GPU real-time video processing |
CN105427236A (en) * | 2015-12-18 | 2016-03-23 | 魅族科技(中国)有限公司 | Method and device for image rendering |
CN105900065A (en) * | 2014-01-13 | 2016-08-24 | 华为技术有限公司 | Method for pattern processing |
CN107180010A (en) * | 2016-03-09 | 2017-09-19 | 联发科技股份有限公司 | Heterogeneous computing system and method |
CN109471673A (en) * | 2017-09-07 | 2019-03-15 | 智微科技股份有限公司 | For carrying out the method and electronic device of hardware resource management in electronic device |
CN109921895A (en) * | 2019-02-26 | 2019-06-21 | 成都国科微电子有限公司 | A kind of calculation method and system of data hash value |
CN110704362A (en) * | 2019-09-12 | 2020-01-17 | 无锡江南计算技术研究所 | Processor array local storage hybrid management technology |
CN110990151A (en) * | 2019-11-24 | 2020-04-10 | 浪潮电子信息产业股份有限公司 | Service processing method based on heterogeneous computing platform |
WO2020134833A1 (en) * | 2018-12-29 | 2020-07-02 | 深圳云天励飞技术有限公司 | Data sharing method, device, equipment and system |
CN111625330A (en) * | 2020-05-18 | 2020-09-04 | 北京达佳互联信息技术有限公司 | Cross-thread task processing method and device, server and storage medium |
-
2009
- 2009-04-30 CN CNA2009100149388A patent/CN101551761A/en active Pending
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102870096A (en) * | 2010-05-20 | 2013-01-09 | 苹果公司 | Subbuffer objects |
US9691346B2 (en) | 2010-05-20 | 2017-06-27 | Apple Inc. | Subbuffer objects |
CN102870096B (en) * | 2010-05-20 | 2016-01-13 | 苹果公司 | Sub-impact damper object |
CN102314670B (en) * | 2010-06-29 | 2016-04-27 | 技嘉科技股份有限公司 | There is the processing module of painting processor, operating system and disposal route |
CN102314670A (en) * | 2010-06-29 | 2012-01-11 | 技嘉科技股份有限公司 | Processing module, operating system and processing method |
CN102323917A (en) * | 2011-09-06 | 2012-01-18 | 中国人民解放军国防科学技术大学 | Shared memory based method for realizing multiprocess GPU (Graphics Processing Unit) sharing |
CN102323917B (en) * | 2011-09-06 | 2013-05-15 | 中国人民解放军国防科学技术大学 | Shared memory based method for realizing multiprocess GPU (Graphics Processing Unit) sharing |
CN102902654A (en) * | 2012-09-03 | 2013-01-30 | 东软集团股份有限公司 | Method and device for linking data among heterogeneous platforms |
US9250986B2 (en) | 2012-09-03 | 2016-02-02 | Neusoft Corporation | Method and apparatus for data linkage between heterogeneous platforms |
CN103412823A (en) * | 2013-08-07 | 2013-11-27 | 格科微电子(上海)有限公司 | Chip architecture based on ultra-wide buses and data access method of chip architecture |
WO2015018237A1 (en) * | 2013-08-07 | 2015-02-12 | 格科微电子(上海)有限公司 | Superwide bus-based chip architecture and data access method therefor |
CN103412823B (en) * | 2013-08-07 | 2017-03-01 | 格科微电子(上海)有限公司 | Chip architecture based on ultra-wide bus and its data access method |
CN103559078A (en) * | 2013-11-08 | 2014-02-05 | 华为技术有限公司 | GPU (Graphics Processing Unit) virtualization realization method as well as vertex data caching method and related device |
CN103559078B (en) * | 2013-11-08 | 2017-04-26 | 华为技术有限公司 | GPU (Graphics Processing Unit) virtualization realization method as well as vertex data caching method and related device |
CN105900065A (en) * | 2014-01-13 | 2016-08-24 | 华为技术有限公司 | Method for pattern processing |
CN104836970A (en) * | 2015-03-27 | 2015-08-12 | 北京联合大学 | Multi-projector fusion method based on GPU real-time video processing, and multi-projector fusion system based on GPU real-time video processing |
CN104836970B (en) * | 2015-03-27 | 2018-06-15 | 北京联合大学 | More projection fusion methods and system based on GPU real time video processings |
CN105427236A (en) * | 2015-12-18 | 2016-03-23 | 魅族科技(中国)有限公司 | Method and device for image rendering |
CN107180010A (en) * | 2016-03-09 | 2017-09-19 | 联发科技股份有限公司 | Heterogeneous computing system and method |
CN109471673A (en) * | 2017-09-07 | 2019-03-15 | 智微科技股份有限公司 | For carrying out the method and electronic device of hardware resource management in electronic device |
CN109471673B (en) * | 2017-09-07 | 2022-02-01 | 智微科技股份有限公司 | Method for hardware resource management in electronic device and electronic device |
WO2020134833A1 (en) * | 2018-12-29 | 2020-07-02 | 深圳云天励飞技术有限公司 | Data sharing method, device, equipment and system |
CN109921895A (en) * | 2019-02-26 | 2019-06-21 | 成都国科微电子有限公司 | A kind of calculation method and system of data hash value |
CN110704362A (en) * | 2019-09-12 | 2020-01-17 | 无锡江南计算技术研究所 | Processor array local storage hybrid management technology |
CN110704362B (en) * | 2019-09-12 | 2021-03-12 | 无锡江南计算技术研究所 | Processor array local storage hybrid management method |
CN110990151A (en) * | 2019-11-24 | 2020-04-10 | 浪潮电子信息产业股份有限公司 | Service processing method based on heterogeneous computing platform |
CN111625330A (en) * | 2020-05-18 | 2020-09-04 | 北京达佳互联信息技术有限公司 | Cross-thread task processing method and device, server and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101551761A (en) | Method for sharing stream memory of heterogeneous multi-processor | |
US11847508B2 (en) | Convergence among concurrently executing threads | |
US8707314B2 (en) | Scheduling compute kernel workgroups to heterogeneous processors based on historical processor execution times and utilizations | |
TWI525540B (en) | Mapping processing logic having data-parallel threads across processors | |
JP5859639B2 (en) | Automatic load balancing for heterogeneous cores | |
KR102253426B1 (en) | Gpu divergence barrier | |
US9477526B2 (en) | Cache utilization and eviction based on allocated priority tokens | |
US9135077B2 (en) | GPU compute optimization via wavefront reforming | |
US9354892B2 (en) | Creating SIMD efficient code by transferring register state through common memory | |
KR101477882B1 (en) | Subbuffer objects | |
Tang et al. | Controlled kernel launch for dynamic parallelism in GPUs | |
US9626216B2 (en) | Graphics processing unit sharing between many applications | |
Aoki et al. | Hybrid opencl: Enhancing opencl for distributed processing | |
US20170053374A1 (en) | REGISTER SPILL MANAGEMENT FOR GENERAL PURPOSE REGISTERS (GPRs) | |
CN103176848A (en) | Compute work distribution reference counters | |
US11934867B2 (en) | Techniques for divergent thread group execution scheduling | |
JP2021034020A (en) | Methods and apparatus to enable out-of-order pipelined execution of static mapping of workload | |
CN101599009A (en) | A kind of method of executing tasks parallelly on heterogeneous multiprocessor | |
Dastgeer et al. | Flexible runtime support for efficient skeleton programming on heterogeneous GPU-based systems | |
KR20140001970A (en) | Device discovery and topology reporting in a combined cpu/gpu architecture system | |
US10289418B2 (en) | Cooperative thread array granularity context switch during trap handling | |
KR20140004654A (en) | Methods and systems for synchronous operation of a processing device | |
Hoffmann et al. | Dynamic task scheduling and load balancing on cell processors | |
KR101755154B1 (en) | Method and apparatus for power load balancing for heterogeneous processors | |
US12020076B2 (en) | Techniques for balancing workloads when parallelizing multiply-accumulate computations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Open date: 20091007 |