CN101833438A - General data processing method based on multiple parallel - Google Patents

General data processing method based on multiple parallel Download PDF

Info

Publication number
CN101833438A
CN101833438A CN201010150549A CN201010150549A CN101833438A CN 101833438 A CN101833438 A CN 101833438A CN 201010150549 A CN201010150549 A CN 201010150549A CN 201010150549 A CN201010150549 A CN 201010150549A CN 101833438 A CN101833438 A CN 101833438A
Authority
CN
China
Prior art keywords
data
task
execution
application program
parallel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201010150549A
Other languages
Chinese (zh)
Inventor
许端清
杨鑫
赵磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201010150549A priority Critical patent/CN101833438A/en
Publication of CN101833438A publication Critical patent/CN101833438A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Processing (AREA)

Abstract

The invention discloses a general data processing method based on multiple parallel, comprising the steps of: (1) dividing an application program for data processing into a plurality of execution behaviors; (2) dividing all of the execution behaviors into a plurality of tasks according to the basic operation types of data by the execution behavior; (3) dividing the data to be processed by the application program into static data and dynamic data; and (4) operating computational tasks on a GPU (Graphic Processing Unit), and operating logical judgment tasks on a CPU (Central Processing Unit) until the execution of the application program is finished. The data processing method performs specialized optimization for complex algorithm with dynamically-characterized performance behavior and irregular data structures, and can carry out dynamic management on the data according to a memory locality principle and an SIMD (Single Instruction Multiple Data) operation mechanism during data processing so as to effectively utilize the computing resources and the memory resources of hardware furthest during the data processing by the application program.

Description

A kind of general data processing method based on multiple parallel
Technical field
The present invention relates to the parallel computing field, relate in particular to a kind of conventional data method for parallel processing based on the heterogeneous polynuclear framework.
Background technology
Along with current rapid development of science and technology, high-performance calculation has become the research means that has strategic significance in the scientific technological advance, it constituted in modern science and technology and the engineering design with traditional theoretical research and laboratory experiment complement each other, inter-related research method, be called three big " pillars " of 21 century scientific research in the world.The application of high-performance computer mainly concentrates on science research and development, telecommunications, finance, government etc., yes performs meritorious deeds never to be obliterated so high-performance computer is for the contribution of country, in order to accelerate the paces of current informatization, growing field is applied to the high-performance calculation technology.The speed of calculating has greatly been accelerated in high-performance calculation, has shortened development and production cycle.The research ability has been widened in its application greatly, promotes and promoted the development of modern science and engineering.Accelerate development high-performance calculation for promoting China's science and technology capability of independent innovation, enhancing national competitiveness, safeguarding national security, promote the development of the national economy, construction innovation-oriented country to have crucial strategic importance.
In the evolution of high-performance computing sector, once sought hegemony high-performance calculation market with the minicomputer that the RISC framework is taken as the leading factor, owing to the development of X86 framework, the X86 framework that has overwhelming superiority on price had finally replaced minicomputer with the form of cluster afterwards.Though can solve the problem of part mass computing by creating distributed system, distributed system has communication overhead big, the failure rate height; The access structure complexity of data, expense is big; Weakness such as the difficult control of safety of data and confidentiality.Along with computer processor, particularly GPU (Graphical Processing Unit) raising at full speed of computing power and cheap price, high-performance calculation progressively enters desktop (low side) field, make each researchist, scientist and slip-stick artist all might have the supercomputer of oneself, can deal with problems faster, accelerate the rhythm of scientific development.Present GPU has comprised up to a hundred processing units, can obtain the performance of 1TFLOPS to the single-precision floating point computing, also can obtain to surpass the performance of 80GFLOPS to the double-precision floating point computing, can have the video memory of 4GB, surpasses the 100GB/ bandwidth of second.Although GPU is a kind of processor that aims at graphics calculations and design originally, yet be particularly suitable for doing GPU that large-scale parallel calculates appears at many non-graphical application rapidly with characteristics such as powerful calculated performance, lower energy consumption, cheap price and floor area are less high-performance computing sector.Nowadays, many important science engineerings are all attempting the GPU computing power is added in their code.Just warmly wait in expectation their work of software engineers can obtain remarkable performance by GPU.
Yet present most of application programs are grafted directly to GPU and go up the raising that can't obtain performance immediately, even also performance decrease can occur.This mainly because these programs and structure are not to design at the GPU Architecture characteristic, can't excavate the whole computing power of GPU.How to utilize concurrent application to carry out efficiently data processing normally a complexity and work consuming time.
Summary of the invention
The invention provides a kind of data parallel, tasks in parallel, parallel multiple parallel data processing method of pipeline of having merged, can effectively use the computational resource and the storage resources of hardware in the time of can making application program carry out data processing to greatest extent.
A kind of general data processing method based on multiple parallel, carry out in computing machine with GPU and CPU processor:
(1) application program that will carry out data processing is divided into some act of execution;
Each act of execution may be done to few basic operation to data, for example storage of the visit of data, data etc., perhaps computations;
(2) according to basic operation type and the computations type of act of execution, all act of execution are divided into several tasks, are about to similar act of execution and put under in the same calculation task data;
Similar act of execution is meant to have identical calculating operation or similar storage operation, and similar storage operation is meant that the visit to data remains in the subrange of storage area.
The division of this step can be satisfied SIMD (Single Instruction, Multiple Data) the execution characteristic of hardware and local access's characteristic of storage.
Each task is finished the calculation task of appointment, and short and small as much as possible and function singleness during division according to specific circumstances can executed in parallel between task, and also serializable is carried out.
(3) data that application program need be handled are divided into static data and dynamic data, in can carrying out the computing machine video memory of described application program, divide storage space (storage pool), in this storage space, be respectively static data and dynamic data and divide storage area, be that a part is used to store static data in the storage space, remaining space is used to store dynamic data.
Wherein static data is meant the data that can not change in the application program implementation, and dynamic data is meant the new data that produces in the application program implementation.All these information are recorded in the configuration file in advance;
(4) according to processing mode to data, task in the step (2) is divided into calculation type task and logic determines type task, operation calculation type task on GPU, operation logic judgement type task on CPU, that the present invention adopts is parallel based on pipeline, the multiple parallel executive mode of data parallel, tasks in parallel, finishes the execution of application program.
Pipeline is a kind of producer-consumer's execution pattern, the calculation process that is fit to most application programs, and pipeline is by data recombination balance operating load effectively, avoids a certain unit too much output may occur and makes whole calculation process load inequality; The data parallel execution model as CUDA (Compute Unified Device Architecture), can make full use of hardware SIMD characteristic to the large-scale neatly data set of isomorphism as the programming model of current main-stream, hide memory access and postpone; The task execution model is a kind of extendible execution pattern, can conclusively show out the relation of interdependence and the dynamic act of execution of each unit in the program process.In order to give full play to the heterogeneous polynuclear Architecture characteristic, reasonably use hardware resource, the present invention adopts pipeline parallel data tupe, will be referred to a large amount of application programs execution pipelines that calculate operates on the GPU, and the data task scheduling pipeline that relates to a large amount of logic determines operates on the CPU, two kinds of pipeline asynchronous parallels are carried out, and data task scheduling pipeline is carried out pipeline than application program and moved in advance.By this pipeline Parallel Executing Scheme, both can guarantee independence and the concurrency calculated, also can avoid the synchronous operation of using atom, lock etc. expensive.
Carry out in the pipeline in application program, the act of execution of same calculation task inside is carried out in the data parallel mode, and the act of execution between different computing tasks is with the asynchronous execution of the mode of tasks in parallel.
Owing to may producing uncertain data volume, some programs cause the unbalanced situation of whole execution pipeline load.The execution pattern of data parallel makes a task produce a large amount of new data probably, be difficult to simultaneously these data be stored and use them, a large amount of new data also may consume limited video memory rapidly, and, also increased difficulty to the data management because the generation of these data is at random uncertain.Therefore, execution in step (4) judges whether will carrying out of task has exceeded the size of the new data of generation the remaining space of current storage pool before; In case exceeded the remaining space of storage pool through judgement, we will divide into groups to the data that this task need be handled, and data are handled in batches.This method will reduce the burden that some algorithms that relate to mass data cause to system storage and bandwidth greatly, making the data in the video memory all is the needed data of calculating of thread, thereby further strengthened the parallel efficiency calculation of thread, improved effective use ability hardware.
Owing to when task run, uncertain new task might occur, so the present invention's employing is shifted according to the corresponding data of the task management of being dispatched simultaneously based on the dynamic dispatching of priority.
Each task (comprising task in the step (2) and the new task that occurs when the task run) all is provided with a priority state, when new task occurs, selects the high task of priority to move successively according to the priority state of all tasks.
The priority of measurement task mainly based on its desired data in the type of the position of memory hierarchy, required processor and the size of desired data collection.Mode with data-driven is carried out task scheduling, dispatches according to the pairing task of static data in the type of idle processor and the current storage pool.Specifically, according to priority from high to low order in conjunction with following several principles:
(1) task executions does not need static data;
(2) desired data is in cache;
(3) priority processing possesses the task of abundant similarity data, and perhaps the data of Chan Shenging can be assisted
Help other task to improve the priority of carrying out, perhaps a plurality of task executions have similarity.
(4) desired data is at the GPU video memory;
(5) desired data is at the CPU internal memory;
(6) desired data is transferred to internal memory by hard disk;
(7) the desired data collection is too little and can't make full use of the hardware computing power.
The enforcement of the inventive method is based on more ripe heterogeneous polynuclear framework, Fermi framework such as the up-to-date release of NVIDIA company, perhaps Inter company is about to the Larrabee framework of release etc., these frameworks generally all have the floating-point operation ability above 1TFLOPS, surpass 20 polycaryon processor, hardware thread up to a hundred and complicated memory hierarchy structure.
Data processing method of the present invention has been done special optimization at the complicated algorithm with behavioral characteristics act of execution and irregular data structure, can effectively use the computational resource and the storage resources of hardware when can be when data processing according to storage principle of locality and SIMD operation mechanism data being carried out dynamic management and make application program carry out data processing to greatest extent.Utilize the inventive method can develop the application program of high performance executed in parallel rapidly and easily, this is the progress and the efficient of faster procedure exploitation greatly undoubtedly, saves the research and development expense.
Description of drawings
Fig. 1 increases progressively the performance evaluation that is shown for CUDA and model of the present invention along with scene complexity.
Embodiment
Select 4 nuclear CPU that are furnished with an Intel Xeon 3.7GHz, the PC of a NvidiaGTX285 (1G video memory) verifies feasibility of the present invention.Realized the DLL (dynamic link library) that a cover is realized based on said method based on the PTX instruction set, and go to redesign and write according to method proposed by the invention and have a large amount of dynamically ray trace algorithms of scrambling behaviors in the graphics, and contrast, and done following analysis with the resulting effect of code that the CUDA programming model that uses Nvidia company is write.
Application program is divided into some calculation tasks, in order to satisfy the SIMD/SIMT operation and the local memory access characteristic of hardware, we are encapsulated in the calculation task effectively to handle the calculating with similar act of execution or similar memory access behavior, the short and small as much as possible and function singleness of each calculation task, according to specific circumstances can executed in parallel between calculation task, also serializable is carried out.Calculate in the data parallel mode calculation task inside, and between calculation task with the asynchronous calculating of tasks in parallel mode.Each calculation task all is provided with a state, in order to handle the execution between the calculation task that may have relation of interdependence.
Characteristics according to calculation task in the ray trace algorithm have been created 6 calculation tasks in application program execution pipeline, carry out respectively that light produces, calculation tasks such as traversal accelerating structure, dough sheet are crossing, painted, shade, in data task scheduling pipeline, carry out that light sorts and the establishment of light bag simultaneously.These tasks all have executed in parallel ability preferably, and promptly the SIMD of broad carries out width, but the recursive nature of light makes the SIMD availability acutely descend along with the carrying out of recurrence probably.In addition, we use the delay computing technique further to improve the SIMD utilization factor when realizing, if promptly painted task can't produce enough light after as calculated and form a complete light bag, crossing calculating will be delayed up to complete light bag and form; Similarly, carry out painted calculating if crossing calculation task can't produce abundant light, painted calculating also will be delayed.
Data are divided into static data and dynamic data, and wherein static data is meant the data that can not change in the application program implementation, and dynamic data is meant the new data of the continuous variation that produces in the application program implementation.A storage pool is set when initialization, is that static data distributes certain space in video memory according to concrete application program, and remaining space is occupied by dynamic data.All these information are recorded in the configuration file.
The required static data size of some application programs may exceed the size of video memory, so just may in program process, dispatch static data dynamically, and that each size of data that imports is not necessarily caught up with once is in full accord, so just may be at static data zone and dynamic data interval generation fragment.For fear of the generation of fragment and effectively use video memory, we can adopt the method for two-way distribution in video memory, deposit static data at the low address end of storage pool, and deposit dynamic data in the high address end of storage pool.
As mentioned above, in order to give full play to the heterogeneous polynuclear Architecture characteristic, reasonably use hardware resource, the present invention has designed application program and has carried out the pipeline Parallel Executing Scheme that pipeline combines with data task scheduling pipeline, will be referred to a large amount of application programs execution pipelines that calculate operates on the GPU, and the data task scheduling pipeline that relates to a large amount of logic determines operates on the CPU, and two kinds of pipeline asynchronous parallels are carried out, and data task scheduling pipeline is carried out pipeline than application program and moved in advance.By this pipeline Parallel Executing Scheme, we both can guarantee independence and the concurrency calculated, also can avoid the synchronous operation of using atom, lock etc. expensive.
When realizing, the present invention is based on following 3 principle design data task scheduling pipelines: 1. should keep as much as possible the visit of static data is in the fastest one deck of hardware store level medium velocity (being cache, shared memory etc.), postpone inevitable up to this visit simultaneously as far as possible to the visit of data.2. priority processing possesses the task of abundant similarity data, and perhaps the data of Chan Shenging can assist other task to improve the priority of carrying out, and perhaps a plurality of task executions have similarity.3. carry out task scheduling in the mode of data-driven, dispatch according to the pairing task of static data in the type of idle processor and the current storage pool.
1) the present invention has designed the use that data-analyzing machine comes dynamic control of data, may produce uncertain data volume and cause the unbalanced situation of whole execution pipeline load to solve some programs.The execution pattern of data parallel makes a calculation task produce a large amount of new data probably, be difficult to simultaneously these data be stored and use them, a large amount of new data also may consume limited video memory rapidly, and, also increased difficulty to the data management because the generation of these data is at random uncertain.Therefore, the present invention is provided with a data-analyzing machine, before calculation task is carried out each time, all to judge whether the size of the new data that produces has been exceeded current residue video memory (concrete appraisal procedure is decided according to application corresponding); In case exceeded remaining video memory capacity through judgement, we will divide into groups to the input data, and data are handled in batches.Our this method will reduce the burden that some algorithms that relate to mass data cause to system storage and bandwidth greatly, making the data in the video memory all is the needed data of calculating of thread, thereby further strengthened the parallel efficiency calculation of thread, improved effective use ability hardware.
All set up a data buffer area for each calculation task, be used for the each data that produce or consume of Management Calculation task.Because the generation and the consumption of data is dynamically irregular in some complicated algorithms, in order to satisfy local similar principle and SIMD operating characteristic, make calculating concentrate on local data concentrates and to carry out as far as possible, be necessary these data are reorganized, guarantee to calculate and to continue on hardware, to carry out effectively.The logical process ability that bandwidth ability that current hardware is powerful and CPU are powerful makes that Data Dynamic reorganization operation is very feasible.
2) designed task dispatcher and come that uncertain task is carried out sequence and carry out dynamic dispatching based on priority, shifted according to the corresponding data of the task management of being dispatched simultaneously.We adopt the method for scheduling as required, but when some processors time spent, 1. a semaphore are set, and pin scheduler; 2. scan whole pending task sequence, select the highest task of priority, and mark; 3. to the scheduler release.
Determination of priority is the core of our this scheduler.At the hybrid processing characteristics of resources, the priority that we weigh task mainly based on its desired data in the type of the position of memory hierarchy, required processor and the size of desired data collection.Specifically, according to priority from high to low order in conjunction with following several principles: (1) task executions does not need static data; (2) desired data is in cache; (3) priority processing possesses the task of abundant similarity data, and perhaps the data of Chan Shenging can assist other task to improve the priority of carrying out, and perhaps a plurality of task executions have similarity.(4) desired data is at the GPU video memory; (5) desired data is at the CPU internal memory; (6) desired data is transferred to internal memory by hard disk; (7) the desired data collection is too little and can't make full use of the hardware computing power.
Selection has the test scene of different geometry complexity, and Bunny, Fairy, BART Kitchen be as the test model file,
Fairy is dynamic scene and has twice reflection calculating that drawing resolution is 1024*1024.We have used the inventive method and CUDA programming model that this scene is tested respectively, and the result is as shown in table 1, and visible the inventive method is compared CUDA, has obtained more performance.The pipeline parallel mechanism is by reasonably using hardware computational resource and storage resources, carried out task scheduling based on priority according to the balanced load of process nuclear.
Table 1
??CUDA The inventive method
??Bunny ??9.3 ??11.1
??Fairy ??4.3 ??5.6
??BART?Kitchen ??3.8 ??5.1
Table 1 uses CUDA and this model to scenario B unny respectively, the drafting frame number of Fairy, BART Kitchen per second under 1024*1024 resolution.
In order to verify the parallel use ability of the inventive method to hardware, the utilization factor of having tested scalar processor, it has reflected that directly can we develop the executed in parallel ability of algorithm on hardware to greatest extent to the scheduling and the method for organizing of data and task.Notice that we do not use the testing standard of the operating position of ALU as us, even if because thread slot is occupied sometimes, but ALU may not used fully because of the poor efficiency of memory access delay or SIMD yet.As shown in table 2, compare the CUDA programming model, the inventive method can be used the computational resource of GPU more effectively.
Table 2
??CUDA The inventive method
??Bunny ??90% ??90%
??Fairy ??72% ??85%
??BART?Kitchen ??69% ??80%
The GPU utilization factor of table 2 CUDA and this model relatively.
For illustrate the inventive method can at the scene dynamics of different complexities organize data and scheduler task and the unbalanced situation of load can not occur, the visible CUDA programming model of Fig. 1 the uneven horizontal stroke of tangible load can occur and make the not good situation of the utilization of resources of handling when the scene complexity extremely increases, finally cause performance decrease, the inventive method is then being kept comparatively stable performance always.

Claims (6)

1. the general data processing method based on multiple parallel is characterized in that, in computing machine with GPU and CPU processor:
(1) application program that will carry out data processing is divided into some act of execution;
(2) according to the similarity of act of execution, all act of execution are divided into several tasks to data or calculating;
(3) data that application program need be handled are divided into static data and dynamic data, divide storage space in can carrying out the computing machine video memory of described application program, be respectively static data and dynamic data and divide storage area in this storage space;
(4) on GPU and CPU, set up the execution pipeline respectively, according to processing mode to data, task in the step (2) is divided into calculation type task and logic determines type task, operation calculation type task on GPU, operation logic judgement type task on CPU is until the execution of finishing application program, wherein in the GPU pipeline, task inside is moved in the mode of data parallel, and the mode with tasks in parallel between the task is moved.
2. general data processing method according to claim 1 is characterized in that, each act of execution may be done to few a basic operation or a calculating operation to data.
3. data parallel processing method according to claim 1 is characterized in that, described static data is the data that can not change in the application program implementation, and described dynamic data is the data that produce in the application program implementation.
4. data parallel processing method according to claim 1, it is characterized in that, on GPU and CPU, set up respectively and carry out pipeline, according to processing mode to data, task is divided into calculation type task and logic determines type task, operation calculation type task on GPU, operation logic judgement type task on CPU, two kinds of pipeline executed in parallel.
5. data parallel processing method according to claim 1 is characterized in that, in the step (4), the act of execution of same task inside is moved in the data parallel mode, and between different task act of execution with the parallel mode asynchronous operation.
6. data parallel processing method according to claim 1 is characterized in that, each task all is provided with a priority state, when new task occurs, selects the high task of priority to move successively according to the priority state of all tasks.
CN201010150549A 2010-04-19 2010-04-19 General data processing method based on multiple parallel Pending CN101833438A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010150549A CN101833438A (en) 2010-04-19 2010-04-19 General data processing method based on multiple parallel

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010150549A CN101833438A (en) 2010-04-19 2010-04-19 General data processing method based on multiple parallel

Publications (1)

Publication Number Publication Date
CN101833438A true CN101833438A (en) 2010-09-15

Family

ID=42717518

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010150549A Pending CN101833438A (en) 2010-04-19 2010-04-19 General data processing method based on multiple parallel

Country Status (1)

Country Link
CN (1) CN101833438A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567084A (en) * 2010-12-31 2012-07-11 新奥特(北京)视频技术有限公司 Multi-task parallel scheduling mechanism
CN103197976A (en) * 2013-04-11 2013-07-10 华为技术有限公司 Method and device for processing tasks of heterogeneous system
CN104040500A (en) * 2011-11-15 2014-09-10 英特尔公司 Scheduling thread execution based on thread affinity
CN104102476A (en) * 2014-08-04 2014-10-15 浪潮(北京)电子信息产业有限公司 High-dimensional data stream canonical correlation parallel computation method and high-dimensional data stream canonical correlation parallel computation device in irregular steam
CN104331271A (en) * 2014-11-18 2015-02-04 李桦 Parallel computing method and system for CFD (Computational Fluid Dynamics)
CN104699461A (en) * 2013-12-10 2015-06-10 Arm有限公司 Configuring thread scheduling on a multi-threaded data processing apparatus
CN102567084B (en) * 2010-12-31 2016-12-14 新奥特(北京)视频技术有限公司 A kind of Multi-task parallel scheduling mechanism
CN106537863A (en) * 2013-10-17 2017-03-22 马维尔国际贸易有限公司 Processing concurrency in a network device
CN106886503A (en) * 2017-02-08 2017-06-23 无锡十月中宸科技有限公司 heterogeneous system, data processing method and device
CN106941522A (en) * 2017-03-13 2017-07-11 广州五舟科技股份有限公司 Lightweight distributed computing platform and its data processing method
CN108595211A (en) * 2018-01-05 2018-09-28 百度在线网络技术(北京)有限公司 Method and apparatus for output data
CN110334049A (en) * 2019-07-02 2019-10-15 上海联影医疗科技有限公司 Data processing method, device, computer equipment and storage medium
CN110580527A (en) * 2018-06-08 2019-12-17 上海寒武纪信息科技有限公司 method and device for generating universal machine learning model and storage medium
CN110688327A (en) * 2019-09-30 2020-01-14 百度在线网络技术(北京)有限公司 Video memory management method and device, electronic equipment and computer readable storage medium
US11726754B2 (en) 2018-06-08 2023-08-15 Shanghai Cambricon Information Technology Co., Ltd. General machine learning model, and model file generation and parsing method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1538296A (en) * 2003-02-18 2004-10-20 Multithreaded kernal for graphics processing unit
CN101091175A (en) * 2004-09-16 2007-12-19 辉达公司 Load balancing
CN101354780A (en) * 2007-07-26 2009-01-28 Lg电子株式会社 Graphic data processing apparatus and method
CN101526934A (en) * 2009-04-21 2009-09-09 浪潮电子信息产业股份有限公司 Construction method of GPU and CPU combined processor

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1538296A (en) * 2003-02-18 2004-10-20 Multithreaded kernal for graphics processing unit
CN101091175A (en) * 2004-09-16 2007-12-19 辉达公司 Load balancing
CN101354780A (en) * 2007-07-26 2009-01-28 Lg电子株式会社 Graphic data processing apparatus and method
CN101526934A (en) * 2009-04-21 2009-09-09 浪潮电子信息产业股份有限公司 Construction method of GPU and CPU combined processor

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
《计算机世界》 20100111 汤铭 处理器:期待丰收"大年" , *
LEI WANG,等: "Task Scheduling of Parallel Processing in CPU-GPU Collaborative Environment", 《INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY 2008》 *
未标注: "CPU+GPU:混合处理器提高内部性能连接效率", 《新电脑》 *
汤铭: "处理器:期待丰收"大年"", 《计算机世界》 *
钱悦: "图形处理器CUDA编程模型的应用研究", 《计算机与数字工程》 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567084B (en) * 2010-12-31 2016-12-14 新奥特(北京)视频技术有限公司 A kind of Multi-task parallel scheduling mechanism
CN102567084A (en) * 2010-12-31 2012-07-11 新奥特(北京)视频技术有限公司 Multi-task parallel scheduling mechanism
CN104040500A (en) * 2011-11-15 2014-09-10 英特尔公司 Scheduling thread execution based on thread affinity
CN104040500B (en) * 2011-11-15 2018-03-30 英特尔公司 Scheduling thread based on thread similitude performs
CN103197976A (en) * 2013-04-11 2013-07-10 华为技术有限公司 Method and device for processing tasks of heterogeneous system
CN106537863A (en) * 2013-10-17 2017-03-22 马维尔国际贸易有限公司 Processing concurrency in a network device
CN104699461A (en) * 2013-12-10 2015-06-10 Arm有限公司 Configuring thread scheduling on a multi-threaded data processing apparatus
CN104699461B (en) * 2013-12-10 2019-04-05 Arm 有限公司 Thread scheduling is configured in multi-thread data processing unit
US10733012B2 (en) 2013-12-10 2020-08-04 Arm Limited Configuring thread scheduling on a multi-threaded data processing apparatus
CN104102476A (en) * 2014-08-04 2014-10-15 浪潮(北京)电子信息产业有限公司 High-dimensional data stream canonical correlation parallel computation method and high-dimensional data stream canonical correlation parallel computation device in irregular steam
CN104331271A (en) * 2014-11-18 2015-02-04 李桦 Parallel computing method and system for CFD (Computational Fluid Dynamics)
CN106886503A (en) * 2017-02-08 2017-06-23 无锡十月中宸科技有限公司 heterogeneous system, data processing method and device
CN106941522A (en) * 2017-03-13 2017-07-11 广州五舟科技股份有限公司 Lightweight distributed computing platform and its data processing method
CN106941522B (en) * 2017-03-13 2019-12-10 广州五舟科技股份有限公司 Lightweight distributed computing platform and data processing method thereof
CN108595211A (en) * 2018-01-05 2018-09-28 百度在线网络技术(北京)有限公司 Method and apparatus for output data
CN108595211B (en) * 2018-01-05 2021-11-26 百度在线网络技术(北京)有限公司 Method and apparatus for outputting data
CN110580527B (en) * 2018-06-08 2022-12-02 上海寒武纪信息科技有限公司 Method and device for generating universal machine learning model and storage medium
CN110580527A (en) * 2018-06-08 2019-12-17 上海寒武纪信息科技有限公司 method and device for generating universal machine learning model and storage medium
US11726754B2 (en) 2018-06-08 2023-08-15 Shanghai Cambricon Information Technology Co., Ltd. General machine learning model, and model file generation and parsing method
CN110334049A (en) * 2019-07-02 2019-10-15 上海联影医疗科技有限公司 Data processing method, device, computer equipment and storage medium
CN110688327A (en) * 2019-09-30 2020-01-14 百度在线网络技术(北京)有限公司 Video memory management method and device, electronic equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN101833438A (en) General data processing method based on multiple parallel
CN102902512B (en) A kind of multi-threading parallel process method based on multi-thread programming and message queue
US8990827B2 (en) Optimizing data warehousing applications for GPUs using dynamic stream scheduling and dispatch of fused and split kernels
CN102981807B (en) Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment
CN103279445A (en) Computing method and super-computing system for computing task
Li et al. Performance modeling in CUDA streams—A means for high-throughput data processing
CN101777007B (en) Parallel function simulation system for on-chip multi-core processor and method thereof
CN105487838A (en) Task-level parallel scheduling method and system for dynamically reconfigurable processor
CN102193830A (en) Many-core environment-oriented division mapping/reduction parallel programming model
CN101655828B (en) Design method for high efficiency super computing system based on task data flow drive
CN110297661A (en) Parallel computing method, system and medium based on AMP framework DSP operating system
CN101840329A (en) Data parallel processing method based on graph topological structure
Tan et al. Optimizing the LINPACK algorithm for large-scale PCIe-based CPU-GPU heterogeneous systems
Zhang et al. Comparison and analysis of GPGPU and parallel computing on multi-core CPU
Żurek et al. The comparison of parallel sorting algorithms implemented on different hardware platforms
Wang et al. Task scheduling of parallel processing in CPU-GPU collaborative environment
CN103810041A (en) Parallel computing method capable of supporting dynamic compand
Du et al. Feature-aware task scheduling on CPU-FPGA heterogeneous platforms
Chen et al. Integrated research of parallel computing: Status and future
CN111177979A (en) Fluid dynamics software GASFLOW optimization method based on OpenMP
CN112559032B (en) Many-core program reconstruction method based on circulation segment
CN102902511A (en) Parallel information processing system
Rashid A GPU accelerated parallel heuristic for the 2D knapsack problem with rectangular pieces
Zhong et al. Parallel multisets sorting using aperiodic multi-round distribution strategy on heterogeneous multi-core clusters
Maurya et al. An approach to parallel sorting using ternary search

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent of invention or patent application
CB03 Change of inventor or designer information

Inventor after: Xu Duanqing

Inventor after: Yang Xin

Inventor after: Zhao Lei

Inventor after: Fang Yingming

Inventor before: Xu Duanqing

Inventor before: Yang Xin

Inventor before: Zhao Lei

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: XU DUANQING YANG XIN ZHAO LEI TO: XU DUANQING YANG XIN ZHAO LEI FANG YINGMING

C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20100915