CN109739559A - Data processing method and equipment in CUDA heterogeneous platform - Google Patents
Data processing method and equipment in CUDA heterogeneous platform Download PDFInfo
- Publication number
- CN109739559A CN109739559A CN201910017177.5A CN201910017177A CN109739559A CN 109739559 A CN109739559 A CN 109739559A CN 201910017177 A CN201910017177 A CN 201910017177A CN 109739559 A CN109739559 A CN 109739559A
- Authority
- CN
- China
- Prior art keywords
- subdata
- equipment
- cuda
- host
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The embodiment of the invention provides the data processing methods and equipment in a kind of CUDA heterogeneous platform.The described method includes: for several subdatas for the data being stored in host memory module, the duration of data is sent to equipment according to host, kernel function executes duration, and equipment sends the duration of data to host, first CUDA is spread to the first subdata in defeated several subdatas, the second subdata in defeated several subdatas is spread with the 2nd CUDA, carries out Cross transfer, completes the processing of first subdata and the second subdata;The processing of the data is completed by the way of the Cross transfer for remaining subdata in several subdatas.Data processing method and equipment in CUDA heterogeneous platform provided in an embodiment of the present invention, can promote the speed of service of data processing.
Description
Technical field
The present embodiments relate to the data processings in ultrasonic imaging technique field more particularly to a kind of CUDA heterogeneous platform
Method and apparatus.
Background technique
Ultrasonic imaging technique is to issue ultrasonic wave to examined object by ultrasonic transducer and receive from examined object
The echo of coming is returned to, using the physical features and object to be detected of ultrasonic wave in difference present on acoustic properties, to draw
The technology of morphologic information inside examined object out.3-D supersonic imaging can intuitively show the feature of object, for doctor
It is raw to make accurately diagnosis in clinical diagnosis and provides effective auxiliary with reasonable therapeutic scheme, facilitate doctor to the state of an illness into
Row diagnosis.But 3-D supersonic imaging is handled entire three-dimensional data, is all far longer than in calculation amount and data volume
Two-dimensional ultrasonic imaging.Traditional method image taking speed calculated using CPU has seriously affected three-D ultrasonic compared with slow, real-time is poor
The development of imaging system.
In recent years, with the continuous improvement that people require calculated performance, the computer parallel technology based on GPU is obtained
Rapid development.GPU general-purpose computations generally use the heterogeneous schemas of CPU+GPU, and the CPU as host is responsible for executing at complex logic
The calculating of the unsuitable data parallel such as reason and issued transaction, and the GPU as equipment is responsible for the large-scale data of computation-intensive
Parallel computation.Since GPU has a clear superiority in processing capacity and bandwidth of memory relative to CPU, GPU can make up CPU
Deficiency on energy, to sufficiently excavate the Potential performance of computer.The appearance of especially 2007 CUDA frameworks, using GPU into
The development that the technology that row universal parallel calculates is advanced by leaps and bounds.It completes the parallel architecture to GPU in a manner of class C language
Calling, as long as developer is familiar with C language, can easy transition to the exploitation of CUDA, greatly reduce the door of developer
Sill.Therefore, the acceleration that traditional technology not only may be implemented in the heterogeneous computing system of multi-core CPU and many-core GPU composition optimizes, also
Conducive to the innovation for pushing High Performance Computing.Current middle and high end diasonograph almost all of on the market all uses this different
Structure computing system realizes 3-D supersonic imaging.
Since heterogeneous computing system is related to the collaborative work of host and equipment, so program optimization will not only optimize equipment end
The execution efficiency of code, while also to optimize the efficiency of host and equipment collaboration work.But most of researcher is carrying out
When accelerating optimization, all only pays attention to the execution efficiency for optimizing equipment end code, have ignored to host and equipment collaboration working efficiency
Optimization.Therefore, for the shortcoming of the 3-D supersonic imaging data processing currently based on heterogeneous platform, a kind of energy order is found
Host and equipment concurrent working, and optimize the association of host and equipment by accelerating the data exchange speed between host and equipment
Same working efficiency, the method for promoting the operational efficiency of whole system, just become industry extensive concern the problem of.
Summary of the invention
In view of the above-mentioned problems existing in the prior art, the embodiment of the invention provides the data in a kind of CUDA heterogeneous platform
Processing method and equipment.
In a first aspect, the embodiment provides the data processing methods in a kind of CUDA heterogeneous platform, comprising: needle
To several subdatas for the data being stored in host memory module, the duration of data, kernel function are sent to equipment according to host
The duration that duration and equipment send data to host is executed, the first CUDA is spread to the in defeated several subdatas
One subdata, spreads the second subdata in defeated several subdatas with the 2nd CUDA, carries out Cross transfer, described in completion
The processing of first subdata and the second subdata;For remaining subdata in several subdatas, passed using the intersection
Defeated mode completes the processing of the data;Wherein, the data are divided into several subdatas.
Further, several subdatas are stored in the page locking page in memory of host memory module.
Further, the duration for sending data to equipment according to host, kernel function execute duration and equipment to master
Machine sends the duration of data, and the first CUDA is spread to the first subdata in defeated several subdatas, flows with the 2nd CUDA
The second subdata in several subdatas of transmission carries out Cross transfer, completes first subdata and the second subnumber
According to processing, comprising: host is compared to duration and the equipment that equipment sends data to the duration that host sends data, is obtained
Duration smaller value is taken, if the kernel function executes duration and is less than the duration smaller value, the first CUDA flows to equipment transmission the
One subdata, while the 2nd CUDA flows to equipment and transmits the second subdata, equipment executes kernel function to the first subdata,
First CUDA flows to host return while executed the first subdata of kernel function, and equipment executes core letter to the second subdata
Number, the second last CUDA flow to host and return to the second subdata for executing kernel function.
Further, the duration for sending data to equipment according to host, kernel function execute duration and equipment to master
Machine sends the duration of data, and the first CUDA is spread to the first subdata in defeated several subdatas, flows with the 2nd CUDA
The second subdata in several subdatas of transmission carries out Cross transfer, completes first subdata and the second subnumber
According to processing, comprising: if the kernel function execute duration, more than or equal to the host to equipment send data duration, be less than
The duration of data is sent to host equal to equipment, then the first CUDA flows to equipment and transmits the first subdata, in equipment to the first son
While data execute kernel function, the 2nd CUDA flows to equipment and transmits the second subdata, flows to host return in the first CUDA and holds
Went kernel function the first subdata while, equipment executes kernel function to the second subdata, and the second last CUDA flows to host
Return to the second subdata for executing kernel function.
Further, the duration for sending data to equipment according to host, kernel function execute duration and equipment to master
Machine sends the duration of data, and the first CUDA is spread to the first subdata in defeated several subdatas, flows with the 2nd CUDA
The second subdata in several subdatas of transmission carries out Cross transfer, completes first subdata and the second subnumber
According to processing, comprising: if the kernel function execute duration, more than or equal to equipment to host send data duration, be less than or equal to
The host sends the duration of data to equipment, then the first CUDA flows to equipment and transmits the first subdata, flows in the 2nd CUDA
While equipment transmits the second subdata, equipment executes kernel function to the first subdata, executes core to the second subdata in equipment
While function, the first CUDA flows to host and returns to the first subdata for executing kernel function, and the second last CUDA flows to host
Return to the second subdata for executing kernel function.
Further, the duration for sending data to equipment according to host, kernel function execute duration and equipment to master
Machine sends the duration of data, and the first CUDA is spread to the first subdata in defeated several subdatas, flows with the 2nd CUDA
The second subdata in several subdatas of transmission carries out Cross transfer, completes first subdata and the second subnumber
According to processing, comprising: host is compared to duration and the equipment that equipment sends data to the duration that host sends data, is obtained
Duration the larger value is taken, if the kernel function executes duration and is greater than described duration the larger value, the first CUDA flows to equipment transmission the
One subdata, while equipment executes kernel function to the first subdata, the 2nd CUDA flows to equipment and transmits the second subdata,
While equipment executes kernel function to the second subdata, the first CUDA flows to host and returns to the first subnumber for executing kernel function
According to the second last CUDA flows to host and returns to the second subdata for executing kernel function.
Further, the data processing method in the CUDA heterogeneous platform, further includes: by several subdatas from
Page locking page in memory copies to equipment memory module.
Second aspect, the embodiment provides the data processing equipments in a kind of CUDA heterogeneous platform, comprising:
Subdata processing module, for several subdatas for the data being stored in host memory module, according to master
Machine sends the duration of data to equipment, and kernel function executes the duration that duration and equipment send data to host, by the first CUDA
The first subdata in defeated several subdatas is spread, second in defeated several subdatas is spread with the 2nd CUDA
Subdata carries out Cross transfer, completes the processing of first subdata and the second subdata;
Partial data processing module, for being passed using the intersection for remaining subdata in several subdatas
Defeated mode completes the processing of the data;
Wherein, the data are divided into several subdatas.
The third aspect, the embodiment provides a kind of electronic equipment, comprising:
At least one host;And
At least one processor being connect with main-machine communication, at least one equipment, in which:
Memory is stored with the program instruction that can be executed by host, and the instruction of host caller is able to carry out first aspect
Data processing method in various possible implementations in CUDA heterogeneous platform provided by any possible implementation.
Fourth aspect, the embodiment provides a kind of non-transient computer readable storage medium, non-transient calculating
Machine readable storage medium storing program for executing stores computer instruction, and computer instruction makes the various possible realization sides of computer execution first aspect
Data processing method in formula in CUDA heterogeneous platform provided by any possible implementation.
Data processing method and equipment in CUDA heterogeneous platform provided in an embodiment of the present invention flow asynchronous place by CUDA
Reason mode realizes that data transmission and kernel function calculate parallel work to data in concurrent working and equipment between host and equipment
Make the parallel data processing method of two levels, the speed of service of data processing can be promoted.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to do a simple introduction, it should be apparent that, the accompanying drawings in the following description is this hair
Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root
Other attached drawings are obtained according to these attached drawings.
Fig. 1 is the data processing method flow chart in CUDA heterogeneous platform provided in an embodiment of the present invention;
Fig. 2 is that a pair of CUDA Stream Data Transmission provided in an embodiment of the present invention and kernel function handle flow diagram;
Fig. 3 is that another double CUDA Stream Data Transmissions provided in an embodiment of the present invention and kernel function handle flow diagram;
Fig. 4 is that another double CUDA Stream Data Transmissions provided in an embodiment of the present invention and kernel function handle flow diagram;
Fig. 5 is that another double CUDA Stream Data Transmissions provided in an embodiment of the present invention and kernel function handle flow diagram;
Fig. 6 is the data processing equipment structural diagram in CUDA heterogeneous platform provided in an embodiment of the present invention;
Fig. 7 is the entity structure schematic diagram of electronic equipment provided in an embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.In addition,
Technical characteristic in each embodiment or single embodiment provided by the invention can mutual any combination, to form feasible skill
Art scheme, but must be based on can be realized by those of ordinary skill in the art, when the combination of technical solution occur it is mutual
Contradiction or when cannot achieve, it will be understood that the combination of this technical solution is not present, also not the present invention claims protection scope
Within.
The embodiment of the invention provides the data processing methods in a kind of CUDA heterogeneous platform, referring to Fig. 1, this method packet
It includes:
101, for several subdatas for the data being stored in host memory module, data are sent to equipment according to host
Duration, kernel function executes the duration that duration and equipment send data to host, the first CUDA is spread defeated described several
The first subdata in subdata spreads the second subdata in defeated several subdatas with the 2nd CUDA, is intersected
The processing of first subdata and the second subdata is completed in transmission;
102, for remaining subdata in several subdatas, by the way of the Cross transfer, described in completion
The processing of data.
Wherein, the data are divided into several subdatas
On the basis of the above embodiments, the data processing side in CUDA heterogeneous platform provided in the embodiment of the present invention
Method, several subdatas are stored in the page locking page in memory of host memory module.
On the basis of the above embodiments, the data processing side in CUDA heterogeneous platform provided in the embodiment of the present invention
Method, the duration for sending data to equipment according to host, kernel function execute duration and equipment to host send data when
It is long, the first CUDA is spread to the first subdata in defeated several subdatas, is spread with the 2nd CUDA defeated described several
The second subdata in subdata carries out Cross transfer, completes the processing of first subdata and the second subdata, comprising:
Host is compared with equipment to the duration that host sends data to the duration that equipment sends data, obtains duration smaller value,
If the kernel function executes duration and is less than the duration smaller value, the first CUDA flows to equipment and transmits the first subdata, the
While two CUDA flow to equipment the second subdata of transmission, equipment executes kernel function to the first subdata, flows in the first CUDA
While host return executed the first subdata of kernel function, equipment executes kernel function, the second last to the second subdata
CUDA flows to host and returns to the second subdata for executing kernel function.It specifically may refer to Fig. 2, include: the first CUDA stream in Fig. 2
It transmits the first subdata section 201, equipment to equipment the first subdata execution kernel function section 202, the first CUDA are flowed to host and returned
Receipt row crosses the first subdata section 203 of kernel function, the 2nd CUDA flows to equipment and transmits the second subdata section 204, equipment to the
Two subdatas execute kernel function section 205, the 2nd CUDA flows to host and returns to the second subdata section 206 for executing kernel function, master
Machine sends the duration T of data to equipment1The duration T of data is sent to host with equipment2.For example preceding institute of the specific transmission process of data
It states, details are not described herein.Such case kernel function is fairly simple, or has already been through a large amount of optimizations, so when its execution
The long transmission duration already less than data.Therefore data transmission becomes the bottleneck of entire program feature, executive condition such as Fig. 2
It is shown.From figure 2 it can be seen that kernel function is held since kernel function enforcement engine and memory replication engine are working simultaneously
Occur the case where overlapping between capable and data exchange, but is not overlapped between two duplications.Since kernel function is consumed
Duration is less than memory duplication institute, and time-consuming, therefore its execution duration is stashed by data exchange duration completely.It is final total
Duration are as follows: T=2T1+2T2。
On the basis of the above embodiments, the data processing side in CUDA heterogeneous platform provided in the embodiment of the present invention
Method, the duration for sending data to equipment according to host, kernel function execute duration and equipment to host send data when
It is long, the first CUDA is spread to the first subdata in defeated several subdatas, is spread with the 2nd CUDA defeated described several
The second subdata in subdata carries out Cross transfer, completes the processing of first subdata and the second subdata, comprising:
If the kernel function executes duration, the duration of data is sent to equipment more than or equal to the host, is less than or equal to equipment to host
The duration of data is sent, then the first CUDA flows to equipment and transmits the first subdata, executes kernel function to the first subdata in equipment
While, the 2nd CUDA flows to equipment and transmits the second subdata, flows to host in the first CUDA and returns and executed the of kernel function
While one subdata, equipment executes kernel function to the second subdata, and the second last CUDA flows to host return and executed core letter
The second several subdatas.Specifically may refer to Fig. 3, include: in Fig. 3 the first CUDA flow to equipment transmit the first subdata section 301,
Equipment executes kernel function section 302 to the first subdata, the first CUDA flows to host and returns to the first subdata for executing kernel function
The 303, the 2nd CUDA of section, which flows to equipment, to be transmitted the second subdata section 304, equipment and executes kernel function section 305, the to the second subdata
Two CUDA flow to the duration T that host returns to the second subdata section 306 for executing kernel function, host sends data to equipment1, set
The standby duration T that data are sent to host2Duration T is executed with kernel functionk.The specific transmission process of data is as previously mentioned, herein no longer
It repeats.Such case explanation transmits low volume data into equipment from host, but data volume is a large amount of after kernel function calculates
Increase, the time for passing these data needs back has been even more than kernel function and has executed the time, and data transmission period cannot be complete
The execution time of kernel function is hidden, executive condition schematic diagram is as shown in Figure 3.From figure 3, it can be seen that the first CUDA stream is set
It is standby that equipment the second subdata section of transmission is flowed to the 2nd CUDA the 2nd CUDA flowed to the first subdata execution kernel function section 302
304 stash, and the first CUDA of the first CUDA stream flows to host and returns to the first subdata section for executing kernel function
303, and the equipment of the 2nd CUDA stream is executed kernel function section 305 to the second subdata and is stashed.Final total duration are as follows: T=
T1+Tk+2T2。
On the basis of the above embodiments, the data processing side in CUDA heterogeneous platform provided in the embodiment of the present invention
Method, the duration for sending data to equipment according to host, kernel function execute duration and equipment to host send data when
It is long, the first CUDA is spread to the first subdata in defeated several subdatas, is spread with the 2nd CUDA defeated described several
The second subdata in subdata carries out Cross transfer, completes the processing of first subdata and the second subdata, comprising:
If the kernel function executes duration, the duration of data is sent to host more than or equal to equipment, is less than or equal to the host to equipment
The duration of data is sent, then the first CUDA flows to equipment and transmits the first subdata, flows to the second son of equipment transmission in the 2nd CUDA
While data, equipment executes kernel function to the first subdata, while equipment executes kernel function to the second subdata, first
CUDA flows to host and returns to the first subdata for executing kernel function, and the second last CUDA flows to host return and executed kernel function
The second subdata.It specifically may refer to Fig. 4, include: that the first CUDA flows to the first subdata section 401 of equipment transmission, sets in Fig. 4
It is standby that the first subdata section that host return executed kernel function is flowed to the first subdata execution kernel function section 402, the first CUDA
403, the 2nd CUDA flows to equipment and transmits the second subdata section 404, equipment to the second subdata execution kernel function section 405, second
CUDA flows to the duration T that host returns to the second subdata section 406 for executing kernel function, host sends data to equipment1, equipment
The duration T of data is sent to host2Duration T is executed with kernel functionk.The specific transmission process of data is as previously mentioned, no longer superfluous herein
It states.Such case be from host input device data amount it is larger, transmission time be greater than kernel function execute the time, but pass through core
Function has passed lesser data volume back to host, executive condition schematic diagram such as Fig. 4 after calculating.At this moment, the equipment of the first CUDA stream
The second subdata section 404 of equipment transmission is flowed to by the 2nd CUDA to the first subdata execution kernel function section 402 to hide, and second
CUDA stream equipment to the second subdata execute kernel function section 405 cover the first CUDA flow to host return executed kernel function
The first subdata section 403.Final total duration are as follows: T=2T1+Tk+T2。
On the basis of the above embodiments, the data processing side in CUDA heterogeneous platform provided in the embodiment of the present invention
Method, the duration for sending data to equipment according to host, kernel function execute duration and equipment to host send data when
It is long, the first CUDA is spread to the first subdata in defeated several subdatas, is spread with the 2nd CUDA defeated described several
The second subdata in subdata carries out Cross transfer, completes the processing of first subdata and the second subdata, comprising:
Host is compared with equipment to the duration that host sends data to the duration that equipment sends data, obtains duration the larger value,
If the kernel function, which executes duration, is greater than described duration the larger value, the first CUDA flows to equipment and transmits the first subdata, setting
While the standby execution kernel function to the first subdata, the 2nd CUDA flows to equipment and transmits the second subdata, in equipment to the second son
While data execute kernel function, the first CUDA flows to host and returns to the first subdata for executing kernel function, the second last
CUDA flows to host and returns to the second subdata for executing kernel function.It specifically may refer to Fig. 5, include: the first CUDA stream in Fig. 5
It transmits the first subdata section 501, equipment to equipment the first subdata execution kernel function section 502, the first CUDA are flowed to host and returned
Receipt row crosses the first subdata section 503 of kernel function, the 2nd CUDA flows to equipment and transmits the second subdata section 504, equipment to the
Two subdatas execute kernel function section 505, the 2nd CUDA flows to host and returns to the second subdata section 506 for executing kernel function, master
Machine sends the duration T of data to equipment1, equipment to host send data duration T2Duration T is executed with kernel functionk.Data are specific
Transmission process as previously mentioned, details are not described herein.In this case, the execution time of kernel function is than data transmission period twice
It will grow, this is a kind of most common situation, and most program is all in this way, executive condition schematic diagram is as shown in Figure 5 at this time.Most
Whole total duration are as follows: T=T1+2Tk+T2。
On the basis of the above embodiment of the present invention, when summarizing the execution of program in the case where flowing using two CUDA
Between function.Regardless of which kind of situation, the transmission time of a part of data can be hidden using CUDA stream, can promote whole system
The speed of service.Since all operations of the present invention all employ asynchronous system, when equipment carries out data transmission calculating with kernel function,
CPU can carry out other a small amount of calculating simultaneously, therefore the time of this part can be hidden by the calculating time of equipment completely, into one
Step improves the speed of service of system.
Wherein, T1The duration of data, T are sent to equipment for host2The duration of data, T are sent to host for equipmentkFor core
Function executes duration.
On the basis of the above embodiments, the data processing side in CUDA heterogeneous platform provided in the embodiment of the present invention
Method, further includes: several subdatas are copied into equipment memory module from page locking page in memory.
Data processing method in CUDA heterogeneous platform provided in an embodiment of the present invention flows asynchronous process side by CUDA
Formula realizes that data transmission and kernel function calculate concurrent working two to data in concurrent working and equipment between host and equipment
The parallel data processing method of a level can promote the speed of service of data processing.
It, can be with it is noted that the data processing method in the CUDA heterogeneous platform that each embodiment of the present invention provides
In scene applied to the three-dimension ultrasonic imaging system volume data exchange processing based on heterogeneous platform, it should be noted that such
It is not to be protected to the present invention using the further embodiment only to the technological essence of technical solution provided in an embodiment of the present invention
Protect the limitation of range, all technical solutions for meeting spirit of the invention, within the protection domain of this patent..Specific step
It is rapid as follows:
Step1 probe one complete three-dimensional data of acquisition.This step is not repeated herein as conventional method.
Step2 distributes page locking page in memory in host memory module to save three-dimensional data.When use pageable memory
When being replicated, since equipment knows that the physical address of host memory, CUDA driver can pass through direct memory access technology
Data are transmitted between host and equipment.Because of intervention of the direct memory access technology when executing duplication without host, host
The data that may move pageable, the delay for causing direct memory access to operate.When being replicated using pageable memory,
Data still can be transferred to equipment by direct memory access technology by CUDA driver.Therefore duplication operation will execute two
Time, first pass copies to one piece of " temporarily " page locking page in memory from pageable memory, then replicates again from this page of locking page in memory
Onto equipment memory module.Whenever executing duplication operation from pageable memory, reproduction speed will be limited by PCIE transmission speed
Degree and the relatively low side of system front end bus speed.Therefore work as and transmitted between host memory module and equipment memory module
When data, this species diversity can make the performance of page locking host memory be higher by about 2 times than the performance of standard pageable memory.Due to
CUDA stream needs to carry out the data transmission of high speed, so the present invention is to need to spread defeated data by CUDA all to distribute page lock
Determine memory.
Step3 is flowed using CUDA, and three-dimensional data is transferred to equipment storage by link block from host memory module
Module.In order to make it easy to understand, being described by creating 2 CUDA streams, and assume that the number of the two streams is the first CUDA stream
It is flowed with the 2nd CUDA, more CUDA streams can be created in actual conditions as needed.Framework computing module is probe collected three
Ultrasonic volume data is tieed up, is divided into the subdata of several suitable sizes, is stored in the page locking page in memory in host memory module.Cause
While this will realize that the first CUDA stream executes kernel function, the 2nd CUDA stream stream, which copies to subdata from host memory module, to be set
Standby memory module, then while calculated result is copied back into host by the first CUDA stream, the 2nd CUDA stream starts to execute core letter
Number.If dispatching all operations of some stream simultaneously, be easy to inadvertently to block another stream duplication operation or
Person's kernel function executes, thus the present invention when that will operate the queue for being put into stream using breadth-first fashion, and non-mainstream depth
Spend mode of priority.That is, be not first addition the first CUDA stream all three operation (i.e. subdata from host store mould
Block is transferred to equipment memory module, kernel function execution and calculated result and is transferred to host memory module from equipment memory module) so
Add again afterwards the 2nd CUDA stream all three operation, but by the two flow between operation intersect addition.It i.e. first will be sub
Data are transferred to equipment memory module from host memory module and are added to the first CUDA stream, then by subdata from host memory module
It is transferred to equipment memory module and is added to the 2nd CUDA stream.Next, kernel function calling is added to the first CUDA stream, then by phase
Same operation is added to the 2nd CUDA stream.Calculated result host memory module is transferred to from equipment memory module later to be added to
Then identical operation is added to the 2nd CUDA stream by the first CUDA stream.It is recycled with this, until collected all data of popping one's head in
It is all transferred to equipment and calculates completion and pass host back.
Step4 equipment computing module reads data from equipment memory module and carries out Large-scale parallel computing.This step and biography
System method is the same, is not repeated herein.
Step5 equipment computing module stores calculated result to equipment memory module.This step as conventional method,
This is not repeated.
Step6 is flowed using CUDA is transmitted back to host memory module by link block from equipment memory module for volume data.
The step for can regard the inverse operation of step 3 as, the technology used and step 3 are completely the same, therefore are also not repeated.
Step7 Framework computing module shows calculated result by display module.This step is as conventional method, herein
It is not repeated.
The optimized integration of each embodiment of the present invention is the processing reality that sequencing is carried out by the equipment with host function
Existing.Therefore engineering in practice, can be by the technical solution of each embodiment of the present invention and its function package at various modules.
Based on this reality, on the basis of the various embodiments described above, the embodiment provides a kind of CUDA heterogeneous platforms
In data processing equipment, which is used to execute the data processing side in CUDA heterogeneous platform in above method embodiment
Method.Referring to Fig. 6, which includes:
Subdata processing module 601, for being directed to several subdatas for the data being stored in host memory module, according to
Host sends the duration of data to equipment, and kernel function executes the duration that duration and equipment send data to host, by first
CUDA spreads the first subdata in defeated several subdatas, is spread in defeated several subdatas with the 2nd CUDA
Second subdata carries out Cross transfer, completes the processing of first subdata and the second subdata;
Partial data processing module 602, for being directed to remaining subdata in several subdatas, using the intersection
The mode of transmission completes the processing of the data;
Wherein, the data are divided into several subdatas.
Data processing equipment in CUDA heterogeneous platform provided in an embodiment of the present invention, using subdata processing module and complete
Entire data processing module flows asynchronous process mode by CUDA, realizes data concurrent working between host and equipment, Yi Jishe
Standby interior data transmission and kernel function calculate the parallel data processing method of two levels of concurrent working, can promote data processing
The speed of service.
The method of the embodiment of the present invention is to rely on electronic equipment to realize, therefore it is necessary to do one to relevant electronic equipment
Lower introduction.Based on this purpose, the embodiment provides a kind of electronic equipment, as shown in fig. 7, the electronic equipment includes:
At least one host (Host) 701,704, at least one processor communication interface (Communications Interface)
(Memory) 702, at least one equipment (Device) 705 and communication bus (Communications Bus) 703.Wherein, until
A few host 701, at least one equipment 705, communication interface 704, at least one processor 702 are complete by communication bus 703
At mutual communication.At least one host 701 and at least one equipment 705 can transmit data between each other to carry out
Data interaction processing, can also call the logical order at least one processor 702, to execute following method: for storage
Several subdatas of data in host memory module send the duration of data according to host, when kernel function executes to equipment
Long and equipment sends the duration of data to host, and the first CUDA is spread to the first subnumber in defeated several subdatas
According to spreading the second subdata in defeated several subdatas with the 2nd CUDA, carry out Cross transfer, complete first son
The processing of data and the second subdata;For remaining subdata in several subdatas, using the side of the Cross transfer
Formula completes the processing of the data;Wherein, the data are divided into several subdatas.
In addition, the logical order in above-mentioned at least one processor 702 can be real by way of SFU software functional unit
Now and when sold or used as an independent product, it can store in a computer readable storage medium.Based in this way
Understanding, the technical solution of the present invention substantially portion of the part that contributes to existing technology or the technical solution in other words
Dividing can be embodied in the form of software products, which is stored in a storage medium, including several
Instruction is used so that a computer equipment (can be personal computer, server or the network equipment etc.) executes the present invention
The all or part of the steps of each embodiment the method.For example, for the data being stored in host memory module
Several subdatas, the duration of data is sent according to host to equipment, and kernel function executes duration and equipment and sends data to host
Duration, the first CUDA is spread to the first subdata in defeated several subdatas, with the 2nd CUDA spread it is defeated described in
The second subdata in several subdatas carries out Cross transfer, completes the processing of first subdata and the second subdata;Needle
The processing of the data is completed by the way of the Cross transfer to remaining subdata in several subdatas;Its
In, the data are divided into several subdatas.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory
(ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk
Etc. the various media that can store program code.
The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member
It is physically separated with being or may not be, component shown as a unit may or may not be physics list
Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs
In some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness
Labour in the case where, it can understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can
It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on
Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should
Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers
It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation
Method described in certain parts of example or embodiment.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although
Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used
To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features;
And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and
Range.
Claims (10)
1. the data processing method in a kind of CUDA heterogeneous platform characterized by comprising
For several subdatas for the data being stored in host memory module, the duration of data is sent to equipment according to host,
Kernel function executes the duration that duration and equipment send data to host, and the first CUDA is spread to defeated several subdatas
In the first subdata, spread the second subdata in defeated several subdatas with the 2nd CUDA, carry out Cross transfer, it is complete
At the processing of first subdata and the second subdata;
The place of the data is completed by the way of the Cross transfer for remaining subdata in several subdatas
Reason;
Wherein, the data are divided into several subdatas.
2. the data processing method in CUDA heterogeneous platform according to claim 1, which is characterized in that several subnumbers
According in the page locking page in memory for being stored in host memory module.
3. the data processing method in CUDA heterogeneous platform according to claim 1, which is characterized in that described according to host
The duration of data is sent to equipment, kernel function executes the duration that duration and equipment send data to host, and the first CUDA is flowed
The first subdata in several subdatas of transmission spreads the second son in defeated several subdatas with the 2nd CUDA
Data carry out Cross transfer, complete the processing of first subdata and the second subdata, comprising:
Host is compared with equipment to the duration that host sends data to the duration that equipment sends data, it is smaller to obtain duration
Value, if the kernel function executes duration and is less than the duration smaller value, the first CUDA flows to equipment and transmits the first subdata,
While 2nd CUDA flows to equipment the second subdata of transmission, equipment executes kernel function to the first subdata, flows in the first CUDA
While executing the first subdata of kernel function to host return, equipment executes kernel function, the second last to the second subdata
CUDA flows to host and returns to the second subdata for executing kernel function.
4. the data processing method in CUDA heterogeneous platform according to claim 1, which is characterized in that described according to host
The duration of data is sent to equipment, kernel function executes the duration that duration and equipment send data to host, and the first CUDA is flowed
The first subdata in several subdatas of transmission spreads the second son in defeated several subdatas with the 2nd CUDA
Data carry out Cross transfer, complete the processing of first subdata and the second subdata, comprising:
If the kernel function executes duration, the duration of data is sent to equipment more than or equal to the host, be less than or equal to equipment to
Host sends the duration of data, then the first CUDA flows to equipment and transmits the first subdata, executes core to the first subdata in equipment
While function, the 2nd CUDA flows to equipment and transmits the second subdata, flows to host return in the first CUDA and executed kernel function
The first subdata while, equipment to the second subdata execute kernel function, the second last CUDA flow to host return executed
Second subdata of kernel function.
5. the data processing method in CUDA heterogeneous platform according to claim 1, which is characterized in that described according to host
The duration of data is sent to equipment, kernel function executes the duration that duration and equipment send data to host, and the first CUDA is flowed
The first subdata in several subdatas of transmission spreads the second son in defeated several subdatas with the 2nd CUDA
Data carry out Cross transfer, complete the processing of first subdata and the second subdata, comprising:
If the kernel function executes duration, the duration of data is sent to host more than or equal to equipment, be less than or equal to the host to
Equipment sends the duration of data, then the first CUDA flows to equipment and transmits the first subdata, flows to equipment transmission the in the 2nd CUDA
While two subdatas, equipment executes kernel function to the first subdata, while equipment executes kernel function to the second subdata,
First CUDA flows to host and returns to the first subdata for executing kernel function, and the second last CUDA flows to host return and executed core
Second subdata of function.
6. the data processing method in CUDA heterogeneous platform according to claim 1, which is characterized in that described according to host
The duration of data is sent to equipment, kernel function executes the duration that duration and equipment send data to host, and the first CUDA is flowed
The first subdata in several subdatas of transmission spreads the second son in defeated several subdatas with the 2nd CUDA
Data carry out Cross transfer, complete the processing of first subdata and the second subdata, comprising:
Host is compared with equipment to the duration that host sends data to the duration that equipment sends data, it is larger to obtain duration
Value, if the kernel function executes duration and is greater than described duration the larger value, the first CUDA flows to equipment and transmits the first subdata,
While equipment executes kernel function to the first subdata, the 2nd CUDA flows to equipment and transmits the second subdata, in equipment to second
While subdata executes kernel function, the first CUDA flows to host and returns to the first subdata for executing kernel function, the second last
CUDA flows to host and returns to the second subdata for executing kernel function.
7. the data processing method in CUDA heterogeneous platform according to claim 2, which is characterized in that further include:
Several subdatas are copied into equipment memory module from page locking page in memory.
8. the data processing equipment in a kind of CUDA heterogeneous platform characterized by comprising
Subdata processing module, for for several subdatas of data being stored in host memory module, according to host to
Equipment sends the duration of data, and kernel function executes the duration that duration and equipment send data to host, the first CUDA is spread
The first subdata in defeated several subdatas, the second subnumber in defeated several subdatas is spread with the 2nd CUDA
According to progress Cross transfer completes the processing of first subdata and the second subdata;
Partial data processing module, for being directed to remaining subdata in several subdatas, using the Cross transfer
Mode completes the processing of the data;
Wherein, the data are divided into several subdatas.
9. a kind of electronic equipment characterized by comprising
At least one host, at least one processor, at least one equipment, communication interface and bus;Wherein,
The host, memory, equipment, communication interface complete mutual communication by the bus;
The memory is stored with the program instruction that can be executed by the host, and the host calls described program instruction, to hold
Row method as described in any one of claim 1 to 7.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited
Computer instruction is stored up, the computer instruction makes the computer execute the method as described in any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910017177.5A CN109739559A (en) | 2019-01-08 | 2019-01-08 | Data processing method and equipment in CUDA heterogeneous platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910017177.5A CN109739559A (en) | 2019-01-08 | 2019-01-08 | Data processing method and equipment in CUDA heterogeneous platform |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109739559A true CN109739559A (en) | 2019-05-10 |
Family
ID=66363927
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910017177.5A Pending CN109739559A (en) | 2019-01-08 | 2019-01-08 | Data processing method and equipment in CUDA heterogeneous platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109739559A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2192780A1 (en) * | 2008-11-28 | 2010-06-02 | Thomson Licensing | Method for video decoding supported by Graphics Processing Unit |
CN102662641A (en) * | 2012-04-16 | 2012-09-12 | 浙江工业大学 | Parallel acquisition method for seed distribution data based on CUDA |
CN103310484A (en) * | 2013-07-03 | 2013-09-18 | 西安电子科技大学 | Computed tomography (CT) image rebuilding accelerating method based on compute unified device architecture (CUDA) |
CN106358003A (en) * | 2016-08-31 | 2017-01-25 | 华中科技大学 | Video analysis and accelerating method based on thread level flow line |
-
2019
- 2019-01-08 CN CN201910017177.5A patent/CN109739559A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2192780A1 (en) * | 2008-11-28 | 2010-06-02 | Thomson Licensing | Method for video decoding supported by Graphics Processing Unit |
CN102662641A (en) * | 2012-04-16 | 2012-09-12 | 浙江工业大学 | Parallel acquisition method for seed distribution data based on CUDA |
CN103310484A (en) * | 2013-07-03 | 2013-09-18 | 西安电子科技大学 | Computed tomography (CT) image rebuilding accelerating method based on compute unified device architecture (CUDA) |
CN106358003A (en) * | 2016-08-31 | 2017-01-25 | 华中科技大学 | Video analysis and accelerating method based on thread level flow line |
Non-Patent Citations (1)
Title |
---|
TONYSHENGTAN: "【CUDA 基础】6.1 流和事件概述", 《HTTPS://WWW.CNBLOGS.COM/FACE2AI/P/9756606.HTML》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106779060B (en) | A kind of calculation method for the depth convolutional neural networks realized suitable for hardware design | |
EP3612942B1 (en) | Queue management for direct memory access | |
CN106875012B (en) | A kind of streamlined acceleration system of the depth convolutional neural networks based on FPGA | |
CN103049241B (en) | A kind of method improving CPU+GPU isomery device calculated performance | |
Carvalho et al. | Heuristics for dynamic task mapping in NoC-based heterogeneous MPSoCs | |
CN103763173B (en) | Data transmission method and calculate node | |
CN102591702B (en) | Virtualization processing method, related device and computer system | |
CN112346833B (en) | Task processing method and processor for privacy computation and heterogeneous processing system | |
CN104317768A (en) | Matrix multiplication accelerating method for CPU+DSP (Central Processing Unit + Digital Signal Processor) heterogeneous system | |
CN107122244A (en) | A kind of diagram data processing system and method based on many GPU | |
US20200250525A1 (en) | Lightweight, highspeed and energy efficient asynchronous and file system-based ai processing interface framework | |
US20220342712A1 (en) | Method for Processing Task, Processor, Device and Readable Storage Medium | |
CN105573850B (en) | Multi-process exchange method, system and server | |
JP2014206979A (en) | Apparatus and method of parallel processing execution | |
WO2020163315A1 (en) | Systems and methods for artificial intelligence with a flexible hardware processing framework | |
Veeravalli et al. | Scheduling divisible loads on heterogeneous linear daisy chain networks with arbitrary processor release times | |
CN102299843A (en) | Network data processing method based on graphic processing unit (GPU) and buffer area, and system thereof | |
CN102306139A (en) | Heterogeneous multi-core digital signal processor for orthogonal frequency division multiplexing (OFDM) wireless communication system | |
KR101869939B1 (en) | Method and apparatus for graphic processing using multi-threading | |
CN107070709A (en) | A kind of NFV implementation methods based on bottom NUMA aware | |
EP3983950A1 (en) | Neural network training in a distributed system | |
CN104200508B (en) | Ray tracing accelerated method based on Intel many-core framework ad-hoc mode | |
CN109739559A (en) | Data processing method and equipment in CUDA heterogeneous platform | |
CN110222410B (en) | Electromagnetic environment simulation method based on Hadoop MapReduce | |
CN104156332B (en) | High-performance parallel computing method based on external PCI-E connection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190510 |
|
RJ01 | Rejection of invention patent application after publication |