CN109739559A - Data processing method and equipment in CUDA heterogeneous platform - Google Patents

Data processing method and equipment in CUDA heterogeneous platform Download PDF

Info

Publication number
CN109739559A
CN109739559A CN201910017177.5A CN201910017177A CN109739559A CN 109739559 A CN109739559 A CN 109739559A CN 201910017177 A CN201910017177 A CN 201910017177A CN 109739559 A CN109739559 A CN 109739559A
Authority
CN
China
Prior art keywords
subdata
equipment
cuda
host
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910017177.5A
Other languages
Chinese (zh)
Inventor
刘俞辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WUHAN ZONCARE BIO-MEDICAL ELECTRONICS Co Ltd
Original Assignee
WUHAN ZONCARE BIO-MEDICAL ELECTRONICS Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WUHAN ZONCARE BIO-MEDICAL ELECTRONICS Co Ltd filed Critical WUHAN ZONCARE BIO-MEDICAL ELECTRONICS Co Ltd
Priority to CN201910017177.5A priority Critical patent/CN109739559A/en
Publication of CN109739559A publication Critical patent/CN109739559A/en
Pending legal-status Critical Current

Links

Abstract

The embodiment of the invention provides the data processing methods and equipment in a kind of CUDA heterogeneous platform.The described method includes: for several subdatas for the data being stored in host memory module, the duration of data is sent to equipment according to host, kernel function executes duration, and equipment sends the duration of data to host, first CUDA is spread to the first subdata in defeated several subdatas, the second subdata in defeated several subdatas is spread with the 2nd CUDA, carries out Cross transfer, completes the processing of first subdata and the second subdata;The processing of the data is completed by the way of the Cross transfer for remaining subdata in several subdatas.Data processing method and equipment in CUDA heterogeneous platform provided in an embodiment of the present invention, can promote the speed of service of data processing.

Description

Data processing method and equipment in CUDA heterogeneous platform
Technical field
The present embodiments relate to the data processings in ultrasonic imaging technique field more particularly to a kind of CUDA heterogeneous platform Method and apparatus.
Background technique
Ultrasonic imaging technique is to issue ultrasonic wave to examined object by ultrasonic transducer and receive from examined object The echo of coming is returned to, using the physical features and object to be detected of ultrasonic wave in difference present on acoustic properties, to draw The technology of morphologic information inside examined object out.3-D supersonic imaging can intuitively show the feature of object, for doctor It is raw to make accurately diagnosis in clinical diagnosis and provides effective auxiliary with reasonable therapeutic scheme, facilitate doctor to the state of an illness into Row diagnosis.But 3-D supersonic imaging is handled entire three-dimensional data, is all far longer than in calculation amount and data volume Two-dimensional ultrasonic imaging.Traditional method image taking speed calculated using CPU has seriously affected three-D ultrasonic compared with slow, real-time is poor The development of imaging system.
In recent years, with the continuous improvement that people require calculated performance, the computer parallel technology based on GPU is obtained Rapid development.GPU general-purpose computations generally use the heterogeneous schemas of CPU+GPU, and the CPU as host is responsible for executing at complex logic The calculating of the unsuitable data parallel such as reason and issued transaction, and the GPU as equipment is responsible for the large-scale data of computation-intensive Parallel computation.Since GPU has a clear superiority in processing capacity and bandwidth of memory relative to CPU, GPU can make up CPU Deficiency on energy, to sufficiently excavate the Potential performance of computer.The appearance of especially 2007 CUDA frameworks, using GPU into The development that the technology that row universal parallel calculates is advanced by leaps and bounds.It completes the parallel architecture to GPU in a manner of class C language Calling, as long as developer is familiar with C language, can easy transition to the exploitation of CUDA, greatly reduce the door of developer Sill.Therefore, the acceleration that traditional technology not only may be implemented in the heterogeneous computing system of multi-core CPU and many-core GPU composition optimizes, also Conducive to the innovation for pushing High Performance Computing.Current middle and high end diasonograph almost all of on the market all uses this different Structure computing system realizes 3-D supersonic imaging.
Since heterogeneous computing system is related to the collaborative work of host and equipment, so program optimization will not only optimize equipment end The execution efficiency of code, while also to optimize the efficiency of host and equipment collaboration work.But most of researcher is carrying out When accelerating optimization, all only pays attention to the execution efficiency for optimizing equipment end code, have ignored to host and equipment collaboration working efficiency Optimization.Therefore, for the shortcoming of the 3-D supersonic imaging data processing currently based on heterogeneous platform, a kind of energy order is found Host and equipment concurrent working, and optimize the association of host and equipment by accelerating the data exchange speed between host and equipment Same working efficiency, the method for promoting the operational efficiency of whole system, just become industry extensive concern the problem of.
Summary of the invention
In view of the above-mentioned problems existing in the prior art, the embodiment of the invention provides the data in a kind of CUDA heterogeneous platform Processing method and equipment.
In a first aspect, the embodiment provides the data processing methods in a kind of CUDA heterogeneous platform, comprising: needle To several subdatas for the data being stored in host memory module, the duration of data, kernel function are sent to equipment according to host The duration that duration and equipment send data to host is executed, the first CUDA is spread to the in defeated several subdatas One subdata, spreads the second subdata in defeated several subdatas with the 2nd CUDA, carries out Cross transfer, described in completion The processing of first subdata and the second subdata;For remaining subdata in several subdatas, passed using the intersection Defeated mode completes the processing of the data;Wherein, the data are divided into several subdatas.
Further, several subdatas are stored in the page locking page in memory of host memory module.
Further, the duration for sending data to equipment according to host, kernel function execute duration and equipment to master Machine sends the duration of data, and the first CUDA is spread to the first subdata in defeated several subdatas, flows with the 2nd CUDA The second subdata in several subdatas of transmission carries out Cross transfer, completes first subdata and the second subnumber According to processing, comprising: host is compared to duration and the equipment that equipment sends data to the duration that host sends data, is obtained Duration smaller value is taken, if the kernel function executes duration and is less than the duration smaller value, the first CUDA flows to equipment transmission the One subdata, while the 2nd CUDA flows to equipment and transmits the second subdata, equipment executes kernel function to the first subdata, First CUDA flows to host return while executed the first subdata of kernel function, and equipment executes core letter to the second subdata Number, the second last CUDA flow to host and return to the second subdata for executing kernel function.
Further, the duration for sending data to equipment according to host, kernel function execute duration and equipment to master Machine sends the duration of data, and the first CUDA is spread to the first subdata in defeated several subdatas, flows with the 2nd CUDA The second subdata in several subdatas of transmission carries out Cross transfer, completes first subdata and the second subnumber According to processing, comprising: if the kernel function execute duration, more than or equal to the host to equipment send data duration, be less than The duration of data is sent to host equal to equipment, then the first CUDA flows to equipment and transmits the first subdata, in equipment to the first son While data execute kernel function, the 2nd CUDA flows to equipment and transmits the second subdata, flows to host return in the first CUDA and holds Went kernel function the first subdata while, equipment executes kernel function to the second subdata, and the second last CUDA flows to host Return to the second subdata for executing kernel function.
Further, the duration for sending data to equipment according to host, kernel function execute duration and equipment to master Machine sends the duration of data, and the first CUDA is spread to the first subdata in defeated several subdatas, flows with the 2nd CUDA The second subdata in several subdatas of transmission carries out Cross transfer, completes first subdata and the second subnumber According to processing, comprising: if the kernel function execute duration, more than or equal to equipment to host send data duration, be less than or equal to The host sends the duration of data to equipment, then the first CUDA flows to equipment and transmits the first subdata, flows in the 2nd CUDA While equipment transmits the second subdata, equipment executes kernel function to the first subdata, executes core to the second subdata in equipment While function, the first CUDA flows to host and returns to the first subdata for executing kernel function, and the second last CUDA flows to host Return to the second subdata for executing kernel function.
Further, the duration for sending data to equipment according to host, kernel function execute duration and equipment to master Machine sends the duration of data, and the first CUDA is spread to the first subdata in defeated several subdatas, flows with the 2nd CUDA The second subdata in several subdatas of transmission carries out Cross transfer, completes first subdata and the second subnumber According to processing, comprising: host is compared to duration and the equipment that equipment sends data to the duration that host sends data, is obtained Duration the larger value is taken, if the kernel function executes duration and is greater than described duration the larger value, the first CUDA flows to equipment transmission the One subdata, while equipment executes kernel function to the first subdata, the 2nd CUDA flows to equipment and transmits the second subdata, While equipment executes kernel function to the second subdata, the first CUDA flows to host and returns to the first subnumber for executing kernel function According to the second last CUDA flows to host and returns to the second subdata for executing kernel function.
Further, the data processing method in the CUDA heterogeneous platform, further includes: by several subdatas from Page locking page in memory copies to equipment memory module.
Second aspect, the embodiment provides the data processing equipments in a kind of CUDA heterogeneous platform, comprising:
Subdata processing module, for several subdatas for the data being stored in host memory module, according to master Machine sends the duration of data to equipment, and kernel function executes the duration that duration and equipment send data to host, by the first CUDA The first subdata in defeated several subdatas is spread, second in defeated several subdatas is spread with the 2nd CUDA Subdata carries out Cross transfer, completes the processing of first subdata and the second subdata;
Partial data processing module, for being passed using the intersection for remaining subdata in several subdatas Defeated mode completes the processing of the data;
Wherein, the data are divided into several subdatas.
The third aspect, the embodiment provides a kind of electronic equipment, comprising:
At least one host;And
At least one processor being connect with main-machine communication, at least one equipment, in which:
Memory is stored with the program instruction that can be executed by host, and the instruction of host caller is able to carry out first aspect Data processing method in various possible implementations in CUDA heterogeneous platform provided by any possible implementation.
Fourth aspect, the embodiment provides a kind of non-transient computer readable storage medium, non-transient calculating Machine readable storage medium storing program for executing stores computer instruction, and computer instruction makes the various possible realization sides of computer execution first aspect Data processing method in formula in CUDA heterogeneous platform provided by any possible implementation.
Data processing method and equipment in CUDA heterogeneous platform provided in an embodiment of the present invention flow asynchronous place by CUDA Reason mode realizes that data transmission and kernel function calculate parallel work to data in concurrent working and equipment between host and equipment Make the parallel data processing method of two levels, the speed of service of data processing can be promoted.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to do a simple introduction, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.
Fig. 1 is the data processing method flow chart in CUDA heterogeneous platform provided in an embodiment of the present invention;
Fig. 2 is that a pair of CUDA Stream Data Transmission provided in an embodiment of the present invention and kernel function handle flow diagram;
Fig. 3 is that another double CUDA Stream Data Transmissions provided in an embodiment of the present invention and kernel function handle flow diagram;
Fig. 4 is that another double CUDA Stream Data Transmissions provided in an embodiment of the present invention and kernel function handle flow diagram;
Fig. 5 is that another double CUDA Stream Data Transmissions provided in an embodiment of the present invention and kernel function handle flow diagram;
Fig. 6 is the data processing equipment structural diagram in CUDA heterogeneous platform provided in an embodiment of the present invention;
Fig. 7 is the entity structure schematic diagram of electronic equipment provided in an embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.In addition, Technical characteristic in each embodiment or single embodiment provided by the invention can mutual any combination, to form feasible skill Art scheme, but must be based on can be realized by those of ordinary skill in the art, when the combination of technical solution occur it is mutual Contradiction or when cannot achieve, it will be understood that the combination of this technical solution is not present, also not the present invention claims protection scope Within.
The embodiment of the invention provides the data processing methods in a kind of CUDA heterogeneous platform, referring to Fig. 1, this method packet It includes:
101, for several subdatas for the data being stored in host memory module, data are sent to equipment according to host Duration, kernel function executes the duration that duration and equipment send data to host, the first CUDA is spread defeated described several The first subdata in subdata spreads the second subdata in defeated several subdatas with the 2nd CUDA, is intersected The processing of first subdata and the second subdata is completed in transmission;
102, for remaining subdata in several subdatas, by the way of the Cross transfer, described in completion The processing of data.
Wherein, the data are divided into several subdatas
On the basis of the above embodiments, the data processing side in CUDA heterogeneous platform provided in the embodiment of the present invention Method, several subdatas are stored in the page locking page in memory of host memory module.
On the basis of the above embodiments, the data processing side in CUDA heterogeneous platform provided in the embodiment of the present invention Method, the duration for sending data to equipment according to host, kernel function execute duration and equipment to host send data when It is long, the first CUDA is spread to the first subdata in defeated several subdatas, is spread with the 2nd CUDA defeated described several The second subdata in subdata carries out Cross transfer, completes the processing of first subdata and the second subdata, comprising: Host is compared with equipment to the duration that host sends data to the duration that equipment sends data, obtains duration smaller value, If the kernel function executes duration and is less than the duration smaller value, the first CUDA flows to equipment and transmits the first subdata, the While two CUDA flow to equipment the second subdata of transmission, equipment executes kernel function to the first subdata, flows in the first CUDA While host return executed the first subdata of kernel function, equipment executes kernel function, the second last to the second subdata CUDA flows to host and returns to the second subdata for executing kernel function.It specifically may refer to Fig. 2, include: the first CUDA stream in Fig. 2 It transmits the first subdata section 201, equipment to equipment the first subdata execution kernel function section 202, the first CUDA are flowed to host and returned Receipt row crosses the first subdata section 203 of kernel function, the 2nd CUDA flows to equipment and transmits the second subdata section 204, equipment to the Two subdatas execute kernel function section 205, the 2nd CUDA flows to host and returns to the second subdata section 206 for executing kernel function, master Machine sends the duration T of data to equipment1The duration T of data is sent to host with equipment2.For example preceding institute of the specific transmission process of data It states, details are not described herein.Such case kernel function is fairly simple, or has already been through a large amount of optimizations, so when its execution The long transmission duration already less than data.Therefore data transmission becomes the bottleneck of entire program feature, executive condition such as Fig. 2 It is shown.From figure 2 it can be seen that kernel function is held since kernel function enforcement engine and memory replication engine are working simultaneously Occur the case where overlapping between capable and data exchange, but is not overlapped between two duplications.Since kernel function is consumed Duration is less than memory duplication institute, and time-consuming, therefore its execution duration is stashed by data exchange duration completely.It is final total Duration are as follows: T=2T1+2T2
On the basis of the above embodiments, the data processing side in CUDA heterogeneous platform provided in the embodiment of the present invention Method, the duration for sending data to equipment according to host, kernel function execute duration and equipment to host send data when It is long, the first CUDA is spread to the first subdata in defeated several subdatas, is spread with the 2nd CUDA defeated described several The second subdata in subdata carries out Cross transfer, completes the processing of first subdata and the second subdata, comprising: If the kernel function executes duration, the duration of data is sent to equipment more than or equal to the host, is less than or equal to equipment to host The duration of data is sent, then the first CUDA flows to equipment and transmits the first subdata, executes kernel function to the first subdata in equipment While, the 2nd CUDA flows to equipment and transmits the second subdata, flows to host in the first CUDA and returns and executed the of kernel function While one subdata, equipment executes kernel function to the second subdata, and the second last CUDA flows to host return and executed core letter The second several subdatas.Specifically may refer to Fig. 3, include: in Fig. 3 the first CUDA flow to equipment transmit the first subdata section 301, Equipment executes kernel function section 302 to the first subdata, the first CUDA flows to host and returns to the first subdata for executing kernel function The 303, the 2nd CUDA of section, which flows to equipment, to be transmitted the second subdata section 304, equipment and executes kernel function section 305, the to the second subdata Two CUDA flow to the duration T that host returns to the second subdata section 306 for executing kernel function, host sends data to equipment1, set The standby duration T that data are sent to host2Duration T is executed with kernel functionk.The specific transmission process of data is as previously mentioned, herein no longer It repeats.Such case explanation transmits low volume data into equipment from host, but data volume is a large amount of after kernel function calculates Increase, the time for passing these data needs back has been even more than kernel function and has executed the time, and data transmission period cannot be complete The execution time of kernel function is hidden, executive condition schematic diagram is as shown in Figure 3.From figure 3, it can be seen that the first CUDA stream is set It is standby that equipment the second subdata section of transmission is flowed to the 2nd CUDA the 2nd CUDA flowed to the first subdata execution kernel function section 302 304 stash, and the first CUDA of the first CUDA stream flows to host and returns to the first subdata section for executing kernel function 303, and the equipment of the 2nd CUDA stream is executed kernel function section 305 to the second subdata and is stashed.Final total duration are as follows: T= T1+Tk+2T2
On the basis of the above embodiments, the data processing side in CUDA heterogeneous platform provided in the embodiment of the present invention Method, the duration for sending data to equipment according to host, kernel function execute duration and equipment to host send data when It is long, the first CUDA is spread to the first subdata in defeated several subdatas, is spread with the 2nd CUDA defeated described several The second subdata in subdata carries out Cross transfer, completes the processing of first subdata and the second subdata, comprising: If the kernel function executes duration, the duration of data is sent to host more than or equal to equipment, is less than or equal to the host to equipment The duration of data is sent, then the first CUDA flows to equipment and transmits the first subdata, flows to the second son of equipment transmission in the 2nd CUDA While data, equipment executes kernel function to the first subdata, while equipment executes kernel function to the second subdata, first CUDA flows to host and returns to the first subdata for executing kernel function, and the second last CUDA flows to host return and executed kernel function The second subdata.It specifically may refer to Fig. 4, include: that the first CUDA flows to the first subdata section 401 of equipment transmission, sets in Fig. 4 It is standby that the first subdata section that host return executed kernel function is flowed to the first subdata execution kernel function section 402, the first CUDA 403, the 2nd CUDA flows to equipment and transmits the second subdata section 404, equipment to the second subdata execution kernel function section 405, second CUDA flows to the duration T that host returns to the second subdata section 406 for executing kernel function, host sends data to equipment1, equipment The duration T of data is sent to host2Duration T is executed with kernel functionk.The specific transmission process of data is as previously mentioned, no longer superfluous herein It states.Such case be from host input device data amount it is larger, transmission time be greater than kernel function execute the time, but pass through core Function has passed lesser data volume back to host, executive condition schematic diagram such as Fig. 4 after calculating.At this moment, the equipment of the first CUDA stream The second subdata section 404 of equipment transmission is flowed to by the 2nd CUDA to the first subdata execution kernel function section 402 to hide, and second CUDA stream equipment to the second subdata execute kernel function section 405 cover the first CUDA flow to host return executed kernel function The first subdata section 403.Final total duration are as follows: T=2T1+Tk+T2
On the basis of the above embodiments, the data processing side in CUDA heterogeneous platform provided in the embodiment of the present invention Method, the duration for sending data to equipment according to host, kernel function execute duration and equipment to host send data when It is long, the first CUDA is spread to the first subdata in defeated several subdatas, is spread with the 2nd CUDA defeated described several The second subdata in subdata carries out Cross transfer, completes the processing of first subdata and the second subdata, comprising: Host is compared with equipment to the duration that host sends data to the duration that equipment sends data, obtains duration the larger value, If the kernel function, which executes duration, is greater than described duration the larger value, the first CUDA flows to equipment and transmits the first subdata, setting While the standby execution kernel function to the first subdata, the 2nd CUDA flows to equipment and transmits the second subdata, in equipment to the second son While data execute kernel function, the first CUDA flows to host and returns to the first subdata for executing kernel function, the second last CUDA flows to host and returns to the second subdata for executing kernel function.It specifically may refer to Fig. 5, include: the first CUDA stream in Fig. 5 It transmits the first subdata section 501, equipment to equipment the first subdata execution kernel function section 502, the first CUDA are flowed to host and returned Receipt row crosses the first subdata section 503 of kernel function, the 2nd CUDA flows to equipment and transmits the second subdata section 504, equipment to the Two subdatas execute kernel function section 505, the 2nd CUDA flows to host and returns to the second subdata section 506 for executing kernel function, master Machine sends the duration T of data to equipment1, equipment to host send data duration T2Duration T is executed with kernel functionk.Data are specific Transmission process as previously mentioned, details are not described herein.In this case, the execution time of kernel function is than data transmission period twice It will grow, this is a kind of most common situation, and most program is all in this way, executive condition schematic diagram is as shown in Figure 5 at this time.Most Whole total duration are as follows: T=T1+2Tk+T2
On the basis of the above embodiment of the present invention, when summarizing the execution of program in the case where flowing using two CUDA Between function.Regardless of which kind of situation, the transmission time of a part of data can be hidden using CUDA stream, can promote whole system The speed of service.Since all operations of the present invention all employ asynchronous system, when equipment carries out data transmission calculating with kernel function, CPU can carry out other a small amount of calculating simultaneously, therefore the time of this part can be hidden by the calculating time of equipment completely, into one Step improves the speed of service of system.
Wherein, T1The duration of data, T are sent to equipment for host2The duration of data, T are sent to host for equipmentkFor core Function executes duration.
On the basis of the above embodiments, the data processing side in CUDA heterogeneous platform provided in the embodiment of the present invention Method, further includes: several subdatas are copied into equipment memory module from page locking page in memory.
Data processing method in CUDA heterogeneous platform provided in an embodiment of the present invention flows asynchronous process side by CUDA Formula realizes that data transmission and kernel function calculate concurrent working two to data in concurrent working and equipment between host and equipment The parallel data processing method of a level can promote the speed of service of data processing.
It, can be with it is noted that the data processing method in the CUDA heterogeneous platform that each embodiment of the present invention provides In scene applied to the three-dimension ultrasonic imaging system volume data exchange processing based on heterogeneous platform, it should be noted that such It is not to be protected to the present invention using the further embodiment only to the technological essence of technical solution provided in an embodiment of the present invention Protect the limitation of range, all technical solutions for meeting spirit of the invention, within the protection domain of this patent..Specific step It is rapid as follows:
Step1 probe one complete three-dimensional data of acquisition.This step is not repeated herein as conventional method.
Step2 distributes page locking page in memory in host memory module to save three-dimensional data.When use pageable memory When being replicated, since equipment knows that the physical address of host memory, CUDA driver can pass through direct memory access technology Data are transmitted between host and equipment.Because of intervention of the direct memory access technology when executing duplication without host, host The data that may move pageable, the delay for causing direct memory access to operate.When being replicated using pageable memory, Data still can be transferred to equipment by direct memory access technology by CUDA driver.Therefore duplication operation will execute two Time, first pass copies to one piece of " temporarily " page locking page in memory from pageable memory, then replicates again from this page of locking page in memory Onto equipment memory module.Whenever executing duplication operation from pageable memory, reproduction speed will be limited by PCIE transmission speed Degree and the relatively low side of system front end bus speed.Therefore work as and transmitted between host memory module and equipment memory module When data, this species diversity can make the performance of page locking host memory be higher by about 2 times than the performance of standard pageable memory.Due to CUDA stream needs to carry out the data transmission of high speed, so the present invention is to need to spread defeated data by CUDA all to distribute page lock Determine memory.
Step3 is flowed using CUDA, and three-dimensional data is transferred to equipment storage by link block from host memory module Module.In order to make it easy to understand, being described by creating 2 CUDA streams, and assume that the number of the two streams is the first CUDA stream It is flowed with the 2nd CUDA, more CUDA streams can be created in actual conditions as needed.Framework computing module is probe collected three Ultrasonic volume data is tieed up, is divided into the subdata of several suitable sizes, is stored in the page locking page in memory in host memory module.Cause While this will realize that the first CUDA stream executes kernel function, the 2nd CUDA stream stream, which copies to subdata from host memory module, to be set Standby memory module, then while calculated result is copied back into host by the first CUDA stream, the 2nd CUDA stream starts to execute core letter Number.If dispatching all operations of some stream simultaneously, be easy to inadvertently to block another stream duplication operation or Person's kernel function executes, thus the present invention when that will operate the queue for being put into stream using breadth-first fashion, and non-mainstream depth Spend mode of priority.That is, be not first addition the first CUDA stream all three operation (i.e. subdata from host store mould Block is transferred to equipment memory module, kernel function execution and calculated result and is transferred to host memory module from equipment memory module) so Add again afterwards the 2nd CUDA stream all three operation, but by the two flow between operation intersect addition.It i.e. first will be sub Data are transferred to equipment memory module from host memory module and are added to the first CUDA stream, then by subdata from host memory module It is transferred to equipment memory module and is added to the 2nd CUDA stream.Next, kernel function calling is added to the first CUDA stream, then by phase Same operation is added to the 2nd CUDA stream.Calculated result host memory module is transferred to from equipment memory module later to be added to Then identical operation is added to the 2nd CUDA stream by the first CUDA stream.It is recycled with this, until collected all data of popping one's head in It is all transferred to equipment and calculates completion and pass host back.
Step4 equipment computing module reads data from equipment memory module and carries out Large-scale parallel computing.This step and biography System method is the same, is not repeated herein.
Step5 equipment computing module stores calculated result to equipment memory module.This step as conventional method, This is not repeated.
Step6 is flowed using CUDA is transmitted back to host memory module by link block from equipment memory module for volume data. The step for can regard the inverse operation of step 3 as, the technology used and step 3 are completely the same, therefore are also not repeated.
Step7 Framework computing module shows calculated result by display module.This step is as conventional method, herein It is not repeated.
The optimized integration of each embodiment of the present invention is the processing reality that sequencing is carried out by the equipment with host function Existing.Therefore engineering in practice, can be by the technical solution of each embodiment of the present invention and its function package at various modules. Based on this reality, on the basis of the various embodiments described above, the embodiment provides a kind of CUDA heterogeneous platforms In data processing equipment, which is used to execute the data processing side in CUDA heterogeneous platform in above method embodiment Method.Referring to Fig. 6, which includes:
Subdata processing module 601, for being directed to several subdatas for the data being stored in host memory module, according to Host sends the duration of data to equipment, and kernel function executes the duration that duration and equipment send data to host, by first CUDA spreads the first subdata in defeated several subdatas, is spread in defeated several subdatas with the 2nd CUDA Second subdata carries out Cross transfer, completes the processing of first subdata and the second subdata;
Partial data processing module 602, for being directed to remaining subdata in several subdatas, using the intersection The mode of transmission completes the processing of the data;
Wherein, the data are divided into several subdatas.
Data processing equipment in CUDA heterogeneous platform provided in an embodiment of the present invention, using subdata processing module and complete Entire data processing module flows asynchronous process mode by CUDA, realizes data concurrent working between host and equipment, Yi Jishe Standby interior data transmission and kernel function calculate the parallel data processing method of two levels of concurrent working, can promote data processing The speed of service.
The method of the embodiment of the present invention is to rely on electronic equipment to realize, therefore it is necessary to do one to relevant electronic equipment Lower introduction.Based on this purpose, the embodiment provides a kind of electronic equipment, as shown in fig. 7, the electronic equipment includes: At least one host (Host) 701,704, at least one processor communication interface (Communications Interface) (Memory) 702, at least one equipment (Device) 705 and communication bus (Communications Bus) 703.Wherein, until A few host 701, at least one equipment 705, communication interface 704, at least one processor 702 are complete by communication bus 703 At mutual communication.At least one host 701 and at least one equipment 705 can transmit data between each other to carry out Data interaction processing, can also call the logical order at least one processor 702, to execute following method: for storage Several subdatas of data in host memory module send the duration of data according to host, when kernel function executes to equipment Long and equipment sends the duration of data to host, and the first CUDA is spread to the first subnumber in defeated several subdatas According to spreading the second subdata in defeated several subdatas with the 2nd CUDA, carry out Cross transfer, complete first son The processing of data and the second subdata;For remaining subdata in several subdatas, using the side of the Cross transfer Formula completes the processing of the data;Wherein, the data are divided into several subdatas.
In addition, the logical order in above-mentioned at least one processor 702 can be real by way of SFU software functional unit Now and when sold or used as an independent product, it can store in a computer readable storage medium.Based in this way Understanding, the technical solution of the present invention substantially portion of the part that contributes to existing technology or the technical solution in other words Dividing can be embodied in the form of software products, which is stored in a storage medium, including several Instruction is used so that a computer equipment (can be personal computer, server or the network equipment etc.) executes the present invention The all or part of the steps of each embodiment the method.For example, for the data being stored in host memory module Several subdatas, the duration of data is sent according to host to equipment, and kernel function executes duration and equipment and sends data to host Duration, the first CUDA is spread to the first subdata in defeated several subdatas, with the 2nd CUDA spread it is defeated described in The second subdata in several subdatas carries out Cross transfer, completes the processing of first subdata and the second subdata;Needle The processing of the data is completed by the way of the Cross transfer to remaining subdata in several subdatas;Its In, the data are divided into several subdatas.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk Etc. the various media that can store program code.
The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member It is physically separated with being or may not be, component shown as a unit may or may not be physics list Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness Labour in the case where, it can understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features; And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims (10)

1. the data processing method in a kind of CUDA heterogeneous platform characterized by comprising
For several subdatas for the data being stored in host memory module, the duration of data is sent to equipment according to host, Kernel function executes the duration that duration and equipment send data to host, and the first CUDA is spread to defeated several subdatas In the first subdata, spread the second subdata in defeated several subdatas with the 2nd CUDA, carry out Cross transfer, it is complete At the processing of first subdata and the second subdata;
The place of the data is completed by the way of the Cross transfer for remaining subdata in several subdatas Reason;
Wherein, the data are divided into several subdatas.
2. the data processing method in CUDA heterogeneous platform according to claim 1, which is characterized in that several subnumbers According in the page locking page in memory for being stored in host memory module.
3. the data processing method in CUDA heterogeneous platform according to claim 1, which is characterized in that described according to host The duration of data is sent to equipment, kernel function executes the duration that duration and equipment send data to host, and the first CUDA is flowed The first subdata in several subdatas of transmission spreads the second son in defeated several subdatas with the 2nd CUDA Data carry out Cross transfer, complete the processing of first subdata and the second subdata, comprising:
Host is compared with equipment to the duration that host sends data to the duration that equipment sends data, it is smaller to obtain duration Value, if the kernel function executes duration and is less than the duration smaller value, the first CUDA flows to equipment and transmits the first subdata, While 2nd CUDA flows to equipment the second subdata of transmission, equipment executes kernel function to the first subdata, flows in the first CUDA While executing the first subdata of kernel function to host return, equipment executes kernel function, the second last to the second subdata CUDA flows to host and returns to the second subdata for executing kernel function.
4. the data processing method in CUDA heterogeneous platform according to claim 1, which is characterized in that described according to host The duration of data is sent to equipment, kernel function executes the duration that duration and equipment send data to host, and the first CUDA is flowed The first subdata in several subdatas of transmission spreads the second son in defeated several subdatas with the 2nd CUDA Data carry out Cross transfer, complete the processing of first subdata and the second subdata, comprising:
If the kernel function executes duration, the duration of data is sent to equipment more than or equal to the host, be less than or equal to equipment to Host sends the duration of data, then the first CUDA flows to equipment and transmits the first subdata, executes core to the first subdata in equipment While function, the 2nd CUDA flows to equipment and transmits the second subdata, flows to host return in the first CUDA and executed kernel function The first subdata while, equipment to the second subdata execute kernel function, the second last CUDA flow to host return executed Second subdata of kernel function.
5. the data processing method in CUDA heterogeneous platform according to claim 1, which is characterized in that described according to host The duration of data is sent to equipment, kernel function executes the duration that duration and equipment send data to host, and the first CUDA is flowed The first subdata in several subdatas of transmission spreads the second son in defeated several subdatas with the 2nd CUDA Data carry out Cross transfer, complete the processing of first subdata and the second subdata, comprising:
If the kernel function executes duration, the duration of data is sent to host more than or equal to equipment, be less than or equal to the host to Equipment sends the duration of data, then the first CUDA flows to equipment and transmits the first subdata, flows to equipment transmission the in the 2nd CUDA While two subdatas, equipment executes kernel function to the first subdata, while equipment executes kernel function to the second subdata, First CUDA flows to host and returns to the first subdata for executing kernel function, and the second last CUDA flows to host return and executed core Second subdata of function.
6. the data processing method in CUDA heterogeneous platform according to claim 1, which is characterized in that described according to host The duration of data is sent to equipment, kernel function executes the duration that duration and equipment send data to host, and the first CUDA is flowed The first subdata in several subdatas of transmission spreads the second son in defeated several subdatas with the 2nd CUDA Data carry out Cross transfer, complete the processing of first subdata and the second subdata, comprising:
Host is compared with equipment to the duration that host sends data to the duration that equipment sends data, it is larger to obtain duration Value, if the kernel function executes duration and is greater than described duration the larger value, the first CUDA flows to equipment and transmits the first subdata, While equipment executes kernel function to the first subdata, the 2nd CUDA flows to equipment and transmits the second subdata, in equipment to second While subdata executes kernel function, the first CUDA flows to host and returns to the first subdata for executing kernel function, the second last CUDA flows to host and returns to the second subdata for executing kernel function.
7. the data processing method in CUDA heterogeneous platform according to claim 2, which is characterized in that further include:
Several subdatas are copied into equipment memory module from page locking page in memory.
8. the data processing equipment in a kind of CUDA heterogeneous platform characterized by comprising
Subdata processing module, for for several subdatas of data being stored in host memory module, according to host to Equipment sends the duration of data, and kernel function executes the duration that duration and equipment send data to host, the first CUDA is spread The first subdata in defeated several subdatas, the second subnumber in defeated several subdatas is spread with the 2nd CUDA According to progress Cross transfer completes the processing of first subdata and the second subdata;
Partial data processing module, for being directed to remaining subdata in several subdatas, using the Cross transfer Mode completes the processing of the data;
Wherein, the data are divided into several subdatas.
9. a kind of electronic equipment characterized by comprising
At least one host, at least one processor, at least one equipment, communication interface and bus;Wherein,
The host, memory, equipment, communication interface complete mutual communication by the bus;
The memory is stored with the program instruction that can be executed by the host, and the host calls described program instruction, to hold Row method as described in any one of claim 1 to 7.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer instruction is stored up, the computer instruction makes the computer execute the method as described in any one of claims 1 to 7.
CN201910017177.5A 2019-01-08 2019-01-08 Data processing method and equipment in CUDA heterogeneous platform Pending CN109739559A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910017177.5A CN109739559A (en) 2019-01-08 2019-01-08 Data processing method and equipment in CUDA heterogeneous platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910017177.5A CN109739559A (en) 2019-01-08 2019-01-08 Data processing method and equipment in CUDA heterogeneous platform

Publications (1)

Publication Number Publication Date
CN109739559A true CN109739559A (en) 2019-05-10

Family

ID=66363927

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910017177.5A Pending CN109739559A (en) 2019-01-08 2019-01-08 Data processing method and equipment in CUDA heterogeneous platform

Country Status (1)

Country Link
CN (1) CN109739559A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2192780A1 (en) * 2008-11-28 2010-06-02 Thomson Licensing Method for video decoding supported by Graphics Processing Unit
CN102662641A (en) * 2012-04-16 2012-09-12 浙江工业大学 Parallel acquisition method for seed distribution data based on CUDA
CN103310484A (en) * 2013-07-03 2013-09-18 西安电子科技大学 Computed tomography (CT) image rebuilding accelerating method based on compute unified device architecture (CUDA)
CN106358003A (en) * 2016-08-31 2017-01-25 华中科技大学 Video analysis and accelerating method based on thread level flow line

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2192780A1 (en) * 2008-11-28 2010-06-02 Thomson Licensing Method for video decoding supported by Graphics Processing Unit
CN102662641A (en) * 2012-04-16 2012-09-12 浙江工业大学 Parallel acquisition method for seed distribution data based on CUDA
CN103310484A (en) * 2013-07-03 2013-09-18 西安电子科技大学 Computed tomography (CT) image rebuilding accelerating method based on compute unified device architecture (CUDA)
CN106358003A (en) * 2016-08-31 2017-01-25 华中科技大学 Video analysis and accelerating method based on thread level flow line

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TONYSHENGTAN: "【CUDA 基础】6.1 流和事件概述", 《HTTPS://WWW.CNBLOGS.COM/FACE2AI/P/9756606.HTML》 *

Similar Documents

Publication Publication Date Title
CN106779060B (en) A kind of calculation method for the depth convolutional neural networks realized suitable for hardware design
EP3612942B1 (en) Queue management for direct memory access
CN106875012B (en) A kind of streamlined acceleration system of the depth convolutional neural networks based on FPGA
CN103049241B (en) A kind of method improving CPU+GPU isomery device calculated performance
Carvalho et al. Heuristics for dynamic task mapping in NoC-based heterogeneous MPSoCs
CN103763173B (en) Data transmission method and calculate node
CN102591702B (en) Virtualization processing method, related device and computer system
CN112346833B (en) Task processing method and processor for privacy computation and heterogeneous processing system
CN104317768A (en) Matrix multiplication accelerating method for CPU+DSP (Central Processing Unit + Digital Signal Processor) heterogeneous system
CN107122244A (en) A kind of diagram data processing system and method based on many GPU
US20200250525A1 (en) Lightweight, highspeed and energy efficient asynchronous and file system-based ai processing interface framework
US20220342712A1 (en) Method for Processing Task, Processor, Device and Readable Storage Medium
CN105573850B (en) Multi-process exchange method, system and server
JP2014206979A (en) Apparatus and method of parallel processing execution
WO2020163315A1 (en) Systems and methods for artificial intelligence with a flexible hardware processing framework
Veeravalli et al. Scheduling divisible loads on heterogeneous linear daisy chain networks with arbitrary processor release times
CN102299843A (en) Network data processing method based on graphic processing unit (GPU) and buffer area, and system thereof
CN102306139A (en) Heterogeneous multi-core digital signal processor for orthogonal frequency division multiplexing (OFDM) wireless communication system
KR101869939B1 (en) Method and apparatus for graphic processing using multi-threading
CN107070709A (en) A kind of NFV implementation methods based on bottom NUMA aware
EP3983950A1 (en) Neural network training in a distributed system
CN104200508B (en) Ray tracing accelerated method based on Intel many-core framework ad-hoc mode
CN109739559A (en) Data processing method and equipment in CUDA heterogeneous platform
CN110222410B (en) Electromagnetic environment simulation method based on Hadoop MapReduce
CN104156332B (en) High-performance parallel computing method based on external PCI-E connection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190510

RJ01 Rejection of invention patent application after publication