CN103810670A

CN103810670A - DVH (dose volume histogram) parallel statistical method based on CUDA (compute unified device architecture) stream and shared memory

Info

Publication number: CN103810670A
Application number: CN201410033988.1A
Authority: CN
Inventors: 王阳萍; 党建武; 蒋偑钊; 杜晓刚; 王松; 杨景玉; 陈永; 郭治成; 邓冲; 赵庶旭; 闵永智; 张鑫; 罗维薇
Original assignee: Lanzhou Jiaotong University
Current assignee: Lanzhou Jiaotong University
Priority date: 2014-01-24
Filing date: 2014-01-24
Publication date: 2014-05-21
Anticipated expiration: 2034-01-24
Also published as: CN103810670B

Abstract

The invention discloses a DVH (dose volume histogram) parallel statistical method based on a CUDA (compute unified device architecture) stream and a shared memory. The method includes the steps: firstly, sampling organs on a host, transmitting the position of a sampling point into a device and processing dose statistics of each organ by one stream; secondly, loading a dose matrix by the aid of a texture memory; thirdly, fetching a texture according to a position point allocated for each thread, setting a filter mode of the texture into linear interpolation, namely linearly interpolating eight picture elements of the three-dimensional texture according to distances and returning values obtained by linear interpolation; fourthly, storing statistical results by the aid of the shared memory. By developing N sub dose boxes on the shared memory, the problem of bank conflict of the shared memory is solved, and statistical speed is increased.

Description

The parallel statistical method of DVH figure based on CUDA stream and shared drive

Technical field

The present invention relates to medical data image processing field, particularly, relate to a kind of parallel statistical method of DVH figure based on CUDA stream and shared drive.

Background technology

Dose-volume histogram (Dose Volume Histogram, DVH) is the powerful of evaluating radiotherapy treatment planning scheme quality.In the reverse treatment planning systems such as intensity modulated radiation therapy, the speed of DVH figure statistics is proposed to higher requirement.

DVH figure adds up by two-dimensional diagram the method that 3-dimensional dose distributes in radiotherapy planning, represent certain area-of-interest is as having how many volumes to be subject to the irradiation of many high doses in the volume of tumour target area or assessment critical organ (organ at risk, OAR).Because its dosage that intuitively, has effectively reacted treatment plan distributes and the quality of planning, become the Main Basis of assessment radiotherapy treatment planning quality.In the statistics of DVH figure, can, according to the position relationship of the outline line of target area or critical organ and 3-dimensional dose data, the dose value that organ spatial point is corresponding be added up.Conventionally can within the scope of organ, sample, carry out the dose value of Tri linear interpolation in the hope of sampled point by dose data, and be recorded in discontinuous dose cassette (bin).Along with the increase of patient CT section in therapeutic process and the needs of intensity modulated radiation therapy, in statistic processes, need to carry out a large amount of Tri linear interpolation computings, the computing velocity of CPU can not requirement of real time, and the present invention considers to utilize parallel processing to realize the express statistic of DVH figure.

In parallel processing field, mainly contain multinuclear CPU (central processing unit) (Multicore Central Processing Unit, Multicore CPU) parallel processing and image processor units (Graphics Processing Unit, GPU) parallel processing.What be responsible for due to CPU is the issued transaction that logicality is stronger, is not good to do highly dense intensity and calculate, so select to aim at computation-intensive, highly-parallelization application, the high-performance calculation platform NVIDIA GPU of design is as the platform of parallel processing DVH figure.On the arithmetic capability of NVIDIA GPU and bandwidth of memory, there is obvious advantage with respect to CPU.By calculating unified equipment frame (Compute Unified Device Architecture, CUDA), GPU can bring into play its powerful computing power under single instruction multiple data (Single Instruction Multiple Threads, SIMT) programming model.

During based on GPU Parallel Implementation DVH figure statistics, there are two difficult points.First, for outline line corresponding to each organ, judge when whether sampling point position is within the scope of it and can use a large amount of if-else statements, this is very simple task in CPU, in GPU, use and judge that statement causes the serial of parallel thread to be carried out possibly, greatly reduces the execution performance of stream multiprocessor inside.The second, when statistics, sampled point result can be saved in respectively in 100 bin, especially distribute for the dosage that is similar to heavy ion radiotherapy etc. and has particular law, can cause a large amount of write conflicts, cause Calculation bottleneck.

Summary of the invention

The object of the invention is to, for the problems referred to above, propose a kind of parallel statistical method of DVH figure based on CUDA stream and shared drive, to improve computing velocity, and there is the internal memory of avoiding write conflict.

For achieving the above object, the technical solution used in the present invention is:

The parallel statistical method of DVH figure based on CUDA stream and shared drive, comprises the following steps:

Step 1: in host side, organ is sampled, and import sampling point position into equipment end,

In DVH when statistics figure, need to judge that sampled point is whether in the outline line of organ, and CPU judgement obtains all sampling point positions of each organ and stores in array, and sampling point position represents to be pos=(x, y, z) by a tri-vector;

Utilize the stream mechanism of CUDA that the position array obtaining is imported in GPU and calculated, homogeneous turbulence is not separate in the time carrying out, and each stream handle is processed the position array of each organ;

Step 2: use texture storage device to be written into dose matrix:

To have dose matrix data and copy equipment end to from host side, and dose matrix data will be stored in texture storage device;

Step 3: the location point being assigned to according to each thread, use texture to pick up, the filter patterns of texture is set to linear interpolation, eight of three-D grain pixels is carried out to linear interpolation according to distance, and returns to the value that linear interpolation obtains;

Step 4: use shared drive storage statistics, by open up N sub-dose cassette on shared drive, solve the bank collision problem that shared drive there will be, and accelerated Statistical Speed.

According to a preferred embodiment of the invention, in above-mentioned steps 4, use shared drive storage statistics to be mainly: on shared drive, to open up N sub-dose cassette, make bin corresponding between sub-dose cassette not at same bank, several the adjacent threads in same warp upgrade respectively corresponding sub-dose cassette.

According to a preferred embodiment of the invention, the span of above-mentioned N is .

According to a preferred embodiment of the invention, described N=8.

Technical scheme of the present invention has following beneficial effect:

Technical scheme of the present invention, adopts isomery (heterogeneous) programming model and the CUDA stream mechanism of CUDA respectively the sampled point of each organ to be imported in GPU and calculated.Realize the parallelization of two levels, i.e. parallel processing between parallel and interpolation arithmetic between organ.Thereby raising computing velocity.Adopt shared drive to store data, avoided the write conflict between internal memory.

Below by drawings and Examples, technical scheme of the present invention is described in further detail.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the parallel statistical method of DVH figure based on CUDA stream and shared drive described in the embodiment of the present invention;

Fig. 2 a and Fig. 2 b are the graph of a relation of organ and dose matrix;

Fig. 3 is single-threaded DVH figure statistical flowsheet figure;

Fig. 4 is CUDA framework operation schematic diagram;

When Fig. 5 is DVH figure statistics, schematic diagram is upgraded in multithreading ballot;

Fig. 6 is the bank conflict schematic diagram of shared storage;

Fig. 7 a is 1% to 100% dosage distribution schematic diagram;

Fig. 7 b is 90% to 100% dosage distribution schematic diagram;

Fig. 8 a is the DVH schematic diagram that CPU generates;

Fig. 8 b is the DVH schematic diagram that GPU generates.

Embodiment

Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein, only for description and interpretation the present invention, is not intended to limit the present invention.

As shown in Figure 1, a kind of parallel statistical method of DVH figure based on CUDA stream and shared drive, comprises the following steps:

Step 2: use texture storage device to be written into dose matrix:

Step 4: use shared drive storage statistics, by open up N sub-dose cassette on shared drive, solve the bank collision problem that shared drive there will be, and accelerated Statistical Speed

Wherein, in step 4, use shared drive storage statistics to be mainly: on shared drive, to open up N sub-dose cassette, make bin corresponding between sub-dose cassette not at same bank, several the adjacent threads in same warp upgrade respectively corresponding sub-dose cassette.The span of N is

.Because also need afterwards sub-dose cassette to be added, and shared drive is limited, so be not that N is the bigger the better, effect is better when N=8 by experiment, and when N>8, the polling hours are elongated on the contrary.

Concrete treatment step is as follows:

1.1 for isomery programming model

CUDA is not simple GPU language, and it has coordinated the concurrent of CPU and two kinds of processing units of GPU, CUDA framework towards be a Heterogeneous Computing network being formed by CPU and GPU.CPU has the complex logic that branch prediction, program stack and loop optimization etc. are taked for control, and the fairly simple structure of GPU makes it be applicable to the statement of order, single, few circulation, few redirect.CPU end moves the program of serial and controls the startup of GPU (new kepler GK110 framework support DYNAMIC PARALLELISM mechanism, can not return to the data of host CPU and dynamic creation new thread by application, be that any kernel can start another kernel), GPU end moves task that can be parallel, in general, GPU needs control and the coordination of CPU, so GPU is commonly called equipment (device), CPU is called main frame (host).

1.2 thread

In CUDA, NVIDIA organizes thread with a kind of programming model of layering.User creates as required the thread of some and determines the mapping relations of thread and data.Be divided into several thread block (block) calculating thread, the hardware corresponding with thread block is embodied as a stream multiprocessor (streaming multiprocessor, SM), claims again multiprocessor.A SM comprises several scalar processors (scalar processor), or claims CUDA core (CUDA core).The corresponding concrete thread of carrying out of scalar processor.All grid of block composition, i.e. a corresponding grid of video card.

1.3 storer

CUDA storer is organized as a kind of hierarchical structure, and being respectively can be by the global storage of all thread accesses in program, constant storer and texture storage device; Can be by the shared storage of all thread accesses in block; Can be by the register of single thread accesses.

(1) register: for storing automatic variable, do not use the variable of qualifier explanation in overall situation function and equipment function.Register is distributed on each scalar processor independently, for each thread provides privately owned storage space.Capacity is less, and access speed is fast.

(2) shared storage: be positioned at each multiprocessor, belong to on-chip memory (on-chip memory), access speed and register are similar, the data of read-write 4B approximately need two clock period (clock cycle).Thread on same multiprocessor can be accessed a slice shared storage.

(3) global storage: belong to chip external memory (off-chip memory), once directly access needs 400 to 800 clock period, on the multiprocessor of computing power 3.x, there is buffer memory on sheet (on-chip cache), a lot of soon compared with computing power 1.x speed.

(4) texture storage device: belong to chip external memory.Texture storage device is a concept from graphics, on graph image video card before, carry out image demonstration as the storer of core exactly, therefore it has higher hardware supported, there is buffer memory on sheet, the calculating that shows as addressing is realized by specific hardware cell, does not need to expend the computing time of kernel; Access needn't be followed the principle of order; Contiguous tables of data is revealed to larger data bandwidth; Provide hardware-accelerated interpolation function, but precision only have 9.

2 DVH figure statistics

2.1 existing DVH figure statistic algorithms

The outline data that is input as dose data and organ when statistics DVH figure, their relation as shown in Figure 2 a and 2 b.

In Fig. 2 a, rectangle represents 3-dimensional dose matrix, and entity part represents organ scope intuitively; In Fig. 2 b, curve has represented the outline line of organ, and setting every outline line has certain thickness, and when statistics, each sampled point has represented certain volume.When making DVH when statistics figure of single-threaded, each organ is added up according to order serial, and general flow process is: as shown in Figure 3.When single-threaded statistics DVH figure, obtaining after dose data and outline data, need to sample to all relevant organs and tumour target area.According to the clooating sequence statistics of organ, in each organ, sampled and interpolation arithmetic by the relation of its outline line and dose data, finally each organ statistics is output as to the statistics array that a size is 100.

3. the DVH figure parallel computation based on CUDA,

When DVH figure Parallel Implementation, first need the sample point coordinate of all volumes in organ to import in GPU global storage, deposit dose matrix in texture memory, carry out interpolation arithmetic according to sample point coordinate afterwards, statistics is deposited in shared storage, finally statistics is added up and passes main frame back, be specially:

Step 1: in host side, organ is sampled, and import sampling point position into equipment end.

In DVH when statistics figure, need to judge that sampled point is whether in the outline line of organ, and this task need to be carried out a large amount of judgement statements, is adapted at CPU end and processes.CPU judgement obtains all sampling point positions of each organ and stores in array, and position represents pos=(x, y, z) by a tri-vector.

When cudaMemcpy () function direct copying enters equipment end for array by the position obtaining, compare poor efficiency, because need all data all copy after equipment end, kernel just can be processed these data.The stream mechanism of CUDA can play a significant role at this moment, GPU can be calculated with the transmission of CPU/GPU internal memory and mutually cover.

As shown in Figure 4, in general CUDA program can be divided into three phases, be respectively: main frame → device data transmission (h part), kernel is carried out (i part) and equipment → host data transmission (j part).Kernel execution time and the data transmission elapsed time of in figure, supposing each stream are deciles.The different part of homogeneous turbulence may be not overlapped on time shaft, as the overlapping region of the overlapping region of yl moiety in figure and green portion and green portion and blue portion, all represented the parallel section between various flows.

Because homogeneous turbulence is not separate in the time carrying out, so the present invention is in DVH statistics, the statistics of each organ to be processed with a stream respectively, the core code on CUDA is won as follows:

// set up stream according to organ and target area quantity

cudaStream_t stream[ORGAN_COUNT];

for(int i=0; i<ORGAN_COUNT; i++)

cudaStreamCreate(&stream[i]);

// use cudaMemcpyAsync () function to realize asynchronous data transfer

for(int i=0; i<ORGAN_COUNT; i++)

cudaMemcpyAsync();

Step 2: use texture storage device to be written into dose matrix.

Need to have dose matrix data and copy equipment end to from host side, now have two kinds of selections: can be deposited in global storage or texture storage device.Because after can carry out Tri linear interpolation to 3-dimensional dose matrix continually according to sampling point position, and be all eight points that are close in dose matrix conventionally when access.Known by introducing of above-mentioned storer, texture storage device has great advantage compared with global storage, so select the storer of texture storage device as dose matrix data.

Step 3: realize kernel

In kernel, the location point being assigned to according to each thread, uses texture to pick up function and picks up.Because the filter patterns of texture has been set to linear interpolation, eight of three-D grain pixels are carried out to linear interpolation according to distance, return to the value that linear interpolation obtains.

Step 4: use shared drive storage statistics.

The ballot that uses global storage to store each thread is upgraded, and only need on global storage, open up the array that a size is 100, and corresponding array element is directly upgraded in the ballot of each thread.If while having multiple threads a bin to be voted, just there will be write conflict simultaneously.If use a parallel N thread to add up an organ, afterwards result is write in 100 bin.If N is 7 o'clock, as shown in Figure 5.

This simultaneously to an address the caused write conflict that conducts interviews, CUDA can not guarantee its correctness, the atomic operation that can only rely on CUDA to provide is accessed reliably, and the principle of atomic operation is that device bus is locked, and will inevitably have a strong impact on the speed of statistics.

The feature that the technical program distributes in conjunction with dosage in radiotherapy treatment planning has proposed a kind of method, can greatly reduce the bank delay bringing that conflicts, concrete grammar, for open up N sub-dose cassette on shared drive, makes bin corresponding between sub-dose cassette not at same bank.Several adjacent threads in same warp upgrade respectively corresponding sub-dose cassette.In the time of N=4, in shared drive, there are 4 sub-dose cassette, if the polled data of adjacent four threads of certain warp is respectively 98,97,98,97 o'clock, this four number will be put into respectively to four different bank according to this algorithm, avoid access conflict, as shown in Figure 6.

The present invention take statistics heavy particle radiotherapy dose data as example explanation.

Fig. 7 a and Fig. 7 b are heavy ion layering exposure dose distribution schematic diagram

Prague peak value having during due to heavy ions, can make high dose area focus on cancer target area, region as shown in Figure 7a.When radiotherapy, be to adopt layering illuminating method, obtain the b as Fig. 7 after high dose area is visual, visual data distributes and is corrugated, distributes close to degenerating, and can cause serious bank conflict.

Experiment, take C Plus Plus as basis, compiles under VS2010 environment.The CPU that experiment porch uses is Intel Celeron E3300@2.5GHz, and GPU is tall and handsome GeForce GTX 650 Ti that reach, and computer system is Microsoft Windows XP Professional Service Pack 3, internal memory 2G.

This experiment Main Analysis uses respectively traditional acceleration effect based on CPU and while the present invention is based on GPUCUDA stream statistics DVH figure.Experimental data is 4 organs, skin, left lung, right lung and tumour target area, totally 262 outline lines.Dose data is the each interval 0.4cm of x, y, z axle, the three-dimensional matrice of 116*60*84.First add up 10 times with CPU, obtaining the average operation result time is 436.56ms, as table one.After utilizing GPU, texture storage device and shared storage to do parallelization during based on CUDA stream statistics, add up, and use and avoid bank collision method to improve on this basis, add up respectively 10 times, the speed-up ratio of contrast tradition based on CPU method is as table two.As shown in Fig. 8 a and Fig. 8 b, the DVH figure going out with two kinds of method statistics is respectively identical.

Table one, CPU statistics DVH figure timetable:

Figure 2014100339881100002DEST_PATH_IMAGE003

。

Table two, the DVH figure flowing based on CUDA accelerate contrast table:

Figure 2014100339881100002DEST_PATH_IMAGE005

。

Can find out by table two, stream mechanism, texture memory and the shared drive in use CUDA obtained good speed-up ratio after adding up.What propose with the present invention avoids bank collision algorithm, also has small size acceleration to promote on original basis.

Finally it should be noted that: the foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, although the present invention is had been described in detail with reference to previous embodiment, for a person skilled in the art, its technical scheme that still can record aforementioned each embodiment is modified, or part technical characterictic is wherein equal to replacement.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the parallel statistical method of DVH figure based on CUDA stream and shared drive, is characterized in that, comprises the following steps:

Step 1: in host side, organ is sampled, and import sampling point position into equipment end;

Step 2: use texture storage device to be written into dose matrix:

2. the parallel statistical method of DVH figure based on CUDA stream and shared drive according to claim 1, it is characterized in that, in above-mentioned steps 4, use shared drive storage statistics to be mainly: on shared drive, to open up N sub-dose cassette, make bin corresponding between sub-dose cassette not at same bank, several the adjacent threads in same warp upgrade respectively corresponding sub-dose cassette.

3. the parallel statistical method of DVH figure based on CUDA stream and shared drive according to claim 2, is characterized in that, the span of above-mentioned N is

Figure 2014100339881100001DEST_PATH_IMAGE002

.

4. the parallel statistical method of DVH figure based on CUDA stream and shared drive according to claim 3, is characterized in that described N=8.