CN107194864A

CN107194864A - CT 3-dimensional reconstructions accelerated method and its device based on heterogeneous platform

Info

Publication number: CN107194864A
Application number: CN201710270520.8A
Authority: CN
Inventors: 闫镔; 李磊; 王林元; 孙艳敏; 路万里; 蔡爱龙; 张瀚铭; 张文昆
Original assignee: PLA Information Engineering University
Current assignee: PLA Information Engineering University
Priority date: 2017-04-24
Filing date: 2017-04-24
Publication date: 2017-09-22

Abstract

The present invention relates to a kind of CT 3-dimensional reconstructions accelerated method and its device based on heterogeneous platform, heterogeneous platform includes main frame and isomery OpenCL computing devices, and the accelerated method is included：FDK algorithm for reconstructing is carried out to calculate grain decomposition, each parallel computation flow for calculating grain is analyzed；Acceleration optimization processing is carried out to each calculation grain by the main frame in heterogeneous platform and isomery OpenCL computing devices.What depth of the present invention excavated CT algorithm for reconstructing can concurrency, using GPU+FPGA Heterogeneous Computing pattern, computing system is constituted using the computing unit of different type instruction set and architectural framework, algorithm is matched with isomeric architecture to the full extent, makes full use of the performance of different acceleration components；Storage and the communication plan for being adapted to the efficient computing of algorithm for reconstructing are designed simultaneously, system supports PCI E/Ethernet interconnections, polylith process plate is supported to realize that multiprocessor efficient parallel is handled by interconnection, coprocessor system either synchronously or asynchronously is realized, is improved on the premise of the loss precision of reduction as far as possible and rebuilds speed.

Description

CT 3-dimensional reconstructions accelerated method and its device based on heterogeneous platform

Technical field

The invention belongs to x-ray ct technology field, more particularly to a kind of CT 3-dimensional reconstructions based on heterogeneous platform add Fast method and its device.

Background technology

X ray computer fault imaging (Computed Tomography, CT) be it is a kind of by the x-ray projection of object Lai The technology of its attenuation distribution of reverse, covers multiple subjects such as nuclear physics, mathematics, computer, precision instrument.Because CT can be non- Contact, it is non-destructive under the conditions of obtain the high-precision three-dimensional structural information of interior of articles, therefore successfully developed from Hounsfield Since First CT, CT is used widely in fields such as Non-Destructive Testing, medical diagnosis, material analysis.

In actual applications, the computing resource and storage resource needed for Cone-Beam CT high-resolution three-dimension is rebuild are all very big, With the increase of the scale of reconstruction, the storage demand of reconstruction and amount of calculation increase sharply, under many circumstances, it is difficult to meet actual answer Demand, so that back projection general in algorithm for reconstructing calculates as an example：If each dimension size of 3-D view to be reconstructed is all N, then The computation complexity of corresponding back projection will be up to O (N4), rebuild the 3-D view that a resolution ratio is 10243, and calculating will circulation 1099500000000 times, it is very time-consuming to complete so big amount of calculation in ordinary PC, it is difficult to meet the requirement of practical application. Therefore, the problem of acceleration of cone-beam CT reconstruction process is current engineering staff's urgent need to resolve, designs corresponding for CT algorithm for reconstructing Accelerate platform and acceleration strategy that there is important practical significance, be the difficulty that Industrial Computed Tomography is badly in need of solving in actual applications Point problem.

The content of the invention

For deficiency of the prior art, the present invention provides a kind of CT 3-dimensional reconstructions acceleration side based on heterogeneous platform Method and its device, for the characteristic of CT algorithm for reconstructing, are accelerated with reference to the general acceleration device performance such as FPGA, GPU and based on isomery Platform is realized, it is improved on the premise of loss precision is reduced as far as possible and rebuilds speed, performance is stable, and acceleration effect is preferable.

According to design provided by the present invention, a kind of CT 3-dimensional reconstruction accelerated methods based on heterogeneous platform, Heterogeneous platform includes main frame and isomery OpenCL computing devices, and the accelerated method includes following content：FDK algorithm for reconstructing is carried out Calculate grain to decompose, analyze each parallel computation flow for calculating grain；Pass through the main frame in heterogeneous platform and isomery OpenCL computing devices Acceleration optimization processing is carried out to each calculation grain.

Above-mentioned, described main frame is the CPU of operation main program, and OpenCL computing devices, which are included, runs the different of kernel program Structure container GPU and FPGA, are communicated, main program is by defining context come pipe between CPU, GPU and FPGA by PCI-E buses Manage the operation of kernel program.

It is preferred that, FDK algorithm for reconstructing is carried out to calculate grain decomposition, comprising：According to FDK algorithm contents, it is decomposed into：For to throwing Grain is calculated in the projection weighting that shadow data are weighted, and grain is calculated in the filtering for being filtered to the data for projection after weighting, for inciting somebody to action Grain is calculated to the back projection rebuild on object by filtered data for projection back projection, and for carrying out reduction process to back projection's result Reduction calculate grain.

It is preferred that, according to FDK algorithm for reconstructing formula：

By the fractionation to being integrated in formula and discretization, it is divided into：

Grain is calculated in projection weighting, is expressed as：Wherein, p'(θ, u, v) represent Data after being weighted when rotary index is θ to data for projection,For weight coefficient；

Grain is calculated in filtering, is expressed as：Wherein, d_fAfter (θ, u, v) is filtering Data, h (u) be filter operator unit impulse response, [- u_m,u_m] represent the 2m data that detector is gathered per a line；

Grain is calculated by back projection, is expressed as：Wherein, f (x, Y, z, θ) represent to rebuild the object contribution margin of subpoint to f (x, y, z) when rotary index is θ；

Reduction calculates grain, is expressed as：Wherein, φ_maxFor rebuild object rotate a circle when from Dissipate the projection number of divisions adopted.

Above-mentioned, acceleration optimization processing is carried out to each calculation grain, comprising：Grain is calculated using FPGA to projection weighting to carry out parallel Processing, by asynchronous transmission to GPU, is handled filtering calculation grain simultaneously in transmitting procedure；Each voxel during with reference to back projection The data parallel operations of point, the calculating of grain progress multi-threaded parallel back projection is calculated back projection in GPU by tissue points.

It is preferred that, according to reconstruction regions in FDK algorithm for reconstructing in the up each layer data for projection of rotation direction of principal axis and detection The projection corresponding relation of each row data on device data for projection y direction, will along rotation direction of principal axis using piecemeal Reconstruction Strategy Region to be reconstructed is divided into some pieces, and taking out corresponding data for projection from external memory storage when rebuilding one of carries out reconstruction behaviour Make.

It is preferred that, grain is calculated to projection weighting using FPGA and carries out parallel processing, comprising：Global storage is divided into 2 pieces Bank, realizes that the access of random access memory is balanced by loading distribution；Need to be repeated several times by constant storage storage The intermediate variable of calculating.

It is preferred that, the calculating of grain progress multi-threaded parallel back projection is calculated back projection by tissue points in GPU, comprising：Using Based on voxel type of drive, task division is carried out to GPU by reconstructed volumetric data output；Variable unrelated with voxel in calculating is entered Row separation and merge, and calculate and be stored in GPU constant storage before back projection, when back projection calculates, directly reading The variable in constant storage is taken to participate in calculating；Optimize the number of a back projection in kernel program.

Further, variable unrelated with voxel in calculating is separated and merged, it is as follows comprising content：Volume data Middle any point (x, y, z) projects point (u, v) on detector when projection angle is θ, and subpoint (u, v) is calculated as：

U=(x-vCenter) × cos (θ)+(y-vCenter) sin (θ)+pCenter

Dis=(u-pCenter) × a

V=(z- (s₀+θ×h)-γ×h/a)×w+γ×h/a+pCenter

, separate and be after merging variable:

U=x × A [0]+y × A [1]+A [2]+pCenter

Dis=(u-pCenter) × a

V=(z-A [4]-γ × A [5]) × w+ γ × A [5]+pCenter

, wherein, vCenter represents volume data center, and pCenter is data for projection center, and α is voxel size, and θ is projection Angle, r is rotation radiographic source radius of turn, and h is pitch, and γ projects the angle with central beam for beam on central plane.

A kind of CT 3-dimensional reconstruction accelerators based on heterogeneous platform, heterogeneous platform uses PCI-Express conducts Data-signal and the interconnection bus of control signal are transmitted, and networking control and data transfer are carried out using Ethernet as with outside Additional busses；There is provided each functional module base of application-oriented layer for application layer of the framework of heterogeneous platform comprising offer functional module The components layer of interface specification needed for component base and algorithm for reconstructing in different processor, and component-oriented layer and application layer provide clothes The supporting layer of business, supporting layer comprising perform main program CPU and perform kernel program multiple OpenCL computing devices, CPU and OpenCL computing devices communicate connection, and described OpenCL computing devices include GPU, FPGA；Based on described heterogeneous platform Framework CT 3-dimensional reconstructions accelerator include following content：

Grain decomposing module is calculated, for calculate particle shape formula split and discrete by the algorithm according to FDK algorithm for reconstructing content Change, be decomposed into the projection weighting for carrying out Data correction to data for projection and calculate grain, for being carried out to the data for projection after weighting Grain is calculated in the filtering of filtering, for filtered data for projection back projection to the back projection rebuild on object to be calculated into grain, and for pair The reduction that back projection's result carries out reduction process calculates grain；

Accelerate processing module, for being transmitted by additional busses to the data for projection in heterogeneous platform calculate node, root Set according to user and reconstruction performance is assessed, Coordination Treatment is carried out by interconnection bus, it is complete in GPU and FPGA acceleration components respectively Preconceived plan grain accelerates, and rebuilds data real-time storage and feeds back to user.

Beneficial effects of the present invention：

Depth of the present invention excavate CT algorithm for reconstructing can concurrency, using GPU+FPGA Heterogeneous Computing pattern, using not The computing unit composition computing system of same type instruction set and architectural framework, matches with isomeric architecture to the full extent, The performance of different acceleration components is made full use of, while devising storage and the communication plan of the efficient computing of algorithm for reconstructing, is supported PCI-E/Ethernet is interconnected, and supports polylith process plate to realize that multiprocessor efficient parallel is handled by interconnection, is realized same Step or asynchronous coprocessor system, its reconstruction speed is improved on the premise of loss precision is reduced as far as possible, user is met and uses Demand.

Brief description of the drawings：

Fig. 1 is method flow schematic diagram of the invention；

Fig. 2 is the geometrical relationship schematic diagram of FDK algorithms；

Fig. 3 calculates grain flow chart for the overall of FDK algorithms；

Fig. 4 is heterogeneous platform frame model；

Fig. 5 is heterogeneous platform software block diagram；

Fig. 6 is schematic device of the invention；

Fig. 7 is oil rock core reconstructed results three-dimensional section view.

Embodiment：

For the objects, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with drawings and Examples, to this Invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, not For limiting the present invention.

As shown in figure 1, there is provided a kind of CT 3-dimensional reconstructions acceleration side based on heterogeneous platform in one embodiment Method.Heterogeneous platform includes main frame and isomery OpenCL computing devices in the present embodiment, and described main frame is operation main program CPU, OpenCL computing device include isomery the container GPU and FPGA of operation kernel program, pass through between CPU, GPU and FPGA PCI-E buses are communicated, and main program manages the operation of kernel program by defining context.

FPGA uses logical cell array LCA (Logic Cell Array), and inside includes：Configurable logic blocks CLB (Configurable Logic Block), output input module IOB (Input Oouput Block) and interconnector (Interconnect) three parts.Field programmable gate array (FPGA) is programming device, with conventional logic circuit and door Array (such as PAL, GAL, and CPLD device) is compared, and FPGA has a different structures, FPGA using small-sized look-up table (16 × Combinational logic 1RAM) is realized, each look-up table is connected to the input of a d type flip flop, trigger drives other to patrol again Circuit or driving I/O are collected, composition can not only realize combination logic function but also can realize the basic logic unit mould of sequential logic function Block, these intermodules are interconnected or are connected to I/O modules using metal connecting line.FPGA logic is deposited by internally static state Storage unit loads programming data to realize, the value of storage in a memory cell determine logic unit logic function and Connecting mode between each module or between module and I/O, and the function achieved by FPGA is finally determined, FPGA allows unlimited Secondary programming.GPU architecture is by a highly threading multinuclear stream handle (Streaming Multiprocess, SM) Array is constituted.Two SM, one structure block of formation, the GPU based on CUDA technologies it is each instead of between, SM in each structure block Quantity may be different.In addition, each SM contains multiple stream handles (Streaming Processor, SP) again, they it Between Compliance control logical sum instruction buffer.Each GPU carries the figure double data rate of some GB (GB) These in (Graphics Double Data Rate, GDDR) DRAM, referred to as global storage (Global Memory), GPU GDDR DRAM are totally different from the system dram being arranged in CPU systems on mainboard, and the frame that they are mainly used in graphics process delays Rush region memorizer.In graphics application program, they are used for preserving video image and the texture information rendered for 3D；And for Calculate, they can be used as bandwidth memory chip.Although the delay than canonical system memory is long, large-scale parallel should Time delay is made up with the usual high bandwidth of program.

Mass data transfers are to rebuild the difficult point accelerated and emphasis.PCI-E bussing techniques are that a kind of high performance universal I/O is mutual Connect bus, single channel transmission rate can reach 2.5Gbit/s.Therefore, the present invention is transmitted using PCI-E interface as key data Bus.GPU existing commercialized PCI-E drivings in terms of communication interface, therefore, the high-speed data between FPGA and PCI-E buses Transmission design is the key of Platform Designing.The transmission of PCI-E interface is realized on FPGA typically 2 kinds of schemes：One is using special Bridging chip, another is using the specific FPGA that PCI-E physical interfaces can be achieved.Supported due to current bridging chip PCI-E port numbers and transmission rate are relatively low, therefore this project uses latter scheme.Data transfer can be divided into 2 kinds of patterns：Common mould Formula and DMA (Direct Memory Access, direct memory access) transmission mode.Wherein general mode mainly realize main frame and Command communication between equipment；DMA mode is transmitted mainly for chunk data, needs not move through CPU in transmitting procedure, data are straight Connect slave unit to be sent in internal memory, the speed of data transfer is fast, can make full use of PCI-E data bandwidths.Therefore, FPGA data Using DMA transmission mode.

The implementation process of the CT image reconstructions of the present embodiment is detailed below：

S1, to FDK algorithm for reconstructing carry out calculate grain decompose, analyze each calculate grain parallel computation flow：

CT image reconstruction algorithms mainly include two classes：Parse class algorithm and Class of Iterative algorithm.Relative to iterative algorithm, parsing Algorithm mathematics form is simple, rebuilds that speed is fast, and it is convenient to realize, is the main flow algorithm applied in actual CT system.It is based on various Among the analytic reconstruction algorithm of filtered backprojection, FDK algorithms computational efficiency is high, implements easily, and when cone angle is smaller, It can obtain and preferably rebuild effect, be most widely used in practice.FDK algorithms geometrical relationship schematic diagram is as shown in Fig. 2 figure Middle S represents radiographic source, and f (x, y, z) is represented to rebuild the certain point of object, as voxel, and P (θ, u, v) is represented in object rotation point Spend the data for projection that detector during for θ collects f (x, y, z) points.According to the principle of FDK algorithms, it mainly includes 3 steps： Projection weighting, one-dimensional filtering and back projection.Pass through the analysis to algorithm：The weighting of each subpoint be it is separate, The lower data for projection of each indexing is filtered by row, and filtering is identical, and anti-throwing of the tissue points under each indexing It is relevant that shadow process is only mapped to the coordinate that detector coordinates fasten with the point, with independence.That is, the three of FDK algorithms Individual step all has good concurrency, is adapted to carry out algorithm acceleration with the mode of parallel processing.

In one embodiment, as shown in figure 3, according to FDK algorithm contents, being decomposed into：For being weighted to data for projection Projection weighting calculate grain, grain is calculated for the filtering that is filtered to the data for projection after weighting, for by filtered projection number Grain is calculated according to back projection to the back projection rebuild on object, and grain is calculated for the reduction that reduction process is carried out to back projection's result.It is logical Cross amount of storage, amount of calculation, the traffic that grain is calculated in analysis so that efficiently realized on heterogeneous platform.

In some of embodiments of the present invention, FDK algorithm for reconstructing formula：

Weight coefficient only with projection ray detector plane relative position and radiographic source to origin distance dependent, because The weighting of data for projection under this each subpoint has independence；According to the frequency filtering formula of data for projection, each indexing Data for projection before filtering all need carry out Fourier transformation, and each index lower data for projection Fourier transformation only and projection Abscissa Y of the data on planar array detector is relevant, and frequency domain filtering function is all identical, therefore the frequency domain filter of data for projection Ripple has independence under every a line；Back projection of the body image vegetarian refreshments under each indexing calculates only is mapped to detection with the pixel Coordinate on device coordinate system is relevant, therefore it is independent that back projection, which is calculated under each tissue points,.

In summary, three steps of FDK algorithms all have good concurrency, are adapted to be carried out with the mode of parallel processing Algorithm accelerates.Under fine-grained dissection, the overall calculation grain flow of FDK algorithms can be as shown in Figure 3

S2, by the main frame in heterogeneous platform and isomery OpenCL computing devices each calculation grain accelerate at optimization Reason：

Heterogeneous platform frame model is as shown in Figure 4.Requirement to message transmission rate is rebuild according to CT, in order in each isomery Efficiently distributed on processor and scheduling reconstruction tasks, heterogeneous platform is used as interconnection bus, transmission data letter using PCI-Express Number and control signal.Meanwhile, heterogeneous platform as overall platform a calculate node, using Ethernet as additional busses, Networking control or data transfer are carried out with outside.According to the granularity of CT algorithm for reconstructing, load distribution and parallelization characteristic, the present invention Main acceleration components are used as using GPU and FPGA.The data for projection of CT collections is entered by Ethernet bus transfers to be accelerated Platform nodes, are set according to user and reconstruction performance is assessed, and system carries out Coordination Treatment by PCI-E buses, respectively in GPU and Corresponding grain of calculating is completed in FPGA acceleration components to accelerate, and is rebuild data real-time storage and is fed back to user.

In one embodiment, grain is calculated to projection weighting using FPGA and carries out parallel processing, by asynchronous transmission to GPU, Grain is calculated in transmitting procedure to filtering simultaneously to handle；The data parallel operations of each tissue points during with reference to back projection, in GPU Grain, which is calculated back projection, by tissue points carries out the calculating of multi-threaded parallel back projection.

Further, the data for projection of CT images needs to carry out Data correction to geometrical relationship；And it is former according to FDK algorithms Reason, it can be divided into 3 steps：Weighting, filtering and back projection.Accordingly, it would be desirable to the resource that three steps of reasonable distribution are consumed, with Reach that acceleration is optimal.On the other hand, FPGA device has reconfigurability, and flexibility is strong, and is appropriate at parallel pipelining process Reason, GPU devices have substantial amounts of high-performance stream handle, are adapted to data parallel operations pattern, should give full play to and respectively add in design The speciality of fast device.Data weighting is corrected, filtering and back projection 3 calculate grain and optimize realization respectively.Data weighting is corrected It is that data for projection is pre-processed, the row data due to calculation grain simply to projection carry out water operation, are adapted in FPGA Portion is handled by the way of streamline.Therefore, parallel processing is carried out using the correction of FPGA weight datas, and employed different The mode of transmission is walked, flowing water is weighted and filtered calculating simultaneously in transmitting procedure, reduce processing delay, played FPGA The characteristics of stream treatment.And weighted filtering and back projection are related to substantial amounts of data operation and storage problem, during with reference to back projection The data parallel operations feature of each tissue points, based on we carry out multi-threaded parallel back projection by tissue points in GPU when realizing Calculate, improve resource utilization and rebuild speed.

Further, according to reconstruction regions in FDK algorithm for reconstructing in the up each layer data for projection of rotation direction of principal axis and spy The projection corresponding relation for each row data surveyed on device data for projection y direction, using piecemeal Reconstruction Strategy, along rotation direction of principal axis Region to be reconstructed is divided into some pieces, taking out corresponding data for projection from external memory storage when rebuilding one of carries out reconstruction behaviour Make.

When to be reconstructed be on a grand scale, the storage size on GPU/FPGA boards is likely difficult to meet and once rebuild, therefore, The strategy that need to be rebuild using piecemeal.According to the formula of FDK algorithms, each layer data and detection of the reconstruction regions on rotation direction of principal axis Device data for projection has strict projection corresponding relation by each row data on y direction.Therefore, it be able to will be treated along rotation direction of principal axis Reconstruction regions are divided into some pieces, and only need to take out corresponding data for projection from external memory storage when rebuilding one of carries out reconstruction behaviour Work.

Further, FPGA hardware characteristicses are considered, calculating grain to projection weighting using FPGA carries out parallel processing, will Global storage is divided into 2 pieces of bank, realizes that the access of random access memory is balanced by loading distribution；Stored by constant Device storage needs that the intermediate variable calculated is repeated several times.Global storage is divided into 2 pieces of bank, distributed by loading to realize 2 pieces of DDR2SDRAM access balance, so as to lift the access bandwidth of memory；Using in constant storage storage calculating process Need that the intermediate variable calculated is repeated several times, save computing resource；Optimize the number of a back projection in kernel function, lifting pair The access bandwidth of data for projection storage, while reducing the access to rebuilding data storage, is stored by adjusting with reaching to the overall situation It is optimal that device is accessed.

Further, grain is calculated to back projection by tissue points in GPU and carries out the calculating of multi-threaded parallel back projection, using based on Voxel type of drive, task division is carried out by reconstructed volumetric data output to GPU；Variable unrelated with voxel in calculating is divided From and merge, and calculate and be stored in GPU constant storage before back projection, when back projection calculates, directly read often Variable in number memory participates in calculating；Optimize the number of a back projection in kernel program.

GPU task, which is divided, can divide according to input or be divided by output.For backprojection algorithm, input and be Data for projection, is output as reconstructed volumetric data, and GPU two kinds of task methods of salary distribution substantially reflect two kinds of different back projections' realities Existing mode：Based on ray-driven and based on voxel driving.Mode based on ray-driven carries out task division, one by data for projection Individual or several threads complete the anti-throwing of all voxels on a ray, and all bodies that current ray is passed through are calculated first during anti-throwing Element, is then assigned to current ray projection value by the value of these voxels, due to the corresponding thread of different rays or same ray correspondence Different threads may give voxel assignment simultaneously, i.e., this mode be present " writing competition "；The side driven based on voxel Formula divides task by volume data, and a thread completes the anti-throwing of one or several voxels, current voxel is calculated first during anti-throwing Orthographic projection position, then take the projection value of the position to be assigned to current voxel.The back projection's mode driven based on voxel is not present " writing competition " problem, it is not necessary to design extra reduction step, therefore this task method of salary distribution accelerates more suitable for GPU.

In the present embodiment, using the type of drive based on voxel, i.e., task is divided by output.Needed when thread is distributed Consider that a critically important principle, i.e. SM occupancy can not be too low, SM occupancies refer to the movable Warp numbers on each SM Amount and the maximum activity Warp ratio of number allowed.Because GPU is to hide long delay operation by the switching of cross-thread (to access Global storage, inter-thread synchronization etc.), when the thread in a Warp carries out long delay operation, another activity Thread in Warp just can so hide a part of delay automatically into computing state.But this does not represent SM occupancies The higher the better, and the GPU resource that the more high then each thread of SM occupancies takes is fewer, i.e., the amount of calculation that each thread is completed is fewer, And the maximum activity Warp quantity on each SM is certain, even if therefore Warp quantity movable in SM may occur and reach Maximum, due in Warp each thread amount of calculation very little so that all movable Warp threads simultaneously enter long delay operation, Can not fully hide latency.Therefore, an equalization point, ability are selected between the amount of calculation that SM occupancies and each thread are completed GPU performance is set to perform to most preferably.By many experiments, the following thread method of salary distribution can be taken, it is assumed that the scale of volume data For N3, then Block constant magnitude is (16,16,2), and Grid size becomes with N change to be turned to (N/16, N/16,1).

Back projection needs to calculate any point (x, y, z) in volume data and projected when projection angle is θ on detector Point (u, v), needs repeatedly to calculate a trigonometric function relevant with geometrical relationship with projection angle and its in this calculating process His intermediate variable.For each voxel, these identical variables are all calculated only once, it is assumed that volume data scale is N³, then these Value can be computed repeatedly N³It is secondary, cause the significant wastage of system resource.For this problem, in one embodiment, by computing In the variable unrelated with voxel (x, y, z) separated and merged, and the constant storage for being stored in GPU is calculated before back projection In device, during backprojection operation, directly read the variable in constant storage and participate in calculating.

Further, variable unrelated with voxel in calculating is separated and merged, it is as follows comprising content：In volume data Any point (x, y, z) projects point (u, v) on detector when projection angle is θ, and subpoint (u, v) is calculated as：

U=(x-vCenter) × cos (θ)+(y-vCenter) sin (θ)+pCenter

Dis=(u-pCenter) × a

V=(z- (s₀+θ×h)-γ×h/a)×w+γ×h/a+pCenter

, separate and be after merging variable:

U=x × A [0]+y × A [1]+A [2]+pCenter

Dis=(u-pCenter) × a

V=(z-A [4]-γ × A [5]) × w+ γ × A [5]+pCenter

The backprojection operation of each angle can extract 6 after separation and merging variable and voxel (x, y, z) is unrelated Intermediate variable, it is assumed that projection number be 360, then whole back projection have 2160 variables (single-precision floating point type) need thrown counter Calculate and be stored in GPU constant storages before shadow.Constant storage is the distinctive read-only memory spaces of GPU, can be delayed Deposit, and during same data in the thread accesses constant storage from same half-Warp, in the event of cache hit, Only need to a cycle and be obtained with data.In general, constant storage space is smaller, in such as Tesla C1060 only 64KB, but be fully able to the need for meeting in the present embodiment.During back projection, it is only necessary to read the variate-value ginseng in constant storage Final back projection's parameter is just can obtain with conventional multiply-add operation.Therefore, this acceleration strategy can not only avoid being gone with GPU The great trigonometric function of computing cost, and avoid computing repeatedly for GPU, have in terms of backprojection operation efficiency is lifted compared with Good effect.

Reconstructed volumetric data is generally deposited in GPU global storages, and global storage occupies the video memory overwhelming majority, can be with For depositing large-scale data, but global storage does not cache acceleration, although merging access mode can be greatly enhanced Access speed, but generally still suffer from the access delay in hundreds of cycles.Research shows that GPU is used for the bottleneck of high-performance calculation It is not to calculate consumption but memory access consumption.Therefore, the time for how reducing access global storage is the key that GPU accelerates.This In embodiment, the acceleration strategy for instead throwing m width projected images simultaneously in a Kernel is devised.

Under normal circumstances, computing only is carried out to 1 width projected image in a Kernel, for 360 width data for projection, entirely Back projection's process needs 360 × N of read/write global storage³It is secondary.In embodiment, a Kernel completes m Angles Projections image Back projection, each Kernel needs to calculate back projection's parameter of m Angles Projections image, but only reads and writes global storage N3 It is secondary, you can so that the number of times of read/write global storage is changed into original 1/m.While global storage read-write number of times is reduced, Algorithm can increase the computation burden of each thread in Kernel, if increase m simply, will certainly reduce movable in whole GPU Block and activity Warp quantity, movable Block and activity Warp quantity are reduced can influence GPU to hide long delay in turn again Operate the advantage of (access global storage).So, find a moderate m and be particularly important.Found by test of many times, When the back projection's image completed in a Kernel is 3 width simultaneously, both sides reach a balance, and acceleration effect is ideal.

In another embodiment, dress is accelerated there is provided a kind of CT 3-dimensional reconstructions based on heterogeneous platform as shown in Figure 6 Put, heterogeneous platform uses PCI-Express as transmission data-signal and the interconnection bus of control signal, and is made with Ethernet For the additional busses with outside progress networking control and data transfer；The framework of heterogeneous platform includes the application for providing functional module There is provided the component of interface specification needed for component base of each functional module of application-oriented layer based on different processor and algorithm for reconstructing for layer Layer, and component-oriented layer and application layer provide the supporting layer of service, and supporting layer is comprising the CPU for performing main program and performs kernel journey Multiple OpenCL computing devices of sequence, CPU and OpenCL computing devices communicate connection, and described OpenCL computing devices are included GPU、FPGA；The CT 3-dimensional reconstructions accelerator of framework based on described heterogeneous platform includes following content：

Calculate grain decomposing module 201, for according to FDK algorithm for reconstructing content by the algorithm with calculate particle shape formula split and from Dispersion, is decomposed into the projection weighting for carrying out Data correction to data for projection and calculates grain, for entering to the data for projection after weighting Grain is calculated in the filtering of row filtering, for filtered data for projection back projection to be calculated into grain to the back projection rebuild on object, and is used for The reduction that reduction process is carried out to back projection's result calculates grain；

Accelerate processing module 202, for being transmitted by additional busses to the data for projection in heterogeneous platform calculate node, Set according to user and reconstruction performance is assessed, Coordination Treatment is carried out by interconnection bus, respectively in GPU and FPGA acceleration components Complete to calculate grain acceleration, rebuild data real-time storage and simultaneously feed back to user.

Need explanation, the CT 3-dimensional reconstruction accelerators of some of embodiments of the invention were implemented Journey is identical with CT 3-dimensional reconstruction accelerated methods part, for details, reference can be made to method section Example, repeats no more here.

The software design block diagram of heterogeneous platform is as shown in Figure 5.Software architecture is divided into three layers, and wherein application layer is mainly to use The functional module that family software possesses, respectively data for projection weighted correction, data for projection filtering, 3 D back projection rebuild and again Build 4 modules of image viewing；Components layer is the application oriented component base based on different processor, and predominantly algorithm for reconstructing is each Calculate the function code storehouse of grain and corresponding interface specification；Last layer is the supporting layer of the containers such as CPU, GPU, FPGA composition, is Component and application software provide services and support.

When components layer is designed, intend developing using OpenCL frameworks.OpenCL full name are Open Computing Language, i.e. open computing language, are proposed by Apple companies earliest, are a kind of brand-new calculating application programming interfaces (API).OpenCL main function is to provide a cross-platform unified standard language for general-purpose computations field, what it was supported Heterogeneous platform can by multi-core CPU, GPU, DSP, Cell/B.E.processor or other kinds of processor group into.OpenCL is The exploitation of concurrent program provides the non-proprietary software solution across manufacturer so that program possesses preferable portability；Together When, cross-platform isomery framework is beneficial to the performance potential of various equipment in performance system simultaneously.OpenCL platform models are by main frame And coupled one or more OpenCL computing devices (Compute Device) are constituted (Host).Wherein, it is each to calculate Equipment by one or more computing units (Computer Unit), each computing unit can be further divided into again one or Multiple processing units (Processing Element), various calculating operations are all completed in processing unit.Main frame end pipe Manage all computing resources on whole platform.Application program can be sent from host side to the processing unit of each OpenCL equipment Calculation command.All processing units in a computing unit can perform identical a set of instruction flow.Heterogeneous platform Multiple programming realize in, using CPU as main frame, and regard GPU and FPGA as OpenCL equipment.OpenCL programming model can To be divided into two parts, a part is the main program (Host program) performed on CPU, and another part is held on Device Capable kernel function (Kernel).Main program is by defining context (Context) and managing kernel program holding on Device OK.In OpenCL programmings, first an index space must be created for the kernel function before host side creates a kernel function (Index Space), the index space can be one-dimensional, two-dimentional or three-dimensional, and kernel function can be in each of the index space Performed on node (Work Item).Index in each working node respective dimensions is defined as node in the dimension Global ID (Global ID).All working node will all perform identical kernel function program, and each working node is to be equivalent to difference Execution thread, performed by concurrent program between great deal of nodes and calculate the purpose accelerated so as to reach.OpenCL is to index Space provides working group (work group) space that smaller particle size is also provided outside global index.Each working group exists There is location index quilt of the node with respect to the working group inside a unique working group ID (Work Group ID), working group Referred to as local I D (Local ID).The design of kernel function can both select the parallel schema between working node, can also select two It is parallel between internal node in parallel and working group between working group in layer parallel schema, i.e. index space.This can both make full use of The computing resource of GPU and FPGA bottom hardwares, also increases the flexibility of programming.

Under multiprocessing isomery and OpenCL programmed environments, the programming to CPU host sides uses standard C/C++ language, right The programming of GPU and FPGA coprocessors uses the description language based on OpenCL specification.The programming language of OpenCL standard criterions Levels of abstraction far above the hardware description language such as VHDL and Verilog.Traditional programming mode is needed to FPGA bottom hardwares Unit is programmed description according to timing cycles, for complicated algorithm performs, it is necessary to design point machine control data path, together When need to handle interface constraints at different levels and timing synchronization problem, programming difficulty is big, time-consuming, and program maintenance and upgrading are complicated, It is highly detrimental to the quick application of actual product.And OpenCL programming modes are used, the hardware without paying close attention to bottom sequential level is set Meter, can design the class C code of high-level language description according to backprojection algorithm, and OpenCL compilers can then be realized by OpenCL automatically Code be converted into Hardware description language make peace configuration processor the step of.

For further checking effectiveness of the invention, explanation is further explained below by specific experiment：

As shown in fig. 7, selection oil rock core is as test object, testee lower 360 of full angle of collection is projected into Row is rebuild, and scale of rebuilding is 500³, shown in the D profile Fig. 7 of reconstructed results centre：A), b), c) it is respectively the graphics of reconstruction Cutting in the profile as on each central plane, i.e., tri- planes of x=0, y=0, z=0 fastened for world coordinates Face figure.

" CPU ", " GPU+FPGA " two ways is respectively adopted for different reconstruction scales and carries out reconstruction test.Every group of test It is repeated 10 times, average reconstruction time test result is as follows：

The reconstruction time test result of table 1

Calculated from result it will be seen that can significantly lift traditional CPU using GPU+FPGA accelerated mode Speed so that user can quickly obtain three-dimensional reconstruction result.Simultaneously we can see that gained in reconstructed results image Reconstructed results can clearly show three-dimensional internal information, meet three-dimensional reconstruction practical application.

Implementation of each embodiment only just for corresponding steps in illustrating is set forth above, Ran Hou In the case that logic is not contradicted, each above-mentioned embodiment is can be mutually combined and form new technical scheme, and this is new Technical scheme still in the open scope of present embodiment.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Understood based on such, technical scheme is substantially done to prior art in other words Going out the part of contribution can be embodied in the form of software product, and the computer software product is carried on a non-volatile meter In calculation machine readable storage medium (such as ROM, magnetic disc, CD, server cloud space), including some instructions are to cause a station terminal Equipment (can be mobile phone, computer, server, or network equipment etc.) performs the method described in each embodiment of the invention.

Each technical characteristic of embodiment described above can be combined arbitrarily, to make description succinct, not to above-mentioned reality Apply all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, the scope of this specification record is all considered to be.

Claims

1. a kind of CT 3-dimensional reconstruction accelerated methods based on heterogeneous platform, it is characterised in that heterogeneous platform comprising main frame and Isomery OpenCL computing devices, the accelerated method includes following content：FDK algorithm for reconstructing is carried out to calculate grain decomposition, each is analyzed Calculate the parallel computation flow of grain；Each calculation grain is accelerated by the main frame in heterogeneous platform and isomery OpenCL computing devices Optimization processing.

2. the CT 3-dimensional reconstruction accelerated methods according to claim 1 based on heterogeneous platform, it is characterised in that described Main frame for operation main program CPU, OpenCL computing devices comprising operation kernel program isomery container GPU and FPGA, Communicated between CPU, GPU and FPGA by PCI-E buses, main program manages the fortune of kernel program by defining context OK.

3. the CT 3-dimensional reconstruction accelerated methods according to claim 2 based on heterogeneous platform, it is characterised in that right FDK algorithm for reconstructing carries out calculating grain decomposition, comprising：According to FDK algorithm contents, it is decomposed into：For what is be weighted to data for projection Grain is calculated in projection weighting, and grain is calculated for the filtering that is filtered to the data for projection after weighting, for by filtered data for projection Grain is calculated to the back projection rebuild on object by back projection, and calculates grain for the reduction that reduction process is carried out to back projection's result.

4. the CT 3-dimensional reconstruction accelerated methods according to claim 3 based on heterogeneous platform, it is characterised in that according to FDK algorithm for reconstructing formula：

Grain is calculated in projection weighting, is expressed as：Wherein, p'(θ, u, v) represent rotation point The data spent after being weighted during for θ to data for projection,For weight coefficient；

Grain is calculated in filtering, is expressed as：Wherein, d_f(θ, u, v) is filtered number According to h (u) is the unit impulse response of filter operator, [- u_m,u_m] represent the 2m data that detector is gathered per a line；

Reduction calculates grain, is expressed as：Wherein, φ_maxTo rebuild discrete when object rotates a circle adopt The projection number of divisions obtained.

5. the CT 3-dimensional reconstruction accelerated methods according to claim 3 based on heterogeneous platform, it is characterised in that to each Individual calculation grain carries out acceleration optimization processing, comprising：Grain is calculated to projection weighting using FPGA and carries out parallel processing, by asynchronous transmission extremely GPU, is handled filtering calculation grain simultaneously in transmitting procedure；The data parallel operations of each tissue points during with reference to back projection, Grain is calculated back projection by tissue points carry out the calculating of multi-threaded parallel back projection in GPU.

6. the CT 3-dimensional reconstruction accelerated methods according to claim 5 based on heterogeneous platform, it is characterised in that according to Reconstruction regions are on rotation direction of principal axis up each layer data for projection and detector data for projection y direction in FDK algorithm for reconstructing Each row data projection corresponding relation, using piecemeal Reconstruction Strategy, region to be reconstructed is divided into some pieces along rotation direction of principal axis, Corresponding data for projection is taken out from external memory storage carry out reconstruction operation when rebuilding one of.

7. the CT 3-dimensional reconstruction accelerated methods according to claim 5 based on heterogeneous platform, it is characterised in that use FPGA calculates grain to projection weighting and carries out parallel processing, comprising：Global storage is divided into 2 pieces of bank, it is real by loading distribution The access of existing random access memory is balanced；Being stored by constant storage needs that the intermediate variable calculated is repeated several times.

8. the CT 3-dimensional reconstruction accelerated methods according to claim 5 based on heterogeneous platform, it is characterised in that Grain is calculated back projection by tissue points carry out the calculating of multi-threaded parallel back projection in GPU, comprising：Using based on voxel type of drive, Task division is carried out to GPU by reconstructed volumetric data output；Variable unrelated with voxel in calculating is separated and merged, and Calculate and be stored in GPU constant storage before back projection, when back projection calculates, directly read the change in constant storage Amount participates in calculating；Optimize the number of a back projection in kernel program.

9. the CT 3-dimensional reconstruction accelerated methods according to claim 8 based on heterogeneous platform, it is characterised in that will count The variable unrelated with voxel is separated and merged in calculation, as follows comprising content：Any point (x, y, z) is in projection in volume data Angle projects point (u, v) on detector when being θ, subpoint (u, v) is calculated as：

U=(x-vCenter) × cos (θ)+(y-vCenter) sin (θ)+pCenter

Dis=(u-pCenter) × a

<mrow> <mi>w</mi> <mo>=</mo> <mfrac> <msqrt> <mrow> <msup> <mi>r</mi> <mn>2</mn> </msup> <mo>-</mo> <msup> <mi>dis</mi> <mn>2</mn> </msup> </mrow> </msqrt> <mrow> <msqrt> <mrow> <msup> <mi>r</mi> <mn>2</mn> </msup> <mo>-</mo> <msup> <mi>dis</mi> <mn>2</mn> </msup> </mrow> </msqrt> <mo>+</mo> <mi>a</mi> <mo>&times;</mo> <mrow> <mo>(</mo> <mo>-</mo> <mo>(</mo> <mrow> <mi>x</mi> <mo>-</mo> <mi>v</mi> <mi>C</mi> <mi>e</mi> <mi>n</mi> <mi>t</mi> <mi>e</mi> <mi>r</mi> </mrow> <mo>)</mo> <mo>&times;</mo> <mi>s</mi> <mi>i</mi> <mi>n</mi> <mo>(</mo> <mi>&theta;</mi> <mo>)</mo> <mo>+</mo> <mo>(</mo> <mrow> <mi>y</mi> <mo>-</mo> <mi>v</mi> <mi>C</mi> <mi>e</mi> <mi>n</mi> <mi>t</mi> <mi>e</mi> <mi>r</mi> </mrow> <mo>)</mo> <mi>c</mi> <mi>o</mi> <mi>s</mi> <mo>(</mo> <mi>&theta;</mi> <mo>)</mo> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> ,

V=(z- (s₀+θ×h)-γ×h/a)×w+γ×h/a+pCenter

It is after separation and merging variable:

U=x × A [0]+y × A [1]+A [2]+pCenter

Dis=(u-pCenter) × a

<mrow> <mi>w</mi> <mo>=</mo> <mfrac> <msqrt> <mrow> <msup> <mi>r</mi> <mn>2</mn> </msup> <mo>-</mo> <msup> <mi>dis</mi> <mn>2</mn> </msup> </mrow> </msqrt> <mrow> <msqrt> <mrow> <msup> <mi>r</mi> <mn>2</mn> </msup> <mo>-</mo> <msup> <mi>dis</mi> <mn>2</mn> </msup> </mrow> </msqrt> <mo>+</mo> <mi>a</mi> <mo>&times;</mo> <mrow> <mo>(</mo> <mo>-</mo> <mi>x</mi> <mo>&times;</mo> <mi>A</mi> <mo>&lsqb;</mo> <mn>1</mn> <mo>&rsqb;</mo> <mo>+</mo> <mi>y</mi> <mo>&times;</mo> <mi>A</mi> <mo>&lsqb;</mo> <mn>0</mn> <mo>&rsqb;</mo> <mo>+</mo> <mi>A</mi> <mo>&lsqb;</mo> <mn>3</mn> <mo>&rsqb;</mo> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> ,

V=(z-A [4]-γ × A [5]) × w+ γ × A [5]+pCenter

Wherein, vCenter represents volume data center, and pCenter is data for projection center, and α is voxel size, and θ is projection angle, R is rotation radiographic source radius of turn, and h is pitch, and γ projects the angle with central beam for beam on central plane.

10. a kind of CT 3-dimensional reconstruction accelerators based on heterogeneous platform, it is characterised in that heterogeneous platform uses PCI- Express carries out networking control as transmission data-signal and the interconnection bus of control signal, and using Ethernet as with outside The additional busses of system and data transfer；There is provided application-oriented layer for application layer of the framework of heterogeneous platform comprising offer functional module The components layer of interface specification needed for component base and algorithm for reconstructing of each functional module based on different processor, and component-oriented layer and Application layer provides the supporting layer of service, and supporting layer includes the CPU for performing main program and the multiple OpenCL meters for performing kernel program Equipment is calculated, CPU and OpenCL computing devices communicate connection, and described OpenCL computing devices include GPU, FPGA；Based on institute The CT 3-dimensional reconstructions accelerator of the framework for the heterogeneous platform stated includes following content：

Grain decomposing module is calculated, for the algorithm to be split and discretization with calculating particle shape formula according to FDK algorithm for reconstructing content, point Solve to calculate grain for the projection weighting that Data correction is carried out to data for projection, for what is be filtered to the data for projection after weighting Grain is calculated in filtering, for filtered data for projection back projection to be calculated into grain to the back projection rebuild on object, and for back projection As a result the reduction for carrying out reduction process calculates grain；

Accelerate processing module, for being transmitted by additional busses to the data for projection in heterogeneous platform calculate node, according to Family is set and reconstruction performance is assessed, and Coordination Treatment is carried out by interconnection bus, completes to calculate respectively in GPU and FPGA acceleration components Grain accelerates, and rebuilds data real-time storage and feeds back to user.