CN104090993B

CN104090993B - Very-long baseline interference measurement relevant processing implementation method

Info

Publication number: CN104090993B
Application number: CN201410240777.5A
Authority: CN
Inventors: 陈蓉; 王静温
Original assignee: Aerospace Long March Launch Vehicle Technology Co Ltd; Beijing Institute of Telemetry Technology
Current assignee: Aerospace Long March Launch Vehicle Technology Co Ltd; Beijing Institute of Telemetry Technology
Priority date: 2014-05-30
Filing date: 2014-05-30
Publication date: 2017-01-25
Anticipated expiration: 2034-05-30
Also published as: CN104090993A

Abstract

The invention discloses a very-long baseline interference measurement relevant processing implementation method. A VLBI (very-long baseline interference) relevant processing procedure is implemented by using a platform consisting of a CPU (central processing unit) and GPU (graphics processing unit) coprocessor based on MPI (message passing interface) and CUDA (compute unified device architecture) mixed parallel mode by a baseline parallel mode, so that a support is supplied to the application of the high-efficiency MPI+CUDA calculation mode to the VLBI relevant processing field. The parallel acceleration of the VLBI relevant processing procedure is effectively realized by the baseline parallel mode; the high-efficiency calculation capacity of a GPU and the high task distribution and scheduling capacity of a multi-core CPU are fully used; therefore, the running efficiency of the VLBI relevant processing procedure is improved, and the flexibility and the expandability of the implementation method are guaranteed through a heterogeneous platform and a mixed parallel mode.

Description

A kind of very long baseline interferometry(VLBI relevant treatment implementation method

Technical field

The present invention relates to a kind of very long baseline interferometry(VLBI relevant treatment implementation method, particularly a kind of based on mpi and The very long baseline interferometry(VLBI correlation process method of cuda hybrid parallel pattern, belongs to very long baseline interferometry(VLBI field.

Background technology

Very long baseline interferometry(VLBI (very long baseline interferometry, vlbi), is 60 years 20th century A kind of important radio interferometry technology got up for Later development.It is by entering to the observation data of multiple radio telescopes Row related operation, these telescopes are synthesized the synthesis telescope that equivalent diameter is Long baselines length.Vlbi adopts High stability atomic clock, as this vibrating system independent, is overcome the restriction of the length of base, has reached high space and divided with the time Resolution, is therefore widely used in fields such as astronomy, geodesy and surveies of deep space.

Relevant treatment is the core of vlbi data processing, has the characteristics that data-intensive and computation-intensive, realizes skill Art mainly has hardware correlation processing technique based on special IC or field programmable logic array (fpga) etc. and is based on The software correlation processing technique of universal computer platform.To realize vlbi relevant treatment using high-performance fpga needs exploitation special Hardware board, realize complicated, and resource-constrained, when needing to increase baseline amount, poor expandability.Using general purpose computer Although platform is realized vlbi relevant treatment and reduce realizing difficulty, improve autgmentability, universal computer platform parallel Disposal ability limited it is difficult to reply vlbi relevant treatment intensive calculations.

The later stage nineties 20th century so far, with the popularization of commercial high-performance computer system, based on modern high performance pc Or the vlbi software correlation processing technique of server platform has obtained the great attention of domestic and international each research institution, become vlbi skill The new study hotspot in art field.By high-performance pc or server platform combination are built cluster, and pass for cluster configuration information Pass interface (message passing interface, mpi) environment, it is possible to obtain higher calculating performance, meanwhile, calculate system One equipment framework (computer unifieddevice architecture, cuda) opens using gpu (graphics Processing unit, gpu) powerful calculating ability do general-purpose computations gate so that be based on cpu+gpu heterogeneous platform and mpi The efficient computation schema of+cuda hybrid parallel environment is possibly realized.

Content of the invention

The technology solve problem of the present invention is: overcomes the deficiencies in the prior art, provides a kind of very long baseline interferometry(VLBI phase Close and process implementation method, the method, with cpu+gpu small-scale cluster as platform, is realized based on mpi and cuda hybrid parallel pattern, The application in vlbi relevant treatment field for the efficient computation schema for mpi+cuda provides support.

The technical solution of the present invention is: a kind of very long baseline interferometry(VLBI relevant treatment implementation method, including following Step:

(1) build development platform using gpu and cpu, configuration on the platform calculates Unified Device architecture environment, in cpu Upper configuration information passing interface environment；

(2) cpu determines information needed passing interface simultaneously according to the baseline amount of its very long baseline interferometry(VLBI to be processed The quantity of traveling journey, and set up messaging interface concurrent process；

(3) cpu specifies corresponding very long baseline interferometry(VLBI baseline for each messaging interface process, starts simultaneously Each messaging interface process；

(4) each messaging interface process obtains the data of corresponding two websites of very long baseline interferometry(VLBI baseline File and Parameter File, the integer bit time delay of the signal sample data and two websites that obtain corresponding two websites of baseline is repaiied Positive time delay value, the length of delay of phase fringes rotation, the length of delay of decimal bit time delay correction and carrier wave frequency information；

(5) each messaging interface process is according to the signal of corresponding two websites of very long baseline interferometry(VLBI baseline The time delay value of the integer bit time delay correction of sampled data and two websites, using the calculating Unified Device framework ring on gpu Border, realizes the integer bit time delay correction of two website signals of baseline in a parallel fashion, and two website signals of baseline divide Not and down coversion local oscillation signal mixing, obtain the signal of two websites of baseline after integer bit time delay correction and down coversion；Its Middle down coversion local oscillation signal is calculated by the down coversion local frequency information in carrier wave frequency information；

(6) each messaging interface process, according to the length of delay of the phase fringes rotation of two websites of baseline, utilizes Calculating Unified Device architecture environment on gpu, the two website signals of baseline in a parallel fashion step (5) being obtained carry out phase Position striped rotation, makes two website signals close, obtains the signal of striped two websites of postrotational baseline；

(7) each messaging interface process is passed through to calculate two, the baseline that Unified Device framework obtains to step (6) The signal of website carries out parallel FFT, realizes for the signal of two websites of baseline being transformed into frequency domain from time domain；

(8) signal of two websites of baseline that each messaging interface process obtains according to step (7) and decimal The length of delay of bit time delay correction, using the calculating Unified Device architecture environment on gpu, realizes two, baseline in a parallel fashion The decimal bit time delay correction of website signal, obtains the signal of two websites of baseline after decimal bit time delay correction；

(9) each messaging interface process utilizes the calculating Unified Device architecture environment on gpu, in a parallel fashion The corresponding sampled point of two website signals of baseline after step (8) is processed carries out multiplication cross, and to the result after multiplication cross Summation is integrated parallel with reduction algorithm, is completed the cross-correlation operation of two website signals of baseline, obtain very long baseline and interfere Measurement cross-correlation operation result simultaneously exports, and this result is two website signals of baseline through very long baseline interferometry(VLBI relevant treatment Phase fringes data afterwards；

Each of described step (5) (9) messaging interface process is all only responsible for processing and this messaging interface The signal of two websites of very long baseline interferometry(VLBI baseline corresponding to process.

For a set of heterogeneous platform or cover heterogeneous platform shape using the development platform that gpu and cpu builds in described step (1) more The small-scale cluster becoming, wherein one gpu and cpu is bonded a set of heterogeneous platform.

The two website signals of baseline in a parallel fashion step (5) being obtained in described step (6) carry out phase fringes During rotation, or rotate two websites letters of baseline by the way of only each sampled point of rotation one website signal of baseline simultaneously The mode of number each sampled point, makes two website signals close.

Described cpu is multinuclear cpu server.

Compared with the prior art, the invention has the advantages that: the present invention is according to vlbi relevant treatment algorithm it is proposed that one Plant with cpu+gpu small-scale cluster as platform, based on the vlbi relevant treatment implementation method of mpi and cuda hybrid parallel pattern, The application in vlbi relevant treatment field for the efficient computation schema for mpi+cuda provides support, and the major advantage of the method is such as Under:

(1) it is easily achieved: need to develop special hardware board and firmware journey based on the hardware correlation processing technique of fpga Sequence, realizes difficulty greatly, performance period is long, and limited to the debugging acid of hardware, and means are limited, and the present invention is based on cpu+gpu The development platform of small-scale cluster is easily obtained, and utilizes cuda framework, can complete gpu is called under c language environment, Easily, debugging process is simple for written in code.

(2) speed of service is fast: only using universal computer platform software correlation processing technique although being also easy to realize, It is easy to debug, but it can only realize the parallel computation of cpu, quantity maximum, the parallel computation energy suitable with cpu quantity of concurrent process Power is well below gpu, and multinuclear cpu is combined with gpu coprocessor and builds heterogeneous platform by the present invention, makes full use of gpu efficient Floating number disposal ability and multinuclear cpu good task distribution and dispatching, played to greatest extent single calculate section The parallel processing capability of point.

(3) extensibility is strong: the hardware correlation processing technique based on fpga, because being constrained by hardware own characteristic, provides Source is restricted, and if as increase baseline amount, resource to be increased then needs to change fpga device and redesign hardware Board, poor expandability, and the present invention adopt multinuclear cpu be combined with gpu coprocessor the heterogeneous platform built and mpi and The hybrid parallel pattern of cuda is respectively provided with very strong extensibility, only need to increase in existing platform when needing and increasing baseline amount Plus calculate node.

Brief description

Fig. 1 is the schematic diagram that the multinuclear cpu that the present invention adopts is combined the heterogeneous platform built with gpu coprocessor；

Fig. 2 is a kind of schematic flow sheet of present invention very long baseline interferometry(VLBI relevant treatment implementation method；

Fig. 3 is the multi-threading parallel process process schematic realized based on method proposed by the present invention.

Specific embodiment

The later stage nineties 20th century so far, with the popularization of commercial high-performance computer system, based on modern high performance pc Or the vlbi software correlation processing technique of server platform has obtained the great attention of domestic and international each research institution, become vlbi skill The new study hotspot in art field.By high-performance pc or server platform combination are built cluster, and pass for cluster configuration information Pass interface (message passing interface, mpi) environment, it is possible to obtain higher calculating performance, meanwhile, calculate system One equipment framework (computer unified device architecture, cuda) opens using gpu (graphics Processing unit, gpu) powerful calculating ability do general-purpose computations gate so that be based on cpu+gpu heterogeneous platform and mpi The efficient computation schema of+cuda hybrid parallel environment is possibly realized.

The present invention is according to vlbi relevant treatment algorithm it is proposed that one kind, with cpu+gpu small-scale cluster as platform, is based on The vlbi relevant treatment implementation method of mpi and cuda hybrid parallel pattern, the Parallel Implementation mode of vlbi relevant treatment generally has Baseline is parallel, survey station is parallel, the structure such as channel parallel, time parallel, and the present invention adopts baseline parallel organization.

Mpi is one of standard of message-passing parallel program design, is the application programming interfaces api of a set of parallel computation, The present embodiment adopts mpi-2 standard；And mpi supports fortran language and c language simultaneously, the present embodiment adopts c language.

Cuda is a kind of parallel computation framework, is also a kind of programming model, and cuda makes gpu can solve the problem that the calculating of complexity is asked Topic, and enable cpu and gpu to complete parallel computation application using respective advantage is collaborative.

The present invention is implemented by way of from order line input order and parameter.Embodiment uses at multinuclear cpu and gpu association Reason device combines the heterogeneous platform built, and illustrates as shown in figure 1, and being implemented using the gpu based on kepler framework for a new generation and transport OK, the number of threads maximum that the gpu of each kepler framework can support can reach 32768.

A kind of schematic flow sheet of present invention very long baseline interferometry(VLBI relevant treatment implementation method is as shown in Figure 2.

It is illustrated in figure 3 multi-threading parallel process process schematic in embodiment, the embodiment of the present invention comprises the following steps:

(1) gpu and multinuclear cpu server are combined and build heterogeneous platform, a multinuclear cpu being furnished with gpu coprocessor Server is a set of heterogeneous platform, and the present embodiment is by two multinuclear cpu server (i.e. two sets of isomeries being furnished with gpu coprocessor Platform) combine to form small-scale cluster, build development platform, in this development platform, be every suit heterogeneous platform configuration cuda Environment, configures mpi environment on each multinuclear cpu server simultaneously, and to the multinuclear cpu server-assignment node in cluster Number, a cpu is selected on this development platform as host node, the present embodiment two is furnished with the multinuclear cpu clothes of gpu coprocessor Business device is two nodes, and node number is 0 and 1, and wherein one is host node, and node number is 0.

(2) the host node server in cluster reads from order line and starts order, and content includes baseline amount, number in order According to file relative path, Parameter File relative path and output file relative path.

Main cpu node in cluster is that each cpu node (including main cpu node) distribution will according to the baseline amount of vlbi The baseline amount processing, its corresponding mpi environment of each cpu node initializing, and determined according to its baseline amount to be processed The quantity of information needed passing interface concurrent process, sets up messaging interface concurrent process simultaneously.

The baseline amount of the present embodiment is 2, one baseline of each correspondence of two in cluster cpu node, each cpu node Complete the initialization of its corresponding mpi environment, set up a messaging interface concurrent process simultaneously.

(3) each cpu node specifies corresponding vlbi baseline for its corresponding mpi process, obtains its corresponding mpi process Process identification number in given communication domain, and baseline list is inquired about according to process identification number, baseline list is Array for structural body, The content of each array element comprises baseline numbering, baseline two site name and baseline two website code name, each cpu simultaneously Node starts corresponding mpi process function.

(4) each mpi process by the title of corresponding two websites of vlbi baseline obtaining from baseline list and code name and The data file relative path, Parameter File relative path and the output file relative path that obtain from order line are combined, and obtain Take the fullpath of the corresponding data file of baseline two website, Parameter File and output file, and read two, corresponding baseline The data file of website and Parameter File, obtain corresponding two website signal sample data of vlbi baseline and two website hits According to the time delay value of corresponding integer bit time delay correction, phase fringes rotation length of delay, decimal bit time delay correction delay The parameter information such as value and carrier frequency (including down coversion local frequency and radio frequency carrier frequency).

The cpu internal memory that the baseline two website signal data of acquisition and each parameter information are located by each mpi process from it Copy is so far in cpu corresponding gpu global memory.

(5) a cuda thread block realizing integer bit time delay correction on each mpi process initiation gpu, cuda Each of thread block thread is according to the signal sample data of this mpi process corresponding vlbi two websites of baseline and two The integer bit time delay value of website, calculates the sampling number of two website starting sample moment delays, then according to result of calculation Obtain two sampled points after two website integer bit time delay corrections of baseline alignment, and calculate the corresponding lower change of two sampled points Frequency local oscillation signal sampled point.

If website 1 is to postpone 0.5ms than website 2 integer bit time delay value when being embodied as, the sampling frequency of two website signals Rate is 50khz, then website 1 is to postpone 0.5*50000/1000=than the integer bit time delay spacing of website 2 initial time signal 25 sampled points, then the thread 0 in cuda thread block take the 26th sampled point of website 1 and the 1st sampled point of website 2, and And the acquisition corresponding down coversion of website 1 sampled point is calculated according to the down coversion local frequency value in gpu global memory and numerical value 26 Local oscillation signal sampled point, calculates according to the down coversion local frequency value in gpu global memory and numerical value 1 and obtains website 2 sampled point Corresponding down coversion local oscillation signal sampled point；Thread 1 takes the 27th sampled point of website 1 and the 2nd sampled point of website 2, and And the acquisition corresponding down coversion of website 1 sampled point is calculated according to the down coversion local frequency value in gpu global memory and numerical value 27 Local oscillation signal sampled point, calculates according to the down coversion local frequency value in gpu global memory and numerical value 2 and obtains website 2 sampled point Corresponding down coversion local oscillation signal sampled point, by that analogy；

Then the down coversion local oscillation signal sampled point that two sampled points are produced with thread is mixed by each thread respectively Frequently, result back into the correspondence position in gpu global memory.When being embodied as, in the thread 0 operation gpu overall situation in thread block Depositing the memory space that middle website 1 signal data memory block offset address is 0 and website 2 signal data memory block offset address is 0 Memory space, thread 1 operate gpu global memory in website 1 signal data memory block offset address be 1 memory space and station Point 2 signal data memory block offset address are 1 memory space, by that analogy；

In cuda thread block, all threads execute identical operation simultaneously, are realized to baseline two in the way of multi-threaded parallel The thick time delay correction of all effective sampling points of individual website signal, and be mixed with down coversion local oscillation signal respectively, obtain integer The signal of baseline two website after bit time delay correction and down coversion.

(6) on the basis of step (5), one on each the mpi process initiation gpu cuda line realizing striped rotation Journey block, axially another website signal frequency axle is close (actual using only rotating one website signal frequency of baseline for the present embodiment Two frequency axiss can also be made by the way of rotation two website signal frequency axles of baseline simultaneously close in application).

When being embodied as, website 2 is motionless, only rotates the signal frequency of website 1, then the thread 0 in thread block takes website 1 1st sampled point, thread 1 takes the 2nd sampled point of website 1, and by that analogy, then each thread obtains from gpu global memory Take the corresponding striped rotational latency value of website 1 signal sampling point handled by this thread, computed phase delay value, then use phase place Length of delay calculates the radio-frequency carrier signal sampled point of striped rotation, then by the website 1 signal sampling point handled by this thread and bar The radio-frequency carrier signal sampled point of stricture of vagina rotation is mixed, and then results back into the correspondence position in gpu global memory, specifically Implementation is: it is 0 deposit that thread 0 in thread block operates website 1 signal data memory block offset address in gpu global memory Storage space, thread 1 operates the memory space that website 1 signal data memory block offset address in gpu global memory is 1, with such Push away.

In cuda thread block, all threads execute identical operation simultaneously, are realized to baseline station in the way of multi-threaded parallel The phase fringes rotation of point 1 sampled point, makes the signal frequency of website 1 close to website 2, obtains two stations of baseline after striped rotation The signal of point.

(7) letter of two websites of vlbi baseline that each mpi process is obtained to step (6) by the cufft storehouse of cuda Number carry out parallel FFT, realize for the signal of two websites of vlbi baseline being transformed into frequency domain from time domain.

Function library cufft based on gpu being provided using nvidia when being embodied as, first by cufftplan1d () The one-dimensional cufft handle of function creation one, then using the signal of cufftexecc2c () function pair vlbi two websites of baseline Carry out parallel fft computing.

(8) a cuda thread block realizing decimal bit time delay correction on each mpi process initiation gpu, cuda Each thread of thread block obtains two frequency domain sample points of this mpi process corresponding vlbi two website signals of baseline, and Obtain the length of delay of two frequency domain sample point corresponding decimal bit time delay corrections of two website signals from gpu global memory, Then the time delay correction less than sampling time interval is carried out to two frequency domain sample points of two website signals, finally result is write Return the correspondence position in gpu global memory.

When being embodied as, website 1 signal data memory block skew ground in the thread 0 operation gpu global memory in thread block Location is 0 memory space and memory space that website 2 signal data memory block offset address is 0, in the thread 1 operation gpu overall situation Depositing the memory space that middle website 1 signal data memory block offset address is 1 and website 2 signal data memory block offset address is 1 Memory space, by that analogy.

In cuda thread block, all threads execute identical operation simultaneously, realize baseline two station in the way of multi-threaded parallel The time delay correction less than sampling time interval of the point each sampled point of frequency domain, baseline two website after acquisition decimal bit time delay correction Signal.

(9) a cuda thread block realizing cross-correlation operation on each mpi process initiation gpu, to step (8) place Baseline two website signal after reason carries out multiplication cross, and the result after multiplication cross is integrated parallel with reduction algorithm asks With.

When being embodied as, each of cuda thread block thread obtains two stations of this mpi process corresponding vlbi baseline Then two sampled points are multiplied, multiplied result are integrated with reduction algorithm by two frequency domain sample points of point signal Summation, and the imaginary part of the complex values after summation is divided by with real part.

In cuda thread block, all threads execute identical operation simultaneously, complete vlbi baseline in the way of multi-threaded parallel The cross-correlation operation of two website signals, and this result is copied to cpu internal memory by gpu global memory, and export, this result is exactly Phase fringes data after vlbi relevant treatment for the two website signals of baseline.

The present invention effectively achieves the parallel acceleration to vlbi correlation procedure with baseline parallel form, makes full use of Gpu efficient computing capability and the good task distribution of multinuclear cpu and dispatching, improve vlbi correlation procedure Operational efficiency, and motility and the autgmentability of implementation method is ensure that by heterogeneous platform and hybrid parallel pattern.

In the present invention, unspecified part belongs to general knowledge as well known to those skilled in the art.

Claims

1. a kind of very long baseline interferometry(VLBI relevant treatment implementation method is it is characterised in that comprise the following steps:

(1) build development platform using gpu and cpu, configuration on the platform calculates Unified Device architecture environment, joins on cpu Put messaging interface environment；

(2) cpu determines information needed passing interface according to the baseline amount of its very long baseline interferometry(VLBI to be processed and advances The quantity of journey, and set up messaging interface concurrent process；

(3) cpu specifies corresponding very long baseline interferometry(VLBI baseline for each messaging interface process, starts each simultaneously Individual messaging interface process；

(4) each messaging interface process obtains the data file of corresponding two websites of very long baseline interferometry(VLBI baseline And Parameter File, obtain the signal sample data of corresponding two websites of baseline and the integer bit time delay correction of two websites Time delay value, the length of delay of phase fringes rotation, the length of delay of decimal bit time delay correction and carrier wave frequency information；

(5) each messaging interface process is according to the signal sampling of corresponding two websites of very long baseline interferometry(VLBI baseline The time delay value of the integer bit time delay correction of data and two websites, using the calculating Unified Device architecture environment on gpu, with Parallel form realizes the integer bit time delay correction of two website signals of baseline, and two website signals of baseline respectively with The mixing of frequency conversion local oscillation signal, obtains the signal of two websites of baseline after integer bit time delay correction and down coversion；Wherein lower change Frequency local oscillation signal is calculated by the down coversion local frequency information in carrier wave frequency information；

(6) length of delay that each messaging interface process rotates according to the phase fringes of two websites of baseline, using on gpu Calculating Unified Device architecture environment, the two website signals of baseline in a parallel fashion step (5) being obtained enter line phase bar Stricture of vagina rotates, and makes two website signals close, obtains the signal of striped two websites of postrotational baseline；

(7) each messaging interface process is passed through to calculate two websites of baseline that Unified Device framework obtains to step (6) Signal carry out parallel FFT, realize for the signal of two websites of baseline being transformed into frequency domain from time domain；

(8) signal of two websites of baseline that each messaging interface process obtains according to step (7) and decimal bit The length of delay of time delay correction, using the calculating Unified Device architecture environment on gpu, realizes two websites of baseline in a parallel fashion The decimal bit time delay correction of signal, obtains the signal of two websites of baseline after decimal bit time delay correction；

(9) each messaging interface process utilizes the calculating Unified Device architecture environment on gpu, in a parallel fashion to step Suddenly the corresponding sampled point of two website signals of the baseline after (8) are processed carries out multiplication cross, and to the result after multiplication cross to return About algorithm is integrated summation parallel, completes the cross-correlation operation of two website signals of baseline, obtains very long baseline interferometry(VLBI Cross-correlation operation result simultaneously exports, and this result is two website signals of baseline after very long baseline interferometry(VLBI relevant treatment Phase fringes data；

Each of described step (5) (9) messaging interface process is all only responsible for processing and this messaging interface process The signal of corresponding two websites of very long baseline interferometry(VLBI baseline.

2. a kind of very long baseline interferometry(VLBI relevant treatment implementation method according to claim 1 it is characterised in that: described It is the small-scale collection that a set of heterogeneous platform or many set heterogeneous platforms are formed using the development platform that gpu and cpu builds in step (1) Group, wherein one gpu and cpu is bonded a set of heterogeneous platform.

3. a kind of very long baseline interferometry(VLBI relevant treatment implementation method according to claim 1 it is characterised in that: described When the two website signals of baseline in a parallel fashion step (5) being obtained in step (6) carry out phase fringes rotation, using only The mode of each sampled point of rotation one website signal of baseline or simultaneously rotation two each sampled points of website signal of baseline Mode, make two website signals close.

4. a kind of very long baseline interferometry(VLBI relevant treatment implementation method according to claim 1 it is characterised in that: described Cpu is multinuclear cpu server.