CN107908477A

CN107908477A - A kind of data processing method and device for radio astronomy data

Info

Publication number: CN107908477A
Application number: CN201711148902.XA
Authority: CN
Inventors: 王超
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2017-11-17
Filing date: 2017-11-17
Publication date: 2018-04-13

Abstract

The present invention provides a kind of data processing method and device for radio astronomy data, wherein, the data processing method includes outermost loop processing procedure, intermediate layer circulating treatment procedure and innermost loop processing procedure, further comprising the steps of：The calculation amount of each iteration in the outermost loop processing procedure is distributed into different threads；Each thread is instructed using vectorization.The embodiment of the present invention is allocated the calculation amount of each iteration in circulation by the method for multithreading task scheduling (schedule), improve the harmony of the computational load of each thread, effectively optimizing has been carried out to deGridding, has greatly improved performance.

Description

A kind of data processing method and device for radio astronomy data

Technical field

The invention belongs to computer realm, more particularly to a kind of data processing method and dress for radio astronomy data Put.

Background technology

International Astronomical project " square kilometer array " astronomical telescope (SKA, Square Kilometer Array).This Mesh is intended to build aperture synthesis radio astronomical telescope the biggest in the world, possess 3000 diameters, 15 meters of parabola butterfly antennas and 250 groups of intermediate frequencies and low frequency array of apertures, distribution are more than 3000 kilometers, about 1 square kilometre of the ray-collecting area gross area, it is contemplated that Sensitivity than current maximum radio telescope arrays (JVLA) improves about 50 times, and maximum single port footpath radio more current than China is hoped The sensitivity of remote mirror (FAST) improves about 10000 times.According to plan, the data volume of SKA collections per second is more than 12Tb, it is necessary to almost The performance summation of all supercomputers of TOP500 could complete the processing work of its data volume.

DeGridding is that calculation procedure is most complicated in SKA, takes most data processing links, is approached in whole project 30% data need to be handled by the software.Degridding, which calculates core, includes three calculating circulations, and outermost layer follows Ring is that dind calculates circulation, and calculation amount is nChan × nSamples, and wherein nSamples is data sample number, and nChan is spectrum Port number；Intercycle is that suppv calculates circulation, and calculation amount is the length of X (/Y) axis of convolution kernel；Innermost loop suppu Circulation is calculated, calculation amount is the length of Y (/X) axis of convolution kernel.At present, the serial process version speed of deGridding can not Reach perfect condition, therefore, as can carrying out effectively optimizing to deGridding, SKA project data processing links will be greatly reduced Investment in terms of calculating platform.

The content of the invention

The embodiment of the present invention provides a kind of data processing method and device for radio astronomy data, to solve above-mentioned ask Topic.

The embodiment of the present invention provides a kind of data processing method for radio astronomy data.The data processing method bag Outermost loop processing procedure, intermediate layer circulating treatment procedure and innermost loop processing procedure are included, it is further comprising the steps of：Will The calculation amount of each iteration distributes to different threads in the outermost loop processing procedure；Each thread uses vector Change instruction.

The embodiment of the present invention also provides a kind of data processing equipment for radio astronomy data, for radio astronomy data Data processing, the data processing includes outermost loop processing procedure, intermediate layer circulating treatment procedure and innermost loop Processing procedure, the data processing equipment include：

Data allocation unit, for the calculation amount of each iteration in the outermost loop processing procedure to be distributed to calculating Different thread in unit；Computing unit, each thread is instructed when calculating using vectorization in the computing unit.

The embodiment of the present invention passes through calculating of the method for multithreading task scheduling (schedule) to each iteration in circulation Amount is allocated, and improves the harmony of the computational load of each thread, is instructed by simd instructions and _ mm_prefetch so that The vectorization of core calculations part, and the data for participating in calculating are stored in caching in advance, improve what is be written and read from memory Efficiency is right using AVX512 instruction set and MCDRAM cache the significant increases computing capability of deGridding programs DeGridding has carried out effectively optimizing, greatly improves performance, and practicality is stronger, and the scope of application is wider.

Brief description of the drawings

Attached drawing described herein is used for providing a further understanding of the present invention, forms the part of the application, this hair Bright schematic description and description is used to explain the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings：

Fig. 1 show the data processing method process chart for radio astronomy data of the embodiment of the present invention 1；

Fig. 2 show the abstract representation schematic diagram of the vectorization processing procedure of the embodiment of the present invention 1；

Fig. 3 show the vectorization operation specific implementation schematic diagram of the embodiment of the present invention 1；

Fig. 4 show the data processing equipment structure chart for radio astronomy data of the embodiment of the present invention 2.

Embodiment

Come that the present invention will be described in detail below with reference to attached drawing and in conjunction with the embodiments.It should be noted that do not conflicting In the case of, the feature in embodiment and embodiment in the application can be mutually combined.

Fig. 1 show the data processing method process chart for radio astronomy data of the embodiment of the present invention 1, described Data processing method includes outermost loop processing procedure, intermediate layer circulating treatment procedure and innermost loop processing procedure, also Comprise the following steps：

Step 102：The calculation amount of each iteration in the outermost loop processing procedure is distributed into different threads；

Step 104：Each thread is instructed using vectorization.

In above-mentioned steps 102, the calculation amount is distributed to not using the schedule clause of OpenMP parallel constructions Same thread；Dynamic dispatching is carried out to iterative calculation using the dynamic dispatching dynamic in schedule.

Specifically, the number using the schedule clause of OpenMP parallel constructions by calculation amount for nChan × nSamples According to different threads is distributed to, for the unbalanced situation of computational load in circulation, avoid causing mutually to wait between thread, Operating status and system resource are based on using the dynamic dispatching dynamic in schedule, and dynamic dispatching is carried out to iteration.Pass through The method of multithreading task scheduling (schedule) is allocated the calculation amount of each iteration in circulation, effectively prevent thread Between data dependency, improve the harmony of the computational load of each thread.When optimizing to performance, it is necessary in memory optimization profit Compromised between optimization load balance, the method that can obtain optimum is found by the measurement to performance.Use One internal queues, when thread can use, is distributed as a certain number of loop iterations specified by block size, due to single-unit for it 64 cores are included in point (node), the Thread Count of each core is 1, and in the case of without using hyperthread, thread maximum is set Quantity is put as 64, works as np=8, during OMP_NUM_THREADS=8, for nChan × nSamples=800,000 data sample This amount needs to be divided into 64 pieces (800000/64=12500/thread).

In above-mentioned steps 104, on the premise of the dependence correctness for ensureing to be quantified between variable, # is used Pragma simd effectively realize cyclic vector.On machine for supporting the extension of 512bit vector gather instructions, compiler life Carry out the cyclic part in vectorizer into corresponding instruction.Fig. 2 is the abstract representation of vectorization processing procedure, wherein employing Individually operation handles vector (vector), there is provided the mode of the data parallel more highly efficient than scalar.VL in figure Vector length is represented, wherein the scalar (such as int, floate type) comprising multiple same data types.Fig. 3 grasps for vectorization Implement, when specified vectorlength (8), theoretical last time equivalent to 8 times scalar loops of vector circulant, due to Value types include real and imaginary parts, and sizeof (float)=8, and therefore, the length of each vector operations is (4 × 8) × 16=512bit, theoretic vector circulant number are sSize/16 times, and need to establish the private numbers that size is 16 Group, the numerical value data after multiply-add operation, program are carried out for preserving grid by the numerical value after convolution nuclear mapping, i.e. grid and C Vectorization unit can be made full use of to accelerate calculating speed, and result of calculation is stored in data_local.Meanwhile if I Do not prevent the loop unroll of compiler from optimizing plus pragma #pragma nounroll, compiler can be followed Ring expansion optimization, so actual cycle-index may be less.

Further, instructed using OpenMP simd and thread packet is carried out to the calculating operation in circulation；

Per thread scheduling performs several data blocks, and is instructed using simd come the circulation followed by performing.

I.e. using OpenMP simd instruct in for-loop calculating operation carry out thread packet, per thread according to OpenMP runtime schedulings perform several data blocks, per thread performed being instructed using simd followed by circulation, and Per thread is allowed to accelerate to circulate using vectorization instruction.

Further, the data processing method for radio astronomy data can also include：

Prefetched instruction is inserted into by compiler to prestore the data for participating in calculating to caching.

Specifically, copied using instruction _ mm_prefetch memory optimizations of SSE intrinsic, in actual access data Before just in advance the digital independent into caching.Function expression void_mm_prefetch (char const*a, int Sel), it correspond to PREFETCH instructions, tell processor that a corresponding cachings in address are loaded into the caching of more high speed, sel Give the type of pre- extract operation.Prefetched instruction and corresponding types are as shown in table 1, and wherein NTA represents to prefetch using non-provisional, energy Enough reduce the pollution of cache lines；T0 represents to fetch data into all cachings；T1 represents to be prefetched to L2, L3 cachings, but is less than L1 Caching；T2 represents only to fetch data into L3 cachings.Because program to carry out write operation or to access the cache lines multiple, therefore adopts With the mode for fetching data into all cachings.Specific code realizes that process is expressed as below, wherein passing through _ mm_ for grid and C Prefetch is prefetched respectively, it is contemplated that grid and C can transform to 2D storage forms, and multirow data are loaded into more high speed In caching, and carry out traveling through all elements prestored during corresponding multiplication operation.The choosing of PF3 and PF4 in _ mm_prefetch Taking mode to be obtained by the experiment shown in table 2, work as PF4=2, during PF3=1, data processing time is most short under single thread, That is grid and C carries out 2 rows respectively every time and the data prefetching performance of 1 row is optimal.

PREFETCHINTA	_{_}MM_HINT_NTA
		PREFETCH0	_{_}MM_HINT_T0
PREFETCH1	_{_}MM_HINT_T1
		PREFETCH2	_{_}MM_HINT_T2

Table 1

Table 2

Further, MCDRAM is configured to cache mode, using the MCDRAM as L2 cache and DDR4 memories Between last level cache.

In addition, the embodiment of the present invention is compiled using Intel's AVX512 instruction set, it is greatly perfect existing Simd instruction set, to lift the calculated performance of program, wherein, VPU supports 512bit vector gather instructions in intel Xeon Phi Extension.

Therefore, the embodiment of the present invention by the method for multithreading task scheduling (schedule) to each iteration in circulation Calculation amount is allocated, and avoids the data dependency of cross-thread, and improve each thread calculates what is loaded during astronomical sample data It is harmonious；Core is calculated to deGridding and uses OpenMP parallelizations, the expansion of thread and merging are placed on outermost The circulation of side, and total amount of data is divided equally according to OpenMP number of threads, and write data to the unique memory headroom of cross-thread；Make Cyclic vector is effectively realized with simd, on the machine for supporting the extension of 512bit vector gather instructions, considers Xeon Phi The 512bit line widths of processor, make full use of MCDRAM according to length shared by single array, accelerate read or write speed；Can be same When support multiple independent data flows prefetch characteristic, array is accessed by expression formula a [j], it is pre- to be inserted into software by compiler Instruction fetch is loaded into a [j+d] in caching, and a corresponding cachings in return address are loaded into the caching of more high speed, improve journey The calculated performance of sequence, is greatly reduced the investment in terms of SKA project data processing links calculating platforms.

As shown in figure 4, a kind of data processing equipment for radio astronomy data according to embodiments of the present invention, for radio The data processing of chronometer data, the data processing include outermost loop processing procedure, intermediate layer circulating treatment procedure and most Interior loop processing procedure, the data processing equipment include：

Data allocation unit 402, for the calculation amount of each iteration in the outermost loop processing procedure to be distributed to Different thread in computing unit；

Computing unit 404, each thread is instructed when calculating using vectorization in the computing unit.

Further, the data allocation unit 402 using the schedule clause of OpenMP parallel constructions by the meter Calculation amount distributes to different threads, and using the dynamic dispatching dynamic in schedule to iterating to calculate into Mobile state tune Degree.

Further, the computing unit 404 is instructed using OpenMP simd and carries out thread to the calculating operation in circulation Packet, and per thread scheduling perform several data blocks, and are instructed using simd come the circulation followed by performing.

Further, the data processing equipment for radio astronomy data can also include：Pre-fetch unit 406, for leading to Compiler insertion prefetched instruction is crossed to prestore the data for participating in calculating to caching.

The pre-fetch unit 406 is additionally operable to MCDRAM being configured to cache mode, and the MCDRAM is delayed as two level Deposit the last level cache between DDR4 memories.

The foregoing is only a preferred embodiment of the present invention, is not intended to limit the invention, for the skill of this area For art personnel, the invention may be variously modified and varied.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of data processing method for radio astronomy data, it is characterised in that the data processing method includes outermost Layer circulating treatment procedure, intermediate layer circulating treatment procedure and innermost loop processing procedure, it is further comprising the steps of：

The calculation amount of each iteration in the outermost loop processing procedure is distributed into different threads；

Each thread is instructed using vectorization.

2. according to the method described in claim 1, it is characterized in that, using the schedule clause of OpenMP parallel constructions by institute State calculation amount and distribute to different threads；

Dynamic dispatching is carried out to iterative calculation using the dynamic dispatching dynamic in schedule.

3. according to the method described in claim 2, it is characterized in that, the calculating in circulation is grasped using OpenMP simd instructions Make to carry out thread packet；

4. according to the method in any one of claims 1 to 3, it is characterised in that further include：

5. according to the method described in claim 4, it is characterized in that, MCDRAM is configured to cache mode, by described in MCDRAM is as the last level cache between L2 cache and DDR4 memories.

6. a kind of data processing equipment for radio astronomy data, it is characterised in that at the data for radio astronomy data Reason, the data processing include outermost loop processing procedure, intermediate layer circulating treatment procedure and innermost loop processing procedure, The data processing equipment includes：

Data allocation unit, for the calculation amount of each iteration in the outermost loop processing procedure to be distributed to computing unit Middle different thread；

Computing unit, each thread is instructed when calculating using vectorization in the computing unit.

7. device according to claim 6, it is characterised in that the data allocation unit utilizes OpenMP parallel constructions The calculation amount is distributed to different threads by schedule clause, and utilizes the dynamic dispatching dynamic in schedule Dynamic dispatching is carried out to iterative calculation.

8. device according to claim 7, it is characterised in that the computing unit is instructed to following using OpenMP simd Calculating operation in ring carries out thread packet, and per thread scheduling performs several data blocks, and is instructed using simd Come the circulation followed by performing.

9. the device according to any one of claim 6 to 8, it is characterised in that further include：Pre-fetch unit, for passing through Compiler is inserted into prefetched instruction and prestores the data for participating in calculating to caching.

10. device according to claim 9, it is characterised in that the pre-fetch unit is additionally operable to be configured to delay by MCDRAM Pattern is deposited, using the MCDRAM as the last level cache between L2 cache and DDR4 memories.