CN107908477A - A kind of data processing method and device for radio astronomy data - Google Patents
A kind of data processing method and device for radio astronomy data Download PDFInfo
- Publication number
- CN107908477A CN107908477A CN201711148902.XA CN201711148902A CN107908477A CN 107908477 A CN107908477 A CN 107908477A CN 201711148902 A CN201711148902 A CN 201711148902A CN 107908477 A CN107908477 A CN 107908477A
- Authority
- CN
- China
- Prior art keywords
- data
- thread
- data processing
- instructed
- calculating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
The present invention provides a kind of data processing method and device for radio astronomy data, wherein, the data processing method includes outermost loop processing procedure, intermediate layer circulating treatment procedure and innermost loop processing procedure, further comprising the steps of:The calculation amount of each iteration in the outermost loop processing procedure is distributed into different threads;Each thread is instructed using vectorization.The embodiment of the present invention is allocated the calculation amount of each iteration in circulation by the method for multithreading task scheduling (schedule), improve the harmony of the computational load of each thread, effectively optimizing has been carried out to deGridding, has greatly improved performance.
Description
Technical field
The invention belongs to computer realm, more particularly to a kind of data processing method and dress for radio astronomy data
Put.
Background technology
International Astronomical project " square kilometer array " astronomical telescope (SKA, Square Kilometer Array).This
Mesh is intended to build aperture synthesis radio astronomical telescope the biggest in the world, possess 3000 diameters, 15 meters of parabola butterfly antennas and
250 groups of intermediate frequencies and low frequency array of apertures, distribution are more than 3000 kilometers, about 1 square kilometre of the ray-collecting area gross area, it is contemplated that
Sensitivity than current maximum radio telescope arrays (JVLA) improves about 50 times, and maximum single port footpath radio more current than China is hoped
The sensitivity of remote mirror (FAST) improves about 10000 times.According to plan, the data volume of SKA collections per second is more than 12Tb, it is necessary to almost
The performance summation of all supercomputers of TOP500 could complete the processing work of its data volume.
DeGridding is that calculation procedure is most complicated in SKA, takes most data processing links, is approached in whole project
30% data need to be handled by the software.Degridding, which calculates core, includes three calculating circulations, and outermost layer follows
Ring is that dind calculates circulation, and calculation amount is nChan × nSamples, and wherein nSamples is data sample number, and nChan is spectrum
Port number;Intercycle is that suppv calculates circulation, and calculation amount is the length of X (/Y) axis of convolution kernel;Innermost loop suppu
Circulation is calculated, calculation amount is the length of Y (/X) axis of convolution kernel.At present, the serial process version speed of deGridding can not
Reach perfect condition, therefore, as can carrying out effectively optimizing to deGridding, SKA project data processing links will be greatly reduced
Investment in terms of calculating platform.
The content of the invention
The embodiment of the present invention provides a kind of data processing method and device for radio astronomy data, to solve above-mentioned ask
Topic.
The embodiment of the present invention provides a kind of data processing method for radio astronomy data.The data processing method bag
Outermost loop processing procedure, intermediate layer circulating treatment procedure and innermost loop processing procedure are included, it is further comprising the steps of:Will
The calculation amount of each iteration distributes to different threads in the outermost loop processing procedure;Each thread uses vector
Change instruction.
The embodiment of the present invention also provides a kind of data processing equipment for radio astronomy data, for radio astronomy data
Data processing, the data processing includes outermost loop processing procedure, intermediate layer circulating treatment procedure and innermost loop
Processing procedure, the data processing equipment include:
Data allocation unit, for the calculation amount of each iteration in the outermost loop processing procedure to be distributed to calculating
Different thread in unit;Computing unit, each thread is instructed when calculating using vectorization in the computing unit.
The embodiment of the present invention passes through calculating of the method for multithreading task scheduling (schedule) to each iteration in circulation
Amount is allocated, and improves the harmony of the computational load of each thread, is instructed by simd instructions and _ mm_prefetch so that
The vectorization of core calculations part, and the data for participating in calculating are stored in caching in advance, improve what is be written and read from memory
Efficiency is right using AVX512 instruction set and MCDRAM cache the significant increases computing capability of deGridding programs
DeGridding has carried out effectively optimizing, greatly improves performance, and practicality is stronger, and the scope of application is wider.
Brief description of the drawings
Attached drawing described herein is used for providing a further understanding of the present invention, forms the part of the application, this hair
Bright schematic description and description is used to explain the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 show the data processing method process chart for radio astronomy data of the embodiment of the present invention 1;
Fig. 2 show the abstract representation schematic diagram of the vectorization processing procedure of the embodiment of the present invention 1;
Fig. 3 show the vectorization operation specific implementation schematic diagram of the embodiment of the present invention 1;
Fig. 4 show the data processing equipment structure chart for radio astronomy data of the embodiment of the present invention 2.
Embodiment
Come that the present invention will be described in detail below with reference to attached drawing and in conjunction with the embodiments.It should be noted that do not conflicting
In the case of, the feature in embodiment and embodiment in the application can be mutually combined.
Fig. 1 show the data processing method process chart for radio astronomy data of the embodiment of the present invention 1, described
Data processing method includes outermost loop processing procedure, intermediate layer circulating treatment procedure and innermost loop processing procedure, also
Comprise the following steps:
Step 102:The calculation amount of each iteration in the outermost loop processing procedure is distributed into different threads;
Step 104:Each thread is instructed using vectorization.
In above-mentioned steps 102, the calculation amount is distributed to not using the schedule clause of OpenMP parallel constructions
Same thread;Dynamic dispatching is carried out to iterative calculation using the dynamic dispatching dynamic in schedule.
Specifically, the number using the schedule clause of OpenMP parallel constructions by calculation amount for nChan × nSamples
According to different threads is distributed to, for the unbalanced situation of computational load in circulation, avoid causing mutually to wait between thread,
Operating status and system resource are based on using the dynamic dispatching dynamic in schedule, and dynamic dispatching is carried out to iteration.Pass through
The method of multithreading task scheduling (schedule) is allocated the calculation amount of each iteration in circulation, effectively prevent thread
Between data dependency, improve the harmony of the computational load of each thread.When optimizing to performance, it is necessary in memory optimization profit
Compromised between optimization load balance, the method that can obtain optimum is found by the measurement to performance.Use
One internal queues, when thread can use, is distributed as a certain number of loop iterations specified by block size, due to single-unit for it
64 cores are included in point (node), the Thread Count of each core is 1, and in the case of without using hyperthread, thread maximum is set
Quantity is put as 64, works as np=8, during OMP_NUM_THREADS=8, for nChan × nSamples=800,000 data sample
This amount needs to be divided into 64 pieces (800000/64=12500/thread).
In above-mentioned steps 104, on the premise of the dependence correctness for ensureing to be quantified between variable, # is used
Pragma simd effectively realize cyclic vector.On machine for supporting the extension of 512bit vector gather instructions, compiler life
Carry out the cyclic part in vectorizer into corresponding instruction.Fig. 2 is the abstract representation of vectorization processing procedure, wherein employing
Individually operation handles vector (vector), there is provided the mode of the data parallel more highly efficient than scalar.VL in figure
Vector length is represented, wherein the scalar (such as int, floate type) comprising multiple same data types.Fig. 3 grasps for vectorization
Implement, when specified vectorlength (8), theoretical last time equivalent to 8 times scalar loops of vector circulant, due to
Value types include real and imaginary parts, and sizeof (float)=8, and therefore, the length of each vector operations is (4 × 8)
× 16=512bit, theoretic vector circulant number are sSize/16 times, and need to establish the private numbers that size is 16
Group, the numerical value data after multiply-add operation, program are carried out for preserving grid by the numerical value after convolution nuclear mapping, i.e. grid and C
Vectorization unit can be made full use of to accelerate calculating speed, and result of calculation is stored in data_local.Meanwhile if I
Do not prevent the loop unroll of compiler from optimizing plus pragma #pragma nounroll, compiler can be followed
Ring expansion optimization, so actual cycle-index may be less.
Further, instructed using OpenMP simd and thread packet is carried out to the calculating operation in circulation;
Per thread scheduling performs several data blocks, and is instructed using simd come the circulation followed by performing.
I.e. using OpenMP simd instruct in for-loop calculating operation carry out thread packet, per thread according to
OpenMP runtime schedulings perform several data blocks, per thread performed being instructed using simd followed by circulation, and
Per thread is allowed to accelerate to circulate using vectorization instruction.
Further, the data processing method for radio astronomy data can also include:
Prefetched instruction is inserted into by compiler to prestore the data for participating in calculating to caching.
Specifically, copied using instruction _ mm_prefetch memory optimizations of SSE intrinsic, in actual access data
Before just in advance the digital independent into caching.Function expression void_mm_prefetch (char const*a, int
Sel), it correspond to PREFETCH instructions, tell processor that a corresponding cachings in address are loaded into the caching of more high speed, sel
Give the type of pre- extract operation.Prefetched instruction and corresponding types are as shown in table 1, and wherein NTA represents to prefetch using non-provisional, energy
Enough reduce the pollution of cache lines;T0 represents to fetch data into all cachings;T1 represents to be prefetched to L2, L3 cachings, but is less than L1
Caching;T2 represents only to fetch data into L3 cachings.Because program to carry out write operation or to access the cache lines multiple, therefore adopts
With the mode for fetching data into all cachings.Specific code realizes that process is expressed as below, wherein passing through _ mm_ for grid and C
Prefetch is prefetched respectively, it is contemplated that grid and C can transform to 2D storage forms, and multirow data are loaded into more high speed
In caching, and carry out traveling through all elements prestored during corresponding multiplication operation.The choosing of PF3 and PF4 in _ mm_prefetch
Taking mode to be obtained by the experiment shown in table 2, work as PF4=2, during PF3=1, data processing time is most short under single thread,
That is grid and C carries out 2 rows respectively every time and the data prefetching performance of 1 row is optimal.
PREFETCHINTA | _MM_HINT_NTA |
PREFETCH0 | _MM_HINT_T0 |
PREFETCH1 | _MM_HINT_T1 |
PREFETCH2 | _MM_HINT_T2 |
Table 1
Table 2
Further, MCDRAM is configured to cache mode, using the MCDRAM as L2 cache and DDR4 memories
Between last level cache.
In addition, the embodiment of the present invention is compiled using Intel's AVX512 instruction set, it is greatly perfect existing
Simd instruction set, to lift the calculated performance of program, wherein, VPU supports 512bit vector gather instructions in intel Xeon Phi
Extension.
Therefore, the embodiment of the present invention by the method for multithreading task scheduling (schedule) to each iteration in circulation
Calculation amount is allocated, and avoids the data dependency of cross-thread, and improve each thread calculates what is loaded during astronomical sample data
It is harmonious;Core is calculated to deGridding and uses OpenMP parallelizations, the expansion of thread and merging are placed on outermost
The circulation of side, and total amount of data is divided equally according to OpenMP number of threads, and write data to the unique memory headroom of cross-thread;Make
Cyclic vector is effectively realized with simd, on the machine for supporting the extension of 512bit vector gather instructions, considers Xeon Phi
The 512bit line widths of processor, make full use of MCDRAM according to length shared by single array, accelerate read or write speed;Can be same
When support multiple independent data flows prefetch characteristic, array is accessed by expression formula a [j], it is pre- to be inserted into software by compiler
Instruction fetch is loaded into a [j+d] in caching, and a corresponding cachings in return address are loaded into the caching of more high speed, improve journey
The calculated performance of sequence, is greatly reduced the investment in terms of SKA project data processing links calculating platforms.
Fig. 4 show the data processing equipment structure chart for radio astronomy data of the embodiment of the present invention 2.
As shown in figure 4, a kind of data processing equipment for radio astronomy data according to embodiments of the present invention, for radio
The data processing of chronometer data, the data processing include outermost loop processing procedure, intermediate layer circulating treatment procedure and most
Interior loop processing procedure, the data processing equipment include:
Data allocation unit 402, for the calculation amount of each iteration in the outermost loop processing procedure to be distributed to
Different thread in computing unit;
Computing unit 404, each thread is instructed when calculating using vectorization in the computing unit.
Further, the data allocation unit 402 using the schedule clause of OpenMP parallel constructions by the meter
Calculation amount distributes to different threads, and using the dynamic dispatching dynamic in schedule to iterating to calculate into Mobile state tune
Degree.
Further, the computing unit 404 is instructed using OpenMP simd and carries out thread to the calculating operation in circulation
Packet, and per thread scheduling perform several data blocks, and are instructed using simd come the circulation followed by performing.
Further, the data processing equipment for radio astronomy data can also include:Pre-fetch unit 406, for leading to
Compiler insertion prefetched instruction is crossed to prestore the data for participating in calculating to caching.
The pre-fetch unit 406 is additionally operable to MCDRAM being configured to cache mode, and the MCDRAM is delayed as two level
Deposit the last level cache between DDR4 memories.
The embodiment of the present invention passes through calculating of the method for multithreading task scheduling (schedule) to each iteration in circulation
Amount is allocated, and improves the harmony of the computational load of each thread, is instructed by simd instructions and _ mm_prefetch so that
The vectorization of core calculations part, and the data for participating in calculating are stored in caching in advance, improve what is be written and read from memory
Efficiency is right using AVX512 instruction set and MCDRAM cache the significant increases computing capability of deGridding programs
DeGridding has carried out effectively optimizing, greatly improves performance, and practicality is stronger, and the scope of application is wider.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the invention, for the skill of this area
For art personnel, the invention may be variously modified and varied.Within the spirit and principles of the invention, that is made any repaiies
Change, equivalent substitution, improvement etc., should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of data processing method for radio astronomy data, it is characterised in that the data processing method includes outermost
Layer circulating treatment procedure, intermediate layer circulating treatment procedure and innermost loop processing procedure, it is further comprising the steps of:
The calculation amount of each iteration in the outermost loop processing procedure is distributed into different threads;
Each thread is instructed using vectorization.
2. according to the method described in claim 1, it is characterized in that, using the schedule clause of OpenMP parallel constructions by institute
State calculation amount and distribute to different threads;
Dynamic dispatching is carried out to iterative calculation using the dynamic dispatching dynamic in schedule.
3. according to the method described in claim 2, it is characterized in that, the calculating in circulation is grasped using OpenMP simd instructions
Make to carry out thread packet;
Per thread scheduling performs several data blocks, and is instructed using simd come the circulation followed by performing.
4. according to the method in any one of claims 1 to 3, it is characterised in that further include:
Prefetched instruction is inserted into by compiler to prestore the data for participating in calculating to caching.
5. according to the method described in claim 4, it is characterized in that, MCDRAM is configured to cache mode, by described in
MCDRAM is as the last level cache between L2 cache and DDR4 memories.
6. a kind of data processing equipment for radio astronomy data, it is characterised in that at the data for radio astronomy data
Reason, the data processing include outermost loop processing procedure, intermediate layer circulating treatment procedure and innermost loop processing procedure,
The data processing equipment includes:
Data allocation unit, for the calculation amount of each iteration in the outermost loop processing procedure to be distributed to computing unit
Middle different thread;
Computing unit, each thread is instructed when calculating using vectorization in the computing unit.
7. device according to claim 6, it is characterised in that the data allocation unit utilizes OpenMP parallel constructions
The calculation amount is distributed to different threads by schedule clause, and utilizes the dynamic dispatching dynamic in schedule
Dynamic dispatching is carried out to iterative calculation.
8. device according to claim 7, it is characterised in that the computing unit is instructed to following using OpenMP simd
Calculating operation in ring carries out thread packet, and per thread scheduling performs several data blocks, and is instructed using simd
Come the circulation followed by performing.
9. the device according to any one of claim 6 to 8, it is characterised in that further include:Pre-fetch unit, for passing through
Compiler is inserted into prefetched instruction and prestores the data for participating in calculating to caching.
10. device according to claim 9, it is characterised in that the pre-fetch unit is additionally operable to be configured to delay by MCDRAM
Pattern is deposited, using the MCDRAM as the last level cache between L2 cache and DDR4 memories.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711148902.XA CN107908477A (en) | 2017-11-17 | 2017-11-17 | A kind of data processing method and device for radio astronomy data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711148902.XA CN107908477A (en) | 2017-11-17 | 2017-11-17 | A kind of data processing method and device for radio astronomy data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107908477A true CN107908477A (en) | 2018-04-13 |
Family
ID=61846296
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711148902.XA Pending CN107908477A (en) | 2017-11-17 | 2017-11-17 | A kind of data processing method and device for radio astronomy data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107908477A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108509279A (en) * | 2018-04-16 | 2018-09-07 | 郑州云海信息技术有限公司 | A kind of processing method, device and storage medium for radio astronomy data |
CN108874547A (en) * | 2018-06-27 | 2018-11-23 | 郑州云海信息技术有限公司 | A kind of data processing method and device of astronomy software Gridding |
WO2020077565A1 (en) * | 2018-10-17 | 2020-04-23 | 北京比特大陆科技有限公司 | Data processing method and apparatus, electronic device, and computer readable storage medium |
CN114661637A (en) * | 2022-02-28 | 2022-06-24 | 中国科学院上海天文台 | Data processing system and method for radio astronomical data intensive scientific operation |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090307655A1 (en) * | 2008-06-10 | 2009-12-10 | Keshav Kumar Pingali | Programming Model and Software System for Exploiting Parallelism in Irregular Programs |
CN104375838A (en) * | 2014-11-27 | 2015-02-25 | 浪潮电子信息产业股份有限公司 | OpenMP (open mesh point protocol) -based astronomy software Griding optimization method |
CN104504257A (en) * | 2014-12-12 | 2015-04-08 | 国家电网公司 | Double parallel computing-based on-line Prony analysis method |
CN105260175A (en) * | 2015-09-16 | 2016-01-20 | 浪潮(北京)电子信息产业有限公司 | Method for processing Gridding in astronomy software based on OpenMP |
CN106020773A (en) * | 2016-05-13 | 2016-10-12 | 中国人民解放军信息工程大学 | Optimization Method of Finite Difference Algorithm in Heterogeneous Many-Core Architecture |
CN106383961A (en) * | 2016-09-29 | 2017-02-08 | 中国南方电网有限责任公司电网技术研究中心 | Large vortex simulation algorithm optimization processing method under CPU + MIC heterogeneous platform |
CN106598552A (en) * | 2016-12-22 | 2017-04-26 | 郑州云海信息技术有限公司 | Data point conversion method and device based on Gridding module |
CN106897131A (en) * | 2017-02-22 | 2017-06-27 | 郑州云海信息技术有限公司 | A kind of parallel calculating method and its device for astronomical software Gridding |
-
2017
- 2017-11-17 CN CN201711148902.XA patent/CN107908477A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090307655A1 (en) * | 2008-06-10 | 2009-12-10 | Keshav Kumar Pingali | Programming Model and Software System for Exploiting Parallelism in Irregular Programs |
CN104375838A (en) * | 2014-11-27 | 2015-02-25 | 浪潮电子信息产业股份有限公司 | OpenMP (open mesh point protocol) -based astronomy software Griding optimization method |
CN104504257A (en) * | 2014-12-12 | 2015-04-08 | 国家电网公司 | Double parallel computing-based on-line Prony analysis method |
CN105260175A (en) * | 2015-09-16 | 2016-01-20 | 浪潮(北京)电子信息产业有限公司 | Method for processing Gridding in astronomy software based on OpenMP |
CN106020773A (en) * | 2016-05-13 | 2016-10-12 | 中国人民解放军信息工程大学 | Optimization Method of Finite Difference Algorithm in Heterogeneous Many-Core Architecture |
CN106383961A (en) * | 2016-09-29 | 2017-02-08 | 中国南方电网有限责任公司电网技术研究中心 | Large vortex simulation algorithm optimization processing method under CPU + MIC heterogeneous platform |
CN106598552A (en) * | 2016-12-22 | 2017-04-26 | 郑州云海信息技术有限公司 | Data point conversion method and device based on Gridding module |
CN106897131A (en) * | 2017-02-22 | 2017-06-27 | 郑州云海信息技术有限公司 | A kind of parallel calculating method and its device for astronomical software Gridding |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108509279A (en) * | 2018-04-16 | 2018-09-07 | 郑州云海信息技术有限公司 | A kind of processing method, device and storage medium for radio astronomy data |
CN108874547A (en) * | 2018-06-27 | 2018-11-23 | 郑州云海信息技术有限公司 | A kind of data processing method and device of astronomy software Gridding |
WO2020077565A1 (en) * | 2018-10-17 | 2020-04-23 | 北京比特大陆科技有限公司 | Data processing method and apparatus, electronic device, and computer readable storage medium |
CN112740174A (en) * | 2018-10-17 | 2021-04-30 | 北京比特大陆科技有限公司 | Data processing method and device, electronic equipment and computer readable storage medium |
CN112740174B (en) * | 2018-10-17 | 2024-02-06 | 北京比特大陆科技有限公司 | Data processing method, device, electronic equipment and computer readable storage medium |
CN114661637A (en) * | 2022-02-28 | 2022-06-24 | 中国科学院上海天文台 | Data processing system and method for radio astronomical data intensive scientific operation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107168683B (en) | GEMM dense matrix multiplication high-performance implementation method on Shenwei 26010 many-core CPU | |
Ionica et al. | The movidius myriad architecture's potential for scientific computing | |
CN107908477A (en) | A kind of data processing method and device for radio astronomy data | |
Leischner et al. | GPU sample sort | |
Baskaran et al. | Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories | |
Schäfer et al. | High performance stencil code algorithms for GPGPUs | |
CN105808309B (en) | A kind of high-performance implementation method of the basic linear algebra library BLAS three-level function GEMM based on Shen prestige platform | |
CN103336758A (en) | Sparse matrix storage method CSRL (Compressed Sparse Row with Local Information) and SpMV (Sparse Matrix Vector Multiplication) realization method based on same | |
Matsumoto et al. | Performance tuning of matrix multiplication in OpenCL on different GPUs and CPUs | |
Rojek et al. | Adaptation of fluid model EULAG to graphics processing unit architecture | |
CN106415526A (en) | FET processor and operation method | |
Podobas et al. | Evaluating high-level design strategies on FPGAs for high-performance computing | |
Wang et al. | Design and implementation of a highly efficient dgemm for 64-bit armv8 multi-core processors | |
CN113987414B (en) | Small and irregular matrix multiplication optimization method based on ARMv8 multi-core processor | |
Chu et al. | Efficient Algorithm Design of Optimizing SpMV on GPU | |
Dursun et al. | In-Core Optimization of High-Order Stencil Computations. | |
Song et al. | Gpnpu: Enabling efficient hardware-based direct convolution with multi-precision support in gpu tensor cores | |
Tang et al. | Optimizing and auto-tuning iterative stencil loops for GPUs with the in-plane method | |
Tandri et al. | Automatic partitioning of data and computations on scalable shared memory multiprocessors | |
Bandyopadhyay et al. | GRS—GPU radix sort for multifield records | |
Li et al. | A speculative HMMER search implementation on GPU | |
CN109522127A (en) | A kind of fluid machinery simulated program isomery accelerated method based on GPU | |
CN106598552A (en) | Data point conversion method and device based on Gridding module | |
Ries et al. | Triangular matrix inversion on graphics processing unit | |
CN108509279A (en) | A kind of processing method, device and storage medium for radio astronomy data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20190227 Address after: 100085 Beijing Haidian District Shangdi Information Road 2-1 C Building 1 Floor Applicant after: INSPUR (BEIJING) ELECTRONIC INFORMATION INDUSTRY Co.,Ltd. Address before: Room 1601, floor 16, 278 Xinyi Road, Zhengdong New District, Zhengzhou City, Henan Province Applicant before: ZHENGZHOU YUNHAI INFORMATION TECHNOLOGY Co.,Ltd. |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180413 |