CN106528054A - GPU (Graphics Processing Unit) accelerated dense vector addition computing method - Google Patents

GPU (Graphics Processing Unit) accelerated dense vector addition computing method Download PDF

Info

Publication number
CN106528054A
CN106528054A CN201610955722.1A CN201610955722A CN106528054A CN 106528054 A CN106528054 A CN 106528054A CN 201610955722 A CN201610955722 A CN 201610955722A CN 106528054 A CN106528054 A CN 106528054A
Authority
CN
China
Prior art keywords
gpu
threads
vector
addition
vectorial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610955722.1A
Other languages
Chinese (zh)
Inventor
邹风华
琚天鹏
李鸣
李一鸣
郝韶航
周赣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201610955722.1A priority Critical patent/CN106528054A/en
Publication of CN106528054A publication Critical patent/CN106528054A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a GPU (Graphics Processing Unit) accelerated dense vector addition computing method. The method is applied to acceleration of an addition operation A+B=C for dense vectors, wherein the A and B represent the vectors for addition and the C represents a result vector. The algorithm comprises specific implementation steps that a CPU generates data required for computing the addition operation of the vectors on a GPU; the CPU transmits the data to the GPU; a task of vectors A+B is allocated to a GPU thread; and a kernel function for vector addition is carried out in the GPU. According to the method, the main task of the CPU is generating and transmitting the data and finishing scheduling main programs. The vector related addition operation is finished by the kernel function of the GPU. The speed of the dense vector addition operation is greatly improved through utilization of the hardware feature of high concurrent degree of the GPU.

Description

The dense vectorial addition computational methods that GPU accelerates
Technical field
The invention belongs to High performance computing in power system application, more particularly to a kind of dense vectors that accelerate of GPU plus Method computational methods.
Background technology
GPU graphic process unit (English:GraphicsProcessingUnit, abbreviation:GPU) in the quantity of processing unit A kind of many-core parallel processor considerably beyond CPU.GPU traditionally is only responsible for figure and renders, and most process is all CPU is given.Present GPU method battle array be a kind of multinuclear, multithreading, with powerful calculating ability and high memory band Width, programmable processor.Under universal computer model, GPU works as the coprocessor of CPU, by task reasonable distribution Decomposition completes high-performance calculation.
The calculating of dense vector has concurrency.Because the numerical computations of correspondence position are separate in vector, without according to Bad relation, it is natural to be processed by parallel calculating, it is adapted to GPU and accelerates.
This generic operation can be completed by rational scheduling between CPU and GPU.
The content of the invention
Goal of the invention:In order to overcome the deficiencies in the prior art, the present invention to provide the dense vector that a kind of GPU accelerates Addition calculation method, solves dense vectorial addition and calculates time-consuming technological deficiency.
Technical scheme:For achieving the above object, technical scheme is as follows:
The dense addition of vectors computational methods that GPU accelerates, the method are suited to speed up the add operation of dense vector:A+B =C, wherein A, B represent the vector being added, and C represents result vector, and methods described includes:
(1) data space distributed in CPU needed for GPU is calculated, and GPU is calculated into required data transfer to GPU On;
(2) by vectorial A, the task that each element of B is added is assigned to execution in a large amount of threads in GPU;
(3) kernel function Kernel_plus of addition of vectors is performed in GPU.
In step (1), in CPU, prepare the data needed for GPU kernel functions, it is specific as follows:Vectorial A, B, C are stored as Array formats, its described data include:Vector element number be n, vectorial DataA, DataB, result vector DataC.
In step (2), vectorial A is added into task with the n element of B and is assigned to execution in a large amount of threads in GPU, i.e., it is interior One thread of kernel function is responsible for vectorial A, the corresponding element add operation in B, such as DataA [1]+DataB [1]=DataC [1]。
In step (3), kernel function Kernel_plus for completing vectorial addition operation is defined as Kernel_plus< Nblocks, Nthreads>, thread block size is fixed as 128, i.e.,:Nthreads=128, its thread number of blocks NblocksFor n/128, bus Number of passes amount is n;Call kernel function Kernel_plus<Nblocks, Nthreads>To calculate the operation of addition of vectors;
Kernel function Kernel_plus<Nblocks, Nthreads>Calculation process be:
(4.1) CUDA is each thread distribution thread block index blockID and the thread index in thread block automatically threadID;
(4.2) blockID is assigned to into variable tid, threadID is assigned to into variable k, afterwards by tid and t come rope Draw the k threads in tid thread blocks;
(4.3) variable t=tid*128+k;
K threads in (4.4) tid thread blocks are responsible for t-th element phase add operation of vectorial A and vector B:Ct= At+Bt, wherein At, BtAnd CtIt is vectorial A respectively, t-th element in B and C.
Beneficial effect:Compared with the prior art, generate in CPU and the required data of addition of vectors operation are calculated on GPU;Then CPU is transmitted that data on GPU;By matrix A1+B1=C1Plus task distribute to GPU threads;Then perform in GPU to The kernel function that amount is added.Significantly reduce the calculating time of dense vectorial addition.
Description of the drawings
Schematic flow sheet of the accompanying drawing 1 for the present invention.
Specific embodiment
Below in conjunction with the accompanying drawings the present invention is further described.
Such as accompanying drawing 1, as shown in figure 1, the present invention is the dense vectorial addition computational methods that a kind of GPU accelerates, the method is fitted For accelerating the add operation of dense vector:A+B=C, wherein A, B represent the vector being added, and C represents result vector, its feature It is that methods described includes:
(1) data space distributed in CPU needed for GPU is calculated, and GPU is calculated into required data transfer to GPU On;Generate in CPU and the required data of addition of vectors operation are calculated on GPU, and transmit that data on GPU;
(2) by vectorial A, the task that each element of B is added is assigned to execution in a large amount of threads in GPU;
(3) kernel function Kernel_plus of addition of vectors is performed in GPU.
In the step (1), in CPU, prepare the data needed for GPU kernel functions, it is specific as follows:Vectorial A, B, C are deposited Store up as array formats, its described data includes:Vector element number be n, vectorial DataA, DataB, result vector DataC.
In the step (2), a thread of kernel function is responsible for vectorial A, the corresponding element add operation in B, such as DataA [1]+DataB [1]=DataC [1].
In the step (3), kernel function Kernel_plus for completing vectorial addition operation is defined as Kernel_plus< Nblocks, Nthreads>, thread block size is fixed as 128, i.e.,:Nthreads=128, its thread number of blocks NblocksFor n/128, bus Number of passes amount is n;Call kernel function Kernel_plus<Nblocks, Nthreads>To calculate the operation of addition of vectors.The kernel Function Kernel_plus<Nblocks, Nthreads>Calculation process be:
(4.1) CUDA is each thread distribution thread block index blockID and the thread index in thread block automatically threadID;
(4.2) blockID is assigned to into variable tid, threadID is assigned to into variable k, afterwards by tid and t come rope Draw the k threads in tid thread blocks;
(4.3) variable t=tid*128+k;
K threads in (4.4) tid thread blocks are responsible for t-th element phase add operation of vectorial A and vector B:Ct= At+BT,Wherein At, BtAnd CtIt is vectorial A respectively, t-th element in B and C.
The above is only the preferred embodiment of the present invention, it should be pointed out that:For the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (4)

  1. The dense addition of vectors computational methods that 1.GPU accelerates, the method are suited to speed up the add operation of dense vector:A+B= C, wherein A, B represent the vector being added, and C represents result vector, it is characterised in that methods described includes:
    (1) data space distributed in CPU needed for GPU is calculated, and the data transfer needed for GPU is calculated is on GPU;
    (2) by vectorial A, the task that each element of B is added is assigned to execution in a large amount of threads in GPU;
    (3) kernel function Kernel_plus of addition of vectors is performed in GPU.
  2. 2. the dense addition of vectors computational methods that GPU according to claim 1 accelerates, it is characterised in that:In step (1), Prepare the data needed for GPU kernel functions in CPU, it is specific as follows:Vectorial A, B, C are stored as into array formats, its described number According to including:Vector element number be n, vectorial DataA, DataB, result vector DataC.
  3. 3. the dense addition of vectors computational methods that GPU according to claim 1 accelerates, it is characterised in that:In step (2), By vectorial A be added with the n element of B task be assigned in a large amount of threads in GPU perform, i.e., kernel function a thread bear Duty vector A, the corresponding element add operation in B, such as DataA [1]+DataB [1]=DataC [1].
  4. 4. the dense addition of vectors computational methods that GPU according to claim 1 accelerates, it is characterised in that:Step (3), it is complete Kernel function Kernel_plus into vectorial addition operation is defined as Kernel_plus<Nblocks, Nthreads>, thread block size 128 are fixed as, i.e.,:Nthreads=128, its thread number of blocks NblocksFor n/128, total number of threads is n;Call kernel function Kernel_plus<Nblocks, Nthreads>To calculate the operation of addition of vectors;
    Kernel function Kernel_plus<Nblocks, Nthreads>Calculation process be:
    (4.1) CUDA is each thread distribution thread block index blockID and thread index threadID in thread block automatically;
    (4.2) blockID is assigned to into variable tid, threadID is assigned to into variable k, afterwards by tid and t indexing tid K threads in number thread block;
    (4.3) variable t=tid*128+k;
    K threads in (4.4) tid thread blocks are responsible for t-th element phase add operation of vectorial A and vector B:Ct=At+ Bt, wherein At, BtAnd CtIt is vectorial A respectively, t-th element in B and C.
CN201610955722.1A 2016-11-03 2016-11-03 GPU (Graphics Processing Unit) accelerated dense vector addition computing method Pending CN106528054A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610955722.1A CN106528054A (en) 2016-11-03 2016-11-03 GPU (Graphics Processing Unit) accelerated dense vector addition computing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610955722.1A CN106528054A (en) 2016-11-03 2016-11-03 GPU (Graphics Processing Unit) accelerated dense vector addition computing method

Publications (1)

Publication Number Publication Date
CN106528054A true CN106528054A (en) 2017-03-22

Family

ID=58325470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610955722.1A Pending CN106528054A (en) 2016-11-03 2016-11-03 GPU (Graphics Processing Unit) accelerated dense vector addition computing method

Country Status (1)

Country Link
CN (1) CN106528054A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080208942A1 (en) * 2007-02-23 2008-08-28 Nara Won Parallel Architecture for Matrix Transposition
CN101937425A (en) * 2009-07-02 2011-01-05 北京理工大学 Matrix parallel transposition method based on GPU multi-core platform
CN103543989A (en) * 2013-11-11 2014-01-29 镇江中安通信科技有限公司 Adaptive parallel processing method aiming at variable length characteristic extraction for big data
US8984043B2 (en) * 2009-12-23 2015-03-17 Intel Corporation Multiplying and adding matrices
CN105574809A (en) * 2015-12-16 2016-05-11 天津大学 Matrix exponent-based parallel calculation method for electromagnetic transient simulation graphic processor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080208942A1 (en) * 2007-02-23 2008-08-28 Nara Won Parallel Architecture for Matrix Transposition
CN101937425A (en) * 2009-07-02 2011-01-05 北京理工大学 Matrix parallel transposition method based on GPU multi-core platform
US8984043B2 (en) * 2009-12-23 2015-03-17 Intel Corporation Multiplying and adding matrices
CN103543989A (en) * 2013-11-11 2014-01-29 镇江中安通信科技有限公司 Adaptive parallel processing method aiming at variable length characteristic extraction for big data
CN105574809A (en) * 2015-12-16 2016-05-11 天津大学 Matrix exponent-based parallel calculation method for electromagnetic transient simulation graphic processor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
我是郭俊辰: "CUDA编程入门:向量加法和矩阵乘法", 《HTTPS://BLOG.CSDN.NET/U014030117/ARTICLE/DETAILS/45952971》 *

Similar Documents

Publication Publication Date Title
CN103049241B (en) A kind of method improving CPU+GPU isomery device calculated performance
CN104834561B (en) A kind of data processing method and device
CN103617150A (en) GPU (graphic processing unit) based parallel power flow calculation system and method for large-scale power system
Xiong et al. Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units
CN109191364A (en) Accelerate the hardware structure of artificial intelligence process device
Kelly GPU computing for atmospheric modeling
CN105183562B (en) A method of rasterizing data are carried out based on CUDA technologies to take out rank
CN101833438A (en) General data processing method based on multiple parallel
Behrens et al. Efficient SIMD Vectorization for Hashing in OpenCL.
Shahbahrami et al. Parallel implementation of Gray Level Co-occurrence Matrices and Haralick texture features on cell architecture
CN105373517A (en) Spark-based distributed matrix inversion parallel operation method
CN102298567A (en) Mobile processor architecture integrating central operation and graphic acceleration
CN106776466A (en) A kind of FPGA isomeries speed-up computation apparatus and system
CN106026107B (en) A kind of QR decomposition method for the direction of energy Jacobian matrix that GPU accelerates
CN110781446A (en) Method for rapidly calculating average vorticity deviation of ocean mesoscale vortex Lagrange
CN103413273A (en) Method for rapidly achieving image restoration processing based on GPU
CN103713938A (en) Multi-graphics-processing-unit (GPU) cooperative computing method based on Open MP under virtual environment
CN102841881A (en) Multiple integral computing method based on many-core processor
CN104572588B (en) Matrix inversion process method and apparatus
CN107256203A (en) The implementation method and device of a kind of matrix-vector multiplication
CN106528054A (en) GPU (Graphics Processing Unit) accelerated dense vector addition computing method
CN106934757A (en) Monitor video foreground extraction accelerated method based on CUDA
CN103577160A (en) Characteristic extraction parallel-processing method for big data
CN109408148A (en) A kind of production domesticization computing platform and its apply accelerated method
CN104463940A (en) Hybrid tree parallel construction method based on GPU

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170322

RJ01 Rejection of invention patent application after publication