CN106528054A

CN106528054A - GPU (Graphics Processing Unit) accelerated dense vector addition computing method

Info

Publication number: CN106528054A
Application number: CN201610955722.1A
Authority: CN
Inventors: 邹风华; 琚天鹏; 李鸣; 李一鸣; 郝韶航; 周赣
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2016-11-03
Filing date: 2016-11-03
Publication date: 2017-03-22

Abstract

The invention discloses a GPU (Graphics Processing Unit) accelerated dense vector addition computing method. The method is applied to acceleration of an addition operation A+B=C for dense vectors, wherein the A and B represent the vectors for addition and the C represents a result vector. The algorithm comprises specific implementation steps that a CPU generates data required for computing the addition operation of the vectors on a GPU; the CPU transmits the data to the GPU; a task of vectors A+B is allocated to a GPU thread; and a kernel function for vector addition is carried out in the GPU. According to the method, the main task of the CPU is generating and transmitting the data and finishing scheduling main programs. The vector related addition operation is finished by the kernel function of the GPU. The speed of the dense vector addition operation is greatly improved through utilization of the hardware feature of high concurrent degree of the GPU.

Description

The dense vectorial addition computational methods that GPU accelerates

Technical field

The invention belongs to High performance computing in power system application, more particularly to a kind of dense vectors that accelerate of GPU plus Method computational methods.

Background technology

GPU graphic process unit (English：GraphicsProcessingUnit, abbreviation：GPU) in the quantity of processing unit A kind of many-core parallel processor considerably beyond CPU.GPU traditionally is only responsible for figure and renders, and most process is all CPU is given.Present GPU method battle array be a kind of multinuclear, multithreading, with powerful calculating ability and high memory band Width, programmable processor.Under universal computer model, GPU works as the coprocessor of CPU, by task reasonable distribution Decomposition completes high-performance calculation.

The calculating of dense vector has concurrency.Because the numerical computations of correspondence position are separate in vector, without according to Bad relation, it is natural to be processed by parallel calculating, it is adapted to GPU and accelerates.

This generic operation can be completed by rational scheduling between CPU and GPU.

The content of the invention

Goal of the invention：In order to overcome the deficiencies in the prior art, the present invention to provide the dense vector that a kind of GPU accelerates Addition calculation method, solves dense vectorial addition and calculates time-consuming technological deficiency.

Technical scheme：For achieving the above object, technical scheme is as follows：

The dense addition of vectors computational methods that GPU accelerates, the method are suited to speed up the add operation of dense vector：A+B =C, wherein A, B represent the vector being added, and C represents result vector, and methods described includes：

(1) data space distributed in CPU needed for GPU is calculated, and GPU is calculated into required data transfer to GPU On；

(2) by vectorial A, the task that each element of B is added is assigned to execution in a large amount of threads in GPU；

(3) kernel function Kernel_plus of addition of vectors is performed in GPU.

In step (1), in CPU, prepare the data needed for GPU kernel functions, it is specific as follows：Vectorial A, B, C are stored as Array formats, its described data include：Vector element number be n, vectorial DataA, DataB, result vector DataC.

In step (2), vectorial A is added into task with the n element of B and is assigned to execution in a large amount of threads in GPU, i.e., it is interior One thread of kernel function is responsible for vectorial A, the corresponding element add operation in B, such as DataA [1]+DataB [1]=DataC [1]。

In step (3), kernel function Kernel_plus for completing vectorial addition operation is defined as Kernel_plus< N_blocks, N_threads>, thread block size is fixed as 128, i.e.,：N_threads=128, its thread number of blocks N_blocksFor n/128, bus Number of passes amount is n；Call kernel function Kernel_plus<N_blocks, N_threads>To calculate the operation of addition of vectors；

Kernel function Kernel_plus<N_blocks, N_threads>Calculation process be：

(4.1) CUDA is each thread distribution thread block index blockID and the thread index in thread block automatically threadID；

(4.2) blockID is assigned to into variable tid, threadID is assigned to into variable k, afterwards by tid and t come rope Draw the k threads in tid thread blocks；

(4.3) variable t=tid*128+k；

K threads in (4.4) tid thread blocks are responsible for t-th element phase add operation of vectorial A and vector B：C_t= A_t+B_t, wherein A_t, B_tAnd C_tIt is vectorial A respectively, t-th element in B and C.

Beneficial effect：Compared with the prior art, generate in CPU and the required data of addition of vectors operation are calculated on GPU；Then CPU is transmitted that data on GPU；By matrix A₁+B₁=C₁Plus task distribute to GPU threads；Then perform in GPU to The kernel function that amount is added.Significantly reduce the calculating time of dense vectorial addition.

Description of the drawings

Schematic flow sheet of the accompanying drawing 1 for the present invention.

Specific embodiment

Below in conjunction with the accompanying drawings the present invention is further described.

Such as accompanying drawing 1, as shown in figure 1, the present invention is the dense vectorial addition computational methods that a kind of GPU accelerates, the method is fitted For accelerating the add operation of dense vector：A+B=C, wherein A, B represent the vector being added, and C represents result vector, its feature It is that methods described includes：

(1) data space distributed in CPU needed for GPU is calculated, and GPU is calculated into required data transfer to GPU On；Generate in CPU and the required data of addition of vectors operation are calculated on GPU, and transmit that data on GPU；

(3) kernel function Kernel_plus of addition of vectors is performed in GPU.

In the step (1), in CPU, prepare the data needed for GPU kernel functions, it is specific as follows：Vectorial A, B, C are deposited Store up as array formats, its described data includes：Vector element number be n, vectorial DataA, DataB, result vector DataC.

In the step (2), a thread of kernel function is responsible for vectorial A, the corresponding element add operation in B, such as DataA [1]+DataB [1]=DataC [1].

In the step (3), kernel function Kernel_plus for completing vectorial addition operation is defined as Kernel_plus< N_blocks, N_threads>, thread block size is fixed as 128, i.e.,：N_threads=128, its thread number of blocks N_blocksFor n/128, bus Number of passes amount is n；Call kernel function Kernel_plus<N_blocks, N_threads>To calculate the operation of addition of vectors.The kernel Function Kernel_plus<N_blocks, N_threads>Calculation process be：

(4.3) variable t=tid*128+k；

K threads in (4.4) tid thread blocks are responsible for t-th element phase add operation of vectorial A and vector B：C_t= A_t+B_T,Wherein A_t, B_tAnd C_tIt is vectorial A respectively, t-th element in B and C.

The above is only the preferred embodiment of the present invention, it should be pointed out that：For the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims

The dense addition of vectors computational methods that 1.GPU accelerates, the method are suited to speed up the add operation of dense vector：A+B= C, wherein A, B represent the vector being added, and C represents result vector, it is characterised in that methods described includes：

(1) data space distributed in CPU needed for GPU is calculated, and the data transfer needed for GPU is calculated is on GPU；

(2) by vectorial A, the task that each element of B is added is assigned to execution in a large amount of threads in GPU；

(3) kernel function Kernel_plus of addition of vectors is performed in GPU.
2. the dense addition of vectors computational methods that GPU according to claim 1 accelerates, it is characterised in that：In step (1), Prepare the data needed for GPU kernel functions in CPU, it is specific as follows：Vectorial A, B, C are stored as into array formats, its described number According to including：Vector element number be n, vectorial DataA, DataB, result vector DataC.
3. the dense addition of vectors computational methods that GPU according to claim 1 accelerates, it is characterised in that：In step (2), By vectorial A be added with the n element of B task be assigned in a large amount of threads in GPU perform, i.e., kernel function a thread bear Duty vector A, the corresponding element add operation in B, such as DataA [1]+DataB [1]=DataC [1].
4. the dense addition of vectors computational methods that GPU according to claim 1 accelerates, it is characterised in that：Step (3), it is complete Kernel function Kernel_plus into vectorial addition operation is defined as Kernel_plus<N_blocks, N_threads>, thread block size 128 are fixed as, i.e.,：N_threads=128, its thread number of blocks N_blocksFor n/128, total number of threads is n；Call kernel function Kernel_plus<N_blocks, N_threads>To calculate the operation of addition of vectors；

Kernel function Kernel_plus<N_blocks, N_threads>Calculation process be：

(4.1) CUDA is each thread distribution thread block index blockID and thread index threadID in thread block automatically；

(4.2) blockID is assigned to into variable tid, threadID is assigned to into variable k, afterwards by tid and t indexing tid K threads in number thread block；

(4.3) variable t=tid*128+k；

K threads in (4.4) tid thread blocks are responsible for t-th element phase add operation of vectorial A and vector B：C_t=A_t+ B_t, wherein A_t, B_tAnd C_tIt is vectorial A respectively, t-th element in B and C.