CN106528054A - GPU (Graphics Processing Unit) accelerated dense vector addition computing method - Google Patents
GPU (Graphics Processing Unit) accelerated dense vector addition computing method Download PDFInfo
- Publication number
- CN106528054A CN106528054A CN201610955722.1A CN201610955722A CN106528054A CN 106528054 A CN106528054 A CN 106528054A CN 201610955722 A CN201610955722 A CN 201610955722A CN 106528054 A CN106528054 A CN 106528054A
- Authority
- CN
- China
- Prior art keywords
- gpu
- threads
- vector
- addition
- vectorial
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000013598 vector Substances 0.000 title claims abstract description 46
- 238000004364 calculation method Methods 0.000 title claims abstract description 7
- 238000000034 method Methods 0.000 claims abstract description 15
- 230000006870 function Effects 0.000 claims description 19
- 238000000205 computational method Methods 0.000 claims description 8
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 claims description 3
- 230000001133 acceleration Effects 0.000 abstract 1
- 230000007812 deficiency Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a GPU (Graphics Processing Unit) accelerated dense vector addition computing method. The method is applied to acceleration of an addition operation A+B=C for dense vectors, wherein the A and B represent the vectors for addition and the C represents a result vector. The algorithm comprises specific implementation steps that a CPU generates data required for computing the addition operation of the vectors on a GPU; the CPU transmits the data to the GPU; a task of vectors A+B is allocated to a GPU thread; and a kernel function for vector addition is carried out in the GPU. According to the method, the main task of the CPU is generating and transmitting the data and finishing scheduling main programs. The vector related addition operation is finished by the kernel function of the GPU. The speed of the dense vector addition operation is greatly improved through utilization of the hardware feature of high concurrent degree of the GPU.
Description
Technical field
The invention belongs to High performance computing in power system application, more particularly to a kind of dense vectors that accelerate of GPU plus
Method computational methods.
Background technology
GPU graphic process unit (English:GraphicsProcessingUnit, abbreviation:GPU) in the quantity of processing unit
A kind of many-core parallel processor considerably beyond CPU.GPU traditionally is only responsible for figure and renders, and most process is all
CPU is given.Present GPU method battle array be a kind of multinuclear, multithreading, with powerful calculating ability and high memory band
Width, programmable processor.Under universal computer model, GPU works as the coprocessor of CPU, by task reasonable distribution
Decomposition completes high-performance calculation.
The calculating of dense vector has concurrency.Because the numerical computations of correspondence position are separate in vector, without according to
Bad relation, it is natural to be processed by parallel calculating, it is adapted to GPU and accelerates.
This generic operation can be completed by rational scheduling between CPU and GPU.
The content of the invention
Goal of the invention:In order to overcome the deficiencies in the prior art, the present invention to provide the dense vector that a kind of GPU accelerates
Addition calculation method, solves dense vectorial addition and calculates time-consuming technological deficiency.
Technical scheme:For achieving the above object, technical scheme is as follows:
The dense addition of vectors computational methods that GPU accelerates, the method are suited to speed up the add operation of dense vector:A+B
=C, wherein A, B represent the vector being added, and C represents result vector, and methods described includes:
(1) data space distributed in CPU needed for GPU is calculated, and GPU is calculated into required data transfer to GPU
On;
(2) by vectorial A, the task that each element of B is added is assigned to execution in a large amount of threads in GPU;
(3) kernel function Kernel_plus of addition of vectors is performed in GPU.
In step (1), in CPU, prepare the data needed for GPU kernel functions, it is specific as follows:Vectorial A, B, C are stored as
Array formats, its described data include:Vector element number be n, vectorial DataA, DataB, result vector DataC.
In step (2), vectorial A is added into task with the n element of B and is assigned to execution in a large amount of threads in GPU, i.e., it is interior
One thread of kernel function is responsible for vectorial A, the corresponding element add operation in B, such as DataA [1]+DataB [1]=DataC
[1]。
In step (3), kernel function Kernel_plus for completing vectorial addition operation is defined as Kernel_plus<
Nblocks, Nthreads>, thread block size is fixed as 128, i.e.,:Nthreads=128, its thread number of blocks NblocksFor n/128, bus
Number of passes amount is n;Call kernel function Kernel_plus<Nblocks, Nthreads>To calculate the operation of addition of vectors;
Kernel function Kernel_plus<Nblocks, Nthreads>Calculation process be:
(4.1) CUDA is each thread distribution thread block index blockID and the thread index in thread block automatically
threadID;
(4.2) blockID is assigned to into variable tid, threadID is assigned to into variable k, afterwards by tid and t come rope
Draw the k threads in tid thread blocks;
(4.3) variable t=tid*128+k;
K threads in (4.4) tid thread blocks are responsible for t-th element phase add operation of vectorial A and vector B:Ct=
At+Bt, wherein At, BtAnd CtIt is vectorial A respectively, t-th element in B and C.
Beneficial effect:Compared with the prior art, generate in CPU and the required data of addition of vectors operation are calculated on GPU;Then
CPU is transmitted that data on GPU;By matrix A1+B1=C1Plus task distribute to GPU threads;Then perform in GPU to
The kernel function that amount is added.Significantly reduce the calculating time of dense vectorial addition.
Description of the drawings
Schematic flow sheet of the accompanying drawing 1 for the present invention.
Specific embodiment
Below in conjunction with the accompanying drawings the present invention is further described.
Such as accompanying drawing 1, as shown in figure 1, the present invention is the dense vectorial addition computational methods that a kind of GPU accelerates, the method is fitted
For accelerating the add operation of dense vector:A+B=C, wherein A, B represent the vector being added, and C represents result vector, its feature
It is that methods described includes:
(1) data space distributed in CPU needed for GPU is calculated, and GPU is calculated into required data transfer to GPU
On;Generate in CPU and the required data of addition of vectors operation are calculated on GPU, and transmit that data on GPU;
(2) by vectorial A, the task that each element of B is added is assigned to execution in a large amount of threads in GPU;
(3) kernel function Kernel_plus of addition of vectors is performed in GPU.
In the step (1), in CPU, prepare the data needed for GPU kernel functions, it is specific as follows:Vectorial A, B, C are deposited
Store up as array formats, its described data includes:Vector element number be n, vectorial DataA, DataB, result vector DataC.
In the step (2), a thread of kernel function is responsible for vectorial A, the corresponding element add operation in B, such as
DataA [1]+DataB [1]=DataC [1].
In the step (3), kernel function Kernel_plus for completing vectorial addition operation is defined as Kernel_plus<
Nblocks, Nthreads>, thread block size is fixed as 128, i.e.,:Nthreads=128, its thread number of blocks NblocksFor n/128, bus
Number of passes amount is n;Call kernel function Kernel_plus<Nblocks, Nthreads>To calculate the operation of addition of vectors.The kernel
Function Kernel_plus<Nblocks, Nthreads>Calculation process be:
(4.1) CUDA is each thread distribution thread block index blockID and the thread index in thread block automatically
threadID;
(4.2) blockID is assigned to into variable tid, threadID is assigned to into variable k, afterwards by tid and t come rope
Draw the k threads in tid thread blocks;
(4.3) variable t=tid*128+k;
K threads in (4.4) tid thread blocks are responsible for t-th element phase add operation of vectorial A and vector B:Ct=
At+BT,Wherein At, BtAnd CtIt is vectorial A respectively, t-th element in B and C.
The above is only the preferred embodiment of the present invention, it should be pointed out that:For the ordinary skill people of the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (4)
- The dense addition of vectors computational methods that 1.GPU accelerates, the method are suited to speed up the add operation of dense vector:A+B= C, wherein A, B represent the vector being added, and C represents result vector, it is characterised in that methods described includes:(1) data space distributed in CPU needed for GPU is calculated, and the data transfer needed for GPU is calculated is on GPU;(2) by vectorial A, the task that each element of B is added is assigned to execution in a large amount of threads in GPU;(3) kernel function Kernel_plus of addition of vectors is performed in GPU.
- 2. the dense addition of vectors computational methods that GPU according to claim 1 accelerates, it is characterised in that:In step (1), Prepare the data needed for GPU kernel functions in CPU, it is specific as follows:Vectorial A, B, C are stored as into array formats, its described number According to including:Vector element number be n, vectorial DataA, DataB, result vector DataC.
- 3. the dense addition of vectors computational methods that GPU according to claim 1 accelerates, it is characterised in that:In step (2), By vectorial A be added with the n element of B task be assigned in a large amount of threads in GPU perform, i.e., kernel function a thread bear Duty vector A, the corresponding element add operation in B, such as DataA [1]+DataB [1]=DataC [1].
- 4. the dense addition of vectors computational methods that GPU according to claim 1 accelerates, it is characterised in that:Step (3), it is complete Kernel function Kernel_plus into vectorial addition operation is defined as Kernel_plus<Nblocks, Nthreads>, thread block size 128 are fixed as, i.e.,:Nthreads=128, its thread number of blocks NblocksFor n/128, total number of threads is n;Call kernel function Kernel_plus<Nblocks, Nthreads>To calculate the operation of addition of vectors;Kernel function Kernel_plus<Nblocks, Nthreads>Calculation process be:(4.1) CUDA is each thread distribution thread block index blockID and thread index threadID in thread block automatically;(4.2) blockID is assigned to into variable tid, threadID is assigned to into variable k, afterwards by tid and t indexing tid K threads in number thread block;(4.3) variable t=tid*128+k;K threads in (4.4) tid thread blocks are responsible for t-th element phase add operation of vectorial A and vector B:Ct=At+ Bt, wherein At, BtAnd CtIt is vectorial A respectively, t-th element in B and C.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610955722.1A CN106528054A (en) | 2016-11-03 | 2016-11-03 | GPU (Graphics Processing Unit) accelerated dense vector addition computing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610955722.1A CN106528054A (en) | 2016-11-03 | 2016-11-03 | GPU (Graphics Processing Unit) accelerated dense vector addition computing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106528054A true CN106528054A (en) | 2017-03-22 |
Family
ID=58325470
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610955722.1A Pending CN106528054A (en) | 2016-11-03 | 2016-11-03 | GPU (Graphics Processing Unit) accelerated dense vector addition computing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106528054A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080208942A1 (en) * | 2007-02-23 | 2008-08-28 | Nara Won | Parallel Architecture for Matrix Transposition |
CN101937425A (en) * | 2009-07-02 | 2011-01-05 | 北京理工大学 | Matrix parallel transposition method based on GPU multi-core platform |
CN103543989A (en) * | 2013-11-11 | 2014-01-29 | 镇江中安通信科技有限公司 | Adaptive parallel processing method aiming at variable length characteristic extraction for big data |
US8984043B2 (en) * | 2009-12-23 | 2015-03-17 | Intel Corporation | Multiplying and adding matrices |
CN105574809A (en) * | 2015-12-16 | 2016-05-11 | 天津大学 | Matrix exponent-based parallel calculation method for electromagnetic transient simulation graphic processor |
-
2016
- 2016-11-03 CN CN201610955722.1A patent/CN106528054A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080208942A1 (en) * | 2007-02-23 | 2008-08-28 | Nara Won | Parallel Architecture for Matrix Transposition |
CN101937425A (en) * | 2009-07-02 | 2011-01-05 | 北京理工大学 | Matrix parallel transposition method based on GPU multi-core platform |
US8984043B2 (en) * | 2009-12-23 | 2015-03-17 | Intel Corporation | Multiplying and adding matrices |
CN103543989A (en) * | 2013-11-11 | 2014-01-29 | 镇江中安通信科技有限公司 | Adaptive parallel processing method aiming at variable length characteristic extraction for big data |
CN105574809A (en) * | 2015-12-16 | 2016-05-11 | 天津大学 | Matrix exponent-based parallel calculation method for electromagnetic transient simulation graphic processor |
Non-Patent Citations (1)
Title |
---|
我是郭俊辰: "CUDA编程入门:向量加法和矩阵乘法", 《HTTPS://BLOG.CSDN.NET/U014030117/ARTICLE/DETAILS/45952971》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103049241B (en) | A kind of method improving CPU+GPU isomery device calculated performance | |
CN104834561B (en) | A kind of data processing method and device | |
CN103617150A (en) | GPU (graphic processing unit) based parallel power flow calculation system and method for large-scale power system | |
Xiong et al. | Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units | |
CN109191364A (en) | Accelerate the hardware structure of artificial intelligence process device | |
Kelly | GPU computing for atmospheric modeling | |
CN105183562B (en) | A method of rasterizing data are carried out based on CUDA technologies to take out rank | |
CN101833438A (en) | General data processing method based on multiple parallel | |
Behrens et al. | Efficient SIMD Vectorization for Hashing in OpenCL. | |
Shahbahrami et al. | Parallel implementation of Gray Level Co-occurrence Matrices and Haralick texture features on cell architecture | |
CN105373517A (en) | Spark-based distributed matrix inversion parallel operation method | |
CN102298567A (en) | Mobile processor architecture integrating central operation and graphic acceleration | |
CN106776466A (en) | A kind of FPGA isomeries speed-up computation apparatus and system | |
CN106026107B (en) | A kind of QR decomposition method for the direction of energy Jacobian matrix that GPU accelerates | |
CN110781446A (en) | Method for rapidly calculating average vorticity deviation of ocean mesoscale vortex Lagrange | |
CN103413273A (en) | Method for rapidly achieving image restoration processing based on GPU | |
CN103713938A (en) | Multi-graphics-processing-unit (GPU) cooperative computing method based on Open MP under virtual environment | |
CN102841881A (en) | Multiple integral computing method based on many-core processor | |
CN104572588B (en) | Matrix inversion process method and apparatus | |
CN107256203A (en) | The implementation method and device of a kind of matrix-vector multiplication | |
CN106528054A (en) | GPU (Graphics Processing Unit) accelerated dense vector addition computing method | |
CN106934757A (en) | Monitor video foreground extraction accelerated method based on CUDA | |
CN103577160A (en) | Characteristic extraction parallel-processing method for big data | |
CN109408148A (en) | A kind of production domesticization computing platform and its apply accelerated method | |
CN104463940A (en) | Hybrid tree parallel construction method based on GPU |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170322 |
|
RJ01 | Rejection of invention patent application after publication |