CN104102513B - A kind of CUDA runtime parameter transparent optimization methods based on Kepler frameworks - Google Patents

A kind of CUDA runtime parameter transparent optimization methods based on Kepler frameworks Download PDF

Info

Publication number
CN104102513B
CN104102513B CN201410341238.0A CN201410341238A CN104102513B CN 104102513 B CN104102513 B CN 104102513B CN 201410341238 A CN201410341238 A CN 201410341238A CN 104102513 B CN104102513 B CN 104102513B
Authority
CN
China
Prior art keywords
thread
amended
kernel function
size
background server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410341238.0A
Other languages
Chinese (zh)
Other versions
CN104102513A (en
Inventor
杨刚
王严
杜三盛
张策
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN201410341238.0A priority Critical patent/CN104102513B/en
Publication of CN104102513A publication Critical patent/CN104102513A/en
Application granted granted Critical
Publication of CN104102513B publication Critical patent/CN104102513B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)
  • Image Generation (AREA)

Abstract

A kind of CUDA runtime parameter transparent optimization systems of selection based on Kepler frameworks are the embodiment of the invention provides, is related to CUDA programming techniques field, the time used by parameter can be saved when kernel function obtains the configuration operation of performance optimization.Methods described includes:The call request after the encapsulation that end sends over is intercepted and captured in background server deblocking, obtains the runtime parameter information of kernel function;Background server total number of threads according to needed for the runtime parameter information of kernel function calculates kernel function, so that it is determined that its affiliated Thread Count grade;Then the Thread Count grade according to determined by changes the size of thread block, and then calculates and obtain amended thread number of blocks and amended shared drive size;Finally, background server is performed the CUDA runtime layer that amended kernel function runtime parameter and kernel function executable portion are sent to background server.

Description

A kind of CUDA runtime parameter transparent optimization methods based on Kepler frameworks
Technical field
The present invention relates to CUDA (Compute Unified Device Architecture, unified calculation equipment framework) Programming technique field, more particularly to a kind of CUDA runtime parameter transparent optimization methods based on Kepler frameworks.
Background technology
The setting of current CUDA kernel function runtime parameters, selection are set according to the experience of oneself by programmer Put, and the configuration result of performance optimization can be just tried to achieve by test of many times, therefore obtaining when CUDA program features optimize carries out trial Needs take a significant amount of time.
The content of the invention
Embodiments of the invention provide a kind of CUDA kernel function runtime parameters transparent optimization side based on Kepler frameworks Method, the configuration parameter time that can be saved needed for obtaining performance optimization.
To reach above-mentioned purpose, embodiments of the invention are adopted the following technical scheme that:
A kind of CUDA runtime parameters transparent optimization system of selection, including:
Intercept and capture call request when CUDA is intercepted and captured using to operation in end when CUDA runs, and call described in intercepting please Encapsulation is asked to pass to background server;
The background server unseals the call request after the encapsulation that the intercepting and capturing end sends over, and obtains the fortune of kernel function Parameter information during row;Wherein, the runtime parameter information of the kernel function includes thread block number, thread block size, shares interior Deposit size;
The background server is according to the hardware characteristicses of Kepler frameworks GPU by total Thread Count by being divided into 4 from less to more Individual Thread Count grade, and the total number of threads according to needed for the runtime parameter information of the kernel function calculates kernel function, so that really Fixed its affiliated Thread Count grade;
Background server Thread Count grade according to determined by changes the size of thread block, then according to modification after Thread block size calculate obtain amended thread number of blocks and amended shared drive size;
Amended kernel function runtime parameter and kernel function executable portion are sent to backstage clothes by the background server The CUDA runtime layer at business end is performed, wherein, the amended kernel function runtime parameter includes amended thread The size of block, amended thread number of blocks and amended shared drive size.
The CUDA runtime parameter transparent optimization methods based on Kepler frameworks that above-mentioned technical proposal is provided, using CUDA Kernel function runtime parameter is chosen and is combined the method to carry out with bottom GPU architecture, by intercepting and capturing upper layer application to kernel function The setting of parameter, transparent to user resetting so as to realize program is carried out according to the characteristics of Kepler framework GPU to parameter The optimization of performance.The runtime parameter of CUDA application and developments is optimized the split hair personnel in part becomes transparent, reduces to CUDA Used time needed for optimizing application, alleviate the burden of developer.And while reducing CUDA application and developments, optimizing time-consuming, reduce Energy consumption needed for exploitation.
Brief description of the drawings
Fig. 1 is a kind of CUDA runtime parameters transparent optimization selection based on Kepler frameworks provided in an embodiment of the present invention Method flow schematic diagram;
Fig. 2 is the system architecture diagram for implementing.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described.Obviously, described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.
The embodiment of the invention provides a kind of CUDA kernel function runtime parameters transparent optimization side based on Kepler frameworks Method, as depicted in figs. 1 and 2, the described method comprises the following steps:
101st, call request when CUDA is intercepted and captured using to operation in end, and the tune that will be intercepted are intercepted and captured when CUDA runs Background server is passed to request encapsulation.
102nd, background server unseals the call request after the encapsulation that the intercepting and capturing end sends over, and obtains the fortune of kernel function Parameter information during row.
The runtime parameter information of the kernel function is believed including thread block number, thread block size, shared drive size etc. Breath.
Be multiplied total number of threads needed for obtaining kernel function according to both obtained kernel function thread block number and thread block size;Root According to obtained kernel function thread block number and shared drive size, both are multiplied and obtain the required shared drive number altogether of kernel function.
103rd, the background server is according to the hardware characteristicses of Kepler frameworks GPU, by total Thread Count by drawing from less to more It is divided into 4 quantitative levels, and the total number of threads according to needed for the runtime parameter information of the kernel function calculates kernel function, so that Determine its affiliated Thread Count grade.
The hardware of Kepler (Kepler) the frameworks GPU (Graphic Processing Unit, graphic process unit) is special Put and be:On each SM (Streaming Multiprocessor, multinuclear stream handle) maximum can parallel 2048 threads, at most Can at most settable 1024 threads of parallel 16 thread blocks, each thread block, SM using a thread beam (32 threads) as Scheduling executable unit.
The runtime parameter information of hardware characteristicses and intercepting and capturing according to Kepler frameworks GPU, by total Thread Count by less to It is divided into 4 Thread Count grades more.Assuming that SN represents the quantity of SM on GPU, optionally, Thread Count grade is pressed from less to more successively It is the first estate (total Thread Count is 0-1536*SN), the second grade (total Thread Count is 1536*SN-2048*SN), the tertiary gradient (total Thread Count is 2048*SN-3072*SN), the fourth estate (total Thread Count is more than 3072*SN).
104th, the background server according to determined by Thread Count grade come the size of optimum choice thread block, Ran Hougen Calculated according to the size of amended thread block and obtain amended thread number of blocks and amended shared drive size.
Shown through experimental result, when hold facility rate is larger, core when thread block size value is 96,128,192,256 Function best performance.Therefore, when total number of threads is in the range of the first estate, modification thread block size is 96;When total Thread Count exists When in the second rate range, modification thread block size is 128;When total Thread Count is in the range of the tertiary gradient, thread block is changed Size is 192;When total Thread Count is in the range of the fourth estate, modification thread block size is 256.
Using total number of threads needed for the kernel function for calculating acquisition divided by amended thread block size, made with its acquired results It is amended thread number of blocks;It is multiplied with former shared drive parameter using kernel function original thread block number parameter, its product knot Fruit total shared drive number needed for, then use its institute divided by amended thread block number parameter with the total shared drive number of gained Result is obtained as amended shared drive parameter.
105th, after be sent to for amended kernel function runtime parameter and kernel function executable portion by the background server The CUDA runtime layer of platform service end is performed.
The amended kernel function runtime parameter includes size, the amended thread block number of amended thread block Amount and amended shared drive size.
As shown in Fig. 2 the background server GPU drivings layer is carried out according to the information after the optimization obtained from runtime layer Perform.Then, after background server obtains the result after optimization is performed, and front end is sent result to, then by front end by result It is supplied to corresponding CUDA to apply.
The application provide CUDA runtime parameter transparent optimization methods, using CUDA kernel functions runtime parameter choose with Bottom GPU architecture is combined the method to carry out, by intercepting and capturing setting of the upper layer application to kernel functional parameter, according to Kepler framves The characteristics of structure GPU, carries out transparent to user resetting so as to realize the optimization of program feature to parameter.Open CUDA applications The runtime parameter optimization split hair personnel in part of hair become transparent, reduce to the used time needed for CUDA optimizing applications, alleviate The burden of developer.And while reducing CUDA application and developments, optimizing time-consuming, reduce energy consumption needed for exploitation.
The above, specific embodiment only of the invention, but protection scope of the present invention is not limited thereto, and it is any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all contain Cover within protection scope of the present invention.Therefore, protection scope of the present invention described should be defined by scope of the claims.

Claims (3)

1. a kind of CUDA runtime parameter transparent optimization methods based on Kepler frameworks, it is characterised in that including:
Call request when CUDA is intercepted and captured using to operation in end, and the call request envelope that will be intercepted are intercepted and captured when CUDA runs Dress passes to background server;
The background server unseals the call request after the encapsulation that the intercepting and capturing end sends over, when obtaining the operation of kernel function Parameter information;Wherein, the runtime parameter information of the kernel function includes that thread block number, thread block size, shared drive are big It is small;
The background server is according to the hardware characteristicses of Kepler frameworks GPU by total Thread Count by being divided into 4 lines from less to more Number of passes grade, and the total number of threads according to needed for the runtime parameter information of the kernel function calculates kernel function, so that it is determined that its Affiliated Thread Count grade;
Background server Thread Count grade according to determined by changes the size of thread block, then according to amended line The size of journey block is calculated and obtains amended thread number of blocks and amended shared drive size;
Amended kernel function runtime parameter and kernel function executable portion are sent to background server by the background server CUDA runtime layer performed, wherein, the amended kernel function runtime parameter includes amended thread block Size, amended thread number of blocks and amended shared drive size.
2. method according to claim 1, it is characterised in that the background server is hard according to Kepler frameworks GPU's Part feature by total Thread Count by being divided into 4 Thread Count grades from less to more, including:
Assuming that on SN representative of graphics processors GPU multinuclear stream handle SM quantity, Thread Count grade is by being successively from less to more: The first estate, total Thread Count is 0-1536*SN;Second grade, total Thread Count is 1536*SN-2048*SN;The tertiary gradient, bus Number of passes is 2048*SN-3072*SN;The fourth estate, total Thread Count is more than 3072*SN.
3. method according to claim 1, it is characterised in that background server Thread Count grade according to determined by To change the size of selection thread block, then the size according to amended thread block is calculated and obtains amended thread number of blocks With amended shared drive size, including:
When identified Thread Count grade is the first estate, amended thread block size is 96;When identified Thread Count When grade is the second grade, amended thread block size is 128;When identified Thread Count grade is the tertiary gradient, repair Thread block size after changing is 192;When identified Thread Count grade is the fourth estate, amended thread block size is 256;
Using total number of threads needed for calculating the kernel function that obtains divided by amended thread block size, with its acquired results as repairing Thread number of blocks after changing;It is multiplied with former shared drive size using kernel function original Thread Count, its result of product is used as required total Shared drive number, then with the total shared drive number of gained divided by amended thread block size, with its acquired results as modification after Shared drive size.
CN201410341238.0A 2014-07-18 2014-07-18 A kind of CUDA runtime parameter transparent optimization methods based on Kepler frameworks Active CN104102513B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410341238.0A CN104102513B (en) 2014-07-18 2014-07-18 A kind of CUDA runtime parameter transparent optimization methods based on Kepler frameworks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410341238.0A CN104102513B (en) 2014-07-18 2014-07-18 A kind of CUDA runtime parameter transparent optimization methods based on Kepler frameworks

Publications (2)

Publication Number Publication Date
CN104102513A CN104102513A (en) 2014-10-15
CN104102513B true CN104102513B (en) 2017-06-16

Family

ID=51670686

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410341238.0A Active CN104102513B (en) 2014-07-18 2014-07-18 A kind of CUDA runtime parameter transparent optimization methods based on Kepler frameworks

Country Status (1)

Country Link
CN (1) CN104102513B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834628A (en) * 2015-04-26 2015-08-12 西北工业大学 Polymorphic computing platform and construction method thereof
CN106681694A (en) * 2016-12-30 2017-05-17 中国科学院计算技术研究所 Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction
CN109840877B (en) * 2017-11-24 2023-08-22 华为技术有限公司 Graphics processor and resource scheduling method and device thereof
CN110308982B (en) * 2018-03-20 2021-11-19 华为技术有限公司 Shared memory multiplexing method and device
CN109634830B (en) * 2018-12-19 2022-06-07 哈尔滨工业大学 CUDA program integration performance prediction method based on multi-feature coupling
CN113553057B (en) * 2021-07-22 2022-09-09 中国电子科技集团公司第十五研究所 Optimization system for parallel computing of GPUs with different architectures
CN116089050B (en) * 2023-04-13 2023-06-27 湖南大学 Heterogeneous adaptive task scheduling method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214356A (en) * 2011-06-07 2011-10-12 内蒙古大学 NVIDIA graphics processing unit (GPU) platform-based best neighborhood matching (BNM) parallel image recovering method
CN102547289A (en) * 2012-01-17 2012-07-04 西安电子科技大学 Fast motion estimation method realized based on GPU (Graphics Processing Unit) parallel
CN102567206A (en) * 2012-01-06 2012-07-11 华中科技大学 Method for analyzing CUDA (compute unified device architecture) program behavior
CN102708009A (en) * 2012-04-19 2012-10-03 华为技术有限公司 Method for sharing GPU (graphics processing unit) by multiple tasks based on CUDA (compute unified device architecture)

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214356A (en) * 2011-06-07 2011-10-12 内蒙古大学 NVIDIA graphics processing unit (GPU) platform-based best neighborhood matching (BNM) parallel image recovering method
CN102567206A (en) * 2012-01-06 2012-07-11 华中科技大学 Method for analyzing CUDA (compute unified device architecture) program behavior
CN102547289A (en) * 2012-01-17 2012-07-04 西安电子科技大学 Fast motion estimation method realized based on GPU (Graphics Processing Unit) parallel
CN102708009A (en) * 2012-04-19 2012-10-03 华为技术有限公司 Method for sharing GPU (graphics processing unit) by multiple tasks based on CUDA (compute unified device architecture)

Also Published As

Publication number Publication date
CN104102513A (en) 2014-10-15

Similar Documents

Publication Publication Date Title
CN104102513B (en) A kind of CUDA runtime parameter transparent optimization methods based on Kepler frameworks
CN106339351B (en) A kind of SGD algorithm optimization system and method
Qian et al. Extending mobile device's battery life by offloading computation to cloud
CN106951926A (en) The deep learning systems approach and device of a kind of mixed architecture
CN103533032B (en) Bandwidth adjustment device and method
CN105912403B (en) The method for managing resource and device of Docker container
CN110413391A (en) Deep learning task service method for ensuring quality and system based on container cluster
CN104038392A (en) Method for evaluating service quality of cloud computing resources
CN106209482A (en) A kind of data center monitoring method and system
US20130198758A1 (en) Task distribution method and apparatus for multi-core system
CN107168779A (en) A kind of task management method and system
CN107025236A (en) Data processing method and data system for settling account between system for settling account
CN102591709B (en) Shapefile master-slave type parallel writing method based on OGR (open geospatial rule)
CN104820616B (en) A kind of method and device of task scheduling
CN104391696B (en) A kind of autotask processing method and processing device
CN107291550A (en) A kind of Spark platform resources dynamic allocation method and system for iterated application
EP3118784A1 (en) Method and system for enabling dynamic capacity planning
CN106383764A (en) Data acquisition method and device
CN106845746A (en) A kind of cloud Workflow Management System for supporting extensive example intensive applications
CN106293947B (en) GPU-CPU (graphics processing Unit-Central processing Unit) mixed resource allocation system and method in virtualized cloud environment
CN103023980A (en) Method and system for processing user service request by cloud platform
CN107833051A (en) A kind of data statistical approach and system
CN106897147A (en) A kind of application container engine container resource regulating method and device
CN108549935A (en) A kind of device and method for realizing neural network model
CN103067450B (en) Application control method and system for cloud environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
DD01 Delivery of document by public notice
DD01 Delivery of document by public notice

Addressee: Shi Jiaming

Document name: payment instructions