CN110727413A

CN110727413A - Monte Carlo parallel algorithm based on GPU (graphics processing Unit) programming model

Info

Publication number: CN110727413A
Application number: CN201910848788.4A
Authority: CN
Inventors: 宗慧; 华任锋; 赵建洋; 洪龙
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2020-01-24

Abstract

The invention discloses a Monte Carlo parallel algorithm based on a GPU programming model, which comprises the following steps: selecting a random number generator; adopting a simplest integral formula; parallelizing integral codes, quantizing huge calculation times, optimizing parallel threads according to the GPU model, and solving a pi value by using a Monte Carlo integral algorithm; solving the value of e by using a Monte Carlo integral algorithm; the invention aims to provide a Monte Carlo parallel algorithm based on a CUDA programming model in order to fully utilize the parallel computing capability of a GPU and reduce the numerical computation time. Selecting a high-quality Mersene Twister random number algorithm, taking pi and e as examples of a Monte Carlo integration algorithm, and finally enabling a calculated value of the constant pi to have 8-bit effective digits and e to have 7-bit effective digits through trillion-time point throwing operations.

Description

Monte Carlo parallel algorithm based on GPU (graphics processing Unit) programming model

Technical Field

The invention relates to the technical field of supervision algorithms and text classification, in particular to a Monte Carlo parallel algorithm based on a GPU programming model.

Background

S.M. Uram and J. von Neumann, 1945, propose a new probabilistic-based solution to the problem, which is known as Monte Carlo, the famous gambling city of Morna, to embody randomness. The calculation principle is mainly based on probability statistics, sampling statistics is carried out on random projection points, and then calculation and processing are carried out to obtain an estimation result. The development of electronic digital computers provides effective tools for such large-scale random experiments, so that the Monte Carlo method is widely applied to the fields of nuclear science, artificial intelligence, medicine, management science and the like. Recently reported AI algorithms such as AlphaGo-Zero and Alpha-Zero successfully search by adopting a Monte Carlo method; the Monte Carlo method has irreplaceable effects in numerical calculations of nuclear physics.

With the continuous advance of science and technology, parallel computing is applied to multiple fields, and particularly, the multi-core technology enables the public to research the parallel computing. Meanwhile, the computation performance of the GPU is also rapidly improved, NVIDIA corporation introduced a computing graphics card, which is dedicated to improving the speed and quality of computer graphics processing, and a parallel algorithm based on GPU design has become a research hotspot in recent years. The calculation capacity of the GPU is greatly improved compared with that of the CPU, the GPU is provided with more calculation units (ALU) than the CPU, and the CPU is suitable for multi-task management and scheduling due to the design, so that the interior of the CPU mainly comprises a controller and a buffer, and only a small part of the CPU is the ALU; the GPU is dedicated to data processing, so its interior is mainly composed of "ALU", which is shown in fig. 1; at present, the technical development of a CPU chip is slower than that of Moore's law, an Inter company develops towards the direction of low power consumption, and the technical development of a GPU is far beyond the Moore's law, so that the GPU has a very good calculation prospect; secondly, the GPU has remarkable computing capability and computing speed, and is very suitable for computing a single instruction multiple data Set (SIMD) of random sampling, so that the speed of solving the problem by the Monte Carlo method in the GPU can be improved greatly compared with the speed of solving the problem by the CPU.

Disclosure of Invention

The invention aims to provide a Monte Carlo parallel algorithm based on a CUDA programming model in order to fully utilize the parallel computing capability of a GPU and reduce the numerical computation time. Selecting a high-quality Mersene Twister random number algorithm, taking pi and e as examples of a Monte Carlo integration algorithm, and finally enabling a calculated value of the constant pi to have 8-bit effective digits and e to have 7-bit effective digits through trillion-time dotting operations.

The invention is realized by the following technical scheme:

a Monte Carlo parallel algorithm based on a GPU programming model comprises the following steps:

step 1, selecting a random number generator;

step 2, adopting a simplest integral formula: the Monte Carlo method is adopted to quantify the area of a certain area in an equal probability casting manner, namely the number of casting points falling in the area can represent the size of the area, then the number of casting points is used as a medium, the size of a target value is obtained through a formula, the area with a smaller area is selected as a statistical area, and the calculation time can be reduced;

step 3, parallelizing integral codes, quantizing huge calculation times, optimizing parallel threads according to the GPU model, and solving a pi value by using a Monte Carlo integral algorithm;

and 4, solving the value of e by using a Monte Carlo integral algorithm.

The invention further adopts the technical improvement scheme that:

the step 1 adopts an MT19937 random number generator, and the process is as follows: obtaining a random number seed; adding random number seeds to the MT19937 algorithm; setting random number type and range; random number generation is performed.

The invention further adopts the technical improvement scheme that:

in the step 3, a MonteCarlo parallel integration algorithm for calculating pi based on CUDA is adopted, firstly two arrays are used for receiving random numbers generated by using an MT19937 random number generator, then a video memory is allocated to the array of the DEVICE end, the array of the stored random numbers is copied and transmitted to the array of the DEVICE end, a kernel function is called for calculation, and finally the number of points meeting the conditions on each returned thread is summed and the result value is calculated.

The invention further adopts the technical improvement scheme that:

in the step 3, a computational method for optimizing GPU threads and realizing e is adopted, a MonteCarlo parallel integration algorithm based on CUDA computing e is adopted, and generation is carried out at a HOST terminal according to a random number generation mode, and then the generated random number is transmitted to a DEVICE terminal for use judgment; the random number generator selects an MT19937 algorithm, then allocates a video memory for the array of the DEVICE end, copies the array for storing the random number and transmits the array to the array of the DEVICE end for use, then calls a CUDA kernel function for calculation, and finally sums the point numbers which meet the conditions on each thread and calculates the result value.

The invention further adopts the technical improvement scheme that:

step 1 adopts an MTGP32 random number generator, transmits the seed and the data set to the kernel function at the host end, and then the device end uses the cruond function to generate the random number, the process is as follows: setting a host side, reformatting a predefined parameter set into a kernel format, and copying kernel parameters into an equipment memory; initializing the state of each thread block and storing the state into an array; calling a cruand function by the equipment terminal, and generating random numbers according to the seed value of each thread; since the MTGP32 random number generator can only generate unsigned integer data, the computation will use unsigned integer data in solving for pi and e.

The invention further adopts the technical improvement scheme that:

in the step 3, a MonteCarlo parallel integration algorithm for calculating pi based on CUDA is adopted, firstly, a random number seed value required to be used at a DEVICE end, an array for storing a throw point number and a judgment condition value are declared for an MTGP32 random number generator at a HOST end, then a kernel function is called for calculation, and a random number is generated in each calculation of each thread; and finally summing the number of the points which meet the conditions on each returned thread and calculating a result value.

The invention further adopts the technical improvement scheme that:

in the step 3, a GPU thread is optimized and a calculation method of e is realized, a MonteCarlo parallel integration algorithm based on CUDA calculation e is adopted, and the calculation method is generated in a random number generation mode, namely directly at a DEVICE end, and then generated before condition judgment, and covered by a new random number after the condition judgment; the random number generator selects MTGP32 algorithm, calls CUDA kernel function in HOST for calculation, and finally sums the number of returned qualified points on each thread and calculates the result value.

Compared with the prior art, the invention has the following obvious advantages:

the method is suitable for improving the efficiency of the Monte Carlo integral algorithm and improving the calculation precision; the computation time of the Monte Carlo integral algorithm is multiplied along with the improvement of the computation precision requirement, and in order to improve the computation performance, the Monte Carlo parallel algorithm based on the GPU programming model is designed.

Secondly, selecting a high-quality Mersene Twister random number algorithm, respectively generating random numbers at a host end and an equipment end according to the characteristics of GPU programming, and applying the random numbers to the design and implementation of a parallel algorithm;

the Monte Carle integral algorithm based on the GPU can quickly realize a large amount of point-of-operation by using the GPU to obtain more accurate pi and e values, can quickly finish huge calculated amount to obtain a result with higher precision, and has extremely high reference significance for other similar scientific calculation tasks

Drawings

FIG. 1 is a preliminary stage of algorithm design for selecting a better quality random number generator by comparing the quality of different random number generators;

FIG. 2 illustrates the present invention in which the designed integral formula is coded, and the formula code is the simplest, and the fastest speed is obtained;

FIG. 3 is a diagram illustrating parallelization of integral codes and quantization of calculated quantities, optimization of threads according to a GPU, and then experimental calculation to obtain a pi value;

FIG. 4 is a diagram illustrating parallelization of the integral code and quantization of the calculated amount, optimization of the thread according to the GPU, and experimental calculation to obtain the value of e.

Detailed Description

The technical scheme of the invention is further described by combining the attached drawings 1-4:

the invention comprises the following steps:

step 1, selecting a random number generator;

step 4, solving the value of e by using Monte Carlo integral algorithm

The first embodiment,

Step 1, in order to obtain a good-quality result, firstly, a proper random number generation method is determined, and a large number of random points are projected on the basis.

The MT19937 random number generator is suitable for random number generation at a host end, and the C language integrates an MT19937 function library, so that the random number generation can be carried out only by calling a method without setting parameters for the random number one by one, and the short codes are as follows:

random seed// obtaining Random number seed

MT19937 gen (seed ()// random number seed was added to MT19937 algorithm

unique _ int _ distribution < int > rand ()// random number type and range are set

rand (gen)// perform random number generation.

And 2, taking pi as an example, solving by using a Monte Carlo integral algorithm to obtain a relatively accurate value, drawing an inscribed circle in a square with the side length of R, and then randomly casting points in the square, wherein each casting point is random, so that each casting point is equal in probability, and assuming that the square can be covered by 1 hundred million casting points, the number of the casting points in the circle is more than the total number of the casting points, and the area of the circle is approximate. Serial and parallel experiments were performed based on this method.

Firstly, serial realization is carried out at a host end, a problem is found when codes are embodied, and when the projection statistics is carried out, the point of which the number of projection points falls into an shadow area or the point of which the number of projection points falls out of 1/4 circular shadows is selected; since both approaches are feasible, both statistical approaches are specifically coded:

when the generated random numbers are used for simulating a projection point, two random numbers are required to be generated at the same time and are respectively used as an x coordinate point and a y coordinate point for calculation; the distance between the coordinate (x, y) and the origin of coordinates (0,0) is calculated only because the statistics falls into the shadow area, if the distance between the randomly generated coordinate point and the origin of coordinates is smaller than the radius R, the condition is met, otherwise, the condition is not met, the randomly generated coordinate point is only allowed to exist in the first quadrant, and the x, y coordinate point can not be larger than the radius R. The serial implementation uses only the Rand function since it is for code optimization below.

Counting the number of the projection points in the shadow area:

simulating the times of manual casting by using a for cycle, respectively generating an x coordinate and a y coordinate by using a rand function, then judging the distance between the coordinate point and the origin of the coordinate to carry out statistics, and finally multiplying the number of points meeting the conditions by the total casting number by 4 to obtain a pi value. The implementation code is as follows:

n represents the total number of points cast, and M is the number of points falling within the shadow area.

The implementation code for counting the projection points outside the shadow area is as follows:

firstly, two arrays are used for receiving random numbers generated by an MT19937 random number generator, then a display memory is allocated to the array of the DEVICE end, the array for storing the random numbers is copied and transmitted to the array of the DEVICE end, a kernel function is called for calculation, finally, the number of points which meet the conditions on each returned thread is summed, and a result value is calculated, and the implementation code is as follows:

and 3, taking e as an example, solving by using a Monte Carlo integral algorithm to obtain a relatively accurate value, and then performing serial and parallel experiments.

Firstly, serial realization is carried out at a CPU end, when a code taking pi as an example is embodied, the problem that whether the number of points falling into a shadow or the number of points falling out of the shadow is selected and counted when the point throwing statistics is carried out is found, and because the mode that the statistics fall out of the shadow is proved to be faster through tests, the statistical mode is directly selected to carry out the specific coding:

the implementation is performed using an MT19937 random number generator, and the implementation code is as follows:

a Monte Carlo parallel integration algorithm implementation one based on CUDA calculation e is that generation is carried out at a HOST terminal according to a first random number generation mode, and then the generated random number is transmitted to a DEVICE terminal for use judgment; selecting an MT19937 algorithm by the random number generator, then distributing a video memory for the array of the DEVICE end, copying and transmitting the array for storing the random number to the array of the DEVICE end for use, calling a CUDA kernel function for calculation, and finally summing the point numbers which meet the conditions on each returned thread and calculating a result value, wherein the code is as follows:

example II,

The MTGP32 random number generator is called Mersene Twister for Graphic Processors, and is one of special random number generators based on a Graphic processor; the MTGP32 function library is also integrated in the cruond function library, so that the seed and the data set only need to be transmitted to the kernel function at the host side, and then the device side can use the cruond function to generate the random number, and the short code is as follows:

setting a host side, reformatting a predefined parameter set into a kernel format, and copying kernel parameters into a device memory:

cudaMalloc((void**)&devKernelParams,sizeof(mtgp32_kernel_params))

initializing the state of each thread block, and storing the state into an array:

curandMakeMTGP32KernelState(devMTGPStates,mtgp32dc_params_fast_11213,devKernelParams,BLOCK,seed())

the equipment terminal calls a cruand function, and random number generation is carried out according to the seed value of each thread:

curand(&state[Id])

since MTGP32 random number generators can only generate unsigned integer data (unsigned int: 0 ~ 4294967295), the computation will use unsigned integer data in solving for π and e.

Counting the number of the projection points in the shadow area:

Firstly, declaring an array of random number seed values, storage throw points and judgment condition values which need to be used at a DEVICE end for an MTGP32 random number generator at a HOST end, then calling a kernel function for calculation, and generating a random number in each calculation of each thread; finally, summing the points which meet the conditions on each returned thread and calculating a result value, wherein the implementation codes are as follows:

a Monte Carlo parallel integration algorithm based on CUDA calculation e is realized in a second random number generation mode, namely, the generation is directly carried out at a DEVICE end, then the generation is carried out before condition judgment, and after the condition judgment, the new random number covers the DEVICE, so that a large video memory does not need to be consumed; the random number generator selects an MTGP32 algorithm, calls a CUDA kernel function in HOST for calculation, sums the number of returned points meeting the conditions on each thread and calculates a result value, and the realization code is as follows:

the technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that modifications and embellishments could be made by those skilled in the art without departing from the principle of the present invention, and these are also considered to be within the scope of the present invention.

Claims

1. A Monte Carlo parallel algorithm based on a GPU programming model is characterized in that: the method comprises the following steps:

step 1, selecting a random number generator;

step 2, adopting a simplest integral formula: the Monte Carlo method is adopted to quantify the area of a certain area in an equal probability casting manner, namely the number of casting points falling in the area can represent the size of the area, then the number of casting points is used as a medium, the size of a target value is obtained through a formula, and the area with a smaller area is selected as a statistical area, so that the calculation time can be reduced;

and 4, solving the value of e by using a Monte Carlo integral algorithm.

2. The Monte Carlo parallel algorithm based on GPU programming model according to claim 1, characterized in that: the step 1 adopts an MT19937 random number generator, and the process is as follows: obtaining a random number seed; adding random number seeds to the MT19937 algorithm; setting random number type and range; random number generation is performed.

3. A Monte Carlo parallel algorithm based on GPU programming model according to claim 2, wherein: in the step 3, a Monte Carlo parallel integration algorithm for calculating pi based on CUDA is adopted, firstly two arrays are used for receiving random numbers generated by using an MT19937 random number generator, then a video memory is allocated to the array of the DEVICE end, the array for storing the random numbers is copied and transmitted to the array of the DEVICE end, a kernel function is called for calculation, and finally points meeting the conditions on each returned thread are summed and a result value is calculated.

4. A Monte Carlo parallel algorithm based on GPU programming model according to claim 3, wherein: in the step 3, a computational method for optimizing GPU threads and realizing e is adopted, a Monte Carlo parallel integration algorithm based on CUDA computing e is adopted, generation is carried out at a HOST terminal according to a random number generation mode, and then the generated random number is transmitted to a DEVICE terminal for use judgment; the random number generator selects an MT19937 algorithm, then allocates a video memory for the array of the DEVICE end, copies the array for storing the random number and transmits the array to the array of the DEVICE end for use, then calls a CUDA kernel function for calculation, and finally sums the point numbers which meet the conditions on each thread and calculates the result value.

5. The Monte Carlo parallel algorithm based on GPU programming model according to claim 1, characterized in that: step 1 adopts an MTGP32 random number generator, transmits the seed and the data set to a kernel function at a host end, and then a device end generates a random number by using a cruond function, the process is as follows: setting a host side, reformatting a predefined parameter set into a kernel format, and copying kernel parameters into a device memory; initializing the state of each thread block and storing the state into an array; calling a cruand function by the equipment terminal, and generating random numbers according to the seed value of each thread; since the MTGP32 random number generator can only generate unsigned integer data, the computation will use unsigned integer data in solving for pi and e.

6. The Monte Carlo parallel algorithm based on GPU programming model according to claim 5, characterized in that: in the step 3, a calculation method for optimizing GPU threads and realizing pi is adopted, a Monte Carlo parallel integration algorithm for calculating pi based on CUDA is adopted, firstly, a random number seed value required to be used at a DEVICE end, an array for storing a throw point number and a judgment condition value are declared for an MTGP32 random number generator at a HOST end, then a kernel function is called for calculation, and a random number is generated in each calculation of each thread; and finally summing the number of the points which meet the conditions on each returned thread and calculating a result value.

7. The Monte Carlo parallel algorithm based on GPU programming model according to claim 6, characterized in that: in the step 3, a GPU thread is optimized and a calculation method of e is realized, a Monte Carlo parallel integration algorithm based on CUDA calculation e is adopted, and the calculation method is generated in a random number generation mode, namely directly at a DEVICE end, and then generated before condition judgment, and covered by a new random number after the condition judgment; and the random number generator selects an MTGP32 algorithm, calls a CUDA kernel function in HOST for calculation, and finally sums the number of points which meet the conditions on each thread and calculates the result value.