CN110554988A - high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation - Google Patents
high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation Download PDFInfo
- Publication number
- CN110554988A CN110554988A CN201810560114.XA CN201810560114A CN110554988A CN 110554988 A CN110554988 A CN 110554988A CN 201810560114 A CN201810560114 A CN 201810560114A CN 110554988 A CN110554988 A CN 110554988A
- Authority
- CN
- China
- Prior art keywords
- cpu
- gpu
- target
- data
- domination
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000003379 elimination reaction Methods 0.000 claims abstract description 18
- 230000008030 elimination Effects 0.000 claims abstract description 16
- 238000004364 calculation method Methods 0.000 claims abstract description 15
- 238000010606 normalization Methods 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 11
- 238000005457 optimization Methods 0.000 claims abstract description 10
- 238000010129 solution processing Methods 0.000 claims abstract description 10
- 239000011159 matrix material Substances 0.000 claims description 7
- 230000001133 acceleration Effects 0.000 claims description 4
- 238000013459 approach Methods 0.000 claims description 4
- 230000002123 temporal effect Effects 0.000 claims description 2
- 230000000694 effects Effects 0.000 claims 1
- 238000012546 transfer Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multi Processors (AREA)
Abstract
The invention discloses a high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation, and belongs to the fields of operation and research optimization, multi-target optimization, high-performance computation and the like. The method is designed based on three aspects of a task dividing mode, a memory type of data and the data amount exchanged between the CPU and the GPU, so that the calculation efficiency and the practicability of the domination algorithm are improved. The method comprises five core parts of optimal solution choice, unbounded solution processing, radical variable processing, normalization and elimination. Compared with a serial high-dimensional multi-target LWM domination method, the method solves the problem of low calculation efficiency of the serial method, and meanwhile, the calculation efficiency and the practicability of a domination algorithm are improved.
Description
Technical Field
The invention relates to a high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation, and belongs to the fields of operational optimization, multi-target optimization, high-performance computation and the like.
background
Aiming at the high-dimensional multi-target optimization problem, along with the increase of the number of targets, most solution sets become non-dominated solutions based on a Pareto dominated algorithm, and further the problem of loss of selection pressure is caused. The process of solving the LWM non-dominated solution is a process of solving a linear programming problem, which is the LWM non-dominated solution if the linear programming problem has an unbounded solution or an optimal solution and is greater than a least positive number. However, as the number of optimization targets and the set of candidate solutions increase, solving the linear programming problem is very time consuming. In order to effectively use the LWM domination relation in solving the practical high-dimensional multi-objective optimization problem, a parallel LWM domination algorithm based on CPU + GPU heterogeneous computation is provided, the parallel processing capacity of a GPU is fully utilized, and the computation efficiency is improved.
Disclosure of Invention
the invention aims to provide a high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation. The method is designed based on three aspects of a task dividing mode, a memory type of data and data quantity exchanged between a CPU and a GPU, so that the calculation efficiency and the practicability of a domination algorithm are improved.
in order to achieve the purpose, the scheme adopted by the invention is a high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation, and the CPU + GPU heterogeneous computation is applied to solving a high-dimensional multi-target optimization problem. The method comprises five core parts of optimal solution choice, unbounded solution processing, radical variable processing, normalization and elimination:
The first and the optimal solution are chosen: carrying out optimal solution judgment on the linear programming problem, and converting the CPU side judgment logic into a GPU side parallel computing strategy;
Second, unbounded solution processing: solving the unbounded solution of the linear programming problem, converting the CPU judgment logic into a GPU parallel computing strategy, and performing parallel acceleration by using a protocol strategy;
thirdly, processing the radical variable: acquiring a minimum sequence number corresponding to the radical variable by using a protocol strategy;
Fourthly, normalization: normalizing a row of data corresponding to the radical variable;
Fifthly, elimination: and performing Gaussian elimination to enable the algorithm to approach the optimal solution.
These five sections are described in detail below, respectively.
The first and the optimal solution are chosen: carrying out optimal solution judgment on the linear programming problem, and converting the CPU side judgment logic into a GPU side parallel computing strategy;
And in the aspect of task division, the judgment logic is converted into a process of solving the maximum value, and the calculation is carried out by utilizing a reduction strategy. Each Block thread allocation is one-dimensional, numbering 256. From the data transfer and storage plane, the amount of incoming data is n x (m + n), where m is the target number and n is the solution set size. Due to the large amount of data transferred, the data is stored in the global memory.
Second, unbounded solution processing: solving the unbounded solution of the linear programming problem, converting the CPU judgment logic into a GPU parallel computing strategy, and performing parallel acceleration by using a protocol strategy;
In the aspect of task division, the time complexity of the process is O (n + m) which can be converted into a process of solving the most value, and the calculation is carried out by using a reduction strategy. Each Block thread allocation is one-dimensional, numbering 256. From the data transfer and storage plane, the amount of incoming data is n x (m + n), where m is the target number and n is the solution set size. Due to the large amount of data transferred, the data is stored in the global memory.
thirdly, processing the radical variable: acquiring a minimum sequence number corresponding to the radical variable by using a protocol strategy;
The time complexity of the process is O (n + m) at the task division level, and the calculation is carried out by using a reduction strategy. By utilizing the characteristic of small correlation among the elements of the calculated matrix, the distribution of each Block thread is one-dimensional, the quantity is 256, and the theoretical time complexity of parallel calculation is O (1). From the data transfer and storage plane, the amount of incoming data is n x (m + n), where m is the target number and n is the solution set size. Due to the large amount of data transferred, the data is stored in the global memory.
Fourthly, normalization: normalizing a row of data corresponding to the radical variable;
the normalized fractional temporal complexity is O (n + m). By utilizing the characteristic that the correlation among the elements of the calculated matrix is not large, each Thread in the Block processes one element, the optimal Thread number in a single Block is 256, and the theoretical time complexity of parallel computing is O (1). Because the data volume required to be processed is n + m, in order to ensure the efficiency, a storage mode of a shared memory is adopted.
Fifthly, elimination: and performing Gaussian elimination to enable the algorithm to approach the optimal solution.
The epoch-elimination portion time complexity is O (n × n + m). Data sharing is not needed among the blocks, two-dimensional distribution is reasonable (16,16) in the Block internal thread distribution, and the theoretical time complexity of parallel computing is O (n + m). The intermediate data amount required by each block is (m + n), and the intermediate data amount can be stored in the shared memory for ensuring the efficiency.
drawings
FIG. 1 is a flow chart of an embodiment of the method of the present invention.
Detailed Description
The invention aims to provide a high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation. The method is designed based on three aspects of a task dividing mode, a memory type of data and data quantity exchanged between a CPU and a GPU, so that the calculation efficiency and the practicability of a domination algorithm are improved. The method comprises five core parts of optimal solution choice, unbounded solution processing, radical variable processing, normalization and elimination:
the first and the optimal solution are chosen: carrying out optimal solution judgment on the linear programming problem, and converting the CPU side judgment logic into a GPU side parallel computing strategy;
second, unbounded solution processing: solving the unbounded solution of the linear programming problem, converting the CPU judgment logic into a GPU parallel computing strategy, and performing parallel acceleration by using a protocol strategy;
Thirdly, processing the radical variable: acquiring a minimum sequence number corresponding to the radical variable by using a protocol strategy;
Fourthly, normalization: normalizing a row of data corresponding to the radical variable;
fifthly, elimination: and performing Gaussian elimination to enable the algorithm to approach the optimal solution.
Specifically, the implementation mode is as shown in the implementation flowchart of fig. one, where the left side of the flowchart is task allocation at the CPU side, and the right side of the flowchart is task allocation at the GPU side. The whole method task is divided according to the above 5 steps, the logic judgment part is distributed at the CPU side, and the calculation intensive part is executed at the GPU side. In order to avoid the transmission of a large amount of data, calculation tasks in three stages of optimal solution decision, unbounded solution processing and radical variable processing are distributed at a GPU end, and a logic judgment task is distributed at a CPU end. The normalization and the elimination are two stages with the largest calculation amount of the method, and the calculation-intensive task is very suitable for parallel calculation at a GPU end, so that the normalization and the elimination are distributed to be executed at the GPU end. Details of the transfer of data and workload partitioning between a particular CPU and GPU can be observed in figure one. Only one flag value is used to judge whether the linear optimization has an optimal solution or an unbounded solution, and the whole data is not transmitted for judgment.
The first and the optimal solution are chosen: if all coefficients for the non-base variables are not greater than 0, then there must be an optimal solution. In the parallel algorithm, the maximum value of the coefficient corresponding to the non-basic variable can be calculated at the GPU end, and whether the maximum value is larger than 0 or not is judged at the CPU end. Meanwhile, only the sequence number of the radical variable is transmitted, and the strategy can avoid transmitting the whole data between the CPU and the GPU. The maximum worth process is calculated at the GPU end, and the strategy of specification in cuda programming is used.
Second, unbounded solution processing: if the iteration does not have the optimal solution, the serial number of the entry variable is obtained at the CPU end, and the serial number value is transmitted to the GPU end. If none of the sequence numbers correspond to a column value of 0, then an unbounded solution exists. If we do unbounded solution judgment at the CPU end, the intermediate result at the GPU end needs to be transferred to the CPU end, which is time-consuming. Therefore, the judgment can be carried out at the GPU end, the same as the judgment of the optimal solution, the maximum value of the row of values corresponding to the basis variable is calculated, the maximum value is transmitted to the CPU end, and then whether the maximum value is larger than 0 or not is judged at the CPU end. In the process of solving the maximum value, the maximum value can be quickly solved by applying the strategy of Reduce. Therefore, not only is the consumption of time caused by transmission of data avoided, but also the maximum value can be solved in multiple threads by using the Reduce strategy in cuda. The GPU-side thread allocation is one-dimensional, and 256 threads allocated to each Block are optimal through experiments.
Thirdly, processing the radical variable: in order to obtain the radical variable, the smallest number corresponding to the radical variable needs to be found. When the CPU is simply executed, the intermediate data of the GPU needs to be transmitted, so that the efficiency of the whole algorithm is reduced. And obtaining the minimum value at the GPU end, obtaining the sequence number value by adopting a Reduce strategy, and finally transmitting the sequence number value to the CPU end. GPU-side thread allocation is also one-dimensional, and each Block allocates 256 threads.
Fourthly, normalization: the gaussian elimination is applied across the entire matrix before the algorithm starts to iterate. In the experiment we first normalise the line of data for the radical variable. To increase the data acquisition rate, we store the row of data in the GPU shared memory, and then process one element per thread. The GPU side thread allocation is one-dimensional, with 256 threads allocated per Block.
Fifthly, elimination: after normalization is carried out, the row of data corresponding to the base variable is stored in a shared memory, then threads are redistributed, according to the characteristic of two dimensions of the matrix, in order to fully utilize GPU computing efficiency, two-dimensional (16,16) threads are distributed to each Block, and therefore each thread distributes each element in the corresponding matrix, and GPU computing resources can be effectively utilized.
Claims (6)
1. A high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation is characterized in that: aiming at the high-dimensional multi-objective optimization problem, most of solution sets become non-dominated solutions along with the increase of the number of targets in the Pareto dominated algorithm, and further the problem of loss of selection pressure is caused. A high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation can solve the problems of selection pressure and algorithm time efficiency, and has an important effect on multi-target planning; the heterogeneous calculation domination method comprises five core parts of optimal solution choice, unbounded solution processing, radical variable processing, normalization and elimination:
the first and the optimal solution are chosen: carrying out optimal solution judgment on the linear programming problem, and converting the CPU side judgment logic into a GPU side parallel computing strategy;
second, unbounded solution processing: solving the unbounded solution of the linear programming problem, converting the CPU judgment logic into a GPU parallel computing strategy, and performing parallel acceleration by using a protocol strategy;
Thirdly, processing the radical variable: acquiring a minimum sequence number corresponding to the radical variable by using a protocol strategy;
fourthly, normalization: normalizing a row of data corresponding to the radical variable;
Fifthly, elimination: and performing Gaussian elimination to enable the algorithm to approach the optimal solution.
2. the high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation of claim 1, characterized in that: and converting the judgment logic into a process of solving the maximum value, and calculating by using a reduction strategy. Each Block thread allocation is one-dimensional, the number is 256, and data is stored in a global memory.
3. The high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation of claim 1, characterized in that: and converting the judgment logic into a process of solving the maximum value, and calculating by using a reduction strategy. Each Block thread allocation is one-dimensional, the number is 256, and the theoretical time complexity of parallel computing is O (1). The data input quantity is n (m + n), wherein m is the target number and n is the size of the solution set. The data is stored in global memory.
4. The high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation of claim 1, characterized in that: the time complexity of the discrete variable phase is O (n + m), and the GPU terminal utilizes a reduction strategy to calculate. By utilizing the characteristic that the correlation among the elements of the calculated matrix is not large, each Thread in the Block processes one element, the optimal number of threads in a single Block is 256, and the parallel calculation complexity is O (1). Since the incoming data is large, it is stored in global memory in the GPU.
5. The high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation of claim 1, characterized in that: the normalized fractional temporal complexity is O (n + m). By utilizing the characteristic that the correlation among the elements of the calculated matrix is not large, each Thread in the Block processes one element, the optimal Thread number in a single Block is 256, and the theoretical time complexity of parallel computing is O (1). Because the data volume required to be processed is n + m, in order to ensure the efficiency, a storage mode of a shared memory is adopted.
6. the high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation of claim 1, characterized in that: the epoch-elimination portion time complexity is O (n × n + m). Data sharing is not needed among the blocks, two-dimensional distribution is reasonable (16,16) in the Block internal thread distribution, and the theoretical time complexity of parallel computing is O (n + m). The intermediate data amount required by each block is (m + n), and the intermediate data amount can be stored in the shared memory for ensuring the efficiency.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810560114.XA CN110554988A (en) | 2018-06-03 | 2018-06-03 | high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810560114.XA CN110554988A (en) | 2018-06-03 | 2018-06-03 | high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110554988A true CN110554988A (en) | 2019-12-10 |
Family
ID=68735373
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810560114.XA Pending CN110554988A (en) | 2018-06-03 | 2018-06-03 | high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110554988A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111651273A (en) * | 2020-05-29 | 2020-09-11 | 中国人民解放军国防科技大学 | GPU-based large-capacity short burst signal receiver design |
CN111970112A (en) * | 2020-08-10 | 2020-11-20 | 山东大学 | Ether house deployment method and system based on ZYNQ heterogeneous computing platform |
-
2018
- 2018-06-03 CN CN201810560114.XA patent/CN110554988A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111651273A (en) * | 2020-05-29 | 2020-09-11 | 中国人民解放军国防科技大学 | GPU-based large-capacity short burst signal receiver design |
CN111651273B (en) * | 2020-05-29 | 2023-05-05 | 中国人民解放军国防科技大学 | High-capacity short burst signal receiver design based on GPU |
CN111970112A (en) * | 2020-08-10 | 2020-11-20 | 山东大学 | Ether house deployment method and system based on ZYNQ heterogeneous computing platform |
CN111970112B (en) * | 2020-08-10 | 2022-01-21 | 山东大学 | Ether house deployment method and system based on ZYNQ heterogeneous computing platform |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106970896B (en) | Vector processor-oriented vectorization implementation method for two-dimensional matrix convolution | |
CN109543830B (en) | Splitting accumulator for convolutional neural network accelerator | |
CN103984527B (en) | Optimization Sparse Matrix-Vector multiplies the method for lifting incompressible pipe flow field simulation efficiency | |
US9886418B2 (en) | Matrix operands for linear algebra operations | |
CN108805266A (en) | A kind of restructural CNN high concurrents convolution accelerator | |
KR20190091858A (en) | Heterogenous Processor Architecture to Integrate CNN and RNN Neural Networks on a Single Chip | |
CN111898733B (en) | Deep separable convolutional neural network accelerator architecture | |
CN110390383A (en) | A kind of deep neural network hardware accelerator based on power exponent quantization | |
CN109840154A (en) | A kind of computation migration method that task based access control relies under mobile cloud environment | |
CN111858465B (en) | Large-scale matrix QR decomposition parallel computing system | |
CN110554988A (en) | high-dimensional multi-target domination method based on CPU + GPU heterogeneous computation | |
CN112257844B (en) | Convolutional neural network accelerator based on mixed precision configuration and implementation method thereof | |
CN102110079A (en) | Tuning calculation method of distributed conjugate gradient method based on MPI | |
CN105373845A (en) | Hybrid intelligent scheduling optimization method of manufacturing enterprise workshop | |
US20240119114A1 (en) | Matrix Multiplier and Matrix Multiplier Control Method | |
CN109993293A (en) | A kind of deep learning accelerator suitable for stack hourglass network | |
CN105229608A (en) | Based on the database processing towards array of coprocessor | |
CN109740619B (en) | Neural network terminal operation method and device for target recognition | |
CN113313252B (en) | Depth separable convolution implementation method based on pulse array | |
CN107133332A (en) | The distribution method and device of a kind of query task | |
CN104933110B (en) | A kind of data prefetching method based on MapReduce | |
CN106648901A (en) | Multichannel signal correlation analyzing method and system | |
US11886347B2 (en) | Large-scale data processing computer architecture | |
CN108984470A (en) | A kind of FPGA mine machine calculates the lifting system and method for power | |
CN114546652A (en) | Parameter estimation method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20191210 |