CN109684061A - A kind of unstrctured grid many-core coarse-grained parallelization method - Google Patents
A kind of unstrctured grid many-core coarse-grained parallelization method Download PDFInfo
- Publication number
- CN109684061A CN109684061A CN201811583475.2A CN201811583475A CN109684061A CN 109684061 A CN109684061 A CN 109684061A CN 201811583475 A CN201811583475 A CN 201811583475A CN 109684061 A CN109684061 A CN 109684061A
- Authority
- CN
- China
- Prior art keywords
- core
- level
- many
- unstrctured grid
- calculating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5018—Thread allocation
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention discloses a kind of unstrctured grid many-core coarse-grained parallelization method, this method increases the thread-level Region Decomposition of the second level on the basis of Region Decomposition of first order unstrctured grid, each respective independent zoning is solved from core, guarantee the data hit rate from core core calculations task, realizes the coarse grain parallelism of MPI process level and the coarse grain parallelism from core thread-level.The present invention is able to solve general unstrctured grid and applies the adaptability problem on polymorphic heterogeneous processor, the coarseness many-core for being automatically performed secondary loads balance and calculating core according to unstrctured grid data scale is parallel, improves the computational efficiency and parallel efficiency of unstrctured grid numerical simulation on isomeric architecture.
Description
Technical field
The present invention relates to unstrctured grid flow field calculation field more particularly to a kind of unstrctured grid many-core coarse grain parallelisms
Calculation method.
Background technique
It in recent years, is the computing capability for improving system, multicore, many-core processor have become the main body of high-performance computer
Component parts, supercomputer system is in the structure trend of polymorphic, isomery and very big parallel scale, and application field phase therewith
The High Efficient Parallel Algorithms matched lack.With going deep into for user study, the zoning of practical problem numerical simulation and topological structure
Increasingly complicated and fine, unstrctured grid is because of its excellent flexibility and to the superpower adaptability of complex appearance, in engineering
It calculates in software using more and more extensive.But then, due to unstrctured grid cast out the architectural limitation of mesh node from
And randomness is caused on the data store, memory access bottleneck problem is more prominent, it is difficult to which the superelevation for playing many-core processor calculates energy
Power needs to study the unstrctured grid Efficient numerical parallel methods based on many-core processor.
The Parallel Implementation towards heterogeneous computer system is all based on greatly MPI process level coarse grain parallelism+many-core thread at present
The two-stage parallel model of grade fine grained parallel, wherein MPI process level coarse grain parallelism uses program message passing model, passes through area
The coarsenesses load balances such as domain decomposition and message communicating mode realize parallel computation, and parallel granularity is relatively large, many-core thread-level
Fine grained parallel is based on shared variable programming model, distributes and loads for the many-core task that core calculations circulation carries out circulation grade
Balance, and Performance tuning is carried out using the methods of data layout optimization, calculating and memory access overlapping, parallel granularity is relatively small.It should
Two-stage parallel model structured grid class application in using and obtain superior performance.However unstrctured grid majority core calculations section
Memory access feature is discontinuous, then can be because visiting if it is parallel directly to carry out many-core to core loop using previous many-core parallel mode
Depositing discontinuously causes performance lower.
To sum up, mostly use process level thick greatly currently based on the unstrctured grid class application of isomery many-core processor structure feature
Granularity is parallel and thread-level fine grained parallel mode, however, being calculated due to the flexibility and randomness of unstrctured grid
There are a large amount of irregular discrete memory access in journey, many-core parallel efficiency is low, and performance issue is prominent.
Summary of the invention
It is an object of the invention to by a kind of unstrctured grid many-core coarse-grained parallelization method, to solve above carry on the back
The problem of scape technology segment is mentioned.
To achieve this purpose, the present invention adopts the following technical scheme:
A kind of unstrctured grid many-core coarse-grained parallelization method, this method, which uses, is based on domestic isomery many-core processor
The two-level load balance of architecture feature increases second level region point on the basis of first order unstrctured grid Region Decomposition
Solution each completes the calculating task of corresponding region from core, guarantees to realize from the data hit rate for assessing calculation and calculate kernel thread grade
Coarseness many-core it is parallel.
Particularly, the two-level load balance based on domestic isomery many-core processor architecture feature, specifically includes:
S101, level-one process level load balance is completed, Region Decomposition is carried out to zoning using figure division methods, is born between guarantee process
It carries balance and communication is minimum;S102, second level thread-level load balance is completed, is calculated often according to from Nuclear Data storage space volume
A maximum task amount enough calculated from nuclear energy, and the second level is equally carried out using zoning of the figure division methods to each process
Task is assigned to each from core by Region Decomposition.
Particularly, the coarseness many-core for calculating kernel thread grade is parallel, specifically includes: first according to step S101's
Task is assigned in each process by load balance result, is then assigned to task according to the load balance result of step S102
Each from core thread, each requires completion in order to calculate including flux from core according to the calculating of Fluid Mechanics Computation, is sparse
The calculating of each core including Algebraic Equation set solution.
Unstrctured grid many-core coarse-grained parallelization method proposed by the present invention is in the region of first order unstrctured grid
The thread-level Region Decomposition for increasing the second level on the basis of decomposition, each solves respective independent zoning from core, guarantee from
The data hit rate of core core calculations task realizes the coarse grain parallelism of MPI process level and the coarse grain parallelism from core thread-level.
The present invention is able to solve general unstrctured grid and applies the adaptability problem on polymorphic heterogeneous processor, according to unstrctured grid
Data scale is automatically performed secondary loads balance and the coarseness many-core of calculating core is parallel, improves non-on isomeric architecture
The computational efficiency and parallel efficiency of structured grid numerical simulation.
Detailed description of the invention
Fig. 1 is unstrctured grid many-core coarse-grained parallelization method flow diagram provided in an embodiment of the present invention.
Specific embodiment
Present invention will be further explained below with reference to the attached drawings and examples.It is understood that tool described herein
Body embodiment is used only for explaining the present invention rather than limiting the invention.It also should be noted that for the ease of retouching
It states, only some but not all contents related to the present invention are shown in the drawings, it is unless otherwise defined, used herein all
Technical and scientific term has the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.It is used herein
Term be intended merely to description specific embodiment, it is not intended that in limitation the present invention.
It please refers to shown in Fig. 1, Fig. 1 is unstrctured grid many-core coarse-grained parallelization method provided in an embodiment of the present invention
Flow chart.
Unstrctured grid many-core coarse-grained parallelization method, which uses, in the present embodiment is based on domestic isomery many-core processor
The two-level load balance of architecture feature increases second level region point on the basis of first order unstrctured grid Region Decomposition
Solution each completes the calculating task of corresponding region from core, guarantees to realize from the data hit rate for assessing calculation and calculate kernel thread grade
Coarseness many-core it is parallel.Wherein, domestic isomery many-core processor architecture feature includes: domestic isomery many-core processor core
Calculation is more, has certain capacity relatively by force, from the instruction space and data space of core core from core computing capability.
Specifically, the two-level load balance based on isomery many-core processor architecture feature described in the present embodiment,
It specifically includes: S101, completing level-one process level load balance, Region Decomposition is carried out to zoning using figure division methods, is protected
Load balance and communication are minimum between card process;S102, second level thread-level load balance is completed, held according to from Nuclear Data memory space
Meter calculates the maximum task amount each enough calculated from nuclear energy, and equally using figure division methods to the zoning of each process
Second level Region Decomposition is carried out, task is assigned to each from core.
The coarseness many-core for calculating kernel thread grade described in the present embodiment is parallel, specifically includes: S103, first basis
Task is assigned in each process by the load balance result of the step S101, then flat according to the load of the step S102
Weighing apparatus result task is assigned to it is each from core thread, each from core according to the calculating of Fluid Mechanics Computation require complete in order
The calculating of each core including flux calculating, the solution of sparse Algebraic Equation set.
Technical solution of the present invention increases the thread of the second level on the basis of Region Decomposition of first order unstrctured grid
Grade Region Decomposition, each solves respective independent zoning from core, guarantees the data hit rate from core core calculations task, real
The now coarse grain parallelism of MPI process level and the coarse grain parallelism from core thread-level.The present invention is able to solve general unstrctured grid
The adaptability problem on polymorphic heterogeneous processor is applied, secondary loads balance is automatically performed according to unstrctured grid data scale
It is parallel with the coarseness many-core that calculates core, improve on isomeric architecture the computational efficiency of unstrctured grid numerical simulation and
Parallel efficiency.
Those of ordinary skill in the art will appreciate that realizing that all parts in above-described embodiment are can to pass through computer
Program is completed to instruct relevant hardware, and the program can be stored in a computer-readable storage medium, the program
When being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can for magnetic disk, CD, only
Read storage memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM)
Deng.
Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that
The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation,
It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention
It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also
It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.
Claims (3)
1. a kind of unstrctured grid many-core coarse-grained parallelization method, which is characterized in that this method, which uses, is based on domestic isomery
The two-level load balance of many-core processor architecture feature increases by the basis of first order unstrctured grid Region Decomposition
Level-2 area decomposes, and the calculating task of corresponding region is each completed from core, guarantees to realize and calculate from the data hit rate for assessing calculation
The coarseness many-core of kernel thread grade is parallel.
2. unstrctured grid many-core coarse-grained parallelization method according to claim 1, which is characterized in that described to be based on
The two-level load balance of domestic isomery many-core processor architecture feature, specifically includes: S101, completing the load of level-one process level
Balance carries out Region Decomposition to zoning using figure division methods, and load balance and communication are minimum between guarantee process;S102,
Second level thread-level load balance is completed, is appointed according to each maximum enough calculated from nuclear energy is calculated from Nuclear Data storage space volume
Business amount, and second level Region Decomposition is carried out using zoning of the figure division methods to each process, task is assigned to each
From core.
3. unstrctured grid many-core coarse-grained parallelization method according to claim 2, which is characterized in that the calculating
The coarseness many-core of kernel thread grade is parallel, specifically includes: first being distributed task according to the load balance result of step S101
Onto each process, task is then assigned to according to the load balance result of step S102 by each from core thread, Mei Gecong
Core requires to complete in order each including flux calculating, the solution of sparse Algebraic Equation set according to the calculating of Fluid Mechanics Computation
The calculating of a core.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811583475.2A CN109684061A (en) | 2018-12-24 | 2018-12-24 | A kind of unstrctured grid many-core coarse-grained parallelization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811583475.2A CN109684061A (en) | 2018-12-24 | 2018-12-24 | A kind of unstrctured grid many-core coarse-grained parallelization method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109684061A true CN109684061A (en) | 2019-04-26 |
Family
ID=66188193
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811583475.2A Pending CN109684061A (en) | 2018-12-24 | 2018-12-24 | A kind of unstrctured grid many-core coarse-grained parallelization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109684061A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110543663A (en) * | 2019-07-22 | 2019-12-06 | 西安交通大学 | Coarse-grained MPI + OpenMP hybrid parallel-oriented structural grid area division method |
CN110633149A (en) * | 2019-09-10 | 2019-12-31 | 中国人民解放军国防科技大学 | Parallel load balancing method for balancing calculation amount of unstructured grid unit |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140039853A1 (en) * | 2011-01-10 | 2014-02-06 | Saudi Arabian Oil Company | Scalable simulation of multiphase flow in a fractured subterranean reservoir with multiple interacting continua by matrix solution |
CN104239661A (en) * | 2013-06-08 | 2014-12-24 | 中国石油化工股份有限公司 | Large-scale numerical reservoir simulation calculation method |
CN104375882A (en) * | 2014-11-21 | 2015-02-25 | 北京应用物理与计算数学研究所 | Multistage nested data drive calculation method matched with high-performance computer structure |
CN108595277A (en) * | 2018-04-08 | 2018-09-28 | 西安交通大学 | A kind of communication optimization method of the CFD simulated programs based on OpenMP/MPI hybrid programmings |
CN109064559A (en) * | 2018-05-28 | 2018-12-21 | 杭州阿特瑞科技有限公司 | Vascular flow analogy method and relevant apparatus based on mechanical equation |
-
2018
- 2018-12-24 CN CN201811583475.2A patent/CN109684061A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140039853A1 (en) * | 2011-01-10 | 2014-02-06 | Saudi Arabian Oil Company | Scalable simulation of multiphase flow in a fractured subterranean reservoir with multiple interacting continua by matrix solution |
CN104239661A (en) * | 2013-06-08 | 2014-12-24 | 中国石油化工股份有限公司 | Large-scale numerical reservoir simulation calculation method |
CN104375882A (en) * | 2014-11-21 | 2015-02-25 | 北京应用物理与计算数学研究所 | Multistage nested data drive calculation method matched with high-performance computer structure |
CN108595277A (en) * | 2018-04-08 | 2018-09-28 | 西安交通大学 | A kind of communication optimization method of the CFD simulated programs based on OpenMP/MPI hybrid programmings |
CN109064559A (en) * | 2018-05-28 | 2018-12-21 | 杭州阿特瑞科技有限公司 | Vascular flow analogy method and relevant apparatus based on mechanical equation |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110543663A (en) * | 2019-07-22 | 2019-12-06 | 西安交通大学 | Coarse-grained MPI + OpenMP hybrid parallel-oriented structural grid area division method |
CN110543663B (en) * | 2019-07-22 | 2021-07-13 | 西安交通大学 | Coarse-grained MPI + OpenMP hybrid parallel-oriented structural grid area division method |
CN110633149A (en) * | 2019-09-10 | 2019-12-31 | 中国人民解放军国防科技大学 | Parallel load balancing method for balancing calculation amount of unstructured grid unit |
CN110633149B (en) * | 2019-09-10 | 2021-06-04 | 中国人民解放军国防科技大学 | Parallel load balancing method for balancing calculation amount of unstructured grid unit |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Energy-efficient scheduling for real-time systems based on deep Q-learning model | |
Shan et al. | FPMR: MapReduce framework on FPGA | |
Yang et al. | Adaptive optimization for petascale heterogeneous CPU/GPU computing | |
US10355966B2 (en) | Managing variations among nodes in parallel system frameworks | |
CN107209548A (en) | Power management is performed in polycaryon processor | |
Liu et al. | Power-efficient time-sensitive mapping in heterogeneous systems | |
US20210263739A1 (en) | Vector reductions using shared scratchpad memory | |
CN109918199B (en) | GPU-based distributed graph processing system | |
CN104636187B (en) | Dispatching method of virtual machine in NUMA architecture based on load estimation | |
CN106462219A (en) | Systems and methods of managing processor device power consumption | |
CN1981280A (en) | Apparatus and method for heterogeneous chip multiprocessors via resource allocation and restriction | |
CN104969182A (en) | High dynamic range software-transparent heterogeneous computing element processors, methods, and systems | |
Liao et al. | Long-term generation scheduling of hydropower system using multi-core parallelization of particle swarm optimization | |
Khairy et al. | Efficient utilization of gpgpu cache hierarchy | |
Talbi et al. | Metaheuristics on gpus | |
Morad et al. | Generalized MultiAmdahl: Optimization of heterogeneous multi-accelerator SoC | |
Sato et al. | Co-design and system for the supercomputer “Fugaku” | |
CN109684061A (en) | A kind of unstrctured grid many-core coarse-grained parallelization method | |
Li et al. | A hybrid particle swarm optimization algorithm for load balancing of MDS on heterogeneous computing systems | |
CN103399799B (en) | Computational physics resource node load evaluation method and device in cloud operating system | |
Wang et al. | Architecture and compiler support for gpus using energy-efficient affine register files | |
Terzopoulos et al. | Performance evaluation of a real-time grid system using power-saving capable processors | |
Rossinelli et al. | Mesh–particle interpolations on graphics processing units and multicore central processing units | |
Du et al. | Feature-aware task scheduling on CPU-FPGA heterogeneous platforms | |
Dally | On the model of computation: point |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190426 |
|
RJ01 | Rejection of invention patent application after publication |