CN111459633B

CN111459633B - Irregular program-oriented self-adaptive thread partitioning method

Info

Publication number: CN111459633B
Application number: CN202010238885.4A
Authority: CN
Inventors: 李玉祥; 张志勇; 牛丹梅; 张丽丽; 赵长伟; 荆军昌; 邵东霞; 徐艳艳
Original assignee: Henan University of Science and Technology
Current assignee: Henan University of Science and Technology
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2023-04-11
Anticipated expiration: 2040-03-30
Also published as: CN111459633A

Abstract

A self-adaptive thread partitioning method for irregular programs relates to the technical field of computers, and comprises the steps of building a program complexity calculation model on a multi-core platform, establishing a candidate thread partitioning scheme set based on a classical thread partitioning method, establishing a selection mechanism of a thread partitioning scheme according to expert knowledge, and selecting the most suitable thread partitioning scheme for programs according to context and program complexity. The invention has the beneficial effects that: the method can realize the optimal division of irregular programs of different types, excavate the potential parallelism of the irregular programs to the maximum extent, improve the acceleration ratio performance of the programs, solve the problem of software and hardware incompatibility between the serial programs and the multi-core processor, fully utilize the resources of the multi-core processor and the legacy serial programs, promote the parallelization of the multi-core processor and the software, promote the health, the virtuous and the rapid development of related industries such as high-performance computing and cloud computing, and have better application prospect and practical value.

Description

Irregular program-oriented self-adaptive thread partitioning method

Technical Field

The invention belongs to the technical field of computers, and particularly relates to an irregular program-oriented adaptive thread partitioning method.

Background

The multi-core era comes, and the traditional parallel programming mode and compiling technology face new challenges. An effective method for realizing the execution of the serial program on the multi-core is the parallelization of the serial program, which not only solves the transformation of the traditional serial program, but also reasonably utilizes increasingly developed and abundant core resources. Traditional parallelization methods, such as: the dependency problem is solved by adopting conservative methods such as OpenMP, MPI, TBB, openCL and CUDA, namely, concurrency units (threads or processes) with dependency relationship are serialized by adopting synchronization or communication, so that the parallelization effect of irregular programs is poor. Thread-Level Speculation (TLS), which is a speculative multithreading technology, allows data dependency between concurrent units to be aggressively executed in parallel, overcomes the limitation that the traditional parallelization method cannot effectively resolve fuzzy dependency relationship of Thread levels, and shows good prospects for parallelization of irregular programs. Thread division is a key step for inserting thread division statements into a serial program by TLS, is a core link for program speculation parallelization, and directly influences acceleration ratio performance, so that research on a thread division method is not slow.

The existing thread division method mainly comprises the following steps: a heuristic rule-based thread partitioning method, a machine learning-based thread partitioning method, a graph-based thread partitioning method, and the like. The former determines a thread dividing scheme according to a heuristic rule in thread division, and determines the insertion position of a thread dividing statement on a program execution flow path; the latter two learn the thread partition knowledge in the sample by using a machine learning method, predict a thread partition scheme according to program characteristics, and execute program partition by using the partition scheme. The thread dividing method generates a uniform thread dividing scheme for the similar programs under the guidance of a dividing rule or a dividing knowledge. However, for unknown programs, the complexity and execution state are difficult to predict, and it is difficult to ensure maximum performance improvement using a uniform thread partitioning scheme.

In order to understand the development situation of the existing thread division method, the existing papers and patents are searched, compared and analyzed, and the following technical information with high correlation degree with the method is screened out:

the technical scheme 1: in the serial program dividing process, a Thread dividing method (HR-based) Thread division Approach) based on Heuristic Rules determines the value ranges of parameters such as Thread granularity, data dependence between threads, firing distance and the like generated after all programs are divided according to the Heuristic Rules, thereby determining the position of a dividing mark (sp-cqip point).

The article entitled "Min-cut program composition for thread-level specification" uses a minimum segmentation algorithm of a graph to divide a program flow graph, uses a heuristic method to balance the cost of factors such as data dependence, performance cost, load imbalance and the like, and obtains performance improvement after program division.

A paper entitled "A General Compiler Framework for specialized Multithreading" finds out the granularity, priority, etc. of each thread after the thread division on the critical path of program execution by using heuristic rules.

A paper entitled A statistical Multi-processed Based on Precomputation Slices, to reduce the search space for a fire pair (SP-CQIP), heuristic rules were used to select candidate fire pairs. In the selection process, the excitation pairs with the contribution rate smaller than the contribution threshold are abandoned, the excitation pairs are simultaneously in the same process or a loop body, the length of the excitation pairs is smaller than the length threshold, the probability from SP to CQIP is larger than the probability threshold, and the ratio of the length of Precomputation-slice (P-slice) to the size of the speculation thread is smaller than the proportional threshold.

The technical scheme 2 is as follows: a Thread partitioning method (ML-based) Thread Partition Approach) based on Machine Learning learns the Thread partitioning knowledge in a sample set by using the Machine Learning method, predicts a partitioning scheme of a new input program according to the characteristics of the new input program, and guides the process of the program to be partitioned by using the partitioning scheme.

A thesis of the title 'speculative multithreading partitioning algorithm based on fuzzy clustering' searches an effective thread solution space by using a clustering method to obtain better thread partitioning.

A KNN-Based Thread Partitioning method is proposed in the article entitled A Novel Thread Partitioning applied Machine Learning for specialized Multithreading. The method mainly comprises two parts: training the generation of a sample set, extracting partition knowledge contained in the sample set, and selecting k most similar samples by using the similarity between each unknown program and each sample to determine a thread partition scheme of the program.

A paper entitled "localization streaming parallel for multi-core adaptive processing" uses a machine learning method to partition a stream program on a mobile and automatic compiler, learns prior knowledge offline, and predicts the partition structure of an unknown program.

A paper of the topic of Optimizing partial thresholds in specific multitudinous decoding extracts five main influence parameters influencing thread division, and optimizes the five parameters by using a layer traversal method, thereby obtaining a better thread division scheme for a program.

A paper of A Parametric Model in statistical Multithreading utilizes a linear regression method to discover the rules between thread partition parameters and acceleration ratios, and extracts the partition scheme of an irregular program.

The topic of Using Industrial Neural Network for Predicting threaded Partitioning in predictive multitreading utilizes an Artificial Neural Network learning Thread to partition knowledge and predict the Partitioning scheme of an unknown program

Technical scheme 3: the thread dividing method based on the Graph comprehensively covers program characteristic information by using a Weighted Control Flow Graph (WCFG) of a program, and executes comprehensive division on different paths of the WCFG.

A paper of title A Graph-Based Thread Partition application in predictive Multithreading proposes a Graph-Based Thread Partition method, in which a weighted control flow Graph is used for formally expressing an irregular program, a machine learning method is used for learning Thread Partition knowledge and predicting a Partition scheme of an unknown program, and the generated Partition scheme is applied to each process of the program.

A paper of the title GbA, A graph-based thread partition adaptation in specific multi-threading, proposes a graph-based thread partition method, in which an irregular program is formally expressed by a Weighted Control Flow Graph (WCFG) of the program, and a machine learning method is utilized to learn thread partition knowledge and predict a thread partition scheme for an unknown program.

A paper entitled Improving Graph Partitioning for model Graphs and Architectures carries out Graph Partitioning on sparse irregular data, a multi-thread Graph divider mt-Metis is provided, 20 different Graphs in multiple fields are used on 36 cores for carrying out experiments, and the effectiveness of the method is verified.

The

technical schemes

1,2 and 3 respectively use different methods to realize the thread division based on heuristic rules, the thread division based on machine learning and the thread division based on graphs. However, there are some drawbacks in both of these solutions.

According to the technical scheme 1, upper and lower limits of thread granularity, dependency between threads, excitation distance and the like during thread division are specified through a heuristic rule, and then the insertion position of a thread division flag bit (sp-cqip) is guided. Compared with other methods, the scheme has the advantages of simplicity and easiness in operation. However, there is a uniform partition rule applied to all the programs to be partitioned, which results in that the partial program threads are partitioned without obtaining the best performance improvement.

The technical scheme 2 is that a machine learning method is utilized to learn thread division knowledge in a sample, the program to be divided is guided by using a division scheme of a most similar sample according to similarity comparison between the program to be divided and the sample, and thread division is carried out. The thread division method based on machine learning has the advantages of intelligence, automatic division and the like, but compared with the scheme 1, the method has the process problem that the minimum unit of thread division is a program and not a program. However, the thread dividing operation is performed in units of processes in the program, and therefore, the optimal performance improvement of some processes in the program is not achieved.

Technical solution 3 proposes a graph-based thread partitioning method in which an irregular program is formally expressed with a weighted control flow graph, and thread partitioning knowledge is learned and a thread partitioning scheme is predicted for an unknown program using a machine learning method. The scheme can fully mine the characteristic information of the program, but the process in the program cannot be divided in a personalized mode, and the program performance cannot be improved to the maximum extent.

According to the current research situation of irregular program thread division methods at home and abroad at present: the thread division method based on the heuristic rule has the advantages of simplicity and easiness in operation; the thread dividing method based on machine learning has the advantages of intelligence, automatic division and the like; the thread dividing method based on the graph can more comprehensively express the data and the control information of the program. However, in summary, the existing thread partitioning method mostly adopts a unified thread partitioning scheme for the same type of irregular program, and the complexity, execution state, and other aspects of the program are rarely concerned, thereby seriously affecting the efficiency of parallelization of serial programs.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an irregular program-oriented adaptive thread partitioning method on a multi-core platform, and solve the problems that the existing thread partitioning method adopts a uniform thread partitioning scheme for the same type of irregular programs, and the complexity, the execution state and other aspects of the programs are rarely concerned, so that the parallelization efficiency of serial programs is seriously influenced, and the like.

The technical scheme adopted by the invention for solving the technical problems is as follows: an irregular program-oriented adaptive thread partitioning method comprises the following steps:

step one, establishing a complexity calculation model of an irregular program

1.1, constructing a CFG (computational fluid dynamics) graph of a program by using a formal expression and using a basic block as an analysis unit, and adding characteristic values obtained by program analysis to the CFG graph in an annotation form to form a weighted control flow graph;

1.2, calculating the complexity, namely the branch complexity of each path possibly existing on the weighting control flow graph based on probability statistics and graph traversal;

and 1.3, integrating the complexity of the components to obtain the overall complexity of the program.

Step two, constructing a candidate thread partition scheme set conforming to the program context

A method for fusing program context and program characteristics is adopted to construct a candidate thread partition scheme set, an initial candidate set is constructed on the basis of the program characteristics, and on the basis, the initial candidate set is filtered on the basis of the upper and lower reference values of the program to generate a final candidate thread partition scheme set.

Step three, constructing a thread division scheme selection mechanism according with the program complexity

After the program complexity and the candidate thread division scheme set are obtained from the first step and the second step respectively, a scheme selection mapping rule set of 'program complexity → thread division scheme' is established according to expert knowledge; and executing the context according to the mapping rule and the program complexity, and selecting the most suitable thread partition scheme in the candidate partition scheme set.

The rule set is used for storing expert knowledge used for reasoning, in the rule set, the expert knowledge of a thread division scheme selection mechanism is expressed by a mapping rule, and the general form expressed by the expert knowledge mapping rule is IF < condition >, THEN < condition >.

The invention has the beneficial effects that: (1) The invention mainly researches an adaptive thread division method facing irregular programs on a multi-core platform, builds a program complexity calculation model, establishes a candidate division scheme set based on a classical thread division method, establishes a thread division scheme selection mechanism according to expert knowledge, selects a most suitable thread division scheme according to context and program complexity, and can realize optimal division of the programs, thereby fully utilizing multi-core resources and maximally excavating potential parallelism of the irregular programs.

(2) The invention aims to utilize a self-adaptive mechanism, and self-adaptively selects the most suitable thread division scheme according to the program characteristics and the context on the basis of the composite thread division method, thereby not only solving the contradiction problem of multi-core platform and program serial execution, but also improving the acceleration ratio performance after serial program parallelization, and providing a new method for designing a multi-core processor.

(3) The self-adaptive thread division method provided by the invention effectively solves the parallelization problem of the irregular serial program on the multi-core platform, simultaneously promotes the progress of the parallel technology, promotes the healthy, benign and rapid development of related industries such as high-performance computing and cloud computing, and has a better application prospect and a practical value.

Drawings

FIG. 1 is a schematic overall flow chart of the adaptive thread partitioning method according to the present invention;

FIG. 2 is a flowchart illustrating the complexity calculation of the process of the present invention;

FIG. 3 is a schematic flow chart of the present invention for constructing a candidate thread partition scheme set;

FIG. 4 is a flow diagram of a thread partitioning scheme selection mechanism.

Detailed Description

The following description of specific embodiments (examples) of the present invention are provided in conjunction with the accompanying drawings to enable those skilled in the art to better understand the present invention.

The overall scheme and flow of the adaptive thread partitioning method of the present invention is shown in fig. 1. Taking an irregular serial program as input, taking program complexity calculation model establishment, candidate thread division scheme generation and division scheme selection based on expert knowledge as main research points, selecting a thread division scheme most suitable for the program to execute thread division, and obtaining an acceleration ratio and a program running result on a Prophet simulator.

(1) Establishment of complexity calculation model of irregular program

The program characteristics influencing the thread division are many, such as data dependence, control dependence, branch number, basic block number, average dynamic instruction number, nesting layer number of loop structure, procedure call number and the like. The values of these features reflect the complexity of the program (complexity is a measure of complexity). Most of the existing thread dividing methods cannot fully consider the influence of program complexity on thread division, only program features are selected as input of the thread dividing methods, and the problems that the program features selected by different thread dividing methods are not uniform, the generated thread dividing schemes are not accurate enough and the like are easily caused.

Firstly, a program complexity calculation model adopts formal expression, a basic block is used as an analysis unit to construct a CFG (computational fluid dynamics) graph of a program, and characteristic values obtained through program analysis are added to the CFG graph in an annotation form to form a Weighting Control Flow Graph (WCFG); calculating the complexity (namely the branch complexity) of each possible path on the WCFG based on the probability statistics and the graph traversal; and finally, integrating the sub-complexity to obtain the overall complexity of the program. Fig. 2 shows a flow chart of the program complexity calculation.

In FIG. 2, P represents the input irregular serial program, G (P) represents WCFG, F1-Fn (N ∈ N) represent program features, F1 () -Fn () (N ∈ N) represent conversion functions, comp1 () -Comp () (N ∈ N) represent the complexity of each path, and Comp represents the total complexity of P. In the model, first, the unknown program P is formally expressed and converted into WCFG, i.e., G (P); secondly, extracting the characteristics of each possible path (from a head node to a tail node) in G (P), and respectively representing the paths by F1-Fn (N belongs to N); thirdly, the mapping of eigenvalues to complexity is implemented with a transfer function f1 () -fn () (N ∈ N), such as: the complexity corresponding to the number x of the basic blocks is 0.01 multiplied by x, the complexity corresponding to the number y of the loops is 0.2 multiplied by y, and the like; then, the complexity Comp1 () to Comp () of each path in G (P) is calculated separately; and finally, summarizing the complexity of each path to obtain the complexity of the program P.

(2) Construction of a set of candidate thread partitioning schemes that conform to a program context

A method for fusing program context and program characteristics is adopted to construct a candidate thread partition scheme set, an initial candidate set is constructed on the basis of the program characteristics, and on the basis, the initial candidate set is filtered on the basis of the values of the upper and lower references of the program to generate a final candidate thread partition scheme set. FIG. 3 shows a construction process of a candidate thread partition scheme set.

In FIG. 3, P represents an irregular serial program, F1-Fn represent program features, formal (P) represents a Formal expression of P, and M represents ₁ ～M _n (N belongs to N) represents N classical thread division methods, schem ₁ ～Sch em _n Representing n thread partitioning schemes. The thread dividing method is numbered as follows: thread partitioning method (M) based on heuristic rule ₁ ) Numbered 1, thread partitioning method based on machine learning (M) ₂ ) Numbered 2, thread partitioning method based on graph critical path (M) ₃ ) Numbered 3, thread partitioning method based on graph full path (M) ₄ ) Numbered 4, hybrid thread partitioning method (M) ₅ ) Number 5, etc. The path numbers are respectively: the critical path number is 1, and the other non-critical paths are numbered from 2 to N (N belongs to N). The Thread partitioning scheme is composed of a Thread partitioning method number, a path number and five main parameters influencing a Thread partitioning result in a Thread partitioning algorithm (the five parameters are respectively an Upper Limit of firing Distance (ULoSD), a Lower Limit of firing Distance (LLoSD), a Data Dependency Count (DDC), an Upper Limit of Thread Granularity (ULoTG) and a Lower Limit of Thread Granularity (LLoTG)). By introducing a context parameter delta ₁ ～δ _n (N epsilon N), so that the thread partitioning method in the invention is context-aware, and the constructed candidate thread partitioning scheme set can better capture the change of the program state.

(3) Construction of thread partitioning scheme selection mechanism conforming to program complexity

After the program complexity is calculated and a candidate thread division scheme set is constructed in the steps (1) and (2), a mapping rule set selected by the scheme of 'program complexity- > thread division scheme' is established according to expert knowledge; and selecting the most suitable thread partition scheme in the candidate partition scheme set according to the mapping rule, the program complexity and the execution context. FIG. 4 is a flow diagram of a thread partitioning scheme selection mechanism.

The rule set is used to store expert knowledge for reasoning. In the rule set, expert knowledge of the thread partitioning scheme selection mechanism is expressed in production rules (also called mapping rules). Production rules separate the knowledge representation into two parts, a premise and a conclusion. The general form of the expert knowledge production rule representation is IF < condition >, THEN < condition >, such as:

(i) IF < complexity Comp ∈ [0.8,1.0] >, THEN < choice Schem1' >;

(ii) IF < complexity Comp ∈ [0.6, 0.8) >, THEN < select Schem2' >;

(iii) IF < complexity Comp ∈ [0.4, 0.6) >, THEN < choice Schem3' >;

(iv) IF < complexity Comp ∈ [0.2, 0.4) >, THEN < select Schem4' >;

(v) IF < complexity Comp ∈ (0.0, 0.2) >, THEN < select Schem5' >.

Schem1 'to Schem5' are division schemes selected from the candidate thread division scheme set generated in the step (2), and are determined by the complexity and rules of the program. Some cases for generating the mapping rules are given above.

The invention provides an irregular program-oriented adaptive thread partitioning method by utilizing an adaptive mechanism, aims to realize the general research goal of improving the speed-up ratio performance of an irregular program to the maximum extent, and provides a necessary and urgent thread partitioning method and a related basic theory for the wide application and the healthy development of a emerging parallel technology.

(1) Maximum boost in program acceleration ratio performance

By researching the relation between the program characteristics and the thread dividing scheme, a compound thread dividing scheme is established, and the program can autonomously select and execute the most suitable dividing scheme by using the guidance of a self-adaptive mechanism and expert knowledge to obtain the maximum acceleration ratio.

(2) Exploring laws that program features affect acceleration ratio performance

By analyzing factors influencing program parallelization, a program complexity model, a candidate thread division scheme set and a division scheme selection mechanism are established, the rule that the program characteristics influence the acceleration ratio performance of the program is explored, and method support is provided for irregular program parallelization on a multi-core platform.

Claims

1. An irregular program-oriented adaptive thread partitioning method is characterized in that: the method comprises the following steps:

step one, establishing a complexity calculation model of an irregular program

1.1, adopting formal expression, constructing a CFG (computational fluid dynamics) graph of a program by taking a basic block as an analysis unit, and adding characteristic values obtained by program analysis to the CFG graph in an annotation form to form a weighted control flow graph;

1.3, integrating the complexity of the sub-components to obtain the overall complexity of the program;

Constructing a candidate thread partition scheme set by adopting a method of fusing program context and program characteristics, constructing an initial candidate set on the basis of the program characteristics, and filtering the initial candidate set on the basis of the values of the upper and lower references of the program to generate a final candidate thread partition scheme set; the program characteristics refer to data dependence, control dependence, branch number, basic block number, average dynamic instruction number, nesting layer number of a loop structure and process call number, and the values of the characteristics reflect the complexity of a program;

2. The adaptive thread partitioning method for irregular programs according to claim 1, wherein: the rule set is used for storing expert knowledge used for reasoning, in the rule set, the expert knowledge of the thread division scheme selection mechanism is expressed by a mapping rule, and the general form expressed by the expert knowledge mapping rule is IF < condition >, THEN < condition >.