CN118211553A

CN118211553A - Greedy strategy-based PISA architecture chip resource arrangement method

Info

Publication number: CN118211553A
Application number: CN202311287817.7A
Authority: CN
Inventors: 王晓丹; 向前; 付强; 宋亚飞; 李松; 郭相科; 丁鹏; 张丹丹
Original assignee: Air Force Engineering University of PLA
Current assignee: Air Force Engineering University of PLA
Priority date: 2023-10-08
Filing date: 2023-10-08
Publication date: 2024-06-18

Abstract

The invention discloses a greedy strategy-based PISA architecture chip resource arrangement method, which comprises the following steps: s1: in a PISA architecture programming model, a P4 program is obtained according to a P4 language input by a user, and the P4 program is divided into a series of basic blocks; s2: respectively constructing a data dependency relation matrix and a control dependency relation matrix according to variable information read and written by each basic block and adjacent basic block information of each basic block in a flow chart; s3: combining the data dependency relation matrix and the control dependency relation matrix on the basis of S2 to obtain a data-control dependency relation matrix, defining a prioritized relation and an absolute prioritized relation between basic blocks, and constructing an arrangement prioritized relation directed graph G (0); s4: and (3) finding out all basic block absolute priority sequences from a root node in G (0), and according to two greedy strategies of 'longer absolute priority sequences are firstly arranged' and 'maximum resource utilization rate in the same pipeline level', recursively arranging each level of pipeline by combining resource dependence, and finally obtaining an optimal resource arrangement scheme meeting the conditions of data dependence, control dependence and resource constraint.

Description

Greedy strategy-based PISA architecture chip resource arrangement method

Technical Field

The invention belongs to the technical field of chip resource arrangement, and particularly relates to a greedy strategy-based PISA architecture chip resource arrangement method.

Background

PISA (Protocol INDEPENDENT SWITCH Architecture) is one of the currently mainstream programmable switch chip architectures, and has a processing rate and programmability equivalent to those of the fixed function switch chip, so that the PISA has very great competitiveness in the future chip industry. The PISA architecture is shown in fig. 1, and includes three components, namely a message parsing (burst), a multi-stage message processing pipeline (Pipeline Pocket Process) and a message reassembling (Deparser), to respectively implement the functions of identifying the message type, modifying the message data and reassembling the message. The resource arrangement problem in the multistage message processing pipeline is a difficult problem to be solved urgently in the chip technology at the present stage, so that the utilization rate maximization arrangement of limited resources by using an efficient algorithm is a key problem to be solved urgently at present.

In a chip programming model of a PISA architecture, a P4 language is firstly used for describing a message behavior to obtain a P4 program, the P4 program is divided into a series of basic blocks when a compiler compiles the P4 program, and then each basic block is arranged in each stage of a pipeline. Because each basic block occupies certain chip resources, the basic blocks are distributed to each level of the pipeline, namely, the resources of each basic block are distributed to each level of the pipeline (namely, which level of the pipeline each basic block needs to be determined), namely, the PISA architecture chip resource distribution problem is the basic block distribution problem.

When the PISA chip is designed, in order to reduce the complexity of basic block arrangement to achieve high resource utilization, there are often multiple constraints between resources in each stage of the pipeline. Thus, in the problem of resource allocation for PISA architecture chips, there are constraints from two aspects: first is the data and control dependencies of the program itself. The essence of the basic blocks is that the basic blocks are fragments of a source program, and instructions need to be executed to finish calculation, so that data dependence exists among the basic blocks according to the sequence of read-before-write operation, whether the basic blocks jump to a plurality of basic blocks after the basic blocks are executed or not exists, control dependence exists in the basic blocks, and the two dependence conditions constrain the size relation of pipeline stages in arrangement; and secondly, the resource constraint of the chip itself.

The chip comprises TCAM, HASH, ALU, QUALIFY types of resources, each stage in the pipeline has definite limitation on the four types of resources, and when the resources are arranged, the constraint conditions of the given resources are strictly followed. Under the condition of meeting the data dependence, the control dependence and a plurality of resource constraint conditions of each specific sub-problem, how to design an algorithm for realizing an optimal arrangement scheme so as to minimize the number of stages of a total pipeline and maximize the value of a chip is a problem to be solved in the design of a PISA architecture chip.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a greedy strategy-based PISA architecture chip resource arrangement method.

In order to achieve the above purpose, the invention adopts the following technical scheme:

S1: in a PISA architecture programming model, a P4 program is obtained according to a P4 language input by a user, and the P4 program is divided into a series of basic blocks;

S2: respectively constructing a data dependency relation matrix and a control dependency relation matrix according to variable information read and written by each basic block and adjacent basic block information of each basic block in a flow chart;

S3: combining the data dependency relation matrix and the control dependency relation matrix on the basis of S2 to obtain a data-control dependency relation matrix, defining a prioritized relation and an absolute prioritized relation between basic blocks, and constructing an arrangement prioritized relation directed graph G (0);

S4: and (3) finding out all basic block absolute priority sequences from a root node in G (0), and according to two greedy strategies of 'longer absolute priority sequences are firstly arranged' and 'maximum resource utilization rate in the same pipeline level', recursively arranging each level of pipeline by combining resource dependence, and finally obtaining an optimal resource arrangement scheme meeting the conditions of data dependence, control dependence and resource constraint.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention combines the data dependency relationship and the control dependency relationship, then adopts the arrangement scheme provided according to the data-control dependency relationship, and then uses the resource constraint to judge whether the constraint is satisfied, thereby simplifying the resource arrangement flow.

2. The invention saves the resource occupation condition of the pipeline level which passes the verification in the program design process, reduces the calculation amount in the resource constraint verification process, and particularly greatly reduces the repeated calculation amount for the basic block which needs to calculate a large amount of basic blocks in the same flow for the second problem.

3. The invention solves the problems of layer-by-layer resource arrangement under the condition of unknown total number of pipeline stages and traversing under the condition of unknown basic blocks in the same flow in the pipeline stages by adopting a recursion idea.

4. According to the method, the target and the corresponding principle of optimal resource arrangement are analyzed according to the problems, greedy ideas are adopted for resource arrangement in two modes of maximizing the resource occupancy rate of each layer of the assembly line and minimizing the number of the layers of the assembly line as far as possible, and the steps of the proposed scheme are concise and easy to realize.

5. The invention utilizes the recursion method and the priority relation to arrange the pipeline resources of each stage, avoids global search and reduces the calculation complexity.

6. The invention can obtain the resource arrangement scheme with minimum pipeline stages and maximized resource utilization rate of each pipeline stage, reduce message processing delay and improve PISA chip performance.

Drawings

FIG. 1 is a schematic diagram of a PISA architecture;

FIG. 2 is a schematic flow diagram illustrating a problem to be solved by the present invention;

FIG. 3 is a flow diagram corresponding to basic blocks in attchment 3. Csv;

FIG. 4 is an overall flow chart of the present invention;

FIG. 5 is a schematic diagram of data dependence;

FIG. 6 is a schematic diagram of a basic block read-in variable storage matrix;

FIG. 7 is a schematic diagram of a basic block read-in variable judgment matrix;

FIG. 8 is a depth-first traversal process of the graph;

FIG. 9 is a thermodynamic diagram of a data dependency matrix M _D;

FIG. 10 is a directed graph of data dependencies;

FIG. 11 is a control dependency solving process;

FIG. 12 is a control dependency matrix thermodynamic diagram;

FIG. 13 is a control dependency graph;

FIG. 14 is a thermodynamic diagram of an M _CD matrix;

FIG. 15 is an M _CD matrix directed graph;

FIG. 16 is a graph of prioritized relationship orientation in the M _CD matrix;

FIG. 17 is a directed graph of absolute priority in the M _CD matrix;

FIG. 18 is a schematic diagram of a basic block arrangement;

fig. 19 is adjacent information (partial data) of every two basic blocks;

FIG. 20 is a block diagram showing variable information (partial data) read from and written to each basic block

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Firstly, before a PISA architecture chip resource arrangement method is proposed, basic concepts related to the PISA architecture are described:

(1) The message comprises the following steps: packets are packets transmitted in network communications, in which data transmitted by a user is encapsulated into individual packets for transmission.

(2) Basic block: the basic block is a program segment of the source program, and the source program is divided into individual basic blocks by dividing the source program into basic blocks.

(3) And (3) pipeline: the assembly line is formed by connecting a series of processing units in series, and the message sequentially passes through each processing unit in the assembly line to finally finish processing. Each stage of the pipeline refers to each processing unit in the pipeline.

Secondly, describing and defining the resource queuing problem to be solved by the invention;

according to the definition of the basic block, which is a segment of the source program, the basic block executes the instruction to complete the calculation, and when the instruction is executed, the instruction source operand (i.e. the variable corresponding to the read source operand) is read to perform the calculation, and then the calculation result is assigned to the destination operand (i.e. the variable corresponding to the write destination operand). For the divided basic blocks, the instructions in each basic block are executed in parallel, and the execution is performed according to the sequence of reading before writing (determined by the chip bottom layer implementation), namely, all the destination operands are read out at the same time, then all the instructions are executed in parallel, and finally, the calculation result is assigned to the destination operands at the same time. Since there are conflicts when writing to the same variable multiple times in parallel, each basic block will write the same variable only once (i.e., there are no multiple instructions in the basic block to assign values to the same variable).

The basic block can be abstracted into a node, concrete instructions executed in the abstracted basic block are shielded, and only read-write variable information is reserved. When basic block a has completed executing and jumps to basic block B, a directed edge is added between a and B, so that the P4 program can be represented as a directed acyclic graph (no loop exists in the P4 program), called a P4 program flow chart, as shown in the left half of fig. 5. Further describing, the PISA architecture resource arrangement is to arrange nodes (i.e., basic blocks) in the P4 program flow chart into pipeline stages under the constraint condition.

In summary, the present invention firstly establishes a mathematical model satisfying the resource arrangement of data dependency and control dependency, and then adds different resource constraint conditions on the basis of the mathematical model, so as to obtain the arrangement scheme of the basic blocks of the pipeline-level program under different resource constraint conditions, so as shown in fig. 2, the problems to be solved by the present invention include:

(1) Problem 1: under the given maximum limit of resources per stage TCAM, HASH, ALU, QUALIFY and the maximum limit of TCAM and HASH resources for two stages of folding (stage 0 and 16, stage 1 and stage 17, fold stages are the 15 th and 31 th stages, no folding is considered from stage 32), the basic blocks in the TCAM resource are arranged according to adjacent basic block information under the condition that the total number of even stages with TCAM resources is not more than 5 and each basic block corresponds to only one stage, and the pipeline stages of basic block arrangement are as few as possible as optimization targets, wherein adjacent information of every two basic blocks is given in an attchment 3.csv table as shown in fig. 19, and a flow chart corresponding to the basic blocks in attchment 3.csv is shown in fig. 3.

(2) Problem 2: on the basis of the problem 1, basic blocks (in the P4 program flow chart, if the basic block B is a downstream node of the basic block a, this is not the case in one execution flow) are considered, and the problem of resource sharing due to the introduction of the execution flow is considered. The basic blocks C and D which are not in one execution flow share HASH and ALU resources, and the HASH and ALU resources of any one of the basic blocks C and D which can be arranged on the same stage in the pipeline do not exceed the resource limit of each stage. The HASH and ALU resource constraint of each stage in the problem 1 is changed into the HASH and ALU resource constraint on the same execution flow in each stage. For the folded two stages, TCAM resource constraint is unchanged, and the addition of HASH resources occupied by the same execution flow of each stage on the folded two stages is not more than 3.

Again, for the two problems set forth above, analysis and design of resource arrangement schemes are performed separately.

(1) Problem 1 analysis

The final objective of PISA architecture chip resource arrangement optimization is to enable various resources of the chip to be utilized more highly. In order to improve the utilization rate of chip resources, on one hand, the number of pipeline level stages can be reduced as much as possible, and on the other hand, the resource utilization rate of the pipeline level can be improved as much as possible. Then the basic blocks are arranged to different pipeline levels, the relative positions of the basic blocks in the pipeline levels are needed to be known first, the program flow diagrams, the data dependency relationships and the control dependency relationships of the basic blocks can reflect the relative positions of the basic blocks in the flow diagrams, and the strictest relationship can be taken as the final dependency relationship in consideration of the fact that the data dependency has the relationship of less than or equal to and the control dependency relationship has the relationship of less than or equal to. For any resource arrangement scheme, the data-control dependency relationship must be satisfied simultaneously with the verification by the resource constraint determination program, i.e., the two conditions are not indispensable. Thus, the present invention gives one arrangement under a preferential satisfaction of one condition, and then verifies the arrangement using another constraint. Since the use of data-control dependencies involves a priority order of basic block arrangement, the scheme can be arranged accordingly, followed by resource constraint verification.

(2) Problem 2 analysis

Because the program basic blocks which are not in one execution flow can share HASH resources and ALU resources, the problem 2 properly relaxes the limiting conditions of the problem 1, and because the resource constraint is only modified, on the basis of the problem 1, only the verification of whether the arrangement scheme meets the new resource constraint condition is needed. In order to determine new resource constraints, the sum maximum value of HASH resources and the sum maximum value of ALU resources of basic blocks in the same execution flow in each stage of the pipeline are required to be calculated respectively, and for all program basic blocks arranged on a given pipeline of a certain stage, the number of all resource blocks in one execution flow is required to be determined. In view of this, the present invention firstly proposes a basic block search algorithm on the same execution flow, finds out all basic blocks on the same flow in each stage of the pipeline, adds the HASH resources and ALU resources occupied by the basic blocks on the same execution flow to obtain the sum of the execution flow on the pipeline of the stage, and then maximizes the sum of the HASH resources and ALU resources occupied by different flows on the pipeline of the stage. For the algorithm of the resource arrangement, the method in problem 1 is continued to be used.

The solutions to the problems 1 and 2 described above are described as an example:

if pipeline level 0 has A, B, C, D, E, F basic blocks arranged, they occupy 1,2,3,4,5,6, respectively, for a resource. If A-B-C is in one execution flow and D-E-F is in another execution flow.

Then in the problem 1 scenario, the resource occupation of pipeline stage 0 is 1+2+3+4+5+6=21.

In the problem 2 scenario, the execution flow a-B-C occupies the resources 1+2+3=6, the execution flow D-E-F occupies the resources 4+5+6=15, and since the basic blocks of different execution flows can share the resources (mainly share the HASH resources and ALU resources in the problem), the resource occupation of the pipeline level 0 is actually the maximum value of the resources occupied by each execution flow, here, the execution flow D-E-F occupies the resources 4+5+6=15.

(3) Resource arrangement problem model assumption

Suppose 1: there is no case where a certain basic block is executed multiple times on the same execution flow.

Suppose 2: assume that no other interfering programs in the chip occupy TCAM, HASH, ALU, QUALIFY resources.

Suppose 3: it is assumed that the actual placement of basic blocks is not limited by chip space and pipeline architecture.

(4) Symbol definition and description

(5) Solution of the above problem

1) Problem 1 solving

According to the variable information read and written by each basic block and the adjacent basic block information of each basic block in the flow chart, respectively calculating a data dependency and a control dependency matrix, merging the data dependency matrix and the control dependency matrix to obtain a data-control relationship matrix M _CD, defining the prioritized relation and the absolute prioritized relation of the basic blocks on the basis, arranging according to the maximization of the resource utilization rate, simultaneously combining the basic blocks meeting the prioritized relation to be placed in the same pipeline level, giving an arrangement scheme by a method of placing in different levels, meeting the absolute prioritized relation, and finally using a resource constraint verification scheme. The whole process is shown in fig. 4, and specifically comprises the following steps:

First, solving a data dependency matrix:

By the characteristics of the PISA architecture, a directed graph describing the adjacency of basic blocks is firstly established, and for basic blocks with adjacency, 1 directed line segment connection from an upstream basic block to a downstream basic block is used for constructing a directed acyclic graph taking the basic blocks as nodes and taking the adjacency between the basic blocks as edges.

Because the data dependence depends on the adjacency relation between the variable read by each basic block and each basic block in the P4 flow chart, the data dependence is specifically divided into 3 dependence modes according to the sequence of variable read or write: write-after-read dependency (RW), read-after-write dependency (WR), and write-after-write (WW) dependency, as shown in fig. 5. In the PISA architecture, when the basic block A and the basic block B have read-after-write dependence or write-after-write dependence, the pipeline number of arrangement of A needs to be smaller than the number of arrangement of B; when the basic blocks A and B have read-write dependence, the pipeline number of the A arrangement needs to be less than or equal to the number of the B arrangement. In the process of reading and writing variables, the sequence of reading and writing each basic block is determined by the adjacent relation (namely, directed acyclic graph), and for the same variable, the basic block positioned upstream of the adjacent relation always reads the variable before the basic block positioned downstream of the adjacent relation or writes the variable before the basic block positioned downstream of the adjacent relation.

For the directed graph shown in the left diagram of fig. 5, basic block a reads variable X0, writes variables X1, X2, basic block B does not read variable, writes variable X0, basic block C does not read variable, writes variable X1, basic block F reads variable X2, does not write variable, and none of the other basic blocks reads and does not write variable. Therefore, by analyzing the left graph of fig. 5, the basic blocks a and B have read-write dependency, i.e., the pipeline number of a is less than or equal to the pipeline number of B; the basic blocks A and C have write-after-write dependence, namely the pipeline stage number of A is smaller than that of C; the basic blocks A and F have read-after-write dependence, namely the pipeline stage number of A is smaller than that of F; no data dependencies between other basic blocks are present, and the right diagram of fig. 5 shows one possible pipeline arrangement (without taking other constraints into account) in one of a number of arrangements that satisfy the data dependency, basic blocks A, B, D being arranged at layer 0 of the pipeline, basic blocks C, F, G being arranged at layer 1, and basic block E being arranged at layer 2.

As shown in fig. 20, variable information read and written by each basic block is given in the attchment 2.Csv table, according to the variable information, N is defined to represent the total number of basic blocks, P represents the kinds of all the variables read or written, 4 n×p matrices M _R、M_W、M_RT、M_WT are established, as shown in fig. 6 and fig. 7 (taking read as an example), M _R、M_W stores the variable read or written by each basic block, and M _RT、M_WT is used to determine whether each basic block reads or writes a corresponding variable. Element M _R (i, j) in matrix M _R represents the j-th variable read in by basic block i; element M _W (i, j) in matrix M _W represents the j-th variable written by basic block i; a value of 1 for element M _RT (i, k) in matrix M _RT indicates that basic block i reads variable Xk, and a value of 0 for element M _RT (i, k) indicates that basic block i does not read variable Xk; a value of 1 for element M _WT (i, k) in matrix M _WT indicates that basic block i is written to variable Xk, and a value of 0 for element M _WT (i, k) indicates that basic block i is not written to variable Xk, wherein: i < N,0 < j, and k < P.

For example, in the matrix M _R, 1 variable X1 is read in the basic block 0, 3 variables X0, X5, and X8 are read in the basic block 1, and the other basic blocks do not read in the variables; in matrix M _RT, basic block 0 reads variable X1, and variables X0, X2, and X (P-1) are not read, and basic block 1 reads variable X0, and variables X1, X2, and X (P-1) are not read.

Each basic block has data dependency on other basic blocks, all the downstream basic blocks corresponding to the basic block can be found through depth-first traversal of the directed graph, only the downstream basic block has data dependency on the basic block, and the downstream basic block does not have data dependency on the basic block.

All downstream basic blocks of the basic block are found using a depth-first search (DEPTH FIRST SEARCH, DFS):

The DFS of the graph starts from an initial access node, which may have a plurality of adjacent nodes, and the depth-first traversal strategy is to access a first adjacent node first, and then access the first adjacent node with the accessed adjacent node as the initial node. This can be understood as follows: each time the current node is accessed, the first neighboring node of the current node is accessed first, the access policy is to drill deep longitudinally preferentially instead of accessing all neighboring nodes of a node laterally, and depth-first search is a recursive process, as shown in fig. 8. The specific steps of depth-first search are as follows:

Step1: accessing the initial node V, and marking the node V as accessed;

step2: searching for the 1st unviewed adjacent node W of the node V;

step3: if W is present, execution continues to Step4, and if W is not present, return to Step1 and continue from the next node of V.

Step4: if W is not accessed, a depth-first traversal recursion is performed on W (i.e., W is treated as another V, and then steps Step1-Step3 are performed).

Step5: find the next neighbor node of the W neighbor node of node V, go to Step3.

The depth-first traversal procedure for the directed graph shown in fig. 8 is:

step1: accessing node a, accessing node a's 1 st neighbor node B along path 1;

step2: accessing node B's 1 st neighbor node C along path 2;

step3: the node C has no adjacent node which is not accessed and returns to the node B;

Step4: accessing a2 nd neighbor node D of node B along path 3;

step5: accessing the 1 st neighbor node E of node D along path 4;

step6: the node E has no adjacent node which is not accessed and returns to the node D;

Step7: the 2 nd neighbor node F of node D is accessed along path 5.

Thus, the depth-first traversal order of the directed graph shown in fig. 8 is: a, B, C, D, E and F.

Establishing an N multiplied by N matrix M _D, wherein an element M _D (i, j) in the matrix M _D represents whether data dependence exists between a basic block i and a basic block j, and a value M _D (i, j) of 1 represents that the basic block i and the basic block j have read-after-write dependence or write-after-write dependence, namely the number of stages of the basic block i arrangement needs to be smaller than the number of stages of the basic block j arrangement; m _D (i, j) is 0, which means that the basic block i and the basic block j have read-write dependence, that is, the number of stages of the basic block i is required to be less than or equal to the number of stages of the basic block j; a value of M _D (i, j) of-1 represents that there is no data dependency for basic block i and basic block j, where: 0 is less than or equal to i, and j is less than N.

For the current basic block, after the downstream basic block is found according to the method, all variables read in or written in by the downstream basic block are found in a matrix M _R、M_W, and whether the current basic block reads in or writes in the variables of the downstream basic block is judged in a matrix M _RT、M_WT. As shown in formula (1):

fig. 9 is a thermodynamic diagram of a data dependency matrix obtained from the algorithm and variable information read and written by each basic block, and fig. 10 is a directed graph corresponding thereto.

Secondly, solving a control dependency relation matrix:

since the program control flow generates control dependency, when the path from the basic block a does not completely pass through a certain basic block B downstream, the basic block a and the basic block B have control dependency, that is, the pipeline number of the basic block a is required to be smaller than or equal to the pipeline number of the basic block B.

If each path from basic block a to the egress node passes basic block B, basic block B is the next dominant node of basic block a, denoted as: b pdom A; if B pdom A, there is no basic block C, so that B pdom C and C pdom A, basic block B is the direct successor dominant node of basic block A, denoted B ipdom A. All downstream basic blocks of a constitute control dependencies with basic block a except for the successor dominant node of a. Not difficult to get, the successor dominant node = direct successor dominant node + direct successor dominant node, and therefore, the successor dominant node of the basic block may be obtained by looking for the direct successor dominant node of each basic block.

It is known from the definition of the direct successor dominant node that only the basic block B, which finally gathers all paths from the basic block a together through the transit of other basic blocks, needs to be found as the direct successor dominant node of the basic block a, and then the direct successor dominant node of the basic block B is continuously found according to the method until no path exists. This can be achieved in particular by using the method of the basic block distance downstream of the label, namely:

Step1: the distance of the basic block A is marked as 0;

step2: traversing basic blocks which are reachable by all the basic blocks A in the adjacency relation directed graph by adopting a BFS mode with the basic blocks A as starting points, and marking the distance of the basic blocks directly pointed by all paths sent out from the A as 1;

step3: for a basic block with a distance d, the distance of the basic block pointed to by a path from the basic block is denoted as d+1;

step4: for a basic block C with more than 2 non-zero distance values, modifying paths of the distance values except the maximum value, supplementing the number of virtual basic blocks, and keeping the distance value of C consistent with the maximum value in a plurality of distance values;

Step5: starting from a distance of 1, the number of basic blocks at the distance is found to be 1 basic block, and then the basic blocks and the basic blocks A do not form control dependence, and all other basic blocks downstream of A and A form control dependence.

Taking fig. 11 as an example, a downstream basic block constituting a control dependency with the basic block a is found.

Step1: starting from the basic blocks, traversing the directed graph with A as the starting point in breadth first, wherein the traversing sequence of each basic block is B-C-D-E-F-G

Step2: each basic block is marked with a distance, the basic block B, C is marked with 1, the D, E is marked with 2,F is marked with 3, and the basic block G has 2 distance values (distance 2 when the basic block B arrives, distance 4 when the basic block F arrives)

Step3: modifying the path B-G with the distance value of 2 of the basic block G, adding 2 virtual basic blocks X, and marking the distances as 2 and 3, wherein the distance value of the basic block G is marked as 4;

Step4: starting from the distance of 1, searching for the number of basic blocks at the distance of only 1 basic block, wherein the number of basic blocks at the distance of 1 is B, C through statistics, and the total number of basic blocks is 2; the basic blocks with the distance of 2 are D, E, X, and the total number of the basic blocks is 3; the basic blocks with the distance of 3 are F, X, and 2 blocks are totally arranged; the basic blocks with a distance of 4 are G, 1 in total. Therefore, basic blocks a and B, C, D, E, F constitute control dependencies and G do not constitute control dependencies.

The control dependencies of all basic blocks are calculated, and an N×N matrix M _C is established, wherein the matrix M _C represents the control dependency relationship among the basic blocks. Element M _C (i, j) in M _C represents whether control dependence exists between basic block i and basic block j, and a value of M _C (i, j) is 0, which represents that control dependence exists between basic block i and basic block j, namely the number of stages of basic block i arrangement needs to be less than or equal to the number of stages of basic block j arrangement; a value of M _C (i, j) of-1 represents that there is no control dependency for basic block i and basic block j. Wherein: 0 is less than or equal to i, and j is less than N.

Again, the data dependencies are merged with the control dependencies:

The data dependency relationship and the control dependency relationship can respectively determine the sequence relationship of the basic blocks of the program on the PISA pipeline, and as shown in fig. 9 and 12, the two relationships have more repeated parts, and in order to comprehensively arrange resources according to the data dependency and the control dependency, the data dependency relationship matrix and the control dependency relationship matrix are combined.

An n×n matrix M _CD is established, and the matrix M _CD represents the data-control dependency relationship between the basic blocks. The element M _CD (i, j) in the matrix M _CD indicates whether the basic block i and the basic block j have an arrangement relation of the number of stages, and the value M _CD (i, j) is 1, which indicates that the number of stages of the arrangement of the basic block i needs to be smaller than the number of stages of the arrangement of the basic block j; a value of M _CD (i, j) of 0 represents that the number of stages of the basic block i arrangement needs to be less than or equal to the number of stages of the basic block j arrangement; a value of M _CD (i, j) of-1 represents that the basic block i and the basic block j have no data dependency or control dependency, wherein 0 is less than or equal to i, and j is less than N.

The matrix M _CD can be obtained by integrating the matrices M _D and M _C, and the dependency relationship which can simultaneously satisfy the matrices M _D and M _C is selected as the value of the element in the matrix M _CD, specifically as shown in the formula (2):

M_CD＝{M_CD(i,j)}(M_CD(i,j)＝max{M_D(i,j),M_C(i,j)},0≤i,j<N) (2)

fig. 14 is a thermodynamic diagram of a data dependency matrix and a control dependency matrix merge matrix, and fig. 15 is a corresponding directed diagram.

Because the values in M _CD have three conditions of-1, 0 and 1, when the value of M _CD (i, j) is 1, the number of stages representing the arrangement of the basic blocks i needs to be smaller than the number of stages representing the arrangement of the basic blocks j; a value of M _CD (i, j) of 0 represents that the number of stages of the basic block i arrangement needs to be less than or equal to the number of stages of the basic block j arrangement; m _CD (i, j) has a value of-1, which represents that the basic block i and the basic block j have no data dependency or control dependency, wherein, i is more than or equal to 0 and j is less than N;

Wherein when the values are 0 and 1, they are closely related to the arrangement between pipeline stages, respectively, and 0 and 1 correspond to different data-control dependencies, respectively, for a detailed description of the arrangement of basic blocks, two definitions are given below:

Definition 1 the relationship may be prioritized: for 0.ltoreq.i, j < N and M _CD (i, j) =0, then there is a prioritized relationship between basic block i and basic block j, which may be either arranged before basic block j at the pipeline level or on the same pipeline level.

Definition 2 absolute priority relation: for 0.ltoreq.i, j < N and M _CD (i, j) =1, then there is an absolute priority relationship between basic block i and basic block j, which must be ordered before basic block j at the pipeline level.

Obviously, the prioritized relation and the absolute prioritized relation satisfy transitivity, and thus a series of basic blocks satisfying the prioritized relation may be arranged at the same pipeline level, whereas a series of basic blocks having the absolute prioritized relation must be arranged at different pipeline levels. From definition 1 and definition 2, a prioritized directed graph and an absolute prioritized directed graph corresponding to the data-control dependency relationship can be drawn, as shown in fig. 16 and 17, respectively. There are 607 nodes for the prioritized graph, 78 for the longest path length, 318 nodes for the absolute prioritized graph, and 34 for the longest path length. Since a series of basic blocks with absolute priority must be arranged in different pipeline stages, it is known that the pipeline stages occupied by all basic blocks in the adjacent basic block information of each basic block are not less than 34 from the data-control dependency alone.

Again, resource constraint determination:

In order to verify the scheme in real time in the arrangement process, a judging method suitable for different arrangement schemes including incomplete arrangement schemes needs to be formulated, and meanwhile, the universality of the program under different resource limiting conditions is also considered. Considering that problem 1 and problem 2 are just different in resource computation, a function interface can be set aside for both problems to use. The invention constructs a Python code segment as shown in algorithm 1 for determining whether the known arrangement scheme satisfies the scheme constraint, wherein the get_blks_source () function is used to calculate the total resource occupancy, and the calculation modes of the functions are different between problem 1 and problem 2. Considering that every basic block is arranged, it is necessary to determine whether resource dependency is satisfied, and therefore, the total amount of resources occupied for all basic blocks in the pipeline hierarchy that have passed verification should be appropriately reserved. The invention defines that the multi-level dictionary plane_info_subject is used for storing an arrangement scheme and the corresponding resource occupation amount, and the plane_info_subject dictionary format is specifically shown as follows.

Wherein, the 'plan_part' field represents the specific content of the scheme, and the 'plan_source' field represents the amount of resources occupied by the scheme.

Algorithm 1 resource constraint determination method Python code segment

/>

Thirdly, a greedy method-based resource arrangement algorithm:

Based on the above solution data-control dependency matrix and design resource dependency determination, for any resource arrangement scheme, the data-control dependency and verification by the resource constraint determination program must be satisfied at the same time, i.e. two conditions are indispensable. Therefore, the following two technical schemes can be adopted to carry out the resource arrangement algorithm design.

Scheme 1: based on the data-control dependency relationship, resource arrangement is given according to the data-control dependency relationship, and then the validity of the resource constraint judgment scheme is used, so that the resource arrangement meets two constraints.

Scheme 2: based on the resource constraint, the resource arrangement is given according to the resource constraint, and then the validity of the scheme is judged by using the data-control dependency relationship, so that the resource arrangement meets two constraints.

In the above scheme, when a resource allocation scheme is given in scheme 2, it is determined whether the scheme satisfies the resource constraint and whether the data-control dependency relationship is satisfied, that is, there are 2 verification processes, but scheme 1 may give the resource allocation scheme according to the basic block priority given by the data-control dependency relationship, and then determine whether the resource constraint is satisfied, where the whole process has only 1 verification process. Therefore, the invention selects the scheme 1 to carry out the algorithm design of resource arrangement.

The final objective of the PISA architecture chip resource arrangement optimization is to enable various resources of the chip to be utilized more highly. In order to improve the utilization rate of chip resources, the following two purposes are needed to be achieved:

target 1: the resource arrangement scheme should minimize the number of pipeline stage levels.

Target 2: the resource arrangement scheme should improve the resource utilization rate of the pipeline level as much as possible.

Since there is a basic block sequence with absolute priority in the data-control dependency matrix, these basic blocks must be placed in different pipeline stages, so based on greedy thought, two resource arrangement principles can be obtained.

Principle 1: for target 1, for a basic block sequence with absolute priority, its starting basic blocks should be arranged in pipeline stages with a small number of stages.

Principle 2: for each level of the pipeline, more basic blocks should be ordered as much as possible under the resource constraint.

In the data-control dependency directed graph, since the root node is the starting node that mostly satisfies the absolute priority basic block sequence and satisfies the prioritized basic block sequence, for principle 1, the root node may be first placed in the pipeline level, then deleted from the data-control dependency directed graph, and then the relationship between the root nodes of the remaining graph and the deleted root node is determined: if the root node of the left graph and the deleted root node have absolute priority, the root node of the left graph and the deleted root node cannot be placed in the same pipeline level; if the root node of the left graph has a prioritized relation with the deleted root node, the root node of the left graph and the deleted root node can be put into the same pipeline level, and the nodes which are already put into the pipeline level are deleted from the left graph. According to principle 2, the number of basic blocks placed in the same pipeline level as the root node is as large as possible to fully occupy pipeline resources of the level. Fig. 18 is an example of a scheme for resource allocation in combination with principle 1 and principle 2.

The resource arrangement according to the two principles is essentially a greedy strategy, when the resource arrangement of each pipeline level is carried out in the transverse direction, the resource occupancy rate of the pipeline level is maximized as much as possible, and when the pipeline level is longitudinally arranged, the program blocks are arranged in the front as much as possible due to the fact that the program block sequence with the absolute priority relation exists in the data-control dependency relation directed graph. According to the thinking arrangement, the maximum utilization of resources of the current pipeline level is naturally met, but the scheme of the level possibly causes that the following level cannot meet all constraint conditions, so that the program is required to be rolled back to the upper level for rearrangement when the current level cannot find a feasible scheme in the algorithm design process. In addition, since the number of the pipeline stages which are specifically arranged is not estimated, the cyclic program structure cannot be used for resource arrangement, preferably, a recursive program design idea is adopted, and the next pipeline stage arrangement is carried out after one pipeline stage arrangement is completed until no node exists in the data-control dependency relationship directed graph. Accordingly, the invention provides a resource allocation algorithm Python code segment based on a greedy method, which is shown as an algorithm 2. In algorithm 2, it is necessary to verify the resource constraint using algorithm 1 resource constraint determination method Python code segment and algorithm 3 finds all its root nodes for a given directed graph.

Algorithm 2 recursively arranges the Python code segments of the base modules on each hierarchical pipeline

/>

Finally, the problem 1 results were obtained:

According to the above Python code segment, the resource arrangement scheme is solved under the condition of the resource limitation of the problem 1, the optimal result is shown in table 1, the resource occupation situation corresponding to the arrangement scheme is shown in table 2, and as can be seen from table 1, the arrangement scheme obtained by using the above method needs to occupy 71 pipeline stages.

TABLE 1 results of problem 1

/>

TABLE 2 total number of resources occupied by the results of problem 1

/>

2) Problem 2 solving

Because the program basic blocks which are not in one execution flow can share HASH resources and ALU resources, the limit condition of the problem 1 is properly relaxed by the problem 2, and the arrangement algorithm and the arrangement result are given again. Because only the resource constraint is modified, on the basis of the problem 1, only the verification of whether the arrangement scheme meets the new resource constraint condition is needed. In order to determine new resource constraint conditions, the maximum value of the sum of HASH resources and the maximum value of the sum of ALU resources of basic blocks in the same execution flow in each stage of the pipeline are calculated respectively, and for all program basic blocks arranged on a given pipeline of a certain stage, all conditions in one execution flow need to be determined. In view of this, the present invention firstly proposes a basic block search algorithm on the same execution flow, finds out all basic blocks on the same flow in each stage of the pipeline, adds the HASH resources and ALU resources occupied by the basic blocks on the same execution flow to obtain the sum of the execution flow on the pipeline of the stage, and then maximizes the sum of the HASH resources and ALU resources occupied by different flows on the pipeline of the stage.

First, the same execution flow performs basic block search algorithm:

The basic blocks in the same execution flow cannot share resources, so that all basic block combination conditions in the unified flow need to be found, and since the possibility of the same execution flow exists in a pipeline level, k basic blocks are assumed to exist in the pipeline level, and k is the sum of-! Each case needs verification, so the verification of these cases is very computationally intensive. Therefore, the invention provides a basic block searching algorithm on the same execution flow of the algorithm 4, and uses a recursion idea to find all basic block situations on the same execution flow. In algorithm 4, the function of the in_a_line (x, y) function is to verify whether node x and node y have paths using Dijkstra's algorithm, if there is a return True, otherwise return False. After finding all the code block combination cases on the same execution flow by using the algorithm 4, respectively calculating the maximum value of each case as the resource occupation amount of the current pipeline level, and then verifying whether the resource constraint meets the requirement of the problem 2.

Algorithm 4 same Python code segment performing basic block search algorithm on flow

Second, problem 2 results:

according to the above Python code segment, the resource arrangement scheme is solved under the condition of problem 2 resource limitation, the optimal result is shown in table 3, the resource occupation situation corresponding to the arrangement scheme is shown in table 4, and as can be seen from table 3, the arrangement scheme obtained by using the above method needs to occupy 57 pipeline stages. As can be seen from table 3, this arrangement takes up less total pipeline stages than problem 1, considering that the conditions in problem 2 are relaxed for the conditions in problem 1, thus also reflecting the effectiveness of the proposed solution from the side.

TABLE 3 results of problem two

/>

TABLE 4 problem resource case occupied by two resource arrangement schemes

/>

The above embodiments are merely preferred embodiments for fully explaining the present invention, and the scope of the present invention is not limited thereto. Equivalent substitutions and changes are intended by those skilled in the art on the basis of the present invention, and are within the scope of the present invention. The protection scope of the invention is subject to the claims.

Claims

1. A greedy strategy-based PISA architecture chip resource arrangement method is characterized by comprising the following steps:

S2: constructing a data dependency relation matrix and a control dependency relation matrix:

S21: constructing a data dependency relation matrix according to variable information read and written by each basic block;

s22: constructing a control dependency relation matrix according to adjacent basic block information of each basic block in the flow chart;

2. The PISA architecture chip resource arrangement method based on greedy strategy as claimed in claim 1, wherein the specific steps of step S21 include:

S211: connecting the upstream basic block and the downstream basic block by adopting a directed line segment, and establishing a directed acyclic graph which takes the basic blocks as nodes and takes the adjacent relation between the basic blocks as an edge;

S212: traversing the directed acyclic graph by a depth-first search method, and searching all downstream basic blocks corresponding to each basic block;

S213: establishing a data dependency matrix M _D based on the searched downstream basic blocks:

Wherein i and j represent basic block i and basic block j, respectively; and a value of M _D (i, j) of 1 indicates that there is a read-after-write dependency WR or a write-after-write dependency WW for basic block i and basic block j, and a value of M _D (i, j) of 0 indicates that there is a write-after-read dependency RW for basic block i and basic block j; a value of-1 indicates that there is no data dependency for basic block i and basic block j.

3. The PISA architecture chip resource arrangement method based on greedy strategy as claimed in claim 2, wherein the specific steps of step S22 include:

S221: searching a downstream basic block forming control dependence with each basic block by a label downstream basic block distance method;

S222: based on the found downstream basic block, a control dependency relation matrix M _C is established, and when the value of M _C (i, j) is 0, the control dependency exists between the basic block i and the basic block j; a value of-1 for M _C (i, j) indicates that there is no control dependency for basic block i and basic block j.

4. The PISA architecture chip resource allocation method according to claim 3, wherein the specific steps of step S221 include:

s2211: the distance of the basic block A is marked as 0;

s2212: traversing the basic blocks which are reachable by all the basic blocks A in the directed acyclic graph by adopting a BFS method with the basic blocks A as starting points, and marking the distance of the basic blocks directly pointed by all paths sent out from the A as 1;

s2213: for a basic block with a distance d, the distance of the basic block pointed to by a path from the basic block is denoted as d+1;

s2214: for a basic block C with more than 2 non-zero distance values, modifying paths of the distance values except the maximum value, adding a virtual basic block, and keeping the distance value of C consistent with the maximum value in a plurality of distance values;

S2215: starting from a distance of 1, the number of basic blocks at the distance is found to be 1 basic block, and then these basic blocks do not form control dependencies with basic block a, and the basic blocks downstream of other basic blocks a form control dependencies with a.

5. The PISA architecture chip resource arrangement method based on greedy strategy as claimed in claim 1, wherein the specific steps of step S3 include:

S31: based on the data dependency matrix M _D and the control dependency matrix M _C, elements which can satisfy the dependency of the matrices M _D and M _C simultaneously are selected as elements in the data-control dependency matrix M _CD, so as to establish the data-control dependency matrix M _CD:

M_CD＝{M_CD(i,j)}(M_CD(i,j)＝max{M_D(i,j),M_C(i,j)},0≤i,j<N) (2)；

S32: when M _CD (i, j) =0, defining that there is a prioritized relationship between basic block i and basic block j; when M _CD (i, j) =1, defining that the basic block i and the basic block j have an absolute priority relationship;

s33: according to the data-control dependency relationship between the basic block i and the basic block j, a data-control dependency relationship directed graph G (0) is established.

6. The PISA architecture chip resource arrangement method based on greedy strategy as claimed in claim 1, wherein the specific steps of step S4 include:

s41, setting the pipeline level of the current arrangement to y=0;

s42, defining a current data-control dependency relationship directed graph as G (y), and storing a copy of the prior relationship directed graph G (y) when the current hierarchy is arranged;

s43, according to a greedy strategy of 'longer absolute priority sequence is arranged first', obtaining a root node k containing the longest absolute priority sequence of a current directed graph G (y), judging the priority relation between k and all nodes in a pipeline level y, executing S44 if the absolute priority relation exists between k and all nodes in the pipeline level y, and executing S45 if the absolute priority relation exists between k and all nodes in the pipeline level y, or no node exists in the pipeline level y;

S44, starting from pipeline level y+1, finding out the minimum pipeline level which still meets the resource constraint after k is placed, placing k into the pipeline level, and executing S47;

s45, starting from pipeline level y, finding out the minimum pipeline level which still meets the resource constraint after k is placed, placing k into the pipeline level, and executing S47;

S46, deleting k from G (y) to obtain G '(y), enabling G (y) =G' (y), repeating the steps S42-S45 until the basic blocks cannot be arranged on the pipeline level y any more, and enabling y=y+1 and G (y+1) =G (y);

And S47, repeatedly executing the steps S42-S47, recursively arranging the resources of each pipeline stage until all basic blocks are arranged, and finally obtaining the optimal arrangement scheme meeting all constraint conditions.

7. The PISA architecture chip resource allocation method according to claim 6, wherein the specific step of determining whether the basic block satisfies the resource constraint in step S44 comprises:

S441: firstly, putting k in a pipeline level y+1, and searching all basic blocks on the same flow in the pipeline level y+1 through a basic block searching algorithm of the same execution flow;

S442: summing all the resources of each class of basic blocks in the same flow in the pipeline level y+1, and taking the maximum value of the sum of the resources of each class as the occupation amount of the resources of each class of the pipeline level y+1 according to a greedy strategy of maximum resource utilization rate in the same pipeline level;

s443: verifying whether the occupation amount of each type of resource meets the constraint condition of the type of resource, if yes, indicating that k can be placed in a pipeline level y+1, and if not, enabling y=y+1;

S444: steps S441-S443 are repeated until the smallest pipeline level is found that still satisfies the resource constraint after placement k.

8. The PISA architecture chip resource allocation method according to claim 7, wherein the specific step of determining whether the basic block satisfies the resource constraint in step S45 comprises:

S451: firstly, putting k in a pipeline level y, and searching all basic blocks in the pipeline level y, which are in the same flow, through a basic block searching algorithm of the same execution flow;

S452: summing all the resources of each class of basic blocks in the same flow in the pipeline level y respectively, and taking the maximum value of the sum of the resources of each class as the occupation amount of the resources of each class of the pipeline level y according to a greedy strategy of maximum resource utilization rate in the same pipeline level;

S453: verifying whether the occupation amount of each type of resource meets the constraint condition of the type of resource, if yes, indicating that k can be placed in a pipeline level y, and if not, enabling y=y+1;

S454: steps S451-S453 are repeated until the smallest pipeline level is found that still satisfies the resource constraint after placement k.

9. The PISA architecture chip resource allocation method according to claim 8, wherein the specific operation steps of step S441 include:

s4411: acquiring a program basic block list L in a pipeline level y+1;

S4412: judging whether a path exists between any two basic blocks in L by using Dijkstra algorithm, if so, forming an execution flow by the two basic blocks, otherwise, forming an execution flow by the two basic blocks independently;

S4413: the execution flows including the same basic blocks are combined into one execution flow.

10. The PISA architecture chip resource allocation method according to claim 9, wherein the specific operation step of step S451 includes:

S4511: acquiring a program basic block list L in a pipeline level y;

S4512: judging whether a path exists between any two basic blocks in L by using Dijkstra algorithm, if so, forming an execution flow by the two basic blocks, otherwise, forming an execution flow by the two basic blocks independently;

s4513: the execution flows including the same basic blocks are combined into one execution flow.