CN104484160A

CN104484160A - Instruction scheduling and register allocation method on optimized clustered VLIW (Very Long Instruction Word) processor

Info

Publication number: CN104484160A
Application number: CN201410799189.5A
Authority: CN
Inventors: 张雪萌; 吴辉; 孙海燕; 王霁; 阳柳; 郭阳; 扈啸
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2014-12-19
Filing date: 2014-12-19
Publication date: 2015-04-01
Anticipated expiration: 2034-12-19
Also published as: CN104484160B

Abstract

The invention discloses an optimized instruction scheduling and register allocation method on a clustered VLIW processor, which includes two stages: in the first stage, a unified algorithm is used to perform the first-pass instruction scheduling and register allocation for all basic blocks ; In the second stage, according to the length of the longest path to which the basic block belongs and the highest execution frequency, perform instruction rescheduling and register reallocation on the basic block with register overflow. The invention has the advantages of wide application range, good performance optimization effect, can effectively reduce the longest execution time of the program in the real-time system, and the like.

Description

An Optimized Instruction Scheduling and Register Allocation Method on Clustered VLIW Processor

技术领域technical field

本发明主要涉及到处理器的编译优化技术领域，特指一种适用于分簇VLIW处理器的优化的指令调度和寄存器分配方法。The invention mainly relates to the technical field of processor compilation optimization, in particular to an optimized instruction scheduling and register allocation method suitable for clustered VLIW processors.

背景技术Background technique

程序的最长执行时间是衡量嵌入式实时系统设计的重要依据之一，必须满足所有的时间限制来保证实时系统的正确性。程序的最长执行时间对于给程序分配可行的调度有重大的影响。由于程序在运行时可能执行不同的分支而导致程序的运行时间不同，程序的最长执行时间是指程序在目标平台上的所有执行时间中最长的。若程序的最长执行时间大于实时系统的时间限制，则无法为该程序分配可行的调度。如果可以降低程序的最长执行时间，则更有可能给程序分配可行的调度。因此，最小化程序的最长执行时间是一个重要的问题。The longest execution time of the program is one of the important bases to measure the design of embedded real-time system, and all time constraints must be met to ensure the correctness of the real-time system. The maximum execution time of a program has a significant impact on assigning a feasible schedule to the program. The running time of the program is different because different branches may be executed during the running of the program, and the longest running time of the program refers to the longest running time of the program on the target platform. A program cannot be assigned a feasible schedule if its maximum execution time is greater than the time limit of the real-time system. If the maximum execution time of a program can be reduced, it is more likely that the program will be assigned a feasible schedule. Therefore, minimizing the maximum execution time of a program is an important issue.

对分簇VLIW体系结构的嵌入式系统来说，指令调度和寄存器分配是一个优化编译器中的重要组成部分，对程序的最长执行时间有极大的影响。传统方法将寄存器分配和指令调度分开执行，然而单独执行每个阶段会导致阶段顺序问题，使得编译代码不够优化。分簇是改进VLIW处理器的可扩展性和能耗的有效技术，然而分簇VLIW处理器加大了指令调度和寄存器分配的难度。首先，当变量被传递到不同簇的时候，会动态地产生的新的活跃区间，并且需要不同簇上的多个寄存器来保存同一个变量的副本。第二，一个变量的精确的活跃区间取决于它的第一个定义和最后一个使用的相关指令在何时被调度，而不能由传统的针对静态代码的活跃区间分析来决定。第三，不当的簇间指令分配会导致不必要的簇间通信，从而增加基本块的调度时间。For embedded systems with sub-cluster VLIW architecture, instruction scheduling and register allocation are important components in an optimizing compiler, which have a great impact on the longest execution time of a program. Traditionally, register allocation and instruction scheduling are performed separately. However, performing each phase separately causes phase order issues, making the compiled code less optimal. Clustering is an effective technique to improve the scalability and energy consumption of VLIW processors. However, clustering VLIW processors increases the difficulty of instruction scheduling and register allocation. First, when variables are passed to different clusters, new active intervals are dynamically generated, and multiple registers on different clusters are required to hold copies of the same variable. Second, the precise live range of a variable depends on when its first definition and last used associated instruction was dispatched, and cannot be determined by traditional live range analysis for static code. Third, improper inter-cluster instruction allocation can lead to unnecessary inter-cluster communication, which increases the scheduling time of basic blocks.

发明内容Contents of the invention

本发明要解决的技术问题就在于：针对现有技术存在的技术问题，本发明提供一种适用范围广、性能优化效果好、可有效地减少实时系统中程序的最长执行时间的优化的分簇VLIW处理器上的指令调度和寄存器分配方法。The technical problem to be solved by the present invention is that: aiming at the technical problems existing in the prior art, the present invention provides an optimized analysis system with wide application range, good performance optimization effect, and effective reduction of the longest execution time of programs in real-time systems. Instruction Scheduling and Register Allocation Methods on Cluster VLIW Processors.

为解决上述技术问题，本发明采用以下技术方案：In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

一种优化的分簇VLIW处理器上的指令调度和寄存器分配方法，包括两个阶段：在第一个阶段，使用统一的算法对所有基本块进行第一遍指令调度和寄存器分配；在第二个阶段，根据基本块所属的最长路径的长度和最高执行频率，对存在寄存器溢出的基本块进行指令重调度和寄存器重分配。An optimized instruction scheduling and register allocation method on a clustered VLIW processor, including two stages: in the first stage, a unified algorithm is used to perform the first-pass instruction scheduling and register allocation for all basic blocks; in the second stage In the first stage, according to the length of the longest path to which the basic block belongs and the highest execution frequency, instruction rescheduling and register reallocation are performed on the basic block with register overflow.

作为本发明的进一步改进：所述第一阶段的步骤为：As a further improvement of the present invention: the steps in the first stage are:

(1)构造程序P的有权控制流图G；程序P由有权控制流图G＝(V,E,W)表示，其中V＝{B₁,B₂,…,B_n:是程序的基本块，E＝{(B_i,B_j):B_j对B_i控制依赖}，W＝{w_i:w_i是基本块B_i的执行时间}；(1) Construct the authorized control flow graph G of the program P; the program P is represented by the authorized control flow graph G=(V,E,W), where V={B ₁ ,B ₂ ,...,B _n : is the program The basic block of , E={(B _i , B _j ): B _j depends on the control of B _i }, W={w _i : w _i is the execution time of basic block B _i };

(2)根据统一算法按照相反后序对每个基本块B_i进行指令调度和寄存器分配。(2) Perform instruction scheduling and register allocation for each basic block B _i in reverse order according to the unified algorithm.

作为本发明的进一步改进：所述步骤(2)中统一算法将递增的寄存器分配方法与基于优先级的指令调度方法结合在一起；所述指令调度方法按照相反后序调度所有基本块，并且根据每个基本块中的指令优先级调度每个指令；每个指令的优先级考虑指令间延迟及处理器资源限制，在调度过程中，指令优先级被动态更新来减少寄存器压力。As a further improvement of the present invention: in the step (2), the unified algorithm combines the incremental register allocation method with the priority-based instruction scheduling method; the instruction scheduling method schedules all basic blocks according to the reverse order, and according to The instruction priority in each basic block schedules each instruction; the priority of each instruction considers the inter-instruction delay and processor resource constraints. During the scheduling process, the instruction priority is dynamically updated to reduce register pressure.

作为本发明的进一步改进：所述第二阶段的步骤为：As a further improvement of the present invention: the steps in the second stage are:

(1)更新有权控制流图G中每个基本块B_i的权值w_i；(1) Update the weight w _i of each basic block B _i in the control flow graph G;

(2)构造无环图DAG(G)；将有权控制流图G转换成无环图DAG(G)＝(V’,E’,W’)，其中V’＝V是P的基本块集合，E’＝E–{(B_i,B_j):(B_i,B_j)是一条回边}，是DAG(G)中边的集合，W’＝{w’_i:w’_i是节点B_i的权值，w’_i＝w_i*N(B_i)，w_i是B_i的执行时间，N(B_i)是B_i的最高执行频率}；(2) Construct an acyclic graph DAG(G); transform the authorized control flow graph G into an acyclic graph DAG(G)=(V',E',W'), where V'=V is the basic block of P Set, E'=E–{(B _i ,B _j ):(B _i ,B _j ) is a return edge}, which is the set of edges in DAG(G), W'={w' _i :w' _i is the weight of node B _i , w' _i =w _i *N(B _i ), w _i is the execution time of B _i , N(B _i ) is the highest execution frequency of B _i };

(3)重复以下步骤直到DAG(G)的最长路径不能再被缩短；(3) Repeat the following steps until the longest path of DAG(G) can no longer be shortened;

(3a)计算DAG(G)的最长路径；(3a) Calculate the longest path of DAG(G);

(3b)找到最长路径上具有寄存器溢出的基本块中执行频率最高的基本块B_k；(3b) Find the basic block B _k with the highest execution frequency among the basic blocks with register overflow on the longest path;

(3c)对B_k进行指令重调度和寄存器重分配；(3c) performing instruction rescheduling and register reallocation on B _k ;

(3d)更新DAG(G)中节点B_k的权值。(3d) Update the weight of node B _k in DAG(G).

作为本发明的进一步改进：所述第二阶段中每次选择DAG(G)的最长路径上执行频率最高的基本块B_i，对每一个被溢出到存储器的活跃周期R_j进行以下步骤来减少溢出：As a further improvement of the present invention: in the second stage, the basic block B _i with the highest execution frequency on the longest path of DAG(G) is selected each time, and the following steps are performed for each active cycle R _j overflowed to the memory Reduce overflow:

I、找出对最长路径影响最小的活跃周期R_k，满足条件：R_k的周期大于R_j的周期，将R_k的寄存器分配给R_j；1. Find out the active cycle R _k that has the least impact on the longest path, and satisfy the condition: the cycle of R _k is greater than the cycle of R _j , and the register of R _k is assigned to R _j ;

II、加入R_k的溢出代码，并且重调度所有受影响的指令；II, adding the overflow code of R _k , and rescheduling all affected instructions;

III、重新计算受R_k影响的每个基本块的执行时间。III. Recalculate the execution time of each basic block affected by R _k .

作为本发明的进一步改进：在所述步骤I中，为了找到满足条件的活跃周期R_k，通过引入一个新的图DAG(G,k)来进行；所述图DAG(G,k)是DAG(G)的子图，DAG(G,k)的任何路径的长度都不大于k+l_min,其中l_min是DAG(G)最短路径的长度；k的值设为(l_max-l_min)/2，其中l_max是DAG(G)最长路径的长度；在构造DAG(G,(l_max-l_min)/2)之后，计算每个活跃周期R_s的优先级rank如下：rank(R_s)＝n1(R_s)/n(R_s)，其中n1(R_s)是DAG(G,(l_max-l_min)/2)中所有基本块中对R_s的引用总数，n(R_s)是所有基本块中使用R_s的总数；所选择的R_k是最长路径上优先级最大的活跃周期。As a further improvement of the present invention: in the step I, in order to find the active period R _k that satisfies the conditions, it is carried out by introducing a new graph DAG (G, k); the graph DAG (G, k) is DAG (G), the length of any path of DAG(G,k) is not greater than k+l _min , where l _min is the length of the shortest path of DAG(G); the value of k is set to (l _max -l _min )/2, where l _max is the length of the longest path of DAG(G); after constructing DAG(G,(l _max -l _min )/2), the priority rank of each active cycle R _s is calculated as follows: rank (R _s )=n1(R _s )/n(R _s ), where n1(R _s ) is the total number of references to R _s in all basic blocks in DAG(G,(l _max -l _min )/2), n(R _s ) is the total number of used R _s in all basic blocks; the chosen R _k is the active cycle with the highest priority on the longest path.

与现有技术相比，本发明的优点在于：Compared with the prior art, the present invention has the advantages of:

1、本发明优化的分簇VLIW处理器上的指令调度和寄存器分配方法，在执行指令重调度和寄存器重分配之前，不仅进行第一遍的指令调度，也进行第一遍的寄存器分配。较之传统方法只进行第一遍指令调度而不考虑寄存器分配的方式，本发明所获得的最长路径更加准确可靠。1. The method for instruction scheduling and register allocation on the optimized clustered VLIW processor of the present invention, before executing instruction rescheduling and register reallocation, not only performs the first pass of instruction scheduling, but also performs the first pass of register allocation. Compared with the traditional method, which only performs the first-pass instruction scheduling without considering register allocation, the longest path obtained by the present invention is more accurate and reliable.

2、本发明优化的分簇VLIW处理器上的指令调度和寄存器分配方法，在指令重调度和寄存器重分配过程中，选择最长路径上执行频率最高的基本块优先处理，较之传统方法选择最长路径上具有最大指令级并行的基本块进行处理的方式，鉴于最长路径上具有最高执行频率的基本块对于减少最长路径的长度的影响力最大，本发明的选择策略显然更好。2. The instruction scheduling and register allocation method on the clustered VLIW processor optimized by the present invention, in the instruction rescheduling and register redistribution process, the basic block with the highest execution frequency on the longest path is selected for priority processing, compared with the traditional method selection In view of the basic block with the highest execution frequency on the longest path having the greatest influence on reducing the length of the longest path, the selection strategy of the present invention is obviously better.

3、本发明优化的分簇VLIW处理器上的指令调度和寄存器分配方法，当寄存器压力很大时，本发明为基本块的每个指令分配一个动态优先级来降低指令级并行，从而减少寄存器溢出。传统的方法不考虑降低指令级并行，可能引发多次寄存器溢出。本发明将寄存器分配和指令调度集成在一个阶段来执行，可以产生性能优化的编译代码。3. The instruction scheduling and register allocation method on the clustered VLIW processor optimized by the present invention, when the register pressure is very high, the present invention assigns a dynamic priority to each instruction of the basic block to reduce instruction-level parallelism, thereby reducing register overflow. Traditional methods do not consider reducing instruction-level parallelism, which may lead to multiple register overflows. The invention integrates register allocation and instruction scheduling into one stage, and can generate performance-optimized compiled codes.

附图说明Description of drawings

图1是本发明的流程示意图。Fig. 1 is a schematic flow chart of the present invention.

图2是本发明在具体应用中的有权控制流图G＝(V,E,W)。Fig. 2 is a right control flow graph G=(V, E, W) of the present invention in a specific application.

图3是本发明在具体应用中的无环图DAG(G)＝(V’,E’,W’)。Fig. 3 is an acyclic graph DAG(G)=(V', E', W') of the present invention in a specific application.

具体实施方式Detailed ways

以下将结合说明书附图和具体实施例对本发明做进一步详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

如图1所示，本发明一种优化的分簇VLIW处理器上的指令调度和寄存器分配方法，以最小化程序的最长执行时间为目标，其包括两个阶段：在第一个阶段，使用统一的算法对所有基本块进行第一遍指令调度和寄存器分配；在第二个阶段，根据基本块所属的最长路径的长度和最高执行频率，对存在寄存器溢出的基本块进行指令重调度和寄存器重分配。As shown in Figure 1, the instruction scheduling and register allocation method on a kind of optimized sub-clustering VLIW processor of the present invention, aim at minimizing the longest execution time of the program, it comprises two phases: in the first phase, Use a unified algorithm to perform the first-pass instruction scheduling and register allocation for all basic blocks; in the second stage, according to the length of the longest path and the highest execution frequency to which the basic block belongs, perform instruction rescheduling on the basic blocks with register overflow and register reallocation.

具体应用时，本发明的详细流程为：During specific application, the detailed flow process of the present invention is:

(1)构造程序P的有权控制流图G，其中w_i未定义。(1) Construct the privileged control flow graph G of program P, where w _i is undefined.

(3)更新有权控制流图G中每个基本块B_i的权值w_i。(3) Update the weight w _i of each basic block B _i in the control flow graph G.

(4)构造无环图DAG(G)。(4) Construct an acyclic graph DAG(G).

(5)重复以下步骤直到DAG(G)的最长路径不能再被缩短：(5) Repeat the following steps until the longest path of DAG(G) can no longer be shortened:

(5a)计算DAG(G)的最长路径。(5a) Calculate the longest path of DAG(G).

(5b)找到最长路径上具有寄存器溢出的基本块中执行频率最高的基本块B_k。(5b) Find the basic block B _k with the highest execution frequency among the basic blocks with register overflow on the longest path.

(5c)对B_k进行指令重调度和寄存器重分配。(5c) Perform instruction rescheduling and register reallocation on B _k .

(5d)更新DAG(G)中节点B_k的权值。(5d) Update the weight of node B _k in DAG(G).

在具体应用实例中，如图2所示，程序P由有权控制流图G＝(V,E,W)表示，其中V＝{B₁,B₂,…,B_n:是程序的基本块，E＝{(B_i,B_j):B_j对B_i控制依赖}，W＝{w_i:w_i是基本块B_i的执行时间}。如图3所示，将G转换成无环图(Directed Acyclic Graph)DAG(G)＝(V’,E’,W’)，其中V’＝V是P的基本块集合，E’＝E–{(B_i,B_j):(B_i,B_j)是一条回边}，是DAG(G)中边的集合，W’＝{w’_i:w’_i是节点B_i的权值，w’_i＝w_i*N(B_i)，w_i是B_i的执行时间，N(B_i)是B_i的最高执行频率}。In a specific application example, as shown in Figure 2, a program P is represented by an authorized control flow graph G=(V, E, W), where V={B ₁ , B ₂ ,...,B _n : is the basic Block, E={(B _i , B _j ): B _j controls dependence on B _i }, W={w _i : w _i is the execution time of basic block B _i }. As shown in Figure 3, convert G into an acyclic graph (Directed Acyclic Graph) DAG(G)=(V', E', W'), where V'=V is the basic block set of P, E'=E –{(B _i ,B _j ):(B _i ,B _j ) is a return edge}, which is the set of edges in DAG(G), W'={w' _i :w' _i is the weight of node B _i value, w' _i =w _i *N(B _i ), where w _i is the execution time of B _i , and N(B _i ) is the highest execution frequency of B _i }.

在上述步骤中，在第一阶段，统一算法将递增的寄存器分配方法与基于优先级的指令调度方法结合在一起。该指令调度方法按照相反后序调度所有基本块，并且根据每个基本块中的指令优先级调度每个指令。每个指令的优先级考虑了指令间延迟及处理器资源限制。在调度过程中，指令优先级被动态更新来减少寄存器压力。对优先级最高的可调度指令，将指令分配到簇上的功能单元，并且调用递增的寄存器分配方法将物理寄存器分配给指令的虚拟寄存器。簇间指令分配需要考虑指令的开始时间及每个簇的寄存器压力。In the above steps, in the first stage, the unified algorithm combines the incremental register allocation method with the priority-based instruction scheduling method. The instruction scheduling method schedules all basic blocks in reverse order, and schedules each instruction according to the instruction priority in each basic block. The priority of each instruction takes into account inter-instruction latency and processor resource constraints. During scheduling, instruction priorities are dynamically updated to reduce register pressure. For the schedulable instruction with the highest priority, the instruction is allocated to the functional units on the cluster, and the incremental register allocation method is called to allocate the physical register to the virtual register of the instruction. Inter-cluster instruction allocation needs to consider the start time of the instruction and the register pressure of each cluster.

在上述步骤中，第二阶段的目标是通过指令重调度和寄存器重分配来最小化最长路径上的寄存器溢出。每次选择DAG(G)的最长路径上执行频率最高的基本块B_i，对每一个被溢出到存储器的活跃周期R_j进行以下步骤来减少溢出：In the above steps, the goal of the second stage is to minimize the register overflow on the longest path through instruction rescheduling and register reallocation. Each time the basic block B _i with the highest execution frequency on the longest path of DAG(G) is selected, the following steps are performed for each active cycle R _j spilled into memory to reduce overflow:

I、找出对最长路径影响最小的活跃周期R_k，满足条件：R_k的周期大于R_j的周期，将R_k的寄存器分配给R_j。I. Find the active cycle R _k that has the least impact on the longest path, and satisfy the condition: the cycle of R _k is greater than the cycle of R _j , and assign the register of R _k to R _j .

II、加入R_k的溢出代码，并且重调度所有受影响的指令。II. Add the overflow code of R _k , and reschedule all affected instructions.

在上述步骤I中，为了找到满足条件的活跃周期R_k，本实施例中通过引入一个新的图DAG(G,k)来进行。DAG(G,k)是DAG(G)的子图，DAG(G,k)的任何路径的长度都不大于k+l_min,其中l_min是DAG(G)最短路径的长度。k的值设为(l_max-l_min)/2，其中l_max是DAG(G)最长路径的长度。在构造DAG(G,(l_max-l_min)/2)之后，计算每个活跃周期R_s的优先级rank如下：rank(R_s)＝n1(R_s)/n(R_s)，其中n1(R_s)是DAG(G,(lmax-l_min)/2)中所有基本块中对R_s的引用总数，n(R_s)是所有基本块中使用R_s的总数。本发明所选择的R_k是最长路径上优先级最大的活跃周期。In the above step I, in order to find the active period R _k that satisfies the conditions, a new graph DAG(G,k) is introduced in this embodiment. DAG(G,k) is a subgraph of DAG(G), and the length of any path of DAG(G,k) is not greater than k+l _min , where l _min is the length of the shortest path of DAG(G). The value of k is set to (l _max -l _min )/2, where l _max is the length of the longest path in DAG(G). After constructing DAG(G,(l _max -l _min )/2), calculate the priority rank of each active period R _s as follows: rank(R _s )=n1(R _s )/n(R _s ), where n1(R _s ) is the total number of references to R _s in all basic blocks in DAG(G,(lmax-l _min )/2), and n(R _s ) is the total number of used R _s in all basic blocks. The R _k selected by the present invention is the active cycle with the highest priority on the longest path.

以上仅是本发明的优选实施方式，本发明的保护范围并不仅局限于上述实施例，凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理前提下的若干改进和润饰，应视为本发明的保护范围。The above are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principle of the present invention should be regarded as the protection scope of the present invention.

Claims

1. instruction scheduling and register allocation method on an optimized sub-clustering VLIW processor, it is characterized in that, comprise two stages: in the first stage, use unified algorithm to carry out first pass instruction scheduling and all basic blocks Register allocation; in the second stage, according to the length of the longest path and the highest execution frequency to which the basic block belongs, instruction rescheduling and register reallocation are performed on the basic block with register overflow.

2. instruction scheduling and register allocation method on the clustered VLIW processor of optimization according to claim 1, is characterized in that, the step of described first stage is:

(1) Construct the authorized control flow graph G of the program P; the program P is represented by the authorized control flow graph G=(V,E,W), where V={B ₁ ,B ₂ ,...,B _n : is the program The basic block of , E={(B _i , B _j ): B _j depends on the control of B _i }, W={w _i : w _i is the execution time of basic block B _i };

(2) Perform instruction scheduling and register allocation for each basic block B _i in reverse order according to the unified algorithm.

3. instruction scheduling and register allocation method on the clustered VLIW processor of optimization according to claim 2, it is characterized in that, in described step (2), unified algorithm will increase the register allocation method and the instruction based on priority The scheduling methods are combined; the instruction scheduling method schedules all basic blocks in reverse order, and schedules each instruction according to the priority of instructions in each basic block; the priority of each instruction takes into account inter-instruction delay and processor resources Constraints, instruction priorities are dynamically updated during scheduling to reduce register pressure.

4. instruction scheduling and register allocation method on the clustered VLIW processor of optimization according to claim 2 or 3, is characterized in that, the step of described second stage is:

(1) Update the weight w _i of each basic block B _i in the control flow graph G;

(2) Construct an acyclic graph DAG(G); transform the authorized control flow graph G into an acyclic graph DAG(G)=(V',E',W'), where V'=V is the basic block of P Set, E'=E–{(B _i ,B _j ):(B _i ,B _j ) is a return edge}, which is the set of edges in DAG(G), W'={w' _i :w' _i is the weight of node B _i , w' _i =w _i *N(B _i ), w _i is the execution time of B _i , N(B _i ) is the highest execution frequency of B _i };

(3) Repeat the following steps until the longest path of DAG(G) can no longer be shortened;

(3a) Calculate the longest path of DAG(G);

(3b) Find the basic block B _k with the highest execution frequency among the basic blocks with register overflow on the longest path;

(3c) performing instruction rescheduling and register reallocation on B _k ;

(3d) Update the weight of node B _k in DAG(G).

5. instruction scheduling and register allocation method on the optimized sub-clustering VLIW processor according to claim 4, is characterized in that, in the second stage, the execution frequency is the highest on the longest path of DAG (G) selected at every turn For a basic block B _{i of} , for each active cycle R _j that is spilled into memory, the following steps are performed to reduce the overflow:

1. Find out the active cycle R _k that has the least impact on the longest path, and satisfy the condition: the cycle of R _k is greater than the cycle of R _j , and the register of R _k is assigned to R _j ;

II, adding the overflow code of R _k , and rescheduling all affected instructions;

III. Recalculate the execution time of each basic block affected by R _k .

6. instruction scheduling and register allocation method on the optimal sub-clustering VLIW processor according to claim 5, is characterized in that, in described step 1, in order to find the active cycle R _k that satisfies the condition, by introducing a new The graph DAG (G, k) is performed; the graph DAG (G, k) is a subgraph of DAG (G), and the length of any path of DAG (G, k) is not greater than k+l _min , where l _min is the length of the shortest path of DAG(G); the value of k is set to (l _max -l _min )/2, where l _max is the length of the longest path of DAG(G); in constructing DAG(G,(l _max - After l _min )/2), calculate the priority rank of each active period R _s as follows: rank(R _s )=n1(R _s )/n(R _s ), where n1(R _s ) is DAG(G, The total number of references to R _s in all basic blocks in (l _max -l _min )/2), n(R _s ) is the total number of used R _s in all basic blocks; the selected R _k is the priority on the longest path Maximum active period.