CN103294550A

CN103294550A - Heterogeneous multi-core thread scheduling method, heterogeneous multi-core thread scheduling system and heterogeneous multi-core processor

Info

Publication number: CN103294550A
Application number: CN2013102065330A
Authority: CN
Inventors: 王磊; 陈云霁; 陈天石; 陆超; 李梦竹
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2013-05-29
Filing date: 2013-05-29
Publication date: 2013-09-11
Anticipated expiration: 2033-05-29
Also published as: CN103294550B

Abstract

The invention relates to a heterogeneous multi-core thread scheduling method, which includes generating sorting lists for threads and cores respectively according to the dynamic characteristics of programs, and finding out the optimal stable matching between threads and cores according to the sorting lists, and performing thread scheduling according to the stable matching . Including receiving the eigenvector of the thread running on the core, and selecting a priority for each core for the thread; sorting each thread for each core; receiving the sorted list of each thread and core, and finding out the thread and The stable matching result of the core; receive the matching result, schedule through the operating system, and assign each thread to the corresponding core to run. Avoiding the huge overhead caused by sampling scheduling; taking into account more complex factors that affect performance and power consumption, only the relative relationship of prediction is needed instead of specific values, which reduces the complexity of the model and improves the accuracy of scheduling .

Description

A heterogeneous multi-core thread scheduling method, system and heterogeneous multi-core processor

技术领域technical field

本发明涉及一种在单指令集异构多核处理器(Single-ISA heterogeneousmulti-core processors)中线程与核数目相等情况下的线程调度方法(threads scheduling policy)领域,尤其涉及一种根据线程和核对彼此进行选择优先级排序后，用Gale-Shapley算法完成线程调度方法的实现。The present invention relates to the field of a thread scheduling method (threads scheduling policy) under the condition that the number of threads and cores in Single-ISA heterogeneous multi-core processors (Single-ISA heterogeneous multi-core processors) is equal, and in particular relates to a method based on threads and checking After selecting and prioritizing each other, the implementation of the thread scheduling method is completed with the Gale-Shapley algorithm.

背景技术Background technique

随着集成电路工艺的发展,越来越多的核被集成到同一个片上系统中,片上多核处理器(chip multi-processors,CMP)逐渐成为一种主流的处理器结构。片上多核处理器通过在片上集成多个相同的通用核为并行运行在系统中的程序提供更好的性能表现，但同时也会受到功耗，散热，芯片面积等的限制。为了更为有效地充分利用片上有限的功耗以及面积，工业界及学术界提出了异构多核处理器结构。With the development of integrated circuit technology, more and more cores are integrated into the same system-on-chip, and chip multi-processors (CMP) have gradually become a mainstream processor architecture. On-chip multi-core processors provide better performance for programs running in parallel in the system by integrating multiple identical general-purpose cores on the chip, but they are also limited by power consumption, heat dissipation, and chip area. In order to make full use of the limited power consumption and area on the chip more effectively, the industry and academia have proposed a heterogeneous multi-core processor structure.

异构多核处理器有多种构成形式，本发明主要涉及单指令集异构多核处理器（Single-ISA heterogeneous multi-core processors）。在单指令集异构多核处理器中，不同类型的核共用同一套指令集。核之间的差异既可以是由频率、缓存大小、功耗限制（power budget）等参数导致的，也可能是由于基本结构设计（例如：out-of-order或in-order，指令发射宽度等）的不同引起。另外，本发明主要针对异构多核处理器中每个核上都各自运行着一个单线程程序的情况，因此线程数目总是等于系统中的核的数目，并且线程可视为与程序等价。Heterogeneous multi-core processors have various forms, and the present invention mainly relates to single-instruction-set heterogeneous multi-core processors (Single-ISA heterogeneous multi-core processors). In a single instruction set heterogeneous multi-core processor, different types of cores share the same instruction set. Differences between cores can be caused by parameters such as frequency, cache size, power budget, etc., or by basic structural design (eg: out-of-order or in-order, instruction issue width, etc. ) caused by the difference. In addition, the present invention is mainly aimed at the situation that each core in the heterogeneous multi-core processor runs a single-threaded program, so the number of threads is always equal to the number of cores in the system, and the threads can be regarded as equivalent to the program.

不同的程序通常具有不同的程序特征。进一步，即使对于同一个程序，根据输入集以及执行阶段的变化，其程序特征也会发生显著的变化。Different programs often have different program characteristics. Furthermore, even for the same program, its program characteristics will change significantly according to the change of input set and execution stage.

在异构多核处理器中，根据程序特征，将各个线程调度到它们各自最为合适的核上面运行，这称之为线程调度。线程调度的目的在于用合适的核为线程提供更好的性能表现，同时尽量避免功耗的浪费，使得片上有限的功耗以及面积资源都得到更有效地利用。In a heterogeneous multi-core processor, each thread is scheduled to run on the most suitable core according to the program characteristics, which is called thread scheduling. The purpose of thread scheduling is to use appropriate cores to provide better performance for threads, and at the same time avoid waste of power consumption as much as possible, so that the limited power consumption and area resources on the chip can be used more effectively.

调度方法有静态和动态之分，其中，静态的调度方法通过离线提取程序与具体执行环境无关的特征来推测各个线程在不同类型的核上运行的性能表现，根据预测结果，将各个线程调度到相应的核上运行。静态调度方法只用到了程序间的差异，忽略了程序自身在不同的执行阶段所具有不同的程序特征，因此静态调度方法存在天然的缺陷。Scheduling methods can be divided into static and dynamic. Among them, the static scheduling method guesses the performance of each thread running on different types of cores by extracting the characteristics of the program offline and has nothing to do with the specific execution environment. According to the prediction results, each thread is scheduled to the run on the corresponding core. The static scheduling method only uses the differences between programs, ignoring the different program characteristics of the program itself in different execution stages, so the static scheduling method has natural defects.

基于抽样的动态调度方法的做法将调度分为两个阶段进行：抽样阶段和稳定执行阶段。当标志着程序动态特征发生了显著变化的触发事件发生后，进入抽样阶段；在抽样阶段中，将各个线程分别调度到每种类型的核上试运行，因此需要遍历所有的调度方案，并记录下每种调度方案相应的性能表现；然后挑选出性能表现最优的调度方案进入较长时期的稳定执行阶段，直到下一个触发事件的发生。基于抽样的动态调度方法能够充分利用程序的动态特征进行调度。但是，在抽样阶段会带来大量的线程迁移代价，并且遍历不同的调度方案时需要让程序在各种非理想的调度方案下试运行，由此导致的性能开销也非常大；另外抽样开销会随着系统中核的类型增加而迅速增加，使得这类调度方法的可扩展性很差，无法应用于实际中。The sampling-based dynamic scheduling method divides the scheduling into two phases: the sampling phase and the stable execution phase. When a trigger event that marks a significant change in the dynamic characteristics of the program occurs, it enters the sampling phase; in the sampling phase, each thread is scheduled to each type of core for trial operation, so it is necessary to traverse all scheduling schemes and record Next, the corresponding performance of each scheduling scheme is selected; then the scheduling scheme with the best performance is selected to enter the stable execution stage for a long period of time until the next trigger event occurs. The sampling-based dynamic scheduling method can make full use of the dynamic characteristics of the program for scheduling. However, the sampling phase will bring a lot of thread migration costs, and when traversing different scheduling schemes, the program needs to be tested under various non-ideal scheduling schemes, resulting in a very large performance overhead; in addition, the sampling overhead will be As the types of cores in the system increase rapidly, this kind of scheduling method has poor scalability and cannot be applied in practice.

为了避免抽样带来的开销，一类基于启发式的调度方法被提出。这类调度方法借助一些硬件的监控部件（monitor）来采集程序运行中的一些关键信息，例如IPC，缓存失效率，阻塞时间等，并根据经验规则用这些动态信息来估算各个线程在不同类型的核上运行的性能表现，然后使用贪心算法根据收益大小为线程选择合适的核。In order to avoid the overhead caused by sampling, a class of scheduling methods based on heuristics is proposed. This type of scheduling method uses some hardware monitoring components (monitors) to collect some key information in the running of the program, such as IPC, cache failure rate, blocking time, etc., and uses these dynamic information to estimate the different types of threads according to empirical rules. The performance of running on the core, and then use the greedy algorithm to select the appropriate core for the thread according to the size of the income.

下面对这类调度方法中具有代表性的技术方案进行一些简单的介绍：The following is a brief introduction to representative technical solutions in this type of scheduling method:

在一个由不同频率的核构成的异构多核处理器中，将线程按照上一执行阶段的IPC由高到低排序，同时将核按频率进行排序，然后将线程与核按照排序的相对位置进行匹配。类似的做法还可以通过采集线程的缓存失效率（cachemiss rate）等信息来将线程分为计算密集型（compute-intensive）和访存密集型（memory-intensive）两类，然后将计算密集型的线程调度到大核上（例如：频率高，缓存面积大，乱序执行等等）运行，访存密集型的线程被调度到小核上（例如：频率低，缓存面积小，顺序执行等等）运行。这种调度方法的出发点是将指令级并行度（ILP）较高的计算密集型线程分配到大核上从而取得更好的性能表现，访存密集型的线程分配到小核上以节约功耗。这类做法的进一步改进是，将采集到的缓存失效率、阻塞时间（stall time）等信息结合各个核的结构参数，估算出各个线程在不同核上运行的性能表现，然后用贪心算法根据性能收益大小将线程调度到各个核上运行。In a heterogeneous multi-core processor composed of cores with different frequencies, the threads are sorted according to the IPC of the previous execution stage from high to low, and the cores are sorted by frequency, and then the threads and cores are sorted according to their relative positions. match. In a similar way, threads can be divided into two types: compute-intensive and memory-intensive by collecting information such as the cache miss rate of the thread, and then the compute-intensive Threads are scheduled to run on large cores (for example: high frequency, large cache area, out-of-order execution, etc.), and memory-intensive threads are scheduled to run on small cores (for example: low frequency, small cache area, sequential execution, etc. )run. The starting point of this scheduling method is to allocate computationally intensive threads with high instruction-level parallelism (ILP) to large cores to achieve better performance, and allocate memory-intensive threads to small cores to save power consumption. . A further improvement of this approach is to combine the collected cache failure rate, stall time and other information with the structural parameters of each core to estimate the performance of each thread running on different cores, and then use the greedy algorithm The benefit size schedules threads to run on each core.

这类调度方法通常只用到少数重要的程序特征（例如缓存失效率，IPC等）和核的结构参数（例如频率，缓存大小等）结合一定的领域知识或者经验规则来对程序的性能进行估算，而实际上，程序的性能与大量复杂因素相关，这导致预测往往不够准确，从而使得这类调度方法的效果不理想。This type of scheduling method usually only uses a few important program characteristics (such as cache failure rate, IPC, etc.) and core structural parameters (such as frequency, cache size, etc.) combined with certain domain knowledge or empirical rules to estimate the performance of the program , but in fact, the performance of the program is related to a large number of complex factors, which makes the predictions often not accurate enough, which makes the effect of this kind of scheduling method unsatisfactory.

另外，现有的调度方法大多依赖于通过一个公式模型考虑有限几个因素来预测各个线程在不同类型的核上运行的性能表现。但是程序的实际性能与各种复杂的因素相关，导致这类预测的准确性有限。另一方面，即使存在一个精确的预测模型，其实现的复杂度通常很高，而且也不一定有助于实现更好的调度。例如，假设一个线程在两个不同类型的核上运行的实际性能分别为（5,4.8），模型A的预测为（4.9,5.1），模型B的预测为（10,1）。显然，模型A的预测更为准确，但是根据模型B的预测做出的调度方案却更可靠。从这个例子可以看出，实际上需要不是线程在不同的核上面运行的性能准确值，而是一个相对关系，即预测线程在各个核上运行的一个性能排序。In addition, most of the existing scheduling methods rely on considering a limited number of factors through a formula model to predict the performance of each thread running on different types of cores. But the actual performance of the program is related to a variety of complex factors, resulting in limited accuracy of such predictions. On the other hand, even if an accurate predictive model exists, its implementation complexity is usually high and does not necessarily lead to better scheduling. For example, suppose the actual performance of a thread running on two different types of cores is (5, 4.8), model A predicts (4.9, 5.1), and model B predicts (10, 1). Obviously, the prediction of model A is more accurate, but the scheduling scheme based on the prediction of model B is more reliable. It can be seen from this example that in fact, what is needed is not the exact value of the performance of threads running on different cores, but a relative relationship, that is, a performance ranking of predicted threads running on each core.

另一方面，现有的调度方法大多从线程的角度出发，以线程作为决策主体，根据单一的优化目标进行贪心调度。On the other hand, most of the existing scheduling methods start from the thread point of view, take the thread as the decision-making subject, and perform greedy scheduling according to a single optimization goal.

总的来说，之前提出的调度方法都是从程序的视角出发设定一个优化目标，以程序作为决策主体来挑选适合的核。这种单向选择的调度方法存在的问题是在调度过程中，核没有根据其自身结构特点以及功耗限制等情况来主动决定是否接收一个线程的权利。例如，当一个核被某个计算密集型的线程选择后，意味着对于该线程而言这个核能够为其提供最好的性能；但是从核的角度出发，如果这个核接收该线程可能导致其功耗超过限制（power budget），则这种调度方案显然不够理想。In general, the previously proposed scheduling methods set an optimization goal from the perspective of the program, and use the program as the decision-making body to select the appropriate core. The problem with this one-way selection scheduling method is that during the scheduling process, the core does not have the right to actively decide whether to receive a thread according to its own structural characteristics and power consumption constraints. For example, when a core is selected by a computationally intensive thread, it means that the core can provide the best performance for the thread; but from the perspective of the core, if the core receives the thread, it may cause other If the power consumption exceeds the power budget, this scheduling scheme is obviously not ideal.

发明内容Contents of the invention

为了解决上述技术问题，本发明的目的在于提出一种基于Gale-Shapley算法的异构多核处理器的线程调度方法及调度系统，针对异构多核处理器中的线程调度问题，本发明能够根据程序特征的变化进行动态调度，有效避免了基于抽样的调度方法带来的巨大开销，以及启发式调度方法对性能难以精确预测导致调度不够理想的缺陷，并且将线程和核都作为决策参与者，在调度的过程中可以同时兼顾线程和核的需求。In order to solve the above-mentioned technical problems, the object of the present invention is to propose a thread scheduling method and a scheduling system for heterogeneous multi-core processors based on the Gale-Shapley algorithm. For the thread scheduling problem in heterogeneous multi-core processors, the present invention can Dynamic scheduling of changes in characteristics effectively avoids the huge overhead brought by the sampling-based scheduling method, and the heuristic scheduling method is difficult to accurately predict the performance of the flaws that lead to unsatisfactory scheduling, and both threads and cores are used as decision-making participants. During the scheduling process, the needs of threads and cores can be taken into account at the same time.

具体地讲，本发明公开了一种异构多核线程调度方法，包括根据程序的动态特征分别为线程和核生成排序列表，并根据排序列表找出线程和核的最优的稳定匹配，根据该稳定匹配进行线程调度。Specifically, the present invention discloses a heterogeneous multi-core thread scheduling method, including generating sorting lists for threads and cores according to the dynamic characteristics of the program, and finding the optimal stable matching of threads and cores according to the sorting lists, according to the Stable matching for thread scheduling.

所述的线程和核生成排序列表包括生成排序模型，具体包括如下步骤：The thread and core generating sorting lists include generating a sorting model, which specifically includes the following steps:

（1）选择一理想数据库；(1) Choose an ideal database;

（2）从该数据库中提取程序抽样片段；(2) extract program sampling fragments from the database;

（3）将程序抽样片段分别在各个核的模拟器上运行，并得到相应响应，把程序抽样片段及其响应分为训练集和测试集两部分；(3) Run the program sampling fragments on the simulators of each core respectively, and get corresponding responses, and divide the program sampling fragments and their responses into two parts: training set and test set;

（4）选择合适的学习算法训练排序模型；(4) Select an appropriate learning algorithm to train the ranking model;

（5）当排序模型的测试误差满足要求时，训练阶段结束。(5) When the test error of the ranking model meets the requirements, the training phase ends.

所述的该程序抽样片段包括特征向量，对于线程，输入一个程序抽样片段的该特征向量，输出一个对各个核的排序列表；对于核，输入各个线程程序抽样片段的该特征向量，输出为每个核对各线程的排序列表。The said program sampling segment includes feature vectors, for thread, input the feature vector of a program sampling segment, and output a sorted list to each core; for core, input the feature vector of each thread program sampling segment, output as each A sorted list of checkpoints for each thread.

所述的异构多核线程调度方法，具体包括如下步骤：The heterogeneous multi-core thread scheduling method specifically includes the following steps:

收集线程运行中的各类动态信息，输出为线程的某个程序抽样片段的特征向量；Collect all kinds of dynamic information in the running of the thread, and output the feature vector of a program sampling segment of the thread;

接收运行在该核的线程的特征向量，并据其为该线程给各个核进行选择一个优先级排序；Receive the eigenvector of the thread running on the core, and select a priority order for each core for the thread according to it;

为各个核对各个线程进行排序；Sort each thread for each check;

接收各个线程和核的排序列表，并找出线程和核的稳定匹配结果；Receive the sorted list of each thread and core, and find out the stable matching result of thread and core;

接收该匹配结果，通过操作系统进行调度，将各个线程分配到相应的核上运行。After receiving the matching result, scheduling is performed through the operating system, and each thread is assigned to a corresponding core to run.

所述的异构多核线程调度方法，该找出线程和核的稳定匹配包括如下步骤：In the heterogeneous multi-core thread scheduling method, finding out the stable matching of threads and cores includes the following steps:

（1）线程按照其优先级排序由高到低向核提出匹配请求，如果核没有匹配对象，则选择接受请求与其形成匹配对；(1) The thread makes a matching request to the core from high to low according to its priority order. If the core has no matching object, it chooses to accept the request and form a matching pair with it;

（2）如果核已经有了匹配对象，则比较新的线程与匹配对象的优先级，如果新线程的优先级高于之前接受的线程，则选择接受新的线程作为匹配对象，如果新线程的优先级低于之前接受的线程，则拒绝新的请求；(2) If the core already has a matching object, compare the priority of the new thread with the matching object. If the priority of the new thread is higher than the previously accepted thread, choose to accept the new thread as the matching object. If the new thread's If the priority is lower than the previously accepted thread, new requests are rejected;

（3）被拒绝的线程重新选择排序列表上下一个核提出匹配请求，直到所有的线程和核都已经找到匹配对象。(3) The rejected thread re-selects the next core in the sorted list to make a matching request until all threads and cores have found matching objects.

所述的找出线程和核的稳定匹配包括采用Gale-Shapley算法。Said finding out the stable matching between thread and core includes adopting Gale-Shapley algorithm.

本发明还公开了一种异构多核线程调度系统，其特征在于，包括信息采集模块、T排序器、C排序器、匹配器、线程调度器，其中：The invention also discloses a heterogeneous multi-core thread scheduling system, which is characterized in that it includes an information collection module, a T sorter, a C sorter, a matcher, and a thread scheduler, wherein:

信息采集模块,用于收集各个线程运行中的各类动态信息,输出为各个线程的某个程序抽样片段的特征向量；The information collection module is used to collect various dynamic information in the operation of each thread, and the output is a feature vector of a certain program sampling segment of each thread;

T排序器，用于接收运行在该核上的线程的特征向量，并据其为该线程给各个核进行选择优先级排序；The T sorter is used to receive the feature vectors of the threads running on the core, and to select and prioritize each core for the thread according to it;

C排序器，用于为各个核对各个线程进行排序；C sorter for sorting each thread for each core;

匹配器，用于接收各个线程和各个核的排序列表，并得到线程和核的稳定匹配结果；A matcher is used to receive a sorted list of each thread and each core, and obtain a stable matching result of the thread and the core;

线程调度器，接收该匹配结果，通过操作系统进行调度，将各个线程分配到相应的核上运行。The thread scheduler receives the matching result, performs scheduling through the operating system, and assigns each thread to a corresponding core to run.

本发明还公开了一种采用上述任何一种异构多核线程调度方法的异构多核处理器。The invention also discloses a heterogeneous multi-core processor adopting any one of the above heterogeneous multi-core thread scheduling methods.

本发明还公开了一种包括上述异构多核线程调度系统的异构多核处理器。The invention also discloses a heterogeneous multi-core processor comprising the above-mentioned heterogeneous multi-core thread scheduling system.

本发明的有益效果是：在能够利用程序的动态特征的基础上避免了抽样调度带来的巨大开销；用一个非线性的学习排序模型替代用经验公式来预测性能功耗的做法，可以将更多影响性能功耗的复杂因素考虑在内，并且只需要预测的相对关系而非具体值，降低了模型的复杂度的同时也提高了调度的精确性；在线程的调度过程中，通过将线程和核都视为博弈过程中的独立决策主体，从而做到兼顾程序的性能需求与核的功耗限制；利用Gale-Shapley算法找到一个处于帕累托最优的稳定匹配并据其进行线程调度。The beneficial effects of the present invention are: on the basis of being able to utilize the dynamic characteristics of the program, the huge overhead caused by sampling scheduling is avoided; a nonlinear learning sorting model is used to replace the practice of predicting performance and power consumption with empirical formulas, and more Multiple complex factors that affect performance and power consumption are taken into account, and only the relative relationship of prediction is required rather than specific values, which reduces the complexity of the model and improves the accuracy of scheduling; in the thread scheduling process, by Both the core and the core are regarded as independent decision-making subjects in the game process, so as to take into account the performance requirements of the program and the power consumption limit of the core; use the Gale-Shapley algorithm to find a stable match in Pareto optimality and perform thread scheduling according to it .

附图说明Description of drawings

图1本发明排序模型离线训练架构Figure 1 The off-line training framework of the sorting model of the present invention

图2本发明四核异构多核处理器的结构实施例Fig. 2 structure embodiment of quad-core heterogeneous multi-core processor of the present invention

具体实施方式Detailed ways

本发明借鉴博弈论，将线程与核都视为自私的决策参与者，它们都会从各自的角度出发尽量分别最大化其性能或者功耗收益，客观上使得调度方法能够兼顾线程和核两方面的优化目标，从而得到一个更优的全局调度决策。The present invention draws lessons from game theory and regards threads and cores as selfish decision-making participants. They will try to maximize their performance or power consumption benefits from their own perspectives, objectively enabling the scheduling method to take into account both threads and cores. Optimizing the objective to get a better global scheduling decision.

在本发明中，需要得到线程对于各个核的选择优先级排序。并从核的角度出发，对各个线程进行一个接收的优先级排序。In the present invention, it is necessary to obtain the selection priority ranking of threads for each core. And from the perspective of the core, each thread is prioritized for receiving.

为了得到上述各优先级排序，需要使用学习排序技术（learn-to-ranktechnique）训练出排序模型（ranker）。如图1给出了本发明得到排序模型的一个具体做法：In order to obtain the above priority rankings, a ranking model (ranker) needs to be trained using the learn-to-rank technique. As shown in Fig. 1, a specific way that the present invention obtains the ranking model is shown:

应用数据库Application database：一个包含所有程序的无穷大的理想数据库；Application database Application database: an infinite ideal database containing all programs;

程序抽样片段Sample application phase：从一些范例程序中提取的程序抽样片段，其具备的程序特征应该能够代表大部分常用程序，并且用一些常用的程序分析工具如mika可以提取出程序的特征向量；Program sampling segment Sample application phase: The program sampling segment extracted from some sample programs, its program features should be able to represent most of the commonly used programs, and some commonly used program analysis tools such as mika can extract the feature vector of the program;

模拟器Simulator：对于一个异构多核处理器，核的数目以及各个核的类型已经事先确定，将程序抽样片段分别在各个核的模拟器上运行，并得到相应响应（each core response），把程序抽样片段及其响应分为训练集和测试集两部分；Simulator Simulator: For a heterogeneous multi-core processor, the number of cores and the type of each core have been determined in advance, and the program sampling fragments are run on the simulators of each core respectively, and corresponding responses (each core response) are obtained, and the program The sampled fragments and their responses are divided into two parts: training set and test set;

学习算法Learning algorithm：排序模型ranker的训练是一个监督学习过程，根据情况选择合适的学习算法例如RankBoost等来训练排序模型；Learning algorithm Learning algorithm: The training of the ranking model ranker is a supervised learning process, and an appropriate learning algorithm such as RankBoost is selected according to the situation to train the ranking model;

排序模型Ranker model：当排序模型的测试误差已经可以满足要求时，训练阶段结束。Ranking model Ranker model: When the test error of the ranking model can meet the requirements, the training phase ends.

对于线程而言，T-ranker的输入是一个程序片段的特征向量，输出是对各个核的一个排序列表，对于T-ranker而言，需要排序的核是固定不变的，输入变量只是每个程序片段的特征向量，因此只需要训练一个T-ranker即可在所有核上通用；对于某个核而言，C-ranker的输入是各个线程片段的特征向量，输出是该核对各线程的一个排序列表，对于C-ranker而言，需要排序的线程处于变化中，并且各个核具有不同的结构配置特征，即使对于相同的一组线程其排序结果页不相同，因此需要为每个核单独训练一个C-ranker。For threads, the input of T-ranker is a feature vector of a program fragment, and the output is a sorted list of each core. For T-ranker, the cores to be sorted are fixed, and the input variable is just each The eigenvectors of program fragments, so only one T-ranker needs to be trained to be common on all cores; for a certain core, the input of C-ranker is the eigenvector of each thread fragment, and the output is a For the sorting list, for C-ranker, the threads that need to be sorted are changing, and each core has different structural configuration characteristics, even for the same group of threads, the sorting result pages are different, so it needs to be trained separately for each core A C-ranker.

训练完毕后，可将排序模型通过硬件实现，集成在各个核上。或者作为本发明调度系统的一部分。After the training is completed, the sorting model can be implemented by hardware and integrated on each core. Or as part of the scheduling system of the present invention.

在利用排序模型分别为线程和核得到其各自的排序列表之后，根据Gale-Shapley算法找到稳定匹配，然后根据匹配结果来进行线程调度。从而达到一个帕累托最优的状态，使得所有的线程和核都处于一个相对满意的状态，从而客观上实现一个近似全局最优的线程调度。After using the sorting model to obtain their respective sorting lists for threads and cores, a stable match is found according to the Gale-Shapley algorithm, and then thread scheduling is performed according to the matching results. So as to achieve a Pareto optimal state, so that all threads and cores are in a relatively satisfactory state, so as to objectively achieve an approximate global optimal thread scheduling.

假设集合A和集合B中各有N个元素，并且每个元素都有一个自己的优先级排序列表包含另一集合的所有元素，则根据Gale-Shapley算法总可以为这两个集合找到一个稳定的匹配状态，使得每个元素都能找到在其所能找到的最佳匹配对象。一个不稳定的匹配意味着在该状态下存在集合A中的元素a与集合B中的元素b各自在对方的排序列表上都优先于他们现在各自的匹配对象，因此a和b都更倾向于拒绝它们当前的匹配对象而与对方进行匹配。一个不存在不稳定因素的匹配即为稳定匹配。对于集合A和B，可能存在多个稳定匹配。理论证明，根据Gale-Shapley算法找到的匹配总是处于帕累托最优状态，并且是所有稳定匹配中最好的一种。Assuming that there are N elements in each set A and set B, and each element has its own prioritized list containing all the elements of the other set, then according to the Gale-Shapley algorithm, a stable The matching state of each element can find the best matching object it can find. An unstable match means that there is a state in which element a in set A and element b in set B each have priority in each other's sorted list than their current matching objects, so a and b are more inclined to Reject their current match and match against each other. A match without unstable factors is a stable match. For sets A and B, there may be multiple stable matches. Theory proves that the matching found according to the Gale-Shapley algorithm is always in the Pareto optimal state and is the best of all stable matchings.

为实现本发明而提供的基于Gale-Shapley算法的异构多核处理器的线程调度方法，以一个4核的异构多核处理器作为例子来进行说明。显然，本发明也可以扩展到集成了更多核的异构多核处理器中，并且对于核的类型没有限制。The Gale-Shapley algorithm-based thread scheduling method for heterogeneous multi-core processors provided by the present invention is described by taking a 4-core heterogeneous multi-core processor as an example. Obviously, the present invention can also be extended to a heterogeneous multi-core processor integrated with more cores, and there is no limitation on the type of cores.

如图2所示，在一个4核的异构多核处理器中，除了4个核之外，包括以下部件：每个核上的信息采集模块Monitor，每个核上的T排序器T-ranker，一个C排序器C-ranker，一个匹配器Matchmaker，一个线程调度器Scheduler。As shown in Figure 2, in a 4-core heterogeneous multi-core processor, in addition to 4 cores, it includes the following components: information collection module Monitor on each core, T-ranker T-ranker on each core , a C sorter C-ranker, a matcher Matchmaker, a thread scheduler Scheduler.

Monitor：用于收集线程运行中的各类动态信息，包括但不限于缓存失效率，阻塞时间，整数指令数目，浮点指令数目等，其输出为线程的某个程序段的特征向量；Monitor: Used to collect various dynamic information during thread running, including but not limited to cache failure rate, blocking time, number of integer instructions, number of floating point instructions, etc., and its output is the feature vector of a certain program segment of the thread;

T-ranker：接收运行在该核的线程的特征向量，并据其为该线程给各个核进行选择优先级排序，通常以性能为排序标准；T-ranker: Receives the eigenvector of the thread running on the core, and selects and prioritizes each core for the thread according to it, usually based on performance as the sorting standard;

C-ranker：实际上在C-ranker内部集成了四个排序模型，分别用于为各个核对四个线程进行排序，排序标准可以设为在满足功耗限制（power budget）的前提下按照性能功耗比从高到低排序；由于每个单独的ranker都需要接收来自四个线程的特征向量，因而将其集中在一起可以降低通讯开销，只需要从四个核的Monitor接收一次信息；C-ranker: In fact, four sorting models are integrated inside C-ranker, which are used to sort the four threads of each core respectively. The sorting standard can be set according to the performance function under the premise of meeting the power budget. The consumption ratio is sorted from high to low; since each individual ranker needs to receive eigenvectors from four threads, concentrating them together can reduce communication overhead, and only need to receive information once from the four-core Monitor;

Matchmaker：接收各个线程和核的排序列表，并根据Gale-Shapley算法找出稳定匹配结果；Matchmaker: Receives the sorted list of each thread and core, and finds a stable matching result according to the Gale-Shapley algorithm;

Scheduler：接收Matchmaker的匹配结果，通过操作系统进行调度，将各个线程分配到相应的核上运行。Scheduler: Receive the matching result of Matchmaker, schedule through the operating system, and assign each thread to the corresponding core to run.

为了使本发明的目的、技术方案及优点更加清楚透彻，以下结合附图及实施例，对本发明的基于Gale-Shapley算法的异构多核处理器的线程调度方法进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the purpose, technical solution and advantages of the present invention clearer, the thread scheduling method for heterogeneous multi-core processors based on the Gale-Shapley algorithm of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

本发明实施例基于Gale-Shapley算法的异构多核处理器的线程调度方法，包括根据程序的动态特征分别为线程和核生成排序列表，并根据排序结果用Gale-Shapley算法找出一个最优的稳定匹配进行线程调度。本实施例中假设系统中只有core0，core1，core2，core3四个异构核，其各自具有不同的结构配置。显然，本发明也可以扩展到包含更多核的异构处理器中，其实现方式与本例中的四核异构多核处理器没有太大差别，因此不在此加以具体说明，但是都应该视为包含在本发明范畴内。The thread scheduling method of the heterogeneous multi-core processor based on the Gale-Shapley algorithm in the embodiment of the present invention includes generating sorting lists for the threads and cores according to the dynamic characteristics of the program, and finding an optimal one using the Gale-Shapley algorithm according to the sorting results. Stable matching for thread scheduling. In this embodiment, it is assumed that there are only four heterogeneous cores core0, core1, core2 and core3 in the system, each of which has a different structural configuration. Obviously, the present invention can also be extended to heterogeneous processors containing more cores, and its implementation is not much different from the quad-core heterogeneous multi-core processor in this example, so it will not be described in detail here, but all should be regarded as to be included within the scope of the present invention.

下面先以图1所示排序模型训练架构为例具体介绍ranker模型离线训练的实现过程。The following is an example of the ranking model training architecture shown in Figure 1 to introduce the implementation process of the offline training of the ranker model.

首先，以具有代表性的范例程序例如SPEC2006作为程序库，将其按照一定规则切分为一系列的程序段，例如将每一千万条指令视为一个程序段；用程序分析工具例如mika等提取程序段的特征向量，其中可以包含ILP，整数指令数，浮点指令数，缓存失效率等各类信息；从库中随机挑选一些程序段，分别在core0，core1，core2，core3四个核的模拟器上进行仿真，并得到相应的性能信息如IPC等，以及功耗信息如性能功耗比等；将随机挑选的程序段及其仿真结果随机划分为训练集和测试集两部分，选定学习排序算法例如RankBoost进行排序模型的训练，排序模型的训练过程是一个监督学习的过程。First, use a representative sample program such as SPEC2006 as a program library, divide it into a series of program segments according to certain rules, for example, treat every ten million instructions as a program segment; use program analysis tools such as mika, etc. Extract the feature vector of the program segment, which can contain various information such as ILP, number of integer instructions, number of floating-point instructions, cache failure rate, etc.; randomly select some program segments from the library, respectively in the four cores of core0, core1, core2, and core3 Simulate on the simulator, and obtain the corresponding performance information such as IPC, and power consumption information such as performance and power consumption ratio; the randomly selected program segments and their simulation results are randomly divided into two parts: training set and test set. A fixed learning ranking algorithm such as RankBoost is used to train the ranking model, and the training process of the ranking model is a supervised learning process.

对于T-ranker，其输入为一个程序段的特征向量，输出为同一程序段在四个核上运行的性能排序，也就是线程对核的选择优先级排序，如前所述，T-ranker只需要训练一个模型即可分别用于四个核；对于某个核的C-ranker，其输入为在该核运行的四个不同程序段的特征向量，输出为四个的性能功耗比排序，即核对线程的接受优先级排序，C-ranker需要单独为每个核训练一个独立的排序模型。当模型在测试集上的测试误差低到可以接受的程度时，模型的训练阶段结束。For T-ranker, its input is the eigenvector of a program segment, and the output is the performance ranking of the same program segment running on four cores, that is, the priority ordering of threads for core selection. As mentioned above, T-ranker only One model needs to be trained and can be used for four cores respectively; for a C-ranker of a certain core, the input is the feature vector of four different program segments running on the core, and the output is the ranking of the four performance-to-power ratios, That is, to check the acceptance priority sorting of threads, C-ranker needs to train an independent sorting model for each core separately. The training phase of the model ends when the test error of the model on the test set is acceptably low.

当排序模型训练结束后，将其以硬件的方式在异构多核处理器上实现，用于线程调度。After the training of the sorting model is completed, it is implemented in hardware on a heterogeneous multi-core processor for thread scheduling.

下面以图2所示异构多核处理器调度架构为例具体介绍Gale-Shapley算法的异构多核处理器的线程调度方法的实现。The implementation of the thread scheduling method for heterogeneous multi-core processors based on the Gale-Shapley algorithm will be described in detail below by taking the heterogeneous multi-core processor scheduling architecture shown in FIG. 2 as an example.

假设有四个线程T0，T1，T2，T3运行于该异构多核处理器。初始化时，由于没有线程的先验信息，将其随机调度在四个核上，例如得到如下匹配方式（T0，core0），（T1，core1），（T2，core2），（T3，core3）。Suppose there are four threads T0, T1, T2, T3 running on the heterogeneous multi-core processor. During initialization, since there is no prior information of the thread, it is randomly scheduled on four cores, for example, the following matching methods (T0, core0), (T1, core1), (T2, core2), (T3, core3) are obtained.

经过一段时间运行之后，各个Monitor采集到所在核的程序动态特征，将其分别发送给相应T-ranker以及C-ranker，并得到如下排序结果：After running for a period of time, each monitor collects the program dynamic features of its core, sends them to the corresponding T-ranker and C-ranker respectively, and obtains the following sorting results:

表1 线程对核的排序结果Table 1 The sorting results of threads to cores

表2 核对线程的排序结果Table 2 Sorting results of checking threads

在得到以上排序列表后，将排序结果发送到Matchmaker，Matchmaker根据Gale-Shapley算法找出一个最优的稳定匹配：After getting the above sorted list, send the sorting result to Matchmaker, and Matchmaker finds an optimal stable match according to the Gale-Shapley algorithm:

首先，T0根据其选择优先级排序向Core2提出请求，Core2此时没有匹配对象，接受T0的请求，形成一个匹配对（T0，Core2）；First, T0 makes a request to Core2 according to its selection priority. Core2 has no matching object at this time, and accepts T0's request to form a matching pair (T0, Core2);

接着，T1根据其选择优先级排序向Core1提出请求，Core1此时没有匹配对象，接受T1的请求，形成一个匹配对（T1，Core1）；Then, T1 makes a request to Core1 according to its selection priority. Core1 has no matching object at this time, and accepts T1's request to form a matching pair (T1, Core1);

然后，T2根据其选择优先级排序向Core2提出请求，Core2此时已经与T0匹配，Core2查看其其接受优先级排序，发现T2的优先级高于T0，于是接受T2提出的请求，重新形成匹配对（T2，Core2）；Then, T2 makes a request to Core2 according to its selection priority. Core2 has already matched T0 at this time. Core2 checks its acceptance priority and finds that T2 has a higher priority than T0, so it accepts the request from T2 and forms a new match. pair(t2, core2);

由于Core2重新与T2匹配，因此T0失去匹配对象，其按照降序顺序向Core1提出请求，TO在Core1的排序列表上优先级高于T1，因此Core1选择接受，形成新的匹配对（T0，Core1）。Since Core2 is re-matched with T2, T0 loses the matching object, and it makes a request to Core1 in descending order. TO has a higher priority than T1 in the sorting list of Core1, so Core1 chooses to accept and form a new matching pair (T0, Core1) .

以此类推，线程按照其优先级排序由高到低向核提出匹配请求，如果核没有匹配对象，则选择接受请求与其形成匹配对；如果核已经有了匹配对象，则比较新的线程与匹配对象的优先级，如果新线程的优先级高于之前接受的线程，则选择接受新的线程作为匹配对象，如果新线程的优先级低于之前接受的线程，则拒绝新的请求；被拒绝的线程重新选择排序列表上下一个核提出匹配请求；直到所有的线程和核都已经找到匹配对象，根据Gale-Shapley算法进行的匹配过程结束。理论证明，该匹配一定处于稳定状态，且是所有稳定匹配中最优的一种。并且毫无疑问，根据上述过程得到的匹配是帕累托最优的，因为没有线程（或核）可以在不损害其它线程（或核）收益的前提下改善自身收益。By analogy, the thread makes a matching request to the core from high to low according to its priority order. If the core has no matching object, it chooses to accept the request to form a matching pair with it; if the core already has a matching object, compare the new thread with the matching object. The priority of the object, if the priority of the new thread is higher than the previously accepted thread, then choose to accept the new thread as the matching object, if the priority of the new thread is lower than the previously accepted thread, reject the new request; rejected The thread reselects the next core in the sorting list to make a matching request; until all threads and cores have found matching objects, the matching process according to the Gale-Shapley algorithm ends. Theory proves that this matching must be in a stable state, and it is the best one among all stable matchings. And there is no doubt that the matching obtained according to the above process is Pareto optimal, because no thread (or core) can improve its own income without compromising the income of other threads (or cores).

最终得到的稳定匹配为：（T0，Core1），（T1，Core3），（T2，Core2），（T3，Core0）。Scheduler根据匹配结果将线程分别调度至对应核上运行。Monitor继续采集新的程序特征，为下一次调度做准备。The final stable matching is: (T0, Core1), (T1, Core3), (T2, Core2), (T3, Core0). The Scheduler schedules the threads to run on the corresponding cores according to the matching results. Monitor continues to collect new program features to prepare for the next dispatch.

以上只是本发明的实施例，还有很多情况同理可推，不一一列举，特别是本发明只是用到排序算法RankBoost得到排序结果，并结合Gale-Shapley算法得到稳定匹配用于线程调度，还可以用其他的排序算法结合Gale-Shapley算法，达到同样的匹配结果，例如AdaRank，Rank SVM等。The above is only an embodiment of the present invention, and there are many other situations that can be deduced in the same way, and are not listed one by one. In particular, the present invention only uses the ranking algorithm RankBoost to obtain the ranking results, and combines the Gale-Shapley algorithm to obtain stable matching for thread scheduling. You can also use other sorting algorithms combined with the Gale-Shapley algorithm to achieve the same matching results, such as AdaRank, Rank SVM, etc.

显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这些修改和变型属于本发明的保护范围之内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. These modifications and variations belong to the protection scope of the present invention.

Claims

1. a heterogeneous polynuclear thread scheduling method is characterized in that, comprises that the behavioral characteristics according to program is respectively thread and karyogenesis sorted lists, and finds out the stable coupling of the optimum of thread and nuclear according to sorted lists, carries out thread scheduling according to this stable coupling.

2. heterogeneous polynuclear thread scheduling method as claimed in claim 1 is characterized in that, thread and karyogenesis sorted lists comprise the generation order models, specifically comprise the steps:

(1) selects an ideal data storehouse;

(2) extraction procedure sampling fragment from this database;

(3) the sampling of program fragment is moved at the simulator of each nuclear respectively, and obtained respective response, sampling of program fragment and response thereof are divided into training set and test set two parts;

(4) select suitable learning algorithm training order models;

(5) when the test error of order models meets the demands, the training stage finishes.

3. heterogeneous polynuclear thread scheduling method as claimed in claim 2 is characterized in that, this sampling of program fragment comprises proper vector, and for thread, this proper vector of a sampling of program fragment of input is exported the sorted lists to each nuclear; For nuclear, import this proper vector of each thread sampling of program fragment, be output as the sorted lists that each checks each thread.

4. heterogeneous polynuclear thread scheduling method as claimed in claim 1 is characterized in that, specifically comprises the steps:

Collect the operating all kinds of multidate informations of thread, be output as the proper vector of certain sampling of program fragment of thread;

Reception operates in the proper vector of the thread of this nuclear, and selects a prioritization for this thread to each nuclear according to it;

Checking each thread for each sorts;

Receive the sorted lists of each thread and nuclear, and find out the stable matching result of thread and nuclear;

Receive this matching result, dispatch by operating system, each thread is assigned on the corresponding nuclear moves.

5. as claim 1 or 4 described heterogeneous polynuclear thread scheduling methods, it is characterized in that this stable coupling of finding out thread and nuclear comprises the steps:

(1) thread proposes matching request to nuclear from high to low according to its prioritization, does not have match objects as fruit stone, and it is right with its formation coupling then to select to accept request;

(2) as fruit stone match objects has been arranged, then newer thread and the priority of match objects, the thread of accepting before if the priority of new thread is higher than, then select to accept new thread as match objects, if the thread of accepting before the priority of new thread is lower than is then refused new request;

(3) unaccepted thread is reselected next nuclear proposition matching request on the sorted lists, has all found match objects up to all threads and nuclear.

6. as claim 1 or 4 described heterogeneous polynuclear thread scheduling methods, it is characterized in that this stable coupling of finding out thread and nuclear comprises employing Gale-Shapley algorithm.

7. a heterogeneous polynuclear thread scheduling system is characterized in that, comprises information acquisition module, T sorting unit, C sorting unit, adaptation, thread scheduler, wherein:

Information acquisition module is used for collecting the operating all kinds of multidate informations of each thread, is output as the proper vector of certain sampling of program fragment of each thread;

The T sorting unit is used for receiving the proper vector that operates in the thread on this nuclear, and selects prioritization for this thread to each nuclear according to it;

The C sorting unit is used to each to check each thread and sorts;

Adaptation is used for receiving the sorted lists of each thread and each nuclear, and obtains the stable matching result of thread and nuclear;

Thread scheduler receives this matching result, dispatches by operating system, each thread is assigned on the corresponding nuclear moves.

8. heterogeneous multi-nucleus processor that adopts any one method of claim 1-6.

9. heterogeneous multi-nucleus processor that comprises claim 8.