CN103294550B

CN103294550B - A kind of heterogeneous polynuclear thread scheduling method, system and heterogeneous multi-nucleus processor

Info

Publication number: CN103294550B
Application number: CN201310206533.0A
Authority: CN
Inventors: 王磊; 陈云霁; 陈天石; 陆超; 李梦竹
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2013-05-29
Filing date: 2013-05-29
Publication date: 2016-08-10
Anticipated expiration: 2033-05-29
Also published as: CN103294550A

Abstract

The present invention relates to a kind of heterogeneous polynuclear thread scheduling method, be respectively thread and karyogenesis sorted lists including according to the behavioral characteristics of program, and find out the optimum stable matching of thread and core according to sorted lists, carry out thread scheduling according to this stable matching.Including receiving the characteristic vector of the thread operating in this core, and each core to carry out selecting a prioritization according to it for this thread；Check each thread for each to be ranked up；Receive the sorted lists of each thread and core, and find out the stable matching result of thread and core；Receive this matching result, be scheduling by operating system, be assigned to each thread on corresponding core run.Avoid the great expense incurred that sampling scheduling brings；More complicated factors affecting performance power consumption are taken into account, it is only necessary to the relativeness of prediction rather than occurrence, while reducing the complexity of model, also improve the accuracy of scheduling.

Description

A kind of heterogeneous polynuclear thread scheduling method, system and heterogeneous multi-nucleus processor

Technical field

The present invention relates to a kind of at single instrction collection heterogeneous multi-nucleus processor (Single-ISA heterogeneous Multi-core processors) thread and check figure mesh mutually when thread scheduling method (threads scheduling policy) field, particularly relates to one and selects each other according to thread and verification After selecting prioritization, complete the realization of thread scheduling method with Gale-Shapley algorithm.

Background technology

Along with the development of integrated circuit technology, increasing core is integrated in same SOC(system on a chip), sheet Upper polycaryon processor (chip multi-processors, CMP) is increasingly becoming the processor knot of a kind of main flow Structure.Chip multi-core processor by multiple identical general purpose core integrated on sheet be parallel running in systems Program provides better performance performance, but the most also can be dispelled the heat by power consumption simultaneously, the restriction of chip area etc.. Propose different to more effectively make full use of power consumption limited on sheet and area, industrial quarters and academia Structure polycaryon processor structure.

Heterogeneous multi-nucleus processor has the multiple form of the composition, the invention mainly relates to single instrction collection heterogeneous polynuclear and processes Device (Single-ISA heterogeneous multi-core processors).At single instrction collection isomery In polycaryon processor, different types of core shares same set of instruction set.Difference between core both can be by frequency The parameters such as rate, cache size, power consumption limit (power budget) cause, it is also possible to due to substantially The difference of structure design (such as: out-of-order or in-order, instruction issue width etc.) causes. It addition, present invention is generally directed in heterogeneous multi-nucleus processor each self-operating on each core a single-threaded journey The situation of sequence, therefore number of threads is always equal to the number of the core in system, and thread can be considered and program Of equal value.

Different programs is generally of different performance of program.Further, even for same program, root According to input set and the change in the stage of execution, its performance of program also can occur significant change.

In heterogeneous multi-nucleus processor, according to performance of program, each thread scheduling is closed the most to each of which Suitable core operates above, and this is referred to as thread scheduling.The purpose of thread scheduling is that with suitable core be thread Offer better performance shows, and avoids the waste of power consumption so that power consumption limited on sheet and face the most as far as possible Long-pending resource is all more efficiently utilized.

Dispatching method has static and dynamic point, wherein, static dispatching method by off-line extraction procedure with The concrete feature performing environment unrelated speculates the performance that each thread runs on different types of core, According to predicting the outcome, will run on each thread scheduling to corresponding core.Static scheduling method has only used journey Difference between sequence, the program that have ignored is from being had different performance of program in the different execution stages, therefore There is natural defect in static scheduling method.

Scheduling is divided into two stages to carry out by the way of dynamic dispatching method based on sampling: sampling phase is with steady Surely the stage is performed.After the program behavioral characteristics of indicating there occurs that the trigger event of notable change occurs, enter Sampling phase；In sampling phase, each thread is dispatched on each type of core respectively trail run, because of This needs to travel through all of scheduling scheme, and records every kind of corresponding performance of scheduling scheme；Then choose Select the optimum scheduling scheme of performance and enter the stable execution stage of long term, trigger until next The generation of event.The behavioral characteristics that dynamic dispatching method based on sampling can make full use of program is adjusted Degree.But, substantial amounts of thread migration cost can be brought in sampling phase, and travel through different scheduling schemes Time need to allow program trail run under various nonideal scheduling schemes, the performance cost thus caused is the most very Greatly；Additionally sampling expense can increase along with the type of system center and increase sharply so that this kind of dispatching method Extensibility very poor, it is impossible to be applied in reality.

The expense brought in order to avoid sampling, a class is suggested based on didactic dispatching method.This kind of scheduling Method carrys out the more operating key messages of capture program by the monitoring parts (monitor) of some hardware, Such as IPC, cache invalidation rate, blocking time etc., and rule of thumb rule is estimated with these multidate informations The performance that each thread runs on different types of core, then uses greedy algorithm according to income size Suitable core is selected for thread.

Below technical scheme representative in this kind of dispatching method is made some brief introduction:

In a heterogeneous multi-nucleus processor being made up of the core of different frequency, thread is performed rank according to upper one The IPC of section sorts from high to low, is ranked up by frequency by core simultaneously, then by thread and core according to sequence Relative position mate.Similar way can also be by the cache invalidation rate (cache of collecting thread Miss rate) etc. information thread is divided into computation-intensive (compute-intensive) and memory access close Collection type (memory-intensive) two class, then by (example on the thread scheduling of computation-intensive to macronucleus As: frequency is high, and Buffer size is big, Out-of-order execution etc.) run, the thread of memory access intensity is scheduled for On small nut, (such as: frequency is low, Buffer size is little, sequentially execution etc.) is run.This dispatching method Starting point is to be assigned on macronucleus by computation-intensive thread higher for instruction level parallelism (ILP) thus takes Obtaining better performance performance, the thread of memory access intensity is assigned on small nut save power consumption.This kind of way Improve further is the information such as the cache invalidation rate collected, blocking time (stall time) to be combined The structural parameters of each core, estimate the performance that each thread runs in different IPs, then by greed Algorithm will run on thread scheduling to each core according to performance benefits size.

This kind of dispatching method the most only uses the important performance of program of minority (such as cache invalidation rate, IPC etc.) Certain domain knowledge or empirical rule is combined with the structural parameters (such as frequency, cache size etc.) of core The performance of program is estimated, and it practice, the performance of program is relevant to large amount of complex factor, this leads Cause prediction the most not accurate enough, so that the effect of this kind of dispatching method is undesirable.

Consider limited several because of usually it addition, existing dispatching method depends on mostly by a formula model Predict the performance that each thread runs on different types of core.But the actual performance of program is with various Complicated factor is correlated with, and causes the limited accuracy of this kind of prediction.On the other hand, even if having one accurately Forecast model, its complexity realized is generally the highest, nor necessarily contributes to realizing preferably dispatching. For example, it is assumed that the actual performance that thread runs on two different types of cores is respectively (5,4.8), Model A is predicted as (4.9,5.1), Model B be predicted as (10,1).Obviously, the prediction of model A The most accurate, but the scheduling scheme made according to the prediction of Model B is the most more reliable.Permissible from this example Find out, the performance exact value that actually need to operate above at different core if it were not for thread, but one relative Relation, the performance ranking that i.e. prediction thread runs on each core.

On the other hand, existing dispatching method is mostly from the angle of thread, using thread as decision-maker, Greed scheduling is carried out according to single optimization aim.

Generally speaking, the dispatching method before proposed is all to set one to optimize mesh from the visual angle of program Mark, selects applicable core using program as decision-maker.What the dispatching method of this unidirectional selection existed asks Topic is in scheduling process, and core is actively determined not according to himself situation such as construction features and power consumption limit The fixed right whether receiving a thread.Such as, after a core is selected by the thread of certain computation-intensive, Mean the enough performances best for its offer of this nuclear energy for this thread；But go out from the angle of core Send out, if this core receives this thread and its power consumption may be caused to exceed restriction (power budget), then this Plant scheduling scheme the most not ideal enough.

Summary of the invention

In order to solve above-mentioned technical problem, it is an object of the invention to propose one based on Gale-Shapley The thread scheduling method of the heterogeneous multi-nucleus processor of algorithm and dispatching patcher, in heterogeneous multi-nucleus processor Thread scheduling problem, the present invention can carry out dynamic dispatching according to the change of performance of program, effectively prevent base In the great expense incurred that the dispatching method of sampling brings, and performance is difficult to accurately predict by heuristic mutation operations method Cause dispatching dissatisfactory defect, and using thread and core all as decision-making participant, in the process of scheduling In can take into account the demand of thread and core simultaneously.

Specifically, the invention discloses a kind of heterogeneous polynuclear thread scheduling method, including moving according to program State feature is respectively thread and karyogenesis sorted lists, and finds out the optimum of thread and core according to sorted lists Stable matching, carries out thread scheduling according to this stable matching.

Described thread and karyogenesis sorted lists include generating order models, specifically include following steps:

(1) an ideal data storehouse is selected；

(2) extraction procedure sampled segment from this data base；

(3) sampling of program fragment is run respectively on the simulator of each core, and obtains respective response, Sampling of program fragment and response thereof are divided into training set and test set two parts；

(4) suitable learning algorithm is selected to train order models；

(5) when the test error of order models meet require time, the training stage terminates.

This described sampling of program fragment includes characteristic vector, for thread, inputs a sampling of program fragment This feature vector, export a sorted lists to each core；For core, input each multi-threaded program and take out This feature vector of print section, is output as the sorted lists of each thread of each verification.

Described heterogeneous polynuclear thread scheduling method, specifically includes following steps:

Collect the operating all kinds of multidate informations of thread, be output as the feature of certain sampling of program fragment of thread Vector；

Receive the characteristic vector of the thread operating in this core, and to each core to carry out selecting one according to it for this thread Individual prioritization；

Check each thread for each to be ranked up；

Receive the sorted lists of each thread and core, and find out the stable matching result of thread and core；

Receive this matching result, be scheduling by operating system, each thread is assigned on corresponding core Run.

Described heterogeneous polynuclear thread scheduling method, this stable matching finding out thread and core includes walking as follows Rapid:

(1) thread proposes matching request to core from high to low according to its prioritization, as pit does not mate Object, then select to accept to ask formed coupling right；

(2) there have been coupling object, the newest thread and the priority mating object such as pit, as Really the priority of new thread is higher than the thread that accepts before, then select to accept new thread as coupling object, If the priority of new thread is less than the thread accepted before, then refuse new request；

(3) unaccepted thread reselects next core proposition matching request on sorted lists, until all Thread and core found coupling object.

The described stable matching finding out thread and core includes using Gale-Shapley algorithm.

The invention also discloses a kind of heterogeneous polynuclear thread scheduling system, it is characterised in that include information gathering Module, T sorting unit, C sorting unit, adapter, thread scheduler, wherein:

Information acquisition module, is used for collecting the operating all kinds of multidate informations of each thread, is output as each line The characteristic vector of certain sampling of program fragment of journey；

T sorting unit, for receiving the characteristic vector of the thread operated on this core, and gives for this thread according to it Each core carries out selecting prioritization；

C sorting unit, is ranked up for checking each thread for each；

Adapter, for receiving the sorted lists of each thread and each core, and obtains stablizing of thread and core Matching result；

Thread scheduler, is received this matching result, is scheduling by operating system, is distributed by each thread Run on corresponding core.

The invention also discloses a kind of isomery using any of the above described a kind of heterogeneous polynuclear thread scheduling method many Core processor.

The invention also discloses a kind of heterogeneous multi-nucleus processor including above-mentioned heterogeneous polynuclear thread scheduling system.

The invention has the beneficial effects as follows: on the basis of the behavioral characteristics that can utilize program, avoid sampling adjust The great expense incurred that degree brings；Estimated performance is carried out by a nonlinear study order models replacement empirical equation More complicated factors affecting performance power consumption can be taken into account by the way of power consumption, and have only to prediction Relativeness rather than occurrence, also improve the accuracy of scheduling while reducing the complexity of model； In the scheduling process of thread, by the independent decision-making main body that thread and core are all considered as in gambling process, from And accomplish the performance requirement of program of taking into account and the power consumption limit of core；Gale-Shapley algorithm is utilized to find one The individual stable matching being in Pareto optimality also carries out thread scheduling according to it.

Accompanying drawing explanation

Fig. 1 order models of the present invention off-line training framework

The constructive embodiment of Fig. 2 present invention four core heterogeneous multi-nucleus processor

Detailed description of the invention

The present invention uses for reference game theory, and thread and core are all considered as the decision-making participant of selfishness, and they all can be from respectively From angle set out and maximize its performance or power consumption income the most respectively, objectively make the dispatching method can Take into account optimization aim of both thread and core, thus obtain a more excellent overall scheduling decision-making.

In the present invention, the selection prioritization obtaining thread for each core is needed.And from the angle of core Set out, each thread is carried out a prioritization received.

In order to obtain above-mentioned each prioritization, need to use study ordering techniques (learn-to-rank Technique) order models (ranker) is trained.Order models is obtained as Fig. 1 gives the present invention One specific practice:

Application data base Application database: comprises the infinitely-great ideal of all programs Data base；

Sampling of program fragment Sample application phase: the journey extracted from some example programs Sequence sampled segment, it performance of program possessed should be able to represent major part common programs, and normal with some Program analysis tool such as mika can extract the characteristic vector of program；

Simulator Simulator: for a heterogeneous multi-nucleus processor, the number of core and the class of each core Type has been determined in advance, and sampling of program fragment is run on the simulator of each core respectively, and obtains corresponding Response (each core response), is divided into training set and test set sampling of program fragment and response thereof Two parts；

Learning algorithm Learning algorithm: the training of order models ranker is a supervised learning Process, according to circumstances selects suitable learning algorithm such as RankBoost etc. to train order models；

Order models Ranker model: when the test error of order models can meet require time, Training stage terminates.

For thread, the input of T-ranker is the characteristic vector of a usability of program fragments, and output is right One sorted lists of each core, for T-ranker, the core needing sequence is changeless, Input variable is the characteristic vector of each usability of program fragments, therefore has only to train a T-ranker On all cores general；For certain core, the input of C-ranker is the feature of each thread fragment Vector, output is a sorted lists of each thread of this verification, for C-ranker, needs sequence Thread be in change, and each core has different structure configuration features, even for identical one Group its ranking results page of thread differs, it is therefore desirable to solely train a C-ranker for each vouching.

After training, order models can be realized by hardware, be integrated on each core.Or as this A part for invention dispatching patcher.

After utilizing order models respectively thread and core to obtain its respective sorted lists, according to Gale-Shapley algorithm finds stable matching, then carries out thread scheduling according to matching result.Thus Reach the state of a Pareto optimality so that all of thread and core are all in a relative satisfied shape State, thus objectively realize the thread scheduling of an approximation global optimum.

Assume to gather in A and set B and be respectively arranged with N number of element, and each element has oneself preferential Level sorted lists comprises all elements of another set, then can be always this according to Gale-Shapley algorithm A stable matching status is found in two set so that each element can find and can find at it Good coupling object.One unstable coupling means to there is the element a in set A and collection in this condition Close and all have precedence over their present respective coupling object on the sorted lists of each comfortable the other side of the element b in B, Therefore a and b is more likely to refuse they current coupling objects and mate with the other side.Do not deposit for one Coupling in unstable factor is stable matching.For set A and B, it is understood that there may be multiple stable matchings. Theoretical proof, the coupling found according to Gale-Shapley algorithm is always at Pareto-optimality, and And be one best in all stable matchings.

The thread of the heterogeneous multi-nucleus processor based on Gale-Shapley algorithm provided for realizing the present invention Dispatching method, illustrates as an example with the heterogeneous multi-nucleus processor of 4 cores.Obviously, the present invention Extend also to be integrated with in the heterogeneous multi-nucleus processor of more multinuclear, and the type for core does not limit System.

As in figure 2 it is shown, in the heterogeneous multi-nucleus processor of 4 cores, in addition to 4 core, including with Lower component: the information acquisition module Monitor on each core, the T sorting unit T-ranker on each core, One C sorting unit C-ranker, an adapter Matchmaker, a thread scheduler Scheduler.

Monitor: be used for collecting the operating all kinds of multidate informations of thread, include but not limited to cache invalidation Rate, blocking time, integer instructions number, floating point instruction number etc., it is output as certain program segment of thread Characteristic vector；

T-ranker: receive the characteristic vector of the thread operating in this core, and give each core according to it for this thread Carry out selecting prioritization, generally with performance as order standard；

C-ranker: be actually internally integrated four order models at C-ranker, be respectively used to as respectively Four threads of individual verification are ranked up, and order standard can be set to meeting power consumption limit (power budget) On the premise of sort from high to low according to power dissipation ratio of performance；Owing to each single ranker is required for receiving From the characteristic vector of four threads, thus concentrated in together and can be reduced communication-cost, it is only necessary to from The Monitor of four cores receives primary information；

Matchmaker: receive the sorted lists of each thread and core, and according to Gale-Shapley algorithm Find out stable matching result；

Scheduler: receive the matching result of Matchmaker, be scheduling by operating system, will be each Individual thread is assigned on corresponding core run.

In order to make the purpose of the present invention, technical scheme and advantage more clear thorough, below in conjunction with accompanying drawing and reality Execute example, the thread scheduling method to the heterogeneous multi-nucleus processor based on Gale-Shapley algorithm of the present invention It is further elaborated.Should be appreciated that specific embodiment described herein is only in order to explain this Bright, it is not intended to limit the present invention.

The thread scheduling side of embodiment of the present invention heterogeneous multi-nucleus processor based on Gale-Shapley algorithm Method, is respectively thread and karyogenesis sorted lists including according to the behavioral characteristics of program, and according to ranking results Find out an optimum stable matching with Gale-Shapley algorithm and carry out thread scheduling.Vacation in the present embodiment If only core0 in system, tetra-isomery cores of core1, core2, core3, it each has different Structure configures.Obviously, the present invention extends also in the heterogeneous processor comprising more multinuclear, its realization side Formula and four core heterogeneous multi-nucleus processors in this example not the biggest difference, is not illustrated at this, But all should be considered as being included in scope.

First as a example by the training framework of order models shown in Fig. 1, specifically introduce ranker model off-line training below Realize process.

First, using representative example program such as SPEC2006 as program library, by it according to one Establishing rules, cutting is a series of program segment, such as, each ten million bar instruction is considered as a program segment；Use journey The characteristic vector of the extraction procedure sections such as sequence analytical tool such as mika, wherein can comprise ILP, and integer refers to Make the various information such as number, floating point instruction number, cache invalidation rate；Some program segments of random choose from storehouse, point Do not emulate on core0, the simulator of tetra-cores of core1, core2, core3, and obtain corresponding Performance information such as IPC etc., and power consumption information such as power dissipation ratio of performance etc.；By the program segment of random choose and Its simulation result random division is training set and test set two parts, and selected study sort algorithm is such as RankBoost is ranked up the training of model, and the training process of order models is the mistake of a supervised learning Journey.

For T-ranker, its input is the characteristic vector of a program segment, is output as same program segment and exists The performance ranking run on four cores, the namely thread selection prioritization to core, as it was previously stated, T-ranker has only to train a model can be respectively used to four cores；For the C-ranker of certain core, Its input is the characteristic vector of four the distinct program sections run at this core, is output as the power dissipation ratio of performance of four Sequence, that i.e. checks thread accepts prioritization, and C-ranker needs to be individually for each core and trains one Independent order models.When model test error on test set low to acceptable degree time, model Training stage terminate.

After order models training terminates, it is realized in the way of hardware on heterogeneous multi-nucleus processor, use In thread scheduling.

As a example by heterogeneous multi-nucleus processor scheduling architecture shown in Fig. 2, specifically introduce Gale-Shapley below calculate The realization of the thread scheduling method of the heterogeneous multi-nucleus processor of method.

Assuming there are four thread T0, T1, T2, T3 run on this heterogeneous multi-nucleus processor.During initialization, Owing to there is no the prior information of thread, by its random schedule on four cores, such as, obtain following matching way (T0, core0), (T1, core1), (T2, core2), (T3, core3).

After running after a while, each Monitor collects the program behavioral characteristics of place core, will It is sent respectively to corresponding T-ranker and C-ranker, and obtains following ranking results:

Table 1 thread ranking results to core

The ranking results of thread checked by table 2

After obtaining above sorted lists, ranking results is sent to Matchmaker, Matchmaker root An optimum stable matching is found out according to Gale-Shapley algorithm:

First, T0 selects prioritization to file a request to Core2 according to it, and Core2 does not the most mate Object, accepts the request of T0, forms a coupling to (T0, Core2)；

Then, T1 selects prioritization to file a request to Core1 according to it, and Core1 does not the most mate Object, accepts the request of T1, forms a coupling to (T1, Core1)；

Then, T2 according to its select prioritization file a request to Core2, Core2 the most with T0 mates, Core2 check its its accept prioritization, find that the priority of T2, higher than T0, then connects The request proposed by T2, re-forms coupling to (T2, Core2)；

Owing to Core2 mates with T2 again, therefore T0 loses coupling object, its according to descending order to Core1 files a request, and priority is higher than T1 on the sorted lists of Core1 for TO, and therefore Core1 selects Accept, form new coupling to (T0, Core1).

By that analogy, thread proposes matching request to core from high to low according to its prioritization, as pit does not has There is coupling object, then select to accept to ask formed coupling right；As pit has had coupling object, then The newest thread and the priority mating object, if the priority of new thread is higher than the line accepted before Journey, then select to accept new thread as coupling object, if the priority of new thread is less than accepting before Thread, then refuse new request；Unaccepted thread reselects next core on sorted lists and proposes coupling Request；Until all of thread and core have found coupling object, enter according to Gale-Shapley algorithm The matching process of row terminates.Theoretical proof, this coupling is necessarily in steady statue, and is all stable matchings The one of middle optimum.And certainly, the coupling obtained according to said process is Pareto optimality, because of Receive for not having thread (or core) that can be improved self on the premise of not damaging other thread (or core) income Benefit.

The stable matching finally given is: (T0, Core1), (T1, Core3), (T2, Core2), (T3, Core0).Thread is dispatched to respectively on corresponding core run by Scheduler according to matching result.Monitor Continue to gather new performance of program, prepare for scheduling next time.

Being more than embodiments of the invention, the most a lot of situations in like manner can push away, not enumerate, particularly The present invention simply uses sort algorithm RankBoost and obtains ranking results, and combines Gale-Shapley calculation Method obtains stable matching for thread scheduling, it is also possible to combine Gale-Shapley with other sort algorithm Algorithm, reaches same matching result, such as AdaRank, Rank SVM etc..

Obviously, those skilled in the art can carry out various change and modification without deviating from this to the present invention Bright spirit and scope.Within these amendments and modification belong to protection scope of the present invention.

Claims

1. a heterogeneous polynuclear thread scheduling method, it is characterised in that include the behavioral characteristics according to program It is respectively thread and karyogenesis sorted lists, and finds out optimum stable of thread and core according to sorted lists Joining, carry out thread scheduling according to this stable matching, the concrete steps of the stable matching wherein finding out optimum include Collect the operating all kinds of multidate informations of thread, be output as thread certain sampling of program fragment feature to Amount；

Check each thread for each to be ranked up；

2. heterogeneous polynuclear thread scheduling method as claimed in claim 1, it is characterised in that thread and core Generate sorted lists to include generating order models, specifically include following steps:

(1) an ideal data storehouse is selected；

(2) extraction procedure sampled segment from this data base；

(4) suitable learning algorithm is selected to train order models；

3. heterogeneous polynuclear thread scheduling method as claimed in claim 2, it is characterised in that this program is taken out Print section includes characteristic vector, for thread, this feature vector of one sampling of program fragment of input, output One sorted lists to each core；For core, input this feature vector of each multi-threaded program sampled segment, It is output as the sorted lists of each thread of each verification.

4. heterogeneous polynuclear thread scheduling method as claimed in claim 1, it is characterised in that this finds out line The stable matching of journey and core comprises the steps:

5. heterogeneous polynuclear thread scheduling method as claimed in claim 1, it is characterised in that this finds out line The stable matching of journey and core includes using Gale-Shapley algorithm.

6. a heterogeneous polynuclear thread scheduling system, it is characterised in that include that information acquisition module, T arrange Sequence device, C sorting unit, adapter, thread scheduler, wherein:

T sorting unit, for receiving the characteristic vector of the thread operated on core, and gives each according to it for this thread Individual core carries out selecting prioritization；

C sorting unit, is ranked up for checking each thread for each；

7. the heterogeneous multi-nucleus processor using claim 1-5 any one method.