WO2014183530A1 - 任务分配方法、任务分配装置及片上网络 - Google Patents

任务分配方法、任务分配装置及片上网络 Download PDF

Info

Publication number
WO2014183530A1
WO2014183530A1 PCT/CN2014/075655 CN2014075655W WO2014183530A1 WO 2014183530 A1 WO2014183530 A1 WO 2014183530A1 CN 2014075655 W CN2014075655 W CN 2014075655W WO 2014183530 A1 WO2014183530 A1 WO 2014183530A1
Authority
WO
WIPO (PCT)
Prior art keywords
chip
idle
rectangular area
threads
processor cores
Prior art date
Application number
PCT/CN2014/075655
Other languages
English (en)
French (fr)
Inventor
路航
韩银和
付斌章
李晓维
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP14797851.4A priority Critical patent/EP2988215B1/en
Priority to KR1020157035119A priority patent/KR101729596B1/ko
Priority to JP2016513212A priority patent/JP6094005B2/ja
Publication of WO2014183530A1 publication Critical patent/WO2014183530A1/zh
Priority to US14/940,577 priority patent/US9965335B2/en
Priority to US15/943,370 priority patent/US10671447B2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7825Globally asynchronous, locally synchronous, e.g. network on chip
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs

Definitions

  • Embodiments of the present invention relate to on-chip multi-core network technologies, and in particular, to a task allocation method, a task distribution device, and an on-chip network.
  • NoC Network-on-Chip
  • the communication between the processor cores of different threads running the same task is affected by the data flow of other tasks, and the Quality of Service (QoS) is not guaranteed.
  • QoS Quality of Service
  • the method of subnetting is usually adopted, that is, the data flow belonging to the same task is limited to a specific area of the NoC.
  • FIG. 1 is a schematic diagram of a task allocation method based on a routing algorithm in the prior art. As shown in Figure 1, if other on-chip routers in task five need to communicate with Dest, they need to go through the same link, which may cause link congestion and affect network throughput.
  • the embodiments of the present invention provide a task allocation method, a task distribution apparatus, and an on-chip network, which are used to solve the problems of high hardware overhead and low network throughput of the task allocation method based on the routing algorithm in the prior art.
  • an embodiment of the present invention provides a task allocation method, including:
  • the thread of the to-be-processed task is allocated to the idle processor core, where Each of the idle processor cores allocates one thread.
  • the rectangular area expanded by the non-rectangular area is a smallest rectangular area in the network on which the non-rectangular area is included.
  • the determining is continuous, in the on-chip network formed by the multi-core processor After the number of threads matching the plurality of idle processor cores, the method further includes:
  • the threads of the to-be-processed task are respectively allocated to the idle processor core, wherein each processor core allocates one thread.
  • the on-chip network includes a line Multiple processor cores arranged in columns;
  • determining, in the on-chip network formed by the multi-core processor, a plurality of idle processor cores that match the number of the threads including:
  • a plurality of idle processor cores that match the number of threads are determined in an on-chip network formed by a multi-core processor.
  • the searching and determining the rectangular area expanded by the non-rectangular area comprises: determining, by the adjacent on-chip routers along the on-chip router connected to the initial idle processor core, whether there is a continuous, and the thread a number of idle processor cores that match the number;
  • the phase along the same line as the on-chip router connected to the initial idle processor sequentially determines successive second free areas such that the sum of the number of processor cores in the first free area and the number of processor cores in the second free area is equal to the number of threads.
  • the searching and determining the rectangular area expanded by the non-rectangular area comprises: sequentially determining whether there is a continuous, and the thread along a neighboring on-chip router in the same column as the on-chip router connected to the initial idle processor core a number of idle processor cores that match the number;
  • the phase along the on-chip router connected to the initial idle processor core sequentially determines a continuous fourth free area such that the sum of the number of processor cores in the third free area and the number of processor cores in the fourth free area is equal to the number of threads.
  • an embodiment of the present invention provides a task distribution apparatus, including:
  • a first determining module configured to determine a number of threads included in the task to be processed
  • a second determining module configured to determine, in an on-chip network formed by the multi-core processor, a plurality of idle processor cores that are equal in number to the number of threads, wherein each of the idle processor cores is connected to an on-chip router;
  • a third determining module configured to: when the second determining module determines that the area formed by the on-chip router connected to the idle processor core is a non-rectangular area, search and determine in the on-chip network by the non-rectangular area a rectangular area in which a rectangular area extends;
  • An allocation module configured to: if the predicted traffic of each on-chip router connected to the non-idle processor core in the rectangular area determined by the third determining module does not exceed a preset threshold, the thread of the to-be-processed task Allocating to the idle processor core, wherein each of the idle processor cores allocates one thread.
  • the third determining module is specifically configured to:
  • Determining a rectangular area expanded by the non-rectangular area is a smallest rectangular area including the non-rectangular area in the on-chip network.
  • the allocating module is further configured to:
  • the on-chip routers connected to the plurality of idle processor cores formed by the second determining module form a rectangular area, respectively, the threads of the to-be-processed tasks are respectively allocated to the idle processor cores, wherein each The processor core allocates a thread.
  • the second determining module is specifically configured to:
  • a plurality of idle processor cores that match the number of threads are determined in an on-chip network formed by a multi-core processor.
  • the second determining module is specifically configured to be connected to the initial idle processor core
  • the adjacent on-chip routers of the on-chip router peers sequentially determine whether there are consecutive idle processor cores that match the number of threads;
  • the phase along the same line as the on-chip router connected to the initial idle processor sequentially determines successive second free areas such that the sum of the number of processor cores in the first free area and the number of processor cores in the second free area is equal to the number of threads.
  • the second determining module is specifically configured to:
  • a neighboring on-chip router in the same column along the on-chip router connected to the initial idle processor core sequentially determines whether there are consecutive idle processor cores that match the number of threads;
  • the phase along the on-chip router connected to the initial idle processor core sequentially determines a continuous fourth free area such that the sum of the number of processor cores in the third free area and the number of processor cores in the fourth free area is equal to the number of threads.
  • the dispensing device also includes:
  • a prediction module configured to perform on-chip routing according to the non-idle processor core in the rectangular area
  • the historical traffic information of the device predicts the traffic of the on-chip router connected to the non-idle processor core in the rectangular area to obtain the predicted traffic.
  • an embodiment of the present invention further provides an on-chip network, including a plurality of processor cores, an on-chip router, and an interconnect, and the task distribution apparatus of any of the above.
  • the task allocation method, the task distribution device, and the on-chip network determine, by determining the number of threads included in the task to be processed, determining, in the on-chip network, a plurality of idle processor cores that match the number of required threads.
  • a rectangular area by means of a border-on-chip router adjacent to the non-rectangular area, an on-chip router connected to the idle processor core in the non-rectangular area forms a regular rectangular area, and then determines that the rectangular area is connected to the non-idle processor core
  • the on-chip router that is, whether the traffic of the router on the border slice exceeds a preset threshold. If not, the task to be processed is allocated to the processor core of the free area.
  • the task allocation method provided by the embodiment of the present invention, when the idle processor core resource in the network on the chip is equal to or more than the processor core required for the task to be processed, if there is no rule rectangular area to allocate the to-be-processed task, By means of the border router, the non-rectangular area is formed into a regular rectangular area and the task to be processed is allocated. In the rectangular area, the routing table is not needed to determine the routing mechanism of the data packet from the source on-chip router to the target on-chip router, but instead XY routing delivers packets to avoid network congestion and increase network throughput.
  • FIG. 1 is a schematic diagram of a task allocation method based on a routing subnet in the prior art
  • Embodiment 1 of a task assignment method according to the present invention
  • FIG. 3 is a schematic diagram of an on-chip network according to Embodiment 2 of the task assignment method of the present invention.
  • FIG. 4A is a schematic diagram of a three-chip network according to an embodiment of a task assignment method according to the present invention.
  • FIG. 4B is a schematic diagram of re-searching for a rectangular area in FIG. 4A;
  • FIG. 5A is a diagram of analyzing a task allocation method and a routing-based method according to the present invention by using a random uniform traffic model. Schematic diagram of the task assignment method of the network;
  • FIG. 5B is a schematic diagram of analyzing a task allocation method of the present invention and a task allocation method based on a routing subnet by using a bit comparison traffic model;
  • FIG. 5C is a schematic diagram of analyzing a task assignment method and a route subnet-based task assignment method according to the tornado flow model
  • Embodiment 1 of a task distribution device is a schematic structural diagram of Embodiment 1 of a task distribution device according to the present invention.
  • Embodiment 7 is a schematic structural diagram of Embodiment 2 of a task distribution device according to the present invention.
  • FIG. 8 is a schematic structural diagram of Embodiment 3 of the task distribution device of the present invention.
  • FIG. 2 is a flowchart of Embodiment 1 of a task assignment method according to the present invention.
  • the execution subject of this embodiment is a task assignment device which can be integrated in an on-chip network composed of a multi-core processor, which can be, for example, any processor in an on-chip network or the like.
  • This embodiment is applicable to a scenario in which the idle processor core resources in the on-chip network are equal to or more than the processor cores required for the task to be processed.
  • the embodiment includes the following steps:
  • the task assignment device determines the number of threads included in the task to be processed. In general, the number of threads a task contains is the same as the number of processor cores needed to process the task. For example, if a task contains 9 threads, then 9 processor cores are needed to process the task.
  • 102 Determining a plurality of idle processor cores in an amount equal to the number of threads in an on-chip network formed by a multi-core processor, wherein each idle processor is connected to an on-chip router.
  • the on-chip network features simultaneous access, high reliability, and high reusability. It consists of multiple processor cores, on-chip routers, and interconnects (channels). Where the interconnect includes an on-chip router Internal interconnect between the processor core and the external interconnect between the on-chip routers. Each processor core is connected to an on-chip router. Each on-chip router is interconnected into a mesh structure (Mesh Topology, hereinafter referred to as mesh). ). In this step, after determining the number of threads included in the task to be processed, according to the number of threads, the task distribution device determines, in the multi-core processor configuration and the on-chip network, a plurality of idle processor cores that are equal in number to the number of threads. And the corresponding on-chip router.
  • mesh Mesh Topology
  • the task distribution device determines a continuous number of idle processor cores equal to the number of threads, if the area formed by the on-chip routers connected to the idle processor cores is a rectangular area, the tasks to be processed are directly The included threads are allocated to the idle processor core, and each processor core is assigned a thread; otherwise, if the area formed by the on-chip routers connected to the idle processor cores is a non-rectangular area, then searching and determining to expand by the non-rectangular area Rectangular area, which is the smallest rectangular area in the network on the network that contains non-rectangular areas. For example, a NoC is a 5 x 5 mesh structure, and the number of threads to be processed includes five.
  • the 5 threads included in the task to be processed are allocated to the 5 consecutive idle processor cores, and each processor core is assigned one thread; if the number of consecutive idle processor cores determined is 5, but 5
  • the area formed by the on-chip routers connected by the processor cores is a non-rectangular area, that is, the area formed by the irregularly shaped area, and the task assigning means determines the rectangular area including the non-rectangular area, that is, the non-idle processing of the assigned task
  • the core and the free processor core in the non-rectangular region form a regular rectangular region.
  • the five processor cores are composed of the first two of the first row and the first two of the second row, and the on-chip router connected by the third processor core of the second row is used as a border-on-chip router.
  • the task distribution device determines a rectangular area formed by the on-chip routers connected to the five processor cores and the routers on the boundary slice.
  • consecutive idle processor cores may have different combinations
  • non-rectangular regions formed by on-chip routers connected to the processor core may also have various possible forms, such as L-type, E. Type, F type, I type, I type, etc.
  • rectangular areas containing the non-rectangular area also have various possible forms.
  • the routing table is not required to determine the routing mechanism of the data packet from the router on the source chip to the router on the target chip, but the data packet is transmitted by using the XY route, that is, after determining the router on the source chip and the destination on-chip router.
  • the data packet is first transmitted horizontally to the intermediate chip router that intersects the column where the destination chip router is located, and then vertically transmitted to the destination on-chip router; or, the vertical direction is transmitted to the destination on-chip router.
  • the lines intersect on the intermediate chip router and then pass horizontally to the destination on-chip router.
  • the task assigning means predicts the on-chip router connected to the non-idle processor core in the rectangular area according to the historical flow information of the on-chip router connected to the non-idle processor core in the rectangular area.
  • the traffic receives the predicted traffic, and determines whether the predicted traffic exceeds a preset threshold. If the predicted traffic does not exceed the preset threshold, the threads included in the to-be-processed task are respectively allocated to the idle processor core.
  • the task allocation method provided by the embodiment of the present invention determines a non-rectangular area formed by a plurality of idle processor cores in an amount equal to the number of required threads in the network on the chip by determining the number of threads included in the task to be processed.
  • a border router of a non-rectangular area adjacent to the area forming a regular rectangular area with the on-chip router in the non-rectangular area, and then determining whether the on-chip router connected to the non-idle processor core in the rectangular area, that is, the traffic of the router on the boundary slice If the preset threshold is exceeded, if not exceeded, the pending task is assigned to the processor core of the free area.
  • the task allocation method provided by the embodiment of the present invention, when the idle processor core resource in the network on the chip is equal to or more than the processor core required for the task to be processed, if there is no rule rectangular area to allocate the to-be-processed task, Extending the non-rectangular area into a regular rectangular area by means of the border router and allocating the task thread to be processed, in which the routing table is not required to determine the routing mechanism of the data packet from the source on-chip router to the target on-chip router, ⁇ Routing data packets by means of XY routing avoids the problem of high hardware overhead, low network throughput, and low system utilization of other task allocation methods.
  • NoC consists of an on-chip router and interconnects (channels), each processor core is connected to an on-chip router, and the number of threads included in a task and the number of processor cores required to process the task are - -corresponding. Therefore, the number of threads included in the task to be processed, the number of processor cores required for the task to be processed, and the number of on-chip routers connected to the required processor core are equal, and the processor core is in the same state as the on-chip router: The status or assigned task status, searching for an idle on-chip router, ie searching for an idle processor core.
  • the on-chip routers are shown in the on-chip network shown in the following various views.
  • the on-chip network includes multiple processor cores arranged in rows and columns, such as a 5 x 5 on-chip network, including 5 rows and 5 columns of 25 processor cores and 25 on-chip routers.
  • the initial idle processor core may be determined in an on-chip network formed by a multi-core processor, and the adjacent on-chip routers along with the on-chip router connected to the initial idle processor core are sequentially determined to determine whether there is continuity.
  • FIG. 3 is a schematic diagram of a second on-chip network according to an embodiment of the task assignment method of the present invention. As shown in FIG.
  • NoC is 5 ⁇ 5 NoC
  • multiple processor cores are arranged in a row and column
  • task 1 (4) is to be processed in the task queue, indicating that task one includes 4 threads, and needs to be allocated 4
  • the processor core handles this task. Randomly determining that the processor core connected to the on-chip router R1.1 is the initial idle processor core, and determining the consecutive four idle on-chip routers in sequence along the adjacent on-chip routers that are connected with the on-chip router connected to R1.
  • R1.1, R1.2, R1.3, and R1.4 have a total of 4 on-chip routers.
  • the four on-chip routers form a first free area, which is a regular rectangular area, and directly allocates 4 threads included in task one.
  • FIG. 4A is a schematic diagram of a three-chip network according to an embodiment of a task assignment method according to the present invention. As shown in FIG. 4A, in this embodiment, the NoC is 5 x 5 NoC, and the plurality of processor cores are arranged in a row and column.
  • the port indicates a high-load on-chip router, indicating a low-load on-chip router, and the port indicates an idle on-chip router, that is, R1 .1 ⁇ R1 .3, R2.1 ⁇ R2.4 and R3.1 - R3.4 are high-load on-chip routers, Rs0.1 ⁇ Rs0.6 are low-load on-chip routers, and the rest are idle routers.
  • the method for determining the load on the on-chip router can be set according to requirements. For example, when the traffic carried by an on-chip router is greater than a preset threshold, it is determined to be a high-load on-chip router.
  • task 2 (5) is to be processed in the task queue, indicating that task 2 includes 5 threads, and 5 processor cores need to be allocated to process the task.
  • the first free area of the four idle on-chip routers does not match the number of threads of the task, that is, the number of processor cores included in the first free area does not satisfy the number of processor cores required for the task.
  • searching for the second free area from the column where R5.0 is located when the sum of the number of processor cores in the second free area and the number of processor cores included in the first free area is equal to 5, that is, the first After searching for R5.4 in the two free areas, the number of idle on-chip routers is equal to the number of threads, and R5.0 ⁇ R5.4, Rs0.1, Rs0.2 and the high-loaded on-chip router R1.1 constitute a rule. Rectangular area.
  • traffic of the on-chip router connected to the non-idle processor core in the rectangular area exceeds a preset threshold.
  • traffic of the on-chip router connected to the non-idle processor core in the rectangular area may be predicted according to historical traffic information of the on-chip router connected to the non-idle processor core in the rectangular area.
  • R1.1 Taking R1.1 as an example, if task 2 is assigned to the rectangular area of the rule determined by the first free area and the second free area, the traffic originally carried by R 1.1 is as indicated by the thick black arrow 1 in the figure. Increased traffic after being assigned task two, as shown by the thick black arrow 2 As shown, if the sum of the two does not exceed the preset threshold, it is considered that R1.1 can be shared, and it is determined that task 2 can be assigned to the processor core included in the rectangular area, as shown by the dotted line in the figure. In this case, the data packet is in the rectangular area, and the data packet is transmitted by means of XY routing.
  • R5.2 is the source on-chip router and the destination on-chip router is R5.4
  • XY route can be used to start the packet from R5.2 and pass through R5.2, R5.1, and R5.0.
  • R5.4 can also be transferred from R5.2 to R5.4 through Rs0.2 and Rs0.1; otherwise, if task 2 is assigned to the rectangular area defined by the first free area and the second free area If the traffic originally carried by R1.1 is more than the preset threshold after the task 2 is assigned, it is considered that R1.1 cannot be shared, and it is determined that task 2 cannot be assigned to the rectangular area. Processor core.
  • FIG. 4B is a schematic diagram of re-searching for a rectangular area in FIG. 4A.
  • the third free area is searched from R5.0, that is, along the first column where R5.0 is located, three free slices including R5.0, R5.4, and R5.5 are searched.
  • the third free area of the router does not match the number of threads of the task. That is, the number of processor cores included in the third free area does not satisfy the processor core required by the task.
  • the row continues to search for the fourth free area, when the sum of the number of processor cores in the fourth free area and the number of processor cores included in the third free area is equal to 5, that is, R5 is searched in the fourth free area. 1.
  • the number of idle on-chip routers is equal to the number of threads, and R5.0, R5.4, R5.5, R5.1, R5.2, and the second row and the fourth row are four.
  • the low-load on-chip routers Rs0.1, Rs0.2, Rs0.3, and Rs0.4 form a regular rectangular area.
  • the task 2 is assigned to the rectangular area of the rule determined by the third free area and the fourth free area, the shared traffic of the original Rs0.1, Rs0.2, Rs0.3, and Rs0.4 is The traffic added after the task 2 is assigned does not exceed the preset threshold. If the traffic of the four shared on-chip routers does not exceed the preset threshold, task 2 is assigned to the processor included in the rectangular area.
  • the core as shown by the dashed box in Figure 4B; otherwise, if the traffic carried by one of them exceeds the preset threshold, it means that task 2 cannot be assigned to the rectangular area, and the processor core needs to be searched again. In the above embodiment, if all the irregular areas cannot pass the traffic prediction, the task 2 just waits for the next task scheduling in the waiting queue. If you release more processor cores after other tasks are processed, re-search for the processor cores required by the task.
  • the first free area, the second free area, the third free area, and the fourth free area may be regular rectangular areas, or may be irregular rectangular areas, for example, as shown in FIG. 4A.
  • the continuous, idle on-chip routers searched from R5.0 include R5.0, R5.1, 5.2, R5.4, and Rs0.1;
  • Rs0.2 is an idle on-chip router.
  • the continuous, idle on-chip routers searched from R5.0 include R5.0 and R5.1. , R5.2, Rs0.2, R5.3.
  • FIG. 5A is a schematic diagram of analyzing a task assignment method and a route subnet-based task assignment method according to the present invention by using a random uniform traffic model.
  • the abscissa is the injection rate, which can be understood as the utilization rate of the processor core of the on-chip network
  • the ordinate is the delay time; representing the injection rate and the delay time of the present invention.
  • Corresponding curve; A represents that the utilization rate of the processor core of the entire on-chip network is not high when the injection rate is 0 ⁇ 6 ⁇ - 3 , as shown in FIG. 5A.
  • the technical solution of the present invention and the existing one are The delay times of the technical solutions are substantially equal.
  • the technical solution of the present invention corresponding to the same delay duration is different from the existing technical solution.
  • the injection is performed.
  • the larger the rate the larger the delay time, which means that the performance of the on-chip network is worse, that is, the delay is significantly increased, and the network throughput is low.
  • the injection rate is larger, and the delay time is relatively increased. , indicating that the performance of the on-chip network is high, that is, the delay rise is not obvious, and the network throughput is high.
  • FIG. 5 is a schematic diagram of analyzing a task assignment method and a route subnet-based task assignment method according to the present invention using a bit comparison traffic model.
  • the abscissa is the injection rate, which can be understood as the utilization of the processor core of the on-chip network;
  • the ordinate is the delay time; represents the corresponding curve of the injection rate and the delay time of the present invention;
  • A represents the corresponding curve of the injection rate and the delay time in the task allocation method based on the routing subnet.
  • FIG. 5C is a schematic diagram of analyzing a task assignment method and a route subnet-based task assignment method according to the tornado flow model.
  • the abscissa is the injection rate, which can be understood as the utilization rate of the processor core of the on-chip network;
  • the ordinate is the delay time; representing the injection rate of the present invention.
  • the corresponding curve with the delay duration represents the corresponding curve of the injection rate and the delay duration in the task allocation method based on the routing subnet.
  • the injection rate is more than 4 ⁇ ⁇ ⁇ - 3 , the advantageous effects of the present invention can be clearly exhibited.
  • Table 1 is a comparison table of system utilization under the rectangular subnet division method and system utilization under the router sharing method of the present invention in the case where the network load ratio is 0.5 ⁇ 1.
  • the network load ratio of 0.5 ⁇ 1 indicates the ratio of the required processor core to the processor core that the system can actually provide.
  • the column in Table 1 indicates: When the required processor core and the processor core that the system can actually provide When the ratio is 0.5, the system is in an unsaturated state.
  • the system utilization based on the rectangular subnet division method is 0.478033, and in the method of router sharing according to the present invention, the system utilization is 0.465374, and the difference between the two is small. .
  • the system utilization rate is 0.810507, and the difference between the two is nearly 10%.
  • an idle processor core of the on-chip network is randomly used as the initial idle processor core, and the search is started from the initial idle processor core when the processor core needs to be searched again.
  • the initial idle processor core may also be selected according to a preset rule, and the initial idle processor core may be different each time the search is performed.
  • a certain processor core in a certain area can be randomly determined as the initial idle processor core.
  • FIG. 6 is a schematic structural diagram of Embodiment 1 of a task distribution device according to the present invention.
  • the task allocation apparatus provided in this embodiment may implement various steps of the method applied to the task distribution apparatus according to any embodiment of the present invention, and the specific implementation process is not described herein again.
  • the task distribution apparatus provided in this embodiment specifically includes:
  • the first determining module 1 1 is configured to determine the number of threads included in the task to be processed
  • a second determining module 12 configured to determine, in an on-chip network formed by the multi-core processor, a plurality of idle processor cores that are equal in number to the number of threads, wherein each idle processor core is connected to an on-chip router;
  • the allocating module 14 is configured to allocate the thread of the task to be processed to the idle if the predicted traffic of each on-chip router connected to the non-idle processor core in the rectangular area determined by the third determining module does not exceed the preset threshold.
  • a processor core wherein each idle processor core allocates one thread.
  • the task allocation apparatus determines, by the first determining module, the number of threads included in the task to be processed, and the second determining module determines, in the network on the chip, the number of idle processor cores equal to the number of required threads.
  • the non-rectangular area by means of the border-on-chip router adjacent to the non-rectangular area, the on-chip router connected to the idle processor core in the non-rectangular area constitutes a regular rectangular area, and then the third determining module determines the rectangular area Connect with non-idle processors
  • the on-chip router that is, the traffic of the router on the border slice exceeds the preset threshold. If not, the task to be processed is allocated to the processor core of the free area by the distribution module.
  • the task allocation method provided by the embodiment of the present invention, when the idle processor core resource in the network on the chip is equal to or more than the processor core required for the task to be processed, if there is no rule rectangular area to allocate the to-be-processed task, By means of the border router, the non-rectangular area is formed into a regular rectangular area and the task to be processed is allocated. In the rectangular area, the routing table is not needed to determine the routing mechanism of the data packet from the source on-chip router to the target on-chip router, but instead The XY routing method transmits data packets, and the hardware saving avoids the problem that the task allocation method based on the routing subnet has large hardware overhead, low network throughput, and low system utilization.
  • the third determining module 13 is specifically configured to:
  • the rectangular area expanded by the non-rectangular area is the smallest rectangular area containing the non-rectangular area in the on-chip network.
  • allocation module 14 is further configured to:
  • the threads of the task to be processed are respectively allocated to the idle processor cores, wherein each processor core allocates one thread.
  • the second determining module 12 is specifically configured to:
  • an initial idle processor core in an on-chip network formed by a multi-core processor the on-chip network including a plurality of processor cores arranged in rows and columns;
  • a plurality of idle processor cores that match the number of threads are determined in the on-chip network formed by the multi-core processor.
  • the second determining module 12 is specifically configured to: sequentially determine whether there are consecutive idle processor cores that match the number of threads along the adjacent on-chip routers of the on-chip routers connected to the initial idle processor core;
  • the adjacent on-chip routers in the same row as the on-chip router connected to the initial idle processor are sequentially determined to be consecutive.
  • the second free area is such that the sum of the number of processor cores in the first free area and the number of processor cores in the second free area is equal to the number of threads.
  • the second determining module 12 is specifically configured to: sequentially determine, according to the adjacent on-chip routers in the same column as the on-chip router connected to the initial idle processor core, whether there are consecutive idle processor cores that match the number of threads;
  • the adjacent on-chip routers along the on-chip router connected to the initial idle processor core are sequentially determined to be consecutive.
  • the fourth free area is such that the sum of the number of processor cores in the third free area and the number of processor cores in the fourth free area is equal to the number of threads.
  • FIG. 7 is a schematic structural diagram of Embodiment 2 of a task distribution device according to the present invention.
  • the task distribution apparatus provided in this embodiment is based on the apparatus shown in FIG. 6.
  • the method further includes: a prediction module 15 for performing on-chip connection with a non-idle processor core according to a rectangular area.
  • the historical traffic information of the router predicts the traffic of the on-chip router connected to the non-idle processor core in the rectangular area to obtain the predicted traffic.
  • FIG. 8 is a schematic structural diagram of Embodiment 3 of the task distribution device of the present invention.
  • the task distribution apparatus 800 of the present embodiment may include a processor 81 and a memory 82.
  • the task distribution device 800 can also include a transmitter 83, a receiver 84. Transmitter 83 and receiver 84 can be coupled to processor 81.
  • the memory 82 stores execution instructions. When the task distribution device 800 is running, the processor 81 communicates with the memory 82, and the processor 81 calls the execution instructions in the memory 82 for performing the following operations:
  • the task assignment device 800 determines the number of threads included in the task to be processed
  • the determined area formed by the on-chip router connected to the idle processor core is a non-rectangular area, searching for and determining a rectangular area expanded by the non-rectangular area in the on-chip network;
  • the thread of the task to be processed is allocated to the idle processor core, where each idle processor The core allocates a thread.
  • the rectangular area expanded by the non-rectangular area is the smallest rectangular area in the network on the chip that includes the non-rectangular area.
  • the method further includes:
  • the threads of the task to be processed are respectively allocated to the idle processor core, wherein each processor core allocates one thread.
  • the on-chip network includes multiple processor cores arranged in rows and columns;
  • determining a plurality of idle processor cores that match the number of threads in the on-chip network formed by the multi-core processor includes:
  • a plurality of idle processor cores that match the number of threads are determined in the on-chip network formed by the multi-core processor.
  • searching for and determining a rectangular area expanded by the non-rectangular area includes:
  • the adjacent on-chip routers along the on-chip router connected to the initial idle processor core determine in turn whether there are consecutive idle processor cores that match the number of threads;
  • the adjacent on-chip routers in the same row as the on-chip router connected to the initial idle processor are sequentially determined to be consecutive.
  • the second free area is such that the sum of the number of processor cores in the first free area and the number of processor cores in the second free area is equal to the number of threads.
  • searching for and determining a rectangular area expanded by the non-rectangular area includes:
  • the adjacent on-chip routers in the same column along the on-chip router connected to the initial idle processor core determine in turn whether there are consecutive idle processor cores that match the number of threads;
  • the adjacent on-chip routers along the on-chip router connected to the initial idle processor core are sequentially determined to be consecutive.
  • the fourth free area is such that the sum of the number of processor cores in the third free area and the number of processor cores in the fourth free area is equal to the number of threads.
  • the method further includes: before the threads included in the to-be-processed task are respectively allocated to the idle processor core, the method further includes:
  • the traffic of the on-chip router connected to the non-idle processor core in the rectangular area is predicted to obtain the predicted traffic.
  • the embodiment of the present invention further provides an on-chip network, including a plurality of processor cores, an on-chip router and an interconnection line, and any task distribution device as shown in FIG. 6 or FIG. 7 , based on the task assignment method and the task assignment device.
  • an on-chip network including a plurality of processor cores, an on-chip router and an interconnection line, and any task distribution device as shown in FIG. 6 or FIG. 7 , based on the task assignment method and the task assignment device.
  • the technical solutions of any of the method embodiments in FIG. 2 to FIG. 4A may be performed, and details are not described herein again.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Multi Processors (AREA)

Abstract

本发明实施例提供一种任务分配方法、任务分配装置及片上网络,该方法包括:确定待处理任务所包含的线程数量,在多核处理器构成的片上网络中确定连续的、与线程数量相等数量的多个空闲处理器核对应的片上路由器构成的连续区域。若此区域为非矩形区域,则确定由此区域扩展的矩形区域,若扩展的矩形区域内与非空闲处理器核连接的每一个片上路由器的预测流量未超过预设门限值,则将待处理任务的多个线程分配给区域中的空闲处理器核。本发明实施例提供的任务分配方法,借助已分配任务的边界路由器将非矩形区域扩展从而避免硬件开销大、网络吞吐量低、系统利用率低等问题。

Description

任务分配方法、 任务分配装置及片上网络 本申请要求于 2013 年 5 月 14 日提交中国专利局、 申请号为 201310177172.1 ,发明名称为 "任务分配方法、任务分配装置及片上网络" 的 中国专利申请的优先权, 上述专利申请的全部内容通过引用结合在本申请中。 技术领域
本发明实施例涉及片上多核网络技术, 尤其涉及一种任务分配方法、 任 务分配装置及片上网络。
背景技术
随着超大规模集成电路(Very Large Scale Integrated Circuits , 以下简称 VLSI ) 的集成程度越来越高, 同一个芯片上集成的片上处理单元, 如存储单 元、 信号处理单元等越来越多, 每一片上处理单元相当于一个处理器核, 多 个处理器核构成多核处理器或众核处理器。 片上网络(Network-on-Chip, 以 下简称 NoC )是实现多核处理器中不同处理器核之间数据传输的主要手段。 随着处理器核数量的不断增多, 一个任务的多个线程、 多个任务同时运行在 同一个处理器核上的情况越来越普遍, 若将不同任务各自的线程随机的分配 给某些处理器核, 则在 NoC中, 运行同一个任务的不同线程的处理器核之间 的通信会受到其他任务的数据流的影响, 无法保证(Quality of Service, 以下 简称 QoS ), 系统性能降低。 为了避免 NoC 中随机分配任务而引起的的任务 之间数据流的相互干扰, 通常釆用子网划分的方法, 即将属于同一个任务的 数据流限制在 NoC的某一特定区域中。
现有技术中, 对 NoC中每个片上路由器建立路由表, 路由表确定数据包 从源片上路由器到目标片上路由器的路由机制。 在进行子网划分时, 依靠内 部的路由算法保证一个任务的数据流下一跳到达的片上路由器是分配给同一 任务的片上路由器, 路由算法可适用于任意拓朴, 相对复杂, 硬件开销大, 不规则的子网形状容易产生流量拥塞。 图 1 为现有技术中基于路由算法的任 务分配方法示意图。 如图 1所示, 若任务五中其他片上路由器都需要和 Dest 通信, 则需要经过同一条链路, 可能导致链路拥塞, 影响网络吞吐量。
发明内容
本发明实施例提供一种任务分配方法、 任务分配装置及片上网络, 用于 解决现有技术中基于路由算法的任务分配方法硬件开销大、 网络吞吐量低等 问题。
第一个方面, 本发明实施例提供一种任务分配方法, 包括:
确定待处理任务所包含的线程数量;
在多核处理器构成的片上网络中确定连续的、 与所述线程数量相等数量 的多个空闲处理器核, 其中, 每一个所述空闲处理器核连接一个片上路由器; 若确定出的与所述空闲处理器核连接的所述片上路由器构成的区域是非 矩形区域, 则在所述片上网络中搜索并确定由所述非矩形区域扩展的矩形区 域;
若所述扩展的矩形区域内与非空闲处理器核连接的每一个片上路由器的 预测流量未超过预设门限值, 则将所述待处理任务的线程分配给所述空闲处 理器核, 其中, 每一个所述空闲处理器核分配一个线程。
在第一个方面的第一种可能的实现方式中, 所述由所述非矩形区域扩展 而成的矩形区域为所述片上网络中包含所述非矩形区域的最小矩形区域。
结合第一个方面或第一个方面的第一种可能的实现方式, 在第一个方面 的第二种可能的实现方式中, 所述在多核处理器构成的片上网络中确定连续 的、 与所述线程数量匹配的多个空闲处理器核之后, 所述方法还包括:
若确定出的所述空闲处理器核的片上路由器构成的区域是矩形区域, 则 将所述待处理任务的线程分别分配给所述空闲处理器核, 其中, 每一个处理 器核分配一个线程。
结合第一个方面、 第一个方面的第一种或第二种可能实现方式中的任一 种可能的实现方式, 在第三种可能的实现方式中, 所述片上网络中包括呈行 列排列的多个处理器核;
相应的, 所述在多核处理器构成的片上网络中确定连续的、 与所述线程 数量匹配的多个空闲处理器核, 包括:
在所述多核处理器构成的片上网络中确定初始空闲处理器核;
以所述初始空闲处理器核为起始点, 在多核处理器构成的片上网络中确 定连续的、 与所述线程数量匹配的多个空闲处理器核。
结合第一个方面的第三种可能的实现方式, 在第一个方面的第四种可能 的实现方式中, 若确定出的所述空闲处理器核的片上路由器构成的区域是非 矩形区域, 则所述搜索并确定由所述非矩形区域扩展而成的矩形区域, 包括: 沿与所述初始空闲处理器核连接的片上路由器同行的相邻片上路由器依 次确定是否存在连续的、 与所述线程数量匹配的多个空闲处理器核;
若沿所述同行的相邻片上路由器依次确定的连续的第一空闲区域中的处 理器核的数量与所述线程数量不匹配, 则沿与所述初始空闲处理器连接的片 上路由器同列的相邻片上路由器依次确定连续的第二空闲区域, 以使所述第 一空闲区域中的处理器核的数量和所述第二空闲区域中的处理器核的数量之 和与所述线程数量相等。
结合第一个方面的第三种可能的实现方式, 在第一个方面的第五种可能 的实现方式中, 若确定出的所述空闲处理器核的片上路由器构成的区域是非 矩形区域, 则所述搜索并确定由所述非矩形区域扩展而成的矩形区域, 包括: 沿与所述初始空闲处理器核连接的片上路由器同列的相邻片上路由器依 次确定是否存在连续的、 与所述线程数量匹配的多个空闲处理器核;
若沿所述同列的相邻片上路由器依次确定的连续的第三空闲区域中的处 理器核数量与所述线程数量不匹配, 则沿与所述初始空闲处理器核连接的片 上路由器同行的相邻片上路由器依次确定连续的第四空闲区域, 以使所述第 三空闲区域中的处理器核的数量和所述第四空闲区域中的处理器核的数量之 和与所述线程数量相等。
结合第一个方面、 第一个方面的第一种至第五种可能的实现方式中的任 一种可能的实现方式, 在第一个方面的第六种可能的实现方式中所述若所述 矩形区域内与非空闲处理器核连接的每一片上路由器的预测流量未超过预设 门限值, 则在将所述待处理任务中所包含的线程分别分配给所述空闲处理器 核之前, 所述方法还包括:
根据所述矩形区域内与非空闲处理器核连接的片上路由器的历史流量信 息, 预测所述矩形区域内与非空闲处理器核连接的片上路由器的流量得到所 述预测流量。
第二个方面, 本发明实施例提供一种任务分配装置, 包括:
第一确定模块, 用于确定待处理任务所包含的线程数量;
第二确定模块, 用于在多核处理器构成的片上网络中确定连续的、 与所 述线程数量相等数量的多个空闲处理器核, 其中, 每一个所述空闲处理器核 连接一个片上路由器;
第三确定模块, 用于若所述第二确定模块确定出与所述空闲处理器核连 接的所述片上路由器构成的区域是非矩形区域时, 在所述片上网络中搜索并 确定由所述非矩形区域扩展的矩形区域;
分配模块, 用于若所述第三确定模块确定出的矩形区域内与非空闲处理 器核连接的每一个片上路由器的预测流量未超过预设门限值, 则将所述待处 理任务的线程分配给所述空闲处理器核, 其中, 每一个所述空闲处理器核分 配一个线程。
在第二个方面的第一种可能的实现方式中, 所述第三确定模块, 具体用 于:
确定由所述非矩形区域扩展而成的矩形区域为所述片上网络中包含所述 非矩形区域的最小矩形区域。
结合第二个方面以及第二个方面的第一种可能的实现方式, 在第二个方 面的第二种可能的实现方式中, 所述分配模块, 还用于:
若所述第二确定模块确定出的与所述多个空闲处理器核连接的片上路由 器构成矩形区域, 则将所述待处理任务的线程分别分配给所述空闲处理器核, 其中, 每一个处理器核分配一个线程。
结合第二个方面、 第二个方面的第一种或第二种可能的实现方式中的任 一种可能的实现方式, 在第二个方面的第三种可能的实现方式中, 所述第二 确定模块, 具体用于:
在所述多核处理器构成的片上网络中确定初始空闲处理器核, 所述片上 网络中包括呈行列排列的多个处理器核;
以所述初始空闲处理器核为起始点, 在多核处理器构成的片上网络中确 定连续的、 与所述线程数量匹配的多个空闲处理器核。
结合第二个方面的第三种可能的实现方式, 在第二个方面的第四种可能 的实现方式中, 所述第二确定模块, 具体用于沿与所述初始空闲处理器核连 接的片上路由器同行的相邻片上路由器依次确定是否存在连续的、 与所述线 程数量匹配的多个空闲处理器核;
若沿所述同行的相邻片上路由器依次确定的连续的第一空闲区域中的处 理器核的数量与所述线程数量不匹配, 则沿与所述初始空闲处理器连接的片 上路由器同列的相邻片上路由器依次确定连续的第二空闲区域, 以使所述第 一空闲区域中的处理器核的数量和所述第二空闲区域中的处理器核的数量之 和与所述线程数量相等。
结合第二个方面的第三种可能的实现方式, 在第二个方面的第五种可能 的实现方式中, 所述第二确定模块, 具体用于:
沿与所述初始空闲处理器核连接的片上路由器同列的相邻片上路由器依 次确定是否存在连续的、 与所述线程数量匹配的多个空闲处理器核;
若沿所述同列的相邻片上路由器依次确定的连续的第三空闲区域中的处 理器核数量与所述线程数量不匹配, 则沿与所述初始空闲处理器核连接的片 上路由器同行的相邻片上路由器依次确定连续的第四空闲区域, 以使所述第 三空闲区域中的处理器核的数量和所述第四空闲区域中的处理器核的数量之 和与所述线程数量相等。
结合第二个方面、 第二个方面的第一种至第五种可能的实现方式中的任 一种可能的实现方式, 在第二个方面的第六种可能的实现方式中, 所述任务 分配装置还包括:
预测模块, 用于根据所述矩形区域内与非空闲处理器核连接的片上路由 器的历史流量信息, 预测所述矩形区域内与非空闲处理器核连接的片上路由 器的流量得到所述预测流量。
第三个方面, 本发明实施例还提供一种片上网络, 包括多个处理器核、 片上路由器和互连线, 以及如上任一所述的任务分配装置。
本发明实施例提供的任务分配方法、 任务分配装置及片上网络, 通过确 定待处理任务所包含的线程数量, 在片上网络中确定出与所需线程数量匹配 的多个空闲处理器核构成的非矩形区域, 借助该非矩形区域相邻的边界片上 路由器, 使其与该非矩形区域内的空闲处理器核连接的片上路由器构成规则 的矩形区域, 然后判断矩形区域内与非空闲处理器核连接的片上路由器, 即 边界片上路由器的流量是否超过预设的门限值, 若未超过, 则将待处理任务 分配给空闲区域的处理器核。 本发明实施例提供的任务分配方法, 在片上网 络中闲置的处理器核资源等于或多于待处理任务所需的处理器核的时候, 若 没有规则的矩形区域分配该待处理的任务, 则借助边界路由器将非矩形区域 构成规则的矩形区域并分配该待处理的任务, 该矩形区域内, 不需要釆用路 由表确定数据包从源片上路由器到目标片上路由器的路由机制, 而是釆用 XY 路由的方式传递数据包, 避免网络拥塞、 提高网络吞吐量。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案, 下面将对实 施例或现有技术描述中所需要使用的附图作一简单地介绍, 显而易见地, 下 面描述中的附图是本发明的一些实施例, 对于本领域普通技术人员来讲, 在 不付出创造性劳动性的前提下, 还可以根据这些附图获得其他的附图。
图 1为现有技术中基于路由子网的任务分配方法示意图;
图 2为本发明任务分配方法实施例一的流程图;
图 3为本发明任务分配方法实施例二片上网络的示意图;
图 4A为本发明任务分配方法实施例三片上网络的示意图;
图 4B为图 4A中重新搜索矩形区域的示意图;
图 5A 为釆用随机均匀流量模型分析本发明任务分配方法与基于路由子 网的任务分配方法的示意图;
图 5B 为釆用位比较流量模型分析本发明任务分配方法与基于路由子网 的任务分配方法的示意图;
图 5C 为釆用龙卷风流量模型分析本发明任务分配方法与基于路由子网 的任务分配方法的示意图;
图 6为本发明任务分配装置实施例一的结构示意图;
图 7为本发明任务分配装置实施例二的结构示意图;
图 8为本发明任务分配装置实施例三的结构示意图。
具体实施方式
为使本发明实施例的目的、 技术方案和优点更加清楚, 下面将结合本发 明实施例中的附图, 对本发明实施例中的技术方案进行清楚、 完整地描述, 显然, 所描述的实施例是本发明一部分实施例, 而不是全部的实施例。 基于 本发明中的实施例, 本领域普通技术人员在没有做出创造性劳动前提下所获 得的所有其他实施例, 都属于本发明保护的范围。
图 2为本发明任务分配方法实施例一的流程图。 本实施例的执行主体为 任务分配装置, 其可集成在多核处理器构成的片上网络中, 该装置可以例如 可以是片上网络中任意处理器等。 本实施例适用于片上网络中空闲的处理器 核资源等于或多于待处理任务所需的处理器核的场景, 具体的, 本实施例包 括如下步骤:
101 : 确定待处理任务所包含的线程数量。
任务分配装置确定待处理任务所包含的线程数量。 一般来说, 一个任务 所包含的线程数量与处理该任务所需的处理器核的数量是——对应的。 例如, 若一个任务包含 9个线程, 则需要 9个处理器核处理该任务。
102: 在多核处理器构成的片上网络中确定连续的、 与线程数量相等数量 的多个空闲处理器核, 其中, 每一个空闲处理器连接一个片上路由器。
片上网络具有支持同时访问、 可靠性高及可重用性高等特点, 由多个处 理器核、 片上路由器和互连线 (通道)组成。 其中, 互连线包括片上路由器 和处理器核之间的内部互连线、 片上路由器之间的外部互连线, 每个处理器 核与一个片上路由器相连, 各个片上路由器互连成一个网状结构 ( Mesh Topology, 以下简称 mesh )。 本步骤中, 在确定出待处理任务所包含的线程 数量后, 根据线程数量, 任务分配装置在多核处理器构成以及片上网络中确 定连续的、 与该线程数量相等数量的多个空闲处理器核以及对应的片上路由 器。
103: 若确定出的与空闲处理器连接的片上路由器构成的区域非矩形区 域, 则在片上网络中搜索并确定由非矩形区域扩展的矩形区域。
当任务分配装置确定出的连续的、 与线程数量相等数量的多个空闲处理 器核后, 若与该些空闲处理器核连接的片上路由器构成的区域是矩形区域, 则直接将待处理任务中所包含的线程分配给空闲处理器核, 每个处理器核分 配一个线程; 否则, 若与该些空闲处理器核连接的片上路由器构成的区域是 非矩形区域, 则搜索并确定由非矩形区域扩展的矩形区域, 此区域为片上网 络中包含非矩形区域的最小矩形区域。 例如, 一个 NoC为 5 x 5的 mesh结 构, 待处理的任务包含的线程数量为 5个, 若确定出的连续的空闲处理器核 为同一列或同一行的 5个处理器核, 则将该待处理的任务所包含的 5个线程 分配给这 5个连续的空闲处理器核, 每个处理器核分配一个线程; 若确定出 的连续的空闲处理器核数量为 5个, 但是与该 5个处理器核连接的片上路由 器构成的区域为非矩形区域, 即构成的区域为不规则形状的区域, 则任务分 配装置确定包含该非矩形区域的矩形区域, 即将已经分配任务的非空闲的处 理器核与该非矩形区域内的空闲处理器核组成规则的矩形区域。 具体的, 以 该 5个处理器核由第一行的前 3个和第二行的前 2个组成为例, 将第二行的 第三个处理器核连接的片上路由器作为边界片上路由器, 任务分配装置确定 该 5个处理器核连接的片上路由器与该边界片上路由器构成的矩形区域。
需要说明的是, 上述仅是以该 5个处理器核由第一行的前 3个和第二行 的前 2个组成为例对本发明进行详细阐述, 然而, 本发明并不以此为限制, 在其他可能的实施方式中, 连续的空闲处理器核可以有不同的组合, 与该处 理器核连接的片上路由器构成的非矩形区域也有多种可能的形式, 如 L型、 E 型、 F型、 工字型、 I型等, 对应的, 包含该非矩形区域的矩形区域也有多种 可能的形式。
104: 若扩展的矩形区域内与非空闲处理器连接的每一个片上路由器的流 量未超过预设门限值, 则将待处理任务的线程分配给空闲处理器, 每一空闲 处理器核分配一个线程。
该矩形区域内, 不需要釆用路由表确定数据包从源片上路由器到目标片 上路由器的路由机制, 而是釆用 XY路由的方式传递数据包, 即在确定出源片 上路由器和目的片上路由器后, 从源片上路由器开始, 数据包先横向的传递 至与目的片上路由器所在的列相交的中间片上路由器, 再竖向的传递至目的 片上路由器; 或者, 先竖向的传递至与目的片上路由器所在的行相交的中间 片上路由器, 再横向的传递至目的片上路由器。
在确定出包括非矩形区域的矩形区域后, 任务分配装置根据矩形区域内 与非空闲处理器核连接的片上路由器的历史流量信息, 预测矩形区域内与非 空闲处理器核连接的该片上路由器的流量得到预测流量, 判断预测流量是否 超过预设的门限值, 若预测流量未超过预设门限值, 则将待处理任务中所包 含的线程分别分配给空闲处理器核。
本发明实施例提供的任务分配方法, 通过确定待处理任务所包含的线程 数量, 在片上网络中确定出与所需线程数量相等数量的多个空闲处理器核构 成的非矩形区域, 借助与该非矩形区域相邻区域的边界路由器, 使其与该非 矩形区域内的片上路由器构成规则的矩形区域, 然后判断矩形区域内与非空 闲处理器核连接的片上路由器, 即边界片上路由器的流量是否超过预设的门 限值, 若未超过, 则将待处理任务分配给空闲区域的处理器核。 本发明实施 例提供的任务分配方法, 在片上网络中空闲的处理器核资源等于或多于待处 理任务所需的处理器核的时候, 若没有规则的矩形区域分配该待处理的任务, 则借助边界路由器将非矩形区域扩展成规则的矩形区域并分配该待处理的任 务线程, 该矩形区域内, 不需要釆用路由表确定数据包从源片上路由器到目 标片上路由器的路由机制, 而是釆用 XY路由的方式传递数据包,避免了其它 任务分配方法硬件开销大、 网络吞吐量低、 系统利用率低的问题。 由上述可知, NoC 由片上路由器和互连线(通道)组成, 每个处理器核 与一个片上路由器相连, 而一个任务所包含的线程数量与处理该任务所需的 处理器核的数量是——对应的。 因此, 待处理任务所包含的线程数量、 待处 理任务所需的处理器核数量以及所需处理器核连接的片上路由器的数量是相 等的, 处理器核与片上路由器的状态一致: 同时处于空闲状态或被分配任务 状态, 搜索到空闲的片上路由器, 即搜索到空闲的处理器核。 为使本发明实 施例更加清楚, 以下各个视图所示的片上网络中仅示出片上路由器。
一般来说, 片上网络中包括呈行列排列的多个处理器核, 如 5 x 5的片上 网络, 包括 5行 5列共 25个处理器核与 25个片上路由器。 此时, 确定连续 的、 与线程数量匹配的多个空闲处理器核时, 可以按行搜索空闲的处理器核 或按照列搜索的方式搜索空闲的处理器核。 以按照行搜索为例, 具体的, 可 以在在多核处理器构成的片上网络中确定初始空闲处理器核, 沿与初始空闲 处理器核连接的片上路由器同行的相邻片上路由器依次确定是否存在连续 的、 与线程数量匹配的多个空闲处理器核; 若沿同行的相邻片上路由器依次 确定的连续的第一空闲区域中所包含的处理器核的数量与线程数量不匹配, 则沿与初始空闲处理器连接的片上路由器同列的相邻片上路由器依次确定连 续的第二空闲区域, 以使第一空闲区域所包含的处理器核的数量和第二空闲 区域所包含的处理器核的数量之和与线程数量匹配。 下面, 通过几个具体的 图 3为本发明任务分配方法实施例二片上网络的示意图。 如图 3所示, 本实施例中, NoC为 5 χ 5的 NoC, 多个处理器核呈行列排列, 任务队列中 有待处理任务一(4 ) , 表示任务一包含 4个线程, 需要分配 4个处理器核处 理该任务。 随机的确定与片上路由器 R1 .1连接的处理器核为初始空闲处理器 核,沿与 R1 .1连接的片上路由器同行的相邻片上路由器依次确定连续的 4个 空闲的片上路由器, 即确定出 R1 .1、 R1 .2、 R1 .3、 R1 .4共 4个片上路由器, 该 4个片上路由器构成第一空闲区域, 为一个规则的矩形区域, 则直接将任 务一包含的 4个线程分配给 R1 .1、 R1 .2、 R1 .3、 R1 .4所在的矩形区域内的 处理器核。 图 4A为本发明任务分配方法实施例三片上网络的示意图。如图 4A所示, 本实施例中, NoC为 5 x 5的 NoC, 多个处理器核呈行列排列, 口表示高负 载片上路由器, 表示低负载片上路由器, 口表示空闲片上路由器, 亦即, R1 .1 ~R1 .3、 R2.1 ~R2.4以及 R3.1 -R3.4为高负载片上路由器, Rs0.1 ~Rs0.6 为低负载片上路由器, 其余的为空闲路由器。 具体的, 判断片上路由器负载 高低的方式可根据需求设定, 例如, 某个片上路由器承载的流量大于预设的 门限值时, 则判断其为高负载片上路由器。
本实施例中, 任务队列中有待处理任务二( 5 ) , 表示任务二包含 5个线 程, 需要分配 5个处理器核处理该任务。 随机的确定与片上路由器 R5.0连接 的处理器核为初始空闲处理器核, 从 R5.0开始搜索连续的、 空闲的片上路由 器并判断是否可以构成矩形区域, 在按照行或按照列遍历各种可能的情况后, 还未找到符合条件的规则的矩形区域, 则继续按照行或按照列的搜索方式搜 索。 具体的, 若按照行搜索的方式, 则从 R5.0开始搜索第一空闲区域, 即沿 着 R5.0所在的第一行搜索到包含 R5.0、 R5.1、 R5.2、 R5.3共 4个空闲的片 上路由器的第一空闲区域, 与任务的线程数量不匹配, 也就是说, 第一空闲 区域中所包含的处理器核的数量不满足任务所需的处理器核数,此时,从 R5.0 所在的列继续搜索第二空闲区域, 当第二空闲区域内的处理器核的数量和第 一空闲区域内所包含的处理器核的数量之和等于 5,即第二空闲区域中搜索到 R5.4后, 空闲片上路由器的个数与线程数量相等, 且 R5.0~R5.4、 Rs0.1、 Rs0.2及高负载的片上路由器 R1 .1构成一个规则的矩形区域。
在上述通过第一空闲区域和第二空闲区域确定出一个规则的矩形区域 后, 接下来, 判断矩形区域内与非空闲处理器核连接的片上路由器的流量是 否超过预设门限值。 具体的, 可以根据矩形区域内与非空闲处理器核连接的 片上路由器的历史流量信息, 预测矩形区域内与非空闲处理器核连接的片上 路由器的流量。 本实施例中, 需要判断 Rs0.1、 Rs0.2及高负载的片上路由器 R1 .1 上的流量是否超过预设门限值。 以 R1 .1 为例, 若将任务二分配给第一 空闲区域和第二空闲区域确定出的规则的矩形区域后, R 1 .1原来承载的流量, 如图中粗黑箭头①所示, 与被分配任务二后增加的流量, 如图中粗黑箭头② 所示, 若两者之和未超过预设的门限值, 则认为 R1 .1可以共享, 进而判断出 可以将任务二分配给该矩形区域所包含的处理器核, 如图中虚线框所示, 此 时, 数据包在该矩形区域内, 釆用 XY路由的方式传递数据包。 例如若 R5.2 为源片上路由器, 目的片上路由器为 R5.4, 此时, 釆用 XY路由, 可以将数 据包从 R5.2开始, 经过 R5.2、 R5.1、 R5.0传递至 R5.4, 也可以从 R5.2开 始, 经过 Rs0.2、 Rs0.1 传递至 R5.4; 否则, 若将任务二分配给第一空闲区 域和第二空闲区域确定出的规则的矩形区域使得 R1 .1原来承载的流量与被分 配任务二后增加的流量超过预设的门限值, 则认为 R1 .1不可以共享, 进而判 断出不可以将任务二分配给该矩形区域所包含的处理器核。
上述判断片上路由器所承载的流量后, 假设该 3个片上路由器中至少其 中之一所承载的流量超过预设的门限值,假设 R1 .1所承载的流量超过预设的 门限值。 此时,从 R5.0开始, 重新按照行或按照列的搜索方式搜索。 具体的, 如图 4B所示, 图 4B为图 4A中重新搜索矩形区域的示意图。
若按照列搜索的方式, 则从 R5.0开始搜索第三空闲区域, 即沿着 R5.0 所在的第一列搜索到包含 R5.0、 R5.4、 R5.5共三个空闲的片上路由器的第三 空闲区域, 与任务的线程数量不匹配, 也就是说, 第三空闲区域中所包含的 处理器核的数量不满足任务所需的处理器核, 此时, 从 R5.0所在的行继续搜 索第四空闲区域, 当第四空闲区域内的处理器核的数量和第三空闲区域内所 包含的处理器核的数量之和等于 5, 即第四空闲区域中搜索到 R5.1、 R5.2之 后, 空闲片上路由器的个数与线程数量相等, 且 R5.0、 R5.4、 R5.5、 R5.1、 R5.2、 与第二行、 第三行的四个低负载的片上路由器 Rs0.1、 Rs0.2、 Rs0.3、 Rs0.4构成一个规则的矩形区域。此时,判断若将任务二分配给第三空闲区域 和第四空闲区域确定出的规则的矩形区域后, 共享的 Rs0.1、 Rs0.2、 Rs0.3、 Rs0.4原来承载的流量与被分配任务二后增加的流量未超过预设的门限值,若 该四个共享片上路由器的流量都未超过预设的门限值, 则将任务二分配给该 矩形区域所包含的处理器核, 如图 4B中虚线框所示; 否则, 若其中之一所承 载的流量超过预设的门限值, 则表示不能将任务二分配给该矩形区域, 需要 重新搜索处理器核。 上述实施例中, 若所有的不规则区域都无法通过流量预测, 则将任务二 刚在等待队列中等待下一次任务调度。 如在其他任务处理完毕后释放出更多 的处理器核后重新搜索任务需要的处理器核。
需要说明的是, 上述实施例中, 第一空闲区域、 第二空闲区域、 第三空 闲区域、 第四空闲区域可以为规则的矩形区域, 也可以是不规则的矩形区域, 例如以图 4A为例, 若 Rs0.1也为空闲的片上路由器时, 则从 R5.0开始搜索 到的连续的、 空闲的片上路由器包括 R5.0、 R5.1、 5.2、 R5.4及 Rs0.1 ; 若 Rs0.2 为空闲的片上路由器, R5.4、 Rs0.1、 R1 .1 为非空闲片上路由器时, 则从 R5.0开始搜索到的连续的、 空闲片上路由器包括 R5.0、 R5.1、 R5.2、 Rs0.2、 R5.3。
为了清楚对比本发明的任务分配方法与现有技术中基于路由子网的任务 分配方法的有益效果, 下面, 釆用不用的流量模型分析本发明的技术方案与 现有的技术方案。
图 5A 为釆用随机均匀流量模型分析本发明任务分配方法与基于路由子 网的任务分配方法的示意图。 如图 5A所示, 随机均匀流量模型 (Uniform ) 中, 横坐标为注入率, 可以理解为片上网络的处理器核的利用率; 纵坐标为 延时时长; 代表本发明注入率与延时时长的对应曲线; A 代表基于 由图 5A所示,在注入率为 0~ 6χΐθ—3时,整个片上网络的处理器核的利用 率不高, 此时, 本发明的技术方案与现有的技术方案的的延时时长基本相等。 但是, 随着处理器核的利用率越来越高, 同一个延时时长对应的本发明的技 术方案与现有的技术方案的差别就越大, 基于路由子网的任务分配方法中, 注入率越大, 延时时长相应的越大, 表示片上网络的性能越差, 即延迟明显 升高, 网络吞吐量低, 而本发明任务分配方法中, 注入率越大, 延时时长相 对增加緩慢, 表示片上网络的性能高, 即延迟升高不明显, 网络吞吐量高。
图 5Β 为釆用位比较流量模型分析本发明任务分配方法与基于路由子网 的任务分配方法的示意图。 同理,如图 5Α, 图 5Β所示位比较流量(Bitcomp ) 流量模型中, 横坐标为注入率, 可以理解为片上网络的处理器核的利用率; 纵坐标为延时时长; 代表本发明注入率与延时时长的对应曲线; A 代表基于路由子网的任务分配方法中注入率与延时时长的对应曲线。 在注入 率为大于 2 χ ΐο— 3时, 即可明显表现出本发明的有益效果。
图 5C 为釆用龙卷风流量模型分析本发明任务分配方法与基于路由子网 的任务分配方法的示意图。 同理, 如图 5Α, 图 5C所示龙卷风(Tornado ) 流量模型中, 横坐标为注入率, 可以理解为片上网络的处理器核的利用率; 纵坐标为延时时长; 代表本发明注入率与延时时长的对应曲线; 代表基于路由子网的任务分配方法中注入率与延时时长的对应曲线。 在注入 率大于 4χ ΐο- 3时, 即可明显表现出本发明的有益效果。
另外, 为了清楚对比本发明的任务分配方法与现有技术中基于矩形子网 划分的任务分配方法的有益效果, 下面, 用系统利用率的表格来对比本发明 路由器共享与现有技术中基于子网划分方法的有益效果。
Figure imgf000016_0001
表 1 为网络负载率为 0.5~1 的情况下, 矩形子网划分方法下系统利用率 和本发明路由器共享方法下系统利用率的一个对照表。 其中, 网络负载率 0.5~1表示需要的处理器核与系统实际能提供的处理器核的比值, 例如, 表 1 中第列表示: 当需要的处理器核与系统实际能提供的处理器核的比值为 0.5 时, 系统处于不饱和状态, 此时基于矩形子网划分方法的系统利用率为 0.478033, 而基于本发明路由器共享的方法中, 系统利用率为 0.465374, 两 者之间差别较小。 然而, 随着网络负载的不断增大, 当网络负载率为 0.9时, 系统逐渐饱和, 基于矩形子网划分方法的系统利用率为 0.70131 1, 而基于本 发明路由器共享的方法中, 系统利用率为 0.76601 1 ; 最终, 当网络负载达到 100%, 即系统饱和时, 基于矩形子网划分方法的系统利用率为 0.707254, 而基于本发明路由器共享的方法中, 系统利用率为 0.810507, 两者相差将近 1 0%。
需要说明的是, 上述各个实施例中, 均是随机将片上网络的某一空闲处 理器核作为初始空闲处理器核, 需要重新搜索处理器核的时候都从该初始空 闲处理器核开始搜索, 然而, 本发明并不以此为限制, 在其他可能的实施方 式中, 也可以根据预设的规则选取初始空闲处理器核, 每次搜索时, 初始空 闲处理器核也可以不同。 另外, 当片上网络中空闲、 连续的处理器核所在的 非矩形区域多于一个的时候, 可以随机确定某一个区域内的某个处理器核作 为初始闲置处理器核。
图 6为本发明任务分配装置实施例一的结构示意图。 如图 6所示, 本实 施例提供的任务分配装置具体可以实现本发明任意实施例提供的应用于任务 分配装置的方法的各个步骤, 具体实现过程在此不再赘述。 本实施例提供的 任务分配装置具体包括:
第一确定模块 1 1, 用于确定待处理任务所包含的线程数量;
第二确定模块 12, 用于在多核处理器构成的片上网络中确定连续的、 与 线程数量相等数量的多个空闲处理器核, 其中, 每一个空闲处理器核连接一 个片上路由器;
第三确定模块 13, 用于第二确定模块 12确定的出与空闲处理器核连接 的片上路由器构成的区域是非矩形区域时, 在片上网络中搜索并确定由非矩 形区域扩展的矩形区域;
分配模块 14, 用于若第三确定模块确定出的矩形区域内与非空闲处理器 核连接的每一个片上路由器的预测流量未超过预设门限值, 则将待处理任务 的线程分配给空闲处理器核, 其中, 每一个空闲处理器核分配一个线程。
本发明实施例提供的任务分配装置, 通过第一确定模块确定待处理任务 所包含的线程数量, 第二确定模块在片上网络中确定出与所需线程数量相等 数量的多个空闲处理器核构成的非矩形区域, 借助该非矩形区域相邻的边界 片上路由器, 使其与该非矩形区域内的空闲处理器核连接的片上路由器构成 规则的矩形区域, 然后由第三确定模块判断矩形区域内与非空闲处理器核连 接的片上路由器, 即边界片上路由器的流量是否超过预设的门限值, 若未超 过, 则由分配模块将待处理任务分配给空闲区域的处理器核。 本发明实施例 提供的任务分配方法, 在片上网络中闲置的处理器核资源等于或多于待处理 任务所需的处理器核的时候, 若没有规则的矩形区域分配该待处理的任务, 则借助边界路由器将非矩形区域构成规则的矩形区域并分配该待处理的任 务, 该矩形区域内, 不需要釆用路由表确定数据包从源片上路由器到目标片 上路由器的路由机制, 而是釆用 XY路由的方式传递数据包, 节省硬件避免了 基于路由子网的任务分配方法硬件开销大、 网络吞吐量低、 系统利用率低的 问题。
进一步的, 第三确定模块 13, 具体用于:
确定由非矩形区域扩展而成的矩形区域为片上网络中包含非矩形区域的 最小矩形区域。
进一步的, 分配模块 14, 还用于:
若第二确定模块 12确定出的与多个空闲处理器核连接的片上路由器构成 矩形区域, 则将待处理任务的线程分别分配给空闲处理器核, 其中, 每一个 处理器核分配一个线程。
进一步的, 第二确定模块 12, 具体用于:
在多核处理器构成的片上网络中确定初始空闲处理器核, 片上网络中包 括呈行列排列的多个处理器核;
以初始空闲处理器核为起始点, 在多核处理器构成的片上网络中确定连 续的、 与线程数量匹配的多个空闲处理器核。
进一步的, 第二确定模块 12, 具体用于: 沿与初始空闲处理器核连接的 片上路由器同行的相邻片上路由器依次确定是否存在连续的、 与线程数量匹 配的多个空闲处理器核;
若沿同行的相邻片上路由器依次确定的连续的第一空闲区域中的处理器 核的数量与线程数量不匹配, 则沿与初始空闲处理器连接的片上路由器同列 的相邻片上路由器依次确定连续的第二空闲区域, 以使第一空闲区域中的处 理器核的数量和第二空闲区域中的处理器核的数量之和与线程数量相等。 进一步的, 第二确定模块 12, 具体用于: 沿与初始空闲处理器核连接的 片上路由器同列的相邻片上路由器依次确定是否存在连续的、 与线程数量匹 配的多个空闲处理器核;
若沿同列的相邻片上路由器依次确定的连续的第三空闲区域中的处理器 核数量与线程数量不匹配, 则沿与初始空闲处理器核连接的片上路由器同行 的相邻片上路由器依次确定连续的第四空闲区域, 以使第三空闲区域中的处 理器核的数量和第四空闲区域中的处理器核的数量之和与线程数量相等。
图 7为本发明任务分配装置实施例二的结构示意图。 如图 7所示, 本实 施例提供的任务分配装置在图 6所示装置的基础上, 进一步的, 还可以包括: 预测模块 15, 用于根据矩形区域内与非空闲处理器核连接的片上路由器 的历史流量信息, 预测矩形区域内与非空闲处理器核连接的片上路由器的流 量得到预测流量。
图 8为本发明任务分配装置实施例三的结构示意图。 如图 8所示, 本实 施例的任务分配装置 800可以包括处理器 81和存储器 82。任务分配装置 800 还可以包括发射器 83、 接收器 84。 发射器 83和接收器 84可以和处理器 81 相连。 其中, 存储器 82存储执行指令, 当任务分配装置 800运行时, 处理器 81与存储器 82之间通信, 处理器 81调用存储器 82中的执行指令, 用于执 行以下操作:
任务分配装置 800确定待处理任务所包含的线程数量;
在多核处理器构成的片上网络中确定连续的、 与线程数量相等数量的多 个空闲处理器核, 其中, 每一个空闲处理器核连接一个片上路由器;
若确定出的与空闲处理器核连接的片上路由器构成的区域是非矩形区 域, 则在片上网络中搜索并确定由非矩形区域扩展的矩形区域;
若扩展的矩形区域内与非空闲处理器核连接的每一个片上路由器的预测 流量未超过预设门限值, 则将待处理任务的线程分配给空闲处理器核, 其中, 每一个空闲处理器核分配一个线程。
可选的, 由非矩形区域扩展而成的矩形区域为片上网络中包含非矩形区 域的最小矩形区域。 可选的, 在多核处理器构成的片上网络中确定连续的、 与线程数量匹配 的多个空闲处理器核之后, 方法还包括:
若确定出的空闲处理器核的片上路由器构成的区域是矩形区域, 则将待 处理任务的线程分别分配给空闲处理器核, 其中, 每一个处理器核分配一个 线程。
可选的, 片上网络中包括呈行列排列的多个处理器核;
相应的, 在多核处理器构成的片上网络中确定连续的、 与线程数量匹配 的多个空闲处理器核, 包括:
在多核处理器构成的片上网络中确定初始空闲处理器核;
以初始空闲处理器核为起始点, 在多核处理器构成的片上网络中确定连 续的、 与线程数量匹配的多个空闲处理器核。
可选的, 若确定出的空闲处理器核的片上路由器构成的区域是非矩形区 域, 则搜索并确定由非矩形区域扩展而成的矩形区域, 包括:
沿与初始空闲处理器核连接的片上路由器同行的相邻片上路由器依次确 定是否存在连续的、 与线程数量匹配的多个空闲处理器核;
若沿同行的相邻片上路由器依次确定的连续的第一空闲区域中的处理器 核的数量与线程数量不匹配, 则沿与初始空闲处理器连接的片上路由器同列 的相邻片上路由器依次确定连续的第二空闲区域, 以使第一空闲区域中的处 理器核的数量和第二空闲区域中的处理器核的数量之和与线程数量相等。
可选的, 若确定出的空闲处理器核的片上路由器构成的区域是非矩形区 域, 则搜索并确定由非矩形区域扩展而成的矩形区域, 包括:
沿与初始空闲处理器核连接的片上路由器同列的相邻片上路由器依次确 定是否存在连续的、 与线程数量匹配的多个空闲处理器核;
若沿同列的相邻片上路由器依次确定的连续的第三空闲区域中的处理器 核数量与线程数量不匹配, 则沿与初始空闲处理器核连接的片上路由器同行 的相邻片上路由器依次确定连续的第四空闲区域, 以使第三空闲区域中的处 理器核的数量和第四空闲区域中的处理器核的数量之和与线程数量相等。
可选的, 若矩形区域内与非空闲处理器核连接的每一片上路由器的预测 流量未超过预设门限值, 则在将待处理任务中所包含的线程分别分配给空闲 处理器核之前, 方法还包括:
根据矩形区域内与非空闲处理器核连接的片上路由器的历史流量信息, 预测矩形区域内与非空闲处理器核连接的片上路由器的流量得到预测流量。
基于上述的任务分配方法和任务分配装置, 本发明实施例还提供一种片 上网络, 包括多个处理器核、 片上路由器和互连线以及如图 6或图 7所示的 任一任务分配装置, 其对应的, 可执行图 2~图 4A中任一方法实施例的技术 方案, 此处不再赘述。
最后应说明的是: 以上各实施例仅用以说明本发明的技术方案, 而非对 其限制; 尽管参照前述各实施例对本发明进行了详细的说明, 本领域的普通 技术人员应当理解: 其依然可以对前述各实施例所记载的技术方案进行修改, 或者对其中部分或者全部技术特征进行等同替换; 而这些修改或者替换, 并 不使相应技术方案的本质脱离本发明各实施例技术方案的范围。

Claims

权利要求 书
1、 一种任务分配方法, 其特征在于:
确定待处理任务所包含的线程数量;
在多核处理器构成的片上网络中确定连续的、 与所述线程数量相等数量的 多个空闲处理器核, 其中, 每一个所述空闲处理器核连接一个片上路由器; 若确定出的与所述空闲处理器核连接的所述片上路由器构成的区域是非矩 形区域, 则在所述片上网络中搜索并确定由所述非矩形区域扩展的矩形区域; 若所述扩展的矩形区域内与非空闲处理器核连接的每一个片上路由器的预 测流量未超过预设门限值, 则将所述待处理任务的线程分配给所述空闲处理器 核, 其中, 每一个所述空闲处理器核分配一个线程。
2、 根据权利要求 1所述的方法, 其特征在于, 所述由所述非矩形区域扩展 而成的矩形区域为所述片上网络中包含所述非矩形区域的最小矩形区域。
3、 根据权利要求 1或 2所述的方法, 其特征在于, 所述在多核处理器构成 的片上网络中确定连续的、 与所述线程数量匹配的多个空闲处理器核之后, 所 述方法还包括:
若确定出的所述空闲处理器核的片上路由器构成的区域是矩形区域, 则将 所述待处理任务的线程分别分配给所述空闲处理器核, 其中, 每一个处理器核 分配一个线程。
4、 根据权利要求 1 ~3任一项所述的方法, 其特征在于, 所述片上网络中包 括呈行列排列的多个处理器核;
相应的, 所述在多核处理器构成的片上网络中确定连续的、 与所述线程数 量匹配的多个空闲处理器核, 包括:
在所述多核处理器构成的片上网络中确定初始空闲处理器核;
以所述初始空闲处理器核为起始点, 在多核处理器构成的片上网络中确定 连续的、 与所述线程数量匹配的多个空闲处理器核。
5、 根据权利要求 4所述的方法, 其特征在于, 若确定出的所述空闲处理器 核的片上路由器构成的区域是非矩形区域, 则所述搜索并确定由所述非矩形区 域扩展而成的矩形区域, 包括: 沿与所述初始空闲处理器核连接的片上路由器同行的相邻片上路由器依次 确定是否存在连续的、 与所述线程数量匹配的多个空闲处理器核;
若沿所述同行的相邻片上路由器依次确定的连续的第一空闲区域中的处理 器核的数量与所述线程数量不匹配, 则沿与所述初始空闲处理器连接的片上路 由器同列的相邻片上路由器依次确定连续的第二空闲区域, 以使所述第一空闲 区域中的处理器核的数量和所述第二空闲区域中的处理器核的数量之和与所述 线程数量相等。
6、 根据权利要求 4所述的方法, 其特征在于, 若确定出的所述空闲处理器 核的片上路由器构成的区域是非矩形区域, 则所述搜索并确定由所述非矩形区 域扩展而成的矩形区域, 包括:
沿与所述初始空闲处理器核连接的片上路由器同列的相邻片上路由器依次 确定是否存在连续的、 与所述线程数量匹配的多个空闲处理器核;
若沿所述同列的相邻片上路由器依次确定的连续的第三空闲区域中的处理 器核数量与所述线程数量不匹配, 则沿与所述初始空闲处理器核连接的片上路 由器同行的相邻片上路由器依次确定连续的第四空闲区域, 以使所述第三空闲 区域中的处理器核的数量和所述第四空闲区域中的处理器核的数量之和与所述 线程数量相等。
7、 根据权利要求 1 ~6任一项所述的方法, 其特征在于, 所述若所述矩形区 域内与非空闲处理器核连接的每一片上路由器的预测流量未超过预设门限值, 则在将所述待处理任务中所包含的线程分别分配给所述空闲处理器核之前, 所 述方法还包括:
根据所述矩形区域内与非空闲处理器核连接的片上路由器的历史流量信 息, 预测所述矩形区域内与非空闲处理器核连接的片上路由器的流量得到所述 预测流量。
8、 一种任务分配装置, 其特征在于, 包括:
第一确定模块, 用于确定待处理任务所包含的线程数量;
第二确定模块, 用于在多核处理器构成的片上网络中确定连续的、 与所述 线程数量相等数量的多个空闲处理器核, 其中, 每一个所述空闲处理器核连接 一个片上路由器;
第三确定模块, 用于若所述第二确定模块确定出的与所述空闲处理器核连 接的所述片上路由器构成的区域是非矩形区域时, 在所述片上网络中搜索并确 定由所述非矩形区域扩展的矩形区域;
分配模块, 用于若所述第三确定模块确定出的矩形区域内与非空闲处理器 核连接的每一个片上路由器的预测流量未超过预设门限值, 则将所述待处理任 务的线程分配给所述空闲处理器核, 其中, 每一个所述空闲处理器核分配一个 线程。
9、根据权利要求 8所述的任务分配装置,其特征在于,所述第三确定模块, 具体用于:
确定由所述非矩形区域扩展而成的矩形区域为所述片上网络中包含所述非 矩形区域的最小矩形区域。
10、 根据权利要求 8或 9所述的任务分配装置, 其特征在于, 所述分配模 块, 还用于:
若所述第二确定模块确定出的与所述多个空闲处理器核连接的片上路由器 构成矩形区域, 则将所述待处理任务的线程分别分配给所述空闲处理器核, 其 中, 每一个处理器核分配一个线程。
1 1、根据权利要求 8~1 0任一项所述的任务分配装置, 其特征在于, 所述第 二确定模块, 具体用于:
在所述多核处理器构成的片上网络中确定初始空闲处理器核, 所述片上网 络中包括呈行列排列的多个处理器核;
以所述初始空闲处理器核为起始点, 在多核处理器构成的片上网络中确定 连续的、 与所述线程数量匹配的多个空闲处理器核。
12、 根据权利要求 1 1所述的任务分配装置, 其特征在于, 所述第二确定模 块, 具体用于沿与所述初始空闲处理器核连接的片上路由器同行的相邻片上路 由器依次确定是否存在连续的、 与所述线程数量匹配的多个空闲处理器核; 若沿所述同行的相邻片上路由器依次确定的连续的第一空闲区域中的处理 器核的数量与所述线程数量不匹配, 则沿与所述初始空闲处理器连接的片上路 由器同列的相邻片上路由器依次确定连续的第二空闲区域, 以使所述第一空闲 区域中的处理器核的数量和所述第二空闲区域中的处理器核的数量之和与所述 线程数量相等。
13、 根据权利要求 1 1所述的任务分配装置, 其特征在于, 所述第二确定模 块, 具体用于:
沿与所述初始空闲处理器核连接的片上路由器同列的相邻片上路由器依次 确定是否存在连续的、 与所述线程数量匹配的多个空闲处理器核;
若沿所述同列的相邻片上路由器依次确定的连续的第三空闲区域中的处理 器核数量与所述线程数量不匹配, 则沿与所述初始空闲处理器核连接的片上路 由器同行的相邻片上路由器依次确定连续的第四空闲区域, 以使所述第三空闲 区域中的处理器核的数量和所述第四空闲区域中的处理器核的数量之和与所述 线程数量相等。
14、 根据权利要求 8-13任一项所述的任务分配装置, 其特征在于, 所述任 务分配装置还包括:
预测模块, 用于根据所述矩形区域内与非空闲处理器核连接的片上路由器 的历史流量信息, 预测所述矩形区域内与非空闲处理器核连接的片上路由器的 流量得到所述预测流量。
15、 一种片上网络, 包括多个处理器核、 片上路由器和互连线, 其特征在 于, 还包括如权利要求 8~ 14任一所述的任务分配装置。
PCT/CN2014/075655 2013-05-14 2014-04-18 任务分配方法、任务分配装置及片上网络 WO2014183530A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
EP14797851.4A EP2988215B1 (en) 2013-05-14 2014-04-18 Task assigning method, task assigning apparatus, and network-on-chip
KR1020157035119A KR101729596B1 (ko) 2013-05-14 2014-04-18 작업 할당 방법, 작업 할당 장치, 및 네트워크 온 칩
JP2016513212A JP6094005B2 (ja) 2013-05-14 2014-04-18 タスク割り当て方法、タスク割り当て装置、およびネットワークオンチップ
US14/940,577 US9965335B2 (en) 2013-05-14 2015-11-13 Allocating threads on a non-rectangular area on a NoC based on predicted traffic of a smallest rectangular area
US15/943,370 US10671447B2 (en) 2013-05-14 2018-04-02 Method, apparatus, and network-on-chip for task allocation based on predicted traffic in an extended area

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310177172.1A CN104156267B (zh) 2013-05-14 2013-05-14 任务分配方法、任务分配装置及片上网络
CN201310177172.1 2013-05-14

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US14/940,577 Continuation US9965335B2 (en) 2013-05-14 2015-11-13 Allocating threads on a non-rectangular area on a NoC based on predicted traffic of a smallest rectangular area

Publications (1)

Publication Number Publication Date
WO2014183530A1 true WO2014183530A1 (zh) 2014-11-20

Family

ID=51881772

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/075655 WO2014183530A1 (zh) 2013-05-14 2014-04-18 任务分配方法、任务分配装置及片上网络

Country Status (6)

Country Link
US (2) US9965335B2 (zh)
EP (1) EP2988215B1 (zh)
JP (1) JP6094005B2 (zh)
KR (1) KR101729596B1 (zh)
CN (1) CN104156267B (zh)
WO (1) WO2014183530A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017067215A1 (zh) * 2015-10-21 2017-04-27 深圳市中兴微电子技术有限公司 众核网络处理器及其微引擎的报文调度方法、系统、存储介质
JP2017539180A (ja) * 2014-12-18 2017-12-28 華為技術有限公司Huawei Technologies Co.,Ltd. 光ネットワークオンチップ、光ルータ、および信号伝送方法

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102285481B1 (ko) * 2015-04-09 2021-08-02 에스케이하이닉스 주식회사 NoC 반도체 장치의 태스크 매핑 방법
CN105718318B (zh) * 2016-01-27 2019-12-13 戴西(上海)软件有限公司 一种基于辅助工程设计软件的集合式调度优化方法
CN105721342B (zh) * 2016-02-24 2017-08-25 腾讯科技(深圳)有限公司 多进程设备的网络连接方法和系统
CN107329822B (zh) * 2017-01-15 2022-01-28 齐德昱 面向多源多核系统的基于超任务网的多核调度方法
CN107632594B (zh) * 2017-11-06 2024-02-06 苏州科技大学 一种基于无线网络的电器集中控制系统和控制方法
CN108694156B (zh) * 2018-04-16 2021-12-21 东南大学 一种基于缓存一致性行为的片上网络流量合成方法
KR102026970B1 (ko) * 2018-10-08 2019-09-30 성균관대학교산학협력단 네트워크 온 칩의 다중 라우팅 경로 설정 방법 및 장치
US11334392B2 (en) 2018-12-21 2022-05-17 Bull Sas Method for deployment of a task in a supercomputer, method for implementing a task in a supercomputer, corresponding computer program and supercomputer
FR3091775A1 (fr) * 2018-12-21 2020-07-17 Bull Sas Execution/Isolation d’application par allocation de ressources réseau au travers du mécanisme de routage
FR3091773A1 (fr) * 2018-12-21 2020-07-17 Bull Sas Execution/Isolation d’application par allocation de ressources réseau au travers du mécanisme de routage
US11327796B2 (en) 2018-12-21 2022-05-10 Bull Sas Method for deploying a task in a supercomputer by searching switches for interconnecting nodes
FR3091771A1 (fr) * 2018-12-21 2020-07-17 Bull Sas Execution/Isolation d’application par allocation de ressources réseau au travers du mécanisme de routage
EP3671455A1 (fr) * 2018-12-21 2020-06-24 Bull SAS Procédé de déploiement d'une tâche dans un supercalculateur, procédé de mise en oeuvre d'une tâche dans un supercalculateur, programme d'ordinateur correspondant et supercalculateur
CN111382115B (zh) 2018-12-28 2022-04-15 北京灵汐科技有限公司 一种用于片上网络的路径创建方法、装置及电子设备
KR102059548B1 (ko) * 2019-02-13 2019-12-27 성균관대학교산학협력단 Vfi 네트워크 온칩에 대한 구역간 라우팅 방법, vfi 네트워크 온칩에 대한 구역내 라우팅 방법, vfi 네트워크 온칩에 대한 구역내 및 구역간 라우팅 방법 및 이를 실행하기 위한 프로그램이 기록된 기록매체
CN109995652B (zh) * 2019-04-15 2021-03-19 中北大学 一种基于冗余通道构筑的片上网络感知预警路由方法
CN110471777B (zh) * 2019-06-27 2022-04-15 中国科学院计算机网络信息中心 一种Python-Web环境中多用户共享使用Spark集群的实现方法和系统
US11134030B2 (en) * 2019-08-16 2021-09-28 Intel Corporation Device, system and method for coupling a network-on-chip with PHY circuitry
CN112612605A (zh) * 2020-12-16 2021-04-06 平安消费金融有限公司 线程分配方法、装置、计算机设备和可读存储介质
CN115686800B (zh) * 2022-12-30 2023-03-21 摩尔线程智能科技(北京)有限责任公司 用于多核系统的动态核心调度方法和装置
CN116405555B (zh) * 2023-03-08 2024-01-09 阿里巴巴(中国)有限公司 数据传输方法、路由节点、处理单元和片上系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101403982A (zh) * 2008-11-03 2009-04-08 华为技术有限公司 一种多核处理器的任务分配方法、系统及设备
US20090328047A1 (en) * 2008-06-30 2009-12-31 Wenlong Li Device, system, and method of executing multithreaded applications
CN102193779A (zh) * 2011-05-16 2011-09-21 武汉科技大学 一种面向MPSoC的多线程调度方法
CN102541633A (zh) * 2011-12-16 2012-07-04 汉柏科技有限公司 基于多核cpu的数据平面和控制平面部署系统及方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5007050B2 (ja) * 2006-02-01 2012-08-22 株式会社野村総合研究所 格子型コンピュータシステム、タスク割り当てプログラム
JP2008191949A (ja) * 2007-02-05 2008-08-21 Nec Corp マルチコアシステムおよびマルチコアシステムの負荷分散方法
JP5429382B2 (ja) 2010-08-10 2014-02-26 富士通株式会社 ジョブ管理装置及びジョブ管理方法
KR101770587B1 (ko) 2011-02-21 2017-08-24 삼성전자주식회사 멀티코어 프로세서의 핫 플러깅 방법 및 멀티코어 프로세서 시스템
JP5724626B2 (ja) 2011-05-23 2015-05-27 富士通株式会社 プロセス配置装置、プロセス配置方法及びプロセス配置プログラム

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090328047A1 (en) * 2008-06-30 2009-12-31 Wenlong Li Device, system, and method of executing multithreaded applications
CN101403982A (zh) * 2008-11-03 2009-04-08 华为技术有限公司 一种多核处理器的任务分配方法、系统及设备
CN102193779A (zh) * 2011-05-16 2011-09-21 武汉科技大学 一种面向MPSoC的多线程调度方法
CN102541633A (zh) * 2011-12-16 2012-07-04 汉柏科技有限公司 基于多核cpu的数据平面和控制平面部署系统及方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2988215A4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017539180A (ja) * 2014-12-18 2017-12-28 華為技術有限公司Huawei Technologies Co.,Ltd. 光ネットワークオンチップ、光ルータ、および信号伝送方法
US10250958B2 (en) 2014-12-18 2019-04-02 Huawei Technologies Co., Ltd Optical network-on-chip, optical router, and signal transmission method
WO2017067215A1 (zh) * 2015-10-21 2017-04-27 深圳市中兴微电子技术有限公司 众核网络处理器及其微引擎的报文调度方法、系统、存储介质
CN106612236A (zh) * 2015-10-21 2017-05-03 深圳市中兴微电子技术有限公司 众核网络处理器及其微引擎的报文调度方法、系统

Also Published As

Publication number Publication date
EP2988215A4 (en) 2016-04-27
US10671447B2 (en) 2020-06-02
US20180225156A1 (en) 2018-08-09
JP6094005B2 (ja) 2017-03-15
KR101729596B1 (ko) 2017-05-11
EP2988215B1 (en) 2021-09-08
CN104156267B (zh) 2017-10-10
CN104156267A (zh) 2014-11-19
KR20160007606A (ko) 2016-01-20
JP2016522488A (ja) 2016-07-28
US20160070603A1 (en) 2016-03-10
EP2988215A1 (en) 2016-02-24
US9965335B2 (en) 2018-05-08

Similar Documents

Publication Publication Date Title
WO2014183530A1 (zh) 任务分配方法、任务分配装置及片上网络
US11516146B2 (en) Method and system to allocate bandwidth based on task deadline in cloud computing networks
EP3422646B1 (en) Method and device for multi-flow transmission in sdn network
US9503394B2 (en) Clustered dispersion of resource use in shared computing environments
CN107454017B (zh) 一种云数据中心网络中混合数据流协同调度方法
CN109614215B (zh) 基于深度强化学习的流调度方法、装置、设备及介质
US11595315B2 (en) Quality of service in virtual service networks
US20190042314A1 (en) Resource allocation
Guo et al. Oversubscription bounded multicast scheduling in fat-tree data center networks
Zhang et al. Load balancing with deadline-driven parallel data transmission in data center networks
Moreno et al. Arbitration and routing impact on NoC design
KR20120121146A (ko) 가상네트워크 환경에서의 자원 할당 방법 및 장치
Alvarez-Horcajo et al. Improving multipath routing of TCP flows by network exploration
WO2012113224A1 (zh) 多节点计算系统下选择共享内存所在节点的方法和装置
CN114996199A (zh) 众核的路由映射方法、装置、设备及介质
Szymanski Low latency energy efficient communications in global-scale cloud computing systems
Guo et al. A QoS aware multicore hash scheduler for network applications
Li et al. Congestion‐free routing strategy in software defined data center networks
González et al. Traffic Injection Regulation Protocol based on free time-slots requests
US20240028881A1 (en) Deep neural network (dnn) compute loading and traffic-aware power management for multi-core artificial intelligence (ai) processing system
EP2939382B1 (en) Distributed data processing system
WO2024021990A1 (zh) 一种路径确定的方法及相关设备
Fan et al. The QoS mechanism for NoC router by dynamic virtual channel allocation and dual-net infrastructure
US10165598B2 (en) Wireless medium clearing
Das et al. Regulating Degree of Adaptiveness for Performance-Centric NoC Routing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14797851

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016513212

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2014797851

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 20157035119

Country of ref document: KR

Kind code of ref document: A