CN103257904B - Optimize computing flow graph mapping method and the device of many core system repairing performances - Google Patents

Optimize computing flow graph mapping method and the device of many core system repairing performances Download PDF

Info

Publication number
CN103257904B
CN103257904B CN201310144403.9A CN201310144403A CN103257904B CN 103257904 B CN103257904 B CN 103257904B CN 201310144403 A CN201310144403 A CN 201310144403A CN 103257904 B CN103257904 B CN 103257904B
Authority
CN
China
Prior art keywords
many
flow graph
core
mapping
operation flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310144403.9A
Other languages
Chinese (zh)
Other versions
CN103257904A (en
Inventor
应忍冬
陈鹰翔
叶凝
刘佩林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiao Tong University
Original Assignee
Shanghai Jiao Tong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiao Tong University filed Critical Shanghai Jiao Tong University
Priority to CN201310144403.9A priority Critical patent/CN103257904B/en
Publication of CN103257904A publication Critical patent/CN103257904A/en
Application granted granted Critical
Publication of CN103257904B publication Critical patent/CN103257904B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Hardware Redundancy (AREA)

Abstract

The embodiment of the present invention provides a kind of computing flow graph mapping method and device of optimizing many core system repairing performances. The method mainly comprises: for many core system computing flow graphs, for the task of each node is carried out Calculation of Reliability; Ensure that according to reliability priority and many core architecture resources generate the idle check figure order of the required vicinity taking of node; Ensure that according to the reliability obtaining priority list and system computing flow graph complete the mapping of computing flow graph to the many core frameworks of target. The method that multiple compute node is ensured in many core frameworks to priority level initializing layout by reliability that the present invention proposes, can solve preferably time delay and the power problems of many core system failures selfreparing and repair after hydraulic performance decline problem.

Description

优化众核系统修复性能的运算流图映射方法及装置Operation flow graph mapping method and device for optimizing repair performance of many-core system

技术领域 technical field

本发明涉及众核处理器领域,特别涉及一种优化众核系统修复性能的运算流图映射方法及装置。 The invention relates to the field of many-core processors, in particular to an operation flow graph mapping method and device for optimizing the repair performance of a many-core system.

背景技术 Background technique

目前,随着半导体工艺的日益进步,集成度越来越高,能够在单位面积上集成的门数越来越多。近年来,业界设计者已经意识到处理器芯片的设计面临着两个问题:一是波拉克法则所提出的,单个芯片运算效率的提升只正比于设计复杂度(门数)的平方根,即意味的增加单个芯片的设计复杂度提升运行效率已经达到了瓶颈;二是对系统可靠性的关注日益增加,芯片在使用的时候,一个基本元件或运算单元故障都有可能导致整个系统崩溃,如何将一个系统芯片从故障中恢复,提高系统的寿命也成为了一个问题。 At present, with the advancement of semiconductor technology, the integration level is getting higher and higher, and the number of gates that can be integrated on a unit area is increasing. In recent years, designers in the industry have realized that the design of processor chips is facing two problems: one is the Pollack's law, the improvement of the computing efficiency of a single chip is only proportional to the square root of the design complexity (number of gates), which means The increase in the design complexity of a single chip and the improvement of operating efficiency have reached a bottleneck; the second is the increasing concern for system reliability. When the chip is in use, a failure of a basic component or computing unit may cause the entire system to crash. A system-on-a-chip recovers from a failure, increasing the lifetime of the system also becomes an issue.

单核系统不仅在性能提高上,由于设计复杂带来的能耗问题上,还是在应对可靠性故障的处理措施上,都无法满足设计者和用户的要求。增加核的数量,多核协作并行处理可以提高系统任务处理效率;另一方面,众核系统架构拥有众多冗余的资源,为处理器系统的自我修复提供了条件。图1展示了一种矩形结构的众核架构,拥有众多冗余的核、路由和通信连接线。 The single-core system cannot meet the requirements of designers and users not only in terms of performance improvement, but also in terms of energy consumption caused by complex design, and in terms of handling measures for reliability failures. Increasing the number of cores, multi-core cooperative parallel processing can improve the efficiency of system task processing; on the other hand, the many-core system architecture has many redundant resources, which provides conditions for the self-healing of the processor system. Figure 1 shows a rectangular many-core architecture with redundant cores, routing and communication lines.

现有的技术主要有两种:一种为当检测某一个单核发生故障后,会在控制核的协作下进行重启以达到恢复的目的;另外一种为当检测到某一个单核发生故障且重启后也无法恢复的时候,将所述某核从众核系统中剥离,利用冗余的核对系统的业务进行重新分配,以恢复系统功能。 There are two main types of existing technologies: one is that when a single-core failure is detected, it will restart under the cooperation of the control core to achieve the purpose of recovery; the other is when a single-core failure is detected And when it cannot be restored after restarting, the certain core is separated from the many-core system, and the business of the redundant checking system is used to redistribute, so as to restore the system function.

上述现有两种提高众核处理器的可靠性的方法的缺点为:第一种方案虽然能够恢复由环境、高能粒子等引起的信号变化产生的软错误,若是该核硬件上产生故障,重启则无法解决;第二种方案虽然提供了一种可靠性高的恢复方案,但是该方案没有提供一种能够提高自修复性能的映射方法,如果某一个单核出现了故障,并且在该核周围没有空闲的核,则该核任务将会被重新分配至距离原位置较远的位置,或者多个核任务将会移动,导致修复后系统性能下降过大并且自修复时延过长功耗过大,多次故障自修复后影响更大。 The disadvantages of the above-mentioned existing two methods for improving the reliability of many-core processors are: although the first scheme can restore the soft errors caused by signal changes caused by the environment, high-energy particles, etc., if a fault occurs on the core hardware, restart It cannot be solved; although the second scheme provides a highly reliable recovery scheme, the scheme does not provide a mapping method that can improve self-healing performance. If there is no idle core, the core task will be reassigned to a location far away from the original location, or multiple core tasks will be moved, resulting in excessive system performance degradation after repair and excessive self-repair delay and excessive power consumption Large, and the impact will be greater after multiple faults are self-repaired.

为此,如何对一个系统的运算流图进行众核架构映射,使得故障自修复时延短,修复后对系统性能影响低,成为了一个有待解决的问题。本发明针对上述问题,在编号为513080102的课题的资助下,提出了一种优化众核系统修复性能的运算流图映射方法及装置。 For this reason, how to map the operation flow graph of a system to the many-core architecture, so that the fault self-repair delay is short, and the impact on system performance after repair is low, has become a problem to be solved. Aiming at the above problems, the present invention proposes an operation flow graph mapping method and device for optimizing the repair performance of many-core systems under the support of the subject numbered 513080102.

发明内容 Contents of the invention

本发明针对现有技术存在的上述不足,提供了一种优化众核系统修复性能的运算流图映射方法,其目的在于解决众核系统因为故障后,自修复时延长,自修复后系统性能下降等问题。 Aiming at the above-mentioned deficiencies in the prior art, the present invention provides an operation flow graph mapping method that optimizes the repair performance of many-core systems, and its purpose is to solve the problem of prolonged self-repair and system performance degradation after self-repair of many-core systems due to failures And other issues.

本发明通过以下技术方案实现: The present invention is realized through the following technical solutions:

一种优化众核系统修复性能的运算流图映射方法,包括: An operation flow graph mapping method for optimizing the repair performance of many-core systems, comprising:

对于一个已经分配好任务的众核系统运算流图,为每一个节点的任务进行可靠性计算,生成每个节点任务的可靠性保障优先级; For a many-core system operation flow diagram that has assigned tasks, perform reliability calculations for each node task, and generate the reliability guarantee priority of each node task;

基于可靠性保障优先级生成节点所需预留的邻近空闲核数目; Generate the number of adjacent idle cores that the node needs to reserve based on the reliability guarantee priority;

基于可靠性保障优先级列表和系统运算流图完成运算流图到目标众核架构的映射。 Based on the reliability guarantee priority list and the system operation flow graph, the mapping of the operation flow graph to the target many-core architecture is completed.

较佳的,对于一个已经分配好任务的众核系统运算流图,为每一个节点的任务进行可靠性计算,还包括: Preferably, for a many-core system operation flow graph that has assigned tasks, the reliability calculation is performed for the tasks of each node, which also includes:

任务越是重要,可靠性保障优先级越高; The more important the task, the higher the reliability assurance priority;

任务故障概率越高,可靠性保障优先级越高; The higher the task failure probability, the higher the reliability guarantee priority;

或,综合考虑这两方面因素。 Or, take these two factors into consideration.

较佳的,根据可靠性保障优先级生成节点所需占用的邻近空闲核数目方法,还包括: Preferably, the method for generating the number of adjacent idle cores required by the node according to the reliability guarantee priority also includes:

生成邻近空闲核需考虑任务节点总数和众核架构资源,不同优先级分配邻近空闲核数目由该方法自动生成或者用户手动定义,保存在可靠性保障优先级列表中;众核架构资源由用户配置。 To generate adjacent idle cores, the total number of task nodes and many-core architecture resources need to be considered. The number of adjacent idle cores assigned to different priorities is automatically generated by this method or manually defined by the user, and stored in the reliability guarantee priority list; many-core architecture resources are configured by the user .

较佳的,邻近空闲核,其特征在于: Preferably, adjacent idle cores are characterized by:

邻近空闲核的范围由用户定义。 The range of adjacent idle cores is user-defined.

较佳的,其特征在于,根据可靠性保障优先级列表和系统运算流图完成运算流图到目标众核架构的映射,包括步骤: Preferably, it is characterized in that the mapping of the operation flow graph to the target many-core architecture is completed according to the reliability guarantee priority list and the system operation flow graph, including steps:

第一步,选取未映射节点中可靠性保障优先级最高的节点; The first step is to select the node with the highest reliability guarantee priority among the unmapped nodes;

第二歩,查询目标众核架构上是否存在未分配核,若存在则进行第三歩,否则进行第五歩; In the second step, whether there is an undistributed core on the target many-core architecture, if there is, the third step is carried out, otherwise the fifth step is carried out;

第三歩,查询是否存在一片区域满足节点邻近空闲核数要求,若不存在则进行第四步,否则进行第六步; The third step is to inquire whether there is an area to meet the requirement of the number of free cores adjacent to the node, if it does not exist, then proceed to the fourth step, otherwise proceed to the sixth step;

第四步,降低邻近空闲核数要求,重新进行第三步; The fourth step is to reduce the requirement for adjacent idle cores, and repeat the third step;

第五步,选取已映射节点中可靠性保障优先级最低的且拥有邻近空闲核数的节点,将其空闲核作为映射目标; The fifth step is to select the node with the lowest reliability guarantee priority among the mapped nodes and the number of adjacent idle cores, and use its idle cores as the mapping target;

第六步,将节点映射到该区域; The sixth step is to map the nodes to the area;

第七步,检查是否映射完毕,若完成则进行第八步,否则回到第一步; The seventh step is to check whether the mapping is completed, if it is completed, go to the eighth step, otherwise return to the first step;

第八步,在每个节点区域内根据一定方法选择一个核作为初始位置进行配置,完成系统配置; The eighth step is to select a core in each node area according to a certain method as the initial position for configuration, and complete the system configuration;

第九步,其余未配置核作为空闲备用核。 In the ninth step, the remaining unconfigured cores are used as idle standby cores.

本发明还提供了一种优化众核系统修复性能的运算流图映射装置,其目的在于解决众核系统因为故障后,自修复时延长,自修复后系统性能下降等问题。 The present invention also provides an operation flow graph mapping device for optimizing the repair performance of the many-core system, which aims to solve the problems of prolonged self-repair and system performance degradation after the self-repair of the many-core system due to failure.

一种优化众核系统修复性能的运算流图映射装置,包括: An operation flow graph mapping device for optimizing the repair performance of many-core systems, comprising:

可靠性预估模块,用于对系统运算流图的每一个节点任务进行可靠性计算,并生成可靠性保障优先级列表; The reliability estimation module is used to calculate the reliability of each node task in the system operation flow graph, and generate a reliability guarantee priority list;

优先级存储单元,用于存储可靠性保障优先级列表; The priority storage unit is used to store the reliability guarantee priority list;

映射配置单元,用于将系统运算流图根据可靠性保障优先级列表映射到目标众核架构之上,完成系统配置; The mapping configuration unit is used to map the system operation flow graph to the target many-core architecture according to the reliability guarantee priority list to complete the system configuration;

映射区域存储单元,用于存储每个任务节点的映射区域。 The mapping area storage unit is used to store the mapping area of each task node.

较佳的,可靠性预估模块,根据该任务的重要性和故障概率大小,自动决定该任务的优先级大小; Preferably, the reliability estimation module automatically determines the priority of the task according to the importance of the task and the probability of failure;

任务越是重要,可靠性保障优先级越高;任务故障概率越高,可靠性保障优先级越高;或,综合考虑这两方面因素。 The more important the task, the higher the reliability guarantee priority; the higher the task failure probability, the higher the reliability guarantee priority; or, the two factors are considered comprehensively.

较佳的,可靠性预估模块,生成的优先级列表以需要预留邻近空闲核数目表示,邻近空闲核数目由优化众核系统修复性能的运算流图映射装置自动生成或者由用户配置。 Preferably, the priority list generated by the reliability estimation module is represented by the number of adjacent idle cores that need to be reserved, and the number of adjacent idle cores is automatically generated by the operation flow graph mapping device that optimizes the repair performance of the many-core system or is configured by the user.

较佳的,优先级存储单元,在映射分配无法完成时,修改以减少所需的预留邻近空闲核数目。 Preferably, the priority storage unit is modified to reduce the number of adjacent idle cores required to be reserved when the mapping allocation cannot be completed.

较佳的,映射配置单元,还用于在映射分配无法完成时,对优先级存储单元提出修改请求。 Preferably, the mapping configuration unit is also used to request modification of the priority storage unit when the mapping allocation cannot be completed.

较佳的,映射区域存储单元,还用于在系统自修复时,指导需修复的任务优先配置在该核分配的区域内。 Preferably, the mapped area storage unit is also used to guide tasks to be repaired to be preferentially allocated in the area allocated by the core when the system is self-repairing.

应用本发明实施例提供的优化众核系统修复性能的运算流图映射方法及装置,由于对于可靠性低的运算节点,在其附近预留了一定量的空闲核,使得在系统故障自修复时,能够在故障附近区域快速找到修复地点,完成自修复,减少修复时延;并且由于对整个系统在众核架构上相对位置变动较小,大大降低了自修复对系统性能的影响;尤其对于一个脆弱任务节点多次故障自修复更为有效。 Applying the operation flow graph mapping method and device for optimizing the repair performance of the many-core system provided by the embodiment of the present invention, since a certain amount of idle cores are reserved near the operation nodes with low reliability, when the system failure self-repair , can quickly find the repair location in the area near the fault, complete self-repair, and reduce repair delay; and because the relative position of the entire system on the many-core architecture changes little, the impact of self-repair on system performance is greatly reduced; especially for a It is more effective for vulnerable task nodes to self-heal after multiple failures.

附图说明 Description of drawings

图1所示的是一个4×4处理器阵列架构; Figure 1 shows a 4×4 processor array architecture;

图2是本发明实施例一提供的一种优化众核系统修复性能的运算流图映射方法流程图; FIG. 2 is a flowchart of an operation flow graph mapping method for optimizing the repair performance of a many-core system provided by Embodiment 1 of the present invention;

图3是本发明实施例一提供的一种可靠性保障优先级列表示意图; FIG. 3 is a schematic diagram of a reliability guarantee priority list provided by Embodiment 1 of the present invention;

图4是本发明实施例一提供的一种优化众核系统修复性能的运算流图映射方法流程中运算流图映射配置步骤的具体步骤流程图; FIG. 4 is a flow chart of specific steps in the operation flow graph mapping configuration step in the flow of an operation flow graph mapping method for optimizing the repair performance of many-core systems provided by Embodiment 1 of the present invention;

图5是本发明实施例一提供的一种系统的运算流图; Fig. 5 is an operation flow diagram of a system provided by Embodiment 1 of the present invention;

图6是本发明实施例一提供的节点可靠性保障优先级进行到目标众核架构映射配置的示意图; FIG. 6 is a schematic diagram of the mapping configuration of the node reliability guarantee priority to the target many-core architecture provided by Embodiment 1 of the present invention;

图7是本发明实施例二提供的一种优化众核系统修复性能的运算流图映射装置的具体结构图。 FIG. 7 is a specific structural diagram of an operation flow graph mapping device for optimizing the repair performance of a many-core system provided by Embodiment 2 of the present invention.

具体实施方式 detailed description

以下将结合本发明的附图,对本发明实施例中的技术方案进行清楚、完整的描述和讨论,显然,这里所描述的仅仅是本发明的一部分实例,并不是全部的实例,基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本发明的保护范围。 The technical solutions in the embodiments of the present invention will be clearly and completely described and discussed below in conjunction with the accompanying drawings of the present invention. Obviously, what is described here is only a part of the examples of the present invention, not all examples. Based on the present invention All other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

为了便于对本发明实施例的理解,下面将结合附图以几个具体实施例为例作进一步的解释说明,且各个实施例不构成对本发明实施例的限定。 In order to facilitate the understanding of the embodiments of the present invention, several specific embodiments will be taken as examples for further explanation below in conjunction with the accompanying drawings, and each embodiment does not constitute a limitation to the embodiments of the present invention.

实施例一 Embodiment one

该实施例提供的一种优化众核系统修复性能的运算流图映射方法的处理流程如图2所示,包括以下处理步骤: The processing flow of an operation flow graph mapping method for optimizing the repair performance of a many-core system provided in this embodiment is shown in FIG. 2 , including the following processing steps:

步骤201,对于一个已经分配好的系统运算流图,对每一个节点的任务进行可靠性计算。所述的计算方法,主要考虑该任务的重要性和故障概率大小,当然,也可能考虑其它一些因素。任务越是重要,可靠性保障优先级越高;任务故障概率越高,可靠性保障优先级越高。这一步骤将综合计算每一个任务的状况,并自动给出可靠性保障优先级。 Step 201, for a system operation flow graph that has been allocated, perform reliability calculation on the tasks of each node. The calculation method mentioned mainly considers the importance of the task and the failure probability, and of course, some other factors may also be considered. The more important the task, the higher the reliability guarantee priority; the higher the task failure probability, the higher the reliability guarantee priority. This step will comprehensively calculate the status of each task, and automatically give reliability guarantee priority.

步骤202,根据步骤201给出的每个节点任务的可靠性保障优先级,结合任务节点总数和目标众核架构资源状况,自动或者人工确定每个节点在映射的时候需要预留邻近空闲核的数目,制定出可靠性保障优先级列表;显而易见,所述众核架构资源,不仅限于图1所示四乘以四的矩形结构,更大更小矩形,圆形或其他类似结构均包含在内。 Step 202, according to the reliability guarantee priority of each node task given in step 201, combined with the total number of task nodes and the resource status of the target many-core architecture, automatically or manually determine that each node needs to reserve adjacent idle cores when mapping number, to formulate a reliability guarantee priority list; obviously, the many-core architecture resources are not limited to the four-by-four rectangular structure shown in Figure 1, larger and smaller rectangles, circles or other similar structures are included .

当一个任务节点可靠性保障优先级越高时,分配给该节点的邻近空闲核数目越高。当然,也可以设置在某一优先级范围内给予一定的邻近空闲核预留数目。 When the reliability guarantee priority of a task node is higher, the number of adjacent idle cores assigned to the node is higher. Of course, it is also possible to set a certain number of reserved adjacent idle cores within a certain priority range.

该实施例提供的一种可靠性保障优先级列表示意图如图3所示。表的每一组数据包含节点任务的名称和该节点任务映射时需要预留邻近空闲核的数目。同时,由于可能在映射的时候不能满足某一些节点的预留要求,可以根据相应情况修改可靠性保障优先级列表中的预留邻近空闲核数目。 A schematic diagram of a reliability guarantee priority list provided in this embodiment is shown in FIG. 3 . Each set of data in the table includes the name of the node task and the number of adjacent idle cores that need to be reserved when the node task is mapped. At the same time, since the reservation requirements of certain nodes may not be met during mapping, the number of reserved adjacent idle cores in the reliability guarantee priority list can be modified according to the corresponding situation.

所述的邻近空闲核,根据不同的情况有不同的定义;当空闲核与某核距离小于等于X时,则为该核的邻近空闲核,其中X为大于等于1的整数。 The adjacent idle core has different definitions according to different situations; when the distance between the idle core and a certain core is less than or equal to X, it is the adjacent idle core of the core, where X is an integer greater than or equal to 1.

步骤203,基于步骤202给出的可靠性保障优先级列表,将系统每一节点任务按照优先级由高到低逐一进行到目标众核架构的映射。 Step 203, based on the reliability guarantee priority list given in step 202, each node task of the system is mapped to the target many-core architecture one by one according to the priority from high to low.

该实例提供的一种根据系统运算流图和节点可靠性保障优先级进行到目标众核架构映射配置的步骤流程如图4所示,主要包括: This example provides a step-by-step process for mapping configurations to the target many-core architecture based on the system operation flow graph and node reliability guarantee priority, as shown in Figure 4, mainly including:

步骤400,初始步骤,映射方法已知系统运算流图和运算流图中每个节点的可靠性保障优先级信息以及映射目标众核架构资源状况; Step 400, the initial step, the mapping method knows the system operation flow graph and the reliability guarantee priority information of each node in the operation flow graph and the resource status of the mapping target many-core architecture;

步骤401,从未映射配置的节点任务中,选取可靠性保障优先级最高的一个节点,将该节点作为待映射节点; Step 401, select a node with the highest reliability guarantee priority from the unmapped configuration node tasks, and use this node as the node to be mapped;

步骤402,检查目标众核架构中是否还有未分配的核,如果还存在核未完毕,则进行步骤403,若所有的核都已分配完毕,则进行步骤405; Step 402, check whether there are unallocated cores in the target many-core architecture, if there are still uncompleted cores, proceed to step 403, if all cores have been allocated, proceed to step 405;

步骤403,根据步骤202给出的可靠性保障优先级列表,检查是否在众核架构上存在一片区域满足该任务节点的可靠性保障要求,如果无法满足,则进行步骤404,如果可以满足,则将该区域作为映射目标,进行步骤406; Step 403, according to the reliability guarantee priority list given in step 202, check whether there is an area on the many-core architecture that meets the reliability guarantee requirements of the task node, if it cannot be satisfied, go to step 404, if it can be satisfied, then Taking the area as the mapping target, proceed to step 406;

步骤404,在可靠性保障优先级列表中,降低该节点任务可靠性保障要求,即减少该节点预留邻近空闲核数目,重新执行步骤403; Step 404, in the reliability guarantee priority list, reduce the task reliability guarantee requirements of the node, that is, reduce the number of adjacent idle cores reserved by the node, and re-execute step 403;

步骤405,由于众核架构上所有节点均已分配完毕,根据可靠性保障优先级从低到高选取已分配的任务节点,如果该节点分配了预留空闲核,这将该预留空闲核作为待分配节点的映射目标。 Step 405, since all the nodes on the many-core architecture have been allocated, the allocated task nodes are selected according to the reliability guarantee priority from low to high. If the node is allocated with a reserved idle core, this reserved idle core is used as The mapping target of the node to be allocated.

步骤406,步骤403和405已经确定了待分配任务节点的映射目标,将待映射节点映射到目标区域,并将分配的核标记为已分配; Step 406, steps 403 and 405 have determined the mapping target of the task node to be allocated, mapped the node to be mapped to the target area, and marked the allocated core as allocated;

步骤407,检查是否所有任务节点都完成了映射,如果没有完成则重新执行步骤401; Step 407, check whether all task nodes have completed the mapping, if not, re-execute step 401;

步骤408,在每个任务节点对应的分配区域内,根据一定的算法选取一个核作为配置核,配置相应任务; Step 408, in the allocation area corresponding to each task node, select a core as the configuration core according to a certain algorithm, and configure the corresponding task;

步骤409,映射结束,完成众核系统配置,所有未配置核设置为空闲,可以在运行核出现故障时作为替代核配置加入到系统中。 In step 409, the mapping is completed, and the configuration of the many-core system is completed. All unconfigured cores are set to be idle, and can be added to the system as alternative core configurations when the running core fails.

比如,如图5图6所示的一种根据系统运算流图和节点可靠性保障优先级进行到目标众核架构映射配置的示意图,其中: For example, as shown in Figure 5 and Figure 6, a schematic diagram of mapping configuration to the target many-core architecture according to the system operation flow graph and node reliability guarantee priority, in which:

系统运算流图每个任务节点旁注明了该节点的可靠性保障优先级,即映射配置是需要预留的邻近空闲核数目; Next to each task node in the system operation flow diagram, the reliability guarantee priority of the node is indicated, that is, the mapping configuration is the number of adjacent idle cores that need to be reserved;

根据可靠性保障优先级顺序,依次对C、D、E、A、B、G和F节点进行分配,值得注意的是,在分配F时,已经没有未分配核,选取G的预留空闲核作为F映射目标核; According to the order of reliability guarantee priority, nodes C, D, E, A, B, G, and F are allocated sequentially. It is worth noting that when F is allocated, there are no unallocated cores, and the reserved idle cores of G are selected. as the F-map target core;

根据一定的算法在每个任务节点区域内选取一个核作为配置核,完成系统配置,其余核作为空闲核以便系统自修复时使用,特别的,故障核优先修复在分配给该核的区域内。 According to a certain algorithm, one core is selected as the configuration core in each task node area to complete the system configuration, and the remaining cores are used as idle cores for system self-repair. In particular, faulty cores are first repaired in the area assigned to this core.

由上述本发明实施例提供的技术方案可以看出,本发明实施例通过对系统运算流图每一个任务节点进行可靠性计算,对于可靠性低的节点预留多的空闲核在其附近,对于可靠性低的任务核出现故障需要自修复的时候,能够快速在其附近找到空闲核完成自修复,高效快速的完成系统恢复;并且因为能避免出现自修复后核位置距离原位置太远,以及需要移动多个核任务的情况,大大降低了由于自修复对系统性能的影响,尤其对于一个脆弱任务节点多次故障自修复更为有效。 It can be seen from the technical solutions provided by the above-mentioned embodiments of the present invention that the embodiments of the present invention perform reliability calculations on each task node in the system operation flow graph, and reserve more idle cores near them for nodes with low reliability. When a low-reliability task core fails and needs self-repair, it can quickly find an idle core nearby to complete self-repair, and complete system recovery efficiently and quickly; and because it can avoid the occurrence of self-repair The core position is too far away from the original position, and The need to move multiple core tasks greatly reduces the impact of self-repair on system performance, especially for multiple fault self-repair of a vulnerable task node.

实施例二 Embodiment two

该实施例提供的一种优化众核系统修复性能的运算流图映射装置,其具体结构如图7所示,包括如下模块: An operation flow graph mapping device for optimizing the repair performance of many-core systems provided in this embodiment has a specific structure as shown in FIG. 7 and includes the following modules:

可靠性预估模块601,用于对系统运算流图的每一个节点任务进行可靠性计算,并生成可靠性保障优先级列表; The reliability estimation module 601 is used to perform reliability calculation on each node task in the system operation flow graph, and generate a reliability guarantee priority list;

优先级存储单元602,用于存储可靠性保障优先级列表; A priority storage unit 602, configured to store a reliability guarantee priority list;

映射配置单元603,用于将系统运算流图根据可靠性保障优先级列表映射到目标众核架构之上,完成系统配置; The mapping configuration unit 603 is used to map the system operation flow graph to the target many-core architecture according to the reliability guarantee priority list to complete the system configuration;

映射区域存储单元604,用于存储每个任务节点的映射区域。 The mapping area storage unit 604 is configured to store the mapping area of each task node.

具体的,所述的可靠性预估模块601,根据该任务的重要性和故障概率大小,自动决定该任务的优先级大小,任务越是重要,可靠性保障优先级越高;任务故障概率越高,可靠性保障优先级越高。需综合考虑这两方面因素,当然也可以综合考虑其他因素。 Specifically, the reliability estimation module 601 automatically determines the priority of the task according to the importance of the task and the probability of failure. The more important the task, the higher the priority of reliability guarantee; the higher the probability of task failure. High, the higher the reliability guarantee priority. These two factors need to be considered comprehensively, and of course other factors can also be considered comprehensively.

具体的,所述的可靠性预估模块601,生成的优先级列表以需要预留邻近空闲核数目表示,所述的邻近空闲核数目可以自动生成或者由用户配置。 Specifically, the priority list generated by the reliability estimation module 601 is represented by the number of adjacent idle cores that need to be reserved, and the number of adjacent idle cores can be automatically generated or configured by the user.

具体的,所述的优先级存储单元602,在映射分配无法完成时,可以进行修改,减少所需的预留邻近空闲核数目。 Specifically, when the mapping allocation cannot be completed, the priority storage unit 602 can be modified to reduce the required number of reserved adjacent idle cores.

具体的,所述映射配置单元603,还用于在映射分配无法完成时,对优先级存储单元602提出修改请求。 Specifically, the mapping configuration unit 603 is further configured to make a modification request to the priority storage unit 602 when the mapping assignment cannot be completed.

具体的,所述映射区域存储单元604,还用于在系统自修复时,指导需修复的任务优先配置在该核分配的区域内。 Specifically, the mapping area storage unit 604 is also used to guide tasks that need to be repaired to be preferentially configured in the area allocated by the core when the system is self-repairing.

应用本发明实施例的装置完成能够优化众核架构系统修复性能的映射具体步骤与前述方法实施例类似,此处不再赘述。 The specific steps of implementing the mapping capable of optimizing the repair performance of the many-core architecture system using the device of the embodiment of the present invention are similar to the foregoing method embodiments, and will not be repeated here.

本领域普通技术人员可以理解上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于计算机可读取存储介质中,该程序在执行时,可包括上述各方法的实施例流程。其中,所述的存储介质可谓磁碟、光盘、只读存储记忆体或随机存储记忆体等。 Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be completed by instructing related hardware through computer programs, and the programs can be stored in computer-readable storage media, and the programs can , the flow of the embodiments of the above-mentioned methods may be included. Wherein, the storage medium can be referred to as a magnetic disk, an optical disk, a read-only memory or a random access memory, and the like.

综上所述,本发明实施例通过对系统运算流图每一个任务节点进行可靠性计算,对于可靠性低的节点预留多的空闲核在其附近,对于可靠性低的任务核出现故障需要自修复的时候,能够快速在其附近找到空闲核完成自修复,高效快速的完成系统恢复;并且因为能避免出现自修复后核位置距离原位置太远,以及需要移动多个核任务的情况,大大降低了由于自修复对系统性能的影响,尤其对于一个脆弱任务节点多次故障自修复更为有效。 To sum up, the embodiment of the present invention calculates the reliability of each task node in the system operation flow graph, reserves more idle cores near the nodes with low reliability, and requires During self-repair, it can quickly find idle cores nearby to complete self-repair, and complete system recovery efficiently and quickly; and because it can avoid the situation where the core position is too far away from the original position after self-repair and the need to move multiple core tasks, It greatly reduces the impact of self-repair on system performance, especially for multiple fault self-repair of a vulnerable task node.

本发明实施例可以较好的解决众核系统因为故障后,自修复时延长,自修复后系统性能下降等问题。 The embodiments of the present invention can better solve problems such as prolonged self-repair time and system performance degradation after self-repair due to failure of many-core systems.

以上所述,仅为本发明较佳的具体实施方式,但本发明的保护范围并不局限与此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应该以权利要求的保护范围为准。 The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of changes or modifications within the technical scope disclosed in the present invention. Replacement should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.

Claims (10)

1.一种优化众核系统修复性能的运算流图映射方法,其特征在于,包括:1. An operation flow graph mapping method for optimizing many-core system repair performance, characterized in that, comprising: 对于一个已经分配好任务的众核系统运算流图,为每一个节点的任务进行可靠性计算,生成每个节点任务的可靠性保障优先级;For a many-core system operation flow diagram that has assigned tasks, perform reliability calculations for each node's tasks, and generate the reliability guarantee priority of each node task; 基于可靠性保障优先级生成节点所需预留的邻近空闲核数目;Generate the number of adjacent idle cores that the node needs to reserve based on the reliability guarantee priority; 基于可靠性保障优先级列表和系统运算流图完成运算流图到目标众核架构的映射;Based on the reliability guarantee priority list and the system operation flow graph, complete the mapping from the operation flow graph to the target many-core architecture; 根据可靠性保障优先级列表和系统运算流图完成运算流图到目标众核架构的映射,包括步骤:Complete the mapping of the operation flow graph to the target many-core architecture according to the reliability guarantee priority list and the system operation flow graph, including steps: 第一步,选取未映射节点中可靠性保障优先级最高的节点;The first step is to select the node with the highest reliability guarantee priority among the unmapped nodes; 第二歩,查询目标众核架构上是否存在未分配核,若存在则进行第三歩,否则进行第五歩;In the second step, whether there is an undistributed core on the target many-core architecture, if there is, the third step is carried out, otherwise the fifth step is carried out; 第三歩,查询是否存在一片区域满足节点邻近空闲核数要求,若不存在则进行第四步,否则进行第六步;The third step is to inquire whether there is an area to meet the requirement of the number of free cores adjacent to the node, if it does not exist, then proceed to the fourth step, otherwise proceed to the sixth step; 第四步,降低邻近空闲核数要求,重新进行第三步;The fourth step is to reduce the requirement for adjacent idle cores, and repeat the third step; 第五步,选取已映射节点中可靠性保障优先级最低的且拥有邻近空闲核数的节点,将其空闲核作为映射目标;The fifth step is to select the node with the lowest reliability guarantee priority among the mapped nodes and the number of adjacent idle cores, and use its idle cores as the mapping target; 第六步,将节点映射到该区域;The sixth step is to map the nodes to the area; 第七步,检查是否映射完毕,若完成则进行第八步,否则回到第一步;The seventh step is to check whether the mapping is completed, if it is completed, go to the eighth step, otherwise return to the first step; 第八步,在每个节点区域内根据一定方法选择一个核作为初始位置进行配置,完成系统配置;The eighth step is to select a core in each node area according to a certain method as the initial position for configuration, and complete the system configuration; 第九步,其余未配置核作为空闲备用核。In the ninth step, the remaining unconfigured cores are used as idle standby cores. 2.根据权利要求1所述的优化众核系统修复性能的运算流图映射方法,其特征在于,所述对于一个已经分配好任务的众核系统运算流图,为每一个节点的任务进行可靠性计算,还包括:2. The operation flow graph mapping method for optimizing many-core system repair performance according to claim 1, characterized in that, for the operation flow graph of a many-core system that has been assigned a task, the task of each node is reliably Sex calculations, also include: 任务越是重要,可靠性保障优先级越高;The more important the task, the higher the reliability assurance priority; 任务故障概率越高,可靠性保障优先级越高。The higher the task failure probability, the higher the reliability guarantee priority. 3.根据权利要求1所述的优化众核系统修复性能的运算流图映射方法,其特征在于,所述根据可靠性保障优先级生成节点所需占用的邻近空闲核数目方法,还包括:3. The operation flow graph mapping method for optimizing the repair performance of a many-core system according to claim 1, wherein the method for generating the number of adjacent idle cores required for a node according to the reliability guarantee priority also includes: 生成邻近空闲核需考虑任务节点总数和众核架构资源,不同优先级分配邻近空闲核数目由该方法自动生成或者用户手动定义,保存在可靠性保障优先级列表中;众核架构资源由用户配置。To generate adjacent idle cores, the total number of task nodes and many-core architecture resources need to be considered. The number of adjacent idle cores assigned to different priorities is automatically generated by this method or manually defined by the user, and stored in the reliability guarantee priority list; many-core architecture resources are configured by the user . 4.根据权利要求1所述的优化众核系统修复性能的运算流图映射方法,其特征在于,所述邻近空闲核的范围由用户定义。4. The operation flow graph mapping method for optimizing the repair performance of a many-core system according to claim 1, wherein the range of adjacent idle cores is defined by a user. 5.一种优化众核系统修复性能的运算流图映射装置,其特征在于,包括:5. An operation flow graph mapping device for optimizing many-core system repair performance, characterized in that it comprises: 可靠性预估模块,用于对系统运算流图的每一个节点任务进行可靠性计算,并生成可靠性保障优先级列表;The reliability estimation module is used to calculate the reliability of each node task in the system operation flow graph, and generate a reliability guarantee priority list; 优先级存储单元,用于存储可靠性保障优先级列表;The priority storage unit is used to store the reliability guarantee priority list; 映射配置单元,用于将系统运算流图根据可靠性保障优先级列表映射到目标众核架构之上,完成系统配置;The mapping configuration unit is used to map the system operation flow graph to the target many-core architecture according to the reliability guarantee priority list to complete the system configuration; 映射区域存储单元,用于存储每个任务节点的映射区域。The mapping area storage unit is used to store the mapping area of each task node. 6.根据权利要求5所述的优化众核系统修复性能的运算流图映射装置,其特征在于,所述的可靠性预估模块,根据该任务的重要性和故障概率大小,自动决定该任务的优先级大小;6. The operation flow graph mapping device for optimizing many-core system repair performance according to claim 5, wherein the reliability estimation module automatically determines the task according to the importance of the task and the probability of failure the priority size; 任务越是重要,可靠性保障优先级越高;任务故障概率越高,可靠性保障优先级越高。The more important the task, the higher the reliability guarantee priority; the higher the task failure probability, the higher the reliability guarantee priority. 7.根据权利要求5所述的优化众核系统修复性能的运算流图映射装置,其特征在于,所述的可靠性预估模块,生成的优先级列表以需要预留邻近空闲核数目表示,所述的邻近空闲核数目由所述优化众核系统修复性能的运算流图映射装置自动生成或者由用户配置。7. The operation flow graph mapping device for optimizing the repair performance of many-core systems according to claim 5, wherein, in the reliability estimation module, the generated priority list is represented by the number of adjacent idle cores that need to be reserved, The number of adjacent idle cores is automatically generated by the operation flow graph mapping device for optimizing the repair performance of many-core systems or configured by the user. 8.根据权利要求5所述的优化众核系统修复性能的运算流图映射装置,其特征在于,所述的优先级存储单元,在映射分配无法完成时,修改以减少所需的预留邻近空闲核数目。8. The operation flow graph mapping device for optimizing the repair performance of many-core systems according to claim 5, wherein the priority storage unit is modified to reduce the required reserved adjacent Number of idle cores. 9.根据权利要求5所述的优化众核系统修复性能的运算流图映射装置,其特征在于,9. The operation flow graph mapping device for optimizing many-core system repair performance according to claim 5, characterized in that, 所述映射配置单元,还用于在映射分配无法完成时,对优先级存储单元提出修改请求。The mapping configuration unit is further configured to request modification to the priority storage unit when the mapping allocation cannot be completed. 10.根据权利要求5所述的优化众核系统修复性能的运算流图映射装置,其特征在于,所述映射区域存储单元,还用于在系统自修复时,指导需修复的任务优先配置在映射区域存储单元中存储的映射区域内。10. The computing flow graph mapping device for optimizing the repair performance of many-core systems according to claim 5, wherein the mapping area storage unit is also used to guide tasks that need to be repaired to be prioritized when the system is self-repairing. The mapping area is stored in the mapping area storage unit.
CN201310144403.9A 2013-04-24 2013-04-24 Optimize computing flow graph mapping method and the device of many core system repairing performances Active CN103257904B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310144403.9A CN103257904B (en) 2013-04-24 2013-04-24 Optimize computing flow graph mapping method and the device of many core system repairing performances

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310144403.9A CN103257904B (en) 2013-04-24 2013-04-24 Optimize computing flow graph mapping method and the device of many core system repairing performances

Publications (2)

Publication Number Publication Date
CN103257904A CN103257904A (en) 2013-08-21
CN103257904B true CN103257904B (en) 2016-05-04

Family

ID=48961838

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310144403.9A Active CN103257904B (en) 2013-04-24 2013-04-24 Optimize computing flow graph mapping method and the device of many core system repairing performances

Country Status (1)

Country Link
CN (1) CN103257904B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4120076A4 (en) * 2020-04-01 2023-04-05 Huawei Technologies Co., Ltd. Task scheduling method and apparatus
CN113220548B (en) * 2021-03-25 2024-02-09 中国航天系统科学与工程研究院 Software reliability index distribution method, medium and equipment based on reliability block diagram

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101556534A (en) * 2009-04-21 2009-10-14 浪潮电子信息产业股份有限公司 Large-scale data parallel computation method with many-core structure
CN102193830A (en) * 2010-03-12 2011-09-21 复旦大学 Many-core environment-oriented division mapping/reduction parallel programming model
US8327187B1 (en) * 2009-09-21 2012-12-04 Tilera Corporation Low-overhead operating systems

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101556534A (en) * 2009-04-21 2009-10-14 浪潮电子信息产业股份有限公司 Large-scale data parallel computation method with many-core structure
US8327187B1 (en) * 2009-09-21 2012-12-04 Tilera Corporation Low-overhead operating systems
CN102193830A (en) * 2010-03-12 2011-09-21 复旦大学 Many-core environment-oriented division mapping/reduction parallel programming model

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
High Performance Fault-Tolerant Routing Algorithm for NoC-based Many-Core System;Masoumeh Ebrahimi等;《2013 21st Euromicro International Conference on Parallel,Distributed,and Network-Based Processing》;20130301;104-109 *
Parallax-A New Operating System Prototype Demonstrating Service Scaling and Service Self-Repair in Multi-core Servers;Dr.Rao Mikklineni等;《2011 20th IEEE International Workshops on Enabling Technologies:Infrastructure for Collaborative Enterprise》;20110629;462-469 *
众核处理器系统核资源动态分组的自适应调度算法;曹仰杰等;《软件学报》;20120229;第23卷(第2期);240-252 *
面向集成电路可靠性挑战的多核处理器虚拟化技术;张磊等;《信息技术快报》;20100331;第8卷(第2期);1-8 *

Also Published As

Publication number Publication date
CN103257904A (en) 2013-08-21

Similar Documents

Publication Publication Date Title
US20180246751A1 (en) Techniques to select virtual machines for migration
US10162708B2 (en) Fault tolerance for complex distributed computing operations
WO2020107829A1 (en) Fault processing method, apparatus, distributed storage system, and storage medium
US10664401B2 (en) Method and system for managing buffer device in storage system
US11144414B2 (en) Method and apparatus for managing storage system
WO2007096350A1 (en) Dynamic resource allocation for disparate application performance requirements
US9448824B1 (en) Capacity availability aware auto scaling
WO2019148716A1 (en) Data transmission method, server, and storage medium
WO2015042778A1 (en) Data migration method, data migration apparatus and storage device
WO2016165304A1 (en) Method for managing instance node and management device
CN112764661B (en) Method, apparatus and computer program product for managing a storage system
US10346269B2 (en) Selective mirroring of predictively isolated memory
CN113051104B (en) Method and related device for recovering data between disks based on erasure codes
US20170123915A1 (en) Methods and systems for repurposing system-level over provisioned space into a temporary hot spare
US20190041937A1 (en) Power allocation among computing devices
CN104753992A (en) Method, device and system for data storage and method and device for virtual platform failure recovery
US10664392B2 (en) Method and device for managing storage system
CN112506691B (en) Digital twin application fault recovery method and system for multi-energy system
CN103873516B (en) Improve the HA method and systems of physical server utilization rate in cloud computing resource pool
CN103257904B (en) Optimize computing flow graph mapping method and the device of many core system repairing performances
Khalili et al. A reliability-aware multi-application mapping technique in networks-on-chip
CN102629223B (en) Method and device for data recovery
WO2016106663A1 (en) Method for writing data into storage system and storage system
JP5510562B2 (en) Memory management method, memory management device, and memory management circuit
CN104460938A (en) System-wide power conservation method and system using memory cache

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant