CN108234188A

CN108234188A - A kind of business platform resource scheduling processing method and device

Info

Publication number: CN108234188A
Application number: CN201611198113.2A
Authority: CN
Inventors: 金昱任; 卞宁艳; 吴勇; 吕鹏
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Shanghai Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Shanghai Co Ltd
Priority date: 2016-12-22
Filing date: 2016-12-22
Publication date: 2018-06-29
Anticipated expiration: 2036-12-22
Also published as: CN108234188B

Abstract

The invention discloses a service platform resource scheduling processing method and device. Wherein, the method includes: obtaining work index information of each resource in the service platform through a fault collection interface, and obtaining a resource fault set according to the work index information and preset health index information used to indicate whether a resource fails; Wherein, the resource failure set includes resources that have failed in the current cycle; a data flow table is obtained through a data switch, and the data flow table includes resources that were in a normal working state in the previous cycle; according to the resource failure set, the The above data flow table, as well as the preset scheduling rules, determine the resources that need to be isolated and/or recovered in the current period. The device is used to perform the above method. The service platform resource scheduling processing method and device provided by the present invention avoid using faulty resources in the resource scheduling process, thereby improving the efficiency of resource scheduling.

Description

A service platform resource scheduling processing method and device

技术领域technical field

本发明涉及互联网技术领域，具体涉及一种业务平台资源调度处理方法及装置。The invention relates to the technical field of the Internet, in particular to a method and device for scheduling and processing resources of a business platform.

背景技术Background technique

互联网+时代，无线通信服务功能继续不断增强和完善，行业短信业务蓬勃发展，例如：基于行业网关为银行、证券、电商等大客户提供的行业短信业务主要有验证码短信、会员通知短信、会员营销短信等。随着行业短信业务规模和重要性的日益增加，客户对行业短信的到达速度、到达率，通道的稳定性、安全性也提出了更高要求。In the Internet+ era, wireless communication service functions continue to be enhanced and improved, and industry SMS services are booming. For example, industry gateway-based industry SMS services provided to major customers such as banks, securities, and e-commerce mainly include verification code SMS, member notification SMS, Affiliate Marketing SMS etc. With the increasing scale and importance of industry SMS business, customers have put forward higher requirements for the arrival speed, arrival rate, channel stability and security of industry SMS.

目前，各大网络运营商都在探求各种方法，或是减少故障发生概率，或是最有效地应对突发故障，以期最大限度地降低业务影响范围和程度。现有的资源调度方法主要关注的是资源本身情况，如资源CPU、内存、网络带宽、利用率等等，以这些指标达到一定阈值为关键依据，并按照预定策略进行资源调度，来达到缓解资源使用压力的目的。而实际影响业务稳定和正常运行的故障原因很多，除了资源本身情况，还涉及到应用软件运行情况、链路状态、业务指标等。此外，如果在用资源异常、资源节点出现故障或者出现突发故障也会对资源的调度带来影响。还有如果出现资源过度调度、频繁调度的情况，可能导致在用资源过少无法承载当前业务引发新的故障，或者资源频繁调度导致系统稳定性差。At present, major network operators are exploring various methods, either to reduce the probability of failure, or to deal with sudden failure most effectively, in order to minimize the scope and degree of business impact. Existing resource scheduling methods mainly focus on the resources themselves, such as resource CPU, memory, network bandwidth, utilization rate, etc., and use these indicators to reach a certain threshold as the key basis, and perform resource scheduling according to a predetermined strategy to achieve resource mitigation. The purpose of using pressure. However, there are many reasons for failures that actually affect business stability and normal operation. In addition to the resource itself, it also involves application software operation, link status, and service indicators. In addition, if the resource in use is abnormal, the resource node fails, or a sudden failure occurs, it will also affect the scheduling of resources. In addition, if resources are over-scheduled or frequently scheduled, there may be too few resources in use to carry the current business and cause new failures, or frequent resource scheduling may lead to poor system stability.

因此，如何提出一种方法，能够在资源调度的过程中提高资源调度的效率成为业界亟待解决的重要课题。Therefore, how to propose a method that can improve the efficiency of resource scheduling in the process of resource scheduling has become an important issue to be solved urgently in the industry.

发明内容Contents of the invention

针对现有技术中的缺陷，本发明提供一种业务平台资源调度处理方法及装置。Aiming at the defects in the prior art, the present invention provides a method and device for scheduling and processing service platform resources.

一方面，本发明提出一种业务平台资源调度处理方法，包括：On the one hand, the present invention proposes a resource scheduling processing method for a service platform, including:

控制器获取业务平台中各资源的工作指标信息，根据所述工作指标信息和预设的用于表征资源是否发生故障的健康度指标信息，获得资源故障集；其中，所述资源故障集中包括当前周期内发生故障的资源；The controller obtains the work index information of each resource in the service platform, and obtains a resource fault set according to the work index information and the preset health index information used to indicate whether the resource fails; wherein, the resource fault set includes the current A resource that failed during the period;

所述控制器获取数据流表，所述数据流表中包括上一周期内处于正常工作状态的资源；The controller obtains a data flow table, and the data flow table includes resources in a normal working state in the previous cycle;

所述控制器根据所述资源故障集、所述数据流表，以及预设的调度规则，确定所述当前周期需要隔离和/或恢复的资源。The controller determines resources that need to be isolated and/or recovered in the current period according to the resource failure set, the data flow table, and a preset scheduling rule.

另一方面，本发明提供一种业务平台资源调度处理装置，包括：In another aspect, the present invention provides a service platform resource scheduling processing device, including:

处理单元，用于获取业务平台中各资源的工作指标信息，根据所述工作指标信息和预设的用于表征资源是否发生故障的健康度指标信息，获得资源故障集；其中，所述资源故障集中包括当前周期内发生故障的资源；A processing unit, configured to obtain work index information of each resource in the service platform, and obtain a resource failure set according to the work index information and preset health index information used to indicate whether a resource fails; wherein, the resource failure The pool includes resources that failed during the current cycle;

获取单元，用于获取数据流表，所述数据流表中包括上一周期内处于正常工作状态的资源；An acquisition unit, configured to acquire a data flow table, the data flow table including resources in a normal working state in the previous cycle;

处理单元，用于根据所述资源故障集、所述数据流表，以及预设的调度规则，确定所述当前周期需要隔离和/或恢复的资源。A processing unit, configured to determine resources that need to be isolated and/or recovered in the current period according to the resource failure set, the data flow table, and a preset scheduling rule.

本发明提供的业务平台资源调度处理方法及装置，由于能够通过对工作指标信息和健康度指标信息的比较获得资源故障集，并通过控制器获取数据流表，并根据资源故障集、数据流表和预设的调度规则，确定需要隔离和/或恢复的资源，避免了在资源调度过程中使用故障资源，从而提高了资源调度的效率。The resource scheduling processing method and device of the service platform provided by the present invention can obtain the resource failure set by comparing the work indicator information and the health indicator information, and obtain the data flow table through the controller, and according to the resource failure set, data flow table and preset scheduling rules to determine the resources that need to be isolated and/or restored, avoiding the use of faulty resources in the resource scheduling process, thereby improving the efficiency of resource scheduling.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

图1为本发明实施例基于SDN技术的资源调度系统架构示意图；FIG. 1 is a schematic diagram of an architecture of a resource scheduling system based on SDN technology according to an embodiment of the present invention;

图2为本发明一实施例业务平台资源调度处理方法的流程示意图；FIG. 2 is a schematic flow diagram of a method for processing resource scheduling on a service platform according to an embodiment of the present invention;

图3为本发明另一实施例业务平台资源调度处理方法的流程示意图；FIG. 3 is a schematic flow diagram of a method for processing resource scheduling on a service platform according to another embodiment of the present invention;

图4为本发明又一实施例业务平台资源调度处理方法的流程示意图；FIG. 4 is a schematic flowchart of a method for processing resource scheduling of a service platform according to another embodiment of the present invention;

图5为本发明一实施例业务平台资源调度处理装置的结构示意图；FIG. 5 is a schematic structural diagram of a service platform resource scheduling processing device according to an embodiment of the present invention;

图6为本发明另一实施例业务平台资源调度处理装置的结构示意图；6 is a schematic structural diagram of a service platform resource scheduling processing device according to another embodiment of the present invention;

图7为本发明又一实施例业务平台资源调度处理装置的结构示意图；7 is a schematic structural diagram of a service platform resource scheduling processing device according to another embodiment of the present invention;

图8为本发明实施例电子设备的实体结构示意图。FIG. 8 is a schematic diagram of a physical structure of an electronic device according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the present invention Examples, not all examples. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

图1为本发明实施例基于SDN技术的资源调度系统架构示意图，为本发明实施例的一种应用环境。其中，软件定义网络(Software Defined Network,以下简称SDN)，是一种新型网络创新架构，是网络虚拟化的一种实现方式，其核心技术OpenFlow通过将网络设备控制面与数据面分离开来，从而实现了网络流量的灵活控制，使网络作为管道变得更加智能。FIG. 1 is a schematic diagram of a resource scheduling system architecture based on SDN technology according to an embodiment of the present invention, which is an application environment of the embodiment of the present invention. Among them, Software Defined Network (Software Defined Network, hereinafter referred to as SDN) is a new network innovation architecture and a way to realize network virtualization. Its core technology, OpenFlow, separates the control plane of network equipment from the data plane. In this way, the flexible control of network traffic is realized, and the network becomes more intelligent as a pipeline.

如图1所示，业务平台105由多台应用服务器组成，提供处理业务所需要的资源。故障采集接口104的作用是获取业务平台105的信息，将获取的信息上传给控制器102。数据交换机103采用SDN技术的OpenFlow协议接口，可以将数据流表上传给控制器102，并根据控制器102下发的指令进行数据包的转发。控制器102可以接收故障采集接口104上传的信息，还可以发送控制指令至数据交换机103更新数据交换机103的流表信息。SDN Manager101是人工控制台，为SDN控制器102提供管理的人机界面，通过控制台102管理控制器102各项参数、OpenFlow协议参数、故障采集接口104的参数等。As shown in FIG. 1 , the service platform 105 is composed of multiple application servers, which provide resources required for processing services. The function of the fault collection interface 104 is to obtain information of the service platform 105 and upload the obtained information to the controller 102 . The data switch 103 adopts the OpenFlow protocol interface of the SDN technology, can upload the data flow table to the controller 102, and forward the data packets according to the instructions issued by the controller 102. The controller 102 can receive the information uploaded by the fault collection interface 104 , and can also send a control command to the data switch 103 to update the flow table information of the data switch 103 . The SDN Manager 101 is a manual console, which provides a management man-machine interface for the SDN controller 102, and manages various parameters of the controller 102, parameters of the OpenFlow protocol, parameters of the fault collection interface 104, etc. through the console 102.

图2为本发明一实施例业务平台资源调度处理方法的流程示意图，如图2所示，本发明提供的业务平台资源调度处理方法，包括：Fig. 2 is a schematic flow chart of a method for scheduling and processing service platform resources according to an embodiment of the present invention. As shown in Fig. 2 , the method for scheduling and processing service platform resources provided by the present invention includes:

S201、控制器获取业务平台中各资源的工作指标信息，根据所述工作指标信息和预设的用于表征资源是否发生故障的健康度指标信息，获得资源故障集；其中，所述资源故障集中包括当前周期内发生故障的资源；S201. The controller obtains the work index information of each resource in the service platform, and obtains a resource fault set according to the work index information and the preset health index information used to indicate whether a resource fails; wherein, the resource fault set Include resources that failed during the current cycle;

具体地，故障采集接口可以获取业务平台中各个资源的工作指标信息，并将获取的所述工作指标信息上传至控制器，所述故障采集接口对所述工作指标信息的获取可以是周期性的，即每隔预定时间，例如30s获取一次，所述预定时间可以根据实际情况进行设置，本发明实施例不做限定。所述工作指标信息是从业务处理的角度出发预设的，包括但不限于：处理机进程或线程运行情况、处理机与数据库连通性、处理机消息处理成功率、处理机消息队列积压情况、处理机异常错误码占比。所述控制器接收所述工作信息指标，并将每个资源的所述工作指标信息与预设的健康度指标信息进行比较，如果所述工作指标信息不满足所述预设的健康度指标信息的条件，那么所述控制器将所述工作指标信息对应的资源判定为故障资源。当前周期内所有判定的故障资源构成故障资源集。所述预设的健康度指标信息与工作信息指标相对应，设置了满足所述资源的条件，所述资源必须满足所有所述条件，才被所述控制器判定为处于正常工作状态的资源，即正常资源。例如处理机进程或线程运行情况正常，处理机与数据库连通性正常，处理机消息处理成功率不低于60％等。所述工作指标信息和所述预设的健康度指标信息根据实际工作的业务平台进行对应设置，本发明实施例不做限定。Specifically, the fault collection interface can obtain the work index information of each resource in the service platform, and upload the obtained work index information to the controller, and the fault collection interface can obtain the work index information periodically , that is, it is acquired every predetermined time, for example, every 30s. The predetermined time can be set according to the actual situation, which is not limited in this embodiment of the present invention. The work index information is preset from the perspective of business processing, including but not limited to: processor process or thread running status, processor and database connectivity, processor message processing success rate, processor message queue backlog, Proportion of processor exception error codes. The controller receives the work information index, and compares the work index information of each resource with preset health index information, and if the work index information does not meet the preset health index information condition, the controller determines the resource corresponding to the work index information as a faulty resource. All the faulty resources judged in the current cycle constitute a faulty resource set. The preset health index information corresponds to the work information index, and the conditions for satisfying the resource are set. The resource must meet all the conditions before it is judged as a resource in a normal working state by the controller. i.e. normal resources. For example, the process or thread of the processor is running normally, the connectivity between the processor and the database is normal, and the message processing success rate of the processor is not less than 60%. The work index information and the preset health index information are correspondingly set according to the actual working service platform, which is not limited in this embodiment of the present invention.

S202、所述控制器获取数据流表，所述数据流表中包括上一周期内处于正常工作状态的资源；S202. The controller obtains a data flow table, and the data flow table includes resources in a normal working state in the previous cycle;

具体地，数据交换机将储存的数据流表上传至所述控制器，所述数据流表中包括上一周期内处于正常工作状态的资源，以下简称为正常资源，所述控制器接收所述数据交换机上传的所述数据流表。Specifically, the data switch uploads the stored data flow table to the controller, and the data flow table includes resources in a normal working state in the previous cycle, hereinafter referred to as normal resources, and the controller receives the data The data flow table uploaded by the switch.

S203、所述控制器根据所述资源故障集、所述数据流表，以及预设的调度规则，确定所述当前周期需要隔离和/或恢复的资源。S203. The controller determines resources that need to be isolated and/or recovered in the current period according to the resource failure set, the data flow table, and a preset scheduling rule.

具体地，所述控制器获取到所述资源故障集和所述数据流表后，根据所述资源故障集、所述数据流表和预设的调度规则，获得当前周期需要隔离和/或恢复的资源。例如，所述控制器可以将既属于所述资源故障集、又在所述数据流表中出现的正常资源，判定为需要隔离的资源；将不属于所述资源故障集，而没有出现在所述数据流表中的正常资源，判定为需要恢复的资源。对于当前周期内的所述需要隔离和/或恢复的资源，会在后续的处理中，对需要隔离的资源进行隔离操作，即将所述数据流表中存在的、所述需要隔离的资源进行删除，对需要恢复的资源进行恢复操作，即将所述数据流表中不存在的、所述需要恢复的资源进行添加。Specifically, after the controller acquires the resource fault set and the data flow table, according to the resource fault set, the data flow table, and a preset scheduling rule, obtains that the current cycle needs isolation and/or recovery Resources. For example, the controller may determine a normal resource that belongs to the resource failure set and appears in the data flow table as a resource that needs to be isolated; it will not belong to the resource failure set and does not appear in all The normal resources in the above data flow table are determined as the resources that need to be restored. For the resources that need to be isolated and/or recovered in the current cycle, the isolation operation will be performed on the resources that need to be isolated in subsequent processing, that is, the resources that need to be isolated that exist in the data flow table will be deleted , performing a recovery operation on the resource that needs to be recovered, that is, adding the resource that does not exist in the data flow table and that needs to be recovered.

本发明提供的业务平台资源调度处理方法，由于能够通过对工作指标信息和健康度指标信息的比较获得资源故障集，并通过控制器获取数据流表，并根据资源故障集、数据流表和预设的调度规则，确定需要隔离和/或恢复的资源，避免了在资源调度过程中使用故障资源，从而提高了资源调度的效率。The service platform resource scheduling processing method provided by the present invention can obtain the resource fault set by comparing the work index information and the health index information, and obtain the data flow table through the controller, and according to the resource fault set, data flow table and preset The scheduling rules are set to determine the resources that need to be isolated and/or restored, avoiding the use of faulty resources in the resource scheduling process, thereby improving the efficiency of resource scheduling.

图3为本发明另一实施例业务平台资源调度处理方法的流程示意图，如图3所示，在上述实施例的基础上，进一步地，本发明实施例提供的业务平台资源调度处理方法还包括：Fig. 3 is a schematic flowchart of a method for scheduling resource scheduling on a service platform according to another embodiment of the present invention. As shown in Fig. 3 , on the basis of the foregoing embodiments, further, the method for scheduling resource scheduling on a service platform provided by an embodiment of the present invention further includes :

S204、所述控制器根据所述需要隔离和/或恢复的资源，向所述数据交换机下发控制指令，以使得所述数据交换机更新所述数据流表，并根据更新后的数据流表向所述业务平台中各资源转发数据。S204. The controller sends a control instruction to the data switch according to the resource that needs to be isolated and/or restored, so that the data switch updates the data flow table, and according to the updated data flow table to Each resource in the service platform forwards data.

具体地，在获得所述需要隔离和/或恢复的资源后，所述控制器向所述数据交换机下发控制指令，将所述数据交换机的所述数据流表进行更新，将需要隔离的资源从所述数据流表中删除，将需要恢复的资源添加到所述数据流表中，所述数据交换机根据更新后的数据流表向所述业务平台中的各资源转发数据。Specifically, after obtaining the resource that needs to be isolated and/or restored, the controller sends a control instruction to the data switch to update the data flow table of the data switch, and the resource that needs to be isolated delete from the data flow table, add resources to be restored to the data flow table, and the data switch forwards data to each resource in the service platform according to the updated data flow table.

本发明提供的业务平台资源调度处理方法，由于能够通过对工作指标信息和健康度指标信息的比较获得资源故障集，并通过控制器获取数据流表，并根据资源故障集、数据流表和预设的调度规则，确定需要隔离和/或恢复的资源，避免了在资源调度过程中使用故障资源，从而提高了资源调度的效率。而通过对数据交换机中数据流表的更新，保证了数据流表中使用的资源都是正常资源。The service platform resource scheduling processing method provided by the present invention can obtain the resource fault set by comparing the work index information and the health index information, and obtain the data flow table through the controller, and according to the resource fault set, data flow table and preset The scheduling rules are set to determine the resources that need to be isolated and/or restored, avoiding the use of faulty resources in the resource scheduling process, thereby improving the efficiency of resource scheduling. By updating the data flow table in the data switch, it is ensured that the resources used in the data flow table are all normal resources.

在上述各实施例的基础上，进一步地，所述调度规则包括：On the basis of the above embodiments, further, the dispatching rules include:

若判断获知当前时刻已隔离资源的数量达到调度临界值，则将恢复的已隔离资源进行恢复；If it is judged that the number of isolated resources at the current moment reaches the scheduling critical value, the recovered isolated resources will be restored;

若判断获知所述当前时刻已隔离资源的数量未达到所述调度临界值，且所述恢复的已隔离资源不属于频繁调度的资源，则将所述恢复的已隔离资源进行恢复；If it is determined that the number of isolated resources at the current moment has not reached the scheduling threshold, and the recovered isolated resources do not belong to frequently scheduled resources, recover the recovered isolated resources;

若判断获知所述当前时刻已隔离资源的数量未达到所述调度临界值，且所述恢复的已隔离资源属于频繁调度的资源，则不对所述恢复的已隔离资源进行恢复；If it is determined that the number of isolated resources at the current moment has not reached the scheduling threshold, and the recovered isolated resources belong to frequently scheduled resources, then the recovered isolated resources are not recovered;

其中，根据所述资源故障集和所述数据流表，确定所述当前周期内出现的恢复的已隔离资源，所述恢复的已隔离资源是指在上一周期内故障、但在所述当前周期内处于正常工作状态的资源；所述调度临界值是指确保当前业务量正常处理的情况下，所述业务平台中最大可隔离资源数。Wherein, according to the resource failure set and the data flow table, determine the recovered isolated resource that occurs in the current cycle, and the recovered isolated resource refers to a fault in the previous cycle but in the current cycle The resources that are in normal working state during the cycle; the scheduling critical value refers to the maximum number of resources that can be isolated in the service platform under the condition that the current business volume is guaranteed to be processed normally.

具体地，所述控制器将当前时刻已隔离资源的数量与所述调度临界值进行比较，如果所述当前时刻已隔离资源的数量大于等于所述调度临界值，那么所述控制器对所述恢复的已隔离资源进行恢复，以保证对业务量的处理。在当前周期内，所述当前时刻已隔离资源的数量是可能发生变化的，在所述控制器第一次将所述当前时刻已隔离资源的数量与所述调度临界值进行比较时，所述当前时刻已隔离资源的数量与所述上一周期的故障资源数相等；在当前周期，如果对某一资源进行了恢复，那么对应的所述当前时刻已隔离资源的数量相应地减少，如果对某一资源进行了隔离，那么对应的所述当前时刻已隔离资源的数量相应地增加。Specifically, the controller compares the number of isolated resources at the current moment with the scheduling critical value, and if the number of isolated resources at the current moment is greater than or equal to the scheduling critical value, then the controller The recovered isolated resources are recovered to ensure the processing of traffic. In the current cycle, the number of isolated resources at the current moment may change, and when the controller compares the number of isolated resources at the current moment with the scheduling threshold for the first time, the The number of isolated resources at the current moment is equal to the number of faulty resources in the previous cycle; in the current cycle, if a resource is restored, the corresponding number of isolated resources at the current moment is correspondingly reduced. If a certain resource has been isolated, the number of corresponding resources that have been isolated at the current moment increases accordingly.

所述控制器将当前时刻已隔离资源的数量与所述调度临界值进行比较，如果所述当前时刻已隔离资源的数量小于所述调度临界值，并且所述恢复的已隔离资源不属于频繁调度的资源，那么所述控制器对所述恢复的已隔离资源进行恢复。The controller compares the number of isolated resources at the current moment with the scheduling threshold, and if the number of isolated resources at the current moment is less than the scheduling threshold, and the recovered isolated resources do not belong to frequent scheduling resource, then the controller restores the restored isolated resource.

所述控制器将当前时刻已隔离资源的数量与所述调度临界值进行比较，如果所述当前时刻已隔离资源的数量小于所述调度临界值，并且所述恢复的已隔离资源属于频繁调度的资源，那么所述控制器不对所述恢复的已隔离资源进行恢复。The controller compares the number of isolated resources at the current moment with the scheduling threshold, and if the number of isolated resources at the current moment is less than the scheduling threshold, and the recovered isolated resources belong to frequently scheduled resource, the controller does not restore the restored isolated resource.

其中，所述控制器根据所述资源故障集，获得当前周期内的所有故障资源，结合所述业务平台的所有资源，可以确定当前周期内的正常资源；所述控制器根据所述数据流表，获得上一周期内的正常资源，结合所述业务平台的所有资源，可以确定上一周期内的所有故障资源；所述控制器根据所述当前周期内的正常资源和所述上一周期内的所有故障资源，即可确定当前周期内出现的恢复的已隔离资源，所述恢复的已隔离资源即上一周期内故障、但在当前周期内处于正常工作状态的资源。Wherein, the controller obtains all faulty resources in the current cycle according to the resource fault set, and can determine the normal resources in the current cycle in combination with all resources of the service platform; the controller according to the data flow table , obtain the normal resources in the last cycle, combined with all the resources of the service platform, can determine all the faulty resources in the last cycle; the controller according to the normal resources in the current cycle and the All faulty resources in the current cycle can be determined to determine the recovered isolated resources, and the recovered isolated resources are resources that failed in the previous cycle but are in a normal working state in the current cycle.

所述调度临界值是指确保当前业务量正常处理的情况下，所述业务平台中最大可隔离资源数。例如，根据业务量情况，所述业务平台共有5个资源，如果需要3个资源保障当前业务量正常处理，即当前最大可隔离资源数为2，所述调度临界值Δ＝2。调度临界值的确定与业务平台当前业务量相关，随着当前业务量变化自适应调整，当前业务量越大，则说明所需要的资源数越多，可隔离的资源数应该越少，因此调度临界值也应该越小。调度临界值可以根据如下公式确定其中N为业务平台当前的业务量，M为业务平台的最大处理业务量，S业务平台的资源总数，k大于0而小于等于1，为业务平台的使用率，通常为了保证业务平台的正常工作，k小于1，例如k取90％。The scheduling critical value refers to the maximum number of resources that can be isolated in the service platform under the condition that the current business volume is guaranteed to be processed normally. For example, according to the business volume, the business platform has 5 resources in total. If 3 resources are needed to ensure the normal processing of the current business volume, that is, the current maximum number of resources that can be isolated is 2, and the scheduling critical value Δ=2. The determination of the scheduling critical value is related to the current business volume of the service platform, and it is adjusted adaptively with the change of the current business volume. The greater the current business volume, the more resources are needed, and the fewer resources that can be isolated should be. Therefore, the scheduling The critical value should also be smaller. The scheduling threshold can be determined according to the following formula Among them, N is the current business volume of the business platform, M is the maximum processing business volume of the business platform, S is the total number of resources of the business platform, and k is greater than 0 but less than or equal to 1, which is the utilization rate of the business platform, usually to ensure the normal operation of the business platform , k is less than 1, for example, k is 90%.

本发明提供的业务平台资源调度处理方法，由于能够通过对工作指标信息和健康度指标信息的比较获得资源故障集，并通过控制器获取数据流表，并根据资源故障集、数据流表和预设的调度规则，确定需要隔离和/或恢复的资源，避免了在资源调度过程中使用故障资源，从而提高了资源调度的效率。通过给出恢复的已隔离资源调度规则，便于判断是否对恢复的已隔离资源进行恢复。The service platform resource scheduling processing method provided by the present invention can obtain the resource fault set by comparing the work index information and the health index information, and obtain the data flow table through the controller, and according to the resource fault set, data flow table and preset The scheduling rules are set to determine the resources that need to be isolated and/or restored, avoiding the use of faulty resources in the resource scheduling process, thereby improving the efficiency of resource scheduling. By giving the restored isolated resource scheduling rule, it is convenient to judge whether to restore the restored isolated resource.

在上述各实施例的基础上，进一步地，所述方法还包括：On the basis of the above-mentioned embodiments, further, the method further includes:

获取所述恢复的已隔离资源在单位时间内进行恢复的次数，若所述次数大于阈值，则确定所述恢复的已隔离资源属于频繁调度的资源；若所述次数不大于所述阈值，则确定所述恢复的已隔离资源不属于频繁调度的资源。Obtain the number of recovery times of the recovered isolated resource within a unit time, if the number of times is greater than a threshold, determine that the recovered isolated resource belongs to a frequently scheduled resource; if the number of times is not greater than the threshold, then It is determined that the recovered isolated resource does not belong to a frequently scheduled resource.

具体地，所述频繁调度的资源是指在单位时间内该资源被隔离或恢复的次数达到阈值，所述单位时间和所述阈值根据实际情况进行设定，本发明实施例不做限定。所述控制器获取所述恢复的已隔离资源在单位时间内进行恢复的次数，如果所述恢复的次数大于所述阈值，表明所述恢复的已隔离资源属于所述频繁调度的资源，如果所述恢复的次数不大于所述阈值，表明所述恢复的已隔离资源属于所述频繁调度的资源。所述资源被频繁恢复或者隔离时，说明该资源在短时间多次被判定为故障资源，虽然该资源多次被恢复，但如果对该资源进行使用，存在很大的发生故障的风险，增加资源调度的不稳定性因此，被判断为所述频繁调度的资源将不会被恢复。Specifically, the frequently scheduled resource means that the number of times the resource is isolated or restored within a unit time reaches a threshold, and the unit time and the threshold are set according to actual conditions and are not limited in this embodiment of the present invention. The controller acquires the number of recovery times of the recovered isolated resource within a unit time, if the number of recovery times is greater than the threshold, it indicates that the recovered isolated resource belongs to the frequently scheduled resource, if the recovered If the number of recovery times is not greater than the threshold, it indicates that the recovered isolated resource belongs to the frequently scheduled resource. When the resource is frequently restored or isolated, it means that the resource has been judged as a faulty resource many times in a short period of time. Although the resource has been restored many times, if the resource is used, there is a great risk of failure, increasing Instability of resource scheduling Therefore, it is judged that the frequently scheduled resource will not be recovered.

例如，对每个资源每次被隔离或者恢复的时间点做一个标记，根据业务的性质可以选用不同粒度的时间窗，在同一时间窗内某一资源被隔离或者恢复3次或以上时，则可判定为该资源为频繁调度的资源，对业务平台网络稳定性要求高的业务，建议选取粒度较小的时间窗。For example, mark the time point when each resource is isolated or recovered, and time windows with different granularities can be selected according to the nature of the business. When a resource is isolated or recovered three or more times in the same time window, then It can be determined that this resource is a resource that is frequently scheduled. For services that require high network stability on the service platform, it is recommended to select a time window with a smaller granularity.

本发明提供的业务平台资源调度处理方法，由于能够通过对工作指标信息和健康度指标信息的比较获得资源故障集，并通过控制器获取数据流表，并根据资源故障集、数据流表和预设的调度规则，确定需要隔离和/或恢复的资源，避免了在资源调度过程中使用故障资源，从而提高了资源调度的效率。通过对频繁调度资源的判断，进一步提高了资源调度的效率。The service platform resource scheduling processing method provided by the present invention can obtain the resource fault set by comparing the work index information and the health index information, and obtain the data flow table through the controller, and according to the resource fault set, data flow table and preset The scheduling rules are set to determine the resources that need to be isolated and/or restored, avoiding the use of faulty resources in the resource scheduling process, thereby improving the efficiency of resource scheduling. By judging frequently scheduled resources, the efficiency of resource scheduling is further improved.

若判断获知当前时刻已隔离资源的数量达到调度临界值，则计算所述业务平台中所有故障资源在预设时间段内被隔离的次数，并根据所述次数对所述所有故障资源进行优先级排序，对优先级低的故障资源进行恢复，对优先级高的故障资源进行隔离；其中，所述次数越低，故障资源对应的优先级越低；If it is determined that the number of isolated resources at the current moment reaches the scheduling critical value, then calculate the number of times that all faulty resources in the service platform are isolated within a preset time period, and prioritize all faulty resources according to the number of times sorting, recovering faulty resources with low priority, and isolating faulty resources with high priority; wherein, the lower the number of times, the lower the corresponding priority of the faulty resources;

若判断获知所述当前时刻已隔离资源的数量未达到所述调度临界值，且所述故障资源为未被隔离的故障资源，则将所述故障资源进行隔离；其中，所述未被隔离的故障资源是指所述上一周期中未被隔离的、且在所述当前周期内发生故障的资源；If it is determined that the number of isolated resources at the current moment has not reached the scheduling critical value, and the faulty resource is a faulty resource that has not been isolated, then isolate the faulty resource; wherein, the faulty resource that is not isolated Faulty resources refer to resources that were not isolated in the previous cycle and failed in the current cycle;

其中，根据所述资源故障集，确定所述当前周期内出现的所述故障资源；所述调度临界值是指确保当前业务量正常处理的情况下，所述业务平台中最大可隔离资源数。Wherein, according to the resource fault set, the fault resources occurring in the current cycle are determined; the scheduling critical value refers to the maximum number of resources that can be isolated in the service platform under the condition of ensuring normal processing of the current traffic.

具体地，所述控制器将当前时刻已隔离资源的数量与调度临界值进行比较，如果所述当前时刻已隔离资源的数量大于等于所述调度临界值，那么对优先级低的所述故障资源进行恢复，使得所述已隔离资源的数量小于所述调度临界值，保证当前周期内的可用资源数能够满足当前业务量的正常处理。所述优先级低的故障资源是所述控制器通过计算所述业务平台中所有故障资源在预设时间段内被隔离的次数，并根据所述次数对所有故障资源进行优先级排序后获得的，所述次数越低故障资源对应的优先级越低。可理解的是，在所述已隔离资源的数量小于所述调度临界值后，可以对未被恢复的优先级高的故障资源进行隔离。Specifically, the controller compares the number of isolated resources at the current moment with a scheduling critical value, and if the number of isolated resources at the current moment is greater than or equal to the scheduling critical value, then the faulty resource with a low priority Restoring is performed so that the number of isolated resources is less than the scheduling critical value, ensuring that the number of available resources in the current cycle can meet the normal processing of the current traffic. The low-priority faulty resources are obtained by the controller by calculating the number of times all faulty resources in the service platform are isolated within a preset time period, and prioritizing all faulty resources according to the number of times , the lower the number of times, the lower the priority corresponding to the faulty resource. It can be understood that, after the number of isolated resources is less than the scheduling threshold, unrecovered faulty resources with high priority may be isolated.

所述控制器将所述当前时刻已隔离资源的数量与所述调度临界值进行比较，如果所述当前时刻已隔离资源的数量小于所述调度临界值，那么对所述故障资源进行是否是未被隔离的故障资源的判断。所述控制器根据所述数据流表，获得上一周期内处于正常工作状态的资源，如果所述故障资源出现在所述数据流表中，那么该故障资源为所述未被隔离的资源，对所述未被隔离的故障资源进行隔离。所述未被隔离的故障资源是指上一周期中未被隔离的、且在当前周期内发生故障的资源。The controller compares the number of isolated resources at the current time with the scheduling critical value, and if the number of isolated resources at the current time is less than the scheduling critical value, then check whether the faulty resource is not Judgment of faulty resources that are isolated. The controller obtains the resources in the normal working state in the previous period according to the data flow table, and if the faulty resource appears in the data flow table, then the faulty resource is the resource that has not been isolated, Isolating the faulty resources that have not been isolated. The non-isolated faulty resource refers to a resource that was not isolated in the previous cycle and has a fault in the current cycle.

其中，所述控制器根据所述资源故障集，获得所述当前周期内的出现的所述故障资源。所述调度临界值在上述的实施例中已经进行了解释，此处不再赘述。Wherein, the controller obtains the faulty resources occurring in the current cycle according to the resource fault set. The scheduling critical value has been explained in the above-mentioned embodiments, and will not be repeated here.

本发明提供的业务平台资源调度处理方法，由于能够通过对工作指标信息和健康度指标信息的比较获得资源故障集，并通过控制器获取数据流表，并根据资源故障集、数据流表和预设的调度规则，确定需要隔离和/或恢复的资源，避免了在资源调度过程中使用故障资源，从而提高了资源调度的效率。通过给出故障资源调度规则，便于判断是否对故障资源进行恢复或者隔离。The service platform resource scheduling processing method provided by the present invention can obtain the resource fault set by comparing the work index information and the health index information, and obtain the data flow table through the controller, and according to the resource fault set, data flow table and preset The scheduling rules are set to determine the resources that need to be isolated and/or restored, avoiding the use of faulty resources in the resource scheduling process, thereby improving the efficiency of resource scheduling. By giving fault resource scheduling rules, it is convenient to judge whether to recover or isolate fault resources.

图4为本发明又一实施例业务平台资源调度处理方法的流程示意图，下面结合图4对本发明提供的业务平台资源调度处理方法进行举例说明。假设所述业务平台共有A、B、C、D、E、F、G、H八个资源，所述控制器获得当前周期的故障集为：A、C、H；那么所述当期周期的正常资源为：B、D、E、F、G；如果所述控制器获取到的流量表中包括的正常资源为：B、C、E、F、G；那么上一周期故障的资源为：A、D、H。FIG. 4 is a schematic flowchart of a method for scheduling and processing service platform resources according to another embodiment of the present invention. The method for scheduling and processing service platform resources provided by the present invention will be illustrated below with reference to FIG. 4 . Assuming that the business platform has eight resources A, B, C, D, E, F, G, and H in total, and the fault set obtained by the controller in the current cycle is: A, C, and H; then the normal cycle of the current cycle The resources are: B, D, E, F, G; if the normal resources included in the flow table obtained by the controller are: B, C, E, F, G; then the resource that failed in the last cycle is: A , D, H.

如图4所示，若所述控制器在步骤1判断所述资源是否是故障资源后，先对当前周期内的正常资源进行处理，即对B、D、E、F、G，进行处理。在步骤2判断所述正常资源是否是恢复的已隔离的资源，只有D是上一个周期故障的资源，而在当前周期是正常资源，即D是所述恢复的已隔离的资源，对B、E、F、G将不做任何处理。对于D继续步骤3判断所述当前时刻的已隔离资源的数量是否达到调度临界值，如果当前业务处理需要的资源数为4，在所述业务平台使用率为100％的情况下，所述调度临界值为8-4＝4；由于所述当前时刻已隔离资源的数量与所述上一周期故障的资源数相等，都为3，小于所述调度临界值，所述控制器进行步骤4，对D是否是频繁调度的资源的判断，如果D被确定为所述频繁调度的资源，不会对D进行恢复；如果D被确定不是所述频繁调度的资源，则对D启动一般调度机制，所述一般调度机制是指改变资源在上一个周期的状态，即对隔离的资源进行恢复或者对正常的资源进行隔离。在对于D继续步骤3判断是否达到调度临界值时，如果当前业务处理需要的资源数为5，那么所述调度临界值为3，与所述当前已隔离的资源数相等，那么所述控制器对D直接启用一般调度机制，即对D进行恢复。As shown in FIG. 4 , if the controller determines whether the resource is a faulty resource in step 1, it first processes the normal resources in the current period, that is, processes B, D, E, F, and G. In step 2, it is judged whether the normal resource is a restored isolated resource, only D is a faulty resource in the last cycle, and it is a normal resource in the current cycle, that is, D is the restored isolated resource, for B, E, F, G will not do any processing. For D, proceed to step 3 to judge whether the number of isolated resources at the current moment has reached the scheduling critical value, if the number of resources required for current business processing is 4, and when the utilization rate of the business platform is 100%, the scheduling The critical value is 8-4=4; since the number of isolated resources at the current moment is equal to the number of faulty resources in the previous period, both are 3, which are smaller than the scheduling critical value, the controller proceeds to step 4, Judging whether D is a frequently scheduled resource, if D is determined to be the frequently scheduled resource, D will not be restored; if D is determined not to be the frequently scheduled resource, then start the general scheduling mechanism for D, The general scheduling mechanism refers to changing the status of resources in the previous cycle, that is, restoring isolated resources or isolating normal resources. When continuing step 3 for D to judge whether the scheduling critical value is reached, if the number of resources required for current business processing is 5, then the scheduling critical value is 3, which is equal to the number of currently isolated resources, then the controller Directly enable the general scheduling mechanism for D, that is, restore D.

所述控制器对所述正常资源B、D、E、F、G处理完成后，接着依次对当前周期的故障资源A、C、H进行处理。如果在当前业务处理需要的资源数为4的情况下，对正常资源D的处理结果是不对D进行恢复。对于A，所述控制器进行步骤5，判断所述当前时刻的已隔离资源的数量是否达到调度临界值，由于所述当前隔离的资源数为3小于所述调度临界值，接着对A进行步骤6是否是未被隔离的故障资源的判断，由于A在所述上一周期为故障资源，A不是所述未被隔离的故障资源，将不对A进行恢复；对于C，所述当前隔离的资源数为3小于所述调度临界值，接着对C进行步骤6是否是未被隔离的故障资源的判断，由于C在上一周期中未被隔离，且在当前周期内为故障资源，所以C是所述未被隔离的故障资源，对C采用一般调度机制，对C进行隔离；对于H，由于C被隔离，所述当前时刻的已隔离资源的数量变为4与所述调度临界值相等，所述控制器启用保护调度机制，即计算所述业务平台中所有故障资源在预设时间段内被隔离的次数，并根据所述次数对所述所有故障资源进行优先级排序，对优先级低的故障资源进行恢复，对优先级高的故障资源进行隔离；所述控制器对A、C、D、H进行优先级排序，如果A的优先级最低，即在预设时间间隔内被隔离的次数最少，优先对A进行恢复；对A进行恢复后，所述当前隔离的资源数变为3，小于所述调度临界值4，可以满足当前业务处理需求，所述控制器对C、D、H保持隔离状态。After the controller finishes processing the normal resources B, D, E, F, and G, it then sequentially processes the faulty resources A, C, and H of the current cycle. If the number of resources required for current business processing is 4, the result of processing the normal resource D is not to restore D. For A, the controller proceeds to step 5, judging whether the number of isolated resources at the current moment reaches the scheduling critical value, since the number of currently isolated resources is 3 less than the scheduling critical value, then proceed to step A 6. Judgment of whether it is a faulty resource that has not been isolated. Since A was a faulty resource in the last cycle, A is not a faulty resource that has not been isolated, and A will not be restored; for C, the currently isolated resource The number is 3 and is less than the scheduling critical value, and then it is judged whether C is a faulty resource that has not been isolated in step 6. Since C was not isolated in the previous cycle and is a faulty resource in the current cycle, C is For the faulty resources that are not isolated, use a general scheduling mechanism for C to isolate C; for H, because C is isolated, the number of isolated resources at the current moment becomes 4 and is equal to the scheduling critical value, The controller activates the protection scheduling mechanism, that is, calculates the number of times that all faulty resources in the service platform are isolated within a preset time period, and prioritizes all the faulty resources according to the number of times. recover the faulty resource, and isolate the faulty resource with high priority; the controller prioritizes A, C, D, and H, and if A has the lowest priority, that is, it is isolated within the preset time interval The number of times is the least, and A is restored first; after A is restored, the number of currently isolated resources becomes 3, which is less than the scheduling critical value 4, which can meet the current business processing requirements. The controller controls C, D, H remains in isolation.

图5为本发明一实施例业务平台资源调度处理装置的结构示意图,如图5所示，本发明提供的业务平台资源调度处理装置包括获取单元501、接收单元502和处理单元503，其中：FIG. 5 is a schematic structural diagram of a service platform resource scheduling processing device according to an embodiment of the present invention. As shown in FIG. 5 , the service platform resource scheduling processing device provided by the present invention includes an acquisition unit 501, a receiving unit 502, and a processing unit 503, wherein:

获取单元501用于获取业务平台中各资源的工作指标信息，根据所述工作指标信息和预设的用于表征资源是否发生故障的健康度指标信息，获得资源故障集；其中，所述资源故障集中包括当前周期内发生故障的资源；接收单元502用于获取数据流表，所述数据流表中包括上一周期内处于正常工作状态的资源；处理单元503用于根据所述资源故障集、所述数据流表，以及预设的调度规则，确定所述当前周期需要隔离和/或恢复的资源。The obtaining unit 501 is used to obtain the work index information of each resource in the service platform, and obtain a resource failure set according to the work index information and the preset health degree index information used to indicate whether the resource fails; wherein, the resource failure The collection includes resources that have failed in the current cycle; the receiving unit 502 is used to obtain a data flow table, and the data flow table includes resources that were in a normal working state in the previous cycle; the processing unit 503 is used to according to the resource failure set, The data flow table, as well as the preset scheduling rules, determine the resources that need to be isolated and/or restored in the current period.

具体地，故障采集接口可以获取业务平台中各个资源的工作指标信息，并将获取的所述工作指标信息上传至获取单元501，所述故障采集接口对所述工作指标信息的获取可以是周期性的，即每隔预定时间，例如30s获取一次，所述预定时间可以根据实际情况进行设置，本发明实施例不做限定。所述工作指标信息是从业务处理的角度出发预设的，包括但不限于：处理机进程或线程运行情况、处理机与数据库连通性、处理机消息处理成功率、处理机消息队列积压情况、处理机异常错误码占比。获取单元501接收所述工作信息指标，并将每个资源的所述工作指标信息与预设的健康度指标信息进行比较，如果所述工作指标信息不满足所述预设的健康度指标信息的条件，那么获取单元501将所述工作指标信息对应的资源判定为故障资源。当前周期内所有判定的故障资源构成故障资源集。所述预设的健康度指标信息与工作信息指标相对应，设置了满足所述资源处于正常工作状态的条件，所述资源必须满足所有所述条件，才被所述控制器判定为处于正常工作状态的资源，即正常资源。例如处理机进程或线程运行情况正常，处理机与数据库连通性正常，处理机消息处理成功率不低于60％等。所述工作指标信息和所述预设的健康度指标信息根据实际工作的业务平台进行对应设置，本发明实施例不做限定。Specifically, the fault collection interface can obtain the work index information of each resource in the service platform, and upload the obtained work index information to the acquisition unit 501, and the fault collection interface can obtain the work index information periodically That is, it is acquired every predetermined time, for example, every 30s. The predetermined time can be set according to the actual situation, which is not limited in this embodiment of the present invention. The work index information is preset from the perspective of business processing, including but not limited to: processor process or thread running status, processor and database connectivity, processor message processing success rate, processor message queue backlog, Proportion of processor exception error codes. The acquiring unit 501 receives the work information index, and compares the work index information of each resource with preset health index information, and if the work index information does not meet the preset health index information condition, then the acquiring unit 501 determines that the resource corresponding to the work index information is a faulty resource. All the faulty resources judged in the current cycle constitute a faulty resource set. The preset health index information corresponds to the work information index, and the condition that the resource is in a normal working state is set, and the resource must meet all the conditions before it is judged to be in a normal working state by the controller State resources, that is, normal resources. For example, the process or thread of the processor is running normally, the connectivity between the processor and the database is normal, and the message processing success rate of the processor is not less than 60%. The work index information and the preset health index information are correspondingly set according to the actual working service platform, which is not limited in this embodiment of the present invention.

数据交换机将储存的数据流表上传至接收单元502，所述数据流表中包括上一周期内处于正常工作状态的资源，以下简称为正常资源，接收单元502接收所述数据交换机上传的所述数据流表。The data switch uploads the stored data flow table to the receiving unit 502. The data flow table includes the resources in the normal working state in the previous cycle, hereinafter referred to as normal resources. The receiving unit 502 receives the data uploaded by the data switch. Data flow table.

处理单元503获取到所述资源故障集和所述数据流表后，根据所述资源故障集、所述数据流表和预设的调度规则，获得当前周期需要隔离和/或恢复的资源。例如，所述控制器可以将既属于所述资源故障集、又在所述数据流表中出现的正常资源，判定为需要隔离的资源；将不属于所述资源故障集，而没有出现在所述数据流表中的正常资源，判定为需要恢复的资源。对于当前周期内的所述需要隔离和/或恢复的资源，会在后续的处理中，对需要隔离的资源进行隔离操作，即将所述数据流表中存在的、所述需要隔离的资源进行删除，对需要恢复的资源进行恢复操作，即将所述数据流表中不存在的、所述需要恢复的资源进行添加。After obtaining the resource failure set and the data flow table, the processing unit 503 obtains resources that need to be isolated and/or recovered in the current period according to the resource failure set, the data flow table, and a preset scheduling rule. For example, the controller may determine a normal resource that belongs to the resource failure set and appears in the data flow table as a resource that needs to be isolated; it will not belong to the resource failure set and does not appear in all The normal resources in the above data flow table are determined as the resources that need to be restored. For the resources that need to be isolated and/or recovered in the current cycle, the isolation operation will be performed on the resources that need to be isolated in subsequent processing, that is, the resources that need to be isolated that exist in the data flow table will be deleted , performing a recovery operation on the resource that needs to be recovered, that is, adding the resource that does not exist in the data flow table and that needs to be recovered.

本发明提供的业务平台资源调度处理装置，由于能够通过对工作指标信息和健康度指标信息的比较获得资源故障集，并通过控制器获取数据流表，并根据资源故障集、数据流表和预设的调度规则，确定需要隔离和/或恢复的资源，避免了在资源调度过程中使用故障资源，从而提高了资源调度的效率。The service platform resource scheduling processing device provided by the present invention can obtain the resource fault set by comparing the work index information and the health index information, and obtain the data flow table through the controller, and according to the resource fault set, data flow table and preset The scheduling rules are set to determine the resources that need to be isolated and/or restored, avoiding the use of faulty resources in the resource scheduling process, thereby improving the efficiency of resource scheduling.

图6为本发明另一实施例业务平台资源调度处理装置的结构示意图，如图6所示，本发明提供的业务平台资源调度处理装置还包括：Fig. 6 is a schematic structural diagram of a service platform resource scheduling processing device according to another embodiment of the present invention. As shown in Fig. 6, the service platform resource scheduling processing device provided by the present invention further includes:

发送单元504用于根据所述需要隔离和/或恢复的资源，向所述数据交换机下发控制指令，以使得所述数据交换机更新所述数据流表，并根据更新后的数据流表向所述业务平台中各资源转发数据。The sending unit 504 is configured to send a control instruction to the data switch according to the resource that needs to be isolated and/or recovered, so that the data switch updates the data flow table, and sends a message to the data switch according to the updated data flow table. Each resource in the business platform forwards data.

本发明提供的业务平台资源调度处理装置，由于能够通过对工作指标信息和健康度指标信息的比较获得资源故障集，并通过控制器获取数据流表，并根据资源故障集、数据流表和预设的调度规则，确定需要隔离和/或恢复的资源，避免了在资源调度过程中使用故障资源，从而提高了资源调度的效率。而通过对数据交换机中数据流表的更新，保证了数据流表中使用的资源都是正常资源。The service platform resource scheduling processing device provided by the present invention can obtain the resource fault set by comparing the work index information and the health index information, and obtain the data flow table through the controller, and according to the resource fault set, data flow table and preset The scheduling rules are set to determine the resources that need to be isolated and/or restored, avoiding the use of faulty resources in the resource scheduling process, thereby improving the efficiency of resource scheduling. By updating the data flow table in the data switch, it is ensured that the resources used in the data flow table are all normal resources.

其中，根据所述资源故障集和所述数据流表，确定所述当前周期内出现的所述恢复的已隔离资源，所述恢复的已隔离资源是指在上一周期内故障、但在所述当前周期内处于正常工作状态的资源；所述调度临界值是指确保当前业务量正常处理的情况下，所述业务平台中最大可隔离资源数。Wherein, according to the resource failure set and the data flow table, the recovered isolated resource occurring in the current cycle is determined, and the recovered isolated resource refers to a fault in the previous cycle but in the The resources in the normal working state in the current cycle; the scheduling critical value refers to the maximum number of resources that can be isolated in the service platform under the condition that the current business volume is guaranteed to be processed normally.

本发明提供的业务平台资源调度处理装置，由于能够通过对工作指标信息和健康度指标信息的比较获得资源故障集，并通过控制器获取数据流表，并根据资源故障集、数据流表和预设的调度规则，确定需要隔离和/或恢复的资源，避免了在资源调度过程中使用故障资源，从而提高了资源调度的效率。通过给出恢复的已隔离资源调度规则，便于判断是否对恢复的已隔离资源进行恢复。The service platform resource scheduling processing device provided by the present invention can obtain the resource fault set by comparing the work index information and the health index information, and obtain the data flow table through the controller, and according to the resource fault set, data flow table and preset The scheduling rules are set to determine the resources that need to be isolated and/or restored, avoiding the use of faulty resources in the resource scheduling process, thereby improving the efficiency of resource scheduling. By giving the restored isolated resource scheduling rule, it is convenient to judge whether to restore the restored isolated resource.

图7为本发明又一实施例业务平台资源调度处理装置的结构示意图，如图7所示，本发明提供的业务平台资源调度处理装置还包括：Fig. 7 is a schematic structural diagram of a service platform resource scheduling processing device according to another embodiment of the present invention. As shown in Fig. 7, the service platform resource scheduling processing device provided by the present invention further includes:

判断单元505用于获取所述恢复的已隔离资源在单位时间内进行恢复的次数，若所述次数大于阈值，则确定所述恢复的已隔离资源属于频繁调度的资源；若所述次数不大于所述阈值，则确定所述恢复的已隔离资源不属于频繁调度的资源。The judging unit 505 is configured to obtain the number of times the restored isolated resource is restored within a unit time, and if the number is greater than a threshold, then determine that the restored isolated resource belongs to a frequently scheduled resource; if the number is not greater than The threshold value, then it is determined that the recovered isolated resources do not belong to frequently scheduled resources.

具体地，所述频繁调度的资源是指在单位时间内该资源被隔离或恢复的次数达到阈值，所述单位时间和所述阈值根据实际情况进行设定，本发明实施例不做限定。判断单元505获取所述恢复的已隔离资源在单位时间内进行恢复的次数，如果所述恢复的次数大于所述阈值，表明所述恢复的已隔离资源属于所述频繁调度的资源，如果所述恢复的次数不大于所述阈值，表明所述恢复的已隔离资源属于所述频繁调度的资源。所述资源被频繁恢复或者隔离时，说明该资源在短时间多次被判定为故障资源，虽然该资源多次被恢复，但如果对该资源进行使用，存在很大的发生故障的风险，增加资源调度的不稳定性，因此，被判断为所述频繁调度的资源将不会被恢复。Specifically, the frequently scheduled resource means that the number of times the resource is isolated or restored within a unit time reaches a threshold, and the unit time and the threshold are set according to actual conditions and are not limited in this embodiment of the present invention. The judging unit 505 acquires the number of recovery times of the recovered isolated resource within a unit time, if the number of recovery times is greater than the threshold, it indicates that the recovered isolated resource belongs to the frequently scheduled resource, if the The number of recovery times is not greater than the threshold, indicating that the recovered isolated resource belongs to the frequently scheduled resource. When the resource is frequently restored or isolated, it means that the resource has been judged as a faulty resource many times in a short period of time. Although the resource has been restored many times, if the resource is used, there is a great risk of failure, increasing Instability of resource scheduling, therefore, resources that are judged to be frequently scheduled will not be recovered.

本发明提供的业务平台资源调度处理装置，由于能够通过对工作指标信息和健康度指标信息的比较获得资源故障集，并通过控制器获取数据流表，并根据资源故障集、数据流表和预设的调度规则，确定需要隔离和/或恢复的资源，避免了在资源调度过程中使用故障资源，从而提高了资源调度的效率。通过对频繁调度资源的判断，进一步提高了资源调度的效率。The service platform resource scheduling processing device provided by the present invention can obtain the resource fault set by comparing the work index information and the health index information, and obtain the data flow table through the controller, and according to the resource fault set, data flow table and preset The scheduling rules are set to determine the resources that need to be isolated and/or restored, avoiding the use of faulty resources in the resource scheduling process, thereby improving the efficiency of resource scheduling. By judging frequently scheduled resources, the efficiency of resource scheduling is further improved.

若判断获知当前时刻已隔离资源的数量达到调度临界值，则计算所述业务平台中各资源在预设时间段内被隔离的次数，并根据所述次数对所述所有故障资源进行优先级排序，对优先级低的故障资源进行恢复，对优先级高的故障资源进行隔离；其中，所述次数越低，故障资源对应的优先级越低；If it is determined that the number of isolated resources at the current moment reaches the scheduling critical value, then calculate the number of times each resource in the service platform is isolated within a preset time period, and prioritize all the faulty resources according to the number of times , recovering faulty resources with low priority, and isolating faulty resources with high priority; wherein, the lower the number of times, the lower the corresponding priority of the faulty resource;

其中，根据所述资源故障集，确定所述当前周期内出现的故障资源；所述调度临界值是指确保当前业务量正常处理的情况下，所述业务平台中最大可隔离资源数。Wherein, according to the resource fault set, the faulty resources occurring in the current period are determined; the scheduling critical value refers to the maximum number of resources that can be isolated in the service platform under the condition that the normal processing of the current business volume is ensured.

所述控制器将所述当前时刻已隔离资源的数量与所述调度临界值进行比较，如果所述当前时刻已隔离资源的数量小于所述调度临界值，那么对所述故障资源进行是否是未被隔离的故障资源的判断。所述控制器根据所述数据流表，获得上一周期内处于正常工作状态的资源，如果所述故障资源出现在所述数据流表中，那么该故障资源为所述未被隔离的资源，，对所述未被隔离的故障资源进行隔离。所述未被隔离的故障资源是指上一周期中未被隔离的、且在当前周期内发生故障的资源。The controller compares the number of isolated resources at the current time with the scheduling critical value, and if the number of isolated resources at the current time is less than the scheduling critical value, then check whether the faulty resource is not Judgment of faulty resources that are isolated. The controller obtains the resources in the normal working state in the previous period according to the data flow table, and if the faulty resource appears in the data flow table, then the faulty resource is the resource that has not been isolated, , isolating the faulty resource that has not been isolated. The non-isolated faulty resource refers to a resource that was not isolated in the previous cycle and has a fault in the current cycle.

其中，所述控制器根据所述资源故障集，获得所述当前周期内的出现的故障资源；所述调度临界值在上述的实施例中已经进行了解释，此处不再赘述。Wherein, the controller obtains the faulty resources occurring in the current period according to the resource fault set; the scheduling critical value has been explained in the above-mentioned embodiments, and will not be repeated here.

本发明提供的装置的实施例具体可以用于执行上述各方法实施例的处理流程，其功能在此不再赘述，可以参照上述方法实施例的详细描述。The embodiments of the apparatus provided by the present invention can be specifically used to execute the processing procedures of the above-mentioned method embodiments, and the functions thereof will not be repeated here, and reference can be made to the detailed description of the above-mentioned method embodiments.

图8为本发明实施例电子设备的实体结构示意图，如图8所示，本发明提供的电子设备包括：FIG. 8 is a schematic diagram of the physical structure of an electronic device according to an embodiment of the present invention. As shown in FIG. 8, the electronic device provided by the present invention includes:

处理器(processor)801、存储器(memory)802和通信总线803；Processor (processor) 801, memory (memory) 802 and communication bus 803;

其中，in,

所述处理器801和存储器802通过所述通信总线803完成相互间的通信；The processor 801 and the memory 802 complete mutual communication through the communication bus 803;

所述处理器801用于调用所述存储器802中的程序指令，以执行上述各方法实施例所提供的方法，例如包括：获取业务平台中各资源的工作指标信息，根据所述工作指标信息和预设的用于表征资源是否发生故障的健康度指标信息，获得资源故障集；其中，所述资源故障集中包括当前周期内发生故障的资源；获取数据流表，所述数据流表中包括上一周期内处于正常工作状态的资源；根据所述资源故障集、所述数据流表，以及预设的调度规则，确定所述当前周期需要隔离和/或恢复的资源。The processor 801 is used to call the program instructions in the memory 802 to execute the methods provided by the above method embodiments, for example, including: obtaining work index information of each resource in the business platform, and according to the work index information and The preset health index information used to represent whether a resource fails, and obtain a resource failure set; wherein, the resource failure set includes resources that have failed in the current cycle; obtain a data flow table, and the data flow table includes the above Resources that are in a normal working state within a period; resources that need to be isolated and/or recovered in the current period are determined according to the resource failure set, the data flow table, and a preset scheduling rule.

本发明实施例提供一种计算机程序产品，所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序，所述计算机程序包括程序指令，当所述程序指令被计算机执行时，计算机能够执行上述各方法实施例所提供的方法，例如包括：获取业务平台中各资源的工作指标信息，根据所述工作指标信息和预设的用于表征资源是否发生故障的健康度指标信息，获得资源故障集；其中，所述资源故障集中包括当前周期内发生故障的资源；获取数据流表，所述数据流表中包括上一周期内处于正常工作状态的资源；根据所述资源故障集、所述数据流表，以及预设的调度规则，确定所述当前周期需要隔离和/或恢复的资源。An embodiment of the present invention provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, The computer can execute the methods provided by the above method embodiments, for example, including: obtaining work index information of each resource in the service platform, and according to the work index information and preset health index information used to indicate whether a resource has failed, Obtain a resource failure set; wherein, the resource failure set includes resources that have failed in the current cycle; obtain a data flow table, and the data flow table includes resources that were in a normal working state in the previous cycle; according to the resource failure set , the data flow table, and a preset scheduling rule to determine resources that need to be isolated and/or restored in the current period.

本发明实施例提供一种非暂态计算机可读存储介质，所述非暂态计算机可读存储介质存储计算机指令，所述计算机指令使所述计算机执行上述各方法实施例所提供的方法，例如获取业务平台中各资源的工作指标信息，根据所述工作指标信息和预设的用于表征资源是否发生故障的健康度指标信息，获得资源故障集；其中，所述资源故障集中包括当前周期内发生故障的资源；获取数据流表，所述数据流表中包括上一周期内处于正常工作状态的资源；根据所述资源故障集、所述数据流表，以及预设的调度规则，确定所述当前周期需要隔离和/或恢复的资源。An embodiment of the present invention provides a non-transitory computer-readable storage medium, the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to execute the methods provided in the above method embodiments, for example Obtain the work index information of each resource in the service platform, and obtain a resource fault set according to the work index information and the preset health index information used to indicate whether the resource fails; wherein, the resource fault set includes Resources that have failed; obtain a data flow table, which includes resources that were in normal working status in the previous cycle; determine the resource failure set, the data flow table, and preset scheduling rules resources that need to be quarantined and/or restored for the current cycle.

本领域普通技术人员可以理解：实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成，前述的程序可以存储于一计算机可读取存储介质中，该程序在执行时，执行包括上述方法实施例的步骤；而前述的存储介质包括：ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps for realizing the above-mentioned method embodiments can be completed by hardware related to program instructions, and the aforementioned program can be stored in a computer-readable storage medium. When the program is executed, the It includes the steps of the above method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention.

Claims

1. A service platform resource scheduling processing method, characterized in that, comprising:

The controller obtains the work index information of each resource in the service platform, and obtains a resource fault set according to the work index information and the preset health index information used to indicate whether the resource fails; wherein, the resource fault set includes the current A resource that failed during the period;

The controller obtains a data flow table, and the data flow table includes resources in a normal working state in the previous cycle;

The controller determines resources that need to be isolated and/or recovered in the current period according to the resource failure set, the data flow table, and a preset scheduling rule.

2. The method according to claim 1, characterized in that the method further comprises:

The controller issues a control instruction to the data switch according to the resource that needs to be isolated and/or recovered, so that the data switch updates the data flow table, and sends the data to the data flow table according to the updated data flow table. Each resource in the service platform forwards data.

3. The method according to claim 1 or 2, wherein the dispatching rule comprises:

If it is judged that the number of isolated resources at the current moment reaches the scheduling critical value, the recovered isolated resources will be recovered;

If it is determined that the number of isolated resources at the current moment has not reached the scheduling threshold, and the recovered isolated resources do not belong to frequently scheduled resources, recover the recovered isolated resources;

If it is determined that the number of isolated resources at the current moment has not reached the scheduling threshold, and the recovered isolated resources belong to frequently scheduled resources, then the recovered isolated resources are not recovered;

Wherein, according to the resource failure set and the data flow table, the recovered isolated resource occurring in the current cycle is determined, and the recovered isolated resource refers to a fault in the previous cycle but in the The resources in the normal working state in the current cycle; the scheduling critical value refers to the maximum number of resources that can be isolated in the service platform under the condition that the current business volume is guaranteed to be processed normally.

4. method according to claim 3, is characterized in that, described method also comprises:

Obtain the number of recovery times of the recovered isolated resource within a unit time, if the number of times is greater than a threshold, determine that the recovered isolated resource belongs to a frequently scheduled resource; if the number of times is not greater than the threshold, then It is determined that the recovered isolated resource does not belong to a frequently scheduled resource.

5. The method according to claim 1 or 2, wherein the dispatching rule comprises:

If it is determined that the number of isolated resources at the current moment reaches the scheduling critical value, then calculate the number of times that all faulty resources in the service platform are isolated within a preset time period, and prioritize all faulty resources according to the number of times sorting, recovering faulty resources with low priority, and isolating faulty resources with high priority; wherein, the lower the number of times, the lower the corresponding priority of the faulty resources;

If it is determined that the number of isolated resources at the current moment has not reached the scheduling critical value, and the faulty resource is a faulty resource that has not been isolated, then isolate the faulty resource; wherein, the faulty resource that is not isolated Faulty resources refer to resources that were not isolated in the previous cycle and failed in the current cycle;

Wherein, according to the resource fault set, determine the fault resource that occurs in the current cycle; the scheduling threshold refers to the maximum isolable resource in the service platform under the condition that the current business volume is guaranteed to be processed normally number.

6. A service platform resource scheduling processing device, characterized in that it comprises:

The obtaining unit is used to obtain the work index information of each resource in the service platform, and obtain a resource fault set according to the work index information and the preset health index information used to indicate whether the resource is faulty; wherein, the resource fault The pool includes resources that failed during the current cycle;

a receiving unit, configured to obtain a data flow table, the data flow table including resources in a normal working state in the previous cycle;

A processing unit, configured to determine resources that need to be isolated and/or recovered in the current period according to the resource failure set, the data flow table, and a preset scheduling rule.

7. The device according to claim 6, further comprising:

The sending unit is configured to send a control instruction to the data switch according to the resource that needs to be isolated and/or restored, so that the data switch updates the data flow table, and sends the data to the data flow table according to the updated data flow table. Each resource in the business platform forwards data.

8. The device according to claim 6 or 7, wherein the scheduling rule comprises:

If it is judged that the number of isolated resources at the current moment reaches the scheduling critical value, the recovered isolated resources will be restored;

9. The device according to claim 8, further comprising:

A judging unit, configured to obtain the number of times the restored isolated resource is restored within a unit time, and if the number is greater than a threshold, determine that the restored isolated resource belongs to a frequently scheduled resource; if the number is not greater than The threshold value, then it is determined that the recovered isolated resources do not belong to frequently scheduled resources.

10. The method according to claim 6 or 7, wherein the dispatching rule comprises:

Wherein, according to the resource fault set, the fault resources occurring in the current cycle are determined; the scheduling critical value refers to the maximum number of resources that can be isolated in the service platform under the condition of ensuring normal processing of the current traffic.