WO2021208184A1 - 一种节点流量的调入、恢复方法、系统及中心服务器 - Google Patents

一种节点流量的调入、恢复方法、系统及中心服务器 Download PDF

Info

Publication number
WO2021208184A1
WO2021208184A1 PCT/CN2020/091868 CN2020091868W WO2021208184A1 WO 2021208184 A1 WO2021208184 A1 WO 2021208184A1 CN 2020091868 W CN2020091868 W CN 2020091868W WO 2021208184 A1 WO2021208184 A1 WO 2021208184A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
traffic
current
information
redundant
Prior art date
Application number
PCT/CN2020/091868
Other languages
English (en)
French (fr)
Inventor
郭林斌
Original Assignee
网宿科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 网宿科技股份有限公司 filed Critical 网宿科技股份有限公司
Publication of WO2021208184A1 publication Critical patent/WO2021208184A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Definitions

  • the present invention relates to the field of Internet technology, in particular to a method, system and central server for transferring and restoring node traffic.
  • the cluster may have faulty nodes when providing services to customers.
  • a node in the cluster fails, it is usually necessary to transfer the traffic of the failed node to other normal nodes so that customer services can be provided normally.
  • the traffic of the faulty node is usually distributed according to the load conditions of the nodes in other clusters. However, only in accordance with the load situation to carry out traffic transfer, may make the node can not handle the transferred traffic well.
  • the failed node returns to normal, it is currently one-time to restore the transferred traffic to the failed node, which is likely to cause the restored node to fail again due to excessive load.
  • the purpose of this application is to provide a method, system and central server for transferring and restoring node traffic, which can reasonably transfer the traffic of a faulty node, and can prevent the node from failing again when the faulty node returns to normal.
  • one aspect of the present application provides a method for transferring node traffic.
  • the method includes: obtaining cluster information of each redundant cluster when a node in the current cluster fails; and determining based on the cluster information The cluster weight value of each of the redundant clusters; the target cluster to be transferred is determined from each of the redundant clusters according to the cluster weight value, and the traffic of the failed node in the current cluster is transferred to the In the target cluster.
  • the present application also provides a node traffic transfer system on the other hand.
  • the system includes: a cluster information acquisition unit for acquiring clusters of each redundant cluster if a node in the current cluster fails. Information; a cluster weight value determining unit, configured to determine the cluster weight value of each of the redundant clusters based on the cluster information; a traffic transfer unit, configured to determine from each of the redundant clusters according to the cluster weight value The target cluster to be transferred into, and transfer the traffic of the failed node in the current cluster into the target cluster.
  • another aspect of the present application also provides a method for restoring node traffic.
  • the method includes: obtaining cluster information and a cluster weight value of the current cluster, and judging the location based on the cluster information and the cluster weight value. Whether the current cluster has the recovery conditions; if the current cluster has the recovery conditions, the traffic to be recovered will be recovered in batches according to the preset bandwidth ratio; when the traffic is recovered according to the current bandwidth ratio, the current cluster will be added In the coverage cluster of the traffic to be restored, the standby clusters of the current cluster are removed from the coverage cluster in batches to complete the restoration process of the traffic to be restored.
  • the present application also provides a node flow restoration system on the other hand.
  • the system includes: a restoration condition determination unit configured to obtain cluster information and a cluster weight value of the current cluster, and based on the cluster information and The cluster weight value judges whether the current cluster has recovery conditions; the batch recovery unit is configured to recover the traffic to be recovered in batches according to a preset batch recovery strategy if the current cluster has recovery conditions; step by step
  • the restoration unit is configured to add the current cluster to the coverage cluster of the to-be-recovered traffic when performing traffic restoration according to the current batch restoration strategy, and remove the current cluster from the coverage cluster in batches. Standby cluster to complete the recovery process of the traffic to be recovered.
  • the central server includes a memory and a processor.
  • the memory is used to store a computer program.
  • the computer program is executed by the processor, the foregoing The method of restoring the node traffic.
  • the technical solutions provided by one or more embodiments of the present application can obtain cluster information of other redundant clusters when a node in the current cluster fails.
  • the cluster information can reflect various aspects of the equipment, network, and alarm information in the redundant cluster.
  • the cluster weight value of each redundant cluster can be determined.
  • the cluster weight value can accurately represent the ability of the redundant cluster to accept traffic.
  • the target cluster with better performance can be filtered from the redundant clusters, and the traffic of the faulty node can be transferred to the target cluster, so that the traffic of the faulty node can be reasonably distributed, so that the traffic of the faulty node can be reasonably distributed. Can be processed normally.
  • the cluster information and cluster weight value of the failed cluster can be detected in real time, so that it can be judged whether the current cluster has the recovery conditions.
  • the traffic to be recovered can be recovered in batches.
  • Figure 1 is a step diagram of a method for transferring node traffic in an embodiment of the present invention
  • FIG. 2 is a schematic diagram of functional modules of a node traffic transfer system in an embodiment of the present invention
  • Fig. 3 is a flowchart of a method for restoring node traffic in an embodiment of the present invention
  • Fig. 4 is a schematic diagram of functional modules of a node flow restoration system in an embodiment of the present invention.
  • This application provides a method for transferring and restoring node traffic.
  • the method can be applied to each cluster of the CDN system.
  • the above-mentioned method for transferring node traffic may include the following multiple steps.
  • the clusters in the CDN system can provide services for different domain names in different regions.
  • the customer ID of the node service in the cluster can be represented by a combination of region and domain name.
  • the nodes in the cluster can serve Baidu domain names in Central China, Baidu domain names in Map of North China, and Tencent domain names in South China.
  • the cluster information may include various contents.
  • the cluster information includes at least one of the following: the health value of the equipment in the cluster; the health value of the network in the cluster; the percentage of redundant bandwidth in the cluster; the restriction information that characterizes the traffic transfer in the cluster; the chain of the cluster Mode switching information; global alarm information of the cluster; cluster status information; alarm switching information within the cluster; local alarm information of the cluster.
  • the health value of the machine equipment in the cluster may be a parameter value that characterizes the availability of the machine equipment in the cluster.
  • the range of the value can be from 0 to 100, where 0 represents the worst availability, and 100 represents the best availability. If the health value of the equipment in the cluster cannot be obtained normally, the corresponding parameter value can be -1. .
  • the aforementioned health value of the network in the cluster can be used to characterize the parameter value of the network availability of the machine equipment in the cluster.
  • the range of this value can be from 0 to 100, where 0 represents the worst network availability, and 100 represents the best network availability. If the health value of the network in the cluster cannot be obtained normally, the corresponding parameter value can be -1.
  • the above-mentioned cluster redundancy bandwidth ratio can be calculated by the formula: 1-channel bandwidth/rated bandwidth in the cluster. It can be seen that the normal range of the value is 0 to 1. If the channel bandwidth is not collected when the faulty machine is removed, the proportion of the corresponding cluster redundancy bandwidth can be -1.
  • the above-mentioned restriction information that characterizes traffic transfer in the cluster can characterize the capacity of the nodes in the cluster to accept traffic.
  • the restriction information may include the rejection of traffic transfer and the traffic drop transfer.
  • the refusal of traffic transfer means that each node in the cluster does not accept the traffic transferred from outside.
  • Traffic drop transfer means that the nodes in the cluster take as little external traffic as possible.
  • the aforementioned chain switching information of the cluster can be used to characterize the stability of the cluster. Specifically, the chain switching information can be determined in the following manner: after the current redundant cluster receives the transferred traffic, if the current redundant cluster generates failure alarm information within a specified time period, then the current redundant cluster’s The chain switching information is set to the first value. If within the specified time period, the current redundant cluster does not generate failure alarm information, then the chain switching information of the current redundant cluster is set to a second value. In an actual application, the first value may be 0 and the second value may be 100. For example, the original cluster has a node failure and the traffic is transferred to the backup cluster, and the backup cluster generates a failure alarm within 24 hours. At this time, the backup cluster is called a chain switching cluster, and the corresponding chain switching information Is assigned a value of 0.
  • the above-mentioned global alarm information of the cluster can indicate whether all domain names of all areas in the current redundant cluster have issued an alarm.
  • the above-mentioned local alarm information of the cluster can indicate whether the alarm information of part of the area and part of the domain name appears in the current redundant cluster.
  • the above-mentioned cluster status may include two states: normal or faulty.
  • the above-mentioned alarm switching information in the cluster can be used to characterize the number of times that a failure alarm has occurred in the cluster and the traffic scheduling is performed within a period of time. Specifically, the number of traffic scheduling occurring in the current redundant cluster may be counted within a specified time period, and the alarm switching information of the current redundant cluster may be generated based on the counted number of traffic scheduling.
  • the value of the alarm switching information can be 100; if it occurs once, the value of the alarm switching information can be 90; If it occurs twice, the value of the alarm switching information can be 80; if it occurs 3 times, the value of the alarm switching information can be 70; if it occurs more than 3 times, the value of the alarm switching information can be 60.
  • the corresponding relationship between the number of times that a fault alarm occurs and the traffic scheduling is performed and the value can be flexibly adjusted according to the actual situation, and it is not limited here.
  • the above-mentioned cluster information can be managed and maintained by devices within the cluster, or can be regularly maintained by the central control system of the CDN. In this way, each cluster information can be obtained from the corresponding device or system.
  • the obtained cluster information can be analyzed, so as to evaluate the cluster weight value used to characterize the ability of the redundant cluster to accept traffic. Specifically, if the restriction information of the current redundant cluster indicates that traffic is refused to be transferred, or the current redundant cluster has global alarm information, or the status information of the current redundant cluster indicates that the cluster is abnormal, it means that the current redundant cluster is in parallel. It does not have the ability to accept traffic. At this time, the cluster weight value of the current redundant cluster may be set to 0.
  • the restriction information of the current redundant cluster characterizes the traffic drop in, it means that although the current redundant cluster can accept the traffic transferred in from outside, it has strict limits on the size of the traffic.
  • the The cluster weight value of the current redundant cluster is set to a smaller preset value.
  • the smaller preset value may be 5 (out of 100).
  • the cluster weight value of the current redundant cluster can be calculated by means of weighted summation. Specifically, the information value represented by each cluster information of the current redundant cluster and the preset allocation ratio of each cluster information can be identified, and then the information value can be weighted according to the preset allocation ratio Sum, and use the value after the weighted sum as the cluster weight value of the current redundant cluster.
  • the information value of the cluster information and the corresponding allocation ratio can be as follows:
  • cluster weight value (health value of equipment in the cluster * P1) + (health value of the network in the cluster * P2) + (proportion of redundant bandwidth in the cluster * 100 * P3)+(chain switching information of the cluster*P4)+(alarm switching information within the cluster*P5), so that the cluster weight value is 73.
  • S15 Determine the target cluster to be transferred from each of the redundant clusters according to the cluster weight value, and transfer the traffic of the failed node in the current cluster into the target cluster.
  • the target cluster suitable for the inbound traffic can be selected from the redundant clusters according to the cluster weight value. Specifically, the redundant clusters can be sorted according to the cluster weight value in descending order, and the several redundant clusters ranked at the top are used as the target clusters to be screened out.
  • the redundant clusters can also be preliminarily screened and sorted, and then sorted according to the cluster weight value.
  • the candidate clusters can be filtered from each redundant cluster according to the cluster weight value and cluster information. Specifically, redundant clusters with a cluster weight value of 0 can be eliminated from each of the redundant clusters. Then, the traffic domain name and traffic zone corresponding to the failed node can be identified, and in the remaining redundant clusters, the traffic domain name and the redundant clusters with local alarms in the traffic zone can be queried, and the queried redundancy Cluster elimination.
  • the traffic domain name corresponding to the failed node may be baidu.com, and the traffic area is Central China.
  • each redundant cluster if the local alarm information of baidu.com in Central China also appears in some redundant clusters, it means that this part of the redundant clusters cannot handle the traffic of baidu.com in Central China normally. Therefore, There is no need to transfer the traffic of the failed node to these redundant clusters, and this part of the redundant clusters can be excluded from the range of options. Finally, the remaining redundant clusters can be selected as candidate clusters.
  • the resource type corresponding to the failed node may be identified.
  • the resource type can be, for example, video, audio, picture, text, and so on.
  • the clusters matching the resource type can be queried, and the priority of the queried clusters can be increased.
  • matching with the resource type may refer to the resource type of the cluster service, which is consistent with the resource type of the failed node, or may include the resource type of the failed node. In this way, when these clusters undertake the traffic of the failed node, since the resource type of the traffic is a familiar type, the traffic can be better processed.
  • the priority of sorting can be appropriately increased, and the specific level of priority can be increased. Or, according to resource demand, when special requirements for redundant resources are identified, they can be selected in the process of resource selection. Set to the highest or the lowest, which can be flexibly set according to the actual situation.
  • a candidate node that has the same primary domain name as the traffic of the faulty node can be selected.
  • the main-level domain name corresponding to the failed node may be identified, and an intersection cluster that overlaps the main-level domain name in the candidate cluster may be determined, and the intersection cluster may be arranged in other candidate clusters.
  • the intersection clusters Before the cluster. In this way, by identifying whether the candidate cluster has an intersection with the main-level domain name of the faulty traffic, the faulty clusters can be preliminarily sorted, which can reduce back-to-source behavior and improve the efficiency of traffic processing.
  • the intersection cluster and the other clusters may be sorted separately according to the regional level.
  • the regional level can refer to the regional relationship between the candidate cluster and the cluster where the faulty node is located.
  • the regional level from high to low can include the same region, the same region, the same region, the same operator, and the same region. Operators, etc.
  • each candidate cluster in the intersection cluster can be sorted, and each candidate cluster in the non-intersection cluster can also be sorted. After sorting by regional level, there may be multiple candidate clusters within the same regional level.
  • the clusters in the same regional level can be sorted according to the cluster weight value.
  • the target cluster to be transferred can be determined from the candidate clusters according to the sorting result.
  • the corresponding number of target clusters can be selected according to the size of the transferred traffic. Specifically, the peak bandwidth of the transferred traffic within 24 hours can be counted, and then the number of target clusters can be determined according to the size of the peak bandwidth. Generally speaking, the number of target clusters can be directly proportional to the size of the peak bandwidth.
  • the transferred traffic can be reasonably distributed among these target clusters. Specifically, it is possible to identify the traffic domain name and the traffic zone corresponding to the failed node, and to make statistics of the traffic domain name and the global peak bandwidth of the traffic zone within a specified time period. Then, the bandwidth taken by each node in the target cluster to be transferred can be determined according to the traffic domain name and the number of nodes currently covered by the traffic area, and the number of nodes in the target cluster to be transferred. In practical applications, the bandwidth taken by each node in the target cluster can be calculated by the following formula:
  • the bandwidth taken by the node the global peak bandwidth within the specified time period / (the number of nodes currently covered-1 + the number of nodes in the target cluster to be transferred)
  • the traffic of the failed node can be transferred to each target cluster.
  • the cluster information of other redundant clusters can be obtained.
  • the cluster information can reflect various aspects of the equipment, network, and alarm information in the redundant cluster.
  • the cluster weight value of each redundant cluster can be determined.
  • the cluster weight value can accurately represent the ability of the redundant cluster to accept traffic.
  • the target cluster with better performance can be filtered from the redundant clusters, and the traffic of the faulty node can be transferred to the target cluster, so that the traffic of the faulty node can be reasonably distributed, so that the traffic of the faulty node can be reasonably distributed. Can be processed normally.
  • This application also provides a node traffic transfer system, the system includes:
  • the cluster information acquiring unit is used to acquire cluster information of each redundant cluster if a node in the current cluster fails;
  • a cluster weight value determining unit configured to determine the cluster weight value of each redundant cluster based on the cluster information
  • the traffic transfer unit is configured to determine the target cluster to be transferred from each of the redundant clusters according to the cluster weight value, and transfer the traffic of the failed node in the current cluster to the target cluster.
  • This application also provides a method for restoring node traffic. Please refer to FIG. 3.
  • the method may include the following multiple steps.
  • S21 Obtain the cluster information and the cluster weight value of the current cluster, and determine whether the current cluster has a recovery condition according to the cluster information and the cluster weight value.
  • the current cluster that has failed can be periodically detected, so as to combine the cluster information of the current cluster and the cluster weight value calculated in the above manner to determine whether the current cluster has the recovery conditions.
  • the cluster information of the current cluster indicates that the current cluster has no alarm or failure within a specified period of time, and the cluster information indicates that the redundant bandwidth of the current cluster meets the recovery bandwidth to be undertaken, and the If the cluster weight value of the current cluster is greater than or equal to the specified weight threshold, it can be determined that the current cluster has the recovery condition.
  • the specified duration can be flexibly set according to actual needs, for example, it can be 24 hours.
  • the recovery bandwidth to be undertaken can be determined based on the total bandwidth called out by the current cluster. Specifically, it is possible to count the sum of the faulty bandwidth called out from the current cluster, and count the number of nodes currently covered by the faulty bandwidth. Then, according to the total bandwidth and the number of nodes, the restoration bandwidth required by the nodes in the current cluster can be calculated.
  • the total bandwidth that is called out can be multiplied by a proportional coefficient greater than 1 (for example, it can be 1.2) to obtain the bandwidth limit of the call out bandwidth. Then you can count the number of nodes currently covered by the transferred traffic. Since the nodes in the current cluster need to resume work, at this time, the number of nodes actually covered by the outbound traffic can be added to the number of nodes expected to be restored in the current cluster on the basis of the number of nodes counted. Finally, the above threshold can be divided by the number of nodes actually covered to obtain the recovery bandwidth that the nodes under the current cluster need to undertake.
  • a proportional coefficient greater than 1 for example, it can be 1.2
  • the recovery bandwidth that each node needs to undertake is multiplied by the number of nodes recovered in the current cluster to obtain the recovery bandwidth that the current cluster needs to undertake. If the redundant bandwidth of the current cluster is greater than or equal to the recovery bandwidth that the current cluster needs to undertake, it is considered that the current cluster has the prerequisites for transferring part of the traffic back.
  • the aforementioned specified weight threshold can be flexibly set according to actual conditions.
  • the specified weight threshold can be 30 points.
  • Bandwidth needed to restore the cluster (domain name bandwidth called from the failed cluster + total area bandwidth) * 1.2/(the number of IPs in the current area + 1).
  • the traffic can be recovered in batches, so as to avoid the risk that the current cluster will fail again when the traffic is recovered at one time.
  • the traffic can be recovered in batches according to the preset bandwidth ratio.
  • the traffic can be recovered in batches according to the custom alarm name and domain name.
  • the bandwidth ratio can be set for the transferred traffic, and then the product of the transferred traffic and the bandwidth ratio is used as the traffic that needs to be restored in the current batch.
  • the transferred traffic involves multiple domain names and multiple regions, it can be restored in batches according to the combination of regions and domain names. For example, if the transferred traffic is Baidu traffic in Central China, Baidu traffic in North China, and Tencent traffic in South China Map, then the domain name traffic in these three regions can be restored in three batches.
  • each domain name corresponding to the flow to be restored can be identified, the priority of each domain name, and the size of the channel bandwidth under each domain name can be identified, such as restoring
  • the restoration can be performed in batches according to the priority of each domain name and the size of the channel bandwidth.
  • the domain names that need to be restored include domain name 1, domain name 2, and domain name 3.
  • the priority of these three domain names is domain name 2> domain name 1> domain name 3.
  • domain name 2 the same faulty IP can include three regional bandwidths.
  • the traffic may not be restored as a result of the above configuration.
  • a mandatory recovery strategy can be implemented. Specifically, if the current cluster meets the recovery conditions and the traffic cannot be recovered within a specified time period, the current cluster may be recovered in batches during the specified time period. For example, if the current cluster is determined to have the recovery conditions, but 3 hours after the normal recovery time has passed, it can be checked again whether the current cluster still has the recovery conditions. If the recovery conditions are still available, you can force the recovery of the current cluster traffic in the early morning time period (2 o'clock to 6 o'clock).
  • a mandatory recovery strategy can have certain prerequisites. Specifically, the mandatory restoration may be performed only for the domain name corresponding to the quality alarm, and the mandatory restoration strategy may not be adopted for the domain name corresponding to the interruption alarm.
  • the recovery conditions can be lowered within the specified time period. For example, you can lower the specified weight threshold, or reduce the bandwidth that needs to be undertaken. In this way, the recovery threshold of the current cluster can be lowered, and subsequently, the current cluster that meets the recovery conditions can be forced to recover in batches.
  • alarm information may be generated. For example, if the traffic of the current cluster cannot be restored normally in the early hours of the morning, and the current cluster still has the recovery conditions, at this time, an alarm message can be generated to prompt the manager to perform manual recovery.
  • the recoverable cluster when the traffic is restored, the recoverable cluster is usually added to the coverage cluster of the traffic, and then the other clusters that previously accepted the traffic are directly removed from the coverage cluster, thereby completing the flow recovery process.
  • a recovery method will cause the recoverable cluster to face a relatively large traffic load in a short time, and may cause the cluster to fail again.
  • the cluster can be gradually withdrawn from the covering cluster, so as to prevent the recoverable cluster from facing a large load in a short time.
  • the recoverable current cluster can be added to the coverage cluster of the traffic to be recovered.
  • the traffic of a certain domain name was originally provided by three clusters of ABC, and then cluster A failed, and the traffic of cluster A was transferred to the standby cluster DEF.
  • the coverage of the domain name cluster changed from ABC to cluster A.
  • BCDEF cluster A can be added to the coverage cluster of the domain name, thereby changing the coverage cluster of the domain name to ABCDEF.
  • the backup clusters of the current cluster can be removed from the overlay clusters in batches to complete the recovery process of the traffic to be recovered.
  • there are three DEF backup clusters in the current overlay cluster which can be divided into In the three batches, DEF is removed from the covering clusters respectively.
  • the covering clusters of ABCDEF can be changed to ABCDE, then to ABCD, and finally to the original ABC.
  • the load of each cluster in the coverage cluster can be increased in a gradient without increasing the load of the cluster in a short time. This can prevent the cluster that has just returned to normal from failing again, thereby improving Improve the stability of the overall system.
  • the cluster information and cluster weight value of the failed cluster can be detected in real time, so that it can be judged whether the current cluster has the recovery conditions.
  • the traffic to be recovered can be recovered in batches.
  • this application also provides a node flow restoration system, the system includes:
  • a recovery condition determination unit configured to obtain cluster information and a cluster weight value of the current cluster, and determine whether the current cluster has recovery conditions according to the cluster information and the cluster weight value;
  • the batch recovery unit is configured to recover the traffic to be recovered in batches according to a preset batch recovery strategy if the current cluster has recovery conditions;
  • a gradual recovery unit which is used to add the current cluster to the coverage cluster of the traffic to be recovered, and remove the current cluster from the coverage cluster in batches when performing traffic recovery according to the current batch recovery strategy To complete the recovery process of the to-be-recovered traffic.
  • An embodiment of the present application also provides a central server, the central server includes a memory and a processor, the memory is used to store a computer program, and when the computer program is executed by the processor, the above-mentioned node traffic recovery is realized method.
  • the embodiments of the present invention can be provided as a method, a system, or a computer program product. Therefore, the present invention may adopt a form of a complete hardware implementation, a complete software implementation, or a combination of software and hardware implementations. Moreover, the present invention may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-permanent memory in computer readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer readable media.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明公开了一种节点流量的调入、恢复方法、系统及中心服务器,其中,所述调入方法包括:若当前集群中的节点出现故障时,获取各个冗余集群的集群信息;基于所述集群信息,确定各个所述冗余集群的集群权重值;根据所述集群权重值从各个所述冗余集群中确定待调入的目标集群,并将所述当前集群中出现故障的节点的流量调入所述目标集群中。本申请提供的技术方案,能够合理地将故障节点的流量调出,并在故障节点恢复正常时,能够避免节点再次故障。

Description

一种节点流量的调入、恢复方法、系统及中心服务器 技术领域
本发明涉及互联网技术领域,特别涉及一种节点流量的调入、恢复方法、系统及中心服务器。
背景技术
在CDN(Content Delivery Network,内容分发网络)中,集群在向客户提供服务时可能会出现故障节点。当集群中的节点出现故障时,通常需要将该故障节点的流量调入其它正常的节点中,以使得客户的服务能够被正常提供。
目前,在调整故障节点的流量时,通常会按照其它集群中节点的负载情况,对故障节点的流量进行分配。然而,仅按照负载情况的方式来进行流量调入,可能使得节点并不能很好地处理调入的流量。此外,当故障节点恢复正常时,目前也是一次性将调出的流量重新恢复至故障节点中,这样很可能会导致恢复正常的节点由于负载过大而再次出现故障。
发明内容
本申请的目的在于提供一种节点流量的调入、恢复方法、系统及中心服务器,能够合理地将故障节点的流量调出,并在故障节点恢复正常时,能够避免节点再次故障。
为实现上述目的,本申请一方面提供一种节点流量的调入方法,所述方法包括:若当前集群中的节点出现故障时,获取各个冗余集群的集群信息;基于所述集群信息,确定各个所述冗余集群的集群权重值;根据所述集群权重值从各个所述冗余集群中确定待调入的目标集群,并将所述当前集群中出现故障的节点的流量调入所述目标集群中。
为实现上述目的,本申请另一方面还提供一种节点流量的调入系统,所述系统包括:集群信息获取单元,用于若当前集群中的节点出现故障时,获取各个冗余集群的集群信息;集群权重值确定单元,用于基于所述集群信息,确定 各个所述冗余集群的集群权重值;流量调入单元,用于根据所述集群权重值从各个所述冗余集群中确定待调入的目标集群,并将所述当前集群中出现故障的节点的流量调入所述目标集群中。
为实现上述目的,本申请另一方面还提供一种节点流量的恢复方法,所述方法包括:获取当前集群的集群信息和集群权重值,并根据所述集群信息和所述集群权重值判断所述当前集群是否具备恢复条件;若所述当前集群具备恢复条件,按照预先设置的带宽比例,对待恢复的流量进行分批恢复;在按照当前的带宽比例进行流量恢复时,将所述当前集群加入所述待恢复的流量的覆盖集群中,并从所述覆盖集群中分批剔除所述当前集群的备用集群,以完成所述待恢复的流量的恢复过程。
为实现上述目的,本申请另一方面还提供一种节点流量的恢复系统,所述系统包括:恢复条件判定单元,用于获取当前集群的集群信息和集群权重值,并根据所述集群信息和所述集群权重值判断所述当前集群是否具备恢复条件;分批恢复单元,用于若所述当前集群具备恢复条件,按照预先设置的分批恢复策略,对待恢复的流量进行分批恢复;逐步恢复单元,用于在按照当前的分批恢复策略进行流量恢复时,将所述当前集群加入所述待恢复的流量的覆盖集群中,并从所述覆盖集群中分批剔除所述当前集群的备用集群,以完成所述待恢复的流量的恢复过程。
为实现上述目的,本申请另一方面还提供一种中心服务器,所述中心服务器包括存储器和处理器,所述存储器用于存储计算机程序,所述计算机程序被所述处理器执行时,实现上述的节点流量的恢复方法。
由上可见,本申请一个或者多个实施方式提供的技术方案,在当前集群中的节点出现故障时,可以获取其它冗余集群的集群信息。这些集群信息可以体现冗余集群内的设备、网络、告警信息等各方面的内容。基于获取的集群信息,可以确定出各个冗余集群的集群权重值。该集群权重值可以精确地表征冗余集群能够承接流量的能力。这样,根据集群权重值,可以从冗余集群中筛选出性能较好的目标集群,并可以将故障节点的流量调入目标集群中,这样能够合理地分配故障节点的流量,使得故障节点的流量能够被正常处理。此外,可以实时检测故障集群的集群信息和集群权重值,从而可以判断当前集群是否具备恢复条件。在当前集群具备恢复条件时,可以对待恢复的流量进行分批恢复。在 进行分批恢复时,可以先将当前集群加入流量的覆盖集群中,然后再逐步将覆盖集群中的备用集群剔除,最终可以实现流量的恢复过程。这样,通过分批恢复流量,以及逐步剔除备用集群的方式,可以避免当前集群在短时间内承接过多的负载,从而避免了当前集群会再次出现故障的情况。
附图说明
为了更清楚地说明本发明实施方式中的技术方案,下面将对实施方式描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施方式中节点流量的调入方法步骤图;
图2是本发明实施方式中节点流量的调入系统的功能模块示意图;
图3是本发明实施方式中节点流量的恢复方法的流程图;
图4是本发明实施方式中节点流量的恢复系统的功能模块示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施方式及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施方式仅是本申请一部分实施方式,而不是全部的实施方式。基于本申请中的实施方式,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施方式,都属于本申请保护的范围。
本申请提供一种节点流量的调入和恢复方法,该方法可以应用于CDN系统的各个集群中。请参阅图1,上述的节点流量的调入方法可以包括以下多个步骤。
S11:若当前集群中的节点出现故障时,获取各个冗余集群的集群信息。
在本实施方式中,CDN系统内的集群可以为不同区域的不同域名提供服务。通常而言,集群内的节点服务的客户标识,可以通过区域和域名的组合方式来表示。例如,集群内的节点可以为华中地区的百度域名服务,也可以为华北地图的百度域名服务,还可以为华南地区的腾讯域名服务。当集群中的某个节点出现故障时,会导致该节点上的流量无法被正常处理。此时,该故障节点上的流量就需要调入其它冗余集群中。
在本实施方式中,可以根据冗余集群的集群信息,来综合判断冗余集群是否合适承接故障节点调出的流量。具体地,该集群信息可以包括多方面的内容。在一个应用场景中,集群信息包括以下至少一种:集群内机器设备的健康值;集群内网络的健康值;集群内冗余带宽占比;集群内表征流量调入的限制信息;集群的链式切换信息;集群的全局告警信息;集群状态信息;集群内的告警切换信息;集群的局部告警信息。
其中,集群内机器设备的健康值,可以是表征集群内机器设备的可用性的参数值。在实际应用中,该数值的区间可以是0至100,其中,0表示可用性最差,100表示可用性最好。如果集群内机器设备的健康值无法正常获取,那么对应的参数值可以是-1。。
上述的集群内网络的健康值,可以用于表征集群内机器设备的网络可用性的参数值。在实际应用中,该数值的区间可以是0至100,其中,0表示网络可用性最差,100表示网络可用性最好。如果无法正常获取到集群内网络的健康值,那么对应的参数值可以是-1。
上述的集群冗余带宽占比,可以通过公式:1-集群内频道带宽/额定带宽来计算得到,可见,在数值的正常区间为0至1。如果在去除故障机器的情况下,频道带宽未采集到,那么对应的集群冗余带宽占比可以是-1。
上述的集群内表征流量调入的限制信息,可以表征集群内的节点对于流量的承接能力。其中,该限制信息可以包括拒绝流量调入和流量降量调入。其中,拒绝流量调入表示集群内的各个节点,都不承接外部调入的流量。流量降量调入表示集群内的节点尽量少地承接外部调入的流量。
上述的集群的链式切换信息,可以用于表征集群的稳定性。具体地,链式切换信息可以按照以下方式确定:当前冗余集群在接收调入的流量后,若在指定时长内,该当前冗余集群产生故障告警信息,那么就将该当前冗余集群的链式切换信息设置为第一数值。而若在所述指定时长内,所述当前冗余集群未产生故障告警信息,就将所述当前冗余集群的链式切换信息设置为第二数值。在一个实际应用中,第一数值可以是0,第二数值可以是100。举例来说,原始集群出现节点故障,将流量调入了备用集群,而该备用集群在24小时内,又产生故障告警,此时该备用集群称为链式切换集群,对应的链式切换信息的赋值为0。
上述的集群的全局告警信息,可以表征当前冗余集群内全部区域的全部域 名是否都发出了告警。上述的集群的局部告警信息,可以表征当前冗余集群内是否出现了部分区域和部分域名的告警信息。
上述的集群状态,可以包括正常或故障两种状态。
上述的集群内的告警切换信息,可以用于表征一段时长内集群中出现故障告警并进行流量调度的次数。具体地,可以在指定时长内,统计当前冗余集群中发生的流量调度次数,并基于统计的所述流量调度次数生成所述当前冗余集群的告警切换信息。举例来说,在24小时内,当前冗余集群中出现故障告警并进行流量调度的次数为0次,那么告警切换信息的数值可以为100;如果出现1次,那么告警切换信息的数值可以为90;如果出现2次,那么告警切换信息的数值可以为80;如果出现3次,那么告警切换信息的数值可以为70;如果出现大于3次,那么告警切换信息的数值可以为60。当然,出现故障告警并进行流量调度的次数与数值大小的对应关系,可以按照实际情况灵活调整,这里并不做限定。
在本实施方式中,上述的各个集群信息可以由集群内部的设备进行管理维护,也可以由CDN的中心控制系统定期维护,这样,可以从对应的设备或者系统处获取各个集群信息。
S13:基于所述集群信息,确定各个所述冗余集群的集群权重值。
在本实施方式中,在获取到各个集群信息后,可以对获取到的集群信息进行分析,从而评估出用于表征冗余集群承接流量的能力的集群权重值。具体地,若当前冗余集群的限制信息表征拒绝流量调入,或者所述当前冗余集群出现全局告警信息,或者所述当前冗余集群的状态信息表征集群异常,则表示当前冗余集群并不具备承接流量的能力,此时,可以将所述当前冗余集群的集群权重值设置为0。
此外,若所述当前冗余集群的限制信息表征流量降量调入,则表示当前冗余集群虽然能够承接外部调入的流量,但是对于流量的大小有严格的限定,此时可以将所述当前冗余集群的集群权重值设置为较小的预设数值。例如,在一个实际应用场景中,该较小的预设数值可以是5(满分100)。
在本实施方式中,若当前冗余集群的集群信息不存在以上列举出的情况,可以通过加权求和的方式计算出当前冗余集群的集群权重值。具体地,可以识别当前冗余集群的各项集群信息各自表征的信息值,以及所述各项集群信息的 预设分配比例,然后,可以根据所述预设分配比例对所述信息值进行加权求和,并将加权求和后的数值作为所述当前冗余集群的集群权重值。举例来说,在一个应用场景中,集群信息的信息值和对应的分配比例可以如下所示:
集群内机器设备的健康值和分配比例:65分,P1=20%
集群内网络的健康值和分配比例:70分,P2=20%
集群内冗余带宽占比和分配比例:60%,P3=20%
集群的链式切换信息和分配比例:100分,P4=10%
集群内的告警切换信息和分配比例:80分,P5=30%
将上述的信息值和分配比例带入公式:集群权重值=(集群内机器设备的健康值*P1)+(集群内网络的健康值*P2)+(集群内冗余带宽占比*100*P3)+(集群的链式切换信息*P4)+(集群内的告警切换信息*P5),从而可以得到集群权重值为73。
S15:根据所述集群权重值从各个所述冗余集群中确定待调入的目标集群,并将所述当前集群中出现故障的节点的流量调入所述目标集群中。
在本实施方式中,在计算出各个冗余集群的集群权重值后,可以根据集群权重值从冗余集群中筛选出适合调入流量的目标集群。具体地,可以按照集群权重值从大到小的顺序对各个冗余集群进行排序,并将排序靠前的若干个冗余集群作为筛选出的目标集群。
此外,结合实际的应用场景,还可以先对冗余集群进行初步筛选和排序,然后再根据集群权重值进行细化排序。首先,可以根据集群权重值和集群信息,从各个冗余集群中筛选出候选集群。具体地,可以从各个所述冗余集群中,将集群权重值为0的冗余集群剔除。然后,可以识别出现故障的节点对应的流量域名和流量区域,并在剩余的冗余集群中,查询所述流量域名和所述流量区域出现局部告警的冗余集群,并将查询得到的冗余集群剔除。举例来说,出现故障的节点对应的流量域名可以是baidu.com,流量区域为华中地区。然后,在各个冗余集群中,如果有些冗余集群中也出现了华中地区的baidu.com的局部告警信息,则表示这部分冗余集群无法对华中地区的baidu.com的流量正常处理,因此就无需向这些冗余集群调入故障节点的流量,这部分冗余集群便可以被剔除出可选择的范围。最终,可以将剩余的其它冗余集群作为筛选出的候选集群。
在一个实施方式中,在筛选出候选集群后,为了提高流量的处理兼容性, 可以识别出现故障的节点对应的资源类型。该资源类型例如可以是视频、音频、图片、文本等。然后,可以在候选集群中,查询与该资源类型相匹配的集群,并提高查询得到的集群的优先级。其中,与资源类型相匹配,可以指集群服务的资源类型,与出现故障的节点的资源类型保持一致,或者可以包含出现故障的节点的资源类型。这样,当这些集群承接故障节点的流量时,由于流量的资源类型是熟悉的类型,便可以更好地对流量进行处理。这部分集群在最终排序时,可以适当提高排序的优先级,具体提高多少层级的优先级,或者可以根据资源需求情况,在识别出冗余资源有特殊要求时,可在挑选资源过程中将其设定为最高或者最低,可以根据实际情况灵活设置。
在一个实施方式中,为了避免候选集群中的节点在处理流量时发生回源行为,可以选用与故障节点的流量具备相同主层域名的候选节点。具体地,可以识别出现故障的节点对应的主层域名,并在所述候选集群中确定与所述主层域名存在交集的交集集群,并将所述交集集群排列于所述候选集群中的其它集群之前。这样,通过识别候选集群是否与故障流量的主层域名具备交集,从而可以对故障集群进行初步排序,这样可以减少回源行为,提高流量的处理效率。
在本实施方式中,按照交集集群和非交集集群对候选集群进行排序后,还可以按照区域等级,分别对所述交集集群和所述其它集群进行排序。具体地,区域等级可以指候选集群与故障节点所在集群之间的区域关系,在实际应用中,区域等级从高到低例如可以包括同区域、同大区、跨大区、同运营商、跨运营商等。这样,根据区域等级,可以对交集集群内的各个候选集群进行排序,还可以对非交集集群内的各个候选集群进行排序。在按照区域等级排序后,在同一个区域等级内,可能会存在多个候选集群。此时,可以按照集群权重值对同一个区域等级内的集群进行排序。最终,可以根据排序结果,从所述候选集群中确定待调入的目标集群。在实际应用中,可以根据调出的流量的大小,选择对应数量的目标集群。具体地,可以统计调出的流量在24小时内的峰值带宽,然后可以根据该峰值带宽的大小,确定出目标集群的数量。通常而言,目标集群的数量与峰值带宽的大小可以成正比。
在一个实施方式中,对于多个目标集群而言,可以合理地将调出的流量在这些目标集群中进行分配。具体地,可以识别出现故障的节点对应的流量域名和流量区域,并统计所述流量域名和所述流量区域在指定时长内的全局峰值带 宽。然后,可以根据所述流量域名和所述流量区域当前覆盖的节点数量,以及待调入的目标集群中的节点数量,确定待调入的目标集群中各个节点所承接的带宽。在实际应用中,目标集群中各个节点所承接的带宽,可以通过以下公式计算得到:
节点承接的带宽=指定时长内的全局峰值带宽/(当前覆盖的节点数量-1+待调入的目标集群中的节点数量)
这样,在确定了各个节点所需承接的带宽,便可以将故障节点的流量分别调入各个目标集群中。
由上可见,在当前集群中的节点出现故障时,可以获取其它冗余集群的集群信息。这些集群信息可以体现冗余集群内的设备、网络、告警信息等各方面的内容。基于获取的集群信息,可以确定出各个冗余集群的集群权重值。该集群权重值可以精确地表征冗余集群能够承接流量的能力。这样,根据集群权重值,可以从冗余集群中筛选出性能较好的目标集群,并可以将故障节点的流量调入目标集群中,这样能够合理地分配故障节点的流量,使得故障节点的流量能够被正常处理。
请参阅图2,本申请还提供一种节点流量的调入系统,所述系统包括:
集群信息获取单元,用于若当前集群中的节点出现故障时,获取各个冗余集群的集群信息;
集群权重值确定单元,用于基于所述集群信息,确定各个所述冗余集群的集群权重值;
流量调入单元,用于根据所述集群权重值从各个所述冗余集群中确定待调入的目标集群,并将所述当前集群中出现故障的节点的流量调入所述目标集群中。
本申请还提供一种节点流量的恢复方法,请参阅图3,该方法可以包括以下多个步骤。
S21:获取当前集群的集群信息和集群权重值,并根据所述集群信息和所述集群权重值判断所述当前集群是否具备恢复条件。
在本实施方式中,可以定期对出现故障的当前集群进行检测,从而结合当前集群的集群信息和按照上述方式计算出的集群权重值,来判断该当前集群是 否具备恢复条件。
具体地,若所述当前集群的集群信息表征所述当前集群在指定时长内未发生告警或者故障,并且所述集群信息表征所述当前集群的冗余带宽满足需承接的恢复带宽,以及所述当前集群的集群权重值大于或者等于指定权重阈值,则可以判定所述当前集群具备恢复条件。其中,指定时长可以按照实际需求灵活设置,例如可以是24小时。需承接的恢复带宽,可以根据当前集群调出的全部带宽来确定。具体地,可以统计从所述当前集群中调出的出现故障的带宽总和,并统计出现故障的带宽当前覆盖的节点数量。然后,可以根据所述带宽总和以及所述节点数量,计算所述当前集群中的节点所需承接的恢复带宽。
举例来说,可以用当前集群出现故障时,调出的全部带宽乘以一个大于1的比例系数(例如可以是1.2),来得到调出带宽的宽限值。然后可以统计调出的流量当前覆盖的节点数量。由于当前集群下的节点需要恢复工作,此时,调出的流量实际覆盖的节点数量,可以在统计的节点数量的基础上,加上当前集群内预计恢复的节点数量。最终,可以用上述的宽限值除以实际覆盖的节点数量,从而得到当前集群下的节点需承接的恢复带宽。利用各个节点需承接的恢复带宽,乘以当前集群下恢复的节点数量,从而可以得到当前集群需承接的恢复带宽。如果当前集群的冗余带宽大于或者等于该当前集群需承接的恢复带宽,便认为当前集群具备将调出的部分流量转回来的前提条件。
上述的指定权重阈值,可以根据实际情况灵活设置,例如,该指定权重阈值可以是30分。
故障集群恢复需承接带宽计算公式(自定义倍数,以放大需要承接的带宽):
恢复集群需要承接的带宽=(故障集群调出的域名带宽+区域带宽总和)*1.2/(当前区域IP个数+1)。
S23:若所述当前集群具备恢复条件,按照预先设置的分批恢复策略,对待恢复的流量进行分批恢复。
在本实施方式中,若当前集群具备了恢复条件,便可以分批对流量进行恢复,避免一次性恢复流量时会导致当前集群再次故障的风险。在实际应用中,可以按照预先设置的带宽比例进行流量的分批恢复,此外,还可以按照自定义的告警名称和域名进行流量的分批恢复。
具体地,可以针对调出的流量设置带宽比例,然后将该调出的流量与带宽 比例的乘积作为当前批次需要恢复的流量。此外,在实际应用中,如果调出的流量涉及多个域名和多个区域,可以按照区域和域名的组合分批恢复。例如,调出的流量为华中地区的百度流量、华北地区的百度流量以及华南地图的腾讯流量,那么可以分三个批次,分别对这三个区域的域名流量进行恢复。
在一个实施方式中,在当前批次中恢复流量时,可以识别待恢复的流量对应的各个域名,并识别所述各个域名的优先级,以及识别所述各个域名下频道带宽的大小,如恢复时,扩大调出恢复的比例,比如故障调出100M/s,设定调出恢复比例为1.5,那么恢复的故障节点要能承接150M/s带宽。然后,可以按照所述各个域名的优先级以及所述频道带宽的大小进行分批恢复。具体地,假设需要恢复的域名有域名1、域名2和域名3,这三个域名的优先级排序结果为域名2>域名1>域名3,这样,可以按照域名2、域名1、域名3的顺序进行分批恢复。此外,在域名2中,同一个故障IP可以包含三个区域带宽,在对域名2的流量进行恢复时,可以按照这三个区域带宽的优先级依次或优先恢复。
在一个实施方式中,由于系统故障或者其它原因,流量可能无法按照上述配置的结果进行恢复。此时,可以执行强制恢复策略。具体地,若所述当前集群具备恢复条件,并且在指定时长内无法恢复流量时,可以在指定时间段对所述当前集群进行流量分批恢复。举例来说,如果当前集群被判定具备恢复条件,但是超过正常恢复的时间3小时后,可以再次检测当前集群是否依然具备恢复条件。如果依然具备恢复条件,则可以在凌晨时间段(2点至6点),强制对当前集群的流量进行恢复。
当然,在实际应用中,强制恢复策略可以具备一定的前提条件。具体地,可以仅针对质量类告警对应的域名进行强制恢复,而针对中断类告警对应的域名,可以不采取强制恢复策略。同时,若当前集群不具备恢复条件,并且在指定时长内也无法恢复流量时,则可以在指定时间段内降低恢复条件。例如,可以降低指定权重阈值,或者降低需承接的带宽。这样,可以降低当前集群的恢复门槛,后续,可以对满足恢复条件的当前集群强制进行流量分批恢复。
此外,若在所述指定时间段内无法对所述当前集群进行流量分批恢复,可以生成告警信息。例如,如果在凌晨时间段依然无法正常对当前集群的流量进行恢复,并且当前集群依然具备恢复条件,此时,可以生成告警信息,以提示管理人员进行人工恢复。
在实际应用中,针对不同的客户,还可以采用不同的配置来进行流量恢复。举例来说,有些客户的流量,在当前集群具备恢复条件时,依然需要再考察一段时间,以避免流量的反复调出和恢复。针对这部分客户,可以设置独立的配置,并且在执行流量恢复时,要加载独立的配置信息,并按照配置信息进行流量恢复。也就是说,在对待恢复的流量进行分批恢复时,可以识别所述待恢复的流量对应的各个域名,并读取各个所述域名的配置信息,并按照所述配置信息表征的恢复时间,分别对各个所述域名的流量进行恢复。
S25:在按照当前的分批恢复策略进行流量恢复时,将所述当前集群加入所述待恢复的流量的覆盖集群中,并从所述覆盖集群中分批剔除所述当前集群的备用集群,以完成所述待恢复的流量的恢复过程。
在现有技术中,在对流量进行恢复时,通常是将可恢复的集群加入流量的覆盖集群中,然后直接将之前承接流量的其它集群从覆盖集群中撤走,从而完成流量的恢复过程。但这样的恢复方式,会导致可恢复的集群短时间内面临较大的流量负载,可能会引起该集群再次出现故障。鉴于此,在本实施方式中,可以逐步从覆盖集群中撤走集群,从而避免可恢复的集群在短时间内面临较大的负载。
具体地,首先可以将可恢复的当前集群加入待恢复的流量的覆盖集群中。举例来说,某个域名的流量原来是ABC三个集群负责提供服务,后来集群A出现故障,并将集群A的流量调入至备用集群DEF中,这样,域名的覆盖集群就从ABC变为了BCDEF。当集群A恢复正常后,按照本实施方式的方案,可以将集群A加入域名的覆盖集群中,从而将域名的覆盖集群变为ABCDEF。
然后,可以从所述覆盖集群中分批剔除所述当前集群的备用集群,以完成所述待恢复的流量的恢复过程,具体地,当前覆盖集群中存在DEF三个备用集群,那么可以分为三个批次,分别从覆盖集群中剔除DEF,这样,便可以先将ABCDEF的覆盖集群变为ABCDE,然后变为ABCD,最终还原为最初的ABC。这样,通过逐步剔除备用集群的方式,可以有梯度地增加覆盖集群中各个集群的负载,而不会短时间内瞬间提高集群的负载,这样可以避免刚刚恢复正常的集群又再次发生故障,从而提高了整体系统的稳定性。
有上可见,可以实时检测故障集群的集群信息和集群权重值,从而可以判断当前集群是否具备恢复条件。在当前集群具备恢复条件时,可以对待恢复的 流量进行分批恢复。在进行分批恢复时,可以先将当前集群加入流量的覆盖集群中,然后再逐步将覆盖集群中的备用集群剔除,最终可以实现流量的恢复过程。这样,通过分批恢复流量,以及逐步剔除备用集群的方式,可以避免当前集群在短时间内承接过多的负载,从而避免了当前集群会再次出现故障的情况。
请参阅图4,本申请还提供一种节点流量的恢复系统,所述系统包括:
恢复条件判定单元,用于获取当前集群的集群信息和集群权重值,并根据所述集群信息和所述集群权重值判断所述当前集群是否具备恢复条件;
分批恢复单元,用于若所述当前集群具备恢复条件,按照预先设置的分批恢复策略,对待恢复的流量进行分批恢复;
逐步恢复单元,用于在按照当前的分批恢复策略进行流量恢复时,将所述当前集群加入所述待恢复的流量的覆盖集群中,并从所述覆盖集群中分批剔除所述当前集群的备用集群,以完成所述待恢复的流量的恢复过程。
本申请一个实施方式还提供一种中心服务器,所述中心服务器包括存储器和处理器,所述存储器用于存储计算机程序,所述计算机程序被所述处理器执行时,实现上述的节点流量的恢复方法。
本说明书中的各个实施方式均采用递进的方式描述,各个实施方式之间相同相似的部分互相参见即可,每个实施方式重点说明的都是与其他实施方式的不同之处。尤其,针对系统和中心服务器的实施方式来说,均可以参照前述方法的实施方式的介绍对照解释。
本领域内的技术人员应明白,本发明的实施方式可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施方式、完全软件实施方式、或结合软件和硬件方面的实施方式的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施方式的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程 或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
以上所述仅为本申请的实施方式而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (21)

  1. 一种节点流量的调入方法,其特征在于,所述方法包括:
    若当前集群中的节点出现故障时,获取各个冗余集群的集群信息;
    基于所述集群信息,确定各个所述冗余集群的集群权重值;
    根据所述集群权重值从各个所述冗余集群中确定待调入的目标集群,并将所述当前集群中出现故障的节点的流量调入所述目标集群中。
  2. 根据权利要求1所述的方法,其特征在于,所述集群信息包括以下至少一种:
    集群内机器设备的健康值;集群内网络的健康值;集群内冗余带宽占比;集群内表征流量调入的限制信息;集群的链式切换信息;集群的全局告警信息;集群状态信息;集群内的告警切换信息;集群的局部告警信息。
  3. 根据权利要求2所述的方法,其特征在于,所述集群的链式切换信息按照以下方式确定:
    当前冗余集群在接收调入的流量后,若在指定时长内,所述当前冗余集群产生故障告警信息,将所述当前冗余集群的链式切换信息设置为第一数值;若在所述指定时长内,所述当前冗余集群未产生故障告警信息,将所述当前冗余集群的链式切换信息设置为第二数值。
  4. 根据权利要求2所述的方法,其特征在于,所述集群内的告警切换信息按照以下方式确定:
    在指定时长内,统计当前冗余集群中发生的流量调度次数,并基于统计的所述流量调度次数生成所述当前冗余集群的告警切换信息。
  5. 根据权利要求1或2所述的方法,其特征在于,基于所述集群信息,确定各个所述冗余集群的集群权重值包括:
    若当前冗余集群的限制信息表征拒绝流量调入,或者所述当前冗余集群出现全局告警信息,或者所述当前冗余集群的状态信息表征集群异常,将所述当 前冗余集群的集群权重值设置为0;
    若所述当前冗余集群的限制信息表征流量降量调入,将所述当前冗余集群的集群权重值设置为预设数值。
  6. 根据权利要求1或2所述的方法,其特征在于,基于所述集群信息,确定各个所述冗余集群的集群权重值包括:
    识别当前冗余集群的各项集群信息各自表征的信息值,以及所述各项集群信息的预设分配比例;
    根据所述预设分配比例对所述信息值进行加权求和,并将加权求和后的数值作为所述当前冗余集群的集群权重值。
  7. 根据权利要求1所述的方法,其特征在于,根据所述集群权重值从各个所述冗余集群中确定待调入的目标集群包括:
    根据所述集群权重值和所述集群信息,从各个所述冗余集群中筛选出候选集群;
    识别出现故障的节点对应的主层域名,并在所述候选集群中确定与所述主层域名存在交集的交集集群,并将所述交集集群排列于所述候选集群中的其它集群之前;
    按照区域等级,分别对所述交集集群和所述其它集群进行排序,并在同一区域等级内,按照集群权重值对集群进行排序;
    根据排序结果,从所述候选集群中确定待调入的目标集群。
  8. 根据权利要求7所述的方法,其特征在于,从各个所述冗余集群中筛选出候选集群包括:
    从各个所述冗余集群中,将集群权重值为0的冗余集群剔除;
    识别出现故障的节点对应的流量域名和流量区域,并在剩余的冗余集群中,查询所述流量域名和所述流量区域出现局部告警的冗余集群,并将查询得到的冗余集群剔除;
    将剩余的其它冗余集群作为筛选出的候选集群。
  9. 根据权利要求7所述的方法,其特征在于,所述方法还包括:
    识别出现故障的节点对应的资源类型,并在所述候选集群中,查询与所述资源类型相匹配的集群,根据资源需求情况,并提高查询得到的集群的优先级。
  10. 根据权利要求1或7所述的方法,其特征在于,所述方法还包括:
    识别出现故障的节点对应的流量域名和流量区域,并统计所述流量域名和所述流量区域在指定时长内的全局峰值带宽;
    根据所述流量域名和所述流量区域当前覆盖的节点数量,以及待调入的目标集群中的节点数量,确定待调入的目标集群中各个节点所承接的带宽。
  11. 一种节点流量的调入系统,其特征在于,所述系统包括:
    集群信息获取单元,用于若当前集群中的节点出现故障时,获取各个冗余集群的集群信息;
    集群权重值确定单元,用于基于所述集群信息,确定各个所述冗余集群的集群权重值;
    流量调入单元,用于根据所述集群权重值从各个所述冗余集群中确定待调入的目标集群,并将所述当前集群中出现故障的节点的流量调入所述目标集群中。
  12. 一种中心服务器,其特征在于,所述中心服务器包括存储器和处理器,所述存储器用于存储计算机程序,所述计算机程序被所述处理器执行时,实现如权利要求1至10中任一项所述的方法。
  13. 一种节点流量的恢复方法,其特征在于,所述方法包括:
    获取当前集群的集群信息和集群权重值,并根据所述集群信息和所述集群权重值判断所述当前集群是否具备恢复条件;
    若所述当前集群具备恢复条件,按照预先设置的分批恢复策略,对待恢复的流量进行分批恢复;
    在按照当前的分批恢复策略进行流量恢复时,将所述当前集群加入所述待恢复的流量的覆盖集群中,并从所述覆盖集群中分批剔除所述当前集群的备用集群,以完成所述待恢复的流量的恢复过程。
  14. 根据权利要求13所述的方法,其特征在于,根据所述集群信息和所述集群权重值判断所述当前集群是否具备恢复条件包括:
    若所述当前集群的集群信息表征所述当前集群在指定时长内未发生告警或者故障,并且所述集群信息表征所述当前集群的冗余带宽满足需承接的恢复带宽,以及所述当前集群的集群权重值大于或者等于指定权重阈值,判定所述当前集群具备恢复条件。
  15. 根据权利要求12所述的方法,其特征在于,所述需承接的恢复带宽按照以下方式确定:
    统计从所述当前集群中调出的出现故障的带宽总和,并统计出现故障的带宽当前覆盖的节点数量;
    根据所述带宽总和以及所述节点数量,计算所述当前集群中的节点所需承接的恢复带宽。
  16. 根据权利要求13所述的方法,其特征在于,对待恢复的流量进行分批恢复包括:
    识别待恢复的流量对应的各个域名,并识别所述各个域名的优先级,以及识别所述各个域名下频道带宽的大小;
    按照所述各个域名的优先级以及所述频道带宽的大小进行分批恢复。
  17. 根据权利要求13所述的方法,其特征在于,所述方法还包括:
    若所述当前集群具备恢复条件,在指定时间段对所述当前集群进行流量分批恢复;
    若所述当前集群不具备恢复条件,在指定时长内无法恢复流量时,则在指定时间段降低恢复条件,对满足恢复条件的所述当前集群强制进行流量分批恢复;
    若在所述指定时间段内无法对所述当前集群进行流量分批恢复,生成告警信息。
  18. 根据权利要求13所述的方法,其特征在于,在对待恢复的流量进行分批恢复时,所述方法还包括:
    识别所述待恢复的流量对应的各个域名,并读取各个所述域名的配置信息,并按照所述配置信息表征的恢复时间,分别对各个所述域名的流量进行恢复。
  19. 根据权利要求13所述的方法,其特征在于,所述分批恢复策略包括按照自定义的告警名称和域名对待恢复的流量进行分批恢复,和/或按照自定义的带宽比例对待恢复的流量进行分批恢复。
  20. 一种节点流量的恢复系统,其特征在于,所述系统包括:
    恢复条件判定单元,用于获取当前集群的集群信息和集群权重值,并根据所述集群信息和所述集群权重值判断所述当前集群是否具备恢复条件;
    分批恢复单元,用于若所述当前集群具备恢复条件,按照预先设置的分批恢复策略,对待恢复的流量进行分批恢复;
    逐步恢复单元,用于在按照当前的分批恢复策略进行流量恢复时,将所述当前集群加入所述待恢复的流量的覆盖集群中,并从所述覆盖集群中分批剔除所述当前集群的备用集群,以完成所述待恢复的流量的恢复过程。
  21. 一种中心服务器,其特征在于,所述中心服务器包括存储器和处理器,所述存储器用于存储计算机程序,所述计算机程序被所述处理器执行时,实现如权利要求13至19中任一项所述的方法。
PCT/CN2020/091868 2020-04-13 2020-05-22 一种节点流量的调入、恢复方法、系统及中心服务器 WO2021208184A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010285725.5A CN111614484B (zh) 2020-04-13 2020-04-13 一种节点流量的调入、恢复方法、系统及中心服务器
CN202010285725.5 2020-04-13

Publications (1)

Publication Number Publication Date
WO2021208184A1 true WO2021208184A1 (zh) 2021-10-21

Family

ID=72203949

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/091868 WO2021208184A1 (zh) 2020-04-13 2020-05-22 一种节点流量的调入、恢复方法、系统及中心服务器

Country Status (2)

Country Link
CN (1) CN111614484B (zh)
WO (1) WO2021208184A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114679412A (zh) * 2022-04-19 2022-06-28 浪潮卓数大数据产业发展有限公司 一种流量向业务节点的转发方法、装置、设备及介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112769643B (zh) * 2020-12-28 2023-12-29 北京达佳互联信息技术有限公司 资源调度方法、装置、电子设备及存储介质
CN112995051B (zh) * 2021-02-05 2022-08-09 中国工商银行股份有限公司 网络流量恢复方法及装置
CN113076212A (zh) * 2021-03-29 2021-07-06 青岛特来电新能源科技有限公司 一种集群的管理方法、装置、设备及计算机可读存储介质
CN113301380B (zh) * 2021-04-23 2024-03-12 海南视联通信技术有限公司 一种业务管控方法、装置、终端设备和存储介质
WO2023230993A1 (en) * 2022-06-02 2023-12-07 Telefonaktiebolaget Lm Ericsson (Publ) Method and apparatus for standby member and active member in cluster
CN116684468B (zh) * 2023-08-02 2023-10-20 腾讯科技(深圳)有限公司 数据处理方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020065922A1 (en) * 2000-11-30 2002-05-30 Vijnan Shastri Method and apparatus for selection and redirection of an existing client-server connection to an alternate data server hosted on a data packet network (DPN) based on performance comparisons
CN104852934A (zh) * 2014-02-13 2015-08-19 阿里巴巴集团控股有限公司 基于前端调度实现流量分配的方法、装置和系统
WO2017050141A1 (zh) * 2015-09-24 2017-03-30 网宿科技股份有限公司 基于分布式存储的文件分发系统及方法
CN107231436A (zh) * 2017-07-14 2017-10-03 网宿科技股份有限公司 一种进行业务调度的方法和装置
CN109495398A (zh) * 2017-09-11 2019-03-19 中国移动通信集团浙江有限公司 一种容器云的资源调度方法及设备
CN109582452A (zh) * 2018-11-27 2019-04-05 北京邮电大学 一种容器调度方法、调度装置及电子设备

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8010829B1 (en) * 2005-10-20 2011-08-30 American Megatrends, Inc. Distributed hot-spare storage in a storage cluster
CN103391254B (zh) * 2012-05-09 2016-07-27 百度在线网络技术(北京)有限公司 用于分布式cdn的流量管理方法及装置
CN103036719A (zh) * 2012-12-12 2013-04-10 北京星网锐捷网络技术有限公司 一种基于主备集群服务器的跨地区服务容灾方法及装置
CN103327072B (zh) * 2013-05-22 2016-12-28 中国科学院微电子研究所 一种集群负载均衡的方法及其系统
CN103312541A (zh) * 2013-05-28 2013-09-18 浪潮电子信息产业股份有限公司 一种高可用互备集群的管理方法
CN108985556B (zh) * 2018-06-06 2019-08-27 北京百度网讯科技有限公司 流量调度的方法、装置、设备和计算机存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020065922A1 (en) * 2000-11-30 2002-05-30 Vijnan Shastri Method and apparatus for selection and redirection of an existing client-server connection to an alternate data server hosted on a data packet network (DPN) based on performance comparisons
CN104852934A (zh) * 2014-02-13 2015-08-19 阿里巴巴集团控股有限公司 基于前端调度实现流量分配的方法、装置和系统
WO2017050141A1 (zh) * 2015-09-24 2017-03-30 网宿科技股份有限公司 基于分布式存储的文件分发系统及方法
CN107231436A (zh) * 2017-07-14 2017-10-03 网宿科技股份有限公司 一种进行业务调度的方法和装置
CN109495398A (zh) * 2017-09-11 2019-03-19 中国移动通信集团浙江有限公司 一种容器云的资源调度方法及设备
CN109582452A (zh) * 2018-11-27 2019-04-05 北京邮电大学 一种容器调度方法、调度装置及电子设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114679412A (zh) * 2022-04-19 2022-06-28 浪潮卓数大数据产业发展有限公司 一种流量向业务节点的转发方法、装置、设备及介质
CN114679412B (zh) * 2022-04-19 2024-05-14 浪潮卓数大数据产业发展有限公司 一种流量向业务节点的转发方法、装置、设备及介质

Also Published As

Publication number Publication date
CN111614484B (zh) 2021-11-02
CN111614484A (zh) 2020-09-01

Similar Documents

Publication Publication Date Title
WO2021208184A1 (zh) 一种节点流量的调入、恢复方法、系统及中心服务器
US9773015B2 (en) Dynamically varying the number of database replicas
CN108737132B (zh) 一种告警信息处理方法及装置
CN110708196B (zh) 数据处理方法及装置
RU2517330C2 (ru) Способ и система для восстановления службы видеонаблюдения
CN111049670B (zh) 一种用于微服务的熔断隔离的方法及装置
CN111158962A (zh) 一种异地容灾方法、装置、系统、电子设备及存储介质
CN108540315A (zh) 分布式存储系统、方法和装置
CN109873714B (zh) 云计算节点配置更新方法及终端设备
US20200389517A1 (en) Monitoring web applications including microservices
CN113625945A (zh) 分布式存储的慢盘处理方法、系统、终端及存储介质
CN107508700B (zh) 容灾方法、装置、设备及存储介质
CN113055246B (zh) 异常服务节点识别方法、装置、设备及存储介质
CN111385359A (zh) 对象网关的负载处理方法及装置
US11695856B2 (en) Scheduling solution configuration method and apparatus, computer readable storage medium thereof, and computer device
CN111866210A (zh) 一种虚拟ip均衡分配方法、系统、终端及存储介质
CN112567687B (zh) 朝向网络切片的可用性
CN110290210B (zh) 接口调用系统中不同接口流量比例自动调配方法及装置
JP2018517345A (ja) 可用性カウント装置および方法
CN114237910A (zh) 客户端负载均衡实现方法及装置
CN113656215A (zh) 一种基于集中配置的自动化容灾方法、系统、介质和设备
CN113301177A (zh) 一种域名防封禁方法及装置
CN113190347A (zh) 一种边缘云系统及任务管理方法
CN112866030B (zh) 流量切换方法、装置、设备及存储介质
CN110780891A (zh) 监控系统的部署方法及部署装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20931339

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20931339

Country of ref document: EP

Kind code of ref document: A1