CN112583648B

CN112583648B - Intelligent service fault processing method based on DNS

Info

Publication number: CN112583648B
Application number: CN202110206285.4A
Authority: CN
Inventors: 李金龙; 于松伟; 华福才; 刘占宇; 李娇
Original assignee: Beijing Urban Construction Design and Development Group Co Ltd
Current assignee: Beijing Urban Construction Design and Development Group Co Ltd
Priority date: 2021-02-24
Filing date: 2021-02-24
Publication date: 2021-06-25
Anticipated expiration: 2041-02-24
Also published as: CN112583648A

Abstract

The invention discloses an intelligent service fault processing method based on a DNS (domain name system). A fault switching strategy based on detection of various service indexes is realized through the DNS, so that more intelligent service reliability is ensured; by switching only the single service with the fault, the influence of fault switching is reduced to the minimum, and the service dimension switching with fine granularity is realized; and the intelligent service switching of faults is carried out based on the DNS, so that the accuracy, controllability and efficiency of switching are improved. The invention also discloses an intelligent service system based on the DNS, which realizes automatic discovery and switching of services in the same data center and across data centers through the cooperative work of the multi-region deployed data center module group without modifying service application. The invention solves the problems that the switching influence range is large and the granularity of fault switching cannot be considered in the prior art; the detection process has certain limitation; the method has certain invasion problem on business service and obvious effect.

Description

Intelligent service fault processing method based on DNS

Technical Field

The invention relates to the technical field of computers, in particular to an intelligent service fault processing method based on a DNS (domain name system).

Background

With the rapid use of the cloud computing technology in enterprises and the continuous improvement of the core service availability requirements of the enterprises, the cloud computing technology provides extensible infrastructure support with higher reliability for the continuous development of enterprise services, and the solution for ensuring the service availability of the enterprises under the condition of abnormal service or disaster comprises deployment modes such as same-city double-activity and remote disaster tolerance.

At present, based on a DNS and health check service cooperation failure switching scheme (as shown in fig. 1), a service IP address failure condition is determined according to a service health check result, and if a failure occurs, a controller switches domain name resolution to a backup service IP. The scheme has a large switching influence range, services without faults in the data center can be directly switched to the standby data center, and the granularity of fault switching is not considered; meanwhile, the scheme only detects whether the IP is available through a general detection method, and whether switching is carried out is not determined based on actual access monitoring indexes, so that the method has certain limitation. Based on a unified service registry failover scheme (as shown in fig. 2), service availability is determined based on registry detection of service health status, and if the primary service is not available, the backup service is automatically switched to by invoking the client. The scheme needs to introduce a uniform registry and an additional development kit in the business program, so that mutual service discovery is performed through the registry, and the fault switching of the service is performed by the client, so that the business service is intrusive.

Aiming at the problems, a method and equipment are designed to solve the problems that the switching influence range is large and the granularity of fault switching cannot be considered in the prior art; the detection process has certain limitation; there is a certain intrusive problem to the business service.

Disclosure of Invention

Aiming at the defects, the invention provides an intelligent service fault processing method based on a DNS (domain name system) to solve the problems that the switching influence range is large and the granularity of fault switching cannot be considered in the prior art; the detection process has certain limitation; there is a certain intrusive problem to the business service.

The invention provides an intelligent service fault processing method based on a DNS, which comprises the following specific steps:

step 1, monitoring and detecting all deployed services, and registering to a DNS service based on a detection result;

step 2, flexibly defining detection statistics of the service, and judging the service state by combining more reference factors obtained by monitoring infrastructure;

step 3, adjusting the DNS service of the data center according to the service state;

step 4, carrying out visual service management on the service running state and the service switching condition;

and 5, carrying out address resolution according to the domain name of the service required to be called, and calling the target service through the resolved IP address.

Preferably, the method flexibly defined in step 2 includes:

step 2.1, probing the IP and the port to ensure that the expected service is in a monitoring state;

step 2.2, detecting the health state according to the health examination address configured by the user;

step 2.3, judging according to the proportion of available examples in each service;

and 2.4, performing fault switching according to the requirement on the judgment result based on the actual condition of the called service.

Preferably, the specific steps of step 2.3 include: if the ratio of the unavailable instance to the total number of instances in the service is greater than a specified threshold, i.e., the service cannot carry the required workload, the service is failed over.

Preferably, the step 3 specifically comprises the following steps:

step 3.1, when the detection service is in a normal state, analyzing to obtain an address corresponding to the service, and directly calling one corresponding service in the data center by one service;

step 3.2, when the detection service is in the unhealthy state, updating the address corresponding to the service into the address corresponding to the standby service of the service in the data center;

and 3.3, when the detection service and the standby service are in the unhealthy state, updating the address corresponding to the service to the address corresponding to the other standby service of the service in the standby data center.

Preferably, the step 5 specifically comprises the following steps:

s5.1, deploying DNS cache service at each service node, and when the user service operated by the current node needs to call domain names of other services, preferentially analyzing through the local DNS cache service;

s5.2, if the local DNS cache does not resolve the service domain name, performing address resolution through the data center DNS service according to the service domain name needing to be called, and calling an actual IP address of the corresponding service;

s5.3, setting an expiration time for the record in the local DNS cache, and regularly synchronizing the latest record in the DNS service of the data center;

s5.4, when the local DNS cache service acquires the resolution address from the data center DNS service, the resolution address is also synchronized into the local DNS cache, and a subsequent resolution request can be directly returned by the local DNS cache service.

Preferably, the specific steps in step 3 further include:

and 3.4, based on rich indexes which can be provided by the service in the cloud computing environment, manually performing degradation processing on the service by a user, setting the service state as unavailable, and automatically switching DNS analysis to a standby service or another standby service of a standby data center according to the service state, wherein the rich indexes comprise availability detection, access efficiency and abnormal statistics.

Preferably, the reference factors in step 2 include: a state determination result obtained by performing service detection based on the SLB; monitoring abnormal conditions of node resources deployed by the service; the traffic of the service access is monitored to be too high.

The invention also provides an intelligent service system based on DNS, which comprises a plurality of groups of data center module groups, wherein the data center module groups comprise:

the main data center module comprises a first service module, a second service module electrically connected with the first service module, a main standby module electrically connected with the first service module, a main registration controller and a main DNS service module, wherein the main registration controller is respectively electrically connected with the first service module, the second service module, the main DNS service module and the main standby module, and the main standby module is used for backing up data on the second service module;

and the auxiliary data center module comprises an auxiliary registration controller, an auxiliary DNS service module and an auxiliary standby module, wherein the auxiliary standby module is electrically connected with the first service module through a gateway and is used for backing up data on the second service module, and the auxiliary registration controller is respectively electrically connected with the main registration controller, the auxiliary DNS service module and the auxiliary standby module.

Preferably, the method further comprises the following steps:

the cloud platform module is electrically connected with the data center module group and is provided with monitoring equipment, and the monitoring equipment is used for acquiring more reference factors and displaying monitoring data;

and the local DNS module is electrically connected with the data center module group.

According to the scheme, the intelligent service fault processing method based on the DNS is a service fault processing method based on the global DNS in the cloud environment, intelligent service switching of faults is carried out based on the DNS, and accuracy, controllability and efficiency of switching are improved. The method realizes a fault switching strategy based on detection of various service indexes through the DNS, and ensures more intelligent service reliability; by switching only the single service with the fault, the influence of fault switching is reduced to the minimum, and fine-grained service dimension switching is realized. The invention also provides an intelligent service system based on the DNS, which realizes automatic discovery and switching of services in the same data center and across data centers through the cooperative work of the multi-region deployed data center module group without modifying service application. The invention solves the problems that the switching influence range is large and the granularity of fault switching cannot be considered in the prior art; the detection process has certain limitation; has certain invasion problem to business service, has obvious effect and is suitable for wide popularization.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a block diagram of a prior art DNS and health check service orchestration based failover scheme;

FIG. 2 is a process diagram of a prior art unified service registry based failover scheme;

fig. 3 is a first process block diagram of a DNS-based intelligent service fault handling method according to an embodiment of the present invention;

fig. 4 is a process block diagram of a local DNS storage of an intelligent DNS-based service failure handling method according to an embodiment of the present invention;

fig. 5 is a block diagram of a visualized management process of a DNS-based intelligent service fault handling method according to an embodiment of the present invention;

fig. 6 is a process block diagram of a DNS-based intelligent service fault handling method according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The same city and double activities refer to that two data centers are built in the same city or similar areas (less than or equal to 200 KM): one is a data center and is responsible for daily production operation; and the other is a disaster backup center which is responsible for the operation of the application system after the disaster occurs. The data center of the same city disaster backup is closer to the disaster backup center, the communication line quality is better, the synchronous replication of the data is easier to realize, and better data integrity and consistency are ensured.

The remote disaster recovery means that the distance between the main center and the standby center is far (more than 200KM), so that an asynchronous mirror image is generally adopted, and certain data inconsistency risks exist due to time delay. Off-site disaster backup is typically used to deal with disastrous events such as fire, building damage, earthquake, flood, etc.

Referring to fig. 3 to fig. 6, a description will now be given of an embodiment of a DNS-based intelligent service fault handling method according to the present invention. The intelligent service fault processing method based on the DNS comprises the following specific steps:

s1, monitoring and detecting all deployed services, and registering to a DNS (Domain Name System) service based on a detection result;

s2, flexibly defining detection statistics of the service, and judging the service state by combining more reference factors obtained by monitoring infrastructure, wherein the detection related contents are wider, the limitation of the monitoring process is effectively solved, and the monitoring accuracy is ensured;

the flexibly defined method in S2 includes:

s2.1, the most basic detection mode: detecting and activating an IP (Internet Protocol) and a port to ensure that the expected service is in a monitoring state;

s2.2, detecting the health state according to the health examination address configured by the user;

s2.3, judging according to the proportion of available examples in each service;

the S2.3 comprises the following specific steps: if the ratio of the number of instances unavailable in a service to the total number of instances is greater than a specified threshold, i.e., the service cannot carry the required workload, then the service is failed over, illustratively, a service has 5 instances, and if more than 2 instances are unavailable, then a failover is required.

And S2.4, performing fault switching according to needs based on the judgment result of the actual condition of the called service, wherein illustratively, more than 20 times of abnormal calling (> 20%) of the service terminal are called in each 100 times of calling, or more than 20% of response time delay exceeds 5S, namely, the fault switching is needed.

The combination of service status determination and infrastructure monitoring in S2 generally includes the following impact factors: 1. if the service is provided by the SLB (Server Load Balancer software Load balancing) of the infrastructure, the detection of the service can be combined with a monitoring system of the SLB to carry out state judgment, for example, the service is inaccessible through the SLB; monitoring statistics on the SLB are abnormal, including that the number of unhealthy service instances exceeds a certain number, the access error rate is high, the access delay is high, and the like; 2. monitoring that node resources deployed by the service are abnormal through host monitoring, wherein unstable factors exist in the service, for example, the use proportion of a CPU or a memory exceeds 75%; 3. through network monitoring, it is monitored that the traffic of service access is too high, occupies 80% of the existing bandwidth, and the problem of network congestion is about to be caused.

S3, adjusting the DNS service of the data center according to the service state;

s3.1, when the detection service is in a normal state, analyzing to obtain an address corresponding to the service, and directly calling one corresponding service in the data center by one service;

s3.2, when the detection service is in an unhealthy state, updating the address corresponding to the service into the address corresponding to the standby service of the service in the data center;

and S3.3, when the detection service and the standby service are in the unhealthy state, updating the address corresponding to the service to the address corresponding to the other standby service of the service in the standby data center.

And S3.4, based on rich indexes which can be provided by the service in the cloud computing environment, in necessary scenes such as planned power failure, system maintenance and the like, manually degrading the service by a user, setting the service state as unavailable, automatically switching DNS analysis to a standby service or another standby service of a standby data center by the register controller according to the service state, and carrying out service switching in advance so as to verify whether the service work is normal after service switching, wherein the rich indexes comprise availability detection, access efficiency and abnormal statistics.

The judgment method in the S2 is used as a strategy for DNS domain name resolution change, so that service switching is more flexible and accurate, and a user is allowed to expand and customize a switching standard. Illustratively, an external service state mark extension interface is provided through the registration controller, allowing a user to develop own service detection implementation and transmit the final service state to the registration controller, and the returned information may include a list of each instance in the service, the state of each instance, the access concurrency, delay, resource use condition, and the like. The registration server judges whether the service is in an unhealthy state or not based on the service state returned by the extended service, finally updates the domain name mapping record of the DNS service, and switches the domain name resolution result when the service is called.

S4, carrying out visual service management on the service running state and the service switching condition, providing a visual management page, observing the service running and switching condition in real time, better mastering the service running and switching state, and simultaneously allowing a user to visually carry out manual intervention on the service so as to carry out service switching before a planned fault and guarantee service reliability;

and S5, performing address resolution according to the domain name of the service to be called, and finally calling the target service through the resolved IP (Internet Protocol) address.

S5.1, deploying DNS cache service at each service node, and when the user service operated by the current node needs to call domain names of other services, preferentially analyzing through the local DNS cache service, so that the problems of network blockage and delay caused by network calling are avoided;

s5.2, if the Local DNS Cache (Local DNS Cache) is not analyzed, performing address analysis through a data center DNS service according to a service domain name required to be called, and calling an actual IP address of a corresponding service;

s5.3, setting an expiration time for the record in the local DNS cache so as to synchronize the latest record in the DNS service of the data center regularly;

s5.4, when the local DNS cache service acquires the resolution address from the data center DNS service, the resolution address is also synchronized into the local DNS cache, and the subsequent resolution request can be directly returned by the local DNS cache service.

By using a local cache mode and based on local DNS address resolution, the efficiency of DNS resolution is improved, the bottleneck that DNS becomes service access due to a large amount of DNS resolution is avoided, the delay of network requests and the pressure on a data center DNS are reduced, the efficient DNS resolution efficiency is realized, the service calling performance is improved, the DNS access efficiency is improved, and the problem of the reduction of the access performance caused by DNS resolution is solved.

Compared with the prior art, the intelligent service fault processing method based on the DNS solves the problem of when and how to perform fault switching no matter in the same city or in a scene of disaster recovery in different places, so that the switching efficiency, reliability and monitorability are ensured; meanwhile, the non-invasion to the prior service is ensured as much as possible, and the implementation cost is effectively ensured. The method performs intelligent service switching of faults based on the DNS, and improves accuracy, controllability and efficiency of switching. A fault switching strategy based on detection of various service indexes is realized through the DNS, and more intelligent service reliability is guaranteed; by switching only the single service with the fault, the influence of fault switching is reduced to the minimum, and fine-grained service dimension switching is realized. The invention solves the problems that the switching influence range is large and the granularity of fault switching cannot be considered in the prior art; the detection process has certain limitation; the method has certain invasion problem on business service and obvious effect.

Referring to fig. 3 to fig. 6, an embodiment of a DNS-based intelligent service system according to the present invention will be described. The intelligent service system based on the DNS comprises a plurality of groups of data center module groups, wherein each data center module group comprises a main data center module and an auxiliary data center module, each main data center module comprises a first service module, a second service module electrically connected with the first service module, a main standby module electrically connected with the first service module, a main registration controller and a main DNS service module, each main registration controller is electrically connected with the first service module, the second service module, the main DNS service module and the main standby module respectively, and the main standby module is used for backing up data on the second service module; the auxiliary data center module comprises an auxiliary registration controller, an auxiliary DNS service module and an auxiliary standby module, the auxiliary standby module is electrically connected with the first service module through a gateway and used for backing up data on the second service module, and the auxiliary registration controller is respectively electrically connected with the main registration controller, the auxiliary DNS service module and the auxiliary standby module.

The main registration controller is used for monitoring and detecting all services deployed in the main data center module and performing DNS registration to the main DNS service module based on a detection result, and relevant environment dependence information is configured on the main registration controller to determine domain name rules, gateway addresses and the like of the main data center; the auxiliary registration controller is used for monitoring and detecting all services deployed in the auxiliary data center module and performing DNS registration to the auxiliary DNS service module based on detection results, relevant environment dependence information is configured on the auxiliary registration controller to determine domain name rules, gateway addresses and the like of the auxiliary data center, and the DNS resolved addresses are determined through coordination between the main registration controller and the auxiliary registration controller. According to the intelligent service system based on the DNS, the service automatic discovery and switching in the same data center and across the data centers are realized through the cooperative work of the multi-region deployed data center module group, and the service application is not required to be modified.

In this embodiment, the intelligent service system based on the DNS further includes a cloud platform module and a local DNS module, wherein the cloud platform module is electrically connected to the data center module group and is provided with a monitoring device, and the monitoring device is used for obtaining more reference factors and displaying monitoring data; the local DNS module is electrically connected with the data center module group. The cloud platform module is a visual service management platform, can be used for observing the service running state of the platform at any time and judging whether service switching occurs, and can be used for providing an easy-to-use management tool for service monitoring and management. The local DNS module carries out local DNS address resolution, reduces DNS address resolution pressure of the main data center and the auxiliary data center, and improves working performance.

For example, referring to fig. 3, in a case that there are two data centers, which may be a same-city dual-activity or disaster-remote scenario, each data center deploys two services "service a" (i.e., a first service module) and "service b" (i.e., a second service module), where the data center 1 is a main data center (i.e., a main data center module). Each data center is provided with a registration controller and a corresponding DNS service, and the DNS service of the data center is used for address resolution when the data center is called among the services of the data center. The fault switching flexibly defines the detection statistics of the service through the registration controller, and obtains more reference factors by combining with the monitoring infrastructure provided by the cloud platform to judge the service state. Service b has a standby service "service b'" (i.e., a primary standby module) in data center 1, and a standby service "service b" (i.e., a secondary standby module) in data center 2, and under normal conditions, the records in the DNS service of the data center are:

10.10.123.1 service-a.dc1.com 10.10.123.2 service-b.dc1.com。

when the service a calls the service b, obtaining service-b.dc1.com through DNS analysis, namely directly calling through 10.10.123.2 address in the data center 1; when service b is detected to be in a non-healthy state by the registration controller, and service b' is marked as a standby service for service b, then the DNS record is updated to: 10.10.123.3 service-b.dc1.com, the address of service a calling service b is automatically switched;

according to coordination and information synchronization between the registration servers, the data center 2 serves as a standby data center (i.e. an auxiliary data center module), and when neither service b nor service b' of the data center 1 is available, the registration controller of the data center 1 updates the DNS service record in the registration center 1 to: 192.168.2.3 service-b.gateway.dc2.com. The domain name is resolved to the gateway address of the data center 2, that is, the service b in the registry 2 is accessed through the gateway, so that the service call across the data center is realized.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An intelligent service fault processing method based on a DNS is characterized by comprising the following specific steps:

step 1, monitoring and detecting all deployed services, registering to a DNS (domain name system) service based on a detection result, configuring relevant environment dependence information on a registration controller, and determining a domain name rule and a gateway address of a data center;

step 2, flexibly defining detection statistics based on services, and judging service states by combining reference factors obtained by monitoring infrastructure, wherein the reference factors comprise state judgment results obtained by performing service detection based on SLB, abnormal conditions of node resources deployed by the services, and overhigh flow of service access;

the flexible definition in step 2 comprises:

step 2.4, performing fault switching according to the requirement on the judgment result based on the actual condition of the called service;

step 3, adjusting the DNS service of the data center according to the service state, wherein the adjustment comprises that one service directly calls the corresponding other service, updates the address and performs degradation processing on the service;

2. The intelligent DNS-based service failure handling method according to claim 1, wherein the specific step of step 2.3 includes: if the ratio of the unavailable instance to the total number of instances in the service is greater than a specified threshold, i.e., the service cannot carry the required workload, the service is failed over.

3. The intelligent service fault handling method based on the DNS as claimed in claim 2, wherein the specific step of the step 3 includes:

4. The intelligent service fault handling method based on DNS according to claim 3, wherein the concrete step of said step 5 includes:

5. The intelligent DNS-based service fault handling method according to any one of claims 3-4, wherein the specific step of step 3 further comprises: