WO2020211362A1 - 提高集群系统可用性的方法、装置和计算机设备 - Google Patents

提高集群系统可用性的方法、装置和计算机设备 Download PDF

Info

Publication number
WO2020211362A1
WO2020211362A1 PCT/CN2019/118163 CN2019118163W WO2020211362A1 WO 2020211362 A1 WO2020211362 A1 WO 2020211362A1 CN 2019118163 W CN2019118163 W CN 2019118163W WO 2020211362 A1 WO2020211362 A1 WO 2020211362A1
Authority
WO
WIPO (PCT)
Prior art keywords
host
service
preset
currently
designated
Prior art date
Application number
PCT/CN2019/118163
Other languages
English (en)
French (fr)
Inventor
赵骏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020211362A1 publication Critical patent/WO2020211362A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Definitions

  • This application relates to the field of distributed deployment technology, and in particular to a method, device and computer equipment for improving the availability of a cluster system.
  • Cluster applications can run on thousands of ordinary servers.
  • the scale of the cluster is dynamically expanded, but it also has to withstand the higher failure rate of ordinary computers.
  • This requires the system to ensure high availability in the event of hardware and software failures.
  • system services can only be transferred on the local host, without taking into account the hosts in other service areas, so that the availability of the Docker container-based cluster system is not high and cannot cope with large-scale systems. malfunction.
  • the main purpose of this application is to provide a method, device and computer equipment for improving the availability of a cluster system, aiming to solve the disadvantages of the existing cluster system based on Docker containers that the availability is low and cannot cope with large-scale system failures.
  • this application provides a method for improving the availability of a cluster system, which is applied to any host in the cluster system, the cluster system includes multiple service areas, and the service areas are distributed in different regions,
  • the host currently executing the method is the first host, and the method includes:
  • the second host currently in the failed state is the failed host, and screen the currently in callable state from each designated service area as the backup host, where the designated service area belongs to the failed host In other service areas outside the service area of, the host in the callable state is the host whose service invocation proportion and running load meet the first preset requirement;
  • This application also provides a device for improving the availability of a cluster system, which is applied to any host in the cluster system.
  • the cluster system includes multiple service areas. The service areas are distributed in different areas.
  • the host of the method is the first host, and the device includes:
  • the monitoring module is used to monitor whether each second host is currently malfunctioning, where the second host is a host other than the first host;
  • the screening module is used to mark the second host that is currently in a faulty state as a faulty host, and to screen the host that is currently in a callable state as a backup host from each designated service area, where the designated service area belongs to the faulty host In other service areas outside the service area of, the host in the callable state is the host whose service invocation proportion and running load meet the first preset requirement;
  • the running module is used to use the standby host to run the system service of the failed host.
  • the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements the steps of any one of the above methods when the computer program is executed by the processor.
  • the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of any one of the above methods are implemented.
  • This application provides a method, device and computer equipment for improving the availability of a cluster system.
  • Hosts in service areas in different regions judge each other whether there is a faulty host according to a first preset frequency, and after the faulty host is found, communicate with each other.
  • the operating information of each host is broadcasted from time to time, so as to screen and obtain the standby host that is currently in a callable state, and then a normal operating host randomly issues an instruction to make the standby host continue to run the system services of the failed host, which can be deployed in different regions.
  • the high availability of the cluster system in the service area at the same time prevents the system from being unable to run after a large-scale failure.
  • FIG. 1 is a schematic diagram of the steps of a method for improving the availability of a cluster system in an embodiment of the present application
  • FIG. 2 is a block diagram of the overall structure of an apparatus for improving the availability of a cluster system in an embodiment of the present application
  • FIG. 3 is a schematic block diagram of the structure of a computer device according to an embodiment of the present application.
  • an embodiment of the present application provides a method for improving the availability of a cluster system, which is applied to any host in the cluster system, the cluster system includes multiple service areas, and the service areas are distributed in In different regions, the host currently executing the method is the first host, and the method includes:
  • S1 monitor whether each second host is currently malfunctioning, where the second host is a host other than the first host;
  • S2 If a failure occurs, mark the second host that is currently in the failed state as the failed host, and filter the currently in callable state from each designated service area as the backup host, where the designated service area is the failure In other service areas other than the service area to which the host belongs, the host in the callable state is the host whose service call percentage and running load meet the first preset requirement;
  • developers deploy multiple service areas in different regions of the world.
  • Each service area is distributed in different areas.
  • Each service area corresponds to a computer room group in a certain city. It is composed of multiple availability zones.
  • It includes multiple hosts, and developers on each host deploy a monitoring service to monitor and manage various system services and running processes in the host.
  • each host confirms and exchanges each other's working status by sending preset signals to and accepting preset signals sent by the remaining hosts.
  • the working status includes working information such as whether the host is currently operating normally, and the number of calls to various system services in the host.
  • System services refer to programs, routines, or processes that perform specified system functions to support other programs, especially low-level (close to hardware) programs; system processes are a series of processes in the operating system and the memory blocks allocated for these processes. It is the unit for system resource allocation and scheduling.
  • the hosts in each service area determine whether the other hosts are currently malfunctioning according to preset rules. Among them, the preset rule is specifically: each host exchanges preset signals according to the first preset frequency. If the system service in the host fails, its external service port will not be connected, and the preset cannot be sent to the external host. information.
  • the normal operating host that is, the first host is monitoring the working status of the other hosts, that is, the second host, if it cannot receive the preset signal from another host, the host that cannot send the preset signal Mark as a designated host, and start monitoring whether the designated host can send a preset signal in a preset time period after the current time, that is, whether the first host can receive the preset signal sent by the designated host. If in the preset time period after the current time, the first host can receive the preset signal sent by the designated host, then it is determined that the designated host has no fault; if the first host cannot receive the preset signal sent by the designated host, then Determine that the specified host is faulty.
  • the first host determines that there is a faulty host in the second host, it first obtains other service areas outside the service area to which the current faulty host belongs, that is, the current classification information of the designated service area.
  • the classification information is the classification information corresponding to each service area according to the proportion of invocations of the service process, and the monitoring service in each host is updated and set according to the second preset frequency. After the monitoring service ranks each service area, the service area with the classification information of level 4 is of higher importance, so the developer sets it to not be the calling host of the faulty host. Therefore, the current first host excludes the service areas whose hierarchical information is level 4, and selects the remaining designated service areas as service areas that are currently in a callable state.
  • the current first host obtains the operating information of each host in the service area that is currently in the callable state through the advance signal.
  • the operation information includes service call information and load information of the host.
  • the first host host compares the operating information with the pre-set screening conditions, that is, the first preset requirement, so as to filter from the hosts in the service area currently in the callable state to obtain the host currently in the callable state as the backup host .
  • the first host marks the host that is currently in a faulty state as a faulty host.
  • the backup host is obtained by screening, the first host obtains the service process information of the service system of the failed host before the failure according to the mutual preset signal.
  • the service process information includes the service type of the system service that each host is responsible for and the service progress of the system service before the failure.
  • the first host deploys the system process corresponding to the system service of the standby host according to the service process information of the failed host, for example, controls the standby host to download the corresponding program container image of the system process to realize the operation of the system process. After the standby host completes the deployment of the system process, the first host controls and starts the standby host to run the system service service.
  • step of monitoring whether each second host is currently malfunctioning includes:
  • S101 Receive a preset signal sent by each of the second hosts according to a first preset frequency, so as to monitor the working state of each of the second hosts;
  • a monitoring service is installed on each host in each service area.
  • the monitoring service will record the working status of the host in real time, including the current load of the host, the number of calls of each service process in the host, and other information.
  • a preset frequency such as once every 5 seconds, is broadcast to other hosts through a preset signal, while receiving preset signals broadcast by other hosts.
  • the preset signal carries identification information of the host that sends the preset signal, such as a serial number, so that other hosts can confirm that the preset signal comes from the corresponding host.
  • the first host may filter and obtain the designated host according to whether the preset signal is received. Among them, the designated host is a host that does not currently send a preset signal.
  • the first host After screening and identifying the designated host, the first host will monitor the working status of the designated host in a preset time period after the current time according to the preset signal sent by the first preset frequency.
  • the preset time period is set by the developer, and the developer can set different preset time periods according to the importance of hosts in different service areas.
  • a mapping relationship table is established between the preset time period and each host, and is stored in the database of each host.
  • the first host can query the preset time period corresponding to each host according to the mapping relationship table. In a preset time period after the current time, for example, within 5 minutes after the current time, if the first host still cannot receive the preset signal sent by the designated host, it is determined that the designated host is faulty. If within the preset time period after the current time, the first host can receive the preset signal sent by the designated host, it is determined that the designated host has not failed.
  • the step of screening a host currently in a callable state as a standby host from each designated service area includes:
  • S201 Acquire current classification information of each designated service area, where the classification information is correspondingly set according to the proportion of service calls in each service area;
  • S202 Filter service areas currently in a callable state according to the classification information, where the service areas in the callable state are service areas for which the classification information meets a second preset requirement;
  • S203 Obtain operating information of each host in the service area currently in a callable state, where the operating information includes the percentage of service calls and operating load of the host;
  • S204 From the service area currently in the callable state, screen the host whose operating information meets the third preset requirement as the standby host.
  • the first host when the first host determines that a faulty host occurs in the second host, it first obtains other service areas outside the service area to which the faulty host belongs, that is, the current classification information of the designated service area.
  • the grading information is the grading information set by the developer according to the proportion of system service calls corresponding to each service area.
  • the monitoring service in each host is updated and set according to the second preset frequency, and broadcast to other hosts after the update . Therefore, each host, including the first host, can directly query the current classification information of each designated service area.
  • the proportion of service calls in the service area with the classification information of level 4 has reached more than 70% according to the setting of the developer, which is of high importance and is unlikely to have excess
  • the host runs the system services of other service areas, so the developer sets that the host in the 4th service area cannot be the calling host of the faulty host.
  • the first host excludes the service areas whose classification information is level 4, and selects the remaining designated service areas as service areas that are currently in a callable state. Then, the first host obtains the operating information of each host in the service area that is currently in the callable state through the advance signal. Among them, the operating information includes the proportion of the host's service calls and the operating load.
  • the first host compares the operating information with the preset screening conditions, thereby screening the hosts in the service area currently in the callable state to obtain the host currently in the callable state as the backup host.
  • the screening conditions are preset by the developer.
  • the filter condition is set such that the proportion of service calls is less than 1%, and the hosts whose operating load is less than 10% can be used as callable hosts. If the current service call ratio of host A is 0.1%, and the running load is 5%, then the running information of host A meets the filter conditions and can be used as a backup host.
  • the method includes:
  • S4 According to the second preset frequency, obtain the number of first service invocations in the time period corresponding to the second preset frequency and the number of second service invocations respectively corresponding to each of the second hosts, where the first The number of service calls is the number of calls of the system service of the first host, and the number of second service calls is the number of calls of the system service of the second host;
  • S5 According to the number of invocations of the first service and the number of invocations of each of the second services, respectively calculate the total number of invocations of services and the number of invocations of services in each of the service areas;
  • S7 Input the proportion of each service call into a pre-built grading information database, and respectively match to obtain the grading information corresponding to the proportion of each service call, wherein the grading information database includes a mapping relationship table between the proportion of service calls and the grading information ;
  • S8 Obtain the classification information corresponding to each of the service areas according to the correspondence between the service invocation proportion and the service area, and the correspondence between the service invocation proportion and the classification information.
  • a monitoring service is installed in each host, and the monitoring service records the working status of the host in real time.
  • the working status includes the number of invocations of the system service in each host.
  • the monitoring service sends the first service call times of the first host to the remaining hosts according to the second preset frequency, and receives the second service call times sent by each second host.
  • the first number of service calls is the number of calls of the system service of the first host within a time period corresponding to the second preset frequency
  • the second number of service calls is the number of system services of each first host corresponding to the second preset frequency. The number of calls during the time period.
  • the second preset frequency is once per hour
  • the number of first service calls acquired in the previous time is 10 points
  • the number of first service calls acquired currently is the service calls of the first host between 10 am and 11 pm frequency.
  • the monitoring service in the first host first calculates the total number of service calls of the service processes of all hosts according to the number of calls to the first service and the number of calls to each second service, and calculates the total number of service calls of all hosts in each service area , Get the number of service calls corresponding to each service area. Then, according to the ratio between the number of service calls in each service area and the total number of service calls, the proportion of service calls in each service area is obtained.
  • the number of service calls of host A is 5
  • the number of service calls of host B is 8 times
  • the number of service calls of host C is 7 times, that is, service area A
  • the number of service calls is 20 times.
  • the total number of service calls calculated based on the number of first service calls and the second number of service calls is 200.
  • the ratio of the number of service calls to the total number of service calls is 0.1, and the proportion of service calls in service area A Is 10%.
  • the first host inputs the proportion of service invocations corresponding to each service area into the pre-built hierarchical information library, and matches the proportion of service invocations in the hierarchical information library and the mapping relationship table of classification information to obtain the corresponding classification information for each service invocation proportion.
  • the classification information includes the area level of the service area and the number of preset system processes corresponding to the area level.
  • the first host then obtains the classification information corresponding to each service area according to the correspondence between the proportion of service calls and the service area, and the correspondence between the proportion of service calls and the classification information.
  • the grading information includes an area level and a preset number of system processes corresponding to the area level, the corresponding relationship between the proportion of the service call and the service area, and the proportion of the service call
  • the corresponding relationship with the classification information after the step of obtaining the classification information corresponding to each of the service areas, includes:
  • the service process in the service area needs to be deployed correspondingly according to the preset number of system processes in the classification information.
  • the system process is a series of processes in the operating system and the memory blocks allocated for these processes, and is the unit for system resource allocation and scheduling.
  • the monitoring service in the first host exchanges information with the monitoring services in each second host to obtain the current number of system processes in each service area. Then, the number of each current system process is compared with the number of preset system processes corresponding to the current grading information, and the magnitude relationship between the number of current system processes and the number of preset system processes is judged.
  • the number of system processes of the host in the service area is reduced to a specified state.
  • the number of current system processes whose designated status is the service area is equal to the number of preset system processes. For example, if the current area level of service area A is level 3, the corresponding preset number of system processes is 50, and the current number of system processes in service area A is 60, you need to reduce the system processes in service area A and close deployment There are hosts with corresponding system services until the current number of system processes in service area A is 50, which is consistent with the preset number of system processes to save resources.
  • the monitoring service needs to calculate the difference between the current number of system processes and the preset number of system processes; then download the corresponding number of program container images from the docker central warehouse to The host of the service area.
  • the program container image includes the running program and running environment of the service process, which is pre-stored in the docker central warehouse by the developer, so that it can be directly downloaded and started when used.
  • the monitoring service issues a startup instruction to enable the host in the service area to run the image of each program container to complete the increase in the number of system processes in the service area.
  • step of increasing the number of system processes of the host in the service area to the specified state includes:
  • S1001 Calculate the difference between the current number of system processes and the preset number of system processes
  • S1002 Download program container images corresponding to the number of differences to the host in the service area, where the program container images include the running program and running environment of the system process;
  • the monitoring service in the first host calculates the difference between the current number of system processes and the preset number of system processes, and uses the difference as the number of system processes that need to be added in the service area. Then from the docker central warehouse, the number of program containers corresponding to the difference is mirrored to the host in the service area for installation.
  • the program container image includes the running program and running environment of the system process, which is pre-stored in the central warehouse of the docker container by the developer, so that it can be directly downloaded and started when used.
  • the monitoring service sends a startup instruction to the host corresponding to the service area, so that the host in the service area runs the image of each program container, and completes the increase of the number of system processes in the service area.
  • step of using the standby host to run the system service of the failed host includes:
  • S301 Obtain service process information of the failed host, where the service process information includes the service type of the system service and the service progress before the failure;
  • S302 Deploy the system service process of the standby host according to the service process information
  • S303 Start the standby host after the deployment of the system service process is completed, and run the system service.
  • the first host obtains the service process information of the system service of the failed host before the failure according to the preset signal broadcasted to each second host after screening the backup host.
  • the service process information includes the service type of the system service that each host is responsible for and the service progress of the system service before the failure.
  • the service type of the system service that the host A is responsible for is the calculation of premiums
  • the host B processes pictures
  • the host C claims settlement is the service type of the system service that the host A is responsible for.
  • the first host first downloads the program container image corresponding to the service type to the standby host for installation according to the service type in the service process information, and then controls the adjustment of the system service in the standby host to the service progress after the installation is completed, thereby completing the standby The deployment of system service processes in the host.
  • the program container image corresponding to the deployed system service is stored in the central warehouse of docker, and it can be downloaded and started directly when needed.
  • the program container image not only contains the program, but also the operating environment. Finally, start the standby host to run the system service after the deployment is completed.
  • This embodiment provides a method for improving the availability of a cluster system.
  • Hosts in the service areas of different regions judge each other whether there is a faulty host according to a first preset frequency, and after the faulty host is found, they broadcast the information of each host to each other.
  • Operating information so as to screen out the standby host that is currently in a callable state, and then randomly issue a command from a normally running host to make the standby host continue to run the system services of the failed host, so as to meet the needs of clusters located in service areas in different regions. While the system is highly available, the system can not run after a large-scale failure.
  • an embodiment of the present application also provides an apparatus for improving the availability of a cluster system, which is applied to any host in the cluster system, the cluster system includes multiple service areas, and the service areas are distributed In different regions, the host currently executing the method is the first host, and the device includes:
  • the monitoring module 1 is used to monitor whether each second host is currently malfunctioning, where the second host is a host other than the first host;
  • the screening module 2 is used to mark the second host that is currently in a faulty state as a faulty host, and to screen the host that is currently in a callable state from each designated service area as a backup host, wherein the designated service area is the faulty host In other service areas other than the service area to which it belongs, the host in the callable state is the host whose service call percentage and running load meet the first preset requirement;
  • the running module 3 is configured to use the standby host to run the system service of the failed host.
  • monitoring module 1 includes:
  • a monitoring unit configured to receive a preset signal sent by each of the second hosts according to a first preset frequency, so as to monitor the working status of each of the second hosts;
  • the first determining unit is configured to determine that the second host is currently operating normally
  • a judging unit configured to mark a host that has not sent the preset signal as a designated host, and determine whether the preset signal sent by the designated host is received within a preset time period after the current time;
  • the second determining unit is used to determine that the designated host is operating normally
  • the third determination unit is used to determine that the designated host has a failure.
  • the screening module 2 includes:
  • the first obtaining unit is configured to obtain current classification information of each of the designated service areas, where the classification information is correspondingly set according to the proportion of service calls in each service area;
  • the first screening unit is configured to screen service areas currently in a callable state according to the classification information, where the service areas in the callable state are service areas for which the classification information meets the second preset requirement;
  • the second acquiring unit is configured to acquire operating information of each host in the service area currently in a callable state, where the operating information includes the proportion of the host's service calls and the operating load;
  • the second screening unit is configured to screen the host whose operating information meets the third preset requirement from the service area currently in the callable state as the standby host.
  • the device further includes:
  • the first obtaining module 4 is configured to obtain, according to a second preset frequency, the number of first service invocations in a time period corresponding to the second preset frequency and the second service invocation times corresponding to each of the second hosts, wherein, the first number of service calls is the number of calls of the system service of the first host, and the second number of service calls is the number of calls of the system service of the second host;
  • the first calculation module 5 is configured to calculate the total number of service invocations and the number of service invocations in each of the service areas according to the number of invocations of the first service and the number of invocations of the second services respectively;
  • the second calculation module 6 is configured to calculate the proportion of service invocations corresponding to each of the service areas according to the number of times of each service invocation and the total number of times of service invocation;
  • the first matching module 7 is configured to input the proportion of each of the service calls into a pre-built hierarchical information library, and respectively match to obtain the classification information corresponding to the proportion of each service call, wherein the hierarchical information database includes the proportion of service calls Mapping relationship table with classification information;
  • the second matching module 8 is configured to obtain each of the service areas according to the corresponding relationship between the proportion of service calls and the service area, and the corresponding relationship between the proportion of service calls and the classification information The corresponding classification information.
  • the classification information includes an area level and a preset number of system processes corresponding to the area level, and the device further includes:
  • the second obtaining module 9 is configured to obtain the current number of system processes in the service area
  • the judging module 10 is configured to compare the current number of system processes with the preset number of system processes corresponding to the classification information of the service area, and determine the size between the two;
  • the reduction module 11 is configured to reduce the number of system processes of the host in the service area to a specified state, where the specified state is that the current number of system processes in the service area is equal to the corresponding preset number of system processes;
  • the adding module 12 is used to increase the number of system processes of the host in the service area to the specified state.
  • the adding module 3 includes:
  • a downloading unit configured to download program container images corresponding to the number of differences to the host in the service area, where the program container images include the running program and running environment of the system process;
  • the running unit is configured to run each of the program container images on the host in the service area to complete the deployment of the system process.
  • the operation module 3 includes:
  • An obtaining unit configured to obtain service process information of the failed host, where the service process information includes the service type of the system service and the service progress before the failure;
  • a deployment unit configured to deploy the system service process of the standby host according to the service process information
  • the starting unit is used to start the standby host after the deployment of the system service process is completed, and run the system service.
  • This embodiment provides a method for improving the availability of a cluster system.
  • Hosts in the service areas of different regions judge each other whether there is a faulty host according to a first preset frequency, and after the faulty host is found, they broadcast the information of each host to each other.
  • Operating information so as to screen out the standby host that is currently in a callable state, and then randomly issue a command from a normally running host to make the standby host continue to run the system services of the failed host, so as to meet the needs of clusters located in service areas in different regions. While the system is highly available, the system can not run after a large-scale failure.
  • an embodiment of the present application also provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 3.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the computer designed processor is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the database of the computer equipment is used to store data such as program container images.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • An embodiment of the present application further provides a computer-readable storage medium.
  • the readable storage medium may be a non-volatile readable storage medium or a volatile readable storage medium on which computer-readable instructions are stored.
  • the processes of the above-mentioned method embodiments are executed.
  • the above are only the preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of this application description and drawings, or directly or indirectly applied to other related The technical field is equally included in the scope of patent protection of this application.

Abstract

本申请提供了一种提高集群系统可用性的方法、装置、计算机设备和可读存储介质,涉及分布式部署技术领域,方法包括:各不同地区的服务区域的主机根据第一预设频率相互判断是否有出现故障主机,并在发现故障主机后,相互之间广播各主机的运行信息,从而筛选得到当前处于可调用状态的备用主机,然后随机由一台正常运行的主机发出指令,使得备用主机继续运行故障主机的系统服务,在满足布置于各个不同地区的服务区域的集群系统的高可用性的同时,避免大规模故障后系统无法运行。

Description

提高集群系统可用性的方法、装置和计算机设备
本申请要求于2019年4月16日提交中国专利局、申请号为201910305188.3,发明名称为“提高集群系统可用性的方法、装置和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及分布式部署技术领域,特别涉及一种提高集群系统可用性的方法、装置和计算机设备。
背景技术
随着计算机技术和互联网的快速发展,集群系统以其低廉的成本、强大的运算能力和健壮的容错机制逐渐成为了计算机行业的焦点。集群应用可以运行在上千台普通的服务器上,伴随业务增长动态扩大集群规模,但也要承受普通计算机较高的故障率,这要求系统在发生软硬件故障的时候仍能保证高度的可用性。目前,在系统发生故障时,仅能在本地的主机上进行系统服务的转移,而没有考虑到其它服务区域的主机,从而使得基于Docker容器的集群系统的可用性不高,无法应对大规模的系统故障。
技术问题
本申请的主要目的为提供一种提高集群系统可用性的方法、装置和计算机设备,旨在解决现有基于Docker容器的集群系统的可用性低,无法应对大规模的系统故障的弊端。
技术解决方案
为实现上述目的,本申请提供了一种提高集群系统可用性的方法,应用于所述集群系统中的任意一台主机,所述集群系统包括多个服务区域,所述服务区域分布于不同地区,当前执行所述方法的主机为第一主机,所述方法包括:
监控各第二主机当前是否出现故障,其中,所述第二主机为所述第一主机之外的其它主机;
若出现故障,则标记当前处于故障状态的第二主机为故障主机,并从各指定服务区域中筛选当前处于可调用状态的主机作为备用主机,其中,所述指定服务区域为所述故障主机所属的服务区域之外的其它服务区域,所述可调用状态的主机为服务调用占比和运行负载符合第一预设要求的主机;
使用所述备用主机运行所述故障主机的系统服务。
本申请还提供了一种提高集群系统可用性的装置,应用于所述集群系统中的任意一台主机,所述集群系统包括多个服务区域,所述服务区域分布于不同地区,当前执行所述方法的主机为第一主机,所述装置包括:
监控模块,用于监控各第二主机当前是否出现故障,其中,所述第二主机为所述第一主机之外的其它主机;
筛选模块,用于标记当前处于故障状态的第二主机为故障主机,并从各指定服务区域中筛选当前处于可调用状态的主机作为备用主机,其中,所述指定服务区域为所述故障主机所属的服务区域之外的其它服务区域,所述可调用状态的主机为服务调用占比和运行负载符合第一预设要求的主机;
运行模块,用于使用所述备用主机运行所述故障主机的系统服务。
本申请还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器执行所述计算机程序时实现上述任一项所述方法的步骤。
本申请还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述任一项所述的方法的步骤。
有益效果
本申请中提供的一种提高集群系统可用性的方法、装置和计算机设备,各不同地区的服务区域的主机根据第一预设频率相互判断是否有出现故障主机,并在发现故障主机后,相互之间广播各主机的运行信息,从而筛选得到当前处于可调用状态的备用主机,然后随机由一台正常运行的主机发出指令,使得备用主机继续运行故障主机的系统服务,在满足布置于各个不同地区的服务区域的集群系统的高可用性的同时,避免大规模故障后系统无法运行。
附图说明
图1是本申请一实施例中提高集群系统可用性的方法步骤示意图;
图2是本申请一实施例中提高集群系统可用性的装置整体结构框图;
图3是本申请一实施例的计算机设备的结构示意框图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
本发明的最佳实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
参照图1,本申请一实施例中提供了一种提高集群系统可用性的方法,应用于所述集群系统中的任意一台主机,所述集群系统包括多个服务区域,所述服务区域分布于不同地区,当前执行所述方法的主机为第一主机,所述方法包括:
S1:监控各第二主机当前是否出现故障,其中,所述第二主机为所述第一主机之外的其它主机;
S2:若出现故障,则标记当前处于故障状态的第二主机为故障主机,并从各指定服务区域中筛选当前处于可调用状态的主机作为备用主机,其中,所述指定服务区域为所述故障主机所属的服务区域之外的其它服务区域,所述可调用状态的主机为服务调用占比和运行负载符合第一预设要求的主机;
S3:使用所述备用主机运行所述故障主机的系统服务。
本实施例中,开发人员在全球的不同地区部署多个服务区域,各服务区域分布于不同的地区,每个服务区域对应某个城市的机房群,由多个可用区组成,每个可用区中包括多台主机,并且每台主机上开发人员均部署有一个监控服务,用于监控和管理该主机中的各个系统服务以及运行的进程。各主机之间根据第一预设频率,通过将预设信号发送到其余主机以及接受其余主机发送过来的预设信号,互相确认、交换彼此的工作状态。其中,工作状态包括主机当前是否运行正常、主机中各系统服务的调用数量等工作信息。系统服务是指执行指定系统功能的程序、例程或进程,以便支持其他程序,尤其是底层(接近硬件)程序;系统进程是操作系统中一系列的进程以及为这些进程所分配的内存块,是系统进行资源分配和调度的单位。各服务区域的主机根据预设规则相互判断其余的主机当前是否出现故障。其中,预设规则具体为:各主机之间根据第一预设频率交换预设信号,如果主机中的系统服务出现故障,其对外的服务端口就会连接不上,无法向外部主机发送预设信息。因此,正常运行的主机,即第一主机在监控其余的主机,即第二主机的工作状态时,如果接收不到另外的某台主机的预设信号,将该台无法发送预设信号的主机标记为指定主机,并开始监控该指定主机在当前时间之后的预设时间段内,是否能够发送出预设信号,即第一主机能否接收到指定主机发送的预设信号。若在当前时间之后的预设时间段内,第一主机可以接收到指定主机发送的预设信号,则判定指定主机没有出现故障;若第一主机无法接收到指定主机发送的预设信号,则判定指定主机出现故障。在第一主机判定第二主机中出现故障主机时,首先获取当前故障主机所属的服务区域之外的其它服务区域,即指定服务区域当前的分级信息。其中,分级信息为各服务区域根据服务进程的调用占比对应的等级信息,由各主机中的监控服务根据第二预设频率进行更新设置。监控服务在对各服务区域分级后,其中分级信息为4级的服务区域具有较高的重要性,因此开发人员设定其不能作为故障主机的调用主机。因此,当前第一主机将分级信息为4级的服务区域除外,选择其余的指定服务区域作为当前处于可调用状态的服务区域。然后,当前第一主机通过预先信号获取当前处于可调用状态的服务区域中各主机的运行信息。其中,运行信息包括主机的服务调用信息和负载信息。第一主机主机将运行信息与预先设置的筛选条件,即第一预设要求进行比对,从而从当前处于可调用状态的服务区域的各主机中筛选得到当前处于可调用状态的主机作为备用主机。并且,第一主机将当前处于故障状态的主机标记为故障主机。在筛选得到备用主机后,第一主机根据相互之间的预设信号得到故障主机在故障前的服务系统的服务进程信息。其中,服务进程信息包括各主机负责的系统服务的服务类型和故障前系统服务的服务进度。第一主机根据故障主机的服务进程信息,对备用主机的系统服务对应的系统进程进行相应的部署,比如控制备用主机下载系统进程相应的程序容器镜像,以实现系统进程的运行。在备用主机完成系统进程的部署后,第一主机控制启动备用主机运行系统服务服务。
进一步的,所述监控各第二主机当前是否出现故障的步骤,包括:
S101:根据第一预设频率接收各所述第二主机发送的预设信号,以实现对各所述第二主机的工作状态的监控;
S102:若接收到所述第二主机发送的预设信号,则判定所述第二主机当前运行正常;
S103:若没有接收到所述第二主机发送的预设信号,则将没有发送所述预设信号的主机标记为指定主机,并在当前时间之后的预设时间段内,判断是否接收到所述指定主机发送的所述预设信号;
S104:若接收到所述指定主机发送的所述预设信号,则判定所述指定主机运行正常;
S105:若没有接收到所述指定主机发送的所述预设信号,则判定所述指定主机出现故障。
本实施例中,各服务区域中的每台主机上均安装有监控服务,监控服务会实时记录主机的工作状态,包括主机当前的负载、主机中各服务进程的调用数量等信息,然后根据第一预设频率,比如每5秒一次,通过预设信号广播给其它的主机,同时接收其它主机广播的预设信号。其中,预设信号携带有发送该预设信号的主机的标识信息,比如编号,以便其它主机确认该预设信号是来自对应的主机。第一主机可以根据是否接收到预设信号筛选得到指定主机。其中,指定主机为当前没有发送预设信号的主机。第一主机在筛选、识别到指定主机后,会根据第一预设频率发送的预设信号,在当前时间之后的预设时间段内对指定主机的工作状态进行监控。其中,预设时间段由开发人员进行设定,开发人员可以根据不同服务区域的主机的重要性设定有不同的预设时间段。预设时间段与各主机之间建立有映射关系表,存储在各主机的数据库中,第一主机可以根据映射关系表查询到各主机分别对应的预设时间段。在当前时间之后的预设时间段内,比如当前时间之后的5分钟内,如果第一主机仍然无法接收到指定主机发送的预设信号,则判定指定主机出现故障。若在当前时间之后的预设时间段内,第一主机可以接收到指定主机发送的预设信号,则判定指定主机没有出现故障。
进一步的,所述从各指定服务区域中筛选当前处于可调用状态的主机作为备用主机的步骤,包括:
S201:获取各所述指定服务区域当前的分级信息,所述分级信息为根据各服务区域的服务调用占比对应设置的等级信息;
S202:根据所述分级信息筛选当前处于可调用状态的服务区域,其中,所述可调用状态的服务区域为所述分级信息符合第二预设要求的服务区域;
S203:获取所述当前处于可调用状态的服务区域中各主机的运行信息,所述运行信息包括主机的服务调用占比和运行负载;
S204:从所述当前处于可调用状态的服务区域中,筛选所述运行信息符合第三预设要求的主机作为所述备用主机。
本实施例中,在第一主机判定第二主机中出现故障主机时,首先获取故障主机所属的服务区域之外的其它服务区域,即指定服务区域当前的分级信息。其中,分级信息为开发人员根据各服务区域根据系统服务的调用占比对应设置的等级信息,由各主机中的监控服务根据第二预设频率进行更新设置,并在更新后广播给其它的主机。因此,各主机,包括第一主机可以直接查询到各指定服务区域当前的分级信息。监控服务在对各服务区域分级后,其中分级信息为4级的服务区域的服务调用占比根据开发人员的设定达到了70%以上,具有较高的重要性,并且不太可能有多余的主机来运行其它服务区域的系统服务,因此开发人员设定4级服务区域中的主机不能作为故障主机的调用主机。第一主机将分级信息为4级的服务区域除外,选择其余的指定服务区域作为当前处于可调用状态的服务区域。然后,第一主机通过预先信号获取当前处于可调用状态的服务区域中各主机的运行信息。其中,运行信息包括主机的服务调用占比和运行负载。第一主机将运行信息与预先设置的筛选条件进行比对,从而从当前处于可调用状态的服务区域的各主机中筛选得到当前处于可调用状态的主机作为备用主机。其中,筛选条件由开发人员预先设定。比如,筛选条件设定为服务调用占比在1%以下,运作负载在10%以下的主机均能作为可调用的主机。如果,主机A当前的服务调用占比为0.1%,运行负载为5%,则主机A的运行信息满足筛选条件,可以作为备用主机。
进一步的,所述获取各所述指定服务区域当前的分级信息的步骤之前,包括:
S4:根据第二预设频率,获取所述第二预设频率对应的时间段内的第一服务调用次数以及各所述第二主机分别对应的第二服务调用次数,其中,所述第一服务调用次数为所述第一主机的系统服务的调用次数,所述第二服务调用次数为所述第二主机的系统服务的调用次数;
S5:根据所述第一服务调用次数和各所述第二服务调用次数,分别计算得到服务调用总次数,以及各所述服务区域的服务调用子次数;
S6:根据各所述服务调用子次数和所述服务调用总次数,分别计算得到各所述服务区域对应的服务调用占比;
S7:将各所述服务调用占比输入预先构建的分级信息库中,分别匹配得到各服务调用占比对应的分级信息,其中,所述分级信息库包括服务调用占比与分级信息映射关系表;
S8:根据所述服务调用占比与所述服务区域之间的对应关系,以及所述服务调用占比与所述分级信息之间的对应关系,得到各所述服务区域对应的分级信息。
本实施例中,各主机中均安装有监控服务,监控服务实时记录主机的工作状态,该工作状态包括各主机中的系统服务的调用次数。监控服务根据第二预设频率,将第一主机的第一服务调用次数分别发送到其余的主机,并接收各第二主机发送过来的第二服务调用次数。其中,第一服务调用次数为第一主机的系统服务在第二预设频率对应的时间段内的调用次数,第二服务调用次数为各第一主机的系统服务在第二预设频率对应的时间段内的调用次数。比如,第二预设频率为每小时一次,前一次获取的第一服务调用次数为10点,则当前次获取的第一服务调用次数为第一主机在10点到11点之间的服务调用次数。第一主机中的监控服务首先根据第一服务调用次数和各第二服务调用次数计算得到所有的主机的服务进程的服务总调用次数,以及,计算各服务区域中所有主机的服务调用的总次数,得到各服务区域对应的服务调用子次数。然后根据各服务区域的服务调用子次数与服务总调用次数之间的比值,得到各服务区域的服务调用占比。比如,服务区域A中有3台主机A、B、C,主机A的服务调用次数为5次,主机B的服务调用次数为8次,主机C的服务调用次数为7次,即服务区域A的服务调用子次数为20次。当前根据第一服务调用次数和第二服务调用次数计算得到的服务总调用次数为200次,那么服务调用子次数与服务总调用次数之间的比值为0.1,则服务区域A的服务调用占比为10%。第一主机将各服务区域对应的服务调用占比输入预先构建的分级信息库中,根据分级信息库中服务调用占比与分级信息映射关系表,分别匹配得到各服务调用占比对应的分级信息。其中,分级信息包括服务区域的区域等级以及区域等级对应的预设系统进程数量。第一主机再根据服务调用占比与服务区域之间的对应关系,以及服务调用占比与分级信息之间的对应关系,得到各服务区域对应的分级信息。
进一步的,所述分级信息包括区域等级和所述区域等级对应的预设系统进程数量,所述根据所述服务调用占比与所述服务区域之间的对应关系,以及所述服务调用占比与所述分级信息之间的对应关系,得到各所述服务区域对应的分级信息的步骤之后,包括:
S9:获取所述服务区域的当前系统进程数量;
S10:将所述当前系统进程数量,分别与所述服务区域的分级信息对应的预设系统进程数量进行比对,判断两者之间的大小;
S11:若所述当前系统进程数量大于所述预设系统进程数量,则减少所述服务区域中的主机的系统进程数量至指定状态,所述指定状态为所述服务区域的当前系统进程数量等于对应的预设系统进程数量;
S12:若所述当前系统进程数量大于所述预设系统进程数量,则增加所述服务区域中的主机的系统进程数量至所述指定状态。
本实施例中,第一主机中的监控服务在匹配得到各个服务区域对应的分级信息后,需要根据分级信息中的预设系统进程数量对服务区域的服务进程进行相应的部署。其中,系统进程是操作系统中一系列的进程以及为这些进程所分配的内存块,是系统进行资源分配和调度的单位。第一主机中的监控服务通过与各第二主机中的监控服务之间相互交换信息,得到各服务区域的当前系统进程数量。然后将各当前系统进程数量分别与当前次的分级信息对应的预设系统进程数量进行比对,判断当前系统进程数量与预设系统进程数量之间的大小关系。如果当前系统进程数量大于预设系统进程数量,则减少服务区域中的主机的系统进程数量至指定状态。其中,指定状态为服务区域的当前系统进程数量与预设系统进程数量相等。比如,服务区域A当前次的区域等级为3级,对应的预设系统进程数量为50个,服务区域A的当前系统进程数量为60个,则需要减少服务区域A中的系统进程,关闭部署有对应系统服务的主机,直至服务区域A中的当前系统进程数量为50个,与预设系统进程数量一致,以节省资源。若当前系统进程数量大于预设系统进程数量,则监控服务需要计算当前系统进程数量与预设系统进程数量之间的差值;然后从docker的中央仓库中下载差值对应数量的程序容器镜像到服务区域的主机。其中,程序容器镜像包括服务进程的运行程序和运行环境,由开发人员预先存储在docker的中央仓库,便于使用时直接下载后启动。监控服务发出启动指令,使得服务区域的主机运行各程序容器镜像,完成增加服务区域中的系统进程数量。
进一步的,所述增加所述服务区域中的主机的系统进程数量至所述指定状态的步骤,包括:
S1001:计算所述当前系统进程数量与所述预设系统进程数量之间的差值;
S1002:下载所述差值对应数量的程序容器镜像到所述服务区域的主机,所述程序容器镜像包括所述系统进程的运行程序和运行环境;
S1003:在所述服务区域的主机中运行各所述程序容器镜像,以完成所述系统进程的部署。
本实施例中,第一主机中的监控服务计算当前系统进程数量与预设系统进程数量之间的差值,并将该差值作为服务区域需要增加的系统进程的数量。然后从docker的中央仓库下述该差值对应数量的程序容器镜像到服务区域的主机中进行安装。其中,程序容器镜像包括系统进程的运行程序和运行环境,由开发人员预先存储在docker容器的中央仓库,便于使用时直接下载后启动。监控服务发出启动指令到服务区域对应的主机中,使得服务区域的主机运行各程序容器镜像,完成增加服务区域中的系统进程数量。
进一步的,所述使用所述备用主机运行所述故障主机的系统服务的步骤,包括:
S301:获取所述故障主机的服务进程信息,所述服务进程信息包括所述系统服务的服务类型和故障前的服务进度;
S302:根据所述服务进程信息,部署所述备用主机的系统服务进程;
S303:启动完成系统服务进程部署后的备用主机,运行所述系统服务。
本实施例中,第一主机在筛选得到备用主机后,根据与各第二主机相互之间广播的预设信号得到故障主机在故障前的系统服务的服务进程信息。其中,服务进程信息包括各主机负责的系统服务的服务类型和故障前系统服务的服务进度,比如主机A负责的系统服务的服务类型为计算保费,主机B处理图片,主机C理赔结算等。第一主机首先根据服务进程信息中的服务类型,将服务类型对应过的程序容器镜像下载到备用主机中进行安装,然后控制备用主机中安装完成后的系统服务调整至服务进度,从而完成对备用主机中系统服务进程的部署。其中,部署的系统服务对应的程序容器镜像存放在docker的中央仓库中,需要使用时直接下载启动就可以了。程序容器镜像,不仅包含了程序,还包含了运行环境。最后,启动备用主机运行完成部署后的系统服务。
本实施例提供的一种提高集群系统可用性的方法,各不同地区的服务区域的主机根据第一预设频率相互判断是否有出现故障主机,并在发现故障主机后,相互之间广播各主机的运行信息,从而筛选得到当前处于可调用状态的备用主机,然后随机由一台正常运行的主机发出指令,使得备用主机继续运行故障主机的系统服务,在满足布置于各个不同地区的服务区域的集群系统的高可用性的同时,避免大规模故障后系统无法运行。
参照图2,本申请一实施例中还提供了一种提高集群系统可用性的装置,应用于所述集群系统中的任意一台主机,所述集群系统包括多个服务区域,所述服务区域分布于不同地区,当前执行所述方法的主机为第一主机,所述装置包括:
监控模块1,用于监控各第二主机当前是否出现故障,其中,所述第二主机为所述第一主机之外的其它主机;
筛选模块2,用于标记当前处于故障状态的第二主机为故障主机,并从各指定服务区域中筛选当前处于可调用状态的主机作为备用主机,其中,所述指定服务区域为所述故障主机所属的服务区域之外的其它服务区域,所述可调用状态的主机为服务调用占比和运行负载符合第一预设要求的主机;
运行模块3,用于使用所述备用主机运行所述故障主机的系统服务。
进一步的,所述监控模块1包括:
监控单元,用于根据第一预设频率接收各所述第二主机发送的预设信号,以实现对各所述第二主机的工作状态的监控;
第一判定单元,用于判定所述第二主机当前运行正常;
判断单元,用于将没有发送所述预设信号的主机标记为指定主机,并在当前时间之后的预设时间段内,判断是否接收到所述指定主机发送的所述预设信号;
第二判定单元,用于判定所述指定主机运行正常;
第三判定判定单元,用于判定所述指定主机出现故障。
进一步的,所述筛选模块2包括:
第一获取单元,用于获取各所述指定服务区域当前的分级信息,所述分级信息为根据各服务区域的服务调用占比对应设置的等级信息;
第一筛选单元,用于根据所述分级信息筛选当前处于可调用状态的服务区域,其中,所述可调用状态的服务区域为所述分级信息符合第二预设要求的服务区域;
第二获取单元,用于获取所述当前处于可调用状态的服务区域中各主机的运行信息,所述运行信息包括主机的服务调用占比和运行负载;
第二筛选单元,用于从所述当前处于可调用状态的服务区域中,筛选所述运行信息符合第三预设要求的主机作为所述备用主机。
进一步的,所述装置还包括:
第一获取模块4,用于根据第二预设频率,获取所述第二预设频率对应的时间段内的第一服务调用次数以及各所述第二主机分别对应的第二服务调用次数,其中,所述第一服务调用次数为所述第一主机的系统服务的调用次数,所述第二服务调用次数为所述第二主机的系统服务的调用次数;
第一计算模块5,用于根据所述第一服务调用次数和各所述第二服务调用次数,分别计算得到服务调用总次数,以及各所述服务区域的服务调用子次数;
第二计算模块6,用于根据各所述服务调用子次数和所述服务调用总次数,分别计算得到各所述服务区域对应的服务调用占比;
第一匹配模块7,用于将各所述服务调用占比输入预先构建的分级信息库中,分别匹配得到各服务调用占比对应的分级信息,其中,所述分级信息库包括服务调用占比与分级信息映射关系表;
第二匹配模块8,用于根据所述服务调用占比与所述服务区域之间的对应关系,以及所述服务调用占比与所述分级信息之间的对应关系,得到各所述服务区域对应的分级信息。
进一步的,所述分级信息包括区域等级和所述区域等级对应的预设系统进程数量,所述装置还包括:
第二获取模块9,用于获取所述服务区域的当前系统进程数量;
判断模块10,用于将所述当前系统进程数量,与所述服务区域的分级信息对应的预设系统进程数量进行比对,判断两者之间的大小;
减少模块11,用于减少所述服务区域中的主机的系统进程数量至指定状态,所述指定状态为所述服务区域的当前系统进程数量等于对应的预设系统进程数量;
增加模块12,用于增加所述服务区域中的主机的系统进程数量至所述指定状态。
进一步的,所述增加模块3包括:
计算单元,用于计算所述当前系统进程数量与所述预设系统进程数量之间的差值;
下载单元,用于下载所述差值对应数量的程序容器镜像到所述服务区域的主机,所述程序容器镜像包括所述系统进程的运行程序和运行环境;
运行单元,用于在所述服务区域的主机中运行各所述程序容器镜像,以完成所述系统进程的部署。
进一步的,所述运行模块3,包括:
获取单元,用于获取所述故障主机的服务进程信息,所述服务进程信息包括所述系统服务的服务类型和故障前的服务进度;
部署单元,用于根据所述服务进程信息,部署所述备用主机的系统服务进程;
启动单元,用于启动完成系统服务进程部署后的备用主机,运行所述系统服务。
本实施例中,装置各模块、单元的实施例与上述对应的方法步骤一致,在此不作详述。
本实施例提供的一种提高集群系统可用性的方法,各不同地区的服务区域的主机根据第一预设频率相互判断是否有出现故障主机,并在发现故障主机后,相互之间广播各主机的运行信息,从而筛选得到当前处于可调用状态的备用主机,然后随机由一台正常运行的主机发出指令,使得备用主机继续运行故障主机的系统服务,在满足布置于各个不同地区的服务区域的集群系统的高可用性的同时,避免大规模故障后系统无法运行。
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储程序容器镜像等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时执行如上述各方法的实施例的流程。本领域技术人员可以理解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。
本申请一实施例还提供一种计算机可读存储介质,该可读存储介质可以是非易失性可读存储介质,也可以是易失性可读存储介质,其上存储有计算机可读指令,计算机可读指令被处理器执行时,执行如上述各方法的实施例的流程。以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种提高集群系统可用性的方法,其特征在于,应用于所述集群系统中的任意一台主机,所述集群系统包括多个服务区域,所述服务区域分布于不同地区,当前执行所述方法的主机为第一主机,所述方法包括:
    监控各第二主机当前是否出现故障,其中,所述第二主机为所述第一主机之外的其它主机;
    若出现故障,则标记当前处于故障状态的第二主机为故障主机,并从各指定服务区域中筛选当前处于可调用状态的主机作为备用主机,其中,所述指定服务区域为所述故障主机所属的服务区域之外的其它服务区域,所述可调用状态的主机为服务调用占比和运行负载符合第一预设要求的主机;
    使用所述备用主机运行所述故障主机的系统服务。
  2. 根据权利要求1所述的提高集群系统可用性的方法,其特征在于,所述监控各第二主机当前是否出现故障的步骤,包括:
    根据第一预设频率接收各所述第二主机发送的预设信号,以实现对各所述第二主机的工作状态的监控;
    若接收到所述第二主机发送的预设信号,则判定所述第二主机当前运行正常;
    若没有接收到所述第二主机发送的预设信号,则将没有发送所述预设信号的主机标记为指定主机,并在当前时间之后的预设时间段内,判断是否接收到所述指定主机发送的所述预设信号;
    若接收到所述指定主机发送的所述预设信号,则判定所述指定主机运行正常;
    若没有接收到所述指定主机发送的所述预设信号,则判定所述指定主机出现故障。
  3. 根据权利要求1所述的提高集群系统可用性的方法,其特征在于,所述从各指定服务区域中筛选当前处于可调用状态的主机作为备用主机的步骤,包括:
    获取各所述指定服务区域当前的分级信息,所述分级信息为根据各服务区域的服务调用占比对应设置的等级信息;
    根据所述分级信息筛选当前处于可调用状态的服务区域,其中,所述可调用状态的服务区域为所述分级信息符合第二预设要求的服务区域;
    获取所述当前处于可调用状态的服务区域中各主机的运行信息,所述运行信息包括主机的服务调用占比和运行负载;
    从所述当前处于可调用状态的服务区域中,筛选所述运行信息符合第三预设要求的主机作为所述备用主机。
  4. 根据权利要求3所述的提高集群系统可用性的方法,其特征在于,所述获取各所述指定服务区域当前的分级信息的步骤之前,包括:
    根据第二预设频率,获取所述第二预设频率对应的时间段内的第一服务调用次数以及各所述第二主机分别对应的第二服务调用次数,其中,所述第一服务调用次数为所述第一主机的系统服务的调用次数,所述第二服务调用次数为所述第二主机的系统服务的调用次数;
    根据所述第一服务调用次数和各所述第二服务调用次数,分别计算得到服务调用总次数,以及各所述服务区域的服务调用子次数;
    根据各所述服务调用子次数和所述服务调用总次数,分别计算得到各所述服务区域对应的服务调用占比;
    将各所述服务调用占比输入预先构建的分级信息库中,分别匹配得到各服务调用占比对应的分级信息,其中,所述分级信息库包括服务调用占比与分级信息映射关系表;
    根据所述服务调用占比与所述服务区域之间的对应关系,以及所述服务调用占比与所述分级信息之间的对应关系,得到各所述服务区域对应的分级信息。
  5. 根据权利要求4所述的提高集群系统可用性的方法,其特征在于,所述分级信息包括区域等级和所述区域等级对应的预设系统进程数量,所述根据所述服务调用占比与所述服务区域之间的对应关系,以及所述服务调用占比与所述分级信息之间的对应关系,得到各所述服务区域对应的分级信息的步骤之后,包括:
    获取所述服务区域的当前系统进程数量;
    将所述当前系统进程数量,与所述服务区域的分级信息对应的预设系统进程数量进行比对,判断两者之间的大小;
    若所述当前系统进程数量大于所述预设系统进程数量,则减少所述服务区域中的主机的系统进程数量至指定状态,所述指定状态为所述服务区域的当前系统进程数量等于对应的预设系统进程数量;
    若所述当前系统进程数量大于所述预设系统进程数量,则增加所述服务区域中的主机的系统进程数量至所述指定状态。
  6. 根据权利要求5所述的提高集群系统可用性的方法,其特征在于,所述增加所述服务区域中的主机的系统进程数量至所述指定状态的步骤,包括:
    计算所述当前系统进程数量与所述预设系统进程数量之间的差值;
    下载所述差值对应数量的程序容器镜像到所述服务区域的主机,所述程序容器镜像包括系统进程的运行程序和运行环境;
    在所述服务区域的主机中运行各所述程序容器镜像,以完成所述系统进程的部署。
  7. 根据权利要求1所述的提高集群系统可用性的方法,其特征在于,所述使用所述备用主机运行所述故障主机的系统服务的步骤,包括:
    获取所述故障主机的服务进程信息,所述服务进程信息包括所述系统服务的服务类型和故障前的服务进度;
    根据所述服务进程信息,部署所述备用主机的系统服务进程;
    启动完成系统服务进程部署后的备用主机,运行所述系统服务。
  8. 一种提高集群系统可用性的装置,其特征在于,应用于所述集群系统中的任意一台主机,所述集群系统包括多个服务区域,所述服务区域分布于不同地区,当前执行所述方法的主机为第一主机,所述装置包括:
    监控模块,用于监控各第二主机当前是否出现故障,其中,所述第二主机为所述第一主机之外的其它主机;
    筛选模块,用于标记当前处于故障状态的第二主机为故障主机,并从各指定服务区域中筛选当前处于可调用状态的主机作为备用主机,其中,所述指定服务区域为所述故障主机所属的服务区域之外的其它服务区域,所述可调用状态的主机为服务调用占比和运行负载符合第一预设要求的主机;
    运行模块,用于使用所述备用主机运行所述故障主机的系统服务。
  9. 根据权利要求8所述的提高集群系统可用性的装置,其特征在于,所述监控模块包括:
    监控单元,用于根据第一预设频率接收各所述第二主机发送的预设信号,以实现对各所述第二主机的工作状态的监控;
    第一判定单元,用于判定所述第二主机当前运行正常;
    判断单元,用于将没有发送所述预设信号的主机标记为指定主机,并在当前时间之后的预设时间段内,判断是否接收到所述指定主机发送的所述预设信号;
    第二判定单元,用于判定所述指定主机运行正常;
    第三判定判定单元,用于判定所述指定主机出现故障。
  10. 根据权利要求8所述的提高集群系统可用性的装置,其特征在于,所述筛选模块包括:
    第一获取单元,用于获取各所述指定服务区域当前的分级信息,所述分级信息为根据各服务区域的服务调用占比对应设置的等级信息;
    第一筛选单元,用于根据所述分级信息筛选当前处于可调用状态的服务区域,其中,所述可调用状态的服务区域为所述分级信息符合第二预设要求的服务区域;
    第二获取单元,用于获取所述当前处于可调用状态的服务区域中各主机的运行信息,所述运行信息包括主机的服务调用占比和运行负载;
    第二筛选单元,用于从所述当前处于可调用状态的服务区域中,筛选所述运行信息符合第三预设要求的主机作为所述备用主机。
  11. 根据权利要求10所述的提高集群系统可用性的装置,其特征在于,所述装置还包括:
    第一获取模块,用于根据第二预设频率,获取所述第二预设频率对应的时间段内的第一服务调用次数以及各所述第二主机分别对应的第二服务调用次数,其中,所述第一服务调用次数为所述第一主机的系统服务的调用次数,所述第二服务调用次数为所述第二主机的系统服务的调用次数;
    第一计算模块,用于根据所述第一服务调用次数和各所述第二服务调用次数,分别计算得到服务调用总次数,以及各所述服务区域的服务调用子次数;
    第二计算模块,用于根据各所述服务调用子次数和所述服务调用总次数,分别计算得到各所述服务区域对应的服务调用占比;
    第一匹配模块,用于将各所述服务调用占比输入预先构建的分级信息库中,分别匹配得到各服务调用占比对应的分级信息,其中,所述分级信息库包括服务调用占比与分级信息映射关系表;
    第二匹配模块,用于根据所述服务调用占比与所述服务区域之间的对应关系,以及所述服务调用占比与所述分级信息之间的对应关系,得到各所述服务区域对应的分级信息。
  12. 根据权利要求11所述的提高集群系统可用性的装置,其特征在于,所述分级信息包括区域等级和所述区域等级对应的预设系统进程数量,所述装置还包括:
    第二获取模块,用于获取所述服务区域的当前系统进程数量;
    判断模块,用于将所述当前系统进程数量,与所述服务区域的分级信息对应的预设系统进程数量进行比对,判断两者之间的大小;
    减少模块,用于减少所述服务区域中的主机的系统进程数量至指定状态,所述指定状态为所述服务区域的当前系统进程数量等于对应的预设系统进程数量;
    增加模块,用于增加所述服务区域中的主机的系统进程数量至所述指定状态。
  13. 根据权利要求12所述的提高集群系统可用性的装置,其特征在于,所述增加模块包括:
    计算单元,用于计算所述当前系统进程数量与所述预设系统进程数量之间的差值;
    下载单元,用于下载所述差值对应数量的程序容器镜像到所述服务区域的主机,所述程序容器镜像包括所述系统进程的运行程序和运行环境;
    运行单元,用于在所述服务区域的主机中运行各所述程序容器镜像,以完成所述系统进程的部署。
  14. 根据权利要求8所述的提高集群系统可用性的装置,其特征在于,所述运行模块,包括:
    获取单元,用于获取所述故障主机的服务进程信息,所述服务进程信息包括所述系统服务的服务类型和故障前的服务进度;
    部署单元,用于根据所述服务进程信息,部署所述备用主机的系统服务进程;
    启动单元,用于启动完成系统服务进程部署后的备用主机,运行所述系统服务。
  15. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,其特征在于,所述处理器执行时实现提高集群系统可用性的方法,应用于所述集群系统中的任意一台主机,所述集群系统包括多个服务区域,所述服务区域分布于不同地区,当前执行所述方法的主机为第一主机,所述方法包括:
    监控各第二主机当前是否出现故障,其中,所述第二主机为所述第一主机之外的其它主机;
    若出现故障,则标记当前处于故障状态的第二主机为故障主机,并从各指定服务区域中筛选当前处于可调用状态的主机作为备用主机,其中,所述指定服务区域为所述故障主机所属的服务区域之外的其它服务区域,所述可调用状态的主机为服务调用占比和运行负载符合第一预设要求的主机;
    使用所述备用主机运行所述故障主机的系统服务。
  16. 根据权利要求15所述的计算机设备,其特征在于,所述处理器执行所述监控各第二主机当前是否出现故障的步骤,包括:
    根据第一预设频率接收各所述第二主机发送的预设信号,以实现对各所述第二主机的工作状态的监控;
    若接收到所述第二主机发送的预设信号,则判定所述第二主机当前运行正常;
    若没有接收到所述第二主机发送的预设信号,则将没有发送所述预设信号的主机标记为指定主机,并在当前时间之后的预设时间段内,判断是否接收到所述指定主机发送的所述预设信号;
    若接收到所述指定主机发送的所述预设信号,则判定所述指定主机运行正常;
    若没有接收到所述指定主机发送的所述预设信号,则判定所述指定主机出现故障。
  17. 根据权利要求15所述的计算机设备,其特征在于,所述处理器执行所述从各指定服务区域中筛选当前处于可调用状态的主机作为备用主机的步骤,包括:
    获取各所述指定服务区域当前的分级信息,所述分级信息为根据各服务区域的服务调用占比对应设置的等级信息;
    根据所述分级信息筛选当前处于可调用状态的服务区域,其中,所述可调用状态的服务区域为所述分级信息符合第二预设要求的服务区域;
    获取所述当前处于可调用状态的服务区域中各主机的运行信息,所述运行信息包括主机的服务调用占比和运行负载;
    从所述当前处于可调用状态的服务区域中,筛选所述运行信息符合第三预设要求的主机作为所述备用主机。
  18. 一种计算机可读存储介质,其上存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现提高集群系统可用性的方法,应用于所述集群系统中的任意一台主机,所述集群系统包括多个服务区域,所述服务区域分布于不同地区,当前执行所述方法的主机为第一主机,所述方法包括:
    监控各第二主机当前是否出现故障,其中,所述第二主机为所述第一主机之外的其它主机;
    若出现故障,则标记当前处于故障状态的第二主机为故障主机,并从各指定服务区域中筛选当前处于可调用状态的主机作为备用主机,其中,所述指定服务区域为所述故障主机所属的服务区域之外的其它服务区域,所述可调用状态的主机为服务调用占比和运行负载符合第一预设要求的主机;
    使用所述备用主机运行所述故障主机的系统服务。
  19. 根据权利要求18所述的计算机可读存储介质,其特征在于,所述处理器执行所述监控各第二主机当前是否出现故障的步骤,包括:
    根据第一预设频率接收各所述第二主机发送的预设信号,以实现对各所述第二主机的工作状态的监控;
    若接收到所述第二主机发送的预设信号,则判定所述第二主机当前运行正常;
    若没有接收到所述第二主机发送的预设信号,则将没有发送所述预设信号的主机标记为指定主机,并在当前时间之后的预设时间段内,判断是否接收到所述指定主机发送的所述预设信号;
    若接收到所述指定主机发送的所述预设信号,则判定所述指定主机运行正常;
    若没有接收到所述指定主机发送的所述预设信号,则判定所述指定主机出现故障。
  20. 根据权利要求18所述的计算机可读存储介质,其特征在于,所述处理器执行所述从各指定服务区域中筛选当前处于可调用状态的主机作为备用主机的步骤,包括:
    获取各所述指定服务区域当前的分级信息,所述分级信息为根据各服务区域的服务调用占比对应设置的等级信息;
    根据所述分级信息筛选当前处于可调用状态的服务区域,其中,所述可调用状态的服务区域为所述分级信息符合第二预设要求的服务区域;
    获取所述当前处于可调用状态的服务区域中各主机的运行信息,所述运行信息包括主机的服务调用占比和运行负载;
    从所述当前处于可调用状态的服务区域中,筛选所述运行信息符合第三预设要求的主机作为所述备用主机。
PCT/CN2019/118163 2019-04-16 2019-11-13 提高集群系统可用性的方法、装置和计算机设备 WO2020211362A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910305188.3 2019-04-16
CN201910305188.3A CN110149366B (zh) 2019-04-16 2019-04-16 提高集群系统可用性的方法、装置和计算机设备

Publications (1)

Publication Number Publication Date
WO2020211362A1 true WO2020211362A1 (zh) 2020-10-22

Family

ID=67589761

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118163 WO2020211362A1 (zh) 2019-04-16 2019-11-13 提高集群系统可用性的方法、装置和计算机设备

Country Status (2)

Country Link
CN (1) CN110149366B (zh)
WO (1) WO2020211362A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110149366B (zh) * 2019-04-16 2022-03-18 平安科技(深圳)有限公司 提高集群系统可用性的方法、装置和计算机设备
CN111338858B (zh) * 2020-02-18 2023-07-14 中国工商银行股份有限公司 一种双机房的容灾方法及装置
CN112787855B (zh) * 2020-12-29 2022-07-26 中国电力科学研究院有限公司 一种面向广域分布式服务的主备管理系统及管理方法
CN117544762B (zh) * 2023-11-17 2024-04-19 广东信佰工程监理有限公司 一种基于大数据分析的项目监理方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103647668A (zh) * 2013-12-16 2014-03-19 上海证券交易所 一种高可用集群内主机群体决策系统及切换方法
CN103931139A (zh) * 2013-03-19 2014-07-16 华为技术有限公司 一种冗余保护方法、装置、设备及系统
US20150074447A1 (en) * 2013-09-09 2015-03-12 Samsung Sds Co., Ltd. Cluster system and method for providing service availability in cluster system
CN106982259A (zh) * 2017-04-19 2017-07-25 聚好看科技股份有限公司 服务器集群的故障解决方法
EP3247090A1 (en) * 2015-02-10 2017-11-22 Huawei Technologies Co., Ltd. Method, device and system for processing fault in at least one distributed cluster
CN110149366A (zh) * 2019-04-16 2019-08-20 平安科技(深圳)有限公司 提高集群系统可用性的方法、装置和计算机设备

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7526540B2 (en) * 2003-04-22 2009-04-28 International Business Machines Corporation System and method for assigning data collection agents to storage area network nodes in a storage area network resource management system
CN101656624B (zh) * 2008-08-18 2011-12-07 中兴通讯股份有限公司 一种多节点应用级容灾系统及容灾方法
US20170293540A1 (en) * 2016-04-08 2017-10-12 Facebook, Inc. Failover of application services
CN106557543A (zh) * 2016-10-14 2017-04-05 深圳前海微众银行股份有限公司 节点切换方法及系统
CN106487486B (zh) * 2016-10-18 2019-12-10 泰康保险集团股份有限公司 业务处理方法和数据中心系统
CN107707393B (zh) * 2017-09-26 2021-07-16 赛尔网络有限公司 基于Openstack O版特性的多活系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103931139A (zh) * 2013-03-19 2014-07-16 华为技术有限公司 一种冗余保护方法、装置、设备及系统
US20150074447A1 (en) * 2013-09-09 2015-03-12 Samsung Sds Co., Ltd. Cluster system and method for providing service availability in cluster system
CN103647668A (zh) * 2013-12-16 2014-03-19 上海证券交易所 一种高可用集群内主机群体决策系统及切换方法
EP3247090A1 (en) * 2015-02-10 2017-11-22 Huawei Technologies Co., Ltd. Method, device and system for processing fault in at least one distributed cluster
CN106982259A (zh) * 2017-04-19 2017-07-25 聚好看科技股份有限公司 服务器集群的故障解决方法
CN110149366A (zh) * 2019-04-16 2019-08-20 平安科技(深圳)有限公司 提高集群系统可用性的方法、装置和计算机设备

Also Published As

Publication number Publication date
CN110149366B (zh) 2022-03-18
CN110149366A (zh) 2019-08-20

Similar Documents

Publication Publication Date Title
WO2020211362A1 (zh) 提高集群系统可用性的方法、装置和计算机设备
CN113169952B (zh) 一种基于区块链技术的容器云管理系统
US9037899B2 (en) Automated node fencing integrated within a quorum service of a cluster infrastructure
US8769132B2 (en) Flexible failover policies in high availability computing systems
US20080244552A1 (en) Upgrading services associated with high availability systems
CN103200036B (zh) 一种电力系统云计算平台的自动化配置方法
CN109697078B (zh) 非高可用性组件的修复方法、大数据集群和容器服务平台
CN109873714B (zh) 云计算节点配置更新方法及终端设备
CN111143023A (zh) 一种资源变更的方法及装置、设备、存储介质
US8031637B2 (en) Ineligible group member status
CN115328662A (zh) 一种进程线程资源管理控制方法及系统
EP2110748A2 (en) Cluster control apparatus, control system, control method, and control program
CN112559138B (zh) 一种资源调度系统及方法
CN114385366A (zh) 容器云平台的容器组弹性扩容方法、系统、介质和设备
CN110839068B (zh) 业务请求处理方法、装置、电子设备及可读存储介质
TWI738583B (zh) 具動態擴展之高可用訊息管理系統、方法及電腦可讀媒介
US11442756B2 (en) Common service resource application method, related device, and system
US11687329B2 (en) Data center infrastructure fungibility and bootstrapping
KR102567541B1 (ko) 엣지 서비스 인스턴스 배포 장치 및 그 제어방법
CN111159786B (zh) 一种元数据保护方法、装置及电子设备和存储介质
US20230336407A1 (en) Automated server restoration construct for cellular networks
CN117369981A (zh) 基于监控器的容器调整方法、设备及存储介质
CN109274986B (zh) 多中心容灾方法、系统、存储介质和计算机设备
CN116743762A (zh) 服务注册集群流量切换方法、流量切换装置及存储介质
JP2022529665A (ja) アプリケーションプログラムのインストール方法、稼働方法、電子機器、コンピュータ可読媒体

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19925379

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19925379

Country of ref document: EP

Kind code of ref document: A1