CN113064732B - Distributed system and management method thereof - Google Patents

Distributed system and management method thereof Download PDF

Info

Publication number
CN113064732B
CN113064732B CN202010002343.7A CN202010002343A CN113064732B CN 113064732 B CN113064732 B CN 113064732B CN 202010002343 A CN202010002343 A CN 202010002343A CN 113064732 B CN113064732 B CN 113064732B
Authority
CN
China
Prior art keywords
service
node
service node
address
client
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010002343.7A
Other languages
Chinese (zh)
Other versions
CN113064732A (en
Inventor
李玮玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010002343.7A priority Critical patent/CN113064732B/en
Publication of CN113064732A publication Critical patent/CN113064732A/en
Application granted granted Critical
Publication of CN113064732B publication Critical patent/CN113064732B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer And Data Communications (AREA)

Abstract

The embodiment of the invention discloses a distributed system and a management method thereof, wherein the distributed system comprises the following steps: receiving an address update request sent by a client, wherein the address update request is used for acquiring a service node address for providing service for the client; caching the address update request; before the timeout time corresponding to the address update request arrives, if the service node providing service for the client is determined to be migrated from the first service node to the second service node, the address of the second service node is returned to the client. When the client side requests service failure, the scheme of the embodiment of the invention can accelerate the fault repair speed and improve the user experience.

Description

Distributed system and management method thereof
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a distributed system and a management method thereof.
Background
The service framework of a distributed system generally includes the following components: client (Client): a user end program for providing an interface for a user to access the service; service node (Server): providing specific interface logic functions for users, typically consisting of a plurality of nodes; master node (Master): the method is mainly used for managing and scheduling service nodes, performing service admission authentication on clients and managing and distributing cluster resources, and is generally composed of single or multiple nodes. A typical distributed system provides Application Program Interface (API) services externally, generally comprising the steps of:
1) The client inquires the service node address of the service to be required from the main control node, and the main control node returns the corresponding service node address to the client.
2) The client calls a remote procedure call protocol (RPC) to access the obtained service node address to obtain the required service, and the whole service process does not need to interact with the Master node.
In the above process, the service node in the distributed system usually does not display the working state, so once the service node fails, the master control node dispatches the service to other service nodes, the client needs to sense and update the service address to correctly access the required service, and the speed of updating the service address by the client directly determines the access speed and accuracy of the client. The address updating scheme adopted in the industry is that the client side periodically inquires until a new service address is obtained. However, the method of periodically polling to access the master node has higher requirements on the polling frequency, and the polling frequency is too high, so that a large number of invalid requests are generated to increase the processing burden of the master node, and the polling frequency is too low, so that the speed of obtaining a new service address by the client is reduced. Therefore, when the current distributed system fails to access the service node to the client, the problem of incapability of timely response still exists.
Disclosure of Invention
In view of the above problems, an embodiment of the present invention provides a distributed system and a management method thereof, which mainly aims to improve the speed of repairing access failures and improve user experience when a client fails to request service.
In order to achieve the above purpose, the embodiment of the present invention mainly provides the following technical solutions:
in a first aspect, an embodiment of the present invention provides a method for managing a distributed system, where the method is applied to a master control node of the distributed system, including:
Receiving an address update request sent by a client, wherein the address update request is used for acquiring a service node address for providing service for the client;
caching the address update request;
Before the timeout time corresponding to the address update request arrives, if the service node providing service for the client is determined to be migrated from the first service node to the second service node, the address of the second service node is returned to the client.
In a second aspect, an embodiment of the present invention provides a method for managing a distributed system, where the method is applied to a client, and includes:
According to the request failure information fed back by the first service node, an address update request is sent to a main control node, wherein the address update request is used for acquiring a service node address for providing service for the client;
receiving response information of the address update request, wherein the response information comprises a service node address corresponding to the service stored by a main control node;
if the service node address in the response information is the first service node address, sending an address update request to a main control node;
And if the service node address in the response information is a second service node address, requesting the service from the second service node according to the second service node address.
In a third aspect, an embodiment of the present invention provides a distributed system, where the distributed system includes a master node, a plurality of service nodes, and at least one client, where the master node performs the management method described in the first aspect, and the client performs the management method described in the second aspect.
In a fourth aspect, an embodiment of the present invention provides a management apparatus for a distributed system, including: a memory and a processor, the memory for storing a computer program; the processor is configured to execute the management method according to the first aspect when the computer program is called.
In a fifth aspect, embodiments of the present invention provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the management method as described in the first aspect above, or implements the management method as described in the second aspect above.
By means of the technical scheme, the distributed system and the management method thereof provided by the embodiment of the invention can enable a user to quickly acquire the service node address for providing the service when the service request of the service node in the distributed system fails, so that the user can successfully access the required service. Therefore, in the distributed system, when a client request fails and an address update request is sent to a master node, the address update request is cached in the master node, so that the master node does not respond to the address update request in real time, but caches the address update request according to a preset timeout period, and if the address of a new service node is detected to exist in the timeout period, the address is fed back to the client, otherwise, after the timeout period is reached, the address stored in the master node is faithfully fed back to the client. Therefore, a large number of invalid requests caused by adopting the polling requests can be avoided, the processing load of the main control node is reduced, and the latest updated service node address can be fed back to the client in the overtime period, so that the client can obtain the address of the new service node in time to repair the access fault.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
fig. 1 shows a flow chart of a data access method of a master control node in a distributed system according to an embodiment of the present invention;
Fig. 2 shows a flowchart of a data access method of a client in a distributed system according to an embodiment of the present invention;
fig. 3 is a flowchart of a data access method of a service node in a distributed system according to an embodiment of the present invention;
fig. 4 shows a block diagram of a master control node in a distributed system according to an embodiment of the present invention;
FIG. 5 is a block diagram illustrating a master node in another distributed system according to an embodiment of the present invention;
FIG. 6 shows a block diagram of a client in a distributed system according to an embodiment of the present invention;
FIG. 7 shows a block diagram of a client in another distributed system according to an embodiment of the present invention;
Fig. 8 shows a block diagram of a distributed system according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
The invention is an optimization improvement for improving the repairing speed of the fault when the fault occurs to the access service node of the client under the service framework of the existing distributed system. In the embodiment of the invention, the access failure generally refers to failure of the client to access the service node, and the reason for the failure of access is mainly the problem of unstable network or the problem that the service node cannot provide service at present. In contrast, in the existing distributed system, verification is required through the master control node, when the current service node is determined to be unable to provide the service, the service is migrated to the new service node, and after the service is migrated to the new service node, the address of the new service node is provided according to the request of the client. In this process, the client needs to wait for the feedback of the master node to repair the access failure because the client and the master node are asynchronous in the sense of failure. The invention reduces the waiting time and improves the repairing rate of the access fault by respectively optimizing and improving the client, the main control node and the service node, thereby improving the application experience of the user.
The improvements made by the present invention in the client, master node and service node, respectively, are described below by way of specific embodiments.
1) For a master control node in a distributed system, an embodiment of the present invention provides a method for managing the distributed system, where specific steps of the method are shown in fig. 1, and the method includes:
step 101, receiving an address update request sent by a client.
The address update request is a request sent to a master control node in the distributed system, which is triggered by a client when the client cannot normally access the service node or cannot obtain the service provided by the service node. The address update request is used to obtain the address of the service node that serves the client.
Step 102, caching the address update request.
The master node will buffer the address update request locally after receiving it, rather than feeding back the address update request in real time.
Step 103, before the timeout time corresponding to the address update request is reached, if it is determined that the service node providing the service for the client is migrated from the first service node to the second service node, the address of the second service node is returned to the client.
In this step, the first service node refers to a service node that fails to be accessed by the client. One of the main roles of the master control node in the distributed system is to schedule the services provided by each service node in the system, so that the addresses of each service node in the distributed system and the identification information of the services provided by each service node are stored in the master control node, namely the corresponding relation between the currently effective service and the address of the service node providing the service is recorded in the master control node.
Because the address update request is cached in step 102, and the timeout period in this step is the maximum cache duration for determining the address update request, during the caching period of the address update request, the master control node will determine whether the service node providing the service is migrated from the first service node to the second service node, i.e. determine whether the address of the service node providing the service is updated to the address of the second service node, if so, return the address of the second service node to the client; and if the address of the first service node is not updated when the timeout period is reached, returning the address of the first service node to the client, and if the address of the first service node is determined to be the address of the first service node in the received response information, the client will send the address update request again, namely returning to the operation of executing the step 101. The master node, once responding to the address update request, deletes the address update request from the cache.
It should be noted that, the setting of the timeout time may be set by the master control node uniformly or according to different service differences, or may be set by the client terminal in a self-defined manner according to the actual requirement, that is, different client terminals may set different timeout times.
According to the steps in the embodiment, when the master control node processes the address update request sent by the client, the master control node does not feed back the address update request in real time, but caches the address update request for a local period of time, and searches whether the service node corresponding to the requested service is migrated from the first service node to the second service node, if the service is determined to be migrated in the period of time, the address of the second service node of the client is fed back directly, and if the service cannot be determined to be migrated in the period of time, the address of the first service node stored locally is fed back to the client. Therefore, under the condition that the overtime time is set reasonably, a large number of polling address update requests sent by the client can be prevented from being received, so that repeated processing of a large number of invalid requests by the main control node is reduced, and the processing load of the main control node is reduced. Meanwhile, in the process of caching the address update request, the main control node can monitor the update state of the service node address in real time or periodically, and the update is fed back to the client once found, so that the response speed to the client request is improved, and the repair time to the access fault is shortened.
Further, for step 103 in the embodiment shown in fig. 1, in order to ensure that the accuracy of the migration of the service node and the availability of the address of the second service node are determined, in a preferred embodiment of the present invention, the client adds request failure information in the sent address update request, where the request failure information is also information fed back to the client by the first service node, and is used to indicate that the client requests failure from the first service node, where the request failure information includes information such as a service version number in addition to error information (i.e. failure cause, may be indicated by an error code). The service version number is used to represent version information of the service, and in this embodiment, the service version number corresponding to the service is changed according to a change of the service node address corresponding to providing the service for the client, for example, monotonically increases according to the change of the service node address. That is, each time a service node providing the service migrates, i.e., the service node address changes once, the service version number is incremented by one based on the original version number. Therefore, by comparing the service version numbers, the larger the values thereof, the newer the service version is explained, and the service node corresponding to the largest service version number is the service node capable of effectively providing the service for the client.
Based on the added request failure information, the specific manner of executing the migration of the service node for providing the service for the client from the first service node to the second service node comprises the following steps:
and judging whether the service version number corresponding to the service locally stored by the main control node is larger than the service version number carried in the address update request.
If the service version number of the main control node is larger than the service version number carried by the request, the service is migrated from the first service node to the second service node, at this time, the address of the second service node is acquired, and the service node address of the local corresponding service is updated.
If the service version number of the main control node is equal to or smaller than the service version number carried by the request, the service is not migrated to the second service node currently, and at this time, the service node address of the service corresponding to the service locally does not need to be updated.
Further, in the foregoing embodiment, whether the service node is migrated is verified by using the service version number carried in the request failure information, and for the error information carried in the request failure information, in another preferred embodiment of the present invention, the method may be used to count the failure information of the first service node, so as to actively trigger migration of the service provided by the service node, specifically:
First, the number of clients requesting the first service node for the service failure is determined according to the request failure information.
Because the master node serves all clients in the distributed system, address update requests that fail for the same service request in the first service node may be received by multiple clients in a certain period of time, and in this embodiment, the master node counts the request failure information carried in the address update requests to determine the number of clients.
It should be noted that, in the statistics process, the counted request failure information may be filtered, because the reason for the failure of the client to access the first service node may be caused by unstable network, or may be caused by the failure of the first service node or failure to provide the corresponding service. In this step, when the request failure information is counted, considering that the access failure caused by the network problem may be temporary and the migration of the service is not needed, the request failure information is filtered out to obtain the request failure information of a designated type, wherein the request failure information of the designated type may be request failure information of a certain type or may be a set of request failure information of several types. That is, the counted client is a request failure due to the failure of the first service node to normally respond to the request.
Further, the counted number of clients refers to the number of clients that failed the same service request provided in the same service node.
Second, when the number of clients is greater than a threshold, triggering a scheduling request of the migration service.
The threshold is an experience value set in a self-defining mode, and can be set differently according to application scenes of the distributed system.
The above description is that, according to the address update request reported to the master control node by the client, the master control node actively triggers the scheduling request for service migration, in addition, the master control node can also determine whether the first service node has a fault through heartbeat detection with the service node, that is, determine whether the service node has the capability of providing the service, and if the service node has the fault, that is, the heartbeat detection of the first service node cannot be received for a long time, trigger the scheduling request for service migration provided in the first service node, and migrate the services to the second service node.
Further, in the embodiment of the present invention, after the master node sends a scheduling request, the master node further monitors whether the first service node completes the migration of the service, and if the migration is successful, updates the service version number corresponding to the locally stored service, so that the master node can timely monitor that the service migrates and inform the client, that is, respond to the corresponding address update request in the cache. Correspondingly, in the second service node after the migration is completed, the service version number of the service is synchronously updated.
After updating the service version number corresponding to the locally stored service by the master node, the master node responds to the address update request of the same service stored in the cache according to the updated service and the corresponding service version number, namely compares the service version numbers of the service and the service version number, and if the service version number is larger than the service version number of the address update request, the master node directly returns the updated second service node address to the corresponding client.
Further, in another preferred embodiment of the present invention, the client may not set the timeout period in the address update request, but may be preset, that is, a data table of timeout periods set by each client in the distributed system is maintained in the master node. When receiving the address update request sent by the client, the master control node can search the overtime time corresponding to the client identifier in the data table according to the identifier of the client, and then buffer the address update request by using the overtime time.
In addition, the master control node can also periodically scan the address update request in the cache according to a preset period to judge whether the address update request reaches the corresponding timeout time, if so, the address update request is deleted from the cache and the address of the first service node is fed back to the corresponding client.
2) For the client in the distributed system, the embodiment of the invention also provides a management method of the distributed system, the specific steps of which are shown in fig. 2, and the method comprises the following steps:
step 201, according to the request failure information fed back by the first service node, an address update request is sent to the master control node.
The address update request is used for acquiring a service node address for providing service for the client.
In this step, after the client requests the first service node for service failure, the request failure information fed back by the first service node is received, where the request failure information includes at least information such as a service version number, which is the same as the request failure information in the embodiment shown in fig. 1.
Step 202, receiving response information of the address update request.
The response information comprises a service node address corresponding to the service to be requested, which is stored locally by the main control node.
After the address update request is processed by the master node, the specific processing procedure is as described in the embodiment shown in fig. 1. The client receives response information fed back by the client, wherein the response information comprises service node addresses corresponding to services stored by the master control node. The service node address corresponding to the service stored by the master control node is not fixed, but updated according to the migration condition of the service, so if the client cannot request the service through the first service node, the client can know whether the service node providing the service is changed or not, i.e. whether the second service node address exists or not by inquiring the master control node.
According to the embodiment shown in fig. 1, the response information fed back by the master node includes a service node address, where the service node address may be a second service node address or may be a source address (a first service node address), so that the specific operation of repairing the access failure of the client by identifying the service node address in the response information is specifically: when the service node address in the response message is the second service node address, executing step 204; if it is the first service node address, step 203 is performed.
Step 203, if the service node address in the response information is the first service node address, an address update request is sent to the master control node.
When the obtained service node address is still the first service node address, the step indicates that the service is not found to have migration in the master control node, and at this time, the client side sends an address update request to the master control node again. It should be noted that, the interval duration between the time of sending the address update request and the time of the last sending should be longer than the timeout period, and compared with the existing polling mode according to a certain frequency period, the timeout period in the embodiment of the present invention is generally longer than the polling period, so for the client, the mode can reduce the number of sending address update requests to the master node, and also reduce the processing burden of a large number of invalid requests to the master node.
Step 204, if the service node address in the response information is the second service node address, requesting service from the second service node according to the second service node address.
When different service node addresses are obtained, the main control node updates the service node address corresponding to the service, namely the service is migrated to the second service node, and at the moment, the client can access according to the second service node address, namely the service is requested to the second service node.
Through the steps, the specific fault repairing flow of the client of the distributed system when the access fault occurs is described in detail, and the fault repairing speed can be greatly improved through the flow, and invalid requests sent to the main control node are reduced.
Further, based on the embodiment shown in fig. 2, in order to reduce the pressure of the master node to process the address update request before sending the address update request to the master node, in this embodiment, when a service is migrated from the first service node to another service node, the addresses of the other service node may be stored in the first service node, and when the client requests the service from the first service node, the addresses of the other service node are added to the request failure information and fed back to the client together. In this way, when the client receives the request failure information fed back by the first service node, it can first determine whether the request failure information has an address of the third service node (other service nodes), if so, the client can directly request the service to the third service node according to the address of the third service node, and does not need to trigger to send an address update request to the master control node. If the service is not available, the first service node may fail to provide the service, or the service may fail to provide the service in the migration process. That is, the cause of the access failure cannot be determined at this time, and for this reason, the client triggers the address update request to be sent to the master node, i.e. the above-mentioned step 201 is performed. It should be noted that the third service node may be the second service node.
Further, based on the embodiment shown in fig. 2, in order to more accurately verify whether the service node address corresponding to the service is updated to the second service node address, in another preferred embodiment of the present invention, the client may add the service version number of the requested service to the address update request when the address update request is triggered. Therefore, the master node can verify the service version number of the service stored in the master node based on the service version number, that is, the service version number is high and is the current valid version, and the specific verification process is described in detail in the embodiment of the master node, which is not described herein.
Further, since the client may make concurrent requests for multiple services provided by the same service node and all have access failures, at this time, the client may send multiple address update requests to the master node, and in the distributed system, the number of clients is numerous, so as to reduce the burden of the master node for processing the concurrent requests, a first threshold is set, so as to limit the number of address update requests that the client sends to the same service node, for example, there are 5 services in the service node a, the client initiates access to 5 services simultaneously, where 4 of the services have access failures, and at this time, assuming that the first threshold is set to 2, the client will only send 2 address update requests to the master node, wait for one of the address update requests to be sent, and then send 3 address update requests until the repair of the access failures is completed, that is, the upper limit of the number of requests sent by the client to the master node is 2.
In addition, the client may limit the total concurrency number of all clients except for the concurrency number of address update requests triggered by the same service node, that is, when a plurality of clients access a plurality of service nodes concurrently and access fails, the total number of concurrent address update requests of all clients to the master control node is limited, where the value of the total number is set as a second threshold, and both the value and the first threshold may be set in a self-defined manner. For example, assuming that the second threshold is 10 and 7 address update requests are currently processed by the master node, when there are 5 concurrent access failures in the service node a, it is necessary to query the master node first and also process several address update requests, i.e. 3, at this time, only 3 out of 5 address update requests can be sent to the master node.
Furthermore, based on the above flow control manner for concurrent address update requests to the client, in order to avoid that the address update request corresponding to the high-traffic service node preemptively occupies the address update request corresponding to the low-traffic service node, the embodiment of the present invention further provides that each service node is allowed to transmit at least one address update request. For example, the service node a is a high-traffic node, which has 4 concurrent access failure requests, the first threshold is 2, 2 address update requests are limited, the service node B is a low-traffic node, which has only 1 access failure request, and because the second preset is 2, the address update request of B is limited, according to the above flow control principle, when the master node processes one address update request, the address update request corresponding to B is preferentially processed at this time, instead of processing the address update request of B after all the address update requests of a are processed.
3) For a service node in a distributed system, the embodiment of the invention also provides a management method of the distributed system, and specific steps of the method are shown in fig. 3, and the method comprises the following steps:
step 301, according to the service request sent by the client, judging whether the client has the access right.
In a distributed system, data in the same storage address may need to be read from different services provided by different service nodes, and for data with access restrictions, in order to avoid simultaneous access by too many users, only one service may be corresponding to the access process. Thus, whether a service request is successful depends also on whether the service in the service node has authorized access. In this embodiment, whether the service in the service node has access right may be determined by acquiring the file lock, that is, the service that successfully performs locking is locked by a plurality of services that need to read the same service having access restriction data, and the service that successfully performs locking has access right to the data. For the first service node in the above embodiment, it is determined whether the service to which the first service node belongs is robbed to the file lock, if yes, the first service node has access rights, and step 302 is executed, otherwise step 303 is executed.
And 302, if the service request is received, acquiring access right data corresponding to the service as response information of the service request.
And 303, if the service node does not have the access right, inquiring the address of the second service node with the access right, and adding the address of the second service node into the response information.
The response information in this step is the response information when the first service node fails to respond to the service request.
Therefore, in the embodiment of the invention, the first service node can inquire the target service node with the authority through recording the service access authority so as to record the service migration state, so that when the service request of the client is responded, even if the access authority data cannot be returned, the address of the service node with the access authority can be fed back to the client, the client can directly access the service node without triggering an address update request to the main control node, and the repairing speed of the access fault is improved.
Further, in order to ensure that the queried target service node can provide effective service, in the preferred embodiment of the present invention, verification may also be performed by determining a service version number of the service. The specific query process is as follows:
first, after the address of the second service node is queried, a service version number of the service provided in the second service node is obtained.
And then judging whether the service version number in the second service node is larger than the service version number stored in the first service node, if so, adding the new address into the response information, otherwise, not adding the new address.
The comparison principle of the service version numbers is described in the above-mentioned master node and client, and will not be described here again.
Further, as an implementation of the method shown in fig. 1, the embodiment of the present invention provides a master control node of a distributed system, which is mainly used for improving the repair speed of access faults and improving the use experience of users. For convenience of reading, details in the foregoing method embodiment are not repeated one by one in this embodiment, but it should be clear that the master control node in this embodiment can correspondingly implement all the details in the foregoing method embodiment. The master control node, as shown in fig. 4, specifically includes:
a receiving unit 41, configured to receive an address update request sent by a client, where the address update request is used to obtain a service node address that provides a service for the client;
A caching unit 42, configured to cache the address update request;
And a response unit 43, configured to return, before a timeout corresponding to the address update request arrives, an address of a second service node to the client if it is determined that the service node providing the service for the client is migrated from the first service node to the second service node.
Further, the address update request further includes request failure information, where the request failure information is used to indicate that the client requests failure from the first service node;
the request failure information also includes a service version number that changes according to a service node address change that serves the client.
Further, as shown in fig. 5, the response unit 43 includes:
A judging module 431, configured to judge whether a service version number corresponding to the service stored in the master node is greater than a service version number carried in the address update request;
a determining module 432, configured to determine that the service is migrated to the second service node and obtain the address of the second service node if the determining module 431 determines that the service is greater than the second service node.
Further, as shown in fig. 5, the master node further includes:
A statistics unit 44, configured to determine, according to request failure information, the number of clients that request the first service node for the service failure, where the request failure information is generated because the first service node cannot respond to a request;
And a triggering unit 45, configured to trigger a scheduling request for migrating the service to migrate the service from the first service node to the second service node when the number of clients determined by the statistics unit 44 is greater than a threshold.
Further, the triggering unit 45 is further configured to determine whether the first service node has a failure by using heartbeat detection; and if the service exists, triggering a scheduling request for migrating the service, and migrating the service from the first service node to the second service node.
Further, as shown in fig. 5, the master node further includes:
an updating unit 46, configured to monitor whether the first service node completes the migration of the service according to the scheduling request triggered by the triggering unit 45; and if the migration is successful, updating and updating the service version number corresponding to the service stored by the main control node.
Further, the response unit 43 is further configured to compare, after updating the service version number corresponding to the service stored in the master node, the service version number corresponding to the address update request requesting the same service in the cache with the updated service version number; and if the address update request is greater than the service version number of the address update request, performing an operation of returning the address of the second service node to the client.
Further, the response unit 43 is further configured to periodically scan the address update request in the cache according to a preset period, and determine whether the address update request reaches a corresponding timeout period; and if so, deleting the address updating request in the cache, and feeding back the address of the first service node to the client.
Further, as an implementation of the method shown in fig. 2, the embodiment of the invention provides a client of a distributed system, which is mainly used for improving the repairing speed of access faults and improving the use experience of users. For convenience of reading, details in the foregoing method embodiment are not repeated one by one, but it should be clear that the client in this embodiment can correspondingly implement all the details in the foregoing method embodiment. The master control node, as shown in fig. 6, specifically includes:
a first sending unit 51, configured to send an address update request to a master control node according to request failure information fed back by a first service node, where the address update request is used to obtain a service node address that provides a service for the client;
A receiving unit 52, configured to receive response information of the address update request, where the response information includes a service node address corresponding to the service stored in the master node;
The first sending unit 51 is further configured to send an address update request to a master node if the service node address in the response message is the first service node address;
And a second sending unit 53, configured to request, if the service node address in the response message is a second service node address, the service from the second service node according to the second service node address.
Further, as shown in fig. 7, the client further includes:
A judging unit 54, configured to judge whether a third service node address exists in the request failure information before the first sending unit 51 sends an address update request to the master node;
a service request unit 55, configured to request, if any, the service from the third service node according to the third service node address;
The first sending unit 51 is further configured to send an address update request to the master node if the address update request does not exist.
Further, as shown in fig. 7, the request failure information includes a service version number, where the service version number changes according to a service node address change for providing services to the client, and the client further includes:
An adding unit 56, configured to add the request failure information to the address update request.
Further, as shown in fig. 7, the client further includes:
a concurrency control unit 57, configured to control, when there are multiple concurrency request failure information for the same service node, to send to the master node that the number of address update requests for the service is less than a first threshold; the total number of address update requests controlling the concurrent plurality of services to the master node is less than a second threshold.
Further, the concurrency control unit 57 is further configured to, when controlling the number of address update requests sent to the master node, allow at least one address update request to be sent per service node.
Further, after the service in the first service node is migrated to the second service node, the service request unit 55 is further configured to send a service request to the first service node, where the first service node determines whether the service request has service access rights according to the service request, if so, access rights data corresponding to the service is obtained as response information of the service request, otherwise, an address of the second service node having the access rights is queried, and the address of the second service node is added to the response information;
the receiving unit 52 is further configured to receive the response information fed back by the first service node.
Further, the service request unit 55 is further configured to determine whether a service version number corresponding to the second service node included in the response information is greater than a service version number of the local storage service; and if the service request is larger than the first service request, requesting the service from the second service node according to the address of the second service node.
In view of the foregoing embodiments, the present invention further provides a distributed system, as shown in fig. 8, where the system includes a Master node (Master), a plurality of service nodes (servers), and at least one Client (Client), where
When the client fails to access the service provided by the first service node, judging whether the address of the second service node exists in the request failure information fed back by the first service node, and if so, sending a service request to the second service node; if not, sending an address update request to the main control node;
After receiving the address update request, the master control node caches the address update request, and if the service node of the service is detected to be migrated from a first service node to a second service node before reaching within the timeout time, the address of the second service node is fed back to the client, and if the service node of the service is not detected, the address of the first service node is fed back to the client when reaching the timeout time;
And when the first service node receives the service request of the client, judging whether the service has access right, and when the first service node does not have the access right, acquiring the address of the target service node with the access right, and feeding back the address to the client.
In addition, the above-mentioned fig. 1-3 also specifically illustrates the respective workflows of the master node, the service node, and the client, which are equally applicable to the distributed system, and the detailed description thereof will not be repeated here.
Further, according to the above-mentioned distributed system, the embodiment of the present invention further provides a management device, which is disposed in a master control node of the distributed system, and includes: a memory and a processor, the memory for storing a computer program; the processor is adapted to execute the management method as described in fig. 1-3 when the computer program is called.
Further, an embodiment of the present invention also proposes a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the management method as described in fig. 1-3.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
It will be appreciated that the relevant features of the methods and apparatus described above may be referenced to one another. In addition, the "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent the merits and merits of the embodiments.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, the present invention is not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.
Furthermore, the memory may include volatile memory, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), in a computer readable medium, the memory including at least one memory chip.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims (10)

1. A method for managing a distributed system, wherein the method is applied to a master control node of the distributed system, and comprises the following steps:
Receiving an address update request sent by a client, wherein the address update request is used for acquiring a service node address for providing service for the client;
caching the address update request;
Before the timeout time corresponding to the address update request arrives, if the service node providing service for the client is determined to be migrated from the first service node to the second service node, the address of the second service node is returned to the client.
2. The method according to claim 1, wherein the address update request further includes request failure information, where the request failure information indicates that the client requests a failure from the first service node;
the request failure information includes a service version number that changes according to a service node address change that serves the client.
3. The method of claim 2, wherein the determining that the service node serving the client is migrated from the first service node to the second service node comprises:
judging whether the service version number corresponding to the service stored by the main control node is larger than the service version number carried in the address update request;
If the service is greater than the first service node address, determining that the service is migrated to the second service node, and acquiring a second service node address.
4. The method according to claim 2, wherein the method further comprises:
Determining the number of clients requesting the first service node for the service failure according to request failure information, wherein the request failure information is generated because the first service node cannot respond to the request;
and triggering a scheduling request for migrating the service when the number of the clients is greater than a threshold value, and migrating the service from the first service node to the second service node.
5. The method according to claim 1, wherein the method further comprises:
Judging whether the first service node has a fault or not by using heartbeat detection;
and if the service exists, triggering a scheduling request for migrating the service, and migrating the service from the first service node to the second service node.
6. The method according to claim 4 or 5, characterized in that the method further comprises:
monitoring whether the first service node completes the migration of the service according to the scheduling request;
And if the migration is successful, updating the service version number corresponding to the service stored by the main control node.
7. The method of claim 6, wherein after updating the service version number corresponding to the service stored by the master node, the method further comprises:
Comparing the service version numbers corresponding to the address update requests requesting the same service in the cache by utilizing the updated service version numbers;
And if the address update request is greater than the service version number corresponding to the address update request, returning the address of the second service node to the client.
8. The method according to claim 1, wherein the method further comprises:
According to an address update request in a preset period scanning cache, judging whether the address update request reaches a corresponding overtime;
and if so, deleting the address updating request in the cache, and feeding back the address of the first service node to the client.
9. A management apparatus for a distributed system, comprising: a memory and a processor, the memory for storing a computer program; the processor is configured to execute the management method according to any one of claims 1-8 when the computer program is invoked.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the management method according to any one of claims 1-8.
CN202010002343.7A 2020-01-02 2020-01-02 Distributed system and management method thereof Active CN113064732B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010002343.7A CN113064732B (en) 2020-01-02 2020-01-02 Distributed system and management method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010002343.7A CN113064732B (en) 2020-01-02 2020-01-02 Distributed system and management method thereof

Publications (2)

Publication Number Publication Date
CN113064732A CN113064732A (en) 2021-07-02
CN113064732B true CN113064732B (en) 2024-05-31

Family

ID=76558398

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010002343.7A Active CN113064732B (en) 2020-01-02 2020-01-02 Distributed system and management method thereof

Country Status (1)

Country Link
CN (1) CN113064732B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114222168B (en) * 2021-12-02 2024-03-12 上海哔哩哔哩科技有限公司 Resource scheduling method and system
CN117061324B (en) * 2023-10-11 2023-12-15 佳瑛科技有限公司 Service data processing method and distributed system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7941560B1 (en) * 2006-07-14 2011-05-10 Intuit Inc. Client caching of target addresses for network requests
US8522086B1 (en) * 2005-05-03 2013-08-27 Emc Corporation Method and apparatus for providing relocation notification
CN103618808A (en) * 2013-11-08 2014-03-05 北京奇虎科技有限公司 Method, device and system for processing address change of server terminal
CN105635331A (en) * 2014-11-18 2016-06-01 阿里巴巴集团控股有限公司 Service addressing method and apparatus in distributed environment
CN109936639A (en) * 2017-12-15 2019-06-25 中兴通讯股份有限公司 A kind of service calling method and server
CN110022333A (en) * 2018-01-09 2019-07-16 阿里巴巴集团控股有限公司 The communication means and device of distributed system
CN110308983A (en) * 2019-04-19 2019-10-08 中国工商银行股份有限公司 Method for balancing resource load and system, service node and client
CN110601868A (en) * 2018-06-13 2019-12-20 阿里巴巴集团控股有限公司 Distributed system, method and electronic equipment for distributing configuration information in real time

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107294799B (en) * 2016-03-31 2020-09-01 阿里巴巴集团控股有限公司 Method and device for processing nodes in distributed system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8522086B1 (en) * 2005-05-03 2013-08-27 Emc Corporation Method and apparatus for providing relocation notification
US7941560B1 (en) * 2006-07-14 2011-05-10 Intuit Inc. Client caching of target addresses for network requests
CN103618808A (en) * 2013-11-08 2014-03-05 北京奇虎科技有限公司 Method, device and system for processing address change of server terminal
CN105635331A (en) * 2014-11-18 2016-06-01 阿里巴巴集团控股有限公司 Service addressing method and apparatus in distributed environment
CN109936639A (en) * 2017-12-15 2019-06-25 中兴通讯股份有限公司 A kind of service calling method and server
CN110022333A (en) * 2018-01-09 2019-07-16 阿里巴巴集团控股有限公司 The communication means and device of distributed system
CN110601868A (en) * 2018-06-13 2019-12-20 阿里巴巴集团控股有限公司 Distributed system, method and electronic equipment for distributing configuration information in real time
CN110308983A (en) * 2019-04-19 2019-10-08 中国工商银行股份有限公司 Method for balancing resource load and system, service node and client

Also Published As

Publication number Publication date
CN113064732A (en) 2021-07-02

Similar Documents

Publication Publication Date Title
CN107005426B (en) Method and device for managing life cycle of virtual network function
US10846185B2 (en) Method for processing acquire lock request and server
CN111159233B (en) Distributed caching method, system, computer equipment and storage medium
CN111464603B (en) Server capacity expansion method and system
CN113064732B (en) Distributed system and management method thereof
CN107688489B (en) Method and system for scheduling tasks
CN110289965B (en) Application program service management method and device
US20160234129A1 (en) Communication system, queue management server, and communication method
CN114745358B (en) IP address management method, system and controller in load balancing service
CN112860386A (en) Method for switching nodes in distributed master-slave system
US20200142759A1 (en) Rest gateway for messaging
CN112416594A (en) Micro-service distribution method, electronic equipment and computer storage medium
CN110557398B (en) Service request control method, device, system, computer equipment and storage medium
CN114244890B (en) RPA server cluster control method and system
CN111314241A (en) Task scheduling method and scheduling system
CN113326104B (en) Method, system and device for modifying internal configuration of virtual machine
CN107105037B (en) Distributed video CDN resource management system and method based on file verification
CN113076187A (en) Distributed lock management method and device
CN115037627B (en) Network configuration information processing method, SDN controller, system and storage medium
US11496564B2 (en) Device state synchronization method and common capability component
CN115858419A (en) Metadata management method, device, equipment, server and readable storage medium
CN113391759B (en) Communication method and equipment
CN111522649B (en) Distributed task allocation method, device and system
CN111258764A (en) Method and system for providing multi-tenant persistent task records for data center
CN114510282B (en) Method, device, equipment and storage medium for running automation application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant