CN113064732A

CN113064732A - Distributed system and management method thereof

Info

Publication number: CN113064732A
Application number: CN202010002343.7A
Authority: CN
Inventors: 李玮玮
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-01-02
Filing date: 2020-01-02
Publication date: 2021-07-02

Abstract

The embodiment of the invention discloses a distributed system and a management method thereof, wherein the distributed system comprises the following steps: receiving an address updating request sent by a client, wherein the address updating request is used for acquiring a service node address for providing service for the client; caching the address updating request; before the timeout time corresponding to the address updating request is reached, if the service node which provides service for the client is determined to be migrated from the first service node to the second service node, the address of the second service node is returned to the client. When the client side fails to request the service, the scheme of the embodiment of the invention can accelerate the fault repairing speed and improve the user experience.

Description

Distributed system and management method thereof

Technical Field

The invention relates to the technical field of computers, in particular to a distributed system and a management method thereof.

Background

The service framework of a distributed system typically includes the following components: client (Client): the user terminal program provides an interface for the user to access the service; service node (Server): providing specific interface logic functions for users, and generally comprising a plurality of nodes; master node (Master): the method is mainly used for managing and scheduling service nodes, performing service admission authentication on clients, and managing and distributing cluster resources, and generally comprises a single node or a plurality of nodes. A typical distributed system providing Application Program Interface (API) services externally typically includes the steps of:

1) the client inquires the service node address of the required service from the main control node, and the main control node returns the corresponding service node address to the client.

2) The client calls a remote procedure call protocol (RPC) to access the obtained service node address so as to obtain the required service, and the whole service process does not need to interact with a Master node any more.

In the process, the service nodes in the distributed system do not usually display the working state, so that once the service node fails, the master node schedules the service to other service nodes, the client needs to sense and update the service address to correctly access the required service, and the speed of updating the service address by the client directly determines the access speed and accuracy of the client. The address update scheme adopted in the industry is that the client periodically makes inquiries until a new service address is obtained. However, the mode of periodically polling and accessing the master node has a high requirement on polling frequency, and if the polling frequency is too high, a large number of invalid requests are generated, so that the processing burden of the master node is increased, and if the polling frequency is too low, the speed of obtaining a new service address by the client is reduced. Therefore, the problem that the current distributed system cannot respond in time when the client fails to access the service node still exists.

Disclosure of Invention

In view of the above problems, embodiments of the present invention provide a distributed system and a management method thereof, and mainly aim to improve the speed of repairing an access failure and improve user experience when a client fails to request a service.

In order to achieve the above purpose, the embodiments of the present invention mainly provide the following technical solutions:

in a first aspect, an embodiment of the present invention provides a management method for a distributed system, where the method is applied to a master control node of the distributed system, and includes:

receiving an address updating request sent by a client, wherein the address updating request is used for acquiring a service node address for providing service for the client;

caching the address updating request;

before the timeout time corresponding to the address updating request is reached, if the service node which provides service for the client is determined to be migrated from the first service node to the second service node, the address of the second service node is returned to the client.

In a second aspect, an embodiment of the present invention provides a management method for a distributed system, where the method is applied to a client, and includes:

sending an address updating request to a master control node according to request failure information fed back by a first service node, wherein the address updating request is used for acquiring a service node address for providing service for the client;

receiving response information of the address updating request, wherein the response information comprises a service node address corresponding to the service stored in the master control node;

if the service node address in the response message is the first service node address, sending an address updating request to the master control node;

and if the service node address in the response message is a second service node address, requesting the service from the second service node according to the second service node address.

In a third aspect, an embodiment of the present invention provides a distributed system, where the distributed system includes a master node, a plurality of service nodes, and at least one client, where the master node executes the management method in the first aspect, and the client executes the management method in the second aspect.

In a fourth aspect, an embodiment of the present invention provides a management apparatus for a distributed system, including: a memory for storing a computer program and a processor; the processor is configured to execute the management method of the first aspect when the computer program is invoked.

In a fifth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the management method according to the first aspect or implements the management method according to the second aspect.

By means of the technical scheme, the distributed system and the management method thereof provided by the embodiment of the invention can enable a user to quickly acquire the address of the service node providing the service when the service request of the service node in the distributed system fails, so that the user can successfully access the required service. Therefore, in the distributed system, when a client request fails and an address updating request is sent to a main control node, the address updating request is cached in the main control node, so that the main control node does not respond to the address updating request in real time, but caches the address updating request according to preset timeout time, and in the timeout time, if the address of a new service node is detected to exist, the address is fed back to the client, otherwise, the address is fed back to the client faithfully according to the address stored by the main control node after the timeout time is reached. Therefore, a large number of invalid requests caused by polling requests can be avoided, the processing burden of the main control node is reduced, the latest updated service node address can be fed back to the client within the overtime period, and the client can obtain the address of the new service node in time to repair the access fault.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 shows a flowchart of a data access method of a master node in a distributed system according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a data access method of a client in a distributed system according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a data access method of a service node in a distributed system according to an embodiment of the present invention;

fig. 4 is a block diagram illustrating a main control node in a distributed system according to an embodiment of the present invention;

fig. 5 is a block diagram illustrating a main control node in another distributed system according to an embodiment of the present invention;

fig. 6 is a block diagram illustrating a client in a distributed system according to an embodiment of the present invention;

FIG. 7 is a block diagram illustrating a client in another distributed system according to an embodiment of the present invention;

fig. 8 shows a block diagram of a distributed system according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The invention is an optimization improvement for improving the repair speed of the fault when the fault occurs to the client access service node under the service framework of the existing distributed system. In the embodiment of the present invention, an access failure generally refers to a failure of a client to access a service node, and the reason for such an inability to access is mainly a problem of unstable network or a problem that the service node cannot provide service currently. In contrast, in the existing distributed system, the master node is required to verify that the service cannot be provided by the current service node, the service is migrated to the new service node, and after the service is migrated to the new service node, the address of the new service node is provided according to the request of the client. In the process, the client and the master control node are asynchronous in sensing of the fault, so that the client needs to wait for the feedback of the master control node to repair the access fault. The invention respectively optimizes and improves the client, the master control node and the service node, so that the waiting time is shortened, the repairing speed of the access fault is improved, and the application experience of the user is improved.

The following description is made of specific embodiments of the present invention with respect to improvements made in the client, the master node, and the service node, respectively.

1) As for a master control node in a distributed system, an embodiment of the present invention provides a management method for a distributed system, which includes the specific steps shown in fig. 1, and the method includes:

step 101, receiving an address update request sent by a client.

The address update request is a request which is triggered by a client and sent to a master control node in the distributed system when the client cannot normally access the service node or cannot obtain the service provided by the service node. And the address updating request is used for acquiring the service node address for providing service for the client.

Step 102, caching the address update request.

After receiving the address update request, the master node caches the address update request locally instead of feeding back the address update request in real time.

And 103, before the timeout time corresponding to the address updating request is reached, if the service node which provides service for the client is determined to be migrated from the first service node to the second service node, returning the address of the second service node to the client.

In this step, the first service node refers to a service node for which the client fails to access. One of the main functions of the main control node in the distributed system is to schedule services provided by each service node in the system, so that the address of each service node in the distributed system and the identification information of the service provided by each service node are stored in the main control node, that is, the corresponding relationship between the currently effective service and the address of the service node providing the service is recorded in the main control node.

Because the address update request is cached in the step 102, and the timeout time in this step is the maximum caching duration for determining the address update request, during the caching of the address update request, the master node will determine whether the service node providing the service is migrated from the first service node to the second service node, that is, whether the address of the service node providing the service is updated to the address of the second service node, if yes, the address of the second service node is returned to the client; and if the address of the first service node is not updated when the timeout time is reached, returning the address of the first service node to the client, and correspondingly, when the client determines that the address of the first service node is the address of the first service node in the received response message, sending the address update request again, that is, returning to execute the operation of the step 101. And the master node, upon responding to the address update request, deletes the address update request from the cache.

It should be noted that the setting of the timeout time may be set by the master control node in a unified manner or set according to different service differences, or may be set by the client in a customized manner according to actual requirements, that is, different clients may set different timeout times.

As can be seen from the steps in the foregoing embodiment, when processing an address update request sent by a client, a master node does not perform real-time feedback, but caches the address update request for a period of time, and simultaneously searches whether a service node corresponding to a requested service is migrated from a first service node to a second service node, and if it is determined that the service has been migrated within the period of time, directly feeds back an address of the second service node of the client, and if it is not determined that the service has been migrated within the period of time, feeds back the address of the locally stored first service node to the client. Therefore, under the condition that the timeout time is reasonably set, a large number of polling address updating requests sent by the client can be prevented from being received, so that repeated processing of a large number of invalid requests by the main control node is reduced, and the processing burden of the main control node is reduced. Meanwhile, in the process of caching the address updating request, the master control node can monitor the updating state of the address of the service node in real time or periodically, and once the updating is found, the updating state can be fed back to the client, so that the response speed of the client request is improved, and the repairing time of the access fault is reduced.

Further, for step 103 in the embodiment shown in fig. 1, in order to ensure the accuracy of determining the service node migration and the availability of the address of the second service node, in a preferred embodiment of the present invention, the client adds request failure information to the sent address update request, where the request failure information is also information fed back to the client by the first service node and is used to indicate that the client has failed to request the first service node, where the request failure information includes information such as a service version number in addition to error information (i.e., a failure reason, which may be indicated by an error code). In this embodiment, the service version number corresponding to the service is changed according to a change of a service node address corresponding to providing the service for the client, for example, the service version number is monotonically increased according to the change of the service node address. That is, each time a service node providing the service migrates, i.e., the address of the service node changes, the version number of the service is increased by one based on the original version number. Therefore, comparing the service version number, the larger the value, the newer the service version, and for the client, the service node corresponding to the largest service version number is the service node capable of effectively providing the service.

Based on the added request failure information, the specific mode of executing the migration of the service node which is determined to provide service for the client from the first service node to the second service node comprises the following steps:

and judging whether the service version number corresponding to the service locally stored by the main control node is greater than the service version number carried in the address updating request.

If the local service version number of the main control node is greater than the service version number carried by the request, the service is transferred from the first service node to the second service node, and at this time, the address of the second service node is obtained, and the local service node address corresponding to the service is updated.

If the local service version number of the main control node is equal to or less than the service version number carried by the request, the service is not currently migrated to the second service node, and at this time, the address of the service node corresponding to the local service does not need to be updated.

Further, in the above embodiment, whether the service node is migrated is verified by using the service version number carried in the request failure information, and as for the error information carried in the request failure information, in another preferred embodiment of the present invention, the method may be used to count the failure information for the first service node, so as to actively trigger the migration of the service provided by the service node, specifically:

first, the number of clients requesting the service failure from the first service node is determined according to the request failure information.

Since the master node serves all the clients in the distributed system, it may receive address update requests of multiple clients for the same service request in the first service node that fails within a certain time period, and for this reason, the master node performs statistics on request failure information carried in the address update requests to determine the number of clients.

It should be noted that, in the statistical process, the statistical request failure information is filtered, because the reason why the client fails to access the first service node may be caused by network instability, or may be caused by a failure of the first service node or a failure of providing a corresponding service. In the step, when the request failure information is counted, it is considered that the access failure caused by the network problem may be temporary and the migration of the service is not required, so that the request failure information of the type is filtered out to obtain the request failure information of the specified type, where the request failure information of the specified type may be a certain type of request failure information or a set of several types of request failure information. That is, the counted clients are request failures due to the first service node failing to respond to the request normally.

In addition, the counted number of clients refers to the number of clients that failed to request the same service provided in the same service node.

Secondly, when the number of the clients is larger than a threshold value, a scheduling request of the migration service is triggered.

The threshold is an experience value set by a user, and can be set differently according to the application scene of the distributed system.

In addition, the main control node may further determine whether the first service node has a failure through heartbeat detection with the service node, that is, whether the service node has the capability of providing the service, and if the first service node has a failure, that is, the heartbeat detection of the first service node is not received for a long time, the scheduling request for migrating the service provided in the first service node is triggered, and the services are migrated to the second service node.

Further, in view of the two ways in which the master node actively triggers the scheduling request for migrating the service, in the embodiment of the present invention, after the master node sends the scheduling request, the master node further monitors whether the first service node completes migration of the service, and if the migration is successful, the locally stored service version number corresponding to the service is updated, so that the master node can monitor that the service is migrated in time and notify the client of the migration, that is, respond to the corresponding address update request in the cache. Correspondingly, in the second service node after the migration is completed, the service version number of the service is also updated synchronously.

After the service version number corresponding to the locally stored service is updated by the master control node, the master control node responds to the address update request requesting the same service and stored in the cache according to the updated service and the corresponding service version number, namely, the service version number of the master control node and the service version number of the service are compared, and if the service version number of the service is greater than the service version number of the address update request, the updated second service node address is directly returned to the corresponding client.

Further, in another preferred embodiment of the present invention, the client may not set the timeout time in the address update request any more, but set the timeout time in advance, that is, a data table of the timeout time is maintained in the master node, and the timeout time set by each client in the distributed system is recorded in the data table. Therefore, when the master control node receives the address updating request sent by the client, the timeout time corresponding to the client identifier can be searched in the data table according to the client identifier, and the address updating request is cached by utilizing the timeout time.

In addition, the master node may also periodically scan the address update request in the cache according to a preset period to determine whether the address update request reaches the corresponding timeout time, and if the address update request reaches the timeout time and there is no address of the second service node, delete the address update request from the cache and feed back the address of the first service node to the corresponding client.

2) For the client in the distributed system, an embodiment of the present invention further provides a management method for the distributed system, which includes the specific steps shown in fig. 2, where the method includes:

step 201, according to the request failure information fed back by the first service node, an address update request is sent to the master control node.

The address updating request is used for acquiring a service node address for providing service for the client.

In this step, after the client fails to request the first service node for service, request failure information fed back by the first service node is received, where the request failure information at least includes information such as a service version number, which is the same as the request failure information in the embodiment shown in fig. 1.

Step 202, receiving response information of the address update request.

The response information includes a service node address corresponding to the requested service locally stored by the main control node.

After the address update request is processed by the master node, the specific processing procedure of the address update request is the content of the embodiment shown in fig. 1. The client receives the response information fed back by the client, wherein the response information comprises the service node address corresponding to the service stored in the main control node. The service node address corresponding to the service stored in the master control node is not fixed but updated according to the migration condition of the service, so that if the client cannot request the service through the first service node, the master control node can be queried to determine whether the service node providing the service is changed, that is, whether the second service node address exists.

As can be seen from the embodiment shown in fig. 1, the response information fed back by the master node includes a service node address, and the service node address may be a second service node address or an original address (a first service node address), so that the embodiment of the present invention determines a specific operation of repairing an access failure by a client by identifying the service node address in the response information, specifically: when the service node address in the response message is the second service node address, executing step 204; if the address is the first service node address, go to step 203.

Step 203, if the service node address in the response message is the first service node address, an address update request is sent to the master control node.

In this step, when the obtained service node address is still the first service node address, it indicates that the migration of the service is not found in the master node, and at this time, the client sends an address update request to the master node again. It should be noted that the interval duration between the time of sending the address update request and the time of last sending should be longer than the timeout time, and compared with the existing mode of polling according to a certain frequency cycle, the timeout time in the embodiment of the present invention is generally longer than the polling cycle, so for the client, this mode can reduce the number of address update requests sent to the main control node, and also reduce the processing load of a large number of invalid requests on the main control node.

And step 204, if the service node address in the response message is the second service node address, requesting service from the second service node according to the second service node address.

When different service node addresses are obtained, the main control node updates the service node address corresponding to the service, namely the service is migrated to a second service node, and at the moment, the client can access according to the second service node address, namely the client requests the service from the second service node.

Through the steps, the specific fault repairing process of the client of the distributed system provided by the invention when the access fault occurs is described in detail, the fault repairing speed can be greatly improved through the process, and the invalid requests sent to the main control node are reduced.

Further, based on the embodiment shown in fig. 2, in order to reduce the pressure of the master node in processing the address update request before sending the address update request to the master node, in this embodiment, when migrating the service from the first service node to another service node, the addresses of the other service nodes may also be stored in the first service node, and when the client requests the service from the first service node, the addresses of the other service nodes are added to the request failure information and are fed back to the client. Therefore, when receiving the request failure information fed back by the first service node, the client can first judge whether the address of the third service node (other service nodes) exists in the request failure information, and if the address of the third service node exists, the client can directly request the third service node for the service according to the address of the third service node without triggering to send an address updating request to the master control node. If not, the failure of the first service node may cause the service not to be provided, or the service may cause the service not to be provided during the migration process. That is, at this time, the reason for the access failure cannot be determined, and for this reason, the client triggers sending an address update request to the master node, i.e., performing step 201 described above. The third service node may also be a second service node.

Further, based on the embodiment shown in fig. 2, in order to more accurately verify whether the service node address corresponding to the service is updated to the second service node address, in another preferred embodiment of the present invention, the client may add the service version number of the requested service to the address update request when triggering the address update request. Thus, the main control node may verify the service version number of the service stored in the main control node based on the service version number, that is, the service version number is the current valid version, and the specific verification process has been described in detail in the above embodiment of the main control node, and is not described herein again.

Further, since the client may make concurrent requests for multiple services provided by the same service node and all of the services may have access failures, at this time, the client may make concurrent multiple address update requests for the master control node, and in the distributed system, the number of the clients is large, and in order to reduce the burden of the master control node in processing the concurrent requests, the present invention sets the first threshold value to limit the number of concurrent address update requests for the same service node by the client, for example, there are 5 services in the service node a, the client initiates accesses to 5 services simultaneously, where 4 of the services are access failures, at this time, assuming that the first threshold value is set to 2, the client will only send 2 address update requests to the master control node, and after one of the services is completed, send the 3 rd address update request again, and so on, and completing the repair of the access fault until 4 address updating requests are sent, namely, the upper limit of the number of concurrent requests from the client to the main control node is 2.

In addition, the clients limit the total concurrent number of all the clients except for limiting the concurrent number of the address update requests triggered by the same service node, that is, when a plurality of clients access a plurality of service nodes concurrently and access fails, the total number of concurrent address update requests from all the clients to the master control node is limited, and the number value is set as a second threshold value, which and the first threshold value can be set in a self-defined manner. For example, assuming that the second threshold is 10 and there are 7 address update requests currently processed by the master node, when there are 5 concurrent access failures in the service node a, it needs to first query the master node to process several address update requests, that is, 3 address update requests, and at this time, only 3 address update requests of the 5 address update requests can be sent to the master node.

Further, based on the flow control manner for the concurrent address update requests of the client, in order to avoid that the address update request corresponding to the high-traffic service node preempts the address update request corresponding to the low-traffic service node, the embodiment of the present invention further sets a flow control method that allows each service node to send at least one address update request. For example, the service node a is a high-traffic node, which has 4 concurrent access-failed requests, and the first threshold is 2, and 2 address update requests are limited, and the service node B is a low-traffic node, which has only 1 access-failed request, and the second preset is 2, which results in that the address update request of the node B is limited, according to the flow control principle, after the master node completes one address update request, the address update request corresponding to the node B is processed preferentially at this time, instead of processing the address update request of the node B after the address update requests of the node a are all processed.

3) For a service node in a distributed system, an embodiment of the present invention further provides a management method for a distributed system, which includes the specific steps shown in fig. 3, where the method includes:

step 301, judging whether the client has the access right according to the service request sent by the client.

In a distributed system, different services provided by different service nodes may need to read data in the same storage address, and for data with access limitation, the data can only correspond to one service in the access process in order to avoid simultaneous access by too many users. Thus, whether a service request is successful also depends on whether the service in the service node is authorized for access. In this embodiment, whether a service in the service node has an access right may be determined by acquiring the file lock, that is, a plurality of services that need to read the same access restricted data perform lock preemption, and a service that succeeds in lock preemption has an access right to the data. For the first service node in the above embodiment, it is determined whether the service to which the first service node belongs robs the file lock, if so, the first service node has an access right, and step 302 is executed, otherwise, step 303 is executed.

And step 302, if so, acquiring the access right data corresponding to the service as response information of the service request.

Step 303, if not, inquiring the address of the second service node with the access right, and adding the address of the second service node to the response message.

The response information in this step is reply information when the first service node fails to respond to the service request.

Therefore, in the embodiment of the invention, the first service node can inquire the target service node with the authority through the record of the service access authority so as to record the service migration state, so that when the service request of the client is responded, even if the access authority data cannot be returned, the address of the service node with the access authority can be fed back to the client, the client can directly access the service node without triggering an address updating request to the main control node, and the repair speed of the access fault is improved.

Further, in order to ensure that the queried target service node can provide an effective service, in the preferred embodiment of the present invention, the service version number of the service may be determined for verification. The specific query process is as follows:

first, after the address of the second service node is queried, the service version number of the service provided in the second service node is obtained.

And then, judging whether the service version number in the second service node is larger than the service version number stored in the first service node, if so, adding the new address into the response message, otherwise, not adding the new address.

The principle of comparing the service version numbers is already described in the above-mentioned master node and the client, and is not further described here.

Further, as an implementation of the method shown in fig. 1, an embodiment of the present invention provides a master control node of a distributed system, which is mainly used to improve a repair speed of an access failure and improve user experience. For convenience of reading, details in the foregoing method embodiment are not repeated in this embodiment, but it should be clear that the main control node in this embodiment can correspondingly implement all the contents in the foregoing method embodiment. As shown in fig. 4, the main control node specifically includes:

a receiving unit 41, configured to receive an address update request sent by a client, where the address update request is used to obtain a service node address for providing a service for the client;

a cache unit 42, configured to cache the address update request;

a response unit 43, configured to, before a timeout time corresponding to the address update request is reached, return an address of a second service node to the client if it is determined that the service node that provides service for the client is migrated from the first service node to the second service node.

Further, the address update request further includes request failure information, where the request failure information is used to indicate that the client has failed to request the first service node;

the request failure information further includes a service version number, and the service version number is changed according to a change of a service node address providing a service for the client.

Further, as shown in fig. 5, the response unit 43 includes:

a determining module 431, configured to determine whether a service version number corresponding to the service stored in the master node is greater than a service version number carried in the address update request;

a determining module 432, configured to determine that the service is migrated to the second service node and obtain the address of the second service node if the determination result of the determining module 431 is greater than the threshold.

Further, as shown in fig. 5, the master node further includes:

a counting unit 44, configured to determine, according to request failure information, the number of clients that failed to request the service from the first service node, where the request failure information is generated because the first service node cannot respond to a request;

a triggering unit 45, configured to trigger a scheduling request for migrating the service when the number of clients determined by the counting unit 44 is greater than a threshold, and migrate the service from a first service node to a second service node.

Further, the triggering unit 45 is further configured to determine whether the first service node has a fault by using heartbeat detection; and if so, triggering a scheduling request for transferring the service, and transferring the service from the first service node to the second service node.

Further, as shown in fig. 5, the master node further includes:

an updating unit 46, configured to monitor whether the first service node completes migration of the service according to the scheduling request triggered by the triggering unit 45; and if the migration is successful, updating and updating the service version number corresponding to the service stored in the main control node.

Further, the response unit 43 is further configured to, after updating the service version number corresponding to the service stored in the master node, compare the updated service version number with a service version number corresponding to an address update request requesting the same service in the cache; and if the service version number is larger than the service version number of the address updating request, the operation of returning the address of the second service node to the client is executed.

Further, the response unit 43 is further configured to periodically scan the address update request in the cache according to a preset period, and determine whether the address update request reaches a corresponding timeout; and if so, deleting the address updating request in the cache, and feeding back the address of the first service node to the client.

Further, as an implementation of the method shown in fig. 2, an embodiment of the present invention provides a client of a distributed system, which is mainly used to improve a repair speed of an access failure and improve a user experience. For convenience of reading, details in the foregoing method embodiment are not repeated in this embodiment, but it should be clear that the client in this embodiment can correspondingly implement all the contents in the foregoing method embodiment. As shown in fig. 6, the main control node specifically includes:

a first sending unit 51, configured to send an address update request to a master node according to request failure information fed back by a first service node, where the address update request is used to obtain a service node address for providing a service for the client;

a receiving unit 52, configured to receive response information of the address update request, where the response information includes a service node address corresponding to the service stored in the master node;

the first sending unit 51 is further configured to send an address update request to the master node if the service node address in the response message is the first service node address;

a second sending unit 53, configured to request the service from the second serving node according to the second serving node address if the serving node address in the response message is the second serving node address.

Further, as shown in fig. 7, the client further includes:

a determining unit 54, configured to determine whether a third serving node address exists in the request failure information before the first sending unit 51 sends an address update request to the master node;

a service request unit 55, configured to request the service from a third service node according to the third service node address if the service exists;

the first sending unit 51 is further configured to send an address update request to the master node if the address update request does not exist.

Further, as shown in fig. 7, the request failure information includes a service version number, where the service version number changes according to a change of a service node address providing a service for the client, and the client further includes:

an adding unit 56, configured to add the request failure information to the address update request.

Further, as shown in fig. 7, the client further includes:

a concurrency control unit 57, configured to control, when multiple concurrency request failure information exists for the same service node, that the number of address update requests for sending the service to the master control node is smaller than a first threshold; and controlling the total number of the address updating requests which are sent to the main control node by a plurality of services concurrently to be smaller than a second threshold value.

Further, the concurrency control unit 57 is further configured to, when controlling the number of address update requests sent to the master node, allow each service node to send at least one address update request.

Further, after the service in the first service node is migrated to a second service node, the service request unit 55 is further configured to send a service request to the first service node, where the first service node determines whether the service has a service access right according to the service request, and if the service has the service access right, the access right data corresponding to the service is acquired as response information of the service request, otherwise, an address of the second service node having the access right is queried, and the address of the second service node is added to the response information;

the receiving unit 52 is further configured to receive the response information fed back by the first serving node.

Further, the service request unit 55 is further configured to determine whether a service version number corresponding to a second service node included in the response information is greater than a service version number of the locally stored service; and if so, requesting the service from the second service node according to the address of the second service node.

In summary, an embodiment of the present invention further provides a distributed system, and specifically as shown in fig. 8, the system includes a Master node (Master), a plurality of service nodes (servers), and at least one Client (Client), where the Master node (Master), the service nodes (servers), and the Client are configured to execute the steps of the method, the Server, and the Client

When the client fails to access the service provided by the first service node, judging whether the address of a second service node exists in the request failure information fed back by the first service node, and if so, sending a service request to the second service node; if the address does not exist, sending an address updating request to the master control node;

after receiving the address updating request, the master control node caches the address updating request, and before the address updating request arrives within the overtime, if the service node of the service is detected to be migrated from a first service node to a second service node, the address of the second service node is fed back to a client, and if the address of the first service node is not detected, the address of the first service node is fed back to the client when the overtime is reached;

and when the first service node does not have the access right, the address of the target service node with the access right is obtained and fed back to the client.

In addition, respective work flows of the master node, the service node, and the client are also specifically described in fig. 1 to 3, which are also applicable to the distributed system, and the detailed description thereof is not repeated here.

Further, according to the distributed system, an embodiment of the present invention further provides a management apparatus, which is disposed in a master control node of the distributed system, and includes: a memory for storing a computer program and a processor; the processor is adapted to perform the management method as described in fig. 1-3 when the computer program is invoked.

Further, an embodiment of the present invention also provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the management method as described in fig. 1 to 3.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In addition, the memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A management method of a distributed system is characterized in that the method is applied to a main control node of the distributed system and comprises the following steps:

caching the address updating request;

2. The method according to claim 1, wherein the address update request further includes request failure information, and the request failure information is used to indicate that the client has failed to request the first service node;

the request failure information includes a service version number, which is changed according to a change in an address of a service node providing a service for the client.

3. The method of claim 2, wherein determining that the service node serving the client is migrated from a first service node to a second service node comprises:

judging whether the service version number corresponding to the service stored by the master control node is greater than the service version number carried in the address updating request;

and if so, determining that the service is migrated to the second service node, and acquiring the address of the second service node.

4. The method of claim 2, further comprising:

determining the number of clients which fail to request the service from the first service node according to request failure information, wherein the request failure information is generated because the first service node cannot respond to the request;

and when the number of the clients is larger than a threshold value, triggering a scheduling request for transferring the service, and transferring the service from a first service node to a second service node.

5. The method of claim 1, further comprising:

judging whether the first service node has a fault by using heartbeat detection;

and if so, triggering a scheduling request for transferring the service, and transferring the service from the first service node to the second service node.

6. The method according to claim 4 or 5, characterized in that the method further comprises:

monitoring whether the first service node completes the migration of the service or not according to the scheduling request;

and if the migration is successful, updating the service version number corresponding to the service stored by the main control node.

7. The method of claim 6, wherein after updating the service version number corresponding to the service stored by the master node, the method further comprises:

comparing the updated service version number with a service version number corresponding to an address update request requesting the same service in the cache;

and if the address is larger than the service version number corresponding to the address updating request, returning the address of the second service node to the client.

8. The method of claim 1, further comprising:

scanning an address updating request in a cache according to a preset period, and judging whether the address updating request reaches corresponding timeout time;

and if so, deleting the address updating request in the cache, and feeding back the address of the first service node to the client.

9. A management method of a distributed system is applied to a client and comprises the following steps:

10. The method of claim 9, wherein before sending the address update request to the master node, the method further comprises:

judging whether a third service node address exists in the request failure information or not;

if the third service node address exists, the service is requested to the third service node according to the third service node address;

and if the address does not exist, the operation of sending the address updating request to the main control node is executed.

11. The method of claim 9, wherein the request failure information includes a service version number, and wherein the service version number changes according to a change in an address of a service node that provides a service for the client, the method further comprising:

and adding the request failure information to the address updating request.

12. The method of claim 9, further comprising:

when a plurality of concurrent request failure messages exist for the same service node, controlling the number of address updating requests for sending the service to the main control node to be smaller than a first threshold value;

and controlling the total number of the address updating requests which are sent to the main control node by a plurality of services concurrently to be smaller than a second threshold value.

13. The method of claim 12, further comprising:

each serving node is allowed to send at least one address update request when controlling the number of address update requests sent to the master node.

14. The method of claim 9, further comprising:

sending a service request to a first service node, wherein the first service node judges whether the service access authority exists or not according to the service request, if so, the access authority data corresponding to the service is acquired as response information of the service request, otherwise, the address of a second service node with the access authority is inquired, and the address of the second service node is added to the response information;

receiving the response information fed back by the first service node.

15. The method of claim 14, further comprising:

judging whether the service version number corresponding to the second service node contained in the response information is larger than the service version number of the service stored by the client;

and if so, requesting the service from the second service node according to the address of the second service node.

16. A distributed system comprising a master node, a plurality of service nodes, and at least one client, wherein the master node performs the management method of any one of claims 1-8, and the client performs the management method of any one of claims 9-15.

17. An apparatus for managing a distributed system, comprising: a memory for storing a computer program and a processor; the processor is adapted to execute the management method according to any of claims 1-8 when invoking the computer program.

18. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the management method according to any one of claims 1 to 8, or carries out the management method according to any one of claims 9 to 15.