CN115202591A

CN115202591A - Storage device, method and storage medium of distributed database system

Info

Publication number: CN115202591A
Application number: CN202211127361.3A
Authority: CN
Inventors: 雷昱; 齐洁
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2022-09-16
Filing date: 2022-09-16
Publication date: 2022-10-18
Anticipated expiration: 2042-09-16
Also published as: CN115202591B

Abstract

The invention provides a storage device, a method and a storage medium of a distributed database system, wherein the device comprises: the receiving unit is used for receiving data to be written in the distributed database system and determining the size of the data; the determining unit is used for selecting a first target storage node from the N storage nodes based on a load balancing strategy; the prediction unit predicts a second target storage node from the N storage nodes based on the size of the data to be written by using the trained graph neural network; and the writing unit writes the data to be written based on the first target storage node and the second target storage node. According to the invention, an Artificial Intelligence (AI) mode is adopted to select a proper target writing node based on the size of the written data block, and a more reasonable writing node is selected based on a certain rule between the target writing node selected by AI and the target node selected by a general load balancing strategy, so that the overall writing performance of the distributed database system is improved.

Description

Storage device, method and storage medium of distributed database system

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a storage device, a storage method and a storage medium of a distributed database system.

Background

Artificial intelligence is the subject of research on making computer to simulate some human thinking process and intelligent behavior (such as learning, reasoning, thinking, planning, etc.), and mainly includes the principle of computer to implement intelligence and the manufacture of computer similar to human brain intelligence to make computer implement higher-level application. Artificial intelligence will relate to computer science, psychology, philosophy and linguistics. The artificial intelligence is in the technical application level of thinking science, is an application branch of the artificial intelligence, and the application range of the artificial intelligence is wider and wider along with the development of the technology.

In the prior art, when data is written into each storage node, the data is generally written into the storage node by adopting a traditional scheduling strategy (namely load balancing), but due to the defect of a scheduling algorithm, the data can be written into some storage nodes all the time, namely, the existing algorithm is not intelligent enough, so that the problem of unbalance of utilization of the storage nodes is caused.

Disclosure of Invention

The present invention proposes the following technical solutions to address one or more technical defects in the prior art.

A storage apparatus of a distributed database system, the distributed database system comprising N storage nodes, the apparatus comprising:

the receiving unit is used for receiving data to be written in the distributed database system and determining the size of the data to be written;

the determining unit is used for selecting one storage node from the N storage nodes as a first target storage node based on a load balancing strategy;

the prediction unit predicts one storage node from the N storage nodes as a second target storage node by using the trained graph neural network;

the writing unit is used for writing the data to be written based on the first target storage node and the second target storage node;

wherein N is more than or equal to 2;

the distributed database system further comprises a scheduling server, the scheduling server writes the data to be written into corresponding storage nodes, the scheduling server is connected with each storage node, and the graph of the graph neural network is formed in the following mode: the scheduling server and the N storage nodes are used as graph nodes, the connection between the scheduling server and the N storage nodes is used as an edge of a graph, the weight of the edge is determined based on the bandwidth between the scheduling server and the corresponding storage node, and the characteristic value of the graph node is determined based on the processing capacity and the storage residual space of the storage node;

the way of determining the weights of the edges of the graph and the eigenvalues of the graph nodes is: normalizing the bandwidth between the scheduling server and the corresponding storage node, the processing capacity of the storage node and the storage residual space to obtain a bandwidth normalization value W between the scheduling server and the corresponding storage node _i And the processing capacity normalization value P of the storage node _i And storing the remainder space normalization value M _i Calculating the Weight of the edge of the graph _i ：

；

Calculating a characteristic value C of a graph node _i ，

；

Wherein i represents the ith storage node and α represents

Beta represents P _i Weight of (a), gamma denotes

The weight of (c).

Further, the writing unit operates to: and judging whether the ID1 of the first target storage node is the same as the ID2 of the second target storage node, if so, writing the data to be written into the storage node corresponding to the ID1, if not, judging whether the priority Pr1 of the first target storage node is greater than or equal to the priority Pr2 of the second target storage node, if so, writing the data to be written into the storage node corresponding to the ID1, and if not, writing the data to be written into the storage node corresponding to the ID 2.

Still further, the load balancing policy is a random selection policy, a polling policy, or a source address hashing policy.

Further, the training of the graph neural network is obtained through historical data, and the historical data comprises the size of a written data block, the identification of a written storage node, the bandwidth of a scheduling node to the node, the processing capacity and the residual storage space of the node and the like.

Furthermore, the priority of the storage node is determined based on the reliability of the storage node and the spatial position of the storage node relative to the scheduling node, the priority of the storage node is updated at regular time, and the priority of the storage node is calculated in a manner that:

wherein the content of the first and second substances,

indicates the priority of the ith storage node, T _i Representing the reliability of the ith storage node, D _i Representing the ith storage nodeRelative to the spatial position of the scheduling node, and when D _i <1 hour, let D _i And =1, δ corresponds to the weight of the reliability of the ith storage node as a whole, and ∈ specifically corresponds to the weight of the spatial position of the scheduling node where the ith storage node is located.

Further, the spatial location of the storage node relative to the scheduling node may be represented using the distance of the storage node relative to the scheduling node.

The invention also provides a storage method of the distributed database system, wherein the distributed database system comprises N storage nodes, and the method comprises the following steps:

a receiving step, namely receiving data to be written in the distributed database system and determining the size of the data to be written;

a determining step, namely selecting one storage node from the N storage nodes as a first target storage node based on a load balancing strategy;

predicting, namely predicting a storage node from the N storage nodes by using the trained graph neural network to serve as a second target storage node;

a writing step, writing the data to be written based on the first target storage node and the second target storage node;

wherein N is more than or equal to 2;

the distributed database system further comprises a scheduling server, the scheduling server writes the data to be written into corresponding storage nodes, the scheduling server is connected with each storage node, and the graph of the graph neural network is formed in the following manner: the method comprises the following steps that the scheduling server and the N storage nodes are used as graph nodes, the connection between the scheduling server and the N storage nodes is used as an edge of a graph, the weight of the edge is determined based on the bandwidth between the scheduling server and the corresponding storage node, the characteristic value of the graph node is determined based on the processing capacity and the storage residual space of the storage node, and the weight of the edge of the graph and the characteristic value of the graph node are determined in the following mode: normalizing the bandwidth between the dispatch server and the corresponding storage node and the processing of the storage nodeObtaining the bandwidth normalization value W between the scheduling server and the corresponding storage node through the capacity and the storage residual space _i And the processing capacity normalization value P of the storage node _i And storing the remainder space normalization value M _i Calculating the Weight of the edge of the graph _i ：

；

Calculating a characteristic value C of a graph node _i ，

；

Wherein i represents the ith storage node and α represents

Beta represents P _i Weight of (a), gamma denotes

The weight of (c).

Further, the writing step operates as: and judging whether the ID1 of the first target storage node is the same as the ID2 of the second target storage node, if so, writing the data to be written into the storage node corresponding to the ID1, if not, judging whether the priority Pr1 of the first target storage node is greater than or equal to the priority Pr2 of the second target storage node, if so, writing the data to be written into the storage node corresponding to the ID1, and if not, writing the data to be written into the storage node corresponding to the ID 2.

Further, the training of the graph neural network is obtained through historical data, wherein the historical data comprises the size of a written data block, the identification of a written storage node, the bandwidth of a scheduling node to the node, the processing capacity and the residual storage space of the node and the like.

wherein, the first and the second end of the pipe are connected with each other,

indicates the priority of the ith storage node, T _i Representing the reliability of the ith storage node, D _i Represents the spatial position of the ith storage node relative to the scheduling node, and when D _i <1 hour, let D _i And =1, δ corresponds to the weight of the reliability of the ith storage node as a whole, and ∈ specifically corresponds to the weight of the spatial position of the scheduling node where the ith storage node is located.

The present invention also proposes a computer-readable storage medium having stored thereon computer program code which, when executed by a computer, performs the method of any of the above.

The invention has the technical effects that: a storage device, method and storage medium of a distributed database system, the device comprising: the receiving unit is used for receiving data to be written in the distributed database system and determining the size of the data to be written; the determining unit is used for selecting one storage node from the N storage nodes as a first target storage node based on a load balancing strategy; the prediction unit predicts one storage node from the N storage nodes as a second target storage node by using the trained graph neural network; and the writing unit writes the data to be written based on the first target storage node and the second target storage node. According to the invention, an Artificial Intelligence (AI) mode is adopted to select a proper target writing node based on the size of a written data block, and a more reasonable writing node is selected based on a certain rule based on the target writing node selected by the AI and the target node selected by a general load balancing strategy, so that the overall writing performance of the distributed database system is improved, and the technical problem of unbalanced writing of the nodes of the distributed database system is solved; in the invention, whether a target node predicted by artificial intelligence and a target writing node selected by a load balancing strategy are the same node or not is compared, if the target node and the target writing node are not the same node, the priority of the target node and the target writing node is compared, and data is written into the node with high priority, namely, the invention provides a node writing mode based on artificial intelligence and priority judgment, thereby further improving the data writing performance; in the invention, based on the actual situation of the distributed database system, a graph neural network is constructed to predict target write-in nodes based on the size of a data block to be written, an initial calculation method of edge weights and graph node characteristic values of the graph neural network is provided, and the constructed graph neural network can well predict the target write-in nodes through simulation calculation so as to balance the write-in data of the distributed database system. In the invention, the AI technology and the traditional selection technology are combined in a priority calculation mode, therefore, the invention provides a storage node priority calculation mode which is determined according to the reliability of the storage node and the spatial position of the storage node relative to the scheduling node and is mainly determined based on the reliability, and the spatial position is taken as an influence factor, so that the spatial position adopts logarithmic calculation to reduce the influence.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.

Fig. 1 is a flow chart of a storage method of a distributed database system according to an embodiment of the present invention.

Fig. 2 is a structural diagram of a storage apparatus of a distributed database system according to an embodiment of the present invention.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates a storage method of a distributed database system according to the present invention, where the distributed database system includes N storage nodes, that is, the distributed database system is composed of a plurality of storage nodes, and the method includes:

a receiving step S101, receiving data to be written in the distributed database system, and determining the size of the data to be written;

determining step S102, selecting one storage node from the N storage nodes as a first target storage node based on a load balancing strategy; in the present invention, the load balancing policy is a random selection policy, a polling policy or a source address hash policy, which are relatively mature load balancing policies and are not described again.

A predicting step S103, predicting a storage node from the N storage nodes by using the trained neural network as a second target storage node;

a writing step S104, writing the data to be written based on the first target storage node and the second target storage node; wherein N is more than or equal to 2.

Generally speaking, the method of the invention can be operated on a scheduling server or a load balancing server, and the important inventive concept of the invention is to select a proper target writing node based on the size of a written data block by adopting an Artificial Intelligence (AI) mode, and select a more reasonable writing node based on a certain rule based on the target writing node selected by the AI and the target node selected by a general load balancing strategy, thereby improving the overall writing performance of the distributed database system and solving the technical problem of unbalanced node writing of the distributed database system, which is one of the important invention points of the invention.

In one embodiment, the writing step S104 is performed by: and judging whether the ID1 of the first target storage node is the same as the ID2 of the second target storage node, if so, writing the data to be written into the storage node corresponding to the ID1, if not, judging whether the priority Pr1 of the first target storage node is greater than or equal to the priority Pr2 of the second target storage node, if so, writing the data to be written into the storage node corresponding to the ID1, and if not, writing the data to be written into the storage node corresponding to the ID 2.

In the invention, whether the target node predicted by artificial intelligence and the target writing node selected by the load balancing strategy are the same node or not is compared, if the target node and the target writing node are not the same node, the priority of the target node and the target writing node is compared, and data is written into the node with high priority, namely, the invention provides a node writing mode based on artificial intelligence and priority judgment, thereby further improving the data writing performance, which is another important invention point of the invention.

In one embodiment, the distributed database system of the present invention further includes a scheduling server, where the scheduling server writes the data to be written into corresponding storage nodes, and the scheduling server is connected to each storage node, and a graph of the graph neural network is formed in a manner that: and taking the scheduling server and the N storage nodes as graph nodes, taking the connection between the scheduling server and the N storage nodes as the edges of the graph, determining the weight of the edges based on the bandwidth between the scheduling server and the corresponding storage nodes, and determining the characteristic value of the graph node based on the processing capacity and the storage residual space of the storage node.

In one embodiment, the way to determine the weights of the edges of the graph and the eigenvalues of the graph nodes is: normalizing the bandwidth between the scheduling server and the corresponding storage node, the processing capacity and the storage residual space of the storage node to obtain a bandwidth normalization value W between the scheduling server and the corresponding storage node _i And the processing capacity normalization value P of the storage node _i And storing the remainder space normalization value M _i Calculating the Weight of the edge of the graph _i ：

；

Calculating a characteristic value C of a graph node _i ，

；

Wherein i represents the ith storage node and alpha represents

Beta represents P _i Weight of (2), gamma denotes

The weights may be obtained from historical data or may be adjusted based on experience in actual work.

In the invention, based on the actual situation of the distributed database system, a graph neural network is constructed to predict target write-in nodes based on the size of a data block to be written, an initial calculation method of edge weights and graph node characteristic values of the graph neural network is provided, and the constructed graph neural network can well predict the target write-in nodes through simulation calculation so as to balance the write-in data of the distributed database system.

In one embodiment, the training of the graph neural network is obtained through historical data, wherein the historical data comprises the size of a written data block, the identification of a written storage node, the bandwidth of a scheduling node to the node, the processing capacity and the residual storage space of the node and the like.

In one embodiment, the priority of the storage node is determined based on the reliability of the storage node and the spatial position of the storage node relative to the scheduling node, the priority of the storage node is updated at regular time, and the priority of the storage node is calculated by:

indicates the priority, T, of the ith storage node _i Representing the reliability of the i-th storage node, D _i Represents the spatial position of the ith storage node relative to the scheduling node, and when D _i <1 hour, let D _i And =1, δ corresponds to the weight of the reliability of the ith storage node as a whole, and ∈ specifically corresponds to the weight of the spatial position of the scheduling node where the ith storage node is located.

In the invention, the AI technology and the traditional selection technology are combined in a priority calculation mode, therefore, the invention provides a storage node priority calculation mode which is determined according to the reliability of the storage node and the spatial position of the storage node relative to the scheduling node and is mainly determined based on the reliability, and the spatial position is taken as an influence factor, so that the spatial position adopts logarithmic calculation to reduce the influence, which is another important invention point of the invention.

In one embodiment, the spatial location of the storage node relative to the scheduling node may be represented using the distance of the storage node relative to the scheduling node.

Fig. 2 shows a storage apparatus of a distributed database system of the present invention, the distributed database system including N storage nodes, that is, the distributed database is composed of a plurality of storage nodes, the apparatus includes:

a receiving unit 201, configured to receive data to be written into the distributed database system, and determine the size of the data to be written;

a determining unit 202, configured to select one storage node from the N storage nodes as a first target storage node based on a load balancing policy; in the present invention, the load balancing policy is a random selection policy, a polling policy, or a source address hash policy, which are relatively mature load balancing policies and are not described again.

The predicting unit 203 predicts one storage node from the N storage nodes as a second target storage node by using the trained graph neural network;

a write unit 204, configured to write the data to be written based on the first target storage node and the second target storage node; wherein N is more than or equal to 2.

In one embodiment, the operation of the write unit 204 is: and judging whether the ID1 of the first target storage node is the same as the ID2 of the second target storage node, if so, writing the data to be written into the storage node corresponding to the ID1, if not, judging whether the priority Pr1 of the first target storage node is greater than or equal to the priority Pr2 of the second target storage node, if so, writing the data to be written into the storage node corresponding to the ID1, and if not, writing the data to be written into the storage node corresponding to the ID 2.

In one embodiment, the distributed database system of the present invention further includes a scheduling server, where the scheduling server writes the data to be written into a corresponding storage node, the scheduling server is connected to each storage node, and a diagram of the graph neural network is formed in a manner that: the scheduling server and the N storage nodes are used as graph nodes, the connection between the scheduling server and the N storage nodes is used as an edge of a graph, the weight of the edge is determined based on the bandwidth between the scheduling server and the corresponding storage node, and the characteristic value of the graph node is determined based on the processing capacity and the storage residual space of the storage node.

In one embodiment, the weights of the edges of the graph and the eigenvalues of the graph nodes are determined in the following manner: normalizing the bandwidth between the scheduling server and the corresponding storage node, the processing capacity and the storage residual space of the storage node to obtain a bandwidth normalization value W between the scheduling server and the corresponding storage node _i And the processing capacity normalization value P of the storage node _i And storing the remainder space normalization value M _i Calculating the Weight of the edge of the graph _i ：

；

Computing characteristic values C of graph nodes _i ，

；

Wherein i represents the ith storage node and alpha represents

Beta represents P _i Weight of (2), gamma denotes

In the invention, based on the actual situation of the distributed database system, a graph neural network is constructed to predict target writing nodes based on the size of a data block to be written, an initial calculation method of edge weights and graph node characteristic values of the graph neural network is provided, and the constructed graph neural network can predict the target writing nodes well through simulation calculation so as to balance the writing data of the distributed database system.

wherein the content of the first and second substances,

indicates the priority of the ith storage node, T _i Representing the ith storage nodeDegree of reliability, D _i Represents the spatial position of the ith storage node relative to the scheduling node, and when D _i <1 hour, let D _i And =1, δ corresponds to the weight of the reliability of the ith storage node as a whole, and ∈ specifically corresponds to the weight of the spatial position of the scheduling node where the ith storage node is located.

For convenience of description, the above devices are described as being functionally separated into various units and described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially implemented or the portions that contribute to the prior art may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the apparatuses described in the embodiments or some portions of the embodiments of the present application.

Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made thereto without departing from the spirit and scope of the invention and it is intended to cover in the claims the invention as defined in the appended claims.

Claims

1. A storage apparatus of a distributed database system, the distributed database system comprising N storage nodes, the apparatus comprising:

the writing unit writes the data to be written based on the first target storage node and the second target storage node;

wherein N is more than or equal to 2;

Computing characteristic values C of graph nodes _i ，

；

Wherein i represents the ith storage node and alpha represents

Beta represents P _i Weight of (a), gamma denotes

The weight of (c).

2. The apparatus of claim 1, wherein the write unit is operable to: and judging whether the ID1 of the first target storage node is the same as the ID2 of the second target storage node, if so, writing the data to be written into the storage node corresponding to the ID1, if not, judging whether the priority Pr1 of the first target storage node is greater than or equal to the priority Pr2 of the second target storage node, if so, writing the data to be written into the storage node corresponding to the ID1, and if not, writing the data to be written into the storage node corresponding to the ID 2.

3. The apparatus of claim 2, wherein the load balancing policy is a random selection policy, a round robin policy, or a source address hashing policy.

4. A storage method of a distributed database system, wherein the distributed database system includes N storage nodes, the method comprising:

a receiving step, namely receiving data to be written into the distributed database and determining the size of the data to be written;

determining, namely selecting one storage node from the N storage nodes as a first target storage node based on a load balancing strategy;

predicting, namely predicting one storage node from the N storage nodes by using the trained graph neural network to serve as a second target storage node;

a writing step, namely writing the data to be written based on the first target storage node and the second target storage node;

wherein N is more than or equal to 2;

the distributed database system further comprises a scheduling server, the scheduling server writes the data to be written into corresponding storage nodes, the scheduling server is connected with each storage node, and the graph of the graph neural network is formed in the following mode: the method comprises the following steps that the scheduling server and the N storage nodes are used as graph nodes, the connection between the scheduling server and the N storage nodes is used as an edge of a graph, the weight of the edge is determined based on the bandwidth between the scheduling server and the corresponding storage node, the characteristic value of the graph node is determined based on the processing capacity and the storage residual space of the storage node, and the weight of the edge of the graph and the characteristic value of the graph node are determined in the following mode: normalizing the bandwidth between the scheduling server and the corresponding storage node, the processing capacity and the storage residual space of the storage node to obtain a bandwidth normalization value W between the scheduling server and the corresponding storage node _i And the processing capacity normalization value P of the storage node _i And storing the remainder space normalization value M _i Calculating the Weight of the edge of the graph _i ：

；

Computing characteristic values C of graph nodes _i ，

；

Wherein i represents the ith storage node and α represents

Beta represents P _i Weight of (a), gamma denotes

The weight of (c).

5. The method of claim 4, wherein the writing step operates by: and judging whether the ID1 of the first target storage node is the same as the ID2 of the second target storage node, if so, writing the data to be written into the storage node corresponding to the ID1, if not, judging whether the priority Pr1 of the first target storage node is greater than or equal to the priority Pr2 of the second target storage node, if so, writing the data to be written into the storage node corresponding to the ID1, and if not, writing the data to be written into the storage node corresponding to the ID 2.

6. The method of claim 5, wherein the load balancing policy is a random selection policy, a round robin policy, or a source address hashing policy.

7. A computer-readable storage medium having computer program code stored thereon which, when executed by a computer, performs the method of any of the preceding claims 4-6.