Distributed database and access method thereof
Technical Field
The invention relates to the technical field of computer network databases, in particular to a distributed database data segmentation mode and an access method thereof.
Background
The traditional application programs are generally in a centralized database architecture, and are connected with a single database to store data. With the rapid development of networks and information technologies, particularly the internet of things, the data volume is larger and larger, the capacity of a database distributed database of a single host is limited, the concurrent access capability of data is greatly reduced, the development of applications cannot be met, and the bottleneck of the whole distributed database is formed. The requirement for mass data calculation cannot be met. The distributed data distributed database solves the storage problem of mass data, and makes it possible to store and analyze a large amount of data.
The existing distributed database needs to migrate data when nodes are added, and rebalance is performed according to the segmentation rule, that is, data is moved from an old node to a new node, and meanwhile, data also needs to be migrated between the existing nodes. The operation occupies a large amount of bandwidth and needs a long time to complete, and the efficiency of the whole distributed database is influenced
In addition, when one node fails temporarily, the existing distributed database needs other nodes to replace the node to provide service, and when the node recovers, the node needs to be migrated from the time of the failure of the replaced node to the time of data recovery.
Therefore, those skilled in the art are dedicated to develop a distributed database and an access method thereof, which can avoid a large amount of data migration, occupy the service bandwidth of the distributed database, and improve the concurrent service capability and reliability of the distributed database when data nodes are added or failed nodes are replied.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, the technical problem to be solved by the present invention is that a distributed database needs to migrate a large amount of data when expanding nodes and recovering failed nodes.
In order to achieve the aim, the invention provides a distributed database which comprises a data storage node module, a data storage node access module and a data segmentation rule management module; the data segmentation rule management module is used for: managing a currently used data node segmentation rule in a distributed database; recording the change history of the data node segmentation rule; the monitoring data storage node module is used for generating a new data node segmentation rule for removing a failed node under the condition that the data storage node or a backup node thereof temporarily fails, and regenerating a new data node segmentation rule containing the data node when the failed data storage node is recovered; notifying a data storage node access module when the data node segmentation rule changes; the data storage node module is used for storing the data distributed to the node and searching the data stored in the node; and the data storage node access module is used for keeping synchronization with the data segmentation rule management module.
Further, the data node segmentation rule comprises the ID, address, starting time and ending time of the data node segmentation hash algorithm and the data storage node, a connection establishment mode and required credentials.
Further, the data segmentation hash algorithm is a consistent hash algorithm.
The invention also provides a distributed database access method, which comprises the following steps:
the distributed database comprises a data storage node module, a data storage node access module and a data segmentation rule management module;
the data node segmentation rule management module is positioned at a data access client and is connected to the data segmentation rule management module when the data node segmentation rule management module is started to obtain a data node segmentation rule in use and a data node segmentation rule with history;
when the data storage node access module receives new data, the data storage node access module searches the latest data node segmentation rule with the change time less than the time for the data according to the time field generated by the data, performs hash calculation on the data by using the latest data node segmentation rule, acquires the data storage node required to be stored by the data, and sends the data to the corresponding data storage node.
Further, when the data storage node access module receives a data query request, the data storage node access module finds all data node segmentation rules contained in the time range according to the time range of the request; when the query condition does not contain the initial value of the data time range, the initial value of the data time range is considered to be zero; when the query condition does not contain the data time range end value, the data time range end value is regarded as infinity.
Further, after the data storage node access module acquires all the data node segmentation rules, for each data node segmentation rule, performing hash calculation on the filtering main key values in the query condition to acquire a corresponding data storage node module;
sending the query conditions and the reduced data filtering time range to the calculated data storage node module;
the data storage node module returns data meeting the conditions according to the inquired filtering main key values and the filtering time periods;
and the data storage node access module returns the returned data of each data storage node module to the client one by one after receiving the returned data.
Further, the data node segmentation rule management module maintains all data node segmentation rules, searches all data node segmentation rules through the start time and the end time of a given time interval, and if the data node segmentation rules meet one of the following conditions, the data node segmentation rules are considered to be related to the query time period:
the starting time or the ending time of the data node segmentation rule is greater than the starting time of the filtering time interval;
the data node segmentation rule start time is less than the end time of the filtering rule time interval.
Further, when a data storage node is added, new node information is added into the data segmentation rule management module, the data segmentation rule management module writes the current data node segmentation rule into the history record, sets the end time of the rule as the current time, generates a data node segmentation rule containing the new node, and pushes all changed rules to the data storage access module which keeps the persistent connection.
Further, when one or more data storage nodes or backup nodes thereof fail temporarily, the data segmentation rule management module generates a new data node segmentation rule for removing the failed node, records the change time of the data node segmentation rule, and regenerates the new data node segmentation rule containing the recovery node when the node is recovered.
Further, the query processed by the data storage node access module contains hash key values to be filtered;
the same data can calculate storage nodes according to different hash key values, and the same data is stored on a plurality of same or different nodes calculated by a hash algorithm;
the query condition handled by the data storage node access module is either a start time or an end time specified only, or both, or neither.
The distributed database and the access method thereof have the following beneficial technical effects:
1) when data storage nodes are added due to the fact that the data scale is increased and the like, the data storage nodes do not need to be recalculated according to the latest data segmentation rule and the data does not need to be moved, data migration is avoided, and meanwhile the nodes where the data are located can also be efficiently located according to the inquired time interval.
2) Due to the convenience of adding the data storage nodes, one or more data nodes can be added conveniently and quickly. Due to the increase of data, the data volume of all data storage nodes in the whole distributed database finally tends to balance.
3) Under the condition that a certain data storage node and all backup nodes thereof are effective, the distributed database can still receive the data which needs to be written into the node according to the original segmentation rule, and the data does not need to be migrated from other nodes back to the node after the node is recovered
The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.
Drawings
FIG. 1 is a diagram of a distributed database architecture in accordance with a preferred embodiment of the present invention.
Detailed Description
The distributed database comprises a data segmentation rule management module, a data storage node module and a data storage node access module. The data segmentation rule management module is used for managing a currently used data node positioning algorithm and rule in the distributed database, a historical record of the data node positioning algorithm and rule and a time point when the rule is changed. The data segmentation rule management module is responsible for providing all data segmentation rules existing in the distributed database to the data storage node access module. And meanwhile, the data storage node access module is informed when the data segmentation rule changes. The data segmentation rule management module monitors all the data storage node modules, generates a new data segmentation rule for removing the failed node under the condition that the data storage node and the backup node thereof are temporarily failed, and regenerates the new data segmentation rule containing the node when the failed data storage node is recovered.
The data segmentation rule comprises a data segmentation hash algorithm, the number of data nodes in the distributed database, the address of each node, a connection establishment mode and required evidence. The data splitting hash algorithm can be a consistent hash algorithm, and can also be other hash algorithms which can ensure the data to be distributed evenly. The data segmentation rule is used for giving a piece of data, and executing a hash algorithm configured according to the number of nodes on the data according to a hash key value defined by the data to obtain a corresponding data node.
The data node storage module is responsible for storing the data distributed to the node and searching the data stored in the node. The data storage node access module is in charge of keeping synchronization with the data segmentation rule management module, and locally stores the current data segmentation rule and the historical data segmentation rule. The data segmentation rule management module is positioned at a data access client and is connected to the data segmentation rule management module when being started, and the data segmentation rule in use and the data segmentation rule with history are obtained from the distributed database. The module is in persistent connection with the data segmentation rule management module, and the data segmentation rule management module pushes a new rule to all the modules which are in persistent connection with the data storage node access module when the rule changes. And the data storage module calculates the nodes where the data are to be stored and the nodes which need to be accessed for data reading according to the latest rule.
Each piece of data defines one or more hash input key values for performing a hash calculation, which typically represent a unique body of the piece of data, such as a physical address, an IP address, etc. The plurality of key values are connected by using a character string connection mode. And each piece of data also contains the time when the data is generated.
When the data storage access module receives new data, the data storage access module searches the latest segmentation rule with the change time less than the time for the data according to the time field generated by the data, performs hash calculation on the data by using the segmentation rule, acquires the data storage node required to be stored by the data, and sends the data to the corresponding data storage node.
When the data storage access module receives a data query request, it finds all the data segmentation rules contained in the time range according to the requested time range, that is, the data segmentation rule creation time is greater than or equal to the request time range starting value and less than the request time range ending value. When the query condition does not contain the start value of the data time range, the start value of the data time range is considered to be zero. When the query condition does not contain the data time range end value, the data time range end value is regarded as infinity. And the data storage access module finishes the following operations one by one according to all the acquired data segmentation rules, performs hash calculation on the filtering main key values in the query conditions for each data segmentation rule to acquire corresponding data storage nodes, and then sends the query conditions and the reduced data filtering time range to the calculated nodes. The reduced data filtering time is the intersection of the life time period of the data segmentation rule and the filtering time period of the query request. And the data storage node returns data meeting the conditions according to the queried filtering main key value and the filtering time period. And after receiving the return data of each data storage node, the data storage access module returns the return data to the requester one by one.
When a data storage node is added, new node information is added into the data segmentation rule management module, the data segmentation rule management module writes the current data segmentation rule into a history record, sets the ending time of the rule as the current time, generates a data segmentation rule containing the new node, and pushes all changed rules to the data storage access module which keeps the persistent connection.
FIG. 1 is an application architecture diagram of a distributed database of the present invention. The distributed database is applied to a distributed group consisting of a plurality of hosts. The system comprises a database segmentation rule management module, one or more data storage node access modules and one or more data storage node modules. The data access client establishes connection with all data storage node access modules in the distributed database. The data access client uses a plurality of data storage node access modules in a polling mode.
The data segmentation rule management module stores all data storage fragmentation rules by using any type of relational database or non-relational data. Here implementation selects MySQL inside. The database segmentation rule management module consists of two hosts, and the MySQL is configured in a master-slave mode. And controlling virtual IP to switch between the host and the slave by using keep alive. The data segmentation module listens to the TCP port 7000 and receives the link to the data storage node access module. After the connection is established, the data segmentation rule management module sends all segmentation rules to the data storage node access module through the TCP link. And the data segmentation rule management module sends keep-alive (keep alive) messages to all connected data storage node access modules every 10 seconds. If the keep-alive response (keep-alive response) of the data storage node access module is not received within 5 seconds, the TCP link between the data storage node access module and the data storage node access module is disconnected. And the data segmentation rule is simultaneously established and connected to all data storage node modules, and whether the data storage nodes are normal is detected by periodically sending MySQL ping messages. If the data storage node does not reply ping messages within the appointed time, the data segmentation rule management module considers that the data storage node is invalid, and removes effective nodes from the equal nodes contained in the existing data segmentation rules to generate a new data segmentation rule. If the failed node is recovered, the data segmentation rule management module adds the recovery node to the nodes contained in the existing data segmentation rule. A new number splitting rule is generated.
A data segmentation rule comprises the following parts: the used hash algorithm, the number of data storage nodes, the ID of each data storage node, the address of each data storage node, the specific access mode, the required user name, the required password and the like, the start time of the effective segmentation rule and the end time of the effective segmentation rule.
The data segmentation rule receives a key value of a character string, obtains a node ID by executing a corresponding hash algorithm on the key value, and can obtain the address of a node and the specific access mode of the node through the node ID, so that data can be read from the node or written into the node.
The data segmentation rule management module maintains all data segmentation rules, searches all segmentation rules through the start time and the end time of a given time interval, and if the segmentation rules meet the following conditions, the rules are considered to be related to the query time period:
the starting or ending time of the segmentation rule is larger than the starting time of the filtering time interval
The start time of the segmentation rule is less than the end time of the filtering rule time interval
If the starting time period of the data segmentation rule and the data filtering time interval are related, the overlapped time period becomes the related time period, and the starting time of the related time period is the larger time of the starting time of the data segmentation rule and the starting time of the filtering time interval. The end time of the relevant time period is the smaller time of the end time of the data segmentation rule and the end time of the filtering time interval.
When the data storage node access module receives that data is to be written to the data storage node, it always computes the node of the data storage using the latest data slicing rule.
When the data storage node access module receives a query request, it must acquire all relevant nodes according to the segmentation rule related to the time period, and send the query request and the relevant time period to all relevant nodes.
Assume that there are four data storage nodes a, B, C, and D in the distributed database, and all 4 data storage nodes are included in the current data slicing rule. Meanwhile, assume that there are 3 data segmentation rules in the distributed database, which are in turn:
r1 contains nodes A and B, and has rule start time 2015/12/1210: 50:20 and end time 2015/12/2011: 00
R2 contains nodes A, B, C and D, rule start time 2015/12/2011: 00:00, end time is infinite start time
I.e., R2 is the latest data slicing rule.
Assuming that data D1 is to be written, the value of the hash key value defined by D1 is K1, and the hash algorithm in R2 is used to calculate K1, and assuming that the data is returned to the data storage node B, the data is sent to the node B for storage.
If a node E is added, the end time of R2 is rewritten to the current time, if 2017/04/1723: 50:00, and a new data segmentation rule is generated
R3 protection nodes A, B, C, D and E, rule start time 2017/04/1723: 50:00 and end time infinity.
At this time, if there is new data to be written, the node of the data storage needs to be calculated using the slicing rule of R3.
Assuming that the hash key value K1 needs to be queried, and the data storage access module splits the data with the time period of 2016/01/0100: 00:00 to 2017/04/2012: 12:12, the data storage access module finds that the splitting rules suitable for the filtering condition are R2 and R3 respectively corresponding to the relevant time periods of 2016/01/0100: 00: 00-2017/04/1723: 50:00,2017/04/1723: 50: 00-2017/04/2012: 12 by matching the local splitting rules according to the time periods. The data storage node access module executes a hash algorithm on K1 according to R2 and sends the corresponding query conditions and filter time periods 2016/01/0100: 00: 00-2017/04/1723: 50:00 to the nodes calculated by the hash algorithm. The data storage node access module then executes the hash algorithm on K1 according to R3 and sends the corresponding query conditions and filtering time periods 2017/04/1723: 50: 00-2017/04/2012: 12:12 to the nodes calculated by the hash algorithm. And finally, the data storage access node sends the data returned by the two nodes to the data access client.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.