CN116881320A - Distributed data query method and system oriented to ubiquitous storage - Google Patents
Distributed data query method and system oriented to ubiquitous storage Download PDFInfo
- Publication number
- CN116881320A CN116881320A CN202310165096.6A CN202310165096A CN116881320A CN 116881320 A CN116881320 A CN 116881320A CN 202310165096 A CN202310165096 A CN 202310165096A CN 116881320 A CN116881320 A CN 116881320A
- Authority
- CN
- China
- Prior art keywords
- node
- data
- index
- network
- metadata
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 230000003993 interaction Effects 0.000 claims abstract description 20
- 230000006855 networking Effects 0.000 claims abstract description 12
- 238000007726 management method Methods 0.000 claims description 59
- 230000008569 process Effects 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000006854 communication Effects 0.000 claims description 6
- 238000013500 data storage Methods 0.000 claims description 6
- 238000004891 communication Methods 0.000 claims description 5
- 230000004044 response Effects 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 4
- 238000012937 correction Methods 0.000 claims description 3
- 238000005259 measurement Methods 0.000 claims description 3
- 238000013139 quantization Methods 0.000 claims description 2
- 238000012423 maintenance Methods 0.000 abstract description 4
- 238000013461 design Methods 0.000 description 7
- 230000007246 mechanism Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000008439 repair process Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000032683 aging Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003090 exacerbative effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2255—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/907—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Fuzzy Systems (AREA)
- Computing Systems (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a distributed data query method and a system for ubiquitous storage, wherein the method comprises the following steps: step 1, constructing a hierarchical index network model, wherein the hierarchical index network model comprises a metadata management network and a data block index network: constructing a decentralised metadata management network in a fully-connected networking mode; constructing a data block index network based on the distributed hash table; and 2, defining a hierarchical data index interaction paradigm to realize collaborative management and quick query of metadata and data blocks. According to the invention, decoupling of metadata and position information is realized through the index network hierarchical model, and management and maintenance cost of the metadata is reduced.
Description
Technical Field
The invention belongs to the technical field of distributed data storage, and particularly relates to a distributed data indexing method and system under a ubiquitous storage application scene.
Background
Currently, conventional distributed storage schemes represented by data centers mostly rely on servers to provide reliable data storage and stable access services, and remain in a centralized mode on the design level, so that the operation efficiency and the scalability of the system are limited by the performance of the servers. The ubiquitous storage follows the design concept of complete decentralization, and a user becomes a certain working entity in the ubiquitous storage system by contributing or renting idle resources (storage, calculation or bandwidth) of own terminal equipment, so that the service load is uniformly deployed on different nodes through the continuous convergence of the end node resources, and the problems of system capacity expansion and single-point performance bottleneck are solved. However, in this mode, the huge number of highly dispersed storage nodes makes it difficult to locate the storage location of the data, and this loose node organization form cannot guarantee the node stability, further exacerbating the difficulty of data location. Therefore, an efficient and node jitter (offline or location migration) resistant lookup algorithm must be employed to quickly obtain location information for the data.
In the storage mode of the data center, metadata is used to record the position information of data on the storage medium, and meanwhile, the metadata information is managed and maintained through configuring a high-performance server node, so that massive data query requests are processed. As the amount of data increases or the amount of data access increases, it is necessary to meet the data query requirements by upgrading or replacing servers. This approach tends to be costly to implement and highly dependent on the data processing capabilities of the server, which violates the design concept of decentralization and is therefore not suitable for ubiquitous storage applications.
In addition, another method for improving the data query efficiency is to accelerate data search through an indexing mechanism, and typical application scenarios include: IPFS, storj, etc. However, the existing indexing algorithm is mainly suitable for the condition that the storage nodes are relatively stable, and the stability of the nodes in the universal storage system is extremely easy to be influenced by network fluctuation and user behaviors. In such an environment, the efficiency and accuracy of the indexing algorithm will drop dramatically.
Based on the above problems, there is a need to design a method for efficient data indexing and management of a ubiquitous storage system, which can effectively reduce the influence of node jitter on data query while realizing efficient data query, and maintain the query efficiency and accuracy of data.
Disclosure of Invention
The invention provides a distributed data index and management model and a distributed data index and management method oriented to a ubiquitous storage application scene, and the data searching efficiency and accuracy are improved by layering a data index network structure. Meanwhile, on the basis of the existing index algorithm, a node jitter self-adaptive adjusting method is provided, and the influence of node jitter on data indexes is reduced.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a distributed data query method facing ubiquitous storage comprises the following steps:
step 1, constructing a hierarchical index network model, wherein the hierarchical index network model comprises a metadata management network and a data block index network:
constructing a decentralised metadata management network in a fully-connected networking mode;
based on the design mode of the distributed hash table (Distributed Hash Table, DHT), a data block index network is constructed, and rapid positioning of the data blocks is realized.
And 2, constructing a hierarchical data index interaction paradigm, and realizing collaborative management and quick query of metadata and data blocks.
Preferably, after the step 2, the method further comprises a step 3 of establishing a node jitter prediction model based on heavy tail panlitor distribution by adopting a node availability probability prediction and index network adjustment method, and carrying out index network adjustment according to the intensity of jitter.
Preferably, in step 1, the query network is logically divided into: a metadata management network and a data block indexing network;
the metadata management network is constructed by: the metadata management nodes are networked in a full-connection mode, and the core task is to process a large number of client inquiry requests and perform quick response; meanwhile, the metadata information is updated and maintained according to the data and the storage node changes.
The data block indexing network is constructed by: the data block index node refers to the design mode of the distributed hash table, and an index network topology is constructed through point-to-point connection, and the main task of the index network topology is to record and maintain the position information of the data block storage node; and simultaneously, auxiliary management of storage data is performed, and the state of the storage nodes is monitored and updated.
The invention introduces the data block index network to decouple the metadata and the storage node position information, and reduce the frequent modification and writing of the metadata caused by the change of the node position.
Preferably, in step 1, the metadata management network is constructed as follows:
1) The metadata management nodes are composed of nodes with high stability, large bandwidth and high data processing capacity, and are networked in a full-connection mode. Each metadata management node only records and maintains partial metadata information, and the average data load of the nodes is as follows:
wherein E is the average data load, w i And n is the number of the nodes, wherein the data is recorded by the ith node at a certain time point.
2) The metadata management node obtains the autogenous identity by a hash algorithm h1 (x), and meanwhile, the node and metadata positioning is realized by maintaining a metadata management node registry. According to exclusive or distance from other nodes to the table, the identity identification (NodeID) and the route information of the nodes are recorded sequentially from near to far, and a calculation formula can be expressed as follows:
dist(node x,node y)=[log 2 XOR(id(x),id(y))]+1。
wherein dist is the relative distance (exclusive or distance) between nodes, x and y respectively represent two different nodes, id (x), id (y) represents the node identity obtained after calculation, and the node identity is generally a large random number.
3) And the same as the identity of the computing node, the storage file is uniformly addressed by using a hash algorithm h1 (x). According to the principle of nearby, metadata information of the file is stored in the exclusive or distance (i.e.:) On the nearest nodes, where x and y represent the node identification value and the metadata identification value, respectively. According to the storage mode, metadata information of the file is randomly stored on different nodes, so that load balancing of data is realized.
Metadata information is stored in the form of < fid, metadata > key-value pairs. Wherein, fid has global uniqueness as a file identifier; specific summary information of the file is recorded in metadata, including file name, author, creation time, data block information, etc. In the invention, the data block information only contains the data block identification information (bid) and is not related with the actual storage position of the data any more, thereby reducing the management and maintenance cost of the node and improving the query efficiency.
Preferably, the data block index network construction step in step 1 specifically includes the following steps:
1) In the invention, the data block index network is formed by networking by referring to the data structure of Kademlia algorithm, and finally forms a tree-shaped network topology structure. The index node and the data block identifier bid are uniformly addressed by adopting the same hash algorithm h2 (x), the position information of the node and the data block serves as leaves of a tree, the position of each node is uniquely determined by the hash value of the identifier, and a neighbor relation is established among the nodes according to the relative distance between the nodes.
In the actual operation process, based on the design principle of decentralization, any node only stores part of index data, and the position index of the data is stored to the adjacent nodes according to the nearby principle. Therefore, the essence of index establishment can be abstracted into a continuous approach, and finally the process of converging to the index node with the minimum relative distance can be realized, and the redundant backup and the load balancing of data can be realized in the converging process.
2) In the invention, the index table adopts a non-clustering index mode, namely, specific data is not stored in the nodes, and the storage position information of the data is recorded. Thus, the data structure of a key-value pair is < id, (ip, port) >, where id represents the identification information of a node or data block, and (ip, port) corresponds to the network location and interaction interface of an inode or data storage node.
id corresponds to two different objects as identification information: one type is bid, and the corresponding data block identifier; the other is nid, which corresponds to the inode identification, both stored in the inode k-bucket data structure, so that the inodes and data blocks are peer-to-peer from an indexing perspective; (ip, port) corresponds to the network location and service identity of the node, and distinguishes whether the object is a storage node or an index node through the port.
Preferably, in step 2, a complete file query process is divided into: metadata query and data chunk index positioning.
In the metadata query process, a user firstly establishes communication connection with an adjacent metadata management node and transmits a file identification fid of a target file to the node; then, the node performs exclusive OR operation on the file identification and the node identification to obtain a relative distance dist; and finally, the node searches the metadata management node registry to find the metadata management node with the minimum relative distance, forwards the request message to the node, and finally returns the response result to the user in a proxy mode.
In the process of indexing the data block, firstly, unpacking metadata to obtain a data block identifier (bid); then establishing connection between the user and the adjacent index node, and simultaneously sending a data block inquiry request; after receiving the request, the index node determines whether the target data block position information is stored in the autogenous index table: if yes, directly returning the position information of the data block; otherwise, inquiring a node which is closer to the bid in relative distance in a proxy mode, and redirecting an index request of a user to the node; and continuously iterating until the node closest to bid is found and data is acquired, or target data index information cannot be found, and returning a query failure message.
Preferably, the step 3 is specifically as follows:
1) Node jitter prediction, comprising two parts: establishing a node jitter prediction model and correcting the model;
building a node jitter prediction model: modeling access loss conditions caused by node offline or network fluctuation in a storage network based on a jitter prediction algorithm of mobile node behavior characteristics and a statistical rule, and simultaneously providing a quantization standard of node jitter degree;
correction of the model: and correcting the model based on the network state history data and the node behavior rules.
2) According to the node jitter prediction model and the actual measurement result, the data redundancy and the access concurrency of the data block index network are dynamically regulated under different jitter conditions, so that the index efficiency and the hit rate are improved, and the data update is accelerated;
3) In order to repair failure data caused by long-time offline or permanently separated nodes of a storage network, the update frequency of index entries needs to be adjusted according to the failure probability of the nodes, and the update of the failure index entries is accelerated by adjusting the data access frequency and the query concurrency.
The invention also discloses a system based on the distributed data query method facing the ubiquitous storage, which comprises the following modules:
the hierarchical index network model building module: the hierarchical index network model comprises a metadata management network and a data block index network, and a decentralised metadata management network is constructed in a fully-connected networking mode; constructing a data block index network based on the distributed hash table;
metadata and data block query module: and constructing a hierarchical data index interaction paradigm, and realizing collaborative management and quick query of metadata and data blocks.
Compared with the prior art, the invention has the remarkable advantages that:
according to the method, decoupling of metadata and position information is achieved through the index network hierarchical model, and management and maintenance cost of the metadata is reduced; the metadata nodes adopt a fully-connected networking mode, so that the query efficiency of metadata is improved; the data block index layer is established based on the DHT algorithm, so that the influence of node instability on index efficiency and accuracy is reduced as much as possible while the end node resources are fully utilized, and the expandability of the system is improved; the index adjusting method provided by the invention effectively relieves the problem of index network performance degradation caused by node jitter, and can update failure nodes and data in time, thereby improving the availability of the system.
Drawings
FIG. 1 is a schematic diagram of an indexing network architecture;
FIG. 2 is a metadata management node registration representation intent;
FIG. 3 is a schematic diagram of data interactions;
FIG. 4 is a flow chart of a file query;
FIG. 5 is a node jitter adjustment flow chart;
FIG. 6 is a block diagram of a distributed data query system for ubiquitous storage in accordance with the present invention.
Detailed Description
The technical scheme of the invention is further described below with reference to the attached drawings and preferred embodiments.
In order to better understand the related method provided in this embodiment, a complete file query is selected, and the index adjustment and repair process under the condition of jitter of different nodes is described, where the specific steps of this embodiment are as follows:
a distributed data query method oriented to ubiquitous storage comprises the following steps:
s1, constructing a hierarchical index network
As shown in fig. 1, in this embodiment, node members in the model can be divided into three roles according to functions: metadata management nodes (MetaNode, MN), data index nodes (IndexNode, IN), and data storage nodes (DataNode, DN). The location and functionality of the different functional nodes in the indexing network is shown in fig. 1.
S11, selecting a plurality of nodes to serve as MN nodes to form a metadata management network according to the scale of the stored data, and forming a data block index network by a plurality of IN nodes.
S12, in the embodiment, a node with good stability and strong performance needs to be selected to serve as an MN node, so that the stability of connection between the nodes is ensured, and an access request is processed efficiently. The MN nodes form a metadata management network in a fully-connected networking mode, each node maintains partial metadata information to jointly form the metadata management network, and the detailed construction steps are shown in S1-12.
S13, further, selecting the node with better stability and larger communication bandwidth as the IN node, thereby guaranteeing the efficiency of data query. IN nodes establish neighbor relationships with other IN nodes that are relatively close IN phase by point-to-point connections. Each node stores part of index data to form a data block index network together, and the detailed construction steps are shown in S1-13.
S1-12, constructing a metadata management network model:
1) As shown in FIG. 1, the metadata management nodes construct a management network in a fully-connected mode, the nodes maintain a neighbor relation through data interaction, and each node records network identifications and routing information of other nodes, so that any metadata query operation can be modeled as a 1-hop reachable communication process. In the mode, the query service can be completed by accessing at most two nodes, so that the communication delay is effectively reduced, and the quick response of metadata query is realized.
2) In the system initialization phase, the metadata management node manages the metadata according to the attribute characteristics of the node (for example: MAC address, access permission number, etc.), generating an own identity identifier node id=h1 (node) by a hash function h1 (); each node then calculates the relative distance of the neighboring node to itselfAnd ordered from near to far according to relative distance, thereby constructing a metadata management node registry, as shown in fig. 2. Finally, the purpose of accelerating node positioning is achieved.
3) When storing file metadata, firstly, uniformly addressing nodes and storage files by using a hash algorithm h1 (x), and respectively obtaining node identification NodeID and file identification fid; then, in order to achieve fast indexing, the metadata management node compares the fid with the NodeID in the metadata management node registry, and calculates the relative distance between the fid and the NodeID, namely:eventually, the metadata will be stored on top of the management node with the smallest relative distance.
It should be noted that, because the values of the metadata identifiers are completely random, the metadata are also stored in the nodes with the closest relative distances in a random manner, so that the load balance among the system nodes is ensured.
S1-13, constructing a data block index network model:
1) As shown in fig. 1, the logical structure of the data block index network refers to Kademlia algorithm, and forms a binary tree formally, and node identifiers nid and the data block identifiers bid are uniformly addressed by adopting a hash function h2 (x), so as to form leaf nodes of the binary tree. The nodes are positioned by nid, and a neighbor relation is established with partial neighboring nodes according to a nearby principle, meanwhile, the states of the neighboring nodes are not monitored periodically by a heartbeat message, and weak neighbor relation is established and the update of the node states is realized by two modes of identity announcement and access interaction, wherein the method comprises the following specific implementation steps:
when a system is initialized or a new node is added, broadcasting own node identification nid and network identification information (ip, port) to other nodes in an identity announcement mode, firstly, performing distance calculation on the node receiving the broadcast message and nid of a sender node, adding the node and the nid of the node into an index list corresponding to a k-barrel according to the relative distance between the nodes, and returning the nid of the node to the sender node;
further, when the nodes are in an operation state, the neighbor relation among the nodes is maintained and updated through 'access interaction', namely, each time complete message interaction is carried out among the nodes, both sides of the nodes synchronously update the state information of the neighbor list, and the identification of the interaction node is advanced from the corresponding k-bucket. Thus, for nodes with higher interaction frequency, the time required for being positioned is shorter, and by the mechanism, hot spot data can be accessed more quickly, and the information of cold data or invalid nodes is phased out, so that the storage space is released.
Through the steps, the index node fills the network position information of the neighbor node into the corresponding k-barrel structure according to the distance of the relative distance, so that a finished node routing information table is constructed.
2) The index information is stored in the form of a < key, value > key value pair, and the data structure of the key value pair is < id, (ip, port) >. Since any node in the index network stores only part of the index information, the new index data needs to locate its own storage node before storing.
The initiating position of the positioning may be any node in the index network, and the positioning process needs to continuously search for the index node with smaller relative distance to the id according to the neighbor node identification and the routing information recorded by the node, and finally positions the node with the minimum relative distance. In the process of continuous interaction with the node and final convergence, the id and the identification information (nid) of the interaction node are required to be continuously compared, and the index information is required to be added into the corresponding k-bucket table entry, so that the redundancy of the index information is realized, and the availability of the index information is ensured.
S2 file data query flow
S21 metadata acquisition
As shown in fig. 3 and 4, assuming that the user needs to acquire a file of fid=δ, first a connection needs to be established with any neighboring Metadata Node (MN), and a metadata request message of Get (fid) is transmitted. Then, the Metadata index locates the data storage node through fid, finally obtains Metadata information through proxy query, and returns a response result of (fid, metadata). Taking the query operation performed on node nodeid=n as an example, the specific steps are as follows:
first, calculate the exclusive OR distance of delta and nIf d is smaller than the distance from delta to any neighbor node, the node is the metadata storage node, and the database is queried and the result is returned; otherwise, find the management node nearest to δ by querying the node registry.
The target node is found from the node registry and the lookup request is forwarded to that node. Returning metadata information if the node contains metadata identified as delta; otherwise, returning the message of the target data missing to the client.
S22 data block index
In this embodiment, the query of the data block sends the data block location information acquisition request of get_datanode_route (bid) in a concurrent access manner: for each query, the query state is eitherIndicating that the query failed; or simultaneously acquiring feedback information in alpha neighbor nodes. Thus, the query state can be expressed as:
the indexing step is refined as follows:
firstly, when a client obtains metadata, unpacking and analyzing a data packet to obtain data block identification information of a target file, namely: (bid 1, bid2, …, bid).
Secondly, the client accesses adjacent index nodes, and the index nodes serve as proxy nodes to help the client to acquire the position information of the data block storage nodes, so that the method is further refined:
first, the proxy node will compare bid with its own nid to get the relative distance. Inquiring data items in a section corresponding to the index table according to the logic distance, and returning the position information of the storage node if the corresponding items exist; otherwise, it is necessary to request data from nodes that are closer to the bid logic.
When a proxy node needs to request data from a neighbor node, the data is first requested by a find_value instruction. If none of the neighbor NODEs has the target data block location information, then NODE information closer to the target data is requested via the FIND NODE instruction and the user connection is redirected to the queried NODE.
The communication complexity of the whole inquiry process is thatN is the number of index nodes, and according to theoretical analysis and experimental verification, the larger the network scale is, the more obvious the effect of accelerating data query by an index method is.
Then, when the proxy node obtains the position information of the data block, the result (bid (ip, port)) is returned to the client, and the client sends a get_datablock (bid (ip, port)) request message to the storage node to obtain the data block information (bid, data block).
Finally, after the minimum data block number required for restoring the file is obtained, the whole query process is completed. For the data blocks which cannot acquire the position information, the default is that the lost or storage node is offline, and the index node needs to feed back the result to the metadata management node to assist in updating the metadata information.
S3, node jitter prediction and processing
As shown in fig. 5, in order to solve the problem of influence of node jitter on data indexing, an index network layer is improved, and the improvement steps include:
1) Modeling the node off-line condition through node jitter prediction, assisting in self-adaptive adjustment of an index network, and guaranteeing system stability;
2) On the basis of a node jitter model, a self-adaptive index optimization method which is suitable for the node jitter model is provided, and the stability of index efficiency and accuracy under different node jitter conditions is realized by adjusting the index table item update rate, table item size, index node trust evaluation, ageing time and other aspects.
S31, node jitter prediction model
The judgment of the system to the Node Shake Rate (NSR) depends on the modeling of the mobility rule of the terminal equipment in statistics. In this embodiment, a mobility rule of P2P nodes in a general city over time is selected, so, assuming that the probability of node accessibility in the system obeys the heavy tail paneto distribution with parameters of γ and β, the probability of accessibility of any node is:
wherein t is α To represent a random variable of node accessible time, T is a certain length of time, and γ and β represent the shape and scale parameters of the distribution, respectively, expressed as the probability that the node accessible time is less than T. At the same time, T is introduced in the modeling process last Representing the time elapsed since the last interaction of the neighbor node to the observation, T online Representing the time that a neighbor node has elapsed from the online to the last interaction. The current availability probability p of any node can be calculated as:
the above formula is changed to obtain:
t in online And t online All obey the parieto distribution of heavy tails, with the median value beingWhen T is online When determining, T last Only with respect to the node presence probability p. Therefore, at +.>Will have 50% of the neighboring nodes on-line probability less than P min I.e. the availability probability of index table items stored by the index nodes of the system is smaller than P min 。
Because node jitter rate in a ubiquitous storage network scene is affected by a plurality of factors, the available probability of various nodes (without considering the offending nodes with attack intention) often has relatively clear time characteristics and behavioural rules, and most typical is the tide rule of a network link, and in order to more accurately reflect the jitter state of the nodes, real-time measurement and correction of the accessible probability of the nodes are required.
S32, index adjusting method
As shown in fig. 5, the index adjustment method in this embodiment acts on the data block index network, and based on the k-bucket mechanism and the concurrent query form of the Kademlia algorithm, performs adaptive adjustment on the original parameters according to the result of node jitter prediction, and optimizes some mechanisms in the original algorithm, and specifically includes the following steps:
first, to ensure availability of index data, i.e., maintain P min The node must continuously update neighbor information, backup and update those available probabilities less than P min Is used for indexing the table entries. While to achieve the intended goal, the node needs to be at T last And repairing 50% of table items in the index table in time, introducing a parameter C, representing the number of new neighbor nodes needing to be interacted in unit time, and reflecting the interaction frequency of the system nodes. ThenSubstituted into->The method can obtain:
setting a parameter gamma=1, wherein the node availability probability changes linearly, and the average availability probability of the index table item can be deduced as follows:
at this time, the availability probability of the node is determined by β and K, the parameter C represents the capability of the node to update the routing data, and the Kademlia algorithm performs node survival status verification by adopting a display update manner, where the data update capability of the node is directly related to the access concurrency α in unit time.
The correspondence between the probability of availability of a data item in the index table and the K-bucket capacity K. It can be seen that when βc is fixed, the probability of availability of data items in the index table is continuously reduced as K increases, which indicates that under certain conditions of node jitter and data update, the maintenance difficulty of the routing table with larger scale is greater. When K is set, the increase of beta C indicates the decrease of node jitter or the increase of data updating capability, and the probability of data availability in the index table is increased.
Therefore, when the node availability probability is reduced, the index data redundancy value can be increased by increasing the k-bucket capacity, so that the index accuracy is improved, and the hit rate of data query can be increased by increasing the access concurrency quantity alpha of the index node, so that the overall index efficiency is improved.
S33, updating index table item
The jitter level is divided into 3 levels according to the node availability probability: slight jitter, general jitter, severe jitter. And according to different jitter degrees, adjusting the updating frequency, and feeding back an updating result to the auxiliary metadata management node for updating metadata.
1) When the node jitter degree is 'slight jitter', the system does not process, and the self-updating of the data is realized according to the self-updating and repairing mechanism of the Kademlia algorithm;
2) When the node jitter degree reaches 'general jitter', the single access concurrency alpha and the access frequency T are gradually increased, and the update period is T. Before each adjustment period starts, the jitter condition of the system node needs to be reevaluated: if the node jitter is continuous, the updating frequency is quickened according to the principle of gradient descent; if the node jitter is relieved, gradually slowing down the updating frequency, and reducing the access concurrency and the access frequency until the stable level is restored.
3) If the node jitter degree reaches 'severe jitter', the adjustment mode is similar to that of 'general jitter', but in order to avoid sharp fluctuation of the system, the invalid node and data update should be further accelerated, at this time, the adjustment period is continuously shortened according to a fixed ratio, and the access concurrency α and the access frequency are continuously increased according to a gradient rising principle until the node is separated from the 'severe jitter' state.
Maintaining high access concurrency and access frequency consumes a significant amount of link bandwidth while adding additional computational overhead. Therefore, when the node jitter degree is reduced, the updating frequency and the concurrency amount are gradually reduced, so that the system resources are saved.
The core of the index adjustment method is to continuously monitor and adjust according to a node jitter prediction model, namely 'prediction and monitor-adjustment and repair-monitoring'. Meanwhile, in order to balance system performance and cost, related parameters of an index algorithm can be gradually traced back when the state of the node tends to be stable, and additional cost is avoided.
As shown in fig. 6, the embodiment of the present invention further provides a system for a distributed data query method for ubiquitous storage based on the above embodiment, which includes the following modules:
the hierarchical index network model building module: the hierarchical index network model comprises a metadata management network and a data block index network, and a decentralised metadata management network is constructed in a fully-connected networking mode; constructing a data block index network based on the distributed hash table;
metadata and data block query module: constructing a hierarchical data index interaction paradigm, and realizing collaborative management and quick query of metadata and data blocks;
and the node jitter prediction and processing module is used for: and establishing a node jitter prediction model based on heavy tail Paris-Hold distribution by adopting a node availability probability prediction and index network adjustment method, and carrying out index network adjustment according to the intensity of jitter.
The foregoing embodiments provide detailed implementation steps on the premise of the technical solution of the present invention, and the accompanying drawings only show one embodiment of the present invention, which can be used as a reference for related application scenarios, and the protection scope of the present invention is not limited to the foregoing embodiments.
Claims (9)
1. A distributed data query method oriented to ubiquitous storage is characterized by comprising the following steps:
step 1, constructing a hierarchical index network model, wherein the hierarchical index network model comprises a metadata management network and a data block index network: constructing a decentralised metadata management network in a fully-connected networking mode; constructing a data block index network based on the distributed hash table;
and 2, constructing a hierarchical data index interaction paradigm, and realizing collaborative management and quick query of metadata and data blocks.
2. The distributed data query method for ubiquitous storage according to claim 1, wherein in step 1: the metadata management network is constructed by the following modes: the metadata management node performs networking in a full-connection mode, and processes a client query request; meanwhile, the metadata information is updated and maintained according to the state changes of the data and the storage nodes.
3. The distributed data query method for ubiquitous storage according to claim 1, wherein in step 1: the data block index network is constructed by the following steps: the data block index node refers to the distributed hash table and builds an index network topology based on point-to-point connection; and simultaneously, auxiliary management of storage data is performed, and the state of the storage nodes is monitored and updated.
4. The distributed data query method for ubiquitous storage according to claim 2, wherein in step 1, the metadata management network is constructed as follows:
1) The metadata management node is composed of a plurality of nodes and is used for networking in a full-connection mode; each metadata management node records and maintains partial metadata information, and the average data load of the nodes is as follows:
wherein E is the average data load, w i The data volume recorded by the ith node at a certain time point is n, and the n is the number of the nodes;
2) The metadata management node obtains a self-generated identity by a hash algorithm h1 (x), and meanwhile, the node and metadata positioning is realized by maintaining a metadata management node registry; according to exclusive or distance from other nodes to the table, the identity identification NodeID and the route information of the nodes are recorded sequentially from near to far, and the calculation formula is expressed as follows:
dist(node x,node y)=[log 2 XOR(id(x),id(y))]+1
wherein dist is the relative distance between nodes, namely exclusive or distance, x and y respectively represent two different nodes, and id (x) and id (y) represent node identity obtained after calculation;
3) Uniformly addressing the storage file by using a hash algorithm h1 (x) as the identity of the computing node; according to the principle of nearby, storing metadata information of the file in an exclusive or distance, namely:on the nearest nodes, where x and y represent the node identification value and the metadata identification value, respectively.
5. The distributed data query method for ubiquitous storage according to claim 3, wherein the data block index network construction step in step 1 is specifically as follows:
1) The data block index network performs networking by referring to the data structure of the Kademlia algorithm, and finally forms a tree-shaped network topology structure; the index node and the data block identifier bid are uniformly addressed by adopting a hash algorithm h2 (x), the position information of the node and the data block serves as leaves of a tree, the position of each node is uniquely determined by the identifier Fu Haxi value, and the nodes establish a neighbor relation according to the relative distance between the nodes;
2) The index table adopts a non-clustering index mode, namely, specific data is not stored in the nodes, and the storage position information of the data is recorded; thus, the data structure of the key-value pair is < id, (ip, port) >, where id represents the identification information of the node or data block, (ip, port) corresponds to the network location and interaction interface of the inode or data storage node.
6. The distributed data query method for ubiquitous storage according to any one of claims 1 to 5, wherein in step 2, a user first establishes a communication connection with a neighboring metadata management node and transmits a file identification fid of a target file to the node in a metadata query process; then, the node performs exclusive OR operation on the file identification and the node identification to obtain a relative distance dist; finally, the node searches the metadata management node registry to find the metadata management node with the minimum relative distance, forwards the request message to the node, and finally returns the response result to the user in a proxy mode;
in the process of indexing the data block, firstly unpacking metadata to obtain a data block identifier bid; then establishing connection between the user and the adjacent index node, and simultaneously sending a data block inquiry request; after receiving the request, the index node judges whether the position information of the target data block is stored in the autogenous index table or not: if yes, directly returning the position information of the data block; otherwise, inquiring a node which is closer to the bid in relative distance in a proxy mode, and redirecting an index request of a user to the node; through continuous iteration, until the node closest to bid is found and data are acquired; or the target data index information cannot be found, a query failure message is returned.
7. The ubiquitous storage-oriented distributed data query method according to any one of claims 1 to 5, further comprising step 3: and establishing a node jitter prediction model based on heavy tail Paris-Hold distribution by adopting a node availability probability prediction and index network adjustment method, and correspondingly adjusting according to the intensity of jitter.
8. The distributed data query method for ubiquitous storage according to claim 7, wherein the step 3 is specifically as follows:
1) Node jitter prediction, comprising:
building a node jitter prediction model: modeling access loss conditions caused by node offline or network fluctuation in a storage network based on a jitter prediction algorithm of mobile node behavior characteristics and a statistical rule, and simultaneously providing a quantization standard of node jitter degree;
correction of the model: correcting the model based on the network state history data and the node behavior rules;
2) Dynamically adjusting the data redundancy and the access concurrency of the data block index network under different jitter conditions according to the node jitter prediction model and the actual measurement result;
3) And adjusting the updating frequency of the index item according to the node failure probability, and accelerating the updating of the failure index item by adjusting the data access frequency and the query concurrency.
9. A system based on the ubiquitous storage oriented distributed data query method according to any of claims 1-8, characterized by comprising the following modules:
the hierarchical index network model building module: the hierarchical index network model comprises a metadata management network and a data block index network, and a decentralised metadata management network is constructed in a fully-connected networking mode; constructing a data block index network based on the distributed hash table;
metadata and data block query module: and constructing a hierarchical data index interaction paradigm, and realizing collaborative management and quick query of metadata and data blocks.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310165096.6A CN116881320A (en) | 2023-02-21 | 2023-02-21 | Distributed data query method and system oriented to ubiquitous storage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310165096.6A CN116881320A (en) | 2023-02-21 | 2023-02-21 | Distributed data query method and system oriented to ubiquitous storage |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116881320A true CN116881320A (en) | 2023-10-13 |
Family
ID=88264949
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310165096.6A Pending CN116881320A (en) | 2023-02-21 | 2023-02-21 | Distributed data query method and system oriented to ubiquitous storage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116881320A (en) |
-
2023
- 2023-02-21 CN CN202310165096.6A patent/CN116881320A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4652435B2 (en) | Optimal operation of hierarchical peer-to-peer networks | |
US9300534B2 (en) | Method for optimally utilizing a peer to peer network | |
US20050108368A1 (en) | Method and apparatus for representing data available in a peer-to-peer network using bloom-filters | |
CN111046065B (en) | Extensible high-performance distributed query processing method and device | |
EP2410770A1 (en) | Method, user node and server for requesting position information of resource on network | |
CN110866046B (en) | Extensible distributed query method and device | |
CN110990448B (en) | Distributed query method and device supporting fault tolerance | |
Moeini et al. | Efficient caching for peer-to-peer service discovery in internet of things | |
CN112860799A (en) | Management method for data synchronization of distributed database | |
CN115733848B (en) | Data distributed storage management system for edge equipment | |
JP2013130960A (en) | Information processing device, data control method, and data control program | |
Graffi et al. | Skyeye. kom: An information management over-overlay for getting the oracle view on structured p2p systems | |
JP4533923B2 (en) | Super-peer with load balancing function in hierarchical peer-to-peer system and method of operating the super-peer | |
EP1719325B1 (en) | Method for optimally utilizing a peer to peer network | |
Duan et al. | Two-layer hybrid peer-to-peer networks | |
CN116881320A (en) | Distributed data query method and system oriented to ubiquitous storage | |
JP4923115B2 (en) | Method, computer program and node for distributing references to objects in a self-organizing distributed overlay network, and self-organizing distributed overlay network | |
Gu et al. | ContextPeers: scalable peer-to-peer search for context information | |
Anupama et al. | Keyword based Searching in a P2P Network | |
Setia | Distributed Hash Tables (DHTs) Tapestry & Pastry | |
Al-Dmour | Comparison of file sharing search algorithms over peer-to-peer networks | |
US20220272092A1 (en) | Decentralized network access systems and methods | |
Li et al. | Application of bloom filter in grid information service | |
Song et al. | The Pastry Algorithm Based on DHT. | |
CN117440003A (en) | Centreless distributed storage method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |