CN112800023B - Multi-model data distributed storage and hierarchical query method based on semantic classification - Google Patents

Multi-model data distributed storage and hierarchical query method based on semantic classification Download PDF

Info

Publication number
CN112800023B
CN112800023B CN202011473262.1A CN202011473262A CN112800023B CN 112800023 B CN112800023 B CN 112800023B CN 202011473262 A CN202011473262 A CN 202011473262A CN 112800023 B CN112800023 B CN 112800023B
Authority
CN
China
Prior art keywords
node
data
name
query
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011473262.1A
Other languages
Chinese (zh)
Other versions
CN112800023A (en
Inventor
舒红章
王冲
牛中盈
胡琦
赵子路
胡占尧
李龙鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aerospace Science And Technology Network Information Development Co ltd
Beijing Institute of Computer Technology and Applications
Original Assignee
Aerospace Science And Technology Network Information Development Co ltd
Beijing Institute of Computer Technology and Applications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aerospace Science And Technology Network Information Development Co ltd, Beijing Institute of Computer Technology and Applications filed Critical Aerospace Science And Technology Network Information Development Co ltd
Priority to CN202011473262.1A priority Critical patent/CN112800023B/en
Publication of CN112800023A publication Critical patent/CN112800023A/en
Application granted granted Critical
Publication of CN112800023B publication Critical patent/CN112800023B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/214Database migration support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2272Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/80Database-specific techniques

Abstract

The invention relates to a multi-model data distributed storage and hierarchical query method based on semantic classification, which comprises the following steps: the method comprises the following steps of multi-model data semantic classification storage, data query and query word management, distributed main and standby data migration and secondary index updating; the multi-model data semantic classification storage comprises the following steps: performing storage preliminary semantic classification; index initialization is carried out; index storage is carried out, each node of the secondary index is added to an index class in a distributed metadata node memory in a key value pair mode, and the index class is used when other nodes do not know the node, the database or the class where the data is located when the other nodes are inquired; the primary index is stored by a node local database; carrying out multi-model data classification storage; performing data semantic query and database query; and managing the query words. The invention reduces the number of nodes which need to be accessed specifically during query, and reduces the communication overhead of query.

Description

Multi-model data distributed storage and hierarchical query method based on semantic classification
Technical Field
The invention relates to a network data storage and query technology, in particular to a multi-model data distributed storage and hierarchical query method based on semantic classification.
Background
With the popularization of social networks and the rapid development of industrial internet, the storage and transmission quantity of network data is increasingly huge, and the traditional storage system uses a centralized server to store data, so that the traditional storage system cannot efficiently and safely deal with the storage, query and transmission of mass data. The distributed storage system stores data in different servers in a dispersed manner, the servers can be expanded, and the problems of large storage burden and single-point failure of mass data can be effectively solved.
The distributed storage system mostly adopts a consistent hash algorithm to hash the data name and the node ip address, so that data in a certain range of hash values are distributed to the same node, and the influence of adding and deleting nodes on data partition mapping can be reduced.
In the existing database, the key values of each record in the data set are sorted, and the key value sorting value and the position of the record corresponding to the key value are stored in a new small data set, which is an index table.
HowNet, the "know net", is a common knowledge base which is mainly based on Chinese and English and discloses meaning items or similarity between attributes of meaning items.
There are two main words in HowNet: a meaning item and a meaning source. A meaning item is a paraphrase description of a word, a word can correspond to a plurality of meaning items, and a sememe is a basic unit for describing the meaning item. Each meaning item has a meaning item number that is different from the other meaning items. There are 8 relations between the semantic sources in HowNet, wherein the most important is the upper and lower relations, and the upper and lower relations are the basis for calculating the semantic similarity.
The upper and lower relationship diagram of the sememe is as follows.
Figure GDA0003894151980000021
Wherein:
",", among a plurality of attributes, indicates a relationship of "and".
"#", indicates "associated therewith".
", indicates" will 'V' or is primarily for 'V', i.e. an action or a tool.
"! "indicates that a certain attribute is a sensitive attribute, such as: "taste" for "food", "height" for "mountain range", "temperature" for "planetarium", etc.
"^" indicates absence, or inability.
"[ ]", which represents the commonality attribute of the concept.
Li FangFang et al propose a method for calculating vocabulary similarity by using HowNet in the study of vocabulary semantic similarity calculation method based on the known network. The method mainly comprises semantic similarity calculation and semantic item similarity calculation. And the similarity calculation of the semantic items comprises the similarity calculation of a main characteristic semantic description part, the similarity calculation of a secondary characteristic semantic description part set and the similarity calculation of a relation characteristic semantic description part, and then the weighted average of the three calculation results is the similarity value among the semantic items.
The method for calculating the similarity of the sets of the secondary feature semantic description parts comprises the following steps:
1. and calculating the similarity value between any two elements in the two sets to form a similarity value set. 2. And selecting the two elements with the maximum similarity value as a match, and deleting the similarity values related to the two elements in the similarity value set. 3. Repeat step 2 until all similarity values are deleted. 4. The elements in the two sets that do not match correspond to empty elements, and the similarity value is set to a very small value. 5. And calculating the average value of the similarity values of all corresponding elements, wherein the average value is the similarity value between the two sets.
The multiple data models include:
class (c): a kind of data category is equivalent to a table of a common database, but is different from a table in which each record is composed of multiple columns of attribute values, and each record in the category is stored in a key-value pair mode.
Key value model: namely, the key/value model, the key is saved in the database index and can be directly queried by the database, and the value can be a simple model or a complex model, and can be a document or a graph. The database firstly searches the designated key in all the keys and then inquires the value in the designated key.
Document model: the document model consists of a plurality of key value pairs. A key-value pair, i.e., a key-value model. The document models may be grouped by class, i.e., there may be multiple documents in a class.
And (3) drawing model: consisting of vertices and edges connecting the vertices, the vertices or edges inherit to classes, i.e., each vertex or edge can store multiple documents and key-value data.
Block (2): a vertex or edge or class is a block.
Documents, vertices or edges are generally not nested within each other because this creates redundancy.
The existing world wide web utilizes a search engine to query already stored multimodal data. When a user inputs a query word, a search engine is required to quickly obtain data similar to the word of the query word, and the search engine is also required to quickly obtain data which is different from the word of the query word and similar in meaning.
The existing consistent hash algorithm partitions data, and generally cannot perform semantic classification on the data, data with similar semantics mostly exist on more different nodes or databases, and data are acquired from more nodes when related data are queried, so that query communication overhead is high.
In the conventional common data index, leaf nodes in the index are specific data, and when the specific data are stored in local nodes and data are queried, all the nodes need to be directly traversed, so that the query range is large, and the time consumption is high. The existing secondary index improves a common index, all key values are used as indexes, and the indexes do not comprise specific data. The secondary indexes are stored in the memory of the metadata server, the specific positions of the data are located in the metadata cluster during data query, but the indexes occupy much memory when the data volume is large; or the data are stored in a disk of a common positioning server, and the index acquisition and data positioning speed is slow. And the data located by these secondary indices is still more likely to be spread across more different nodes.
In the existing database index query, the query word is directly matched with the key value in the index, semantic matching is not performed, the correlation between the query result and the semantic of the query word is low, and when a user inputs the query word query through a search engine, the obtained content is not comprehensive and the correlation is not high.
Disclosure of Invention
The invention aims to provide a multi-model data distributed storage and hierarchical query method based on semantic classification, which is used for solving the problems in the prior art.
The invention discloses a multi-model data distributed storage and hierarchical query method based on semantic classification, which comprises the following steps: the method comprises the following steps of multi-model data semantic classification storage, data query and query word management, distributed main and standby data migration and secondary index updating; the multi-model data semantic classification storage comprises the following steps: performing storage preliminary semantic classification; carrying out index initialization; index storage is carried out, each node of the secondary index is added to an index class in a distributed metadata node memory in a key value pair mode, and the index class is used when other nodes do not know the node, the database or the class where the data is located when the other nodes are inquired; the primary index is stored by a node local database; carrying out multi-model data classification storage; performing data semantic query and database query; and managing the query words.
The invention classifies the multi-model data semanteme and stores the data in a redistribution mode so as to facilitate the search engine to inquire the data semanteme. Compared with the consistency hash, the number of nodes needing specific access during query is reduced, and the query communication overhead is reduced.
Drawings
FIG. 1 is a flow chart of a multi-model data distributed storage and hierarchical query method based on semantic classification according to the present invention;
FIG. 2 is a flow diagram of a multi-model semantic classification store;
FIG. 3 is a two-level index structure tree;
FIG. 4 is a data query and query term management diagram;
FIG. 5 is a diagram of an example distributed storage and hierarchical indexed semantic queries;
FIG. 6 is a flow chart of distributed primary and secondary index updating when nodes are added;
FIG. 7 is a block data migration model diagram.
Detailed Description
In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
The invention discloses a multi-model data distributed storage and hierarchical query method based on semantic classification, which mainly comprises three parts: the method comprises the steps of multi-model data semantic classification storage, data query and query word management, distributed main and standby data migration and secondary index updating. As shown in fig. 1, the present invention is mainly divided into seven steps: the method comprises the steps of primary semantic storage classification, index initialization, index storage, multi-model data classification storage, data semantic query and database query, query word management, distributed main and standby data migration and secondary index updating.
FIG. 2 is a flow chart of multi-model semantic classification storage, and as shown in FIG. 2, the first part of the multi-model data semantic classification storage of the present invention includes:
storing a preliminary semantic classification comprising:
for a cluster, a name associated with the data to be stored is initially manually drawn up empirically, and for each node, database and class of the cluster, a name of the class of the data to be stored is initially drawn up manually from the different classes of data to be stored. And searching the semantic item name which is most similar to the alias name in HowNet and the semantic item number in HowNet through a program according to the drawn category name, using the semantic item name as a formal category name, and finding some semantic items of the semantic item as semantic keywords.
The index initialization includes:
(1) The second-level index is initially composed of the existing cluster name, the node name, the database name, the class name and semantic words (keywords) with similar names, and the newly-built database name, class name, key of the high-frequency key value pair and the high-frequency keywords are automatically added through an internal program in the query process.
(2) The first-level index is an index automatically generated by the local database of the node.
The index storage includes:
secondary index storage; and adding each node of the secondary index into an index class in a distributed metadata node memory in a key-value pair mode for other nodes to use when the node, the database or the class where the data is located is not known during query.
Storing a first-level index; the primary index is maintained by the node local database.
FIG. 3 is a two-level index structure tree with each node (tree node, as opposed to server node) in the tree being a key-value pair and child node key-value pairs being the value values of parent node key-value pairs. The node code, a combination of letters and numbers other than ip, indicates that there may be one or more similar sibling nodes for the node. The term name is the term name of the word "HowNet", and the term number is the number of the term in HowNet ". When the similarity of the key value pairs of the parent nodes with higher similarity of the layer and the key value pairs of the data to be stored or the words to be inquired is calculated, namely the similarity of the parent nodes of the layer and the data to be stored or the words to be inquired is calculated, and then the similarity of the child nodes of the parent nodes with higher similarity of the layer and the key value pairs of the data to be stored or the words to be inquired is calculated until the deepest leaf node is calculated. And then positioning the nodes, databases or classes to be stored and inquired specifically through the index path from the leaf node with the highest similarity to the root node.
The multi-model data classification storage comprises the following steps:
(1) A plurality of related key-value pairs and document data to be stored are stored in one class. For the graph model to be stored, each vertex or edge data is stored into one or more tables, each table is a class, and different vertices or edges are independently stored, namely, the different vertices or edges can be stored in the same node or different nodes.
The key value pair to be stored is determined by a method of Li Fang et al, namely, the existing node name, database name, class name and key word name with the maximum key correlation are searched through HowNet, so as to determine the node name, database name, class name and the number of the three semantic items, if the database name or class name with the larger key value correlation exists, the key value pair is stored in the class, if the existing database name or class name has the smaller key value correlation, the semantic item with the maximum key value correlation and the number thereof are searched in HowNet, english semantic item name or semantic item number is used as the name, a database or class is newly built at the node with the maximum correlation, the key value pair is stored, and the semantic item of the node where the database or class is located is perfected.
Regarding document data or graph model data to be stored, regarding all key value pairs of the vertexes or edges in the document and the graph model as a set, regarding each father node semantic name and keyword in the upper layer of the secondary index as a set, using a calculation method of similarity of the secondary feature semantic description part set in the Liangxiang and other human papers, calculating the similarity value between any two keys in the two sets, weighting average, namely the similarity value of the document, the vertexes or the edges and each father node data of the layer, selecting the nodes with larger relevance of all key value pairs of the files, the vertexes or the edges in the father node of the layer, calculating the nodes with larger relevance of all key value pairs of the subnodes and the files, the vertexes or the edges of the next layer, and calculating the deepest leaf nodes. And finding out a database or class where the leaf node with the maximum similarity is greater than a certain value, and storing the document data and the vertex data or the edge data of the graph model. If the relevance of the vertex data or the side data of the graph model to be stored and the existing class or the database is too small, each key value pair key of the document, the vertex or the side is calculated in sequence, the average value of the similarity values of all the key values pairs key of the document, the vertex or the side is calculated, the key value pair key with the maximum average value is used for searching a semantic item with the maximum relevance and the serial number thereof in HowNet, the name or the serial number of the semantic item is used as the name, the class or the database is newly built at the node with the maximum relevance, the document, the vertex data or the side data are stored, and the semantic item name of the similar word of the node where the database or the class is located is perfected.
Fig. 4 is a data query and query term management diagram, and a second part of data semantic query and query term management, including:
1. data semantic queries and database queries, including:
(1) When the search engine inquires data, firstly, a secondary index word closest to a query word is inquired through HowNet semanteme by the query word at a metadata node, a database and a class where the data corresponding to the query word is more likely to be located are found, and a plurality of sememes with greater relevance to the query word are found in HowNet. Then, the more specific inquiry is carried out on other common nodes.
(2) After finding one or more classes pointed by the primary index matched with the common node through the secondary index, sequentially searching key value pairs key close to each semantic source of the query words through the database, and returning the key value values. And then returning the data of the vertexes or edges of the document or the graph model from the most to the least according to the number of the semaphores.
(3) If the common node does not inquire the corresponding data in the secondary indexes of the metadata nodes, returning null values or traversing each common node to inquire the primary indexes in sequence.
FIG. 5 is a diagram of an example distributed storage and hierarchical indexed semantic query. A user inputs a query word at a client, the query word is transmitted into a metadata node 1, semantic query secondary indexes from the metadata node 1 to the metadata node 2 and the metadata node 3 are carried out, the node 2 and the node 3 return secondary indexes with high relevance, and the secondary indexes point to a node 5. Then the node 1 sends a query request to the node 5, the node 5 queries key value pairs key corresponding to query lexical-synonyms through a database, then key values, document and vertex or edge data of a graph are transmitted to the node 1 according to the sequence of the number of the synonyms from high to low, and the node 1 returns the data to a client.
2. The management of the query words comprises the following steps:
and the ordinary node counts the number of times of inquiry in a period of time for each key value pair key and the inquiry word stored in the disk.
The key value of the common node key value does not change the primary index of the local database regardless of the query times.
For query words and key value pair key values with a large number of queried times, if the same query words and key value pair key values exist, one is selected as a high-frequency keyword, if the query words and key value pair key values are different, the keyword and index data are converted into a secondary index and sent to a metadata node, and the secondary index is matched with a secondary index class in a metadata node memory, so that common nodes can be located in the metadata node semanteme conveniently, wherein the same keyword and query words are only matched with one index, and the query times are added when the query words and the key value pair key values are matched with each other.
And the metadata nodes count the number of times of the keywords being queried, and delete the keywords and the independent indexes thereof which are queried less in a longer time when more keywords exist.
Fig. 6 is a flowchart of updating the distributed primary and secondary indexes when a node is newly added, and as shown in fig. 6, in the third part, the updating of the distributed primary and secondary data migration and secondary indexes includes:
1. distributed master and slave data migration and secondary index update
(1) When the nodes are added, the total storage space utilization rate and the total data block number of each node are counted, sorted and averaged, wherein one vertex or edge is regarded as one block, and one class is one block.
(2) And (5) if the node storage space is not full, selecting the node with the most total number of the existing blocks, and otherwise, turning to the step (5).
(3) And carrying out primary statistics and sequencing on the sizes of the main block and the backup block of each local data of the node respectively.
(4) If the total storage space utilization rate of the node is greater than the average total storage space utilization rate of each node, and the total number of the residual blocks and the total space utilization rate of the node are both greater than the average value, the ratios of the number of the selected blocks and the total space to occupy the space of the new node are both smaller than the average value, a main block in odd number (even number) sequence and a backup block in even number (odd number) sequence and not repeated with the selected main block are sequentially selected until the conditions of the total number of the residual blocks and the total space utilization rate and the occupancy rate of the total number of the selected blocks and the total space are not met, and then the selected blocks are all moved to the new node. If the total number of the selected blocks and the total space occupancy rate are smaller than the average value of each node after the selection, deleting the node from the total data block number sorting and the total storage space utilization rate sorting, if the node is deleted from the two sorting, returning to the step (2), and if not, ending.
(5) If the node storage space is full, selecting the node with the maximum storage space utilization rate, and repeating the steps (3) and (4).
(6) And during data migration, after the node name, the database name, the class name and the key value pair key of the class where the data is located before the data migration are inquired in the secondary index class and completely correspond to the class name, the class name and the key value pair key, the node name, the database name, the class name and the key value pair key after the data migration are replaced. And repeating the third step.
(7) And (3) when the nodes are reduced, sequencing the space rates used by the rest nodes from large to small, sequentially and one by one regarding the nodes newly added in the step (1), and performing data migration and secondary index updating.
FIG. 7 is a block data migration model diagram. The node 1 stores data and backups one copy, when a node 2 is added, the node 1 sequences the sizes of the main block and the duplicate blocks respectively, and then moves the classes of partial odd number sequences in the main block and partial even number sequences in the duplicate blocks to the newly added node. When the number of nodes is increased from 2 to 4, the sizes of the main copy block and the copy block in the node 1 and the node 2 are sorted again, and then the movement similar to the movement from the node 1 to the node 2 is carried out.
According to the invention, developers firstly classify the data categories mainly stored in each node according to HowNet semanteme, and search nodes with corresponding semantemes for multi-model data to be stored and then store the nodes in a distributed manner, so that users can inquire the data semanteme through a search engine in the world wide web conveniently.
The index of semantic query by a search engine is divided into two stages, the second-stage index is stored in a distributed metadata cluster node memory, and the first-stage index is the index of a node local database. When the search engine is used for semantic query, a secondary index path is determined through semantic correlation, so that nodes, databases and classes which are possibly located with larger data are located, then a plurality of sememes are searched for at the nodes, primary indexes of the databases are sequentially searched, key value pairs of the sememes and documents, vertexes and edges which contain more sememes are located, and the data which are related to the sememes are returned.
The invention provides a multi-model data distributed storage method classified by semantics, which is a distributed two-stage index query method for facilitating a user to carry out fast semantic query on the data through a search engine, so that when the search engine carries out semantic query, nodes, databases and classes which are possibly located with high speed are located, and then the database query is carried out by using a plurality of sememes of query words on specific nodes.
The invention adopts two-stage index, which is convenient for users to inquire data through a search engine; the secondary indexes are semantic tree indexes and can be stored in a metadata node memory, compared with the common secondary indexes, the secondary indexes can quickly locate nodes with strong semantic correlation but less quantity, reduce the quantity of nodes to be accessed during query, and avoid traversing each node of a cluster during query compared with the single-level indexes which are all stored in the local nodes; the first-level index belongs to the index of the local database, so that the query lexical senses and the key values to be searched can be quickly matched, the amount of the second-level index is reduced, and compared with the single-level index which is completely stored in the metadata node, the metadata memory space required by the index is reduced.
The semantic query of the invention can lead a search engine to quickly obtain a query result which is more comprehensive and more in line with the requirements of users than the character matching.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, it is possible to make various improvements and modifications without departing from the technical principle of the present invention, and those improvements and modifications should be considered as the protection scope of the present invention.

Claims (8)

1. A multi-model data distributed storage and hierarchical query method based on semantic classification comprises the following steps: the method comprises the following steps of multi-model data semantic classification storage, data query and query word management, distributed main and standby data migration and secondary index updating;
the multi-model data semantic classification storage comprises the following steps:
performing storage preliminary semantic classification;
index initialization is carried out;
index storage is carried out, each node of the secondary index is added to an index class in a distributed metadata node memory in a key value pair mode, and the index class is used when other nodes do not know the node, the database or the class where the data is located when the other nodes are inquired; the primary index is stored by a node local database;
carrying out multi-model data classification storage;
performing data semantic query and database query;
managing the query words;
wherein, index initialization is performed, comprising:
(1) The second-level index initially consists of the existing cluster name, the node name, the database name, the class name and similar semantic words, and keys and high-frequency keywords of the newly-built database name, class name and high-frequency key value pair are added in the query process;
(2) The primary index is an index generated by a node local database;
each node in the structure tree of the secondary index is a key value pair, the child node key value pair is a value of a parent node key value pair, the node code indicates that one or more similar brother nodes possibly exist in the node, the meaning item name is the meaning item name of HowNet words, and the meaning item number is the number of the meaning item in HowNet; when the similarity of the data to be stored or the words to be inquired is compared with the relevance of the data to be stored or the words to be inquired, the semantic name and the key word of each father node of the upper layer to be inquired in the secondary index tree are calculated, the similarity of each father node of the corresponding layer and the data to be stored or the words to be inquired is weighted average of the similarity of each key value of the data to be stored or the words to be inquired, then the similarity of the child node of the father node of the corresponding layer with higher similarity and each key value of the data to be stored or the words to be inquired is calculated until the deepest leaf node is calculated, and then the node, the database or the class to be stored and inquired is positioned through the index path from the leaf node with the highest similarity to the root node.
2. The distributed storage and hierarchical query method for multimodal data based on semantic classification as claimed in claim 1, wherein performing preliminary semantic classification for storage comprises:
the method comprises the steps of drawing up a name related to data to be stored for a cluster, drawing up a name of a data category to be stored for each node, database and category of the cluster according to the data of different categories to be stored, searching a definition name and a definition number which are closest to the name of the category in HowNet according to the drawn-up category name, using the definition name as a formal category name, and finding out a definition of the definition as a semantic keyword.
3. The distributed storage and hierarchical query method for multimodal data based on semantic classification as claimed in claim 1, wherein performing multimodal data classification storage comprises:
storing a plurality of relevant key value pairs and document data to be stored into a class, and storing each vertex or edge data into one or more tables for a graph model to be stored, wherein each table is a class;
for key value pairs to be stored, searching the existing node name, database name, class name and key word name with the maximum key correlation through HowNet to determine the node name, database name, class name and key word number, if the database name or class name with the larger key value correlation exists, storing the key value pairs into the class, if the existing database name or class name with the smaller key value correlation exists, searching the semantic item with the maximum key value correlation and the number thereof in HowNet, using English semantic item name or semantic item number as the name, building a new database or class at the node with the maximum correlation, storing the key value pair, and completing the semantics of the node where the database or class is located by the semantic item name;
regarding document data or graph model data to be stored, regarding all key value pairs of peaks or edges in a document and a graph model as a set, regarding each father node semantic item name and a keyword in a first layer of a secondary index as a set, calculating a similarity value between any two keys in the two sets through a secondary feature semantic element description partial set similarity calculation method, performing weighted averaging, namely the similarity value of the document, the peaks or the edges and each father node data in the layer, selecting nodes with higher relevance to all key value pairs of the document, the peaks or the edges in the father node in the layer, calculating nodes with higher relevance to all key value pairs of the subnodes and the document, the peaks or the edges in the next layer until the deepest leaf node is calculated, finding out a database or class where the leaf node with the highest similarity and larger than a certain value is located, and storing the document data and the peak or edge data of the graph model; if the relevance of the document data or the vertex or edge data of the graph model to be stored is small with the existing class or the database, each key value pair key of the document, the vertex or the edge is calculated in sequence, the average value of the similarity values of all other key value pairs keys of the document, the vertex or the edge is calculated, the key value pair key with the maximum average value is used for searching a semantic item with the maximum relevance and the serial number thereof in HowNet, the English semantic item name or the semantic item serial number is used as the name, the class or the database is newly built at the node with the maximum relevance, the document, the vertex or the edge data is stored, and the similar semantic words of the node where the database or the class is located are perfected by the semantic item name.
4. The semantic classification based multimodal data distributed storage and hierarchical query method of claim 1, wherein data semantic query and database query comprise:
when data is queried, a search engine firstly queries a secondary index word closest to a query word through HowNet semantics by using the query word at a metadata node, finds nodes, databases and classes where data corresponding to the query word possibly exist, finds a plurality of sememes with larger relevance to the query word in HowNet, and then carries out specific query on other common nodes;
after finding one or more classes pointed by the matched primary indexes through the secondary indexes, the common node queries key value pairs close to each semantic source of the query word in the database in sequence, returns the value of the key value pair, and then returns the data of the document or the vertex or edge of the graph model from most to least according to the number of the semantic sources;
if the common node does not inquire the corresponding data in the secondary indexes of the metadata nodes, returning null values or traversing each common node to inquire the primary indexes in sequence.
5. The distributed storage and hierarchical query method for multimodal data based on semantic classification as claimed in claim 1, wherein the management of query terms comprises:
the common node counts the number of times of inquiry in a period of time of each key value pair key and the inquiry word stored in the disk;
for query words and key value pair key values with a large number of queried times, if the same query words and key value pair key values exist, one is selected as a high-frequency keyword, if the query words and key value pair key values are different, the keyword and index data are all selected as keywords, the keywords and the index data are converted into a secondary index and are sent to a metadata node, and the secondary index is matched into a secondary index class in a metadata node memory, so that common nodes can be positioned in metadata node semantics conveniently, wherein the same keyword and query word are matched with one index, and the query times are added when the query words and key value pair key values are matched in the front and back.
6. The distributed storage and hierarchical query method for multimodal data based on semantic classification as claimed in claim 1, wherein the distributed primary and backup data migration and secondary index updating comprises:
(1) When the nodes are increased, counting, sorting and averaging the total storage space utilization rate and the total data block number of each node;
(2) If the node storage space is not full, selecting the node with the most total number of the existing blocks, otherwise, turning to the step (5);
(3) Carrying out primary statistics and sequencing on the sizes of the main blocks and the backup blocks of each local data of the node respectively;
(4) If the total storage space utilization rate of the node is greater than the average total storage space utilization rate of each node, and the total number of the residual blocks and the total space utilization rate of the node are both greater than the average value, the number of the selected blocks and the total space occupying the new node space are both less than the average value, sequentially selecting an odd-numbered or even-numbered ordered main block and a backup block which is even-numbered or odd-numbered ordered and is not repeated with the selected main block until the total number of the residual blocks and the total space utilization rate and the total number of the selected blocks and the total space do not satisfy the occupancy rate condition, moving the selected blocks to the new node, if so selected, moving the selected blocks and the total space occupying rate to be less than the average value of each node, deleting the node from the total data block number ordering and the total storage space utilization rate ordering, if the node is deleted from the two orderings, both orderings are not empty, returning to the step (2), and if not, ending;
(5) If the storage space of the node is full, the node with the maximum utilization rate of the full storage space is selected, and the step (3) is returned.
7. The semantic classification based multi-model data distributed storage and hierarchical query method of claim 6,
and during data migration, after the node name, the database name, the class name and the key value pair key of the class where the data is located before the data migration are inquired in the secondary index class and completely correspond to the class name, the database name, the class name and the key value pair key, the node name, the database name, the class name and the key value pair key are replaced by the node name, the database name, the class name and the key value pair key after the data migration.
8. The distributed storage and hierarchical query method for multimodal data based on semantic classification as claimed in claim 6, wherein when reducing nodes, the spatial rates used by the remaining nodes are sorted from large to small, and the nodes are sequentially regarded as the newly added nodes in step (1) one by one for data migration and secondary index updating.
CN202011473262.1A 2020-12-11 2020-12-11 Multi-model data distributed storage and hierarchical query method based on semantic classification Active CN112800023B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011473262.1A CN112800023B (en) 2020-12-11 2020-12-11 Multi-model data distributed storage and hierarchical query method based on semantic classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011473262.1A CN112800023B (en) 2020-12-11 2020-12-11 Multi-model data distributed storage and hierarchical query method based on semantic classification

Publications (2)

Publication Number Publication Date
CN112800023A CN112800023A (en) 2021-05-14
CN112800023B true CN112800023B (en) 2023-01-10

Family

ID=75806733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011473262.1A Active CN112800023B (en) 2020-12-11 2020-12-11 Multi-model data distributed storage and hierarchical query method based on semantic classification

Country Status (1)

Country Link
CN (1) CN112800023B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114564928B (en) * 2022-02-25 2024-02-27 北京圣博润高新技术股份有限公司 File management method, device, equipment and storage medium for office system
CN116579344B (en) * 2023-07-12 2023-10-20 吉奥时空信息技术股份有限公司 Case main body extraction method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021592A (en) * 2016-11-04 2018-05-11 上海大学 A kind of Unstructured Data Management for ARTBEATS DESKTOP TECHNOLOGY NTSC field
CN112000851A (en) * 2020-08-28 2020-11-27 北京计算机技术及应用研究所 Key value model, document model and graph model data unified storage method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9679041B2 (en) * 2014-12-22 2017-06-13 Franz, Inc. Semantic indexing engine

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021592A (en) * 2016-11-04 2018-05-11 上海大学 A kind of Unstructured Data Management for ARTBEATS DESKTOP TECHNOLOGY NTSC field
CN112000851A (en) * 2020-08-28 2020-11-27 北京计算机技术及应用研究所 Key value model, document model and graph model data unified storage method

Also Published As

Publication number Publication date
CN112800023A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
US9171065B2 (en) Mechanisms for searching enterprise data graphs
US7392250B1 (en) Discovering interestingness in faceted search
US6738759B1 (en) System and method for performing similarity searching using pointer optimization
US6618727B1 (en) System and method for performing similarity searching
US8176052B2 (en) Hyperspace index
Park et al. Keyword search in relational databases
US20140310260A1 (en) Using persistent data samples and query-time statistics for query optimization
Ilyas et al. Adaptive rank-aware query optimization in relational databases
US20070271228A1 (en) Documentary search procedure in a distributed system
CN112800023B (en) Multi-model data distributed storage and hierarchical query method based on semantic classification
Gou et al. A/sup*/search: an efficient and flexible approach to materialized view selection
CN110032676B (en) SPARQL query optimization method and system based on predicate association
Si et al. Query optimization for broadcast database
Álvarez-García et al. Compact and efficient representation of general graph databases
CN108804580B (en) Method for querying keywords in federal RDF database
Özsoyoǧlu et al. Querying web metadata: Native score management and text support in databases
CN114911826A (en) Associated data retrieval method and system
Ding et al. An Efficient Relational Database Keyword Search Scheme Based on Combined Candidate Network Evaluation
Yu et al. A tree-based indexing approach for diverse textual similarity search
Sheng et al. A knowledge-based approach to effective document retrieval
Yan et al. RDF knowledge graph keyword type search using frequent patterns
Kwon et al. G-Index Model: A generic model of index schemes for top-k spatial-keyword queries
Zhao et al. Organizing structured deep web by clustering query interfaces link graph
Hui Research on keyword indexing algorithm based on big data
Lee et al. Hybrid Index Structure based on MBB Approximation for Linked Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant