CN111897911B

CN111897911B - Unstructured data query method and system based on secondary attribute graph

Info

Publication number: CN111897911B
Application number: CN202010529960.2A
Authority: CN
Inventors: 沈志宏; 赵子豪; 周园春
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2021-08-31
Anticipated expiration: 2040-06-11
Also published as: CN111897911A

Abstract

The invention discloses a method and a system for querying unstructured data based on a secondary attribute graph. The method comprises the following steps: 1) for a target database, taking unstructured data of each record in the database as a primary attribute of the corresponding record; 2) extracting intrinsic information in each primary attribute as a secondary attribute graph of the primary attribute; 3) expanding the query language of the target database, and adding a semantic operator "- >; expanding a query engine of the target database, and compiling and executing a query statement conforming to the syntax of the semantic operator "- >; 4) the query engine queries the cache results meeting the query conditions from the cache system according to the query conditions, if no matching result exists, the matching records in the target database are searched according to the primary attributes in the query conditions, then secondary attribute graphs are extracted from the primary attributes of the matching records and are respectively matched with the secondary attribute graphs in the query conditions, and the matching results are returned.

Description

Unstructured data query method and system based on secondary attribute graph

Technical Field

The invention relates to the technical fields of unstructured data, data query language, artificial intelligence, graph data models and the like, and provides a method and a system for realizing unstructured data representation and query based on a secondary attribute graph, aiming at the current situation that the prior art cannot conveniently perform information query on unstructured data and has weak information extraction and representation capability in unstructured data.

Background

The unstructured data has a large proportion in the network data, and contents such as pictures, sound recordings, videos, plain long texts and the like belong to the unstructured data. At present, the technology related to storage and query of structured data is mature, and the related solutions for storage and management of structured data are already well-established. However, with the progress of technology and the development of times, the data sources are wider and more extensive, the quantity is more and more, and the form is more and more complex. In many application scenarios, engineers need to face not only structured data with a canonical format, but also semi-structured data with a self-describing structure or even unstructured data without a fixed structure. Obviously, because of the flexibility of the structure, the data has rich expansibility and extremely high information expression freedom. But due to its freedom in format, the storage and management of such unstructured data has also been a problem that has plagued the industry for many years.

Current management and query techniques for unstructured data focus primarily on retrieval based on metadata of the unstructured data, such as file name, size, file category, tag value, etc. Such simple retrieval cannot fully utilize AI techniques and cannot directly query and consume information contained in unstructured data, which causes difficulties in querying and utilizing unstructured data. At present, some artificial intelligence methods can extract information in unstructured data, such as voice recording to character conversion, face recognition, license plate number extraction and the like, and related algorithms reach higher accuracy. However, because the AI algorithm is complex in dependence, difficult to deploy, and different tools have large differences, it is inconvenient to use the AI algorithm to obtain information in unstructured data.

In the face of the current situations that unstructured data are more and more, and the accuracy and richness of the AI algorithm are stronger and stronger, it is of great significance to develop a method and a system capable of quickly querying information in unstructured data.

Disclosure of Invention

The invention provides a method and a system for realizing unstructured data information representation and query based on a secondary attribute graph, aiming at the problem of unstructured data information query and representation, and the realization is based on a graph database. The method represents information in the unstructured data as a secondary attribute graph, extracts designated scene information by using an AI algorithm, describes scenes in the unstructured data by using the secondary attribute graph, and acquires the information in the unstructured data in a form of inquiring the secondary attribute graph, thereby realizing flexible representation and quick inquiry of the scene information in the unstructured data.

The technical scheme adopted by the invention is as follows:

a method for querying unstructured data based on a secondary attribute graph comprises the following steps:

1) for a target database, taking the unstructured data of each record in the target database as the primary attribute of the corresponding record;

2) extracting intrinsic information in each primary attribute, extracting nodes and attributes of the nodes from the intrinsic information to construct an attribute graph which is used as a secondary attribute graph of the primary attribute; wherein the secondary attribute graph represents nodes by "()", "{ }" represents attribute sets of nodes, and "- [ ] -" represents edges between nodes;

3) expanding the query language of the target database, setting symbols "()", "{ }", "- [ ] -" for describing the intrinsic information of the unstructured data, and setting a secondary attribute graph extraction symbol "- >", wherein the symbol "- >" is a binary connector, the left side is connected with a primary attribute, the right side is connected with the name of a secondary attribute graph, and the semantic operator "- >" is used by 'a- > b', and the meaning is that the name b of the secondary attribute graph in the primary attribute a is queried; expanding a query engine of the target database, and compiling and executing a query statement conforming to the syntax of the semantic operator "- >;

4) the query engine queries the cache results meeting the query conditions from the cache system according to the query conditions, and if the cache results are matched, the cache results are returned; if no matched query result exists, searching the matched record in the target database according to the primary attribute in the query condition, then extracting the secondary attribute graph from the primary attribute of the matched record, respectively matching with the secondary attribute graph in the query condition, and returning the matching result.

Further, the node contains category and attribute set information, and the edge contains category information.

Further, the query is carried out by directly inputting the primary attribute and the secondary attribute graph; or inputting the information in the secondary attribute graph to inquire the secondary attribute graph, and then inputting the primary attribute and inquiring the secondary attribute graph selected from the inquired secondary attribute graph.

Furthermore, an algorithm mapping library is established, and the corresponding relation between each AI algorithm and different secondary attribute maps is set, so as to call different AI algorithms to extract the corresponding secondary attribute maps in the primary attributes.

An unstructured data query system based on a secondary attribute graph is characterized by comprising an information extractor, a task scheduler and a query engine; wherein,

the information extractor is used for extracting the unstructured data of each record from the target database as the primary attribute of the corresponding record; calling a task scheduler to extract the intrinsic information in each primary attribute, and then extracting nodes and the attributes of the nodes from the intrinsic information to construct an attribute graph which is used as a secondary attribute graph of the primary attribute; wherein the secondary attribute graph represents nodes by "()", "{ }" represents attribute sets of nodes, and "- [ ] -" represents edges between nodes;

the task scheduler is used for calling different AI algorithms to extract different secondary attribute graphs from the intrinsic information of the primary attributes;

the query engine is used for querying the cache results meeting the query conditions from the cache system according to the query conditions, and if the cache results are matched, returning the cache results; if no matched query result exists, searching the matched record in the target database according to the primary attribute in the query condition, then extracting a secondary attribute graph from the primary attribute of the matched record, respectively matching with the secondary attribute graph in the query condition, and returning the matching result;

the query language of the target database is expanded, symbols "()", "{ }", "- [ ] -" are set for describing the intrinsic information of unstructured data, and a secondary attribute graph extraction symbol "- >", wherein the symbol "- >" is a binary connector, the left side is connected with a primary attribute, the right side is connected with the name of a secondary attribute graph, and the semantic operator "- >" is used by 'a- > b', meaning that the name b of the secondary attribute graph in the primary attribute a is queried; and expanding the query engine of the target database, and compiling and executing the query statement conforming to the syntax of the semantic operator "- >".

1) for a graph database based on an attribute graph model, wherein nodes in the graph database are used for representing entities, and edges are used for representing the relationship between the entities; taking the attribute data of each entity as the primary attribute of the corresponding node, extracting the intrinsic information in each primary attribute, then extracting the node and the attribute of the node from the intrinsic information to construct an attribute graph which is taken as a secondary attribute graph of the primary attribute; wherein the secondary attribute graph represents nodes by "()", "{ }" represents attribute sets of nodes, and "- [ ] -" represents edges between nodes;

2) expanding a Cypher query language of the graph database, setting symbols of "()", "{ }", "- [ ] -" for describing internal information of unstructured data, and setting a secondary attribute graph extraction symbol of "- >; the semantic operator "- >" is a binary operator, the left side is a primary attribute, the right side is a secondary attribute graph, and the meaning of the semantic operator is to extract the content of the secondary attribute graph in the primary attribute; expanding a Cypher query engine of the graph database, and analyzing query sentences input by a user in a syntax tree mode to generate an execution plan;

3) the Cypher query engine queries the cache results meeting the query conditions from the cache according to the query conditions, and if the cache results are matched, the cache results are returned; if no matched query result exists, searching the matched node in the graph database according to the primary attribute in the query condition, then extracting a secondary attribute graph from the primary attribute of the matched node, respectively matching with the secondary attribute graph in the query condition, and returning the matching result.

An unstructured data query system based on a secondary attribute graph is characterized by comprising an information extractor, a task scheduler and a Cypher query engine; wherein,

an information extractor for extracting unstructured data from each record of the graph database as a primary attribute of the corresponding record; calling a task scheduler to extract the intrinsic information in each primary attribute, and then extracting nodes and the attributes of the nodes from the intrinsic information to construct an attribute graph which is used as a secondary attribute graph of the primary attribute; wherein the secondary attribute graph represents nodes by "()", "{ }" represents attribute sets of nodes, and "- [ ] -" represents edges between nodes;

the Cypher query engine is used for querying the cache results meeting the query conditions from the cache according to the query conditions, and returning the cache results if the cache results are matched; if no matched query result exists, searching a matched node in the graph database according to the primary attribute in the query condition, then extracting a secondary attribute graph from the primary attribute of the matched node, respectively matching with the secondary attribute graphs in the query condition, and returning the matching result;

the Cypher query language of the graph database is expanded, symbols are arranged, namely, (), "{ } and" - [ ] - "are used for describing internal information of unstructured data, and a secondary attribute graph extraction symbol is arranged; the semantic operator "- >" is a binary operator, the left side is a primary attribute, the right side is a secondary attribute graph, and the meaning of the semantic operator is to extract the content of the secondary attribute graph in the primary attribute; and expanding a Cypher query engine of the graph database, and analyzing the query sentence input by the user in a syntax tree form to generate an execution plan.

The unstructured data information query method based on the secondary attribute graph comprises the following steps:

1) in the raw database, unstructured data is represented as attributes of database records (hereinafter referred to as primary attributes).

2) Some intrinsic information in the unstructured data (primary attributes) is defined as a secondary attribute graph. Information in the same primary attribute is represented in the form of a graph, such as: (person: "boy" } ] - [: SIT _ ON) - (: horse: "white" }).

3) On the basis of the step 2), expanding the query language of the database, increasing the description capacity of the internal information of the unstructured data, representing nodes by () and representing attribute sets by { } and sides by- [ ] -; the symbols "()", "{ }", "- [ ] -" belong to symbols in the graph data query language, and the graph structure is represented in the secondary attribute graph by using the symbols. Wherein, the node and the edge both contain categories, and the node also can contain attribute sets; the category can be freely set by the user and is used for marking entity categories, such as: person, Car, Article, category information is used to mark node category, narrow the category of search range node and edge. In particular, the present invention adds a secondary attribute map extraction symbol "- >", which is a binary connector, with the left side connected to a primary attribute and the right side connected to the name of the secondary attribute map. The name of the secondary attribute map can be freely specified by the user, as well as the secondary attribute name. Such as: photo- > locationGraph, which means that for the primary attribute photo (group photo), a secondary attribute map of the position relationship of the person in the group photo is obtained. The secondary property graph can be directly obtained through a query statement, such as: match (n: { name: "Alice" }) Return n.photo- > locationGraph. The information in the secondary attribute graph may also be queried, such as: match (n: { name: "Alice" }) With n.photo- > locationGraph as graph, Match (m) - [: nextTo ] - (n: { name: "Alice" }) from graph Return m.name.

4) On the basis of the step 3), a query engine of the expansion database is responsible for compiling and executing a query statement conforming to the syntax in the step 3), and the value of the secondary attribute graph is allowed to be acquired by adopting a mode of 'primary attribute- > attribute graph';

5) in the invention, the corresponding secondary attribute map information is obtained by calling a specific AI algorithm to process unstructured data. Each AI algorithm extracts a secondary attribute graph corresponding to the mode; one type of secondary attribute (e.g., "children's horse riding") corresponds to an AI algorithm, and the correspondence between the algorithm and the attribute map is maintained by an algorithm mapping library.

6) The function of the algorithm mapping library mentioned in the step 5) is to configure a specified algorithm for a specified primary attribute, and the algorithm can extract secondary attribute map information from the primary attribute. The algorithm mapping library is responsible for maintaining the mapping relationship between the algorithm and the secondary attribute map mode.

7) In order to accelerate the query in the step 4), the invention designs a cache system, the result is preferably searched in the cache system in each query, if the cache system has the latest result for the query, the AI algorithm is not called, and the result is directly returned. If the cache system does not have the corresponding result, calling an AI algorithm to obtain the secondary attribute, and storing the result in the cache system for accelerating subsequent query.

In particular, the invention provides a graph-database-based implementation of the above method:

1. in a graph database based on an attribute graph model, data is organized in the form of nodes and edges. Wherein, the nodes are used for representing natural entities (such as people, commodities, organizations and the like), and the edges are used for representing the relationships among the entities (such as friend relationships, purchasing relationships and the like). On the basis of a graph database, the invention improves the query language, executes an engine and increases an algorithm mapping library, so that the system supports the query of information in unstructured data through an attribute graph. The system architecture is shown in figure 1, and the main components include: a cache layer, a graph database (graph system), and an algorithm mapping library (AI system).

2. The attribute data is used to extend information describing the entity (e.g., name of the person, date of birth, certificate photo of the person, car photo of the person), and in particular, the present invention supports unstructured data as attribute data of the entity and is referred to as "primary attribute". Certain specific information in the unstructured data is defined as a certain secondary attribute map. (e.g., a photograph of a boy horse riding, as [: person { type: "boy" } ] - (: SIT _ ON) - [: horse { color: "white" } ])

3. In the invention, the corresponding secondary attribute graph is obtained by calling a specific AI algorithm to process unstructured data. A class of secondary attribute maps (e.g., children's horse-riding) corresponds to an AI algorithm, and the correspondence between the algorithm and the secondary attributes is maintained by an algorithm mapping library.

4. The invention realizes a cache layer for accelerating the query of the secondary attribute graph. The data stored in the cache layer is the query result of the secondary attribute map in a certain time period, and when the data and the AI algorithm are not changed, the AI algorithm is not repeatedly called for multiple queries of the same secondary attribute map.

5. The invention expands Cypher query language to support semantic extraction symbol "- >", wherein the symbol is a binary operator, the left side is a primary attribute, the right side is a secondary attribute graph, and the meaning of the symbol is that the content of the secondary attribute graph in the primary attribute is extracted.

6. On the basis of step 5, the Cypher query engine is expanded. The engine parses a query statement input by a user in the form of a syntax tree and generates an execution plan. When the query statement of the secondary attribute graph is executed, searching is preferentially carried out in the cache layer in the step 4, if the result is hit, the result is returned, and AI algorithm is not called for repeated processing; if the result is not hit, an AI algorithm is called to process the primary attribute to obtain the secondary attribute graph, the secondary attribute graph is returned to the user, and the result is stored in a cache layer and used for accelerating the next query.

7. For the cache layer in step 4, the data in the cache layer is stored in a form of key-value pairs, where a key is a combination of id of the unstructured data (primary attribute) and algorithm id, and the value is a result obtained by the AI algorithm processing the unstructured data. When the value of the algorithm or the primary attribute is updated, the value of the combination id is also changed, which can make the original cache result out of date, and the design enables the system to obtain the latest secondary attribute map.

8. For the algorithm mapping described in step 3, the present invention implements an algorithm mapping library. The function of the algorithm mapping library is to manage and maintain the corresponding relationship between the secondary attribute graph mode and the AI algorithm, receive the request of the execution engine calling the AI algorithm to process the unstructured data in step 6, process the data, and return the result.

9. In order to improve efficiency, query engines of the algorithm mapping library and the graph database are deployed on different hosts, and data is interacted between the two hosts through an HTTP protocol.

The invention has the beneficial effects that:

the invention provides a novel method for representing and querying unstructured data information. The invention provides a concept of a secondary attribute graph on the basis of a database model, and information in unstructured data is represented as the secondary attribute graph. And the secondary attribute map mode is mapped to the AI algorithm. The method and the device realize the query of the unstructured data information through the database query language, simplify the flow of calling the AI algorithm to extract information from the unstructured data, and enhance the flexibility of the query of the unstructured data information. The information extraction capability of the AI algorithm and the information query capability of the database are fully combined, and a new solution is provided for the information query of the unstructured data.

The design of the cache layer reduces the calling times of the AI algorithm when the same secondary attribute is repeatedly inquired, and improves the inquiry efficiency of the system.

The design of separating the algorithm mapping library (AI system) from the graph database shields the complexity of algorithm dependence and improves the utilization efficiency of system resources.

Drawings

FIG. 1 is a system framework diagram of the present invention.

Detailed Description

The invention is further described by the following specific embodiments in conjunction with the accompanying drawings.

Some academic map contains data such as academic conference information, student information, and scientific research institution information. The map takes the scholars, meetings and institutions as vertexes, and the relationships of participants, affiliations and the like as edges. Wherein, there is the group photo of academic meeting under the conference node.

The user obtains a certain group photo by a query statement, then obtains the information of the position relationship of the people in the group photo according to the secondary attribute graph (Match (Meeting) Return n.photo- > locationGraph), or more closely, directly searches the position relationship of the people in the group photo as an attribute graph to obtain the information in the secondary attribute graph. As a query statement to obtain the names of people next to Bob in the academic meeting group: ("Match (n: Meeting) with n.photo- > locationGraph as graph, Match (m1) in graph Where (m1) - [: nextTo ] - (m2{ name:" Bob "}) return m1. name").

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and a person skilled in the art can make modifications or equivalent substitutions to the technical solution of the present invention without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A method for querying unstructured data based on a secondary attribute graph comprises the following steps:

2. The method of claim 1, wherein a node contains category and attribute set information and an edge contains category information.

3. The method of claim 1 or 2, wherein the query is made by directly inputting the primary attribute and the secondary attribute maps; or inputting the information in the secondary attribute graph to inquire the secondary attribute graph, and then inputting the primary attribute and inquiring the secondary attribute graph selected from the inquired secondary attribute graph.

4. The method of claim 1, wherein an algorithm mapping library is established, and the correspondence between each AI algorithm and different secondary attribute maps is set for invoking different AI algorithms to extract the corresponding secondary attribute maps in the primary attributes.

5. An unstructured data query system based on a secondary attribute graph is characterized by comprising an information extractor, a task scheduler and a query engine; wherein,

6. The system of claim 5, wherein a node contains category and attribute set information and an edge contains category information.

7. The system of claim 5 or 6, wherein the query is made by directly inputting the primary attribute and the secondary attribute maps; or inputting the information in the secondary attribute graph to inquire the secondary attribute graph, and then inputting the primary attribute and inquiring the secondary attribute graph selected from the inquired secondary attribute graph.

8. A method for querying unstructured data based on a secondary attribute graph comprises the following steps:

9. The method of claim 8, wherein an algorithm mapping library is established, and the correspondence between each AI algorithm and different secondary attribute maps is set for invoking different AI algorithms to extract the corresponding secondary attribute maps in the primary attributes.

10. An unstructured data query system based on a secondary attribute graph is characterized by comprising an information extractor, a task scheduler and a Cypher query engine; wherein,