CN111831787A

CN111831787A - Unstructured data information query method and system based on secondary attributes

Info

Publication number: CN111831787A
Application number: CN202010513529.9A
Authority: CN
Inventors: 沈志宏; 赵子豪; 周园春
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2020-06-08
Filing date: 2020-06-08
Publication date: 2020-10-27
Anticipated expiration: 2040-06-08
Also published as: CN111831787B

Abstract

The invention discloses a method and a system for querying unstructured data information based on secondary attributes. The method comprises the following steps: 1) for a target database, taking the unstructured data of each record in the target database as the primary attribute of the corresponding record; 2) extracting intrinsic information in each primary attribute as a secondary attribute of the primary attribute; 3) expanding the query language of the target database, and adding a semantic operator "- >; expanding a query engine of the target database, and compiling and executing a query statement conforming to the syntax of the semantic operator "- >; 4) the query engine queries the cache results meeting the query conditions from the cache system according to the query conditions, if no matching result exists, the matching records in the target database are searched according to the primary attributes in the query conditions, then the secondary attributes are extracted from the primary attributes of the matching records and are respectively matched with the secondary attributes in the query conditions, and the matching results are returned.

Description

Unstructured data information query method and system based on secondary attributes

Technical Field

The invention relates to the technical field of unstructured data, data query language and artificial intelligence, and aims to solve the problems that the information query cannot be conveniently carried out on unstructured data and calculation cannot be carried out according to requirements in the prior art. A method and system for realizing unstructured data query based on secondary attributes are provided.

Background

The unstructured data has a large proportion in the network data, and contents such as pictures, sound recordings, videos, plain long texts and the like belong to the unstructured data. At present, the technology related to storage and query of structured data is mature, and the related solutions for storage and management of structured data are already well-established. However, with the progress of technology and the development of times, the data sources are wider and more extensive, the quantity is more and more, and the form is more and more complex. In many application scenarios, engineers need to face not only structured data with a canonical format, but also semi-structured data with a self-describing structure or even unstructured data without a fixed structure. Obviously, because of the flexibility of the structure, the data has rich expansibility and extremely high information expression freedom. But due to its freedom in format, the storage and management of such unstructured data has also been a problem that has plagued the industry for many years. Current management and query techniques for unstructured data mainly focus on retrieval based on metadata of unstructured data, such as file name, size, file category, tag value, etc. Such simple retrieval cannot fully utilize AI techniques and cannot directly query and consume information contained in unstructured data, which causes difficulties in querying and utilizing unstructured data. At present, some artificial intelligence methods can extract information in unstructured data, such as voice recording to character conversion, face recognition, license plate number extraction and the like, and related algorithms reach higher accuracy. However, because the AI algorithm is complex in dependence, difficult to deploy, and different tools have large differences, it is inconvenient to use the AI algorithm to obtain information in unstructured data.

In the face of the current situations that unstructured data are more and more, and the accuracy and richness of the AI algorithm are stronger and stronger, it is of great significance to develop a method and a system capable of quickly querying information in unstructured data.

Disclosure of Invention

The invention provides a method and a system for querying unstructured data information based on secondary attributes, aiming at the problem of querying unstructured data information, and the implementation is provided based on a graph database. The method combines the setting information in the unstructured data with the secondary attribute name, uses the secondary attribute name to represent the setting information in the unstructured data, uses an AI algorithm to extract the appointed secondary attribute, and obtains the information in the unstructured data in the form of inquiring the secondary attribute, thereby realizing the quick inquiry of the information in the unstructured data and improving the flexibility.

The technical scheme adopted by the invention is as follows:

a secondary attribute-based unstructured data information query method comprises the following steps:

1) for a target database, taking the unstructured data of each record in the target database as the primary attribute of the corresponding record;

2) extracting intrinsic information in each primary attribute as a secondary attribute of the primary attribute;

3) expanding the query language of the target database, and adding a semantic operator "- >; the semantic operator "- >" is used in a way of "a- > b", and the meaning is that for the primary attribute a, the value of the secondary attribute b in the primary attribute a is inquired; expanding a query engine of the target database, and compiling and executing a query statement conforming to the syntax of the semantic operator "- >;

4) the query engine queries the cache results meeting the query conditions from the cache system according to the query conditions, and if the cache results are matched, the cache results are returned; if no matched query result exists, searching the matched record in the target database according to the primary attribute in the query condition, then extracting the secondary attribute from the primary attribute of the matched record, respectively matching with the secondary attribute in the query condition, and returning the matching result.

Further, for a record i in the target database, if there are n unstructured data in the record i, the n unstructured data are used as n primary attributes of the record i.

Further, the secondary attributes under the same primary attribute are organized in a table form, i.e. { secondary attribute 1: value 1, secondary attribute 2: value 2.

Furthermore, an algorithm mapping library is established, and the corresponding relation between each AI algorithm and different secondary attributes is set for calling different AI algorithms to extract the corresponding secondary attributes in the primary attributes.

Further, when the user inputs the query condition and finishes inputting the primary attribute, selecting the secondary attribute of the query condition according to the secondary attribute supported by the automatic prompt of the primary attribute or directly inputting the secondary attribute of the query condition.

An unstructured data information query system based on secondary attributes is characterized by comprising an information extractor, a task scheduler and a query engine; wherein the content of the first and second substances,

an information extractor for extracting unstructured data from each node of the graph database as a primary attribute of the corresponding node; calling a task scheduler to extract the setting information in each primary attribute as a secondary attribute of the primary attribute;

the task scheduler is used for calling different AI algorithms to extract different secondary attributes from the intrinsic information of the primary attributes;

the query engine is used for querying the cache results meeting the query conditions from the cache system according to the query conditions, and if the cache results are matched, returning the cache results; if no matched query result exists, searching the matched record in the target database according to the primary attribute in the query condition, then extracting the secondary attribute from the primary attribute of the matched record, respectively matching with the secondary attribute in the query condition, and returning the matching result;

expanding the query language of the target database, and adding a semantic operator "- >; the semantic operator "- >" is used in a way of "a- > b", and the meaning is that for the primary attribute a, the value of the secondary attribute b in the primary attribute a is inquired; and expanding the query engine of the target database, and compiling and executing the query statement conforming to the syntax of the semantic operator "- >".

1) for a graph database based on an attribute graph model, wherein nodes in the graph database are used for representing entities, and edges are used for representing the relationship between the entities;

2) taking the attribute data of each entity as the primary attribute of the corresponding node, and taking the setting information of each primary attribute as the secondary attribute of the primary attribute;

3) expanding the Cypher query language of the graph database, and adding a semantic operator "- >; the semantic operator "- >" is a binary operator, the left side is a primary attribute, the right side is a secondary attribute, and the meaning is that the content of the secondary attribute in the primary attribute is extracted; expanding a Cypher query engine of the graph database, and analyzing query sentences input by a user in a syntax tree mode to generate an execution plan;

4) the Cypher query engine queries the cache results meeting the query conditions from the cache according to the query conditions, and if the cache results are matched, the cache results are returned; if no matched query result exists, searching a matched node in the graph database according to the primary attribute in the query condition, then extracting the secondary attribute from the primary attribute of the matched node, respectively matching with the secondary attribute in the query condition, and returning the matching result.

The unstructured data information query system based on the secondary attributes is characterized by comprising an information extractor, a task scheduler and a Cypher query engine; wherein the content of the first and second substances,

an information extractor for extracting unstructured data from each record of the graph database as a primary attribute of the corresponding record; calling a task scheduler to extract the intrinsic information in each primary attribute as a secondary attribute of the primary attribute;

the Cypher query engine is used for querying the cache results meeting the query conditions from the cache according to the query conditions, and returning the cache results if the cache results are matched; if no matched query result exists, searching a matched node in the graph database according to the primary attribute in the query condition, then extracting a secondary attribute from the primary attribute of the matched node, respectively matching the secondary attribute with the secondary attribute in the query condition, and returning the matching result;

wherein, the Cypher query language of the graph database is expanded, and a semantic operator "- >; the semantic operator "- >" is a binary operator, the left side is a primary attribute, the right side is a secondary attribute, and the meaning is that the content of the secondary attribute in the primary attribute is extracted; and expanding a Cypher query engine of the graph database, and analyzing the query sentence input by the user in a syntax tree form to generate an execution plan.

The unstructured data information query method based on the secondary attributes comprises the following steps:

1) in the raw database, unstructured data is represented as attributes of database records (hereinafter referred to as primary attributes).

2) According to actual conditions, the intrinsic information in the unstructured data (primary attributes) which the user needs to use is defined as secondary attributes. The secondary attributes under the same primary attribute are organized in the form of a table, such as: { secondary attribute 1: value 1, secondary attribute 2: a value of 2.; the intrinsic information refers to information in unstructured data, such as a picture of a car, the model, color, license plate number, etc. of the car belong to the intrinsic information of the picture, and the specific secondary attribute to be used can be specified by the user.

3) On the basis of the step 2), expanding the query language of the database, and adding a semantic operator "- >, wherein the using method of the symbol is" a- > b ", and the meaning is that for the primary attribute a, the value of the secondary attribute b is queried.

4) On the basis of the step 3), a query engine of the expansion database is responsible for compiling and executing a query statement conforming to the grammar in the step 3), and the value of the secondary attribute is allowed to be acquired by adopting a mode of 'primary attribute- > secondary attribute name';

5) in the invention, the corresponding secondary attributes are obtained by calling a specific AI algorithm to process unstructured data. One type of secondary attribute (e.g., license plate number in a picture) corresponds to an AI algorithm, and the correspondence between the algorithm and the secondary attribute is maintained by an algorithm mapping library.

6) The function of the algorithm mapping library mentioned in step 5) is to configure a specified algorithm for a specified primary attribute, and the algorithm can extract a secondary attribute list from the primary attribute. The algorithm mapping library is responsible for maintaining the mapping relationship between the algorithms and the secondary attributes.

7) In order to accelerate the query in the step 4), the invention designs a cache system, the result is preferably searched in the cache system in each query, if the cache system has the latest result for the query, the AI algorithm is not called, and the result is directly returned. If the cache system does not have the corresponding result, calling an AI algorithm to obtain the secondary attribute, and storing the result in the cache system for accelerating subsequent query.

In particular, the invention provides a graph-database-based implementation of the above method:

1. in a graph database based on an attribute graph model, data is organized in the form of nodes and edges. Wherein, the nodes are used for representing natural entities (such as people, commodities, organizations and the like), and the edges are used for representing the relationships among the entities (such as friend relationships, purchasing relationships and the like). On the basis of a graph database, the invention improves the query language, executes an engine and increases an algorithm mapping library, so that the system supports the query of information in unstructured data through secondary attributes.

2. The attribute data is used to extend information describing the entity (e.g., name of the person, date of birth, certificate photo of the person, car photo of the person), and in particular, the present invention supports unstructured data as attribute data of the entity and is referred to as "primary attribute". Certain specific information in the unstructured data is defined as a certain secondary attribute name. (for example, the number plate in the photo of the car is defined as the placenumber)

3. In the invention, the corresponding secondary attributes are obtained by calling a specific AI algorithm to process unstructured data. One type of secondary attribute (e.g., license plate number in a picture) corresponds to an AI algorithm, and the correspondence between the algorithm and the secondary attribute is maintained by an algorithm mapping library.

4. The invention implements a cache layer for accelerating the query of secondary attributes. The data stored in the cache layer is the result of the query of the secondary attribute in a certain time period, and when the data and the AI algorithm are not changed, the AI algorithm is not repeatedly called for multiple queries of the same secondary attribute.

5. The invention expands Cypher query language to support semantic extraction symbol "- >", wherein the symbol is a binary operator, the left side is a primary attribute, the right side is a secondary attribute, and the meaning of the symbol is that the content of the secondary attribute in the primary attribute is extracted.

6. On the basis of step 5, the Cypher query engine is expanded. The engine parses a query statement input by a user in the form of a syntax tree and generates an execution plan. When the query statement of the secondary attribute is executed, searching is preferentially carried out in the cache layer in the step 4, if the result is hit, the result is returned, and AI algorithm is not called for repeated processing; if the result is not hit, an AI algorithm is called to process the primary attribute to obtain the secondary attribute, the secondary attribute is returned to the user, and the result is stored in a cache layer for accelerating the next query.

7. For the cache layer in step 4, the data in the cache layer is stored in a form of key-value pairs, where a key is a combination of id of the unstructured data (primary attribute) and algorithm id, and the value is a result obtained by the AI algorithm processing the unstructured data. When the algorithm or the primary attribute value is updated, the value of the combination id is also changed, which can make the original cache result out of date, and the design enables the system to obtain the latest secondary attribute value.

8. For the algorithm mapping described in step 3, the present invention implements an algorithm mapping library. The function of the algorithm mapping library is to manage and maintain the corresponding relationship between the secondary attribute names and the AI algorithm, receive the request of the execution engine calling the AI algorithm to process the unstructured data in step 6, process the data, and return the result.

9. In order to improve efficiency, query engines of the algorithm mapping library and the graph database are deployed on different hosts, and data is interacted between the two hosts through an HTTP protocol.

The invention has the beneficial effects that:

the invention provides a novel method for inquiring unstructured data information. The invention provides a concept of secondary attributes on the basis of a database model, and information in unstructured data is represented as the secondary attributes. And the secondary attribute name is associated with the AI algorithm. The method and the device realize the query of the unstructured data information through the database query language, simplify the flow of calling the AI algorithm to extract information from the unstructured data, and enhance the flexibility of the query of the unstructured data information. The information extraction capability of the AI algorithm and the information query capability of the database are fully combined, and a new solution is provided for the information query of the unstructured data.

The design of the cache layer reduces the calling times of the AI algorithm when the same secondary attribute is repeatedly inquired, and improves the inquiry efficiency of the system.

The design of separating the algorithm mapping library (AI system) from the graph database shields the complexity of algorithm dependence and improves the utilization efficiency of system resources.

Drawings

FIG. 1 is a system framework diagram of the present invention.

FIG. 2 is a schematic diagram of a level one attribute of a node.

FIG. 3 is a diagram of secondary attributes of a node.

Fig. 4 is a schematic diagram of the first-level attributes of the newly added node.

FIG. 5 is a map of newly added binding.

Detailed Description

The invention is further described by the following specific embodiments in conjunction with the accompanying drawings.

Examples of complete processes are supplemented based on some academic atlases:

a certain unit constructs a knowledge graph in the academic field, and characters are used as nodes. (note: in fig. 3, the solid line boxes represent what the prior art can achieve, and the dashed line boxes are secondary attributes constructed based on the multi-level attributes proposed by the present invention; where the ellipses are nodes and the boxes are attributes).

1. Taking Zhang three as an example, the node has three attributes, name: zhang III, Job title: researchers, articles: xx research, as shown in fig. 2, in which two attributes of name and title are character strings, the value of the attribute of a paper is a pdf file which is an object of Blob.

2. Due to the need for academic atlas expansion, keywords and co-authors of papers need to be embodied in the academic atlas. The information can be extracted from the paper text through PDF analysis and natural language processing technology, and can also be manually input. As shown in FIG. 3, where keywords and authors belong to a secondary attribute of the attribute "treatise", the concept of such a "secondary attribute" is not present in the original graph database.

The user can obtain the value corresponding to the attribute by querying 'Zhang three. paper- > author'. The system receives the query request of the user, firstly takes out the pdf file corresponding to the attribute of the 'thesis', then calls a predefined method for extracting author information from the pdf file, extracts the author name 'zhang san, li si', and returns the author name 'zhang san, li si' to the user.

3. Due to the need of academic atlas expansion, new personnel (lie four) need to be added to the atlas and attributes need to be added to the atlas, and specific values are shown in fig. 4.

4. Because of the need of academic atlas expansion, the academic atlas needs to be supplemented with the paper cooperation relationship between characters, the paper is found to have a co-author "lie four" by inquiring the value of the secondary attribute "author" of the paper "xx research" of zhang, so that the paper cooperation relationship between lie four and zhang can be determined, and the new atlas is shown in fig. 5.

The user specifies the information of the paper author through Zhang III, and a paper cooperation relationship is created in the graph, and the system work flow of the step is as follows:

1. and inquiring author information of the thesis under the third Zhang to obtain the third Zhang and the fourth Li.

2. And querying nodes with the names of Zhang three or Liqu from the whole graph.

3. A relationship is created between the nodes of zhang san and lie san.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and a person skilled in the art can make modifications or equivalent substitutions to the technical solution of the present invention without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A secondary attribute-based unstructured data information query method comprises the following steps:

2. The method of claim 1, wherein for a record i in the target database, if there are n unstructured data for the record i, the n unstructured data are taken as n primary attributes of the record i.

3. Method according to claim 1 or 2, characterized in that the secondary attributes under the same primary attribute are organized in the form of a table, i.e. { secondary attribute 1: value 1, secondary attribute 2: value 2.

4. The method of claim 1, wherein an algorithm mapping library is established, and the correspondence between each AI algorithm and different secondary attributes is set for invoking different AI algorithms to extract corresponding secondary attributes in the primary attributes.

5. The method of claim 1, wherein when the user inputs the query condition, when the input of the primary attribute is completed, the secondary attribute of the query condition is selected according to the secondary attribute supported by the automatic prompt of the primary attribute or the secondary attribute of the query condition is directly input.

6. An unstructured data information query system based on secondary attributes is characterized by comprising an information extractor, a task scheduler and a query engine; wherein the content of the first and second substances,

7. The system of claim 6, wherein for a record i in the target database, if there are n unstructured data in the record i, the n unstructured data are taken as n primary attributes of the record i; the secondary attributes under the same primary attribute are organized in a table form, namely { secondary attribute 1: value 1, secondary attribute 2: value 2.

8. A secondary attribute-based unstructured data information query method comprises the following steps:

9. The method of claim 8, wherein an algorithm mapping library is established, and the correspondence between each AI algorithm and different secondary attributes is set for invoking different AI algorithms to extract corresponding secondary attributes in the primary attributes.

10. The unstructured data information query system based on the secondary attributes is characterized by comprising an information extractor, a task scheduler and a Cypher query engine; wherein the content of the first and second substances,