CN112966126B

CN112966126B - High-reliability knowledge base construction method capable of inquiring and tracing mass unstructured data content

Info

Publication number: CN112966126B
Application number: CN202110216317.9A
Authority: CN
Inventors: 徐超; 陈勇
Original assignee: NANJING AUDIT UNIVERSITY
Current assignee: NANJING AUDIT UNIVERSITY
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-09-17
Anticipated expiration: 2041-02-26
Also published as: CN112966126A

Abstract

The invention discloses a construction method of a high-reliability knowledge base which is capable of inquiring and tracing and faces to massive unstructured data contents, which is characterized by comprising the following steps: (1) inputting unstructured original data, and storing by adopting a secondary structure: the cloud end stores original data, and the block chain stores abstract characteristics of the original data and the relation between the original data in a transaction mode; (2) inputting a query condition, and automatically extracting abstract information of the query condition as a query basis; (3) and matching the query with the relevance according to the information in the block chain, and returning a direct correlation result and the precursor successor relation of the result. The method utilizes the deep learning technology to automatically extract the characteristic information of the unstructured data, so that the unstructured data can be conveniently used for data retrieval; meanwhile, the relation between the retrieved summary information and the data is stored by utilizing the block chain, so that the high reliability of data retrieval and source tracing in the knowledge base is guaranteed.

Description

High-reliability knowledge base construction method capable of inquiring and tracing mass unstructured data content

Technical Field

The invention belongs to the field of big data storage and block chain application, and particularly relates to a construction method of a high-reliability knowledge base which is capable of being inquired and traced and faces to massive unstructured data contents.

Background

With the rapid development of big data, the application of the unstructured data construction such as pictures and videos is more and more extensive, and various enterprises pay great attention to the accumulation of related data and generally manage the data in a knowledge base mode. The constructed knowledge base is mainly served for application, and the knowledge base can be effectively applied, so that the reliability of the data stored in the knowledge base is ensured on one hand, and particularly, the decision support knowledge base can cause completely opposite results once the data is missing; the other party needs to provide better searching capability, including directly searching the required content according to the keywords and searching the coming and going arteries according to the specified content. For structured data, the existing database management system can achieve better data retrieval function, but for unstructured data, due to the heterogeneity of the content of the unstructured data, the existing knowledge base generally builds a search index and establishes association through a form of self-defined keywords based on the understanding of individuals on the unstructured data. When the search for related information or the search for the context of data is required, the search can be performed only based on the keyword. Due to the richness of the representation content of the unstructured data, the simple keyword information determined by personal understanding can not fully embody all aspects covered by the unstructured data. For example, when a police case is proved, some key evidences may be unintentionally recorded in some picture or video information, but the picture or video may be originally taken only for publicizing a certain product and is not directly related to the case, so that there may be no corresponding information in the keyword of the picture or video information, and therefore, searching is performed only according to the unstructured keyword extracted by an individual, and many useful data may be missed in many cases, which may greatly reduce the effect of the knowledge base.

Disclosure of Invention

The invention mainly solves the technical problem of providing a knowledge base construction method for massive unstructured data, which can perform data retrieval and tracing from the content level of the unstructured data and ensure the reliability of the knowledge base.

When querying data in a knowledge base by using keywords, the existing knowledge base is difficult to search based on the content of unstructured data. In order to solve the above problems, the present invention provides a method for constructing a high-reliability knowledge base that is query-traceable and faces the content of massive unstructured data by using a deep learning technique to automatically extract the features of unstructured data and using a block chain structure, wherein the method comprises:

inputting unstructured raw data such as pictures, videos and sounds, and storing the unstructured raw data by adopting a secondary structure of 'cloud + block chain'. The cloud end stores original data, and the block chain stores abstract characteristics and relation of the data in a transaction mode;

-inputting query conditions, and automatically extracting key information of data as a query basis;

-matching the query with the relevance of the information in the blockchain, returning the directly relevant results and their predecessor successors of the results.

Specifically, the unstructured raw data such as input pictures, videos and sounds are saved by a cloud server.

Specifically, the abstract features and the relation of the unstructured data are saved by adopting a block chain. In order to ensure that transactions can be used for reliable query of unstructured data according to the content, each transaction T is represented as a five-tuple T ═ E, H, ID, S, a >, wherein E represents transaction evidence, H represents a hash of transaction data, ID represents a transaction number, S represents a digital signature of a data sender, and a represents a transaction digest. The ith transaction in a block may be denoted as Ti ═ Ei, Hi, Di, Si, Ai >; wherein:

the transaction evidence field E is an evidence item on which the transaction depends, and the evidence items are divided into two main categories: a) based on the existing evidence on the block chain, the address is the corresponding transaction number; b) and new certification materials need to be uploaded to the cloud, and the cloud address is used as the address of the certification materials. The calculation mode of E is E ═ E, E ∈ ID | E ∈ Addr, wherein ID represents the set of all transaction numbers, and Addr is the set of all cloud addresses;

-pass between evidence items of the transaction evidence field "; "separate;

-the data HASH field checks the consistency of the transaction, which is calculated as H_e＝∪_e∈IDHash(e)|∪_e∈AddrHash (e), wherein the Hash (e) indicates that the data in the cloud address corresponding to e is hashed;

the signature S is a private encrypted digital signature of the sender to determine the identity of the sender;

the transaction summary a is used to store information of the transaction characteristics. Mainly comprising two parts. The first part is given the basic characteristics of the transaction, such as the transaction recipient address, transaction amount, etc., denoted a 1. The second part is meta-features of unstructured data automatically extracted based on uploaded evidence information, such as character features obtained by image recognition, article features, automatic summarization obtained by character recognition and natural language processing, etc., denoted by a 2;

use between the basic features of the transaction summary a1 "," separation; use between features of transaction digest a2 "," separate, and entire a2 is enclosed with small brackets, use between transaction digests a1 and a2 "; "separate;

specifically, when the query condition is input, if the data is unstructured data, corresponding features are automatically extracted by using a corresponding deep learning model according to the data type (video, picture, audio and the like) to serve as a query basis. These deep learning models are built by training for specific knowledge base types. Such as a power system equipment knowledge base, whose picture-oriented deep learning model is obtained by training based on pictures of power system related equipment. If the data is structured data such as keywords, directly taking the keywords as query bases;

specifically, when the data in the knowledge base is matched with the query basis one by one, the method comprises the following steps:

firstly, checking whether a query is based on keywords or not, if yes, searching each node i in the block chain according to the keywords in the A1 field of each transaction of each block stored in each block according to the principle of complete matching, and finding matching items to form a transaction list L1;

each node broadcasts the list of matching transactions L1 found respectively, consensus is carried out by the consensus mechanism of the blockchain, if the consensus is successful, the transaction is added to the matching list EXR, if there is a characteristic field in the summary of the transaction, the characteristic field of the transaction is added to the characteristic list SBL;

-then detecting whether the query basis contains unstructured data, if so, adding the unstructured data to be queried to the SBL according to features extracted from the unstructured data, then according to the pearson similarity comparison method, each node performs similarity calculation between the a2 field of each transaction of each block in the respectively stored block chain and each feature item in the SBL, and extracts the transactions with similarity greater than a threshold value to form a matching list L2;

each node broadcasts the respective found matching transaction list L2 separately, consensus is made by the consensus mechanism of the blockchain, and if the consensus is successful, the transaction is added to the matching list EXR;

traversing each transaction in the EXR in turn, finding all relevant transactions corresponding to the transaction from the evidence field thereof by using the tracing mechanism of the block chain, and finally returning the relevant transactions to the inquirer.

The invention has the advantages of

The method utilizes the deep learning technology to automatically extract the characteristic information of the unstructured data, so that the unstructured data can be conveniently used for data retrieval; meanwhile, the relation between the retrieved summary information and the data is stored by utilizing a block chain, so that the high reliability of data retrieval and tracing in a knowledge base is guaranteed (a deep learning modeling method belongs to the prior art, and the innovation point of the invention is that the method is applied to data retrieval, and more importantly, the relation between a retrieval result and the data is designed).

Drawings

FIG. 1 is a schematic diagram of the invention for repository storage and unstructured data query.

Fig. 2 is a block transaction field structure diagram.

Fig. 3 is a transaction field diagram.

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.

Suppose a picture of the blackneck crane needs to be transferred into the knowledge base of the conservation of the treasure bird, which is added to supplement another picture P already in the knowledge base (say the address is www.knowledgebase.com/pic/1.png), and the picture also contains the content of part of the blackneck crane. As shown in fig. 1, the owner of this picture will log into the system of the conservation knowledge base of the treasure bird using his own user ID (assumed 1243) and then upload the picture and its keyword "red-neck crane" input entry system. In the uploading process, the system uploads the cloud address storing the original data of the knowledge base through the network and returns the address mark, and the address is assumed to be www.knowledgebase.com/pic/2. png; next, the system calls a picture feature extraction model in the existing unstructured data feature extraction models to extract features corresponding to the picture, and assumes that the feature model generates 10-dimensional features, which are (0.223,0.323,0.523,0.752,0.531,0.631,0.724,0.891,0.133,0.343), respectively.

Then, according to the transaction structure shown in fig. 2, each blockchain node assembles its summary information into a transaction in a block. First, fill in the evidence field E, which should be the transaction number of P and the cloud address of the photo itself, since the photo depends on another photo P and is itself a photo. For the transaction number of P, the corresponding relationship between the cloud address and the transaction number is stored in the cloud server, so that the transaction number can be searched according to the cloud address of P, and if the transaction number corresponding to www.knowledgebase.com/pic/1.png is 14323113, the value of the evidence field E at this time is: "14323113, www.knowledgebase.com/pic/2. png". Secondly, for the transaction number of the transaction, global uniform distribution is carried out, assuming that the value is '14324147', for the signature, the private key is searched according to the user ID, then the user ID number 1243 is encrypted by the private key, and assuming that the result is '1299498238 a2345 b', the value of the signature field is '1299498238 a2345 b'. Finally, for the transaction summary fields, on one hand, the keyword "red-neck crane" input by the user, on the other hand, the unstructured features (0.223,0.323,0.523,0.752,0.531,0.631,0.724,0.891,0.133,0.343) corresponding to the picture are obtained, and finally, all the fields are combined and subjected to MD5 calculation, and the fields are separated by semicolons, namely:

MD5(14323113, www.knowledgebase.com/pic/2. png; 14324147; 1299498238a 2345B; red-neck crane, (0.223,0.323,0.523,0.752,0.531,0.631,0.724,0.891,0.133,0.343)) ═ 72B0EEECCE3B0D1CBE28EC7620F1D205

Thus, the value of the transaction is shown in FIG. 3 after the above analysis.

This transaction will then be packed into a blockchain, and the cloud will also save the correspondence between its picture address www.knowledgebase.com/pic/2.png and transaction number 14324147.

When the user inputs the query condition, the assumption is that a picture containing the black-neck crane is provided, and no keyword is input. In this case, the system will extract the corresponding features of the picture by using the picture feature extraction model in the unstructured data feature extraction model, and assume that the feature model generates 10-dimensional features with values of (0.433,0.532,0.237,1.342,0.623,0.712,0.561,0.792,0.122, 0.353). Each block link point then compares the unstructured features in all its stored transaction summaries in turn based on the set of features. A similarity threshold value is set to be 0.7, and a Pearson correlation coefficient is adopted to calculate the feature similarity. For the previously entered picture of the red-neck crane, the similarity is calculated due to its characteristics (0.223,0.323,0.523,0.752,0.531,0.631,0.724,0.891,0.133, 0.343):

since 0.7037>0.7, the threshold is exceeded. Since most nodes in the blockchain are honest nodes, they find that the transaction matching degree of the previously input picture of the red-neck crane with partial black-neck crane content exceeds the threshold value, so the transaction will be broadcast by the normal nodes.

So, we will eventually find that more than 50% of the nodes broadcast the transaction, and therefore the picture corresponding to the transaction and its link relationship with P (www.knowledgebase.com/pic/2.png) will be returned to the querying user. Thereby finding the picture associated with the black-neck crane and the source tracing relation thereof.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A construction method of a high-reliability knowledge base which is capable of inquiring and tracing and faces to massive unstructured data contents is characterized by comprising the following steps:

(1) inputting unstructured original data, and storing by adopting a secondary structure: the cloud end stores original data, and the block chain stores abstract characteristics of the original data and the relation between the original data in a transaction mode; the relationship between the abstract features of the raw data and the raw data is represented by a block transaction T:

T＝<E,H,ID,S,A>

in the formula, E represents transaction evidence, H represents the hash of transaction data, ID represents a transaction number, S represents a digital signature of a data sender, and A represents a transaction abstract; the transaction summary A is used for storing information of transaction characteristics and comprises two parts:

a) the basic characteristics given according to the transaction type, denoted by a 1;

b) according to the uploaded evidence information, the meta-features of the unstructured data automatically extracted by the unstructured feature extraction model are represented by A2;

(2) inputting a query condition, and automatically extracting abstract information of the query condition as a query basis;

(3) matching the query with the relevance according to the information in the block chain, and returning a direct correlation result and a precursor successor relation of the result; the method comprises the following specific steps:

(3-1) first, checking whether the query has keywords: if yes, according to the keywords, each node i in the block chain searches the A1 field of each transaction of each block stored in the block chain according to the principle of complete matching, and after a matching item is found to form a matching transaction list L1, the step (3-2) is carried out; if no keyword exists, directly performing the step (3-3);

(3-2) each node broadcasts the matching transaction list L1 respectively, consensus is carried out by a consensus mechanism of the block chain, and if the consensus is successful, the transaction is added into the matching list EXR; further judging, if the summary of the transaction has a characteristic field, adding the characteristic field of the transaction into a characteristic list SBL;

(3-3) detecting whether the query basis contains unstructured data or not, if so, adding the unstructured data to be queried into the SBL according to the extracted features of the unstructured data, then according to a Pearson similarity comparison method, carrying out similarity calculation on an A2 field of each transaction of each block in a block chain stored by each node and each feature item in the SBL, and extracting the transactions with the similarity larger than a threshold value to form a matched transaction list L2;

(3-4) each node broadcasts the matching transaction list L2 respectively, consensus is carried out by a consensus mechanism of the block chain, and if the consensus is successful, the transaction is added into the matching list EXR;

and (3-5) traversing each transaction in the EXR in turn, finding all related transactions corresponding to the transaction from the evidence field of the transaction by utilizing a traceability mechanism of the block chain, and finally returning the related transactions to the inquirer.

2. The method according to claim 1, wherein the transaction evidences E are classified into two broad categories:

a) evidence already on the blockchain;

b) the input unstructured original data serving as the transaction evidence E needs to be uploaded to a cloud end, and the cloud end address serves as the address of the transaction evidence;

the calculation mode of E is E ═ E, E ∈ ID | E ∈ Addr, where ID denotes a set of all transaction numbers, Addr is a set of all cloud addresses, and E denotes a set of sequences formed by transaction numbers or cloud addresses.

3. The method of claim 1, wherein the hash H of the transaction data verifies the consistency of the transaction by the formula:

wherein ID represents the set of all transaction numbers, and Addr is the set of all cloud addresses; hash (e) indicates that the transaction number corresponding to e is hashed; hash (. e) indicates that data in the cloud address corresponding to e is hashed.

4. The method of claim 1, wherein the unstructured feature extraction model is constructed based on a deep learning approach.

5. The method of claim 1, wherein the query condition is structured data or unstructured data, and wherein:

-structured data: directly taking keywords as query bases;

-unstructured data: and according to the data type of the unstructured data, automatically extracting corresponding abstract information by using a corresponding deep learning model to serve as a query basis.