CN112966126B - High-reliability knowledge base construction method capable of inquiring and tracing mass unstructured data content - Google Patents

High-reliability knowledge base construction method capable of inquiring and tracing mass unstructured data content Download PDF

Info

Publication number
CN112966126B
CN112966126B CN202110216317.9A CN202110216317A CN112966126B CN 112966126 B CN112966126 B CN 112966126B CN 202110216317 A CN202110216317 A CN 202110216317A CN 112966126 B CN112966126 B CN 112966126B
Authority
CN
China
Prior art keywords
transaction
data
unstructured
block chain
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110216317.9A
Other languages
Chinese (zh)
Other versions
CN112966126A (en
Inventor
徐超
陈勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING AUDIT UNIVERSITY
Original Assignee
NANJING AUDIT UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING AUDIT UNIVERSITY filed Critical NANJING AUDIT UNIVERSITY
Priority to CN202110216317.9A priority Critical patent/CN112966126B/en
Publication of CN112966126A publication Critical patent/CN112966126A/en
Application granted granted Critical
Publication of CN112966126B publication Critical patent/CN112966126B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/41Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a construction method of a high-reliability knowledge base which is capable of inquiring and tracing and faces to massive unstructured data contents, which is characterized by comprising the following steps: (1) inputting unstructured original data, and storing by adopting a secondary structure: the cloud end stores original data, and the block chain stores abstract characteristics of the original data and the relation between the original data in a transaction mode; (2) inputting a query condition, and automatically extracting abstract information of the query condition as a query basis; (3) and matching the query with the relevance according to the information in the block chain, and returning a direct correlation result and the precursor successor relation of the result. The method utilizes the deep learning technology to automatically extract the characteristic information of the unstructured data, so that the unstructured data can be conveniently used for data retrieval; meanwhile, the relation between the retrieved summary information and the data is stored by utilizing the block chain, so that the high reliability of data retrieval and source tracing in the knowledge base is guaranteed.

Description

High-reliability knowledge base construction method capable of inquiring and tracing mass unstructured data content
Technical Field
The invention belongs to the field of big data storage and block chain application, and particularly relates to a construction method of a high-reliability knowledge base which is capable of being inquired and traced and faces to massive unstructured data contents.
Background
With the rapid development of big data, the application of the unstructured data construction such as pictures and videos is more and more extensive, and various enterprises pay great attention to the accumulation of related data and generally manage the data in a knowledge base mode. The constructed knowledge base is mainly served for application, and the knowledge base can be effectively applied, so that the reliability of the data stored in the knowledge base is ensured on one hand, and particularly, the decision support knowledge base can cause completely opposite results once the data is missing; the other party needs to provide better searching capability, including directly searching the required content according to the keywords and searching the coming and going arteries according to the specified content. For structured data, the existing database management system can achieve better data retrieval function, but for unstructured data, due to the heterogeneity of the content of the unstructured data, the existing knowledge base generally builds a search index and establishes association through a form of self-defined keywords based on the understanding of individuals on the unstructured data. When the search for related information or the search for the context of data is required, the search can be performed only based on the keyword. Due to the richness of the representation content of the unstructured data, the simple keyword information determined by personal understanding can not fully embody all aspects covered by the unstructured data. For example, when a police case is proved, some key evidences may be unintentionally recorded in some picture or video information, but the picture or video may be originally taken only for publicizing a certain product and is not directly related to the case, so that there may be no corresponding information in the keyword of the picture or video information, and therefore, searching is performed only according to the unstructured keyword extracted by an individual, and many useful data may be missed in many cases, which may greatly reduce the effect of the knowledge base.
Disclosure of Invention
The invention mainly solves the technical problem of providing a knowledge base construction method for massive unstructured data, which can perform data retrieval and tracing from the content level of the unstructured data and ensure the reliability of the knowledge base.
When querying data in a knowledge base by using keywords, the existing knowledge base is difficult to search based on the content of unstructured data. In order to solve the above problems, the present invention provides a method for constructing a high-reliability knowledge base that is query-traceable and faces the content of massive unstructured data by using a deep learning technique to automatically extract the features of unstructured data and using a block chain structure, wherein the method comprises:
inputting unstructured raw data such as pictures, videos and sounds, and storing the unstructured raw data by adopting a secondary structure of 'cloud + block chain'. The cloud end stores original data, and the block chain stores abstract characteristics and relation of the data in a transaction mode;
-inputting query conditions, and automatically extracting key information of data as a query basis;
-matching the query with the relevance of the information in the blockchain, returning the directly relevant results and their predecessor successors of the results.
Specifically, the unstructured raw data such as input pictures, videos and sounds are saved by a cloud server.
Specifically, the abstract features and the relation of the unstructured data are saved by adopting a block chain. In order to ensure that transactions can be used for reliable query of unstructured data according to the content, each transaction T is represented as a five-tuple T ═ E, H, ID, S, a >, wherein E represents transaction evidence, H represents a hash of transaction data, ID represents a transaction number, S represents a digital signature of a data sender, and a represents a transaction digest. The ith transaction in a block may be denoted as Ti ═ Ei, Hi, Di, Si, Ai >; wherein:
the transaction evidence field E is an evidence item on which the transaction depends, and the evidence items are divided into two main categories: a) based on the existing evidence on the block chain, the address is the corresponding transaction number; b) and new certification materials need to be uploaded to the cloud, and the cloud address is used as the address of the certification materials. The calculation mode of E is E ═ E, E ∈ ID | E ∈ Addr, wherein ID represents the set of all transaction numbers, and Addr is the set of all cloud addresses;
-pass between evidence items of the transaction evidence field "; "separate;
-the data HASH field checks the consistency of the transaction, which is calculated as He=∪e∈IDHash(e)|∪e∈AddrHash (e), wherein the Hash (e) indicates that the data in the cloud address corresponding to e is hashed;
the signature S is a private encrypted digital signature of the sender to determine the identity of the sender;
the transaction summary a is used to store information of the transaction characteristics. Mainly comprising two parts. The first part is given the basic characteristics of the transaction, such as the transaction recipient address, transaction amount, etc., denoted a 1. The second part is meta-features of unstructured data automatically extracted based on uploaded evidence information, such as character features obtained by image recognition, article features, automatic summarization obtained by character recognition and natural language processing, etc., denoted by a 2;
use between the basic features of the transaction summary a1 "," separation; use between features of transaction digest a2 "," separate, and entire a2 is enclosed with small brackets, use between transaction digests a1 and a2 "; "separate;
specifically, when the query condition is input, if the data is unstructured data, corresponding features are automatically extracted by using a corresponding deep learning model according to the data type (video, picture, audio and the like) to serve as a query basis. These deep learning models are built by training for specific knowledge base types. Such as a power system equipment knowledge base, whose picture-oriented deep learning model is obtained by training based on pictures of power system related equipment. If the data is structured data such as keywords, directly taking the keywords as query bases;
specifically, when the data in the knowledge base is matched with the query basis one by one, the method comprises the following steps:
firstly, checking whether a query is based on keywords or not, if yes, searching each node i in the block chain according to the keywords in the A1 field of each transaction of each block stored in each block according to the principle of complete matching, and finding matching items to form a transaction list L1;
each node broadcasts the list of matching transactions L1 found respectively, consensus is carried out by the consensus mechanism of the blockchain, if the consensus is successful, the transaction is added to the matching list EXR, if there is a characteristic field in the summary of the transaction, the characteristic field of the transaction is added to the characteristic list SBL;
-then detecting whether the query basis contains unstructured data, if so, adding the unstructured data to be queried to the SBL according to features extracted from the unstructured data, then according to the pearson similarity comparison method, each node performs similarity calculation between the a2 field of each transaction of each block in the respectively stored block chain and each feature item in the SBL, and extracts the transactions with similarity greater than a threshold value to form a matching list L2;
each node broadcasts the respective found matching transaction list L2 separately, consensus is made by the consensus mechanism of the blockchain, and if the consensus is successful, the transaction is added to the matching list EXR;
traversing each transaction in the EXR in turn, finding all relevant transactions corresponding to the transaction from the evidence field thereof by using the tracing mechanism of the block chain, and finally returning the relevant transactions to the inquirer.
The invention has the advantages of
The method utilizes the deep learning technology to automatically extract the characteristic information of the unstructured data, so that the unstructured data can be conveniently used for data retrieval; meanwhile, the relation between the retrieved summary information and the data is stored by utilizing a block chain, so that the high reliability of data retrieval and tracing in a knowledge base is guaranteed (a deep learning modeling method belongs to the prior art, and the innovation point of the invention is that the method is applied to data retrieval, and more importantly, the relation between a retrieval result and the data is designed).
Drawings
FIG. 1 is a schematic diagram of the invention for repository storage and unstructured data query.
Fig. 2 is a block transaction field structure diagram.
Fig. 3 is a transaction field diagram.
Detailed Description
The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.
Suppose a picture of the blackneck crane needs to be transferred into the knowledge base of the conservation of the treasure bird, which is added to supplement another picture P already in the knowledge base (say the address is www.knowledgebase.com/pic/1.png), and the picture also contains the content of part of the blackneck crane. As shown in fig. 1, the owner of this picture will log into the system of the conservation knowledge base of the treasure bird using his own user ID (assumed 1243) and then upload the picture and its keyword "red-neck crane" input entry system. In the uploading process, the system uploads the cloud address storing the original data of the knowledge base through the network and returns the address mark, and the address is assumed to be www.knowledgebase.com/pic/2. png; next, the system calls a picture feature extraction model in the existing unstructured data feature extraction models to extract features corresponding to the picture, and assumes that the feature model generates 10-dimensional features, which are (0.223,0.323,0.523,0.752,0.531,0.631,0.724,0.891,0.133,0.343), respectively.
Then, according to the transaction structure shown in fig. 2, each blockchain node assembles its summary information into a transaction in a block. First, fill in the evidence field E, which should be the transaction number of P and the cloud address of the photo itself, since the photo depends on another photo P and is itself a photo. For the transaction number of P, the corresponding relationship between the cloud address and the transaction number is stored in the cloud server, so that the transaction number can be searched according to the cloud address of P, and if the transaction number corresponding to www.knowledgebase.com/pic/1.png is 14323113, the value of the evidence field E at this time is: "14323113, www.knowledgebase.com/pic/2. png". Secondly, for the transaction number of the transaction, global uniform distribution is carried out, assuming that the value is '14324147', for the signature, the private key is searched according to the user ID, then the user ID number 1243 is encrypted by the private key, and assuming that the result is '1299498238 a2345 b', the value of the signature field is '1299498238 a2345 b'. Finally, for the transaction summary fields, on one hand, the keyword "red-neck crane" input by the user, on the other hand, the unstructured features (0.223,0.323,0.523,0.752,0.531,0.631,0.724,0.891,0.133,0.343) corresponding to the picture are obtained, and finally, all the fields are combined and subjected to MD5 calculation, and the fields are separated by semicolons, namely:
MD5(14323113, www.knowledgebase.com/pic/2. png; 14324147; 1299498238a 2345B; red-neck crane, (0.223,0.323,0.523,0.752,0.531,0.631,0.724,0.891,0.133,0.343)) ═ 72B0EEECCE3B0D1CBE28EC7620F1D205
Thus, the value of the transaction is shown in FIG. 3 after the above analysis.
This transaction will then be packed into a blockchain, and the cloud will also save the correspondence between its picture address www.knowledgebase.com/pic/2.png and transaction number 14324147.
When the user inputs the query condition, the assumption is that a picture containing the black-neck crane is provided, and no keyword is input. In this case, the system will extract the corresponding features of the picture by using the picture feature extraction model in the unstructured data feature extraction model, and assume that the feature model generates 10-dimensional features with values of (0.433,0.532,0.237,1.342,0.623,0.712,0.561,0.792,0.122, 0.353). Each block link point then compares the unstructured features in all its stored transaction summaries in turn based on the set of features. A similarity threshold value is set to be 0.7, and a Pearson correlation coefficient is adopted to calculate the feature similarity. For the previously entered picture of the red-neck crane, the similarity is calculated due to its characteristics (0.223,0.323,0.523,0.752,0.531,0.631,0.724,0.891,0.133, 0.343):
Figure BDA0002953910920000051
since 0.7037>0.7, the threshold is exceeded. Since most nodes in the blockchain are honest nodes, they find that the transaction matching degree of the previously input picture of the red-neck crane with partial black-neck crane content exceeds the threshold value, so the transaction will be broadcast by the normal nodes.
So, we will eventually find that more than 50% of the nodes broadcast the transaction, and therefore the picture corresponding to the transaction and its link relationship with P (www.knowledgebase.com/pic/2.png) will be returned to the querying user. Thereby finding the picture associated with the black-neck crane and the source tracing relation thereof.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (5)

1. A construction method of a high-reliability knowledge base which is capable of inquiring and tracing and faces to massive unstructured data contents is characterized by comprising the following steps:
(1) inputting unstructured original data, and storing by adopting a secondary structure: the cloud end stores original data, and the block chain stores abstract characteristics of the original data and the relation between the original data in a transaction mode; the relationship between the abstract features of the raw data and the raw data is represented by a block transaction T:
T=<E,H,ID,S,A>
in the formula, E represents transaction evidence, H represents the hash of transaction data, ID represents a transaction number, S represents a digital signature of a data sender, and A represents a transaction abstract; the transaction summary A is used for storing information of transaction characteristics and comprises two parts:
a) the basic characteristics given according to the transaction type, denoted by a 1;
b) according to the uploaded evidence information, the meta-features of the unstructured data automatically extracted by the unstructured feature extraction model are represented by A2;
(2) inputting a query condition, and automatically extracting abstract information of the query condition as a query basis;
(3) matching the query with the relevance according to the information in the block chain, and returning a direct correlation result and a precursor successor relation of the result; the method comprises the following specific steps:
(3-1) first, checking whether the query has keywords: if yes, according to the keywords, each node i in the block chain searches the A1 field of each transaction of each block stored in the block chain according to the principle of complete matching, and after a matching item is found to form a matching transaction list L1, the step (3-2) is carried out; if no keyword exists, directly performing the step (3-3);
(3-2) each node broadcasts the matching transaction list L1 respectively, consensus is carried out by a consensus mechanism of the block chain, and if the consensus is successful, the transaction is added into the matching list EXR; further judging, if the summary of the transaction has a characteristic field, adding the characteristic field of the transaction into a characteristic list SBL;
(3-3) detecting whether the query basis contains unstructured data or not, if so, adding the unstructured data to be queried into the SBL according to the extracted features of the unstructured data, then according to a Pearson similarity comparison method, carrying out similarity calculation on an A2 field of each transaction of each block in a block chain stored by each node and each feature item in the SBL, and extracting the transactions with the similarity larger than a threshold value to form a matched transaction list L2;
(3-4) each node broadcasts the matching transaction list L2 respectively, consensus is carried out by a consensus mechanism of the block chain, and if the consensus is successful, the transaction is added into the matching list EXR;
and (3-5) traversing each transaction in the EXR in turn, finding all related transactions corresponding to the transaction from the evidence field of the transaction by utilizing a traceability mechanism of the block chain, and finally returning the related transactions to the inquirer.
2. The method according to claim 1, wherein the transaction evidences E are classified into two broad categories:
a) evidence already on the blockchain;
b) the input unstructured original data serving as the transaction evidence E needs to be uploaded to a cloud end, and the cloud end address serves as the address of the transaction evidence;
the calculation mode of E is E ═ E, E ∈ ID | E ∈ Addr, where ID denotes a set of all transaction numbers, Addr is a set of all cloud addresses, and E denotes a set of sequences formed by transaction numbers or cloud addresses.
3. The method of claim 1, wherein the hash H of the transaction data verifies the consistency of the transaction by the formula:
Figure FDA0003218303080000021
wherein ID represents the set of all transaction numbers, and Addr is the set of all cloud addresses; hash (e) indicates that the transaction number corresponding to e is hashed; hash (. e) indicates that data in the cloud address corresponding to e is hashed.
4. The method of claim 1, wherein the unstructured feature extraction model is constructed based on a deep learning approach.
5. The method of claim 1, wherein the query condition is structured data or unstructured data, and wherein:
-structured data: directly taking keywords as query bases;
-unstructured data: and according to the data type of the unstructured data, automatically extracting corresponding abstract information by using a corresponding deep learning model to serve as a query basis.
CN202110216317.9A 2021-02-26 2021-02-26 High-reliability knowledge base construction method capable of inquiring and tracing mass unstructured data content Active CN112966126B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110216317.9A CN112966126B (en) 2021-02-26 2021-02-26 High-reliability knowledge base construction method capable of inquiring and tracing mass unstructured data content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110216317.9A CN112966126B (en) 2021-02-26 2021-02-26 High-reliability knowledge base construction method capable of inquiring and tracing mass unstructured data content

Publications (2)

Publication Number Publication Date
CN112966126A CN112966126A (en) 2021-06-15
CN112966126B true CN112966126B (en) 2021-09-17

Family

ID=76276144

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110216317.9A Active CN112966126B (en) 2021-02-26 2021-02-26 High-reliability knowledge base construction method capable of inquiring and tracing mass unstructured data content

Country Status (1)

Country Link
CN (1) CN112966126B (en)

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10797863B2 (en) * 2017-12-28 2020-10-06 Intel Corporation Multi-domain cascade convolutional neural network
CN109040012B (en) * 2018-06-19 2021-02-09 西安电子科技大学 Block chain-based data security protection and sharing method and system and application
CN109377363B (en) * 2018-09-26 2020-08-18 电子科技大学 Block chain-based Internet of things data transaction architecture and transaction security method thereof
CN109493017B (en) * 2018-11-05 2021-11-23 江苏大学 Trusted outsourcing storage method based on block chain
CN111414417B (en) * 2020-03-02 2023-02-14 陕西西影数码传媒科技有限责任公司 Video copyright management method based on block chain
CN111737343B (en) * 2020-05-11 2021-12-07 广州大学 Information labeling method based on block chain
CN111581292B (en) * 2020-05-18 2023-12-15 中国工业互联网研究院 Industrial Internet data asset right-determining and trading method and platform
CN111339106B (en) * 2020-05-18 2020-08-28 杭州趣链科技有限公司 Block chain data indexing method
CN112035894A (en) * 2020-07-20 2020-12-04 江苏傲为控股有限公司 Electronic evidence trusteeship system
CN111915301B (en) * 2020-08-05 2022-08-26 腾讯科技(深圳)有限公司 Data processing method and device based on block chain, electronic equipment and readable medium

Also Published As

Publication number Publication date
CN112966126A (en) 2021-06-15

Similar Documents

Publication Publication Date Title
US8805833B2 (en) Enhancing personal data search with information from social networks
US20100082653A1 (en) Event media search
US20150278691A1 (en) User interests facilitated by a knowledge base
US8655902B2 (en) Identifying superphrases of text strings
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
WO2023108980A1 (en) Information push method and device based on text adversarial sample
CN111475725A (en) Method, apparatus, device, and computer-readable storage medium for searching for content
CN110674313A (en) Method for dynamically updating knowledge graph based on user log
US8046360B2 (en) Reduction of annotations to extract structured web data
CN117351334A (en) Image auditing method and related equipment
CN111752922A (en) Method and device for establishing knowledge database and realizing knowledge query
CN112966126B (en) High-reliability knowledge base construction method capable of inquiring and tracing mass unstructured data content
US20090182759A1 (en) Extracting entities from a web page
US20220156228A1 (en) Data Tagging And Synchronisation System
CN112749246B (en) Evaluation method and device of search phrase, server and storage medium
CN114357086A (en) Patent IPC classification number recommendation method and device based on knowledge graph
Fatemi et al. Record linkage to match customer names: A probabilistic approach
SalahEldeen et al. Reading the correct history? Modeling temporal intention in resource sharing
Abass et al. Information retrieval models, techniques and applications
Xiao et al. Data analysis algorithms for mining online communities from microblogs
Sebastine et al. Semantic web for content based video retrieval
CN112860940B (en) Music resource retrieval method based on sequential concept space on description logic knowledge base
CN116306622B (en) AIGC comment system for improving public opinion atmosphere
CN112287229B (en) National defense construction dynamic information recommendation method based on combined semantic similarity
CN115859380B (en) Electronic data fixed certificate association method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant