CN108846029B - Information correlation analysis method based on knowledge graph - Google Patents
Information correlation analysis method based on knowledge graph Download PDFInfo
- Publication number
- CN108846029B CN108846029B CN201810519637.XA CN201810519637A CN108846029B CN 108846029 B CN108846029 B CN 108846029B CN 201810519637 A CN201810519637 A CN 201810519637A CN 108846029 B CN108846029 B CN 108846029B
- Authority
- CN
- China
- Prior art keywords
- triple
- query
- triplet
- similarity
- idf
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000010219 correlation analysis Methods 0.000 title claims abstract description 7
- 238000004364 calculation method Methods 0.000 claims abstract description 25
- 230000008569 process Effects 0.000 claims abstract description 17
- 238000012163 sequencing technique Methods 0.000 claims abstract description 17
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 235000019227 E-number Nutrition 0.000 claims description 5
- 239000004243 E-number Substances 0.000 claims description 5
- 230000001105 regulatory effect Effects 0.000 claims description 3
- 230000002040 relaxant effect Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 2
- 238000004458 analytical method Methods 0.000 description 10
- 230000000694 effects Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012958 reprocessing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an intelligence correlation analysis method based on a knowledge graph, and belongs to the field of relevance retrieval of the intelligence under the condition of an RDF knowledge graph. The invention comprises the following steps: the data preprocessing process is used for analyzing the TXT document of the downloaded information data; constructing a triple information knowledge base; calculating the weight of each triple and the keyword thereof by using an IDF and information entropy weighting method, and storing the weight into a database; calculating the similar triples of each triplet through a triplet similarity calculation formula, and sequencing according to the similarity; the RDF triples are effectively stored; basic SPARQL query operation is realized by adopting API provided by Jena TDB, and query expansion is carried out according to a query method based on triple similarity; and generating a query sample, effectively sequencing the search result set according to the weight of the triple and the keyword, and returning a top-k result.
Description
Technical Field
The invention relates to an intelligence analysis scheme based on a knowledge graph, and belongs to the field of relevance retrieval of the intelligence under the condition of an RDF knowledge graph.
Background
In intelligence science, intelligence is the reprocessing of knowledge, and transitivity and correlation are basic attributes of knowledge. It is the characteristics of the transitivity and correlation of the intelligence that makes certain correlation exist between different intelligence. With the rapid development of the internet technology, modern information presents the characteristics of massive level, diversity, instantaneity and the like, so that the association degree between information knowledge is higher and higher, which brings difficulty to the analysis of the modern information. However, with the great development of artificial intelligence techniques represented by machine learning, the possibility of intelligence analysis in the context of big data is brought. Intelligence analysis targets are intelligence resources, which can be unitized into various entities associated with each other. By using a machine learning method, various entities and relationships between entities in intelligence data can be mined. However, the analysis mode generally stays in a simple statistical grammar analysis level, does not go deep into a semantic level of information content, and is easy to cause a semantic missing problem, so that the analyzed information is not accurate and complete enough, and the correlation condition between different information cannot be well reflected.
Knowledge graph (knowledgegraph) is an emerging technology in recent years, and the technical theory and concept thereof are proposed by Google in 2012 by combining ideas such as a Knowledge base, a semantic network and an ontology. The knowledge graph is essentially a graph structure based on a complex semantic network, and makes full use of visualization technology, so that not only knowledge resources and carriers can be described, but also knowledge and the relation between knowledge can be analyzed and described. The method draws and displays the complex knowledge by using a graph mode, the nodes in the graph represent the knowledge, and the edges between the nodes represent the relation between the knowledge, so that the association between the knowledge is more intuitively represented. Due to the characteristic of the graph structure, the knowledge graph has the advantages of high entity concept coverage rate, various semantic relations, friendly structure, high quality and the like, so that the knowledge graph increasingly becomes the most main knowledge representation mode in the big data era. Meanwhile, the knowledge map is also praised as the development direction of new generation artificial intelligence. Knowledge graph is firstly applied to the field of search engines, and search engine manufacturers such as Google, Bing, Baidu and dog search, etc. develop knowledge graph products to improve the retrieval quality of the search engines. And then the knowledge graph is widely applied to the fields of intelligent search, personalized recommendation, text classification, content distribution and the like, and a good effect is achieved. Since the knowledge graph can sufficiently reflect the association relationship between the entities, the application of the knowledge graph to the field of intelligence analysis is possible.
Disclosure of Invention
The purpose of the invention is realized as follows:
the intelligence correlation analysis method based on the knowledge graph is characterized by comprising the following steps:
step one, preprocessing process of data; analyzing the downloaded information data TXT document, extracting information such as title, author, keyword, mechanism, date and the like of each piece of information data, and removing stop words and repeated items;
step two, constructing a triple information knowledge base; constructing relationships among the entities, including writing relationships among intelligence and authors, affiliated unit relationships among authors and organizations, and cooperative relationships among different authors; combining the entities and the relations together to form triple data, and storing the triple data in a triple database; in the expanded three-tuple mode, keywords of the information data are adopted as the expansion of the three-tuple mode;
thirdly, calculating the weight of each triple and the keyword thereof by using an IDF and information entropy weighting method, and storing the weight into a database;
calculating the similar triples of each triplet through a triplet similarity calculation formula, and sequencing according to the similarity;
step five, effectively storing the RDF triples; basic SPARQL query operation is realized by adopting API provided by Jena TDB, and query expansion is carried out according to a query method based on triple similarity;
and sixthly, generating a query sample, effectively sequencing the search result set according to the weight of the triple and the keyword, and returning a top-k result.
The IDF and information entropy weighting method comprises the following steps:
step one, calculating an IDF value calculation formula of each triple as follows;
in the above formula, e represents an entity and a relationship in an extended triple, n represents the number of triples in the RDF dataset, and f (e) represents the number of times e appears in the RDF dataset; setting IDF (t) as an IDF value of (s, p, o) in a triad t, and calculating the IDF value of each triad according to the characteristics of the triad structure as shown in the following formula, wherein gamma is an adjusting factor, and gamma is more than 0 and less than 1;
IDF(t)=γ×(IDF(s)+IDF(p)+IDF(o))
step two, the calculation process of the information entropy H (e) of each triplet is shown as the following formula;
wherein e represents entities and relationships in the extended triples, and k represents the number of data classifications; d1(e) Indicates the frequency of occurrence of e in the data set, D2(e(k)) Represents the frequency of occurrence of e in the kth data class, | C | represents the total number of data classes; for each triplet t (s, p, o), the formula for calculating the information entropy of its triplet is shown as follows;
H(t)=η×(H(s)+H(p)+H(o))
wherein eta is a regulating factor, and eta is more than 0 and less than 1; the weight w (t) of each triplet is determined by an IDF value and an information entropy H value, and is shown in the following formula; where | N | represents the total number of triples in the dataset;
step three, calculating the weight of the keywords similarly to the triple, and respectively calculating the IDF value and the information entropy value of each keyword; first, the IDF calculation process for the triplet-keyword is shown as the following formula:
let IDF (v)iIe) is the key phrase v ═ v (v, p, o) belonging to the triplet t ═ s (s, p, o)1,v2,…,vn) Middle key word viThe IDF value of (e) represents the head entity s or the tail entity o in the triplet t, excluding the relationship between them; f (e, v)i) Representing the head entity s or the tail entity o of the triad and the key word viThe number of times that the common occurrence of (c) in the data set occurs; let IDF (v)iI t) represents the IDF value of the triplet-keyword, whose calculation formula is shown below, whereIn order to adjust the factors, the method comprises the following steps,
step four, calculating the information entropy H (v) of the keywordsiThe calculation process of | e) is shown as the following formula:
wherein e only represents head and tail entities, and does not comprise the relationship between the head and tail entities; k represents the number of data classes, D1(viE) represents a keyword viAnd e number of co-occurrences in the dataset, D2(vi,e(k)) Representing a keyword viAnd e number of co-occurrences in the kth data category, | C | represents the total number of data categories; for a triple t and a certain key word v thereofiThe formula for calculating the information entropy is as follows:
H(vi|t)=θ×(H(vi|s)+H(vi|o))
wherein theta represents an adjustment factor, and theta is more than 0 and less than 1; weight w (v) of each triplet-keywordiI t) is composed of an IDF value and an information entropy H, and is shown as the following formula:
the query method based on the triple similarity comprises the following steps:
step one, a measuring method based on triple similarity;
let t1And t2Is a triplet, then the triplet t1And t2The similarity calculation is shown in the following formula:
step two, generating a method for relaxing query; for an original queryWhereinRepresenting the ith triplet in the original query; firstly, calculating the similarity of each triple in the original query through the formula to obtain a group of similar triples, and sequencing according to the similarity to form a similar triple queue; selecting each tripletThe approximate triples in the similar triple queue construct a relaxed query by replacing the triples in the original query;
calculating the similarity of each sub-query and the original query through the following formula, sequencing by utilizing the similarity, and determining the execution sequence of each query according to the sequencing; firstly, the original query is executed firstly, and then the loose query with the highest similarity is executed in sequence;
step four, setting a threshold value for the semantic similarity between the original query and the relaxed query, and if the similarity between the two queries is smaller than the threshold value, not deriving the semantic similarity.
The invention has the beneficial effects that:
at present, scholars at home and abroad are actively researching on a correlation retrieval method for an intelligence knowledge graph, but the correlation retrieval method faces the problems of incomplete data and relation, lack of flexible query, incapability of embodying the diversity characteristics of a query result set and the like. The invention provides a targeted solution based on the main problem of information data relevance query based on an RDF triple retrieval method, and provides an expanded RDF knowledge graph relevance retrieval strategy R-SPARQL method.
Drawings
FIG. 1 is a block diagram of the algorithm flow of the present invention;
FIG. 2 is a diagram illustrating semantic overlap based on graph structure according to the present invention;
FIG. 3 is a diagram illustrating an original query derived relaxed query process according to the present invention;
FIG. 4 is a graph comparing average recall and accuracy for different values of sag α in accordance with the present invention;
FIG. 5 is a comparison graph of average query time consumption under different relaxation alpha values according to the present invention;
FIG. 6 is a comparison of average recall for different data set sizes in accordance with the present invention;
FIG. 7 is a comparison of average recall for different data set sizes in accordance with the present invention;
FIG. 8 is a comparison of the average recall ratio of different data set sizes according to the present invention.
Detailed Description
The invention will be explained in more detail below with reference to the drawings and the movement examples.
The intelligence correlation analysis is the research focus in the field of information retrieval and intelligence science, and the characteristic of how to embody a plurality of associated data in the intelligence retrieval result is the main task of the invention. Knowledge-graph-based information retrieval is essentially a sub-graph matching problem, knowledge-graphs can be regarded as a triple structure in the form of RDF, and retrieval on the structure can be queried by using the SPARQL language conforming to the W3C standard. However, the query mode only returns the subgraph structure which meets the conditions, and the characteristics of relevance and diversity cannot be reflected in the query result set.
Aiming at the problem, the invention provides a relevance retrieval method by deeply researching a knowledge graph retrieval method based on RDF. The idea of this approach is to extend the structure of the RDF knowledge graph using keywords and weights. The expansion of the keywords provides a fuzzy retrieval mode, and the weight of the triples and the keywords is used for calculating the similarity of the query triples and carrying out relevance sequencing on the query result set. By calculating the similarity, a plurality of relaxed queries are derived from the original query, so that more retrieval results which are associated with each other are obtained.
The main invention points and technical effects are as follows:
at present, scholars at home and abroad are actively researching on a correlation retrieval method for an intelligence knowledge graph, but the correlation retrieval method faces the problems of incomplete data and relation, lack of flexible query, incapability of embodying the diversity characteristics of a query result set and the like. The invention provides a targeted solution based on the main problem of information data relevance query based on an RDF triple retrieval method, provides an expanded RDF knowledge graph relevance retrieval strategy R-SPARQL method, and has the following main viewpoints and contents:
(1) an extension to the triple mode. Since the user may not know the name of the triple element to be queried explicitly or only the name of an entity similar to the element to be queried, such queries often fail to retrieve data or return erroneous results. Therefore, an extended triple pattern query method is provided to avoid the over-accurate query mode of the traditional SPARQL query. First, the present invention extends the RDF knowledge graph with a set of keywords, each triple associated with the set of keywords. The purpose of expansion by using the keywords is to provide a fuzzy retrieval method to improve the flexibility of retrieval. Secondly, according to the degree of association between each keyword and the triple, each keyword is assigned with a weight, which is called triple-keyword weight. The significance of the expansion by using the weight is to perform relevance sorting on the retrieval result set, and the diversity of the retrieval result can be reflected. Thus, the weight assigned to each triple represents not only the importance of the triple in the dataset, but should also embody the ability to distinguish between triples. By utilizing the distinguishing capability and sequencing the retrieval result set through the weight, the characteristic of diversity is embodied in the result set.
The method of acquiring the keyword is various. For example, in scientific literature data, key phrases can be constructed by extracting subject words of the literature. Some preprocessing processes are also required for the selection of the keywords. For example, operations such as removing stop words, repeating items, etc. Whether the weight of the triplet or the weight of the keyword is a key factor in the ordering of the final result set. The RDF knowledge-graph structure is extended with a set of keywords and weights. The weight of the triple can be obtained by calculation in various modes according to the property of the RDF knowledge graph data source, and the weight of the triple and the weight of the keyword are calculated by an inverse document frequency algorithm (IDF) and information entropy, and the calculation steps are as follows.
The first step is as follows: the formula for calculating the IDF value for each triplet is shown below.
In the above formula, e represents the entity and relationship in the extended triple, n represents the number of triples in the RDF dataset, and f (e) represents the number of times e appears in the RDF dataset. Let IDF (t) be the IDF value of (s, p, o) in the triplet t, and according to the characteristics of the triplet structure, the calculation formula for the IDF value of each triplet is as follows, where γ is the adjustment factor, and γ is greater than 0 and less than 1.
IDF(t)=γ×(IDF(s)+IDF(p)+IDF(o))(2)
The second step is that: the information entropy h (e) of each triplet is calculated as shown in the following formula.
Where e represents the entities and relationships in the extended triples and k represents the number of data classes. D1(e) Indicates the frequency of occurrence of e in the data set, D2(e(k)) Indicating the frequency with which e appears in the kth data class, | C | indicates the total number of data classes. For each triplet t (s, p, o), the formula for calculating the information entropy of its triplet is shown below.
H(t)=η×(H(s)+H(p)+H(o))(4)
Wherein eta is a regulating factor, and eta is more than 0 and less than 1. The weight w (t) of each triplet is determined by the IDF value and the information entropy H value, as shown in the following formula. Where | N | represents the total number of triples in the dataset.
The third step: the calculation process of the weight of the keywords is similar to that of the triple, and the IDF value and the information entropy value of each keyword are calculated respectively. First, the IDF calculation process for the triplet-keyword is shown in the following formula.
Let IDF (v)iIe) is the key phrase v ═ v (v, p, o) belonging to the triplet t ═ s (s, p, o)1,v2,…,vn) Middle key word viThe IDF value of (e) represents the head entity s or the tail entity o in the triplet t, without including the relationship between them. F (e, v)i) Representing the head entity s or the tail entity o of the triad and the key word viThe number of times that the sum of (c) appears in the data set. Let IDF (v)iI t) represents the IDF value of the triplet-keyword, whose calculation formula is shown below, whereIn order to adjust the factors, the method comprises the following steps,
the fourth step: calculating information entropy H (v) of key wordiThe calculation process of | e) is shown as follows.
Wherein e only represents head and tail entities, and does not include the relationship between them. k represents the number of data classes, D1(viE) represents a keyword viAnd e number of co-occurrences in the dataset, D2(vi,e(k)) Representing a keyword viAnd e the number of times of co-occurrence in the kth data class, | C | represents the total number of data classes. For a triple t and a certain key word v thereofiThe formula for calculating the information entropy is shown below.
H(vi|t)=θ×(H(vi|s)+H(vi|o))(9)
Wherein theta represents an adjustment factor, and 0 < theta < 1. Weight w (v) of each triplet-keywordiAnd l t) is composed of an IDF value and an information entropy H, and is shown in the following formula.
(2) The invention provides an association query strategy suitable for an RDF knowledge graph structure based on an association query method of triple similarity. The query relaxation algorithm is a very important relevance query method, and is a query processing process with a wider coverage range by changing or deleting one or more hard conditions in an original query or expanding the original query conditions, and finally a group of query results associated with the original query results are obtained. The method comprises the following specific steps:
the first step is as follows: and (4) a measuring method based on the triple similarity. From a knowledge graph structure perspective, an entity may have one or more edges that are connected to another entity, and those directly connected triples are often similar and represent multiple relationships that the same entity may have or different entities represented by the same relationship. In fact, however, there may not be a directly connected edge between two triples, but they have the same entity label and edge label, and the similarity between these triples is not negligible. Therefore, the concept of semantic overlap based on triples has been proposed based on this consideration. Semantic overlap represents the number of ontological concepts contained in the ontology with the same upper level concepts, which may indicate how similar the two concepts are. The semantic overlap of two RDF triples is defined as the number of triples with the same element, and the semantic overlap based on graph structure is shown in FIG. 2.
Still another method for measuring triple similarity is based on the graph embedding idea, which embeds the triples of the knowledge graph into a low-dimensional vector space, and the head and tail entities and the relationship in the triples are a vector. Let t be (s, p, o), and the vector representations of its entities and relationships are respectivelyAndif the triplet is true, then the following equation existsIf two triplets t1=(s1,p1,o1) And t2=(s2,p2,o2) Having a higher degree of similarity, the elements between them are considered to be exchanged, as described aboveThe equations are also true. For example, if a triple t1And t2Are similar to each other, then existOrTherefore, based on these two calculation methods, the triplet t is combined1And t2The similarity calculation of (c) is shown in the following formula.
The second step is that: a method of generating a relaxed query. For an original queryWhereinRepresenting the ith triplet in the original query. Firstly, calculating the similarity of each triple in the original query through the formula to obtain a group of similar triples, and sequencing according to the similarity to form a similar triple queue. Selecting each tripletThe approximate triples in the similar triples queue of (b) construct the relaxed query by replacing the triples in the original query.
The third step: the similarity of each sub-query and the original query is calculated through the following formula, the similarity is used for sorting, and the execution sequence of each query is determined according to the sorting. First, the original query would be executed first, and then the relaxed queries with the highest similarity would be executed in order.
The fourth step: a constraint scheme for relaxing the number of queries. An original query may derive a plurality of relaxed queries, and for each relaxed query, a plurality of sub relaxed queries are derived, the number of derivation being referred to as the relaxation α. The number of relaxed queries grows exponentially with increasing alpha values. Without limitation, too many relaxed queries may be derived from the original query. However, too many relaxed queries may not retrieve results or produce meaningless results. Therefore, it is necessary to put a limit on the amount of slack. The present invention sets a threshold on the semantic similarity between the original query and the relaxed query, and if the similarity between the two queries is less than this threshold, it will not be derived. FIG. 3 illustrates the process of deriving sub-queries from an original query.
(3) And inquiring top-k selection scheme of the result set. Typically, triple-pattern queries often return too many results. Moreover, the result sets returned by the search through the relaxed query are often too similar to present diversity features, making it difficult for the user to find the true desired data in these result sets. The invention provides a sorting model based on an extended RDF (remote data Format) triple mode, effectively sorts the result sub-graphs according to the relevance between each triple and the keywords of the triple, and finally returns the most relevant top-k results. Let original query beDeriving m relaxed queries (Q)1,Q2,…,Qm). Wherein the content of the first and second substances,representing a query QjThe ith query triplet of (a) th query,is composed of a group of key phrasesAnd (5) performing expansion. By searchingQ-polljA set of result triplets (q) is obtained1,q2,…,qs). Wherein the triadThe result triple obtained by query is qwKnown as tripletsTo qwThe transfer of (2). Hypothesis keyword setsAre independent of each other and are provided with a plurality of groups,to qwIs expanded into each independent keywordTo qwThe transition probability of (2) can be calculated by the following formula. The result set is ordered by such probability values.
The invention has the technical effects that:
the invention provides an RDF knowledge graph representation method based on keyword and weight expansion, and an SPARQL query mode is expanded by introducing additional variables. And then, a query relaxation method of triple similarity is provided based on the expanded triple mode, a plurality of relaxed queries are generated by calculating the similarity of the original query, so that the query intention of the user is expanded, and finally a group of most relevant query results are returned. And finally, performing relevance sequencing on the query result set through the triple weight, and fully considering the diversity of the result set.
Through simulation experiments and analysis of experimental results, compared with the traditional SPARQL query and the relaxation query scheme provided by the invention, the indexes of three performances, namely average recall rate, average accuracy and average query time consumption, are intuitively verified in a drawing mode. First, performance indexes comparing analysis recall, accuracy and query time consumption under different sag conditions are shown in fig. 4 and 5. When alpha is 0, only the original query is carried out, and the number of the relaxed queries grows exponentially as the alpha value increases, so that the recall rate and the accuracy rate also grow. But this effect is enhanced at the cost of increased query time. The performance indicators comparing the recall, accuracy and query time for analysis with different data set sizes are shown in fig. 6, 7 and 8.
Because the traditional SPARQL query only returns a result set meeting the conditions, the association degree between data is not reflected. The R-SPARQL method provided by the invention expands the original query into a plurality of sub-queries, and not only can return a result set meeting conditions, but also can return associated entities meeting the structure of the knowledge graph. Therefore, the performance of the R-SPARQL method is obviously better than that of the traditional SPARQL query in the average recall rate and the average accuracy rate.
The intelligence correlation query scheme based on the extended RDF knowledge graph is realized through the following steps.
The method comprises the following steps: and (5) preprocessing the data. The main task is to analyze the downloaded information data TXT document, extract information such as title, author, keyword, organization, date and the like of each information data, and remove stop words and repeated items.
Step two: and constructing a triple intelligence knowledge base. The entities are constructed into relationships, such as writing relationships between one intelligence and an author, affiliated entity relationships between authors and organizations, and cooperative relationships between different authors. And combining the entities and the relations together to form the triple data, and storing the triple data in a triple database. In the extended triple mode, the invention adopts keywords of the intelligence data as the extension of the triple mode.
Step three: and calculating the weight of each triple and the keyword thereof by using an IDF and information entropy weighting method, and storing the weight into a database.
Step four: and calculating the similar triples of each triplet through a triplet similarity calculation formula, and sequencing according to the similarity.
Step five: the RDF triples are stored efficiently. The Jena TDB is a component supporting RDF data storage, query and update operations, and can be conveniently imported into the Jena TDB through body data in an RDF format. The method adopts API provided by Jena TDB to realize basic SPARQL query operation, and utilizes the query relaxation model based on triple similarity to carry out query expansion.
Step six: and generating a query sample, effectively sequencing the search result set according to the weight of the triple and the keyword, and returning top-k results.
Claims (2)
1. The intelligence correlation analysis method based on the knowledge graph is characterized by comprising the following steps:
step one, preprocessing process of data; analyzing the downloaded information data TXT document, extracting the title, author, keyword, mechanism and date of each information data, and removing stop words and repeated items;
step two, constructing a triple information knowledge base; constructing relationships among the entities, including writing relationships among intelligence and authors, affiliated unit relationships among authors and organizations, and cooperative relationships among different authors; combining the entities and the relations together to form triple data, and storing the triple data in a triple database; in the expanded three-tuple mode, keywords of the information data are adopted as the expansion of the three-tuple mode;
thirdly, calculating the weight of each triple and the keyword thereof by using an IDF and information entropy weighting method, and storing the weight in a database;
calculating the similar triples of each triplet through a triplet similarity calculation formula, and sequencing according to the similarity;
step five, effectively storing the RDF triples; basic SPARQL query operation is realized by adopting API provided by JenaTDB, and query expansion is carried out according to a query method based on triple similarity;
generating a query sample, effectively ordering the retrieval result set according to the weight of the triples and the keywords, and returning a top-k result;
the IDF and information entropy weighting method described in step three includes the steps of:
calculating an IDF value calculation formula of each triplet by the step (I) as follows;
in the above formula, e represents an entity and a relationship in an extended triple, n represents the number of triples in the RDF dataset, and f (e) represents the number of times e appears in the RDF dataset; setting IDF (t) as an IDF value of (s, p, o) in a triad t, and calculating the IDF value of each triad according to the characteristics of the triad structure as shown in the following formula, wherein gamma is an adjusting factor, and gamma is more than 0 and less than 1;
IDF(t)=γ×(IDF(s)+IDF(p)+IDF(o))
the calculation process of the information entropy H (e) of each triplet in the step (II) is shown as the following formula;
wherein e represents entities and relationships in the extended triples, and k represents the number of data classifications; d1(e) Indicates the frequency of occurrence of e in the data set, D2(e(k)) Represents the frequency of occurrence of e in the kth data class, | C | represents the total number of data classes;
for each triplet t (s, p, o), the formula for calculating the information entropy of its triplet is shown as follows;
H(t)=η×(H(s)+H(p)+H(o))
wherein eta is a regulating factor, and eta is more than 0 and less than 1; the weight w (t) of each triplet is determined by an IDF value and an information entropy H value, and is shown in the following formula; where | N | represents the total number of triples in the dataset;
step three, the calculation process of the weight of the keywords is similar to that of the triple, and the IDF value and the information entropy value of each keyword are calculated respectively; first, the IDF calculation process for the triplet-keyword is shown as the following formula:
let IDF (v | e) be the key phrase v ═ v (v, o) belonging to the triplet t ═ s, p, o1,v2,...,vn) Middle key word viThe IDF value of (e) represents the head entity s or the tail entity o in the triplet t, excluding the relationship between them; f (e, vi) represents the times of co-occurrence of the head entity s or the tail entity o of the triad and the keyword v in the data set; let IDF (v)iI t) represents the IDF value of the triplet-keyword, which is calculated as follows, wherein for the adjustment factor,
the calculation process of calculating the information entropy H (v | e) of the keyword in the step (IV) is shown as the following formula:
wherein e is only substitutedHead and tail entities of the table, excluding the relationship between them; k represents the number of data classes, D1(vi,e(k)) Representing a keyword viAnd e number of co-occurrences in the dataset, D2(vi,e(k)) Representing a keyword viAnd e number of co-occurrences in the kth data category, | C | represents the total number of data categories; for a triple t and a certain key word v thereofiThe formula for calculating the information entropy is as follows:
H(vi|t)=θ×(H(vi|s)+H(vi|o))
wherein theta represents an adjustment factor, and theta is more than 0 and less than 1; weight w (v) of each triplet-keywordiI t) is composed of an IDF value and an information entropy H, and is shown as the following formula:
the measuring method based on triple similarity uses the judgment of semantic overlapping or the judgment of graph embedding thought based on graph structure;
semantic overlap of two RDF triples is defined as the number of triples with the same element; from a knowledge graph structure perspective, an entity may have one or more edges that are connected to another entity, and those directly connected triples are similar; two triples are similar to each other if there is no directly connected edge between them, but they have the same entity label and edge label;
embedding a triple of the knowledge graph into a low-dimensional vector space, wherein head and tail entities and relations in the triple are a vector; let t be (s, p, o) and the vector representation of its entities and relationships be s, p and o, respectively, if a triplet is true, then the following equation existsIf two triplets t ═ s(s)1,p1,o1) And t ═(s2,p2,o2) With a higher degree of similarity, the above equation is also valid after the elements between them are considered to be interchanged.
2. The intellectual property correlation analysis method based on the knowledge-graph as claimed in claim 1, wherein the query method based on the triple similarity in the fifth step comprises the following steps:
step one, a measuring method based on triple similarity;
let t1And t2Is a triplet, then the triplet t1And t2The similarity calculation is shown in the following formula:
step (two) generating a method for relaxing query; for an original queryWhereinRepresenting the ith triplet in the original query; firstly, calculating the similarity of each triple in the original query through the formula to obtain a group of similar triples, and sequencing according to the similarity to form a similar triple queue; selecting each tripletThe approximate triples in the similar triple queue construct a relaxed query by replacing the triples in the original query;
step three, calculating the similarity of each sub-query and the original query through the following formula, sequencing by utilizing the similarity, and determining the execution sequence of each query according to the sequencing; firstly, the original query is executed firstly, and then the loose query with the highest similarity is executed in sequence;
and step (IV) setting a threshold value for the semantic similarity between the original query and the relaxed query, and if the similarity between the two queries is smaller than the threshold value, not deriving the semantic similarity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810519637.XA CN108846029B (en) | 2018-05-28 | 2018-05-28 | Information correlation analysis method based on knowledge graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810519637.XA CN108846029B (en) | 2018-05-28 | 2018-05-28 | Information correlation analysis method based on knowledge graph |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108846029A CN108846029A (en) | 2018-11-20 |
CN108846029B true CN108846029B (en) | 2021-05-25 |
Family
ID=64213596
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810519637.XA Expired - Fee Related CN108846029B (en) | 2018-05-28 | 2018-05-28 | Information correlation analysis method based on knowledge graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108846029B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109582803A (en) * | 2018-11-30 | 2019-04-05 | 广东电网有限责任公司 | The construction method and system of competitive intelligence database |
CN109508389B (en) * | 2018-12-19 | 2021-05-28 | 哈尔滨工程大学 | Visual accelerating method for personnel social relationship map |
CN109710621B (en) * | 2019-01-16 | 2022-06-21 | 福州大学 | Keyword search KSANEW method combining semantic nodes and edge weights |
CN110083732B (en) * | 2019-03-12 | 2021-08-31 | 浙江大华技术股份有限公司 | Picture retrieval method and device and computer storage medium |
CN110082116B (en) * | 2019-03-18 | 2022-04-19 | 深圳市元征科技股份有限公司 | Evaluation method and evaluation device for vehicle four-wheel positioning data and storage medium |
CN110222240B (en) * | 2019-05-24 | 2021-03-26 | 华中科技大学 | Abstract graph-based space RDF data keyword query method |
CN110457486A (en) * | 2019-07-05 | 2019-11-15 | 中国人民解放军战略支援部队信息工程大学 | The people entities alignment schemes and device of knowledge based map |
CN110765233A (en) * | 2019-11-11 | 2020-02-07 | 中国人民解放军军事科学院评估论证研究中心 | Intelligent information retrieval service system based on deep mining and knowledge management technology |
CN111753055B (en) * | 2020-06-28 | 2024-01-26 | 中国银行股份有限公司 | Automatic prompt method and device for customer questions and answers |
CN113495955A (en) * | 2021-07-08 | 2021-10-12 | 北京明略软件系统有限公司 | Expert pushing method, system, equipment and storage medium for document |
CN113901452B (en) * | 2021-09-30 | 2022-05-17 | 中国电子科技集团公司第十五研究所 | Sub-graph fuzzy matching security event identification method based on information entropy |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003030204A (en) * | 2001-07-17 | 2003-01-31 | Takami Yasuda | Server for providing video contents, device and method for preparing file for video contents retrieval, computer program and device and method for supporting video clip preparation |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104166670A (en) * | 2014-06-17 | 2014-11-26 | 青岛农业大学 | Information inquiry method based on semantic network |
CN104765779A (en) * | 2015-03-20 | 2015-07-08 | 浙江大学 | Patent document inquiry extension method based on YAGO2s |
-
2018
- 2018-05-28 CN CN201810519637.XA patent/CN108846029B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003030204A (en) * | 2001-07-17 | 2003-01-31 | Takami Yasuda | Server for providing video contents, device and method for preparing file for video contents retrieval, computer program and device and method for supporting video clip preparation |
Also Published As
Publication number | Publication date |
---|---|
CN108846029A (en) | 2018-11-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108846029B (en) | Information correlation analysis method based on knowledge graph | |
CN109800284B (en) | Task-oriented unstructured information intelligent question-answering system construction method | |
US8341159B2 (en) | Creating taxonomies and training data for document categorization | |
CN104239513B (en) | A kind of semantic retrieving method of domain-oriented data | |
Nabli et al. | Efficient cloud service discovery approach based on LDA topic modeling | |
US20180341686A1 (en) | System and method for data search based on top-to-bottom similarity analysis | |
CN111428047B (en) | Knowledge graph construction method and device based on UCL semantic indexing | |
JP2009093653A (en) | Refining search space responding to user input | |
CN104298776A (en) | LDA model-based search engine result optimization system | |
CN105426529A (en) | Image retrieval method and system based on user search intention positioning | |
CN110888991A (en) | Sectional semantic annotation method in weak annotation environment | |
CN112989215B (en) | Sparse user behavior data-based knowledge graph enhanced recommendation system | |
CN107679124B (en) | Knowledge graph Chinese question-answer retrieval method based on dynamic programming algorithm | |
Ma et al. | Matching descriptions to spatial entities using a siamese hierarchical attention network | |
CN112084312A (en) | Intelligent customer service system constructed based on knowledge graph | |
Grineva et al. | Blognoon: Exploring a topic in the blogosphere | |
Fujiwara et al. | Fast ad-hoc search algorithm for personalized pagerank | |
Liu et al. | A query suggestion method based on random walk and topic concepts | |
CN111723179A (en) | Feedback model information retrieval method, system and medium based on concept map | |
CN117271577B (en) | Keyword retrieval method based on intelligent analysis | |
Zhu | Improvement in Probabilistic Information Retrieval Model: Rewarding Terms with High Relative Term Frequency | |
Peng et al. | On the marriage of SPARQL and keywords | |
Dai et al. | Intent identification for knowledge base question answering | |
He et al. | FCA-based web user profile mining for topics of interest | |
Barresi et al. | A concept based indexing approach for document clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220708 Address after: 100000 Room 401, block D, No. 8, Tongji South Road, Beijing Economic and Technological Development Zone, Daxing District, Beijing Patentee after: BEIJING SAIXI TECHNOLOGY DEVELOPMENT Co.,Ltd. Address before: 150001 Intellectual Property Office, Harbin Engineering University science and technology office, 145 Nantong Avenue, Nangang District, Harbin, Heilongjiang Patentee before: HARBIN ENGINEERING University |
|
TR01 | Transfer of patent right | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210525 |