CN108846029B - Information correlation analysis method based on knowledge graph - Google Patents

Information correlation analysis method based on knowledge graph Download PDF

Info

Publication number
CN108846029B
CN108846029B CN201810519637.XA CN201810519637A CN108846029B CN 108846029 B CN108846029 B CN 108846029B CN 201810519637 A CN201810519637 A CN 201810519637A CN 108846029 B CN108846029 B CN 108846029B
Authority
CN
China
Prior art keywords
triple
query
triplet
similarity
idf
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810519637.XA
Other languages
Chinese (zh)
Other versions
CN108846029A (en
Inventor
王念滨
陈锡瑞
谢晓东
王红滨
周连科
陈田田
原明旗
赵昱杰
厉原通
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING SAIXI TECHNOLOGY DEVELOPMENT CO LTD
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201810519637.XA priority Critical patent/CN108846029B/en
Publication of CN108846029A publication Critical patent/CN108846029A/en
Application granted granted Critical
Publication of CN108846029B publication Critical patent/CN108846029B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an intelligence correlation analysis method based on a knowledge graph, and belongs to the field of relevance retrieval of the intelligence under the condition of an RDF knowledge graph. The invention comprises the following steps: the data preprocessing process is used for analyzing the TXT document of the downloaded information data; constructing a triple information knowledge base; calculating the weight of each triple and the keyword thereof by using an IDF and information entropy weighting method, and storing the weight into a database; calculating the similar triples of each triplet through a triplet similarity calculation formula, and sequencing according to the similarity; the RDF triples are effectively stored; basic SPARQL query operation is realized by adopting API provided by Jena TDB, and query expansion is carried out according to a query method based on triple similarity; and generating a query sample, effectively sequencing the search result set according to the weight of the triple and the keyword, and returning a top-k result.

Description

Information correlation analysis method based on knowledge graph
Technical Field
The invention relates to an intelligence analysis scheme based on a knowledge graph, and belongs to the field of relevance retrieval of the intelligence under the condition of an RDF knowledge graph.
Background
In intelligence science, intelligence is the reprocessing of knowledge, and transitivity and correlation are basic attributes of knowledge. It is the characteristics of the transitivity and correlation of the intelligence that makes certain correlation exist between different intelligence. With the rapid development of the internet technology, modern information presents the characteristics of massive level, diversity, instantaneity and the like, so that the association degree between information knowledge is higher and higher, which brings difficulty to the analysis of the modern information. However, with the great development of artificial intelligence techniques represented by machine learning, the possibility of intelligence analysis in the context of big data is brought. Intelligence analysis targets are intelligence resources, which can be unitized into various entities associated with each other. By using a machine learning method, various entities and relationships between entities in intelligence data can be mined. However, the analysis mode generally stays in a simple statistical grammar analysis level, does not go deep into a semantic level of information content, and is easy to cause a semantic missing problem, so that the analyzed information is not accurate and complete enough, and the correlation condition between different information cannot be well reflected.
Knowledge graph (knowledgegraph) is an emerging technology in recent years, and the technical theory and concept thereof are proposed by Google in 2012 by combining ideas such as a Knowledge base, a semantic network and an ontology. The knowledge graph is essentially a graph structure based on a complex semantic network, and makes full use of visualization technology, so that not only knowledge resources and carriers can be described, but also knowledge and the relation between knowledge can be analyzed and described. The method draws and displays the complex knowledge by using a graph mode, the nodes in the graph represent the knowledge, and the edges between the nodes represent the relation between the knowledge, so that the association between the knowledge is more intuitively represented. Due to the characteristic of the graph structure, the knowledge graph has the advantages of high entity concept coverage rate, various semantic relations, friendly structure, high quality and the like, so that the knowledge graph increasingly becomes the most main knowledge representation mode in the big data era. Meanwhile, the knowledge map is also praised as the development direction of new generation artificial intelligence. Knowledge graph is firstly applied to the field of search engines, and search engine manufacturers such as Google, Bing, Baidu and dog search, etc. develop knowledge graph products to improve the retrieval quality of the search engines. And then the knowledge graph is widely applied to the fields of intelligent search, personalized recommendation, text classification, content distribution and the like, and a good effect is achieved. Since the knowledge graph can sufficiently reflect the association relationship between the entities, the application of the knowledge graph to the field of intelligence analysis is possible.
Disclosure of Invention
The purpose of the invention is realized as follows:
the intelligence correlation analysis method based on the knowledge graph is characterized by comprising the following steps:
step one, preprocessing process of data; analyzing the downloaded information data TXT document, extracting information such as title, author, keyword, mechanism, date and the like of each piece of information data, and removing stop words and repeated items;
step two, constructing a triple information knowledge base; constructing relationships among the entities, including writing relationships among intelligence and authors, affiliated unit relationships among authors and organizations, and cooperative relationships among different authors; combining the entities and the relations together to form triple data, and storing the triple data in a triple database; in the expanded three-tuple mode, keywords of the information data are adopted as the expansion of the three-tuple mode;
thirdly, calculating the weight of each triple and the keyword thereof by using an IDF and information entropy weighting method, and storing the weight into a database;
calculating the similar triples of each triplet through a triplet similarity calculation formula, and sequencing according to the similarity;
step five, effectively storing the RDF triples; basic SPARQL query operation is realized by adopting API provided by Jena TDB, and query expansion is carried out according to a query method based on triple similarity;
and sixthly, generating a query sample, effectively sequencing the search result set according to the weight of the triple and the keyword, and returning a top-k result.
The IDF and information entropy weighting method comprises the following steps:
step one, calculating an IDF value calculation formula of each triple as follows;
Figure BDA0001674485680000021
in the above formula, e represents an entity and a relationship in an extended triple, n represents the number of triples in the RDF dataset, and f (e) represents the number of times e appears in the RDF dataset; setting IDF (t) as an IDF value of (s, p, o) in a triad t, and calculating the IDF value of each triad according to the characteristics of the triad structure as shown in the following formula, wherein gamma is an adjusting factor, and gamma is more than 0 and less than 1;
IDF(t)=γ×(IDF(s)+IDF(p)+IDF(o))
step two, the calculation process of the information entropy H (e) of each triplet is shown as the following formula;
Figure BDA0001674485680000022
wherein e represents entities and relationships in the extended triples, and k represents the number of data classifications; d1(e) Indicates the frequency of occurrence of e in the data set, D2(e(k)) Represents the frequency of occurrence of e in the kth data class, | C | represents the total number of data classes; for each triplet t (s, p, o), the formula for calculating the information entropy of its triplet is shown as follows;
H(t)=η×(H(s)+H(p)+H(o))
wherein eta is a regulating factor, and eta is more than 0 and less than 1; the weight w (t) of each triplet is determined by an IDF value and an information entropy H value, and is shown in the following formula; where | N | represents the total number of triples in the dataset;
Figure BDA0001674485680000031
step three, calculating the weight of the keywords similarly to the triple, and respectively calculating the IDF value and the information entropy value of each keyword; first, the IDF calculation process for the triplet-keyword is shown as the following formula:
Figure BDA0001674485680000032
let IDF (v)iIe) is the key phrase v ═ v (v, p, o) belonging to the triplet t ═ s (s, p, o)1,v2,…,vn) Middle key word viThe IDF value of (e) represents the head entity s or the tail entity o in the triplet t, excluding the relationship between them; f (e, v)i) Representing the head entity s or the tail entity o of the triad and the key word viThe number of times that the common occurrence of (c) in the data set occurs; let IDF (v)iI t) represents the IDF value of the triplet-keyword, whose calculation formula is shown below, where
Figure BDA0001674485680000033
In order to adjust the factors, the method comprises the following steps,
Figure BDA0001674485680000034
Figure BDA0001674485680000035
step four, calculating the information entropy H (v) of the keywordsiThe calculation process of | e) is shown as the following formula:
Figure BDA0001674485680000036
wherein e only represents head and tail entities, and does not comprise the relationship between the head and tail entities; k represents the number of data classes, D1(viE) represents a keyword viAnd e number of co-occurrences in the dataset, D2(vi,e(k)) Representing a keyword viAnd e number of co-occurrences in the kth data category, | C | represents the total number of data categories; for a triple t and a certain key word v thereofiThe formula for calculating the information entropy is as follows:
H(vi|t)=θ×(H(vi|s)+H(vi|o))
wherein theta represents an adjustment factor, and theta is more than 0 and less than 1; weight w (v) of each triplet-keywordiI t) is composed of an IDF value and an information entropy H, and is shown as the following formula:
Figure BDA0001674485680000037
the query method based on the triple similarity comprises the following steps:
step one, a measuring method based on triple similarity;
let t1And t2Is a triplet, then the triplet t1And t2The similarity calculation is shown in the following formula:
Figure BDA0001674485680000041
step two, generating a method for relaxing query; for an original query
Figure BDA0001674485680000042
Wherein
Figure BDA0001674485680000043
Representing the ith triplet in the original query; firstly, calculating the similarity of each triple in the original query through the formula to obtain a group of similar triples, and sequencing according to the similarity to form a similar triple queue; selecting each triplet
Figure BDA0001674485680000044
The approximate triples in the similar triple queue construct a relaxed query by replacing the triples in the original query;
calculating the similarity of each sub-query and the original query through the following formula, sequencing by utilizing the similarity, and determining the execution sequence of each query according to the sequencing; firstly, the original query is executed firstly, and then the loose query with the highest similarity is executed in sequence;
Figure BDA0001674485680000045
step four, setting a threshold value for the semantic similarity between the original query and the relaxed query, and if the similarity between the two queries is smaller than the threshold value, not deriving the semantic similarity.
The invention has the beneficial effects that:
at present, scholars at home and abroad are actively researching on a correlation retrieval method for an intelligence knowledge graph, but the correlation retrieval method faces the problems of incomplete data and relation, lack of flexible query, incapability of embodying the diversity characteristics of a query result set and the like. The invention provides a targeted solution based on the main problem of information data relevance query based on an RDF triple retrieval method, and provides an expanded RDF knowledge graph relevance retrieval strategy R-SPARQL method.
Drawings
FIG. 1 is a block diagram of the algorithm flow of the present invention;
FIG. 2 is a diagram illustrating semantic overlap based on graph structure according to the present invention;
FIG. 3 is a diagram illustrating an original query derived relaxed query process according to the present invention;
FIG. 4 is a graph comparing average recall and accuracy for different values of sag α in accordance with the present invention;
FIG. 5 is a comparison graph of average query time consumption under different relaxation alpha values according to the present invention;
FIG. 6 is a comparison of average recall for different data set sizes in accordance with the present invention;
FIG. 7 is a comparison of average recall for different data set sizes in accordance with the present invention;
FIG. 8 is a comparison of the average recall ratio of different data set sizes according to the present invention.
Detailed Description
The invention will be explained in more detail below with reference to the drawings and the movement examples.
The intelligence correlation analysis is the research focus in the field of information retrieval and intelligence science, and the characteristic of how to embody a plurality of associated data in the intelligence retrieval result is the main task of the invention. Knowledge-graph-based information retrieval is essentially a sub-graph matching problem, knowledge-graphs can be regarded as a triple structure in the form of RDF, and retrieval on the structure can be queried by using the SPARQL language conforming to the W3C standard. However, the query mode only returns the subgraph structure which meets the conditions, and the characteristics of relevance and diversity cannot be reflected in the query result set.
Aiming at the problem, the invention provides a relevance retrieval method by deeply researching a knowledge graph retrieval method based on RDF. The idea of this approach is to extend the structure of the RDF knowledge graph using keywords and weights. The expansion of the keywords provides a fuzzy retrieval mode, and the weight of the triples and the keywords is used for calculating the similarity of the query triples and carrying out relevance sequencing on the query result set. By calculating the similarity, a plurality of relaxed queries are derived from the original query, so that more retrieval results which are associated with each other are obtained.
The main invention points and technical effects are as follows:
at present, scholars at home and abroad are actively researching on a correlation retrieval method for an intelligence knowledge graph, but the correlation retrieval method faces the problems of incomplete data and relation, lack of flexible query, incapability of embodying the diversity characteristics of a query result set and the like. The invention provides a targeted solution based on the main problem of information data relevance query based on an RDF triple retrieval method, provides an expanded RDF knowledge graph relevance retrieval strategy R-SPARQL method, and has the following main viewpoints and contents:
(1) an extension to the triple mode. Since the user may not know the name of the triple element to be queried explicitly or only the name of an entity similar to the element to be queried, such queries often fail to retrieve data or return erroneous results. Therefore, an extended triple pattern query method is provided to avoid the over-accurate query mode of the traditional SPARQL query. First, the present invention extends the RDF knowledge graph with a set of keywords, each triple associated with the set of keywords. The purpose of expansion by using the keywords is to provide a fuzzy retrieval method to improve the flexibility of retrieval. Secondly, according to the degree of association between each keyword and the triple, each keyword is assigned with a weight, which is called triple-keyword weight. The significance of the expansion by using the weight is to perform relevance sorting on the retrieval result set, and the diversity of the retrieval result can be reflected. Thus, the weight assigned to each triple represents not only the importance of the triple in the dataset, but should also embody the ability to distinguish between triples. By utilizing the distinguishing capability and sequencing the retrieval result set through the weight, the characteristic of diversity is embodied in the result set.
The method of acquiring the keyword is various. For example, in scientific literature data, key phrases can be constructed by extracting subject words of the literature. Some preprocessing processes are also required for the selection of the keywords. For example, operations such as removing stop words, repeating items, etc. Whether the weight of the triplet or the weight of the keyword is a key factor in the ordering of the final result set. The RDF knowledge-graph structure is extended with a set of keywords and weights. The weight of the triple can be obtained by calculation in various modes according to the property of the RDF knowledge graph data source, and the weight of the triple and the weight of the keyword are calculated by an inverse document frequency algorithm (IDF) and information entropy, and the calculation steps are as follows.
The first step is as follows: the formula for calculating the IDF value for each triplet is shown below.
Figure BDA0001674485680000061
In the above formula, e represents the entity and relationship in the extended triple, n represents the number of triples in the RDF dataset, and f (e) represents the number of times e appears in the RDF dataset. Let IDF (t) be the IDF value of (s, p, o) in the triplet t, and according to the characteristics of the triplet structure, the calculation formula for the IDF value of each triplet is as follows, where γ is the adjustment factor, and γ is greater than 0 and less than 1.
IDF(t)=γ×(IDF(s)+IDF(p)+IDF(o))(2)
The second step is that: the information entropy h (e) of each triplet is calculated as shown in the following formula.
Figure BDA0001674485680000062
Where e represents the entities and relationships in the extended triples and k represents the number of data classes. D1(e) Indicates the frequency of occurrence of e in the data set, D2(e(k)) Indicating the frequency with which e appears in the kth data class, | C | indicates the total number of data classes. For each triplet t (s, p, o), the formula for calculating the information entropy of its triplet is shown below.
H(t)=η×(H(s)+H(p)+H(o))(4)
Wherein eta is a regulating factor, and eta is more than 0 and less than 1. The weight w (t) of each triplet is determined by the IDF value and the information entropy H value, as shown in the following formula. Where | N | represents the total number of triples in the dataset.
Figure BDA0001674485680000063
The third step: the calculation process of the weight of the keywords is similar to that of the triple, and the IDF value and the information entropy value of each keyword are calculated respectively. First, the IDF calculation process for the triplet-keyword is shown in the following formula.
Figure BDA0001674485680000064
Let IDF (v)iIe) is the key phrase v ═ v (v, p, o) belonging to the triplet t ═ s (s, p, o)1,v2,…,vn) Middle key word viThe IDF value of (e) represents the head entity s or the tail entity o in the triplet t, without including the relationship between them. F (e, v)i) Representing the head entity s or the tail entity o of the triad and the key word viThe number of times that the sum of (c) appears in the data set. Let IDF (v)iI t) represents the IDF value of the triplet-keyword, whose calculation formula is shown below, where
Figure BDA0001674485680000071
In order to adjust the factors, the method comprises the following steps,
Figure BDA0001674485680000072
Figure BDA0001674485680000073
the fourth step: calculating information entropy H (v) of key wordiThe calculation process of | e) is shown as follows.
Figure BDA0001674485680000074
Wherein e only represents head and tail entities, and does not include the relationship between them. k represents the number of data classes, D1(viE) represents a keyword viAnd e number of co-occurrences in the dataset, D2(vi,e(k)) Representing a keyword viAnd e the number of times of co-occurrence in the kth data class, | C | represents the total number of data classes. For a triple t and a certain key word v thereofiThe formula for calculating the information entropy is shown below.
H(vi|t)=θ×(H(vi|s)+H(vi|o))(9)
Wherein theta represents an adjustment factor, and 0 < theta < 1. Weight w (v) of each triplet-keywordiAnd l t) is composed of an IDF value and an information entropy H, and is shown in the following formula.
Figure BDA0001674485680000075
(2) The invention provides an association query strategy suitable for an RDF knowledge graph structure based on an association query method of triple similarity. The query relaxation algorithm is a very important relevance query method, and is a query processing process with a wider coverage range by changing or deleting one or more hard conditions in an original query or expanding the original query conditions, and finally a group of query results associated with the original query results are obtained. The method comprises the following specific steps:
the first step is as follows: and (4) a measuring method based on the triple similarity. From a knowledge graph structure perspective, an entity may have one or more edges that are connected to another entity, and those directly connected triples are often similar and represent multiple relationships that the same entity may have or different entities represented by the same relationship. In fact, however, there may not be a directly connected edge between two triples, but they have the same entity label and edge label, and the similarity between these triples is not negligible. Therefore, the concept of semantic overlap based on triples has been proposed based on this consideration. Semantic overlap represents the number of ontological concepts contained in the ontology with the same upper level concepts, which may indicate how similar the two concepts are. The semantic overlap of two RDF triples is defined as the number of triples with the same element, and the semantic overlap based on graph structure is shown in FIG. 2.
Still another method for measuring triple similarity is based on the graph embedding idea, which embeds the triples of the knowledge graph into a low-dimensional vector space, and the head and tail entities and the relationship in the triples are a vector. Let t be (s, p, o), and the vector representations of its entities and relationships are respectively
Figure BDA0001674485680000081
And
Figure BDA0001674485680000082
if the triplet is true, then the following equation exists
Figure BDA0001674485680000083
If two triplets t1=(s1,p1,o1) And t2=(s2,p2,o2) Having a higher degree of similarity, the elements between them are considered to be exchanged, as described aboveThe equations are also true. For example, if a triple t1And t2Are similar to each other, then exist
Figure BDA0001674485680000084
Or
Figure BDA0001674485680000085
Therefore, based on these two calculation methods, the triplet t is combined1And t2The similarity calculation of (c) is shown in the following formula.
Figure BDA0001674485680000086
The second step is that: a method of generating a relaxed query. For an original query
Figure BDA0001674485680000087
Wherein
Figure BDA0001674485680000088
Representing the ith triplet in the original query. Firstly, calculating the similarity of each triple in the original query through the formula to obtain a group of similar triples, and sequencing according to the similarity to form a similar triple queue. Selecting each triplet
Figure BDA0001674485680000089
The approximate triples in the similar triples queue of (b) construct the relaxed query by replacing the triples in the original query.
The third step: the similarity of each sub-query and the original query is calculated through the following formula, the similarity is used for sorting, and the execution sequence of each query is determined according to the sorting. First, the original query would be executed first, and then the relaxed queries with the highest similarity would be executed in order.
Figure BDA00016744856800000810
The fourth step: a constraint scheme for relaxing the number of queries. An original query may derive a plurality of relaxed queries, and for each relaxed query, a plurality of sub relaxed queries are derived, the number of derivation being referred to as the relaxation α. The number of relaxed queries grows exponentially with increasing alpha values. Without limitation, too many relaxed queries may be derived from the original query. However, too many relaxed queries may not retrieve results or produce meaningless results. Therefore, it is necessary to put a limit on the amount of slack. The present invention sets a threshold on the semantic similarity between the original query and the relaxed query, and if the similarity between the two queries is less than this threshold, it will not be derived. FIG. 3 illustrates the process of deriving sub-queries from an original query.
(3) And inquiring top-k selection scheme of the result set. Typically, triple-pattern queries often return too many results. Moreover, the result sets returned by the search through the relaxed query are often too similar to present diversity features, making it difficult for the user to find the true desired data in these result sets. The invention provides a sorting model based on an extended RDF (remote data Format) triple mode, effectively sorts the result sub-graphs according to the relevance between each triple and the keywords of the triple, and finally returns the most relevant top-k results. Let original query be
Figure BDA0001674485680000091
Deriving m relaxed queries (Q)1,Q2,…,Qm). Wherein the content of the first and second substances,
Figure BDA0001674485680000092
representing a query QjThe ith query triplet of (a) th query,
Figure BDA0001674485680000093
is composed of a group of key phrases
Figure BDA0001674485680000094
And (5) performing expansion. By searchingQ-polljA set of result triplets (q) is obtained1,q2,…,qs). Wherein the triad
Figure BDA0001674485680000095
The result triple obtained by query is qwKnown as triplets
Figure BDA0001674485680000096
To qwThe transfer of (2). Hypothesis keyword sets
Figure BDA0001674485680000097
Are independent of each other and are provided with a plurality of groups,
Figure BDA0001674485680000098
to qwIs expanded into each independent keyword
Figure BDA0001674485680000099
To qwThe transition probability of (2) can be calculated by the following formula. The result set is ordered by such probability values.
Figure BDA00016744856800000910
Figure BDA00016744856800000911
The invention has the technical effects that:
the invention provides an RDF knowledge graph representation method based on keyword and weight expansion, and an SPARQL query mode is expanded by introducing additional variables. And then, a query relaxation method of triple similarity is provided based on the expanded triple mode, a plurality of relaxed queries are generated by calculating the similarity of the original query, so that the query intention of the user is expanded, and finally a group of most relevant query results are returned. And finally, performing relevance sequencing on the query result set through the triple weight, and fully considering the diversity of the result set.
Through simulation experiments and analysis of experimental results, compared with the traditional SPARQL query and the relaxation query scheme provided by the invention, the indexes of three performances, namely average recall rate, average accuracy and average query time consumption, are intuitively verified in a drawing mode. First, performance indexes comparing analysis recall, accuracy and query time consumption under different sag conditions are shown in fig. 4 and 5. When alpha is 0, only the original query is carried out, and the number of the relaxed queries grows exponentially as the alpha value increases, so that the recall rate and the accuracy rate also grow. But this effect is enhanced at the cost of increased query time. The performance indicators comparing the recall, accuracy and query time for analysis with different data set sizes are shown in fig. 6, 7 and 8.
Because the traditional SPARQL query only returns a result set meeting the conditions, the association degree between data is not reflected. The R-SPARQL method provided by the invention expands the original query into a plurality of sub-queries, and not only can return a result set meeting conditions, but also can return associated entities meeting the structure of the knowledge graph. Therefore, the performance of the R-SPARQL method is obviously better than that of the traditional SPARQL query in the average recall rate and the average accuracy rate.
The intelligence correlation query scheme based on the extended RDF knowledge graph is realized through the following steps.
The method comprises the following steps: and (5) preprocessing the data. The main task is to analyze the downloaded information data TXT document, extract information such as title, author, keyword, organization, date and the like of each information data, and remove stop words and repeated items.
Step two: and constructing a triple intelligence knowledge base. The entities are constructed into relationships, such as writing relationships between one intelligence and an author, affiliated entity relationships between authors and organizations, and cooperative relationships between different authors. And combining the entities and the relations together to form the triple data, and storing the triple data in a triple database. In the extended triple mode, the invention adopts keywords of the intelligence data as the extension of the triple mode.
Step three: and calculating the weight of each triple and the keyword thereof by using an IDF and information entropy weighting method, and storing the weight into a database.
Step four: and calculating the similar triples of each triplet through a triplet similarity calculation formula, and sequencing according to the similarity.
Step five: the RDF triples are stored efficiently. The Jena TDB is a component supporting RDF data storage, query and update operations, and can be conveniently imported into the Jena TDB through body data in an RDF format. The method adopts API provided by Jena TDB to realize basic SPARQL query operation, and utilizes the query relaxation model based on triple similarity to carry out query expansion.
Step six: and generating a query sample, effectively sequencing the search result set according to the weight of the triple and the keyword, and returning top-k results.

Claims (2)

1. The intelligence correlation analysis method based on the knowledge graph is characterized by comprising the following steps:
step one, preprocessing process of data; analyzing the downloaded information data TXT document, extracting the title, author, keyword, mechanism and date of each information data, and removing stop words and repeated items;
step two, constructing a triple information knowledge base; constructing relationships among the entities, including writing relationships among intelligence and authors, affiliated unit relationships among authors and organizations, and cooperative relationships among different authors; combining the entities and the relations together to form triple data, and storing the triple data in a triple database; in the expanded three-tuple mode, keywords of the information data are adopted as the expansion of the three-tuple mode;
thirdly, calculating the weight of each triple and the keyword thereof by using an IDF and information entropy weighting method, and storing the weight in a database;
calculating the similar triples of each triplet through a triplet similarity calculation formula, and sequencing according to the similarity;
step five, effectively storing the RDF triples; basic SPARQL query operation is realized by adopting API provided by JenaTDB, and query expansion is carried out according to a query method based on triple similarity;
generating a query sample, effectively ordering the retrieval result set according to the weight of the triples and the keywords, and returning a top-k result;
the IDF and information entropy weighting method described in step three includes the steps of:
calculating an IDF value calculation formula of each triplet by the step (I) as follows;
Figure FDA0002945246540000011
in the above formula, e represents an entity and a relationship in an extended triple, n represents the number of triples in the RDF dataset, and f (e) represents the number of times e appears in the RDF dataset; setting IDF (t) as an IDF value of (s, p, o) in a triad t, and calculating the IDF value of each triad according to the characteristics of the triad structure as shown in the following formula, wherein gamma is an adjusting factor, and gamma is more than 0 and less than 1;
IDF(t)=γ×(IDF(s)+IDF(p)+IDF(o))
the calculation process of the information entropy H (e) of each triplet in the step (II) is shown as the following formula;
Figure FDA0002945246540000021
wherein e represents entities and relationships in the extended triples, and k represents the number of data classifications; d1(e) Indicates the frequency of occurrence of e in the data set, D2(e(k)) Represents the frequency of occurrence of e in the kth data class, | C | represents the total number of data classes;
for each triplet t (s, p, o), the formula for calculating the information entropy of its triplet is shown as follows;
H(t)=η×(H(s)+H(p)+H(o))
wherein eta is a regulating factor, and eta is more than 0 and less than 1; the weight w (t) of each triplet is determined by an IDF value and an information entropy H value, and is shown in the following formula; where | N | represents the total number of triples in the dataset;
Figure FDA0002945246540000022
step three, the calculation process of the weight of the keywords is similar to that of the triple, and the IDF value and the information entropy value of each keyword are calculated respectively; first, the IDF calculation process for the triplet-keyword is shown as the following formula:
Figure FDA0002945246540000023
let IDF (v | e) be the key phrase v ═ v (v, o) belonging to the triplet t ═ s, p, o1,v2,...,vn) Middle key word viThe IDF value of (e) represents the head entity s or the tail entity o in the triplet t, excluding the relationship between them; f (e, vi) represents the times of co-occurrence of the head entity s or the tail entity o of the triad and the keyword v in the data set; let IDF (v)iI t) represents the IDF value of the triplet-keyword, which is calculated as follows, wherein for the adjustment factor,
Figure FDA0002945246540000031
Figure FDA0002945246540000032
the calculation process of calculating the information entropy H (v | e) of the keyword in the step (IV) is shown as the following formula:
Figure FDA0002945246540000033
wherein e is only substitutedHead and tail entities of the table, excluding the relationship between them; k represents the number of data classes, D1(vi,e(k)) Representing a keyword viAnd e number of co-occurrences in the dataset, D2(vi,e(k)) Representing a keyword viAnd e number of co-occurrences in the kth data category, | C | represents the total number of data categories; for a triple t and a certain key word v thereofiThe formula for calculating the information entropy is as follows:
H(vi|t)=θ×(H(vi|s)+H(vi|o))
wherein theta represents an adjustment factor, and theta is more than 0 and less than 1; weight w (v) of each triplet-keywordiI t) is composed of an IDF value and an information entropy H, and is shown as the following formula:
Figure FDA0002945246540000034
the measuring method based on triple similarity uses the judgment of semantic overlapping or the judgment of graph embedding thought based on graph structure;
semantic overlap of two RDF triples is defined as the number of triples with the same element; from a knowledge graph structure perspective, an entity may have one or more edges that are connected to another entity, and those directly connected triples are similar; two triples are similar to each other if there is no directly connected edge between them, but they have the same entity label and edge label;
embedding a triple of the knowledge graph into a low-dimensional vector space, wherein head and tail entities and relations in the triple are a vector; let t be (s, p, o) and the vector representation of its entities and relationships be s, p and o, respectively, if a triplet is true, then the following equation exists
Figure FDA0002945246540000041
If two triplets t ═ s(s)1,p1,o1) And t ═(s2,p2,o2) With a higher degree of similarity, the above equation is also valid after the elements between them are considered to be interchanged.
2. The intellectual property correlation analysis method based on the knowledge-graph as claimed in claim 1, wherein the query method based on the triple similarity in the fifth step comprises the following steps:
step one, a measuring method based on triple similarity;
let t1And t2Is a triplet, then the triplet t1And t2The similarity calculation is shown in the following formula:
Figure FDA0002945246540000042
step (two) generating a method for relaxing query; for an original query
Figure FDA0002945246540000043
Wherein
Figure FDA0002945246540000044
Representing the ith triplet in the original query; firstly, calculating the similarity of each triple in the original query through the formula to obtain a group of similar triples, and sequencing according to the similarity to form a similar triple queue; selecting each triplet
Figure FDA0002945246540000045
The approximate triples in the similar triple queue construct a relaxed query by replacing the triples in the original query;
step three, calculating the similarity of each sub-query and the original query through the following formula, sequencing by utilizing the similarity, and determining the execution sequence of each query according to the sequencing; firstly, the original query is executed firstly, and then the loose query with the highest similarity is executed in sequence;
Figure FDA0002945246540000051
and step (IV) setting a threshold value for the semantic similarity between the original query and the relaxed query, and if the similarity between the two queries is smaller than the threshold value, not deriving the semantic similarity.
CN201810519637.XA 2018-05-28 2018-05-28 Information correlation analysis method based on knowledge graph Expired - Fee Related CN108846029B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810519637.XA CN108846029B (en) 2018-05-28 2018-05-28 Information correlation analysis method based on knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810519637.XA CN108846029B (en) 2018-05-28 2018-05-28 Information correlation analysis method based on knowledge graph

Publications (2)

Publication Number Publication Date
CN108846029A CN108846029A (en) 2018-11-20
CN108846029B true CN108846029B (en) 2021-05-25

Family

ID=64213596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810519637.XA Expired - Fee Related CN108846029B (en) 2018-05-28 2018-05-28 Information correlation analysis method based on knowledge graph

Country Status (1)

Country Link
CN (1) CN108846029B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582803A (en) * 2018-11-30 2019-04-05 广东电网有限责任公司 The construction method and system of competitive intelligence database
CN109508389B (en) * 2018-12-19 2021-05-28 哈尔滨工程大学 Visual accelerating method for personnel social relationship map
CN109710621B (en) * 2019-01-16 2022-06-21 福州大学 Keyword search KSANEW method combining semantic nodes and edge weights
CN110083732B (en) * 2019-03-12 2021-08-31 浙江大华技术股份有限公司 Picture retrieval method and device and computer storage medium
CN110082116B (en) * 2019-03-18 2022-04-19 深圳市元征科技股份有限公司 Evaluation method and evaluation device for vehicle four-wheel positioning data and storage medium
CN110222240B (en) * 2019-05-24 2021-03-26 华中科技大学 Abstract graph-based space RDF data keyword query method
CN110457486A (en) * 2019-07-05 2019-11-15 中国人民解放军战略支援部队信息工程大学 The people entities alignment schemes and device of knowledge based map
CN110765233A (en) * 2019-11-11 2020-02-07 中国人民解放军军事科学院评估论证研究中心 Intelligent information retrieval service system based on deep mining and knowledge management technology
CN111753055B (en) * 2020-06-28 2024-01-26 中国银行股份有限公司 Automatic prompt method and device for customer questions and answers
CN113495955A (en) * 2021-07-08 2021-10-12 北京明略软件系统有限公司 Expert pushing method, system, equipment and storage medium for document
CN113901452B (en) * 2021-09-30 2022-05-17 中国电子科技集团公司第十五研究所 Sub-graph fuzzy matching security event identification method based on information entropy

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003030204A (en) * 2001-07-17 2003-01-31 Takami Yasuda Server for providing video contents, device and method for preparing file for video contents retrieval, computer program and device and method for supporting video clip preparation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104166670A (en) * 2014-06-17 2014-11-26 青岛农业大学 Information inquiry method based on semantic network
CN104765779A (en) * 2015-03-20 2015-07-08 浙江大学 Patent document inquiry extension method based on YAGO2s

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003030204A (en) * 2001-07-17 2003-01-31 Takami Yasuda Server for providing video contents, device and method for preparing file for video contents retrieval, computer program and device and method for supporting video clip preparation

Also Published As

Publication number Publication date
CN108846029A (en) 2018-11-20

Similar Documents

Publication Publication Date Title
CN108846029B (en) Information correlation analysis method based on knowledge graph
CN109800284B (en) Task-oriented unstructured information intelligent question-answering system construction method
US8341159B2 (en) Creating taxonomies and training data for document categorization
CN104239513B (en) A kind of semantic retrieving method of domain-oriented data
Nabli et al. Efficient cloud service discovery approach based on LDA topic modeling
US20180341686A1 (en) System and method for data search based on top-to-bottom similarity analysis
CN111428047B (en) Knowledge graph construction method and device based on UCL semantic indexing
JP2009093653A (en) Refining search space responding to user input
CN104298776A (en) LDA model-based search engine result optimization system
CN105426529A (en) Image retrieval method and system based on user search intention positioning
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
CN112989215B (en) Sparse user behavior data-based knowledge graph enhanced recommendation system
CN107679124B (en) Knowledge graph Chinese question-answer retrieval method based on dynamic programming algorithm
Ma et al. Matching descriptions to spatial entities using a siamese hierarchical attention network
CN112084312A (en) Intelligent customer service system constructed based on knowledge graph
Grineva et al. Blognoon: Exploring a topic in the blogosphere
Fujiwara et al. Fast ad-hoc search algorithm for personalized pagerank
Liu et al. A query suggestion method based on random walk and topic concepts
CN111723179A (en) Feedback model information retrieval method, system and medium based on concept map
CN117271577B (en) Keyword retrieval method based on intelligent analysis
Zhu Improvement in Probabilistic Information Retrieval Model: Rewarding Terms with High Relative Term Frequency
Peng et al. On the marriage of SPARQL and keywords
Dai et al. Intent identification for knowledge base question answering
He et al. FCA-based web user profile mining for topics of interest
Barresi et al. A concept based indexing approach for document clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220708

Address after: 100000 Room 401, block D, No. 8, Tongji South Road, Beijing Economic and Technological Development Zone, Daxing District, Beijing

Patentee after: BEIJING SAIXI TECHNOLOGY DEVELOPMENT Co.,Ltd.

Address before: 150001 Intellectual Property Office, Harbin Engineering University science and technology office, 145 Nantong Avenue, Nangang District, Harbin, Heilongjiang

Patentee before: HARBIN ENGINEERING University

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210525