CN108846029B

CN108846029B - Information correlation analysis method based on knowledge graph

Info

Publication number: CN108846029B
Application number: CN201810519637.XA
Authority: CN
Inventors: 王念滨; 陈锡瑞; 谢晓东; 王红滨; 周连科; 陈田田; 原明旗; 赵昱杰; 厉原通
Original assignee: Harbin Engineering University
Current assignee: BEIJING SAIXI TECHNOLOGY DEVELOPMENT CO LTD
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2021-05-25
Anticipated expiration: 2038-05-28
Also published as: CN108846029A

Abstract

The invention discloses an intelligence correlation analysis method based on a knowledge graph, and belongs to the field of relevance retrieval of the intelligence under the condition of an RDF knowledge graph. The invention comprises the following steps: the data preprocessing process is used for analyzing the TXT document of the downloaded information data; constructing a triple information knowledge base; calculating the weight of each triple and the keyword thereof by using an IDF and information entropy weighting method, and storing the weight into a database; calculating the similar triples of each triplet through a triplet similarity calculation formula, and sequencing according to the similarity; the RDF triples are effectively stored; basic SPARQL query operation is realized by adopting API provided by Jena TDB, and query expansion is carried out according to a query method based on triple similarity; and generating a query sample, effectively sequencing the search result set according to the weight of the triple and the keyword, and returning a top-k result.

Description

Information correlation analysis method based on knowledge graph

Technical Field

The invention relates to an intelligence analysis scheme based on a knowledge graph, and belongs to the field of relevance retrieval of the intelligence under the condition of an RDF knowledge graph.

Background

In intelligence science, intelligence is the reprocessing of knowledge, and transitivity and correlation are basic attributes of knowledge. It is the characteristics of the transitivity and correlation of the intelligence that makes certain correlation exist between different intelligence. With the rapid development of the internet technology, modern information presents the characteristics of massive level, diversity, instantaneity and the like, so that the association degree between information knowledge is higher and higher, which brings difficulty to the analysis of the modern information. However, with the great development of artificial intelligence techniques represented by machine learning, the possibility of intelligence analysis in the context of big data is brought. Intelligence analysis targets are intelligence resources, which can be unitized into various entities associated with each other. By using a machine learning method, various entities and relationships between entities in intelligence data can be mined. However, the analysis mode generally stays in a simple statistical grammar analysis level, does not go deep into a semantic level of information content, and is easy to cause a semantic missing problem, so that the analyzed information is not accurate and complete enough, and the correlation condition between different information cannot be well reflected.

Knowledge graph (knowledgegraph) is an emerging technology in recent years, and the technical theory and concept thereof are proposed by Google in 2012 by combining ideas such as a Knowledge base, a semantic network and an ontology. The knowledge graph is essentially a graph structure based on a complex semantic network, and makes full use of visualization technology, so that not only knowledge resources and carriers can be described, but also knowledge and the relation between knowledge can be analyzed and described. The method draws and displays the complex knowledge by using a graph mode, the nodes in the graph represent the knowledge, and the edges between the nodes represent the relation between the knowledge, so that the association between the knowledge is more intuitively represented. Due to the characteristic of the graph structure, the knowledge graph has the advantages of high entity concept coverage rate, various semantic relations, friendly structure, high quality and the like, so that the knowledge graph increasingly becomes the most main knowledge representation mode in the big data era. Meanwhile, the knowledge map is also praised as the development direction of new generation artificial intelligence. Knowledge graph is firstly applied to the field of search engines, and search engine manufacturers such as Google, Bing, Baidu and dog search, etc. develop knowledge graph products to improve the retrieval quality of the search engines. And then the knowledge graph is widely applied to the fields of intelligent search, personalized recommendation, text classification, content distribution and the like, and a good effect is achieved. Since the knowledge graph can sufficiently reflect the association relationship between the entities, the application of the knowledge graph to the field of intelligence analysis is possible.

Disclosure of Invention

The purpose of the invention is realized as follows:

the intelligence correlation analysis method based on the knowledge graph is characterized by comprising the following steps:

step one, preprocessing process of data; analyzing the downloaded information data TXT document, extracting information such as title, author, keyword, mechanism, date and the like of each piece of information data, and removing stop words and repeated items;

step two, constructing a triple information knowledge base; constructing relationships among the entities, including writing relationships among intelligence and authors, affiliated unit relationships among authors and organizations, and cooperative relationships among different authors; combining the entities and the relations together to form triple data, and storing the triple data in a triple database; in the expanded three-tuple mode, keywords of the information data are adopted as the expansion of the three-tuple mode;

thirdly, calculating the weight of each triple and the keyword thereof by using an IDF and information entropy weighting method, and storing the weight into a database;

calculating the similar triples of each triplet through a triplet similarity calculation formula, and sequencing according to the similarity;

step five, effectively storing the RDF triples; basic SPARQL query operation is realized by adopting API provided by Jena TDB, and query expansion is carried out according to a query method based on triple similarity;

and sixthly, generating a query sample, effectively sequencing the search result set according to the weight of the triple and the keyword, and returning a top-k result.

The IDF and information entropy weighting method comprises the following steps:

step one, calculating an IDF value calculation formula of each triple as follows;

in the above formula, e represents an entity and a relationship in an extended triple, n represents the number of triples in the RDF dataset, and f (e) represents the number of times e appears in the RDF dataset; setting IDF (t) as an IDF value of (s, p, o) in a triad t, and calculating the IDF value of each triad according to the characteristics of the triad structure as shown in the following formula, wherein gamma is an adjusting factor, and gamma is more than 0 and less than 1;

IDF(t)＝γ×(IDF(s)+IDF(p)+IDF(o))

step two, the calculation process of the information entropy H (e) of each triplet is shown as the following formula;

wherein e represents entities and relationships in the extended triples, and k represents the number of data classifications; d₁(e) Indicates the frequency of occurrence of e in the data set, D₂(e^(k)) Represents the frequency of occurrence of e in the kth data class, | C | represents the total number of data classes; for each triplet t (s, p, o), the formula for calculating the information entropy of its triplet is shown as follows;

H(t)＝η×(H(s)+H(p)+H(o))

wherein eta is a regulating factor, and eta is more than 0 and less than 1; the weight w (t) of each triplet is determined by an IDF value and an information entropy H value, and is shown in the following formula; where | N | represents the total number of triples in the dataset;

step three, calculating the weight of the keywords similarly to the triple, and respectively calculating the IDF value and the information entropy value of each keyword; first, the IDF calculation process for the triplet-keyword is shown as the following formula:

let IDF (v)_iIe) is the key phrase v ═ v (v, p, o) belonging to the triplet t ═ s (s, p, o)₁,v₂,…,v_n) Middle key word v_iThe IDF value of (e) represents the head entity s or the tail entity o in the triplet t, excluding the relationship between them; f (e, v)_i) Representing the head entity s or the tail entity o of the triad and the key word v_iThe number of times that the common occurrence of (c) in the data set occurs; let IDF (v)_iI t) represents the IDF value of the triplet-keyword, whose calculation formula is shown below, where

In order to adjust the factors, the method comprises the following steps,

step four, calculating the information entropy H (v) of the keywords_iThe calculation process of | e) is shown as the following formula:

wherein e only represents head and tail entities, and does not comprise the relationship between the head and tail entities; k represents the number of data classes, D₁(v_iE) represents a keyword v_iAnd e number of co-occurrences in the dataset, D₂(v_i,e^(k)) Representing a keyword v_iAnd e number of co-occurrences in the kth data category, | C | represents the total number of data categories; for a triple t and a certain key word v thereof_iThe formula for calculating the information entropy is as follows:

H(v_i|t)＝θ×(H(v_i|s)+H(v_i|o))

wherein theta represents an adjustment factor, and theta is more than 0 and less than 1; weight w (v) of each triplet-keyword_iI t) is composed of an IDF value and an information entropy H, and is shown as the following formula:

the query method based on the triple similarity comprises the following steps:

step one, a measuring method based on triple similarity;

let t₁And t₂Is a triplet, then the triplet t₁And t₂The similarity calculation is shown in the following formula:

step two, generating a method for relaxing query; for an original query

Wherein

Representing the ith triplet in the original query; firstly, calculating the similarity of each triple in the original query through the formula to obtain a group of similar triples, and sequencing according to the similarity to form a similar triple queue; selecting each triplet

The approximate triples in the similar triple queue construct a relaxed query by replacing the triples in the original query;

calculating the similarity of each sub-query and the original query through the following formula, sequencing by utilizing the similarity, and determining the execution sequence of each query according to the sequencing; firstly, the original query is executed firstly, and then the loose query with the highest similarity is executed in sequence;

step four, setting a threshold value for the semantic similarity between the original query and the relaxed query, and if the similarity between the two queries is smaller than the threshold value, not deriving the semantic similarity.

The invention has the beneficial effects that:

at present, scholars at home and abroad are actively researching on a correlation retrieval method for an intelligence knowledge graph, but the correlation retrieval method faces the problems of incomplete data and relation, lack of flexible query, incapability of embodying the diversity characteristics of a query result set and the like. The invention provides a targeted solution based on the main problem of information data relevance query based on an RDF triple retrieval method, and provides an expanded RDF knowledge graph relevance retrieval strategy R-SPARQL method.

Drawings

FIG. 1 is a block diagram of the algorithm flow of the present invention;

FIG. 2 is a diagram illustrating semantic overlap based on graph structure according to the present invention;

FIG. 3 is a diagram illustrating an original query derived relaxed query process according to the present invention;

FIG. 4 is a graph comparing average recall and accuracy for different values of sag α in accordance with the present invention;

FIG. 5 is a comparison graph of average query time consumption under different relaxation alpha values according to the present invention;

FIG. 6 is a comparison of average recall for different data set sizes in accordance with the present invention;

FIG. 7 is a comparison of average recall for different data set sizes in accordance with the present invention;

FIG. 8 is a comparison of the average recall ratio of different data set sizes according to the present invention.

Detailed Description

The invention will be explained in more detail below with reference to the drawings and the movement examples.

The intelligence correlation analysis is the research focus in the field of information retrieval and intelligence science, and the characteristic of how to embody a plurality of associated data in the intelligence retrieval result is the main task of the invention. Knowledge-graph-based information retrieval is essentially a sub-graph matching problem, knowledge-graphs can be regarded as a triple structure in the form of RDF, and retrieval on the structure can be queried by using the SPARQL language conforming to the W3C standard. However, the query mode only returns the subgraph structure which meets the conditions, and the characteristics of relevance and diversity cannot be reflected in the query result set.

Aiming at the problem, the invention provides a relevance retrieval method by deeply researching a knowledge graph retrieval method based on RDF. The idea of this approach is to extend the structure of the RDF knowledge graph using keywords and weights. The expansion of the keywords provides a fuzzy retrieval mode, and the weight of the triples and the keywords is used for calculating the similarity of the query triples and carrying out relevance sequencing on the query result set. By calculating the similarity, a plurality of relaxed queries are derived from the original query, so that more retrieval results which are associated with each other are obtained.

The main invention points and technical effects are as follows:

at present, scholars at home and abroad are actively researching on a correlation retrieval method for an intelligence knowledge graph, but the correlation retrieval method faces the problems of incomplete data and relation, lack of flexible query, incapability of embodying the diversity characteristics of a query result set and the like. The invention provides a targeted solution based on the main problem of information data relevance query based on an RDF triple retrieval method, provides an expanded RDF knowledge graph relevance retrieval strategy R-SPARQL method, and has the following main viewpoints and contents:

(1) an extension to the triple mode. Since the user may not know the name of the triple element to be queried explicitly or only the name of an entity similar to the element to be queried, such queries often fail to retrieve data or return erroneous results. Therefore, an extended triple pattern query method is provided to avoid the over-accurate query mode of the traditional SPARQL query. First, the present invention extends the RDF knowledge graph with a set of keywords, each triple associated with the set of keywords. The purpose of expansion by using the keywords is to provide a fuzzy retrieval method to improve the flexibility of retrieval. Secondly, according to the degree of association between each keyword and the triple, each keyword is assigned with a weight, which is called triple-keyword weight. The significance of the expansion by using the weight is to perform relevance sorting on the retrieval result set, and the diversity of the retrieval result can be reflected. Thus, the weight assigned to each triple represents not only the importance of the triple in the dataset, but should also embody the ability to distinguish between triples. By utilizing the distinguishing capability and sequencing the retrieval result set through the weight, the characteristic of diversity is embodied in the result set.

The method of acquiring the keyword is various. For example, in scientific literature data, key phrases can be constructed by extracting subject words of the literature. Some preprocessing processes are also required for the selection of the keywords. For example, operations such as removing stop words, repeating items, etc. Whether the weight of the triplet or the weight of the keyword is a key factor in the ordering of the final result set. The RDF knowledge-graph structure is extended with a set of keywords and weights. The weight of the triple can be obtained by calculation in various modes according to the property of the RDF knowledge graph data source, and the weight of the triple and the weight of the keyword are calculated by an inverse document frequency algorithm (IDF) and information entropy, and the calculation steps are as follows.

The first step is as follows: the formula for calculating the IDF value for each triplet is shown below.

In the above formula, e represents the entity and relationship in the extended triple, n represents the number of triples in the RDF dataset, and f (e) represents the number of times e appears in the RDF dataset. Let IDF (t) be the IDF value of (s, p, o) in the triplet t, and according to the characteristics of the triplet structure, the calculation formula for the IDF value of each triplet is as follows, where γ is the adjustment factor, and γ is greater than 0 and less than 1.

IDF(t)＝γ×(IDF(s)+IDF(p)+IDF(o))(2)

The second step is that: the information entropy h (e) of each triplet is calculated as shown in the following formula.

Where e represents the entities and relationships in the extended triples and k represents the number of data classes. D₁(e) Indicates the frequency of occurrence of e in the data set, D₂(e^(k)) Indicating the frequency with which e appears in the kth data class, | C | indicates the total number of data classes. For each triplet t (s, p, o), the formula for calculating the information entropy of its triplet is shown below.

H(t)＝η×(H(s)+H(p)+H(o))(4)

Wherein eta is a regulating factor, and eta is more than 0 and less than 1. The weight w (t) of each triplet is determined by the IDF value and the information entropy H value, as shown in the following formula. Where | N | represents the total number of triples in the dataset.

The third step: the calculation process of the weight of the keywords is similar to that of the triple, and the IDF value and the information entropy value of each keyword are calculated respectively. First, the IDF calculation process for the triplet-keyword is shown in the following formula.

Let IDF (v)_iIe) is the key phrase v ═ v (v, p, o) belonging to the triplet t ═ s (s, p, o)₁,v₂,…,v_n) Middle key word v_iThe IDF value of (e) represents the head entity s or the tail entity o in the triplet t, without including the relationship between them. F (e, v)_i) Representing the head entity s or the tail entity o of the triad and the key word v_iThe number of times that the sum of (c) appears in the data set. Let IDF (v)_iI t) represents the IDF value of the triplet-keyword, whose calculation formula is shown below, where

In order to adjust the factors, the method comprises the following steps,

the fourth step: calculating information entropy H (v) of key word_iThe calculation process of | e) is shown as follows.

Wherein e only represents head and tail entities, and does not include the relationship between them. k represents the number of data classes, D₁(v_iE) represents a keyword v_iAnd e number of co-occurrences in the dataset, D₂(v_i,e^(k)) Representing a keyword v_iAnd e the number of times of co-occurrence in the kth data class, | C | represents the total number of data classes. For a triple t and a certain key word v thereof_iThe formula for calculating the information entropy is shown below.

H(v_i|t)＝θ×(H(v_i|s)+H(v_i|o))(9)

Wherein theta represents an adjustment factor, and 0 < theta < 1. Weight w (v) of each triplet-keyword_iAnd l t) is composed of an IDF value and an information entropy H, and is shown in the following formula.

(2) The invention provides an association query strategy suitable for an RDF knowledge graph structure based on an association query method of triple similarity. The query relaxation algorithm is a very important relevance query method, and is a query processing process with a wider coverage range by changing or deleting one or more hard conditions in an original query or expanding the original query conditions, and finally a group of query results associated with the original query results are obtained. The method comprises the following specific steps:

the first step is as follows: and (4) a measuring method based on the triple similarity. From a knowledge graph structure perspective, an entity may have one or more edges that are connected to another entity, and those directly connected triples are often similar and represent multiple relationships that the same entity may have or different entities represented by the same relationship. In fact, however, there may not be a directly connected edge between two triples, but they have the same entity label and edge label, and the similarity between these triples is not negligible. Therefore, the concept of semantic overlap based on triples has been proposed based on this consideration. Semantic overlap represents the number of ontological concepts contained in the ontology with the same upper level concepts, which may indicate how similar the two concepts are. The semantic overlap of two RDF triples is defined as the number of triples with the same element, and the semantic overlap based on graph structure is shown in FIG. 2.

Still another method for measuring triple similarity is based on the graph embedding idea, which embeds the triples of the knowledge graph into a low-dimensional vector space, and the head and tail entities and the relationship in the triples are a vector. Let t be (s, p, o), and the vector representations of its entities and relationships are respectively

And

if the triplet is true, then the following equation exists

If two triplets t₁＝(s₁,p₁,o₁) And t₂＝(s₂,p₂,o₂) Having a higher degree of similarity, the elements between them are considered to be exchanged, as described aboveThe equations are also true. For example, if a triple t₁And t₂Are similar to each other, then exist

Or

Therefore, based on these two calculation methods, the triplet t is combined₁And t₂The similarity calculation of (c) is shown in the following formula.

The second step is that: a method of generating a relaxed query. For an original query

Wherein

Representing the ith triplet in the original query. Firstly, calculating the similarity of each triple in the original query through the formula to obtain a group of similar triples, and sequencing according to the similarity to form a similar triple queue. Selecting each triplet

The approximate triples in the similar triples queue of (b) construct the relaxed query by replacing the triples in the original query.

The third step: the similarity of each sub-query and the original query is calculated through the following formula, the similarity is used for sorting, and the execution sequence of each query is determined according to the sorting. First, the original query would be executed first, and then the relaxed queries with the highest similarity would be executed in order.

The fourth step: a constraint scheme for relaxing the number of queries. An original query may derive a plurality of relaxed queries, and for each relaxed query, a plurality of sub relaxed queries are derived, the number of derivation being referred to as the relaxation α. The number of relaxed queries grows exponentially with increasing alpha values. Without limitation, too many relaxed queries may be derived from the original query. However, too many relaxed queries may not retrieve results or produce meaningless results. Therefore, it is necessary to put a limit on the amount of slack. The present invention sets a threshold on the semantic similarity between the original query and the relaxed query, and if the similarity between the two queries is less than this threshold, it will not be derived. FIG. 3 illustrates the process of deriving sub-queries from an original query.

(3) And inquiring top-k selection scheme of the result set. Typically, triple-pattern queries often return too many results. Moreover, the result sets returned by the search through the relaxed query are often too similar to present diversity features, making it difficult for the user to find the true desired data in these result sets. The invention provides a sorting model based on an extended RDF (remote data Format) triple mode, effectively sorts the result sub-graphs according to the relevance between each triple and the keywords of the triple, and finally returns the most relevant top-k results. Let original query be

Deriving m relaxed queries (Q)₁,Q₂,…,Q_m). Wherein the content of the first and second substances,

representing a query Q_jThe ith query triplet of (a) th query,

is composed of a group of key phrases

And (5) performing expansion. By searchingQ-poll_jA set of result triplets (q) is obtained₁,q₂,…,q_s). Wherein the triad

The result triple obtained by query is q_wKnown as triplets

To q_wThe transfer of (2). Hypothesis keyword sets

Are independent of each other and are provided with a plurality of groups,

to q_wIs expanded into each independent keyword

To q_wThe transition probability of (2) can be calculated by the following formula. The result set is ordered by such probability values.

The invention has the technical effects that:

the invention provides an RDF knowledge graph representation method based on keyword and weight expansion, and an SPARQL query mode is expanded by introducing additional variables. And then, a query relaxation method of triple similarity is provided based on the expanded triple mode, a plurality of relaxed queries are generated by calculating the similarity of the original query, so that the query intention of the user is expanded, and finally a group of most relevant query results are returned. And finally, performing relevance sequencing on the query result set through the triple weight, and fully considering the diversity of the result set.

Through simulation experiments and analysis of experimental results, compared with the traditional SPARQL query and the relaxation query scheme provided by the invention, the indexes of three performances, namely average recall rate, average accuracy and average query time consumption, are intuitively verified in a drawing mode. First, performance indexes comparing analysis recall, accuracy and query time consumption under different sag conditions are shown in fig. 4 and 5. When alpha is 0, only the original query is carried out, and the number of the relaxed queries grows exponentially as the alpha value increases, so that the recall rate and the accuracy rate also grow. But this effect is enhanced at the cost of increased query time. The performance indicators comparing the recall, accuracy and query time for analysis with different data set sizes are shown in fig. 6, 7 and 8.

Because the traditional SPARQL query only returns a result set meeting the conditions, the association degree between data is not reflected. The R-SPARQL method provided by the invention expands the original query into a plurality of sub-queries, and not only can return a result set meeting conditions, but also can return associated entities meeting the structure of the knowledge graph. Therefore, the performance of the R-SPARQL method is obviously better than that of the traditional SPARQL query in the average recall rate and the average accuracy rate.

The intelligence correlation query scheme based on the extended RDF knowledge graph is realized through the following steps.

The method comprises the following steps: and (5) preprocessing the data. The main task is to analyze the downloaded information data TXT document, extract information such as title, author, keyword, organization, date and the like of each information data, and remove stop words and repeated items.

Step two: and constructing a triple intelligence knowledge base. The entities are constructed into relationships, such as writing relationships between one intelligence and an author, affiliated entity relationships between authors and organizations, and cooperative relationships between different authors. And combining the entities and the relations together to form the triple data, and storing the triple data in a triple database. In the extended triple mode, the invention adopts keywords of the intelligence data as the extension of the triple mode.

Step three: and calculating the weight of each triple and the keyword thereof by using an IDF and information entropy weighting method, and storing the weight into a database.

Step four: and calculating the similar triples of each triplet through a triplet similarity calculation formula, and sequencing according to the similarity.

Step five: the RDF triples are stored efficiently. The Jena TDB is a component supporting RDF data storage, query and update operations, and can be conveniently imported into the Jena TDB through body data in an RDF format. The method adopts API provided by Jena TDB to realize basic SPARQL query operation, and utilizes the query relaxation model based on triple similarity to carry out query expansion.

Step six: and generating a query sample, effectively sequencing the search result set according to the weight of the triple and the keyword, and returning top-k results.

Claims

1. The intelligence correlation analysis method based on the knowledge graph is characterized by comprising the following steps:

step one, preprocessing process of data; analyzing the downloaded information data TXT document, extracting the title, author, keyword, mechanism and date of each information data, and removing stop words and repeated items;

thirdly, calculating the weight of each triple and the keyword thereof by using an IDF and information entropy weighting method, and storing the weight in a database;

step five, effectively storing the RDF triples; basic SPARQL query operation is realized by adopting API provided by JenaTDB, and query expansion is carried out according to a query method based on triple similarity;

generating a query sample, effectively ordering the retrieval result set according to the weight of the triples and the keywords, and returning a top-k result;

the IDF and information entropy weighting method described in step three includes the steps of:

calculating an IDF value calculation formula of each triplet by the step (I) as follows;

IDF(t)＝γ×(IDF(s)+IDF(p)+IDF(o))

the calculation process of the information entropy H (e) of each triplet in the step (II) is shown as the following formula;

wherein e represents entities and relationships in the extended triples, and k represents the number of data classifications; d₁(e) Indicates the frequency of occurrence of e in the data set, D₂(e^(k)) Represents the frequency of occurrence of e in the kth data class, | C | represents the total number of data classes;

for each triplet t (s, p, o), the formula for calculating the information entropy of its triplet is shown as follows;

H(t)＝η×(H(s)+H(p)+H(o))

step three, the calculation process of the weight of the keywords is similar to that of the triple, and the IDF value and the information entropy value of each keyword are calculated respectively; first, the IDF calculation process for the triplet-keyword is shown as the following formula:

let IDF (v | e) be the key phrase v ═ v (v, o) belonging to the triplet t ═ s, p, o₁，v₂，...，v_n) Middle key word v_iThe IDF value of (e) represents the head entity s or the tail entity o in the triplet t, excluding the relationship between them; f (e, vi) represents the times of co-occurrence of the head entity s or the tail entity o of the triad and the keyword v in the data set; let IDF (v)_iI t) represents the IDF value of the triplet-keyword, which is calculated as follows, wherein for the adjustment factor,

the calculation process of calculating the information entropy H (v | e) of the keyword in the step (IV) is shown as the following formula:

wherein e is only substitutedHead and tail entities of the table, excluding the relationship between them; k represents the number of data classes, D₁(vi，e^(k)) Representing a keyword v_iAnd e number of co-occurrences in the dataset, D₂(v_i，e^(k)) Representing a keyword v_iAnd e number of co-occurrences in the kth data category, | C | represents the total number of data categories; for a triple t and a certain key word v thereof_iThe formula for calculating the information entropy is as follows:

H(v_i|t)＝θ×(H(v_i|s)+H(v_i|o))

the measuring method based on triple similarity uses the judgment of semantic overlapping or the judgment of graph embedding thought based on graph structure;

semantic overlap of two RDF triples is defined as the number of triples with the same element; from a knowledge graph structure perspective, an entity may have one or more edges that are connected to another entity, and those directly connected triples are similar; two triples are similar to each other if there is no directly connected edge between them, but they have the same entity label and edge label;

embedding a triple of the knowledge graph into a low-dimensional vector space, wherein head and tail entities and relations in the triple are a vector; let t be (s, p, o) and the vector representation of its entities and relationships be s, p and o, respectively, if a triplet is true, then the following equation exists

If two triplets t ═ s(s)₁，p₁，o₁) And t ═(s₂，p₂，o₂) With a higher degree of similarity, the above equation is also valid after the elements between them are considered to be interchanged.

2. The intellectual property correlation analysis method based on the knowledge-graph as claimed in claim 1, wherein the query method based on the triple similarity in the fifth step comprises the following steps:

step one, a measuring method based on triple similarity;

step (two) generating a method for relaxing query; for an original query

Wherein

step three, calculating the similarity of each sub-query and the original query through the following formula, sequencing by utilizing the similarity, and determining the execution sequence of each query according to the sequencing; firstly, the original query is executed firstly, and then the loose query with the highest similarity is executed in sequence;

and step (IV) setting a threshold value for the semantic similarity between the original query and the relaxed query, and if the similarity between the two queries is smaller than the threshold value, not deriving the semantic similarity.