CN108846029A

CN108846029A - The information association analysis method of knowledge based map

Info

Publication number: CN108846029A
Application number: CN201810519637.XA
Authority: CN
Inventors: 王念滨; 陈锡瑞; 谢晓东; 王红滨; 周连科; 陈田田; 原明旗; 赵昱杰; 厉原通
Original assignee: Harbin Engineering University
Current assignee: BEIJING SAIXI TECHNOLOGY DEVELOPMENT CO LTD
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2018-11-20
Anticipated expiration: 2038-05-28
Also published as: CN108846029B

Abstract

The invention discloses a kind of information association analysis methods of knowledge based map, belong to information correlation searching field under the conditions of RDF knowledge mapping.The present invention includes：The preprocessing process of data parses the information data TXT document of downloading；Construct triple information knowledge base；Using IDF and comentropy method of weighting, the weight of each triple He its keyword is calculated, and store into database；The similar triple of each triple is calculated by triple calculating formula of similarity, and is ranked up according to similarity；RDF triple is effectively stored；Basic SPARQL inquiry operation is realized using the API that Jena TDB is provided, and query expansion is carried out according to the querying method based on triple similarity；Inquiry sample is generated, search result is concentrated and is effectively sorted according to triple and the weight of keyword, returns to top-k result.

Description

The information association analysis method of knowledge based map

Technical field

The present invention relates to a kind of intelligence analysis schemes of knowledge based map, belong to information phase under the conditions of RDF knowledge mapping Closing property searching field.

Background technique

In information science, information is the reprocessing of knowledge, and transitivity and correlation are the essential attributes of knowledge.Exactly feelings The characteristics of transitivity and correlation of report, make between different information that there is certain associations.It is at full speed with Internet technology The features such as development, Modern Information shows magnanimity rank, diversity, real-time, so that the correlation degree between information knowledge Higher and higher, this brings difficult point to Modern Information analysis.But with using machine learning as the length of the artificial intelligence technology of representative Foot development, this brings possibility to the intelligence analysis under big data background.The object of intelligence analysis is resource of information, can be with unit Turn to inter-related each entity.Using machine learning method, each entity and entity in information data can be excavated Between relationship.But this analysis mode generally stays at simple statistics syntactic analysis level, is not deep into information content Semantic level, it is easy to cause semantic missing problem, it is not accurate enough and complete to analyze the information come in this way, for difference Association situation between information also tends to embody well.

Knowledge mapping (Knowledge Graphs) is a kind of emerging technology in recent years, and technical know-how and concept are knots The thoughts such as knowledge base, semantic network and ontology are closed, were proposed by Google in 2012.Knowledge mapping is substantially one Graphic structure of the kind based on complicated semantic network, it takes full advantage of visualization technique, can not only be to knowledge resource and carrier It is described, but also the connection between knowledge and knowledge can be analyzed and be described.Its benefit graphically will Complicated knowledge is drawn and is shown, and indicates knowledge with the node in figure, is indicated between knowledge with the side between node Relationship, so more intuitively embody knowledge between association.The exactly this feature of graph structure, makes it have entitative concept and covers Lid rate is high, semantic relation is various, structure is friendly and the advantages such as quality is higher, so that knowledge mapping has become big data Epoch knowledge representation mode the most main.Meanwhile knowledge mapping is also known as the developing direction of artificial intelligence of new generation.Knowledge Map is applied at first to be released one after another in the search engines such as searching engine field, such as Google, Bing, Baidu and search dog manufacturer The knowledge mapping product of oneself improves the retrieval quality of its search engine.Subsequent knowledge mapping is pushed away in intelligent search, personalization It recommends, text classification, the fields such as content distribution are widely used, and achieve good effect.Since knowledge mapping can Incidence relation sufficiently between reflection entity, is also possibly realized so being applied in intelligence analysis field.

Summary of the invention

The object of the present invention is achieved like this：

The information association analysis method of knowledge based map, which is characterized in that include the following steps：

The preprocessing process of step 1 data；The information data TXT document of downloading is parsed, every information number is extracted According to the information such as title, author, keyword, mechanism, date, and remove stop words and duplicate keys；

Step 2 constructs triple information knowledge base；These entities are constructed into relationship, including writing between information and author Make relationship, affiliated unit's relationship of author and mechanism, the cooperative relationship between different authors；These entities and composition of relations are existed Together, triple data are constituted, and are stored into triple database；In the triple mode of extension, using information data Keyword, as the extension to triple mode；

Step 3 utilizes IDF and comentropy method of weighting, calculates the weight of each triple He its keyword, and store Into database；

Step 4 calculates the similar triple of each triple by triple calculating formula of similarity, and according to similar Degree is ranked up；

Step 5 effectively stores RDF triple；Realize that basic SPARQL is looked into using the API that Jena TDB is provided Operation is ask, and query expansion is carried out according to the querying method based on triple similarity；

Step 6 generates inquiry sample, concentrates to search result and is effectively arranged according to triple and the weight of keyword Sequence returns to top-k result.

The IDF and comentropy method of weighting includes the following steps：

The IDF value calculation formula that step 1 calculates each triple is as follows；

In above formula, e indicates that entity and relationship in the triple of extension, n indicate that RDF data concentrates triple number, F (e) indicate that e appears in the number of RDF data collection；If IDF (t) is the IDF value of triple t=(s, p, o), according to triple knot The characteristics of structure, as follows to the IDF value calculation formula of each triple, wherein γ is regulatory factor, 0 < γ < 1；

IDF (t)=γ × (IDF (s)+IDF (p)+IDF (o))

Shown in the following formula of calculating process of the comentropy H (e) of each triple of step 2；

Wherein, e indicates that entity and relationship in the triple of extension, k indicate data classification quantity；D₁(e) indicate e in number According to the frequency for concentrating appearance, D₂(e^(k)) indicate the frequency that e occurs in k-th of data category, | C | indicate the total of data category Number；For each triple t=(s, p, o), the information entropy calculation formula of triple is as follows；

H (t)=η × (H (s)+H (p)+H (o))

Wherein, η is regulatory factor, 0 < η < 1；The weight w (t) of each triple is by IDF value and comentropy H value two parts It determines, shown in following formula；Wherein, | N | indicate the sum of triple in data set；

Step 3 is similar to triple to the calculating process of keyword weight, calculate separately each keyword IDF value and Information entropy；Firstly, to shown in the following formula of triple-keyword IDF calculating process：

If IDF (v_i| it e) is the crucial phrase v=(v for belonging to triple t=(s, p, o)₁,v₂,…,v_n) in keyword v_i's IDF value, e indicate the head entity s or tail entity o in triple t, without including the relationship between them；F(e,v_i) indicate ternary Group head entity s or tail entity o and keyword v_iThe number appeared in data set jointly；If IDF (v_i| t) indicate ternary Group-keyword IDF value, calculation formula is as follows, whereinFor regulatory factor,

Comentropy H (the v of step 4 calculating keyword_i| shown in the following formula of calculating process e)：

Wherein, e only represents entity end to end, does not include the relationship between them；K indicates data classification quantity, D₁(v_i, e) and table Show keyword v_iThe number occurred jointly in data set with e, D₂(v_i,e^(k)) indicate keyword v_iWith e in k-th of data category In the number that occurs jointly, | C | indicate the sum of data category；For some keyword v of a triple t and it_i, Information entropy calculation formula is as follows：

H(v_i| t)=θ × (H (v_i|s)+H(v_i|o))

Wherein, θ indicates regulatory factor, 0 < θ < 1；Weight w (the v of each triple-keyword_i| t) by IDF value and letter Entropy H two parts composition is ceased, shown in following formula：

The querying method based on triple similarity includes the following steps：

Balancing method of the step 1 based on triple similarity；

If t₁And t₂For triple, then triple t₁And t₂The following formula of similarity calculation shown in：

The method of step 2 generation effective query；For an original queryWhereinIndicate original I-th of triple in inquiry；Firstly, being calculated by above-mentioned formula triple similarity each in original query, obtain It to one group of similar triple, and is ranked up according to similarity, constitutes similar triple queue；Choose each triple's Approximate triple in similar triple queue carrys out tectonically relaxation inquiry by way of the triple in replacement original query；

Step 3 calculates the similarity of each subquery and original query by following formula, and is arranged using similarity Sequence, each inquiry is determined according to this sequence executes sequence；Firstly, original query can execute at first, then there is highest The effective query of similarity can be executed successively；

A threshold value is arranged in semantic similarity of the step 4 between original query and effective query, if between two inquiries Similarity be less than this threshold value, then will not be derived from.

Beneficial effects of the present invention：

Currently, for the correlation search method domestic and foreign scholars towards information knowledge mapping all in positive research, but It is faced with such as incomplete data and relationship, lack flexibility inquiry and the diversity feature of query results can not be embodied The problems such as.The present invention inquires this main problem based on RDF triple search method, based on information data relevance, gives Targetedly solution out, proposes the RDF knowledge mapping correlation retrieval strategy R-SPARQL method based on extension.

Detailed description of the invention

Fig. 1 is algorithm flow block diagram of the invention；

Fig. 2 is the semantic overlapping schematic diagram the present invention is based on graph structure；

Fig. 3 is that original query of the present invention derives from effective query process schematic；

Fig. 4 is the present invention average recall rate and accuracy rate comparison diagram under conditions of different slackness α values；

Fig. 5 is present invention average lookup time-consuming comparison diagram under conditions of different slackness α values；

Fig. 6 is the present invention average recall rate comparison diagram under conditions of different data collection size；

Fig. 7 is the present invention average recall rate comparison diagram under conditions of different data collection size；

Fig. 8 is the present invention average recall rate comparison diagram under conditions of different data collection size.

Specific embodiment

With reference to the accompanying drawing and example is moved, more detailed elaboration is made to the present invention.

Information correlation analysis is the research emphasis of information retrieval and information science field, how in information retrieval result body The characteristics of existing multiple associated datas is the main task of this invention.The information retrieval of knowledge based map is substantially a seed Knowledge mapping can be considered as a kind of triple structure of RDF form by figure matching problem, and the retrieval in its structure can make It is inquired with the SPARQL language for meeting W3C standard.But this inquiry mode can only return to qualified subgraph structure, and Relevance and multifarious feature will not be embodied in query result concentration.

For this problem, the present invention has made intensive studies the knowledge mapping search method based on RDF, proposes one Kind correlation retrieval method.The thought of this method is that the structure of RDF knowledge mapping is extended using keyword and weight.Wherein, The extension of keyword provides a kind of mode of fuzzy search, and triple and the weight of keyword are used to calculate inquiry triple Similarity and to query results carry out relevance ranking.By calculating similarity, original query is derived into multiple pines Relax inquiry, thus to obtain search result more, associated with each other.

Primary object and technical effect：

Currently, for the correlation search method domestic and foreign scholars towards information knowledge mapping all in positive research, but It is faced with such as incomplete data and relationship, lack flexibility inquiry and the diversity feature of query results can not be embodied The problems such as.The present invention inquires this main problem based on RDF triple search method, based on information data relevance, gives Targetedly solution out, proposes the RDF knowledge mapping correlation retrieval strategy R-SPARQL method based on extension, leads Want viewpoint and content as follows：

(1) expansion scheme of triple mode.Since user may not clearly know the triple element to be inquired Title perhaps only knows that the such inquiry of entity name similar with element to be checked often be can not find out data or be returned wrong Result accidentally.Therefore, in order to avoid traditional SPARQL inquiry excessively accurately inquiry mode and propose the three of a kind of extension Tuple pattern query method.Firstly, the present invention extends RDF knowledge mapping using one group of keyword, each triple with This group of keyword is associated.A kind of method for being designed to provide fuzzy search being extended using keyword, to improve inspection The flexibility of rope.Secondly, according to the correlation degree between each keyword and the triple, to each keyword distribution one A weight, referred to as triple-keyword weight.It is to carry out correlation to retrieval set using the meaning that weight is extended Sequence, and the diversity of search result can be embodied.Therefore, the weight for distributing to each triple not only represents the ternary Significance level of the group in data set, it should also embody the separating capacity between triple.Using this separating capacity, and pass through Weight is ranked up retrieval set, so that result embodies a concentrated reflection of multifarious feature.

The acquisition methods of keyword are various.For example, the theme of extraction document can be passed through in scientific literature data Word constructs crucial phrase.In the selection of keyword, it is also necessary to carry out some preprocessing process.For example, remove stop words, weight The operations such as multiple item.The either weight of triple or the weight of keyword are all the key factors of final result collection sequence.Benefit RDF knowledge mapping structure is extended with one group of keyword and weight.The weight of triple can be according to RDF knowledge mapping number It is obtained according to the property calculating in many ways in source, the present invention is using inverse document frequency algorithm (IDF) and comentropy come to three The step of tuple and the weight of keyword are calculated, calculating is as follows.

The first step：The IDF value calculation formula for calculating each triple is as follows.

In above formula, e indicates that entity and relationship in the triple of extension, n indicate that RDF data concentrates triple number, F (e) indicate that e appears in the number of RDF data collection.If IDF (t) is the IDF value of triple t=(s, p, o), according to triple knot The characteristics of structure, as follows to the IDF value calculation formula of each triple, wherein γ is regulatory factor, 0 < γ < 1.

IDF (t)=γ × (IDF (s)+IDF (p)+IDF (o)) (2)

Second step：Shown in the following formula of the calculating process of the comentropy H (e) of each triple.

Wherein, e indicates that entity and relationship in the triple of extension, k indicate data classification quantity.D₁(e) indicate e in number According to the frequency for concentrating appearance, D₂(e^(k)) indicate the frequency that e occurs in k-th of data category, | C | indicate the total of data category Number.For each triple t=(s, p, o), the information entropy calculation formula of triple is as follows.

H (t)=η × (H (s)+H (p)+H (o)) (4)

Wherein, η is regulatory factor, 0 < η < 1.The weight w (t) of each triple is by IDF value and comentropy H value two parts It determines, shown in following formula.Wherein, | N | indicate the sum of triple in data set.

Third step：It is similar to triple to the calculating process of keyword weight, calculate separately each keyword IDF value and Information entropy.Firstly, to shown in the following formula of triple-keyword IDF calculating process.

If IDF (v_i| it e) is the crucial phrase v=(v for belonging to triple t=(s, p, o)₁,v₂,…,v_n) in keyword v_i's IDF value, e indicate the head entity s or tail entity o in triple t, without including the relationship between them.F(e,v_i) indicate ternary Group head entity s or tail entity o and keyword v_iThe number appeared in data set jointly.If IDF (v_i| t) indicate ternary Group-keyword IDF value, calculation formula is as follows, whereinFor regulatory factor,

4th step：Calculate the comentropy H (v of keyword_i| shown in the following formula of calculating process e).

Wherein, e only represents entity end to end, does not include the relationship between them.K indicates data classification quantity, D₁(v_i, e) and table Show keyword v_iThe number occurred jointly in data set with e, D₂(v_i,e^(k)) indicate keyword v_iWith e in k-th of data category In the number that occurs jointly, | C | indicate the sum of data category.For some keyword v of a triple t and it_i, Information entropy calculation formula is as follows.

H(v_i| t)=θ × (H (v_i|s)+H(v_i|o))(9)

Wherein, θ indicates regulatory factor, 0 < θ < 1.Weight w (the v of each triple-keyword_i| t) by IDF value and letter Entropy H two parts composition is ceased, shown in following formula.

(2) relation query method based on triple similarity, the invention proposes one kind to be suitable for RDF knowledge mapping knot The correlation inquiry strategy of structure.Inquiring relaxed algorithm is a kind of very important relevance querying method, it refers in original query Middle change is perhaps deleted one or more rigid condition or is extended to original query condition, covers to obtain one The wider array of query processing process of lid range, it is final to obtain a group polling result associated with original query result.Specific steps It is as follows：

The first step：Balancing method based on triple similarity.From knowledge mapping structure, an entity may There are one or more sides to be connected with another entity, often more similar between those triples being connected directly, it Indicate a variety of relationships that the same entity may have or the different entities that same relationship represents.But in fact, two The side that may be connected directly between triple, but their entity tags having the same and side label, then these ternarys Similarity degree between group is also very important.Therefore, the semantic overlapping based on triple is proposed in view of this consideration Concept.Semanteme overlapping indicates to include the quantity in Ontological concept with identical Upper Concept, this may indicate that two concepts Similarity degree.The semantic overlapping of two RDF triples is defined as the quantity in triple with identical element, based on graph structure Semanteme overlapping is as shown in Figure 2.

It is based on figure insertion thought there are also a kind of method for measuring triple similarity, it is embedding by the triple of knowledge mapping Enter into a low-dimensional vector space, the entity end to end and relationship in triple are a vector.If triple t=(s, p, o), Its entity and the expression of the vector of relationship are respectivelyWithIf triple is set up, there are following equatioies If two triple t₁=(s₁,p₁,o₁) and t₂=(s₂,p₂,o₂) similarity with higher, then it is assumed that it will be between them After element exchanges, above-mentioned equation is also to set up.For example, if triple t₁And t₂It is similar to each other, then existing OrTherefore, both calculation methods are based on, by triple t₁And t₂The following formula of similarity calculation shown in.

Second step：The method for generating effective query.For an original queryWhereinIndicate former I-th of triple in beginning inquiry.Firstly, triple similarity each in original query is calculated by above-mentioned formula, One group of similar triple is obtained, and is ranked up according to similarity, similar triple queue is constituted.Choose each triple Similar triple queue in approximate triple, replacement original query in triple by way of, carry out tectonically relaxation and look into It askes.

Third step：The similarity of each subquery and original query is calculated by following formula, and is carried out using similarity Sequence, each inquiry is determined according to this sequence executes sequence.Firstly, original query can execute at first, then have most The effective query of high similarity can be executed successively.

4th step：For the restricted version of effective query quantity.One original query can derive multiple effective queries, Each effective query also will continue to derive multiple sub- effective queries, the number of derivation is known as slackness α.Effective query Quantity is as the increase of α value exponentially increases.If without restriction, may be derived by original query excessive Effective query.However, excessive effective query may be retrieved less than result or generate skimble-skamble result.Therefore, have Necessity limits in relaxation quantity.A threshold is arranged in semantic similarity of the present invention between original query and effective query Value will not be derived from if the similarity between two inquiries is less than this threshold value.Fig. 3 is illustrated to be derived from by original query The process of subquery.

(3) the top-k selection scheme of query results.Under normal conditions, triple pattern query would generally return excessively Result.Moreover, retrieved by effective query in returned result set, the result for coming front it is often excessively approximate and Diversity feature is not presented, user is made to be difficult to be focused to find out really desired data in these results.The present invention is based on extensions RDF triple mode propose a kind of order models, according to the relevance between each triple and their keyword It is effectively sorted to result subgraph, finally returns that maximally related top-k result.If original query is Derive m effective query (Q₁,Q₂,…,Q_m).Wherein,Indicate inquiry Q_jIn i-th of inquiry triple,By one group of key PhraseIt is extended.By inquiring Q_jObtain one group of result triplet sets (q₁,q₂,…,q_s).Wherein, tripleInquiring obtained result triple is q_w, referred to as tripleTo q_wTransfer.Assuming that crucial phraseBetween It is independent from each other,To q_wTransition probability be extended to each independent keywordTo q_wTransition probability, can be by such as Lower formula calculates its transition probability.Result set is ranked up by this probability value.

Technical effect of the invention is：

The invention proposes a kind of RDF knowledge mapping representation methods extended based on keyword and weight, by introducing volume Exogenousd variables extends SPARQL inquiry mode.A kind of triple similarity is proposed subsequently, based on the triple mode of extension Relaxation method is inquired, by calculating the similarity of original query, multiple effective queries is generated, thus extends the inquiry meaning of user Figure, finally returns that one group of maximally related query result.Finally, carrying out correlation row to query results by triple weight Sequence has fully considered the diversity of result set.

By emulation experiment and analysis experimental result, traditional SPARQL inquiry and effective query proposed by the present invention are compared Scheme demonstrates the finger of time-consuming three performances of average recall rate, Average Accuracy and average lookup with the formal intuition of drawing Mark.Firstly, under the conditions of different slacknesses, comparative analysis recall rate, accuracy rate and the time-consuming performance indicator such as Fig. 4 of inquiry and Shown in Fig. 5.It indicates only to carry out original query as α=0, with the increase of α value, the quantity of effective query exponentially increases, Therefore recall rate and accuracy rate also increase.But the promotion of this effect is to inquire time-consuming increase as cost.Not In the case where data set size, performance indicator such as Fig. 6, Fig. 7 and Fig. 8 of comparative analysis recall rate, accuracy rate and inquiry time-consuming It is shown.

Since traditional SPARQL inquiry can only return to the result set of the condition of satisfaction, the association journey between data is not embodied Degree.And R-SPARQL method proposed by the present invention, original query is extended to multiple subqueries, can not only return to the condition of satisfaction Result set, and the associated entity for meeting knowledge mapping structure can be returned to.Therefore, in average recall rate and Average Accuracy On, the performance of R-SPARQL method will be inquired significantly better than traditional SPARQL.

The information correlation inquiry scheme of RDF knowledge mapping proposed by the present invention based on extension, is realized by following steps.

Step 1：The preprocessing process of data.Main task is parsed to the information data TXT document of downloading, is taken out The information such as the title, author, keyword, mechanism, date of every information data are taken, and remove stop words and duplicate keys.

Step 2：Construct triple information knowledge base.These entities are constructed into relationship, such as one between information and author Writing relationship, affiliated unit's relationship of author and mechanism, the cooperative relationship between different authors.By these entities and relationship group It is combined, constitutes triple data, and store into triple database.In the triple mode of extension, the present invention is adopted With the keyword of information data, as the extension to triple mode.

Step 3：Using IDF and comentropy method of weighting, the weight of each triple He its keyword is calculated, and is deposited It stores up in database.

Step 4：The similar triple of each triple is calculated by triple calculating formula of similarity, and according to phase It is ranked up like degree.

Step 5：RDF triple is effectively stored.Jena TDB is the storage of a support RDF data, inquiry and more The component newly operated can easily be imported into Jena TDB by the ontology data of RDF format.The present invention uses Jena The API that TDB is provided realizes basic SPARQL inquiry operation, and utilizes looking into based on triple similarity proposed by the present invention It askes relaxation model and carries out query expansion.

Step 6：Inquiry sample is generated, search result is concentrated and is effectively arranged according to triple and the weight of keyword Sequence returns to top-k result.

Claims

1. the information association analysis method of knowledge based map, which is characterized in that include the following steps：

The preprocessing process of step 1 data；The information data TXT document of downloading is parsed, every information data is extracted The information such as title, author, keyword, mechanism, date, and remove stop words and duplicate keys；

Step 2 constructs triple information knowledge base；These entities are constructed into relationship, are closed including the writing between information and author It is, affiliated unit's relationship of author and mechanism the cooperative relationship between different authors；By these entities and composition of relations one It rises, constitutes triple data, and store into triple database；In the triple mode of extension, using information data Keyword, as the extension to triple mode；

Step 3 utilizes IDF and comentropy method of weighting, calculates the weight of each triple He its keyword, and store and arrive In database；

Step 4 calculates the similar triple of each triple by triple calculating formula of similarity, and according to similarity It is ranked up；

Step 5 effectively stores RDF triple；Basic SPARQL inquiry is realized using the API that Jena TDB is provided Operation, and query expansion is carried out according to the querying method based on triple similarity；

Step 6 generates inquiry sample, concentrates to search result and is effectively sorted according to triple and the weight of keyword, is returned Return top-k result.

2. the information association analysis method of knowledge based map according to claim 1, which is characterized in that institute in step 3 The IDF and comentropy method of weighting stated includes the following steps：

In above formula, e indicates that entity and relationship in the triple of extension, n indicate that RDF data concentrates triple number, F (e) table Show that e appears in the number of RDF data collection；If IDF (t) is the IDF value of triple t=(s, p, o), according to the spy of triple structure Point, as follows to the IDF value calculation formula of each triple, wherein γ is regulatory factor, 0 < γ < 1；

IDF (t)=γ × (IDF (s)+IDF (p)+IDF (o))

Wherein, e indicates that entity and relationship in the triple of extension, k indicate data classification quantity；D₁(e) indicate e in data set The frequency of middle appearance, D₂(e^(k)) indicate the frequency that e occurs in k-th of data category, | C | indicate the sum of data category；

For each triple t=(s, p, o), the information entropy calculation formula of triple is as follows；

H (t)=η × (H (s)+H (p)+H (o))

Wherein, η is regulatory factor, 0 < η < 1；The weight w (t) of each triple is true by IDF value and comentropy H value two parts It is fixed, shown in following formula；Wherein, | N | indicate the sum of triple in data set；

Step 3 is similar to triple to the calculating process of keyword weight, calculates separately the IDF value and information of each keyword Entropy；Firstly, to shown in the following formula of triple-keyword IDF calculating process：

If IDF (v_i| it e) is the crucial phrase v=(v for belonging to triple t=(s, p, o)₁,v₂,…,v_n) in keyword v_iIDF Value, e indicate the head entity s or tail entity o in triple t, without including the relationship between them；F(e,v_i) indicate triple Head entity s or tail entity o and keyword v_iThe number appeared in data set jointly；If IDF (v_i| t) indicate triple- The IDF value of keyword, calculation formula is as follows, whereinFor regulatory factor,

Wherein, e only represents entity end to end, does not include the relationship between them；K indicates data classification quantity, D₁(v_i, e) and it indicates to close Keyword v_iThe number occurred jointly in data set with e, D₂(v_i,e^(k)) indicate keyword v_iIt is total in k-th of data category with e The same number occurred, | C | indicate the sum of data category；For some keyword v of a triple t and it_i, information Entropy calculation formula is as follows：

H(v_i| t)=θ × (H (v_i|s)+H(v_i|o))

Wherein, θ indicates regulatory factor, 0 < θ < 1；Weight w (the v of each triple-keyword_i| t) by IDF value and comentropy H Two parts composition, shown in following formula：

3. the information association analysis method of knowledge based map according to claim 1, which is characterized in that institute in step 5 The querying method based on triple similarity stated includes the following steps：

Balancing method of the step 1 based on triple similarity；

The method of step 2 generation effective query；For an original queryWhereinIndicate original query In i-th of triple；Firstly, calculating by above-mentioned formula triple similarity each in original query, one is obtained The similar triple of group, and be ranked up according to similarity, constitute similar triple queue；Choose each tripleIt is similar Approximate triple in triple queue carrys out tectonically relaxation inquiry by way of the triple in replacement original query；

Step 3 calculates the similarity of each subquery and original query by following formula, and is ranked up using similarity, Each inquiry is determined according to this sequence executes sequence；Firstly, original query can execute at first, then have highest similar The effective query of degree can be executed successively；

A threshold value is arranged in semantic similarity of the step 4 between original query and effective query, if the phase between two inquiries It is less than this threshold value like degree, then will not be derived from.