CN108846029A - The information association analysis method of knowledge based map - Google Patents

The information association analysis method of knowledge based map Download PDF

Info

Publication number
CN108846029A
CN108846029A CN201810519637.XA CN201810519637A CN108846029A CN 108846029 A CN108846029 A CN 108846029A CN 201810519637 A CN201810519637 A CN 201810519637A CN 108846029 A CN108846029 A CN 108846029A
Authority
CN
China
Prior art keywords
triple
keyword
idf
similarity
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810519637.XA
Other languages
Chinese (zh)
Other versions
CN108846029B (en
Inventor
王念滨
陈锡瑞
谢晓东
王红滨
周连科
陈田田
原明旗
赵昱杰
厉原通
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING SAIXI TECHNOLOGY DEVELOPMENT CO LTD
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201810519637.XA priority Critical patent/CN108846029B/en
Publication of CN108846029A publication Critical patent/CN108846029A/en
Application granted granted Critical
Publication of CN108846029B publication Critical patent/CN108846029B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of information association analysis methods of knowledge based map, belong to information correlation searching field under the conditions of RDF knowledge mapping.The present invention includes:The preprocessing process of data parses the information data TXT document of downloading;Construct triple information knowledge base;Using IDF and comentropy method of weighting, the weight of each triple He its keyword is calculated, and store into database;The similar triple of each triple is calculated by triple calculating formula of similarity, and is ranked up according to similarity;RDF triple is effectively stored;Basic SPARQL inquiry operation is realized using the API that Jena TDB is provided, and query expansion is carried out according to the querying method based on triple similarity;Inquiry sample is generated, search result is concentrated and is effectively sorted according to triple and the weight of keyword, returns to top-k result.

Description

The information association analysis method of knowledge based map
Technical field
The present invention relates to a kind of intelligence analysis schemes of knowledge based map, belong to information phase under the conditions of RDF knowledge mapping Closing property searching field.
Background technique
In information science, information is the reprocessing of knowledge, and transitivity and correlation are the essential attributes of knowledge.Exactly feelings The characteristics of transitivity and correlation of report, make between different information that there is certain associations.It is at full speed with Internet technology The features such as development, Modern Information shows magnanimity rank, diversity, real-time, so that the correlation degree between information knowledge Higher and higher, this brings difficult point to Modern Information analysis.But with using machine learning as the length of the artificial intelligence technology of representative Foot development, this brings possibility to the intelligence analysis under big data background.The object of intelligence analysis is resource of information, can be with unit Turn to inter-related each entity.Using machine learning method, each entity and entity in information data can be excavated Between relationship.But this analysis mode generally stays at simple statistics syntactic analysis level, is not deep into information content Semantic level, it is easy to cause semantic missing problem, it is not accurate enough and complete to analyze the information come in this way, for difference Association situation between information also tends to embody well.
Knowledge mapping (Knowledge Graphs) is a kind of emerging technology in recent years, and technical know-how and concept are knots The thoughts such as knowledge base, semantic network and ontology are closed, were proposed by Google in 2012.Knowledge mapping is substantially one Graphic structure of the kind based on complicated semantic network, it takes full advantage of visualization technique, can not only be to knowledge resource and carrier It is described, but also the connection between knowledge and knowledge can be analyzed and be described.Its benefit graphically will Complicated knowledge is drawn and is shown, and indicates knowledge with the node in figure, is indicated between knowledge with the side between node Relationship, so more intuitively embody knowledge between association.The exactly this feature of graph structure, makes it have entitative concept and covers Lid rate is high, semantic relation is various, structure is friendly and the advantages such as quality is higher, so that knowledge mapping has become big data Epoch knowledge representation mode the most main.Meanwhile knowledge mapping is also known as the developing direction of artificial intelligence of new generation.Knowledge Map is applied at first to be released one after another in the search engines such as searching engine field, such as Google, Bing, Baidu and search dog manufacturer The knowledge mapping product of oneself improves the retrieval quality of its search engine.Subsequent knowledge mapping is pushed away in intelligent search, personalization It recommends, text classification, the fields such as content distribution are widely used, and achieve good effect.Since knowledge mapping can Incidence relation sufficiently between reflection entity, is also possibly realized so being applied in intelligence analysis field.
Summary of the invention
The object of the present invention is achieved like this:
The information association analysis method of knowledge based map, which is characterized in that include the following steps:
The preprocessing process of step 1 data;The information data TXT document of downloading is parsed, every information number is extracted According to the information such as title, author, keyword, mechanism, date, and remove stop words and duplicate keys;
Step 2 constructs triple information knowledge base;These entities are constructed into relationship, including writing between information and author Make relationship, affiliated unit's relationship of author and mechanism, the cooperative relationship between different authors;These entities and composition of relations are existed Together, triple data are constituted, and are stored into triple database;In the triple mode of extension, using information data Keyword, as the extension to triple mode;
Step 3 utilizes IDF and comentropy method of weighting, calculates the weight of each triple He its keyword, and store Into database;
Step 4 calculates the similar triple of each triple by triple calculating formula of similarity, and according to similar Degree is ranked up;
Step 5 effectively stores RDF triple;Realize that basic SPARQL is looked into using the API that Jena TDB is provided Operation is ask, and query expansion is carried out according to the querying method based on triple similarity;
Step 6 generates inquiry sample, concentrates to search result and is effectively arranged according to triple and the weight of keyword Sequence returns to top-k result.
The IDF and comentropy method of weighting includes the following steps:
The IDF value calculation formula that step 1 calculates each triple is as follows;
In above formula, e indicates that entity and relationship in the triple of extension, n indicate that RDF data concentrates triple number, F (e) indicate that e appears in the number of RDF data collection;If IDF (t) is the IDF value of triple t=(s, p, o), according to triple knot The characteristics of structure, as follows to the IDF value calculation formula of each triple, wherein γ is regulatory factor, 0 < γ < 1;
IDF (t)=γ × (IDF (s)+IDF (p)+IDF (o))
Shown in the following formula of calculating process of the comentropy H (e) of each triple of step 2;
Wherein, e indicates that entity and relationship in the triple of extension, k indicate data classification quantity;D1(e) indicate e in number According to the frequency for concentrating appearance, D2(e(k)) indicate the frequency that e occurs in k-th of data category, | C | indicate the total of data category Number;For each triple t=(s, p, o), the information entropy calculation formula of triple is as follows;
H (t)=η × (H (s)+H (p)+H (o))
Wherein, η is regulatory factor, 0 < η < 1;The weight w (t) of each triple is by IDF value and comentropy H value two parts It determines, shown in following formula;Wherein, | N | indicate the sum of triple in data set;
Step 3 is similar to triple to the calculating process of keyword weight, calculate separately each keyword IDF value and Information entropy;Firstly, to shown in the following formula of triple-keyword IDF calculating process:
If IDF (vi| it e) is the crucial phrase v=(v for belonging to triple t=(s, p, o)1,v2,…,vn) in keyword vi's IDF value, e indicate the head entity s or tail entity o in triple t, without including the relationship between them;F(e,vi) indicate ternary Group head entity s or tail entity o and keyword viThe number appeared in data set jointly;If IDF (vi| t) indicate ternary Group-keyword IDF value, calculation formula is as follows, whereinFor regulatory factor,
Comentropy H (the v of step 4 calculating keywordi| shown in the following formula of calculating process e):
Wherein, e only represents entity end to end, does not include the relationship between them;K indicates data classification quantity, D1(vi, e) and table Show keyword viThe number occurred jointly in data set with e, D2(vi,e(k)) indicate keyword viWith e in k-th of data category In the number that occurs jointly, | C | indicate the sum of data category;For some keyword v of a triple t and iti, Information entropy calculation formula is as follows:
H(vi| t)=θ × (H (vi|s)+H(vi|o))
Wherein, θ indicates regulatory factor, 0 < θ < 1;Weight w (the v of each triple-keywordi| t) by IDF value and letter Entropy H two parts composition is ceased, shown in following formula:
The querying method based on triple similarity includes the following steps:
Balancing method of the step 1 based on triple similarity;
If t1And t2For triple, then triple t1And t2The following formula of similarity calculation shown in:
The method of step 2 generation effective query;For an original queryWhereinIndicate original I-th of triple in inquiry;Firstly, being calculated by above-mentioned formula triple similarity each in original query, obtain It to one group of similar triple, and is ranked up according to similarity, constitutes similar triple queue;Choose each triple's Approximate triple in similar triple queue carrys out tectonically relaxation inquiry by way of the triple in replacement original query;
Step 3 calculates the similarity of each subquery and original query by following formula, and is arranged using similarity Sequence, each inquiry is determined according to this sequence executes sequence;Firstly, original query can execute at first, then there is highest The effective query of similarity can be executed successively;
A threshold value is arranged in semantic similarity of the step 4 between original query and effective query, if between two inquiries Similarity be less than this threshold value, then will not be derived from.
Beneficial effects of the present invention:
Currently, for the correlation search method domestic and foreign scholars towards information knowledge mapping all in positive research, but It is faced with such as incomplete data and relationship, lack flexibility inquiry and the diversity feature of query results can not be embodied The problems such as.The present invention inquires this main problem based on RDF triple search method, based on information data relevance, gives Targetedly solution out, proposes the RDF knowledge mapping correlation retrieval strategy R-SPARQL method based on extension.
Detailed description of the invention
Fig. 1 is algorithm flow block diagram of the invention;
Fig. 2 is the semantic overlapping schematic diagram the present invention is based on graph structure;
Fig. 3 is that original query of the present invention derives from effective query process schematic;
Fig. 4 is the present invention average recall rate and accuracy rate comparison diagram under conditions of different slackness α values;
Fig. 5 is present invention average lookup time-consuming comparison diagram under conditions of different slackness α values;
Fig. 6 is the present invention average recall rate comparison diagram under conditions of different data collection size;
Fig. 7 is the present invention average recall rate comparison diagram under conditions of different data collection size;
Fig. 8 is the present invention average recall rate comparison diagram under conditions of different data collection size.
Specific embodiment
With reference to the accompanying drawing and example is moved, more detailed elaboration is made to the present invention.
Information correlation analysis is the research emphasis of information retrieval and information science field, how in information retrieval result body The characteristics of existing multiple associated datas is the main task of this invention.The information retrieval of knowledge based map is substantially a seed Knowledge mapping can be considered as a kind of triple structure of RDF form by figure matching problem, and the retrieval in its structure can make It is inquired with the SPARQL language for meeting W3C standard.But this inquiry mode can only return to qualified subgraph structure, and Relevance and multifarious feature will not be embodied in query result concentration.
For this problem, the present invention has made intensive studies the knowledge mapping search method based on RDF, proposes one Kind correlation retrieval method.The thought of this method is that the structure of RDF knowledge mapping is extended using keyword and weight.Wherein, The extension of keyword provides a kind of mode of fuzzy search, and triple and the weight of keyword are used to calculate inquiry triple Similarity and to query results carry out relevance ranking.By calculating similarity, original query is derived into multiple pines Relax inquiry, thus to obtain search result more, associated with each other.
Primary object and technical effect:
Currently, for the correlation search method domestic and foreign scholars towards information knowledge mapping all in positive research, but It is faced with such as incomplete data and relationship, lack flexibility inquiry and the diversity feature of query results can not be embodied The problems such as.The present invention inquires this main problem based on RDF triple search method, based on information data relevance, gives Targetedly solution out, proposes the RDF knowledge mapping correlation retrieval strategy R-SPARQL method based on extension, leads Want viewpoint and content as follows:
(1) expansion scheme of triple mode.Since user may not clearly know the triple element to be inquired Title perhaps only knows that the such inquiry of entity name similar with element to be checked often be can not find out data or be returned wrong Result accidentally.Therefore, in order to avoid traditional SPARQL inquiry excessively accurately inquiry mode and propose the three of a kind of extension Tuple pattern query method.Firstly, the present invention extends RDF knowledge mapping using one group of keyword, each triple with This group of keyword is associated.A kind of method for being designed to provide fuzzy search being extended using keyword, to improve inspection The flexibility of rope.Secondly, according to the correlation degree between each keyword and the triple, to each keyword distribution one A weight, referred to as triple-keyword weight.It is to carry out correlation to retrieval set using the meaning that weight is extended Sequence, and the diversity of search result can be embodied.Therefore, the weight for distributing to each triple not only represents the ternary Significance level of the group in data set, it should also embody the separating capacity between triple.Using this separating capacity, and pass through Weight is ranked up retrieval set, so that result embodies a concentrated reflection of multifarious feature.
The acquisition methods of keyword are various.For example, the theme of extraction document can be passed through in scientific literature data Word constructs crucial phrase.In the selection of keyword, it is also necessary to carry out some preprocessing process.For example, remove stop words, weight The operations such as multiple item.The either weight of triple or the weight of keyword are all the key factors of final result collection sequence.Benefit RDF knowledge mapping structure is extended with one group of keyword and weight.The weight of triple can be according to RDF knowledge mapping number It is obtained according to the property calculating in many ways in source, the present invention is using inverse document frequency algorithm (IDF) and comentropy come to three The step of tuple and the weight of keyword are calculated, calculating is as follows.
The first step:The IDF value calculation formula for calculating each triple is as follows.
In above formula, e indicates that entity and relationship in the triple of extension, n indicate that RDF data concentrates triple number, F (e) indicate that e appears in the number of RDF data collection.If IDF (t) is the IDF value of triple t=(s, p, o), according to triple knot The characteristics of structure, as follows to the IDF value calculation formula of each triple, wherein γ is regulatory factor, 0 < γ < 1.
IDF (t)=γ × (IDF (s)+IDF (p)+IDF (o)) (2)
Second step:Shown in the following formula of the calculating process of the comentropy H (e) of each triple.
Wherein, e indicates that entity and relationship in the triple of extension, k indicate data classification quantity.D1(e) indicate e in number According to the frequency for concentrating appearance, D2(e(k)) indicate the frequency that e occurs in k-th of data category, | C | indicate the total of data category Number.For each triple t=(s, p, o), the information entropy calculation formula of triple is as follows.
H (t)=η × (H (s)+H (p)+H (o)) (4)
Wherein, η is regulatory factor, 0 < η < 1.The weight w (t) of each triple is by IDF value and comentropy H value two parts It determines, shown in following formula.Wherein, | N | indicate the sum of triple in data set.
Third step:It is similar to triple to the calculating process of keyword weight, calculate separately each keyword IDF value and Information entropy.Firstly, to shown in the following formula of triple-keyword IDF calculating process.
If IDF (vi| it e) is the crucial phrase v=(v for belonging to triple t=(s, p, o)1,v2,…,vn) in keyword vi's IDF value, e indicate the head entity s or tail entity o in triple t, without including the relationship between them.F(e,vi) indicate ternary Group head entity s or tail entity o and keyword viThe number appeared in data set jointly.If IDF (vi| t) indicate ternary Group-keyword IDF value, calculation formula is as follows, whereinFor regulatory factor,
4th step:Calculate the comentropy H (v of keywordi| shown in the following formula of calculating process e).
Wherein, e only represents entity end to end, does not include the relationship between them.K indicates data classification quantity, D1(vi, e) and table Show keyword viThe number occurred jointly in data set with e, D2(vi,e(k)) indicate keyword viWith e in k-th of data category In the number that occurs jointly, | C | indicate the sum of data category.For some keyword v of a triple t and iti, Information entropy calculation formula is as follows.
H(vi| t)=θ × (H (vi|s)+H(vi|o))(9)
Wherein, θ indicates regulatory factor, 0 < θ < 1.Weight w (the v of each triple-keywordi| t) by IDF value and letter Entropy H two parts composition is ceased, shown in following formula.
(2) relation query method based on triple similarity, the invention proposes one kind to be suitable for RDF knowledge mapping knot The correlation inquiry strategy of structure.Inquiring relaxed algorithm is a kind of very important relevance querying method, it refers in original query Middle change is perhaps deleted one or more rigid condition or is extended to original query condition, covers to obtain one The wider array of query processing process of lid range, it is final to obtain a group polling result associated with original query result.Specific steps It is as follows:
The first step:Balancing method based on triple similarity.From knowledge mapping structure, an entity may There are one or more sides to be connected with another entity, often more similar between those triples being connected directly, it Indicate a variety of relationships that the same entity may have or the different entities that same relationship represents.But in fact, two The side that may be connected directly between triple, but their entity tags having the same and side label, then these ternarys Similarity degree between group is also very important.Therefore, the semantic overlapping based on triple is proposed in view of this consideration Concept.Semanteme overlapping indicates to include the quantity in Ontological concept with identical Upper Concept, this may indicate that two concepts Similarity degree.The semantic overlapping of two RDF triples is defined as the quantity in triple with identical element, based on graph structure Semanteme overlapping is as shown in Figure 2.
It is based on figure insertion thought there are also a kind of method for measuring triple similarity, it is embedding by the triple of knowledge mapping Enter into a low-dimensional vector space, the entity end to end and relationship in triple are a vector.If triple t=(s, p, o), Its entity and the expression of the vector of relationship are respectivelyWithIf triple is set up, there are following equatioies If two triple t1=(s1,p1,o1) and t2=(s2,p2,o2) similarity with higher, then it is assumed that it will be between them After element exchanges, above-mentioned equation is also to set up.For example, if triple t1And t2It is similar to each other, then existing OrTherefore, both calculation methods are based on, by triple t1And t2The following formula of similarity calculation shown in.
Second step:The method for generating effective query.For an original queryWhereinIndicate former I-th of triple in beginning inquiry.Firstly, triple similarity each in original query is calculated by above-mentioned formula, One group of similar triple is obtained, and is ranked up according to similarity, similar triple queue is constituted.Choose each triple Similar triple queue in approximate triple, replacement original query in triple by way of, carry out tectonically relaxation and look into It askes.
Third step:The similarity of each subquery and original query is calculated by following formula, and is carried out using similarity Sequence, each inquiry is determined according to this sequence executes sequence.Firstly, original query can execute at first, then have most The effective query of high similarity can be executed successively.
4th step:For the restricted version of effective query quantity.One original query can derive multiple effective queries, Each effective query also will continue to derive multiple sub- effective queries, the number of derivation is known as slackness α.Effective query Quantity is as the increase of α value exponentially increases.If without restriction, may be derived by original query excessive Effective query.However, excessive effective query may be retrieved less than result or generate skimble-skamble result.Therefore, have Necessity limits in relaxation quantity.A threshold is arranged in semantic similarity of the present invention between original query and effective query Value will not be derived from if the similarity between two inquiries is less than this threshold value.Fig. 3 is illustrated to be derived from by original query The process of subquery.
(3) the top-k selection scheme of query results.Under normal conditions, triple pattern query would generally return excessively Result.Moreover, retrieved by effective query in returned result set, the result for coming front it is often excessively approximate and Diversity feature is not presented, user is made to be difficult to be focused to find out really desired data in these results.The present invention is based on extensions RDF triple mode propose a kind of order models, according to the relevance between each triple and their keyword It is effectively sorted to result subgraph, finally returns that maximally related top-k result.If original query is Derive m effective query (Q1,Q2,…,Qm).Wherein,Indicate inquiry QjIn i-th of inquiry triple,By one group of key PhraseIt is extended.By inquiring QjObtain one group of result triplet sets (q1,q2,…,qs).Wherein, tripleInquiring obtained result triple is qw, referred to as tripleTo qwTransfer.Assuming that crucial phraseBetween It is independent from each other,To qwTransition probability be extended to each independent keywordTo qwTransition probability, can be by such as Lower formula calculates its transition probability.Result set is ranked up by this probability value.
Technical effect of the invention is:
The invention proposes a kind of RDF knowledge mapping representation methods extended based on keyword and weight, by introducing volume Exogenousd variables extends SPARQL inquiry mode.A kind of triple similarity is proposed subsequently, based on the triple mode of extension Relaxation method is inquired, by calculating the similarity of original query, multiple effective queries is generated, thus extends the inquiry meaning of user Figure, finally returns that one group of maximally related query result.Finally, carrying out correlation row to query results by triple weight Sequence has fully considered the diversity of result set.
By emulation experiment and analysis experimental result, traditional SPARQL inquiry and effective query proposed by the present invention are compared Scheme demonstrates the finger of time-consuming three performances of average recall rate, Average Accuracy and average lookup with the formal intuition of drawing Mark.Firstly, under the conditions of different slacknesses, comparative analysis recall rate, accuracy rate and the time-consuming performance indicator such as Fig. 4 of inquiry and Shown in Fig. 5.It indicates only to carry out original query as α=0, with the increase of α value, the quantity of effective query exponentially increases, Therefore recall rate and accuracy rate also increase.But the promotion of this effect is to inquire time-consuming increase as cost.Not In the case where data set size, performance indicator such as Fig. 6, Fig. 7 and Fig. 8 of comparative analysis recall rate, accuracy rate and inquiry time-consuming It is shown.
Since traditional SPARQL inquiry can only return to the result set of the condition of satisfaction, the association journey between data is not embodied Degree.And R-SPARQL method proposed by the present invention, original query is extended to multiple subqueries, can not only return to the condition of satisfaction Result set, and the associated entity for meeting knowledge mapping structure can be returned to.Therefore, in average recall rate and Average Accuracy On, the performance of R-SPARQL method will be inquired significantly better than traditional SPARQL.
The information correlation inquiry scheme of RDF knowledge mapping proposed by the present invention based on extension, is realized by following steps.
Step 1:The preprocessing process of data.Main task is parsed to the information data TXT document of downloading, is taken out The information such as the title, author, keyword, mechanism, date of every information data are taken, and remove stop words and duplicate keys.
Step 2:Construct triple information knowledge base.These entities are constructed into relationship, such as one between information and author Writing relationship, affiliated unit's relationship of author and mechanism, the cooperative relationship between different authors.By these entities and relationship group It is combined, constitutes triple data, and store into triple database.In the triple mode of extension, the present invention is adopted With the keyword of information data, as the extension to triple mode.
Step 3:Using IDF and comentropy method of weighting, the weight of each triple He its keyword is calculated, and is deposited It stores up in database.
Step 4:The similar triple of each triple is calculated by triple calculating formula of similarity, and according to phase It is ranked up like degree.
Step 5:RDF triple is effectively stored.Jena TDB is the storage of a support RDF data, inquiry and more The component newly operated can easily be imported into Jena TDB by the ontology data of RDF format.The present invention uses Jena The API that TDB is provided realizes basic SPARQL inquiry operation, and utilizes looking into based on triple similarity proposed by the present invention It askes relaxation model and carries out query expansion.
Step 6:Inquiry sample is generated, search result is concentrated and is effectively arranged according to triple and the weight of keyword Sequence returns to top-k result.

Claims (3)

1. the information association analysis method of knowledge based map, which is characterized in that include the following steps:
The preprocessing process of step 1 data;The information data TXT document of downloading is parsed, every information data is extracted The information such as title, author, keyword, mechanism, date, and remove stop words and duplicate keys;
Step 2 constructs triple information knowledge base;These entities are constructed into relationship, are closed including the writing between information and author It is, affiliated unit's relationship of author and mechanism the cooperative relationship between different authors;By these entities and composition of relations one It rises, constitutes triple data, and store into triple database;In the triple mode of extension, using information data Keyword, as the extension to triple mode;
Step 3 utilizes IDF and comentropy method of weighting, calculates the weight of each triple He its keyword, and store and arrive In database;
Step 4 calculates the similar triple of each triple by triple calculating formula of similarity, and according to similarity It is ranked up;
Step 5 effectively stores RDF triple;Basic SPARQL inquiry is realized using the API that Jena TDB is provided Operation, and query expansion is carried out according to the querying method based on triple similarity;
Step 6 generates inquiry sample, concentrates to search result and is effectively sorted according to triple and the weight of keyword, is returned Return top-k result.
2. the information association analysis method of knowledge based map according to claim 1, which is characterized in that institute in step 3 The IDF and comentropy method of weighting stated includes the following steps:
The IDF value calculation formula that step 1 calculates each triple is as follows;
In above formula, e indicates that entity and relationship in the triple of extension, n indicate that RDF data concentrates triple number, F (e) table Show that e appears in the number of RDF data collection;If IDF (t) is the IDF value of triple t=(s, p, o), according to the spy of triple structure Point, as follows to the IDF value calculation formula of each triple, wherein γ is regulatory factor, 0 < γ < 1;
IDF (t)=γ × (IDF (s)+IDF (p)+IDF (o))
Shown in the following formula of calculating process of the comentropy H (e) of each triple of step 2;
Wherein, e indicates that entity and relationship in the triple of extension, k indicate data classification quantity;D1(e) indicate e in data set The frequency of middle appearance, D2(e(k)) indicate the frequency that e occurs in k-th of data category, | C | indicate the sum of data category;
For each triple t=(s, p, o), the information entropy calculation formula of triple is as follows;
H (t)=η × (H (s)+H (p)+H (o))
Wherein, η is regulatory factor, 0 < η < 1;The weight w (t) of each triple is true by IDF value and comentropy H value two parts It is fixed, shown in following formula;Wherein, | N | indicate the sum of triple in data set;
Step 3 is similar to triple to the calculating process of keyword weight, calculates separately the IDF value and information of each keyword Entropy;Firstly, to shown in the following formula of triple-keyword IDF calculating process:
If IDF (vi| it e) is the crucial phrase v=(v for belonging to triple t=(s, p, o)1,v2,…,vn) in keyword viIDF Value, e indicate the head entity s or tail entity o in triple t, without including the relationship between them;F(e,vi) indicate triple Head entity s or tail entity o and keyword viThe number appeared in data set jointly;If IDF (vi| t) indicate triple- The IDF value of keyword, calculation formula is as follows, whereinFor regulatory factor,
Comentropy H (the v of step 4 calculating keywordi| shown in the following formula of calculating process e):
Wherein, e only represents entity end to end, does not include the relationship between them;K indicates data classification quantity, D1(vi, e) and it indicates to close Keyword viThe number occurred jointly in data set with e, D2(vi,e(k)) indicate keyword viIt is total in k-th of data category with e The same number occurred, | C | indicate the sum of data category;For some keyword v of a triple t and iti, information Entropy calculation formula is as follows:
H(vi| t)=θ × (H (vi|s)+H(vi|o))
Wherein, θ indicates regulatory factor, 0 < θ < 1;Weight w (the v of each triple-keywordi| t) by IDF value and comentropy H Two parts composition, shown in following formula:
3. the information association analysis method of knowledge based map according to claim 1, which is characterized in that institute in step 5 The querying method based on triple similarity stated includes the following steps:
Balancing method of the step 1 based on triple similarity;
If t1And t2For triple, then triple t1And t2The following formula of similarity calculation shown in:
The method of step 2 generation effective query;For an original queryWhereinIndicate original query In i-th of triple;Firstly, calculating by above-mentioned formula triple similarity each in original query, one is obtained The similar triple of group, and be ranked up according to similarity, constitute similar triple queue;Choose each tripleIt is similar Approximate triple in triple queue carrys out tectonically relaxation inquiry by way of the triple in replacement original query;
Step 3 calculates the similarity of each subquery and original query by following formula, and is ranked up using similarity, Each inquiry is determined according to this sequence executes sequence;Firstly, original query can execute at first, then have highest similar The effective query of degree can be executed successively;
A threshold value is arranged in semantic similarity of the step 4 between original query and effective query, if the phase between two inquiries It is less than this threshold value like degree, then will not be derived from.
CN201810519637.XA 2018-05-28 2018-05-28 Information correlation analysis method based on knowledge graph Expired - Fee Related CN108846029B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810519637.XA CN108846029B (en) 2018-05-28 2018-05-28 Information correlation analysis method based on knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810519637.XA CN108846029B (en) 2018-05-28 2018-05-28 Information correlation analysis method based on knowledge graph

Publications (2)

Publication Number Publication Date
CN108846029A true CN108846029A (en) 2018-11-20
CN108846029B CN108846029B (en) 2021-05-25

Family

ID=64213596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810519637.XA Expired - Fee Related CN108846029B (en) 2018-05-28 2018-05-28 Information correlation analysis method based on knowledge graph

Country Status (1)

Country Link
CN (1) CN108846029B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508389A (en) * 2018-12-19 2019-03-22 哈尔滨工程大学 A kind of personnel's social relationships map visualization accelerated method
CN109582803A (en) * 2018-11-30 2019-04-05 广东电网有限责任公司 The construction method and system of competitive intelligence database
CN109710621A (en) * 2019-01-16 2019-05-03 福州大学 In conjunction with the keyword search KSANEW algorithm of semantic category node and side right weight
CN110082116A (en) * 2019-03-18 2019-08-02 深圳市元征科技股份有限公司 A kind of evaluation method, evaluating apparatus and the storage medium of vehicular four wheels location data
CN110083732A (en) * 2019-03-12 2019-08-02 浙江大华技术股份有限公司 Picture retrieval method, device and computer storage medium
CN110222240A (en) * 2019-05-24 2019-09-10 华中科技大学 A kind of space RDF data keyword query method based on summary figure
CN110457486A (en) * 2019-07-05 2019-11-15 中国人民解放军战略支援部队信息工程大学 The people entities alignment schemes and device of knowledge based map
CN110765233A (en) * 2019-11-11 2020-02-07 中国人民解放军军事科学院评估论证研究中心 Intelligent information retrieval service system based on deep mining and knowledge management technology
CN111753055A (en) * 2020-06-28 2020-10-09 中国银行股份有限公司 Automatic client question and answer prompting method and device
CN113495955A (en) * 2021-07-08 2021-10-12 北京明略软件系统有限公司 Expert pushing method, system, equipment and storage medium for document
CN113901452A (en) * 2021-09-30 2022-01-07 中国电子科技集团公司第十五研究所 Sub-graph fuzzy matching security event identification method based on information entropy

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003030204A (en) * 2001-07-17 2003-01-31 Takami Yasuda Server for providing video contents, device and method for preparing file for video contents retrieval, computer program and device and method for supporting video clip preparation
CN104166670A (en) * 2014-06-17 2014-11-26 青岛农业大学 Information inquiry method based on semantic network
CN104765779A (en) * 2015-03-20 2015-07-08 浙江大学 Patent document inquiry extension method based on YAGO2s

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003030204A (en) * 2001-07-17 2003-01-31 Takami Yasuda Server for providing video contents, device and method for preparing file for video contents retrieval, computer program and device and method for supporting video clip preparation
CN104166670A (en) * 2014-06-17 2014-11-26 青岛农业大学 Information inquiry method based on semantic network
CN104765779A (en) * 2015-03-20 2015-07-08 浙江大学 Patent document inquiry extension method based on YAGO2s

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李大振: "基于语义相似度的RDF本体查询松弛方法研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582803A (en) * 2018-11-30 2019-04-05 广东电网有限责任公司 The construction method and system of competitive intelligence database
CN109508389A (en) * 2018-12-19 2019-03-22 哈尔滨工程大学 A kind of personnel's social relationships map visualization accelerated method
CN109508389B (en) * 2018-12-19 2021-05-28 哈尔滨工程大学 Visual accelerating method for personnel social relationship map
CN109710621B (en) * 2019-01-16 2022-06-21 福州大学 Keyword search KSANEW method combining semantic nodes and edge weights
CN109710621A (en) * 2019-01-16 2019-05-03 福州大学 In conjunction with the keyword search KSANEW algorithm of semantic category node and side right weight
CN110083732A (en) * 2019-03-12 2019-08-02 浙江大华技术股份有限公司 Picture retrieval method, device and computer storage medium
CN110083732B (en) * 2019-03-12 2021-08-31 浙江大华技术股份有限公司 Picture retrieval method and device and computer storage medium
CN110082116A (en) * 2019-03-18 2019-08-02 深圳市元征科技股份有限公司 A kind of evaluation method, evaluating apparatus and the storage medium of vehicular four wheels location data
CN110222240A (en) * 2019-05-24 2019-09-10 华中科技大学 A kind of space RDF data keyword query method based on summary figure
CN110222240B (en) * 2019-05-24 2021-03-26 华中科技大学 Abstract graph-based space RDF data keyword query method
CN110457486A (en) * 2019-07-05 2019-11-15 中国人民解放军战略支援部队信息工程大学 The people entities alignment schemes and device of knowledge based map
CN110765233A (en) * 2019-11-11 2020-02-07 中国人民解放军军事科学院评估论证研究中心 Intelligent information retrieval service system based on deep mining and knowledge management technology
CN111753055A (en) * 2020-06-28 2020-10-09 中国银行股份有限公司 Automatic client question and answer prompting method and device
CN111753055B (en) * 2020-06-28 2024-01-26 中国银行股份有限公司 Automatic prompt method and device for customer questions and answers
CN113495955A (en) * 2021-07-08 2021-10-12 北京明略软件系统有限公司 Expert pushing method, system, equipment and storage medium for document
CN113901452A (en) * 2021-09-30 2022-01-07 中国电子科技集团公司第十五研究所 Sub-graph fuzzy matching security event identification method based on information entropy
CN113901452B (en) * 2021-09-30 2022-05-17 中国电子科技集团公司第十五研究所 Sub-graph fuzzy matching security event identification method based on information entropy

Also Published As

Publication number Publication date
CN108846029B (en) 2021-05-25

Similar Documents

Publication Publication Date Title
CN108846029A (en) The information association analysis method of knowledge based map
Batsakis et al. Improving the performance of focused web crawlers
Wang et al. Q2semantic: A lightweight keyword interface to semantic search
Bao et al. Towards an effective XML keyword search
Ganti et al. Keyword++ a framework to improve keyword search over entity databases
Alwan et al. A survey of schema matching research using database schemas and instances
Li et al. Subjective databases
CN102043812A (en) Method and system for retrieving medical information
CN102890711A (en) Retrieval ordering method and system
Amolochitis et al. A heuristic hierarchical scheme for academic search and retrieval
Zhang et al. A graph based document retrieval method
Nargesian et al. Data lake organization
Hoang et al. The state of the art of ontology-based query systems: A comparison of existing approaches
Li et al. Processing xml keyword search by constructing effective structured queries
Kanavos et al. Ranking web search results exploiting wikipedia
Wang et al. Research on discovering deep web entries
Nargesian et al. Optimizing organizations for navigating data lakes
Liu et al. A query suggestion method based on random walk and topic concepts
Agrawal et al. Search Engine Results Improvement--A Review
Elbassuoni Effective searching of RDF knowledge bases
Li et al. A structure-based approach of keyword querying for fuzzy XML data
Song et al. Discussions on subgraph ranking for keyworded search
Peng et al. On the marriage of SPARQL and keywords
Ganta et al. Search engine optimization through spanning forest generation algorithm
Nargesian et al. Data lake organization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220708

Address after: 100000 Room 401, block D, No. 8, Tongji South Road, Beijing Economic and Technological Development Zone, Daxing District, Beijing

Patentee after: BEIJING SAIXI TECHNOLOGY DEVELOPMENT Co.,Ltd.

Address before: 150001 Intellectual Property Office, Harbin Engineering University science and technology office, 145 Nantong Avenue, Nangang District, Harbin, Heilongjiang

Patentee before: HARBIN ENGINEERING University

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210525