CN108846029A - The information association analysis method of knowledge based map - Google Patents
The information association analysis method of knowledge based map Download PDFInfo
- Publication number
- CN108846029A CN108846029A CN201810519637.XA CN201810519637A CN108846029A CN 108846029 A CN108846029 A CN 108846029A CN 201810519637 A CN201810519637 A CN 201810519637A CN 108846029 A CN108846029 A CN 108846029A
- Authority
- CN
- China
- Prior art keywords
- triple
- keyword
- idf
- similarity
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012097 association analysis method Methods 0.000 title claims abstract description 7
- 238000000034 method Methods 0.000 claims abstract description 57
- 230000008569 process Effects 0.000 claims abstract description 20
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 238000004364 calculation method Methods 0.000 claims description 19
- 230000001105 regulatory effect Effects 0.000 claims description 12
- 238000013480 data collection Methods 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 6
- 239000012141 concentrate Substances 0.000 claims description 5
- 239000000203 mixture Substances 0.000 claims description 5
- 238000013507 mapping Methods 0.000 abstract description 25
- 238000004458 analytical method Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000010835 comparative analysis Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 235000008331 Pinus X rigitaeda Nutrition 0.000 description 1
- 235000011613 Pinus brutia Nutrition 0.000 description 1
- 241000018646 Pinus brutia Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012958 reprocessing Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000007794 visualization technique Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of information association analysis methods of knowledge based map, belong to information correlation searching field under the conditions of RDF knowledge mapping.The present invention includes:The preprocessing process of data parses the information data TXT document of downloading;Construct triple information knowledge base;Using IDF and comentropy method of weighting, the weight of each triple He its keyword is calculated, and store into database;The similar triple of each triple is calculated by triple calculating formula of similarity, and is ranked up according to similarity;RDF triple is effectively stored;Basic SPARQL inquiry operation is realized using the API that Jena TDB is provided, and query expansion is carried out according to the querying method based on triple similarity;Inquiry sample is generated, search result is concentrated and is effectively sorted according to triple and the weight of keyword, returns to top-k result.
Description
Technical field
The present invention relates to a kind of intelligence analysis schemes of knowledge based map, belong to information phase under the conditions of RDF knowledge mapping
Closing property searching field.
Background technique
In information science, information is the reprocessing of knowledge, and transitivity and correlation are the essential attributes of knowledge.Exactly feelings
The characteristics of transitivity and correlation of report, make between different information that there is certain associations.It is at full speed with Internet technology
The features such as development, Modern Information shows magnanimity rank, diversity, real-time, so that the correlation degree between information knowledge
Higher and higher, this brings difficult point to Modern Information analysis.But with using machine learning as the length of the artificial intelligence technology of representative
Foot development, this brings possibility to the intelligence analysis under big data background.The object of intelligence analysis is resource of information, can be with unit
Turn to inter-related each entity.Using machine learning method, each entity and entity in information data can be excavated
Between relationship.But this analysis mode generally stays at simple statistics syntactic analysis level, is not deep into information content
Semantic level, it is easy to cause semantic missing problem, it is not accurate enough and complete to analyze the information come in this way, for difference
Association situation between information also tends to embody well.
Knowledge mapping (Knowledge Graphs) is a kind of emerging technology in recent years, and technical know-how and concept are knots
The thoughts such as knowledge base, semantic network and ontology are closed, were proposed by Google in 2012.Knowledge mapping is substantially one
Graphic structure of the kind based on complicated semantic network, it takes full advantage of visualization technique, can not only be to knowledge resource and carrier
It is described, but also the connection between knowledge and knowledge can be analyzed and be described.Its benefit graphically will
Complicated knowledge is drawn and is shown, and indicates knowledge with the node in figure, is indicated between knowledge with the side between node
Relationship, so more intuitively embody knowledge between association.The exactly this feature of graph structure, makes it have entitative concept and covers
Lid rate is high, semantic relation is various, structure is friendly and the advantages such as quality is higher, so that knowledge mapping has become big data
Epoch knowledge representation mode the most main.Meanwhile knowledge mapping is also known as the developing direction of artificial intelligence of new generation.Knowledge
Map is applied at first to be released one after another in the search engines such as searching engine field, such as Google, Bing, Baidu and search dog manufacturer
The knowledge mapping product of oneself improves the retrieval quality of its search engine.Subsequent knowledge mapping is pushed away in intelligent search, personalization
It recommends, text classification, the fields such as content distribution are widely used, and achieve good effect.Since knowledge mapping can
Incidence relation sufficiently between reflection entity, is also possibly realized so being applied in intelligence analysis field.
Summary of the invention
The object of the present invention is achieved like this:
The information association analysis method of knowledge based map, which is characterized in that include the following steps:
The preprocessing process of step 1 data;The information data TXT document of downloading is parsed, every information number is extracted
According to the information such as title, author, keyword, mechanism, date, and remove stop words and duplicate keys;
Step 2 constructs triple information knowledge base;These entities are constructed into relationship, including writing between information and author
Make relationship, affiliated unit's relationship of author and mechanism, the cooperative relationship between different authors;These entities and composition of relations are existed
Together, triple data are constituted, and are stored into triple database;In the triple mode of extension, using information data
Keyword, as the extension to triple mode;
Step 3 utilizes IDF and comentropy method of weighting, calculates the weight of each triple He its keyword, and store
Into database;
Step 4 calculates the similar triple of each triple by triple calculating formula of similarity, and according to similar
Degree is ranked up;
Step 5 effectively stores RDF triple;Realize that basic SPARQL is looked into using the API that Jena TDB is provided
Operation is ask, and query expansion is carried out according to the querying method based on triple similarity;
Step 6 generates inquiry sample, concentrates to search result and is effectively arranged according to triple and the weight of keyword
Sequence returns to top-k result.
The IDF and comentropy method of weighting includes the following steps:
The IDF value calculation formula that step 1 calculates each triple is as follows;
In above formula, e indicates that entity and relationship in the triple of extension, n indicate that RDF data concentrates triple number, F
(e) indicate that e appears in the number of RDF data collection;If IDF (t) is the IDF value of triple t=(s, p, o), according to triple knot
The characteristics of structure, as follows to the IDF value calculation formula of each triple, wherein γ is regulatory factor, 0 < γ < 1;
IDF (t)=γ × (IDF (s)+IDF (p)+IDF (o))
Shown in the following formula of calculating process of the comentropy H (e) of each triple of step 2;
Wherein, e indicates that entity and relationship in the triple of extension, k indicate data classification quantity;D1(e) indicate e in number
According to the frequency for concentrating appearance, D2(e(k)) indicate the frequency that e occurs in k-th of data category, | C | indicate the total of data category
Number;For each triple t=(s, p, o), the information entropy calculation formula of triple is as follows;
H (t)=η × (H (s)+H (p)+H (o))
Wherein, η is regulatory factor, 0 < η < 1;The weight w (t) of each triple is by IDF value and comentropy H value two parts
It determines, shown in following formula;Wherein, | N | indicate the sum of triple in data set;
Step 3 is similar to triple to the calculating process of keyword weight, calculate separately each keyword IDF value and
Information entropy;Firstly, to shown in the following formula of triple-keyword IDF calculating process:
If IDF (vi| it e) is the crucial phrase v=(v for belonging to triple t=(s, p, o)1,v2,…,vn) in keyword vi's
IDF value, e indicate the head entity s or tail entity o in triple t, without including the relationship between them;F(e,vi) indicate ternary
Group head entity s or tail entity o and keyword viThe number appeared in data set jointly;If IDF (vi| t) indicate ternary
Group-keyword IDF value, calculation formula is as follows, whereinFor regulatory factor,
Comentropy H (the v of step 4 calculating keywordi| shown in the following formula of calculating process e):
Wherein, e only represents entity end to end, does not include the relationship between them;K indicates data classification quantity, D1(vi, e) and table
Show keyword viThe number occurred jointly in data set with e, D2(vi,e(k)) indicate keyword viWith e in k-th of data category
In the number that occurs jointly, | C | indicate the sum of data category;For some keyword v of a triple t and iti,
Information entropy calculation formula is as follows:
H(vi| t)=θ × (H (vi|s)+H(vi|o))
Wherein, θ indicates regulatory factor, 0 < θ < 1;Weight w (the v of each triple-keywordi| t) by IDF value and letter
Entropy H two parts composition is ceased, shown in following formula:
The querying method based on triple similarity includes the following steps:
Balancing method of the step 1 based on triple similarity;
If t1And t2For triple, then triple t1And t2The following formula of similarity calculation shown in:
The method of step 2 generation effective query;For an original queryWhereinIndicate original
I-th of triple in inquiry;Firstly, being calculated by above-mentioned formula triple similarity each in original query, obtain
It to one group of similar triple, and is ranked up according to similarity, constitutes similar triple queue;Choose each triple's
Approximate triple in similar triple queue carrys out tectonically relaxation inquiry by way of the triple in replacement original query;
Step 3 calculates the similarity of each subquery and original query by following formula, and is arranged using similarity
Sequence, each inquiry is determined according to this sequence executes sequence;Firstly, original query can execute at first, then there is highest
The effective query of similarity can be executed successively;
A threshold value is arranged in semantic similarity of the step 4 between original query and effective query, if between two inquiries
Similarity be less than this threshold value, then will not be derived from.
Beneficial effects of the present invention:
Currently, for the correlation search method domestic and foreign scholars towards information knowledge mapping all in positive research, but
It is faced with such as incomplete data and relationship, lack flexibility inquiry and the diversity feature of query results can not be embodied
The problems such as.The present invention inquires this main problem based on RDF triple search method, based on information data relevance, gives
Targetedly solution out, proposes the RDF knowledge mapping correlation retrieval strategy R-SPARQL method based on extension.
Detailed description of the invention
Fig. 1 is algorithm flow block diagram of the invention;
Fig. 2 is the semantic overlapping schematic diagram the present invention is based on graph structure;
Fig. 3 is that original query of the present invention derives from effective query process schematic;
Fig. 4 is the present invention average recall rate and accuracy rate comparison diagram under conditions of different slackness α values;
Fig. 5 is present invention average lookup time-consuming comparison diagram under conditions of different slackness α values;
Fig. 6 is the present invention average recall rate comparison diagram under conditions of different data collection size;
Fig. 7 is the present invention average recall rate comparison diagram under conditions of different data collection size;
Fig. 8 is the present invention average recall rate comparison diagram under conditions of different data collection size.
Specific embodiment
With reference to the accompanying drawing and example is moved, more detailed elaboration is made to the present invention.
Information correlation analysis is the research emphasis of information retrieval and information science field, how in information retrieval result body
The characteristics of existing multiple associated datas is the main task of this invention.The information retrieval of knowledge based map is substantially a seed
Knowledge mapping can be considered as a kind of triple structure of RDF form by figure matching problem, and the retrieval in its structure can make
It is inquired with the SPARQL language for meeting W3C standard.But this inquiry mode can only return to qualified subgraph structure, and
Relevance and multifarious feature will not be embodied in query result concentration.
For this problem, the present invention has made intensive studies the knowledge mapping search method based on RDF, proposes one
Kind correlation retrieval method.The thought of this method is that the structure of RDF knowledge mapping is extended using keyword and weight.Wherein,
The extension of keyword provides a kind of mode of fuzzy search, and triple and the weight of keyword are used to calculate inquiry triple
Similarity and to query results carry out relevance ranking.By calculating similarity, original query is derived into multiple pines
Relax inquiry, thus to obtain search result more, associated with each other.
Primary object and technical effect:
Currently, for the correlation search method domestic and foreign scholars towards information knowledge mapping all in positive research, but
It is faced with such as incomplete data and relationship, lack flexibility inquiry and the diversity feature of query results can not be embodied
The problems such as.The present invention inquires this main problem based on RDF triple search method, based on information data relevance, gives
Targetedly solution out, proposes the RDF knowledge mapping correlation retrieval strategy R-SPARQL method based on extension, leads
Want viewpoint and content as follows:
(1) expansion scheme of triple mode.Since user may not clearly know the triple element to be inquired
Title perhaps only knows that the such inquiry of entity name similar with element to be checked often be can not find out data or be returned wrong
Result accidentally.Therefore, in order to avoid traditional SPARQL inquiry excessively accurately inquiry mode and propose the three of a kind of extension
Tuple pattern query method.Firstly, the present invention extends RDF knowledge mapping using one group of keyword, each triple with
This group of keyword is associated.A kind of method for being designed to provide fuzzy search being extended using keyword, to improve inspection
The flexibility of rope.Secondly, according to the correlation degree between each keyword and the triple, to each keyword distribution one
A weight, referred to as triple-keyword weight.It is to carry out correlation to retrieval set using the meaning that weight is extended
Sequence, and the diversity of search result can be embodied.Therefore, the weight for distributing to each triple not only represents the ternary
Significance level of the group in data set, it should also embody the separating capacity between triple.Using this separating capacity, and pass through
Weight is ranked up retrieval set, so that result embodies a concentrated reflection of multifarious feature.
The acquisition methods of keyword are various.For example, the theme of extraction document can be passed through in scientific literature data
Word constructs crucial phrase.In the selection of keyword, it is also necessary to carry out some preprocessing process.For example, remove stop words, weight
The operations such as multiple item.The either weight of triple or the weight of keyword are all the key factors of final result collection sequence.Benefit
RDF knowledge mapping structure is extended with one group of keyword and weight.The weight of triple can be according to RDF knowledge mapping number
It is obtained according to the property calculating in many ways in source, the present invention is using inverse document frequency algorithm (IDF) and comentropy come to three
The step of tuple and the weight of keyword are calculated, calculating is as follows.
The first step:The IDF value calculation formula for calculating each triple is as follows.
In above formula, e indicates that entity and relationship in the triple of extension, n indicate that RDF data concentrates triple number, F
(e) indicate that e appears in the number of RDF data collection.If IDF (t) is the IDF value of triple t=(s, p, o), according to triple knot
The characteristics of structure, as follows to the IDF value calculation formula of each triple, wherein γ is regulatory factor, 0 < γ < 1.
IDF (t)=γ × (IDF (s)+IDF (p)+IDF (o)) (2)
Second step:Shown in the following formula of the calculating process of the comentropy H (e) of each triple.
Wherein, e indicates that entity and relationship in the triple of extension, k indicate data classification quantity.D1(e) indicate e in number
According to the frequency for concentrating appearance, D2(e(k)) indicate the frequency that e occurs in k-th of data category, | C | indicate the total of data category
Number.For each triple t=(s, p, o), the information entropy calculation formula of triple is as follows.
H (t)=η × (H (s)+H (p)+H (o)) (4)
Wherein, η is regulatory factor, 0 < η < 1.The weight w (t) of each triple is by IDF value and comentropy H value two parts
It determines, shown in following formula.Wherein, | N | indicate the sum of triple in data set.
Third step:It is similar to triple to the calculating process of keyword weight, calculate separately each keyword IDF value and
Information entropy.Firstly, to shown in the following formula of triple-keyword IDF calculating process.
If IDF (vi| it e) is the crucial phrase v=(v for belonging to triple t=(s, p, o)1,v2,…,vn) in keyword vi's
IDF value, e indicate the head entity s or tail entity o in triple t, without including the relationship between them.F(e,vi) indicate ternary
Group head entity s or tail entity o and keyword viThe number appeared in data set jointly.If IDF (vi| t) indicate ternary
Group-keyword IDF value, calculation formula is as follows, whereinFor regulatory factor,
4th step:Calculate the comentropy H (v of keywordi| shown in the following formula of calculating process e).
Wherein, e only represents entity end to end, does not include the relationship between them.K indicates data classification quantity, D1(vi, e) and table
Show keyword viThe number occurred jointly in data set with e, D2(vi,e(k)) indicate keyword viWith e in k-th of data category
In the number that occurs jointly, | C | indicate the sum of data category.For some keyword v of a triple t and iti,
Information entropy calculation formula is as follows.
H(vi| t)=θ × (H (vi|s)+H(vi|o))(9)
Wherein, θ indicates regulatory factor, 0 < θ < 1.Weight w (the v of each triple-keywordi| t) by IDF value and letter
Entropy H two parts composition is ceased, shown in following formula.
(2) relation query method based on triple similarity, the invention proposes one kind to be suitable for RDF knowledge mapping knot
The correlation inquiry strategy of structure.Inquiring relaxed algorithm is a kind of very important relevance querying method, it refers in original query
Middle change is perhaps deleted one or more rigid condition or is extended to original query condition, covers to obtain one
The wider array of query processing process of lid range, it is final to obtain a group polling result associated with original query result.Specific steps
It is as follows:
The first step:Balancing method based on triple similarity.From knowledge mapping structure, an entity may
There are one or more sides to be connected with another entity, often more similar between those triples being connected directly, it
Indicate a variety of relationships that the same entity may have or the different entities that same relationship represents.But in fact, two
The side that may be connected directly between triple, but their entity tags having the same and side label, then these ternarys
Similarity degree between group is also very important.Therefore, the semantic overlapping based on triple is proposed in view of this consideration
Concept.Semanteme overlapping indicates to include the quantity in Ontological concept with identical Upper Concept, this may indicate that two concepts
Similarity degree.The semantic overlapping of two RDF triples is defined as the quantity in triple with identical element, based on graph structure
Semanteme overlapping is as shown in Figure 2.
It is based on figure insertion thought there are also a kind of method for measuring triple similarity, it is embedding by the triple of knowledge mapping
Enter into a low-dimensional vector space, the entity end to end and relationship in triple are a vector.If triple t=(s, p, o),
Its entity and the expression of the vector of relationship are respectivelyWithIf triple is set up, there are following equatioies
If two triple t1=(s1,p1,o1) and t2=(s2,p2,o2) similarity with higher, then it is assumed that it will be between them
After element exchanges, above-mentioned equation is also to set up.For example, if triple t1And t2It is similar to each other, then existing
OrTherefore, both calculation methods are based on, by triple t1And t2The following formula of similarity calculation shown in.
Second step:The method for generating effective query.For an original queryWhereinIndicate former
I-th of triple in beginning inquiry.Firstly, triple similarity each in original query is calculated by above-mentioned formula,
One group of similar triple is obtained, and is ranked up according to similarity, similar triple queue is constituted.Choose each triple
Similar triple queue in approximate triple, replacement original query in triple by way of, carry out tectonically relaxation and look into
It askes.
Third step:The similarity of each subquery and original query is calculated by following formula, and is carried out using similarity
Sequence, each inquiry is determined according to this sequence executes sequence.Firstly, original query can execute at first, then have most
The effective query of high similarity can be executed successively.
4th step:For the restricted version of effective query quantity.One original query can derive multiple effective queries,
Each effective query also will continue to derive multiple sub- effective queries, the number of derivation is known as slackness α.Effective query
Quantity is as the increase of α value exponentially increases.If without restriction, may be derived by original query excessive
Effective query.However, excessive effective query may be retrieved less than result or generate skimble-skamble result.Therefore, have
Necessity limits in relaxation quantity.A threshold is arranged in semantic similarity of the present invention between original query and effective query
Value will not be derived from if the similarity between two inquiries is less than this threshold value.Fig. 3 is illustrated to be derived from by original query
The process of subquery.
(3) the top-k selection scheme of query results.Under normal conditions, triple pattern query would generally return excessively
Result.Moreover, retrieved by effective query in returned result set, the result for coming front it is often excessively approximate and
Diversity feature is not presented, user is made to be difficult to be focused to find out really desired data in these results.The present invention is based on extensions
RDF triple mode propose a kind of order models, according to the relevance between each triple and their keyword
It is effectively sorted to result subgraph, finally returns that maximally related top-k result.If original query is
Derive m effective query (Q1,Q2,…,Qm).Wherein,Indicate inquiry QjIn i-th of inquiry triple,By one group of key
PhraseIt is extended.By inquiring QjObtain one group of result triplet sets (q1,q2,…,qs).Wherein, tripleInquiring obtained result triple is qw, referred to as tripleTo qwTransfer.Assuming that crucial phraseBetween
It is independent from each other,To qwTransition probability be extended to each independent keywordTo qwTransition probability, can be by such as
Lower formula calculates its transition probability.Result set is ranked up by this probability value.
Technical effect of the invention is:
The invention proposes a kind of RDF knowledge mapping representation methods extended based on keyword and weight, by introducing volume
Exogenousd variables extends SPARQL inquiry mode.A kind of triple similarity is proposed subsequently, based on the triple mode of extension
Relaxation method is inquired, by calculating the similarity of original query, multiple effective queries is generated, thus extends the inquiry meaning of user
Figure, finally returns that one group of maximally related query result.Finally, carrying out correlation row to query results by triple weight
Sequence has fully considered the diversity of result set.
By emulation experiment and analysis experimental result, traditional SPARQL inquiry and effective query proposed by the present invention are compared
Scheme demonstrates the finger of time-consuming three performances of average recall rate, Average Accuracy and average lookup with the formal intuition of drawing
Mark.Firstly, under the conditions of different slacknesses, comparative analysis recall rate, accuracy rate and the time-consuming performance indicator such as Fig. 4 of inquiry and
Shown in Fig. 5.It indicates only to carry out original query as α=0, with the increase of α value, the quantity of effective query exponentially increases,
Therefore recall rate and accuracy rate also increase.But the promotion of this effect is to inquire time-consuming increase as cost.Not
In the case where data set size, performance indicator such as Fig. 6, Fig. 7 and Fig. 8 of comparative analysis recall rate, accuracy rate and inquiry time-consuming
It is shown.
Since traditional SPARQL inquiry can only return to the result set of the condition of satisfaction, the association journey between data is not embodied
Degree.And R-SPARQL method proposed by the present invention, original query is extended to multiple subqueries, can not only return to the condition of satisfaction
Result set, and the associated entity for meeting knowledge mapping structure can be returned to.Therefore, in average recall rate and Average Accuracy
On, the performance of R-SPARQL method will be inquired significantly better than traditional SPARQL.
The information correlation inquiry scheme of RDF knowledge mapping proposed by the present invention based on extension, is realized by following steps.
Step 1:The preprocessing process of data.Main task is parsed to the information data TXT document of downloading, is taken out
The information such as the title, author, keyword, mechanism, date of every information data are taken, and remove stop words and duplicate keys.
Step 2:Construct triple information knowledge base.These entities are constructed into relationship, such as one between information and author
Writing relationship, affiliated unit's relationship of author and mechanism, the cooperative relationship between different authors.By these entities and relationship group
It is combined, constitutes triple data, and store into triple database.In the triple mode of extension, the present invention is adopted
With the keyword of information data, as the extension to triple mode.
Step 3:Using IDF and comentropy method of weighting, the weight of each triple He its keyword is calculated, and is deposited
It stores up in database.
Step 4:The similar triple of each triple is calculated by triple calculating formula of similarity, and according to phase
It is ranked up like degree.
Step 5:RDF triple is effectively stored.Jena TDB is the storage of a support RDF data, inquiry and more
The component newly operated can easily be imported into Jena TDB by the ontology data of RDF format.The present invention uses Jena
The API that TDB is provided realizes basic SPARQL inquiry operation, and utilizes looking into based on triple similarity proposed by the present invention
It askes relaxation model and carries out query expansion.
Step 6:Inquiry sample is generated, search result is concentrated and is effectively arranged according to triple and the weight of keyword
Sequence returns to top-k result.
Claims (3)
1. the information association analysis method of knowledge based map, which is characterized in that include the following steps:
The preprocessing process of step 1 data;The information data TXT document of downloading is parsed, every information data is extracted
The information such as title, author, keyword, mechanism, date, and remove stop words and duplicate keys;
Step 2 constructs triple information knowledge base;These entities are constructed into relationship, are closed including the writing between information and author
It is, affiliated unit's relationship of author and mechanism the cooperative relationship between different authors;By these entities and composition of relations one
It rises, constitutes triple data, and store into triple database;In the triple mode of extension, using information data
Keyword, as the extension to triple mode;
Step 3 utilizes IDF and comentropy method of weighting, calculates the weight of each triple He its keyword, and store and arrive
In database;
Step 4 calculates the similar triple of each triple by triple calculating formula of similarity, and according to similarity
It is ranked up;
Step 5 effectively stores RDF triple;Basic SPARQL inquiry is realized using the API that Jena TDB is provided
Operation, and query expansion is carried out according to the querying method based on triple similarity;
Step 6 generates inquiry sample, concentrates to search result and is effectively sorted according to triple and the weight of keyword, is returned
Return top-k result.
2. the information association analysis method of knowledge based map according to claim 1, which is characterized in that institute in step 3
The IDF and comentropy method of weighting stated includes the following steps:
The IDF value calculation formula that step 1 calculates each triple is as follows;
In above formula, e indicates that entity and relationship in the triple of extension, n indicate that RDF data concentrates triple number, F (e) table
Show that e appears in the number of RDF data collection;If IDF (t) is the IDF value of triple t=(s, p, o), according to the spy of triple structure
Point, as follows to the IDF value calculation formula of each triple, wherein γ is regulatory factor, 0 < γ < 1;
IDF (t)=γ × (IDF (s)+IDF (p)+IDF (o))
Shown in the following formula of calculating process of the comentropy H (e) of each triple of step 2;
Wherein, e indicates that entity and relationship in the triple of extension, k indicate data classification quantity;D1(e) indicate e in data set
The frequency of middle appearance, D2(e(k)) indicate the frequency that e occurs in k-th of data category, | C | indicate the sum of data category;
For each triple t=(s, p, o), the information entropy calculation formula of triple is as follows;
H (t)=η × (H (s)+H (p)+H (o))
Wherein, η is regulatory factor, 0 < η < 1;The weight w (t) of each triple is true by IDF value and comentropy H value two parts
It is fixed, shown in following formula;Wherein, | N | indicate the sum of triple in data set;
Step 3 is similar to triple to the calculating process of keyword weight, calculates separately the IDF value and information of each keyword
Entropy;Firstly, to shown in the following formula of triple-keyword IDF calculating process:
If IDF (vi| it e) is the crucial phrase v=(v for belonging to triple t=(s, p, o)1,v2,…,vn) in keyword viIDF
Value, e indicate the head entity s or tail entity o in triple t, without including the relationship between them;F(e,vi) indicate triple
Head entity s or tail entity o and keyword viThe number appeared in data set jointly;If IDF (vi| t) indicate triple-
The IDF value of keyword, calculation formula is as follows, whereinFor regulatory factor,
Comentropy H (the v of step 4 calculating keywordi| shown in the following formula of calculating process e):
Wherein, e only represents entity end to end, does not include the relationship between them;K indicates data classification quantity, D1(vi, e) and it indicates to close
Keyword viThe number occurred jointly in data set with e, D2(vi,e(k)) indicate keyword viIt is total in k-th of data category with e
The same number occurred, | C | indicate the sum of data category;For some keyword v of a triple t and iti, information
Entropy calculation formula is as follows:
H(vi| t)=θ × (H (vi|s)+H(vi|o))
Wherein, θ indicates regulatory factor, 0 < θ < 1;Weight w (the v of each triple-keywordi| t) by IDF value and comentropy H
Two parts composition, shown in following formula:
3. the information association analysis method of knowledge based map according to claim 1, which is characterized in that institute in step 5
The querying method based on triple similarity stated includes the following steps:
Balancing method of the step 1 based on triple similarity;
If t1And t2For triple, then triple t1And t2The following formula of similarity calculation shown in:
The method of step 2 generation effective query;For an original queryWhereinIndicate original query
In i-th of triple;Firstly, calculating by above-mentioned formula triple similarity each in original query, one is obtained
The similar triple of group, and be ranked up according to similarity, constitute similar triple queue;Choose each tripleIt is similar
Approximate triple in triple queue carrys out tectonically relaxation inquiry by way of the triple in replacement original query;
Step 3 calculates the similarity of each subquery and original query by following formula, and is ranked up using similarity,
Each inquiry is determined according to this sequence executes sequence;Firstly, original query can execute at first, then have highest similar
The effective query of degree can be executed successively;
A threshold value is arranged in semantic similarity of the step 4 between original query and effective query, if the phase between two inquiries
It is less than this threshold value like degree, then will not be derived from.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810519637.XA CN108846029B (en) | 2018-05-28 | 2018-05-28 | Information correlation analysis method based on knowledge graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810519637.XA CN108846029B (en) | 2018-05-28 | 2018-05-28 | Information correlation analysis method based on knowledge graph |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108846029A true CN108846029A (en) | 2018-11-20 |
CN108846029B CN108846029B (en) | 2021-05-25 |
Family
ID=64213596
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810519637.XA Expired - Fee Related CN108846029B (en) | 2018-05-28 | 2018-05-28 | Information correlation analysis method based on knowledge graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108846029B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109508389A (en) * | 2018-12-19 | 2019-03-22 | 哈尔滨工程大学 | A kind of personnel's social relationships map visualization accelerated method |
CN109582803A (en) * | 2018-11-30 | 2019-04-05 | 广东电网有限责任公司 | The construction method and system of competitive intelligence database |
CN109710621A (en) * | 2019-01-16 | 2019-05-03 | 福州大学 | In conjunction with the keyword search KSANEW algorithm of semantic category node and side right weight |
CN110082116A (en) * | 2019-03-18 | 2019-08-02 | 深圳市元征科技股份有限公司 | A kind of evaluation method, evaluating apparatus and the storage medium of vehicular four wheels location data |
CN110083732A (en) * | 2019-03-12 | 2019-08-02 | 浙江大华技术股份有限公司 | Picture retrieval method, device and computer storage medium |
CN110222240A (en) * | 2019-05-24 | 2019-09-10 | 华中科技大学 | A kind of space RDF data keyword query method based on summary figure |
CN110457486A (en) * | 2019-07-05 | 2019-11-15 | 中国人民解放军战略支援部队信息工程大学 | The people entities alignment schemes and device of knowledge based map |
CN110765233A (en) * | 2019-11-11 | 2020-02-07 | 中国人民解放军军事科学院评估论证研究中心 | Intelligent information retrieval service system based on deep mining and knowledge management technology |
CN111753055A (en) * | 2020-06-28 | 2020-10-09 | 中国银行股份有限公司 | Automatic client question and answer prompting method and device |
CN113495955A (en) * | 2021-07-08 | 2021-10-12 | 北京明略软件系统有限公司 | Expert pushing method, system, equipment and storage medium for document |
CN113901452A (en) * | 2021-09-30 | 2022-01-07 | 中国电子科技集团公司第十五研究所 | Sub-graph fuzzy matching security event identification method based on information entropy |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003030204A (en) * | 2001-07-17 | 2003-01-31 | Takami Yasuda | Server for providing video contents, device and method for preparing file for video contents retrieval, computer program and device and method for supporting video clip preparation |
CN104166670A (en) * | 2014-06-17 | 2014-11-26 | 青岛农业大学 | Information inquiry method based on semantic network |
CN104765779A (en) * | 2015-03-20 | 2015-07-08 | 浙江大学 | Patent document inquiry extension method based on YAGO2s |
-
2018
- 2018-05-28 CN CN201810519637.XA patent/CN108846029B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003030204A (en) * | 2001-07-17 | 2003-01-31 | Takami Yasuda | Server for providing video contents, device and method for preparing file for video contents retrieval, computer program and device and method for supporting video clip preparation |
CN104166670A (en) * | 2014-06-17 | 2014-11-26 | 青岛农业大学 | Information inquiry method based on semantic network |
CN104765779A (en) * | 2015-03-20 | 2015-07-08 | 浙江大学 | Patent document inquiry extension method based on YAGO2s |
Non-Patent Citations (1)
Title |
---|
李大振: "基于语义相似度的RDF本体查询松弛方法研究", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109582803A (en) * | 2018-11-30 | 2019-04-05 | 广东电网有限责任公司 | The construction method and system of competitive intelligence database |
CN109508389A (en) * | 2018-12-19 | 2019-03-22 | 哈尔滨工程大学 | A kind of personnel's social relationships map visualization accelerated method |
CN109508389B (en) * | 2018-12-19 | 2021-05-28 | 哈尔滨工程大学 | Visual accelerating method for personnel social relationship map |
CN109710621B (en) * | 2019-01-16 | 2022-06-21 | 福州大学 | Keyword search KSANEW method combining semantic nodes and edge weights |
CN109710621A (en) * | 2019-01-16 | 2019-05-03 | 福州大学 | In conjunction with the keyword search KSANEW algorithm of semantic category node and side right weight |
CN110083732A (en) * | 2019-03-12 | 2019-08-02 | 浙江大华技术股份有限公司 | Picture retrieval method, device and computer storage medium |
CN110083732B (en) * | 2019-03-12 | 2021-08-31 | 浙江大华技术股份有限公司 | Picture retrieval method and device and computer storage medium |
CN110082116A (en) * | 2019-03-18 | 2019-08-02 | 深圳市元征科技股份有限公司 | A kind of evaluation method, evaluating apparatus and the storage medium of vehicular four wheels location data |
CN110222240A (en) * | 2019-05-24 | 2019-09-10 | 华中科技大学 | A kind of space RDF data keyword query method based on summary figure |
CN110222240B (en) * | 2019-05-24 | 2021-03-26 | 华中科技大学 | Abstract graph-based space RDF data keyword query method |
CN110457486A (en) * | 2019-07-05 | 2019-11-15 | 中国人民解放军战略支援部队信息工程大学 | The people entities alignment schemes and device of knowledge based map |
CN110765233A (en) * | 2019-11-11 | 2020-02-07 | 中国人民解放军军事科学院评估论证研究中心 | Intelligent information retrieval service system based on deep mining and knowledge management technology |
CN111753055A (en) * | 2020-06-28 | 2020-10-09 | 中国银行股份有限公司 | Automatic client question and answer prompting method and device |
CN111753055B (en) * | 2020-06-28 | 2024-01-26 | 中国银行股份有限公司 | Automatic prompt method and device for customer questions and answers |
CN113495955A (en) * | 2021-07-08 | 2021-10-12 | 北京明略软件系统有限公司 | Expert pushing method, system, equipment and storage medium for document |
CN113901452A (en) * | 2021-09-30 | 2022-01-07 | 中国电子科技集团公司第十五研究所 | Sub-graph fuzzy matching security event identification method based on information entropy |
CN113901452B (en) * | 2021-09-30 | 2022-05-17 | 中国电子科技集团公司第十五研究所 | Sub-graph fuzzy matching security event identification method based on information entropy |
Also Published As
Publication number | Publication date |
---|---|
CN108846029B (en) | 2021-05-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108846029A (en) | The information association analysis method of knowledge based map | |
Batsakis et al. | Improving the performance of focused web crawlers | |
Wang et al. | Q2semantic: A lightweight keyword interface to semantic search | |
Bao et al. | Towards an effective XML keyword search | |
Ganti et al. | Keyword++ a framework to improve keyword search over entity databases | |
Alwan et al. | A survey of schema matching research using database schemas and instances | |
Li et al. | Subjective databases | |
CN102043812A (en) | Method and system for retrieving medical information | |
CN102890711A (en) | Retrieval ordering method and system | |
Amolochitis et al. | A heuristic hierarchical scheme for academic search and retrieval | |
Zhang et al. | A graph based document retrieval method | |
Nargesian et al. | Data lake organization | |
Hoang et al. | The state of the art of ontology-based query systems: A comparison of existing approaches | |
Li et al. | Processing xml keyword search by constructing effective structured queries | |
Kanavos et al. | Ranking web search results exploiting wikipedia | |
Wang et al. | Research on discovering deep web entries | |
Nargesian et al. | Optimizing organizations for navigating data lakes | |
Liu et al. | A query suggestion method based on random walk and topic concepts | |
Agrawal et al. | Search Engine Results Improvement--A Review | |
Elbassuoni | Effective searching of RDF knowledge bases | |
Li et al. | A structure-based approach of keyword querying for fuzzy XML data | |
Song et al. | Discussions on subgraph ranking for keyworded search | |
Peng et al. | On the marriage of SPARQL and keywords | |
Ganta et al. | Search engine optimization through spanning forest generation algorithm | |
Nargesian et al. | Data lake organization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220708 Address after: 100000 Room 401, block D, No. 8, Tongji South Road, Beijing Economic and Technological Development Zone, Daxing District, Beijing Patentee after: BEIJING SAIXI TECHNOLOGY DEVELOPMENT Co.,Ltd. Address before: 150001 Intellectual Property Office, Harbin Engineering University science and technology office, 145 Nantong Avenue, Nangang District, Harbin, Heilongjiang Patentee before: HARBIN ENGINEERING University |
|
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210525 |