CN107066589A

CN107066589A - A kind of sort method and device of Entity Semantics and word frequency based on comprehensive knowledge

Info

Publication number: CN107066589A
Application number: CN201710252110.0A
Authority: CN
Inventors: 靳小波; 王胜; 曹鹤玲; 肖乐; 费选
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2017-04-17
Filing date: 2017-04-17
Publication date: 2017-08-18
Anticipated expiration: 2037-04-17
Also published as: CN107066589B

Abstract

The present invention relates to a kind of Entity Semantics based on comprehensive knowledge and the sort method and device of word frequency, after domain knowledge extension entity object is crawled, a variety of validity features, including words-frequency feature and semantic feature are designed, the correlation come using sequence learning method between predicted query and entity.The present invention makes full use of the words-frequency feature of bottom and the Entity Semantics feature of high level, can preferably embody inquiry with the correlation between entity, the result for improving retrieval performance, i.e. entity search is more accurate, and then improves the satisfaction of user search.

Description

A kind of sort method and device of Entity Semantics and word frequency based on comprehensive knowledge

Technical field

The invention belongs to entity search technical field, and in particular to a kind of Entity Semantics and word frequency based on comprehensive knowledge Sort method and device.

Background technology

The main flow search technique " keyword search " that present search engine is used is a kind of " existence search " technology, is returned The web page listings of keyword are included back to user, user generally requires further to browse these webpages and filtered out a large amount of useless Information can just find really desired result, and this procedural information consumption cost is high, significantly reduces Consumer's Experience, user Geng Xi Prestige " can directly obtain answer "." whom the wife of Barack Obama is " such as is inquired about, search result desired by user is Succinct data entries " meter Xie Er Obamas ", rather than substantial amounts of webpage, this search is exactly entity search (Entity Search).The distinguishing feature of entity search is " directly giving answer ", and it is concerned with " object ", object can be it is various not Same classification, such as：People, film, company, novel etc..For example, inquiry " film that Tom's hanks are acted the leading role ", it is desirable to To be a classification be " film " list of entities.

Traditional entity search is divided into three classes：Interrogation reply system based on webpage, the information extraction mode based on webpage and base The way of search demarcated in type.Interrogation reply system based on webpage finds answering for particular problem by excavating the diversity of webpage Case.It needs to search for the certain types of information near some keywords and verifies more evidences to determine final answer Case.And the information extraction based on webpage try find it is all<Query word, entity>Right, it needs to record a large amount of contexts and system Count match information.The search demarcated based on type is intended to search for certain types of information, it need keyword and type word it Between match some adjacent modes, then add up all match information to form final ordering score.

Sentence is embedded into a lower dimensional space by deep learning method using convolutional neural networks, and is kept between them Syntax and semantic relation, but they only be used only the implication of entity in itself, do not account for the inherent meaning of entity so that structure The order models made have larger deviation.

Previous entity sort method either focuses on inquiry and the co-occurrence of entity or straight based on specific model hypothesis Connect the relation weighed and inquired about between entity.However, co-occurrence feature is for representing that the relation between inquiry and entity is too weak, it is another Aspect, they seldom consider the semantic relation between inquiry and entity, the exactly user query word behind of semantic relation concern Demand.Semantic relation strengthens document full-text search and is extracting and handling the ability in semantic information, in particular improves in full Retrieve the ability of semantic ambiguity and semantic extension.

The content of the invention

It is an object of the invention to provide a kind of Entity Semantics based on comprehensive knowledge and the sort method and device of word frequency, The problem of to solve predicted query in the prior art and inaccurate physical correlation.

In order to solve the above technical problems, the technical scheme is that：

The sort method of a kind of Entity Semantics and word frequency based on comprehensive knowledge of the present invention, comprises the following steps：

1) description information on entity of external resource is collected, entity is extended；

2) according to query specification and entity description, data flow is extracted；

3) data flow is done into word segmentation processing, obtains word stream；

4) words-frequency feature and semantic feature of word stream are extracted, and regard the feature of extraction as sequence learning method in the lump Input, obtains the entity collating sequence arranged according to similarity between inquiry and entity.

Further, climbed using the cross-referenced and Vertical Website of multithreading, reptile agent pool, multiple search engines Technology is taken to collect the description information of external resource.

Further, title, text or title and the combination of text of the data flow for inquiry with entity.

Further, the word segmentation processing includes Chinese word segmentation processing and 2-gram word segmentation processings.

Further, the words-frequency feature includes TF-IDF features, BM25 features and LMIR features.

Further, the semantic feature include the inquiry and entity obtained using word2vec similarity and feature, Weighted Similarity and feature, maximum similarity feature and maximum weighted similarity feature.

Further, the sequence learning method is the sequence learning method based on point mode.

Further, the grader used in the sequence learning method based on point mode includes AdaBoost, random Forest and ExtraTree.

The collator of a kind of Entity Semantics and word frequency based on comprehensive knowledge of the present invention, including following module：

Description information for collecting external resource, extends the module of entity；

For according to query specification and entity description, extracting the module of data flow；

For data flow to be done into word segmentation processing, the module of word stream is obtained；

Words-frequency feature and semantic feature for extracting word stream, and it regard the feature of extraction as sequence learning method in the lump Input, obtain according between inquiry and entity similarity arrange entity collating sequence module.

Further, climbed using the cross-referenced and Vertical Website of multithreading, reptile agent pool, multiple search engines Technology is taken to collect description information of the external resource on entity.

Beneficial effects of the present invention：

The present invention designs a variety of validity features, including word frequency spy after domain knowledge extension entity object is crawled Seek peace semantic feature, the correlation come using sequence learning method between predicted query and entity.The present invention makes full use of bottom Text words-frequency feature and high level Entity Semantics feature, can preferably embody inquiry and the correlation between entity, carry The result of high retrieval performance, i.e. entity search is more accurate, and then improves the satisfaction of user search.

Brief description of the drawings

Fig. 1 is flow chart of the method for the present invention；

Fig. 2 is the example ranking results figure based on the present invention.

Embodiment

To make the objects, technical solutions and advantages of the present invention clearer, below in conjunction with the accompanying drawings and embodiment, to the present invention It is described in further detail, but embodiments of the present invention are not limited thereto.

The Entity Semantics based on comprehensive knowledge of the present invention and the sort method embodiment of word frequency：

Inquiry is the set of one group of keyword or key phrase, the demand for describing user.One entity is a spy Levy independent individual, such as personage, restaurant, movie or television play etc..Target is the entity that inquiry meets specific description, such as Certain impression under evaluation or restaurant environment for certain film etc..The present invention is used to train an order models, and Order based on this predicting candidate entity.

First, entity description information and user comment information from Baidu, bean cotyledon and popular comment is collected to extend reality Body, wherein the technology applied to has：Multi-thread design, reptile agent pool, cross-referenced, the Vertical Website of multiple search engines are climbed Take etc..Multi-thread design will make it is multiple crawl thread parallel, greatly speeded up the speed crawled；Reptile agent pool will be avoided The anti-reptile obstacle caused by frequently crawling；By the cross-referenced help between multiple search engines, we search entity More accurate description；It is vertical to crawl that the information for being beneficial to crawl is more targeted, while also improving the effect crawled Rate.

For query specification and entity description, data flow is extracted, data flow can be title, text or title and text Combination.These three data flows all can obtain final sequence learning outcome.The data flow combined below for text with title Do following processing.

Then, word segmentation processing is carried out to inquiry and entity.Word segmentation processing is handled using two methods, and one kind is Chinese point Word, one kind is 2-gram words.Chinese word segmentation is that a word sequence is divided into a basic Chinese word unit.It is in It is a crucial step in literary Language Processing, but the performance of algorithm depends on domain lexicon and corpus used.Big portion The segmentation methods divided can not handle polysemant and unregistered word well, meanwhile, most of Chinese phrase is all based on 2 words , so adding 2-gram words method for expressing again as a supplement of participle.Both participle processing methods are obtained Multiple word stream merger at one piece, obtain a word stream.

Then, a variety of validity features are designed, carry out the input in the lump as subsequent ranking algorithms, so as to more accurate The each inquiry-entity of prediction between similarity probability, realize accurate entity sequence or recommend.Specifically, in this implementation In example, using the text feature of bottom, contextual feature and high-level semantics features.

Text feature includes TF-IDF features, BM25 features, LMIR features.Wherein, LMIR includes three smoothing methods again, It is LMIR.JM, LMIR.ABS and LMIR.DIR respectively, LMIR is based on consistent a priori assumption, for calculating and language model phase The feature of pass, language model is attributed to conditional probability p (q | d) calculating：

Lower mask body introduces this several feature.

1) in the method for digging of web page contents, TF-IDF (Term Frequency-Inverse Document Frequency it is) a kind of conventional weighting technique explored for information retrieval with information, to assess a word to a file The importance of collection or a copy of it file in a corpus.

TF-IDF is actually TF*IDF, and word frequency (Term Frequency, TF) is referred in the given file of portion, The number of times that some given word occurs in this document, if the number of times that some word occurs in the data flow is more, then This word contribution degree in terms of the implication of this data flow is described is bigger.For the word t of a certain specific data stream_iFor, it Importance be represented by：

Wherein, tf_i,jRepresent word t in data flow_iWord frequency importance, n_i,jIt is the word in data flow d_jIn go out occurrence Number,It is data flow d_jIn all words occurrence number sum.

Reverse document-frequency (Inverse Document Frequency, IDF) is the universal important of one word of measurement Property, some word occurs in more documents, then this word should be smaller to the contribution degree of a certain document.A certain particular words IDF, can the file by total number of files divided by comprising the word number, then obtained business taken the logarithm obtained：

Wherein, idf_iRepresent t_iInverse document frequency, | D | be corpus in data flow summary,For comprising Word t_iNumber of data streams, i.e. n_i≠ 0 number of data streams, but n in practice_i≠ 0 it is difficult to ensure that, it is necessary to do smooth, So when system is realized, allowing denominator molecule respectively to add 0.5 so that system robust is a little, and formula change is as follows：

Then, TF-IDF is：

tfidf_i,j=tf_i,j*idf_i

Frequent words frequency in a certain specific file, and low document-frequency of the word in whole file set, The TF-IDF of high weight can be produced.TF-IDF tends to filter out common word, retains important word.

2) BM25 feature extracting methods are to propose that it is typical probability retrieval model by Robertson et al..BM25 Model is built upon on orthogonal hypothesis between all elements, but in fact, the element in identical document each other Between be not isolated, more or less semantic relation is there is between them, this relation causes contextual elements one Determine that the correlation of document interior element will be influenceed in degree.It is the ranking functions of an experience, and calculation formula is as follows：

Wherein, query statement is by query word q₁……q_iComposition, idf (q_i) be query word IDF values, f (q_i, d) it is document Q in d_iThe number of times of appearance, f (q_i, q) it is query word q in inquiry q_iThe number of times of appearance, | d | it is the summary of word in document d, avg (d) be document in whole data set average length, herein experience setting k₁=2.0, k₃=0, b=0.75.

3) LMIR.JM is Jelinek-Mercer methods, by realizing one between Maximum-likelihood estimation and language material model Individual linear interpolation estimates p (q_i| d), it is a simple mixed model, i.e.,：

p(q_i| d)=(1- λ) p_ml(q_i|d)+λp(q_i|C)

Wherein, λ=0.1 is used for the influence power of Controlling model, p_ml(q_i| d) with p (q_i| C) it is respectively q_iIn document d and language material Frequency in the C of storehouse.

4) LMIR.DIR provides p (q based on Dirichlet priori_i| Bayes d) smoothly estimates：

Wherein, smoothing parameter μ is set to 2000.

5) LMIR.ABS is Absoute discounting methods, and the document probability and language of word are realized by subtraction Expect a compromise between the probability of storehouse, its calculating p (q_i| d) it is calculated as follows：

Wherein, | d |_μRepresent the number of various words in document d, δ ∈ [0,1] are the constants subtracted, set herein δ= 0.7.Wherein f (q_i| d) represent query word q_iThe number of times occurred in document d, it is general that p (w | C) represents that word w occurs in classification C Rate.

For semanteme of word feature, word2vec is frequently used for producing the embedded vector of word, and it is in substantial amounts of corpus On set up the neutral net of two layers, the term vector of a higher-dimension is then exported to each word.In order to set up sentence with Semantic similarity measurement between sentence, the similarity between our first defined terms and sentence.

The kit that Word2vec is Google to be used to obtain word vector in one of release of increasing income in 2013, it Simply, efficiently, fast and effeciently a word can be expressed as by the training pattern after optimization according to given corpus Vector form.

Inquire about q_iSimilarity between sentence s is defined as word s in sentence s_iWith inquiry q_iBetween maximum similarity, i.e.,：

Wherein, q_iAnd s_jIt is the vector that length normalization method is 1, they are all the vectors generated by word2vec algorithms.

By all inquiry q_i∈ q (i=1,2 ..., m) it is arranged in matrix Q=[q₁,q₂,…,q_m]^TWith it is allIt is arranged in S=[s₁,s₂,…,s_n]^T, so as to obtain：

R=QS^T

Wherein, R=[r₁,r₂,…,r_m]^T,Then：

sim(q_i, s)=| | r_i||_∞

Based on summing and asking significant operational, 4 kinds of statistical semantic features can be defined, wherein having similarity and (Sum of Similarity, SS), Weighted Similarity and (Sum of Weighted Similarity, SWS), maximum similarity value (Max of Similarity, MS) and maximum weighted Similarity value (Max of Weighted Similarity, MWS), i.e.,：

When calculated inquiry and entity answer in each sentence between similarity after, we take all sentences similar The average value and maximum of degree.

Finally, the result of features described above is ranked up study, finds a label for being capable of Accurate Prediction unknown sample Decision function.Herein, the sequence learning method of selection point mode (point-wise).The feature of said extracted is made in the lump For the input for the learning method that sorts, the entity collating sequence arranged according to similarity between inquiry and entity is obtained.Wherein, select Combined method is as our point sort algorithm, for predicting the similarity probability between each inquiry-entity pair.We are main Selected by cross validation method from AdaBoost, random forest and ExtraTree graders.For point-wise sequences Algorithm, combined number is randomly choosed from interval [100,500], and the depth set is randomly choosed from { 4,6,8,10,12 }, is used In the model parameter that searching is optimal.

The model of point sort algorithm is simple, and the training time is short.It is related with it is irrelevant be relative concept, only document need to be pressed Fraction is ranked up from high to low, the fraction without accurately predicting each document.As other embodiment, also it may be selected To sort algorithm and list ordering algorithm.But, to sort algorithm compared with a sort algorithm, model is complex, during training Between it is longer, it is desirable to have relatively efficient learning algorithm；List ordering algorithm there are problems that in actual applications, for example：Instruction Practice data to be relatively difficult to obtain, for given inquiry, mark person needs to carry out a relevancy ranking to all documents, takes When it is laborious, can not objectively obtain substantial amounts of training data.

Further illustrated below with an instantiation.

As shown in Fig. 2 when we inquire about lexical item " on the Embroidered-Uniform Guard ", based on candidate's film (entity) " embroidering spring knife ", " brocade Clothing is defended " etc. the description of correspondence film and evaluation information, set up order models on training set, prediction test Integrated query and each wait The direct similarity of entity is selected, they are ranked up according to similarity and obtains a sequence, finally, contrast test is concentrated all Inquiry it is corresponding sequence the true sequence of entity between difference, use Average Accuracy (Mean Average Precision, MAP) weigh and obtain evaluation of estimate (it be located between 0 and 1).MAP is bigger, illustrates the accuracy of order models Better.

The Entity Semantics based on comprehensive knowledge of the present invention and the collator embodiment of word frequency：

The collator of a kind of Entity Semantics and word frequency based on comprehensive knowledge of the present invention, including following module：For The description information on entity of external resource is collected, the module of entity is extended；For according to query specification and entity description, carrying Take the module of data flow；For data flow to be done into word segmentation processing, the module of word stream is obtained；Word frequency for extracting word stream is special Seek peace semantic feature, and the feature of extraction is obtained according between inquiry and entity in the lump as the input of sequence learning method The module of the entity collating sequence of similarity arrangement.

The device is actually based on the Entity Semantics based on comprehensive knowledge of the present invention and the sort method flow of word frequency A kind of computer solution, i.e., a kind of software architecture, above-mentioned various modules are each processing corresponding with method flow Process or program.Because the sufficiently clear of the introduction to the above method is complete, therefore the device is no longer described in detail.

Although present disclosure is discussed in detail by above preferred embodiment, but it should be appreciated that above-mentioned Description is not considered as limitation of the present invention.After those skilled in the art have read the above, for the present invention's A variety of modifications and substitutions all will be apparent.Therefore, protection scope of the present invention should be limited to the appended claims.

Claims

1. a kind of sort method of Entity Semantics and word frequency based on comprehensive knowledge, it is characterised in that comprise the following steps：

3) data flow is done into word segmentation processing, obtains word stream；

4) words-frequency feature and semantic feature of word stream are extracted, and regard the feature of extraction as the defeated of sequence learning method in the lump Enter, obtain the entity collating sequence arranged according to similarity between inquiry and entity.

2. the sort method of Entity Semantics and word frequency according to claim 1 based on comprehensive knowledge, it is characterised in that adopt Crawled technology with the cross-referenced and Vertical Website of multithreading, reptile agent pool, multiple search engines and collected outside money Description information of the source on entity.

3. the sort method of Entity Semantics and word frequency according to claim 1 based on comprehensive knowledge, it is characterised in that institute State title, text or title and the fusion of text of the data flow for inquiry and entity.

4. the sort method of Entity Semantics and word frequency according to claim 1 based on comprehensive knowledge, it is characterised in that institute Stating word segmentation processing includes Chinese word segmentation processing and 2-gram word segmentation processings.

5. the sort method of Entity Semantics and word frequency according to claim 1 based on comprehensive knowledge, it is characterised in that institute Stating words-frequency feature includes TF-IDF features, BM25 features and LMIR features.

6. the sort method of Entity Semantics and word frequency according to claim 1 based on comprehensive knowledge, it is characterised in that institute State semantic feature include the similarity and feature of the inquiry and entity obtained using word2vec, Weighted Similarity and feature, Maximum similarity feature and maximum weighted similarity feature.

7. the sort method of Entity Semantics and word frequency according to claim 1 based on comprehensive knowledge, it is characterised in that institute It is the sequence learning method based on point mode to state sequence learning method.

8. the sort method of Entity Semantics and word frequency according to claim 7 based on comprehensive knowledge, it is characterised in that institute Stating the grader used in the sequence learning method based on point mode includes AdaBoost, random forest and ExtraTree.

9. a kind of collator of Entity Semantics and word frequency based on comprehensive knowledge, it is characterised in that including following module：

The description information on entity for collecting external resource, extends the module of entity；

Words-frequency feature and semantic feature for extracting word stream, and it regard the feature of extraction as the defeated of sequence learning method in the lump Enter, obtain the module of entity collating sequence arranged according to similarity between inquiry and entity.

10. the collator of Entity Semantics and word frequency according to claim 9 based on comprehensive knowledge, it is characterised in that Crawl technology using the cross-referenced and Vertical Website of multithreading, reptile agent pool, multiple search engines and collect outside Description information of the resource on entity.