CN105550226B - A kind of inquiry facet generation method in knowledge based library - Google Patents
A kind of inquiry facet generation method in knowledge based library Download PDFInfo
- Publication number
- CN105550226B CN105550226B CN201510888652.8A CN201510888652A CN105550226B CN 105550226 B CN105550226 B CN 105550226B CN 201510888652 A CN201510888652 A CN 201510888652A CN 105550226 B CN105550226 B CN 105550226B
- Authority
- CN
- China
- Prior art keywords
- facet
- inquiry
- entity
- attribute
- search
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of inquiry facet generation methods in knowledge based library, and this method comprises the following steps:1) for given inquiry q, T retrieval result before being obtained from search engine, composition query result set D;2) QDMiner algorithms are based on and obtain a series of initial query facet f, a series of initial query facet f composition set F;3) initial query facet f described in each is extended;4) the initial query facet f after extension is filtered using search file, to ensure the accuracy rate of spreading result;Final inquiry facet is generated using the initial query facet f after extension.The present invention generates inquiry facet using knowledge base, can effectively solve the limitation that existing method depends on retrieval result.Initial facet is extended by using the information of high quality in knowledge base, the facet lexical item that does not have to occur in retrieval result or not be extracted can be accurately positioned, to improve the accuracy and coverage rate of inquiry facet.
Description
Technical field
The present invention relates to a kind of inquiry facet generation methods in knowledge based library.
Background technology
According to China Internet Network Information Center (CNNIC) publication《Chinese netizen's search behavior research report in 2013》It is aobvious
Show, in by the end of June, 2013 by, Chinese search engine netizen scale is 4.70 hundred million, and China mobile searches for number of netizen up to 3.24 hundred million.
Used netizen's ratio of comprehensive search engine up to 98% in half a year in past, it is seen then that in Internet era, search engine is people
It is the main source for obtaining the network information into the main entrance of network.
Comprehensive search engine mainly shows search result in the form of list of relevant documents at present, and according to the correlation of document
Property sort from high to low, for simple, navigational element of searching, such as search for " Taobao official website ", this mode disclosure satisfy that demand,
But for complicated, informative, heuristic search, this to show form just and seem excessively thin, user needs returning
It is found in the thousands of result returned, the information needed for summary, under efficiency.In some cases, the search intention of user is
Fuzzy, it is difficult to accurately be reached by one or two of vocabulary, such as search for the knowledge etc. of related field;In addition, the search of user is
It may be heuristic, search engine needed categorizedly to organize related content, user is facilitated to find oneself step by step
Desired information, such as search in shopping website can provide corresponding limitation to the brand of commodity, pattern, size etc..For
The former, current Main is that search is suggested, in search box input content, search engine meeting basis accumulated user in the past
It searches for daily record and prompts the possible search statement of user;For latter situation, the range applied at present is mainly commodity, hotel etc.
Vertical field.For problem above, inquiry facet is an effective solution approach.Inquiry facet can be regarded as to inquiry from
The summary and conclusion that different angle is made, such as the facet of inquiry " Wang Fei " have:Her well-known songs, album, good friend, acquisition
Awards etc..Inquiry facet is the extension to user's query intention, is the summary to potential Query Information, can not only facilitate use
The clear search intention in family, moreover it is possible to user's related content is prompted, so that user carries out heuristic search.
Currently, the method for digging of inquiry facet depends on the collection of document of search engine return, Manual definition is utilized
A variety of constellations, the lexical item list occurred side by side in abstracting document, and finally being looked by processes, the generation such as cluster, sort
Ask facet.On this basis, another scheme is to utilize supervised learning, and two models are respectively trained, for judging a word
Whether item belongs to inquiry facet and whether two lexical items belong to the same inquiry facet.Although both the above method achieves not
Wrong effect, but the exactness and accuracy of result can be influenced by document quality.First, if retrieval result document
It concentrates and does not include certain facets or lexical item, existing method has no way of extracting;Secondly, even if in retrieval result including corresponding facet, by
In not being showed with tabular form, existing decimation pattern can not accurately identify;Finally, the list arranged side by side of extraction may include
Impurity item, existing way can not efficiently filter out all impurity.
Therefore, the technical issues of how solving the above problems as those skilled in the art's urgent need to resolve.
Invention content
The problem of for background technology, the purpose of the present invention is to provide a kind of inquiry facets in knowledge based library
Generation method, the application generate inquiry facet using knowledge base, can effectively solve the office that existing method depends on retrieval result
It is sex-limited.Initial facet is extended by using the information of high quality in knowledge base, in retrieval result without occur or not by
The facet lexical item of extraction can be accurately positioned, to improve the accuracy and coverage rate of inquiry facet.
The purpose of the present invention is achieved through the following technical solutions:
A kind of inquiry facet generation method in knowledge based library, described method includes following steps:
1) for given inquiry q, T retrieval result before being obtained from search engine, composition query result set D;
2) QDMiner algorithms are based on and obtain a series of initial query facet f, a series of initial query facet f compositions
Set F;
3) initial query facet f described in each is extended;
4) the initial query facet f after extension is filtered using search file, to ensure the standard of spreading result
True rate;Final inquiry facet is generated using the initial query facet f after extension.
Further, a series of method for obtaining initial query facet f in the step 2) based on QDMiner algorithms is specific
For:
A. list is extracted:Using text, html tag, the multiple patterns of repeat region, taken out from the query result set D
Take original list;
B. power is assigned in list:Based on tf-idf thoughts, assessment is made to the importance of each original list;
C. list clusters:It gets together similar list to form inquiry facet using WQT methods;
D. facet and lexical item sequence are inquired:The importance for calculating lexical item in different inquiry facets and facet, sorts and exports
Final result obtains a series of initial query facet f.
Further, it is specially to the method that initial query facet f is extended described in each in the step 3):First
The inquiry of search engine is divided into two kinds:Entity grade is inquired and the inquiry of non-physical grade;Entity grade is inquired, inquiry is obtained and corresponds to
Freebase in entity, and obtain its attribute;If the registration of former facet and a certain attribute is very high, the category is used
Property extension as former facet;If can not find such attribute, the inquiry of non-physical grade is gone to;Non-physical grade is inquired,
Thought based on tf-idf finds the minimum type for including former facet in Freebase, and finds former facet using Freebase
Middle difference lexical item is shared, attribute associated with the query, is limited with such attribute is further to type, return is limited
Extension of the type entity that is included as former facet.
Further, the specific method for obtaining the entity inquired in corresponding Freebase is:Use Freebase's
The corresponding entity of Search API search inquiries, Search API mainly use name, the synonym matching inquiry character of entity
String;Then the entity of return is filtered.
Further, the method that is filtered of entity of described pair of return is:For the inquiry Q of Search API, return N number of
Entity [E1,E2..., EN], for entity E therein, (cutting) processing is segmented to all synonyms and inquiry Q, calculates E
All synonyms and the public word string of maximum of inquiry Q account for the ratio of former string, take the maximum value of ratio in all synonyms to make
For the word string similarity score StrSim of E and Q;If the score is less than threshold value RStrSim, then E is filtered out;Formula is:
Wherein Alias (E) is all synonym collections of entity E, and len indicates the length of word string;Threshold value RStrSimWith looking into
It askes the length variation of Q and changes, (Q a) calculates the public word string length of maximum of inquiry Q and synonym a to LCS:
What wherein pow (x, y) was calculated is the y powers of x.
Further, it is specially to the method that initial query facet f is extended described in each in the step 3):First
Several types for including former facet f are found, is given a mark using the method for tf-idf, selects the type of highest scoring, used
Search API find the corresponding entity of all lexical items in facet f, are found in all properties of these entities public and former
Beginning inquires relevant attribute and is limited to type, all entities of the type after being limited with return, as the expansion to former facet
Exhibition.
The present invention has following positive technique effect:
One, inquiry facet is generated using knowledge base
Set forth herein inquiry facet is generated using knowledge base and retrieval result simultaneously, can both utilize high-quality in knowledge base
It measures, comprehensive information, while catching the interest and focus of user by retrieval result.
Two, being associated between inquiry and inquiry facet is determined using knowledge base
For object query, entity attributes can be used for and inquire facet and be matched, so that it is determined that inquiry and inquiry
Relationship between facet ensures the accuracy of extension;For general inquiry, we are limited in the way of " type+limitation " and are looked into
The range of inquiry equally ensure that the precision of extension.
Three, construction of knowledge base multilayer relational network is utilized
Context of methods not only considers one layer of attribute (predicate) of each entity in Freebase, while being also conceivable to two
Layer or even multilayer attribute, so as to portray more complicated relationship.
Description of the drawings
Fig. 1 is the three-decker schematic diagram of entity;
Fig. 2 is the three-decker schematic diagram of " Feng little Gang ";
Fig. 3 is the tree structure schematic diagram of 2 layers of attribute net;
Fig. 4 is the type map for inquiring facet;
Fig. 5 is the structural schematic diagram of the 2 layers of attribute net in part.
Specific implementation mode
The application is further described below in conjunction with the accompanying drawings.
This application provides a kind of inquiry facet generation methods in knowledge based library, and this method comprises the following steps:
1) for given inquiry q, T retrieval result before being obtained from search engine, composition query result set D;
2) QDMiner algorithms are based on and obtain a series of initial query facet f, a series of initial query facet f compositions
Set F;
3) initial query facet f described in each is extended;
4) the initial query facet f after extension is filtered using search file, to ensure the standard of spreading result
True rate;Final inquiry facet is generated using the initial query facet f after extension.
It is also a knowledge base by user's addition content, the institute in Freebase similar to Wikipedia, Freebase
There is knowledge to be organized by three-decker:Domain (Domain) → type (Type) → entity (Topic), entity can be simple
It is interpreted as things with the real world, each entity there are several attributes (Property), similar entity component type, one
Entity may belong to multiple types, such as director " Feng little Gang " belongs to " director " (/film/director), " performer " (/film/
Actor), multiple types such as " award-winner " (award/award_winnner), a type may include multiple entities, similar
Type form a domain, domain is a wider array of range, for example, this domain " music " (/music) include " singer " (/
Music/singer), multiple types such as " album " (/music/record).
The Core Feature of Freebase is for a set of metadata of each type definition, referred to as attribute (Property) one
All entities of a type possess all properties of the type, and assign its specific value.Such as " director " (/film/
Director) this type is " film of director " (/film/director/film) there are one attribute, is belonged to " director "
" Feng little Gang " just possesses this attribute, and value is [mobile phone, night fete, and the World Without Thieve is not left without seeing each other ...].
Initial query facet in the present invention derives from QDMiner algorithms.For given inquiry q, obtained from search engine
T retrieval result before taking, composition query result set D, QDMiner algorithm excavate inquiry facet by four steps:
List is extracted:Using multiple patterns such as text, html tag, repeat region, original row are extracted from results set D
Table.
Power is assigned in list:Based on tf-idf thoughts, assessment is made to the importance of each original list.
List clusters.It gets together similar list to form inquiry facet using WQT methods.
Inquire facet and lexical item sequence.The importance for calculating lexical item in different inquiry facets and facet, sorts and exports most
Terminate fruit.
The application will carry out on the basis of QDMiner is exported.The input of system has:Original query Qsearch, collection of document
D, a series of set F of initial query sections composition of QDMiner outputs.We will use each inquiry section f ∈ F identical
Method be extended, introduction below is also by taking a section f as an example.
Inquiry facet and the set maximum difference of scaling problem are that inquiring facet has abundant context, is detached from context
Extension be likely to result in the query intention excessively to extend one's service, to influence extension effect.That is, if can fill
Divide and utilizes original query QsearchRelationship between facet f is the key that influence extension effect.In order to find inquiry and facet it
Between imply relationship, the application is from two angles:Inquiry and facet.The inquiry of search engine is divided into two kinds by the application:
Entity grade is inquired and the inquiry of non-physical grade.The inquiry of entity grade refers to that inquiry content corresponds directly to certain one or more are physically present
Entity inquiry, such as " Wang Fei ", " Nokia ";The inquiry of non-physical grade refers to correspond directly to the inquiry of entity,
Such as " song of Wang Fei ", " university in the U.S. ".Such differentiation why is made, is because the former can be direct by searching for
Corresponding entity in knowledge base is found, and the latter then cannot be by simply retrieving to obtain.
Entity grade is inquired, obtains the entity inquired in corresponding Freebase, and obtain its attribute.If former facet
It is very high with the registration of a certain attribute, then use the attribute as the extension of former section, this process is exactly to be looked for from inquiry
Relationship goes to the method that non-physical grade is inquired if can not find such attribute;Non-physical grade is inquired, tf- is based on
The thought of idf finds the minimum type for including former facet in Freebase, and finds difference in former facet using Freebase
The attribute shared, associated with the query of lexical item, makes type of such attribute and is further limited, return to confined class
Extension of the entity for being included of type as former facet.Facet after extension can be filtered using search file, ensure result
Accuracy rate.Step-by-step instructions algorithm flow below.
One, attribute extension facet is utilized
Using the corresponding entity of Search API search inquiries of Freebase, if Freebase returns several realities
Body is likely to be then the inquiry of entity grade.Freebase Search API mainly use the matchings such as name, the synonym of entity
Inquiry string, query result is relatively accurate, but and irrevocable all entities be all the corresponding entity of inquiry string, such as
It inquires " Windows 7 ", Freebase is not returned only to " Windows 7 ", can also return to relevant operating system, such as " Windows
8 ", " Windows Xp " etc., so needing to filter.It is all herein all to use similar calculation using the place of Search API
Method is filtered the entity of return:For the inquiry Q of Search API, N number of entity [E is returned1,E2..., EN], for wherein
Entity E, the public word string of maximum of all synonyms and inquiry Q that calculate E accounts for the ratio of former string, takes in all synonyms and compare
Word string similarity score StrSim of the maximum value of example as E and Q.If the score is less than threshold value RStrSim, then E is filtered out.
Formula is:
Wherein Alias (E) is all synonym collections of entity E, and len indicates the length of word string.In order to as accurate as possible
Ground filters out apparent incoherent entity, threshold value RStrSimAs the length of inquiry Q changes and change:
What wherein pow (x, y) was calculated is the y powers of x.
After filtering out apparent incoherent entity using above method, we take search engine inquiry QsearchBefore corresponding
Sn entity [ES1, ES2, ES3..., ESn], the entity that Search API are returned is sorted according to degree of correlation, we follow
This sequence, obtains [ES successively1, ES2, ES3..., ESn] all properties, and compared using these attributes and original facet f, hair
The high attribute of existing similarity, detailed process are as follows.
As shown in Figure 1, an entity can indicate that entity attributes may be multivalue with three-decker, it is also possible to
Monodrome or void value.Such as structure such as Fig. 2 of director " Feng little Gang ".
For director " Feng little Gang ", " film of director " (/film/director/film) is multi-valued attribute, " state
Nationality " (/people/person/nationality) is single-value attribute, " religious belief " (/people/person/religion)
It is no value attribute.
Attribute value (Target) is also substantially entity, and entity and entity are exactly by Attribute Association.If I
Further obtain attribute value attribute, that obtain is exactly source entity Eroot2 layers of attribute, all 1 layer of attributes and 2 layers of attribute
Union be known as source entity Eroot2 layers of attribute net, use P2It indicates, iteration continues can further obtain P3、P4Deng the number of plies is got over
Deep, the path from source entity to last layer attribute value is longer, and correlation is also weaker, and corresponding noise and impurity are namely more,
The present embodiment mainly uses P2Complete subsequent algorithm.
The structure of attribute net can be converted to one tree, ErootIt is equivalent to the root node of tree, every layer of attribute is (for PnThere is n
Layer) attribute value be the equal of leaf node.The attribute that path from root node to leaf node is passed through can be regarded as broad sense
Attribute.All non-1 layer of attributes we be referred to as multilayer attribute.Such as " Feng little Gang ", 1 layer of attribute is as shown in Fig. 2, continue
One layer of extension, film " night dinner " possesses attribute " performer of film " (/film/film/actor), is worth for [Zhou Xun, Zhang Ziyi, Pueraria lobota
It is excellent, Wu Yanzu ...] ,/film/director/film#/film/film/actor is exactly 2 layers of attribute for directing " Feng little Gang ",
This attribute can be understood as " performer in the film of Feng little Gang directors ".There are one critically important problems for multilayer attribute, if
It should be using the attribute value of middle layer as being intermediate node.For example, for " Feng little Gang " this entity, he directs all
All performers of film are uniformly to regard an attribute as, and several attributes is subdivided into also according to different films, such as/
Film/director/film# night dinners #/film/film/actor.The mode of both extension multilayer attributes emphasizes particularly on different fields, preceding
Person lays particular emphasis on entirety, and the latter extend it is more careful.In the present embodiment, we can be according to the discrete journey of preceding layer attribute value
Degree decides whether to segment.Such as entity " Beijing Metro " possesses attribute " fall circuit " (/metropolitan_transit/
Transit_system/transit_lines), value is [Beijing Metro Line 1, No. 2 lines of Beijing Metro, Beijing Metro 4
Line ...], each circuit possesses attribute " subway station " (/metropolitan_transit/transit_line/stops), because
For different circuits subway station overlap it is seldom, so when expanding 2 layers of attribute:
/metropolitan_transit/transit_system/transit_lines#/metropolitan_
This attribute of transit/transit_line/st ops can be subdivided into/metropolitan_transit/transit_
System/transit_lines# Beijing Metro Line 1s #/metropolitan_transit/transit_line/stops
Deng.Specific computational methods are as follows.
For the attribute P of entity E, if it is single-value attribute, then this problem is not present;If it is multi-valued attribute,
Attribute value table is shown as [T1,T2,T3,…TN], T1,T2,T3,…TNThe attribute possessed may be not fully identical, T1The category possessed
Property may T2There is no (T1,T2Affiliated type is not exactly the same), for attribute P ', [T1,T2,T3,…TN] in have K entity
Possess this attribute [Ts1,Ts2,Ts3,…Tsk], if the value of K is 1, this problem is also not present, if K>1, then set T
=[T1,T2,T3,…TN] dispersion degree on P ' attributes calculation formula it is as follows:
Wherein, the property value set of the attribute P of presentation-entity E, SdiversityValue closer 1, the dispersion degree of attribute P#P '
It is bigger, if it is 1, then it represents that the attribute value of entity attributes P ' does not overlap in T.Work as Sdiversity>0.5, that is to say, that P#P '
In corresponding property value set, the number that average each attribute value repeats is less than 1, and attribute P#P ' is just subdivided into several P#
Ts#P’。
It is exactly all processes for obtaining attribute above, final we can obtain the P of entity E2, the tree that this is flourishing can
To be simplified shown as structure shown in Fig. 3:Either 1 layer of attribute or 2 layers of attribute, or 2 layers of category by intermediate node
Property, all regard the broad sense attribute of entity E as.
There are broad sense attribute shown in Fig. 3 and its corresponding attribute value, we can find and former facet f is most like
Attribute.Initial facet f is expressed as [Str1, Str2, Str3..., StrN], for each lexical item therein, we use and inquiry
QsearchThe same method searches for corresponding entity using Search API, and the entity of return takes after the filtering of StrSim scores
Preceding K, the 1st why is not only taken to be because there is ambiguousness.Q is used successivelysearchList [ES1, ES2, ES3...,
ESn] in each entity ES, with its P2In all attribute and f make comparisons, the highest attribute of similarity score is taken, if found
Meet the attribute of condition, terminates, otherwise continue with next ES.Algorithm is as follows.
Algorithm 1 calculates Property tf values
Wherein Targets (P) indicates that all properties value of attribute P, Entity (Str) indicate the corresponding Freebase of Str
The set of all entities returned.Entity in one Freebase is at most used 1 time, that is to say, that an entity can only
Represent a lexical item in facet.
By algorithm 1, we have obtained the coincidence number CoNum of attribute and former facet, this value can be understood as belonging to
Property tf values, so the thought of computation attribute final score be similar to tf-idf:
Wherein TotalPropertyN is equivalent to the number of all documents.It (is approximately Freebase to take 60,000,000 herein
The number of middle entity).It is rightIn there are one threshold value, the attribute less than certain value can filter out, and take 0.5 herein.If tasted
All ES have been tried all without qualified attribute, branch to following method.
Two, use pattern and limitation extension facet
Most of inquiry in search engine is not the inquiry of entity grade, even if inquiry is entity grade, in knowledge base
May be difficult to find corresponding attribute, for example " wrist-watch " this entity, can there is no attribute as " color " in Freebase
Can be because " color " this attribute is not unique, any commodity have different colors.So being critically important from facet
A kind of method.
Several types for including former facet f are found first, are given a mark using the method for similar tf-idf, are selected score
Highest type can use all entities of this type as the extension of former facet, but actually in the case of more satisfactory
Situation is that the type in Freebase is excessively coarse, needs further to limit.Using with method similar before, use Search
API finds the corresponding entity of all lexical items in facet f, and public and Q is found in all properties of these entitiessearchIt is related
Attribute type is limited, return limitation after type all entities, as the extension to former facet.Such as extension
Former facet is " film that Feng little Gang was directed ", and the optimal type found is likely to " prize-winning film ", excessively wide in range, Bu Nengzhi
It connects for extending, if it is possible to it is " director of film " there are one attribute to find these films altogether and its value is " Feng little Gang ", I
MQL (Freebase specifically retrieves sentence) can be utilized to retrieve " director is Feng little Gang " " prize-winning film ", to big
The precision of the big efficiency and extension for improving search.
In a similar way, for facet f [Str1, Str2, Str3..., StrN] in each Str use
Search API find the entity in Freebase, and structure as shown in Figure 4 is obtained after being filtered using StrSim scores;It calculates
The tf values of each type, algorithm are as follows:
Algorithm 2 calculates Type tf values
The tf values of each type can be understood as the ratio shared by the lexical item for including by the type in facet.Belong to front
The calculation of property tf values is similar, and an entity can only be only used once, this skill entity in being related to Freebase
It can all be used when ambiguity problem.It, can be according to threshold value R after obtaining tfTypeTfFilter out the too low type of tf values, the meter of threshold value
It is as follows to calculate formula:
After StrSim and the filtering of two steps of TypeTf, to all entities retained and corresponding type, calculate final
The process of score is a voting process, K entity before retaining for each lexical item, is voted with the type that entity belongs to it,
Weight is that 1/ (sqrt (Rank (E))) wherein Rank (E) is sequences of the entity E in the list of entities of corresponding lexical item.Because
What the entity that Search API are returned was ordered into, similitude is high in front, so we can utilize sequencing information.The front and
It is similar, if correspondent entity E, E are used only once lexical item Str1 and lexical item Str2 simultaneously.Finally the score of type T is:
Wherein I (T, E) is indicator function, and if sporocarp E belongs to type T, it is 1 to be worth, and is otherwise 0;TotalTypeN is
60000000, Df (T) are the entity number that type T includes.
Usage type score is ranked up, and acquirement point is highest to be used as final result.If not finding suitable type,
This facet f is without extension.
Since the type in Freebase is very wide in range, need further to limit to reduce the range of inquiry.Equally, it obtains first
Take section f [Str1, Str2, Str3..., StrN] in each corresponding entities of lexical item Str and its 1 layer of attribute P1It (is examined for performance
Consider, P is not used2), attribute and attribute value are regarded as several categories to Pair (Property, Target), such as " Feng little Gang "
Property " film of director " and value [night fete, mobile phone ...] be considered as (/film/director/film, night fete), (/film/
Director/film, mobile phone) etc., entity each in this way can possess a series of Pair, the computational methods of Pair scores and
TypeTf is similar, counts the ratio of appearance of each Pair in all lexical items as final score, each lexical item takes preceding K reality
Body, an entity can only be only used once, and for a lexical item, same Pair is used only once, and the score of calculating can be with
It is interpreted as the tf values of Pair.Filter out the too low Pair of tf values.Threshold RPairTfCalculating formula is:
For remaining Pair, all synonyms of Target are taken, check Target whether by QsearchIncluding if
Including, then it is assumed that the Pair is associated with the query, is otherwise deleted.Last same Target may go out in multiple Pair
It is existing, such as inquiry " us university " is " university in the U.S. " there are one section f, the Pair selected has (/location/
Location/containedby, the U.S.), (/organization/organization/headquarters, the U.S.) etc.,
It is K since each lexical item that front limits takes the number of entity, the Pair of mistake may be selected, so for Target mono-
The Pair of sample only retains tf values highest preceding P.All entities under the type of limitation are added using MQL search as extension.
Target in judging Pair whether and QsearchWhen related, in addition to using synonym, Target can also be obtained
N layer attributes PnIt (considers performance, uses P herein1), if there is by QsearchIncluding attribute value, equally it is considered that this
A Pair is and QsearchIt is relevant.In general, P is utilized before1Possible Pair is found as possible limitation, this process
The P of the Target in Pair is obtained again1, it is equivalent to and obtains source entity part P2。
As shown in figure 5, rectangle represents the set of the corresponding all entities of a lexical item in figure, circular node is entity 1
The attribute value of layer attribute, triangular nodes and diamond shape node are the attribute values of 2 layers of attribute of entity.In algorithm above, by
The higher Pair of tf values is only remained in us, so in the P for further obtaining Target1When, what is be equivalent to is diamond shape
Node, what triangular nodes did not obtain really.For example, inquiry " us university " is " New York there are one facet
The university in state ", these universities share Pair (/location/location/containedby, New York), pass through acquisition " knob
The P in about state "1, it is found that " New York " there are attribute " being contained in " (/location/location/containedby) and values
For " U.S. ", so Pair (/location/location/containedby, New York) is and Qsea rchRelevant limitation.
If not finding limitation, and the df values of type are very big, then no longer extend the facet, if the df values of type compare
Rationally, then use all entities of the type as extension.
The difference of the present processes and traditional set expansion problem is there is QsearchWith retrieval result as up and down
Text.The extension facet that above method finally exports can pass through the inspection of context to improve the precision of extension.It examines at present
Method there are mainly two types of, there is score NameOcc, co-occurrence score C oOcc in text.Include in the collection of document of search engine
There is score as text in ratio shared by the document of the entity;In average every document and original point that the entity occurs jointly
Lexical item number in the f of face filters out the entity of NameOcc=0 or CoOcc=0 as co-occurrence score.
The entity in facet after extension is unordered, context and attribute can be utilized to be ranked up.In addition to
NameOcc, CoOcc can also utilize BM25 models, computational entity description and inquiry QsearchDegree of correlation score
BM25Score sorts to entity using the product of this 3 scores.
In addition it can use attribute to sort the entity of extension.The PairTf scores being the previously calculated can be regarded as
The importance of one Pair.For section f [Str1, Str2, Str3..., StrN], it obtains all Pair and calculates PairTf, and
Unlike the process for finding limitation, these Pair need not move through filtering, and Pair importance higher PairTf is bigger.It is all
Pair constitutes a set Pairjudge, the foundation as entity marking.For all entity E, calculate E all Pair and
PairjudgeSimilar score ProSim, calculation is as follows:
Wherein I (E, P) is indicator function, and if sporocarp E has Pair Pa, it is 1 to be worth, and is otherwise 0.By multiplying for all scores
Product is as the foundation finally to sort:
ScoreForRank=NameOcc × CoOcc × BM25Score × ProSim
The invention discloses the method for using knowledge base to excavate inquiry facet, the main contributions of this method are to utilize structure
The knowledge base of change finds the implication relation between inquiry and facet, and extended target is waited for accurately find.The present invention uses
UserQ is as test data evaluation algorithms effect, the results showed that can be in the feelings for ensureing precision using the method in knowledge based library
It is obviously improved the quality of inquiry facet under condition, improves recall rate.In addition, can be without the help of initial query point using knowledge base
Facet is directly generated in the case of the inquiry facet in face, such as entity grade is inquired corresponding attribute and can be divided directly as inquiry
Face, to which original facet can not only be extended, moreover it is possible to generate new facet.
It is described above simply to illustrate that of the invention, it is understood that the present invention is not limited to the above embodiments, meets
The various variants of inventive concept are within protection scope of the present invention.
Claims (4)
1. a kind of inquiry facet generation method in knowledge based library, which is characterized in that described method includes following steps:
1)For given inquiry q, T retrieval result before being obtained from search engine, composition query result set D;
2)A series of initial query facet f, a series of initial query facet f compositions set are obtained based on QDMiner algorithms
F;
3)Initial query facet f described in each is extended;
4)The initial query facet f after extension is filtered using search file, to ensure the accuracy rate of spreading result;
Final inquiry facet is generated using the initial query facet f after extension;
The step 2)In a series of initial query facet f are obtained based on QDMiner algorithms method be specially:
A. list is extracted:Using text, html tag, the multiple patterns of repeat region, extracted from the query result set D former
Beginning list;
B. power is assigned in list:Based on tf-idf thoughts, assessment is made to the importance of each original list;
C. list clusters:It gets together similar list to form inquiry facet using WQT methods;
D. facet and lexical item sequence are inquired:The importance for calculating lexical item in different inquiry facets and facet sorts and exports final
As a result, obtaining a series of initial query facet f;
The step 3)In be specially to the method that initial query facet f is extended described in each:First by search engine
Inquiry be divided into two kinds:Entity grade is inquired and the inquiry of non-physical grade;Entity grade is inquired, obtains and inquires corresponding Freebase
In entity, and obtain its attribute;If the registration of former facet and a certain attribute is very high, use the attribute as former facet
Extension;If can not find such attribute, the inquiry of non-physical grade is gone to;Non-physical grade is inquired, based on tf-idf's
Thought finds the minimum type for including former facet in Freebase, and finds different lexical items in former facet using Freebase and be total to
Attribute having, associated with the query is limited with such attribute is further to type, and returning to confined type is included
Extension of the entity as former facet.
2. the inquiry facet generation method in knowledge based library according to claim 1, which is characterized in that described to be inquired
The specific method of entity in corresponding Freebase is:Use the corresponding reality of Search API search inquiries of Freebase
Body, Search API use name, the synonym matching inquiry character string of entity;Then the entity of return is filtered.
3. the inquiry facet generation method in knowledge based library according to claim 2, which is characterized in that described pair return
The method that entity is filtered is:For the inquiry Q of Search API, N number of entity [E is returned1,E2..., EN], for therein
Entity E carries out word segmentation processing to all synonyms and inquiry Q, calculates all synonyms of E and the public word of maximum of inquiry Q
String accounts for the ratio of former string, takes the maximum value of ratio in all synonyms as the similarity of character string score StrSim of E and Q;Such as
The fruit score is less than threshold value RStrSim, then E is filtered out;Formula is:
Wherein Alias (E) is all synonym collections of entity E, and len indicates the length of word string;Threshold value RStrSimWith inquiry Q
Length variation and change, (Q a) calculates the maximum public word string length of inquiry Q and synonym a to LCS:
Wherein pow (x, y) be x y powers.
4. the inquiry facet generation method in knowledge based library according to claim 1, which is characterized in that the step 3)In
It is specially to the method that initial query facet f is extended described in each:Several classes for including former facet f are found first
Type is given a mark using the method for tf-idf, selects the type of highest scoring, and all words in facet f are found with Search API
The corresponding entity of item, the relevant attribute of public and original query is found in all properties of these entities and is subject to type
Limitation, all entities of the type after being limited with return, as the extension to former facet.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510888652.8A CN105550226B (en) | 2015-12-07 | 2015-12-07 | A kind of inquiry facet generation method in knowledge based library |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510888652.8A CN105550226B (en) | 2015-12-07 | 2015-12-07 | A kind of inquiry facet generation method in knowledge based library |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105550226A CN105550226A (en) | 2016-05-04 |
CN105550226B true CN105550226B (en) | 2018-09-04 |
Family
ID=55829415
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510888652.8A Active CN105550226B (en) | 2015-12-07 | 2015-12-07 | A kind of inquiry facet generation method in knowledge based library |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105550226B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106203632B (en) * | 2016-07-12 | 2018-10-23 | 中国科学院科技政策与管理科学研究所 | A kind of limited knowledge collection recombinant and study and the application system method for being distributed extraction |
US11941010B2 (en) * | 2020-12-22 | 2024-03-26 | International Business Machines Corporation | Dynamic facet ranking |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102402619A (en) * | 2011-12-23 | 2012-04-04 | 广东威创视讯科技股份有限公司 | Search method and device |
CN104794163A (en) * | 2015-03-25 | 2015-07-22 | 中国人民大学 | Entity set extension method |
CN104915449A (en) * | 2015-06-30 | 2015-09-16 | 河海大学 | Faceted search system and method based on water conservancy object classification labels |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9280587B2 (en) * | 2013-03-15 | 2016-03-08 | Xerox Corporation | Mailbox search engine using query multi-modal expansion and community-based smoothing |
-
2015
- 2015-12-07 CN CN201510888652.8A patent/CN105550226B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102402619A (en) * | 2011-12-23 | 2012-04-04 | 广东威创视讯科技股份有限公司 | Search method and device |
CN104794163A (en) * | 2015-03-25 | 2015-07-22 | 中国人民大学 | Entity set extension method |
CN104915449A (en) * | 2015-06-30 | 2015-09-16 | 河海大学 | Faceted search system and method based on water conservancy object classification labels |
Non-Patent Citations (4)
Title |
---|
Facet-based opinion retrieval from blogs;Olga Vechtomova;《Information Processing and Management》;20090717;第71-88页,第3.2节 * |
Finding dimensions for queries;Zhicheng Dou 等;《Acm International Conference on Information & Knowledge Management ACM》;20111028;第1311-1320页,第3.1节至3.5节 * |
Summarization and Expansion of Search Facets;Aparna Nurani Venkitasubramanian 等;《Ceur Workshop Proceedings》;20130426;第1-2页 * |
结合相关规则和本体加权图的查询扩展;郝志峰 等;《计算机应用研究》;20140418;第31卷(第10期);第3028-3032页 * |
Also Published As
Publication number | Publication date |
---|---|
CN105550226A (en) | 2016-05-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104428767B (en) | For identifying the mthods, systems and devices of related entities | |
CN103678576B (en) | The text retrieval system analyzed based on dynamic semantics | |
CN104239513B (en) | A kind of semantic retrieving method of domain-oriented data | |
US9928296B2 (en) | Search lexicon expansion | |
CN107590128B (en) | Paper homonymy author disambiguation method based on high-confidence characteristic attribute hierarchical clustering method | |
US20130232154A1 (en) | Social network message categorization systems and methods | |
CN106339502A (en) | Modeling recommendation method based on user behavior data fragmentation cluster | |
US20130179426A1 (en) | Search and Retrieval Methods and Systems of Short Messages Utilizing Messaging Context and Keyword Frequency | |
CN103823893A (en) | User comment-based product search method and system | |
CN106204156A (en) | A kind of advertisement placement method for network forum and device | |
CN103116588A (en) | Method and system for personalized recommendation | |
CN107291699A (en) | A kind of sentence semantic similarity computational methods | |
CN104008171A (en) | Legal database establishing method and legal retrieving service method | |
CN105528411B (en) | Apparel interactive electronic technical manual full-text search device and method | |
CN103778227A (en) | Method for screening useful images from retrieved images | |
CN106547864B (en) | A kind of Personalized search based on query expansion | |
CN102054029A (en) | Figure information disambiguation treatment method based on social network and name context | |
CN111221968B (en) | Author disambiguation method and device based on subject tree clustering | |
CN106294418B (en) | Search method and searching system | |
CN113553429A (en) | Normalized label system construction and text automatic labeling method | |
CN101923556B (en) | Method and device for searching webpages according to sentence serial numbers | |
CN106960044A (en) | A kind of Time Perception personalization POI based on tensor resolution and Weighted H ITS recommends method | |
WO2010096986A1 (en) | Mobile search method and device | |
CN109033132A (en) | The method and device of text and the main body degree of correlation are calculated using knowledge mapping | |
CN104794222A (en) | Network table semantic recovery method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |