CN105550226B - A kind of inquiry facet generation method in knowledge based library - Google Patents

A kind of inquiry facet generation method in knowledge based library Download PDF

Info

Publication number
CN105550226B
CN105550226B CN201510888652.8A CN201510888652A CN105550226B CN 105550226 B CN105550226 B CN 105550226B CN 201510888652 A CN201510888652 A CN 201510888652A CN 105550226 B CN105550226 B CN 105550226B
Authority
CN
China
Prior art keywords
facet
inquiry
entity
attribute
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510888652.8A
Other languages
Chinese (zh)
Other versions
CN105550226A (en
Inventor
窦志成
文继荣
江政宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin University of China
Original Assignee
Renmin University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renmin University of China filed Critical Renmin University of China
Priority to CN201510888652.8A priority Critical patent/CN105550226B/en
Publication of CN105550226A publication Critical patent/CN105550226A/en
Application granted granted Critical
Publication of CN105550226B publication Critical patent/CN105550226B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of inquiry facet generation methods in knowledge based library, and this method comprises the following steps:1) for given inquiry q, T retrieval result before being obtained from search engine, composition query result set D;2) QDMiner algorithms are based on and obtain a series of initial query facet f, a series of initial query facet f composition set F;3) initial query facet f described in each is extended;4) the initial query facet f after extension is filtered using search file, to ensure the accuracy rate of spreading result;Final inquiry facet is generated using the initial query facet f after extension.The present invention generates inquiry facet using knowledge base, can effectively solve the limitation that existing method depends on retrieval result.Initial facet is extended by using the information of high quality in knowledge base, the facet lexical item that does not have to occur in retrieval result or not be extracted can be accurately positioned, to improve the accuracy and coverage rate of inquiry facet.

Description

A kind of inquiry facet generation method in knowledge based library
Technical field
The present invention relates to a kind of inquiry facet generation methods in knowledge based library.
Background technology
According to China Internet Network Information Center (CNNIC) publication《Chinese netizen's search behavior research report in 2013》It is aobvious Show, in by the end of June, 2013 by, Chinese search engine netizen scale is 4.70 hundred million, and China mobile searches for number of netizen up to 3.24 hundred million. Used netizen's ratio of comprehensive search engine up to 98% in half a year in past, it is seen then that in Internet era, search engine is people It is the main source for obtaining the network information into the main entrance of network.
Comprehensive search engine mainly shows search result in the form of list of relevant documents at present, and according to the correlation of document Property sort from high to low, for simple, navigational element of searching, such as search for " Taobao official website ", this mode disclosure satisfy that demand, But for complicated, informative, heuristic search, this to show form just and seem excessively thin, user needs returning It is found in the thousands of result returned, the information needed for summary, under efficiency.In some cases, the search intention of user is Fuzzy, it is difficult to accurately be reached by one or two of vocabulary, such as search for the knowledge etc. of related field;In addition, the search of user is It may be heuristic, search engine needed categorizedly to organize related content, user is facilitated to find oneself step by step Desired information, such as search in shopping website can provide corresponding limitation to the brand of commodity, pattern, size etc..For The former, current Main is that search is suggested, in search box input content, search engine meeting basis accumulated user in the past It searches for daily record and prompts the possible search statement of user;For latter situation, the range applied at present is mainly commodity, hotel etc. Vertical field.For problem above, inquiry facet is an effective solution approach.Inquiry facet can be regarded as to inquiry from The summary and conclusion that different angle is made, such as the facet of inquiry " Wang Fei " have:Her well-known songs, album, good friend, acquisition Awards etc..Inquiry facet is the extension to user's query intention, is the summary to potential Query Information, can not only facilitate use The clear search intention in family, moreover it is possible to user's related content is prompted, so that user carries out heuristic search.
Currently, the method for digging of inquiry facet depends on the collection of document of search engine return, Manual definition is utilized A variety of constellations, the lexical item list occurred side by side in abstracting document, and finally being looked by processes, the generation such as cluster, sort Ask facet.On this basis, another scheme is to utilize supervised learning, and two models are respectively trained, for judging a word Whether item belongs to inquiry facet and whether two lexical items belong to the same inquiry facet.Although both the above method achieves not Wrong effect, but the exactness and accuracy of result can be influenced by document quality.First, if retrieval result document It concentrates and does not include certain facets or lexical item, existing method has no way of extracting;Secondly, even if in retrieval result including corresponding facet, by In not being showed with tabular form, existing decimation pattern can not accurately identify;Finally, the list arranged side by side of extraction may include Impurity item, existing way can not efficiently filter out all impurity.
Therefore, the technical issues of how solving the above problems as those skilled in the art's urgent need to resolve.
Invention content
The problem of for background technology, the purpose of the present invention is to provide a kind of inquiry facets in knowledge based library Generation method, the application generate inquiry facet using knowledge base, can effectively solve the office that existing method depends on retrieval result It is sex-limited.Initial facet is extended by using the information of high quality in knowledge base, in retrieval result without occur or not by The facet lexical item of extraction can be accurately positioned, to improve the accuracy and coverage rate of inquiry facet.
The purpose of the present invention is achieved through the following technical solutions:
A kind of inquiry facet generation method in knowledge based library, described method includes following steps:
1) for given inquiry q, T retrieval result before being obtained from search engine, composition query result set D;
2) QDMiner algorithms are based on and obtain a series of initial query facet f, a series of initial query facet f compositions Set F;
3) initial query facet f described in each is extended;
4) the initial query facet f after extension is filtered using search file, to ensure the standard of spreading result True rate;Final inquiry facet is generated using the initial query facet f after extension.
Further, a series of method for obtaining initial query facet f in the step 2) based on QDMiner algorithms is specific For:
A. list is extracted:Using text, html tag, the multiple patterns of repeat region, taken out from the query result set D Take original list;
B. power is assigned in list:Based on tf-idf thoughts, assessment is made to the importance of each original list;
C. list clusters:It gets together similar list to form inquiry facet using WQT methods;
D. facet and lexical item sequence are inquired:The importance for calculating lexical item in different inquiry facets and facet, sorts and exports Final result obtains a series of initial query facet f.
Further, it is specially to the method that initial query facet f is extended described in each in the step 3):First The inquiry of search engine is divided into two kinds:Entity grade is inquired and the inquiry of non-physical grade;Entity grade is inquired, inquiry is obtained and corresponds to Freebase in entity, and obtain its attribute;If the registration of former facet and a certain attribute is very high, the category is used Property extension as former facet;If can not find such attribute, the inquiry of non-physical grade is gone to;Non-physical grade is inquired, Thought based on tf-idf finds the minimum type for including former facet in Freebase, and finds former facet using Freebase Middle difference lexical item is shared, attribute associated with the query, is limited with such attribute is further to type, return is limited Extension of the type entity that is included as former facet.
Further, the specific method for obtaining the entity inquired in corresponding Freebase is:Use Freebase's The corresponding entity of Search API search inquiries, Search API mainly use name, the synonym matching inquiry character of entity String;Then the entity of return is filtered.
Further, the method that is filtered of entity of described pair of return is:For the inquiry Q of Search API, return N number of Entity [E1,E2..., EN], for entity E therein, (cutting) processing is segmented to all synonyms and inquiry Q, calculates E All synonyms and the public word string of maximum of inquiry Q account for the ratio of former string, take the maximum value of ratio in all synonyms to make For the word string similarity score StrSim of E and Q;If the score is less than threshold value RStrSim, then E is filtered out;Formula is:
Wherein Alias (E) is all synonym collections of entity E, and len indicates the length of word string;Threshold value RStrSimWith looking into It askes the length variation of Q and changes, (Q a) calculates the public word string length of maximum of inquiry Q and synonym a to LCS:
What wherein pow (x, y) was calculated is the y powers of x.
Further, it is specially to the method that initial query facet f is extended described in each in the step 3):First Several types for including former facet f are found, is given a mark using the method for tf-idf, selects the type of highest scoring, used Search API find the corresponding entity of all lexical items in facet f, are found in all properties of these entities public and former Beginning inquires relevant attribute and is limited to type, all entities of the type after being limited with return, as the expansion to former facet Exhibition.
The present invention has following positive technique effect:
One, inquiry facet is generated using knowledge base
Set forth herein inquiry facet is generated using knowledge base and retrieval result simultaneously, can both utilize high-quality in knowledge base It measures, comprehensive information, while catching the interest and focus of user by retrieval result.
Two, being associated between inquiry and inquiry facet is determined using knowledge base
For object query, entity attributes can be used for and inquire facet and be matched, so that it is determined that inquiry and inquiry Relationship between facet ensures the accuracy of extension;For general inquiry, we are limited in the way of " type+limitation " and are looked into The range of inquiry equally ensure that the precision of extension.
Three, construction of knowledge base multilayer relational network is utilized
Context of methods not only considers one layer of attribute (predicate) of each entity in Freebase, while being also conceivable to two Layer or even multilayer attribute, so as to portray more complicated relationship.
Description of the drawings
Fig. 1 is the three-decker schematic diagram of entity;
Fig. 2 is the three-decker schematic diagram of " Feng little Gang ";
Fig. 3 is the tree structure schematic diagram of 2 layers of attribute net;
Fig. 4 is the type map for inquiring facet;
Fig. 5 is the structural schematic diagram of the 2 layers of attribute net in part.
Specific implementation mode
The application is further described below in conjunction with the accompanying drawings.
This application provides a kind of inquiry facet generation methods in knowledge based library, and this method comprises the following steps:
1) for given inquiry q, T retrieval result before being obtained from search engine, composition query result set D;
2) QDMiner algorithms are based on and obtain a series of initial query facet f, a series of initial query facet f compositions Set F;
3) initial query facet f described in each is extended;
4) the initial query facet f after extension is filtered using search file, to ensure the standard of spreading result True rate;Final inquiry facet is generated using the initial query facet f after extension.
It is also a knowledge base by user's addition content, the institute in Freebase similar to Wikipedia, Freebase There is knowledge to be organized by three-decker:Domain (Domain) → type (Type) → entity (Topic), entity can be simple It is interpreted as things with the real world, each entity there are several attributes (Property), similar entity component type, one Entity may belong to multiple types, such as director " Feng little Gang " belongs to " director " (/film/director), " performer " (/film/ Actor), multiple types such as " award-winner " (award/award_winnner), a type may include multiple entities, similar Type form a domain, domain is a wider array of range, for example, this domain " music " (/music) include " singer " (/ Music/singer), multiple types such as " album " (/music/record).
The Core Feature of Freebase is for a set of metadata of each type definition, referred to as attribute (Property) one All entities of a type possess all properties of the type, and assign its specific value.Such as " director " (/film/ Director) this type is " film of director " (/film/director/film) there are one attribute, is belonged to " director " " Feng little Gang " just possesses this attribute, and value is [mobile phone, night fete, and the World Without Thieve is not left without seeing each other ...].
Initial query facet in the present invention derives from QDMiner algorithms.For given inquiry q, obtained from search engine T retrieval result before taking, composition query result set D, QDMiner algorithm excavate inquiry facet by four steps:
List is extracted:Using multiple patterns such as text, html tag, repeat region, original row are extracted from results set D Table.
Power is assigned in list:Based on tf-idf thoughts, assessment is made to the importance of each original list.
List clusters.It gets together similar list to form inquiry facet using WQT methods.
Inquire facet and lexical item sequence.The importance for calculating lexical item in different inquiry facets and facet, sorts and exports most Terminate fruit.
The application will carry out on the basis of QDMiner is exported.The input of system has:Original query Qsearch, collection of document D, a series of set F of initial query sections composition of QDMiner outputs.We will use each inquiry section f ∈ F identical Method be extended, introduction below is also by taking a section f as an example.
Inquiry facet and the set maximum difference of scaling problem are that inquiring facet has abundant context, is detached from context Extension be likely to result in the query intention excessively to extend one's service, to influence extension effect.That is, if can fill Divide and utilizes original query QsearchRelationship between facet f is the key that influence extension effect.In order to find inquiry and facet it Between imply relationship, the application is from two angles:Inquiry and facet.The inquiry of search engine is divided into two kinds by the application: Entity grade is inquired and the inquiry of non-physical grade.The inquiry of entity grade refers to that inquiry content corresponds directly to certain one or more are physically present Entity inquiry, such as " Wang Fei ", " Nokia ";The inquiry of non-physical grade refers to correspond directly to the inquiry of entity, Such as " song of Wang Fei ", " university in the U.S. ".Such differentiation why is made, is because the former can be direct by searching for Corresponding entity in knowledge base is found, and the latter then cannot be by simply retrieving to obtain.
Entity grade is inquired, obtains the entity inquired in corresponding Freebase, and obtain its attribute.If former facet It is very high with the registration of a certain attribute, then use the attribute as the extension of former section, this process is exactly to be looked for from inquiry Relationship goes to the method that non-physical grade is inquired if can not find such attribute;Non-physical grade is inquired, tf- is based on The thought of idf finds the minimum type for including former facet in Freebase, and finds difference in former facet using Freebase The attribute shared, associated with the query of lexical item, makes type of such attribute and is further limited, return to confined class Extension of the entity for being included of type as former facet.Facet after extension can be filtered using search file, ensure result Accuracy rate.Step-by-step instructions algorithm flow below.
One, attribute extension facet is utilized
Using the corresponding entity of Search API search inquiries of Freebase, if Freebase returns several realities Body is likely to be then the inquiry of entity grade.Freebase Search API mainly use the matchings such as name, the synonym of entity Inquiry string, query result is relatively accurate, but and irrevocable all entities be all the corresponding entity of inquiry string, such as It inquires " Windows 7 ", Freebase is not returned only to " Windows 7 ", can also return to relevant operating system, such as " Windows 8 ", " Windows Xp " etc., so needing to filter.It is all herein all to use similar calculation using the place of Search API Method is filtered the entity of return:For the inquiry Q of Search API, N number of entity [E is returned1,E2..., EN], for wherein Entity E, the public word string of maximum of all synonyms and inquiry Q that calculate E accounts for the ratio of former string, takes in all synonyms and compare Word string similarity score StrSim of the maximum value of example as E and Q.If the score is less than threshold value RStrSim, then E is filtered out. Formula is:
Wherein Alias (E) is all synonym collections of entity E, and len indicates the length of word string.In order to as accurate as possible Ground filters out apparent incoherent entity, threshold value RStrSimAs the length of inquiry Q changes and change:
What wherein pow (x, y) was calculated is the y powers of x.
After filtering out apparent incoherent entity using above method, we take search engine inquiry QsearchBefore corresponding Sn entity [ES1, ES2, ES3..., ESn], the entity that Search API are returned is sorted according to degree of correlation, we follow This sequence, obtains [ES successively1, ES2, ES3..., ESn] all properties, and compared using these attributes and original facet f, hair The high attribute of existing similarity, detailed process are as follows.
As shown in Figure 1, an entity can indicate that entity attributes may be multivalue with three-decker, it is also possible to Monodrome or void value.Such as structure such as Fig. 2 of director " Feng little Gang ".
For director " Feng little Gang ", " film of director " (/film/director/film) is multi-valued attribute, " state Nationality " (/people/person/nationality) is single-value attribute, " religious belief " (/people/person/religion) It is no value attribute.
Attribute value (Target) is also substantially entity, and entity and entity are exactly by Attribute Association.If I Further obtain attribute value attribute, that obtain is exactly source entity Eroot2 layers of attribute, all 1 layer of attributes and 2 layers of attribute Union be known as source entity Eroot2 layers of attribute net, use P2It indicates, iteration continues can further obtain P3、P4Deng the number of plies is got over Deep, the path from source entity to last layer attribute value is longer, and correlation is also weaker, and corresponding noise and impurity are namely more, The present embodiment mainly uses P2Complete subsequent algorithm.
The structure of attribute net can be converted to one tree, ErootIt is equivalent to the root node of tree, every layer of attribute is (for PnThere is n Layer) attribute value be the equal of leaf node.The attribute that path from root node to leaf node is passed through can be regarded as broad sense Attribute.All non-1 layer of attributes we be referred to as multilayer attribute.Such as " Feng little Gang ", 1 layer of attribute is as shown in Fig. 2, continue One layer of extension, film " night dinner " possesses attribute " performer of film " (/film/film/actor), is worth for [Zhou Xun, Zhang Ziyi, Pueraria lobota It is excellent, Wu Yanzu ...] ,/film/director/film#/film/film/actor is exactly 2 layers of attribute for directing " Feng little Gang ", This attribute can be understood as " performer in the film of Feng little Gang directors ".There are one critically important problems for multilayer attribute, if It should be using the attribute value of middle layer as being intermediate node.For example, for " Feng little Gang " this entity, he directs all All performers of film are uniformly to regard an attribute as, and several attributes is subdivided into also according to different films, such as/ Film/director/film# night dinners #/film/film/actor.The mode of both extension multilayer attributes emphasizes particularly on different fields, preceding Person lays particular emphasis on entirety, and the latter extend it is more careful.In the present embodiment, we can be according to the discrete journey of preceding layer attribute value Degree decides whether to segment.Such as entity " Beijing Metro " possesses attribute " fall circuit " (/metropolitan_transit/ Transit_system/transit_lines), value is [Beijing Metro Line 1, No. 2 lines of Beijing Metro, Beijing Metro 4 Line ...], each circuit possesses attribute " subway station " (/metropolitan_transit/transit_line/stops), because For different circuits subway station overlap it is seldom, so when expanding 2 layers of attribute:
/metropolitan_transit/transit_system/transit_lines#/metropolitan_ This attribute of transit/transit_line/st ops can be subdivided into/metropolitan_transit/transit_ System/transit_lines# Beijing Metro Line 1s #/metropolitan_transit/transit_line/stops Deng.Specific computational methods are as follows.
For the attribute P of entity E, if it is single-value attribute, then this problem is not present;If it is multi-valued attribute, Attribute value table is shown as [T1,T2,T3,…TN], T1,T2,T3,…TNThe attribute possessed may be not fully identical, T1The category possessed Property may T2There is no (T1,T2Affiliated type is not exactly the same), for attribute P ', [T1,T2,T3,…TN] in have K entity Possess this attribute [Ts1,Ts2,Ts3,…Tsk], if the value of K is 1, this problem is also not present, if K>1, then set T =[T1,T2,T3,…TN] dispersion degree on P ' attributes calculation formula it is as follows:
Wherein, the property value set of the attribute P of presentation-entity E, SdiversityValue closer 1, the dispersion degree of attribute P#P ' It is bigger, if it is 1, then it represents that the attribute value of entity attributes P ' does not overlap in T.Work as Sdiversity>0.5, that is to say, that P#P ' In corresponding property value set, the number that average each attribute value repeats is less than 1, and attribute P#P ' is just subdivided into several P# Ts#P’。
It is exactly all processes for obtaining attribute above, final we can obtain the P of entity E2, the tree that this is flourishing can To be simplified shown as structure shown in Fig. 3:Either 1 layer of attribute or 2 layers of attribute, or 2 layers of category by intermediate node Property, all regard the broad sense attribute of entity E as.
There are broad sense attribute shown in Fig. 3 and its corresponding attribute value, we can find and former facet f is most like Attribute.Initial facet f is expressed as [Str1, Str2, Str3..., StrN], for each lexical item therein, we use and inquiry QsearchThe same method searches for corresponding entity using Search API, and the entity of return takes after the filtering of StrSim scores Preceding K, the 1st why is not only taken to be because there is ambiguousness.Q is used successivelysearchList [ES1, ES2, ES3..., ESn] in each entity ES, with its P2In all attribute and f make comparisons, the highest attribute of similarity score is taken, if found Meet the attribute of condition, terminates, otherwise continue with next ES.Algorithm is as follows.
Algorithm 1 calculates Property tf values
Wherein Targets (P) indicates that all properties value of attribute P, Entity (Str) indicate the corresponding Freebase of Str The set of all entities returned.Entity in one Freebase is at most used 1 time, that is to say, that an entity can only Represent a lexical item in facet.
By algorithm 1, we have obtained the coincidence number CoNum of attribute and former facet, this value can be understood as belonging to Property tf values, so the thought of computation attribute final score be similar to tf-idf:
Wherein TotalPropertyN is equivalent to the number of all documents.It (is approximately Freebase to take 60,000,000 herein The number of middle entity).It is rightIn there are one threshold value, the attribute less than certain value can filter out, and take 0.5 herein.If tasted All ES have been tried all without qualified attribute, branch to following method.
Two, use pattern and limitation extension facet
Most of inquiry in search engine is not the inquiry of entity grade, even if inquiry is entity grade, in knowledge base May be difficult to find corresponding attribute, for example " wrist-watch " this entity, can there is no attribute as " color " in Freebase Can be because " color " this attribute is not unique, any commodity have different colors.So being critically important from facet A kind of method.
Several types for including former facet f are found first, are given a mark using the method for similar tf-idf, are selected score Highest type can use all entities of this type as the extension of former facet, but actually in the case of more satisfactory Situation is that the type in Freebase is excessively coarse, needs further to limit.Using with method similar before, use Search API finds the corresponding entity of all lexical items in facet f, and public and Q is found in all properties of these entitiessearchIt is related Attribute type is limited, return limitation after type all entities, as the extension to former facet.Such as extension Former facet is " film that Feng little Gang was directed ", and the optimal type found is likely to " prize-winning film ", excessively wide in range, Bu Nengzhi It connects for extending, if it is possible to it is " director of film " there are one attribute to find these films altogether and its value is " Feng little Gang ", I MQL (Freebase specifically retrieves sentence) can be utilized to retrieve " director is Feng little Gang " " prize-winning film ", to big The precision of the big efficiency and extension for improving search.
In a similar way, for facet f [Str1, Str2, Str3..., StrN] in each Str use Search API find the entity in Freebase, and structure as shown in Figure 4 is obtained after being filtered using StrSim scores;It calculates The tf values of each type, algorithm are as follows:
Algorithm 2 calculates Type tf values
The tf values of each type can be understood as the ratio shared by the lexical item for including by the type in facet.Belong to front The calculation of property tf values is similar, and an entity can only be only used once, this skill entity in being related to Freebase It can all be used when ambiguity problem.It, can be according to threshold value R after obtaining tfTypeTfFilter out the too low type of tf values, the meter of threshold value It is as follows to calculate formula:
After StrSim and the filtering of two steps of TypeTf, to all entities retained and corresponding type, calculate final The process of score is a voting process, K entity before retaining for each lexical item, is voted with the type that entity belongs to it, Weight is that 1/ (sqrt (Rank (E))) wherein Rank (E) is sequences of the entity E in the list of entities of corresponding lexical item.Because What the entity that Search API are returned was ordered into, similitude is high in front, so we can utilize sequencing information.The front and It is similar, if correspondent entity E, E are used only once lexical item Str1 and lexical item Str2 simultaneously.Finally the score of type T is:
Wherein I (T, E) is indicator function, and if sporocarp E belongs to type T, it is 1 to be worth, and is otherwise 0;TotalTypeN is 60000000, Df (T) are the entity number that type T includes.
Usage type score is ranked up, and acquirement point is highest to be used as final result.If not finding suitable type, This facet f is without extension.
Since the type in Freebase is very wide in range, need further to limit to reduce the range of inquiry.Equally, it obtains first Take section f [Str1, Str2, Str3..., StrN] in each corresponding entities of lexical item Str and its 1 layer of attribute P1It (is examined for performance Consider, P is not used2), attribute and attribute value are regarded as several categories to Pair (Property, Target), such as " Feng little Gang " Property " film of director " and value [night fete, mobile phone ...] be considered as (/film/director/film, night fete), (/film/ Director/film, mobile phone) etc., entity each in this way can possess a series of Pair, the computational methods of Pair scores and TypeTf is similar, counts the ratio of appearance of each Pair in all lexical items as final score, each lexical item takes preceding K reality Body, an entity can only be only used once, and for a lexical item, same Pair is used only once, and the score of calculating can be with It is interpreted as the tf values of Pair.Filter out the too low Pair of tf values.Threshold RPairTfCalculating formula is:
For remaining Pair, all synonyms of Target are taken, check Target whether by QsearchIncluding if Including, then it is assumed that the Pair is associated with the query, is otherwise deleted.Last same Target may go out in multiple Pair It is existing, such as inquiry " us university " is " university in the U.S. " there are one section f, the Pair selected has (/location/ Location/containedby, the U.S.), (/organization/organization/headquarters, the U.S.) etc., It is K since each lexical item that front limits takes the number of entity, the Pair of mistake may be selected, so for Target mono- The Pair of sample only retains tf values highest preceding P.All entities under the type of limitation are added using MQL search as extension.
Target in judging Pair whether and QsearchWhen related, in addition to using synonym, Target can also be obtained N layer attributes PnIt (considers performance, uses P herein1), if there is by QsearchIncluding attribute value, equally it is considered that this A Pair is and QsearchIt is relevant.In general, P is utilized before1Possible Pair is found as possible limitation, this process The P of the Target in Pair is obtained again1, it is equivalent to and obtains source entity part P2
As shown in figure 5, rectangle represents the set of the corresponding all entities of a lexical item in figure, circular node is entity 1 The attribute value of layer attribute, triangular nodes and diamond shape node are the attribute values of 2 layers of attribute of entity.In algorithm above, by The higher Pair of tf values is only remained in us, so in the P for further obtaining Target1When, what is be equivalent to is diamond shape Node, what triangular nodes did not obtain really.For example, inquiry " us university " is " New York there are one facet The university in state ", these universities share Pair (/location/location/containedby, New York), pass through acquisition " knob The P in about state "1, it is found that " New York " there are attribute " being contained in " (/location/location/containedby) and values For " U.S. ", so Pair (/location/location/containedby, New York) is and Qsea rchRelevant limitation.
If not finding limitation, and the df values of type are very big, then no longer extend the facet, if the df values of type compare Rationally, then use all entities of the type as extension.
The difference of the present processes and traditional set expansion problem is there is QsearchWith retrieval result as up and down Text.The extension facet that above method finally exports can pass through the inspection of context to improve the precision of extension.It examines at present Method there are mainly two types of, there is score NameOcc, co-occurrence score C oOcc in text.Include in the collection of document of search engine There is score as text in ratio shared by the document of the entity;In average every document and original point that the entity occurs jointly Lexical item number in the f of face filters out the entity of NameOcc=0 or CoOcc=0 as co-occurrence score.
The entity in facet after extension is unordered, context and attribute can be utilized to be ranked up.In addition to NameOcc, CoOcc can also utilize BM25 models, computational entity description and inquiry QsearchDegree of correlation score BM25Score sorts to entity using the product of this 3 scores.
In addition it can use attribute to sort the entity of extension.The PairTf scores being the previously calculated can be regarded as The importance of one Pair.For section f [Str1, Str2, Str3..., StrN], it obtains all Pair and calculates PairTf, and Unlike the process for finding limitation, these Pair need not move through filtering, and Pair importance higher PairTf is bigger.It is all Pair constitutes a set Pairjudge, the foundation as entity marking.For all entity E, calculate E all Pair and PairjudgeSimilar score ProSim, calculation is as follows:
Wherein I (E, P) is indicator function, and if sporocarp E has Pair Pa, it is 1 to be worth, and is otherwise 0.By multiplying for all scores Product is as the foundation finally to sort:
ScoreForRank=NameOcc × CoOcc × BM25Score × ProSim
The invention discloses the method for using knowledge base to excavate inquiry facet, the main contributions of this method are to utilize structure The knowledge base of change finds the implication relation between inquiry and facet, and extended target is waited for accurately find.The present invention uses UserQ is as test data evaluation algorithms effect, the results showed that can be in the feelings for ensureing precision using the method in knowledge based library It is obviously improved the quality of inquiry facet under condition, improves recall rate.In addition, can be without the help of initial query point using knowledge base Facet is directly generated in the case of the inquiry facet in face, such as entity grade is inquired corresponding attribute and can be divided directly as inquiry Face, to which original facet can not only be extended, moreover it is possible to generate new facet.
It is described above simply to illustrate that of the invention, it is understood that the present invention is not limited to the above embodiments, meets The various variants of inventive concept are within protection scope of the present invention.

Claims (4)

1. a kind of inquiry facet generation method in knowledge based library, which is characterized in that described method includes following steps:
1)For given inquiry q, T retrieval result before being obtained from search engine, composition query result set D;
2)A series of initial query facet f, a series of initial query facet f compositions set are obtained based on QDMiner algorithms F;
3)Initial query facet f described in each is extended;
4)The initial query facet f after extension is filtered using search file, to ensure the accuracy rate of spreading result; Final inquiry facet is generated using the initial query facet f after extension;
The step 2)In a series of initial query facet f are obtained based on QDMiner algorithms method be specially:
A. list is extracted:Using text, html tag, the multiple patterns of repeat region, extracted from the query result set D former Beginning list;
B. power is assigned in list:Based on tf-idf thoughts, assessment is made to the importance of each original list;
C. list clusters:It gets together similar list to form inquiry facet using WQT methods;
D. facet and lexical item sequence are inquired:The importance for calculating lexical item in different inquiry facets and facet sorts and exports final As a result, obtaining a series of initial query facet f;
The step 3)In be specially to the method that initial query facet f is extended described in each:First by search engine Inquiry be divided into two kinds:Entity grade is inquired and the inquiry of non-physical grade;Entity grade is inquired, obtains and inquires corresponding Freebase In entity, and obtain its attribute;If the registration of former facet and a certain attribute is very high, use the attribute as former facet Extension;If can not find such attribute, the inquiry of non-physical grade is gone to;Non-physical grade is inquired, based on tf-idf's Thought finds the minimum type for including former facet in Freebase, and finds different lexical items in former facet using Freebase and be total to Attribute having, associated with the query is limited with such attribute is further to type, and returning to confined type is included Extension of the entity as former facet.
2. the inquiry facet generation method in knowledge based library according to claim 1, which is characterized in that described to be inquired The specific method of entity in corresponding Freebase is:Use the corresponding reality of Search API search inquiries of Freebase Body, Search API use name, the synonym matching inquiry character string of entity;Then the entity of return is filtered.
3. the inquiry facet generation method in knowledge based library according to claim 2, which is characterized in that described pair return The method that entity is filtered is:For the inquiry Q of Search API, N number of entity [E is returned1,E2..., EN], for therein Entity E carries out word segmentation processing to all synonyms and inquiry Q, calculates all synonyms of E and the public word of maximum of inquiry Q String accounts for the ratio of former string, takes the maximum value of ratio in all synonyms as the similarity of character string score StrSim of E and Q;Such as The fruit score is less than threshold value RStrSim, then E is filtered out;Formula is:
Wherein Alias (E) is all synonym collections of entity E, and len indicates the length of word string;Threshold value RStrSimWith inquiry Q Length variation and change, (Q a) calculates the maximum public word string length of inquiry Q and synonym a to LCS:
Wherein pow (x, y) be x y powers.
4. the inquiry facet generation method in knowledge based library according to claim 1, which is characterized in that the step 3)In It is specially to the method that initial query facet f is extended described in each:Several classes for including former facet f are found first Type is given a mark using the method for tf-idf, selects the type of highest scoring, and all words in facet f are found with Search API The corresponding entity of item, the relevant attribute of public and original query is found in all properties of these entities and is subject to type Limitation, all entities of the type after being limited with return, as the extension to former facet.
CN201510888652.8A 2015-12-07 2015-12-07 A kind of inquiry facet generation method in knowledge based library Active CN105550226B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510888652.8A CN105550226B (en) 2015-12-07 2015-12-07 A kind of inquiry facet generation method in knowledge based library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510888652.8A CN105550226B (en) 2015-12-07 2015-12-07 A kind of inquiry facet generation method in knowledge based library

Publications (2)

Publication Number Publication Date
CN105550226A CN105550226A (en) 2016-05-04
CN105550226B true CN105550226B (en) 2018-09-04

Family

ID=55829415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510888652.8A Active CN105550226B (en) 2015-12-07 2015-12-07 A kind of inquiry facet generation method in knowledge based library

Country Status (1)

Country Link
CN (1) CN105550226B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106203632B (en) * 2016-07-12 2018-10-23 中国科学院科技政策与管理科学研究所 A kind of limited knowledge collection recombinant and study and the application system method for being distributed extraction
US11941010B2 (en) * 2020-12-22 2024-03-26 International Business Machines Corporation Dynamic facet ranking

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402619A (en) * 2011-12-23 2012-04-04 广东威创视讯科技股份有限公司 Search method and device
CN104794163A (en) * 2015-03-25 2015-07-22 中国人民大学 Entity set extension method
CN104915449A (en) * 2015-06-30 2015-09-16 河海大学 Faceted search system and method based on water conservancy object classification labels

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9280587B2 (en) * 2013-03-15 2016-03-08 Xerox Corporation Mailbox search engine using query multi-modal expansion and community-based smoothing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402619A (en) * 2011-12-23 2012-04-04 广东威创视讯科技股份有限公司 Search method and device
CN104794163A (en) * 2015-03-25 2015-07-22 中国人民大学 Entity set extension method
CN104915449A (en) * 2015-06-30 2015-09-16 河海大学 Faceted search system and method based on water conservancy object classification labels

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Facet-based opinion retrieval from blogs;Olga Vechtomova;《Information Processing and Management》;20090717;第71-88页,第3.2节 *
Finding dimensions for queries;Zhicheng Dou 等;《Acm International Conference on Information & Knowledge Management ACM》;20111028;第1311-1320页,第3.1节至3.5节 *
Summarization and Expansion of Search Facets;Aparna Nurani Venkitasubramanian 等;《Ceur Workshop Proceedings》;20130426;第1-2页 *
结合相关规则和本体加权图的查询扩展;郝志峰 等;《计算机应用研究》;20140418;第31卷(第10期);第3028-3032页 *

Also Published As

Publication number Publication date
CN105550226A (en) 2016-05-04

Similar Documents

Publication Publication Date Title
CN104428767B (en) For identifying the mthods, systems and devices of related entities
CN103678576B (en) The text retrieval system analyzed based on dynamic semantics
CN104239513B (en) A kind of semantic retrieving method of domain-oriented data
US9928296B2 (en) Search lexicon expansion
CN107590128B (en) Paper homonymy author disambiguation method based on high-confidence characteristic attribute hierarchical clustering method
US20130232154A1 (en) Social network message categorization systems and methods
CN106339502A (en) Modeling recommendation method based on user behavior data fragmentation cluster
US20130179426A1 (en) Search and Retrieval Methods and Systems of Short Messages Utilizing Messaging Context and Keyword Frequency
CN103823893A (en) User comment-based product search method and system
CN106204156A (en) A kind of advertisement placement method for network forum and device
CN103116588A (en) Method and system for personalized recommendation
CN107291699A (en) A kind of sentence semantic similarity computational methods
CN104008171A (en) Legal database establishing method and legal retrieving service method
CN105528411B (en) Apparel interactive electronic technical manual full-text search device and method
CN103778227A (en) Method for screening useful images from retrieved images
CN106547864B (en) A kind of Personalized search based on query expansion
CN102054029A (en) Figure information disambiguation treatment method based on social network and name context
CN111221968B (en) Author disambiguation method and device based on subject tree clustering
CN106294418B (en) Search method and searching system
CN113553429A (en) Normalized label system construction and text automatic labeling method
CN101923556B (en) Method and device for searching webpages according to sentence serial numbers
CN106960044A (en) A kind of Time Perception personalization POI based on tensor resolution and Weighted H ITS recommends method
WO2010096986A1 (en) Mobile search method and device
CN109033132A (en) The method and device of text and the main body degree of correlation are calculated using knowledge mapping
CN104794222A (en) Network table semantic recovery method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant