CN105550226A - Inquiry sub-page generation method based on knowledge base - Google Patents

Inquiry sub-page generation method based on knowledge base Download PDF

Info

Publication number
CN105550226A
CN105550226A CN201510888652.8A CN201510888652A CN105550226A CN 105550226 A CN105550226 A CN 105550226A CN 201510888652 A CN201510888652 A CN 201510888652A CN 105550226 A CN105550226 A CN 105550226A
Authority
CN
China
Prior art keywords
inquiry
face
entity
point
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510888652.8A
Other languages
Chinese (zh)
Other versions
CN105550226B (en
Inventor
窦志成
文继荣
江政宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin University of China
Original Assignee
Renmin University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renmin University of China filed Critical Renmin University of China
Priority to CN201510888652.8A priority Critical patent/CN105550226B/en
Publication of CN105550226A publication Critical patent/CN105550226A/en
Application granted granted Critical
Publication of CN105550226B publication Critical patent/CN105550226B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an inquiry sub-page generation method based on a knowledge base. The method comprises the following steps: (1), for given inquiry q, obtaining former T retrieval results from a search engine to form an inquiry result set D; (2), obtaining a series of initial inquiry sub-pages f based on a QDMiner algorithm, wherein a series of initial inquiry sub-pages f form a set F; (3), expanding each initial inquiry sub-pages f; and (4), filtering the expanded initial inquiry sub-pages f by utilizing a retrieval document, such that the accuracy rate of an expansion result is ensured, and generating a final inquiry sub-page by utilizing the expanded initial inquiry sub-pages f. According to the invention, the inquiry sub-pages are generated by using the knowledge base; the limitation that the existing method depends on a retrieval result can be effectively solved; the initial sub-pages are expanded by utilizing high-quality information in the knowledge base; sub-page lexical items, which do not occur in the retrieval result or are not extracted, can be accurately located; and thus, the accuracy and the coverage rate of the inquiry sub-pages can be increased.

Description

A kind of inquiry of knowledge based storehouse divides looks unfamiliar into method
Technical field
The inquiry that the present invention relates to a kind of knowledge based storehouse divides looks unfamiliar into method.
Background technology
According to " Chinese netizen's search behavior research report in 2013 " display that CNNIC (CNNIC) issues, by by the end of June, 2013, Chinese search engine netizen scale is 4.70 hundred million, and China mobile search number of netizen reaches 3.24 hundred million.Used netizen's ratio of comprehensive search engine to reach 98% in the past in half a year, visible, Internet era, search engine is the main entrance that people enter network, is to obtain the main source of the network information.
Current comprehensive search engine mainly shows Search Results with the form of list of relevant documents, and sort from high to low according to the correlativity of document, element is searched for simple, navigational, as search " Taobao official website ", this mode can satisfy the demands, but for search that is complexity, informative, heuristic, this form that represents just seems too thin, user need in the thousands of result returned find, sum up needed for information, under efficiency.In some cases, the search intention of user is fuzzy, is difficult to be reached by one or two vocabulary exactly, such as, search for the knowledge etc. of association area; In addition, the search of user may be heuristic, need search engine to organize related content categorizedly, facilitate user to find the information oneself wanted step by step, such as, search in shopping website can provide corresponding restriction to the brand, pattern, size etc. of commodity.For the former, current Main is search suggestion, and user is when search box input content, and search engine can point out user possible search statement according to the search daily record accumulated in the past; For latter event, the scope mainly vertical field such as commodity, hotel of application at present.For problem above, an inquiry point face is an effective solution route.Inquiry point face can be regarded as inquiring about the summary made from different perspectives and conclusion, and point face of such as inquiry " Wang Fei " has: the awards etc. of her well-known songs, special edition, good friend, acquisition.Inquiry point face is the expansion to user's query intention, is the summary to potential Query Information, not only can facilitates the clear and definite search intention of user, can also point out user's related content, so that user carries out heuristic search.
At present, the method for digging in inquiry point face depends on the collection of document that search engine returns, and utilizes the multiple constellation of Manual definition, the lexical item list occurred side by side in abstracting document, and by the process such as cluster, sequence, generates final inquiry and divide a face.On this basis, another scheme utilizes supervised learning, trains two models respectively, for judging whether a lexical item belongs to inquiry point face and whether two lexical items belong to same inquiry point face.Although above two kinds of methods achieve good effect, the accuracy of result and accuracy can be subject to the impact of document quality.First, if do not comprise some point of face or lexical item in result for retrieval document sets, existing method has no way of extracting; Secondly, even if comprise corresponding point of face in result for retrieval, owing to not representing with tabular form, existing decimation pattern can not accurately identify; Finally, the list arranged side by side of extraction may comprise impurity item, and existing way can not filter out all impurity efficiently.
Therefore, how to solve the problem and become the technical matters that those skilled in the art need solution badly.
Summary of the invention
For Problems existing in background technology, the object of the present invention is to provide a kind of inquiry of knowledge based storehouse to divide and look unfamiliar into method, the application uses knowledge base generated query point face, effectively can solve the limitation that existing method depends on result for retrieval.By utilizing high-quality information in knowledge base to expand an initial point face, in result for retrieval, do not have point face lexical item occurring or be not extracted to be accurately positioned, thus improve accuracy and the coverage rate in inquiry point face.
The object of the invention is to be achieved through the following technical solutions:
The inquiry in knowledge based storehouse divides looks unfamiliar into a method, and described method comprises the steps:
1) for given inquiry q, T result for retrieval before obtaining from search engine, composition Query Result set D;
2) obtain a series of initial query point face f based on QDMiner algorithm, a series of described initial query divides face f to form set F;
3) point face f of initial query described in each is expanded;
4) search file is utilized to filter, to ensure the accuracy rate of spreading result to the described initial query point face f after expansion; The initial query point face f after expansion is utilized to generate final inquiry point face.
Further, described step 2) in obtain a series of initial query point face f based on QDMiner algorithm method be specially:
A. list is extracted: use text, html tag, the multiple pattern of repeat region, from described Query Result set D, extract original list;
B. power is composed in list: based on tf-idf thought, make assessment to the importance of each described original list;
C. list cluster: use WQT method similar list to be got together and form inquiry point face;
D. sort with lexical item in inquiry point face: the importance calculating lexical item in different inquiry point face and point face, sorts and export net result, namely obtaining a series of described initial query point face f.
Further, described step 3) in point method that face f expands of initial query described in each is specially: first the inquiry of search engine is divided into two kinds: entity level inquiry and non-physical level inquire about; For the inquiry of entity level, obtain the entity in Freebase corresponding to inquiry, and obtain its attribute; If the registration of former point of face and a certain attribute is very high, then use this attribute as the expansion in former point of face; If can not find such attribute, then forward the inquiry of non-physical level to; Non-physical level is inquired about, thought based on tf-idf finds in Freebase the minimum type comprising former point of face, and different lexical item in former point of face is total, attribute associated with the query to utilize Freebase to find, with such attribute, further restriction is done to type, return the expansion as former point of face of entity that confined type comprises.
Further, the concrete grammar of the described entity obtained in Freebase corresponding to inquiry is: use the entity that the SearchAPI search inquiry of Freebase is corresponding, SearchAPI mainly uses name, the synonym matching inquiry character string of entity; Then the entity returned is filtered.
Further, describedly to the method that the entity returned filters be: for the inquiry Q of SearchAPI, return N number of entity [E 1, E 2..., E n], for entity E wherein, carry out participle (cutting) process to all synonyms and inquiry Q, the maximum public word string of all synonyms and inquiry Q that calculate E accounts for the ratio of former string, gets the word string similarity score StrSim of maximal value as E and Q of ratio in all synonyms; If this score is less than threshold value R strSim, then E is filtered out; Formula is:
S t r S i m ( E , Q ) M a x a i n A l i a s ( E ) ( L C S ( Q , a ) M a x ( l e n ( Q ) , l e n ( a ) ) )
Wherein Alias (E) is all TongYiCi CiLin of entity E, and len represents the length of word string; Threshold value R strSimchange along with the length variations of inquiry Q, LCS (Q, a) calculates the maximum public word string length of inquiry Q and synonym a:
R S t r S i m = M a x ( 0.3 , 1 p o w ( l e n ( Q ) , 1 / 3 ) )
The y power of what wherein pow (x, y) calculated is x.
Further, described step 3) in point method that face f expands of initial query described in each is specially: first find several types comprising former point of face f, the method of tf-idf is utilized to give a mark, select the type that score is the highest, the entity that in point face f, all lexical items are corresponding is found with SearchAPI, in all properties of these entities, find the public attribute relevant with original query to be limited type, with all entities returning the type after restriction, as the expansion to former point of face.
The present invention has following positive technique effect:
One, knowledge base generated query point face is utilized
Propose herein to utilize knowledge base and result for retrieval generated query point face simultaneously, both can utilize high-quality in knowledge base, comprehensively information, and catch interest and the focus of user simultaneously by result for retrieval.
Two, knowledge base is utilized to determine to inquire about associating between inquiry point face
For object query, entity attributes can be used for and inquire about a point face and mate, thus determines to inquire about the relation between inquiry point face, ensures the accuracy of expansion; For general inquiry, we utilize the mode of " type+restriction " to limit the scope of inquiry, ensure that the precision of expansion equally.
Three, construction of knowledge base multilayer relational network is utilized
Context of methods not only considers one deck attribute (predicate) of each entity in Freebase, it is also conceivable to two layers, even multilayer attribute simultaneously, thus can portray more complicated relation.
Accompanying drawing explanation
Fig. 1 is the three-decker schematic diagram of entity;
Fig. 2 is the three-decker schematic diagram of " Feng little Gang ";
Fig. 3 is the tree structure schematic diagram of 2 layers of attribute net;
Fig. 4 is the type map in inquiry point face;
Fig. 5 is the structural representation of part 2 layers of attribute net.
Embodiment
Below in conjunction with accompanying drawing, the application is further described.
The inquiry that this application provides a kind of knowledge based storehouse divides looks unfamiliar into method, and the method comprises the steps:
1) for given inquiry q, T result for retrieval before obtaining from search engine, composition Query Result set D;
2) obtain a series of initial query point face f based on QDMiner algorithm, a series of described initial query divides face f to form set F;
3) point face f of initial query described in each is expanded;
4) search file is utilized to filter, to ensure the accuracy rate of spreading result to the described initial query point face f after expansion; The initial query point face f after expansion is utilized to generate final inquiry point face.
Be similar to Wikipedia, Freebase is also one and adds the knowledge base of content by user, all knowledge in Freebase are organized by three-decker: territory (Domain) → type (Type) → entity (Topic), entity simply can be interpreted as the things in real world, each entity has some attributes (Property), similar entity component type, an entity can belong to multiple type, such as direct " Feng little Gang " and belong to " director " (/film/director), " performer " (/film/actor), multiple types such as " award-winner " (award/award_winnner), a type may comprise multiple entity, similar type forms a territory, territory is a wider scope, such as this territory of " music " (/music) comprises " singer " (/music/singer), multiple types such as " special edition " (/music/record).
The Core Feature of Freebase is a set of metadata for each type definition, and be called attribute (Property), all entities of a type have all properties of the type, and give its concrete value.Such as this type of " director " (/film/director) has an attribute to be " film of director " (/film/director/film), " Feng little Gang " that belong to " director " just has this attribute, its value is [mobile phone, night fetes, the World Without Thieve, do not leave without seeing each other ... ].
An initial query point face in the present invention derives from QDMiner algorithm.For given inquiry q, T result for retrieval before obtaining from search engine, composition Query Result set D, QDMiner algorithm excavates inquiry point face by four steps:
List is extracted: use multiple patterns such as text, html tag, repeat region, from results set D, extract original list.
Power is composed in list: based on tf-idf thought, make assessment to the importance of each original list.
List cluster.Use WQT method similar list to be got together and form inquiry point face.
Inquiry point face and lexical item sequence.Calculate the importance of lexical item in different inquiry point face and point face, sort and export net result.
The application is undertaken on the basis exported at QDMiner.The input of system has: original query Q search, the set F of a series of initial query tangent planes composition that exports of collection of document D, QDMiner.We use identical method to expand by each inquiry tangent plane f ∈ F, and introduction is below also for a tangent plane f.
Inquiry point face and the maximum difference of set expansion problem are that there is abundant context in inquiry point face, depart from the query intention that contextual expansion causes too extending one's service possibly, thus affect the effect expanded.That is, whether original query Q can be made full use of searchand the relation between point face f affects the key expanding effect.In order to find relation implicit between inquiry and point face, the application is from two angles: inquiry and point face.The inquiry of search engine is divided into two kinds by the application: the inquiry of entity level and the inquiry of non-physical level.The inquiry of entity level refers to the inquiry that query contents directly corresponds to the entity that certain one or more physics exists, such as " Wang Fei ", " Nokia "; The inquiry of non-physical level refers to the inquiry that directly can not correspond to entity, such as " song of Wang Fei ", " university of the U.S. ".Why make such differentiation, be because the former can directly find entity corresponding in knowledge base by search, the latter does not then obtain by simply retrieving.
For the inquiry of entity level, obtain the entity in Freebase corresponding to inquiry, and obtain its attribute.If the registration of former point of face and a certain attribute is very high, then use this attribute as the expansion of former tangent plane, this process looks for relation from inquiry exactly, if can not find such attribute, then forwards the method for non-physical level inquiry to; Non-physical level is inquired about, thought based on tf-idf finds in Freebase the minimum type comprising former point of face, and utilize Freebase to find total, the attribute associated with the query of different lexical item in former point of face, of such attribute type made and limit further, return the expansion of the entity comprised as former point of face of confined type.Point face after expansion can utilize search file to filter, and ensures the accuracy rate of result.Step-by-step instructions algorithm flow below.
One, attribute extension point face is utilized
Using the entity that the SearchAPI search inquiry of Freebase is corresponding, if Freebase returns some entities, is then probably the inquiry of entity level.FreebaseSearchAPI mainly uses the matching inquiry character string such as name, synonym of entity, Query Result is relatively accurate, but and irrevocable all entities are all the entities that inquiry string is corresponding, such as inquire about " Windows7 ", Freebase not only returns " Windows7 ", also can return relevant operating system, as " Windows8 ", " WindowsXp " etc., so need to filter.The place of all use SearchAPI all can use similar algorithm to filter the entity returned herein: for the inquiry Q of SearchAPI, return N number of entity [E 1, E 2..., E n], for entity E wherein, the maximum public word string of all synonyms and inquiry Q that calculate E accounts for the ratio of former string, gets the word string similarity score StrSim of maximal value as E and Q of ratio in all synonyms.If this score is less than threshold value R strSim, then E is filtered out.Formula is:
S t r S i m ( E , Q ) M a x a i n A l i a s ( E ) ( L C S ( Q , a ) M a x ( l e n ( Q ) , l e n ( a ) ) )
Wherein Alias (E) is all TongYiCi CiLin of entity E, and len represents the length of word string.In order to filter out obvious incoherent entity as far as possible exactly, threshold value R strSimchange along with the length variations of inquiry Q:
R S t r S i m = M a x ( 0.3 , 1 p o w ( l e n ( Q ) , 1 / 3 ) )
The y power of what wherein pow (x, y) calculated is x.
After using above method to filter out obvious incoherent entity, we get search engine inquiry Q searchfront Sn corresponding entity [ES 1, ES 2, ES 3, ES n], the entity that SearchAPI returns is according to degree of correlation sequence, and we follow this sequence, obtain [ES successively 1, ES 2, ES 3, ES n] all properties, and use these attributes and former point of face f to compare, find the attribute that similarity is high, detailed process is as follows.
As shown in Figure 1, an entity can represent by three-decker, and entity attributes may be many-valued, also may be monodrome or void value.Such as direct the structure of " Feng little Gang " as Fig. 2.
For director " Feng little Gang ", " film of director " (/film/director/film) is multi-valued attribute, " nationality " (/people/person/nationality) is single-value attribute, and " religious belief " (/people/person/religion) is without value attribute.
Property value (Target) is also entity in essence, and entity and entity are got up by Attribute Association.If we obtain the attribute of property value further, that obtain is exactly source entity E root2 layers of attribute, the union of all 1 layer of attributes and 2 layers of attribute is called source entity E root2 layers of attribute net, use P 2represent, iteration is gone down and can be obtained P further 3, P 4deng, the number of plies is darker, and the path from source entity to last layer property value is longer, and correlativity is also more weak, and corresponding noise and impurity are namely more, and the present embodiment mainly uses P 2complete algorithm below.
The structure of attribute net can change into one tree, E rootbe equivalent to the root node set, every layer of attribute is (for P nhave n layer) property value be the equal of leaf node.Path from root node to leaf node the attribute of process can be regarded as the attribute of broad sense.The attribute of all non-1 layer we be referred to as multilayer attribute.Such as " Feng little Gang ", 1 layer of attribute as shown in Figure 2, continue expansion one deck, film " dinner at night " has attribute " performer of film " (/film/film/actor), is worth for [Zhou Xun, Zhang Ziyi, Ge You, Wu Yanzu ... ], 2 layers of attribute that/film/director/film#/film/film/actor directs exactly " Feng little Gang ", this attribute can be understood as " performer in the film of Feng little Gang director ".Whether multilayer attribute has a very important problem, should using the property value in middle layer as being intermediate node.Give an example, for " Feng little Gang " this entity, all performers of all films of he director should unify to regard an attribute as, or are subdivided into some attributes according to different film, as/film/director/film# to fete #/film/film/actor night.The modes of these two kinds expansion multilayer attributes emphasize particularly on different fields, and the former lays particular emphasis on entirety, and the latter expand more careful.In the present embodiment, we can determine whether segment according to the dispersion degree of one deck property value before the.Such as entity " Beijing Metro " has attribute " fall circuit " (/metropolitan_transit/transit_system/transit_lines), its value is [Beijing Metro Line 1, Beijing Metro No. 2 lines, Beijing Metro No. 4 lines, ], each circuit has attribute " subway station " (/metropolitan_transit/transit_line/stops), because the subway station coincidence of different circuit is little, so when expansion 2 layers of attribute:
This attribute of/metropolitan_transit/transit_system/transit_lines#/metro politan_transit/transit_line/stops can be subdivided into/metropolitan_transit/transit_system/transit_lines# Beijing Metro Line 1 #/metropolitan_transit/transit_line/stops etc.Concrete computing method are as follows.
For the attribute P of entity E, if single-value attribute, then there is not such problem; If multi-valued attribute, its property value is expressed as [T 1, T 2, T 3... T n], T 1, T 2, T 3... T nthe attribute possibility had is also incomplete same, T 1the attribute had may T 2not (T 1, T 2affiliated type is incomplete same), for attribute P ', [T 1, T 2, T 3... T n] in have K entity to have this attribute [Ts 1, Ts 2, Ts 3... Ts k], if the value of K is 1, there is not this problem yet, if K>1, then gather T=[T 1, T 2, T 3... T n] computing formula of dispersion degree on P ' attribute is as follows:
S d i v e r s i t y ( T , P ′ ) = | U T s i ∈ T T arg e t s P ′ ( T s i ) | Σ T s i ∈ T | T arg e t s P ′ ( T s i ) |
Wherein, the property value set of the attribute P of presentation-entity E, S diversityvalue is more close to 1, and the dispersion degree of attribute P#P ' is larger, if be 1, then represents that the property value of entity attributes P ' in T does not overlap.Work as S diversity>0.5, in the property value set that is P#P ' is corresponding, the number of times that average each property value repeats is less than 1, just attribute P#P ' is subdivided into some P#Ts#P '.
Be exactly more than all processes obtaining attribute, finally we can obtain the P of entity E 2, the tree that this is flourishing can be simplified shown as the structure shown in Fig. 3: no matter be 1 layer of attribute, or 2 layers of attribute, is also through 2 layers of attribute of intermediate node, all regards the broad sense attribute of entity E as.
Had the property value of the broad sense attribute shown in Fig. 3 and correspondence thereof, we just can find the attribute the most similar with former point of face f.An initial point face f is expressed as [Str 1, Str 2, Str 3..., Str n], for each lexical item wherein, we use and inquiry Q searchthe same method, use the entity that SearchAPI search is corresponding, the entity returned gets front K after StrSim score is filtered, and why not only getting the 1st is because there is the problem of ambiguousness.Use Q successively searchlist [ES 1, ES 2, ES 3, ES n] in each entity ES, use its P 2in all attributes and f make comparisons, get the attribute that similarity score is the highest, if having found the attribute satisfied condition, stop, otherwise continue to use next ES.Algorithm is as follows.
Algorithm 1 calculates Propertytf value
Wherein Targets (P) represents all properties value of attribute P, and Entity (Str) represents the set of all entities that the Freebase that Str is corresponding returns.Entity in a Freebase is at most by use 1 time, and that is, entity can only represent a lexical item in point face.
Through algorithm 1, we obtain the coincidence number CoNum in attribute and former point of face, and it is the tf value of attribute that this value can be understood as, so the thought of computation attribute final score is similar to tf-idf:
S c o r e ( P ) = C o N u m | f | * log T o t a l Pr o p e r t y N | T arg e t s ( P ) |
Wherein TotalPropertyN is equivalent to the number of all documents.Get 6,000 ten thousand (being approximately the number of entity in Freebase) herein.Right in there being a threshold value, the attribute lower than certain value can filter out, and gets 0.5 herein.If attempted all ES all there is no qualified attribute, just forward method below to.
Two, use pattern and restriction expansion point face
Major part inquiry in search engine is not the inquiry of entity level, even if inquiry is entity level, also may be difficult in knowledge base find corresponding attribute, " wrist-watch " this entity attribute that " color " is not such in such as Freebase, may be that any commodity have different colors because " color " this attribute is not unique.So be a kind of very important method from point face.
First several types comprising former point of face f are found, the method of similar tf-idf is utilized to give a mark, select the type that score is the highest, when more satisfactory, can with the expansion of all entities of this type as former point of face, but actual conditions are that the type in Freebase is too coarse, need to limit further.Utilize and similar before method, find with SearchAPI the entity that in point face f, all lexical items are corresponding, in all properties of these entities, find public and Q searchrelevant attribute is limited type, returns all entities of the type after restriction, as the expansion to former point of face.The former point of face such as expanded is " film that Feng little Gang directed ", the optimum type found is likely " prize-winning film ", too wide in range, can not directly be used for expanding, if can find these films have an attribute be " director of film " and its value for " Feng little Gang ", we just can utilize MQL (the specific retrieve statement of Freebase) to retrieve " director is Feng little Gang " " prize-winning film ", thus greatly improve the efficiency of search and the precision of expansion.
According to similar method, for a point face f [Str 1, Str 2, Str 3..., Str n] in each Str use SearchAPI to find entity in Freebase, obtain structure as shown in Figure 4 after using StrSim score to filter; Calculate the tf value of each type, algorithm is as follows:
Algorithm 2 calculates Typetf value
The tf value of each type can be understood as shared by the lexical item that comprised by the type in point face ratio.The account form of same attribute tf value is above similar, and an entity can only be only used once, and this skill all can use when relating to the ambiguity problem of entity in Freebase.After obtaining tf, can according to threshold value R typeTffilter out the type that tf value is too low, the computing formula of threshold value is as follows:
R T y p e T f = 1 p o w ( | ( f ) | , 1 / 3 )
After StrSim and TypeTf two step is filtered, to all entities retained and corresponding type, the process calculating final score is a voting process, K entity before each lexical item is retained, type ballot it belonged to entity, weight be 1/ (sqrt (Rank (E))) wherein Rank (E) be the sequence of this entity E in the list of entities of corresponding lexical item.Because the entity that SearchAPI returns is orderly, similarity high above, so we can utilize sequencing information.Similar above, if lexical item Str1 and lexical item Str2 is correspondent entity E simultaneously, E can only use once.The score of final type T is:
S c o r e ( T ) = log ( T o t a l T y p e N D f ( T ) ) Σ S t r i n f Σ E i n E n t i t y ( S t r ) I ( T , E ) s q r t ( R a n k ( E ) )
Wherein I (T, E) is indicator function, and as sporocarp E belongs to type T, value is 1, otherwise is 0; TotalTypeN is the entity number that 6,000 ten thousand, Df (T) comprise for type T.
Type of service score sorts, and obtains point the highest as net result.If do not find suitable type, this point of face f does not expand.
Because the type in Freebase is very wide in range, need restriction further with the scope reducing inquiry.Equally, tangent plane f [Str is first obtained 1, Str 2, Str 3..., Str n] in entity corresponding to each lexical item Str and 1 layer of attribute P thereof 1(consider for performance, do not use P 2), attribute and property value are regarded as several to Pair (Property, Target), [night fetes for the attribute " film of director " of such as " Feng little Gang " and value, mobile phone, ] can (/film/director/film be regarded as, night fetes), (/film/director/film, mobile phone) etc., each like this entity can have a series of Pair, computing method and the TypeTf of Pair score are similar, count the ratio of the appearance of each Pair in all lexical items as final score, each lexical item gets a front K entity, an entity can only be only used once, for a lexical item, same Pair can only use once, the score calculated can be understood as the tf value of Pair.Filter out the Pair that tf value is too low.Threshold R pairTfcalculation formula is:
R P a i r T f = M a x ( 1 p o w ( | ( f ) | , 1 3 ) 0.3 )
For remaining Pair, take all synonyms of Target, check that whether Target is by Q searchcomprise, if comprised, then think that this Pair is associated with the query, otherwise delete.Finally same Target may occur in multiple Pair, such as inquiring about " usuniversity " has a tangent plane f to be " university of the U.S. ", the Pair selected has (/location/location/containedby, the U.S.), (/organization/organization/headquarters, the U.S.) etc., the number of getting entity above due to each lexical item limited is K, the Pair of mistake may be selected, so for the same Pair of Target, only retain the highest front P of tf value individual.MQL is utilized to search for add all entities under the type of restriction as expansion.
Judging that whether and Q the Target in Pair searchtime relevant, except using synonym, the N layer attribute P of Target can also be obtained n(consider performance, use P herein 1), if existed by Q searchthe property value comprised, can think that this Pair is and Q equally searchbe correlated with.In general, P is utilized before 1find possible Pair as possible restriction, this process obtains again the P of the Target in Pair 1, be equivalent to obtain source entity part P 2.
As shown in Figure 5, in figure, rectangle represents the set of all entities corresponding to lexical item, and circular node is the property value of entity 1 layer of attribute, and triangular nodes and rhombus node are the property values of 2 layers of attribute of entity.In superincumbent algorithm, because we only remain the higher Pair of tf value, so at the P obtaining Target further 1time, what be equivalent to obtain is rhombus node, and triangular nodes is actual is do not obtain.Give an example, inquiry " usuniversity " has a point of face to be " university of New York ", and these universities have Pair (/location/location/containedby, New York), by obtaining the P of " New York " 1, find " New York " exist attribute " be contained in " (/location/location/containedby) and value be " U.S. ", so Pair (/location/location/containedby, New York) is and Q searchrelevant restriction.
If do not find restriction, and the df value of type is very large, then no longer expand this point of face, if the df value of type is relatively more reasonable, then uses all entities of the type as expansion.
The method of the application has been Q from traditional the different of set expansion problem searchwith result for retrieval as context.The expansion point face that method above finally exports can through contextual inspection to improve the precision of expansion.The method of current inspection mainly contains two kinds, and score NameOcc, co-occurrence score C oOcc appear in text.Score is there is in the ratio shared by document comprising this entity in the collection of document of search engine as text; On average in every section of document, and the lexical item number in former point of face f jointly occurring of this entity is as co-occurrence score, filters out the entity of NameOcc=0 or CoOcc=0.
Entity in point face after expansion is unordered, and context and attribute can be utilized to sort.Except NameOcc, CoOcc, can also utilize BM25 model, computational entity describes and inquiry Q searchdegree of correlation score B M25Score, utilize the product of these 3 scores to sort to entity.
The entity of expansion can also be sorted use attribute in addition.The PairTf score that previous calculations obtains can be regarded as the importance of a Pair.For tangent plane f [Str 1, Str 2, Str 3..., Str n], obtain all Pair and calculate PairTf, and the process finding restriction does not need through filtering unlike, these Pair, the Pair importance that PairTf is higher is larger.All Pair constitute a set Pair judge, as the foundation of entity marking.For all entity E, calculate all Pair and Pair of E judgesimilar score ProSim, account form is as follows:
Pr o S i m ( E , Pair j u d g e ) = Σ PainPair j u d g e I ( E , P a ) * P a i r T f ( P a )
Wherein I (E, P) is indicator function, and as sporocarp E has PairPa, value is 1, otherwise is 0.Using the foundation of the product of all scores as last sequence:
ScoreForRank=NameOcc×CoOcc×BM25Score×ProSim
The invention discloses the method using knowledge base to excavate inquiry point face, the main contributions of the method is to utilize structurized knowledge base to find to inquire about and implication relation between point face, thus finds accurately and treat extended target.The present invention uses UserQ as test data evaluation algorithms effect, and result shows to use the method in knowledge based storehouse obviously can improve the quality in inquiry point face when ensureing precision, improves recall rate.In addition, use knowledge base can divide in the inquiry not by means of initial query point face when face and directly generate point face, the attribute that such as inquiry of entity level is corresponding directly as inquiry point face, thus can not only expand original point of face, can also generate new point face.
Described just in order to the present invention is described above, be construed as the present invention and be not limited to above embodiment, meet the various variants of inventive concept all within protection scope of the present invention.

Claims (6)

1. the inquiry in knowledge based storehouse divides and looks unfamiliar into a method, and it is characterized in that, described method comprises the steps:
1) for given inquiry q, T result for retrieval before obtaining from search engine, composition Query Result set D;
2) obtain a series of initial query point face f based on QDMiner algorithm, a series of described initial query divides face f to form set F;
3) point face f of initial query described in each is expanded;
4) search file is utilized to filter, to ensure the accuracy rate of spreading result to the described initial query point face f after expansion; The initial query point face f after expansion is utilized to generate final inquiry point face.
2. the inquiry in knowledge based storehouse according to claim 1 divides and looks unfamiliar into method, it is characterized in that, described step 2) in obtain a series of initial query point face f based on QDMiner algorithm method be specially:
A. list is extracted: use text, html tag, the multiple pattern of repeat region, from described Query Result set D, extract original list;
B. power is composed in list: based on tf-idf thought, make assessment to the importance of each described original list;
C. list cluster: use WQT method similar list to be got together and form inquiry point face;
D. sort with lexical item in inquiry point face: the importance calculating lexical item in different inquiry point face and point face, sorts and export net result, namely obtaining a series of described initial query point face f.
3. the inquiry in knowledge based storehouse according to claim 1 divides and looks unfamiliar into method, it is characterized in that, described step 3) in point method that face f expands of initial query described in each is specially: first the inquiry of search engine is divided into two kinds: entity level inquiry and non-physical level inquire about; For the inquiry of entity level, obtain the entity in Freebase corresponding to inquiry, and obtain its attribute; If the registration of former point of face and a certain attribute is very high, then use this attribute as the expansion in former point of face; If can not find such attribute, then forward the inquiry of non-physical level to; Non-physical level is inquired about, thought based on tf-idf finds in Freebase the minimum type comprising former point of face, and different lexical item in former point of face is total, attribute associated with the query to utilize Freebase to find, with such attribute, further restriction is done to type, return the expansion as former point of face of entity that confined type comprises.
4. the inquiry in knowledge based storehouse according to claim 3 divides and looks unfamiliar into method, it is characterized in that, the concrete grammar of the described entity obtained in Freebase corresponding to inquiry is: use the entity that the SearchAPI search inquiry of Freebase is corresponding, SearchAPI mainly uses name, the synonym matching inquiry character string of entity; Then the entity returned is filtered.
5. the inquiry in knowledge based storehouse according to claim 4 divides and looks unfamiliar into method, it is characterized in that, describedly to the method that the entity returned filters is: for the inquiry Q of SearchAPI, return N number of entity [E 1, E 2..., E n], for entity E wherein, carry out word segmentation processing to all synonyms and inquiry Q, the maximum public word string of all synonyms and inquiry Q that calculate E accounts for the ratio of former string, gets the similarity of character string score StrSim of maximal value as E and Q of ratio in all synonyms; If this score is less than threshold value R strSim, then E is filtered out; Formula is:
S t r S i m ( E , Q ) M a x a i n A l i a s ( E ) ( L C S ( Q , a ) M a x ( l e n ( Q ) , l e n ( a ) ) )
Wherein Alias (E) is all TongYiCi CiLin of entity E, and len represents the length of word string; Threshold value R strSimchange along with the length variations of inquiry Q, LCS (Q, a) calculates the maximum public word string length of inquiry Q and synonym a:
R S t r S i m = M a x ( 0.3 , 1 p o w ( l e n ( Q ) , 1 / 3 ) )
Wherein pow's (x, y) is the y power of x.
6. the inquiry in knowledge based storehouse according to claim 1 divides and looks unfamiliar into method, it is characterized in that, described step 3) in point method that face f expands of initial query described in each is specially: first find several types comprising former point of face f, the method of tf-idf is utilized to give a mark, select the type that score is the highest, the entity that in point face f, all lexical items are corresponding is found with SearchAPI, find public in all properties of these entities, the attribute relevant with original query is limited type, with all entities returning the type after restriction, as the expansion to former point of face.
CN201510888652.8A 2015-12-07 2015-12-07 A kind of inquiry facet generation method in knowledge based library Active CN105550226B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510888652.8A CN105550226B (en) 2015-12-07 2015-12-07 A kind of inquiry facet generation method in knowledge based library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510888652.8A CN105550226B (en) 2015-12-07 2015-12-07 A kind of inquiry facet generation method in knowledge based library

Publications (2)

Publication Number Publication Date
CN105550226A true CN105550226A (en) 2016-05-04
CN105550226B CN105550226B (en) 2018-09-04

Family

ID=55829415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510888652.8A Active CN105550226B (en) 2015-12-07 2015-12-07 A kind of inquiry facet generation method in knowledge based library

Country Status (1)

Country Link
CN (1) CN105550226B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106203632A (en) * 2016-07-12 2016-12-07 中国科学院科技政策与管理科学研究所 A kind of limited knowledge collection recombinant is also distributed the study of extraction and application system method
WO2022134778A1 (en) * 2020-12-22 2022-06-30 International Business Machines Corporation Dynamic facet ranking

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402619A (en) * 2011-12-23 2012-04-04 广东威创视讯科技股份有限公司 Search method and device
US20140280207A1 (en) * 2013-03-15 2014-09-18 Xerox Corporation Mailbox search engine using query multi-modal expansion and community-based smoothing
CN104794163A (en) * 2015-03-25 2015-07-22 中国人民大学 Entity set extension method
CN104915449A (en) * 2015-06-30 2015-09-16 河海大学 Faceted search system and method based on water conservancy object classification labels

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402619A (en) * 2011-12-23 2012-04-04 广东威创视讯科技股份有限公司 Search method and device
US20140280207A1 (en) * 2013-03-15 2014-09-18 Xerox Corporation Mailbox search engine using query multi-modal expansion and community-based smoothing
CN104794163A (en) * 2015-03-25 2015-07-22 中国人民大学 Entity set extension method
CN104915449A (en) * 2015-06-30 2015-09-16 河海大学 Faceted search system and method based on water conservancy object classification labels

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
APARNA NURANI VENKITASUBRAMANIAN 等: "Summarization and Expansion of Search Facets", 《CEUR WORKSHOP PROCEEDINGS》 *
OLGA VECHTOMOVA: "Facet-based opinion retrieval from blogs", 《INFORMATION PROCESSING AND MANAGEMENT》 *
ZHICHENG DOU 等: "Finding dimensions for queries", 《ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT ACM》 *
郝志峰 等: "结合相关规则和本体加权图的查询扩展", 《计算机应用研究》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106203632A (en) * 2016-07-12 2016-12-07 中国科学院科技政策与管理科学研究所 A kind of limited knowledge collection recombinant is also distributed the study of extraction and application system method
CN106203632B (en) * 2016-07-12 2018-10-23 中国科学院科技政策与管理科学研究所 A kind of limited knowledge collection recombinant and study and the application system method for being distributed extraction
WO2022134778A1 (en) * 2020-12-22 2022-06-30 International Business Machines Corporation Dynamic facet ranking
GB2617500A (en) * 2020-12-22 2023-10-11 Ibm Dynamic facet ranking
US11941010B2 (en) 2020-12-22 2024-03-26 International Business Machines Corporation Dynamic facet ranking

Also Published As

Publication number Publication date
CN105550226B (en) 2018-09-04

Similar Documents

Publication Publication Date Title
CN106991092B (en) Method and equipment for mining similar referee documents based on big data
CN106777274B (en) A kind of Chinese tour field knowledge mapping construction method and system
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN105468605B (en) Entity information map generation method and device
CN103678576B (en) The text retrieval system analyzed based on dynamic semantics
CN107122413A (en) A kind of keyword extracting method and device based on graph model
CN106708966A (en) Similarity calculation-based junk comment detection method
CN109543178A (en) A kind of judicial style label system construction method and system
CN103778227A (en) Method for screening useful images from retrieved images
CN103605658B (en) A kind of search engine system analyzed based on text emotion
CN101751455B (en) Method for automatically generating title by adopting artificial intelligence technology
CN105653706A (en) Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN103810299A (en) Image retrieval method on basis of multi-feature fusion
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
CN103473283A (en) Method for matching textual cases
CN103488724A (en) Book-oriented reading field knowledge map construction method
CN102054029A (en) Figure information disambiguation treatment method based on social network and name context
CN104636325B (en) A kind of method based on Maximum-likelihood estimation determination Documents Similarity
CN105138670A (en) Audio file label generation method and system
CN104598588A (en) Automatic generating algorithm of microblog user label based on biclustering
CN104484433B (en) A kind of books Ontology Matching method based on machine learning
CN103593474A (en) Image retrieval ranking method based on deep learning
CN101923556B (en) Method and device for searching webpages according to sentence serial numbers
CN104268230A (en) Method for detecting objective points of Chinese micro-blogs based on heterogeneous graph random walk
CN109033132A (en) The method and device of text and the main body degree of correlation are calculated using knowledge mapping

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant