CN102799586A - Transferred meaning degree determining method and device for sequencing searching result - Google Patents

Transferred meaning degree determining method and device for sequencing searching result Download PDF

Info

Publication number
CN102799586A
CN102799586A CN2011101358053A CN201110135805A CN102799586A CN 102799586 A CN102799586 A CN 102799586A CN 2011101358053 A CN2011101358053 A CN 2011101358053A CN 201110135805 A CN201110135805 A CN 201110135805A CN 102799586 A CN102799586 A CN 102799586A
Authority
CN
China
Prior art keywords
word pair
word
closeness
search request
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011101358053A
Other languages
Chinese (zh)
Other versions
CN102799586B (en
Inventor
程道放
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110135805.3A priority Critical patent/CN102799586B/en
Publication of CN102799586A publication Critical patent/CN102799586A/en
Application granted granted Critical
Publication of CN102799586B publication Critical patent/CN102799586B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a transferred meaning degree determining method and device for sequencing a searching result. The method comprises the following steps of: A, carrying out compactness analysis on a searching request input by a user and determining the compactness of each word pair in the searching request; B, counting physical distance distribution of each word pair of the searching request in each webpage according to a result of processing structure information of each webpage in a searching result corresponding to the searching request; and C, utilizing the compactness corresponding to each word pair in the searching request and the physical distance distribution in each webpage to determine the transferred meaning degree of each webpage in the searching result to the searching request, wherein the transferred meaning degree is used for sequencing each webpage in the searching result. According to the transferred meaning degree determining method and device for sequencing the searching result provided by the invention, the determined transferred meaning degree is subjected to the sequencing of the searching result and the sequencing result of the searching result can be improved, so that a network resource is saved.

Description

Escape degree determination method and device for search result sorting
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of computers, in particular to an escape degree determining method and device for search result sequencing.
[ background of the invention ]
With the continuous development of computer technology, search engines have become the main means for people to obtain information. When a user inputs a search request query, the search engine includes a page matched with the query in a search result and returns the search result to the user.
The ranking of the pages in the search results is based on the matching degree of the query and the pages input by the user, and the matching degree generally depends only on the physical distance of each word in the query in the page in the current search technology. However, in the current search technology, the characteristic cannot be embodied in the ranking of the search results, and the user occupies network resources for a long time due to the poor ranking of the search results, thereby causing waste of network resources.
[ summary of the invention ]
The invention provides a method and a device for determining the escape degree of search result sequencing, which are used for improving the sequencing effect of search results and saving network resources.
The specific technical scheme is as follows:
an escape degree determination method for search result ranking, the method comprising:
A. analyzing the closeness of a search request input by a user, and determining the closeness of each word pair in the search request;
B. according to the result of structural information processing performed on each webpage in the search result corresponding to the search request, counting the physical distance distribution of each word pair in each webpage in the search request;
C. and determining the escape degree of each webpage in the search result aiming at the search request by utilizing the corresponding closeness of each word pair in the search request and the physical distance distribution in each webpage, wherein the escape degree is used for sequencing each webpage in the search result.
Wherein, the step A specifically comprises:
a1, performing word segmentation processing on the search request;
a2, determining each word pair in the search request by using the words obtained after the word segmentation processing;
a3, inquiring a pre-mined proper name dictionary and/or co-occurrence dictionary, and determining the closeness of each word pair, wherein the proper name dictionary contains pre-mined proper nouns, and the co-occurrence dictionary contains the co-occurrence condition of each word pair in the existing data source.
Preferably, the step a1 further includes: and filtering the words obtained after the word segmentation processing based on a stop word list.
Specifically, the step a2 includes:
every two adjacent words in the words obtained after the word segmentation processing form a word pair; or,
and pairwise forming word pairs by the words with strong ideographic capability in the words obtained after the word segmentation, wherein the words with strong ideographic capability are determined according to parts of speech or sentence components in the search request.
Querying a pre-mined proper noun dictionary in the step a3 to determine the closeness of the word pairs specifically includes:
if the proper noun in the proper noun dictionary contains a word pair i, the closeness of the word pair i is determined as a preset closeness value, and the word pair i is any one of word pairs in the search request.
Querying a pre-mined co-occurrence dictionary in the step a3 to determine closeness of the word pairs specifically comprises:
querying the co-occurrence dictionary to determine co-occurrence conditions of the word pair i in the existing data source, wherein the co-occurrence conditions comprise the occurrence times of the word pair i in each distance range grade, and the word pair i is any one of the word pairs in the search request;
determining the distance range grade with the maximum relative occurrence probability value of the word pair i in each distance range grade;
and taking the determined closeness corresponding to the distance range grade as the closeness of the word pair i, wherein different distance range grades are preset to correspond to different closeness.
In addition, the mining of the co-occurrence dictionary specifically includes:
d1, performing word segmentation processing on the data source and filtering based on a stop word list, and combining the obtained words two by two to form a word pair;
d2, counting the co-occurrence condition of the word pair obtained in the step D1 in the data source, and storing the counted co-occurrence condition in a co-occurrence dictionary.
If a proper dictionary and a co-occurrence dictionary are simultaneously used in the step a3, and the closeness of the word pair i can be determined by searching the proper dictionary, the closeness of the word pair i determined by searching the proper dictionary is used as the closeness of the word pair i, which is any one of the word pairs in the search request.
Specifically, the structural information processing performed on the web page includes:
dividing a webpage into webpage blocks, segments and sentences;
recording the position information of each word in the webpage and storing the position information in a database, wherein the position information comprises: the located web page blocks, segments, sentences and intra-sentence offsets.
Based on this, the step B specifically includes:
b1, determining the co-occurrence condition of the word pair i in the webpage d according to the position information of the two words of the word pair i in the search request recorded in the database in the webpage d, wherein the word pair i is any one of the word pairs in the search request, and the webpage d is any one of the webpage in the search results;
b2, according to the co-occurrence condition determined in the step B1, counting the physical distance distribution of the word pair i in the webpage d.
The step C specifically comprises the following steps:
c1, determining weighted value weight (i) of the word pair i by using the closeness of the word pair i in the search request;
c2, determining the satisfaction degree fit (i, d) of the word pair i by the webpage d by using the physical distance distribution of the word pair i in the webpage d in the search result;
c3, according to the formula
Figure BDA0000063449110000041
An escape degree offset _ ratio (d, q) of the web page d for the search request q is determined, where φ is a set of word pairs in the search request q.
The weight (i) is:
weight (i) f1 (light (i), imp (i)); wherein light (i) is the closeness of the word pair i, imp (i) is the degree of importance of the word pair i in the search request q, f1 (light (i), imp (i)) is a function of light (i) as a primary factor and imp (i) as a scaling factor, the greater the value of light (i) the greater the value of weight (i) for the same imp (i); or,
weight (i) ═ f2 (light (i)), where f2 (light (i)) is a function of the normalization process performed on light (i).
The imp (i) is determined by at least one of the following factors:
a part-of-speech of the word pair i in the search request, a sentence component of the word pair i in the search request, and a reciprocal document rate of the word pair i.
The fit (i, d) is:
fit (i, d) ═ f3(HIT (i, d), light (i)); wherein HIT (i, d) identifies the statistical physical distance distribution of the word pair i in the web page d, light (i) is the closeness of the word pair i, f3(HIT (i, d), light (i)) is the distance range of the word pair i determined by HIT (i, d) as a main factor and light (i) as a function of an adjustment factor, the smaller the distance range of the word pair i determined by HIT (i, d) is, the larger the value of fit (i, d) is; or,
the fit (i, d) ═ f4(HIT (i, d)), f4(HIT (i, d)) is a function that maps the distance range of the word pair i determined by HIT (i, d) to a specific fit (i, d) value.
Determining, by HIT (i, d), the distance range of the word pair i specifically includes:
adopting the minimum distance range of the word pair i in the HIT (i, d) as the distance range of the word pair i; or,
and according to the HIT (i, d), taking the distance range grade with the maximum relative occurrence probability value as the distance range grade of the word pair i.
An apparatus for escape determination for search result ranking, the apparatus comprising: a closeness analyzing unit, a distance distribution determining unit, and an escape degree determining unit;
the compactness analyzing unit is used for analyzing the compactness of a search request input by a user and determining the compactness of each word pair in the search request;
the distance distribution determining unit is used for counting the physical distance distribution of each word pair in each webpage in the search request according to the result of the structural information processing of each webpage in the search result corresponding to the search request;
the escape degree determining unit is configured to determine, by using closeness corresponding to each word pair in the search request and physical distance distribution in each web page, an escape degree of each web page in the search result for the search request, where the escape degree is used to rank each web page in the search result.
Wherein, the compactness analysis unit specifically comprises: the word segmentation processing subunit, the word pair determining subunit and the compactness determining subunit are connected with the word pair determining subunit;
the word segmentation processing subunit is used for carrying out word segmentation processing on the search request;
the word pair determining subunit is configured to determine, by using the words obtained after the word segmentation processing, each word pair in the search request;
the closeness determining subunit is configured to query a pre-mined proper name dictionary and/or co-occurrence dictionary, and determine closeness of each word pair, where the proper name dictionary includes pre-mined proper nouns, and the co-occurrence dictionary includes co-occurrence conditions of each pre-determined word pair in an existing data source.
Preferably, the compactness analyzing unit further comprises: and the filtering processing subunit is used for filtering the words obtained by the word segmentation processing subunit based on the disabled word list and sending the words obtained by filtering to the word pair determining subunit.
Specifically, the word pair determining subunit forms word pairs by two adjacent words in the words obtained after the word segmentation processing; or,
and pairwise forming word pairs by the words with strong ideographic capability in the words obtained after the word segmentation, wherein the words with strong ideographic capability are determined according to parts of speech or sentence components in the search request.
If a proper noun in the proper noun dictionary contains a word pair i, the closeness determination subunit determines the closeness of the word pair i as a preset closeness value, where the word pair i is any one of the word pairs in the search request.
The compactness determining subunit specifically includes: the device comprises a dictionary query module, a distance grade determination module and a closeness determination module;
the dictionary query module is used for querying the co-occurrence dictionary to determine the co-occurrence condition of the word pair i in the existing data source, wherein the co-occurrence condition comprises the occurrence times of the word pair i in each distance range grade, and the word pair i is any one of the word pairs in the search request;
the distance grade determining module is used for determining the distance range grade with the maximum relative occurrence probability value of the word pair i in each distance range grade according to the query result of the dictionary querying module;
and the closeness determining module is used for taking the closeness corresponding to the distance range grade determined by the distance grade determining module as the closeness of the word pair i, wherein different distance range grades are preset to correspond to different closeness.
Still further, the compactness analyzing unit further comprises: and the co-occurrence dictionary mining subunit is used for combining the obtained words pairwise to form word pairs after performing word segmentation processing and filtering based on the stop word list on the data source, counting the co-occurrence conditions of the obtained word pairs in the data source, and storing the counted co-occurrence conditions in a co-occurrence dictionary.
And if the closeness determination subunit adopts a proper dictionary and a co-occurrence dictionary at the same time, and the closeness of the word pair i can be determined by the query of the proper dictionary, taking the closeness of the word pair i determined by the query of the proper dictionary as the closeness of the word pair i, wherein the word pair i is any one of the word pairs in the search request.
Still further, the apparatus further comprises: the structure information processing unit is used for dividing a webpage into webpage blocks, sections and sentences, recording position information of all words in the webpage and storing the position information in a database, wherein the position information comprises: the located web page blocks, segments, sentences and intra-sentence offsets.
The distance distribution determining unit specifically includes: a co-occurrence condition determining subunit and a distance distribution counting subunit;
the co-occurrence condition determining subunit is configured to determine, according to location information of two terms of the term pair i in the search request recorded in the database in a web page d, a co-occurrence condition of the term pair i in the web page d, where the term pair i is any one of the term pairs in the search request, and the web page d is any one of the web pages in the search result;
and the distance distribution counting subunit is used for determining the co-occurrence condition determined by the subunit according to the co-occurrence condition and counting the physical distance distribution of the word pair i in the webpage d.
The escape degree determination unit specifically includes: a weighted value determining subunit, a satisfaction degree determining subunit and an escape degree determining subunit;
the weighted value determining subunit is used for determining weighted value weight (i) of the word pair i by utilizing the compactness of the word pair i in the search request;
the satisfaction determining subunit is configured to determine the satisfaction fit (i, d) of the word pair i by the web page d using the physical distance distribution of the word pair i in the web page d in the search result;
the escape degree determining subunit is used for determining the escape degree according to a formulaAn escape degree offset _ ratio (d, q) of the web page d for the search request q is determined, where φ is a set of word pairs in the search request q.
The weight value determining subunit determines a weight value (i) of the word pair i according to weight (i) ═ f1 (light (i), imp (i)) or weight (i) ═ f2 (light (i));
where light (i) is the closeness of the word pair i, imp (i) is the degree of importance of the word pair i in the search request q, and f1 (light (i) is a function of light (i) as a primary factor and imp (i) as a secondary factor, with greater light (i) values and greater weight (i) values for the same imp (i), and f2 (light (i)) is a function of normalizing light (i).
In this case, the escape degree determination unit further includes: an importance determining subunit for determining the imp (i) in accordance with at least one of the following factors:
a part-of-speech of the word pair i in the search request, a sentence component of the word pair i in the search request, and a reciprocal document rate of the word pair i.
The satisfaction determining subunit determines the satisfaction of the word pair i of the web page d according to fit (i, d) ═ f3(HIT (i, d), light (i)) or fit (i, d) ═ f4(HIT (i, d));
wherein HIT (i, d) identifies a statistical physical distance distribution of the word pair i in the web page d, light (i) is the closeness of the word pair i, f3(HIT (i, d), light (i)) is a function of the distance range of the word pair i determined by HIT (i, d) as a primary factor and light (i) as a scaling factor, with f4(HIT (i, d)) being a function of mapping the distance range of the word pair i determined by HIT (i, d) to a specific value of fit (i, d) for the same light (i) as the distance range of the word pair i determined by HIT (i, d) is smaller.
In this case, the escape degree determination unit further includes: a distance range determining subunit, configured to determine a distance range of the word pair i according to the HIT (i, d), and specifically includes:
adopting the minimum distance range of the word pair i in the HIT (i, d) as the distance range of the word pair i; or,
and according to the HIT (i, d), taking the distance range grade with the maximum relative occurrence probability value as the distance range grade of the word pair i.
According to the technical scheme, the escape degrees determined by the method and the device are distributed based on the closeness corresponding to each word pair in the query and the physical distance in the webpage, the higher the escape degree of the webpage aiming at the query is, the higher the matching degree of the word pair with the high closeness in the query in the webpage is, the better the sorting result is according to the higher the matching degree is, and the user can more quickly acquire the required information through the sorting of the search results, so that the network resources are saved.
[ description of the drawings ]
FIG. 1 is a flow chart of a main method provided by an embodiment of the present invention;
FIG. 2 is a flowchart of a method for performing compactness analysis on query according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for counting physical distance distribution of word pairs in a query in a web page according to a second embodiment of the present invention;
FIG. 4 is a flowchart of a method for determining an ambiguity of a webpage for a query according to a third embodiment of the present invention;
fig. 5 is a block diagram of an escape level determining apparatus according to a fourth embodiment of the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
Fig. 1 is a flow chart of a main method provided by an embodiment of the present invention, and as shown in fig. 1, the method may include the following steps:
step 101: and analyzing the closeness of the query input by the user, and determining the closeness of each word pair in the query.
Step 102: and according to the result of structural information processing on each webpage in the search result corresponding to the query, counting the physical distance distribution of each word pair in the query in the webpage.
Step 103: and determining the escape degree of each webpage of the search result aiming at the query by utilizing the corresponding closeness of each word pair in the query and the physical distance distribution in each webpage of the search result, wherein the escape degree is used for sequencing each webpage of the search result.
The steps in the above-described method are described in detail below. First, the above step 101, i.e. the compactness analysis process of query, will be described in detail with reference to the embodiment.
The first embodiment,
Fig. 2 is a flowchart of a method for performing compactness analysis on query according to an embodiment of the present invention, as shown in fig. 2, the method may include the following steps:
step 201: and performing word segmentation on the query.
The word segmentation processing method in this step can adopt but is not limited to: a dictionary and longest match based method, or a statistical model based method, etc. since the word segmentation process is an existing mature technology, it is not described in detail here.
Preferably, each word obtained after the word processing can be further filtered based on the stop word list, and words with poor ideographic capability, such as adverbs, fictional words, auxiliary words and the like, can be filtered.
Taking query 'who loves each other' as an example, the words obtained after word segmentation processing are as follows: "relative love", "of", "a family", "is", "who", "singing" and "of".
When filtering is carried out based on the stop word list, the 'ones' in the stop word list are filtered, and the rest words are: "relative love", "family", "is", "who" and "sing".
Step 202: and determining each word pair in the query by using the word segmentation processing result.
In determining word pairs in a query, at least one of the following strategies may be employed:
strategy 1: and in the words obtained after word segmentation, every two adjacent words form a word pair.
Strategy 2: and in the words obtained after word segmentation, pairwise words with strong ideographic capability form word pairs.
The words with strong ideographic ability can be determined according to parts of speech or sentence components, for example, at least one of nouns, verbs, adjectives and pronouns is determined as the words with strong ideographic ability, or at least one of subjects, predicates and objects is determined as the words with strong ideographic ability.
Still taking the query of "who sings a family member of the relative love" as an example, corresponding to the strategy 1, two adjacent words in the "relative love", "a family member", "who" and "singing" are paired into a word pair, and then the word pair is obtained: "relative love" - "a family", "a family" - "is", "is" - "who", "who" - "sings".
Corresponding to the strategy 2, the terms with strong ideographic ability in the terms obtained after the query is subjected to word segmentation processing are 'love of each other', 'one family' and 'sing', and the terms are formed into a term pair in pairs: "relatives love each other" - "one family", "relatives love each other" - "sing", and "one family" - "sing".
Step 203: and inquiring a pre-mined proper name dictionary and/or a co-occurrence dictionary to determine the closeness of each word pair, wherein the proper name dictionary comprises pre-mined proper nouns, and the co-occurrence dictionary comprises the co-occurrence condition of each pre-determined word pair in the existing data source.
Two dictionaries are involved in this step: a proper name dictionary and/or a co-occurrence dictionary. The mining process of the special name dictionary can adopt the prior art, and the current special nouns can be divided into 18 types: name of person, place, movie name, country name, unit name, organization name, etc. For each type, a respective corpus can be adopted for mining, for example, for proper nouns of the movie name type, a title of a video website can be mined as a corpus. The various types of excavation are not particularly limited herein.
When the word pair closeness is determined by using the proper name dictionary, if the proper name in the proper name dictionary contains a certain word pair, the closeness of the word pair can be determined to be a preset closeness value. For example: the word pair consisting of "love and" family "hits the proper noun" love and family "in the proper name dictionary, i.e., the proper noun in the proper name dictionary contains the word pair, so the closeness of the word pair consisting of" love and family "can be set to be the highest closeness.
The following describes mining of co-occurrence dictionaries, and the data source for mining the co-occurrence dictionaries may adopt, but is not limited to, at least one of the following: web page content, web page title (title), and query in the search log. And (3) performing word segmentation processing on each data source respectively, preferably, filtering out words with poor meaning capability in the words obtained after word segmentation processing based on the stop word list, combining every two words to form word pairs, counting the co-occurrence condition of the word pairs in the data source, and storing the word pairs in a co-occurrence dictionary.
The co-occurrence condition of each word pair in the co-occurrence dictionary may be stored as: a word pair, a co-occurrence distance range of a word pair, a number of co-occurrences within the distance range. The distance range may be preset to several levels, for example, five levels are divided: a block of web pages, a segment, a sentence, N words (N is an integer greater than 2, e.g., 3 words), adjacent.
When the co-occurrence dictionary is used for carrying out compactness analysis on the query input by the user, the co-occurrence dictionary is inquired to determine the occurrence times of each word pair in the query in each distance range grade, the distance range grade with the maximum relative occurrence probability value in each distance range grade is determined, and the compactness corresponding to the determined distance range grade is taken as the compactness of the word pair. Wherein different distance range levels may be preset for different closeness.
For example, for a word pair of "who" - "sing", the number of co-occurrences at adjacent levels in the co-occurrence dictionary is 2, the number of co-occurrences at levels within 3 words is 10, the number of co-occurrences at a sentence level is 18, the number of co-occurrences at a segment level is 40, and the number of co-occurrences at a page block level is 60. Then, determining the distance range grade with the maximum relative occurrence probability value as follows: within 3 words, the co-occurrence distance range rating of the word pair determining "who" - "sing" is therefore: within 3 words, the closeness level of this word pair is the second closeness level.
Wherein the relative probability of occurrence P of the jth leveljCan be as follows:
Figure BDA0000063449110000111
wherein x isjCo-occurrence number of word pair at j level, xj+1The number of co-occurrences of a word at the j +1 th level is determined, and the levels are sorted from high to low according to the closeness. Other definitions of relative occurrence probability values may be used, and are not limited herein.
If the proper name dictionary and the co-occurrence dictionary are adopted at the same time and a certain word pair hits the proper name dictionary and the co-occurrence dictionary at the same time, the proper name dictionary can be used as a higher priority, namely, the compactness of the word pair determined by inquiring the proper name dictionary is used as the final compactness of the word pair.
In this embodiment, the compactness of each word pair may be represented in the above-mentioned different compactness grades, or in a specific compactness value.
After the flow shown in the first embodiment is completed, the following describes in detail how to perform the structural information processing on each web page in the search result, with reference to the two pairs of steps 102 in the first embodiment.
Example II,
Fig. 3 is a flowchart of a method for counting physical distance distribution of each word pair in a query in a web page according to a second embodiment of the present invention, as shown in fig. 3, the method may include the following steps:
step 301: and respectively carrying out structural information processing on each webpage in the search result corresponding to the query, wherein the structural information processing comprises the following steps: dividing the web page blocks, the segments and the sentences.
The webpage blocks obtained by dividing may include, but are not limited to: title (title) chunk, anchor (anchor) chunk, navigation (mypos) chunk, content chunk. Where anchor blocks and content blocks may have a finer granularity of partitioning.
The divided webpage blocks can be further segmented, and each segment can be further subjected to clause processing.
Step 301 may be performed online, and after the above-mentioned processing process of the structure information of the web page, each term has an absolute position in the web page, and the position information of each term in each web page may be stored in a database for use in querying when step 302 is performed online. The position information may be a web page block, a paragraph, a sentence, and an intra-sentence offset where each word is specifically located.
Step 302: and according to the position information of each word in the webpage, counting the physical distance distribution of each word pair in the query in the webpage.
According to the position information of two words in each word pair in the query in the webpage, the co-occurrence condition of the word pair in the webpage, namely the co-occurrence times in a webpage block, a segment or a sentence can be determined, and as one word pair possibly appears in the webpage for multiple times, the physical distance distribution of the word pair in the webpage can be counted based on the co-occurrence condition of the words in the webpage, so that an array HIT (i, d) is formed, wherein i identifies the word pair, d identifies the webpage, and HIT (i, d) identifies the physical distance distribution of the counted word pair i in the webpage d.
Now, the flow shown in the second embodiment is finished, and the method for determining the ambiguity of each webpage for the query is described in detail below with reference to the third embodiment.
Example III,
Fig. 4 is a flowchart of a method for determining an ambiguity of a webpage for a query according to a third embodiment of the present invention, as shown in fig. 4, the method may include the following steps:
step 401: and determining the weighted value weight (i) of the word pair by using the closeness of each word pair in the query and the importance degree of the word pair in the query.
Where weight (i) is f1 (light (i), imp (i)), light (i) is the closeness of word pair i, and imp (i) is the importance of word pair i in query. f1 (light (i), imp (i)) may be a function of light (i) as the primary factor and imp (i) as the scaling factor, with the greater the value of light (i) the greater the value of weight (i) with the same imp (i). For example, the value obtained by normalizing imp (i) may be multiplied by light (i).
The following embodiment of one of f1 (light (i), imp (i)) is given as follows:
first, the level corresponding to the value of light (i) is mapped to the corresponding weight value g _ light _ map [ light (i) ], where different levels corresponding to the value of light (i) can be mapped to different weight values, for example, suppose that light (i) corresponds to five levels, and the corresponding integer value between [0 and 4] is taken, and the mapping to the weight value becomes an array, and is assumed to be g _ light _ map [5] {16, 8, 4, 2, 1 }.
Then, weight (i) ═ f1 (light (i), imp (i) ═ g _ light _ map [ light (i) ] + imp (i), where the value range of g _ light _ map [ light (i) ] can be larger than that of imp (i), so that light (i) is realized as the main factor and imp (i) is the adjustment factor.
imp (i) the importance of a word pair i in a query can be determined based on at least one of the following factors: part of speech in query, or sentence component in query, or inverse document rate (IDF).
Wherein, the inverse document rate IDF of the word pair iiComprises the following steps:
Figure BDA0000063449110000131
Freqiand M is the maximum value of the absolute word frequency of all the word pairs in the large-scale corpus.
In addition, the weight of the word pair, namely, weight (i) ═ f2 (light (i)), may be determined by using only the closeness of each word pair in the query, and in this case, f2 (light (i)) may be a function for normalizing light (i).
Step 402: and respectively determining the satisfaction degree fit (i, d) of the webpage to each word pair by utilizing the physical distance distribution of each word pair in the query in the webpage and the compactness of each word pair.
Where, f3(HIT (i, d), light (i)), where HIT (i, d) identifies the statistical physical distance distribution of word pair i in web page d, and light (i) is the closeness of word pair i. Specifically, f3(HIT (i, d), light (i)) may use the distance range of the word pair i determined by HIT (i, d) as a main factor, and light (i) as a function of an adjustment factor, where the smaller the distance range of the word pair i determined by HIT (i, d) the larger the value of fit (i, d) for the same light (i).
The following is a specific implementation of one of f3(HIT (i, d), light (i)):
HIT (i, d) reflects the physical distance distribution of the word pair i in the webpage d, and can be understood as the co-occurrence times in each physical distance range, and the HIT (i, d) is assumed to be an array HIT [5], and HIT [0] represents the adjacent co-occurrence times of the word pair i in the webpage d; hit [1] represents the frequency of co-occurrence within 3 words in web page d; hit [2] represents the number of intra-sentence co-occurrences in web page d; hit [3] represents the number of co-occurrences in the middle segment of web page d; hit [4] represents the number of co-occurrences within a block in web page d. Tight (i) is the closeness of the word to i, assuming that Tight (i) is an integer value of the [0, 4] range.
First, HIT (i, d) can be quantified as a range of distance values, and then the value HIT _ value ═ 16 × HIT [0] +8 × HIT [1] +4 × HIT [2] +2 × HIT [1] + HIT [0 ]. The value range of hit _ value can be defined as [0, 16], and if the calculated hit _ value is greater than 16, the value is directly 16.
The mapping relation between the combination of the light (i) value and the hit _ value to different fit (i, d) is predefined, and can be embodied as a two-dimensional array g _ hit _ map _ fit [ light (i) ] [ hit _ value ], such as g _ hit _ map _ fit [5] [17], wherein the values in the two-dimensional array can be floating point numbers in the range of [0, 1 ]. That is, take the fit (i, d) ═ f3(HIT (i, d), light (i) ═ g _ HIT _ map _ fit [ light (i) ] [ HIT _ value ].
When the distance range of the word pair i is determined by the HIT (i, d), the minimum distance range of the word pair i in the HIT (i, d) may be directly adopted as the distance range of the word pair i, or the distance range grade with the maximum relative occurrence probability value may be adopted as the distance range grade of the word pair i according to the HIT (i, d).
In addition, it is also possible to determine the fit (i, d), that is, the fit (i, d) ═ f4(HIT (i, d)), using only the distance range of the word pair i determined by HIT (i, d), and in this case, f4(HIT (i, d)) may be a function that maps the distance range of the word pair i determined by HIT (i, d) to a specific fit (i, d) value. For example, different distance range levels are associated with different fit (i, d) values in advance, and after the distance range level of the word pair i is determined by HIT (i, d), the fit (i, d) value corresponding to the distance range level is determined.
Step 403: and determining the ambiguity offset _ ratio (d, q) of the webpage aiming at the query by using the weighted value of each word pair in the query and the satisfaction degree of the webpage to each word pair.
Wherein,
Figure BDA0000063449110000151
offset _ ratio (d, q) is the degree of escape of web page d for query q, and φ is the set of word pairs in q.
After the ambiguity of each webpage in the search result of the query for the query is determined, the search results can be sorted in the order of the ambiguity from high to low. The higher the degree of ambiguity of the webpage for the query is, the higher the matching degree of the words in the webpage with the high closeness in the query is, and the better the sorting result according to the matching degree is.
The method provided by the invention is described above, and the device provided by the invention is described in detail below.
Example four,
Fig. 5 is a structural diagram of an escape degree determining apparatus according to a fourth embodiment of the present invention, where the apparatus may be disposed at a server side where a search engine is located, or may be disposed at another server side capable of interacting with the search engine. As shown in fig. 5, the apparatus may include: a compactness analysis unit 500, a distance distribution determination unit 510 and an escape degree determination unit 520.
The closeness analysis unit 500 performs closeness analysis on the query input by the user, and determines closeness of each word pair in the query.
The distance distribution determining unit 510 counts the physical distance distribution of each word pair in the query in each web page of the search result according to the result of the structural information processing performed on each web page in the search result corresponding to the query.
The distance distribution determining unit 510 may obtain a search result corresponding to the query from the search engine.
The escaping degree determining unit 520 determines escaping degrees of the webpages in the search result for the query by using the closeness corresponding to each word pair in the query and the physical distance distribution in each webpage, wherein the escaping degrees are used for sorting the webpages in the search result.
The compactness analyzing unit 500 may specifically include: a word segmentation processing subunit 501, a word pair determination subunit 502, and a closeness determination subunit 503.
The word segmentation processing subunit 501 performs word segmentation processing on the query. The employed word segmentation processing method can include but is not limited to: dictionary and longest match based methods, or statistical model based methods, etc.
The word pair determining subunit 502 determines each word pair in the query by using the words obtained after the word segmentation processing.
The closeness determination subunit 503 queries a pre-mined proper dictionary containing pre-mined proper nouns and/or a co-occurrence dictionary containing pre-determined co-occurrence conditions of word pairs in existing data sources to determine closeness of each word pair.
Preferably, the compactness analyzing unit 500 may further include: a filtering processing subunit 504 provided between the word segmentation processing subunit 501 and the word pair determination subunit 502. The filtering processing subunit 504 performs filtering based on the disabled word list on the words obtained by performing word segmentation processing on the word segmentation processing subunit 501, and sends the words obtained after filtering to the word pair determining subunit 502. The word pair determining subunit 502 determines each word pair in the query using the words filtered by the filtering processing subunit 504.
When determining each word pair in the query, the word pair determining subunit 502 may perform word segmentation processing on adjacent words to form a word pair; or, in the words obtained after word segmentation, every two words with strong ideographic capability form a word pair, wherein the words with strong ideographic capability are determined according to the part of speech or the sentence components in the query.
When the closeness determination subunit 503 determines the closeness of each word pair using the proper name dictionary, if the proper noun in the proper name dictionary contains a word pair i, the closeness determination subunit 503 may determine the closeness of the word pair i as a preset closeness value, where the word pair i is any one of the word pairs in the query, and the utilization of the proper name dictionary is not shown in fig. 5.
The mining process of the special name dictionary can adopt the prior art mode, and the current special names can be divided into 18 types: name of person, place, movie name, country name, unit name, organization name, etc.
When the closeness determining subunit 503 determines the closeness of each word pair by using the co-occurrence dictionary, the closeness determining subunit 503 may specifically include: a dictionary lookup module 5031, a distance level determination module 5032 and a closeness determination module 5033.
The dictionary lookup module 5031 queries the co-occurrence dictionary to determine the co-occurrence condition of the word pair i in the existing data source, wherein the co-occurrence condition comprises the occurrence times of the word pair i in each distance range level.
The distance level determining module 5032 determines the distance range level with the maximum relative occurrence probability value of the word pair i in each distance range level according to the query result of the dictionary querying module 5031.
The closeness determination module 5033 takes the closeness corresponding to the distance range level determined by the distance level determination module 5032 as the closeness of the word pair i, where different distance range levels are preset to correspond to different closeness.
To implement offline mining of co-occurrence dictionaries, the closeness analysis unit 500 may further include: the co-occurrence dictionary mining subunit 505 performs word segmentation processing on the data source and filtering based on the disabled word list, combines the obtained words two by two to form word pairs, counts the co-occurrence conditions of the obtained word pairs in the data source, and stores the counted co-occurrence conditions in the co-occurrence dictionary.
The data sources employed therein may include, but are not limited to: web page content, web page title, and query in the search log.
The co-occurrence condition of each word pair in the co-occurrence dictionary may be stored as: a word pair, a co-occurrence distance range of a word pair, a number of co-occurrences within the distance range. The distance range may be preset to several levels, for example, five levels are divided: a web page block, a segment, a sentence, N words, and adjacent, where N is an integer greater than 2.
If the closeness determination subunit 503 adopts both the proper dictionary and the co-occurrence dictionary, and the closeness of the word pair i can be determined through the query of the proper dictionary, the closeness of the word pair i determined by the query of the proper dictionary is used as the closeness of the word pair i.
In order to implement the distance distribution determining unit 510 to count the physical distance distribution of each word pair in the query in each web page of the search result, the apparatus may further include: a structure information processing unit 530, configured to divide the web page into web page blocks, segments, and sentences, record position information of each word in the web page, and store the position information in a database, where the position information includes: the located web page blocks, segments, sentences and intra-sentence offsets.
The division of the web page block involved in this embodiment includes, but is not limited to: a title chunk, an anchor chunk, a mypos chunk, or a content chunk. Where anchor blocks and content blocks may have a finer granularity of partitioning.
Based on this, the distance distribution determining unit 510 may specifically include: the co-occurrence determination subunit 511 and the distance distribution statistics subunit 512.
The co-occurrence status determining subunit 511 determines the co-occurrence status of the word pair i in the web page d according to the position information of the two words of the word pair i in the query recorded in the database in the web page d, where the web page d is any one of the web pages in the search result.
The distance distribution statistics subunit 512 counts the physical distance distribution of the word pair i in the web page d according to the co-occurrence condition determined by the co-occurrence condition determination subunit 511.
As described in detail below with respect to the structure of the escaping degree determining unit 520, the escaping degree determining unit 520 may specifically include: a weight value determining sub-unit 521, a satisfaction degree determining sub-unit 522, and an escape degree determining sub-unit 523.
The weight determination subunit 521 determines the weight (i) of the word pair i using the closeness of the word pair i in the query.
The satisfaction determining subunit 522 determines the satisfaction fit (i, d) of the word pair i by the web page d using the physical distance distribution of the word pair i in the web page d in the search result.
The escape degree determination subunit 523 follows a formula
Figure BDA0000063449110000181
An escape degree offset _ ratio (d, q) of the web page d for the query q is determined, where φ is a set of word pairs in the query q.
The weight value determining subunit 521 may determine the weight value (i) of the word pair i according to weight (i) ═ f1 (light (i), imp (i)) or weight (i) ═ f2 (light (i)).
Tight (i) is the closeness of word pair i, imp (i) is the degree of importance of word pair i in query q, f1 (light (i)) is a function of the light (i) as a main factor and imp (i) as an adjustment factor, with the greater the value of light (i) and the greater the value of light (i) for the same imp (i), and f2 (light (i)) is a function of the normalization of light (i).
At this time, the escape degree determination unit 520 may further include: an importance determining subunit 524, configured to determine imp (i) according to at least one of the following factors: part of speech of the word pair i in the query, sentence component of the word pair i in the query, and reciprocal document rate of the word pair i.
The satisfaction determining subunit 522 may determine the satisfaction of the word pair i of the web page d, fit (i, d) according to fit (i, d) ═ f3(HIT (i, d), light (i)), or fit (i, d) ═ f4(HIT (i, d)).
Wherein HIT (i, d) identifies the statistical physical distance distribution of the word pair i in the web page d, light (i) is the closeness of the word pair i, f3(HIT (i, d), light (i)) is the distance range of the word pair i determined by HIT (i, d) as the main factor and light (i) as a function of the adjustment factor, and f4(HIT (i, d)) is a function of mapping the distance range of the word pair i determined by HIT (i, d) to a specific value of fit (i, d) for the same light (i) as the distance range of the word pair i determined by HIT (i, d) is smaller.
At this time, the escape degree determination unit 520 may further include: the distance range determining subunit 525 is configured to determine a distance range of the word pair i according to HIT (i, d), and may specifically include:
adopting the minimum distance range of the word pair i in HIT (i, d) as the distance range of the word pair i; or, according to HIT (i, d), the distance range level with the highest relative occurrence probability value is used as the distance range level of the word pair i.
After the apparatus shown in fig. 5 determines the degree of ambiguity of each web page in the search result for the query, the degree of ambiguity may be provided to the search engine for ranking each web page in the search result, and the higher the degree of ambiguity, the higher the ranking of the web pages is.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (30)

1. An escape degree determination method for search result ranking, the method comprising:
A. analyzing the closeness of a search request input by a user, and determining the closeness of each word pair in the search request;
B. according to the result of structural information processing performed on each webpage in the search result corresponding to the search request, counting the physical distance distribution of each word pair in each webpage in the search request;
C. and determining the escape degree of each webpage in the search result aiming at the search request by utilizing the corresponding closeness of each word pair in the search request and the physical distance distribution in each webpage, wherein the escape degree is used for sequencing each webpage in the search result.
2. The method according to claim 1, wherein the step a specifically comprises:
a1, performing word segmentation processing on the search request;
a2, determining each word pair in the search request by using the words obtained after the word segmentation processing;
a3, inquiring a pre-mined proper name dictionary and/or co-occurrence dictionary, and determining the closeness of each word pair, wherein the proper name dictionary contains pre-mined proper nouns, and the co-occurrence dictionary contains the co-occurrence condition of each word pair in the existing data source.
3. The method according to claim 2, wherein step a1 further comprises: and filtering the words obtained after the word segmentation processing based on a stop word list.
4. The method according to claim 2, wherein the step a2 specifically comprises:
every two adjacent words in the words obtained after the word segmentation processing form a word pair; or,
and pairwise forming word pairs by the words with strong ideographic capability in the words obtained after the word segmentation, wherein the words with strong ideographic capability are determined according to parts of speech or sentence components in the search request.
5. The method of claim 2, wherein querying a pre-mined proper name dictionary to determine the closeness of the word pairs in step a3 comprises:
if the proper noun in the proper noun dictionary contains a word pair i, the closeness of the word pair i is determined as a preset closeness value, and the word pair i is any one of word pairs in the search request.
6. The method of claim 2, wherein querying a pre-mined co-occurrence dictionary to determine the closeness of the word pairs in step a3 comprises:
querying the co-occurrence dictionary to determine co-occurrence conditions of the word pair i in the existing data source, wherein the co-occurrence conditions comprise the occurrence times of the word pair i in each distance range grade, and the word pair i is any one of the word pairs in the search request;
determining the distance range grade with the maximum relative occurrence probability value of the word pair i in each distance range grade;
and taking the determined closeness corresponding to the distance range grade as the closeness of the word pair i, wherein different distance range grades are preset to correspond to different closeness.
7. The method according to claim 2 or 6, wherein the mining of the co-occurrence dictionary specifically comprises:
d1, performing word segmentation processing on the data source and filtering based on a stop word list, and combining the obtained words two by two to form a word pair;
d2, counting the co-occurrence condition of the word pair obtained in the step D1 in the data source, and storing the counted co-occurrence condition in a co-occurrence dictionary.
8. The method according to claim 2, wherein if a proper dictionary and a co-occurrence dictionary are simultaneously used in the step a3, and the closeness of a word pair i can be determined by a query of the proper dictionary, the closeness of the word pair i determined by querying the proper dictionary is used as the closeness of the word pair i, which is any one of the word pairs in the search request.
9. The method of claim 1, wherein the processing of the structure information of the web page comprises:
dividing a webpage into webpage blocks, segments and sentences;
recording the position information of each word in the webpage and storing the position information in a database, wherein the position information comprises: the located web page blocks, segments, sentences and intra-sentence offsets.
10. The method according to claim 9, wherein step B specifically comprises:
b1, determining the co-occurrence condition of the word pair i in the webpage d according to the position information of the two words of the word pair i in the search request recorded in the database in the webpage d, wherein the word pair i is any one of the word pairs in the search request, and the webpage d is any one of the webpage in the search results;
b2, according to the co-occurrence condition determined in the step B1, counting the physical distance distribution of the word pair i in the webpage d.
11. The method according to claim 1, wherein step C specifically comprises:
c1, determining weighted value weight (i) of the word pair i by using the closeness of the word pair i in the search request;
c2, determining the satisfaction degree fit (i, d) of the word pair i by the webpage d by using the physical distance distribution of the word pair i in the webpage d in the search result;
c3, according to the formula
Figure FDA0000063449100000031
An escape degree offset _ ratio (d, q) of the web page d for the search request q is determined, where φ is a set of word pairs in the search request q.
12. The method of claim 11, wherein weight (i) is:
weight (i) f1 (light (i), imp (i)); wherein light (i) is the closeness of the word pair i, imp (i) is the degree of importance of the word pair i in the search request q, f1 (light (i), imp (i)) is a function of light (i) as a primary factor and imp (i) as a scaling factor, the greater the value of light (i) the greater the value of weight (i) for the same imp (i); or,
weight (i) ═ f2 (light (i)), where f2 (light (i)) is a function of the normalization process performed on light (i).
13. The method of claim 12, wherein said imp (i) is determined by at least one of the following factors:
a part-of-speech of the word pair i in the search request, a sentence component of the word pair i in the search request, and a reciprocal document rate of the word pair i.
14. The method of claim 11, wherein the fit (i, d) is:
fit (i, d) ═ f3(HIT (i, d), light (i)); wherein HIT (i, d) identifies the statistical physical distance distribution of the word pair i in the web page d, light (i) is the closeness of the word pair i, f3(HIT (i, d), light (i)) is the distance range of the word pair i determined by HIT (i, d) as a main factor and light (i) as a function of an adjustment factor, the smaller the distance range of the word pair i determined by HIT (i, d) is, the larger the value of fit (i, d) is; or,
the fit (i, d) ═ f4(HIT (i, d)), f4(HIT (i, d)) is a function that maps the distance range of the word pair i determined by HIT (i, d) to a specific fit (i, d) value.
15. The method of claim 14, wherein determining, by HIT (i, d), a distance range of the word pair i comprises in particular:
adopting the minimum distance range of the word pair i in the HIT (i, d) as the distance range of the word pair i; or,
and according to the HIT (i, d), taking the distance range grade with the maximum relative occurrence probability value as the distance range grade of the word pair i.
16. An apparatus for escape determination for ranking search results, the apparatus comprising: a closeness analyzing unit, a distance distribution determining unit, and an escape degree determining unit;
the compactness analyzing unit is used for analyzing the compactness of a search request input by a user and determining the compactness of each word pair in the search request;
the distance distribution determining unit is used for counting the physical distance distribution of each word pair in each webpage in the search request according to the result of the structural information processing of each webpage in the search result corresponding to the search request;
the escape degree determining unit is configured to determine, by using closeness corresponding to each word pair in the search request and physical distance distribution in each web page, an escape degree of each web page in the search result for the search request, where the escape degree is used to rank each web page in the search result.
17. The apparatus according to claim 16, wherein said compactness analyzing unit comprises in particular: the word segmentation processing subunit, the word pair determining subunit and the compactness determining subunit are connected with the word pair determining subunit;
the word segmentation processing subunit is used for carrying out word segmentation processing on the search request;
the word pair determining subunit is configured to determine, by using the words obtained after the word segmentation processing, each word pair in the search request;
the closeness determining subunit is configured to query a pre-mined proper name dictionary and/or co-occurrence dictionary, and determine closeness of each word pair, where the proper name dictionary includes pre-mined proper nouns, and the co-occurrence dictionary includes co-occurrence conditions of each pre-determined word pair in an existing data source.
18. The apparatus of claim 17, wherein the compactness analysis unit further comprises: and the filtering processing subunit is used for filtering the words obtained by the word segmentation processing subunit based on the disabled word list and sending the words obtained by filtering to the word pair determining subunit.
19. The apparatus according to claim 17, wherein the word pair determining subunit forms a word pair by two adjacent words in the words obtained after the word segmentation processing; or,
and pairwise forming word pairs by the words with strong ideographic capability in the words obtained after the word segmentation, wherein the words with strong ideographic capability are determined according to parts of speech or sentence components in the search request.
20. The apparatus of claim 17, wherein if a proper noun in the proper noun dictionary contains a word pair i, the closeness determination subunit determines the closeness of the word pair i as a preset closeness value, the word pair i being any one of the word pairs in the search request.
21. The apparatus according to claim 17, wherein the compactness determining subunit comprises in particular: the device comprises a dictionary query module, a distance grade determination module and a closeness determination module;
the dictionary query module is used for querying the co-occurrence dictionary to determine the co-occurrence condition of the word pair i in the existing data source, wherein the co-occurrence condition comprises the occurrence times of the word pair i in each distance range grade, and the word pair i is any one of the word pairs in the search request;
the distance grade determining module is used for determining the distance range grade with the maximum relative occurrence probability value of the word pair i in each distance range grade according to the query result of the dictionary querying module;
and the closeness determining module is used for taking the closeness corresponding to the distance range grade determined by the distance grade determining module as the closeness of the word pair i, wherein different distance range grades are preset to correspond to different closeness.
22. The apparatus according to claim 17 or 21, wherein the compactness analysis unit further comprises: and the co-occurrence dictionary mining subunit is used for combining the obtained words pairwise to form word pairs after performing word segmentation processing and filtering based on the stop word list on the data source, counting the co-occurrence conditions of the obtained word pairs in the data source, and storing the counted co-occurrence conditions in a co-occurrence dictionary.
23. The apparatus according to claim 17, wherein if the closeness determination subunit employs both a proper dictionary and a co-occurrence dictionary, and the closeness of a word pair i can be determined by a query of the proper dictionary, the closeness of the word pair i determined by querying the proper dictionary is used as the closeness of the word pair i, which is any one of the word pairs in the search request.
24. The apparatus of claim 16, further comprising: the structure information processing unit is used for dividing a webpage into webpage blocks, sections and sentences, recording position information of all words in the webpage and storing the position information in a database, wherein the position information comprises: the located web page blocks, segments, sentences and intra-sentence offsets.
25. The apparatus according to claim 24, wherein the distance distribution determining unit specifically includes: a co-occurrence condition determining subunit and a distance distribution counting subunit;
the co-occurrence condition determining subunit is configured to determine, according to location information of two terms of the term pair i in the search request recorded in the database in a web page d, a co-occurrence condition of the term pair i in the web page d, where the term pair i is any one of the term pairs in the search request, and the web page d is any one of the web pages in the search result;
and the distance distribution counting subunit is used for determining the co-occurrence condition determined by the subunit according to the co-occurrence condition and counting the physical distance distribution of the word pair i in the webpage d.
26. The apparatus according to claim 16, wherein the ambiguity determining unit specifically comprises: a weighted value determining subunit, a satisfaction degree determining subunit and an escape degree determining subunit;
the weighted value determining subunit is used for determining weighted value weight (i) of the word pair i by utilizing the compactness of the word pair i in the search request;
the satisfaction determining subunit is configured to determine the satisfaction fit (i, d) of the word pair i by the web page d using the physical distance distribution of the word pair i in the web page d in the search result;
the escape degree determining subunit is used for determining the escape degree according to a formulaAn escape degree offset _ ratio (d, q) of the web page d for the search request q is determined, where φ is a set of word pairs in the search request q.
27. The apparatus of claim 26, wherein the weight determining subunit determines the weight (weight) of the word pair i (weight (i)) as weight (i) ═ f1 (light (i), imp (i)) or weight (i) ═ f2 (light (i));
where light (i) is the closeness of the word pair i, imp (i) is the degree of importance of the word pair i in the search request q, and f1 (light (i) is a function of light (i) as a primary factor and imp (i) as a secondary factor, with greater light (i) values and greater weight (i) values for the same imp (i), and f2 (light (i)) is a function of normalizing light (i).
28. The apparatus of claim 27, wherein the ambiguity determining unit further comprises: an importance determining subunit for determining the imp (i) in accordance with at least one of the following factors:
a part-of-speech of the word pair i in the search request, a sentence component of the word pair i in the search request, and a reciprocal document rate of the word pair i.
29. The apparatus of claim 26, wherein the satisfaction determining subunit determines the satisfaction of the word pair i of the web page d according to fit (i, d) ═ f3(HIT (i, d), light (i)) or fit (i, d) ═ f4(HIT (i, d));
wherein HIT (i, d) identifies a statistical physical distance distribution of the word pair i in the web page d, light (i) is the closeness of the word pair i, f3(HIT (i, d), light (i)) is a function of the distance range of the word pair i determined by HIT (i, d) as a primary factor and light (i) as a scaling factor, with f4(HIT (i, d)) being a function of mapping the distance range of the word pair i determined by HIT (i, d) to a specific value of fit (i, d) for the same light (i) as the distance range of the word pair i determined by HIT (i, d) is smaller.
30. The apparatus of claim 29, wherein the ambiguity determining unit further comprises: a distance range determining subunit, configured to determine a distance range of the word pair i according to the HIT (i, d), and specifically includes:
adopting the minimum distance range of the word pair i in the HIT (i, d) as the distance range of the word pair i; or,
and according to the HIT (i, d), taking the distance range grade with the maximum relative occurrence probability value as the distance range grade of the word pair i.
CN201110135805.3A 2011-05-24 2011-05-24 A kind of escape degree defining method for search results ranking and device Active CN102799586B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110135805.3A CN102799586B (en) 2011-05-24 2011-05-24 A kind of escape degree defining method for search results ranking and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110135805.3A CN102799586B (en) 2011-05-24 2011-05-24 A kind of escape degree defining method for search results ranking and device

Publications (2)

Publication Number Publication Date
CN102799586A true CN102799586A (en) 2012-11-28
CN102799586B CN102799586B (en) 2016-04-27

Family

ID=47198698

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110135805.3A Active CN102799586B (en) 2011-05-24 2011-05-24 A kind of escape degree defining method for search results ranking and device

Country Status (1)

Country Link
CN (1) CN102799586B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104216931A (en) * 2013-05-29 2014-12-17 酷盛(天津)科技有限公司 Real-time recommending system and method
CN104778262A (en) * 2015-04-21 2015-07-15 无锡天脉聚源传媒科技有限公司 Searching method and searching device
CN105354321A (en) * 2015-11-16 2016-02-24 中国建设银行股份有限公司 Query data processing method and device
CN105677664A (en) * 2014-11-19 2016-06-15 腾讯科技(深圳)有限公司 Compactness determination method and device based on web search
CN109241356A (en) * 2018-06-22 2019-01-18 腾讯科技(深圳)有限公司 A kind of data processing method, device and storage medium
WO2020199270A1 (en) * 2019-04-04 2020-10-08 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for identifying proper nouns

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080109434A1 (en) * 2006-11-07 2008-05-08 Bellsouth Intellectual Property Corporation Determining Sort Order by Distance
CN101901249A (en) * 2009-05-26 2010-12-01 复旦大学 Text-based query expansion and sort method in image retrieval
CN101957828A (en) * 2009-07-20 2011-01-26 阿里巴巴集团控股有限公司 Method and device for sequencing search results

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080109434A1 (en) * 2006-11-07 2008-05-08 Bellsouth Intellectual Property Corporation Determining Sort Order by Distance
CN101901249A (en) * 2009-05-26 2010-12-01 复旦大学 Text-based query expansion and sort method in image retrieval
CN101957828A (en) * 2009-07-20 2011-01-26 阿里巴巴集团控股有限公司 Method and device for sequencing search results

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104216931A (en) * 2013-05-29 2014-12-17 酷盛(天津)科技有限公司 Real-time recommending system and method
CN105677664A (en) * 2014-11-19 2016-06-15 腾讯科技(深圳)有限公司 Compactness determination method and device based on web search
CN105677664B (en) * 2014-11-19 2019-11-19 腾讯科技(深圳)有限公司 Method and device is determined based on the tightness of web search
CN104778262A (en) * 2015-04-21 2015-07-15 无锡天脉聚源传媒科技有限公司 Searching method and searching device
CN104778262B (en) * 2015-04-21 2018-07-24 无锡天脉聚源传媒科技有限公司 A kind of searching method and device
CN105354321A (en) * 2015-11-16 2016-02-24 中国建设银行股份有限公司 Query data processing method and device
CN109241356A (en) * 2018-06-22 2019-01-18 腾讯科技(深圳)有限公司 A kind of data processing method, device and storage medium
WO2020199270A1 (en) * 2019-04-04 2020-10-08 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for identifying proper nouns
CN111797620A (en) * 2019-04-04 2020-10-20 北京嘀嘀无限科技发展有限公司 System and method for recognizing proper nouns
CN111797620B (en) * 2019-04-04 2023-12-19 北京嘀嘀无限科技发展有限公司 System and method for identifying proper nouns

Also Published As

Publication number Publication date
CN102799586B (en) 2016-04-27

Similar Documents

Publication Publication Date Title
CN107247745B (en) A kind of information retrieval method and system based on pseudo-linear filter model
CN103136352B (en) Text retrieval system based on double-deck semantic analysis
US8332434B2 (en) Method and system for finding appropriate semantic web ontology terms from words
Varadarajan et al. A system for query-specific document summarization
CN101216826B (en) Information search system and method
CN103390004B (en) Determination method and apparatus, corresponding searching method and the device of a kind of semantic redundancy
CN102799586B (en) A kind of escape degree defining method for search results ranking and device
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
Erdmann et al. Improving the extraction of bilingual terminology from Wikipedia
Wang et al. Indexing by L atent D irichlet A llocation and an E nsemble M odel
Srinivas et al. A weighted tag similarity measure based on a collaborative weight model
CN109815401A (en) A kind of name disambiguation method applied to Web people search
US20230282018A1 (en) Generating weighted contextual themes to guide unsupervised keyphrase relevance models
Zaila et al. Geographic information extraction, disambiguation and ranking techniques
Madnani et al. Multiple alternative sentence compressions for automatic text summarization
Min et al. Building user interest profiles from wikipedia clusters
Li et al. Computational linguistics literature and citations oriented citation linkage, classification and summarization
Wang et al. Cmu multiple-choice question answering system at ntcir-11 qa-lab
Hoeber et al. Conceptual query expansion
Wang et al. TREC-10 Experiments at CAS-ICT: Filtering, Web and QA.
Ono et al. Person name disambiguation in web pages using social network, compound words and latent topics
Gupta et al. Document summarisation based on sentence ranking using vector space model
Lampert A quick introduction to question answering
Meng et al. Chinese microblog entity linking system combining wikipedia and search engine retrieval results
Dornescu et al. Densification: Semantic document analysis using Wikipedia

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant