CN102799586A - Transferred meaning degree determining method and device for sequencing searching result - Google Patents
Transferred meaning degree determining method and device for sequencing searching result Download PDFInfo
- Publication number
- CN102799586A CN102799586A CN2011101358053A CN201110135805A CN102799586A CN 102799586 A CN102799586 A CN 102799586A CN 2011101358053 A CN2011101358053 A CN 2011101358053A CN 201110135805 A CN201110135805 A CN 201110135805A CN 102799586 A CN102799586 A CN 102799586A
- Authority
- CN
- China
- Prior art keywords
- word pair
- word
- closeness
- search request
- tight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明提供了一种用于搜索结果排序的转义度确定方法和装置,其中方法包括:A、对用户输入的搜索请求进行紧密度的分析,确定所述搜索请求中各词语对的紧密度;B、根据对所述搜索请求对应的搜索结果中各网页进行的结构信息处理的结果,统计所述搜索请求中各词语对在所述各网页中的物理距离分布;C、利用所述搜索请求中各词语对对应的紧密度以及在所述各网页中的物理距离分布,确定所述搜索结果中各网页针对所述搜索请求的转义度,所述转义度用于对所述搜索结果中各网页进行排序。利用本发明确定出的转义度进行搜索结果的排序能够提高搜索结果的排序效果,从而节约网络资源。
The present invention provides a method and device for determining the escape degree used for sorting search results, wherein the method includes: A. analyzing the closeness of the search request input by the user, and determining the closeness of each word pair in the search request ; B. According to the result of structural information processing of each webpage in the search results corresponding to the search request, count the physical distance distribution of each word in the search request in each of the webpages; C. use the search The closeness corresponding to each word pair in the request and the physical distance distribution in each web page determine the escape degree of each web page in the search result for the search request, and the escape degree is used for the search The pages in the results are sorted. Using the escape degree determined by the present invention to sort the search results can improve the sorting effect of the search results, thereby saving network resources.
Description
【技术领域】 【Technical field】
本发明涉及计算机技术领域,特别涉及一种用于搜索结果排序的转义度确定方法和装置。The invention relates to the field of computer technology, in particular to a method and device for determining the escape degree used for sorting search results.
【背景技术】 【Background technique】
随着计算机技术的不断发展,搜索引擎已经成为人们获取信息的主要手段。当用户输入搜索请求query后,搜索引擎将与该query匹配的页面包含在搜索结果中返回给用户。With the continuous development of computer technology, search engines have become the main means for people to obtain information. When the user inputs a search request query, the search engine will include pages matching the query in the search results and return it to the user.
对搜索结果中各页面的排序是基于用户输入的query与页面的匹配程度进行的,在目前的搜索技术中通常该匹配程度仅取决于query中各词语在页面中的物理距离。但实际上query中包含的各词语之间是存在紧密度上的区分的,同一个query中,如果紧密度大的词语对在页面1中的物理距离与紧密度小的词语对在页面2中的物理距离相同,那么显然页面1的排序应该比页面2的排序靠前,但在目前的搜索技术中并不能在搜索结果的排序中体现这一特点,因搜索结果的排序不优导致用户长时间占用网络资源,从而造成网络资源的浪费。The sorting of each page in the search results is based on the matching degree between the query input by the user and the page. In the current search technology, the matching degree usually only depends on the physical distance of each word in the query on the page. But in fact, there is a difference in the degree of closeness between the words contained in the query. In the same query, if the physical distance between the words with a high degree of density in page 1 and the words with a small degree of density in page 2 The same physical distance, then obviously the ranking of page 1 should be higher than the ranking of page 2, but this feature cannot be reflected in the ranking of search results in the current search technology, because the ranking of search results is not good, which leads to long-term search for users. Time takes up network resources, resulting in a waste of network resources.
【发明内容】 【Content of invention】
本发明提供了一种用于搜索结果排序的转义度确定方法和装置,以便于提高搜索结果的排序效果,从而节约网络资源。The invention provides a method and device for determining the degree of escape for sorting search results, so as to improve the sorting effect of search results and save network resources.
具体技术方案如下:The specific technical scheme is as follows:
一种用于搜索结果排序的转义度确定方法,该方法包括:A method for determining the degree of escape for sorting search results, the method comprising:
A、对用户输入的搜索请求进行紧密度的分析,确定所述搜索请求中各词语对的紧密度;A. Analyze the closeness of the search request input by the user, and determine the closeness of each word pair in the search request;
B、根据对所述搜索请求对应的搜索结果中各网页进行的结构信息处理的结果,统计所述搜索请求中各词语对在所述各网页中的物理距离分布;B. According to the result of structural information processing of each web page in the search result corresponding to the search request, count the physical distance distribution of each word pair in the search request in each web page;
C、利用所述搜索请求中各词语对对应的紧密度以及在所述各网页中的物理距离分布,确定所述搜索结果中各网页针对所述搜索请求的转义度,所述转义度用于对所述搜索结果中各网页进行排序。C. Using the closeness corresponding to each word pair in the search request and the physical distance distribution in each web page, determine the escape degree of each web page in the search result for the search request, and the escape degree Used to sort the web pages in the search results.
其中,所述步骤A具体包括:Wherein, the step A specifically includes:
A1、对所述搜索请求进行分词处理;A1. Perform word segmentation processing on the search request;
A2、利用所述分词处理后得到的词语,确定所述搜索请求中的各词语对;A2. Using the words obtained after the word segmentation process, determine each word pair in the search request;
A3、查询预先挖掘出的专名词典和/或共现词典,确定所述各词语对的紧密度,其中所述专名词典包含预先挖掘出的专有名词,所述共现词典包含预先确定的各词语对在已有数据源中的共现状况。A3. Query the pre-mined proper noun dictionary and/or co-occurrence dictionary to determine the closeness of each word pair, wherein the proper noun dictionary contains pre-mined proper nouns, and the co-occurrence dictionary contains pre-determined The co-occurrence status of each word pair in the existing data sources.
较优地,在所述步骤A1中还包括:对所述分词处理后得到的词语进行基于停用词表的过滤。Preferably, the step A1 further includes: filtering the words obtained after the word segmentation process based on a stop vocabulary list.
具体地,所述步骤A2包括:Specifically, the step A2 includes:
将所述分词处理后得到的词语中相邻的词语两两构成词语对;或者,Two adjacent words in the words obtained after the word segmentation process form word pairs; or,
将所述分词处理后得到的词语中表意能力强的词语两两构成词语对,其中所述表意能力强的词语根据词性或者在所述搜索请求中的句子成分确定。Words with strong expressive ability among the words obtained after the word segmentation process are used to form word pairs, wherein the words with strong expressive ability are determined according to part of speech or sentence components in the search request.
在所述步骤A3中查询预先挖掘出的专名词典确定所述各词语对的紧密度具体包括:In the step A3, querying the thesaurus excavated in advance to determine the compactness of each word pair specifically includes:
如果所述专名词典中的专有名词包含词语对i,则将所述词语对i的紧密度确定为预设紧密度值,所述词语对i为所述搜索请求中各词语对的任一个。If the proper nouns in the dictionary of proper names include a word pair i, then the tightness of the word pair i is determined as a preset tightness value, and the word pair i is any word pair in the search request. one.
在所述步骤A3中查询预先挖掘出的共现词典确定所述各词语对的紧密度具体包括:In the step A3, querying the pre-mined co-occurrence dictionary to determine the closeness of each word pair specifically includes:
查询所述共现词典确定词语对i在已有数据源中的共现状况,所述共现状况包括词语对i在各距离范围等级的出现次数,所述词语对i为所述搜索请求中各词语对的任一个;Query the co-occurrence dictionary to determine the co-occurrence status of the word pair i in the existing data source, the co-occurrence status includes the number of occurrences of the word pair i in each distance range level, and the word pair i is in the search request any of the pairs of words;
确定所述词语对i在各距离范围等级中相对出现概率值最大的距离范围等级;Determine the distance range grade with the largest relative occurrence probability value of the word pair i in each distance range grade;
将确定的所述距离范围等级所对应的紧密度作为所述词语对i的紧密度,其中预先设置不同距离范围等级对应不同的紧密度。The determined closeness corresponding to the distance range level is taken as the closeness of the word pair i, where different distance range levels are preset to correspond to different closeness levels.
另外,所述共现词典的挖掘具体包括:In addition, the mining of the co-occurrence dictionary specifically includes:
D1、对所述数据源进行分词处理和基于停用词表的过滤后,将得到的词语两两组合构成词语对;D1. After performing word segmentation processing on the data source and filtering based on the stop vocabulary list, the obtained words are combined in pairs to form a word pair;
D2、统计所述步骤D1得到的词语对在所述数据源中的共现状况,并将统计到的共现状况存入共现词典中。D2. Count the co-occurrence status of the word pairs obtained in the step D1 in the data source, and store the co-occurrence status in the co-occurrence dictionary.
如果在所述步骤A3中同时采用了专名词典和共现词典,并且通过专名词典的查询能够确定出词语对i的紧密度,则以查询专名词典确定出的所述词语对i的紧密度作为所述词语对i的紧密度,所述词语对i为所述搜索请求中各词语对的任一个。If in said step A3, both the thesaurus and the co-occurrence dictionary are used, and the closeness of the word pair i can be determined through the query of the thesaurus, then the closeness of the word pair i determined by querying the thesaurus The closeness is the closeness of the word pair i, and the word pair i is any one of the word pairs in the search request.
具体地,对网页进行的结构信息处理包括:Specifically, the structural information processing of the webpage includes:
将网页划分为网页块、段和句;Divide web pages into web blocks, paragraphs and sentences;
记录网页中各词语的位置信息并存储在数据库中,其中位置信息包括:所在的网页块、段、句和句内偏移。The position information of each word in the web page is recorded and stored in the database, where the position information includes: the web page block, segment, sentence and sentence offset.
基于此,所述步骤B具体包括:Based on this, the step B specifically includes:
B1、根据所述数据库中记录的所述搜索请求中词语对i的两词语分别在网页d中的位置信息,确定出所述词语对i在网页d中的共现状况,所述词语对i为所述搜索请求中各词语对的任一个,网页d为所述搜索结果中的任一个网页;B1. Determine the co-occurrence status of the word pair i in the web page d according to the position information of the two words of the word pair i in the search request recorded in the database, and the word pair i It is any one of the word pairs in the search request, and the web page d is any web page in the search results;
B2、根据所述步骤B1确定出的共现状况,统计所述词语对i在网页d中的物理距离分布。B2. According to the co-occurrence status determined in the step B1, the physical distance distribution of the word pair i in the webpage d is counted.
所述步骤C具体包括:Described step C specifically comprises:
C1、利用所述搜索请求中的词语对i的紧密度确定所述词语对i的加权值weight(i);C1. Using the closeness of the word pair i in the search request to determine the weighted value weight(i) of the word pair i;
C2、利用所述词语对i在所述搜索结果中的网页d中的物理距离分布确定所述网页d对所述词语对i的满足度fit(i,d);C2. Using the physical distance distribution of the word pair i in the web page d in the search results to determine the degree of satisfaction fit(i, d) of the web page d to the word pair i;
C3、按照公式确定所述网页d针对所述搜索请求q的转义度offset_ratio(d,q),其中,φ为所述搜索请求q中的词语对构成的集合。C3, according to the formula Determine the offset_ratio(d, q) of the web page d for the search request q, where φ is a set of word pairs in the search request q.
所述weight(i)为:The weight(i) is:
weight(i)=f1(tight(i),imp(i));其中tight(i)为所述词语对i的紧密度,imp(i)为所述词语对i在所述搜索请求q中的重要程度,f1(tight(i),imp(i))是将tight(i)作为主因数且将imp(i)作为调节因数的函数,在相同imp(i)的情况下tight(i)值越大weight(i)值越大;或者,weight(i)=f1(tight(i), imp(i)); wherein tight(i) is the tightness of the word pair i, and imp(i) is the word pair i in the search request q The degree of importance, f1(tight(i), imp(i)) is a function that takes tight(i) as the main factor and imp(i) as the adjustment factor, in the case of the same imp(i) tight(i) The larger the value, the larger the weight(i) value; or,
weight(i)=f2(tight(i)),其中f2(tight(i))是对tight(i)进行归一化处理的函数。weight(i)=f2(tight(i)), where f2(tight(i)) is a function for normalizing tight(i).
所述imp(i)由以下因素中的至少一种确定:The imp(i) is determined by at least one of the following factors:
所述词语对i在所述搜索请求中的词性、所述词语对i在所述搜索请求中的句子成分以及所述词语对i的倒文档率。The part of speech of the word pair i in the search request, the sentence component of the word pair i in the search request, and the inverted document rate of the word pair i.
所述fit(i,d)为:The fit(i, d) is:
fit(i,d)=f3(HIT(i,d),tight(i));其中HIT(i,d)标识统计到的所述词语对i在网页d中的物理距离分布,tight(i)为所述词语对i的紧密度,f3(HIT(i,d),tight(i))是由HIT(i,d)确定出的所述词语对i的距离范围作为主因数且tight(i)作为调节因数的函数,在相同tight(i)的情况下由HIT(i,d)确定出的词语对i的距离范围越小fit(i,d)值越大;或者,fit(i, d)=f3(HIT(i, d), tight(i)); wherein HIT(i, d) identifies the physical distance distribution of the word i in the web page d that is counted, tight(i ) is the tightness of the word pair i, f3(HIT(i, d), tight(i)) is the distance range of the word pair i determined by HIT(i, d) as the main factor and tight( i) As a function of the adjustment factor, the smaller the distance range of the word pair i determined by HIT(i, d) under the same tight(i) is, the larger the value of fit(i, d) is; or,
fit(i,d)=f4(HIT(i,d)),f4(HIT(i,d))是将由HIT(i,d)确定出的词语对i的距离范围映射为具体的fit(i,d)值的函数。fit(i,d)=f4(HIT(i,d)), f4(HIT(i,d)) is to map the distance range of the word pair i determined by HIT(i,d) into a specific fit(i , d) function of value.
由HIT(i,d)确定所述词语对i的距离范围具体包括:Determining the distance range of the words to i by HIT (i, d) specifically includes:
采用所述HIT(i,d)中所述词语对i的最小距离范围作为所述词语对i的距离范围;或者,Adopt the minimum distance range of the word pair i in the HIT (i, d) as the distance range of the word pair i; or,
依据所述HIT(i,d)将相对出现概率值最大的距离范围等级作为词语对i的距离范围等级。According to the HIT(i, d), the distance range level with the largest relative occurrence probability value is used as the distance range level of the word pair i.
一种用于搜索结果排序的转义度确定装置,该装置包括:紧密度分析单元、距离分布确定单元以及转义度确定单元;A device for determining the degree of escape for sorting search results, the device comprising: a closeness analysis unit, a distance distribution determination unit, and a degree of escape determination;
所述紧密度分析单元,用于对用户输入的搜索请求进行紧密度的分析,确定所述搜索请求中各词语对的紧密度;The closeness analysis unit is used to analyze the closeness of the search request input by the user, and determine the closeness of each word pair in the search request;
所述距离分布确定单元,用于根据对所述搜索请求对应的搜索结果中各网页进行的结构信息处理的结果,统计所述搜索请求中各词语对在所述各网页中的物理距离分布;The distance distribution determining unit is configured to calculate the physical distance distribution of each word pair in the search request in each web page according to the result of structural information processing of each web page in the search result corresponding to the search request;
所述转义度确定单元,用于利用所述搜索请求中各词语对对应的紧密度以及在所述各网页中的物理距离分布,确定所述搜索结果中各网页针对所述搜索请求的转义度,所述转义度用于对所述搜索结果中各网页进行排序。The escaping degree determination unit is configured to use the closeness corresponding to each word pair in the search request and the physical distance distribution in each webpage to determine the translation of each webpage in the search result for the search request. degree of meaning, and the degree of escape is used to sort the webpages in the search results.
其中,所述紧密度分析单元具体包括:分词处理子单元、词语对确定子单元和紧密度确定子单元;Wherein, the closeness analysis unit specifically includes: a word segmentation processing subunit, a word pair determination subunit and a closeness determination subunit;
所述分词处理子单元,用于对所述搜索请求进行分词处理;The word segmentation processing subunit is configured to perform word segmentation processing on the search request;
所述词语对确定子单元,用于利用所述分词处理后得到的词语,确定所述搜索请求中的各词语对;The word pair determination subunit is used to determine each word pair in the search request by using the words obtained after the word segmentation process;
所述紧密度确定子单元,用于查询预先挖掘出的专名词典和/或共现词典,确定所述各词语对的紧密度,其中所述专名词典包含预先挖掘出的专有名词,所述共现词典包含预先确定的各词语对在已有数据源中的共现状况。The compactness determination subunit is used to query a pre-mined dictionary of proper names and/or a co-occurrence dictionary, and determine the compactness of each word pair, wherein the dictionary of proper names contains proper nouns mined in advance, The co-occurrence dictionary includes the pre-determined co-occurrence status of each word pair in the existing data source.
较优地,所述紧密度分析单元还包括:过滤处理子单元,用于对所述分词处理子单元进行分词处理后得到的词语进行基于停用词表的过滤,将过滤后得到的词语发送给所述词语对确定子单元。Preferably, the compactness analysis unit further includes: a filtering processing subunit, which is used to filter the words obtained after the word segmentation processing by the word segmentation processing subunit based on the stop vocabulary list, and send the filtered words to Subunits are determined for the word pairs.
具体地,所述词语对确定子单元将所述分词处理后得到的词语中相邻的词语两两构成词语对;或者,Specifically, the word pair determination subunit forms word pairs in pairs of adjacent words among the words obtained after the word segmentation process; or,
将所述分词处理后得到的词语中表意能力强的词语两两构成词语对,其中所述表意能力强的词语根据词性或者在所述搜索请求中的句子成分确定。Words with strong expressive ability among the words obtained after the word segmentation process are used to form word pairs, wherein the words with strong expressive ability are determined according to part of speech or sentence components in the search request.
如果所述专名词典中的专有名词包含词语对i,则所述紧密度确定子单元将所述词语对i的紧密度确定为预设紧密度值,所述词语对i为所述搜索请求中各词语对的任一个。If the proper noun in the dictionary of proper names contains a word pair i, the closeness determination subunit determines the closeness of the word pair i as a preset closeness value, and the word pair i is the search Any of the pairs of words in the request.
所述紧密度确定子单元具体包括:词典查询模块、距离等级确定模块和紧密度确定模块;The compactness determination subunit specifically includes: a dictionary query module, a distance level determination module and a compactness determination module;
所述词典查询模块,用于查询所述共现词典确定词语对i在已有数据源中的共现状况,所述共现状况包括词语对i在各距离范围等级的出现次数,所述词语对i为所述搜索请求中各词语对的任一个;The dictionary query module is used to query the co-occurrence dictionary to determine the co-occurrence status of the word pair i in the existing data source, the co-occurrence status includes the number of occurrences of the word pair i in each distance range level, the word Pair i is any one of the word pairs in the search request;
所述距离等级确定模块,用于根据所述词典查询模块的查询结果,确定所述词语对i在各距离范围等级中相对出现概率值最大的距离范围等级;The distance level determination module is used to determine the distance range level with the largest relative occurrence probability value of the word pair i in each distance range level according to the query result of the dictionary query module;
所述紧密度确定模块,用于将所述距离等级确定模块确定的所述距离范围等级所对应的紧密度作为所述词语对i的紧密度,其中预先设置不同距离范围等级对应不同的紧密度。The closeness determination module is used to use the closeness corresponding to the distance range level determined by the distance level determination module as the closeness of the word pair i, wherein different distance range levels are preset to correspond to different closeness .
更进一步地,所述紧密度分析单元还包括:共现词典挖掘子单元,用于对所述数据源进行分词处理和基于停用词表的过滤后,将得到的词语两两组合构成词语对,统计得到的词语对在所述数据源中的共现状况,并将统计到的共现状况存入共现词典中。Furthermore, the closeness analysis unit also includes: a co-occurrence dictionary mining subunit, which is used to perform word segmentation processing on the data source and filter based on the stop vocabulary list, and combine the obtained words in pairs to form word pairs , counting the co-occurrence status of the obtained word pairs in the data source, and storing the co-occurrence status in the co-occurrence dictionary.
如果所述紧密度确定子单元同时采用了专名词典和共现词典,并且通过专名词典的查询能够确定出词语对i的紧密度,则以查询专名词典确定出的所述词语对i的紧密度作为所述词语对i的紧密度,所述词语对i为所述搜索请求中各词语对的任一个。If the closeness determination subunit adopts both the thesaurus and the co-occurrence dictionary, and the closeness of the word pair i can be determined through the query of the thesaurus, then the word pair i determined by querying the thesaurus The closeness of is taken as the closeness of the word pair i, and the word pair i is any one of the word pairs in the search request.
更进一步地,该装置还包括:结构信息处理单元,用于将网页划分为网页块、段和句,记录网页中各词语的位置信息并存储在数据库中,其中所述位置信息包括:所在的网页块、段、句和句内偏移。Furthermore, the device also includes: a structural information processing unit, configured to divide the webpage into webpage blocks, paragraphs and sentences, record the position information of each word in the webpage and store it in the database, wherein the position information includes: Web page blocks, paragraphs, sentences and intra-sentence offsets.
所述距离分布确定单元具体包括:共现状况确定子单元和距离分布统计子单元;The distance distribution determination unit specifically includes: a co-occurrence status determination subunit and a distance distribution statistics subunit;
所述共现状况确定子单元,用于根据所述数据库中记录的所述搜索请求中词语对i的两词语分别在网页d中的位置信息,确定出所述词语对i在网页d中的共现状况,所述词语对i为所述搜索请求中各词语对的任一个,网页d为所述搜索结果中的任一个网页;The co-occurrence determination subunit is configured to determine the position of the word pair i in the webpage d according to the position information of the two words of the word pair i in the webpage d in the search request recorded in the database Co-occurrence situation, the word pair i is any one of the word pairs in the search request, and the web page d is any web page in the search results;
所述距离分布统计子单元,用于根据所述共现状况确定子单元确定出的共现状况,统计所述词语对i在网页d中的物理距离分布。The distance distribution statistical subunit is configured to count the physical distance distribution of the word pair i in the web page d according to the co-occurrence status determined by the co-occurrence status determination sub-unit.
所述转义度确定单元具体包括:加权值确定子单元、满足度确定子单元和转义度确定子单元;The escape degree determination unit specifically includes: a weighted value determination subunit, a satisfaction degree determination subunit and an escape degree determination subunit;
所述加权值确定子单元,用于利用所述搜索请求中的词语对i的紧密度确定所述词语对i的加权值weight(i);The weighted value determination subunit is used to determine the weighted value weight(i) of the word pair i by using the closeness of the word pair i in the search request;
所述满足度确定子单元,用于利用所述词语对i在所述搜索结果中的网页d中的物理距离分布确定所述网页d对所述词语对i的满足度fit(i,d);The satisfaction determination subunit is used to determine the satisfaction degree fit(i, d) of the webpage d to the word pair i by using the physical distance distribution of the word pair i in the web page d in the search results ;
所述转义度确定子单元,用于按照公式确定所述网页d针对所述搜索请求q的转义度offset_ratio(d,q),其中,φ为所述搜索请求q中的词语对构成的集合。The escaping degree determines subunits for use according to the formula Determine the offset_ratio(d, q) of the web page d for the search request q, where φ is a set of word pairs in the search request q.
所述加权值确定子单元按照weight(i)=f1(tight(i),imp(i))或者weight(i)=f2(tight(i))确定所述词语对i的加权值weight(i);The weighted value determination subunit determines the weighted value weight(i) of the word pair i according to weight(i)=f1(tight(i), imp(i)) or weight(i)=f2(tight(i) );
其中tight(i)为所述词语对i的紧密度,imp(i)为所述词语对i在所述搜索请求q中的重要程度,f1(tight(i),imp(i))是将tight(i)作为主因数且将imp(i)作为调节因数的函数,在相同imp(i)的情况下tight(i)值越大weight(i)值越大,f2(tight(i))是对tight(i)进行归一化处理的函数。Where tight(i) is the tightness of the word pair i, imp(i) is the importance degree of the word pair i in the search request q, and f1(tight(i), imp(i)) is the tight(i) as the main factor and imp(i) as the function of the adjustment factor, in the case of the same imp(i), the larger the value of tight(i) is, the larger the value of weight(i) is, f2(tight(i)) is a function to normalize tight(i).
这种情况下,所述转义度确定单元还包括:重要度确定子单元,用于按照以下因素中的至少一种确定所述imp(i):In this case, the escape degree determination unit further includes: an importance determination subunit, configured to determine the imp(i) according to at least one of the following factors:
所述词语对i在所述搜索请求中的词性、所述词语对i在所述搜索请求中的句子成分以及所述词语对i的倒文档率。The part of speech of the word pair i in the search request, the sentence component of the word pair i in the search request, and the inverted document rate of the word pair i.
所述满足度确定子单元按照fit(i,d)=f3(HIT(i,d),tight(i))或者fit(i,d)=f4(HIT(i,d))确定所述网页d对所述词语对i的满足度fit(i,d);The satisfaction determination subunit determines the webpage according to fit(i, d)=f3(HIT(i,d), tight(i)) or fit(i,d)=f4(HIT(i,d)) The degree of satisfaction fit(i, d) of d to the word i;
其中HIT(i,d)标识统计到的所述词语对i在网页d中的物理距离分布,tight(i)为所述词语对i的紧密度,f3(HIT(i,d),tight(i))是由HIT(i,d)确定出的所述词语对i的距离范围作为主因数且tight(i)作为调节因数的函数,在相同tight(i)的情况下由HIT(i,d)确定出的词语对i的距离范围越小fit(i,d)值越大,f4(HIT(i,d))是将由HIT(i,d)确定出的词语对i的距离范围映射为具体的fit(i,d)值的函数。Wherein HIT (i, d) marks the physical distance distribution of the said words to i in web page d, and tight (i) is the tightness of said words to i, f3 (HIT (i, d), tight ( i)) is a function of the distance range of the word pair i determined by HIT (i, d) as the main factor and tight (i) as the adjustment factor, under the same tight (i) situation by HIT (i, d) The smaller the distance range of the determined word pair i is, the larger the value of fit(i, d) is, and f4(HIT(i, d)) is the distance range mapping of the word pair i determined by HIT(i, d) is a function of specific fit(i, d) values.
这种情况下,所述转义度确定单元还包括:距离范围确定子单元,用于根据所述HIT(i,d)确定所述词语对i的距离范围,具体包括:In this case, the escape degree determination unit also includes: a distance range determination subunit, configured to determine the distance range of the word pair i according to the HIT(i, d), specifically including:
采用所述HIT(i,d)中所述词语对i的最小距离范围作为所述词语对i的距离范围;或者,Adopt the minimum distance range of the word pair i in the HIT (i, d) as the distance range of the word pair i; or,
依据所述HIT(i,d)将相对出现概率值最大的距离范围等级作为词语对i的距离范围等级。According to the HIT(i, d), the distance range level with the largest relative occurrence probability value is used as the distance range level of the word pair i.
由以上技术方案可以看出,本发明提供的方法和装置确定出的转义度是基于query中各词语对对应的紧密度以及在网页中的物理距离分布的,网页针对query的转义度越高,说明该网页中与query中紧密度高的词语对的匹配程度越高,依据此的排序结果越优,用户通过这样的搜索结果排序能够更快速地获取想要的信息,从而节约网络资源。It can be seen from the above technical solutions that the degree of escape determined by the method and device provided by the present invention is based on the closeness of each word pair in the query and the physical distance distribution in the webpage. High, indicating that the higher the degree of matching between the web page and the word pairs with high density in the query, the better the sorting result based on this, and the user can obtain the desired information more quickly by sorting the search results in this way, thus saving network resources .
【附图说明】 【Description of drawings】
图1为本发明实施例提供的主要方法流程图;Fig. 1 is the main method flowchart provided by the embodiment of the present invention;
图2为本发明实施例一提供的对query进行紧密度分析的方法流程图;FIG. 2 is a flowchart of a method for performing compactness analysis on a query provided in Embodiment 1 of the present invention;
图3为本发明实施例二提供的统计query中各词语对在网页中的物理距离分布的方法流程图;Fig. 3 is the flow chart of the method for counting the physical distance distribution of each word pair in the webpage in the statistics query provided by the second embodiment of the present invention;
图4为本发明是实施例三提供的确定网页针对query的转义度的方法流程图;Fig. 4 is the flow chart of the method for determining the escape degree of a webpage for query provided by the third embodiment of the present invention;
图5为本发明实施例四提供的转义度确定装置的结构图。FIG. 5 is a structural diagram of an apparatus for determining the degree of escape provided by Embodiment 4 of the present invention.
【具体实施方式】 【Detailed ways】
为了使本发明的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本发明进行详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.
图1为本发明实施例提供的主要方法流程图,如图1所示,该方法可以包括以下步骤:Fig. 1 is the flow chart of the main method provided by the embodiment of the present invention, as shown in Fig. 1, the method may include the following steps:
步骤101:对用户输入的query进行紧密度的分析,确定query中各词语对的紧密度。Step 101: Analyze the closeness of the query input by the user, and determine the closeness of each word pair in the query.
步骤102:根据对该query对应的搜索结果中各网页进行结构信息处理的结果,统计query中各词语对在网页中的物理距离分布。Step 102: According to the result of structural information processing of each web page in the search result corresponding to the query, make statistics on the physical distance distribution of each word pair in the query in the web page.
步骤103:利用query中各词语对对应的紧密度以及在搜索结果的各网页中的物理距离分布,确定搜索结果的各网页针对query的转义度,该转义度用于对搜索结果的各网页进行排序。Step 103: Utilize the corresponding closeness of each word pair in the query and the physical distance distribution in each web page of the search result to determine the degree of escape of each web page of the search result for the query, and the degree of escape is used for each page of the search result Pages are sorted.
下面对上述方法中的各步骤进行详细描述。首先结合实施例一对上述步骤101,即对query进行紧密度的分析过程进行详细描述。Each step in the above method is described in detail below. Firstly, the above-mentioned
实施例一、Embodiment one,
图2为本发明实施例一提供的对query进行紧密度分析的方法流程图,如图2所示,该方法可以包括以下步骤:FIG. 2 is a flow chart of a method for analyzing the compactness of a query provided by Embodiment 1 of the present invention. As shown in FIG. 2 , the method may include the following steps:
步骤201:对query进行分词处理。Step 201: Perform word segmentation processing on query.
本步骤中的分词处理方法可以采用但不限于:基于词典和最长匹配的方法,或者基于统计模型的方法等,由于分词处理是已有成熟技术,在此不再具体赘述。The word segmentation processing method in this step can be adopted but not limited to: a method based on a dictionary and the longest match, or a method based on a statistical model, etc. Since the word segmentation processing is an existing mature technology, it will not be described in detail here.
较优地,还可以进一步对分词处理后得到的各词语基于停用词表进行过滤,过滤掉表意能力差的词语,例如副词、虚词、助词等。Preferably, each word obtained after the word segmentation process can be further filtered based on the stop vocabulary list, and words with poor expressive ability, such as adverbs, function words, auxiliary words, etc., can be filtered out.
以query“相亲相爱的一家人是谁唱的”为例,进行分词处理后得到的词语为:“相亲相爱”、“的”、“一家人”、“是”、“谁”、“唱”和“的”。Take the query "who sang the family who loves each other" as an example, and the words obtained after word segmentation are: "love each other", "de", "a family", "yes", "who", "sing" and "of".
基于停用词表进行过滤时,过滤掉其中的“的”,剩下的词语为:“相亲相爱”、“一家人”、“是”、“谁”和“唱”。When filtering based on the stop word list, the "of" is filtered out, and the remaining words are: "love each other", "family", "is", "who" and "sing".
步骤202:利用分词处理的结果,确定query中的各词语对。Step 202: Use the result of word segmentation to determine each word pair in the query.
在确定query中的词语对时,可以采用以下策略中的至少一种:When determining the word pairs in the query, at least one of the following strategies can be adopted:
策略1:将分词处理后得到的词语中,相邻的词语两两构成词语对。Strategy 1: Among the words obtained after word segmentation, adjacent words form word pairs in pairs.
策略2:将分词处理后得到的词语中,表意能力强的词语两两构成词语对。Strategy 2: Among the words obtained after word segmentation, the words with strong expressive ability form word pairs in pairs.
其中表意能力强的词语可以根据词性或者句子成分确定,例如,将名词、动词、形容词、代词中的至少一种确定为表意能力强的词语,或者,将作为主语、谓语、宾语中的至少一种确定为表意能力强的词语。Among them, words with strong expressive ability can be determined according to part of speech or sentence components, for example, at least one of nouns, verbs, adjectives, and pronouns is determined as words with strong expressive ability, or at least one of subject, predicate, and object A word determined to be expressive.
仍以“相亲相爱的一家人是谁唱的”的query为例,对应于策略1,将“相亲相爱”、“一家人”、“是”、“谁”和“唱”中相邻的词语两两构成词语对后,得到词语对:“相亲相爱”-“一家人”、“一家人”-“是”、“是”-“谁”、“谁”-“唱”。Still taking the query "who sang the family who love each other" as an example, corresponding to strategy 1, the adjacent words in "love each other", "family", "yes", "who" and "sing" After two pairs of words are formed, the word pairs are obtained: "love each other"-"family", "family"-"yes", "yes"-"who", "who"-"sing".
对应于策略2,query进行分词处理后得到的词语中,表意能力强的词语为“相亲相爱”、“一家人”和“唱”,将其两两组成词语对为:“相亲相爱”-“一家人”、“相亲相爱”-“唱”和“一家人”-“唱”。Corresponding to strategy 2, among the words obtained after the query is segmented, the words with strong expressive ability are "love each other", "family" and "sing", and pair them into pairs: "love each other"-" "One family", "love each other" - "sing" and "one family" - "sing".
步骤203:查询预先挖掘出的专名词典和/或共现词典,确定各词语对的紧密度,其中专名词典中包含预先挖掘出的专有名词,共现词典中包含预先确定的各词语对在已有数据源中的共现状况。Step 203: Query the pre-mined proper name dictionary and/or co-occurrence dictionary, and determine the closeness of each word pair, wherein the proper name dictionary contains pre-mined proper nouns, and the co-occurrence dictionary contains predetermined words For co-occurrences in existing data sources.
在本步骤中涉及到两种词典:专名词典和/或共现词典。其中,专名词典的挖掘过程可以采用现有技术,目前专有名词可以分为18种类型:人名、地名、影视名、国家名、单位名、组织名等。针对各类型可以采用各自的语料库进行挖掘,例如,针对影视名类型的专有名词,可以将视频网站的title作为语料库进行挖掘。各种类型的挖掘方式在此不做具体限定。Two kinds of dictionaries are involved in this step: proper name dictionaries and/or co-occurrence dictionaries. Among them, the excavation process of the proper name dictionary can adopt the existing technology. Currently, proper nouns can be divided into 18 types: personal names, place names, film and television names, country names, unit names, organization names, etc. Each type can be mined using its own corpus. For example, for the proper nouns of the film and television title type, the title of the video website can be used as the corpus for mining. Various types of excavation methods are not specifically limited here.
在利用专名词典确定各词语对的紧密度时,如果专名词典中的专有名词包含某词语对,则可以确定该词语对的紧密度为预设紧密度值。例如:“相亲相爱”和“一家人”构成的词语对命中专名词典中的专有名词“相亲相爱的一家人”,即专名词典中的专有名词包含该词语对,因此,可以设置“相亲相爱”和“一家人”构成的词语对的紧密度为最高紧密度。When using the thesaurus to determine the closeness of each word pair, if the proper noun in the thesaurus contains a certain word pair, then the closeness of the word pair can be determined as a preset closeness value. For example: the word pair composed of "love each other" and "a family" hits the proper noun "a family that loves each other" in the dictionary of proper names, that is, the proper nouns in the dictionary of proper names contain the pair of words, so you can set The closeness of the word pair formed by "loving each other" and "family" is the highest closeness.
下面对共现词典的挖掘进行介绍,挖掘共现词典的数据源可以采用但不限于以下中的至少一种:网页内容、网页标题(title)以及搜索日志中的query。对各数据源分别进行分词处理,较优地,进一步基于停用词表过滤掉分词处理后得到的词语中表意能力差的词语后,两两组合构成词语对,统计词语对在数据源中的共现状况,并存入共现词典中。The following describes the mining of co-occurrence dictionaries. The data source for mining co-occurrence dictionaries may be at least one of the following: web page content, web page titles, and queries in search logs. Carry out word segmentation processing on each data source respectively, preferably, after filtering out the words with poor expressive ability in the words obtained after the word segmentation processing based on the stop word table, combine them in pairs to form word pairs, and count the word pairs in the data source Co-occurrence status, and stored in the co-occurrence dictionary.
共现词典中各词语对的共现状况可以存储为:词语对、词语对的共现距离范围、共现在该距离范围内的次数。其中,距离范围可以预先设置为几种等级,例如分成五种等级:网页块、段、句、N个词语内(N为大于2的整数,例如3个词语内)、相邻。The co-occurrence status of each word pair in the co-occurrence dictionary can be stored as: the word pair, the co-occurrence distance range of the word pair, and the number of co-occurrence times within the distance range. Among them, the distance range can be preset to several levels, for example, divided into five levels: web page block, segment, sentence, within N words (N is an integer greater than 2, for example, within 3 words), adjacent.
在利用共现词典对用户输入的query进行紧密度分析时,查询共现词典确定query中的各词语对在各距离范围等级的出现次数,确定出各距离范围等级中相对出现概率值最大的距离范围等级,将确定的该距离范围等级所对应的紧密度作为该词语对的紧密度。其中可以预先设置不同距离范围等级对应不同的紧密度。When using the co-occurrence dictionary to analyze the closeness of the query input by the user, query the co-occurrence dictionary to determine the number of occurrences of each word pair in the query at each distance range level, and determine the distance with the largest relative occurrence probability value in each distance range level range level, the determined closeness corresponding to the distance range level is used as the closeness of the word pair. Wherein different distance range levels can be preset to correspond to different tightness.
例如,对于“谁”-“唱”这一词语对,在共现词典中相邻等级的共现次数为2,3个词语内等级的共现次数为10,句等级的共现次数为18,段等级的共现次数为40,网页块等级的共现次数为60。然后确定出相对出现概率值最大的距离范围等级为:3个词语内,因此,确定“谁”-“唱”这一词语对的共现距离范围等级为:3个词语内,这一词语对的紧密度等级为第二紧密度等级。For example, for the word pair "who"-"sing", in the co-occurrence dictionary, the number of co-occurrences of the adjacent level is 2, the number of co-occurrences of the level within 3 words is 10, and the number of co-occurrences of the sentence level is 18 , the number of co-occurrences at the segment level is 40, and the number of co-occurrences at the page block level is 60. Then determine that the distance range grade of the maximum relative occurrence probability value is: within 3 words, therefore, determine the co-occurrence distance range grade of the word pair of "who"-"singing" to be: within 3 words, this word pair The tightness level is the second tightness level.
其中第j个等级的相对出现概率Pj可以为:Among them, the relative occurrence probability P j of the jth level can be:
其中,xj为词语对在第j个等级的共现次数,xj+1为词语在第j+1个等级的共现次数,各等级按照紧密度从高到低排序。相对出现概率值也可以采用其它的定义,在此不做限制。 Among them, x j is the number of co-occurrences of word pairs at level j, x j+1 is the number of co-occurrences of words at level j+1, and each level is sorted from high to low in terms of closeness. Other definitions may also be used for the relative occurrence probability value, which is not limited here.
如果同时采用了专名词典和共现词典,某个词语对同时命中了专名词典和共现词典,则可以以专名词典为较高优先级,即以查询专名词典确定出的词语对的紧密度作为该词语对最终的紧密度。If the proper name dictionary and the co-occurrence dictionary are used at the same time, and a word pair hits the proper name dictionary and the co-occurrence dictionary at the same time, the proper name dictionary can be given a higher priority, that is, the word pair determined by querying the proper name dictionary The closeness of is used as the final closeness of the word pair.
在本实施例中,可以将各词语对的紧密度以上述不同紧密度等级的方式体现,也可以以具体的紧密度值的方式体现。In this embodiment, the closeness of each word pair can be expressed in the form of the above-mentioned different levels of closeness, or can be expressed in the form of a specific closeness value.
至此实施例一所示流程结束,下面结合实施例二对步骤102,即如何对搜索结果中各网页进行结构信息处理的过程进行详细描述。So far, the process shown in the first embodiment is completed, and the
实施例二、Embodiment two,
图3为本发明实施例二提供的统计query中各词语对在网页中的物理距离分布的方法流程图,如图3所示,该方法可以包括以下步骤:Fig. 3 is the flow chart of the method for statistically distributing the physical distances of the word pairs in the webpage in the second embodiment of the present invention. As shown in Fig. 3, the method may include the following steps:
步骤301:对query对应的搜索结果中的各网页分别进行结构信息处理,该结构信息处理包括:划分网页块、段、句。Step 301: Perform structural information processing on each webpage in the search results corresponding to the query, the structural information processing includes: dividing webpage blocks, paragraphs, and sentences.
其中,划分得到的网页块可以包括但不限于:标题(title)块、锚(anchor)块、导航(mypos)块、内容块。其中anchor块和内容块可以有更细粒度的划分。Wherein, the divided webpage blocks may include but not limited to: title (title) block, anchor (anchor) block, navigation (mypos) block, and content block. Among them, the anchor block and the content block can have a finer-grained division.
对划分得到的网页块可以进一步进行分段,每个段可以进一步进行分句处理。The divided web page blocks can be further segmented, and each segment can be further processed into sentences.
步骤301可以是在线下进行的,经过上述对网页的结构信息处理过程,每个词语在网页中都具有绝对的位置,可以将各词语在各网页中的位置信息存储在数据库中,供在线执行步骤302时查询使用。其中,位置信息可以是各词语具体所在的网页块、段、句以及句内偏移。Step 301 can be carried out offline. After the above-mentioned structural information processing process of the webpage, each word has an absolute position in the webpage, and the position information of each word in each webpage can be stored in the database for online execution Query usage during step 302. Wherein, the location information may be the webpage block, segment, sentence and intra-sentence offset where each word is specifically located.
步骤302:根据各词语在网页中的位置信息,统计query中各词语对在网页中的物理距离分布。Step 302: According to the position information of each word in the web page, calculate the physical distance distribution of each word pair in the query in the web page.
根据query中各词语对中两词语在网页中的位置信息就可以确定出词语对在网页中的共现状况,即在网页块、段或句内的共现次数,由于一个词语对在网页中可能多次出现,可以基于词语在网页中的共现状况统计词语对在网页中的物理距离分布,从而形成数组HIT(i,d),其中i标识词语对,d标识网页,HIT(i,d)标识统计到的词语对i在网页d中的物理距离分布。According to the position information of the two words in the web page in each word pair in the query, the co-occurrence status of the word pair in the web page can be determined, that is, the number of co-occurrence times in the web page block, paragraph or sentence, because a word pair in the web page It may appear multiple times, and the physical distance distribution of word pairs in web pages can be counted based on the co-occurrence status of words in web pages, thereby forming an array HIT (i, d), where i identifies word pairs, d identifies web pages, and HIT (i, d) Identify the statistical distribution of the physical distance of the word pair i in the webpage d.
至此,实施例二所示流程结束,下面结合实施例三对确定各网页针对query的转义度的方法进行详细描述。So far, the process shown in the second embodiment is completed, and the method for determining the escape degree of each web page for the query will be described in detail below in conjunction with the third embodiment.
实施例三、Embodiment three,
图4为本发明是实例三提供的确定网页针对query的转义度的方法流程图,如图4所示,该方法可以包括以下步骤:Fig. 4 is the flow chart of the method for determining the escaping degree of the web page for the query provided by Example 3 of the present invention. As shown in Fig. 4, the method may include the following steps:
步骤401:利用query中各词语对的紧密度和词语对在query中的重要程度,确定词语对的加权值weight(i)。Step 401: Using the closeness of each word pair in the query and the importance of the word pair in the query, determine the weighted value weight(i) of the word pair.
其中,weight(i)=f1(tight(i),imp(i)),tight(i)为词语对i的紧密度,imp(i)为词语对i在query中的重要程度。f1(tight(i),imp(i))可以是将tight(i)作为主因数且将imp(i)作为调节因数的函数,在相同imp(i)的情况下tight(i)值越大weight(i)值越大。例如可以是将imp(i)进行归一化处理后得到的值乘以tight(i)。Among them, weight(i)=f1(tight(i), imp(i)), tight(i) is the closeness of word pair i, and imp(i) is the importance degree of word pair i in the query. f1(tight(i), imp(i)) can be a function that uses tight(i) as the main factor and imp(i) as the adjustment factor. In the case of the same imp(i), the value of tight(i) is larger The larger the weight(i) value is. For example, the value obtained by normalizing imp(i) may be multiplied by tight(i).
下面举其中一个f1(tight(i),imp(i))的具体实现方案:Here is a specific implementation of one of f1(tight(i), imp(i)):
首先根据tight(i)值对应的等级映射到对应的权重值g_tight_map[tight(i)]上,其中tight(i)值对应的不同等级可以映射为不同的权重值,例如假设tight(i)对应五个等级,对应取[0,4]之间的整数值,映射到权重值成为一个数组,假设为g_tight_map[5]={16,8,4,2,1}。Firstly, the level corresponding to the tight(i) value is mapped to the corresponding weight value g_tight_map[tight(i)], where different levels corresponding to the tight(i) value can be mapped to different weight values, for example, assuming that tight(i) corresponds to The five levels correspond to integer values between [0, 4] and are mapped to weight values to form an array, assuming g_tight_map[5]={16, 8, 4, 2, 1}.
然后取weight(i)=f1(tight(i),imp(i))=g_tight_map[tight(i)]+imp(i),其中,可以将g_tight_map[tight(i)]的取值范围大于imp(i)的取值范围,从而实现tight(i)作为主因数,imp(i)为调节因数。Then take weight(i)=f1(tight(i), imp(i))=g_tight_map[tight(i)]+imp(i), wherein, the value range of g_tight_map[tight(i)] can be greater than imp (i) value range, so as to realize tight(i) as the main factor and imp(i) as the adjustment factor.
imp(i)可以基于以下因素中的至少一种确定词语对i在query中的重要程度:在query中的词性,或者在query中的句子成分,或者倒文档率(IDF)。imp(i) may determine the importance of word pair i in the query based on at least one of the following factors: part of speech in the query, or sentence components in the query, or inverted document rate (IDF).
其中,词语对i的倒文档率IDFi为:Freqi为所述词语对i在大规模语料库中的绝对词频,M为所有词语对在大规模语料库中的绝对词频的最大值。Among them, the inverted document rate IDF i of word pair i is: Freq i is the absolute word frequency of the word pair i in the large-scale corpus, and M is the maximum value of the absolute word frequencies of all word pairs in the large-scale corpus.
另外,还可以仅利用query中各词语对的紧密度确定词语对的加权值,即weight(i)=f2(tight(i)),此时,f2(tight(i))可以是对tight(i)进行归一化处理的函数。In addition, it is also possible to determine the weighted value of the word pair only by using the closeness of each word pair in the query, that is, weight(i)=f2(tight(i)), at this time, f2(tight(i)) can be the pair of tight( i) A function for normalization processing.
步骤402:利用query中各词语对在网页中的物理距离分布以及各词语对的紧密度,分别确定网页对各词语对的满足度fit(i,d)。Step 402: Using the physical distance distribution of each word pair in the query in the webpage and the closeness of each word pair, respectively determine the satisfaction degree fit(i, d) of each word pair on the webpage.
其中fit(i,d)=f3(HIT(i,d),tight(i)),HIT(i,d)标识统计到的词语对i在网页d中的物理距离分布,tight(i)为词语对i的紧密度。具体地,f3(HIT(i,d),tight(i))可以采用由HIT(i,d)确定出的词语对i的距离范围作为主因数,tight(i)作为调节因数的函数,在相同tight(i)的情况下由HIT(i,d)确定出的词语对i的距离范围越小fit(i,d)值越大。Wherein fit(i, d)=f3(HIT(i, d), tight(i)), HIT(i, d) marks the physical distance distribution of the word i in the web page d that has been counted, and tight(i) is The closeness of words to i. Specifically, f3(HIT(i, d), tight(i)) can use the distance range of the word pair i determined by HIT(i, d) as the main factor, and tight(i) as a function of the adjustment factor, in In the case of the same tight(i), the smaller the distance range of the word pair i determined by HIT(i, d), the larger the value of fit(i, d).
下面举其中一个f3(HIT(i,d),tight(i))的具体实现方案:The following is a specific implementation of one of f3 (HIT(i, d), tight(i)):
HIT(i,d)反映了词语对i在网页d中的物理距离分布,可以理解为在各物理距离范围中的共现次数,假设HIT(i,d)是一个数组hit[5],hit[0]代表词语对i在网页d中相邻共现的次数;hit[1]代表在网页d中3个词语内共现的频次;hit[2]代表在网页d中句内共现的次数;hit[3]代表在网页d中段内共现的次数;hit[4]代表在网页d中块内共现的次数。tight(i)为词语对i的紧密度,假设tight(i)是一个[0,4]范围的整数值。HIT(i, d) reflects the physical distance distribution of word pair i in webpage d, which can be understood as the number of co-occurrences in each physical distance range. Suppose HIT(i, d) is an array hit[5], hit [0] represents the number of adjacent co-occurrences of word pair i in web page d; hit[1] represents the co-occurrence frequency of 3 words in web page d; hit[2] represents the co-occurrence frequency of three words in web page d times; hit[3] represents the number of co-occurrences in the middle segment of web page d; hit[4] represents the number of co-occurrences in the block of web page d. tight(i) is the closeness of word pair i, assuming that tight(i) is an integer value in the range of [0, 4].
首先可以将HIT(i,d)量化为一个距离范围值,则计算值hit_value=16*hit[0]+8*hit[1]+4*hit[2]+2*hit[1]+hit[0]。可以定义hit_value的取值范围为[0,16],如果计算出的hit_value大于16,则直接取值为16。First, HIT(i, d) can be quantized into a distance range value, then the calculated value hit_value=16*hit[0]+8*hit[1]+4*hit[2]+2*hit[1]+hit [0]. The value range of hit_value can be defined as [0, 16]. If the calculated hit_value is greater than 16, the value is 16 directly.
预先定义各tight(i)值和hit_value值的组合映射到不同的fit(i,d),该映射关系可以体现为一个二维数组g_hit_map_fit[tight(i)][hit_value],例如g_hit_map_fit[5][17],该二维数组中的取值可以为[0,1]范围内的浮点数。即取fit(i,d)=f3(HIT(i,d),tight(i))=g_hit_map_fit[tight(i)][hit_value]。Predefine the combination of each tight(i) value and hit_value value to map to different fit(i, d), the mapping relationship can be embodied as a two-dimensional array g_hit_map_fit[tight(i)][hit_value], such as g_hit_map_fit[5] [17], the values in the two-dimensional array can be floating point numbers in the range of [0, 1]. That is, fit(i, d)=f3(HIT(i, d), tight(i))=g_hit_map_fit[tight(i)][hit_value].
在由HIT(i,d)确定词语对i的距离范围时,可以直接采用HIT(i,d)中词语对i的最小距离范围作为词语对i的距离范围,或者可以依据HIT(i,d)将相对出现概率值最大的距离范围等级作为词语对i的距离范围等级。When the distance range of word pair i is determined by HIT (i, d), the minimum distance range of word pair i in HIT (i, d) can be directly used as the distance range of word pair i, or it can be based on HIT (i, d ) takes the distance range level with the largest relative occurrence probability value as the distance range level of the word pair i.
另外,还可以仅利用由HIT(i,d)确定出的词语对i的距离范围确定fit(i,d),即fit(i,d)=f4(HIT(i,d)),此时,f4(HIT(i,d))可以是将由HIT(i,d)确定出的词语对i的距离范围映射为具体的fit(i,d)值的函数。例如,预先将不同的距离范围等级对应到不同的fit(i,d)值,由HIT(i,d)确定出的词语对i的距离范围等级后,确定该距离范围等级对应的fit(i,d)值。In addition, it is also possible to determine fit(i, d) only by the distance range of words determined by HIT(i, d), that is, fit(i, d)=f4(HIT(i, d)), at this time , f4(HIT(i, d)) may be a function that maps the distance range of the word pair i determined by HIT(i, d) into a specific fit(i, d) value. For example, different distance range levels are corresponding to different fit(i, d) values in advance, after the distance range level of the word pair i determined by HIT(i, d), determine the fit(i , d) value.
步骤403:利用query中各词语对的加权值以及网页对各词语对的满足度,确定网页针对query的转义度offset_ratio(d,q)。Step 403: Using the weighted value of each word pair in the query and the degree of satisfaction of each word pair in the web page, determine the offset_ratio(d, q) of the web page for the query.
其中,offset_ratio(d,q)为网页d针对query q的转义度,φ为q中的词语对构成的集合。in, offset_ratio(d, q) is the escape degree of web page d for query q, and φ is a set of word pairs in q.
在确定出query的搜索结果中各网页针对query的转义度后,可以按照转义度从高到低的顺序进行搜索结果的排序。网页针对query的转义度越高,说明该网页中与query中紧密度高的词语对的匹配程度越高,依据此的排序结果越优。After determining the escaping degree of each web page for the query in the search results of the query, the search results may be sorted in descending order of the escaping degree. The higher the escaping degree of the web page for the query, the higher the degree of matching between the web page and the word pairs with high density in the query, and the better the sorting result based on this.
以上是对本发明所提供的方法进行的描述,下面对本发明所提供的装置进行详细描述。The above is the description of the method provided by the present invention, and the device provided by the present invention will be described in detail below.
实施例四、Embodiment four,
图5为本发明实施例四提供的转义度确定装置的结构图,该装置可以设置在搜索引擎所在的服务器端,也可以设置在能与搜索引擎进行交互的其他服务器端。如图5所示,该装置可以包括:紧密度分析单元500、距离分布确定单元510以及转义度确定单元520。FIG. 5 is a structural diagram of a device for determining the degree of escape provided by Embodiment 4 of the present invention. The device can be set on the server where the search engine is located, or can be set on other servers that can interact with the search engine. As shown in FIG. 5 , the device may include: a
紧密度分析单元500对用户输入的query进行紧密度的分析,确定query中各词语对的紧密度。The
距离分布确定单元510根据对query对应的搜索结果中各网页进行的结构信息处理的结果,统计query中各词语对在搜索结果的各网页中的物理距离分布。The distance
距离分布确定单元510可以从搜索引擎获取query对应的搜索结果。The distance
转义度确定单元520利用query中各词语对对应的紧密度以及在各网页中的物理距离分布,确定搜索结果中各网页针对query的转义度,转义度用于对搜索结果中各网页进行排序。The degree of
其中,紧密度分析单元500可以具体包括:分词处理子单元501、词语对确定子单元502和紧密度确定子单元503。Wherein, the
分词处理子单元501对query进行分词处理。其采用的分词处理方法可以包括但不限于:基于词典和最长匹配的方法,或者基于统计模型的方法等。The word
词语对确定子单元502利用分词处理后得到的词语,确定query中的各词语对。The word
紧密度确定子单元503查询预先挖掘出的专名词典和/或共现词典,确定各词语对的紧密度,其中专名词典包含预先挖掘出的专有名词,共现词典包含预先确定的各词语对在已有数据源中的共现状况。The
较优地,紧密度分析单元500还可以包括:设置在分词处理子单元501和词语对确定子单元502之间的过滤处理子单元504。过滤处理子单元504对分词处理子单元501进行分词处理后得到的词语进行基于停用词表的过滤,将过滤后得到的词语发送给词语对确定子单元502。词语对确定子单元502利用所述过滤处理子单元504过滤后得到的词语确定query中的各词语对。Preferably, the
词语对确定子单元502在确定query中的各词语对时,可以将分词处理后得到的词语中,相邻的词语两两构成词语对;或者,将分词处理后得到的词语中,表意能力强的词语两两构成词语对,其中表意能力强的词语根据词性或者在query中的句子成分确定。When the word
紧密度确定子单元503在利用专名词典确定各词语对对的紧密度时,如果专名词典中的专有名词包含词语对i,则紧密度确定子单元503可以将词语对i的紧密度确定为预设紧密度值,词语对i为query中各词语对的任一个,对于专名词典的利用,在图5中并未示出。When the
专名词典的挖掘过程可以采用现有技术的方式,目前专有名词可以分为18种类型:人名、地名、影视名、国家名、单位名、组织名等。The excavation process of the proper name dictionary can adopt the method of the existing technology. Currently, proper nouns can be divided into 18 types: personal names, place names, film and television names, country names, unit names, organization names, etc.
紧密度确定子单元503在利用共现词典确定各词语对的紧密度时,紧密度确定子单元503可以具体包括:词典查询模块5031、距离等级确定模块5032和紧密度确定模块5033。When the
词典查询模块5031查询共现词典确定词语对i在已有数据源中的共现状况,共现状况包括词语对i在各距离范围等级的出现次数。The
距离等级确定模块5032根据词典查询模块5031的查询结果,确定词语对i在各距离范围等级中相对出现概率值最大的距离范围等级。The distance
紧密度确定模块5033将距离等级确定模块5032确定的距离范围等级所对应的紧密度作为词语对i的紧密度,其中预先设置不同距离范围等级对应不同的紧密度。The
为了实现共现词典的线下挖掘,紧密度分析单元500还可以包括:共现词典挖掘子单元505,对数据源进行分词处理和基于停用词表的过滤后,将得到的词语两两组合构成词语对,统计得到的词语对在数据源中的共现状况,并统计到的共现状况存入共现词典中。In order to realize the offline mining of co-occurrence dictionaries, the
其中采用的数据源可以包括但不限于:网页内容、网页title以及搜索日志中的query。The data sources used therein may include but not limited to: web page content, web page title, and query in search logs.
共现词典中各词语对的共现状况可以存储为:词语对、词语对的共现距离范围、共现在该距离范围内的次数。其中,距离范围可以预先设置为几种等级,例如分成五种等级:网页块、段、句、N个词语内以及相邻,其中N为大于2的整数。The co-occurrence status of each word pair in the co-occurrence dictionary can be stored as: the word pair, the co-occurrence distance range of the word pair, and the number of co-occurrence times within the distance range. Among them, the distance range can be preset to several levels, for example, divided into five levels: web page block, paragraph, sentence, within N words and adjacent, where N is an integer greater than 2.
如果紧密度确定子单元503同时采用了专名词典和共现词典,通过专名词典的查询能够确定出词语对i的紧密度,则以查询专名词典确定出的词语对i的紧密度作为词语对i的紧密度。If the
为了实现距离分布确定单元510统计query中各词语对在搜索结果的各网页中的物理距离分布,该装置还可以包括:结构信息处理单元530,用于将网页划分为网页块、段和句,记录网页中各词语的位置信息并存储在数据库中,其中位置信息包括:所在的网页块、段、句和句内偏移。In order to realize that the distance
本实施例中涉及到的网页块的划分包括但不限于:title块、anchor块、mypos块或者内容块。其中anchor块和内容块可以有更细粒度的划分。The division of web page blocks involved in this embodiment includes but not limited to: title block, anchor block, mypos block or content block. Among them, the anchor block and the content block can have a finer-grained division.
基于此,距离分布确定单元510可以具体包括:共现状况确定子单元511和距离分布统计子单元512。Based on this, the distance
共现状况确定子单元511根据数据库中记录的query中词语对i的两词语分别在网页d中的位置信息,确定出词语对i在网页d中的共现状况,网页d为搜索结果中的任一个网页。The co-occurrence
距离分布统计子单元512根据共现状况确定子单元511确定出的共现状况,统计词语对i在网页d中的物理距离分布。The distance distribution statistics subunit 512 calculates the physical distance distribution of the word pair i in the web page d according to the co-occurrence status determined by the co-occurrence
下面对转义度确定单元520的结构进行详细描述,转义度确定单元520可以具体包括:加权值确定子单元521、满足度确定子单元522和转义度确定子单元523。The structure of the escape
加权值确定子单元521利用query中的词语对i的紧密度确定词语对i的加权值weight(i)。The weighted
满足度确定子单元522利用词语对i在搜索结果中的网页d中的物理距离分布确定网页d对词语对i的满足度fit(i,d)。The satisfaction
转义度确定子单元523按照公式确定网页d针对query q的转义度offset_ratio(d,q),其中,φ为query q中的词语对构成的集合。Escape degree determines
其中,加权值确定子单元521可以按照weight(i)=f1(tight(i),imp(i))或者weight(i)=f2(tight(i))确定词语对i的加权值weight(i)。Wherein, the weighted
tight(i)为词语对i的紧密度,imp(i)为词语对i在query q中的重要程度,f1(tight(i),imp(i))是将tight(i)作为主因数且将imp(i)作为调节因数的函数,在相同imp(i)的情况下tight(i)值越大weight(i)值越大,f2(tight(i))是对tight(i)进行归一化处理的函数。tight(i) is the tightness of word pair i, imp(i) is the importance of word pair i in query q, f1(tight(i), imp(i)) is tight(i) as the main factor and Using imp(i) as a function of the adjustment factor, in the case of the same imp(i), the larger the value of tight(i) is, the larger the value of weight(i) is, and f2(tight(i)) is to normalize tight(i) A function for unification.
此时,转义度确定单元520还可以包括:重要度确定子单元524,用于按照以下因素中的至少一种确定imp(i):词语对i在query中的词性、词语对i在query中的句子成分以及词语对i的倒文档率。At this point, the escape
满足度确定子单元522可以按照fit(i,d)=f3(HIT(i,d),tight(i))或者fit(i,d)=f4(HIT(i,d))确定网页d对词语对i的满足度fit(i,d)。The degree of
其中HIT(i,d)标识统计到的词语对i在网页d中的物理距离分布,tight(i)为词语对i的紧密度,f3(HIT(i,d),tight(i))是由HIT(i,d)确定出的词语对i的距离范围作为主因数且tight(i)作为调节因数的函数,在相同tight(i)的情况下由HIT(i,d)确定出的词语对i的距离范围越小fit(i,d)值越大,f4(HIT(i,d))是将由HIT(i,d)确定出的词语对i的距离范围映射为具体的fit(i,d)值的函数。Among them, HIT(i, d) identifies the physical distance distribution of the word pair i in webpage d, tight(i) is the tightness of word pair i, and f3(HIT(i, d), tight(i)) is The distance range of the word pair i determined by HIT(i,d) is used as the main factor and tight(i) is used as a function of the adjustment factor, and the words determined by HIT(i,d) under the same tight(i) The smaller the distance range to i, the larger the fit(i, d) value, f4(HIT(i, d)) is to map the distance range of the word pair i determined by HIT(i, d) to a specific fit(i , d) function of value.
此时,转义度确定单元520还可以包括:距离范围确定子单元525,用于根据HIT(i,d)确定词语对i的距离范围,具体可以包括:At this point, the escape
采用HIT(i,d)中词语对i的最小距离范围作为词语对i的距离范围;或者,依据HIT(i,d)将相对出现概率值最大的距离范围等级作为词语对i的距离范围等级。The minimum distance range of word pair i in HIT(i, d) is used as the distance range of word pair i; or, according to HIT(i, d), the distance range grade with the largest relative occurrence probability value is used as the distance range grade of word pair i .
在图5所示装置确定出搜索结果中各网页针对query的转义度后,可以将该转义度提供给搜索引擎用于对搜索结果中的各网页进行排序,query转义度越高的网页排序越靠前。After the device shown in Figure 5 determines the escape degree of each webpage in the search results for query, the escape degree can be provided to the search engine for sorting each webpage in the search results, the higher the query escape degree The higher the ranking of the web page.
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.
Claims (30)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110135805.3A CN102799586B (en) | 2011-05-24 | 2011-05-24 | A kind of escape degree defining method for search results ranking and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110135805.3A CN102799586B (en) | 2011-05-24 | 2011-05-24 | A kind of escape degree defining method for search results ranking and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102799586A true CN102799586A (en) | 2012-11-28 |
CN102799586B CN102799586B (en) | 2016-04-27 |
Family
ID=47198698
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110135805.3A Expired - Fee Related CN102799586B (en) | 2011-05-24 | 2011-05-24 | A kind of escape degree defining method for search results ranking and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102799586B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104216931A (en) * | 2013-05-29 | 2014-12-17 | 酷盛(天津)科技有限公司 | Real-time recommending system and method |
CN104778262A (en) * | 2015-04-21 | 2015-07-15 | 无锡天脉聚源传媒科技有限公司 | Searching method and searching device |
CN105354321A (en) * | 2015-11-16 | 2016-02-24 | 中国建设银行股份有限公司 | Query data processing method and device |
CN105677664A (en) * | 2014-11-19 | 2016-06-15 | 腾讯科技(深圳)有限公司 | Compactness determination method and device based on web search |
CN109241356A (en) * | 2018-06-22 | 2019-01-18 | 腾讯科技(深圳)有限公司 | A kind of data processing method, device and storage medium |
WO2020199270A1 (en) * | 2019-04-04 | 2020-10-08 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for identifying proper nouns |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080109434A1 (en) * | 2006-11-07 | 2008-05-08 | Bellsouth Intellectual Property Corporation | Determining Sort Order by Distance |
CN101901249A (en) * | 2009-05-26 | 2010-12-01 | 复旦大学 | A Text-Based Query Expansion and Ranking Method in Image Retrieval |
CN101957828A (en) * | 2009-07-20 | 2011-01-26 | 阿里巴巴集团控股有限公司 | Method and device for sequencing search results |
-
2011
- 2011-05-24 CN CN201110135805.3A patent/CN102799586B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080109434A1 (en) * | 2006-11-07 | 2008-05-08 | Bellsouth Intellectual Property Corporation | Determining Sort Order by Distance |
CN101901249A (en) * | 2009-05-26 | 2010-12-01 | 复旦大学 | A Text-Based Query Expansion and Ranking Method in Image Retrieval |
CN101957828A (en) * | 2009-07-20 | 2011-01-26 | 阿里巴巴集团控股有限公司 | Method and device for sequencing search results |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104216931A (en) * | 2013-05-29 | 2014-12-17 | 酷盛(天津)科技有限公司 | Real-time recommending system and method |
CN105677664A (en) * | 2014-11-19 | 2016-06-15 | 腾讯科技(深圳)有限公司 | Compactness determination method and device based on web search |
CN105677664B (en) * | 2014-11-19 | 2019-11-19 | 腾讯科技(深圳)有限公司 | Method and device is determined based on the tightness of web search |
CN104778262A (en) * | 2015-04-21 | 2015-07-15 | 无锡天脉聚源传媒科技有限公司 | Searching method and searching device |
CN104778262B (en) * | 2015-04-21 | 2018-07-24 | 无锡天脉聚源传媒科技有限公司 | A kind of searching method and device |
CN105354321A (en) * | 2015-11-16 | 2016-02-24 | 中国建设银行股份有限公司 | Query data processing method and device |
CN109241356A (en) * | 2018-06-22 | 2019-01-18 | 腾讯科技(深圳)有限公司 | A kind of data processing method, device and storage medium |
WO2020199270A1 (en) * | 2019-04-04 | 2020-10-08 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for identifying proper nouns |
CN111797620A (en) * | 2019-04-04 | 2020-10-20 | 北京嘀嘀无限科技发展有限公司 | System and method for recognizing proper nouns |
CN111797620B (en) * | 2019-04-04 | 2023-12-19 | 北京嘀嘀无限科技发展有限公司 | System and method for identifying proper nouns |
Also Published As
Publication number | Publication date |
---|---|
CN102799586B (en) | 2016-04-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Soboroff et al. | Overview of the TREC 2006 Enterprise Track. | |
US8122043B2 (en) | System and method for using an exemplar document to retrieve relevant documents from an inverted index of a large corpus | |
CN1623146B (en) | Systems, methods and software for hyperlinking names | |
US8010539B2 (en) | Phrase based snippet generation | |
US20110078205A1 (en) | Method and system for finding appropriate semantic web ontology terms from words | |
Plachouras et al. | Interacting with financial data using natural language | |
CN103514213B (en) | Term extraction method and device | |
CN102799586B (en) | A kind of escape degree defining method for search results ranking and device | |
CN103390004B (en) | Determination method and apparatus, corresponding searching method and the device of a kind of semantic redundancy | |
US10152478B2 (en) | Apparatus, system and method for string disambiguation and entity ranking | |
Sun et al. | Mining dependency relations for query expansion in passage retrieval | |
CN103186556B (en) | Obtain the method with searching structure semantic knowledge and corresponding intrument | |
Erdmann et al. | Improving the extraction of bilingual terminology from Wikipedia | |
CN110991181B (en) | Method and apparatus for enhancing labeled samples | |
CN103377226A (en) | Intelligent search method and system thereof | |
Song et al. | Question similarity calculation for FAQ answering | |
Brooke et al. | Automatic acquisition of lexical formality | |
CN101187919A (en) | A method and system for performing batch single-document summarization on a document set | |
CN103970756A (en) | Hot topic extracting method, device and server | |
CN101853298B (en) | Event-oriented query expansion method | |
Harada et al. | Finding authoritative people from the web | |
CN102929962B (en) | A kind of evaluating method of search engine | |
Mizzaro et al. | Short text categorization exploiting contextual enrichment and external knowledge | |
Dornescu et al. | Densification: Semantic document analysis using Wikipedia | |
Wang et al. | Research on hybrid query expansion algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20160427 |
|
CF01 | Termination of patent right due to non-payment of annual fee |