CN105955993B

CN105955993B - Search result ordering method and device

Info

Publication number: CN105955993B
Application number: CN201610245408.4A
Authority: CN
Inventors: 苏建雷; 吴文权; 刘占一
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2016-04-19
Filing date: 2016-04-19
Publication date: 2020-09-25
Anticipated expiration: 2036-04-19
Also published as: CN105955993A

Abstract

The invention discloses a search result ordering method and a search result ordering device, wherein the search result ordering method comprises the following steps: receiving an input query sentence, and dividing the query sentence into a plurality of words; acquiring collocation corresponding to the words; obtaining a plurality of search results corresponding to the query statement; and sequencing the plurality of search results based on the BOW model according to the words and the collocation corresponding to the words. According to the search result ordering method and device provided by the embodiment of the invention, the input query sentence is received, the query sentence is divided into a plurality of words, the collocation corresponding to the words is obtained, the plurality of search results corresponding to the query sentence are obtained, and the plurality of search results are ordered based on the BOW model according to the words and the collocation corresponding to the words, so that the ordering of the search results is optimized, the search results more conforming to the intention of the user are preferentially provided, and the use experience of the user is improved.

Description

Search result ordering method and device

Technical Field

The invention relates to the technical field of internet, in particular to a search result ordering method and a search result ordering device.

Background

With the rapid development of science and technology, the internet has gone deep into people's daily life. The user can query the required information through the search engine. How to sort the search results and arrange the information needed by the user in the front of the search results is a subject of current key research.

Currently, the search result sorting work is assisted mainly by semantic similarity between the query statement and the search result before the title. For example: and calculating a word vector of each word of the query statement, and adding the word vectors to obtain the word vector of the query statement. Similarly, a word vector for the title of the search result is calculated. And finally, calculating the semantic similarity between the two word vectors, thereby sequencing the search results.

However, the method does not consider the structural information of the query sentences, which may cause deviation of the final sequencing result, for example, word vectors of two query sentences, namely "wu song died tiger" and "wu song died tiger" are the same, and the accuracy of the sequencing result is reduced.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art. Therefore, an object of the present invention is to provide a search result ranking method, which can optimize the ranking of search results, preferentially provide search results that better meet the user's intention, and improve the user experience.

The second purpose of the invention is to provide a search result sorting device.

In order to achieve the above object, an embodiment of a first aspect of the present invention provides a search result ranking method, including: receiving an input query sentence, and dividing the query sentence into a plurality of words; acquiring collocation corresponding to the words; obtaining a plurality of search results corresponding to the query statement; and sequencing the plurality of search results based on the BOW model according to the words and the collocation corresponding to the words.

According to the search result ordering method provided by the embodiment of the invention, the input query sentence is received, the query sentence is divided into a plurality of words, the collocation corresponding to the words is obtained, the plurality of search results corresponding to the query sentence are obtained, and the plurality of search results are ordered based on the BOW model according to the words and the collocation corresponding to the words, so that the ordering of the search results is optimized, the search results which are more in line with the intention of the user are preferentially provided, and the use experience of the user is improved.

An embodiment of a second aspect of the present invention provides a search result sorting apparatus, including: the segmentation module is used for receiving an input query sentence and segmenting the query sentence into a plurality of words; the first acquisition module is used for acquiring collocation corresponding to the word; the second acquisition module is used for acquiring a plurality of search results corresponding to the query statement; and the ordering module is used for ordering the plurality of search results based on the BOW model according to the words and the collocation corresponding to the words.

According to the search result sorting device provided by the embodiment of the invention, the input query sentence is received, the query sentence is divided into a plurality of words, the collocation corresponding to the words is obtained, the plurality of search results corresponding to the query sentence are obtained, and the plurality of search results are sorted based on the BOW model according to the words and the collocation corresponding to the words, so that the sorting of the search results is optimized, the search results which are more in line with the intention of the user are preferentially provided, and the use experience of the user is improved.

Drawings

FIG. 1 is a flow diagram of a search result ranking method according to one embodiment of the invention;

FIG. 2 is a schematic diagram illustrating the effect of obtaining candidate matches;

FIG. 3 is a flow diagram of ranking a plurality of search results based on a BOW model, according to one embodiment of the invention;

FIG. 4 is a first schematic structural diagram of a search result sorting apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a search result sorting apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The search result ranking method and apparatus of the embodiments of the present invention are described below with reference to the accompanying drawings.

FIG. 1 is a flow diagram of a search result ranking method according to one embodiment of the invention.

As shown in fig. 1, the search result ranking method may include:

and S1, receiving the input query sentence, and dividing the query sentence into a plurality of words.

For example, the query statement is "how is bluetooth car auto-connect? ", the query sentence can be divided into" in-vehicle "," bluetooth "," how "," automatic "," connect ", and"? ".

And S2, acquiring collocation corresponding to the words.

Specifically, the current word and the word within the preset window length with the current word may be obtained first, then the current word and the word within the preset window length with the current word are combined into a plurality of candidate collocations, and finally the collocation corresponding to the current word is screened out based on the preset collocation dictionary.

Continuing with the above example, assume that the current word is "vehicle mounted" and the preset window length is 5. Then the words within the same window length as "onboard" may include "bluetooth", "how", "auto", and "connect", and the corresponding candidate collocations for "onboard" are "onboard bluetooth", "how onboard", "onboard auto", "onboard connect". Similarly, as shown in fig. 2, words that are within the same window length as "bluetooth" may include "how", "automatic", "connect", and "? "then" bluetooth "corresponds to the candidate collocation of" how bluetooth "," bluetooth is automatic "," bluetooth is connected "," bluetooth? ". And by analogy, scanning each word in the query sentence to obtain the corresponding candidate collocation.

After that, the reasonable collocation can be screened out based on the preset collocation dictionary. The reasonable collocation may be a collocation with high co-occurrence frequency or a collocation with high log-likelihood ratio score. For example: "how on-vehicle", "automatic on-vehicle", "on-vehicle connection", "how bluetooth", "bluetooth automatic", "bluetooth? These candidate matches are all matches with low co-occurrence frequency or matches with low log-likelihood ratio score, so the candidate matches can be filtered out. Finally, the collocation of vehicle-mounted Bluetooth and Bluetooth connection with high rationality is obtained.

And S3, acquiring a plurality of search results corresponding to the query statement.

For example, the inquiry sentence is "how is bluetooth car automatically connected? ", a plurality of search results corresponding thereto may be retrieved based on the query statement.

And S4, sorting the plurality of search results based on the BOW model according to the words and the collocation corresponding to the words.

After obtaining the Words and their corresponding collocations, and the plurality Of search results, the plurality Of search results may be sorted based on a BOW (Bag Of Words) model.

Specifically, as shown in fig. 3, the sorting process can be divided into the following steps:

and S41, calculating the sum of the word vector of the word and the word vector of the collocation corresponding to the word, and taking the sum as the word vector of the query sentence.

For example, the query sentence may include a word 1, a word 2, and a word 3, the collocation corresponding to the word 1 is collocation 1 and collocation 2, the collocation corresponding to the word 2 is collocation 3, collocation 4, and collocation 5, and the collocation corresponding to the word 3 is collocation 6, and word vectors of the word 1, the word 2, the word 3, the collocation 1, the collocation 2, the collocation 3, the collocation 4, the collocation 5, and the collocation 6 may be calculated respectively, and the word vectors are added to obtain a word vector of the query sentence finally. When the word vector of the query sentence is calculated, the matched word vector is added into the input features, the features of the syntactic structure are added, the defect that only the word vector of a word is used as the input features can be effectively overcome, and the semantic discrimination is improved.

And S42, calculating a word vector of a plurality of search results.

In this embodiment, the word vector of the search result is actually the word vector of the title corresponding to the search result. The title corresponding to the search result is composed of a plurality of words, so the word vector of the search result can be calculated by adopting the same method as the method for calculating the word vector of the query sentence.

S43, calculating semantic similarity between the word vector of the query sentence and the word vectors of the plurality of search results.

After the word vector of the query sentence and the word vectors of the search results are calculated, semantic similarity of the word vectors of the query sentence and the word vectors of each search result may be calculated, respectively.

And S44, sorting the plurality of search results according to the semantic similarity.

In this embodiment, the search results may be ranked according to the order of semantic similarity from high to low, so that the search results with higher semantic similarity to the query statement are ranked higher.

The specific process of creating the predetermined collocation dictionary will be described in detail below.

Firstly, a training corpus can be obtained, candidate word pair corpora in the training corpus are extracted, then, the word pair corpora can be screened out from the candidate word pair corpora based on co-occurrence frequency or log likelihood ratio scores, and finally, a preset collocation dictionary is established based on the word pair corpora. For example, assuming that the corpus is "auto bluetooth auto connect", each word can be regarded as a node, and whether there is a syntactic dependency between every two nodes is analyzed. And taking two words with grammar dependency relationship as a candidate word pair corpus. The null word nodes such as "ones", "it" and "ones" may exist in the training corpus, and the null word nodes can be filtered out. After the candidate word pair corpus is selected, the candidate word pair corpus may be screened. The screening method can be divided into two methods. The first method is to screen by co-occurrence frequency, i.e. the number of co-occurrences of two words that constitute the corpus of candidate word pairs. And screening the candidate word pair corpora with the co-occurrence frequency larger than the preset times. The second method is to screen through log likelihood ratio score, that is, screen out the top N candidate word pair corpora with the highest log likelihood ratio score. The log-likelihood ratio score is mainly obtained by calculating collocation frequency, word frequency, corpus size and the like. The collocation frequency refers to the frequency of collocation of two words forming the candidate word pair corpus; the word frequency is the times of respectively showing two word pairs forming the candidate word pair corpus in the corpus; the corpus size is the number of words contained. And finally, establishing a preset collocation dictionary according to the word pair corpus obtained after screening.

In order to achieve the purpose, the invention further provides a search result sorting device.

Fig. 4 is a first schematic structural diagram of a search result sorting apparatus according to an embodiment of the present invention.

As shown in fig. 4, the search result ranking means may include: a segmentation module 110, a first acquisition module 120, a second acquisition module 130, and a ranking module 140. The first obtaining module 120 may include an obtaining unit 121, a combining unit 122, and a screening unit 123.

The segmentation module 110 is configured to receive an input query sentence and segment the query sentence into a plurality of words. For example, the query statement is "how is bluetooth car auto-connect? ", the query sentence can be divided into" in-vehicle "," bluetooth "," how "," automatic "," connect ", and"? ".

The first obtaining module 120 is used for obtaining the collocation corresponding to the word. Specifically, the obtaining unit 121 may first obtain the current word and the word within the preset window length with the current word, then the combining unit 122 combines the current word and the word within the preset window length with the current word into a plurality of candidate collocations, and finally the screening unit 123 screens out the collocation corresponding to the current word based on the preset collocation dictionary. Continuing with the above example, assume that the current word is "vehicle mounted" and the preset window length is 5. Then the words within the same window length as "onboard" may include "bluetooth", "how", "auto", and "connect", and the corresponding candidate collocations for "onboard" are "onboard bluetooth", "how onboard", "onboard auto", "onboard connect". Similarly, as shown in fig. 2, words that are within the same window length as "bluetooth" may include "how", "automatic", "connect", and "? "then" bluetooth "corresponds to the candidate collocation of" how bluetooth "," bluetooth is automatic "," bluetooth is connected "," bluetooth? ". And by analogy, scanning each word in the query sentence to obtain the corresponding candidate collocation. After that, the reasonable collocation can be screened out based on the preset collocation dictionary. The reasonable collocation may be a collocation with high co-occurrence frequency or a collocation with high log-likelihood ratio score. For example: "how on-vehicle", "automatic on-vehicle", "on-vehicle connection", "how bluetooth", "bluetooth automatic", "bluetooth? These candidate matches are all matches with low co-occurrence frequency or matches with low log-likelihood ratio score, so the candidate matches can be filtered out. Finally, the collocation of vehicle-mounted Bluetooth and Bluetooth connection with high rationality is obtained.

The second obtaining module 130 is configured to obtain a plurality of search results corresponding to the query statement. For example, the inquiry sentence is "how is bluetooth car automatically connected? ", the second obtaining module 130 may retrieve a plurality of search results corresponding to the query statement based on the query statement.

The sorting module 140 is configured to sort the plurality of search results based on the BOW model according to the word and the collocation corresponding to the word. Specifically, the sorting module 140 may first calculate a sum of a word vector of a word and a word vector of a collocation corresponding to the word, and use the sum as the word vector of the query sentence. For example, the query sentence may include a word 1, a word 2, and a word 3, the collocation corresponding to the word 1 is collocation 1 and collocation 2, the collocation corresponding to the word 2 is collocation 3, collocation 4, and collocation 5, and the collocation corresponding to the word 3 is collocation 6, and word vectors of the word 1, the word 2, the word 3, the collocation 1, the collocation 2, the collocation 3, the collocation 4, the collocation 5, and the collocation 6 may be calculated respectively, and the word vectors are added to obtain a word vector of the query sentence finally. When the word vector of the query sentence is calculated, the matched word vector is added into the input features, the features of the syntactic structure are added, the defect that only the word vector of a word is used as the input features can be effectively overcome, and the semantic discrimination is improved.

The ranking module 140 then computes a word vector for the plurality of search results. In this embodiment, the word vector of the search result is actually the word vector of the title corresponding to the search result. The title corresponding to the search result is composed of a plurality of words, so the word vector of the search result can be calculated by adopting the same method as the method for calculating the word vector of the query sentence.

Then, the sorting module 140 calculates semantic similarity between the word vector of the query sentence and the word vectors of the plurality of search results.

Finally, the ranking module 140 may rank the plurality of search results according to semantic similarity. For example, the search results may be ranked in order of high to low semantic similarity, such that search results that are more semantically similar to the query statement are ranked higher.

Furthermore, as shown in fig. 5, the first obtaining module 120 may further include a establishing module 124.

Specifically, the establishing module 124 may first obtain the corpus, extract the candidate word pair corpus in the corpus, then may screen the word pair corpus from the candidate word pair corpus based on the co-occurrence frequency or log-likelihood ratio score, and finally establish the preset collocation dictionary based on the word pair corpus. For example, assuming that the corpus is "auto bluetooth auto connect", each word can be regarded as a node, and whether there is a syntactic dependency between every two nodes is analyzed. And taking two words with grammar dependency relationship as a candidate word pair corpus. The null word nodes such as "ones", "it" and "ones" may exist in the training corpus, and the null word nodes can be filtered out. After the candidate word pair corpus is selected, the candidate word pair corpus may be screened. The screening method can be divided into two methods. The first method is to screen by co-occurrence frequency, i.e. the number of co-occurrences of two words that constitute the corpus of candidate word pairs. And screening the candidate word pair corpora with the co-occurrence frequency larger than the preset times. The second method is to screen through log likelihood ratio score, that is, screen out the top N candidate word pair corpora with the highest log likelihood ratio score. The log-likelihood ratio score is mainly obtained by calculating collocation frequency, word frequency, corpus size and the like. The collocation frequency refers to the frequency of collocation of two words forming the candidate word pair corpus; the word frequency is the times of respectively showing two word pairs forming the candidate word pair corpus in the corpus; the corpus size is the number of words contained. And finally, establishing a preset collocation dictionary according to the word pair corpus obtained after screening.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method for ranking search results, comprising the steps of:

receiving an input query sentence, and segmenting the query sentence into a plurality of words;

acquiring collocation corresponding to the word;

obtaining a plurality of search results corresponding to the query statement; and

sorting the plurality of search results based on a bag of words (BOW) model according to the words and the collocation corresponding to the words;

wherein obtaining a collocation corresponding to the word comprises:

acquiring a current word and a word which is within a preset window length with the current word;

combining the current word and the words of the current word within a preset window length into a plurality of candidate collocations;

screening out a collocation corresponding to the current word based on a preset collocation dictionary;

wherein, according to the word and the collocation corresponding to the word, the plurality of search results are sorted based on a BOW model, including:

calculating the sum of the word vector of the word and the word vector of the collocation corresponding to the word as the word vector of the query sentence;

calculating a word vector for the plurality of search results;

calculating semantic similarity of the word vector of the query statement and the word vectors of the plurality of search results;

and sequencing the plurality of search results according to the semantic similarity.

2. The method of claim 1, prior to filtering out a collocation corresponding to the current word based on a preset collocation dictionary, further comprising:

and establishing a preset collocation dictionary.

3. The method of claim 2, wherein establishing a predetermined collocation dictionary comprises:

acquiring a training corpus and extracting a candidate word pair corpus in the training corpus;

and screening out the word pair corpus from the candidate word pair corpus based on the co-occurrence frequency or the log-likelihood ratio score, and establishing a preset collocation dictionary based on the word pair corpus.

4. A search result ranking apparatus, comprising:

the segmentation module is used for receiving an input query sentence and segmenting the query sentence into a plurality of words;

the first obtaining module is used for obtaining the collocation corresponding to the word;

the second acquisition module is used for acquiring a plurality of search results corresponding to the query statement; and

the sorting module is used for sorting the plurality of search results based on a BOW model according to the words and the collocation corresponding to the words;

wherein, the first obtaining module comprises:

the acquiring unit is used for acquiring a current word and a word which is within a preset window length with the current word;

the combination unit is used for combining the current word and the word in a preset window length with the current word into a plurality of candidate collocations;

the screening unit is used for screening out the collocation corresponding to the current word based on a preset collocation dictionary;

wherein the sorting module is configured to:

calculating a word vector for the plurality of search results;

5. The apparatus of claim 4, wherein the first obtaining module further comprises:

and the establishing unit is used for establishing a preset collocation dictionary before the collocation corresponding to the current word is screened out based on the preset collocation dictionary.

6. The apparatus of claim 5, wherein the establishing unit is to: