CN105279252B

CN105279252B - Excavate method, searching method, the search system of related term

Info

Publication number: CN105279252B
Application number: CN201510657691.7A
Authority: CN
Inventors: 韩增新; 蒋冠军; 董良
Original assignee: Guangzhou Shenma Mobile Information Technology Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2015-10-12
Filing date: 2015-10-12
Publication date: 2017-12-26
Anticipated expiration: 2035-10-12
Also published as: CN105279252A; WO2017063538A1

Abstract

The invention discloses a kind of method for excavating related term, including：The parallel sentence pair of identical meanings is expressed using different expression form based on large-scale consumer search behavior data acquisition；Word segmentation processing is carried out to every group of parallel sentence pair；Word alignment processing is carried out to the parallel sentence pair after the word segmentation processing, to obtain the first alignment word pair；Calculate the co-occurrence frequency of the first alignment word pair；By the first alignment word of the co-occurrence frequency higher than predetermined threshold to being defined as related term.So, by the excavation related term method, the related term of the higher degree of correlation can be excavated, the scope of term search can also be expanded, improve the probability for finding more preferable search result.Meanwhile the invention also discloses a kind of searching method and a kind of search system.

Description

Method for mining related words, searching method and searching system

Technical Field

The present invention relates to the field of information retrieval, and in particular, to a method for mining related words, a search method, and a search system.

Background

The search engine is a necessary function provided for the convenience of using the website by the user in website construction, and is an effective tool for researching the behavior of the website user. The efficient in-site retrieval enables users to quickly and accurately find target information, so that the user problems are effectively solved, the product/service sale can be effectively promoted, and the deep analysis of the search behavior of website visitors has important value for further making more effective network marketing strategies.

When a user searches by using a search engine, a search keyword is input through a search page of the search engine, and the search engine searches and returns a search result. A general search engine may perform a primitive word search directly using a keyword input by a user or perform a search using a synonym of a search term.

However, when a search is performed using a search term primitive word or a synonym, the search results are limited. There are often some good results whose terms do not correspond to the search terms themselves, but are semantically very related to the search terms, resulting in web pages with such results being not recallable.

Disclosure of Invention

The invention aims to solve the technical problem that a traditional search engine only carries out retrieval through original words or synonyms to obtain limited retrieval results, and provides a method for mining related words, a search method and a search system.

According to one aspect of the invention, a method of mining related words is provided.

A method of mining related words, comprising:

acquiring parallel sentence pairs expressing the same meaning by adopting different expression forms based on large-scale user search behavior data;

performing word segmentation on each group of parallel sentence pairs;

performing word alignment processing on the parallel sentence pairs subjected to word segmentation processing to obtain first aligned word pairs;

calculating a co-occurrence frequency of the first aligned word pair;

determining the first aligned word pair having a co-occurrence frequency above a predetermined threshold as a related word.

Therefore, by the method for mining the related words, the related words with higher relevance can be mined, the search range of the search words can be expanded, and the probability of finding better search results is improved.

Preferably, the step of obtaining parallel sentence pairs comprises:

according to the literal similarity of the two sentences, parallel sentence pairs with different meanings are filtered.

Therefore, parallel sentence pairs with different meanings are filtered out through the literal similarity of the two sentences, and the parallel sentence pairs with the same expression meaning but different meanings are obtained.

Preferably, the method further comprises recording context words for said related words.

By recording the context of the related words, and judging whether the context of the two related words is the same or similar, the method is beneficial to further judging the correlation degree between the related words.

Preferably, the word alignment process comprises a regular word alignment process and/or a statistical word alignment process.

Preferably, the regular word alignment process includes at least one of a literal perfect word alignment process, a literal partial identical word alignment process, or an adjacent word alignment process.

Thus, related words with different degrees of relevance can be mined.

Preferably, the statistical word alignment process is a statistical word alignment process using a GIZA + + tool.

Preferably, the method further comprises:

filtering the large-scale user searching behavior data by using a linear model to obtain a second alignment word pair;

acquiring statistical characteristics capable of reflecting the correlation degree between the related words;

and training the positive sample and the negative sample by using the first aligned word pair as a positive sample and the second aligned word pair as a negative sample and adopting a Gradient Boosting Decision Tree (GBDT) algorithm based on the statistical characteristics to obtain the related word confidence coefficient calculation model.

Thus, by establishing a related word confidence calculation model, the degree of correlation between related words can be distinguished through the model.

Preferably, the related word confidence calculation model is a GBDT nonlinear regression model.

According to another aspect of the invention, a search method is also disclosed.

A search method comprising the steps of:

acquiring related words of the search words based on a related word library;

calculating a confidence between the search word and each related word based on a confidence calculation model;

and sequencing results obtained by searching the search words and the related words according to the corresponding confidence degrees.

Therefore, by the searching method, the corresponding related words can be found for the search words, the searching range is expanded, the searching result is expanded, and the occurrence of the result which cannot be recalled by the searching result when the words are not consistent with the search words but are semantically very similar to the search words can be prevented.

Preferably, the related word lexicon is established by the method for mining related words.

By the method for mining the related words, the related words with higher relevance can be mined, the search range of the search words can be expanded, and the probability of finding better search results is improved.

Preferably, the method further comprises performing word segmentation processing on the search sentence to obtain the search word.

When a user inputs a search sentence, the search sentence is segmented to obtain a plurality of search terms, so that search results related to the search terms are searched by the search method, and the search range is further expanded.

Preferably, the step of calculating the confidence between the search term and each of the related terms based on a confidence calculation model comprises:

obtaining a characteristic value between each search word and each corresponding related word;

and taking the characteristic value as an input of the confidence coefficient calculation model, and calculating the confidence coefficient based on the confidence coefficient calculation model.

Preferably, the characteristic values include:

the correlation degree information is used for measuring the correlation degree between each search word and each corresponding correlation word; and/or

The degree of exchangeability information is used for measuring the degree of exchangeability between the search word and the related word in the context of the related word; and/or

Co-occurrence relation information used for measuring the co-occurrence relation among the search terms; and/or

Language model score information for displaying the language model scores of the retrieval sentences before and after the related words replace the retrieval words; and/or

And the weight value information is used for representing the weight of the related words.

Preferably, the correlation degree information includes a first translation probability P ₁ And/or a second translation probability P ₂ ；

count ₁ (A,·)＝∑ _j count ₁ (A,w _j )，count ₁ (·,A′)＝∑ _i count ₁ (w _i ,A′)；

Wherein, the search word A and the related word A 'form a first word pair (A, A'), count ₁ (A, A ') represents the number of times the first word pair (A, A') in the parallel sentence pair is aligned, count ₁ (A,. Cndot.) represents the total number of times that term A is aligned in parallel sentence pairs, count ₁ (. A ') represents the total number of times the related word A' is aligned in the parallel sentence pair, w _j Denotes the j-th, w, of all words aligned with the search word A in the parallel sentence pair _i Denotes the i-th, count, of all words in parallel sentence pairs that are aligned with the related word A ₁ (A,w _j ) Meaning that the search word A and the word w are in parallel sentence pair _j Number of alignments, count ₁ (w _i And A') represents a word w in a parallel sentence _i The number of times of alignment with the related word a', i and j are natural numbers.

Preferably, the degree of replaceability information includes a first degree of replaceability score (D, Q) and/or a second degree of replaceability score (D, Q');

wherein, the search word A and the related word A 'form a first word pair (A, A'),

all the context words of the search word A and the related word A' are taken as a document D, | D | is the length of D,

q is a search statement, Q _i For the ith search word of the search sentence Q, n is the total number of search words in the search sentence Q,

q' is a combination of m words, m<n，q′ _j For the jth search word of the search word combination Q',

avgdl is the average length of the documents made up of the context of all the related words of term a,

k ₁ is a first constant, b is a second constant,

f(q _i d) represents the frequency of occurrence of qi in the document D,

f(q′ _j and D) represents q' _j Frequency of occurrence in document D.

Preferably, the co-occurrence relationship information includes first co-occurrence relationship information and/or second co-occurrence relationship information obtained based on a co-occurrence relationship index PMI, wherein,

count ₂ (A,·)＝Σ _j count ₂ (A,w _j )；

count ₂ (·,B)＝∑ _i count ₂ (w _i ,B)；

count ₂ (·,·)＝Σ _i,j count ₂ (w _i ,w _j )；

count ₂ (A,. Cndot.) represents the total number of times that term A appears simultaneously with other terms in the search resource, count ₂ (. B) represents the total number of times that term B appears simultaneously with other terms in the search resource, count ₂ (A, B) in the search of dataNumber of simultaneous occurrences of two search terms A, B in the source, w _j Denotes the jth, w, of all words in the search resource that appear simultaneously with the search word A _i Represents the ith, count, of all words that occur simultaneously with the related word B in the search resource ₂ (A,w _j ) Two search terms A and w in the search resource _j Number of simultaneous occurrences, count ₂ (w _i B) represents two search terms w in the search resource _i B number of simultaneous occurrences, count ₂ (w _i ,w _j ) Two search terms w in a search resource _i 、w _j The number of simultaneous occurrences, i and j, is a natural number;

the first co-occurrence relation information is the average value of the co-occurrence relation index PMI of the search word and other words in the search sentence;

the second co-occurrence relation information is an average value of co-occurrence relation indexes PMI of the related word and other words in the search sentence.

Preferably, the method further comprises training the N-gram language model based on the large-scale user search behavior data to obtain the language model.

Preferably, the step of ranking the results obtained by using the search term and the related term for the search according to the corresponding confidence degrees ranks the results obtained by using the search term and the related term for the search according to the corresponding confidence degrees through a ranking model.

Preferably, the method further comprises the step of the ranking model primarily ranking the retrieval resources according to the retrieval statements and the retrieval resource page information.

Preferably, the retrieval resource is a webpage resource and/or a document resource.

According to another aspect of the invention, a search system is also provided.

A search system, comprising:

a related vocabulary storage device;

a related word acquiring device for acquiring related words of the search word based on the related word library stored in the related word library storage device;

confidence calculation means for calculating a confidence between the search word and each of the related words based on a related word confidence calculation model;

and the sequencing device is used for sequencing the results obtained by using the search words and the related words for searching according to the corresponding confidence degrees.

Preferably, the search system further includes a related word bank establishing device, configured to establish the related word bank, including:

the parallel sentence acquisition module is used for acquiring parallel sentence pairs expressing the same meaning by adopting different expression forms based on large-scale user search behavior data;

the word segmentation device is used for carrying out word segmentation on each group of parallel sentence pairs;

the word alignment module is used for carrying out word alignment on the parallel sentence pairs subjected to word segmentation processing to obtain first aligned word pairs;

a co-occurrence frequency acquisition module, configured to calculate a co-occurrence frequency of the first alignment word pair;

a related word determination module for determining the first aligned word pair having a co-occurrence frequency higher than a predetermined threshold as a related word.

Preferably, the related word bank establishing device further includes:

and the context acquisition module is used for acquiring the context words of the related words.

Preferably, the search system further includes a related word confidence calculation model establishing device, configured to establish the related word confidence calculation model, including:

a linear model filtering module for filtering the large-scale user search behavior data using a linear model to obtain a second pair of aligned words;

and the training module is used for training the positive sample and the negative sample based on a GBDT algorithm by taking the first aligned word pair as a positive sample and the second aligned word pair as a negative sample to obtain the related word confidence coefficient calculation model.

Preferably, the word segmentation device is further configured to perform word segmentation processing on the search sentence to obtain a search word.

Preferably, the confidence calculating means includes:

the characteristic value extraction module is used for extracting a characteristic value between each search word and each corresponding related word;

and the confidence coefficient calculation module is used for taking the characteristic value as the input of the related word confidence coefficient calculation model and calculating the confidence coefficient based on the related word confidence coefficient calculation model.

Preferably, the feature value extraction module includes:

the system comprises a correlation degree information acquisition unit, a correlation degree information acquisition unit and a correlation degree information processing unit, wherein the correlation degree information acquisition unit is used for acquiring correlation degree information which is used for measuring the correlation degree between each search word and each corresponding correlation word; and/or

A substitutability information acquisition unit configured to acquire substitutability information that measures a degree of substitutability between the search term and the related term in a context of the related term; and/or

A co-occurrence relation information obtaining unit, configured to obtain co-occurrence relation information, where the co-occurrence relation information is used to measure co-occurrence relations among the search terms; and/or

A language model score information acquisition unit configured to acquire language model score information for displaying language model scores of search sentences before and after the related word replaces the search word; and/or

A weight value information acquiring unit configured to acquire weight value information indicating a weight of the related word.

Preferably, the feature value extraction module further comprises:

and the language model acquisition unit is used for training an N-gram language model based on the large-scale user search behavior data to acquire the language model.

Preferably, the ranking device ranks the results obtained by using the search term and the related term to perform the search according to the corresponding confidence through a ranking model.

Preferably, the sorting device is further configured to perform preliminary sorting on the retrieval resources according to the retrieval statements and the retrieval resource page information through the sorting model.

In this way, by the method for mining related words, the searching method and the searching system, the related words corresponding to the search words can be found, the search words and the related words are used for searching together, the searching range is expanded, the searching result is expanded, and the situation that the search result cannot be recalled when the words are not consistent with the search words in semantics but are very similar to the search words can be prevented.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 illustrates a flow diagram of a method of mining related words in accordance with an embodiment of the present invention;

FIG. 2 illustrates a flow diagram of a method of mining related words, according to another embodiment of the invention;

FIG. 3 shows a flow diagram of a search method according to an embodiment of the invention;

FIG. 4 shows a flow diagram of a search method according to another embodiment of the invention;

FIG. 5 shows a flowchart of step S240 of the embodiment shown in FIG. 4;

FIG. 6 shows a schematic diagram of a search system according to an embodiment of the invention;

FIG. 7 shows a schematic diagram of a search system according to another embodiment of the invention;

fig. 8 is a schematic diagram of the related word bank establishing apparatus 310 according to the embodiment shown in fig. 7;

FIG. 9 is a diagram illustrating a related word confidence calculation model building device 350 according to the embodiment shown in FIG. 7;

FIG. 10 shows a schematic diagram of the confidence computation device 390 of the embodiment shown in FIG. 7;

fig. 11 shows a schematic diagram of the feature value extraction module 394 of the embodiment shown in fig. 10.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

A method for mining related words for obtaining related words from large-scale user search behavior data according to an embodiment of the present invention is described below with reference to fig. 1.

Fig. 1 is a flowchart illustrating a method of mining related words according to an embodiment of the present invention.

In step S110, parallel sentence pairs expressing the same meaning in different expression forms are acquired based on the large-scale user search behavior data.

And acquiring parallel sentence pairs from data such as retrieval logs and/or retrieval title logs of the users based on large-scale user search behavior data. Wherein, the parallel sentence pair means the sentence pair expressing the same meaning by different expression forms. For example, the above-mentioned parallel sentence pairs expressing the same meaning in different expressions may be "mole is pigmented on baby's neck" or "mole is pigmented on baby's neck", etc.

In the large-scale user search behavior data, there are many sentence pairs whose expressions are not uniform, although they have the same meaning, in data such as a search log and/or a search title log of a user. Further, parallel sentence pairs with different meanings can be filtered according to the literal similarity of the two sentences.

In step S120, word segmentation processing is performed for each set of parallel sentence pairs.

And segmenting each sentence in each group of parallel sentence pairs by a word segmentation technology.

In step S130, word alignment processing is performed on the above-mentioned word segmentation processed parallel sentence pair to obtain a first aligned word pair.

Through the word alignment process, words expressing the same meaning can be found.

The word alignment processing may include regular word alignment processing and/or statistical word alignment processing. The regular word alignment processing includes at least one of a literal perfect word alignment processing, a literal partial identical word alignment processing, or an adjacent word alignment processing. The above statistical word alignment processing is statistical word alignment processing using a GIZA + + tool.

In step S140, the co-occurrence frequency of the first aligned word pair is calculated.

The evaluation index of the co-occurrence frequency may be the first translation probability P1 and/or the second translation probability P2, and the calculation formulas of P1 and P2 are as follows:

count ₁ (A,·)＝Σ _j count ₁ (A,w _j )，count ₁ (·,A′)＝Σ _i count ₁ (w _i ,A′)；

wherein, the search word A and the related word A 'form a first word pair (A, A'), count ₁ (A, A ') represents the number of times the first word pair (A, A') in the parallel sentence pair is aligned, count ₁ (A,. Cndot.) represents the total number of times that the term A is aligned in the parallel sentence pair, count ₁ (. A ') represents the total number of times the related word A' is aligned in the parallel sentence pair, w _j Represents the j-th of all words aligned with the search word A in the parallel sentence pairA, w _i Denotes the i-th, count, of all words in parallel sentence pairs that are aligned with the related word A ₁ (A,w _j ) Meaning that the search word A and word w are in parallel sentence pair _j Number of alignments count ₁ (w _i A') represents a word w in parallel sentences _i The number of times, i and j, of alignments with the related word a' is a natural number.

It can be understood that count ₁ The value of (A, A ') is independent of the order of A, A', i.e. count ₁ (A, A') and count ₁ (A', A) are the same.

P1 represents the proportion of the number of times that the query word a is aligned with the related word a ' to the total number of times that the query word a is aligned, and P2 represents the proportion of the number of times that the query word a is aligned with the related word a ' to the total number of times that the related word a ' is aligned.

The alignment times are the times of two words aligned in a plurality of different parallel sentence pairs, and the co-occurrence times are the times of two words appearing in the same corpus at the same time.

In step S150, a first aligned word pair having a co-occurrence frequency higher than a predetermined threshold is determined as a related word.

The predetermined threshold may be set to different degrees according to different requirements for the correlation between related words. In an embodiment, the predetermined threshold may be 1.0 × e ^-99 。

Therefore, by the method for mining the related words, the related words with higher relevance can be mined, the search range of the search words can be further expanded, and the probability of finding better search results is improved. And, the related words with different similarity can be obtained according to different preset threshold values.

A method of mining related words for obtaining related words from large-scale user search behavior data according to another embodiment of the present invention is described below with reference to fig. 2.

Referring to fig. 2, the method for mining related words further includes the following steps:

in step S160, the context word of the related word is recorded.

By recording the context word of the related word, the context of the related word can be known. By judging whether the context of the two related words is the same or similar, the correlation between the related words can be further judged, and the related words with higher similarity can be obtained.

The acquisition of the context words of the related words can be limited in length to different degrees according to the different lengths of the parallel sentences. In this embodiment, the length of the parallel sentence pair is not usually too long, so the length or other form of limitation may not be made. In other embodiments, the length of the related word or the obtaining manner of the context word may be defined differently according to different requirements on the relevancy of the related word or other criteria.

In step S170, the large-scale user search behavior data is filtered using a linear model to obtain a second alignment word pair.

The linear model may be a simple linear model. Further, the simple linear model may be a linear model fitted with a simple linear regression model using statistical features between the above word pairs, with a small number of word pairs (which may be on the order of ten thousand) labeled manually. Wherein the fitting may refer to linear regression fitting modeling.

The above-mentioned manually labeled word pairs are fewer in number and the model is simple, so the confidence score output using the model is not high. And filtering the large-scale user search behavior data through the linear model, and taking a result with a confidence score smaller than a specific threshold value as the second aligned word pair, wherein the confidence score of the word pair filtered by the model is not high, so that the second aligned word pair is taken as a poor word pair. Specifically, the above-mentioned specific threshold value is close to or less than zero.

The word pair of the above-mentioned "manual annotation" means: under a certain query sentence (query), an original word in one query forms a word pair with a related word, and the word pair is labeled to determine whether the word pair is suitable for being used as the related word. The above labeling manner may be "what is eaten by the baby in eight months? "baby- > baby in this query is the related word pair," baby "is the original word," baby "is the related word, this related word can mark 1 point, represents and can be regarded as a related word; under this query, "baby" - > "baby" is labeled with score 0, which means that it cannot be a related word.

The poor word pairs refer to wrong word pairs which should not appear under the current query word context, or word pairs which violate the intention of the user. For example, the user searches for "baby takes milk", and obtains "baby drinks milk" as a better word pair (i.e. related words labeled with 1 point); however, "what fruit is good for eating" is changed into "what fruit is good for drinking", which is an escape wrong word pair, i.e. a poor word pair. Also, the above-mentioned poor word pair may be represented in more various forms, and is not limited to this example.

In step S180, a statistical feature that can reflect the degree of correlation between related words is obtained.

The statistical characteristics are context word statistical verification characteristics of whether the word pair is suitable or not in the current query context, and the characteristics comprise at least one of correlation degree information, replaceable degree information, co-occurrence relation information, language model score information and weight value information between every two related words.

In step S190, the first aligned word pair is used as a positive sample, the second aligned word pair is used as a negative sample, and based on the statistical characteristics, the positive sample and the negative sample are trained by using a gradient-enhanced decision tree (GBDT) algorithm to obtain the confidence level calculation model of the related words.

The confidence calculation model of the related words may be a GBDT nonlinear regression model.

A search method according to an embodiment of the present invention is described below with reference to fig. 3.

Fig. 3 shows a flow diagram of a search method according to an embodiment of the invention.

A search method comprising the steps of:

in step S220, related words of the search term are obtained based on the related word library.

The related word bank is established by the method for mining related words. In this way, all related terms of the term may be obtained, including not only synonyms of the term (which may include strong synonyms and contextual synonyms), but also related terms of a broader coverage. By the method for mining the related words, the related words with higher relevance can be mined, the searching range is further expanded, and the probability of finding a better searching result is improved.

In step S240, a confidence level between the search term and each related term is calculated based on a confidence level calculation model.

In step S260, the results obtained by searching using the search term and the related terms thereof are ranked according to the corresponding confidence.

The step is to sort the results obtained by searching the search terms and the related terms according to the corresponding confidence degrees through a sorting model. The sorting model may be a fast sorting model that sorts according to an existing fast sorting algorithm. It is understood that the ranking model may be other existing models.

The search according to the related words not only covers the high frequency of the synonyms, but also focuses on the related words with medium and low frequency, and particularly when the retrieval resources are less, the related words are used for searching, so that the retrieval information is acquired to the maximum extent.

Therefore, by the searching method, the corresponding related words can be found aiming at the search words, and the search words and the related words are used for searching, so that the searching range is expanded, and the searching result is expanded; it is possible to prevent the occurrence of results that the word itself does not coincide with the search term but are semantically too similar to the search term and that such search results cannot be recalled.

In another embodiment, before the step S260, a step of the ranking model initially ranking the retrieval resources according to the retrieval statement and the retrieval resource page information may be further included.

The preliminary ranking step is a general search process, and may be limited by setting the degree of search, so that the search result reaching a predetermined score may be re-ranked in step S260. Thus, when the initial search results are more, the amount of reordering can be reduced. The double-ranking method can also be used for searching when a user requires to display only the search results with high accuracy.

The retrieval resource can be a web page resource and/or a document resource. The retrieval resource can be a piece of text information, a title of a webpage, a sentence of a query, or a document with a longer length.

A search method according to another embodiment of the present invention is described below with reference to fig. 4.

Fig. 4 shows a flowchart of a search method according to another embodiment of the present invention.

The searching method may further include step S210 before step S220. In step S210, a word segmentation process is performed on the search sentence to obtain the search word.

When a user inputs a search sentence, the search sentence is segmented to obtain a plurality of search terms, so that search results related to the search terms are searched by the search method, and the search range is further expanded. The above-mentioned word segmentation may include Chinese word segmentation and/or English word segmentation, and may also include word segmentation in other languages, and the corresponding word segmentation mode may be the existing word segmentation technology in various forms.

Referring now to fig. 5, a flowchart of step S240 of the embodiment shown in fig. 4 is shown.

Fig. 5 shows a flowchart of step S240 of the embodiment shown in fig. 4.

In step S242, a feature value between each search term and each corresponding related term is acquired.

Each time the search content is different, the corresponding search term is also different, and therefore the characteristic value is also different.

In step S244, the feature value is input as a confidence level calculation model, and a confidence level is calculated based on the confidence level calculation model.

The feature value may include at least one of correlation degree information, degree of replaceability information, co-occurrence relationship information, language model score information, and weight value information.

The related degree information is used for measuring the related degree between each search term and each corresponding related term.

The correlation degree information may include a first translation probability P ₁ And/or a second translation probability P ₂ And are respectively expressed by the following formulas:

wherein, the search word A and the related word A 'form a first word pair (A, A'), count ₁ (A, A ') represents the number of times the first word pair (A, A') in the parallel sentence pair is aligned, count ₁ (A,. Cndot.) represents the total number of times that term A is aligned in parallel sentence pairs, count ₁ (. A ') represents the total number of times the related word A' is aligned in the parallel sentence pair, w _j Representing the jth, w, of all words in the parallel sentence pair that are aligned with the term A _i Represents the i-th, count, of all words in parallel sentence pairs aligned with the related word A ₁ (A,w _j ) Meaning that the search word A and the word w are in parallel sentence pair _j Number of times of alignment count ₁ (w _i And A') represents a word w in a parallel sentence _i The number of times of alignment with the related word a', i and j are natural numbers.

As can be appreciated, count ₁ The value of (A, A ') is independent of the order of A, A', i.e. count ₁ (A, A') and count ₁ (A', A) are the same.

Wherein the degree of replaceability information is used to measure the degree of replaceability between the search term and the related term in the context of the related term.

The replaceability degree information includes a first replaceability degree score (D, Q) and/or a second replaceability degree score (D, Q'), and is expressed by the following formula:

the context words of the search word A and the context words of the related words A' are taken as a document D, and | D | is the length of D; the context words of the search word A and the related word A' are the same in a plurality of sentence pairs, but are respectively different, and the context as a whole is recorded;

q is a search statement, Q _i For retrieving the ith term of the sentence Q, n is the total number of terms in the sentence Q,

q' is a combination of m words near the search word A, m<n，q′ _j For the jth search word of the search word combination Q',

k ₁ is a first constant, b is a second constant,

f(q _i d) represents the frequency of occurrence of qi in the document D,

f(q′ _j and D) represents q' _j Frequency of occurrence in document D.

The co-occurrence relationship information is used for measuring the co-occurrence relationship between the search terms, and refers to statistical data of two search terms appearing in a query corpus (search resources, web pages and/or documents) at the same time.

The co-occurrence relationship information comprises first co-occurrence relationship information and/or second co-occurrence relationship information obtained based on the co-occurrence relationship index PMI:

count ₂ (A,·)＝∑ _j count ₂ (A,w _j )；

count ₂ (·,B)＝∑ _i count ₂ (w _i ,B)；

count ₂ (·,·)＝∑ _i,j count ₂ (w _i ,w _j )；

count ₂ (A,. Cndot.) represents the total number of times that the search term A and other search terms appear in the search resource at the same time, count ₂ (, B) represents the total number of times that term B and other terms occur simultaneously in the search resource, count ₂ (A, B) represents the number of times that two search terms A, B appear in the search resource at the same time, w _j Represents the j-th, w-th of all words appearing simultaneously with the search word A in the search resource _i Represents the ith, count, of all words that occur simultaneously with the related word B in the search resource ₂ (A,w _j ) Two search terms A and w in the search resource _j Number of simultaneous occurrences, count ₂ (w _i B) represents two search terms w in the search resource _i B number of simultaneous occurrences, count ₂ (w _i ,w _j ) Two search terms w in a search resource _i 、w _j The number of simultaneous occurrences, i and j, is a natural number.

It can be understood that count ₂ The value of (A, B) is independent of the order of A, B, i.e. count ₂ (A, B) and count ₂ (B, A) are the same.

The first co-occurrence relation information is an average value of co-occurrence relation indexes PMI of the search word and other words in the search sentence.

The second co-occurrence relationship information is an average value of the co-occurrence relationship index PMI of the related word and the other search terms (the other search terms excluding the search term corresponding to the related word) in the search sentence.

When the first co-occurrence relationship information is calculated, the formula can be directly used and an average value can be calculated; and when the second co-occurrence relation is calculated, replacing the search word A in the formula with the related word A'.

And the language model score information is used for displaying the language model scores of the retrieval sentences before and after the related words replace the retrieval words. The method further comprises the step of training an N-gram language model based on large-scale user search behavior data to obtain the language model.

The weight value information is used for representing the weight of the related words.

In step S180, the above statistical characteristic calculation method is also used to calculate the statistical characteristic between each related word.

A search system according to an embodiment of the present invention is described below with reference to fig. 6.

FIG. 6 shows a schematic diagram of a search system according to an embodiment of the invention.

A search system 300 comprises a related word stock device 320, a related word acquisition device 340, a search device 360, a sorting device 380 and a confidence calculation device 390.

The related word obtaining means 340 is connected to the related word stock storage means 320, and obtains related words of the search word based on the related word stock storage means 320. The search means 360 performs a search based on the search term and the related term of the search term. The confidence calculation means 390 calculates the confidence between the search word and each of the related words corresponding thereto based on the confidence calculation model. The ranking unit 380 ranks the results retrieved by the search unit 360 according to the corresponding confidence calculated by the confidence calculation unit 390.

Thus, through the search system 300, the corresponding related words can be found for the search terms, and the search is performed according to the search terms and the corresponding related words, so that the search range is expanded, the search result is further expanded, and the probability of retrieving the target file is improved. The phenomenon that the good search results cannot be recalled when the words are not consistent with the search words but are semantically very similar to the search words can be prevented.

A search system according to another embodiment of the present invention is described below with reference to fig. 7.

FIG. 7 shows a schematic diagram of a search system according to another embodiment of the invention.

The searching system 300 may further include a related word bank building means 310 and a related word confidence calculation model building means 350.

The related word stock establishing device 310 is connected to the related word stock device 320 for establishing the related word stock by the method of mining related words.

Fig. 8 is a diagram of the related word bank building apparatus 310 for building a related word bank according to the embodiment shown in fig. 7.

Fig. 8 is a schematic diagram of the related word bank establishing apparatus 310 according to the embodiment shown in fig. 7.

The related vocabulary base establishing means 310 may include: a parallel sentence acquisition module 311, a word segmenter 313, a word alignment module 315, a co-occurrence frequency acquisition module 317, a related word determination module 319, and a context acquisition module 318.

The parallel sentence acquisition module 311 is configured to acquire parallel sentence pairs expressing the same meaning in different expression forms based on large-scale user search behavior data, the word segmentation unit 313 performs word segmentation processing on each group of parallel sentence pairs, the word alignment module 315 performs word alignment processing on the word-segmented parallel sentence pairs to acquire a first aligned word pair, the co-occurrence frequency acquisition module 317 calculates the co-occurrence frequency of the first aligned word pair, and the related word determination module 319 determines the first aligned word pair having the co-occurrence frequency higher than a predetermined threshold as a related word to form a related word lexicon.

Thus, through the related word bank establishing device 310, related words with higher relevance can be mined, the search range of search words can be expanded, the probability of finding better search results can be improved, and related words with different similarities can be acquired according to different preset thresholds.

By establishing a related word bank, all related words of the search word can be obtained, and the related words not only comprise synonyms of the search word (which can comprise strong synonyms and context synonyms), but also comprise related words with wider coverage degree. By the method for mining the related words, the related words with higher relevance can be mined, the search range of the search words can be expanded, and the probability of finding better search results is improved.

In addition, the word segmentation unit 313 is further configured to perform word segmentation processing on the search sentence to obtain a search word. When a user inputs a search sentence, the search sentence is segmented to obtain a plurality of search terms, so that search results related to the search terms are searched by the search method, and the search range is further expanded.

Further, the related word bank establishing device 310 further includes a context obtaining module 318 for obtaining context words of the related words.

By recording the context word of the related word, the context of the related word can be known. By judging whether the context of the two related words is the same or similar, the correlation degree between the related words can be further judged, and the related words with higher similarity degree can be obtained.

The acquisition of the context words of the related words can be limited in length to different degrees according to the different lengths of the parallel sentences. In this embodiment, the length of the parallel sentence pair is not too long, so the length or other limitations may not be made. In other embodiments, the length of the related word or the obtaining manner of the context word may be defined differently according to different requirements on the relevancy of the related word or other criteria.

Referring to fig. 9, a schematic diagram of the related word confidence calculation model building apparatus 350 according to the embodiment shown in fig. 7 is shown.

Fig. 9 is a schematic diagram of the related word confidence calculation model building device 350 according to the embodiment shown in fig. 7.

The related word confidence calculation model building apparatus 350 may include a linear model filtering module 352 and a training module 354.

The linear model filtering module 352 is configured to filter the large-scale user search behavior data using a linear model to obtain second aligned word pairs.

The linear model may be a simple linear model, and further, the simple linear model may be a linear model fitted with a simple linear regression model using statistical features between the word pairs labeled manually in a small number (which may be on the order of ten thousand). The above-mentioned manually labeled word pairs are fewer in number and the model is simple, so the confidence degree output by using the model is not high in precision. And filtering the large-scale user search behavior data through the linear model to obtain a second aligned word pair, wherein the second aligned word pair is a poor word pair and refers to a wrong word pair which should not appear or a word pair which violates the intention of the user in the context of the current query word. For example, a user searches for 'baby eats milk', and obtaining 'baby drinks milk' is a good word pair; however, "what fruit is good for eating" is changed into "what fruit is good for drinking", which is an escape wrong word pair, i.e. a poor word pair.

The training module 354 is connected to the related word bank establishing device 310 and the linear model filtering module 352, respectively, and trains the positive sample and the negative sample based on the GBDT algorithm to obtain a related word confidence coefficient calculation model by using the first aligned word pair as a positive sample and the second aligned word pair as a negative sample.

The related word confidence calculation model may be a GBDT nonlinear regression model.

Referring to fig. 10, the confidence calculation device 390 of the embodiment shown in fig. 7 may include a confidence calculation module 392 and a feature value extraction module 394.

The feature value extraction module 394 extracts a feature value between each search term and each related term corresponding thereto, and the confidence calculation module 392 calculates the confidence by using the feature value as an input of a confidence calculation model based on the confidence calculation model.

Referring to fig. 11, a diagram of the feature value extraction module 394 of the embodiment shown in fig. 10 is shown.

The feature value extraction module 394 may further include at least one of a correlation degree information acquisition unit 3941, a replaceable degree information acquisition unit 3942, a co-occurrence relationship information acquisition unit 3943, a language model score information acquisition unit 3944, a weight value information acquisition unit 3945, and a language model acquisition unit 3946.

A correlation degree information obtaining unit 3941, configured to obtain correlation degree information. The degree of relevance information is used to measure the degree of relevance between each search term and each corresponding relevant term.

A substitutability information acquisition unit 3942 for acquiring the substitutability information. The degree of replaceability information is used to measure the degree of replaceability between a search term and a related term in the context of the related term.

A co-occurrence relation information obtaining unit 3943, configured to obtain co-occurrence relation information. And the co-occurrence relation information is used for measuring the co-occurrence relation among the search terms.

A language model score information obtaining unit 3944, configured to obtain language model score information. The language model score information is used for displaying the language model scores of the search sentences before and after the related word replaces the search word.

The weight value information acquiring unit 3945 is configured to acquire weight value information. Wherein, the weight value information is used for representing the weight of the related words.

Further, the feature value extraction module 394 may further include a language model obtaining unit 3946. The language model obtaining unit 3946 is configured to train the N-gram language model based on the large-scale user search behavior data to obtain the language model.

The sorting device 380 sorts the results obtained by searching the search terms and the corresponding related terms according to the corresponding confidence information through the sorting model. The sorting model may be a fast sorting model for sorting according to an existing fast sorting algorithm.

Further, the sorting device 380 may also perform preliminary sorting on the retrieval resources according to the retrieval statements and the retrieval resource page information through the sorting model. The initial ranking is a general search process, and can be limited by setting the retrieval degree, so that the retrieval results reaching the preset score can enter the re-ranking. When the initial search results are more, the work load of reordering can be reduced. The double ranking method may also be used when the user requests that only highly accurate search results be displayed.

The search according to the related words not only covers the high frequency of the synonyms, but also focuses on the low and medium frequency search words, and particularly when the search resources are less, the related words are used for searching, and the search information is obtained to the maximum extent. Therefore, by the search system, the corresponding related words can be found for the search words, and the search words and the related words are used for searching, so that the search range is expanded, and the search results are expanded; it is possible to prevent the occurrence of results that the word itself does not coincide with the search term but are semantically too similar to the search term and that such search results cannot be recalled.

The method of mining related words, the search method, and the search system according to the present invention have been described above in detail with reference to the accompanying drawings.

Furthermore, the method according to the invention may also be implemented as a computer program product comprising a computer readable medium having stored thereon a computer program for performing the above-mentioned functions defined in the method of the invention. Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

While embodiments of the present invention have been described above, the above description is illustrative, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A search method comprising the steps of:

acquiring related words of the search words based on a related word library;

calculating a confidence between the search word and each of the related words based on a confidence calculation model;

ranking results obtained from the search using the search term and the related term according to the corresponding confidence levels,

wherein the method further comprises:

performing word segmentation processing on a search sentence to obtain the search word,

wherein the step of calculating the confidence between the search word and each of the related words based on a confidence calculation model comprises:

taking the feature value as an input to the confidence computation model, computing the confidence based on the confidence computation model,

wherein the characteristic values include:

-a degree of replaceability information for measuring a degree of replaceability between the search term and the related term in a context of the related term; and

and the co-occurrence relation information is used for measuring the co-occurrence relation among the search terms.

2. The method according to claim 1, wherein the degree of replaceability information includes a first degree of replaceability score (D, Q) and/or a second degree of replaceability score (D, Q');

all the context words of the search term A and the related term A' are used as a document D, | D | is the length of D,

q is a search statement, Q _i For the ith search term of the search sentence Q, n is the total number of search terms in the search sentence Q,

q' is a combination of m words near the search word A, m<n，q' _j For the jth search word of the search word combination Q',

avgdl is the average length of the document formed by the context of all related words of the term A, k ₁ Is a first constant, b is a second constant,

f(q _i d) represents the frequency of occurrence of qi in the document D,

f(q' _j and D) represents q' _j Frequency of occurrence in document D.

3. The method according to claim 1, wherein the co-occurrence relationship information comprises first co-occurrence relationship information and/or second co-occurrence relationship information derived based on a co-occurrence relationship index, PMI, wherein,

count ₂ (A,·)＝∑ _j count ₂ (A,w _j )；

count ₂ (·,B)＝∑ _i count ₂ (w _i ,B)；

count ₂ (·,·)＝∑ _i,j count ₂ (w _i ,w _j )；

count ₂ (A,. Cndot.) represents the total number of times that term A appears simultaneously with other terms in the search resource, count ₂ (. B) represents the total number of times that term B appears simultaneously with other terms in the search resource, count ₂ (A, B) represents the number of times two search terms A, B appear simultaneously in the search resource, w _j Denotes the jth, w, of all words in the search resource that appear simultaneously with the search word A _i Represents the ith, count, of all words that occur simultaneously with the related word B in the search resource ₂ (A,w _j ) Two search terms A and w in the search resource _j Number of simultaneous occurrences, count ₂ (w _i B) represents two search terms w in the search resource _i B number of simultaneous occurrences, count ₂ (w _i ,w _j ) Two search terms w in a search resource _i 、w _j The number of simultaneous occurrences, i and j are natural numbers;

4. The method of claim 1, wherein the feature values further comprise:

Language model score information for displaying language model scores of the search sentences before and after the related word replaces the search word; and/or

5. The method of claim 4, wherein the relatedness information comprises a first translation probability P ₁ And/or a second translation probability P ₂ ；

Wherein, the search word A and the related word A 'form a first word pair (A, A'), count ₁ (A, A ') represents the number of times the first word pair (A, A') in the parallel sentence pair is aligned, count ₁ (A,. Cndot.) represents the total number of times that term A is aligned in parallel sentence pairs, count ₁ (. A ') represents the total number of times the related word A' is aligned in the parallel sentence pair, w _j Representing the jth, w, of all words in the parallel sentence pair that are aligned with the term A _i Denotes the i-th, count, of all words in parallel sentence pairs that are aligned with the related word A ₁ (A,w _j ) Meaning that the search word A and the word w are in parallel sentence pair _j Number of alignments, count ₁ (w _i A') represents a word w in parallel sentences _i The number of times, i and j, of alignments with the related word a' is a natural number.

6. The method of claim 4, further comprising training an N-gram language model to obtain the language model based on large-scale user search behavior data.

7. The method of claim 1, wherein the step of ranking the results of the search using the search term and the related term according to the corresponding confidence levels is ranking the results of the search using the search term and the related term according to the corresponding confidence levels by a ranking model.

8. The method of claim 7, further comprising the step of the ranking model initially ranking the search resources according to the search statement and search resource page information.

9. The method of claim 8, wherein,

the retrieval resources are webpage resources and/or document resources.

10. The method of claim 1, wherein the related word library is created by a method of mining related words, the method of mining related words comprising:

performing word segmentation processing on each group of parallel sentence pairs;

performing word alignment processing on the parallel sentence pairs after word segmentation processing to obtain a first aligned word pair;

calculating a co-occurrence frequency of the first aligned word pair;

11. The method of claim 10, wherein the step of obtaining parallel sentence pairs comprises:

according to the literal similarity of two sentences, filtering out the parallel sentence pairs with different meanings.

12. The method of claim 10, the method of mining related words further comprising:

context words of the related words are recorded.

13. The method of claim 10, wherein,

the word alignment processing comprises regular word alignment processing and/or statistical word alignment processing;

the regular word alignment processing comprises at least one of word alignment processing with completely same literal, word alignment processing with partially same literal or adjacent word alignment processing;

the statistical word alignment processing is performed by using a GIZA + + tool.

14. The method of claim 10, further comprising:

and training the positive sample and the negative sample by using the first aligned word pair as a positive sample and the second aligned word pair as a negative sample and adopting a gradient lifting decision tree (GBDT) algorithm based on the statistical characteristics to obtain the confidence coefficient calculation model of the related words.

15. The method of claim 14, wherein the related word confidence calculation model is a GBDT non-linear regression model.

16. A search system, comprising:

a related vocabulary storage device;

a related word acquiring device for acquiring related words of the search word based on the related word bank stored in the related word bank storage device;

a sorting device for sorting the results obtained by searching using the search word and the related word according to the corresponding confidence degrees,

wherein, the word segmentation device is also used for carrying out word segmentation processing on the retrieval sentence to obtain a retrieval word,

wherein the confidence calculating means comprises:

a confidence degree calculation module for taking the feature value as an input of the related word confidence degree calculation model, calculating the confidence degree based on the related word confidence degree calculation model,

wherein the feature value extraction module comprises:

a degree of replaceability information obtaining unit configured to obtain degree of replaceability information, the degree of replaceability information being used to measure a degree of replaceability between the search word and the related word in a context of the related word; and

and the co-occurrence relation information acquisition unit is used for acquiring co-occurrence relation information, and the co-occurrence relation information is used for measuring the co-occurrence relation among the search terms.

17. The search system of claim 16, wherein the feature value extraction module further comprises:

the relevant degree information acquisition unit is used for acquiring relevant degree information which is used for measuring the relevant degree between each search word and each corresponding relevant word; and/or

A weight value information acquisition unit configured to acquire weight value information indicating a weight of the related word.

18. The search system of claim 17, wherein the feature value extraction module further comprises:

and the language model acquisition unit is used for training the N-gram language model based on the large-scale user search behavior data to acquire the language model.

19. The search system of claim 16,

the device for establishing the related word thesaurus is used for establishing the related word thesaurus and comprises the following steps:

a related word determining module for determining the first aligned word pair with the co-occurrence frequency higher than a predetermined threshold as a related word.

20. The search system according to claim 19, wherein the related word bank establishing means further comprises:

21. The search system according to claim 19, further comprising related word confidence calculation model building means for building the related word confidence calculation model, including:

22. The search system of claim 21, wherein the related term confidence calculation model is a GBDT non-linear regression model.

23. The search system according to claim 16, wherein the ranking means ranks results obtained by the search using the search term and the related term according to the corresponding confidence levels through a ranking model.

24. The search system of claim 23, wherein the ranking means is further configured to rank the retrieval resources initially according to the retrieval statements and the retrieval resource page information by the ranking model.