WO2017063538A1 - 挖掘相关词的方法、搜索方法、搜索系统 - Google Patents

挖掘相关词的方法、搜索方法、搜索系统 Download PDF

Info

Publication number
WO2017063538A1
WO2017063538A1 PCT/CN2016/101700 CN2016101700W WO2017063538A1 WO 2017063538 A1 WO2017063538 A1 WO 2017063538A1 CN 2016101700 W CN2016101700 W CN 2016101700W WO 2017063538 A1 WO2017063538 A1 WO 2017063538A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
word
words
pair
search term
Prior art date
Application number
PCT/CN2016/101700
Other languages
English (en)
French (fr)
Inventor
韩增新
蒋冠军
董良
Original Assignee
广州神马移动信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州神马移动信息科技有限公司 filed Critical 广州神马移动信息科技有限公司
Publication of WO2017063538A1 publication Critical patent/WO2017063538A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • the present application relates to the field of information retrieval, and in particular, to a method for mining related words, a search method, and a search system.
  • the search engine is a necessary function for the "user's convenience of using the website” in the website construction, and is also "an effective tool for studying the behavior of the website user".
  • Efficient on-site search allows users to quickly and accurately find target information, effectively solve user problems, and promote product/service sales more effectively, and further develops through in-depth analysis of website visitors' search behavior. Effective online marketing strategies are of great value.
  • search engine When a user searches using a search engine, a search keyword is input through a search page of the search engine, and the search engine retrieves and returns the search result.
  • the general search engine will directly search for the original word using the keyword input by the user, or search using the synonym of the search term.
  • search results are limited. There are often good results, their words are not consistent with the search terms, but are semantically related to the search terms, and the pages that result in such results cannot be recalled.
  • the technical problem to be solved by the present application is to solve the problem that the traditional search engine only obtains limited search results by original words or synonyms, and provides a method for mining related words, a search method and a search system.
  • a method of mining related words is provided.
  • a method of mining related words including:
  • the first aligned word pair having a co-occurrence frequency above a predetermined threshold is determined as a related word.
  • the step of obtaining a parallel sentence pair comprises:
  • the method further comprises recording a contextual context word of the related word.
  • the word alignment process includes rule word alignment processing and/or statistical word alignment processing.
  • the rule word alignment process includes at least one of a literal identical word alignment process, a literal partial same word alignment process, or a proximity word alignment process.
  • the statistical word alignment process is a statistical word alignment process using a GIZA++ tool.
  • the method further comprises:
  • the second aligned word pair as a negative sample, and based on the statistical feature, using a gradient lifting decision tree (GBDT) algorithm to train the positive sample and the negative sample, Obtaining the related word confidence calculation model.
  • GBDT gradient lifting decision tree
  • the model can distinguish the correlation between related words.
  • the correlation word confidence calculation model is a GBDT nonlinear regression model.
  • a search method is also disclosed.
  • a search method includes the following steps:
  • the results obtained by searching using the search term and the related words are sorted according to the corresponding confidence.
  • the corresponding related words can be found for the search term, the scope of the search is expanded, the search result is expanded, and the word itself can be prevented from being inconsistent with the search term, but the semantics are very similar to the search term.
  • the results of the search results cannot be recalled.
  • the related word lexicon is established by the method of mining related words according to the above.
  • the method further comprises performing word segmentation on the retrieval statement to obtain the search term.
  • the search sentence is segmented to obtain a plurality of search terms, and the search results related to the plurality of search terms are searched by the search method, thereby further expanding the search range.
  • the step of calculating the confidence between the search term and each of the related words based on the confidence calculation model comprises:
  • the feature value is used as an input of the confidence calculation model, and the confidence is calculated based on the confidence calculation model.
  • the characteristic value comprises:
  • Level of substitution information for measuring the degree of substitutability between the search term and the related word in the context of the related word.
  • Co-occurrence relationship information for measuring a co-occurrence relationship between the search terms
  • Language model score information for displaying a language model score of the search term before and after the related word replaces the search term
  • Weight value information for indicating the weight of the related word.
  • the correlation degree information includes a first translation probability P 1 and/or a second translation probability P 2 ;
  • the search term A and the related word A' constitute the first word pair (A, A'), and the count 1 (A, A') indicates the number of times the first word pair (A, A') is aligned in the parallel sentence pair.
  • count 1 (A, ⁇ ) indicates the total number of times the search term A is aligned in the parallel sentence pair
  • count 1 ( ⁇ , A') indicates the total number of times the related word A' is aligned in the parallel sentence pair
  • w j represents In the parallel sentence pair, the jth of all words aligned with the search term A
  • w i represents the i-th of the words aligned with the related word A' in the parallel sentence pair
  • count 1 (A, w j ) Indicates the number of times the search term A is aligned with the word w j in the parallel sentence pair
  • count 1 (w i , A') indicates the number of times the word w i is aligned with the related word A' in the parallel sentence pair
  • the replaceability degree information includes a first replaceable degree score (D, Q) and/or a second replaceable degree score (D, Q');
  • search term A and the related word A' constitute a first word pair (A, A'),
  • Q is a search sentence
  • q i is the i-th search term of the search sentence Q
  • n is the total number of search terms in the search sentence Q
  • Q' is a combination of search terms of m words near the search term A, and m ⁇ n, q' j is the jth search term of the search term combination Q',
  • Avgdl is the average length of the document composed of the context of all related words of the search term A
  • k 1 is the first constant and b is the second constant.
  • f(q i , D) represents the frequency of occurrence of qi in document D
  • f(q' j , D) represents the frequency of occurrence of q' j in document D.
  • the co-occurrence relationship information includes first co-occurrence relationship information and/or second co-occurrence relationship information obtained based on a co-occurrence relationship index PMI, where
  • Count 2 ( ⁇ , ⁇ ) ⁇ i,j count 2 (w i ,w j );
  • Count 2 (A, ⁇ ) indicates the total number of times the search term A appears simultaneously with other search terms in the search resource
  • count 2 ( ⁇ , B) indicates the total number of times the search term B appears simultaneously with other search terms in the search resource.
  • Count 2 (A, B) represents the number of simultaneous occurrences of two search terms A, B in the search resource
  • w j represents the jth of all words in the search resource that appear simultaneously with the search term A
  • w i represents Searching for the ith of all the words in the resource that appear simultaneously with the related word B
  • count 2 (A, w j ) represents the number of simultaneous occurrences of the two search terms A, w j in the search resource
  • count 2 (w i , B) indicates the number of simultaneous occurrences of two search terms w i , B in the search resource
  • count 2 (w i , w j ) indicates the number of simultaneous occurrences of two search terms w i , w j in the search
  • the first co-occurrence relationship information is an average value of a co-occurrence relationship index PMI between the search term and other words in the search sentence;
  • the second co-occurrence relationship information is an average value of the co-occurrence relationship index PMI of the related words and other words in the retrieval sentence.
  • the method further comprises training the N-gram language model to acquire the language model based on the large-scale user search behavior data.
  • the step of sorting results obtained by using the search term and the related words according to a corresponding confidence degree is to use the search term according to the corresponding confidence degree by using a sorting model
  • the results obtained by the related words are sorted.
  • the method further comprises the step of the sorting model initializing the search resources according to the search statement and the search resource page information.
  • the retrieval resource is a webpage resource and/or a document resource.
  • a search system is also provided.
  • a search system that includes:
  • a related word obtaining device configured to acquire a related word of the search term based on the related word lexicon stored by the related word storage device;
  • a confidence calculation device configured to calculate a confidence level between the search term and each of the related words based on a related word confidence calculation model
  • a sorting means for sorting results obtained by searching using the search term and the related words according to the corresponding confidence.
  • the search system further includes a related word dictionary establishing means for establishing the related word dictionary, including:
  • a parallel sentence obtaining module configured to acquire parallel sentence pairs that use different expression forms to express the same meaning based on large-scale user search behavior data
  • a word segmentation device for performing word segmentation processing on each set of the parallel sentence pairs
  • a word alignment module configured to perform word alignment processing on the parallel sentence pairs processed by the word segmentation to obtain a first aligned word pair
  • a co-occurrence frequency acquisition module configured to calculate a co-occurrence frequency of the first aligned word pair
  • the related word determining module is configured to determine the first aligned word pair whose co-occurrence frequency is higher than a predetermined threshold as a related word.
  • the related word vocabulary establishing device further includes:
  • a context obtaining module configured to acquire a contextual context word of the related word.
  • the search system further comprises a related word confidence calculation model establishing means for establishing the related word confidence calculation model, comprising:
  • a linear model filtering module configured to filter the large-scale user search behavior data using a linear model to obtain a second aligned word pair
  • a training module configured to use the first aligned word pair as a positive sample and the second aligned word pair as a negative sample, and the positive sample and the negative sample are trained based on a GBDT algorithm to obtain the related word confidence Calculate the model.
  • the correlation word confidence calculation model is a GBDT nonlinear regression model.
  • the tokenizer is further configured to perform word segmentation on the retrieval sentence to obtain the retrieval term.
  • the confidence calculation device comprises:
  • An eigenvalue extraction module configured to extract feature values between each of the search terms and each of the corresponding related words
  • a confidence calculation module is configured to use the feature value as an input of the related word confidence calculation model, and calculate the confidence based on the related word confidence calculation model.
  • the feature value extraction module includes:
  • a correlation degree information acquiring unit configured to acquire relevance level information, where the correlation degree information is used to measure a degree of correlation between each of the search terms and each corresponding related word;
  • a replaceability degree information acquiring unit configured to obtain replaceability degree information, wherein the replaceable degree information is used to measure an alternative degree between the search term and the related word in a context of the related word ;and / or
  • a co-occurrence relationship information acquiring unit configured to obtain co-occurrence relationship information, where the co-occurrence relationship information is used to measure a co-occurrence relationship between the search terms;
  • a language model score information obtaining unit configured to acquire language model score information, wherein the language model score information is used to display a language model score of the search sentence before and after the related word is replaced by the related word;
  • the weight value information obtaining unit is configured to obtain weight value information, where the weight value information is used to represent the weight of the related word.
  • the feature value extraction module further includes:
  • a language model obtaining unit configured to acquire the language model by training an N-gram language model based on the large-scale user search behavior data.
  • the sorting means sorts the results obtained by searching the search term and the related words according to the corresponding confidence by the sorting model.
  • the sorting device is further configured to perform initial sorting on the search resources according to the search statement and the search resource page information by using the sorting model.
  • a computing device comprising:
  • One or more processors are One or more processors;
  • the memory is configured to execute:
  • the results obtained by searching using the search term and the related words are sorted according to the corresponding confidence.
  • a computer readable recording medium having recorded thereon a program for executing the above method.
  • the related words corresponding to the search words can be found, and the search words and their related words are used together to search, which expands the search range and expands the search results, thereby preventing The words themselves are not consistent with the search terms, but when the semantics are very similar to the search terms, such search results cannot be recalled.
  • FIG. 1 shows a flow chart of a method of mining related words according to an embodiment of the present application
  • FIG. 2 is a flow chart showing a method of mining related words according to another embodiment of the present application.
  • FIG. 3 shows a flow chart of a search method according to an embodiment of the present application
  • FIG. 4 shows a flowchart of a search method according to another embodiment of the present application.
  • FIG. 5 is a flow chart showing the step S240 of the embodiment shown in Figure 4.
  • FIG. 6 shows a schematic diagram of a search system in accordance with an embodiment of the present application.
  • FIG. 7 shows a schematic diagram of a search system according to another embodiment of the present application.
  • FIG. 8 is a schematic diagram showing a related word dictionary establishing apparatus 310 of the embodiment shown in FIG. 7;
  • FIG. 9 shows the related word confidence calculation model establishing means 350 of the embodiment shown in FIG. schematic diagram
  • Figure 10 is a diagram showing the confidence calculation device 390 of the embodiment shown in Figure 7;
  • FIG. 11 shows a schematic diagram of the feature value extraction module 394 of the embodiment shown in FIG.
  • FIG. 12 is a block diagram showing the structure of a computing device provided in accordance with an embodiment of the present invention.
  • FIG. 1 shows a flow chart of a method of mining related words according to an embodiment of the present application.
  • step S110 parallel sentence pairs that express the same meaning in different expression forms are acquired based on the large-scale user search behavior data.
  • parallel sentence pairs are obtained from data such as a user's search log and/or a search title log.
  • parallel sentence pairs refer to sentence pairs that use different expressions to express the same meaning.
  • the above-mentioned parallel sentence pairs expressing the same meaning in different expressions may be "the baby has a red spot on the neck” and "the baby's neck has a spotted”.
  • step S120 each set of parallel sentence pairs is subjected to word segmentation processing.
  • Each sentence of each of the above parallel sentences is segmented by a word segmentation technique.
  • step S130 word alignment processing is performed on the parallel sentence pairs after the word segmentation to obtain a first aligned word pair.
  • Word alignment can be used to find words that express the same meaning.
  • the word alignment process may include a rule word alignment process and/or a statistical word alignment process.
  • the above rule word alignment processing includes at least one of literal identical word alignment processing, literal partial same word alignment processing, or adjacent word alignment processing.
  • the above statistical word alignment process is to perform statistical word alignment processing using the GIZA++ tool.
  • step S140 the co-occurrence frequency of the first aligned word pair is calculated.
  • the evaluation index of the co-occurrence frequency may be the first translation probability P1 and/or the second translation probability P2, and the calculation formulas of P1 and P2 are as follows:
  • the search term A and the related word A' constitute the first word pair (A, A'), and the count 1 (A, A') indicates the number of times the first word pair (A, A') is aligned in the parallel sentence pair.
  • count 1 (A, ⁇ ) indicates the total number of times the search term A is aligned in the parallel sentence pair
  • count 1 ( ⁇ , A') indicates the total number of times the related word A' is aligned in the parallel sentence pair
  • w j represents In the parallel sentence pair, the jth of all words aligned with the search term A
  • w i represents the i-th of the words aligned with the related word A' in the parallel sentence pair
  • count 1 (A, w j ) The number of times the search term A is aligned with the word w j in the parallel sentence pair count 1 (w i , A') indicates the number of times the word w i is aligned with the related word A' in the parallel sentence pair, and both i and j
  • count 1 (A, A') is independent of the order of A, A', that is, count 1 (A, A') is the same as count 1 (A', A).
  • P1 represents the ratio of the number of times the query word A is aligned with the related word A' to the total number of times the query word A is aligned
  • P2 represents the ratio of the number of times the query word A is aligned with the related word A' to the total number of times the related word A' is aligned.
  • the number of alignments is the number of times two words are aligned in a plurality of different parallel sentence pairs, and the number of co-occurrences is the number of simultaneous occurrences of two words in the same corpus.
  • step S150 a first aligned word pair having a co-occurrence frequency higher than a predetermined threshold is determined as a related word.
  • the predetermined threshold value may be set to different degrees according to different requirements for the correlation between related words.
  • the predetermined threshold may be 1.0*e -99 .
  • related words with higher relevance can be mined, and the scope of search term search can be further expanded, and the probability of finding better search results can be improved.
  • related words having different degrees of similarity may also be acquired according to different predetermined threshold values.
  • the above method for mining related words further includes the following steps:
  • step S160 the contextual context words of the related words are recorded.
  • the context of the related words can be known. By judging whether the contexts of the two related words are the same or similar, the correlation between the related words can be further judged, which is beneficial to obtain related words with higher similarity.
  • the acquisition of the contextual context words of the above related words may be limited according to the length of the parallel sentences.
  • the length or other form of limitation may not be made.
  • the length or the contextual word acquisition manner may be differently defined according to the requirements for the relevance of the related words or other criteria.
  • step S170 the large-scale user search behavior data is filtered using a linear model to obtain a second aligned word pair.
  • the above linear model can be a simple linear model.
  • the simple linear model may be a small (can be 10,000-level) word pair manually labeled, using a linear model fitted with a simple linear regression model using statistical features between the above pairs of words.
  • the above fitting may refer to linear regression fitting modeling.
  • the above-mentioned manually labeled word pairs are small in number and the model is simple, so the confidence score output using the model is not high.
  • the linear large-scale user search behavior data is filtered by the linear model, and the result that the confidence score is less than a specific threshold is used as the second aligned word pair, and the word pair confidence score filtered by using the model is not high, so the second alignment
  • the word pair is a poor word pair.
  • the above specific threshold is close to or less than zero.
  • the above-mentioned "manually labeled" word pair means that under a query, the original word in a query to a related word constitutes a word pair, and the word pair is labeled as a related word.
  • the above labeling method can be, in the "eight months baby what to eat?" in the query, baby -> baby this related word pair, "baby” is the original word, "baby” It is a related word.
  • This related word can be marked with 1 point.
  • the representative can be used as a related word.
  • “baby”->“baby” is marked with 0 points, which means it cannot be used as a related word.
  • the above-mentioned poor word pair refers to a pair of erroneous words that should not appear under the current query term, or a word pair that violates the user's intention. For example, the user searches for “baby to eat milk” and obtains “baby drinking milk” is a good word pair (ie, a related word with 1 point); however, “what fruit is delicious” becomes “what fruit is good”, that is An erroneous pair of wrong words, that is, a poor pair of words.
  • the above-mentioned poor word pairs may have more forms of representation, and are not limited to the examples.
  • step S180 statistical features capable of embodying the correlation between related words are obtained.
  • the above statistical features are the statistical features of the contextual words that are suitable for the word pair in the current query context. These features include the degree of relevance information, the degree of replaceability information, and the co-occurrence relationship information between each two related words. At least one of language model score information and weight value information.
  • step S190 the first aligned word pair is a positive sample, and the second aligned word pair is a negative sample.
  • a gradient lifting decision tree (GBDT) algorithm is used to train the positive sample and the negative sample to obtain the correlation. Word confidence calculation model.
  • the above-mentioned related word confidence calculation model may be a GBDT nonlinear regression model.
  • FIG. 3 shows a flow chart of a search method in accordance with an embodiment of the present application.
  • a search method includes the following steps:
  • step S220 related words of the search term are obtained based on the related word lexicon.
  • the above related word lexicon is established by the method of mining related words as described above.
  • the related words include not only synonyms of the search terms (which may include strong synonyms and context synonyms), but also related words with wider coverage.
  • related words with higher relevance can be mined, which further expands the scope of search and improves the probability of finding better search results.
  • step S240 a confidence level between the above-mentioned search term and each related word is calculated based on the confidence calculation model.
  • step S260 using the above search term and its related words according to the corresponding confidence degree The results obtained by the search are sorted.
  • the above steps are used to sort the results obtained by searching the search terms and related words according to the above-mentioned corresponding confidence degree by the sorting model.
  • the above sorting model may be a quick sorting model sorted according to an existing quick sorting algorithm. It can be seen that the ranking model can also be other existing models.
  • Searching according to related words not only covers the high frequency of synonyms, but also pays more attention to the related words of low and medium frequency. Especially when the retrieval resources are relatively small, the related words are used for searching, and the retrieval information is obtained to the maximum extent.
  • the corresponding related words can be found for the search words, and the search words and related words are used for searching, which expands the scope of the search and expands the search result; the word itself can be prevented from being inconsistent with the search words, but the semantics When the search term is very similar, the result of such a search result cannot be recalled.
  • the step of pre-sorting the search resources according to the retrieval statement and the retrieval resource page information may be further included.
  • the initial sorting step is a general search process, and the search result is limited by setting the degree of retrieval, and the search result of the predetermined score can be entered into the step of reordering in step S260.
  • the initial retrieval result is large, the amount of reordering can be reduced. It is also possible to use the double sorting method to search when the user requests to display only highly accurate search results.
  • the above search resource may be a webpage resource and/or a document resource.
  • the retrieval resource can be a piece of text information, a title of a web page, a query for one query, or a long document.
  • FIG. 4 shows a flow chart of a search method according to another embodiment of the present application.
  • the above search method may further include step S210 before the step S220.
  • step S210 the search sentence is subjected to word segmentation processing to acquire the above-mentioned search term.
  • the search sentence is segmented to obtain a plurality of search terms, and the search results related to the plurality of search terms are searched by the search method, thereby further expanding the search range.
  • the above participles may include Chinese participles and/or English participles. It can also include word segmentation in other languages, and the corresponding word segmentation can be the existing various forms of word segmentation techniques.
  • Figure 5 shows a flow chart of step S240 of the embodiment shown in Figure 4.
  • step S242 feature values between each search term and each corresponding related word are acquired.
  • step S244 the above feature value is used as an input of the confidence calculation model, and the confidence is calculated based on the confidence calculation model.
  • the feature value may include at least one of relevance degree information, replaceability degree information, co-occurrence relationship information, language model score information, and weight value information.
  • the above correlation degree information is used to measure the degree of correlation between each search term and each corresponding related word.
  • the above correlation degree information may include the first translation probability P 1 and/or the second translation probability P 2 and are respectively expressed by the following formula:
  • the search term A and the related word A' constitute the first word pair (A, A'), and the count 1 (A, A') indicates the number of times the first word pair (A, A') is aligned in the parallel sentence pair.
  • count 1 (A, ⁇ ) indicates the total number of times the search term A is aligned in the parallel sentence pair
  • count 1 ( ⁇ , A') indicates the total number of times the related word A' is aligned in the parallel sentence pair
  • w j represents In the parallel sentence pair, the jth of all words aligned with the search term A
  • w i represents the i-th of the words aligned with the related word A' in the parallel sentence pair
  • count 1 (A, w j ) The number of times the search term A is aligned with the word w j in the parallel sentence pair count 1 (w i , A') indicates the number of times the word w i is aligned with the related word A' in the parallel sentence pair, and both i and j
  • count 1 (A, A') is independent of the order of A, A', that is, count 1 (A, A') is the same as count 1 (A', A).
  • the degree of replaceability information is used to measure the degree of substitutability between the search term and the related words in the context of the related words.
  • the replaceability degree information includes a first replaceable degree score(D, Q) and/or a second replaceable
  • the degree score(D, Q') is expressed by the following formula:
  • search term A and the related word A' constitute a first word pair (A, A'),
  • Q is a search statement
  • q i is the i-th search term of the search sentence Q
  • n is the total number of search terms in the search sentence Q
  • Q' is a combination of search terms of m words near the search term A, and m ⁇ n, q' j is the jth search term of the search term combination Q',
  • Avgdl is the average length of the document composed of the context of all related words of the search term A
  • k 1 is the first constant and b is the second constant.
  • f(q i , D) represents the frequency of occurrence of qi in document D
  • f(q' j , D) represents the frequency of occurrence of q' j in document D.
  • the co-occurrence relationship information is used to measure the co-occurrence relationship between the search terms, and refers to the statistical data in which two search terms appear in one query corpus (search resources, web pages and/or documents).
  • the co-occurrence relationship information includes first co-occurrence relationship information and/or second co-occurrence relationship information obtained based on the co-occurrence relationship index PMI:
  • Count 2 ( ⁇ , ⁇ ) ⁇ i,j count 2 (w i ,w j );
  • Count 2 (A, ⁇ ) represents the total number of simultaneous occurrences of the search term A and other search terms in the search resource
  • count 2 ( ⁇ , B) represents the total number of simultaneous occurrences of the search term B and other search terms in the search resource.
  • Count 2 (A, B) represents the number of simultaneous occurrences of two search terms A, B in the search resource
  • w j represents the jth of all words in the search resource that appear simultaneously with the search term A
  • w i represents Searching for the ith of all the words in the resource that appear simultaneously with the related word B
  • count 2 (A, w j ) represents the number of simultaneous occurrences of the two search terms A, w j in the search resource
  • count 2 (w i , B) indicates the number of simultaneous occurrences of two search terms w i , B in the search resource
  • count 2 (w i , w j ) indicates the number of simultaneous occurrences of two search terms w i , w
  • count 2 (A, B) is independent of the order of A and B, that is, count 2 (A, B) is the same as count 2 (B, A).
  • the first co-occurrence relationship information is an average value of the co-occurrence relationship index PMI of the search term and other words in the search sentence.
  • the second co-occurrence relationship information is an average value of the co-occurrence relationship index PMI of the related words and other search terms in the search sentence (excluding other search terms of the search words corresponding to the related words).
  • the above formula when calculating the first co-occurrence relationship information, the above formula may be directly used and the average value may be calculated; when the second co-occurrence relationship is calculated, the search term A in the above formula is replaced with the related word A'.
  • the language model score information is used to display the language model score of the search sentence before and after the related word replaces the search term.
  • the method further includes training the N-gram language model based on the large-scale user search behavior data to obtain the language model.
  • the weight value information is used to indicate the weight of the related words.
  • step S180 The calculation manner of the above statistical features is also used in step S180 to calculate statistical features between each related word.
  • FIG. 6 shows a schematic diagram of a search system in accordance with an embodiment of the present application.
  • a search system 300 includes a related word storage device 320, a related word acquisition device 340, a search device 360, a sorting device 380, and a confidence calculation device 390.
  • the related word acquisition means 340 connects the related word storage means 320, and acquires related words of the search term based on the related word storage means 320.
  • the search device 360 searches based on the search term and the related words of the search term.
  • the confidence calculation device 390 calculates a model based on the confidence calculation Calculate the confidence between the search term and each of its associated words.
  • the sorting means 380 sorts the results obtained by the search means 360 based on the corresponding confidence calculated by the confidence calculating means 390.
  • the corresponding related words can be found for the search term, and the search is performed according to the search term and the corresponding related words, thereby expanding the scope of the search, further expanding the search result, and improving the retrieval of the destination file. Probability. It can prevent the words themselves from being inconsistent with the search terms, but when the semantics are very similar to the search terms, such a good search result cannot be recalled.
  • a search system according to another embodiment of the present application is described below with reference to FIG.
  • FIG. 7 shows a schematic diagram of a search system in accordance with another embodiment of the present application.
  • the above search system 300 may further include a related word vocabulary establishing means 310 and a related word confidence calculation model establishing means 350.
  • the related word vocabulary establishing means 310 is connected to the related word storage means 320 for establishing the above-mentioned related word vocabulary by the above method of mining related words.
  • FIG. 7 A schematic diagram of a related word vocabulary building device 310 according to the embodiment shown in FIG. 7 for establishing a related word vocabulary is described with reference to FIG.
  • FIG. 8 is a diagram showing the related word vocabulary establishing means 310 of the embodiment shown in FIG.
  • the related word vocabulary establishing device 310 may include: a parallel sentence obtaining module 311, a word segmentation unit 313, a word aligning module 315, a co-occurrence frequency obtaining module 317, a related word determining module 319, and a context obtaining module 318.
  • the parallel sentence obtaining module 311 acquires parallel sentence pairs that use different expression forms to express the same meaning based on the large-scale user search behavior data, the word segmenter 313 performs word segmentation processing on each set of parallel sentence pairs, and the word alignment module 315 processes the word segmentation parallel.
  • the sentence pair performs word alignment processing to obtain a first alignment word pair, the co-occurrence frequency acquisition module 317 calculates a co-occurrence frequency of the first alignment word pair, and the related word determination module 319 sets the first alignment word pair whose co-occurrence frequency is higher than a predetermined threshold. Determined to be related words to form a related word lexicon.
  • related words of higher relevance can be mined, the range of search term search can be expanded, the probability of finding a better search result can be improved, and the threshold value can be determined according to a predetermined threshold. Get related words with different similarities.
  • the related words include not only synonymous words of the search words (which may include strong synonyms and context synonyms), but also related words with wider coverage.
  • the above-mentioned word breaker 313 is further configured to perform word segmentation processing on the retrieval sentence to obtain the search term.
  • the search sentence is segmented to obtain a plurality of search terms, and the search results related to the plurality of search terms are searched by the search method, thereby further expanding the search range.
  • the related word vocabulary establishing device 310 further includes a context obtaining module 318, configured to acquire a contextual context word of the related word.
  • the context of the related words can be known. By judging whether the contexts of the two related words are the same or similar, the correlation between the related words can be further judged, which is beneficial to obtain related words with higher similarity.
  • the acquisition of the contextual context words of the above related words may be limited according to the length of the parallel sentences.
  • the length or other form of limitation may not be made.
  • the length or the contextual word acquisition manner may be differently defined according to the requirements for the relevance of the related words or other criteria.
  • FIG. 9 is a diagram showing the related word confidence calculation model establishing means 350 of the embodiment shown in FIG.
  • the related word confidence calculation model establishing means 350 may include a linear model filtering module 352 and a training module 354.
  • the linear model filtering module 352 is configured to filter large scale user search behavior data using a linear model to obtain a second aligned word pair.
  • the linear model described above may be a simple linear model.
  • the simple linear model may be a small (can be 10,000 level) word pair manually labeled, using statistics between the above pairs of words.
  • linear models fitted with simple linear regression models The above-mentioned manually labeled word pairs are small in number and the model is simple, so the confidence of using the model output is not high. Filtering, by the linear model, the large-scale user search behavior data to obtain a second alignment word pair, where the second alignment word pair is a poor word pair, which refers to an incorrect word pair that should not appear in the current query context, or A word pair that violates the user's intent.
  • the training module 354 is respectively connected to the related word lexicon establishing device 310 and the linear model filtering module 352, wherein the first aligned word pair is a positive sample, the second aligned word pair is a negative sample, and the positive and negative samples are trained based on the GBDT algorithm. , obtain the relevant word confidence calculation model.
  • the above-mentioned related word confidence calculation model may be a GBDT nonlinear regression model.
  • the embodiment confidence calculation device 390 of FIG. 7 may include a confidence calculation module 392 and a feature value extraction module 394.
  • the feature value extraction module 394 extracts the feature value between each of the search terms and each of the related words corresponding thereto, and the confidence calculation module 392 uses the feature value as an input of the confidence calculation model, and calculates the above based on the confidence calculation model. Confidence.
  • FIG. 11 is a schematic diagram of the feature value extraction module 394 of the embodiment shown in FIG.
  • the feature value extraction module 394 may further include a relevance degree information acquisition unit 3941, an alternative degree information acquisition unit 3942, a co-occurrence relationship information acquisition unit 3943, a language model score information acquisition unit 3944, a weight value information acquisition unit 3945, and a language model. At least one of the units 3946 is obtained.
  • the relevance level information obtaining unit 3941 is configured to acquire the relevance level information.
  • the relevance level information is used to measure the degree of correlation between each search term and each corresponding related word.
  • the replaceability degree information obtaining unit 3942 is configured to acquire the replaceability degree information.
  • the degree of replaceability information is used to measure the degree of substitutability between a search term and a related word in the context of the related word.
  • the co-occurrence relationship information obtaining unit 3943 is configured to acquire co-occurrence relationship information. Among them, the co-occurrence relationship information is used to measure the co-occurrence relationship between the search terms.
  • the language model score information obtaining unit 3944 is configured to acquire language model score information. Its The language model score information is used to display the language model score of the search sentence before and after the relevant word replaces the search term.
  • the weight value information obtaining unit 3945 is configured to obtain weight value information. Wherein, the weight value information is used to indicate the weight of the related word.
  • the feature value extraction module 394 may further include a language model acquisition unit 3946.
  • the language model obtaining unit 3946 is configured to acquire the above language model by training the N-gram language model based on the large-scale user search behavior data.
  • the sorting device 380 sorts the results obtained by searching the search terms and the corresponding related words according to the corresponding confidence information by the sorting model.
  • the above sorting model may be a quick sorting model sorted according to an existing quick sorting algorithm.
  • the sorting device 380 may further sort the search resources according to the search sentence and the search resource page information by using the sorting model.
  • the initial sorting is a general search process, and can also be determined by setting the degree of retrieval, and the search result that reaches the predetermined score can enter the reordering. When the initial retrieval result is large, the workload of reordering can be reduced. It is also possible to use this double sorting method when the user requests to display only highly accurate search results.
  • Searching according to related words not only covers the high frequency of synonyms, but also pays more attention to the search words of low and medium frequency, especially when the retrieval resources are relatively small, using related words to search and obtain the retrieval information to the greatest extent.
  • the corresponding related words can be found for the search words, and the search words and related words can be used for searching, which expands the scope of the search and expands the search results; the words themselves can be prevented from being inconsistent with the search words, but the semantics
  • the search term is very similar, the result of such a search result cannot be recalled.
  • the method according to the present application can also be embodied as a computer program product comprising a computer readable medium having stored thereon a computer program for performing the functions described above in the method of the present application .
  • the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
  • the computing device can be implemented as various types of computer devices, such as desktops, portable computers, tablets, smart phones, personal data assistants (PDAs), smart wearable devices, or other types of computer devices, but is not limited to any particular form.
  • the computer can include a processing module 1100, a storage subsystem 1200, an input device 1300, a display 1400, a network interface 1500, and a bus 1600.
  • the processing module 1100 can be a multi-core processor or multiple processors.
  • the processing module 1100 can include a general purpose main processor and one or more special coprocessors, such as a graphics processing unit (GPU), a digital signal processor (DSP), and the like.
  • processor 1100 can be implemented using custom circuitry, such as an application specific integrated circuit (ASIC) or field programmable gate arrays (FPGA).
  • the processing module 100 can be a similar circuit that executes executable instructions stored on itself.
  • the processing module 1100 can execute executable instructions stored on the storage subsystem 1200.
  • Storage subsystem 1200 can include various types of storage units, such as system memory, read only memory (ROM), and persistent storage.
  • the ROM can store static data or instructions required by the processing module 1100 or other modules of the computer.
  • the persistent storage device can be a readable and writable storage device.
  • the persistent storage device may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off.
  • the persistent storage device employs a mass storage device (eg, magnetic or optical disk, flash memory) as the permanent storage device.
  • the persistent storage device can be a removable storage device (eg, a floppy disk, an optical drive).
  • the system memory can be a readable and writable storage device or a volatile read/write storage device, such as dynamic random access memory.
  • System memory can store instructions and data that some or all of the processors need at runtime.
  • storage subsystem 1200 can include any combination of computer readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read only memory), and magnetic disks and/or optical disks can also be employed.
  • storage subsystem 1200 can include removable storage devices that are readable and/or writable, such as a compact disc (CD), a read-only digital versatile disc (eg, a DVD-ROM, dual Layer DVD-ROM), read-only Blu-ray disc, ultra-density disc, flash card (such as SD card, min SD card, Micro-SD card, etc.), magnetic floppy disk, and so on.
  • the computer readable storage medium does not include a carrier wave and an instantaneous electronic signal transmitted by wireless or wire.
  • the storage subsystem 1200 can store one or more software programs that can be executed by the processing module 1100 or resource files that need to be invoked.
  • the resource files can include some third-party libraries, including but not limited to audio libraries and video libraries. , 2D graphics library, 3D graphics library.
  • the user interface can be provided by one or more user input devices 1300, display 1400, and/or one or more other user output devices.
  • Input device 1300 can include means for a user to input signals to a computer that can interpret such signals to include particular user requests or information.
  • a web address may be input to the user interface through a keyboard to display webpage content corresponding to the input webpage.
  • input device 300 can include some or all of a keyboard button, touch screen, mouse or other pointing device, scroll wheel, click wheel, dial, button, switch, keypad, microphone, and the like.
  • the display 1400 can display computer generated images, and can include various types of image devices, such as cathode ray tubes (CRTs), liquid crystal displays (LCDs), light emitting diodes (LEDs) (including organic light emitting diodes (OLEDs)), projection systems. And other collections of supporting electronic devices (such as DACs, ADCs, signal processors, etc.). In some embodiments, it is also possible to additionally provide other user output devices, or to replace display 1400, such as a signal light, a speaker, a tactile sensor, a printer, and the like.
  • image devices such as cathode ray tubes (CRTs), liquid crystal displays (LCDs), light emitting diodes (LEDs) (including organic light emitting diodes (OLEDs)), projection systems. And other collections of supporting electronic devices (such as DACs, ADCs, signal processors, etc.).
  • the user interface can be provided through a graphical user interface.
  • Certain areas of the display 1400 define some visual graphical elements as interactive objects or control objects that the user selects through the input device 1300.
  • the user can operate the user input device 1300 to move the specified location input URL on the screen, and control the display of the webpage content corresponding to the webpage on the display 1400.
  • a touch device that can identify a user gesture can be used as an input device, which may, but need not, be associated with an array on display 1300.
  • Network interface 1500 provides sound and/or data communication functionality to the computer.
  • network interface 1500 can include a radio frequency transceiver to communicate sound and/or data (eg, using cellular telephone technology, such as 3G, 4G or EDGE, WIFI data network technology), GPS receiver modules, and/or other Module.
  • the network interface 1500 can mention For additional Wi-Fi or an alternative wireless interface.
  • Network interface 1500 may be a combination of hardware (eg, antennas, modems, codecs, and other analog and/or digital signal processing circuits) and software modules.
  • Bus 1600 can include various systems, external devices, and chip buses that connect various components within the computer.
  • bus 1600 connects processing device 1100 to storage subsystem 1200, and may also connect input device 1300 and display 1400.
  • Bus 1600 can also cause a computer to interface with the network via network interface 1500.
  • the computer can be part of multiple networked computer devices. Any or all of the components of the computer can be used in concert in embodiments of the present invention.
  • Some embodiments include electronic components, such as a microprocessor, a memory that stores computer instructions and data in a computer readable storage medium. Many of the features described in the Detailed Description section can be implemented by the method steps of executing computer instructions stored on a computer readable storage medium. When these computer instructions are executed, the computer processing unit performs various functions of the instructions.
  • the embodiment of the program instructions or computer code may be machine code, such as code compiled using a computer, electronic component or microprocessor of the object to be parsed to compile other high-level languages.
  • the computer is schematic.
  • the computer may have other functions not specifically described (eg, mobile call, GPS, power management, one or more cameras, various connection ports or accessories for connecting external devices, etc.).
  • the specific functional modules involved in the computer 1100 are described herein, and the description of these functional modules is for convenience of description, and does not mean a specific physical configuration of the functional components. Moreover, these functional modules do not need to be in one-to-one correspondence with physical modules.
  • the module can be configured to perform various operations, such as by programming or setting up appropriate control circuitry, and the module may be reconfigured according to initial settings.
  • Embodiments of the invention may be implemented in a variety of devices, including electronic devices, through the use of a combination of hardware and software.
  • each block of the flowchart or block diagram can represent a module, a program segment, or a portion of code that includes one or more of the Executable instructions.
  • the functions marked in the box may also be It occurs in an order different from that marked in the drawings. For example, two consecutive blocks may be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts can be implemented in a dedicated hardware-based system that performs the specified function or operation. Or it can be implemented by a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种挖掘相关词的方法,包括:基于大规模用户搜索行为数据获取采用不同表述形式来表达相同含义的平行句对(S110);对每组平行句对进行分词处理(S120);对所述分词处理后的平行句对进行词对齐处理,以获取第一对齐词对(S130);计算所述第一对齐词对的共现频率(S140);将共现频率高于预定阈值的所述第一对齐词对确定为相关词(S150)。这样,通过该挖掘相关词方法,可以挖掘出更高相关度的相关词,也可以扩大检索词搜索的范围,提高找到更好的搜索结果的概率。同时,还公开了一种搜索方法和一种搜索系统。

Description

挖掘相关词的方法、搜索方法、搜索系统 技术领域
本申请涉及信息检索领域,尤其涉及一种挖掘相关词的方法、一种搜索方法以及一种搜索系统。
背景技术
搜索引擎是网站建设中针对“用户使用网站的便利性”所提供的必要功能,同时也是“研究网站用户行为的一个有效工具”。高效的站内检索可以让用户快速准确地找到目标信息,从而有效地解决用户问题,也能更有效地促进产品/服务的销售,而且通过对网站访问者搜索行为的深度分析,对于进一步制定更为有效的网络营销策略具有重要价值。
用户在使用搜索引擎进行搜索时,通过搜索引擎的检索页面,输入检索关键词,搜索引擎检索并返回检索结果。一般搜索引擎会直接使用用户输入的关键词进行原词搜索,或者使用检索词的同义词进行搜索。
但是,使用检索词原词或者同义词进行搜索时,搜索结果有限。常常有一些好结果,它们的词语本身与检索词并不一致,但是语义上与搜索词非常相关,导致这样结果的网页无法召回。
发明内容
本申请所要解决的技术问题是解决传统搜索引擎只通过原词或同义词进行检索得到的检索结果有限的问题,提供一种挖掘相关词的方法、一种搜索方法以及一种搜索系统。
根据本申请的一个方面,提供了一种挖掘相关词的方法。
一种挖掘相关词的方法,包括:
基于大规模用户搜索行为数据获取采用不同表述形式来表达相同含义的平行句对;
对每组所述平行句对进行分词处理;
对所述分词处理后的平行句对进行词对齐处理,以获取第一对齐词 对;
计算所述第一对齐词对的共现频率;
将共现频率高于预定阈值的所述第一对齐词对确定为相关词。
这样,通过该挖掘相关词方法,可以挖掘出更高相关度的相关词,也可以扩大检索词搜索的范围,提高找到更好的搜索结果的概率。
优选地,所述获取平行句对的步骤包括:
根据两个句子的字面相似度,滤除含义不同的平行句对。
这样,通过两个句子的字面相似度滤除含义不同的平行句对,从而获取表达含义相同但说法不同的平行句对。
优选地,该方法还包括记录所述相关词的上下文语境词。
通过记录该相关词的上下文语境,通过判断两个相关词的上下文语境是否相同或者相近,有利于进一步判断相关词之间的相关度。
优选地,所述词对齐处理包括规则词对齐处理和/或统计词对齐处理。
优选地,所述规则词对齐处理包括字面完全相同词对齐处理、字面部分相同词对齐处理或临近词对齐处理中的至少一种。
这样,可以挖掘出相关度程度不同的相关词。
优选地,所述统计词对齐处理为使用GIZA++工具进行统计词对齐处理。
优选地,该方法还包括:
使用线性模型过滤所述大规模用户搜索行为数据获取第二对齐词对;
获取能够体现所述相关词之间的相关度的统计特征;
以所述第一对齐词对为正样本,所述第二对齐词对为负样本,基于所述统计特征,采用梯度提升决策树(GBDT)算法,训练所述正样本和所述负样本,获取所述相关词置信度计算模型。
这样,通过建立相关词置信度计算模型,通过该模型可以区分相关词之间的相关度。
优选地,所述相关词置信度计算模型为GBDT非线性回归模型。
根据本申请的另一个方面,还公开了一种搜索方法。
一种搜索方法,包括如下步骤:
基于相关词词库获取检索词的相关词;
基于置信度计算模型计算所述检索词与每个所述相关词之间的置信度;
根据对应的置信度对使用所述检索词和所述相关词进行检索所得到的结果进行排序。
这样,通过该搜索方法,可以针对检索词找到其对应的相关词,扩大了搜索的范围,扩大了搜索结果,可以防止词语本身与检索词并不一致,但是语义上与检索词非常相时,这样的搜索结果无法召回的结果发生。
优选地,所述相关词词库是通过根据上述挖掘相关词的方法建立的。
通过上述挖掘相关词的方法,可以挖掘出更高相关度的相关词,也可以扩大检索词搜索的范围,提高找到更好的搜索结果的概率。
优选地,该方法还包括对检索语句进行分词处理以获取所述检索词。
当用户输入检索语句时,通过将检索语句进行分词,从而获取若干检索词,从而通过该检索方法检索出与上述若干检索词相关的检索结果,进一步扩大了搜索的范围。
优选地,基于置信度计算模型计算所述检索词与每个所述相关词之间的置信度的步骤包括:
获取每个所述检索词与对应的每个所述相关词之间的特征值;
将所述特征值作为所述置信度计算模型的输入,基于所述置信度计算模型计算所述置信度。
优选地,所述特征值包括:
相关程度信息,用于衡量每个所述检索词与每个对应的相关词之间的相关程度;和/或
可替换程度信息,用于衡量在所述相关词的上下文语境中,所述检索词与所述相关词之间的可替换程度;和/或
共现关系信息,用于衡量所述检索词之间的共现关系;和/或
语言模型得分信息,用于显示所述相关词替换所述检索词前后的检索语句的语言模型得分;和/或
权重值信息,用于表示所述相关词的权重。
优选地,所述相关程度信息包括第一翻译概率P1和/或第二翻译概率P2
Figure PCTCN2016101700-appb-000001
count1(A,·)=∑jcount1(A,wj),count1(·,A′)=∑icount1(wi,A′);
其中,检索词A与相关词A’构成第一词对(A,A’),count1(A,A’)表示在平行句对中第一词对(A,A’)被对齐的次数,count1(A,·)表示在平行句对中检索词A被对齐的总次数,count1(·,A′)表示在平行句对中相关词A’被对齐的总次数,wj表示在平行句对中所有与检索词A对齐的词中的第j个,wi表示在平行句对中所有与相关词A’对齐的词中的第i个,count1(A,wj)表示在平行句对中检索词A与词wj对齐的次数,count1(wi,A’)表示在平行句对中词wi与相关词A’对齐的次数,i和j均为自然数。
优选地,所述可替换程度信息包括第一可替换程度score(D,Q)和/或第二可替换程度score(D,Q′);
Figure PCTCN2016101700-appb-000002
Figure PCTCN2016101700-appb-000003
其中,检索词A与相关词A’构成第一词对(A,A’),
检索词A与相关词A’的所有上下文词作为文档D,|D|为D的长度,
Q为检索语句,qi为所述检索语句Q的第i个检索词,n是所述检索语句Q中检索词的总个数,
Q′为检索词A附近的m个词的检索词组合,m<n,q′j为所述检索词组合Q′的第j个检索词,
avgdl为检索词A的所有相关词的上下文构成的文档的平均长度,
k1为第一常数,b为第二常数,
f(qi,D)表示qi在文档D中的出现频率,
f(q′j,D)表示q′j在文档D中的出现频率。
优选地,所述共现关系信息包括基于共现关系指数PMI得到的第一共现关系信息和/或第二共现关系信息,其中,
Figure PCTCN2016101700-appb-000004
count2(A,·)=∑jcount2(A,wj);
count2(·,B)=∑icount2(wi,B);
count2(·,·)=∑i,jcount2(wi,wj);
count2(A,·)表示在检索资源中检索词A与其它检索词同时出现的总次数,count2(·,B)表示在检索资源中检索词B与其它检索词同时出现的总次数,count2(A,B)表示在检索资源中两个检索词A、B同时出现的次数,wj表示在检索资源中所有与检索词A同时出现的词中的第j个,wi表示在检索资源中所有与相关词B同时出现的词中的第i个,count2(A,wj)表示在检索资源中两个检索词A、wj同时出现的次数,count2(wi,B)表示在检索资源中两个检索词wi、B同时出现的次数,count2(wi,wj)表示在检索资源中两个检索词wi、wj同时出现的次数,i和j均为自然数;
第一共现关系信息是检索词与检索语句中其它词的共现关系指数PMI的平均值;
第二共现关系信息是相关词与检索语句中其它词的共现关系指数PMI的平均值。
优选地,该方法还包括基于大规模用户搜索行为数据训练N-gram语言模型获取所述语言模型。
优选地,所述根据对应的置信度对使用所述检索词和所述相关词进行检索所得到的结果进行排序的步骤,为通过排序模型根据所述对应的置信度对使用所述检索词和所述相关词进行检索所得到的结果进行排序。
优选地,该方法还包括所述排序模型根据所述检索语句和检索资源页面信息对所述检索资源进行初排序的步骤。
优选地,所述检索资源为网页资源和/或文档资源。
根据本申请的另一个方面,还提供了一种搜索系统。
一种搜索系统,包括:
相关词词库存储装置;
相关词获取装置,用于基于所述相关词词库存储装置存储的相关词词库获取检索词的相关词;
置信度计算装置,用于基于相关词置信度计算模型计算所述检索词与每个所述相关词之间的置信度;
排序装置,用于根据所述对应的置信度对使用所述检索词和所述相关词进行检索所得到的结果进行排序。
优选地,该搜索系统还包括相关词词库建立装置,用于建立所述相关词词库,包括:
平行句获取模块,用于基于大规模用户搜索行为数据获取采用不同表述形式来表达相同含义的平行句对;
分词器,用于对每组所述平行句对进行分词处理;
词对齐模块,用于将所述分词处理后的平行句对进行词对齐处理以获取第一对齐词对;
共现频率获取模块,用于计算所述第一对齐词对的共现频率;
相关词确定模块,用于将共现频率高于预定阈值的所述第一对齐词对确定为相关词。
优选地,所述相关词词库建立装置还包括:
语境获取模块,用于获取所述相关词的上下文语境词。
优选地,该搜索系统还包括相关词置信度计算模型建立装置,用于建立所述相关词置信度计算模型,包括:
线性模型过滤模块,用于使用线性模型过滤所述大规模用户搜索行为数据以获取第二对齐词对;
训练模块,用于以所述第一对齐词对为正样本,以所述第二对齐词对为负样本,基于GBDT算法训练所述正样本和所述负样本,获取所述相关词置信度计算模型。
优选地,所述相关词置信度计算模型为GBDT非线性回归模型。
优选地,所述分词器还用于对检索语句进行分词处理以获取检索词。
优选地,所述置信度计算装置包括:
特征值提取模块,用于提取每个所述检索词与对应的每个所述相关词之间的特征值;
置信度计算模块,用于将所述特征值作为所述相关词置信度计算模型的输入,基于所述相关词置信度计算模型计算所述置信度。
优选地,所述特征值提取模块包括:
相关程度信息获取单元,用于获取相关程度信息,所述相关程度信息用于衡量每个所述检索词与每个对应的相关词之间的相关程度;和/或
可替换程度信息获取单元,用于获取可替换程度信息,所述可替换程度信息用于衡量在所述相关词的上下文语境中,所述检索词与所述相关词之间的可替换程度;和/或
共现关系信息获取单元,用于获取共现关系信息,所述共现关系信息用于衡量所述检索词之间的共现关系;和/或
语言模型得分信息获取单元,用于获取语言模型得分信息,所述语言模型得分信息用于显示所述相关词替换所述检索词前后的检索语句的语言模型得分;和/或
权重值信息获取单元,用于获取权重值信息,所述权重值信息用于表示所述相关词的权重。
优选地,所述特征值提取模块还包括:
语言模型获取单元,用于基于所述大规模用户搜索行为数据训练N-gram语言模型获取所述语言模型。
优选地,所述排序装置为通过排序模型根据所述对应的置信度对使用所述检索词和所述相关词进行检索所得到的结果进行排序。
优选地,所述排序装置还用于通过所述排序模型根据检索语句和检索资源页面信息对所述检索资源进行初排序。
根据本发明的另一方面,还提出一种计算设备,包括:
一个或多个处理器;
存储器;
其中,所述存储器被配置为执行:
基于相关词词库获取检索词的相关词;
基于置信度计算模型计算所述检索词与每个所述相关词之间的置信度;
根据对应的置信度对使用所述检索词和所述相关词进行检索所得到的结果进行排序。
根据本发明的另一方面,还提出一种在其上记录有用于执行上述方法的程序的计算机可读记录介质。
这样,通过上述挖掘相关词的方法、搜索方法以及搜索系统,可以找到检索词对应的相关词,使用检索词以及其相关词一并进行检索,扩大了搜索的范围,扩大了搜索结果,可以防止词语本身与检索词并不一致,但是语义上与检索词非常相时,这样的搜索结果无法召回的结果发生。
附图说明
通过结合附图对本公开示例性实施方式进行更详细的描述,本公开的上述以及其它目的、特征和优势将变得更加明显,其中,在本公开示例性实施方式中,相同的参考标号通常代表相同部件。
图1示出了根据本申请一实施例的挖掘相关词的方法的流程图;
图2示出了根据本申请另一实施例的挖掘相关词的方法的流程图;
图3示出了根据本申请一实施例的搜索方法的流程图;
图4示出了根据本申请另一实施例的搜索方法的流程图;
图5示出了图4所示实施例步骤S240的流程图;
图6示出了根据本申请一实施例的搜索系统的示意图;
图7示出了根据本申请另一实施例的搜索系统的示意图;
图8示出了图7所示实施例相关词词库建立装置310的示意图;
图9示出了图7所示实施例相关词置信度计算模型建立装置350的 示意图;
图10示出了图7所示实施例置信度计算装置390的示意图;
图11示出了图10所示实施例特征值提取模块394的示意图。
图12示出了根据本发明实施方式提供的计算设备的结构框图。
具体实施方式
下面将参照附图更详细地描述本公开的优选实施方式。虽然附图中显示了本公开的优选实施方式,然而应该理解,可以以各种形式实现本公开而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。
下面参考图1描述根据本申请一实施例的一种挖掘相关词的方法,用于从大规模用户搜索行为数据获取相关词。
图1示出了根据本申请一实施例的挖掘相关词的方法的流程图。
在步骤S110,基于大规模用户搜索行为数据获取采用不同表述形式来表达相同含义的平行句对。
基于大规模用户搜索行为数据,从用户的检索日志和/或检索标题日志等数据中获取平行句对。其中,平行句对是指采用不同表述形式来表达相同含义的句对。例如,上述采用不同表述形式表达相同含义的平行句对,可以为“婴儿颈部长有红斑痣”和“宝宝脖子有斑痣”等。
在上述大规模用户搜索行为数据中,例如在用户的检索日志和/或检索标题日志等数据中,存在很多含义相同,但表达并不一致的句对。进一步地,可以根据两个句子的字面相似度,滤除含义不同的平行句对。
在步骤S120,对每组平行句对进行分词处理。
通过分词技术将上述每组平行句对中的每一个句子进行分词。
在步骤S130,对上述分词处理后的平行句对进行词对齐处理,以获取第一对齐词对。
通过词对齐处理,可以找出表达相同含义的词。
其中,上述词对齐处理可以包括规则词对齐处理和/或统计词对齐处理方式。上述规则词对齐处理包括字面完全相同词对齐处理、字面部分相同词对齐处理或临近词对齐处理中的至少一种。上述统计词对齐处理为使用GIZA++工具进行统计词对齐处理。
在步骤S140,计算上述第一对齐词对的共现频率。
其中,共现频率的评价指标可以为第一翻译概率P1和/或第二翻译概率P2,P1、P2的计算公式如下:
Figure PCTCN2016101700-appb-000005
count1(A,·)=∑jcount1(A,wj),count1(·,A′)=∑icount1(wi,A′);
其中,检索词A与相关词A’构成第一词对(A,A’),count1(A,A’)表示在平行句对中第一词对(A,A’)被对齐的次数,count1(A,·)表示在平行句对中检索词A被对齐的总次数,count1(·,A′)表示在平行句对中相关词A’被对齐的总次数,wj表示在平行句对中所有与检索词A对齐的词中的第j个,wi表示在平行句对中所有与相关词A’对齐的词中的第i个,count1(A,wj)表示在平行句对中检索词A与词wj对齐的次数count1(wi,A’)表示在平行句对中词wi与相关词A’对齐的次数,i和j均为自然数。
可以理解,count1(A,A’)的值与A、A’的顺序是无关的,即count1(A,A’)与count1(A’,A)是相同的。
P1表示查询词A与相关词A′对齐的次数占查询词A被对齐的总次数的比例,P2表示查询词A与相关词A′对齐的次数占相关词A′被对齐的总次数的比例。
其中,对齐次数是两个词在多个不同的平行句对中对齐的次数,共现次数是两个词在同一个语料中同时出现的次数。
在步骤S150,将共现频率高于预定阈值的第一对齐词对确定为相关词。
其中,上述预定阈值可以根据对相关词之间相关度的要求不同进行不同程度的设定。在一实施例中,上述预定阈值可为1.0*e-99
这样,通过该挖掘相关词方法,可以挖掘出更高相关度的相关词,可以进一步扩大检索词搜索的范围,提高找到更好的搜索结果的概率。 并且,也可以根据预定阈值的不同,获取相似度不同的相关词。
下面参考图2描述根据本申请另一实施例的一种挖掘相关词的方法,用于从大规模用户搜索行为数据获取相关词。
参考图2,上述挖掘相关词的方法还包括如下步骤:
在步骤S160,记录相关词的上下文语境词。
通过记录该相关词的上下文语境词,可以获知相关词的上下文语境。通过判断两个相关词的上下文语境是否相同或者相近,可以进一步判断相关词之间的相关度,有利于获取更高相似度的相关词。
上述相关词的上下文语境词的获取,根据平行句的长度不同,可以做不同程度长度的限定。本实施例中,因考虑平行句对的长度一般不会过长,因此可以不做长度或其他形式的限定。在其他实施例中,可以根据对相关词的相关度的要求不同或者其他标准下,对其长度或者上下文语境词的获取方式做不同的限定。
在步骤S170,使用线性模型过滤所述大规模用户搜索行为数据获取第二对齐词对。
其中,上述线性模型可为简单线性模型。进一步地,该简单线性模型可以为用人工标注的少量(可以为万级别)词对,使用上述词对之间的统计特征,用简单线性回归模型拟合的线性模型。其中,上述拟合可以指线性回归拟合建模。
上述人工标注的词对数量较少,并且模型简单,因此使用该模型输出的置信度得分不高。通过该线性模型过滤上述大规模用户搜索行为数据,将置信度得分小于特定阈值的结果作为上述第二对齐词对,因使用该模型过滤出的词对置信度得分不高,因此该第二对齐词对作为较差词对。具体的,上述特定阈值接近或小于零。
上述“人工标注”的词对是指:在某个查询语句(query)下,一个query中的原词到相关词构成一个词对,这个词对经过标注,是否适合作为一个相关词。上述标注方式可以为,在”八个月宝宝吃什么?”这个query中,宝宝->婴儿这个相关词对中,“宝宝”是原词,“婴儿” 是相关词,这个相关词可以标注1分,代表可以作为一个相关词;在这个query下,“宝宝”->“宝贝”标注0分,代表不能作为一个相关词。
上述较差词对是指在当前查询词语境下,不应该出现的错误词对,或者说违反用户意图的词对。例如,用户搜索“宝宝吃奶”,获取“宝宝喝奶”是一个较好词对(即标注1分的相关词);然而“什么水果好吃”,变成“什么水果好喝”,就是一个转义的错误词对,即较差词对。并且,上述较差词对可以有更多种形式的表示,并不限于该举例。
在步骤S180,获取能够体现相关词之间的相关度的统计特征。
上述统计特征,是在当前query语境下是否适合出这个词对的语境词统计验证特征,这些特征包括每两个相关词之间的相关程度信息、可替换程度信息、共现关系信息、语言模型得分信息、权重值信息中的至少一种。
在步骤S190,以上述第一对齐词对为正样本,第二对齐词对为负样本,基于上述统计特征,采用梯度提升决策树(GBDT)算法,训练上述正样本和负样本,获取上述相关词置信度计算模型。
其中,上述相关词置信度计算模型可以为GBDT非线性回归模型。
下面参考图3描述根据本申请一实施例的一种搜索方法。
图3示出了根据本申请一实施例的搜索方法的流程图。
一种搜索方法,包括如下步骤:
在步骤S220,基于相关词词库获取检索词的相关词。
其中,上述相关词词库是通过根据上述挖掘相关词的方法建立的。这样,可以获取该检索词的所有相关词,该相关词不仅包括检索词的同义词(可以包括强同义词和语境同义词),还包括了更广覆盖程度的相关词。通过上述挖掘相关词的方法,可以挖掘出更高相关度的相关词,进一步地扩大了搜索的范围,提高了找到更好的搜索结果的概率。
在步骤S240,基于置信度计算模型计算上述检索词与每个相关词之间的置信度。
在步骤S260,根据对应的置信度对使用上述检索词和其相关词进行 检索所得到的结果进行排序。
上述步骤,为通过排序模型根据上述对应的置信度对使用检索词和相关词进行检索所得到的结果进行排序。上述排序模型可以为根据现有快速排序算法进行排序的快速排序模型。可知,该排序模型也可以为现有其他模型。
根据相关词进行搜索不仅涵盖了同义词的高频,还更注重了中低频的相关词,尤其是在检索资源比较少的时候,使用相关词进行搜索,实现了最大程度地获取到检索信息。
这样,通过该搜索方法,可以针对检索词找到其对应的相关词,使用检索词和相关词进行检索,扩大了搜索的范围,扩大了搜索结果;可以防止词语本身与检索词并不一致,但是语义上与检索词非常相时,这样的搜索结果无法召回的结果发生。
在另一实施例中,在上述步骤S260之前还可以包括该排序模型根据检索语句和检索资源页面信息对检索资源进行初排序的步骤。
该初排序步骤为一般的检索过程,也可以通过设定检索程度限定,达到预定得分的检索结果才可以进入步骤S260再排序的步骤。这样,在初检索结果较多时,可以减少再排序的量。也可以在用户要求只显示精确度高的搜索结果时,使用该双重排序方法进行搜索。
其中,上述检索资源可以为网页资源和/或文档资源。检索资源可以是一段文本信息、一个网页的标题、一次查询的语句,也可能是比较长的一个文档。
下面参考图4描述根据本申请另一实施例的一种搜索方法。
图4示出了根据本申请另一实施例的搜索方法的流程图。
上述搜索方法,在上述步骤S220之前还可以包括步骤S210。在步骤S210,对检索语句进行分词处理以获取上述检索词。
当用户输入检索语句时,通过将检索语句进行分词,从而获取若干检索词,从而通过该检索方法检索出与上述若干检索词相关的检索结果,进一步扩大了搜索的范围。上述分词,可以包括中文分词和/或英文分词, 也可以包括其他语种形式的分词,相应的分词方式可以为现有的各种形式的分词技术。
下面参考图5为图4所示实施例步骤S240的流程图。
图5示出了图4所示实施例步骤S240的流程图。
在步骤S242,获取每个检索词与对应的每个相关词之间的特征值。
每一次检索内容不同,相应的检索词也会不同,因此上述特征值也会不同。
在步骤S244,将上述特征值作为置信度计算模型的输入,基于该置信度计算模型计算置信度。
其中,上述特征值可以包括相关程度信息、可替换程度信息、共现关系信息、语言模型得分信息、权重值信息中的至少一种。
其中,上述相关程度信息用于衡量每个检索词与每个对应的相关词之间的相关程度。
上述相关程度信息可以包括第一翻译概率P1和/或第二翻译概率P2,并分别用下述公式进行表示:
Figure PCTCN2016101700-appb-000006
count1(A,·)=∑jcount1(A,wj),count1(·,A′)=∑icount1(wi,A′);
其中,检索词A与相关词A’构成第一词对(A,A’),count1(A,A’)表示在平行句对中第一词对(A,A’)被对齐的次数,count1(A,·)表示在平行句对中检索词A被对齐的总次数,count1(·,A′)表示在平行句对中相关词A’被对齐的总次数,wj表示在平行句对中所有与检索词A对齐的词中的第j个,wi表示在平行句对中所有与相关词A’对齐的词中的第i个,count1(A,wj)表示在平行句对中检索词A与词wj对齐的次数count1(wi,A’)表示在平行句对中词wi与相关词A’对齐的次数,i和j均为自然数。
可以理解,count1(A,A’)的值与A、A’的顺序是无关的,即count1(A,A’)与count1(A’,A)是相同的。
其中,可替换程度信息用于衡量在相关词的上下文语境中,检索词与相关词之间的可替换程度。
可替换程度信息包括第一可替换程度score(D,Q)和/或第二可替换 程度score(D,Q′),并用如下公式进行表示:
Figure PCTCN2016101700-appb-000007
Figure PCTCN2016101700-appb-000008
其中,检索词A与相关词A’构成第一词对(A,A’),
检索词A的上下文词,以及相关词A’的上下文词一起作为文档D,|D|为D的长度;其中,检索词A与相关词A’的上下文词在多数句对中是一样的,但是也会有个别不同,都会记录下来作为整体的上下文;
Q为检索语句,qi为检索语句Q的第i个检索词,n是检索语句Q中检索词的总个数,
Q′为检索词A附近的m个词的检索词组合,m<n,q′j为所述检索词组合Q′的第j个检索词,
avgdl为检索词A的所有相关词的上下文构成的文档的平均长度,
k1为第一常数,b为第二常数,
f(qi,D)表示qi在文档D中的出现频率,
f(q′j,D)表示q′j在文档D中的出现频率。
其中,共现关系信息,用于衡量检索词之间的共现关系,是指两个检索词出现在一个查询语料(检索资源,网页和/文档)中同时出现的统计数据。
共现关系信息包括基于共现关系指数PMI得到的第一共现关系信息和/或第二共现关系信息:
Figure PCTCN2016101700-appb-000009
count2(A,·)=∑jcount2(A,wj);
count2(·,B)=∑icount2(wi,B);
count2(·,·)=∑i,jcount2(wi,wj);
count2(A,·)表示检索词A与其它检索词在检索资源中同时出现的总次数,count2(·,B)表示检索词B与其它检索词在检索资源中同时出现 的总次数,count2(A,B)表示两个检索词A、B在检索资源中同时出现的次数,wj表示在检索资源中所有与检索词A同时出现的词中的第j个,wi表示在检索资源中所有与相关词B同时出现的词中的第i个,count2(A,wj)表示在检索资源中两个检索词A、wj同时出现的次数,count2(wi,B)表示在检索资源中两个检索词wi、B同时出现的次数,count2(wi,wj)表示在检索资源中两个检索词wi、wj同时出现的次数,i和j均为自然数。
可以理解,count2(A,B)的值与A、B的顺序是无关的,即count2(A,B)与count2(B,A)是相同的。
第一共现关系信息是检索词与检索语句中其它词的共现关系指数PMI的平均值。
第二共现关系信息是相关词与检索语句中其它检索词(不包括与该相关词对应的检索词的其他检索词)的共现关系指数PMI的平均值。
其中,计算上述第一共现关系信息时,可以直接使用上述公式并计算平均值;计算第二共现关系时,将上述公式中的检索词A替换为其相关词A’。
语言模型得分信息,用于显示相关词替换检索词前后的检索语句的语言模型得分。其中,该方法还包括基于大规模用户搜索行为数据训练N-gram语言模型获取上述语言模型。
其中,上述权重值信息用于表示相关词的权重。
其中,上述统计特征的计算方式同样用于步骤S180,计算每个相关词之间的统计特征。
下面参考图6描述根据本申请一实施例的一种搜索系统。
图6示出了根据本申请一实施例的搜索系统的示意图。
一种搜索系统300,包括相关词词库存储装置320,相关词获取装置340,搜索装置360,排序装置380,置信度计算装置390。
相关词获取装置340连接相关词词库存储装置320,并基于相关词词库存储装置320获取检索词的相关词。搜索装置360基于上述检索词和检索词的相关词进行检索。置信度计算装置390基于置信度计算模型计 算检索词与其对应的每个相关词之间的置信度。排序装置380,根据置信度计算装置390计算的对应的置信度对搜索装置360检索所得到的结果进行排序。
这样,通过该搜索系统300,可以针对检索词找到其对应的相关词,根据检索词和其对应的相关词进行检索,扩大了搜索的范围,进一步扩大了搜索结果,提高了检索到目的文件的概率。可以防止词语本身与检索词并不一致,但是语义上与检索词非常相时,这样的好的搜索结果无法召回的现象发生。
下面参考图7描述根据本申请另一实施例的一种搜索系统。
图7示出了根据本申请另一实施例的搜索系统的示意图。
上述搜索系统300还可以包括相关词词库建立装置310和相关词置信度计算模型建立装置350。
上述相关词词库建立装置310连接相关词词库存储装置320,用于通过上述挖掘相关词的方法以建立上述相关词词库。
参考图8描述了根据图7所示实施例相关词词库建立装置310的示意图,用于建立相关词词库。
图8示出了图7所示实施例相关词词库建立装置310的示意图。
上述相关词词库建立装置310可以包括:平行句获取模块311,分词器313,词对齐模块315,共现频率获取模块317,相关词确定模块319和语境获取模块318。
平行句获取模块311,基于大规模用户搜索行为数据获取采用不同表述形式来表达相同含义的平行句对,分词器313对每组平行句对进行分词处理,词对齐模块315将分词处理后的平行句对进行词对齐处理以获取第一对齐词对,共现频率获取模块317计算第一对齐词对的共现频率,相关词确定模块319将共现频率高于预定阈值的第一对齐词对确定为相关词以组成相关词词库。
这样,通过该相关词词库建立装置310,可以挖掘出更高相关度的相关词,也可以扩大检索词搜索的范围,提高找到更好的搜索结果的概率,也可以根据预定阈值的不同,获取相似度不同的相关词。
通过建立相关词词库,可以获取该检索词的所有相关相关词,该相关词不仅包括检索词的同义词(可以包括强同义词和语境同义词),还包括了更广覆盖程度的相关词。通过上述挖掘相关词的方法,可以挖掘出更高相关度的相关词,也可以扩大检索词搜索的范围,提高找到更好的搜索结果的概率。
另外,上述分词器313还用于对检索语句进行分词处理以获取检索词。当用户输入检索语句时,通过将检索语句进行分词,从而获取若干检索词,从而通过该检索方法检索出与上述若干检索词相关的检索结果,进一步扩大了搜索的范围。
进一步地,上述相关词词库建立装置310还包括语境获取模块318,用于获取上述相关词的上下文语境词。
通过记录该相关词的上下文语境词,可以获知相关词的上下文语境。通过判断两个相关词的上下文语境是否相同或者相近,可以进一步判断相关词之间的相关度,有利于获取更高相似度的相关词。
上述相关词的上下文语境词的获取,根据平行句的长度不同,可以做不同程度长度的限定。本实施例中,因考虑平行句对的长度一般不会过长,因此可以不做长度或其他形式的限定。在其他实施例中,可以根据对相关词的相关度的要求不同或者其他标准下,对其长度或者上下文语境词的获取方式做不同的限定。
下面参考图9为图7所示实施例相关词置信度计算模型建立装置350的示意图。
图9示出了图7所示实施例相关词置信度计算模型建立装置350的示意图。
相关词置信度计算模型建立装置350可以包括线性模型过滤模块352和训练模块354。
线性模型过滤模块352用于使用线性模型过滤大规模用户搜索行为数据以获取第二对齐词对。
上述线性模型可为简单线性模型,进一步地,该简单线性模型可以为用人工标注的少量(可以为万级别)词对,使用上述词对之间的统计 特征,用简单线性回归模型拟合的线性模型。上述人工标注的词对数量较少,并且模型简单,因此使用该模型输出的置信度精度不高。通过该线性模型过滤上述大规模用户搜索行为数据获取第二对齐词对,该第二对齐词对为较差词对,是指在当前查询词语境下,不应该出现的错误词对,或者说违反用户意图的词对。例如,用户搜索“宝宝吃奶”,获取“宝宝喝奶”是一个好词对;然而“什么水果好吃”,变成“什么水果好喝”,就是一个转义的错误词对,即较差词对。
训练模块354分别连接相关词词库建立装置310、线性模型过滤模块352,以上述第一对齐词对为正样本,上述第二对齐词对为负样本,基于GBDT算法训练该正样本和负样本,获取相关词置信度计算模型。
其中,上述相关词置信度计算模型可以为GBDT非线性回归模型。
参考图10,图7所示实施例置信度计算装置390可以包括置信度计算模块392和特征值提取模块394。
特征值提取模块394提取每个检索词与其对应的每个所述相关词之间的特征值,置信度计算模块392将上述特征值作为置信度计算模型的输入,基于该置信度计算模型计算上述置信度。
参考图11为图10所示实施例特征值提取模块394的示意图。
其中,特征值提取模块394还可以包括相关程度信息获取单元3941,可替换程度信息获取单元3942,共现关系信息获取单元3943,语言模型得分信息获取单元3944,权重值信息获取单元3945和语言模型获取单元3946中的至少一个。
相关程度信息获取单元3941,用于获取相关程度信息。相关程度信息用于衡量每个检索词与每个对应的相关词之间的相关程度。
可替换程度信息获取单元3942,用于获取可替换程度信息。可替换程度信息用于衡量在相关词的上下文语境中,检索词与相关词之间的可替换程度。
共现关系信息获取单元3943,用于获取共现关系信息。其中,共现关系信息用于衡量检索词之间的共现关系。
语言模型得分信息获取单元3944,用于获取语言模型得分信息。其 中,语言模型得分信息用于显示相关词替换检索词前后的检索语句的语言模型得分。
权重值信息获取单元3945,用于获取权重值信息。其中,权重值信息用于表示相关词的权重。
进一步地,特征值提取模块394还可以包括语言模型获取单元3946。语言模型获取单元3946用于基于大规模用户搜索行为数据训练N-gram语言模型获取上述语言模型。
其中,上述排序装置380为通过排序模型根据对应的置信度信息对使用检索词和对应的相关词进行检索所得到的结果进行排序。其中,上述排序模型可以为根据现有快速排序算法进行排序的快速排序模型。
进一步地,上述排序装置380还可以通过上述排序模型根据检索语句和检索资源页面信息对检索资源进行初排序。该初排序为一般的搜索过程,也可以通过设定检索程度限定,达到预定得分的检索结果才可以进入再排序。在初检索结果较多时,可以减少再排序的工作量。也可以在用户要求只显示精确度高的搜索结果时,使用该双重排序方法。
根据相关词进行搜索不仅涵盖了同义词的高频,还更注重了中低频的检索词,尤其是在检索资源比较少的时候,使用相关词进行搜索,最大程度地获取到检索信息。这样,通过该搜索系统,可以针对检索词找到其对应的相关词,使用检索词和相关词进行检索,扩大了搜索的范围,扩大了搜索结果;可以防止词语本身与检索词并不一致,但是语义上与检索词非常相时,这样的搜索结果无法召回的结果发生。
上文中已经参考附图详细描述了根据本申请的挖掘相关词的方法、搜索方法和搜索系统。
此外,根据本申请的方法还可以实现为一种计算机程序产品,该计算机程序产品包括计算机可读介质,在该计算机可读介质上存储有用于执行本申请的方法中限定的上述功能的计算机程序。本领域技术人员还将明白的是,结合这里的公开所描述的各种示例性逻辑块、模块、电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。
其中,图12示出了根据本发明实施方式提供的计算设备的结构框图。该计算设备可以实施为各种类型的计算机装置,例如台式机、便携式计算机、平板电脑、智能手机、个人数据助理(PDA)、智能穿戴设备,或者其他类型的计算机装置,但是不限于任何特定形式。计算机可以包括处理模块1100,存储子系统1200,输入装置1300、显示器1400、网络接口1500,以及总线1600。
处理模块1100可以是一个多核的处理器,也可以包含多个处理器。在一些实施例中,处理模块1100可以包含一个通用的主处理器以及一个或多个特殊的协处理器,例如图形处理器(GPU)、数字信号处理器(DSP)等等。在一些实施例中,处理器1100可以使用定制的电路实现,例如特定用途集成电路(application specific integrated circuit,ASIC)或者现场可编程逻辑门阵列(field programmable gate arrays,FPGA)。在一些实施方式中,处理模块100可以是类似的电路执行存储在自身上的可执行指令。在另外一些实施方式中,处理模块1100可以执行存储在存储子系统1200上的可执行指令。
存储子系统1200可以包括各种类型的存储单元,例如系统内存、只读存储器(ROM),和永久存储装置。其中,ROM可以存储处理模块1100或者计算机的其他模块需要的静态数据或者指令。永久存储装置可以是可读写的存储装置。永久存储装置可以是即使计算机断电后也不会失去存储的指令和数据的非易失性存储设备。在一些实施方式中,永久性存储装置采用大容量存储装置(例如磁或光盘、闪存)作为永久存储装置。另外一些实施方式中,永久性存储装置可以是可移除的存储设备(例如软盘、光驱)。系统内存可以是可读写存储设备或者易失性可读写存储设备,例如动态随机访问内存。系统内存可以存储一些或者所有处理器在运行时需要的指令和数据。此外,存储子系统1200可以包括任意计算机可读存储媒介的组合,包括各种类型的半导体存储芯片(DRAM,SRAM,SDRAM,闪存,可编程只读存储器),磁盘和/或光盘也可以采用。在一些实施方式中,存储子系统1200可以包括可读和/或写的可移除的存储设备,例如激光唱片(CD)、只读数字多功能光盘(例如DVD-ROM,双 层DVD-ROM)、只读蓝光光盘、超密度光盘、闪存卡(例如SD卡、min SD卡、Micro-SD卡等等)、磁性软盘等等。计算机可读存储媒介不包含载波和通过无线或有线传输的瞬间电子信号。在一些实施方式中,存储子系统1200能够存储一个或多个能被处理模块1100执行的软件程序或需要调用的资源文件,资源文件可以包含一些第三方库,包括但不限于音频库、视频库、2D图形库、3D图形库。
用户界面可以由一个或多个用户输入装置1300、显示器1400,和/或一个或多个其他用户输出设备提供。输入装置1300可以包括用户向计算机输入信号的装置,计算机可以解释这些信号包含有特定的用户请求或信息。在一些实施方式中,可以通过键盘向用户界面输入网址,显示输入网址对应的网页内容。在一些实施方式中,输入装置300可以包含一些或所有的键盘按钮、触摸屏、鼠标或其他定点设备、滚轮、点击轮、转盘、按键、开关、小型键盘、麦克风等等。
显示器1400可以显示由计算机生成的图像,可以包括各种类型的图像设备,例如阴极射线管(CRT)、液晶显示器(LCD)、发光二极管(LED)(包括有机发光二极管(OLED))、投射系统等等与其他支持电子装置(例如DAC、ADC、信号处理器等等)的集合。在一些实施方式中,也可能额外提供其他用户输出设备,或者取代显示器1400,例如信号灯、扬声器、触觉传感器、打印机等。
在一些实施方式中,用户界面可以通过图形用户界面提供。在显示器1400中的某些区域定义一些可视的图形元素作为用户通过输入装置1300选择的交互对象或者控制对象。例如,用户可以操作用户输入装置1300移动屏幕上的指定位置输入网址,控制在显示器1400上显示该网址对应的网页内容。在一些实施方式中,可以识别用户手势的触摸设备作为输入设备,这些手势可以但不必须与显示器1300上的阵列相联系。
网络接口1500为计算机提供声音和/或数据通讯功能。在一些实施方式中,网络接口1500可以包括射频收发器来传递声音和/或数据(例如使用蜂窝式电话技术,例如3G、4G或EDGE、WIFI的数据网络技术)、GPS接受模块和/或其他模块。在一些实施方式中,网络接口1500可以提 供额外的无线网络连接或替代无线接口。网络接口1500可以是硬件(例如天线、调制解调器、编解码器以及其他模拟和/或数字信号处理电路)和软件模块的结合。
总线1600可以包括各种连接计算机内部各部件的系统、外部设备和芯片总线。例如总线1600将处理装置1100和存储子系统1200连接,还可以连接输入装置1300和显示器1400。总线1600也可以使得计算机通过网络接口1500与网络连接。在这种情况下,计算机可以作为多个联网计算机设备的一部分。计算机的任意或所有部件都可以在本发明的实施方式中协调使用。
一些实施方式中包含电子元件,例如微处理器、在计算机可读存储媒介中存储有计算机指令和数据的存储器。在具体实施方式部分描述的许多特征都可以通过执行存储在计算机可读存储媒介上的计算机指令的方法步骤实现。当这些计算机指令被执行,计算机处理单元完成指令的各种功能。程序指令或计算机编码的实施方式可以是机器码,例如使用计算机、电子元件或待解析器的微处理器编译其他高级语言得到的代码。
需要理解的是,计算机是示意性的。计算机可以具有其他没有具体描述的功能(例如移动通话、GPS、电源管理,一个或多个摄像头、各种用于连接外部设备的连接端口或附件等等)。进一步,此处对计算机1100涉及的特定功能模块进行了描述,这些功能模块的描述是为了便于描述,而且也不意味着对功能部件特定的物理配置。而且,这些功能模块不需要与物理模块一一对应。模块可以被配置成用来完成各种操作,例如通过编程或设置合适的控制电路,模块也可能会根据初始设置重新被配置。本发明的实施例可以在各种设备包括电子设备中,通过使用硬件和软件的结合来实现。
附图中的流程图和框图显示了根据本申请的多个实施例的系统和方法的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标记的功能也可 以以不同于附图中所标记的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (33)

  1. 一种挖掘相关词的方法,包括:
    基于大规模用户搜索行为数据获取采用不同表述形式来表达相同含义的平行句对;
    对每组所述平行句对进行分词处理;
    对所述分词处理后的平行句对进行词对齐处理,以获取第一对齐词对;
    计算所述第一对齐词对的共现频率;
    当所述共现频率高于预定阈值时,将所述第一对齐词对确定为相关词。
  2. 根据权利要求1所述的方法,其中,所述对每组所述平行句对进行分词处理的步骤之前,所述方法还包括:
    根据两个句子的字面相似度,滤除含义不同的平行句对。
  3. 根据权利要求1所述的方法,其中,在将所述第一对齐词对确定为相关词之后,所述方法还包括:
    记录所述相关词的上下文语境词。
  4. 根据权利要求1所述的方法,其中,
    所述词对齐处理包括规则词对齐处理和/或统计词对齐处理;
    所述规则词对齐处理包括字面完全相同词对齐处理、字面部分相同词对齐处理或临近词对齐处理中的至少一种;
    所述统计词对齐处理为使用GIZA++工具进行统计词对齐处理。
  5. 根据权利要求1所述的方法,还包括:
    获取能够体现所述相关词之间的相关度的统计特征;
    基于所述统计特征,采用梯度提升决策树(GBDT)算法,获取所述相 关词置信度计算模型。
  6. 根据权利要求5所述的方法,其中,所述方法还包括:
    使用线性模型过滤所述大规模用户搜索行为数据获取第二对齐词对;
    所述基于所述统计特征,采用梯度提升决策树(GBDT)算法,获取所述相关词置信度计算模型的步骤包括:
    以所述第一对齐词对为正样本,所述第二对齐词对为负样本,基于所述统计特征,采用梯度提升决策树(GBDT)算法,训练所述正样本和所述负样本,获取所述相关词置信度计算模型;
    其中,所述相关词置信度计算模型为GBDT非线性回归模型。
  7. 一种搜索方法,包括如下步骤:
    基于相关词词库获取检索词的相关词;
    基于置信度计算模型计算所述检索词与每个所述相关词之间的置信度;
    根据对应的置信度对使用所述检索词和所述相关词进行检索所得到的结果进行排序。
  8. 根据权利要求7所述的方法,其中,在基于相关词词库获取检索词的相关词的步骤之前,所述方法还包括:
    基于大规模用户搜索行为数据获取采用不同表述形式来表达相同含义的平行句对;
    对每组所述平行句对进行分词处理;
    对所述分词处理后的平行句对进行词对齐处理,以获取第一对齐词对;
    计算所述第一对齐词对的共现频率;
    当所述共现频率高于预定阈值时,将所述第一对齐词对确定为相关词。
  9. 根据权利要求8所述的方法,其中,在基于置信度计算模型计算所述检索词与每个所述相关词之间的置信度的步骤之前,所述方法还包括:
    获取能够体现所述相关词之间的相关度的统计特征;
    基于所述统计特征,采用梯度提升决策树(GBDT)算法,获取所述相关词置信度计算模型。
  10. 根据权利要求9所述的方法,其中,所述相关词词库是通过根据权利要求2、3、4、6中任何一项所述的方法建立。
  11. 根据权利要求9所述的方法,其中,在基于相关词词库获取检索词的相关词的步骤之前,所述方法还包括:
    对检索语句进行分词处理以获取所述检索词。
  12. 根据权利要求11所述的方法,其中,基于置信度计算模型计算所述检索词与每个所述相关词之间的置信度的步骤包括:
    获取每个所述检索词与对应的每个所述相关词之间的特征值;
    将所述特征值作为所述置信度计算模型的输入,基于所述置信度计算模型计算所述置信度。
  13. 根据权利要求12所述的方法,其中,所述特征值包括:
    相关程度信息,用于衡量每个所述检索词与每个对应的相关词之间的相关程度;和/或
    可替换程度信息,用于衡量在所述相关词的上下文语境中,所述检索词与所述相关词之间的可替换程度;和/或
    共现关系信息,用于衡量所述检索词之间的共现关系;和/或
    语言模型得分信息,用于显示所述相关词替换所述检索词前后的检索语句的语言模型得分;和/或
    权重值信息,用于表示所述相关词的权重。
  14. 根据权利要求13所述的方法,其中,所述相关程度信息包括第一翻译概率P1和/或第二翻译概率P2
    Figure PCTCN2016101700-appb-100001
    count1(A,·)=∑jcount1(a,wj),count1(·,A′)=∑icount1(wi,A′);
    其中,检索词A与相关词A’构成第一词对(A,A’),count1(A,A’)表示在平行句对中第一词对(A,A’)被对齐的次数,count1(A,·)表示在平行句对中检索词A被对齐的总次数,count1(·,A′)表示在平行句对中相关词A’被对齐的总次数,wj表示在平行句对中所有与检索词A对齐的词中的第j个,wi表示在平行句对中所有与相关词A’对齐的词中的第i个,count1(A,wj)表示在平行句对中检索词A与词wj对齐的次数,count1(wi,A’)表示在平行句对中词wi与相关词A’对齐的次数,i和j均为自然数。
  15. 根据权利要求13所述的方法,其中,所述可替换程度信息包括第一可替换程度score(D,Q)和/或第二可替换程度score(D,Q′);
    Figure PCTCN2016101700-appb-100002
    Figure PCTCN2016101700-appb-100003
    其中,检索词A与相关词A’构成第一词对(A,A’),
    检索词A与相关词A’的所有上下文词作为文档D,|D|为D的长度,
    Q为检索语句,qi为所述检索语句Q的第i个检索词,n是所述检索语句Q中检索词的总个数,
    Q′为检索词A附近的m个词的检索词组合,m<n,q′j为所述检索词组合Q′的第j个检索词,
    avgdl为检索词A的所有相关词的上下文构成的文档的平均长度,
    k1为第一常数,b为第二常数,
    f(qi,D)表示qi在文档D中的出现频率,
    f(q′j,D)表示q′j在文档D中的出现频率。
  16. 根据权利要求13所述的方法,其中,所述共现关系信息包括基于共现关系指数PMI得到的第一共现关系信息和/或第二共现关系信息,其中,
    Figure PCTCN2016101700-appb-100004
    count2(A,·)=∑jcount2(a,wj);
    count2(·,B)=∑icount2(wi,B);
    count2(·,·)=∑i,jcount2(wi,wj);
    count2(A,·)表示在检索资源中检索词A与其它检索词同时出现的总次数,count2(·,B)表示在检索资源中检索词B与其它检索词同时出现的总次数,count2(A,B)表示在检索资源中两个检索词A、B同时出现的次数,wj表示在检索资源中所有与检索词A同时出现的词中的第j个,wi表示在检索资源中所有与相关词B同时出现的词中的第i个,count2(A,wj)表示在检索资源中两个检索词A、wj同时出现的次数,count2(wi,B)表示在检索资源中两个检索词wi、B同时出现的次数,count2(wi,wj)表示在检索资源中两个检索词wi、wj同时出现的次数,i和j均为自然数;
    第一共现关系信息是检索词与检索语句中其它词的共现关系指数PMI的平均值;
    第二共现关系信息是相关词与检索语句中其它词的共现关系指数PMI的平均值。
  17. 根据权利要求13所述的信息检索方法,其中,还包括基于大规模用户搜索行为数据训练N-gram语言模型获取所述语言模型。
  18. 根据权利要求7或11所述的方法,其中,所述根据对应的置信度对使用所述检索词和所述相关词进行检索所得到的结果进行排序的步骤,为通过排序模型根据所述对应的置信度对使用所述检索词和所述相 关词进行检索所得到的结果进行排序。
  19. 根据权利要求18所述的方法,其中,还包括所述排序模型根据所述检索语句和检索资源页面信息对所述检索资源进行初排序的步骤。
  20. 根据权利要求19所述的方法,其中,
    所述检索资源为网页资源和/或文档资源。
  21. 一种搜索系统,包括:
    相关词词库存储装置;
    相关词获取装置,用于基于所述相关词词库存储装置存储的相关词词库获取检索词的相关词;
    置信度计算装置,用于基于相关词置信度计算模型计算所述检索词与每个所述相关词之间的置信度;
    排序装置,用于根据所述对应的置信度对使用所述检索词和所述相关词进行检索所得到的结果进行排序。
  22. 根据权利要求21所述的搜索系统,其中,
    还包括相关词词库建立装置,用于建立所述相关词词库,包括:
    平行句获取模块,用于基于大规模用户搜索行为数据获取采用不同表述形式来表达相同含义的平行句对;
    分词器,用于对每组所述平行句对进行分词处理;
    词对齐模块,用于将所述分词处理后的平行句对进行词对齐处理以获取第一对齐词对;
    共现频率获取模块,用于计算所述第一对齐词对的共现频率;
    相关词确定模块,用于将共现频率高于预定阈值的所述第一对齐词对确定为相关词。
  23. 根据权利要求22所述的搜索系统,其中,所述相关词词库建立 装置还包括:
    语境获取模块,用于获取所述相关词的上下文语境词。
  24. 根据权利要求22所述的搜索系统,其中,还包括相关词置信度计算模型建立装置,用于建立所述相关词置信度计算模型,包括:
    统计特征获取模块,用于获取能够体现所述相关词之间的相关度的统计特征;
    训练模块,用于基于所述统计特征,采用GBDT算法,获取所述相关词置信度计算模型。
  25. 根据权利要求24所述的搜索系统,其中,所述搜索系统还包括:
    线性模型过滤模块,用于使用线性模型过滤所述大规模用户搜索行为数据以获取第二对齐词对;
    所述训练模块,进一步用于以所述第一对齐词对为正样本,以所述第二对齐词对为负样本,基于GBDT算法训练所述正样本和所述负样本,获取所述相关词置信度计算模型;
    所述相关词置信度计算模型为GBDT非线性回归模型。
  26. 根据权利要求22所述的搜索系统,其中,
    所述分词器还用于对检索语句进行分词处理以获取检索词。
  27. 根据权利要求26所述的搜索系统,其中,所述置信度计算装置包括:
    特征值提取模块,用于提取每个所述检索词与对应的每个所述相关词之间的特征值;
    置信度计算模块,用于将所述特征值作为所述相关词置信度计算模型的输入,基于所述相关词置信度计算模型计算所述置信度。
  28. 根据权利要求27所述的搜索系统,其中,所述特征值提取模块 包括:
    相关程度信息获取单元,用于获取相关程度信息,所述相关程度信息用于衡量每个所述检索词与每个对应的相关词之间的相关程度;和/或
    可替换程度信息获取单元,用于获取可替换程度信息,所述可替换程度信息用于衡量在所述相关词的上下文语境中,所述检索词与所述相关词之间的可替换程度;和/或
    共现关系信息获取单元,用于获取共现关系信息,所述共现关系信息用于衡量所述检索词之间的共现关系;和/或
    语言模型得分信息获取单元,用于获取语言模型得分信息,所述语言模型得分信息用于显示所述相关词替换所述检索词前后的检索语句的语言模型得分;和/或
    权重值信息获取单元,用于获取权重值信息,所述权重值信息用于表示所述相关词的权重。
  29. 根据权利要求28所述的搜索系统,其中,所述特征值提取模块还包括:
    语言模型获取单元,用于基于所述大规模用户搜索行为数据训练N-gram语言模型获取所述语言模型。
  30. 根据权利要求21所述的搜索系统,其中,所述排序装置为通过排序模型根据所述对应的置信度对使用所述检索词和所述相关词进行检索所得到的结果进行排序。
  31. 根据权利要求30所述的搜索系统,其中,所述排序装置还用于通过所述排序模型根据检索语句和检索资源页面信息对所述检索资源进行初排序。
  32. 一种计算设备,包括:
    一个或多个处理器;
    存储器;
    其中所述存储器被配置为执行:
    基于相关词词库获取检索词的相关词;
    基于置信度计算模型计算所述检索词与每个所述相关词之间的置信度;
    根据对应的置信度对使用所述检索词和所述相关词进行检索所得到的结果进行排序。
  33. 一种在其上记录有用于执行权利要求1-20中任一项所述方法的程序的计算机可读记录介质。
PCT/CN2016/101700 2015-10-12 2016-10-10 挖掘相关词的方法、搜索方法、搜索系统 WO2017063538A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510657691.7A CN105279252B (zh) 2015-10-12 2015-10-12 挖掘相关词的方法、搜索方法、搜索系统
CN201510657691.7 2015-10-12

Publications (1)

Publication Number Publication Date
WO2017063538A1 true WO2017063538A1 (zh) 2017-04-20

Family

ID=55148266

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/101700 WO2017063538A1 (zh) 2015-10-12 2016-10-10 挖掘相关词的方法、搜索方法、搜索系统

Country Status (2)

Country Link
CN (1) CN105279252B (zh)
WO (1) WO2017063538A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107909088A (zh) * 2017-09-27 2018-04-13 百度在线网络技术(北京)有限公司 获取训练样本的方法、装置、设备和计算机存储介质
CN110795613A (zh) * 2018-07-17 2020-02-14 阿里巴巴集团控股有限公司 商品搜索方法、装置、系统及电子设备
CN110851584A (zh) * 2019-11-13 2020-02-28 成都华律网络服务有限公司 一种法律条文精准推荐系统和方法
CN111241319A (zh) * 2020-01-22 2020-06-05 北京搜狐新媒体信息技术有限公司 一种图文转换的方法及系统
CN111400577A (zh) * 2018-12-14 2020-07-10 阿里巴巴集团控股有限公司 一种搜索召回方法及装置
CN112835923A (zh) * 2021-02-02 2021-05-25 中国工商银行股份有限公司 一种相关检索方法、装置和设备
CN113496411A (zh) * 2020-03-18 2021-10-12 北京沃东天骏信息技术有限公司 页面推送方法、装置、系统、存储介质及电子设备
CN113553483A (zh) * 2021-07-02 2021-10-26 广联达科技股份有限公司 构件检索方法、装置、电子设备及可读存储介质
CN114969310A (zh) * 2022-06-07 2022-08-30 南京云问网络技术有限公司 一种面向多维数据的分段式检索排序系统设计方法

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279252B (zh) * 2015-10-12 2017-12-26 广州神马移动信息科技有限公司 挖掘相关词的方法、搜索方法、搜索系统
CN105868847A (zh) * 2016-03-24 2016-08-17 车智互联(北京)科技有限公司 一种购物行为的预测方法及装置
CN105955993B (zh) * 2016-04-19 2020-09-25 北京百度网讯科技有限公司 搜索结果排序方法和装置
CN108205757B (zh) * 2016-12-19 2022-05-27 创新先进技术有限公司 电子支付业务合法性的校验方法和装置
CN107168958A (zh) * 2017-05-15 2017-09-15 北京搜狗科技发展有限公司 一种翻译方法及装置
CN108171570B (zh) * 2017-12-15 2021-04-27 北京星选科技有限公司 一种数据筛选方法、装置及终端
CN108733766B (zh) * 2018-04-17 2020-10-02 腾讯科技(深圳)有限公司 一种数据查询方法、装置和可读介质
CN110472251B (zh) * 2018-05-10 2023-05-30 腾讯科技(深圳)有限公司 翻译模型训练的方法、语句翻译的方法、设备及存储介质
CN109241356B (zh) * 2018-06-22 2023-04-14 腾讯科技(深圳)有限公司 一种数据处理方法、装置及存储介质
CN109298796B (zh) * 2018-07-24 2022-05-24 北京捷通华声科技股份有限公司 一种词联想方法及装置
CN109151599B (zh) * 2018-08-30 2020-10-09 百度在线网络技术(北京)有限公司 视频处理方法和装置
CN109885696A (zh) * 2019-02-01 2019-06-14 杭州晶一智能科技有限公司 一种基于自学习的外语联想词库构建方法
CN109918661B (zh) * 2019-03-04 2023-05-30 腾讯科技(深圳)有限公司 同义词获取方法及装置
CN110413737B (zh) * 2019-07-29 2022-10-14 腾讯科技(深圳)有限公司 一种同义词的确定方法、装置、服务器及可读存储介质
CN112199958A (zh) * 2020-09-30 2021-01-08 平安科技(深圳)有限公司 概念词序列生成方法、装置、计算机设备及存储介质
CN112541076B (zh) * 2020-11-09 2024-03-29 北京百度网讯科技有限公司 目标领域的扩充语料生成方法、装置和电子设备
CN112307198B (zh) * 2020-11-24 2024-03-12 腾讯科技(深圳)有限公司 一种单文本的摘要确定方法和相关装置
CN113609843B (zh) * 2021-10-12 2022-02-01 京华信息科技股份有限公司 一种基于梯度提升决策树的句词概率计算方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6633868B1 (en) * 2000-07-28 2003-10-14 Shermann Loyall Min System and method for context-based document retrieval
CN101819578A (zh) * 2010-01-25 2010-09-01 青岛普加智能信息有限公司 检索方法、索引建立方法和装置及检索系统
CN102591862A (zh) * 2011-01-05 2012-07-18 华东师范大学 一种基于词共现的汉语实体关系提取的控制方法及装置
CN103942339A (zh) * 2014-05-08 2014-07-23 深圳市宜搜科技发展有限公司 同义词挖掘方法及装置
CN104239286A (zh) * 2013-06-24 2014-12-24 阿里巴巴集团控股有限公司 同义短语的挖掘方法和装置及搜索相关内容的方法和装置
CN105279252A (zh) * 2015-10-12 2016-01-27 广州神马移动信息科技有限公司 挖掘相关词的方法、搜索方法、搜索系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033955B (zh) * 2010-12-24 2012-12-05 常华 扩展用户搜索结果的方法及服务器
CN103514150A (zh) * 2012-06-21 2014-01-15 富士通株式会社 识别具有组合型歧义的歧义词的方法和装置
CN104063454A (zh) * 2014-06-24 2014-09-24 北京奇虎科技有限公司 一种挖掘用户需求的搜索推送方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6633868B1 (en) * 2000-07-28 2003-10-14 Shermann Loyall Min System and method for context-based document retrieval
CN101819578A (zh) * 2010-01-25 2010-09-01 青岛普加智能信息有限公司 检索方法、索引建立方法和装置及检索系统
CN102591862A (zh) * 2011-01-05 2012-07-18 华东师范大学 一种基于词共现的汉语实体关系提取的控制方法及装置
CN104239286A (zh) * 2013-06-24 2014-12-24 阿里巴巴集团控股有限公司 同义短语的挖掘方法和装置及搜索相关内容的方法和装置
CN103942339A (zh) * 2014-05-08 2014-07-23 深圳市宜搜科技发展有限公司 同义词挖掘方法及装置
CN105279252A (zh) * 2015-10-12 2016-01-27 广州神马移动信息科技有限公司 挖掘相关词的方法、搜索方法、搜索系统

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107909088A (zh) * 2017-09-27 2018-04-13 百度在线网络技术(北京)有限公司 获取训练样本的方法、装置、设备和计算机存储介质
CN107909088B (zh) * 2017-09-27 2022-06-28 百度在线网络技术(北京)有限公司 获取训练样本的方法、装置、设备和计算机存储介质
CN110795613B (zh) * 2018-07-17 2023-04-28 阿里巴巴集团控股有限公司 商品搜索方法、装置、系统及电子设备
CN110795613A (zh) * 2018-07-17 2020-02-14 阿里巴巴集团控股有限公司 商品搜索方法、装置、系统及电子设备
CN111400577A (zh) * 2018-12-14 2020-07-10 阿里巴巴集团控股有限公司 一种搜索召回方法及装置
CN111400577B (zh) * 2018-12-14 2023-06-30 阿里巴巴集团控股有限公司 一种搜索召回方法及装置
CN110851584A (zh) * 2019-11-13 2020-02-28 成都华律网络服务有限公司 一种法律条文精准推荐系统和方法
CN110851584B (zh) * 2019-11-13 2023-12-15 成都华律网络服务有限公司 一种法律条文精准推荐系统和方法
CN111241319A (zh) * 2020-01-22 2020-06-05 北京搜狐新媒体信息技术有限公司 一种图文转换的方法及系统
CN111241319B (zh) * 2020-01-22 2023-10-03 北京搜狐新媒体信息技术有限公司 一种图文转换的方法及系统
CN113496411A (zh) * 2020-03-18 2021-10-12 北京沃东天骏信息技术有限公司 页面推送方法、装置、系统、存储介质及电子设备
CN112835923A (zh) * 2021-02-02 2021-05-25 中国工商银行股份有限公司 一种相关检索方法、装置和设备
CN113553483A (zh) * 2021-07-02 2021-10-26 广联达科技股份有限公司 构件检索方法、装置、电子设备及可读存储介质
CN114969310A (zh) * 2022-06-07 2022-08-30 南京云问网络技术有限公司 一种面向多维数据的分段式检索排序系统设计方法
CN114969310B (zh) * 2022-06-07 2024-04-05 南京云问网络技术有限公司 一种面向多维数据的分段式检索排序系统设计方法

Also Published As

Publication number Publication date
CN105279252B (zh) 2017-12-26
CN105279252A (zh) 2016-01-27

Similar Documents

Publication Publication Date Title
WO2017063538A1 (zh) 挖掘相关词的方法、搜索方法、搜索系统
US11182445B2 (en) Method, apparatus, server, and storage medium for recalling for search
US11361243B2 (en) Recommending machine learning techniques, features, and feature relevance scores
US11216504B2 (en) Document recommendation method and device based on semantic tag
US10423652B2 (en) Knowledge graph entity reconciler
US11449767B2 (en) Method of building a sorting model, and application method and apparatus based on the model
US10586155B2 (en) Clarification of submitted questions in a question and answer system
AU2016277558B2 (en) Generating a semantic network based on semantic connections between subject-verb-object units
US9971769B2 (en) Method and system for providing translated result
CN108334490B (zh) 关键词提取方法以及关键词提取装置
US20150178273A1 (en) Unsupervised Relation Detection Model Training
US20220318275A1 (en) Search method, electronic device and storage medium
US9342561B2 (en) Creating and using titles in untitled documents to answer questions
US20130132381A1 (en) Tagging entities with descriptive phrases
US10943673B2 (en) Method and apparatus for medical data auto collection segmentation and analysis platform
US20200372117A1 (en) Proximity information retrieval boost method for medical knowledge question answering systems
CN105653701A (zh) 模型生成方法及装置、词语赋权方法及装置
US20190005028A1 (en) Systems, methods, and computer-readable medium for validation of idiomatic expressions
JP2022510818A (ja) 改良されたデータマッチングのためのデータレコードの字訳
US9633009B2 (en) Knowledge-rich automatic term disambiguation
CN113988157A (zh) 语义检索网络训练方法、装置、电子设备及存储介质
TWI640877B (zh) 語意分析裝置、方法及其電腦程式產品
US8046361B2 (en) System and method for classifying tags of content using a hyperlinked corpus of classified web pages
CN107239209B (zh) 一种拍照搜索方法、装置、终端及存储介质
US20190057401A1 (en) Identifying market-agnostic and market-specific search queries

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16854917

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16854917

Country of ref document: EP

Kind code of ref document: A1