CN105279252B - Excavate method, searching method, the search system of related term - Google Patents

Excavate method, searching method, the search system of related term Download PDF

Info

Publication number
CN105279252B
CN105279252B CN201510657691.7A CN201510657691A CN105279252B CN 105279252 B CN105279252 B CN 105279252B CN 201510657691 A CN201510657691 A CN 201510657691A CN 105279252 B CN105279252 B CN 105279252B
Authority
CN
China
Prior art keywords
word
search
words
count
confidence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510657691.7A
Other languages
Chinese (zh)
Other versions
CN105279252A (en
Inventor
韩增新
蒋冠军
董良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Guangzhou Shenma Mobile Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shenma Mobile Information Technology Co Ltd filed Critical Guangzhou Shenma Mobile Information Technology Co Ltd
Priority to CN201510657691.7A priority Critical patent/CN105279252B/en
Publication of CN105279252A publication Critical patent/CN105279252A/en
Priority to PCT/CN2016/101700 priority patent/WO2017063538A1/en
Application granted granted Critical
Publication of CN105279252B publication Critical patent/CN105279252B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention discloses a kind of method for excavating related term, including:The parallel sentence pair of identical meanings is expressed using different expression form based on large-scale consumer search behavior data acquisition;Word segmentation processing is carried out to every group of parallel sentence pair;Word alignment processing is carried out to the parallel sentence pair after the word segmentation processing, to obtain the first alignment word pair;Calculate the co-occurrence frequency of the first alignment word pair;By the first alignment word of the co-occurrence frequency higher than predetermined threshold to being defined as related term.So, by the excavation related term method, the related term of the higher degree of correlation can be excavated, the scope of term search can also be expanded, improve the probability for finding more preferable search result.Meanwhile the invention also discloses a kind of searching method and a kind of search system.

Description

Method for mining related words, searching method and searching system
Technical Field
The present invention relates to the field of information retrieval, and in particular, to a method for mining related words, a search method, and a search system.
Background
The search engine is a necessary function provided for the convenience of using the website by the user in website construction, and is an effective tool for researching the behavior of the website user. The efficient in-site retrieval enables users to quickly and accurately find target information, so that the user problems are effectively solved, the product/service sale can be effectively promoted, and the deep analysis of the search behavior of website visitors has important value for further making more effective network marketing strategies.
When a user searches by using a search engine, a search keyword is input through a search page of the search engine, and the search engine searches and returns a search result. A general search engine may perform a primitive word search directly using a keyword input by a user or perform a search using a synonym of a search term.
However, when a search is performed using a search term primitive word or a synonym, the search results are limited. There are often some good results whose terms do not correspond to the search terms themselves, but are semantically very related to the search terms, resulting in web pages with such results being not recallable.
Disclosure of Invention
The invention aims to solve the technical problem that a traditional search engine only carries out retrieval through original words or synonyms to obtain limited retrieval results, and provides a method for mining related words, a search method and a search system.
According to one aspect of the invention, a method of mining related words is provided.
A method of mining related words, comprising:
acquiring parallel sentence pairs expressing the same meaning by adopting different expression forms based on large-scale user search behavior data;
performing word segmentation on each group of parallel sentence pairs;
performing word alignment processing on the parallel sentence pairs subjected to word segmentation processing to obtain first aligned word pairs;
calculating a co-occurrence frequency of the first aligned word pair;
determining the first aligned word pair having a co-occurrence frequency above a predetermined threshold as a related word.
Therefore, by the method for mining the related words, the related words with higher relevance can be mined, the search range of the search words can be expanded, and the probability of finding better search results is improved.
Preferably, the step of obtaining parallel sentence pairs comprises:
according to the literal similarity of the two sentences, parallel sentence pairs with different meanings are filtered.
Therefore, parallel sentence pairs with different meanings are filtered out through the literal similarity of the two sentences, and the parallel sentence pairs with the same expression meaning but different meanings are obtained.
Preferably, the method further comprises recording context words for said related words.
By recording the context of the related words, and judging whether the context of the two related words is the same or similar, the method is beneficial to further judging the correlation degree between the related words.
Preferably, the word alignment process comprises a regular word alignment process and/or a statistical word alignment process.
Preferably, the regular word alignment process includes at least one of a literal perfect word alignment process, a literal partial identical word alignment process, or an adjacent word alignment process.
Thus, related words with different degrees of relevance can be mined.
Preferably, the statistical word alignment process is a statistical word alignment process using a GIZA + + tool.
Preferably, the method further comprises:
filtering the large-scale user searching behavior data by using a linear model to obtain a second alignment word pair;
acquiring statistical characteristics capable of reflecting the correlation degree between the related words;
and training the positive sample and the negative sample by using the first aligned word pair as a positive sample and the second aligned word pair as a negative sample and adopting a Gradient Boosting Decision Tree (GBDT) algorithm based on the statistical characteristics to obtain the related word confidence coefficient calculation model.
Thus, by establishing a related word confidence calculation model, the degree of correlation between related words can be distinguished through the model.
Preferably, the related word confidence calculation model is a GBDT nonlinear regression model.
According to another aspect of the invention, a search method is also disclosed.
A search method comprising the steps of:
acquiring related words of the search words based on a related word library;
calculating a confidence between the search word and each related word based on a confidence calculation model;
and sequencing results obtained by searching the search words and the related words according to the corresponding confidence degrees.
Therefore, by the searching method, the corresponding related words can be found for the search words, the searching range is expanded, the searching result is expanded, and the occurrence of the result which cannot be recalled by the searching result when the words are not consistent with the search words but are semantically very similar to the search words can be prevented.
Preferably, the related word lexicon is established by the method for mining related words.
By the method for mining the related words, the related words with higher relevance can be mined, the search range of the search words can be expanded, and the probability of finding better search results is improved.
Preferably, the method further comprises performing word segmentation processing on the search sentence to obtain the search word.
When a user inputs a search sentence, the search sentence is segmented to obtain a plurality of search terms, so that search results related to the search terms are searched by the search method, and the search range is further expanded.
Preferably, the step of calculating the confidence between the search term and each of the related terms based on a confidence calculation model comprises:
obtaining a characteristic value between each search word and each corresponding related word;
and taking the characteristic value as an input of the confidence coefficient calculation model, and calculating the confidence coefficient based on the confidence coefficient calculation model.
Preferably, the characteristic values include:
the correlation degree information is used for measuring the correlation degree between each search word and each corresponding correlation word; and/or
The degree of exchangeability information is used for measuring the degree of exchangeability between the search word and the related word in the context of the related word; and/or
Co-occurrence relation information used for measuring the co-occurrence relation among the search terms; and/or
Language model score information for displaying the language model scores of the retrieval sentences before and after the related words replace the retrieval words; and/or
And the weight value information is used for representing the weight of the related words.
Preferably, the correlation degree information includes a first translation probability P 1 And/or a second translation probability P 2
count 1 (A,·)=∑ j count 1 (A,w j ),count 1 (·,A′)=∑ i count 1 (w i ,A′);
Wherein, the search word A and the related word A 'form a first word pair (A, A'), count 1 (A, A ') represents the number of times the first word pair (A, A') in the parallel sentence pair is aligned, count 1 (A,. Cndot.) represents the total number of times that term A is aligned in parallel sentence pairs, count 1 (. A ') represents the total number of times the related word A' is aligned in the parallel sentence pair, w j Denotes the j-th, w, of all words aligned with the search word A in the parallel sentence pair i Denotes the i-th, count, of all words in parallel sentence pairs that are aligned with the related word A 1 (A,w j ) Meaning that the search word A and the word w are in parallel sentence pair j Number of alignments, count 1 (w i And A') represents a word w in a parallel sentence i The number of times of alignment with the related word a', i and j are natural numbers.
Preferably, the degree of replaceability information includes a first degree of replaceability score (D, Q) and/or a second degree of replaceability score (D, Q');
wherein, the search word A and the related word A 'form a first word pair (A, A'),
all the context words of the search word A and the related word A' are taken as a document D, | D | is the length of D,
q is a search statement, Q i For the ith search word of the search sentence Q, n is the total number of search words in the search sentence Q,
q' is a combination of m words, m<n,q′ j For the jth search word of the search word combination Q',
avgdl is the average length of the documents made up of the context of all the related words of term a,
k 1 is a first constant, b is a second constant,
f(q i d) represents the frequency of occurrence of qi in the document D,
f(q′ j and D) represents q' j Frequency of occurrence in document D.
Preferably, the co-occurrence relationship information includes first co-occurrence relationship information and/or second co-occurrence relationship information obtained based on a co-occurrence relationship index PMI, wherein,
count 2 (A,·)=Σ j count 2 (A,w j );
count 2 (·,B)=∑ i count 2 (w i ,B);
count 2 (·,·)=Σ i,j count 2 (w i ,w j );
count 2 (A,. Cndot.) represents the total number of times that term A appears simultaneously with other terms in the search resource, count 2 (. B) represents the total number of times that term B appears simultaneously with other terms in the search resource, count 2 (A, B) in the search of dataNumber of simultaneous occurrences of two search terms A, B in the source, w j Denotes the jth, w, of all words in the search resource that appear simultaneously with the search word A i Represents the ith, count, of all words that occur simultaneously with the related word B in the search resource 2 (A,w j ) Two search terms A and w in the search resource j Number of simultaneous occurrences, count 2 (w i B) represents two search terms w in the search resource i B number of simultaneous occurrences, count 2 (w i ,w j ) Two search terms w in a search resource i 、w j The number of simultaneous occurrences, i and j, is a natural number;
the first co-occurrence relation information is the average value of the co-occurrence relation index PMI of the search word and other words in the search sentence;
the second co-occurrence relation information is an average value of co-occurrence relation indexes PMI of the related word and other words in the search sentence.
Preferably, the method further comprises training the N-gram language model based on the large-scale user search behavior data to obtain the language model.
Preferably, the step of ranking the results obtained by using the search term and the related term for the search according to the corresponding confidence degrees ranks the results obtained by using the search term and the related term for the search according to the corresponding confidence degrees through a ranking model.
Preferably, the method further comprises the step of the ranking model primarily ranking the retrieval resources according to the retrieval statements and the retrieval resource page information.
Preferably, the retrieval resource is a webpage resource and/or a document resource.
According to another aspect of the invention, a search system is also provided.
A search system, comprising:
a related vocabulary storage device;
a related word acquiring device for acquiring related words of the search word based on the related word library stored in the related word library storage device;
confidence calculation means for calculating a confidence between the search word and each of the related words based on a related word confidence calculation model;
and the sequencing device is used for sequencing the results obtained by using the search words and the related words for searching according to the corresponding confidence degrees.
Preferably, the search system further includes a related word bank establishing device, configured to establish the related word bank, including:
the parallel sentence acquisition module is used for acquiring parallel sentence pairs expressing the same meaning by adopting different expression forms based on large-scale user search behavior data;
the word segmentation device is used for carrying out word segmentation on each group of parallel sentence pairs;
the word alignment module is used for carrying out word alignment on the parallel sentence pairs subjected to word segmentation processing to obtain first aligned word pairs;
a co-occurrence frequency acquisition module, configured to calculate a co-occurrence frequency of the first alignment word pair;
a related word determination module for determining the first aligned word pair having a co-occurrence frequency higher than a predetermined threshold as a related word.
Preferably, the related word bank establishing device further includes:
and the context acquisition module is used for acquiring the context words of the related words.
Preferably, the search system further includes a related word confidence calculation model establishing device, configured to establish the related word confidence calculation model, including:
a linear model filtering module for filtering the large-scale user search behavior data using a linear model to obtain a second pair of aligned words;
and the training module is used for training the positive sample and the negative sample based on a GBDT algorithm by taking the first aligned word pair as a positive sample and the second aligned word pair as a negative sample to obtain the related word confidence coefficient calculation model.
Preferably, the related word confidence calculation model is a GBDT nonlinear regression model.
Preferably, the word segmentation device is further configured to perform word segmentation processing on the search sentence to obtain a search word.
Preferably, the confidence calculating means includes:
the characteristic value extraction module is used for extracting a characteristic value between each search word and each corresponding related word;
and the confidence coefficient calculation module is used for taking the characteristic value as the input of the related word confidence coefficient calculation model and calculating the confidence coefficient based on the related word confidence coefficient calculation model.
Preferably, the feature value extraction module includes:
the system comprises a correlation degree information acquisition unit, a correlation degree information acquisition unit and a correlation degree information processing unit, wherein the correlation degree information acquisition unit is used for acquiring correlation degree information which is used for measuring the correlation degree between each search word and each corresponding correlation word; and/or
A substitutability information acquisition unit configured to acquire substitutability information that measures a degree of substitutability between the search term and the related term in a context of the related term; and/or
A co-occurrence relation information obtaining unit, configured to obtain co-occurrence relation information, where the co-occurrence relation information is used to measure co-occurrence relations among the search terms; and/or
A language model score information acquisition unit configured to acquire language model score information for displaying language model scores of search sentences before and after the related word replaces the search word; and/or
A weight value information acquiring unit configured to acquire weight value information indicating a weight of the related word.
Preferably, the feature value extraction module further comprises:
and the language model acquisition unit is used for training an N-gram language model based on the large-scale user search behavior data to acquire the language model.
Preferably, the ranking device ranks the results obtained by using the search term and the related term to perform the search according to the corresponding confidence through a ranking model.
Preferably, the sorting device is further configured to perform preliminary sorting on the retrieval resources according to the retrieval statements and the retrieval resource page information through the sorting model.
In this way, by the method for mining related words, the searching method and the searching system, the related words corresponding to the search words can be found, the search words and the related words are used for searching together, the searching range is expanded, the searching result is expanded, and the situation that the search result cannot be recalled when the words are not consistent with the search words in semantics but are very similar to the search words can be prevented.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.
FIG. 1 illustrates a flow diagram of a method of mining related words in accordance with an embodiment of the present invention;
FIG. 2 illustrates a flow diagram of a method of mining related words, according to another embodiment of the invention;
FIG. 3 shows a flow diagram of a search method according to an embodiment of the invention;
FIG. 4 shows a flow diagram of a search method according to another embodiment of the invention;
FIG. 5 shows a flowchart of step S240 of the embodiment shown in FIG. 4;
FIG. 6 shows a schematic diagram of a search system according to an embodiment of the invention;
FIG. 7 shows a schematic diagram of a search system according to another embodiment of the invention;
fig. 8 is a schematic diagram of the related word bank establishing apparatus 310 according to the embodiment shown in fig. 7;
FIG. 9 is a diagram illustrating a related word confidence calculation model building device 350 according to the embodiment shown in FIG. 7;
FIG. 10 shows a schematic diagram of the confidence computation device 390 of the embodiment shown in FIG. 7;
fig. 11 shows a schematic diagram of the feature value extraction module 394 of the embodiment shown in fig. 10.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
A method for mining related words for obtaining related words from large-scale user search behavior data according to an embodiment of the present invention is described below with reference to fig. 1.
Fig. 1 is a flowchart illustrating a method of mining related words according to an embodiment of the present invention.
In step S110, parallel sentence pairs expressing the same meaning in different expression forms are acquired based on the large-scale user search behavior data.
And acquiring parallel sentence pairs from data such as retrieval logs and/or retrieval title logs of the users based on large-scale user search behavior data. Wherein, the parallel sentence pair means the sentence pair expressing the same meaning by different expression forms. For example, the above-mentioned parallel sentence pairs expressing the same meaning in different expressions may be "mole is pigmented on baby's neck" or "mole is pigmented on baby's neck", etc.
In the large-scale user search behavior data, there are many sentence pairs whose expressions are not uniform, although they have the same meaning, in data such as a search log and/or a search title log of a user. Further, parallel sentence pairs with different meanings can be filtered according to the literal similarity of the two sentences.
In step S120, word segmentation processing is performed for each set of parallel sentence pairs.
And segmenting each sentence in each group of parallel sentence pairs by a word segmentation technology.
In step S130, word alignment processing is performed on the above-mentioned word segmentation processed parallel sentence pair to obtain a first aligned word pair.
Through the word alignment process, words expressing the same meaning can be found.
The word alignment processing may include regular word alignment processing and/or statistical word alignment processing. The regular word alignment processing includes at least one of a literal perfect word alignment processing, a literal partial identical word alignment processing, or an adjacent word alignment processing. The above statistical word alignment processing is statistical word alignment processing using a GIZA + + tool.
In step S140, the co-occurrence frequency of the first aligned word pair is calculated.
The evaluation index of the co-occurrence frequency may be the first translation probability P1 and/or the second translation probability P2, and the calculation formulas of P1 and P2 are as follows:
count 1 (A,·)=Σ j count 1 (A,w j ),count 1 (·,A′)=Σ i count 1 (w i ,A′);
wherein, the search word A and the related word A 'form a first word pair (A, A'), count 1 (A, A ') represents the number of times the first word pair (A, A') in the parallel sentence pair is aligned, count 1 (A,. Cndot.) represents the total number of times that the term A is aligned in the parallel sentence pair, count 1 (. A ') represents the total number of times the related word A' is aligned in the parallel sentence pair, w j Represents the j-th of all words aligned with the search word A in the parallel sentence pairA, w i Denotes the i-th, count, of all words in parallel sentence pairs that are aligned with the related word A 1 (A,w j ) Meaning that the search word A and word w are in parallel sentence pair j Number of alignments count 1 (w i A') represents a word w in parallel sentences i The number of times, i and j, of alignments with the related word a' is a natural number.
It can be understood that count 1 The value of (A, A ') is independent of the order of A, A', i.e. count 1 (A, A') and count 1 (A', A) are the same.
P1 represents the proportion of the number of times that the query word a is aligned with the related word a ' to the total number of times that the query word a is aligned, and P2 represents the proportion of the number of times that the query word a is aligned with the related word a ' to the total number of times that the related word a ' is aligned.
The alignment times are the times of two words aligned in a plurality of different parallel sentence pairs, and the co-occurrence times are the times of two words appearing in the same corpus at the same time.
In step S150, a first aligned word pair having a co-occurrence frequency higher than a predetermined threshold is determined as a related word.
The predetermined threshold may be set to different degrees according to different requirements for the correlation between related words. In an embodiment, the predetermined threshold may be 1.0 × e -99
Therefore, by the method for mining the related words, the related words with higher relevance can be mined, the search range of the search words can be further expanded, and the probability of finding better search results is improved. And, the related words with different similarity can be obtained according to different preset threshold values.
A method of mining related words for obtaining related words from large-scale user search behavior data according to another embodiment of the present invention is described below with reference to fig. 2.
Referring to fig. 2, the method for mining related words further includes the following steps:
in step S160, the context word of the related word is recorded.
By recording the context word of the related word, the context of the related word can be known. By judging whether the context of the two related words is the same or similar, the correlation between the related words can be further judged, and the related words with higher similarity can be obtained.
The acquisition of the context words of the related words can be limited in length to different degrees according to the different lengths of the parallel sentences. In this embodiment, the length of the parallel sentence pair is not usually too long, so the length or other form of limitation may not be made. In other embodiments, the length of the related word or the obtaining manner of the context word may be defined differently according to different requirements on the relevancy of the related word or other criteria.
In step S170, the large-scale user search behavior data is filtered using a linear model to obtain a second alignment word pair.
The linear model may be a simple linear model. Further, the simple linear model may be a linear model fitted with a simple linear regression model using statistical features between the above word pairs, with a small number of word pairs (which may be on the order of ten thousand) labeled manually. Wherein the fitting may refer to linear regression fitting modeling.
The above-mentioned manually labeled word pairs are fewer in number and the model is simple, so the confidence score output using the model is not high. And filtering the large-scale user search behavior data through the linear model, and taking a result with a confidence score smaller than a specific threshold value as the second aligned word pair, wherein the confidence score of the word pair filtered by the model is not high, so that the second aligned word pair is taken as a poor word pair. Specifically, the above-mentioned specific threshold value is close to or less than zero.
The word pair of the above-mentioned "manual annotation" means: under a certain query sentence (query), an original word in one query forms a word pair with a related word, and the word pair is labeled to determine whether the word pair is suitable for being used as the related word. The above labeling manner may be "what is eaten by the baby in eight months? "baby- > baby in this query is the related word pair," baby "is the original word," baby "is the related word, this related word can mark 1 point, represents and can be regarded as a related word; under this query, "baby" - > "baby" is labeled with score 0, which means that it cannot be a related word.
The poor word pairs refer to wrong word pairs which should not appear under the current query word context, or word pairs which violate the intention of the user. For example, the user searches for "baby takes milk", and obtains "baby drinks milk" as a better word pair (i.e. related words labeled with 1 point); however, "what fruit is good for eating" is changed into "what fruit is good for drinking", which is an escape wrong word pair, i.e. a poor word pair. Also, the above-mentioned poor word pair may be represented in more various forms, and is not limited to this example.
In step S180, a statistical feature that can reflect the degree of correlation between related words is obtained.
The statistical characteristics are context word statistical verification characteristics of whether the word pair is suitable or not in the current query context, and the characteristics comprise at least one of correlation degree information, replaceable degree information, co-occurrence relation information, language model score information and weight value information between every two related words.
In step S190, the first aligned word pair is used as a positive sample, the second aligned word pair is used as a negative sample, and based on the statistical characteristics, the positive sample and the negative sample are trained by using a gradient-enhanced decision tree (GBDT) algorithm to obtain the confidence level calculation model of the related words.
The confidence calculation model of the related words may be a GBDT nonlinear regression model.
A search method according to an embodiment of the present invention is described below with reference to fig. 3.
Fig. 3 shows a flow diagram of a search method according to an embodiment of the invention.
A search method comprising the steps of:
in step S220, related words of the search term are obtained based on the related word library.
The related word bank is established by the method for mining related words. In this way, all related terms of the term may be obtained, including not only synonyms of the term (which may include strong synonyms and contextual synonyms), but also related terms of a broader coverage. By the method for mining the related words, the related words with higher relevance can be mined, the searching range is further expanded, and the probability of finding a better searching result is improved.
In step S240, a confidence level between the search term and each related term is calculated based on a confidence level calculation model.
In step S260, the results obtained by searching using the search term and the related terms thereof are ranked according to the corresponding confidence.
The step is to sort the results obtained by searching the search terms and the related terms according to the corresponding confidence degrees through a sorting model. The sorting model may be a fast sorting model that sorts according to an existing fast sorting algorithm. It is understood that the ranking model may be other existing models.
The search according to the related words not only covers the high frequency of the synonyms, but also focuses on the related words with medium and low frequency, and particularly when the retrieval resources are less, the related words are used for searching, so that the retrieval information is acquired to the maximum extent.
Therefore, by the searching method, the corresponding related words can be found aiming at the search words, and the search words and the related words are used for searching, so that the searching range is expanded, and the searching result is expanded; it is possible to prevent the occurrence of results that the word itself does not coincide with the search term but are semantically too similar to the search term and that such search results cannot be recalled.
In another embodiment, before the step S260, a step of the ranking model initially ranking the retrieval resources according to the retrieval statement and the retrieval resource page information may be further included.
The preliminary ranking step is a general search process, and may be limited by setting the degree of search, so that the search result reaching a predetermined score may be re-ranked in step S260. Thus, when the initial search results are more, the amount of reordering can be reduced. The double-ranking method can also be used for searching when a user requires to display only the search results with high accuracy.
The retrieval resource can be a web page resource and/or a document resource. The retrieval resource can be a piece of text information, a title of a webpage, a sentence of a query, or a document with a longer length.
A search method according to another embodiment of the present invention is described below with reference to fig. 4.
Fig. 4 shows a flowchart of a search method according to another embodiment of the present invention.
The searching method may further include step S210 before step S220. In step S210, a word segmentation process is performed on the search sentence to obtain the search word.
When a user inputs a search sentence, the search sentence is segmented to obtain a plurality of search terms, so that search results related to the search terms are searched by the search method, and the search range is further expanded. The above-mentioned word segmentation may include Chinese word segmentation and/or English word segmentation, and may also include word segmentation in other languages, and the corresponding word segmentation mode may be the existing word segmentation technology in various forms.
Referring now to fig. 5, a flowchart of step S240 of the embodiment shown in fig. 4 is shown.
Fig. 5 shows a flowchart of step S240 of the embodiment shown in fig. 4.
In step S242, a feature value between each search term and each corresponding related term is acquired.
Each time the search content is different, the corresponding search term is also different, and therefore the characteristic value is also different.
In step S244, the feature value is input as a confidence level calculation model, and a confidence level is calculated based on the confidence level calculation model.
The feature value may include at least one of correlation degree information, degree of replaceability information, co-occurrence relationship information, language model score information, and weight value information.
The related degree information is used for measuring the related degree between each search term and each corresponding related term.
The correlation degree information may include a first translation probability P 1 And/or a second translation probability P 2 And are respectively expressed by the following formulas:
count 1 (A,·)=Σ j count 1 (A,w j ),count 1 (·,A′)=Σ i count 1 (w i ,A′);
wherein, the search word A and the related word A 'form a first word pair (A, A'), count 1 (A, A ') represents the number of times the first word pair (A, A') in the parallel sentence pair is aligned, count 1 (A,. Cndot.) represents the total number of times that term A is aligned in parallel sentence pairs, count 1 (. A ') represents the total number of times the related word A' is aligned in the parallel sentence pair, w j Representing the jth, w, of all words in the parallel sentence pair that are aligned with the term A i Represents the i-th, count, of all words in parallel sentence pairs aligned with the related word A 1 (A,w j ) Meaning that the search word A and the word w are in parallel sentence pair j Number of times of alignment count 1 (w i And A') represents a word w in a parallel sentence i The number of times of alignment with the related word a', i and j are natural numbers.
As can be appreciated, count 1 The value of (A, A ') is independent of the order of A, A', i.e. count 1 (A, A') and count 1 (A', A) are the same.
Wherein the degree of replaceability information is used to measure the degree of replaceability between the search term and the related term in the context of the related term.
The replaceability degree information includes a first replaceability degree score (D, Q) and/or a second replaceability degree score (D, Q'), and is expressed by the following formula:
wherein, the search word A and the related word A 'form a first word pair (A, A'),
the context words of the search word A and the context words of the related words A' are taken as a document D, and | D | is the length of D; the context words of the search word A and the related word A' are the same in a plurality of sentence pairs, but are respectively different, and the context as a whole is recorded;
q is a search statement, Q i For retrieving the ith term of the sentence Q, n is the total number of terms in the sentence Q,
q' is a combination of m words near the search word A, m<n,q′ j For the jth search word of the search word combination Q',
avgdl is the average length of the documents made up of the context of all the related words of term a,
k 1 is a first constant, b is a second constant,
f(q i d) represents the frequency of occurrence of qi in the document D,
f(q′ j and D) represents q' j Frequency of occurrence in document D.
The co-occurrence relationship information is used for measuring the co-occurrence relationship between the search terms, and refers to statistical data of two search terms appearing in a query corpus (search resources, web pages and/or documents) at the same time.
The co-occurrence relationship information comprises first co-occurrence relationship information and/or second co-occurrence relationship information obtained based on the co-occurrence relationship index PMI:
count 2 (A,·)=∑ j count 2 (A,w j );
count 2 (·,B)=∑ i count 2 (w i ,B);
count 2 (·,·)=∑ i,j count 2 (w i ,w j );
count 2 (A,. Cndot.) represents the total number of times that the search term A and other search terms appear in the search resource at the same time, count 2 (, B) represents the total number of times that term B and other terms occur simultaneously in the search resource, count 2 (A, B) represents the number of times that two search terms A, B appear in the search resource at the same time, w j Represents the j-th, w-th of all words appearing simultaneously with the search word A in the search resource i Represents the ith, count, of all words that occur simultaneously with the related word B in the search resource 2 (A,w j ) Two search terms A and w in the search resource j Number of simultaneous occurrences, count 2 (w i B) represents two search terms w in the search resource i B number of simultaneous occurrences, count 2 (w i ,w j ) Two search terms w in a search resource i 、w j The number of simultaneous occurrences, i and j, is a natural number.
It can be understood that count 2 The value of (A, B) is independent of the order of A, B, i.e. count 2 (A, B) and count 2 (B, A) are the same.
The first co-occurrence relation information is an average value of co-occurrence relation indexes PMI of the search word and other words in the search sentence.
The second co-occurrence relationship information is an average value of the co-occurrence relationship index PMI of the related word and the other search terms (the other search terms excluding the search term corresponding to the related word) in the search sentence.
When the first co-occurrence relationship information is calculated, the formula can be directly used and an average value can be calculated; and when the second co-occurrence relation is calculated, replacing the search word A in the formula with the related word A'.
And the language model score information is used for displaying the language model scores of the retrieval sentences before and after the related words replace the retrieval words. The method further comprises the step of training an N-gram language model based on large-scale user search behavior data to obtain the language model.
The weight value information is used for representing the weight of the related words.
In step S180, the above statistical characteristic calculation method is also used to calculate the statistical characteristic between each related word.
A search system according to an embodiment of the present invention is described below with reference to fig. 6.
FIG. 6 shows a schematic diagram of a search system according to an embodiment of the invention.
A search system 300 comprises a related word stock device 320, a related word acquisition device 340, a search device 360, a sorting device 380 and a confidence calculation device 390.
The related word obtaining means 340 is connected to the related word stock storage means 320, and obtains related words of the search word based on the related word stock storage means 320. The search means 360 performs a search based on the search term and the related term of the search term. The confidence calculation means 390 calculates the confidence between the search word and each of the related words corresponding thereto based on the confidence calculation model. The ranking unit 380 ranks the results retrieved by the search unit 360 according to the corresponding confidence calculated by the confidence calculation unit 390.
Thus, through the search system 300, the corresponding related words can be found for the search terms, and the search is performed according to the search terms and the corresponding related words, so that the search range is expanded, the search result is further expanded, and the probability of retrieving the target file is improved. The phenomenon that the good search results cannot be recalled when the words are not consistent with the search words but are semantically very similar to the search words can be prevented.
A search system according to another embodiment of the present invention is described below with reference to fig. 7.
FIG. 7 shows a schematic diagram of a search system according to another embodiment of the invention.
The searching system 300 may further include a related word bank building means 310 and a related word confidence calculation model building means 350.
The related word stock establishing device 310 is connected to the related word stock device 320 for establishing the related word stock by the method of mining related words.
Fig. 8 is a diagram of the related word bank building apparatus 310 for building a related word bank according to the embodiment shown in fig. 7.
Fig. 8 is a schematic diagram of the related word bank establishing apparatus 310 according to the embodiment shown in fig. 7.
The related vocabulary base establishing means 310 may include: a parallel sentence acquisition module 311, a word segmenter 313, a word alignment module 315, a co-occurrence frequency acquisition module 317, a related word determination module 319, and a context acquisition module 318.
The parallel sentence acquisition module 311 is configured to acquire parallel sentence pairs expressing the same meaning in different expression forms based on large-scale user search behavior data, the word segmentation unit 313 performs word segmentation processing on each group of parallel sentence pairs, the word alignment module 315 performs word alignment processing on the word-segmented parallel sentence pairs to acquire a first aligned word pair, the co-occurrence frequency acquisition module 317 calculates the co-occurrence frequency of the first aligned word pair, and the related word determination module 319 determines the first aligned word pair having the co-occurrence frequency higher than a predetermined threshold as a related word to form a related word lexicon.
Thus, through the related word bank establishing device 310, related words with higher relevance can be mined, the search range of search words can be expanded, the probability of finding better search results can be improved, and related words with different similarities can be acquired according to different preset thresholds.
By establishing a related word bank, all related words of the search word can be obtained, and the related words not only comprise synonyms of the search word (which can comprise strong synonyms and context synonyms), but also comprise related words with wider coverage degree. By the method for mining the related words, the related words with higher relevance can be mined, the search range of the search words can be expanded, and the probability of finding better search results is improved.
In addition, the word segmentation unit 313 is further configured to perform word segmentation processing on the search sentence to obtain a search word. When a user inputs a search sentence, the search sentence is segmented to obtain a plurality of search terms, so that search results related to the search terms are searched by the search method, and the search range is further expanded.
Further, the related word bank establishing device 310 further includes a context obtaining module 318 for obtaining context words of the related words.
By recording the context word of the related word, the context of the related word can be known. By judging whether the context of the two related words is the same or similar, the correlation degree between the related words can be further judged, and the related words with higher similarity degree can be obtained.
The acquisition of the context words of the related words can be limited in length to different degrees according to the different lengths of the parallel sentences. In this embodiment, the length of the parallel sentence pair is not too long, so the length or other limitations may not be made. In other embodiments, the length of the related word or the obtaining manner of the context word may be defined differently according to different requirements on the relevancy of the related word or other criteria.
Referring to fig. 9, a schematic diagram of the related word confidence calculation model building apparatus 350 according to the embodiment shown in fig. 7 is shown.
Fig. 9 is a schematic diagram of the related word confidence calculation model building device 350 according to the embodiment shown in fig. 7.
The related word confidence calculation model building apparatus 350 may include a linear model filtering module 352 and a training module 354.
The linear model filtering module 352 is configured to filter the large-scale user search behavior data using a linear model to obtain second aligned word pairs.
The linear model may be a simple linear model, and further, the simple linear model may be a linear model fitted with a simple linear regression model using statistical features between the word pairs labeled manually in a small number (which may be on the order of ten thousand). The above-mentioned manually labeled word pairs are fewer in number and the model is simple, so the confidence degree output by using the model is not high in precision. And filtering the large-scale user search behavior data through the linear model to obtain a second aligned word pair, wherein the second aligned word pair is a poor word pair and refers to a wrong word pair which should not appear or a word pair which violates the intention of the user in the context of the current query word. For example, a user searches for 'baby eats milk', and obtaining 'baby drinks milk' is a good word pair; however, "what fruit is good for eating" is changed into "what fruit is good for drinking", which is an escape wrong word pair, i.e. a poor word pair.
The training module 354 is connected to the related word bank establishing device 310 and the linear model filtering module 352, respectively, and trains the positive sample and the negative sample based on the GBDT algorithm to obtain a related word confidence coefficient calculation model by using the first aligned word pair as a positive sample and the second aligned word pair as a negative sample.
The related word confidence calculation model may be a GBDT nonlinear regression model.
Referring to fig. 10, the confidence calculation device 390 of the embodiment shown in fig. 7 may include a confidence calculation module 392 and a feature value extraction module 394.
The feature value extraction module 394 extracts a feature value between each search term and each related term corresponding thereto, and the confidence calculation module 392 calculates the confidence by using the feature value as an input of a confidence calculation model based on the confidence calculation model.
Referring to fig. 11, a diagram of the feature value extraction module 394 of the embodiment shown in fig. 10 is shown.
The feature value extraction module 394 may further include at least one of a correlation degree information acquisition unit 3941, a replaceable degree information acquisition unit 3942, a co-occurrence relationship information acquisition unit 3943, a language model score information acquisition unit 3944, a weight value information acquisition unit 3945, and a language model acquisition unit 3946.
A correlation degree information obtaining unit 3941, configured to obtain correlation degree information. The degree of relevance information is used to measure the degree of relevance between each search term and each corresponding relevant term.
A substitutability information acquisition unit 3942 for acquiring the substitutability information. The degree of replaceability information is used to measure the degree of replaceability between a search term and a related term in the context of the related term.
A co-occurrence relation information obtaining unit 3943, configured to obtain co-occurrence relation information. And the co-occurrence relation information is used for measuring the co-occurrence relation among the search terms.
A language model score information obtaining unit 3944, configured to obtain language model score information. The language model score information is used for displaying the language model scores of the search sentences before and after the related word replaces the search word.
The weight value information acquiring unit 3945 is configured to acquire weight value information. Wherein, the weight value information is used for representing the weight of the related words.
Further, the feature value extraction module 394 may further include a language model obtaining unit 3946. The language model obtaining unit 3946 is configured to train the N-gram language model based on the large-scale user search behavior data to obtain the language model.
The sorting device 380 sorts the results obtained by searching the search terms and the corresponding related terms according to the corresponding confidence information through the sorting model. The sorting model may be a fast sorting model for sorting according to an existing fast sorting algorithm.
Further, the sorting device 380 may also perform preliminary sorting on the retrieval resources according to the retrieval statements and the retrieval resource page information through the sorting model. The initial ranking is a general search process, and can be limited by setting the retrieval degree, so that the retrieval results reaching the preset score can enter the re-ranking. When the initial search results are more, the work load of reordering can be reduced. The double ranking method may also be used when the user requests that only highly accurate search results be displayed.
The search according to the related words not only covers the high frequency of the synonyms, but also focuses on the low and medium frequency search words, and particularly when the search resources are less, the related words are used for searching, and the search information is obtained to the maximum extent. Therefore, by the search system, the corresponding related words can be found for the search words, and the search words and the related words are used for searching, so that the search range is expanded, and the search results are expanded; it is possible to prevent the occurrence of results that the word itself does not coincide with the search term but are semantically too similar to the search term and that such search results cannot be recalled.
The method of mining related words, the search method, and the search system according to the present invention have been described above in detail with reference to the accompanying drawings.
Furthermore, the method according to the invention may also be implemented as a computer program product comprising a computer readable medium having stored thereon a computer program for performing the above-mentioned functions defined in the method of the invention. Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
While embodiments of the present invention have been described above, the above description is illustrative, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (24)

1. A search method comprising the steps of:
acquiring related words of the search words based on a related word library;
calculating a confidence between the search word and each of the related words based on a confidence calculation model;
ranking results obtained from the search using the search term and the related term according to the corresponding confidence levels,
wherein the method further comprises:
performing word segmentation processing on a search sentence to obtain the search word,
wherein the step of calculating the confidence between the search word and each of the related words based on a confidence calculation model comprises:
obtaining a characteristic value between each search word and each corresponding related word;
taking the feature value as an input to the confidence computation model, computing the confidence based on the confidence computation model,
wherein the characteristic values include:
-a degree of replaceability information for measuring a degree of replaceability between the search term and the related term in a context of the related term; and
and the co-occurrence relation information is used for measuring the co-occurrence relation among the search terms.
2. The method according to claim 1, wherein the degree of replaceability information includes a first degree of replaceability score (D, Q) and/or a second degree of replaceability score (D, Q');
wherein, the search word A and the related word A 'form a first word pair (A, A'),
all the context words of the search term A and the related term A' are used as a document D, | D | is the length of D,
q is a search statement, Q i For the ith search term of the search sentence Q, n is the total number of search terms in the search sentence Q,
q' is a combination of m words near the search word A, m<n,q' j For the jth search word of the search word combination Q',
avgdl is the average length of the document formed by the context of all related words of the term A, k 1 Is a first constant, b is a second constant,
f(q i d) represents the frequency of occurrence of qi in the document D,
f(q' j and D) represents q' j Frequency of occurrence in document D.
3. The method according to claim 1, wherein the co-occurrence relationship information comprises first co-occurrence relationship information and/or second co-occurrence relationship information derived based on a co-occurrence relationship index, PMI, wherein,
count 2 (A,·)=∑ j count 2 (A,w j );
count 2 (·,B)=∑ i count 2 (w i ,B);
count 2 (·,·)=∑ i,j count 2 (w i ,w j );
count 2 (A,. Cndot.) represents the total number of times that term A appears simultaneously with other terms in the search resource, count 2 (. B) represents the total number of times that term B appears simultaneously with other terms in the search resource, count 2 (A, B) represents the number of times two search terms A, B appear simultaneously in the search resource, w j Denotes the jth, w, of all words in the search resource that appear simultaneously with the search word A i Represents the ith, count, of all words that occur simultaneously with the related word B in the search resource 2 (A,w j ) Two search terms A and w in the search resource j Number of simultaneous occurrences, count 2 (w i B) represents two search terms w in the search resource i B number of simultaneous occurrences, count 2 (w i ,w j ) Two search terms w in a search resource i 、w j The number of simultaneous occurrences, i and j are natural numbers;
the first co-occurrence relation information is the average value of the co-occurrence relation index PMI of the search word and other words in the search sentence;
the second co-occurrence relation information is an average value of co-occurrence relation indexes PMI of the related word and other words in the search sentence.
4. The method of claim 1, wherein the feature values further comprise:
the correlation degree information is used for measuring the correlation degree between each search word and each corresponding correlation word; and/or
Language model score information for displaying language model scores of the search sentences before and after the related word replaces the search word; and/or
And the weight value information is used for representing the weight of the related words.
5. The method of claim 4, wherein the relatedness information comprises a first translation probability P 1 And/or a second translation probability P 2
count 1 (A,·)=∑ j count 1 (A,w j ),count 1 (·,A′)=∑ i count 1 (w i ,A′);
Wherein, the search word A and the related word A 'form a first word pair (A, A'), count 1 (A, A ') represents the number of times the first word pair (A, A') in the parallel sentence pair is aligned, count 1 (A,. Cndot.) represents the total number of times that term A is aligned in parallel sentence pairs, count 1 (. A ') represents the total number of times the related word A' is aligned in the parallel sentence pair, w j Representing the jth, w, of all words in the parallel sentence pair that are aligned with the term A i Denotes the i-th, count, of all words in parallel sentence pairs that are aligned with the related word A 1 (A,w j ) Meaning that the search word A and the word w are in parallel sentence pair j Number of alignments, count 1 (w i A') represents a word w in parallel sentences i The number of times, i and j, of alignments with the related word a' is a natural number.
6. The method of claim 4, further comprising training an N-gram language model to obtain the language model based on large-scale user search behavior data.
7. The method of claim 1, wherein the step of ranking the results of the search using the search term and the related term according to the corresponding confidence levels is ranking the results of the search using the search term and the related term according to the corresponding confidence levels by a ranking model.
8. The method of claim 7, further comprising the step of the ranking model initially ranking the search resources according to the search statement and search resource page information.
9. The method of claim 8, wherein,
the retrieval resources are webpage resources and/or document resources.
10. The method of claim 1, wherein the related word library is created by a method of mining related words, the method of mining related words comprising:
acquiring parallel sentence pairs expressing the same meaning by adopting different expression forms based on large-scale user search behavior data;
performing word segmentation processing on each group of parallel sentence pairs;
performing word alignment processing on the parallel sentence pairs after word segmentation processing to obtain a first aligned word pair;
calculating a co-occurrence frequency of the first aligned word pair;
determining the first aligned word pair having a co-occurrence frequency above a predetermined threshold as a related word.
11. The method of claim 10, wherein the step of obtaining parallel sentence pairs comprises:
according to the literal similarity of two sentences, filtering out the parallel sentence pairs with different meanings.
12. The method of claim 10, the method of mining related words further comprising:
context words of the related words are recorded.
13. The method of claim 10, wherein,
the word alignment processing comprises regular word alignment processing and/or statistical word alignment processing;
the regular word alignment processing comprises at least one of word alignment processing with completely same literal, word alignment processing with partially same literal or adjacent word alignment processing;
the statistical word alignment processing is performed by using a GIZA + + tool.
14. The method of claim 10, further comprising:
filtering the large-scale user searching behavior data by using a linear model to obtain a second alignment word pair;
acquiring statistical characteristics capable of reflecting the correlation degree between the related words;
and training the positive sample and the negative sample by using the first aligned word pair as a positive sample and the second aligned word pair as a negative sample and adopting a gradient lifting decision tree (GBDT) algorithm based on the statistical characteristics to obtain the confidence coefficient calculation model of the related words.
15. The method of claim 14, wherein the related word confidence calculation model is a GBDT non-linear regression model.
16. A search system, comprising:
a related vocabulary storage device;
a related word acquiring device for acquiring related words of the search word based on the related word bank stored in the related word bank storage device;
confidence calculation means for calculating a confidence between the search word and each of the related words based on a related word confidence calculation model;
a sorting device for sorting the results obtained by searching using the search word and the related word according to the corresponding confidence degrees,
wherein, the word segmentation device is also used for carrying out word segmentation processing on the retrieval sentence to obtain a retrieval word,
wherein the confidence calculating means comprises:
the characteristic value extraction module is used for extracting a characteristic value between each search word and each corresponding related word;
a confidence degree calculation module for taking the feature value as an input of the related word confidence degree calculation model, calculating the confidence degree based on the related word confidence degree calculation model,
wherein the feature value extraction module comprises:
a degree of replaceability information obtaining unit configured to obtain degree of replaceability information, the degree of replaceability information being used to measure a degree of replaceability between the search word and the related word in a context of the related word; and
and the co-occurrence relation information acquisition unit is used for acquiring co-occurrence relation information, and the co-occurrence relation information is used for measuring the co-occurrence relation among the search terms.
17. The search system of claim 16, wherein the feature value extraction module further comprises:
the relevant degree information acquisition unit is used for acquiring relevant degree information which is used for measuring the relevant degree between each search word and each corresponding relevant word; and/or
A language model score information acquisition unit configured to acquire language model score information for displaying language model scores of search sentences before and after the related word replaces the search word; and/or
A weight value information acquisition unit configured to acquire weight value information indicating a weight of the related word.
18. The search system of claim 17, wherein the feature value extraction module further comprises:
and the language model acquisition unit is used for training the N-gram language model based on the large-scale user search behavior data to acquire the language model.
19. The search system of claim 16,
the device for establishing the related word thesaurus is used for establishing the related word thesaurus and comprises the following steps:
the parallel sentence acquisition module is used for acquiring parallel sentence pairs expressing the same meaning by adopting different expression forms based on large-scale user search behavior data;
the word segmentation device is used for carrying out word segmentation on each group of parallel sentence pairs;
the word alignment module is used for carrying out word alignment on the parallel sentence pairs subjected to word segmentation processing to obtain first aligned word pairs;
a co-occurrence frequency acquisition module, configured to calculate a co-occurrence frequency of the first alignment word pair;
a related word determining module for determining the first aligned word pair with the co-occurrence frequency higher than a predetermined threshold as a related word.
20. The search system according to claim 19, wherein the related word bank establishing means further comprises:
and the context acquisition module is used for acquiring the context words of the related words.
21. The search system according to claim 19, further comprising related word confidence calculation model building means for building the related word confidence calculation model, including:
a linear model filtering module for filtering the large-scale user search behavior data using a linear model to obtain a second pair of aligned words;
and the training module is used for training the positive sample and the negative sample based on a GBDT algorithm by taking the first aligned word pair as a positive sample and the second aligned word pair as a negative sample to obtain the related word confidence coefficient calculation model.
22. The search system of claim 21, wherein the related term confidence calculation model is a GBDT non-linear regression model.
23. The search system according to claim 16, wherein the ranking means ranks results obtained by the search using the search term and the related term according to the corresponding confidence levels through a ranking model.
24. The search system of claim 23, wherein the ranking means is further configured to rank the retrieval resources initially according to the retrieval statements and the retrieval resource page information by the ranking model.
CN201510657691.7A 2015-10-12 2015-10-12 Excavate method, searching method, the search system of related term Active CN105279252B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201510657691.7A CN105279252B (en) 2015-10-12 2015-10-12 Excavate method, searching method, the search system of related term
PCT/CN2016/101700 WO2017063538A1 (en) 2015-10-12 2016-10-10 Method for mining related words, search method, search system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510657691.7A CN105279252B (en) 2015-10-12 2015-10-12 Excavate method, searching method, the search system of related term

Publications (2)

Publication Number Publication Date
CN105279252A CN105279252A (en) 2016-01-27
CN105279252B true CN105279252B (en) 2017-12-26

Family

ID=55148266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510657691.7A Active CN105279252B (en) 2015-10-12 2015-10-12 Excavate method, searching method, the search system of related term

Country Status (2)

Country Link
CN (1) CN105279252B (en)
WO (1) WO2017063538A1 (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279252B (en) * 2015-10-12 2017-12-26 广州神马移动信息科技有限公司 Excavate method, searching method, the search system of related term
CN105868847A (en) * 2016-03-24 2016-08-17 车智互联(北京)科技有限公司 Shopping behavior prediction method and device
CN105955993B (en) * 2016-04-19 2020-09-25 北京百度网讯科技有限公司 Search result ordering method and device
CN108205757B (en) * 2016-12-19 2022-05-27 创新先进技术有限公司 Method and device for verifying legality of electronic payment service
CN107168958A (en) * 2017-05-15 2017-09-15 北京搜狗科技发展有限公司 A kind of interpretation method and device
CN107909088B (en) * 2017-09-27 2022-06-28 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer storage medium for obtaining training samples
CN108171570B (en) * 2017-12-15 2021-04-27 北京星选科技有限公司 Data screening method and device and terminal
CN108733766B (en) * 2018-04-17 2020-10-02 腾讯科技(深圳)有限公司 Data query method and device and readable medium
CN110472251B (en) 2018-05-10 2023-05-30 腾讯科技(深圳)有限公司 Translation model training method, sentence translation equipment and storage medium
CN109241356B (en) * 2018-06-22 2023-04-14 腾讯科技(深圳)有限公司 Data processing method, device and storage medium
CN110795613B (en) * 2018-07-17 2023-04-28 阿里巴巴集团控股有限公司 Commodity searching method, device and system and electronic equipment
CN109298796B (en) * 2018-07-24 2022-05-24 北京捷通华声科技股份有限公司 Word association method and device
CN109151599B (en) * 2018-08-30 2020-10-09 百度在线网络技术(北京)有限公司 Video processing method and device
CN111400577B (en) * 2018-12-14 2023-06-30 阿里巴巴集团控股有限公司 Search recall method and device
CN109885696A (en) * 2019-02-01 2019-06-14 杭州晶一智能科技有限公司 A kind of foreign language word library construction method based on self study
CN109918661B (en) * 2019-03-04 2023-05-30 腾讯科技(深圳)有限公司 Synonym acquisition method and device
CN110413737B (en) * 2019-07-29 2022-10-14 腾讯科技(深圳)有限公司 Synonym determination method, synonym determination device, server and readable storage medium
CN110851584B (en) * 2019-11-13 2023-12-15 成都华律网络服务有限公司 Legal provision accurate recommendation system and method
CN111241319B (en) * 2020-01-22 2023-10-03 北京搜狐新媒体信息技术有限公司 Image-text conversion method and system
CN113496411A (en) * 2020-03-18 2021-10-12 北京沃东天骏信息技术有限公司 Page pushing method, device and system, storage medium and electronic equipment
CN112199958A (en) * 2020-09-30 2021-01-08 平安科技(深圳)有限公司 Concept word sequence generation method and device, computer equipment and storage medium
CN112541076B (en) * 2020-11-09 2024-03-29 北京百度网讯科技有限公司 Method and device for generating expanded corpus in target field and electronic equipment
CN112307198B (en) * 2020-11-24 2024-03-12 腾讯科技(深圳)有限公司 Method and related device for determining abstract of single text
CN112835923A (en) * 2021-02-02 2021-05-25 中国工商银行股份有限公司 Correlation retrieval method, device and equipment
CN113609843B (en) * 2021-10-12 2022-02-01 京华信息科技股份有限公司 Sentence and word probability calculation method and system based on gradient lifting decision tree
CN114969310B (en) * 2022-06-07 2024-04-05 南京云问网络技术有限公司 Multi-dimensional data-oriented sectional search ordering system design method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6633868B1 (en) * 2000-07-28 2003-10-14 Shermann Loyall Min System and method for context-based document retrieval
CN102033955A (en) * 2010-12-24 2011-04-27 常华 Method for expanding user search results and server
CN103514150A (en) * 2012-06-21 2014-01-15 富士通株式会社 Method and device for recognizing ambiguous words with combinatorial ambiguities
CN104063454A (en) * 2014-06-24 2014-09-24 北京奇虎科技有限公司 Search push method and device for mining user demands

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101819578B (en) * 2010-01-25 2012-05-23 青岛普加智能信息有限公司 Retrieval method, method and device for establishing index and retrieval system
CN102591862A (en) * 2011-01-05 2012-07-18 华东师范大学 Control method and device of Chinese entity relationship extraction based on word co-occurrence
CN104239286A (en) * 2013-06-24 2014-12-24 阿里巴巴集团控股有限公司 Method and device for mining synonymous phrases and method and device for searching related contents
CN103942339B (en) * 2014-05-08 2017-06-09 深圳市宜搜科技发展有限公司 Synonym method for digging and device
CN105279252B (en) * 2015-10-12 2017-12-26 广州神马移动信息科技有限公司 Excavate method, searching method, the search system of related term

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6633868B1 (en) * 2000-07-28 2003-10-14 Shermann Loyall Min System and method for context-based document retrieval
CN102033955A (en) * 2010-12-24 2011-04-27 常华 Method for expanding user search results and server
CN103514150A (en) * 2012-06-21 2014-01-15 富士通株式会社 Method and device for recognizing ambiguous words with combinatorial ambiguities
CN104063454A (en) * 2014-06-24 2014-09-24 北京奇虎科技有限公司 Search push method and device for mining user demands

Also Published As

Publication number Publication date
CN105279252A (en) 2016-01-27
WO2017063538A1 (en) 2017-04-20

Similar Documents

Publication Publication Date Title
CN105279252B (en) Excavate method, searching method, the search system of related term
CN108509474B (en) Synonym expansion method and device for search information
CN106649786B (en) Answer retrieval method and device based on deep question answering
CN110929038B (en) Knowledge graph-based entity linking method, device, equipment and storage medium
US7949514B2 (en) Method for building parallel corpora
US20180300315A1 (en) Systems and methods for document processing using machine learning
Al-Hashemi Text Summarization Extraction System (TSES) Using Extracted Keywords.
CN113011533A (en) Text classification method and device, computer equipment and storage medium
US20160283468A1 (en) Context Based Synonym Filtering for Natural Language Processing Systems
CN107577671B (en) Subject term extraction method based on multi-feature fusion
CN108255813B (en) Text matching method based on word frequency-inverse document and CRF
CA2774278C (en) Methods and systems for extracting keyphrases from natural text for search engine indexing
US20150100308A1 (en) Automated Formation of Specialized Dictionaries
US20110295850A1 (en) Detection of junk in search result ranking
US11893537B2 (en) Linguistic analysis of seed documents and peer groups
Landthaler et al. Extending Full Text Search for Legal Document Collections Using Word Embeddings.
CN101702167A (en) Method for extracting attribution and comment word with template based on internet
CN112307182B (en) Question-answering system-based pseudo-correlation feedback extended query method
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
CN105653562A (en) Calculation method and apparatus for correlation between text content and query request
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
WO2020074017A1 (en) Deep learning-based method and device for screening for keywords in medical document
Rahman et al. NLP-based automatic answer script evaluation
CN112015907A (en) Method and device for quickly constructing discipline knowledge graph and storage medium
CN110019814B (en) News information aggregation method based on data mining and deep learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200812

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping square B radio tower 12 layer self unit 01

Patentee before: GUANGZHOU SHENMA MOBILE INFORMATION TECHNOLOGY Co.,Ltd.