CN107491447A

CN107491447A - Establish inquiry rewriting discrimination model, method for distinguishing and corresponding intrument are sentenced in inquiry rewriting

Info

Publication number: CN107491447A
Application number: CN201610408229.8A
Authority: CN
Inventors: 成幸毅; 林荣逸; 吕钦; 李磊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2016-06-12
Filing date: 2016-06-12
Publication date: 2017-12-19
Anticipated expiration: 2036-06-12
Also published as: CN107491447B

Abstract

The invention provides one kind foundation inquiry rewriting discrimination model, inquiry rewriting to sentence method for distinguishing and corresponding intrument, includes wherein establishing and inquiring about the method for rewriting discrimination model：Using the first sample set comprising query pairs of the first positive sample and the first negative sample composition, M neural network model is respectively trained, obtains M underlying model, the M is positive integer；Feature is extracted from the second sample set comprising query pairs of the second positive sample and the second negative sample composition, the feature includes the M underlying model scoring to each query pairs in second sample set respectively；Using the features training disaggregated model of extraction, obtain inquiry and rewrite discrimination model.Present invention utilizes the machine learning techniques in forward position, the potential association expressed with learning text, so as to realize the accurate differentiation of inquiry rewriting.

Description

Establish inquiry rewriting discrimination model, method for distinguishing and corresponding intrument are sentenced in inquiry rewriting

【Technical field】

The present invention relates to Computer Applied Technology field, discrimination model is rewritten in more particularly to a kind of foundation inquiry, inquiry changes Write and sentence method for distinguishing and corresponding intrument.

【Background technology】

In a search engine in order to improve search result, introduce inquiry and rewrite this technology.Pass through input user Query is rewritten so that search result can recall search result corresponding to revised query, so that user needs The expression asked is more accurate.

In existing inquiry rewriting technology, some rules manually formulated are mainly based upon, such as fragment rewriting rule, Sequencing rewriting rule, chain type rewriting rule, omit rewriting rule, etc..However, Chinese natural language is of extensive knowledge and profound scholarship, row in word Between embody the cultural deposits and ancestor's wisdom in thousands of years of China, when carrying out inquiry based on the rule manually formulated and rewriting, often Higher accuracy requirement is not reached.For example, when based on fragment rewriting rule, " old foster-mother " is rewritten as " old adopted mother "； During based on sequencing rewriting rule, " Beijing south to Shenzhen " is rewritten as " Nanjing to Shen Zhenbei "；When based on chain type rewriting rule, " Hubei bus ticket " is rewritten as " Hubei ticket ", and then is rewritten as " Hubei train ticket "；When based on rewriting rule is omitted, It is poor that " market of stock in America " are rewritten as " beautiful market " to ... the degree of accuracy that obviously these inquiries are rewritten.Therefore it is badly in need of Whether a kind of one query of differentiation can be used for the mode that another query inquiry is rewritten.

【The content of the invention】

In view of this, inquiry rewriting discrimination model is established the invention provides one kind, method for distinguishing is sentenced in inquiry rewriting and right Device is answered, in order to accurately differentiate whether a query can be used for another query inquiry rewriting.

Concrete technical scheme is as follows：

The invention provides a kind of method established inquiry and rewrite discrimination model, this method includes：

Using the first sample set comprising query pairs of the first positive sample and the first negative sample composition, M are respectively trained Neural network model, obtains M underlying model, and the M is positive integer；

Feature is extracted from the second sample set comprising query pairs of the second positive sample and the second negative sample composition, institute Stating feature includes the M underlying model scoring to each query pairs in second sample set respectively；

Using the features training disaggregated model of extraction, obtain inquiry and rewrite discrimination model.

According to a preferred embodiment of the invention, the first sample set obtains in the following way：

The similarity for being clicked url is obtained from search daily record more than or equal to two query compositions of first threshold Query to as the first positive sample, and/or, former query high-quality rewriting query is determined using existing rewriting rule, by this The query that former query and high-quality rewriting query is formed is to as the first positive sample；

The similarity for being clicked url is obtained from search daily record less than or equal to two query compositions of Second Threshold Query is to as the first negative sample；

Wherein described first threshold is higher than the Second Threshold.

According to a preferred embodiment of the invention, second sample set obtains in the following way：

The similarity for being clicked url is obtained from search daily record more than or equal to the 3rd threshold value and less than or equal to the 4th Query pairs of two query compositions of threshold value, the 3rd threshold value is more than the Second Threshold, and the 4th threshold value is less than institute State first threshold；

According to the manually annotation results to the query to progress, it will manually be labeled as stating query pairs of identical meanings As the second positive sample, it manually will be labeled as stating the query of different implications to as the second negative sample.

According to a preferred embodiment of the invention, at least one of following filtering is carried out to positive sample：

If the common url number of q is less than default before being come in search result corresponding to two query of query centerings Number threshold value, then this query pairs is filtered out, q is default positive integer；

If two query of query centerings obtain identical statement after removing stop words respectively, this query pairs is filtered out；

If two query of query centerings include different digital contents, this query pairs is filtered out；

If the total numbers of clicks of url corresponding to two query of query centerings are less than default number of clicks threshold value, filter Fall this query pairs；

If the error correction that a query of query centerings is another query is stated, this query pairs is filtered out.

According to a preferred embodiment of the invention, at least one of following filtering is carried out to negative sample：

If each query of query centerings is not the query with preset need, this query pairs is filtered out；

If a query is present in multiple query pairs, retain wherein m query pairs, other are filtered out, and the m is Default positive integer.

According to a preferred embodiment of the invention, the neural network model includes following at least one：

Neutral net BOW_NN, convolutional neural networks CNN, forward-backward recutrnce neutral net BiRNN based on multi-layer perception (MLP).

According to an of the invention preferred embodiment, the feature also include it is following in one kind or any combination：

Statistical nature, distance feature, position feature, word importance characteristic, semantic feature and synonym rewrite feature.

According to a preferred embodiment of the invention, the features training disaggregated model using extraction, obtain inquiry and rewrite Discrimination model includes：

N number of disaggregated model is respectively trained using the feature of extraction, obtains N number of high-order model, the N is just whole more than 1 Number；

N number of high-order model is selected and integrated, inquiry is obtained and rewrites discrimination model.

According to a preferred embodiment of the invention, the disaggregated model includes following at least one：

Gradient recurrence decision tree GBDT, support vector machines, logistic regression LR, random forest RF, multilayer perceptron MLP.

According to a preferred embodiment of the invention, N number of high-order model is selected and integrated, obtained inquiry and rewrite Discrimination model includes：

Testing evaluation is carried out to the result of N number of high-order model using test set, and the test set, which includes, to be had determined that and change Write scoring query pairs；

Wherein P high-order model is selected according to testing evaluation, the P is less than or equal to the N；

Processing is weighted to the P high-order model, inquiry is obtained and rewrites discrimination model.

Present invention also offers a kind of method that discrimination and query is rewritten, this method includes：

Feature is extracted from query centerings to be discriminated, the feature includes M underlying model to query pairs of the scoring, institute It is positive integer to state M；

The feature input inquiry of extraction is rewritten into discrimination model, obtains the differentiation result that discrimination model is rewritten in the inquiry；

Wherein described M underlying model and the inquiry are rewritten discrimination model and obtained using the above method.

Invention further provides a kind of device established inquiry and rewrite discrimination model, the device includes：

First sample acquiring unit, for obtaining formed comprising query pairs of the first positive sample and the first negative sample the One sample set；

Second sample acquisition unit, for obtaining formed comprising query pairs of the second positive sample and the second negative sample the Two sample sets；

First training unit, for utilizing the first sample set, M neural network model is respectively trained, obtains M Underlying model, the M are positive integer；

Feature extraction unit, for extracting feature from second sample set, the feature includes the M bottom The model scoring to each query pairs in second sample set respectively；

Second training unit, for the features training disaggregated model using feature extraction unit extraction, inquired about Rewrite discrimination model.

According to a preferred embodiment of the invention, the first sample acquiring unit, specifically for obtaining in the following way Take the first sample set：

Wherein described first threshold is higher than the Second Threshold.

According to a preferred embodiment of the invention, second sample acquisition unit, specifically for obtaining in the following way Take second sample set：

According to a preferred embodiment of the invention, the first sample acquiring unit and second sample acquisition unit, It is additionally operable to carry out positive sample at least one of following filtering：

According to a preferred embodiment of the invention, the first sample acquiring unit and second sample acquisition unit, It is additionally operable to carry out negative sample at least one of following filtering：

According to a preferred embodiment of the invention, second training unit, it is specifically used for：Distinguished using the feature of extraction N number of disaggregated model is trained, obtains N number of high-order model, the N is the positive integer more than 1；N number of high-order model is selected Select and integrate, obtain inquiry and rewrite discrimination model.

According to a preferred embodiment of the invention, second training unit selects to N number of high-order model With it is integrated, it is specific to perform when obtaining inquiry and rewriting discrimination model：

Present invention also offers the device that a kind of discrimination and query is rewritten, the device includes：

Feature extraction unit, for extracting feature from query centerings to be discriminated, the feature includes M underlying model pair Query pairs of the scoring, the M are positive integer；

Judgement unit, the feature input inquiry for the feature extraction unit to be extracted rewrite discrimination model, obtain institute State the differentiation result that discrimination model is rewritten in inquiry；

It is to rewrite to differentiate mould using above-mentioned foundation inquiry that discrimination model is rewritten in wherein described M underlying model and the inquiry What the device of type obtained.

As can be seen from the above technical solutions, the present invention makees scoring of the underlying model that self study obtains to query pairs It is characterized, and to train classification models, rewrites discrimination model so as to obtain inquiry, this mode make use of the engineering in forward position Habit technology, the potential association expressed with learning text, so as to realize the accurate differentiation of inquiry rewriting.

【Brief description of the drawings】

Fig. 1 is the method flow diagram provided in an embodiment of the present invention established inquiry and rewrite discrimination model；

Fig. 2 is one provided in an embodiment of the present invention and establishes the instance graph that discrimination model is rewritten in inquiry；

Fig. 3 is that the method flow diagram differentiated is rewritten in inquiry provided in an embodiment of the present invention；

Fig. 4 is that the instance graph differentiated is rewritten in one provided in an embodiment of the present invention inquiry；

Fig. 5 is the structure drawing of device provided in an embodiment of the present invention established inquiry and rewrite discrimination model；

Fig. 6 is the structure drawing of device that discrimination and query provided in an embodiment of the present invention is rewritten.

【Embodiment】

In order that the object, technical solutions and advantages of the present invention are clearer, below in conjunction with the accompanying drawings with specific embodiment pair The present invention is described in detail.

The term used in embodiments of the present invention is only merely for the purpose of description specific embodiment, and is not intended to be limiting The present invention." one kind ", " described " and "the" of singulative used in the embodiment of the present invention and appended claims It is also intended to including most forms, unless context clearly shows that other implications.

It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, represent There may be three kinds of relations, for example, A and/or B, can be represented：Individualism A, while A and B be present, individualism B these three Situation.In addition, character "/" herein, it is a kind of relation of "or" to typically represent forward-backward correlation object.

Depending on linguistic context, word as used in this " if " can be construed to " ... when " or " when ... When " or " in response to determining " or " in response to detection ".Similarly, depending on linguistic context, phrase " if it is determined that " or " if detection (condition or event of statement) " can be construed to " when it is determined that when " or " in response to determine " or " when the detection (condition of statement Or event) when " or " in response to detecting (condition or event of statement) ".

The machine learning techniques for having supervision are used to subversiveness of the present invention, go to find semanteme by large-scale data and algorithm The rule of expression.This method is described in detail below by embodiment.

Fig. 1 is method flow diagram provided in an embodiment of the present invention, and as shown in fig. 1, this method may comprise steps of：

In 101, the first sample set formed comprising query pairs of the first positive sample and the first negative sample is obtained.

Limiting mode in this place using " first ", primarily to the sample limited with " second " subsequently occurred Data make a distinction, not any limitation semantically, and " second " subsequently occurred is also such.

The first sample set obtained in this step is the sample set based on big data, can be with for positive sample Using but be not limited to following two：

The first：Former query high-quality rewriting query is determined using existing rewriting rule, by the former query and high-quality The query of query compositions is rewritten to as the first positive sample.

As described in the background art, query rewrites the rewriting rule for being based primarily upon and manually formulating at present, and these rewrite rule It is very high-quality to have some to rewrite query in then, then in embodiments of the present invention, can be obtained from by existing rewriting rule Rewriting vocabulary in, select query pairs, the query to comprising be former query and high-quality rewriting query.

Second：Obtained from search daily record and be clicked two querys of the url similarity more than or equal to first threshold The query of composition is to as the first positive sample.If two query search result is similar, then can be generally considered as them Semanteme (intention) and similar.Especially for medium-high frequency query, the click verification of user is have passed through, search result It is stronger as Relativity.In this way, it can be obtained with less cost and largely be intended to similar query pairs, and can Cover most of field, the positive sample as underlying model.

Url similarity is clicked corresponding to two of which query to use various ways to weigh, and enumerate one kind herein Mode：

Assuming that two query of query centerings are respectively：Corresponding to queryLeft and queryRight, two query The collection for being clicked url compositions jointly is combined into overlapUrls, if

min(overlapClickRatioLeft,overlapClickRatioLeft)>0.3 and

max(overlapClickRatioLeft,overlapClickRatioLeft)>0.6, then it is assumed that queryLeft More common clicks be present with queryRight, i.e., the corresponding similarity for being clicked url meets to want as positive sample Ask.It should be noted that above-mentioned 0.3 and 0.6 is a kind of preferable threshold value selection, but it is not limited to these numerical value.

Wherein,

LeftUrls is the set that url compositions are clicked corresponding to queryLeft, and rightUrls is queryRight pairs That answers is clicked the set of url compositions, and clickLeft (u) is clicked quantity, clickRight for u's corresponding to queryLeft (u) it is clicked quantity for u corresponding to queryRight.

For the positive sample obtained using aforesaid way, can carry out at least one of in the following ways at filtering Reason：

The first filtering：If the common url number of q before being come in search result corresponding to two query of query centerings Less than default number threshold value, then this query pairs is filtered out.If for example, before being come in search result corresponding to two query Common url number is less than 3 in the url of 10, then illustrates that two query of the query centerings are not having that semantically It is similar, this query pairs can be filtered out from positive sample set.

Second of filtering：If two query of query centerings obtain identical statement, mistake after removing stop words respectively Filter this query pairs.

The third filtering：If two query of query centerings include different digital contents, this query pairs is filtered out. Such as " running the male third quarter " and " running the fraternal fourth season ", the inside include afoul digital content, it is intended that difference is larger, because This is not suitable as the positive sample that inquiry is rewritten.

4th kind of filtering：If the total numbers of clicks of url are less than default number of clicks corresponding to two query of query centerings Threshold value, then filter out this query pairs.Such case largely illustrates this query statement and improper, just to search Hitch fruit does not hit the demand of user, therefore this part query is to being not appropriate for as positive sample.

5th kind of filtering：If the error correction that a query of query centerings is another query is stated, this is filtered out Query pairs.Such as " even horizontal bar " and " practicing horizontal bar ", the latter is the correction to wrong word in the former, then this just uncomfortable cooperation The positive sample rewritten for inquiry.

For negative sample, it can be obtained from search daily record and be clicked url similarity and be less than or equal to the second threshold Query pairs of two query compositions of value, as the first negative sample.Second Threshold is less than first threshold.Equally use this side Method, it can be obtained with less cost and largely be intended to similar query pairs, and most of field can be covered, as bottom mould The negative sample of type.

It is clicked assuming that overlapQ is query to corresponding in url, the url of top n common factor is come, if certain query To meeting following condition, then as negative sample：

1≤overlap15≤3 and 0≤overlap10≤2 and overlap5=0 and clickLeft >=2 and clickRight≥5.Wherein, clickLeft is clicked number for all url corresponding to queryLeft, and clickRight is All url's is clicked number corresponding to queryRight.

For negative sample, at least one of following filter type can be performed：

The first filtering：If each query of query centerings is not the query with preset need, this is filtered out Query pairs.For example, it is assumed that it is the structured search being directed to that inquiry, which is rewritten, if then each query of query centerings is without knot Structure search need, then filter out this query pairs.

Second of filtering：If a query is present in multiple query pairs, retain wherein m query pairs, other filterings Fall, m is default positive integer.For example, each query is at most present in 5 query centerings, i.e., each queryLeft is at most protected 5 queryRight are stayed, can therefrom randomly select 5, other are filtered out.

The sample got using aforesaid way is very large-scale, can obtain more than one hundred million ranks.

In 102, using first sample set, M neural network model is respectively trained, obtains M underlying model, wherein M is positive integer.

Neural network model can be directed to the study that sample carries out feature automatically, and the underlying model finally given can be to appointing The query inputted anticipate to scoring, the scoring can regard as the queryRight of the query centerings as queryLeft Inquiry rewrite scoring.

In embodiments of the present invention, neural network model can use such as neutral net based on multi-layer perception (MLP) (BOW_NN), convolutional neural networks (CNN), forward-backward recutrnce neutral net (BiRNN).Due to the realization mechanism of neural network model And be more ripe technology for the learning process of text, it will not be repeated here.Wherein, Char-BiRNN is a kind of preferred Forward-backward recutrnce neutral net, its advantage be without to input carry out a point word, its results of learning is substantially better than other neutral nets.

In addition, single Neural model excessively monopolizes, risk is larger, and multiple architectural differences can be used in the present invention Neural network model, i.e., above-mentioned M can be more than 2 values, and multiple underlying models are respectively obtained by training.

But if underlying model is only used only inquire about the differentiation of rewriting, the degree of accuracy is not still high, is sent out by test Existing, the differentiation degree of accuracy that underlying model is rewritten to inquiry is generally 70% or so.This is mainly due to sample distribution and big portion Caused by point sample characteristics are excessively obvious, do not distinguished strictly for the sample on some borders, cause precision deficiency.In order to Overcome this problem, continue executing with following steps and establish more accurate and high-order model.

In 103, the second sample set formed comprising query pairs of the second positive sample and the second negative sample is obtained.

Second sample set mainly chooses data boundary so that model can distinguish trickleer intention difference, example As " treating diabetes " and " the diabetes cause of disease " should be determined as dissmilarity, but its acquisition strategy according to above-mentioned first negative sample It is able to not may obtain, because both have certain similitude on url is clicked, but similarity is not so low (low In Second Threshold).Therefore need to excavate the sample on some comparison borders.It can be obtained first from search daily record and be clicked url Similarity be more than or equal to the 3rd threshold value and less than or equal to the 4th threshold value two query form query pairs, it is described 3rd threshold value is more than the Second Threshold, and the 4th threshold value is less than the first threshold；Then by part query to submitting To being manually labeled, it manually will be labeled as stating the query of identical meanings to as the second positive sample, will manually be labeled as table The query of different implications is stated to as the second negative sample.

Due to requiring that meeting is higher for the algnment accuracy of borderline sample, therefore multiple mark people can be used herein Member, such as the engineer rewritten by three familiar inquiries are labeled respectively, then go different seek common ground.This mode obtain second The data of sample set are probably in ranks up to ten thousand.

It should be noted that above-mentioned 103 are not any limitation as with 101,102 execution sequence, can also be same with step 101 It Shi Zhihang, can also after step 101 perform, sequentially be only one of which realization order shown in Fig. 1.

In 104, feature is extracted from the second sample set, this feature includes above-mentioned M underlying model respectively to second Each query pairs of scoring in sample set.

After above-mentioned second sample set is inputted into above-mentioned M underlying model respectively, it is possible to respectively obtain each underlying model Scoring to each query pairs, the scoring can rewrite the feature of discrimination model as the final inquiry of training.This feature is actually It is by the knowledge migration that multiple underlying models learn from large-scale training sample to the boundary sample manually marked.

With the exception of the features described above, some other features can also be included, so as to train to obtain high-order model.It is such as following One kind or any combination in feature：

1) statistical nature.Such as the number or accounting of word term in query are counted, term can use n-gram shape Formula；Statistics whether digital accounting.

2) distance feature.Such as confirm jaccard distances or editing distance etc. between two query.Wherein Jaccard distances for two query co-occurrences of query centerings term quantity and query to including term total quantity.

3) position feature.Such as confirm position variance averages of the term common in two query in two query.

4) word importance characteristic.Such as query centerings term tf-idf features.

5) semantic feature.Such as term part of speech, sentence element etc..

6) synonym rewrites feature.Such as confirm that query centerings belong to the term of synonym.

In 105, using the features training disaggregated model of extraction, obtain inquiry and rewrite discrimination model.

The disaggregated model trained in this step can be one, that is, the disaggregated model trained obtains inquiry and rewritten Discrimination model.

As a preferred embodiment, multiple disaggregated models can be trained in this step, that is, utilize the feature of extraction N number of disaggregated model is respectively trained, obtains N number of high-order model, N is the positive integer more than 1；Then N number of high-order model is carried out again Select and integrated, obtain inquiry and rewrite discrimination model, the inquiry rewriting discrimination model that this mode obtains is actually one and collected Into model.

The disaggregated model being related in this step can use GBDT (gradient recurrence decision tree), SVM (SVMs), LR (logistic regression), RF (random forest), MLP (multilayer perceptron) etc..Above-mentioned N number of disaggregated model can be different types of Disaggregated model or the disaggregated model of same type but the different model parameter of use.

For example, the feature extracted from the second sample set can be utilized, N number of GBDT is trained, this N number of GBDT is respectively adopted Different model parameters (such as the parameter such as depth, decision tree quantity, study idea), so can be obtained by N number of high-order model.Can So that directly this N number of model to be integrated, but due in this N number of model not necessarily all model can reach expected differentiation Accuracy rate, therefore can be selected from this N number of model and can reach the expected model for differentiating accuracy rate and integrated.

Testing evaluation can be carried out to the result of this N number of high-order model using test set herein, wherein be included in test set Some have determined that query pairs that rewrites scoring, then these query are obtained into each high-order to inputting N number of high-order model respectively The model scoring to each query pairs respectively, then each query pairs in obtained scoring and test set of rewriting scoring is compared Compared with, the testing evaluation of the result of this N number of high-order model is obtained, such as can be using AUC embodiment testing evaluations.Then according to test Scoring can therefrom select P high-order model, such as selection testing evaluation is more than the high-order model of default testing evaluation threshold value, P is less than or equal to N.

After selecting P high-order model, these high-order models can be integrated by the way of weighting, be looked into Ask and rewrite discrimination model.It can think that these high-order models distribute respective weights, these weights are used to change using inquiry Write discrimination model to differentiate query when whether being that another query inquiry is rewritten, can be by each high-order model to this Query pairs of scoring is weighted the scoring that the final higher assessment obtained after processing is allocated as rewriting discrimination model for inquiry, comments accordingly Divide to produce differentiation result.

Lift a specific embodiment：

As shown in Fig. 2 big data sample is obtained by the way of shown in above-mentioned 101, by the way of shown in above-mentioned 103 Obtain boundary sample.Tri- models of BOW_NN, CNN, BiRNN are respectively trained using big data sample.Then it is boundary sample is defeated Enter three models that training obtains, respectively obtain the scoring to each query pairs in boundary sample, the scoring of these three models is made It is characterized, such as statistical nature that is extracted together with other from boundary sample, distance feature, position feature, word importance Feature, semantic feature, synonym rewrite feature etc., are used to train N number of GBDT models together, then therefrom select P BGDT mould Type is carried out after integrating, and is obtained final inquiry and is rewritten discrimination model.

After completing the foundation that discrimination model is rewritten in inquiry, according to the model carry out inquiry rewrite the process that differentiates can be as Shown in Fig. 3, comprise the following steps：

In 301, feature is extracted from query centerings to be discriminated, this feature includes above-mentioned each underlying model to this query pairs Scoring.

, then can should assuming that to differentiate that inquiry that whether a query of query centerings is another query is rewritten Query can obtain each underlying model and this query pairs is commented to M underlying model for training to obtain in input above-described embodiment Point.Using this M scoring as feature, the statistical nature from query centerings extraction, distance feature, position are further combined Feature, word importance characteristic, semantic feature and synonym rewrite feature etc., and (training inquiry employs when rewriting discrimination model Which feature, which feature just extracted from query centerings to be discriminated herein).

In 302, the feature input inquiry of extraction is rewritten into discrimination model, obtains the differentiation knot that discrimination model is rewritten in inquiry Fruit.

If inquiry is rewritten discrimination model and integrated by multiple high-order models, then this step is exactly actually will extraction Feature input each high-order model respectively, each high-order model is obtained to query pairs to be discriminated of the scoring, then according to each high-order The weights of model, these scorings are weighted with processing, such as weighted sum or weighting are averaging, according to the scoring finally given To differentiate that the inquiry whether a query of query centerings to be discriminated is another query is rewritten.

Discrimination model is rewritten to be inquired about shown in Fig. 2 as an example：

As shown in figure 4, query to be discriminated obtains three outputs to inputting tri- underlying models of BOW_NN, CNN and BiRNN Scoring.From query centerings to be discriminated extraction statistical nature, distance feature, position feature, word importance characteristic, semantic feature, Synonym rewrites the features such as feature, and discrimination model, the inquiry are rewritten together as feature input inquiry together with above three scoring Rewrite discrimination model to be formed by P GBDT model integrated, processing be weighted by the scoring of this P GBDT models output, Finally give the differentiation result that discrimination model is rewritten in inquiry.

Above is the description carried out to method provided by the present invention, is retouched in detail to device provided by the invention below State.

Fig. 5 is the structure drawing of device provided in an embodiment of the present invention established inquiry and rewrite discrimination model, as shown in figure 5, should Device can include：First sample acquiring unit 01, the second sample acquisition unit 02, the first training unit 03, feature extraction list The training unit 05 of member 04 and second, the major function of each component units are as follows：

First sample acquiring unit 01 is responsible for obtaining the formed comprising query pairs of the first positive sample and the first negative sample One sample set.

Specifically, first sample acquiring unit 01 can obtain first sample set in the following way：

The similarity for being clicked url is obtained from search daily record more than or equal to two query compositions of first threshold Query to as the first positive sample, and/or, former query high-quality rewriting query is determined using existing rewriting rule, by this The query that former query and high-quality rewriting query is formed is to as the first positive sample.

The similarity for being clicked url is obtained from search daily record less than or equal to two query compositions of Second Threshold Query is to as the first negative sample；Wherein first threshold is higher than Second Threshold.

Second sample acquisition unit 02 is responsible for obtaining the formed comprising query pairs of the second positive sample and the second negative sample Two sample sets.

Specifically, the second sample acquisition unit 02 can obtain the second sample set in the following way：

First, obtained from search daily record and be clicked url similarity and be more than or equal to the 3rd threshold value and be less than or wait Query pairs formed in two query of the 4th threshold value, the 3rd threshold value is more than Second Threshold, and the 4th threshold value is less than first threshold. Then, according to artificial annotation results to query to progress, it manually will be labeled as stating the query of identical meanings to being used as the Two positive samples, it manually will be labeled as stating the query of different implications to as the second negative sample.

For the positive sample obtained using aforesaid way, the sample acquisition unit 02 of first sample acquiring unit 01 and second can To carry out at least one of following filtering to positive sample：

The first filtering：If the common url number of q before being come in search result corresponding to two query of query centerings Less than default number threshold value, then this query pairs is filtered out, q is default positive integer.

Second of filtering：If two query of query centerings obtain identical statement after removing stop words respectively, filter Fall this query pairs.

The third filtering：If two query of query centerings include different digital contents, this query pairs is filtered out.

4th kind of filtering：If the total numbers of clicks of url are less than default number of clicks corresponding to two query of query centerings Threshold value, then filter out this query pairs.

5th kind of filtering：If the error correction that a query of query centerings is another query is stated, this is filtered out Query pairs.

For negative sample, the sample acquisition unit 02 of first sample acquiring unit 01 and second can carry out following filter At least one of：

The first filtering：If each query of query centerings is not the query with preset need, this is filtered out Query pairs.

Second of filtering：If a query is present in multiple query pairs, retain wherein m query pairs, other filterings Fall, m is default positive integer.

First training unit 03 is responsible for utilizing first sample set, and M neural network model is respectively trained, obtains M bottom Layer model, M are positive integer.Wherein, neural network model can include but is not limited to：BOW_NN, CNN, BiRNN etc..Nerve net Network model can be directed to sample and carry out the study of feature automatically, and the underlying model finally given can be to the query that arbitrarily inputs To scoring, the scoring can regard that the queryRight of the query centerings comments as what queryLeft inquiry was rewritten as Point.

Feature extraction unit 04 is responsible for extracting feature from the second sample set, and wherein feature includes M underlying model point The other scoring to each query pairs in the second sample set, in addition to statistical nature, distance feature, position feature, word importance Feature, semantic feature and synonym rewrite one kind or any combination in feature etc..

Second training unit 05 is responsible for the features training disaggregated model extracted using feature extraction unit 04, obtains inquiry and changes Write discrimination model.The disaggregated model trained in this step can be one, that is, the disaggregated model trained is inquired about Rewrite discrimination model.As a preferred embodiment, the second training unit 05 can utilize the feature of extraction that N is respectively trained Individual disaggregated model, obtains N number of high-order model, and N is the positive integer more than 1；N number of high-order model is selected and integrated, is obtained Discrimination model is rewritten in inquiry.

Wherein, disaggregated model can use one kind or any combination in GBDT, SVM, LR, RF, MLP etc., use it is more Individual disaggregated model can be different types of disaggregated model or the disaggregated model of same type, but use different moulds Shape parameter.

Second training unit 05 is being selected and integrated to N number of high-order model, can when obtaining inquiry rewriting discrimination model Directly to be integrated using this N number of high-order model, obtain inquiry and rewrite discrimination model.Test set can also be utilized to N number of high The result of rank model carries out testing evaluation, and test set, which includes, has determined that query pairs that rewrites scoring；It is selected according to testing evaluation Middle P high-order model, P are less than or equal to N；Processing is weighted to P high-order model, inquiry is obtained and rewrites discrimination model.

Fig. 6 is the structure drawing of device that discrimination and query provided in an embodiment of the present invention is rewritten, as shown in fig. 6, the device includes： Feature extraction unit 11 and judgement unit 12, the major function of each component units are as follows：

Feature extraction unit 11 is responsible for extracting feature from query centerings to be discriminated, and feature includes M underlying model to this Query pairs of scoring, M are positive integer.The underlying model is that above-described embodiment trains what is obtained, and this M is scored as feature, Further combine the statistical nature from query centerings extraction, distance feature, position feature, word importance characteristic, semanteme Feature and synonym rewrite feature etc..This Partial Feature that feature extraction unit 11 is extracted and feature in embodiment illustrated in fig. 5 The feature that extraction unit 04 extracts is consistent.

The feature input inquiry that judgement unit 12 is responsible for extracting feature extraction unit 11 rewrites discrimination model, is inquired about Rewrite the differentiation result of discrimination model.If inquiry is rewritten discrimination model and integrated by multiple high-order models, then differentiates Actual unit 12 is exactly that the feature of extraction is inputted into each high-order model respectively, obtains each high-order model query pairs to be discriminated to this Scoring, then weights according to each high-order model are weighted processing to these scorings, such as weighted sum or weight and ask flat , according to finally give scoring come differentiate a query of query centerings to be discriminated whether the inquiry for being another query Rewrite.

The above method and device provided in an embodiment of the present invention, it can be used for accurately differentiating whether a query can use Rewritten in another query inquiry, the foundation and optimization of its rewriting dictionary that can be used under line, it is enterprising to can be used for line The differentiation and selection that row query rewrites, can be also used for other plurality of application scenes, and the present invention is no longer exhaustive one by one herein.

In several embodiments provided by the present invention, it should be understood that disclosed apparatus and method, it can be passed through Its mode is realized.For example, device embodiment described above is only schematical, for example, the division of the unit, only Only a kind of division of logic function, can there is other dividing mode when actually realizing.

The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of hardware adds SFU software functional unit.

The above-mentioned integrated unit realized in the form of SFU software functional unit, can be stored in one and computer-readable deposit In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are causing a computer It is each that equipment (can be personal computer, server, or network equipment etc.) or processor (processor) perform the present invention The part steps of embodiment methods described.And foregoing storage medium includes：USB flash disk, mobile hard disk, read-only storage (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. it is various Can be with the medium of store program codes.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God any modification, equivalent substitution and improvements done etc., should be included within the scope of protection of the invention with principle.

Claims

A kind of 1. method established inquiry and rewrite discrimination model, it is characterised in that this method includes：

Using the first sample set comprising query pairs of the first positive sample and the first negative sample composition, M nerve is respectively trained Network model, obtains M underlying model, and the M is positive integer；

Feature, the spy are extracted from the second sample set comprising query pairs of the second positive sample and the second negative sample composition Sign includes scoring of the M underlying model respectively to each query pairs in second sample set；

Using the features training disaggregated model of extraction, obtain inquiry and rewrite discrimination model.
2. according to the method for claim 1, it is characterised in that the first sample set obtains in the following way：

The similarity for being clicked url is obtained from search daily record more than or equal to the query that two query of first threshold are formed To as the first positive sample, and/or, former query high-quality rewriting query is determined using existing rewriting rule, by the original The query that query and high-quality rewriting query is formed is to as the first positive sample；

The similarity for being clicked url is obtained from search daily record less than or equal to the query that two query of Second Threshold are formed To as the first negative sample；

Wherein described first threshold is higher than the Second Threshold.
3. according to the method for claim 2, it is characterised in that second sample set obtains in the following way：

The similarity for being clicked url is obtained from search daily record more than or equal to the 3rd threshold value and is less than or equal to the 4th threshold value Form query pairs of two query, the 3rd threshold value is more than the Second Threshold, and the 4th threshold value is less than described the One threshold value；

According to artificial annotation results to the query to progress, manually will be labeled as stating the query of identical meanings to as Second positive sample, it manually will be labeled as stating the query of different implications to as the second negative sample.
4. according to the method in claim 2 or 3, it is characterised in that at least one of following filtering is carried out to positive sample：

If the common url number of q is less than default number before being come in search result corresponding to two query of query centerings Threshold value, then this query pairs is filtered out, q is default positive integer；

If two query of query centerings obtain identical statement after removing stop words respectively, this query pairs is filtered out；

If two query of query centerings include different digital contents, this query pairs is filtered out；

If the total numbers of clicks of url corresponding to two query of query centerings are less than default number of clicks threshold value, this is filtered out Query pairs；

If the error correction that a query of query centerings is another query is stated, this query pairs is filtered out.
5. according to the method in claim 2 or 3, it is characterised in that at least one of following filtering is carried out to negative sample：

If each query of query centerings is not the query with preset need, this query pairs is filtered out；

If a query is present in multiple query pairs, retain wherein m query pairs, other are filtered out, and the m is default Positive integer.
6. according to the method for claim 1, it is characterised in that the neural network model includes following at least one：

Neutral net BOW_NN, convolutional neural networks CNN, forward-backward recutrnce neutral net BiRNN based on multi-layer perception (MLP).
7. according to the method for claim 1, it is characterised in that the feature also include it is following in one kind or any group Close：

Statistical nature, distance feature, position feature, word importance characteristic, semantic feature and synonym rewrite feature.
8. according to the method for claim 1, it is characterised in that the features training disaggregated model using extraction, obtain Discrimination model is rewritten in inquiry to be included：

N number of disaggregated model is respectively trained using the feature of extraction, obtains N number of high-order model, the N is the positive integer more than 1；

N number of high-order model is selected and integrated, inquiry is obtained and rewrites discrimination model.
9. according to the method for claim 8, it is characterised in that the disaggregated model includes following at least one：

Gradient recurrence decision tree GBDT, support vector machines, logistic regression LR, random forest RF, multilayer perceptron MLP.
10. according to the method for claim 8, it is characterised in that N number of high-order model is selected and integrated, is obtained Rewriting discrimination model to inquiry includes：

Testing evaluation is carried out to the result of N number of high-order model using test set, the test set includes and has determined that rewriting is commented Query pairs divided；

Wherein P high-order model is selected according to testing evaluation, the P is less than or equal to the N；

Processing is weighted to the P high-order model, inquiry is obtained and rewrites discrimination model.
11. a kind of method that discrimination and query is rewritten, it is characterised in that this method includes：

Feature is extracted from query centerings to be discriminated, the feature includes M underlying model to query pairs of the scoring, the M For positive integer；

The feature input inquiry of extraction is rewritten into discrimination model, obtains the differentiation result that discrimination model is rewritten in the inquiry；

Discrimination model is rewritten in wherein described M underlying model and the inquiry to be used such as any claim institute of claim 1 to 10 What the method for stating obtained.
12. a kind of device established inquiry and rewrite discrimination model, it is characterised in that the device includes：

First sample acquiring unit, for obtaining the first sample formed comprising query pairs of the first positive sample and the first negative sample This set；

Second sample acquisition unit, for obtaining the second sample formed comprising query pairs of the second positive sample and the second negative sample This set；

First training unit, for utilizing the first sample set, M neural network model is respectively trained, obtains M bottom Model, the M are positive integer；

Feature extraction unit, for extracting feature from second sample set, the feature includes the M underlying model Scoring to each query pairs in second sample set respectively；

Second training unit, for the features training disaggregated model using feature extraction unit extraction, obtain inquiry and rewrite Discrimination model.
13. device according to claim 12, it is characterised in that the first sample acquiring unit, specifically for using Following manner obtains the first sample set：

The similarity for being clicked url is obtained from search daily record more than or equal to the query that two query of first threshold are formed To as the first positive sample, and/or, former query high-quality rewriting query is determined using existing rewriting rule, by the original The query that query and high-quality rewriting query is formed is to as the first positive sample；

The similarity for being clicked url is obtained from search daily record less than or equal to the query that two query of Second Threshold are formed To as the first negative sample；

Wherein described first threshold is higher than the Second Threshold.
14. device according to claim 12, it is characterised in that second sample acquisition unit, specifically for using Following manner obtains second sample set：

The similarity for being clicked url is obtained from search daily record more than or equal to the 3rd threshold value and is less than or equal to the 4th threshold value Form query pairs of two query, the 3rd threshold value is more than the Second Threshold, and the 4th threshold value is less than described the One threshold value；

According to artificial annotation results to the query to progress, manually will be labeled as stating the query of identical meanings to as Second positive sample, it manually will be labeled as stating the query of different implications to as the second negative sample.
15. the device according to claim 13 or 14, it is characterised in that the first sample acquiring unit and described second Sample acquisition unit, it is additionally operable to carry out positive sample at least one of following filtering：

If the common url number of q is less than default number before being come in search result corresponding to two query of query centerings Threshold value, then this query pairs is filtered out, q is default positive integer；

If two query of query centerings obtain identical statement after removing stop words respectively, this query pairs is filtered out；

If two query of query centerings include different digital contents, this query pairs is filtered out；

If the total numbers of clicks of url corresponding to two query of query centerings are less than default number of clicks threshold value, this is filtered out Query pairs；

If the error correction that a query of query centerings is another query is stated, this query pairs is filtered out.
16. the device according to claim 13 or 14, it is characterised in that the first sample acquiring unit and described second Sample acquisition unit, it is additionally operable to carry out negative sample at least one of following filtering：

If each query of query centerings is not the query with preset need, this query pairs is filtered out；

If a query is present in multiple query pairs, retain wherein m query pairs, other are filtered out, and the m is default Positive integer.
17. device according to claim 12, it is characterised in that the neural network model includes following at least one：

Neutral net BOW_NN, convolutional neural networks CNN, forward-backward recutrnce neutral net BiRNN based on multi-layer perception (MLP).
18. device according to claim 12, it is characterised in that the feature also include it is following in one kind or any group Close：

Statistical nature, distance feature, position feature, word importance characteristic, semantic feature and synonym rewrite feature.
19. device according to claim 12, it is characterised in that second training unit, be specifically used for：Utilize extraction Feature N number of disaggregated model is respectively trained, obtain N number of high-order model, the N is the positive integer more than 1；To N number of high-order Model is selected and integrated, and is obtained inquiry and is rewritten discrimination model.
20. device according to claim 19, it is characterised in that the disaggregated model includes following at least one：

Gradient recurrence decision tree GBDT, support vector machines, logistic regression LR, random forest RF, multilayer perceptron MLP.
21. device according to claim 19, it is characterised in that second training unit is to N number of high-order mode Type is selected and integrated, specific to perform when obtaining inquiry rewriting discrimination model：

Testing evaluation is carried out to the result of N number of high-order model using test set, the test set includes and has determined that rewriting is commented Query pairs divided；

Wherein P high-order model is selected according to testing evaluation, the P is less than or equal to the N；

Processing is weighted to the P high-order model, inquiry is obtained and rewrites discrimination model.
22. the device that a kind of discrimination and query is rewritten, it is characterised in that the device includes：

Feature extraction unit, for extracting feature from query centerings to be discriminated, the feature includes M underlying model to this Query pairs of scoring, the M are positive integer；

Judgement unit, the feature input inquiry for the feature extraction unit to be extracted rewrite discrimination model, obtain described look into Ask the differentiation result for rewriting discrimination model；

Discrimination model is rewritten in wherein described M underlying model and the inquiry to be used such as any claim institute of claim 12 to 21 State what device obtained.