CN106844530A

CN106844530A - Training method and device of a kind of question and answer to disaggregated model

Info

Publication number: CN106844530A
Application number: CN201611249261.2A
Authority: CN
Inventors: 庞伟
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2017-06-13

Abstract

Training method and device of a kind of question and answer to disaggregated model are the embodiment of the invention provides, the method includes：Question and answer are obtained to data；From the question and answer to extracting data question and answer to feature；Tag along sort is marked to data to the question and answer to the quality of data according to the question and answer；Question and answer are trained to disaggregated model to feature and the tag along sort using the question and answer.Using question and answer to a large amount of training sets of quality automatic marking of data, training question and answer are classified to disaggregated model, i.e. forecast quality divides, avoid artificial strategy, it is few so as to avoid the characteristic information that artificial strategy utilizes, user's active feedback rate is low, rely on the subjective judgement of quizmaster, advertisement cheating is serious, the problems such as user causes tactful unstable to the question and answer of data and history to new question and answer to the feedback information imbalance of data, history question and answer to data and the new question and answer for producing in data, all obtaining preferable predictablity rate.

Description

Training method and device of a kind of question and answer to disaggregated model

Technical field

The present invention relates to the technical field of computer disposal, training method of more particularly to a kind of question and answer to disaggregated model With a kind of question and answer to the trainer of disaggregated model.

Background technology

At present, there are many interactive answer platforms, user proposes the problem of oneself on answer platform, asks on network Answer platform to start other users to answer, solve the query of quizmaster.

Answer platform have accumulated substantial amounts of user, produce the question and answer of magnanimity to data (i.e. question and answer), wherein, question and answer There is height to have the quality of data low, a low-quality question and answer are relatively low to the value of data, influence Consumer's Experience, and it is high-quality Question and answer, to data, are the significant data resources of answer platform.

To excavate high-quality question and answer to data, traditional method is based on artificial policy calculation quality point, by carrying The person of asking or other users design a strategy to judge quality of the question and answer to data to the feedback information of answer.

For example, setting interactive button on answer platform, praise label and step on label, for other users interaction, work as quizmaster Answer is set to " optimum answer ", or, praise the quantity that label is clicked and exceed when stepping on quantity that label is clicked, can sentence Fixed this answer is a preferable answer of quality.

But, the characteristic information that artificial strategy is utilized is few, and user's active feedback rate is low, relies on the subjective judgement of quizmaster, Advertisement cheating is serious, and user causes plan to the question and answer of data and history to new question and answer to the feedback information imbalance of data It is slightly unstable, cause question and answer relatively low to the accuracy rate of data.

Especially, to data, because lacking user feedback, question and answer are lower to the accuracy rate of data for the new question and answer for producing.

The content of the invention

In view of the above problems, it is proposed that the present invention so as to provide one kind overcome above mentioned problem or at least in part solve on State the training method and a kind of accordingly question and answer trainer to disaggregated model of a kind of question and answer to disaggregated model of problem.

According to one aspect of the present invention, there is provided a kind of question and answer to the training method of disaggregated model, including：

Question and answer are obtained to data；

From the question and answer to extracting data question and answer to feature；

Tag along sort is marked to data to the question and answer to the quality of data according to the question and answer；

Question and answer are trained to disaggregated model to feature and the tag along sort using the question and answer.

Alternatively, the question and answer include following one or more to feature：

Quizmaster's feature, answerer's feature, question and answer are to text semantic feature, question and answer to numerical characteristic, user feedback feature.

Alternatively, the question and answer include question and answer to data, and the question and answer include question and answer pair to text semantic feature Pairing feature；

It is described from the question and answer to extracting data question and answer to feature the step of include：

Search the word pair of the lexical item co-occurrence in lexical item and the answer in described problem；

The quantity of the word pair of the co-occurrence is counted, as question and answer to pairing feature.

Alternatively, the question and answer include question and answer to data, and the question and answer include question and answer pair to text semantic feature Minimal routing distance；

Keyword is extracted from described problem, key to the issue set of words is generated；

Keyword is extracted from the answer, answer keyword set is generated；

Calculate similarity between described problem keyword set and the answer keyword set；

The similarity is accumulated, question and answer is obtained to Minimal routing distance.

Alternatively, the question and answer include question and answer to data, and the question and answer include question and answer pair to text semantic feature Sentence similarity；

Described problem is converted into the first sentence vector；

The answer is converted into the second sentence vector；

Calculate that first sentence is vectorial and similarity between second sentence vector, it is similar to sentence as question and answer Degree.

Alternatively, it is described the step of mark tag along sort to data to the question and answer to the quality of data according to the question and answer Including：

The search record data recorded when searching the search question and answer to data；

Tag along sort is marked to data to the question and answer according to the search record data.

Alternatively, it is described to be wrapped according to described search the step of record data marks tag along sort to the question and answer to data Include：

Excavate average click weight of the question and answer to data under search keyword；

Excavate last time of the question and answer to data under search keyword and click on weight；

Weight is clicked on using the average click weight and the last time and is fitted continuous score value；

Tag along sort is turned to by the continuous score value is discrete.

Alternatively, it is described excavate the question and answer to data under search keyword average click weight the step of include：

Record address of the question and answer to the affiliated webpage of data；

Calculate click score value of the address under specified search keyword；

Click score value distributed intelligence of the address under specified search keyword is calculated using the click score value；

Average click of the question and answer to data under search keyword is calculated using click score value distributed intelligence to weigh Weight.

Alternatively, it is described calculate the address under specified search keyword click score value the step of include：

Count number of clicks of the address under specified keyword；

The searching times of the keyword that statistics is specified；

Click of the address under specified search keyword is calculated using the number of clicks and the searching times Score value.

Alternatively, the step of last time for excavating the question and answer to data under search keyword clicks on weight is wrapped Include：

Record address of the question and answer to the affiliated webpage of data；

Calculate the address and click on score value for the last time under specified search keyword；

Score value is clicked on using the last time and calculates last time point of the question and answer to data under search keyword Hit weight；

Alternatively, described calculating the step of score value is clicked in the address for the last time under specified search keyword is wrapped Include：

Count the number of clicks of address last time under specified keyword；

The searching times of the keyword that statistics is specified；

The address is calculated with the searching times using the number of clicks of the last time crucial in specified search Score value is clicked under word for the last time.

Alternatively, in the step that to the quality of data the question and answer are marked with tag along sort according to the question and answer to data After rapid, methods described also includes：

The question and answer are normalized to feature.

Alternatively, it is described to include the step of be normalized to feature to the question and answer：

Statistics is per one-dimensional question and answer to the average value and standard deviation of feature；

The average value will be subtracted to feature per one-dimensional question and answer, divided by the standard deviation.

Data are adjusted to current question and answer to the tag along sort of data according to neighbouring question and answer.

Alternatively, the neighbouring question and answer of the basis are adjusted to current question and answer to data to the tag along sort of data Step includes：

The question and answer are clustered to data；

For each question and answer to data, the question and answer of the N number of neighbour after selection cluster are to data；

Current question and answer are calculated to the question and answer of data and the neighbour to the distance between data；

Tag along sort is fitted based on the distance again.

Alternatively, also include：

Recognize the question and answer to feature for the question and answer to the significance level of disaggregated model；

M question and answer of significance level highest are extended to feature, the question and answer after being extended are returned and performed to feature It is described the step of mark tag along sort to data to the question and answer to the quality of data according to the question and answer.

According to another aspect of the present invention, there is provided a kind of question and answer to the trainer of disaggregated model, including：

Question and answer are suitable to obtain question and answer to data to data acquisition module；

Question and answer are suitable to from the question and answer to extracting data question and answer to feature to characteristic extracting module；

Tag along sort labeling module, is suitable to mark data the question and answer quality of data according to the question and answer and classifies Label；

Model training module, is suitable for use with the question and answer and trains question and answer to disaggregated model to feature and the tag along sort.

The question and answer are further adapted for characteristic extracting module：

Keyword is extracted from the answer, answer keyword set is generated；

Described problem is converted into the first sentence vector；

The answer is converted into the second sentence vector；

Alternatively, the tag along sort labeling module is further adapted for：

Tag along sort is turned to by the continuous score value is discrete.

Alternatively, the tag along sort labeling module is further adapted for：

Record address of the question and answer to the affiliated webpage of data；

Calculate click score value of the address under specified search keyword；

Alternatively, the tag along sort labeling module is further adapted for：

Count number of clicks of the address under specified keyword；

The searching times of the keyword that statistics is specified；

Alternatively, the tag along sort labeling module is further adapted for：

Record address of the question and answer to the affiliated webpage of data；

Alternatively, the tag along sort labeling module is further adapted for：

Count the number of clicks of address last time under specified keyword；

The searching times of the keyword that statistics is specified；

Alternatively, also include：

Normalization module, is suitable to be normalized feature the question and answer.

Alternatively, the normalization module is further adapted for：

Alternatively, also include：

Tag along sort adjusting module, is suitable to according to neighbouring question and answer to data to current question and answer to the tag along sort of data It is adjusted.

Alternatively, the tag along sort adjusting module is further adapted for：

The question and answer are clustered to data；

Tag along sort is fitted based on the distance again.

Alternatively, also include：

Significance level identification module, be suitable to recognize the question and answer to feature for the question and answer to the important journey of disaggregated model Degree；

Question and answer are suitable to be extended feature M question and answer of significance level highest to feature expansion module, are extended Question and answer afterwards call the model training module to feature, return.

The embodiment of the present invention proposes the quality point computational methods based on machine learning, and comprehensive utilization question and answer are to each of data The question and answer of dimension are planted to feature, using question and answer to a large amount of training sets of quality automatic marking of data, question and answer is trained to disaggregated model Classified, i.e. forecast quality point, it is to avoid artificial strategy, it is few so as to avoid the characteristic information that artificial strategy utilizes, use householder Dynamic feedback rates are low, rely on the subjective judgement of quizmaster, and advertisement cheating is serious, and user is to new question and answer to data and history The problems such as question and answer cause tactful unstable to the feedback information imbalance of data, data and new generation are asked in the question and answer of history Answer questions in data, all obtain preferable predictablity rate.

Described above is only the general introduction of technical solution of the present invention, in order to better understand technological means of the invention, And can be practiced according to the content of specification, and in order to allow the above and other objects of the present invention, feature and advantage can Become apparent, below especially exemplified by specific embodiment of the invention.

Brief description of the drawings

By reading the detailed description of hereafter preferred embodiment, various other advantages and benefit is common for this area Technical staff will be clear understanding.Accompanying drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention Limitation.And in whole accompanying drawing, identical part is denoted by the same reference numerals.In the accompanying drawings：

The step of Fig. 1 shows a kind of question and answer according to an embodiment of the invention to the training method of disaggregated model flow Figure；

The step of Fig. 2 shows another question and answer according to an embodiment of the invention to the training method of disaggregated model is flowed Cheng Tu；

Fig. 3 shows a kind of structural frames of trainer of the question and answer according to an embodiment of the invention to disaggregated model Figure；And

Fig. 4 shows the structural frames of trainer of another question and answer according to an embodiment of the invention to disaggregated model Figure.

Specific embodiment

The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here Limited.Conversely, there is provided these embodiments are able to be best understood from the disclosure, and can be by the scope of the present disclosure Complete conveys to those skilled in the art.

Reference picture 1, shows a kind of step of training method of the question and answer according to an embodiment of the invention to disaggregated model Rapid flow chart, specifically may include steps of：

Step 101, obtains question and answer to data.

Question and answer are to data (Questin＆Answer, Q＆A), including question and answer.

For example, problem " Mountain Everest is how high" with " 8844 meters " of answer one question and answer of composition to data.

Because question and answer have one or more answers to data, therefore, a problem can group with one or more answers Into one or more question and answer to data.

Step 102, from the question and answer to extracting data question and answer to feature.

In embodiments of the present invention, by Feature Engineering, question and answer are embodied to feature from question and answer to extracting data question and answer To the information of data characteristics.

In implementing, question and answer include following one or more to feature：

1st, quizmaster's feature

Quizmaster is characterized as the feature of the user (i.e. quizmaster) of proposition problem, for example：

Answer_count_questioner	Quizmaster answers answer quantity
		Question_posted_count	Quizmaster asks a question quantity
bestA_count_questioner	The optimum answer quantity that quizmaster is answered
		bestA_ratio_questioner	The optimum answer accounting that quizmaster is answered

2nd, answerer's feature

Answerer is characterized as the feature of the user (i.e. answerer) for answering a question, for example：

bestA_ratio_answerer	Optimum answer accounting of the answerer within a season
		A_count_answerer	Answer quantity of the answerer within a season
bestA_ratio_answerer	Optimum answer quantity of the answerer within a season
		Q_count_answerer	Answerer asks a question quantity within a season
Status_answerer	Identity of the answerer on question and answer website
		Accept_percent_answerer	The answer of answerer is adopted rate on question and answer website

3rd, question and answer are to text semantic feature

Question and answer are characterized as semantic feature of the question and answer to data to text semantic.

In an example of the embodiment of the present invention, question and answer include question and answer to pairing feature to text semantic feature (topic_focus_count_qa), then in this example, step 102 can include following sub-step：

Sub-step 1021, searches the word pair of the lexical item co-occurrence in lexical item and the answer in described problem；

Sub-step 1022, counts the quantity of the word pair of the co-occurrence, as question and answer to pairing feature.

Question and answer are characterized in a numerical characteristic to pairing, refer to the quantity of the word pair of problem and answer co-occurrence.

One pairing dictionary of generation when excavating, in substantial amounts of question and answer in data, entity lexical item in statistical problem and The word pair of lexical item co-occurrence in the lexical items, with answer such as focus lexical item.

For example, in problem " Mountain Everest is how high ", " Mountain Everest " is problem main body (i.e. entity lexical item), " has It is many high " be " 8848 ", " 8848 meters " in problem focus (i.e. focus lexical item), with answer be high frequency co-occurrence word pair.

Because problem has various way to put questions, therefore, this is characterized as the quantity of co-occurrence word pair, for example：

Lexical item in problem	Lexical item in answer	Statistical indicator
			Mountain Everest	8848	2.759 2.951 5.710 211 3466 1230
Mountain Everest	8844	10.255 10.752 21.007 408 3466 1419
			Mountain Everest	8848 meters	0.477 0.534 1.011 78 3466 231
Mountain Everest	8844 meters	0.282 0.316 0.598 73 3466 134
			It is how many	8848 meters	0.000 0.000 0.000 1 45878 231
It is how many	8848	0.000 0.000 0.000 2 45878 1230
			It is how many	008848 meter	0.000 0.000 0.000 2 45878 3

In another example of the embodiment of the present invention, question and answer include question and answer to Minimal routing distance text semantic feature (Word_mover_distance), then in this example, step 102 can include following sub-step：

Sub-step 1023, extracts keyword from described problem, generates key to the issue set of words；

Sub-step 1024, extracts keyword from the answer, generates answer keyword set；

Sub-step 1025, calculates similarity between described problem keyword set and the answer keyword set；

Sub-step 1026, the similarity is accumulated, and obtains question and answer to Minimal routing distance.

In this example, question and answer can be between key to the issue set of words and answer keyword set to Minimal routing distance Cartesian product cumulative and.

The similarity (such as cosine similarity) of lexical item two-by-two in first computational problem keyword set and answer keyword set, A numerical value is summed into again.

For example, selecting preceding 5 lexical items in key to the issue set of words, the selection of answer keyword set is first 15, calculates 75 pairs The cosine similarity of lexical item, accumulates together and obtain the question and answer to Minimal routing distance.

In another example of the embodiment of the present invention, question and answer include question and answer to sentence similarity text semantic feature (Cosine_sim_qa), then in this example, step 102 can include following sub-step：

Sub-step 1027, the first sentence vector is converted to by described problem；

Sub-step 1028, the second sentence vector is converted to by the answer；

Sub-step 1029, calculates that first sentence is vectorial and similarity between the second sentence vector, used as asking Answer questions sentence similarity.

In this example, used as sentence vector, answer is used as sentence vector for problem, you can between two sentences vectors of calculating Similarity (such as cosine similarity).

3rd, question and answer are to numerical characteristic

Question and answer are characterized as the feature of the digitized information of question and answer to numeral, for example：

4th, user feedback feature.

User feedback is characterized as other users (non-quizmaster, answerer) to question and answer to the feature of the feedback information of data.

Certainly, above-mentioned judgement processing method is intended only as example, when the embodiment of the present invention is implemented, can be according to actual feelings Condition sets other question and answer to feature, and the embodiment of the present invention is not any limitation as to this.In addition, in addition to above-mentioned question and answer are to feature, this Art personnel can also be according to actual needs using other question and answer to feature, and the embodiment of the present invention is not also limited this System.

The question and answer are marked tag along sort by step 103 to the quality of data according to the question and answer to data.

In the embodiment of the present invention, question and answer can be divided into multiple class to the quality of data, multiple classification are corresponded to respectively Label, using quality as a polytypic problem.

For example, quality is set up separately being set to three class：It is good, general, poor, three tag along sorts are corresponded to respectively：4、2、0.

In one embodiment of the invention, step 103 can include following sub-step：

Sub-step 1031, the search record data recorded when searching the search question and answer to data；

The question and answer are marked tag along sort by sub-step 1032 according to the search record data to data.

In embodiments of the present invention, because user searches asking for question and answer website when search engine is scanned for, often Data are answered questions as Search Results, operation of the record user to the question and answer to data can form search record data, and storage exists In the daily record session log of search engine.

Because the behavior of user can to a certain extent embody quality of the question and answer to data, therefore, it can by searching Rope record data marks tag along sort to question and answer to data.

In one embodiment of the invention, sub-step 1032 can further include following sub-step：

Sub-step 10321, excavates average click weight of the question and answer to data under search keyword (query) (avg_click_docwei)。

In an example of the embodiment of the present invention, sub-step 10321 can further include following sub-step：

Sub-step 103211, records address of the question and answer to webpage belonging to data (pair), such as URL (Uniform Resource Locator, URL).

It should be noted that a question and answer are a document to data, i.e., one URL.

Sub-step 103212, calculates click score value (score) of the address under specified search keyword (query).

In a kind of calculation, number of clicks of the address (URL) under specified keyword (query) can be counted (counting of click, i.e. query_url_pair)

The searching times (search_count) of the keyword (query) that statistics is specified.

Click score value of the address under specified search keyword is calculated using number of clicks and searching times, for example, point The ratio of the product hit between number of times and number of clicks, the product and searching times, as click score value, i.e. score= click*click/search_count。

Sub-step 103213, the point of address (URL) under specified search keyword is calculated using score value (score) is clicked on Hit score value distributed intelligence (dwei).

For example, dwei=score/norm*100, wherein, norm is normalization factor

Sub-step 103214, the question and answer are calculated to data under search keyword using the click score value distributed intelligence Average click weight (avg_click_docwei).

For example,Wherein, n is click on the keyword of the address (URL) (query) quantity.

Sub-step 10322, excavates last time of the question and answer to data under search keyword and clicks on weight (last_ click_docwei)。

In an example of the embodiment of the present invention, sub-step 10322 can further include following sub-step：

Sub-step 103221, records address (URL) of the question and answer to webpage belonging to data (pair).

Sub-step 103222, calculates the address and clicks on score value (last_ for the last time under specified search keyword click_score)。

In a kind of calculation, the point of address (URL) last time under specified keyword (query) can be counted Hit number of times (last_click).

Address is calculated using the number of clicks (last_click) of last time with searching times (search_count) to exist Score value (last_click_score) is clicked on for the last time under the search keyword (query) specified.

For example, the product between the number of clicks of last time and the number of clicks of last time, the product and search time Several ratio, as click score value, i.e. last_click_score=last_click*last_click/search_count.

Sub-step 103223, clicks on score value (last_click_score) and calculates the question and answer pair using the last time Last time of the data under search keyword clicks on weight (last_click_docwei).

For example, clicking on score value to last time configures default weight, you can obtain last time and click on weight, such as Last_click_docwei=0.60*last_click_score.

Sub-step 10323, clicks on weight and is fitted continuous score value using the average click weight and the last time (QA_score)。

In implementing, will averagely click on weight and click on the continuous score value of acquisition by weight is added with last time, i.e., QA_score=avg_click_docwei+last_click_docwei.

Sub-step 10324, turns to tag along sort (label) by the continuous score value is discrete.

Value after continuous score value (QA_score) discretization is referred to, you can as tag along sort (label).

For example, continuous score value (QA_score) is discretized into 4,2 or 0, represent question and answer to the quality of data preferably, typically Or it is poor.

Step 104, question and answer are trained to disaggregated model using the question and answer to feature and the tag along sort.

Because random forest (Random Forest, RF) is a class Ensemble Learning Algorithms, to missing data and nonequilibrium Data are more sane, therefore, in embodiments of the present invention, Random Forest model can be selected to question and answer to feature and tag along sort To disaggregated model, the question and answer can be used for classifying data question and answer and (divide quality shelves training question and answer to disaggregated model It is secondary), new question and answer to the question and answer of data and history to data in can obtain preferable effect.

Certainly, in addition to random forest, question and answer can also be trained to disaggregated model using other modes, for example, SVM (Support Vector Machine, SVMs), CNN (Convolutional Neural Network, convolutional Neural Network), etc., the embodiment of the present invention is not any limitation as to this.

Reference picture 2, shows another question and answer according to an embodiment of the invention to the training method of disaggregated model Flow chart of steps, specifically may include steps of：

Step 201, obtains question and answer to data.

Step 202, from the question and answer to extracting data question and answer to feature.

The question and answer are marked tag along sort by step 203 to the quality of data according to the question and answer to data.

The question and answer are normalized by step 204 to feature.

In embodiments of the present invention, question and answer are standardized to multidimensional (such as 24 dimension) feature of data, to per one-dimensional spy Levy and normalize.

In implementing, the every one-dimensional question and answer of statistics, will be per one-dimensional question and answer to feature to the average value and standard deviation of feature Subtract average value, be used to be used during model prediction divided by standard deviation, preservation average value and standard deviation.

Normalization in the embodiment of the present invention can make that random noise information is positive and negative to offset, and strengthen the effect of validity feature, Effectively the model such as training random forest, obtains more preferable generalization ability.

Data are adjusted by step 205 to current question and answer according to neighbouring question and answer to the tag along sort of data.

Due to there may be noise in the click behavior of user, the tag along sort (label) being fitted to also likely to be present makes an uproar Sound, therefore, in embodiments of the present invention, the distribution of tag along sort (label) can be finely tuned.

In implementing, with similar question and answer to the question and answer of feature to data, its continuous score value (QA_score) also close to, Because threshold value selection is improper label may be caused different during discretization, therefore, it can by the question and answer of neighbour to data to current Question and answer the tag along sort of data is adjusted.

In one embodiment of the invention, step 205 can include following sub-step：

Sub-step 2051, the question and answer are clustered to data；

Sub-step 2052, for each question and answer to data, the question and answer of the N number of neighbour after selection cluster are to data；

Sub-step 2053, calculates current question and answer to the question and answer of data and the neighbour to the distance between data；

Sub-step 2054, tag along sort is fitted based on the distance again.

In embodiments of the present invention, it is possible to use KNN (the closest nodes of k-Nearest Neighbor algorithm, K Algorithm) scheduling algorithm, question and answer are clustered to data.

To each question and answer to data, (N is positive integer to selection N, and such as 100) question and answer of individual neighbour, to data, calculate question and answer To the question and answer of data and neighbour to the distance (such as Euclidean distance) of data.

Scheduling algorithm is weighted using the Gaussian kernel based on Euclidean distance, the value of tag along sort (label) is fitted again, then it is discrete Chemical conversion tag along sort, effectively reduces the noise information in tag along sort (label).

Step 206, question and answer are trained to disaggregated model using the question and answer to feature and the tag along sort.

In one example, about 5,000 ten thousand question and answer can be collected to data, 500,000 question and answer logarithms are therefrom randomly choosed According to for training Random Forest model.

In Random Forest model, 200 trees, the depth 50 of tree, the oobrmse (out-of-bag of model are used Estimate, the method for weighing the predicated error of RF models) about 0.652314, in new question and answer to data and old question and answer logarithm According to upper prediction Average Accuracy up to 81%.

Step 207, recognize the question and answer to feature for the question and answer to the significance level of disaggregated model.

M question and answer of significance level highest are extended by step 208 to feature, the question and answer after being extended to feature, Return and perform step 206.

For question and answer to disaggregated model, importance of each question and answer to feature can be analyzed.

In one example, 10 important question and answer are as shown in the table to feature：

Wherein, answerer's feature and question and answer to text semantic feature mostly in this 10 important question and answer in feature, it is right Prediction question and answer play effective effect to the quality (classifying) of data.

To feature, (question and answer before extending are to spy using 24 basic question and answer for Random Forest model in step 206 Levy), model in new question and answer to data and old question and answer to the prediction Average Accuracy in data up to 81%.

Basic question and answer are extended to feature by the way of cartesian product conversion, M (M is positive integer) individual expansion is obtained The question and answer of exhibition represent and basic question and answer are to the interaction effect between feature, express basic question and answer to feature and expand to feature The question and answer of exhibition expand generalization ability of the question and answer to disaggregated model to the synergy between feature, so as to improve question and answer to dividing The predictablity rate of class model.

If the preceding 10 important question and answer of selection do cartesian product conversion to feature, 45 question and answer of extension are to feature, part The question and answer of extension are as follows to feature：

Basic question and answer to feature and the question and answer of extension to feature totally 69 features, the random depth woods model of re -training, its Its parameter constant, the oobrmse about 0.414505 of model, model prediction Average Accuracy increases by 3 percentage points, reaches 84%.

Question and answer after feature extension are averagely accurate to the prediction in data to data and old question and answer in new question and answer to model True rate up to 84%, better than the conventional method based on artificial strategy in old question and answer to the accuracy rate 74% in data, and, pass System method cannot be applied to prediction of the new question and answer to data.

For embodiment of the method, in order to be briefly described, therefore it is all expressed as a series of combination of actions, but this area Technical staff should know that the embodiment of the present invention is not limited by described sequence of movement, because implementing according to the present invention Example, some steps can sequentially or simultaneously be carried out using other.Secondly, those skilled in the art should also know, specification Described in embodiment belong to preferred embodiment, necessary to the involved action not necessarily embodiment of the present invention.

Reference picture 3, shows a kind of knot of trainer of the question and answer according to an embodiment of the invention to disaggregated model Structure block diagram, can specifically include such as lower module：

Question and answer are suitable to obtain question and answer to data to data acquisition module 301；

Question and answer are suitable to from the question and answer to extracting data question and answer to feature to characteristic extracting module 302；

Tag along sort labeling module 303, is suitable to mark data the question and answer quality of data according to the question and answer Tag along sort；

Model training module 304, is suitable for use with the question and answer and question and answer is trained to feature and the tag along sort to classification mould Type.

In implementing, the question and answer include following one or more to feature：

In an example of the embodiment of the present invention, the question and answer include question and answer to data, and the question and answer are to text This semantic feature includes question and answer to pairing feature；

The question and answer are further adapted for characteristic extracting module 302：

In another example of the embodiment of the present invention, the question and answer include question and answer, the question and answer pair to data Text semantic feature includes question and answer to Minimal routing distance；

Keyword is extracted from the answer, answer keyword set is generated；

In another example of the embodiment of the present invention, the question and answer include question and answer, the question and answer pair to data Text semantic feature includes question and answer to sentence similarity；

Described problem is converted into the first sentence vector；

The answer is converted into the second sentence vector；

In one embodiment of the invention, the tag along sort labeling module 303 is further adapted for：

Tag along sort is turned to by the continuous score value is discrete.

Record address of the question and answer to the affiliated webpage of data；

Calculate click score value of the address under specified search keyword；

Count number of clicks of the address under specified keyword；

The searching times of the keyword that statistics is specified；

Record address of the question and answer to the affiliated webpage of data；

Count the number of clicks of address last time under specified keyword；

The searching times of the keyword that statistics is specified；

Reference picture 4, shows another question and answer according to an embodiment of the invention to the trainer of disaggregated model Structured flowchart, can specifically include such as lower module：

Question and answer are suitable to obtain question and answer to data to data acquisition module 401；

Question and answer are suitable to from the question and answer to extracting data question and answer to feature to characteristic extracting module 402；

Tag along sort labeling module 403, is suitable to mark data the question and answer quality of data according to the question and answer Tag along sort；

Normalization module 404, is suitable to be normalized feature the question and answer.

Tag along sort adjusting module 405, is suitable to according to neighbouring question and answer to data the classification to data to current question and answer Label is adjusted.

Model training module 406, is suitable for use with the question and answer and question and answer is trained to feature and the tag along sort to classification mould Type.

Significance level identification module 407, be suitable to recognize the question and answer to feature for the question and answer to the weight of disaggregated model Want degree；

Question and answer are suitable to be extended feature M question and answer of significance level highest to feature expansion module 408, obtain To feature, the model training module 406 is called in return to question and answer after extension.

In one embodiment of the invention, the normalization module 404 is further adapted for：

In one embodiment of the invention, the tag along sort adjusting module 405 is further adapted for：

The question and answer are clustered to data；

Tag along sort is fitted based on the distance again.

For device embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, it is related Part is illustrated referring to the part of embodiment of the method.

Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein. Various general-purpose systems can also be used together with based on teaching in this.As described above, construct required by this kind of system Structure be obvious.Additionally, the present invention is not also directed to any certain programmed language.It is understood that, it is possible to use it is various Programming language realizes the content of invention described herein, and the description done to language-specific above is to disclose this hair Bright preferred forms.

In specification mentioned herein, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be put into practice in the case of without these details.In some instances, known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this description.

Similarly, it will be appreciated that in order to simplify one or more that the disclosure and helping understands in each inventive aspect, exist Above to the description of exemplary embodiment of the invention in, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention：I.e. required guarantor The application claims of shield features more more than the feature being expressly recited in each claim.More precisely, such as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, and wherein each claim is in itself All as separate embodiments of the invention.

Those skilled in the art are appreciated that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment Unit or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or Sub-component.In addition at least some in such feature and/or process or unit exclude each other, can use any Combine to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so disclosed appoint Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power Profit is required, summary and accompanying drawing) disclosed in each feature can the alternative features of or similar purpose identical, equivalent by offer carry out generation Replace.

Although additionally, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments In included some features rather than further feature, but the combination of the feature of different embodiments means in of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment required for protection is appointed One of meaning mode can be used in any combination.

All parts embodiment of the invention can be realized with hardware, or be run with one or more processor Software module realize, or with combinations thereof realize.It will be understood by those of skill in the art that can use in practice Microprocessor or digital signal processor (DSP) are set realizing the training of question and answer according to embodiments of the present invention to disaggregated model The some or all functions of some or all parts in standby.The present invention is also implemented as described here for performing Method some or all equipment or program of device (for example, computer program and computer program product).This Sample realizes that program of the invention can be stored on a computer-readable medium, or can have one or more signal Form.Such signal can be downloaded from internet website and obtained, or be provided on carrier signal, or with any other Form is provided.

It should be noted that above-described embodiment the present invention will be described rather than limiting the invention, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol being located between bracket should not be configured to limitations on claims.Word "comprising" is not excluded the presence of not Element listed in the claims or step.Word "a" or "an" before element is not excluded the presence of as multiple Element.The present invention can come real by means of the hardware for including some different elements and by means of properly programmed computer It is existing.If in the unit claim for listing equipment for drying, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame Claim.

Claims

1. a kind of question and answer are to the training method of disaggregated model, including：

Question and answer are obtained to data；

2. the method for claim 1, it is characterised in that the question and answer include following one or more to feature：

3. the method as described in claim any one of 1-2, it is characterised in that the question and answer include question and answer to data, The question and answer include question and answer to pairing feature to text semantic feature；

4. the method as described in claim any one of 1-3, it is characterised in that the question and answer include question and answer to data, The question and answer include question and answer to Minimal routing distance text semantic feature；

Keyword is extracted from the answer, answer keyword set is generated；

5. the method as described in claim any one of 1-4, it is characterised in that the question and answer include question and answer to data, The question and answer include question and answer to sentence similarity text semantic feature；

Described problem is converted into the first sentence vector；

The answer is converted into the second sentence vector；

Calculate that first sentence is vectorial and similarity between second sentence vector, as question and answer to sentence similarity.

6. the method as described in claim any one of 1-5, it is characterised in that it is described according to the question and answer to the quality pair of data The step of question and answer mark tag along sort to data includes：

7. the method as described in claim any one of 1-6, it is characterised in that it is described according to the search record data to described The step of question and answer mark tag along sort to data includes：

Tag along sort is turned to by the continuous score value is discrete.

8. the method as described in claim any one of 1-7, it is characterised in that the excavation question and answer are closed to data in search The step of average click weight under keyword, includes：

Record address of the question and answer to the affiliated webpage of data；

Calculate click score value of the address under specified search keyword；

Average click weight of the question and answer to data under search keyword is calculated using click score value distributed intelligence.

9. the method as described in claim any one of 1-8, it is characterised in that closed in specified search the calculating address The step of click score value under keyword, includes：

Count number of clicks of the address under specified keyword；

The searching times of the keyword that statistics is specified；

Click score value of the address under specified search keyword is calculated using the number of clicks and the searching times.

10. a kind of question and answer are to the trainer of disaggregated model, including：

Tag along sort labeling module, is suitable to mark contingency table to data to the question and answer to the quality of data according to the question and answer Sign；