CN101980211A - Machine learning model and establishing method thereof - Google Patents

Machine learning model and establishing method thereof Download PDF

Info

Publication number
CN101980211A
CN101980211A CN2010105423748A CN201010542374A CN101980211A CN 101980211 A CN101980211 A CN 101980211A CN 2010105423748 A CN2010105423748 A CN 2010105423748A CN 201010542374 A CN201010542374 A CN 201010542374A CN 101980211 A CN101980211 A CN 101980211A
Authority
CN
China
Prior art keywords
sample
machine learning
learning model
classification
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010105423748A
Other languages
Chinese (zh)
Inventor
�田�浩
万伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN2010105423748A priority Critical patent/CN101980211A/en
Publication of CN101980211A publication Critical patent/CN101980211A/en
Pending legal-status Critical Current

Links

Images

Abstract

The embodiment of the invention provides a machine learning model and an establishing method thereof. The method comprises the following steps of: a1, acquiring a sample library of conventional target words; a2, performing positive and negative sample classification on the conventional target words for classifying the conventional target words into at least one positive sample and at least one negative sample; a3, performing feature extraction on the positive sample and the negative sample; and a4, forming the machine learning model according to the extracted features of the positive sample and the negative sample. In the method of the embodiment of the invention, the machine learning model is established by performing classified and graded statistics on conventional search keywords and advertising auction words, so as to automatically identify, analyze and evaluate new keywords or auction words by a machine; and therefore, a classified and graded method and a classified and graded system for the target words are realized.

Description

A kind of machine learning model and method for building up thereof
[technical field]
The present invention relates to a kind of machine learning model and method for building up thereof, particularly a kind of machine learning model and method for building up thereof such as keyword and/or advertising words.
[background technology]
In each corner of network world and reality society, the popularity rate and the importance of advertisement grow with each passing day.Produce many advertising words thus, on network, also had similarly target speech such as term, keyword or auction speech simultaneously.Yet,, generally can only rely on manpower to carry out the subjective effect of judging whether it has positive effect, reaching which level for emerging advertising words.But, be easy to do the judgement that makes mistake for the people that some are lacked experience.And artificial mode is difficult to accomplish extensive judgement, and simultaneously, the consistance of subjective judgement also is difficult to be guaranteed.How the classify and grading that term and advertising words is carried out robotization by computer system is the technical matters that information society need solve.
[summary of the invention]
The embodiment of the invention provides a kind of machine learning model machine method for building up, can use this machine learning model that new target speech is carried out classify and grading, further can realize estimating that the target speech is worth.
The embodiment of the invention provides a kind of machine learning model method for building up, and the method comprising the steps of: a1. obtains the sample storehouse of target speech formerly; A2. target speech is formerly carried out positive and negative sample classification, formerly the target speech is divided at least one positive sample and a negative sample; A3. align sample and negative sample and carry out feature extraction; A4. the feature according to positive sample that is extracted and negative sample forms machine learning model.
According to one preferred embodiment of the present invention, just sample comprises the ad click rate height or/and the high target speech of advertising rates; Negative sample comprises the target speech that ad click rate low target speech, advertising rates are low or does not have one or more combination in the target speech of showing advertisement.
According to one preferred embodiment of the present invention, in step a2, comprise that further aligning sample carries out classification, target speech formerly is divided into the positive sample and the negative sample of a plurality of different brackets.
According to one preferred embodiment of the present invention, in step a2, the speech of target formerly to small part is carried out the classification of positive negative sample and the classification of positive sample by reading default sample database.
According to one preferred embodiment of the present invention, in step a4, the feature that aligns sample and negative sample is carried out machine learning, thereby forms machine learning model.
According to one preferred embodiment of the present invention, just sample further is divided at least two grade samples according to the difference of classification level.
According to one preferred embodiment of the present invention, the grade sample comprises A equal samples, B equal samples and C equal samples, or comprises A equal samples, B equal samples, C equal samples and D equal samples, or comprises A equal samples, B equal samples, C equal samples, D equal samples and E equal samples; Wherein, the classification level of A equal samples is the highest, and the classification level of other grade samples is successively decreased successively.
According to one preferred embodiment of the present invention, classification level height according to the ad click rate of target speech or/and the height of advertising rates judge.
According to one preferred embodiment of the present invention, step a further comprises: a5. carries out feature extraction to the speech of target formerly that does not carry out the sample classification classification in step a2; A6. according to machine learning model the feature of the residue target speech that obtained is carried out Model Calculation, and then carry out classify and grading, and add the residue target speech sample characteristics behind the classify and grading to machine learning model.
According to one preferred embodiment of the present invention, in step a6, the feature of residue target speech is carried out characteristic parameter that the method for Model Calculation trains according to machine learning model for the feature with described target speech carry out positive and negative class degree of confidence score and calculate.
According to one preferred embodiment of the present invention, when carrying out feature extraction, carry out participle earlier.
According to one preferred embodiment of the present invention, the method for participle comprises: forward coupling participle, oppositely mate participle, Direct/Reverse coupling participle, the participle based on full segmenting word figure, maximum entropy Markov model participle, maximum entropy participle or condition random field participle.
The present invention further provides a kind of machine learning model, this machine learning model comprises: be used to obtain the sample storehouse acquisition module in the sample storehouse of target speech formerly; Be used for target speech is formerly carried out positive and negative sample classification, target speech formerly be divided into the sample classification diversity module of at least one a positive sample and a negative sample; Align the first sample characteristics extraction module that sample and negative sample carry out feature extraction; Form the machine learning model formation module of machine learning model according to the feature of positive sample that is extracted and negative sample.
According to one preferred embodiment of the present invention, just sample comprises the ad click rate height or/and the high target speech of advertising rates; Negative sample comprises the target speech that ad click rate low target speech, advertising rates are low or does not have one or more combination in the target speech of showing advertisement.
According to one preferred embodiment of the present invention, the sample classification diversity module further aligns sample and carries out classification, target speech formerly is divided into the positive sample and the negative sample of a plurality of different brackets.
According to one preferred embodiment of the present invention, in the sample classification diversity module, the speech of target formerly to small part is carried out the classification of positive negative sample and the classification of positive sample by reading default sample database.
According to one preferred embodiment of the present invention, form in the module in machine learning model, the feature that aligns sample and negative sample is carried out machine learning, thereby forms machine learning model.
According to one preferred embodiment of the present invention, in the sample classification diversity module, positive sample further is divided at least two grade samples according to the difference of classification level.
According to one preferred embodiment of the present invention, the grade sample comprises A equal samples, B equal samples and C equal samples, or comprises A equal samples, B equal samples, C equal samples and D equal samples, or comprises A equal samples, B equal samples, C equal samples, D equal samples and E equal samples; Wherein, the classification level of A equal samples is the highest, and the classification level of other grade samples is successively decreased successively.
According to one preferred embodiment of the present invention, classification level height according to the ad click rate of target speech or/and the height of advertising rates judge.
According to one preferred embodiment of the present invention, machine learning model further comprises: be used for not carrying out the second sample characteristics extraction module that the residue target speech in the speech of target formerly of sample classification classification carries out feature extraction in the sample classification diversity module; Be used for the feature of the residue target speech that obtained being carried out Model Calculation, and then carry out classify and grading, and add the residue target speech sample characteristics behind the classify and grading sample pattern computing module of machine learning model to according to machine learning model.
According to one preferred embodiment of the present invention, in the sample pattern computing module, the feature of residue target speech is carried out the characteristic parameter that the method for Model Calculation trains according to machine learning model for the feature that will remain the target speech carry out positive and negative class degree of confidence score calculating.
According to one preferred embodiment of the present invention, when carrying out feature extraction, carry out participle earlier.
According to one preferred embodiment of the present invention, the method for participle comprises: forward coupling participle, oppositely mate participle, Direct/Reverse coupling participle, the participle based on full segmenting word figure, maximum entropy Markov model participle, maximum entropy participle or condition random field participle.
By the way, by existing search key and ad auction speech are carried out the classify and grading statistics, set up machine learning model, thereby to new keyword or auction speech, carry out that machine is discerned automatically, A+E, realized cover target speech classify and grading method and system.
[description of drawings]
In order to be illustrated more clearly in the technical scheme in the embodiment of the invention, the accompanying drawing of required use is done to introduce simply in will describing embodiment below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.In addition, accompanying drawing is not proportionally drawn.Wherein
Fig. 1 is the structural representation block diagram of target speech classify and grading system according to an embodiment of the invention.
Fig. 2 is the structural representation block diagram of machine learning model shown in Figure 1.
Fig. 3 is the schematic flow diagram of target speech classify and grading method according to an embodiment of the invention.
Fig. 4 is the schematic flow diagram of machine learning model method for building up shown in Figure 3.
Fig. 5 is the schematic flow diagram of machine learning model according to another embodiment of the present invention.
[embodiment]
Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is a part of embodiment of the present invention, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.
As shown in Figure 1, Fig. 1 is the structural representation block diagram of target speech classify and grading system according to an embodiment of the invention.In order to narrate conveniently, term, keyword and/or auction speech are referred to as " target speech " herein.The target speech classify and grading system of the embodiment of the invention comprises machine learning model 11, characteristic extracting module 12 and model computation module 13.Machine learning model 11 is for containing the machine learning model of target speech classify and grading.Characteristic extracting module 12 is used for new target speech is extracted." new target speech " described herein is meant the target speech that need carry out classify and grading.Model computation module 13 is used for the feature of the new target speech that extracted according to characteristic extracting module 12, and applied for machines learning model 11 carries out Model Calculation, and then determines the classify and grading of new target speech.In other embodiments, but also applied for machines learning model 11 carries out Model Matching, and then determines the classify and grading of new target speech.
As shown in Figure 2, Fig. 2 is the structural representation block diagram of machine learning model shown in Figure 1.Referring to Fig. 1, machine learning model 11 comprises that sample storehouse acquisition module 21, sample classification diversity module 22, the first sample characteristics extraction module 23 and machine learning model form module 24.Sample storehouse acquisition module 21 is used to obtain the sample storehouse of target speech formerly.Sample classification diversity module 22 is used for the target speech that sample storehouse acquisition module 21 is obtained is formerly carried out positive and negative sample classification, is divided into positive sample and negative sample, and aligns sample and carry out classification.At least a portion is to finish by reading default sample database to the classify and grading that formerly target speech carries out positive negative sample in the sample classification diversity module 22.Comprise in the sample database by computing machine existing target speech is carried out positive sample data and the negative sample data that statistic of classification generates according to statistical standard such as ad click rate, advertisement volume, advertisement ranks, can also adjust positive sample data in the sample database and negative sample data by the manual mode of operation in addition.The first sample characteristics extraction module 23 is used to align sample and negative sample carries out feature extraction.Machine learning model forms that module 24 is used for the positive sample that extracted according to the first sample characteristics extraction module 23 and the feature of negative sample is carried out machine learning, further forms machine learning model.
As shown in Figure 2, the machine learning model of the embodiment of the invention is set up system and is further comprised the second sample characteristics extraction module 25 and sample pattern computing module 26.The second sample characteristics extraction module 25 is used for the residue target speech that does not carry out the speech of target formerly of sample classification classification in sample classification diversity module 22 is carried out feature extraction.Sample pattern computing module 26 is used for the feature of the residue target speech that the second sample characteristics extraction module 25 obtained according to machine learning model and carries out Model Calculation, and then carries out classify and grading.In other embodiments, sample pattern computing module 26 also can carry out Model Matching according to the feature that machine learning model is mentioned the residue target speech that module 25 obtained to second sample characteristics, and then carries out classify and grading.Sample pattern computing module 26 further adds the feature of the residue target speech behind the classify and grading to machine learning model, and machine learning model is further improved.
The concrete function of target speech classify and grading system and machine learning model being set up each module of system below in conjunction with Fig. 3 and Fig. 4 is described.
As shown in Figure 3, Fig. 3 is the schematic flow diagram of target speech classify and grading method according to an embodiment of the invention.
In step 31, obtain the machine learning model 11 that contains target speech classify and grading.
In step 32, obtain new target speech by characteristic extracting module 12.New target speech can be imported by the user, also can be obtained by additive method.
In step 33, carry out feature extraction by 12 pairs of new target speech of characteristic extracting module.When new target speech is carried out feature extraction, need earlier new target speech to be carried out participle.The method of participle comprises: forward coupling participle, oppositely mate participle, Direct/Reverse coupling participle, based on the participle of full segmenting word figure, maximum entropy Markov model participle, maximum entropy participle or condition random field participle.
Be that example is elaborated with forward maximum match participle and reverse maximum match participle below.For example: " today many new colleagues ", use forward maximum match mode, maximum length is 5, so just subordinate clause first opens several 5 words of beginning: today perhaps.And these 5 words are not a word, so remove the last character, just become: today.Still not a speech, continue to remove the last character, and the like: today; Today; Obtain a speech: today.Come many new; Come many; Come to be permitted; Come; Come; Obtain a speech: come.Many new with; Many new; Many; Permitted; ; Obtain a speech:.Many new colleagues; Many together new; Many new; Many; Obtain a speech: many.New colleague; Together new; Newly; Obtain a speech: new.The colleague; Obtain a speech: the colleague.The result of last forward maximum match is :/today/come// many/new/colleague/.And reverse maximum match mode, maximum length is set and is similarly 5, so then is: many new colleagues; Many new colleagues; New colleague; The colleague; Obtain a speech: the colleague.The rest may be inferred, and the last oppositely result of maximum match is :/today/come// many/new/colleague/.But the result who is noted that forward maximum match and reverse maximum match might not be identical.For example: " of me has a meal ", maximum length is set at 5, and the result that forward maximum match mode obtains is :/I/one/people/have a meal/, the result that reverse maximum match mode obtains then is :/I/one/individual/have a meal/.This shows that different segmenting methods may cause the feature extraction result's of target speech difference.
In step 34, the new target speech feature of obtaining in step 33 to be extracted.
In step 35, be applied in the step 31 machine learning model that obtains the feature of the new target speech that obtained in the step 34 is carried out Model Calculation (details of Model Calculation will be introduced hereinafter).In other embodiments, can be applicable in the step 31 machine learning model that obtains the feature of the new target speech that obtained in the step 34 is carried out Model Matching.
In step 36, carry out the classify and grading that the letter score is determined new target speech of putting of Model Calculation according to the feature of new target speech in the step 34 by model computation module 13.
As shown in Figure 4, Fig. 4 is the schematic flow diagram of machine learning model method for building up shown in Figure 3.The machine learning model that this machine learning model is in the step 31 to be set up.
In step 41, obtain the sample storehouse of target speech formerly by sample storehouse acquisition module 21.This formerly the target speech have the information such as ad click rate of this target speech simultaneously.This ad click rate for example can be the statistics of the past period.
In step 42, by sample classification diversity module 22 formerly the target speech carry out positive and negative sample classification, be divided at least one positive sample and a negative sample, and align sample and carry out classification.Positive sample is selected embodiment and is mainly comprised: select the high target speech of ad click rate, select the high target speech of advertising rates, take all factors into consideration preceding two conditions and select.Negative sample is selected embodiment and is mainly comprised: the target speech that target speech, the advertising rates that ad click rate is low are low, the target speech of no showing advertisement, the comprehensively novel selection of first three condition.Also can be understood as, the target speech in the positive sample is a target speech of directly or indirectly having created high value, and the target speech in the negative sample then is the not direct or indirect creation of value or the low target speech of the creation of value.For convenience of explanation, simply enumerate the example of one group of positive negative sample below.
For example, existing " Panpan is got home, and lives and works in peace and contentment ", " with the step sun, I am relieved ", " U.S. ostium, happy regard " and " popular antitheft tips " four groups of keywords.Clearly, first three keyword all is the advertising slogan of brand antitheft door, all has certain value, especially commercial value.Therefore " Panpan is got home, and lives and works in peace and contentment ", " with the step sun, I am relieved ", " U.S. ostium, happy regard " are divided into positive sample.And the value of " popular antitheft tips " is just very low, especially and have no commercial value.Therefore, " popular antitheft tips " just is divided to negative sample.
Definite says, in step 42, by reading default sample database the speech of target formerly to small part is carried out the classification of positive negative sample and the classification of positive sample.Certainly, this is comprising carrying out positive and negative sample classification by computing machine according to statistical standard such as the ad click rate of target speech, advertisement volume, advertisement ranks, and by manually-operated target speech is formerly carried out the classification of positive negative sample and the classification of positive sample.Because the showing advertisement strategy height correlation of ad click rate and ad system, some obviously have the target speech that is worth but does not temporarily obtain effective showing advertisement, and to be considered to negative sample be irrational, adopts manual sort's classification to assist and then can well solve this class problem.
Here detailed explanation was once mentioned in step 42: align sample and carry out classification, be divided at least two grade samples according to the difference of classification level.Classification level height according to the ad click rate of target speech or/and the height of advertising rates judge.General, the grade sample comprises A equal samples, B equal samples and C equal samples; Or comprise A equal samples, B equal samples, C equal samples and D equal samples; Or comprise A equal samples, B equal samples, C equal samples, D equal samples and E equal samples.Wherein, the target speech in the A equal samples be ad click rate or/and advertising rates are the highest, so A equal samples classification level is the highest, the classification level of other grade samples is successively decreased successively.These three kinds of grade samples can guarantee accuracy when carrying out classification, and can not expend too big calculated amount.If it is too much that institute classify, then increase calculated amount, and cause between each grade boundary to be blured.As carrying out classification to above mentioning example sample group, the value of " Panpan is got home, and lives and works in peace and contentment " is very high, therefore is divided into the A equal samples.The value of " with the step sun, I am relieved " is medium, therefore is divided into the B equal samples.And, therefore be divided into the C equal samples because the value of " U.S. ostium, happy regard " is lower.
Aligning sample when carrying out classification, can produce certain influence by the classification that manually-operated align sample.Be worth very highly such as certain target speech, but data such as ad click are not very high, then artificial with its branch to the A equal samples.But same target speech, in the different people operation, be assigned to probably in the different grade samples, cause that manually-operated error rate can be bigger in the process of grade sample classification classification, thus the identification of manually-operated coupled computer various data (such as existing ad click rate, advertisement volume, advertisement rank etc.) are carried out classification is safer method.
In step 43, align sample and negative sample carries out feature extraction by the first sample characteristics extraction module 23.Same, similar step 33, the feature extraction in step 43 also can be carried out participle earlier.Because different segmenting methods may cause the feature extraction result's of target speech difference, so the segmenting method in the step 43 is preferably identical with the employed segmenting method of step 33.Such as carrying out feature extraction to above mentioning example sample group, but characteristic of correspondence :/Panpan/get home/live and work in peace and contentment/,/usefulness/step sun/I/relieved/,/the U.S. heart/door/happiness/regard/.
In step 44, form positive sample that module 24 extracted according to the first sample characteristics extraction module 23 and the feature of negative sample is carried out machine learning by machine learning model, further form machine learning model.Machine learning (Machine Learning) is that human learning behavior is simulated or realized to the research computing machine how, to obtain new knowledge or skills, reorganizes the existing structure of knowledge and makes it constantly to improve the performance of self.The knowledge that machine learning system obtains has: the description of rule of conduct, physical object, problem solving strategy, various classify and grading and other are used for the knowledge type that task realizes.Come for example with the inductive learning in several big classification of machine learning below.Inductive learning is some examples or the counter-example that certain notion is provided by teacher or environment, allows the student draw the general description of this notion by induction.Such as, we provide positive sample characteristics and negative sample feature, come induction by machine, draw the general description of positive sample notion and the general description of negative sample notion, further make it have whether other sample characteristics of analysis are positive sample or negative sample.When setting up machine learning model, can use the maximum entropy model sorter.In addition, sorters such as SVM (support vector machine, support vector machine), Boosting can be with reaching propinquity effect.
In step 45, will in step 42, not carry out the part in sample storehouse of the speech of target formerly of the classify and grading of positive negative sample by the second sample characteristics extraction module 25 and carry out feature extraction, obtain residue target speech sample characteristics.This step and step 43 are similar, and distinctive points is that the residue target speech that does not carry out the classify and grading of positive negative sample in the sample storehouse of the speech of target formerly that obtained in this step 45 pair step 41 in step 42 carries out feature extraction.Though carrying out the classify and grading of positive negative sample, manually-operated can increase certain classify and grading accuracy, but if the quantity of the speech of target formerly that is obtained in the step 41 is too much, then have very big workload by the words of manually carrying out positive and negative sample classification fully, and may cause the working time long, cost is crossed problems such as height.Therefore, carried out earlier the classify and grading of a part of target speech by the people, the machine sort classification transferred in remaining target speech, be one time saving and energy saving and don't lose the method for accuracy.
In step 46, according to machine learning model residue target speech sample characteristics is carried out Model Calculation by sample pattern computing module 26, and then carry out classify and grading.In other embodiments, can carry out Model Matching according to machine learning model to residue target speech sample characteristics by sample pattern computing module 26, and then carry out classify and grading.What this step was carried out is that the machine sort classification transferred in residue target speech, saves great amount of manpower, but still can guarantee certain accuracy.And, add the residue target speech sample characteristics behind the classify and grading to machine learning model, make it further perfect.
As shown in Figure 3 and Figure 4, in step 35, when the feature of new target speech is carried out Model Calculation, be about to characteristic parameter that the feature of target speech trains according to machine learning model and carry out positive and negative class degree of confidence score and calculate.If the positive sample class that the feature of target speech obtains based on the machine learning model calculation of parameter put that the letter score is higher than the negative sample class put the letter score, then in step 36, this target speech is divided into a valuable class; If the negative sample class that the feature of target speech obtains based on the machine learning model calculation of parameter put that the letter score is higher than positive sample class put the letter score,, then in step 36, it is divided into a class of valueless or low value.Similarly, in step 46, residue target speech sample characteristics is carried out Model Calculation, the characteristic parameter that method trains according to machine learning model for the feature that will remain the target speech carries out positive and negative class degree of confidence score and calculates.For example, " use XX, wrap you and feel at ease " and " flu-prevention tips ".It is very high that the positive sample class of " use XX, wrap you and feel at ease " is put the letter score, therefore divided into valuable target speech.And the negative sample class of " flu-prevention tips " put letter score height, therefore be divided into a class of valueless or low value.If in the classify and grading of step 42, only formerly the target speech is divided into positive negative sample, carries out the grade separation classification and do not align sample, then then is coarse evaluation in step 36, if positive sample is carried out the grade separation classification, then then is detailed evaluation in step 36.
Comprise that with the grade sample of positive sample the situation of A equal samples, B equal samples and C equal samples comes detailed evaluation is explained below.After positive sample marks off the Three Estate sample, in machine learning model, also can comprise 4 kinds of model parameters, A equal samples characteristic model parameter, B equal samples characteristic model parameter, C equal samples characteristic model parameter and negative sample characteristic model parameter.In step 36, a kind of classify and grading mode is, puts the letter score based on what all kinds of model parameters were calculated respective classes respectively according to target speech sample characteristics.Any class put letter score height, then it is dispensed to which grade.For example the feature of a target speech is the putting letter and must be divided into 0.12 of category-A, the putting letter and must be divided into 0.63 of category-B, and putting letter and must be divided into 0.17 in the C class.Because the marking of the corresponding B sample characteristics of this target speech model is the highest, is 0.63 minute, then this target speech is divided into B etc.Except that above-mentioned classify and grading mode, also can use the classify and grading mode of other complexity commonly used.
As shown in Figure 5, Fig. 5 is the schematic flow diagram of machine learning model according to another embodiment of the present invention.
In step 51, obtain the sample storehouse of target speech formerly.This formerly the target speech have the information such as ad click rate of this target speech simultaneously.This clicking rate for example can be the statistics of the past period.
In step 52, align negative sample and classify, target root formerly is divided into the positive sample and the negative sample of a plurality of different brackets according to the difference of classification level.Classification level height according to the ad click rate of target speech or/and the height of advertising rates judge.
In step 53, obtain the positive sample of a plurality of different brackets.The description of the positive sample of a plurality of different brackets is aligned the part that sample carries out classification described in the step 42 in as detailed above.
In step 54, obtain negative sample.Certainly, exist sometimes some formerly the target speech promptly be not divided into positive sample and be not divided into negative sample again.
In step 55, align sample and negative sample and carry out feature extraction.Same, similar step 33, the feature extraction in step 55 also can be carried out participle earlier.
In step 56, obtain a plurality of positive sample characteristics.The positive sample of each different stage all can obtain corresponding positive sample characteristics.
In step 57, obtain the negative sample feature.
In step 58, a plurality of positive sample characteristics and negative sample feature are carried out machine learning.
In step 59, form machine learning model.
Need to prove that above-mentioned steps can suitably be out of shape in the practice operation, such as, in step 52, align the negative sample classification and also can adopt earlier definite a plurality of positive samples, and then in whole sample storehouse, deduct a plurality of positive samples and then draw negative sample.
By the way, can carry out the classify and grading statistics to existing search key and ad auction speech, set up machine learning model, thereby to new keyword or auction speech, carry out that machine is discerned automatically, A+E, realized cover target speech classify and grading method and system.Adopt the classify and grading method of this machine learning model, not only can be worth the judgement that (such as ad click rate) has or not, can also estimate the classification of carrying out of the height that is worth to the target speech.
In the above-described embodiments, only the embodiment of the invention has been carried out exemplary description, but those skilled in the art can carry out various modifications to the present invention after reading present patent application under the situation that does not break away from the spirit and scope of the present invention.

Claims (24)

1. a machine learning model method for building up is characterized in that, described machine learning model method for building up comprises step:
A1. obtain the sample storehouse of target speech formerly;
A2. the described speech of target is formerly carried out positive and negative sample classification, the described speech of target formerly is divided at least one positive sample and a negative sample;
A3. described positive sample and described negative sample are carried out feature extraction;
A4. the feature according to described positive sample that is extracted and described negative sample forms described machine learning model.
2. machine learning model method for building up according to claim 1 is characterized in that, described positive sample comprises the ad click rate height or/and the high target speech of advertising rates; Described negative sample comprises the target speech that ad click rate low target speech, advertising rates are low or does not have one or more combination in the target speech of showing advertisement.
3. machine learning model method for building up according to claim 1 is characterized in that, in described step a2, further comprises described positive sample is carried out classification, the described speech of target formerly is divided into the positive sample and the negative sample of a plurality of different brackets.
4. machine learning model method for building up according to claim 3 is characterized in that, in described step a2, by reading default sample database the described speech of target formerly to small part is carried out the classification of positive negative sample and the classification of positive sample.
5. machine learning model method for building up according to claim 4 is characterized in that, in step a4, the feature of described positive sample and described negative sample is carried out machine learning, thereby forms described machine learning model.
6. machine learning model method for building up according to claim 4 is characterized in that, described positive sample further is divided at least two grade samples according to the difference of classification level.
7. machine learning model method for building up according to claim 6, it is characterized in that, described grade sample comprises A equal samples, B equal samples and C equal samples, or comprise A equal samples, B equal samples, C equal samples and D equal samples, or comprise A equal samples, B equal samples, C equal samples, D equal samples and E equal samples; Wherein, the classification level of A equal samples is the highest, and the classification level of other described grade samples is successively decreased successively.
8. machine learning model method for building up according to claim 7 is characterized in that, described classification level height according to the ad click rate of described target speech or/and the height of advertising rates judge.
9. machine learning model method for building up according to claim 6 is characterized in that, described step a further comprises:
A5. the described speech of target formerly that does not carry out the sample classification classification in described step a2 is carried out feature extraction;
A6. according to described machine learning model the feature of the described residue target speech that obtained is carried out Model Calculation, and then carry out classify and grading, and add the described residue target speech sample characteristics behind the classify and grading to described machine learning model.
10. machine learning model method for building up according to claim 9, it is characterized in that, in step a6, the characteristic parameter that the method that the feature of described residue target speech is carried out Model Calculation trains according to described machine learning model for the feature with described residue target speech carries out positive and negative class degree of confidence score and calculates.
11. according to claim 1,2 or 9 described machine learning model method for building up, it is characterized in that, when carrying out described feature extraction, carry out participle earlier.
12. machine learning model method for building up according to claim 11, it is characterized in that the method for described participle comprises: forward coupling participle, oppositely mate participle, Direct/Reverse coupling participle, participle, maximum entropy Markov model participle, maximum entropy participle or condition random field participle based on full segmenting word figure.
13. a machine learning model is characterized in that, described machine learning model comprises:
Sample storehouse acquisition module is used to obtain the sample storehouse of target speech formerly;
The sample classification diversity module is used for the described speech of target is formerly carried out positive and negative sample classification, and the described speech of target formerly is divided at least one positive sample and a negative sample;
The first sample characteristics extraction module carries out feature extraction to described positive sample and described negative sample;
Machine learning model forms module, forms described machine learning model according to the feature of described positive sample that is extracted and described negative sample.
14. machine learning model according to claim 13 is characterized in that, described positive sample comprises the ad click rate height or/and the high target speech of advertising rates; Described negative sample comprises the target speech that ad click rate low target speech, advertising rates are low or does not have one or more combination in the target speech of showing advertisement.
15. machine learning model according to claim 13 is characterized in that, described sample classification diversity module is further carried out classification to described positive sample, the described speech of target formerly is divided into the positive sample and the negative sample of a plurality of different brackets.
16. machine learning model according to claim 15 is characterized in that, in described sample classification diversity module, by reading default sample database the described speech of target formerly to small part is carried out the classification of positive negative sample and the classification of positive sample.
17. machine learning model according to claim 16 is characterized in that, forms in the module in described machine learning model, the feature of described positive sample and described negative sample is carried out machine learning, thereby form described machine learning model.
18. machine learning model according to claim 16 is characterized in that, in described sample classification diversity module, described positive sample further is divided at least two grade samples according to the difference of classification level.
19. machine learning model according to claim 18, it is characterized in that, described grade sample comprises A equal samples, B equal samples and C equal samples, or comprise A equal samples, B equal samples, C equal samples and D equal samples, or comprise A equal samples, B equal samples, C equal samples, D equal samples and E equal samples; Wherein, the classification level of A equal samples is the highest, and the classification level of other described grade samples is successively decreased successively.
20. machine learning model according to claim 18 is characterized in that, described classification level height according to the ad click rate of described target speech or/and the height of advertising rates judge.
21. machine learning model according to claim 18 is characterized in that, described machine learning model further comprises:
The second sample characteristics extraction module is used for the residue target speech that does not carry out in the described speech of target formerly of sample classification classification in described sample classification diversity module is carried out feature extraction;
The sample pattern computing module, be used for the feature of the described residue target speech that obtained being carried out Model Calculation according to described machine learning model, and then carry out classify and grading, and add the described residue target speech sample characteristics behind the classify and grading to described machine learning model.
22. machine learning model according to claim 21, it is characterized in that, in described sample pattern computing module, the characteristic parameter that the method that the feature of described residue target speech is carried out Model Calculation trains according to described machine learning model for the feature with described residue target speech carries out positive and negative class degree of confidence score and calculates.
23. according to claim 13,14 or 21 described machine learning model, it is characterized in that, when carrying out feature extraction, carry out participle earlier.
24. machine learning model according to claim 23, it is characterized in that the method for described participle comprises: forward coupling participle, oppositely mate participle, Direct/Reverse coupling participle, participle, maximum entropy Markov model participle, maximum entropy participle or condition random field participle based on full segmenting word figure.
CN2010105423748A 2010-11-12 2010-11-12 Machine learning model and establishing method thereof Pending CN101980211A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010105423748A CN101980211A (en) 2010-11-12 2010-11-12 Machine learning model and establishing method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010105423748A CN101980211A (en) 2010-11-12 2010-11-12 Machine learning model and establishing method thereof

Publications (1)

Publication Number Publication Date
CN101980211A true CN101980211A (en) 2011-02-23

Family

ID=43600713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010105423748A Pending CN101980211A (en) 2010-11-12 2010-11-12 Machine learning model and establishing method thereof

Country Status (1)

Country Link
CN (1) CN101980211A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591983A (en) * 2012-01-10 2012-07-18 凤凰在线(北京)信息技术有限公司 Advertisement filter system and advertisement filter method
CN102591854A (en) * 2012-01-10 2012-07-18 凤凰在线(北京)信息技术有限公司 Advertisement filtering system and advertisement filtering method specific to text characteristics
CN103123634A (en) * 2011-11-21 2013-05-29 北京百度网讯科技有限公司 Copyright resource identification method and copyright resource identification device
CN103377088A (en) * 2012-04-27 2013-10-30 国际商业机器公司 Method and system for discovering and grouping related computing resources using machine learning
CN103425677A (en) * 2012-05-18 2013-12-04 阿里巴巴集团控股有限公司 Method for determining classified models of keywords and method and device for classifying keywords
CN104537118A (en) * 2015-01-26 2015-04-22 苏州大学 Microblog data processing method, device and system
WO2015124024A1 (en) * 2014-02-24 2015-08-27 北京奇虎科技有限公司 Method and device for promoting exposure rate of information, method and device for determining value of search word
CN107292154A (en) * 2017-06-09 2017-10-24 北京奇安信科技有限公司 A kind of terminal feature recognition methods and system
CN108205766A (en) * 2016-12-19 2018-06-26 阿里巴巴集团控股有限公司 Information-pushing method, apparatus and system
WO2018192348A1 (en) * 2017-04-20 2018-10-25 腾讯科技(深圳)有限公司 Data processing method and device, and server
CN110110076A (en) * 2017-12-28 2019-08-09 重庆南华中天信息技术有限公司 Classification method based on machine learning knowledge
CN110110077A (en) * 2017-12-28 2019-08-09 重庆南华中天信息技术有限公司 Sorter based on machine learning knowledge
CN112052671A (en) * 2019-06-06 2020-12-08 阿里巴巴集团控股有限公司 Negative sample sampling method, text processing method, device, equipment and medium
CN112259085A (en) * 2020-09-28 2021-01-22 上海声瀚信息科技有限公司 Two-stage voice awakening algorithm based on model fusion framework
CN115344757A (en) * 2022-02-07 2022-11-15 花瓣云科技有限公司 Label prediction method, electronic equipment and storage medium
TWI793170B (en) * 2017-10-20 2023-02-21 美商雅虎廣告技術有限責任公司 System, devices, and method for automated bidding using deep neural language models

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103123634A (en) * 2011-11-21 2013-05-29 北京百度网讯科技有限公司 Copyright resource identification method and copyright resource identification device
CN103123634B (en) * 2011-11-21 2016-04-27 北京百度网讯科技有限公司 A kind of copyright resource identification method and device
CN102591854A (en) * 2012-01-10 2012-07-18 凤凰在线(北京)信息技术有限公司 Advertisement filtering system and advertisement filtering method specific to text characteristics
CN102591983A (en) * 2012-01-10 2012-07-18 凤凰在线(北京)信息技术有限公司 Advertisement filter system and advertisement filter method
CN102591854B (en) * 2012-01-10 2015-08-05 凤凰在线(北京)信息技术有限公司 For advertisement filtering system and the filter method thereof of text feature
CN103377088A (en) * 2012-04-27 2013-10-30 国际商业机器公司 Method and system for discovering and grouping related computing resources using machine learning
CN103377088B (en) * 2012-04-27 2016-08-03 国际商业机器公司 For finding with packet about the method and system calculating resource
CN103425677B (en) * 2012-05-18 2016-08-24 阿里巴巴集团控股有限公司 Keyword classification model determines method, keyword classification method and device
CN103425677A (en) * 2012-05-18 2013-12-04 阿里巴巴集团控股有限公司 Method for determining classified models of keywords and method and device for classifying keywords
WO2015124024A1 (en) * 2014-02-24 2015-08-27 北京奇虎科技有限公司 Method and device for promoting exposure rate of information, method and device for determining value of search word
CN104537118A (en) * 2015-01-26 2015-04-22 苏州大学 Microblog data processing method, device and system
CN104537118B (en) * 2015-01-26 2017-12-26 苏州大学 A kind of microblog data processing method, apparatus and system
CN108205766A (en) * 2016-12-19 2018-06-26 阿里巴巴集团控股有限公司 Information-pushing method, apparatus and system
WO2018192348A1 (en) * 2017-04-20 2018-10-25 腾讯科技(深圳)有限公司 Data processing method and device, and server
CN107292154A (en) * 2017-06-09 2017-10-24 北京奇安信科技有限公司 A kind of terminal feature recognition methods and system
TWI793170B (en) * 2017-10-20 2023-02-21 美商雅虎廣告技術有限責任公司 System, devices, and method for automated bidding using deep neural language models
CN110110076A (en) * 2017-12-28 2019-08-09 重庆南华中天信息技术有限公司 Classification method based on machine learning knowledge
CN110110077A (en) * 2017-12-28 2019-08-09 重庆南华中天信息技术有限公司 Sorter based on machine learning knowledge
CN112052671A (en) * 2019-06-06 2020-12-08 阿里巴巴集团控股有限公司 Negative sample sampling method, text processing method, device, equipment and medium
CN112052671B (en) * 2019-06-06 2023-10-27 阿里巴巴集团控股有限公司 Negative sample sampling method, text processing method, device, equipment and medium
CN112259085A (en) * 2020-09-28 2021-01-22 上海声瀚信息科技有限公司 Two-stage voice awakening algorithm based on model fusion framework
CN115344757A (en) * 2022-02-07 2022-11-15 花瓣云科技有限公司 Label prediction method, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN101980211A (en) Machine learning model and establishing method thereof
CN101980210A (en) Marked word classifying and grading method and system
CN109189901B (en) Method for automatically discovering new classification and corresponding corpus in intelligent customer service system
CN104166706B (en) Multi-tag grader construction method based on cost-sensitive Active Learning
CN106776538A (en) The information extracting method of enterprise's noncanonical format document
CN101620615B (en) Automatic image annotation and translation method based on decision tree learning
CN107818105A (en) The recommendation method and server of application program
CN104881458A (en) Labeling method and device for web page topics
CN102663001A (en) Automatic blog writer interest and character identifying method based on support vector machine
CN107798351B (en) Deep learning neural network-based identity recognition method and system
CN108376164B (en) Display method and device of potential anchor
CN108664474A (en) A kind of resume analytic method based on deep learning
CN108710894A (en) A kind of Active Learning mask method and device based on cluster representative point
CN110210294A (en) Evaluation method, device, storage medium and the computer equipment of Optimized model
CN108804577B (en) Method for estimating interest degree of information tag
CN110807760A (en) Tobacco leaf grading method and system
CN101937510A (en) Fast incremental learning method based on quasi-Haar and AdaBoost classifier
CN109492105A (en) A kind of text sentiment classification method based on multiple features integrated study
CN105095196A (en) Method and device for finding new word in text
CN110705272A (en) Named entity identification method for automobile engine fault diagnosis
CN110443303B (en) Image segmentation and classification-based intelligent identification method for coal-rock microcomponents
CN104572915A (en) User event relevance calculation method based on content environment enhancement
CN103493067A (en) Method and apparatus for recognizing a character of a video
CN111984790B (en) Entity relation extraction method
CN109543049A (en) A kind of method and system for writing techniques automatic push material

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110223