CN101980210A - Marked word classifying and grading method and system - Google Patents

Marked word classifying and grading method and system Download PDF

Info

Publication number
CN101980210A
CN101980210A CN2010105423714A CN201010542371A CN101980210A CN 101980210 A CN101980210 A CN 101980210A CN 2010105423714 A CN2010105423714 A CN 2010105423714A CN 201010542371 A CN201010542371 A CN 201010542371A CN 101980210 A CN101980210 A CN 101980210A
Authority
CN
China
Prior art keywords
target speech
sample
grading
classify
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010105423714A
Other languages
Chinese (zh)
Inventor
�田�浩
万伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN2010105423714A priority Critical patent/CN101980210A/en
Publication of CN101980210A publication Critical patent/CN101980210A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a marked word classifying and grading method and a marked word classifying and grading system. The method comprises the following steps of: a, acquiring a marked word-containing classifying and grading machine model; b, extracting the features of a newly-marked word; and c, calculating by using the machine model according to the extracted features of the newly-marked word so as to determine the classification and grading of the newly-marked word. In the embodiment of the invention, the existing search keywords and advertising auction words are classified, graded and counted and the machine model is established, so that new keywords or auction words are automatically recognized, analyzed and evaluated by a machine and a set of marked word classifying and grading method and a set of marked word classifying and grading system are realized.

Description

A kind of target speech classify and grading method and system
[technical field]
The present invention relates to a kind of target speech classify and grading method and system, particularly a kind of classify and grading method and system such as keyword and/or advertising words.
[background technology]
In each corner of network world and reality society, the popularity rate and the importance of advertisement grow with each passing day.Produce many advertising words thus, on network, also had similarly target speech such as term, keyword or auction speech simultaneously.Yet,, generally can only rely on manpower to carry out the subjective effect of judging whether it has positive effect, reaching which level for emerging advertising words.But, be easy to do the judgement that makes mistake for the people that some are lacked experience.And artificial mode is difficult to accomplish extensive judgement, and simultaneously, the consistance of subjective judgement also is difficult to be guaranteed.How the classify and grading that term and advertising words is carried out robotization by computer system is the technical matters that information society need solve.
[summary of the invention]
The embodiment of the invention provides a kind of target speech classify and grading method and system, can use the method and system that new target speech is carried out classify and grading, further can realize estimating that the target speech is worth.
The embodiment of the invention provides a kind of target speech classify and grading method, and the method comprising the steps of: a. obtains the machine mould that contains target speech classify and grading; B. new target speech is carried out feature extraction; And c. is according to the feature of the new target speech that is extracted, and the applied for machines model calculates, and determines the classify and grading of new target speech.
Follow according to one embodiment of the present invention, the characteristic parameter that Calculation Method trains according to machine mould for the feature with new target speech carries out positive and negative class degree of confidence score and calculates.
Follow according to one embodiment of the present invention, step a comprises that further step: a1. obtains the sample storehouse of target speech formerly; A2. target speech is formerly carried out positive and negative sample classification, formerly the target speech is divided at least one positive sample and a negative sample; A3. align sample and negative sample and carry out feature extraction; A4. the feature according to positive sample that is extracted and negative sample forms machine mould.
Follow according to one embodiment of the present invention, positive sample comprises the ad click rate height or/and the high target speech of advertising rates; Negative sample comprises the target speech that ad click rate low target speech, advertising rates are low or does not have one or more combination in the target speech of showing advertisement.
Follow according to one embodiment of the present invention, in step a2, comprise that further aligning sample carries out classification, target speech formerly is divided into the positive sample and the negative sample of a plurality of different brackets.
Follow according to one embodiment of the present invention, in step a2, the speech of target formerly to small part is carried out the classification of positive negative sample and the classification of positive sample by reading default sample database.
Follow according to one embodiment of the present invention, in step a4, the feature that aligns sample and negative sample is carried out the machine modeling, thereby forms machine mould.
Follow according to one embodiment of the present invention, positive sample further is divided at least two grade samples according to the difference of classification level.
Follow according to one embodiment of the present invention, the grade sample comprises A equal samples, B equal samples and C equal samples, or comprises A equal samples, B equal samples, C equal samples and D equal samples, or comprises A equal samples, B equal samples, C equal samples, D equal samples and E equal samples; Wherein, the classification level of A equal samples is the highest, and the classification level of other grade samples is successively decreased successively.
Follow according to one embodiment of the present invention, classification level height according to the ad click rate of target speech or/and the height of advertising rates judge.
Follow according to one embodiment of the present invention, step a further comprises: a5. carries out feature extraction to the speech of target formerly that does not carry out the sample classification classification in step a2; A6. according to machine mould the feature of the residue target speech that obtained is calculated, and then carried out classify and grading, and add the residue target speech sample characteristics behind the classify and grading to machine mould.
Follow according to one embodiment of the present invention, in step a6, the feature of residue target speech is carried out the characteristic parameter that Calculation Method trains according to machine mould for the feature that will remain the target speech carry out positive and negative class degree of confidence score calculating.
Follow according to one embodiment of the present invention, it is characterized in that, when carrying out feature extraction, carry out participle earlier.
Follow according to one embodiment of the present invention, the method for participle comprises: forward coupling participle, oppositely mate participle, Direct/Reverse coupling participle, the participle based on full segmenting word figure, maximum entropy Markov model participle, maximum entropy participle or condition random field participle.
The embodiment of the invention further provides a kind of target speech classify and grading system, and this system comprises: the machine mould that contains target speech classify and grading; Be used for characteristic extracting module that new target speech is carried out feature extraction; Be used for the feature according to the new target speech that is extracted, the applied for machines model calculates, and determines the computing module of the classify and grading of new target speech.
Follow according to one embodiment of the present invention, in computing module, the characteristic parameter that Calculation Method trains according to machine mould for the feature with new target speech carries out positive and negative class degree of confidence score and calculates.
Follow according to one embodiment of the present invention, machine mould comprises: be used to obtain the sample storehouse acquisition module in the sample storehouse of target speech formerly; Be used for target speech is formerly carried out positive and negative sample classification, target speech formerly be divided into the sample classification diversity module of at least one a positive sample and a negative sample; Align the first sample characteristics extraction module that sample and negative sample carry out feature extraction; Form the machine mould formation module of machine mould according to the feature of positive sample that is extracted and negative sample.
Follow according to one embodiment of the present invention, positive sample comprises the ad click rate height or/and the high target speech of advertising rates; Negative sample comprises the target speech that ad click rate low target speech, advertising rates are low or does not have one or more combination in the target speech of showing advertisement.
Follow according to one embodiment of the present invention, the sample classification diversity module further aligns sample and carries out classification, target speech formerly is divided into the positive sample and the negative sample of a plurality of different brackets.
Follow according to one embodiment of the present invention, in the sample classification diversity module, the speech of target formerly to small part is carried out the classification of positive negative sample and the classification of positive sample by reading default sample database.
Follow according to one embodiment of the present invention, form in the module at machine mould, the feature that aligns sample and negative sample is carried out the machine modeling, thereby forms machine mould.
Follow according to one embodiment of the present invention, in the sample classification diversity module, positive sample further is divided at least two grade samples according to the difference of classification level.
Follow according to one embodiment of the present invention, the grade sample comprises A equal samples, B equal samples and C equal samples, or comprises A equal samples, B equal samples, C equal samples and D equal samples, or comprises A equal samples, B equal samples, C equal samples, D equal samples and E equal samples; Wherein, the classification level of A equal samples is the highest, and the classification level of other grade samples is successively decreased successively.
Follow according to one embodiment of the present invention, classification level height according to the ad click rate of target speech or/and the height of advertising rates judge.
Follow according to one embodiment of the present invention, machine mould further comprises: be used for not carrying out the second sample characteristics extraction module that the residue target speech in the speech of target formerly of sample classification classification carries out feature extraction in the sample classification diversity module; Be used for the feature of the residue target speech that obtained being calculated, and then carry out classify and grading according to machine mould, and the sample calculation module of adding the residue target speech sample characteristics behind the classify and grading to machine mould.
Follow according to one embodiment of the present invention, in the sample calculation module, the feature of residue target speech is carried out the characteristic parameter that Calculation Method trains according to machine mould for the feature that will remain the target speech carry out positive and negative class degree of confidence score calculating.
Follow according to one embodiment of the present invention, when carrying out feature extraction, carry out participle earlier.
Follow according to one embodiment of the present invention, the method for participle comprises: forward coupling participle, oppositely mate participle, Direct/Reverse coupling participle, the participle based on full segmenting word figure, maximum entropy Markov model participle, maximum entropy participle or condition random field participle.
The embodiment of the invention is by carrying out the classify and grading statistics to existing search key and ad auction speech, set up machine mould, thereby new keyword or auction speech are carried out that machine is discerned automatically, A+E, realized cover target speech classify and grading method and system.
[description of drawings]
In order to be illustrated more clearly in the technical scheme in the embodiment of the invention, the accompanying drawing of required use is done to introduce simply in will describing embodiment below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.In addition, accompanying drawing is not proportionally drawn.Wherein
Fig. 1 is the structural representation block diagram of target speech classify and grading system according to an embodiment of the invention.
Fig. 2 is the structural representation block diagram of machine mould shown in Figure 1.
Fig. 3 is the schematic flow diagram of target speech classify and grading method according to an embodiment of the invention.
Fig. 4 is the schematic flow diagram of machine mould method for building up shown in Figure 3.
Fig. 5 is the schematic flow diagram of machine mould according to another embodiment of the present invention.
[embodiment]
Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is a part of embodiment of the present invention, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.
As shown in Figure 1, Fig. 1 is the structural representation block diagram of target speech classify and grading system according to an embodiment of the invention.In order to narrate conveniently, term, keyword and/or auction speech are referred to as " target speech " herein.The target speech classify and grading system of the embodiment of the invention comprises machine mould 11, characteristic extracting module 12 and computing module 13.Machine mould 11 is for containing the machine mould of target speech classify and grading.Characteristic extracting module 12 is used for new target speech is extracted." new target speech " described herein is meant the target speech that need carry out classify and grading.Computing module 13 is used for the feature of the new target speech that extracted according to characteristic extracting module 12, and applied for machines model 11 calculates, and then determines the classify and grading of new target speech.In other embodiments, but also applied for machines model 11 carries out Model Matching, and then determines the classify and grading of new target speech.
As shown in Figure 2, Fig. 2 is the structural representation block diagram of machine mould shown in Figure 1.Referring to Fig. 1, machine mould 11 comprises that sample storehouse acquisition module 21, sample classification diversity module 22, the first sample characteristics extraction module 23 and machine mould form module 24.Sample storehouse acquisition module 21 is used to obtain the sample storehouse of target speech formerly.Sample classification diversity module 22 is used for the target speech that sample storehouse acquisition module 21 is obtained is formerly carried out positive and negative sample classification, is divided into positive sample and negative sample, and aligns sample and carry out classification.At least a portion is to finish by reading default sample database to the classify and grading that formerly target speech carries out positive negative sample in the sample classification diversity module 22.Comprise in the sample database by computing machine existing target speech is carried out positive sample data and the negative sample data that statistic of classification generates according to statistical standard such as ad click rate, advertisement volume, advertisement ranks, can also adjust positive sample data in the sample database and negative sample data by the manual mode of operation in addition.The first sample characteristics extraction module 23 is used to align sample and negative sample carries out feature extraction.Machine mould forms that module 24 is used for the positive sample that extracted according to the first sample characteristics extraction module 23 and the feature of negative sample is carried out the machine modeling, further forms machine mould.
As shown in Figure 2, the machine mould of the embodiment of the invention is set up system and is further comprised the second sample characteristics extraction module 25 and sample calculation module 26.The second sample characteristics extraction module 25 is used for the residue target speech that does not carry out the speech of target formerly of sample classification classification in sample classification diversity module 22 is carried out feature extraction.Sample calculation module 26 is used for the feature of the residue target speech that the second sample characteristics extraction module 25 obtained according to machine mould to be calculated, and then carries out classify and grading.In other embodiments, sample calculation module 26 also can be carried out Model Matching according to the feature that machine mould is mentioned the residue target speech that module 25 obtained to second sample characteristics, and then carries out classify and grading.Sample calculation module 26 is further added the feature of the residue target speech behind the classify and grading to machine mould, and machine mould is further improved.
The concrete function of target speech classify and grading system and machine mould being set up each module of system below in conjunction with Fig. 3 and Fig. 4 is described.
As shown in Figure 3, Fig. 3 is the schematic flow diagram of target speech classify and grading method according to an embodiment of the invention.
In step 31, obtain the machine mould 11 that contains target speech classify and grading.
In step 32, obtain new target speech by characteristic extracting module 12.New target speech can be imported by the user, also can be obtained by additive method.
In step 33, carry out feature extraction by 12 pairs of new target speech of characteristic extracting module.When new target speech is carried out feature extraction, need earlier new target speech to be carried out participle.The method of participle comprises: forward coupling participle, oppositely mate participle, Direct/Reverse coupling participle, based on the participle of full segmenting word figure, maximum entropy Markov model participle, maximum entropy participle or condition random field participle.
Be that example is elaborated with forward maximum match participle and reverse maximum match participle below.For example: " today many new colleagues ", use forward maximum match mode, maximum length is 5, so just subordinate clause first opens several 5 words of beginning: today perhaps.And these 5 words are not a word, so remove the last character, just become: today.Still not a speech, continue to remove the last character, and the like: today; Today; Obtain a speech: today.Come many new; Come many; Come to be permitted; Come; Come; Obtain a speech: come.Many new with; Many new; Many; Permitted; ; Obtain a speech:.Many new colleagues; Many together new; Many new; Many; Obtain a speech: many.New colleague; Together new; Newly; Obtain a speech: new.The colleague; Obtain a speech: the colleague.The result of last forward maximum match is :/today/come// many/new/colleague/.And reverse maximum match mode, maximum length is set and is similarly 5, so then is: many new colleagues; Many new colleagues; New colleague; The colleague; Obtain a speech: the colleague.The rest may be inferred, and the last oppositely result of maximum match is :/today/come// many/new/colleague/.But the result who is noted that forward maximum match and reverse maximum match might not be identical.For example: " of me has a meal ", maximum length is set at 5, and the result that forward maximum match mode obtains is :/I/one/people/have a meal/, the result that reverse maximum match mode obtains then is :/I/one/individual/have a meal/.This shows that different segmenting methods may cause the feature extraction result's of target speech difference.
In step 34, the new target speech feature of obtaining in step 33 to be extracted.
In step 35, be applied in the step 31 machine mould that obtains the feature of the new target speech that obtained in the step 34 is calculated (details of calculating will be introduced hereinafter).In other embodiments, can be applicable in the step 31 machine mould that obtains the feature of the new target speech that obtained in the step 34 is carried out Model Matching.
In step 36, put the classify and grading that the letter score is determined new target speech according to what the feature of new target speech in the step 34 was calculated by computing module 13.
As shown in Figure 4, Fig. 4 is the schematic flow diagram of machine mould method for building up shown in Figure 3.The machine mould that this machine mould is in the step 31 to be set up.
In step 41, obtain the sample storehouse of target speech formerly by sample storehouse acquisition module 21.This formerly the target speech have the information such as ad click rate of this target speech simultaneously.This ad click rate for example can be the statistics of the past period.
In step 42, by sample classification diversity module 22 formerly the target speech carry out positive and negative sample classification, be divided at least one positive sample and a negative sample, and align sample and carry out classification.Positive sample is selected embodiment and is mainly comprised: select the high target speech of ad click rate, select the high target speech of advertising rates, take all factors into consideration preceding two conditions and select.Negative sample is selected embodiment and is mainly comprised: the target speech that target speech, the advertising rates that ad click rate is low are low, the target speech of no showing advertisement, the comprehensively novel selection of first three condition.Also can be understood as, the target speech in the positive sample is a target speech of directly or indirectly having created high value, and the target speech in the negative sample then is the not direct or indirect creation of value or the low target speech of the creation of value.For convenience of explanation, simply enumerate the example of one group of positive negative sample below.
For example, existing " Panpan is got home, and lives and works in peace and contentment ", " with the step sun, I am relieved ", " U.S. ostium, happy regard " and " popular antitheft tips " four groups of keywords.Clearly, first three keyword all is the advertising slogan of brand antitheft door, all has certain value, especially commercial value.Therefore " Panpan is got home, and lives and works in peace and contentment ", " with the step sun, I am relieved ", " U.S. ostium, happy regard " are divided into positive sample.And the value of " popular antitheft tips " is just very low, especially and have no commercial value.Therefore, " popular antitheft tips " just is divided to negative sample.
Definite says, in step 42, by reading default sample database the speech of target formerly to small part is carried out the classification of positive negative sample and the classification of positive sample.Certainly, this is comprising carrying out positive and negative sample classification by computing machine according to statistical standard such as the ad click rate of target speech, advertisement volume, advertisement ranks, and by manually-operated target speech is formerly carried out the classification of positive negative sample and the classification of positive sample.Because the showing advertisement strategy height correlation of ad click rate and ad system, some obviously have the target speech that is worth but does not temporarily obtain effective showing advertisement, and to be considered to negative sample be irrational, adopts manual sort's classification to assist and then can well solve this class problem.
Here detailed explanation was once mentioned in step 42: align sample and carry out classification, be divided at least two grade samples according to the difference of classification level.Classification level height according to the ad click rate of target speech or/and the height of advertising rates judge.General, the grade sample comprises A equal samples, B equal samples and C equal samples; Or comprise A equal samples, B equal samples, C equal samples and D equal samples; Or comprise A equal samples, B equal samples, C equal samples, D equal samples and E equal samples.Wherein, the target speech in the A equal samples be ad click rate or/and advertising rates are the highest, so A equal samples classification level is the highest, the classification level of other grade samples is successively decreased successively.These three kinds of grade samples can guarantee accuracy when carrying out classification, and can not expend too big calculated amount.If it is too much that institute classify, then increase calculated amount, and cause between each grade boundary to be blured.As carrying out classification to above mentioning example sample group, the value of " Panpan is got home, and lives and works in peace and contentment " is very high, therefore is divided into the A equal samples.The value of " with the step sun, I am relieved " is medium, therefore is divided into the B equal samples.And, therefore be divided into the C equal samples because the value of " U.S. ostium, happy regard " is lower.
Aligning sample when carrying out classification, can produce certain influence by the classification that manually-operated align sample.Be worth very highly such as certain target speech, but data such as ad click are not very high, then artificial with its branch to the A equal samples.But same target speech, in the different people operation, be assigned to probably in the different grade samples, cause that manually-operated error rate can be bigger in the process of grade sample classification classification, thus the identification of manually-operated coupled computer various data (such as existing ad click rate, advertisement volume, advertisement rank etc.) are carried out classification is safer method.
In step 43, align sample and negative sample carries out feature extraction by the first sample characteristics extraction module 23.Same, similar step 33, the feature extraction in step 43 also can be carried out participle earlier.Because different segmenting methods may cause the feature extraction result's of target speech difference, so the segmenting method in the step 43 is preferably identical with the employed segmenting method of step 33.Such as carrying out feature extraction to above mentioning example sample group, but characteristic of correspondence :/Panpan/get home/live and work in peace and contentment/,/usefulness/step sun/I/relieved/,/the U.S. heart/door/happiness/regard/.
In step 44, form positive sample that module 24 extracted according to the first sample characteristics extraction module 23 and the feature of negative sample is carried out the machine modeling by machine mould, further form machine mould.The machine modeling process can adopt the mode of machine learning, also can adopt multiple modes such as mathematical induction, probability statistics to realize.Be that example is elaborated below with the machine learning.
Machine learning (Machine Learning) is that human learning behavior is simulated or realized to the research computing machine how, to obtain new knowledge or skills, reorganizes the existing structure of knowledge and makes it constantly to improve the performance of self.The knowledge that machine learning system obtains has: the description of rule of conduct, physical object, problem solving strategy, various classify and grading and other are used for the knowledge type that task realizes.Come for example with the inductive learning in several big classification of machine learning below.Inductive learning is some examples or the counter-example that certain notion is provided by teacher or environment, allows the student draw the general description of this notion by induction.Such as, we provide positive sample characteristics and negative sample feature, come induction by machine, draw the general description of positive sample notion and the general description of negative sample notion, further make it have whether other sample characteristics of analysis are positive sample or negative sample.When setting up machine mould, can use the maximum entropy model sorter.In addition, sorters such as SVM (support vector machine, support vector machine), Boosting can be with reaching propinquity effect.
In step 45, will in step 42, not carry out the part in sample storehouse of the speech of target formerly of the classify and grading of positive negative sample by the second sample characteristics extraction module 25 and carry out feature extraction, obtain residue target speech sample characteristics.This step and step 43 are similar, and distinctive points is that the residue target speech that does not carry out the classify and grading of positive negative sample in the sample storehouse of the speech of target formerly that obtained in this step 45 pair step 41 in step 42 carries out feature extraction.Though carrying out the classify and grading of positive negative sample, manually-operated can increase certain classify and grading accuracy, but if the quantity of the speech of target formerly that is obtained in the step 41 is too much, then have very big workload by the words of manually carrying out positive and negative sample classification fully, and may cause the working time long, cost is crossed problems such as height.Therefore, carried out earlier the classify and grading of a part of target speech by the people, the machine sort classification transferred in remaining target speech, be one time saving and energy saving and don't lose the method for accuracy.
In step 46, residue target speech sample characteristics is calculated according to machine mould by sample calculation module 26, and then carry out classify and grading.In other embodiments, can carry out Model Matching according to machine mould to residue target speech sample characteristics by sample calculation module 26, and then carry out classify and grading.What this step was carried out is that the machine sort classification transferred in residue target speech, saves great amount of manpower, but still can guarantee certain accuracy.And, add the residue target speech sample characteristics behind the classify and grading to machine mould, make it further perfect.
As shown in Figure 3 and Figure 4, in step 35, when the feature of new target speech is calculated, be about to characteristic parameter that the feature of target speech trains according to machine mould and carry out positive and negative class degree of confidence score and calculate.If the positive sample class that the feature of target speech obtains based on the machine mould calculation of parameter put that the letter score is higher than the negative sample class put the letter score, then in step 36, this target speech is divided into a valuable class; If the negative sample class that the feature of target speech obtains based on the machine mould calculation of parameter put that the letter score is higher than positive sample class put the letter score,, then in step 36, it is divided into a class of valueless or low value.Similarly, in step 46, residue target speech sample characteristics is calculated, the characteristic parameter that method trains according to machine mould for the feature that will remain the target speech carries out positive and negative class degree of confidence score and calculates.For example, " use XX, wrap you and feel at ease " and " flu-prevention tips ".It is very high that the positive sample class of " use XX, wrap you and feel at ease " is put the letter score, therefore divided into valuable target speech.And the negative sample class of " flu-prevention tips " put letter score height, therefore be divided into a class of valueless or low value.If in the classify and grading of step 42, only formerly the target speech is divided into positive negative sample, carries out the grade separation classification and do not align sample, then then is coarse evaluation in step 36, if positive sample is carried out the grade separation classification, then then is detailed evaluation in step 36.
Comprise that with the grade sample of positive sample the situation of A equal samples, B equal samples and C equal samples comes detailed evaluation is explained below.After positive sample marks off the Three Estate sample, in machine mould, also can comprise 4 kinds of model parameters, A equal samples characteristic model parameter, B equal samples characteristic model parameter, C equal samples characteristic model parameter and negative sample characteristic model parameter.In step 36, a kind of classify and grading mode is, puts the letter score based on what all kinds of model parameters were calculated respective classes respectively according to target speech sample characteristics.Any class put letter score height, then it is dispensed to which grade.For example the feature of a target speech is the putting letter and must be divided into 0.12 of category-A, the putting letter and must be divided into 0.63 of category-B, and putting letter and must be divided into 0.17 in the C class.Because the marking of the corresponding B sample characteristics of this target speech model is the highest, is 0.63 minute, then this target speech is divided into B etc.Except that above-mentioned classify and grading mode, also can use the classify and grading mode of other complexity commonly used.
As shown in Figure 5, Fig. 5 is the schematic flow diagram of machine mould according to another embodiment of the present invention.
In step 51, obtain the sample storehouse of target speech formerly.This formerly the target speech have the information such as ad click rate of this target speech simultaneously.This clicking rate for example can be the statistics of the past period.
In step 52, align negative sample and classify, target root formerly is divided into the positive sample and the negative sample of a plurality of different brackets according to the difference of classification level.Classification level height according to the ad click rate of target speech or/and the height of advertising rates judge.
In step 53, obtain the positive sample of a plurality of different brackets.The description of the positive sample of a plurality of different brackets is aligned the part that sample carries out classification described in the step 42 in as detailed above.
In step 54, obtain negative sample.Certainly, exist sometimes some formerly the target speech promptly be not divided into positive sample and be not divided into negative sample again.
In step 55, align sample and negative sample and carry out feature extraction.Same, similar step 33, the feature extraction in step 55 also can be carried out participle earlier.
In step 56, obtain a plurality of positive sample characteristics.The positive sample of each different stage all can obtain corresponding positive sample characteristics.
In step 57, obtain the negative sample feature.
In step 58, a plurality of positive sample characteristics and negative sample feature are carried out the machine modeling.
In step 59, form machine mould.
Need to prove that above-mentioned steps can suitably be out of shape in the practice operation, such as, in step 52, align the negative sample classification and also can adopt earlier definite a plurality of positive samples, and then in whole sample storehouse, deduct a plurality of positive samples and then draw negative sample.
By the way, can carry out the classify and grading statistics to existing search key and ad auction speech, set up machine mould, thereby to new keyword or auction speech, carry out that machine is discerned automatically, A+E, realized cover target speech classify and grading method and system.Adopt the classify and grading method of this machine mould, not only can be worth the judgement that (such as ad click rate) has or not, can also estimate the classification of carrying out of the height that is worth to the target speech.
In the above-described embodiments, only the embodiment of the invention has been carried out exemplary description, but those skilled in the art can carry out various modifications to the present invention after reading present patent application under the situation that does not break away from the spirit and scope of the present invention.

Claims (28)

1. a target speech classify and grading method is characterized in that, described target speech classify and grading method comprises step:
A. obtain the machine mould that contains target speech classify and grading;
B. new target speech is carried out feature extraction; And
C. according to the feature of the described new target speech that is extracted, use described machine mould and calculate, determine the classify and grading of described new target speech.
2. target speech classify and grading method according to claim 1 is characterized in that, the characteristic parameter that described Calculation Method trains according to machine mould for the feature with described new target speech carries out positive and negative class degree of confidence score and calculates.
3. target speech classify and grading method according to claim 1 is characterized in that step a further comprises step:
A1. obtain the sample storehouse of target speech formerly;
A2. the described speech of target is formerly carried out positive and negative sample classification, the described speech of target formerly is divided at least one positive sample and a negative sample;
A3. described positive sample and described negative sample are carried out feature extraction;
A4. the feature according to described positive sample that is extracted and described negative sample forms described machine mould.
4. target speech classify and grading method according to claim 3 is characterized in that, described positive sample comprises the ad click rate height or/and the high target speech of advertising rates; Described negative sample comprises the target speech that ad click rate low target speech, advertising rates are low or does not have one or more combination in the target speech of showing advertisement.
5. target speech classify and grading method according to claim 3 is characterized in that, in described step a2, further comprises described positive sample is carried out classification, the described speech of target formerly is divided into the positive sample and the negative sample of a plurality of different brackets.
6. target speech classify and grading method according to claim 5 is characterized in that, in described step a2, by reading default sample database the described speech of target formerly to small part is carried out the classification of positive negative sample and the classification of positive sample.
7. target speech classify and grading method according to claim 6 is characterized in that, in step a4, the feature of described positive sample and described negative sample is carried out the machine modeling, thereby forms described machine mould.
8. target speech classify and grading method according to claim 6 is characterized in that described positive sample further is divided at least two grade samples according to the difference of classification level.
9. target speech classify and grading method according to claim 8, it is characterized in that, described grade sample comprises A equal samples, B equal samples and C equal samples, or comprise A equal samples, B equal samples, C equal samples and D equal samples, or comprise A equal samples, B equal samples, C equal samples, D equal samples and E equal samples; Wherein, the classification level of A equal samples is the highest, and the classification level of other described grade samples is successively decreased successively.
10. target speech classify and grading method according to claim 9 is characterized in that, described classification level height according to the ad click rate of described target speech or/and the height of advertising rates judge.
11. target speech classify and grading method according to claim 8 is characterized in that described step a further comprises:
A5. the described speech of target formerly that does not carry out the sample classification classification in described step a2 is carried out feature extraction;
A6. according to described machine mould the feature of the described residue target speech that obtained is calculated, and then carried out classify and grading, and add the described residue target speech sample characteristics behind the classify and grading to described machine mould.
12. target speech classify and grading method according to claim 11, it is characterized in that, in step a6, the feature of described residue target speech is carried out Calculation Method calculate for the characteristic parameter that the feature of described residue target speech is trained according to described machine mould carries out positive and negative class degree of confidence score.
13. according to any described target speech classify and grading method in the claim 1,3 or 11, it is characterized in that, when carrying out described feature extraction, carry out participle earlier.
14. target speech classify and grading method according to claim 13, it is characterized in that the method for described participle comprises: forward coupling participle, oppositely mate participle, Direct/Reverse coupling participle, participle, maximum entropy Markov model participle, maximum entropy participle or condition random field participle based on full segmenting word figure.
15. a target speech classify and grading system is characterized in that, described target speech classify and grading system comprises:
Machine mould, described machine mould are the machine mould that contains target speech classify and grading;
Characteristic extracting module is used for new target speech is carried out feature extraction;
Computing module is used for the feature according to the described new target speech that is extracted, and uses described machine mould and calculates, and determines the classify and grading of described new target speech.
16. target speech classify and grading according to claim 15 system, it is characterized in that, in described computing module, the characteristic parameter that described Calculation Method trains according to machine mould for the feature with described new target speech carries out positive and negative class degree of confidence score and calculates.
17. target speech classify and grading according to claim 15 system is characterized in that described machine mould comprises:
Sample storehouse acquisition module is used to obtain the sample storehouse of target speech formerly;
The sample classification diversity module is used for the described speech of target is formerly carried out positive and negative sample classification, and the described speech of target formerly is divided at least one positive sample and a negative sample;
The first sample characteristics extraction module carries out feature extraction to described positive sample and described negative sample;
Machine mould forms module, forms described machine mould according to the feature of described positive sample that is extracted and described negative sample.
18. target speech classify and grading according to claim 17 system is characterized in that described positive sample comprises the ad click rate height or/and the high target speech of advertising rates; Described negative sample comprises the target speech that ad click rate low target speech, advertising rates are low or does not have one or more combination in the target speech of showing advertisement.
19. target speech classify and grading according to claim 17 system is characterized in that described sample classification diversity module is further carried out classification to described positive sample, the described speech of target formerly is divided into the positive sample and the negative sample of a plurality of different brackets.
20. target speech classify and grading according to claim 19 system, it is characterized in that, in described sample classification diversity module, the described speech of target formerly to small part is carried out the classification of positive negative sample and the classification of positive sample by reading default sample database.
21. target speech classify and grading according to claim 20 system is characterized in that, forms in the module at described machine mould, the feature of described positive sample and described negative sample is carried out the machine modeling, thereby form described machine mould.
22. target speech classify and grading according to claim 21 system is characterized in that in described sample classification diversity module, described positive sample further is divided at least two grade samples according to the difference of classification level.
23. target speech classify and grading according to claim 22 system, it is characterized in that, described grade sample comprises A equal samples, B equal samples and C equal samples, or comprise A equal samples, B equal samples, C equal samples and D equal samples, or comprise A equal samples, B equal samples, C equal samples, D equal samples and E equal samples; Wherein, the classification level of A equal samples is the highest, and the classification level of other described grade samples is successively decreased successively.
24. target speech classify and grading according to claim 22 system is characterized in that, described classification level height according to the ad click rate of described target speech or/and the height of advertising rates judge.
25. target speech classify and grading according to claim 22 system is characterized in that described machine mould further comprises:
The second sample characteristics extraction module is used for the residue target speech that does not carry out in the described speech of target formerly of sample classification classification in described sample classification diversity module is carried out feature extraction;
The sample calculation module is used for according to described machine mould the feature of the described residue target speech that obtained being calculated, and then carries out classify and grading, and adds the described residue target speech sample characteristics behind the classify and grading to described machine mould.
26. target speech classify and grading according to claim 25 system, it is characterized in that, in described sample calculation module, the feature of described residue target speech is carried out Calculation Method calculate for the characteristic parameter that the feature of described residue target speech is trained according to described machine mould carries out positive and negative class degree of confidence score.
27. according to claim 14,16 or 25 described target speech classify and grading systems, it is characterized in that, when carrying out feature extraction, carry out participle earlier.
28. target speech classify and grading according to claim 27 system, it is characterized in that the method for described participle comprises: forward coupling participle, oppositely mate participle, Direct/Reverse coupling participle, participle, maximum entropy Markov model participle, maximum entropy participle or condition random field participle based on full segmenting word figure.
CN2010105423714A 2010-11-12 2010-11-12 Marked word classifying and grading method and system Pending CN101980210A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010105423714A CN101980210A (en) 2010-11-12 2010-11-12 Marked word classifying and grading method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010105423714A CN101980210A (en) 2010-11-12 2010-11-12 Marked word classifying and grading method and system

Publications (1)

Publication Number Publication Date
CN101980210A true CN101980210A (en) 2011-02-23

Family

ID=43600712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010105423714A Pending CN101980210A (en) 2010-11-12 2010-11-12 Marked word classifying and grading method and system

Country Status (1)

Country Link
CN (1) CN101980210A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103123634A (en) * 2011-11-21 2013-05-29 北京百度网讯科技有限公司 Copyright resource identification method and copyright resource identification device
CN103136220A (en) * 2011-11-24 2013-06-05 北京百度网讯科技有限公司 Method of establishing term requirement classification model, term requirement classification method and device
CN103164454A (en) * 2011-12-15 2013-06-19 百度在线网络技术(北京)有限公司 Keyword grouping method and keyword grouping system
CN103425677A (en) * 2012-05-18 2013-12-04 阿里巴巴集团控股有限公司 Method for determining classified models of keywords and method and device for classifying keywords
CN104537118A (en) * 2015-01-26 2015-04-22 苏州大学 Microblog data processing method, device and system
WO2015124024A1 (en) * 2014-02-24 2015-08-27 北京奇虎科技有限公司 Method and device for promoting exposure rate of information, method and device for determining value of search word
CN105095210A (en) * 2014-04-22 2015-11-25 阿里巴巴集团控股有限公司 Method and apparatus for screening promotional keywords
CN103136220B (en) * 2011-11-24 2016-12-14 北京百度网讯科技有限公司 Set up the method for lexical item demand classification model, lexical item demand classification method and device
CN106548186A (en) * 2015-09-16 2017-03-29 阿里巴巴集团控股有限公司 A kind of method and apparatus that sample yield is determined based on confidence level
CN107292342A (en) * 2017-06-21 2017-10-24 广东欧珀移动通信有限公司 Data processing method and related product
CN108647201A (en) * 2018-04-04 2018-10-12 卓望数码技术(深圳)有限公司 A kind of classifying identification method and system based on mobile application
CN110288007A (en) * 2019-06-05 2019-09-27 北京三快在线科技有限公司 The method, apparatus and electronic equipment of data mark
CN110399479A (en) * 2018-04-20 2019-11-01 北京京东尚科信息技术有限公司 Search for data processing method, device, electronic equipment and computer-readable medium
CN111950254A (en) * 2020-09-22 2020-11-17 北京百度网讯科技有限公司 Method, device and equipment for extracting word features of search sample and storage medium
CN115344757A (en) * 2022-02-07 2022-11-15 花瓣云科技有限公司 Label prediction method, electronic equipment and storage medium

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103123634B (en) * 2011-11-21 2016-04-27 北京百度网讯科技有限公司 A kind of copyright resource identification method and device
CN103123634A (en) * 2011-11-21 2013-05-29 北京百度网讯科技有限公司 Copyright resource identification method and copyright resource identification device
CN103136220A (en) * 2011-11-24 2013-06-05 北京百度网讯科技有限公司 Method of establishing term requirement classification model, term requirement classification method and device
CN103136220B (en) * 2011-11-24 2016-12-14 北京百度网讯科技有限公司 Set up the method for lexical item demand classification model, lexical item demand classification method and device
CN103164454A (en) * 2011-12-15 2013-06-19 百度在线网络技术(北京)有限公司 Keyword grouping method and keyword grouping system
CN103164454B (en) * 2011-12-15 2016-03-23 百度在线网络技术(北京)有限公司 Keyword group technology and system
CN103425677A (en) * 2012-05-18 2013-12-04 阿里巴巴集团控股有限公司 Method for determining classified models of keywords and method and device for classifying keywords
CN103425677B (en) * 2012-05-18 2016-08-24 阿里巴巴集团控股有限公司 Keyword classification model determines method, keyword classification method and device
WO2015124024A1 (en) * 2014-02-24 2015-08-27 北京奇虎科技有限公司 Method and device for promoting exposure rate of information, method and device for determining value of search word
CN105095210A (en) * 2014-04-22 2015-11-25 阿里巴巴集团控股有限公司 Method and apparatus for screening promotional keywords
CN104537118A (en) * 2015-01-26 2015-04-22 苏州大学 Microblog data processing method, device and system
CN104537118B (en) * 2015-01-26 2017-12-26 苏州大学 A kind of microblog data processing method, apparatus and system
CN106548186B (en) * 2015-09-16 2019-11-08 阿里巴巴集团控股有限公司 A kind of method and apparatus that sample yield is determined based on confidence level
CN106548186A (en) * 2015-09-16 2017-03-29 阿里巴巴集团控股有限公司 A kind of method and apparatus that sample yield is determined based on confidence level
CN107292342A (en) * 2017-06-21 2017-10-24 广东欧珀移动通信有限公司 Data processing method and related product
CN108647201A (en) * 2018-04-04 2018-10-12 卓望数码技术(深圳)有限公司 A kind of classifying identification method and system based on mobile application
CN110399479A (en) * 2018-04-20 2019-11-01 北京京东尚科信息技术有限公司 Search for data processing method, device, electronic equipment and computer-readable medium
CN110288007A (en) * 2019-06-05 2019-09-27 北京三快在线科技有限公司 The method, apparatus and electronic equipment of data mark
CN110288007B (en) * 2019-06-05 2021-02-02 北京三快在线科技有限公司 Data labeling method and device and electronic equipment
CN111950254A (en) * 2020-09-22 2020-11-17 北京百度网讯科技有限公司 Method, device and equipment for extracting word features of search sample and storage medium
CN111950254B (en) * 2020-09-22 2023-07-25 北京百度网讯科技有限公司 Word feature extraction method, device and equipment for searching samples and storage medium
CN115344757A (en) * 2022-02-07 2022-11-15 花瓣云科技有限公司 Label prediction method, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN101980211A (en) Machine learning model and establishing method thereof
CN101980210A (en) Marked word classifying and grading method and system
CN101620615B (en) Automatic image annotation and translation method based on decision tree learning
CN107766371A (en) A kind of text message sorting technique and its device
CN106776538A (en) The information extracting method of enterprise's noncanonical format document
CN104516986A (en) Statement identification method and device
CN107818105A (en) The recommendation method and server of application program
CN109165294A (en) Short text classification method based on Bayesian classification
CN105975478A (en) Word vector analysis-based online article belonging event detection method and device
CN104881458A (en) Labeling method and device for web page topics
CN107704512A (en) Financial product based on social data recommends method, electronic installation and medium
CN109657039B (en) Work history information extraction method based on double-layer BilSTM-CRF
CN107798351B (en) Deep learning neural network-based identity recognition method and system
CN108804577B (en) Method for estimating interest degree of information tag
CN111984790B (en) Entity relation extraction method
CN109492105A (en) A kind of text sentiment classification method based on multiple features integrated study
CN105095196A (en) Method and device for finding new word in text
CN110705272A (en) Named entity identification method for automobile engine fault diagnosis
CN109902284A (en) A kind of unsupervised argument extracting method excavated based on debate
CN103324632A (en) Concept identification method and device based on collaborative learning
CN103500216A (en) Method for extracting file information
CN103493067A (en) Method and apparatus for recognizing a character of a video
CN109783807A (en) A kind of user comment method for digging for APP software defect
CN106022389A (en) Related feedback method for actively selecting multi-instance multi-mark digital image
CN110163525A (en) Terminal recommended method and terminal recommender system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110223