CN108959263A

CN108959263A - A kind of entry weight calculation model training method and device

Info

Publication number: CN108959263A
Application number: CN201810757233.4A
Authority: CN
Inventors: 王亮
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2018-07-11
Filing date: 2018-07-11
Publication date: 2018-12-07
Anticipated expiration: 2038-07-11
Also published as: CN108959263B

Abstract

This application discloses a kind of entry weight calculation model training method and devices, this method splits every sample sentence in the sample sentence set of acquisition to obtain the corresponding entry sequence of every sample sentence, and entry sequence includes at least one entry obtained after sample sentence is split；Determine the relative importance of each entry in every entry sequence；The entry in every entry sequence is grouped according to the relative importance of each entry in every entry sequence, obtain every corresponding annotated sequence of entry sequence, it includes obtained at least one entry group after the entry grouping in entry sequence, and entry group includes at least one entry；Preset entry weight calculation model is trained according to every annotated sequence, obtains the value of model parameter in entry weight calculation model.The annotated sequence that the above-mentioned relative importance based on each entry in every entry sequence obtains is more accurate, improves the accuracy of entry weight calculation model.

Description

A kind of entry weight calculation model training method and device

Technical field

This application involves technical field of data processing, more specifically to a kind of entry weight calculation model training side Method and device.

Background technique

Entry weight calculation is an important natural language processing job, and the accuracy calculated directly affects keyword The performance of extraction, tag extraction, searching order etc..Wherein entry weight calculation can be by entry weight calculation model, word at present Weight calculation model can be obtained by supervised learning method, during obtaining entry weight calculation model, be needed to entry The sample sentence that weight calculation model uses is labeled, and annotation process is as follows:

Firstly, entry weight, which is divided into several ranks, determines weight rank number, it is then corresponding to sample sentence Each entry in entry sequence carries out weight rank mark, is such as labeled according to 5 weight ranks to entry, entry is most Low weight rank is level1 rank, and highest weighting rank is level5 rank, and then the weight rank of the entry based on mark With the feature vector training entry weight calculation model of entry.

In the above method, setting weight rank number is the equal of calculating entry weight using the method for classification, but classify Method it is confirmed that entry absolutely essential rank, be in all sample sentences determine entry importance height, mark Accuracy it is lower, the entry weight calculation model inaccuracy for causing training to obtain.

Summary of the invention

In view of this, the application provides a kind of entry weight calculation model training method and device, to improve entry weight The accuracy of computation model.

To achieve the goals above, it is proposed that scheme it is as follows:

A kind of entry weight calculation model training method, which comprises

Obtain sample sentence set；

Every sample sentence in the sample sentence set is split, it is corresponding to obtain every sample sentence Entry sequence, the entry sequence include at least one entry obtained after the sample sentence is split；

Determine the relative importance of each entry in every entry sequence；

According to the relative importance of each entry in every entry sequence, the entry in every entry sequence is divided Group, obtains every corresponding annotated sequence of entry sequence, and the annotated sequence includes the entry point in the entry sequence At least one the entry group obtained after group, the entry group include at least one entry；

According to every annotated sequence, preset entry weight calculation model is trained, obtains the entry weight meter Calculate the value of model parameter in model.

A kind of entry weight calculation model training device, described device include:

Acquiring unit, for obtaining sample sentence set；

Split cells obtains every sample for splitting to every sample sentence in the sample sentence set The corresponding entry sequence of sentence, the entry sequence include at least one entry obtained after the sample sentence is split；

Determination unit, for determining the relative importance of each entry in every entry sequence；

Grouped element, for the relative importance according to each entry in every entry sequence, to every entry sequence In entry be grouped, obtain every corresponding annotated sequence of entry sequence, the annotated sequence includes the entry At least one the entry group obtained after entry grouping in sequence, the entry group includes at least one entry；

Training unit, for being trained to preset entry weight calculation model, obtaining institute according to every annotated sequence The value of model parameter in predicate weight calculation model.

It can be seen from the above technical scheme that being torn open in the application to every sample sentence in sample sentence set Point, the various corresponding entry sequences of the sample sentence are obtained, the entry sequence includes obtaining after the sample sentence is split At least one entry；The relative importance for determining each entry in every entry sequence, according to every in every entry sequence The relative importance of a entry is grouped the entry in every entry sequence, obtains every entry sequence and respectively corresponds to Annotated sequence, the annotated sequence includes obtained at least one entry group after the entry grouping in the entry sequence, institute Predicate item group includes at least one entry；As it can be seen that the annotated sequence is relatively heavy in the entry sequence based on each entry Want degree to obtain, compared to the prior art in Direct Mark entry absolute weight rank, the present invention no longer marks absolutely Weight rank, but determine relative importance of each entry in same entry sequence, and be based on the relative importance Determine annotated sequence, for for specific same entry sequence, the relative importance between different entries is to compare Fixed, so the labeled data i.e. annotated sequence that the above-mentioned notation methods of the application obtain is more accurate, and then based on mark sequence Column train obtained entry weight calculation model also more accurate property.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of application for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is a kind of flow chart of entry weight calculation model training method disclosed in the embodiment of the present application；

Fig. 2 is a kind of flow chart of entry weight calculation model training method disclosed in another embodiment of the application；

Fig. 3 is a kind of stream based on pair-wise algorithm training entry weight calculation model disclosed in the embodiment of the present application Cheng Tu；

Fig. 4 is a kind of structural block diagram of entry weight calculation model training device disclosed in the embodiment of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.

The embodiment of the present application provides a kind of entry weight calculation model training method, as shown in Figure 1, this method comprises:

S100, sample sentence set is obtained；

Wherein, the sample sentence in sample sentence set is the search inquiry sentence of user in search engine, is used for video Video search word in player, video title, headline when video tour or news browsing.Such as video title is " university teacher, which graduates to make a speech, to become much more popular: mother-in-law schoolboy can swim ", " Bangkok, THA explodes, and dazzling fireball drops in day " Sample sentence.

S101, every sample sentence in sample sentence set is split, obtains every various correspondences of sample sentence Entry sequence, entry sequence includes obtained at least one entry after sample sentence is split；

Wherein, entry fractionation is carried out to every sample sentence, specifically, according to the space for including in the sample sentence of acquisition The word or word that equal punctuation marks are made available separately；And/or character string is split to obtain word or word using participle program.

Such as, entry fractionation, obtained entry sequence are carried out to " Bangkok, THA explodes, and dazzling fireball drops in day " are as follows: " safe State ", " Bangkok ", " generation ", " explosion ", " day ", " drop ", " dazzling ", " fireball ".

S102, the relative importance for determining each entry in every entry sequence；

Continue with model sentence, in the entry sequence: " Thailand ", " Bangkok ", " generation ", " explosion ", " day ", " drop ", In " dazzling ", " fireball ", mark personnel are according to objective standard, such as it has been generally acknowledged that acute name, proper noun, name, noun it is important The standards such as property is higher, are compared each entry, and each entry inputted by mark personnel is in the entry sequence Relative importance determines relative importance of each entry in the entry sequence, wherein relative importance is highest Entry is " explosion ", takes second place for " Thailand " and " Bangkok ", is third " fireball ", relative importance is minimum be " generation ", " day ", " drop " and " dazzling ".

Specifically, relative importance mark such as 1,2,3 etc. can be used in mark personnel, the relatively heavy of each entry is marked Want degree.Alternatively, determining key term from the entry, and the relative importance of each key term is marked, such as will " Thailand ", " Bangkok ", " explosion " and " fireball " in model sentence carry out subsequent processing as keyword.

S103, according to the relative importance of each entry in every entry sequence, to the entry in every entry sequence It is grouped, obtains every corresponding annotated sequence of entry sequence, the annotated sequence includes in the entry sequence At least one the entry group obtained after entry grouping, the entry group includes at least one entry；

S104, according to every annotated sequence, preset entry weight calculation model is trained, entry weight meter is obtained Calculate the value of model parameter in model.

Specifically, entry weight calculation model can be linear model:

Weight (q)=w₀+∑(w_jφ_j(q)),

Wherein, q indicates that entry, weight (q) indicate entry weight, w₀For bias term, φ_jFor j-th of feature of entry Value, w_jFor the corresponding weight coefficient of j-th of characteristic value of entry, the w₀And w_jFor the parameter in entry weight calculation model.Its In, above-mentioned linear model is trained using conventional learning to rank (LTR, machine learning sort algorithm).

Above-described embodiment splits every sample sentence in sample sentence set, it is various to obtain the sample sentence Corresponding entry sequence；The relative importance for determining each entry in every entry sequence, according in every entry sequence The relative importance of each entry is grouped the entry in every entry sequence, and it is respectively right to obtain every entry sequence The annotated sequence answered.As it can be seen that the annotated sequence is obtained based on relative importance of each entry in the entry sequence, The absolute weight rank of Direct Mark entry in compared to the prior art, no longer mark absolute weight rank, but determine each Relative importance of a entry in same entry sequence, and annotated sequence is determined based on the relative importance, due to needle For specific entry sequence, the relative importance between different entries is that comparison is fixed, so the above-mentioned mark of the application The labeled data that note mode obtains i.e. annotated sequence is more accurate, and then the entry weight calculation obtained based on annotated sequence training Model also more accurate property.

And the entry weighted value being calculated through the foregoing embodiment is continuous floating point values, such as 0.41 floating point values, No longer by the other limitation of significance level, i.e., the accuracy of entry weight need not be improved in demapping to limited weight rank.

In another embodiment of the application, the training method of entry weight calculation model, as shown in Figure 2, comprising:

S200, sample sentence set is obtained；

S201, every sample sentence in sample sentence set is split, obtains every sample sentence and respectively corresponds to Entry sequence, the entry sequence includes at least one entry obtained after the sample sentence is split；

S202, the relative importance for determining each entry in every entry sequence；

S203, to any entry in every entry sequence: according to the relative importance of the entry in the entry sequence, The matched entry of relative importance with the entry is obtained from the entry sequence, and the entry and acquired entry are deposited Storage is in the same entry group；

Specifically, can be identified based on relative importance in the entry sequence for for every entry sequence Each entry is matched, and matched entry is stored in the same entry group, i.e. the matched entry of relative importance is deposited Storage is in the same entry, after the completion of matching, can form one or more entry groups, include at least one word in each entry group The relative importance of item, multiple entries in each entry group identifies matching, which can be set to relative importance Identify it is identical, or have certain difference, such as difference be 1.

Alternatively, can be operated by the input of mark personnel directly by phase after the relative importance for marking opposite entry The identical entry of significance level is stored in the same entry group.

S204, to all entry groups in every entry sequence: according in entry group entry relative importance carry out Sequence, using the sequence formed after sequence as the corresponding annotated sequence of entry sequence；

Specifically, can according to the relative importance of entry in entry group by high sequence on earth in every entry sequence All entry groups be ranked up.For model sentence, the annotated sequence that is obtained after sequence are as follows: [explosion] [Thailand, Bangkok] [fireball] [day, drop is dazzling, occurs].

S205, according to every annotated sequence, preset entry weight calculation model is trained, entry weight meter is obtained Calculate the value of model parameter in model.

In above-described embodiment, relative importance of each entry in same entry sequence is determined, and then be based on entry Significance level carry out the grouping of entry, and the entry group after grouping is arranged by the relative importance of entry in entry group Sequence, since the relative importance between different entries in same entry sequence is relatively fixed, so obtained mark sequence Column data is more accurate, and then the entry weight calculation model that the annotated sequence based on the sequence and the training of LTR algorithm obtain More accurate property.

It, can also be by the multiple entry groups obtained after grouping directly as annotated sequence in above-described embodiment.

One embodiment of the application specifically discloses the pair-wise that entry weight calculation model includes based on LTR algorithm and calculates The training method of method, as shown in figure 3, this method comprises:

S300, based on the every two entry group in the annotated sequence, generate entry pair, two entries of the entry centering Relative importance it is different and arranged according to predetermined order；Obtain the feature vector of each each entry of entry centering；

Specifically, continuing with annotated sequence obtained above: [explosion] [Thailand, Bangkok] [fireball] [day, drop is dazzling, Occur], the training method in the embodiment is illustrated:

Based on the every two entry group in the annotated sequence, entry pair is generated are as follows:

<explosion, Thailand><explosion, Bangkok><explosion, fireball><explosion, day><explosion, drop><explosion, dazzling><explosion, Occur >；

<Thailand, fireball><Thailand, day><Thailand, drop><Thailand, dazzling><Thailand, occur>；

<Bangkok, fireball><Bangkok, day><Bangkok, drop><Bangkok, dazzling><Bangkok, occur>；

<fireball, day><fireball, drop><fireball, dazzling><fireball, occur>.

Wherein, the relative importance of first entry of entry centering is greater than the relative importance of second entry, It is identical that i.e. this, which puts in order with putting in order for entry group in annotated sequence,.

The feature vector of each entry is obtained, specifically, the lexical characteristics of entry can be obtained: such as part of speech, the statistics of entry Feature: it such as tf-idf, the user behavior characteristics of entry: such as the number of clicks of the entry as label, is based on searching in search engine The entry feature that Suo Zhi is obtained.The feature vector of obtained entry is as shown in table 1 below:

Wherein, in domain idf indicates the field correlation inverse document frequency feature of entry, domain free idf table Show the unrelated inverse document frequency feature in the field of entry, log (#query) indicates the entry feature obtained based on search log, log (word length) indicates the length characteristic of entry, and pos indicates the part of speech feature of entry.

With entry to<explosion, Thailand>for, the feature vector Φ of " explosion "₁For (0.4818,0.3795,0.6780, 0.3010,0.8000), the feature vector Φ of " Thailand "₂For (0.3621,0.5101,0.8130,0.3010,1.1000).

S301, according to the feature vector of each each entry of entry centering, generate the first training sample set and the Two training sample set；

Specifically, generating entry to<explosion, Thailand>corresponding positive sample feature vector: Φ_Just=Φ₁-Φ₂= (0.1197, -0.1306, -0.135,0.000, -0.3000), corresponding sample output token label are 1；The feature of negative sample Vector Φ_{It is negative}=Φ₂-Φ₁=(- 0.1197,0.1306,0.135,0.000,0.3000), corresponding sample output token label It is -1.

S302, according to the first training sample set and the second training sample set, to preset entry weight calculation model It is trained, obtains the value of model parameter in entry weight calculation model.

Specifically, all entries generated using all annotated sequences are to corresponding positive sample set and negative sample set, Training entry weight calculation model: weight (q)=w₀+∑(w_jφ_j(q)), to determine the model parameter w in model₀And w_j's Value.

To get having arrived entry weight calculation model after the value for obtaining the model parameter in entry weight calculation model, Subsequent to need to calculate short text such as search statement, title when entry weight in brief introduction, obtains the spy of each entry in short text Vector is levied, and each entry weight is can be obtained into feature vector substitution entry weight calculation model.

Model training is carried out using svm algorithm combination Pair-wise algorithm in above-described embodiment, is realized simply and effectively Training.

In other embodiments, the training that model can also be carried out using other LTR algorithms should such as list-wise algorithm Algorithm specifically obtains an entry from each entry group that annotated sequence includes, and forms orderly entry sequence.For example, from upper Annotated sequence described in text: it is obtained in each entry group in [explosion] [Thailand, Bangkok] [fireball] [day, drop is dazzling, occurs] The orderly entry sequence of one entry, composition has: [explosion, Thailand, fireball, day], [explosion, Bangkok, fireball, day], [explosion, Thailand, fireball, drop] ... wait 8 orderly entry sequences.It is calculated by list-wise algorithm such as ListNet algorithm, LambdaMART Method), it is fitted the sequence of each entry in each orderly entry sequence, and entry power is carried out according to each entry sequence after the fitting of acquisition The training of re-computation model obtains the value of model parameter in entry weight calculation model, which can be improved The accuracy of training result.The application is not specifically limited LTR algorithm, as long as the mark sequence that can be proposed in conjunction with the application Column progress model training obtains the LTR algorithm of the value of the model parameter in model all in the application protection scope.

A kind of entry weight calculation model training device is also disclosed in the embodiment of the present application, as shown in figure 4, the device includes:

Acquiring unit 400, for obtaining sample sentence set；

Split cells 401 obtains every galley proof for splitting to every sample sentence in the sample sentence set The corresponding entry sequence of this sentence, the entry sequence include at least one word obtained after the sample sentence is split Item；

Determination unit 402, for determining the relative importance of each entry in every entry sequence；

Grouped element 403, for the relative importance according to each entry in every entry sequence, to every entry sequence Entry in column is grouped, and obtains every corresponding annotated sequence of entry sequence, and the annotated sequence includes institute's predicate At least one the entry group obtained after entry grouping in sequence, the entry group includes at least one entry；

Training unit 404, for being trained, obtaining to preset entry weight calculation model according to every annotated sequence The value of model parameter into the entry weight calculation model.

Preferably, the grouped element, comprising:

Coupling subelement, for any entry in every entry sequence: according to the phase of the entry in the entry sequence To significance level, the matched entry of relative importance with the entry is obtained from the entry sequence, and by the entry and institute The entry of acquisition is stored in the same entry group；

First generates subelement, for it is each to obtain every entry sequence according to the entry group in every entry sequence Self-corresponding annotated sequence, the annotated sequence include at least one entry obtained after the entry in the entry sequence is grouped Group, the entry group include at least one entry.

Preferably, the first generation subelement includes sorting module, for all entries in every entry sequence Group: being ranked up according to the relative importance of entry in entry group, using the sequence formed after sequence as the entry sequence pair The annotated sequence answered.

Preferably, the training unit includes:

Second generates subelement, for generating entry pair, the entry based on the every two entry group in the annotated sequence The relative importance of two entries of centering is different and arranges according to predetermined order；Obtain each each entry of entry centering Feature vector；

Third generates subelement, for the feature vector according to each each entry of entry centering, generates the first instruction Practice sample set and the second training sample set；

Training subelement, is used for according to the first training sample set and the second training sample set, to preset word Weight calculation model is trained, and obtains the value of model parameter in the entry weight calculation model.

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or apparatus that includes the element.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.

The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of entry weight calculation model training method, which is characterized in that the described method includes:

Obtain sample sentence set；

Every sample sentence in the sample sentence set is split, every corresponding entry of sample sentence is obtained Sequence, the entry sequence include at least one entry obtained after the sample sentence is split；

Determine the relative importance of each entry in every entry sequence；

According to the relative importance of each entry in every entry sequence, the entry in every entry sequence is grouped, Every corresponding annotated sequence of entry sequence is obtained, the annotated sequence includes after the entry in the entry sequence is grouped At least one obtained entry group, the entry group include at least one entry；

According to every annotated sequence, preset entry weight calculation model is trained, obtains the entry weight calculation mould The value of model parameter in type.

2. the method according to claim 1, wherein it is described according in every entry sequence each entry it is opposite Significance level is grouped the entry in every entry sequence, obtains the corresponding annotated sequence packet of every entry sequence It includes:

To any entry in every entry sequence: according to the relative importance of the entry in the entry sequence, from the entry Obtained in sequence with the matched entry of relative importance of the entry, and the entry and acquired entry be stored in same In a entry group；

According to the entry group in every entry sequence, every corresponding annotated sequence of entry sequence, the mark are obtained Note sequence includes at least one the entry group obtained after the entry in the entry sequence is grouped, and the entry group includes at least one A entry.

3. according to the method described in claim 2, it is characterized in that, the entry group according in every entry sequence, Obtaining every corresponding annotated sequence of entry sequence includes:

To all entry groups in every entry sequence: being ranked up, will arrange according to the relative importance of entry in entry group The sequence formed after sequence is as the corresponding annotated sequence of entry sequence.

4. the method according to claim 1, wherein described according to the corresponding mark sequence of every entry sequence Column, are trained preset entry weight calculation model, obtain the value of model parameter in the entry weight calculation model Include:

Based on the every two entry group in the annotated sequence, generate entry pair, two entries of the entry centering it is relatively heavy It wants degree different and is arranged according to predetermined order；Obtain the feature vector of each each entry of entry centering；

According to the feature vector of each each entry of entry centering, the first training sample set and the second training sample set are generated It closes；

According to the first training sample set and the second training sample set, the entry weight calculation model is instructed Practice, obtains the value of model parameter in the entry weight calculation model.

5. a kind of entry weight calculation model training device, which is characterized in that described device includes:

Acquiring unit, for obtaining sample sentence set；

Split cells obtains every sample sentence for splitting to every sample sentence in the sample sentence set Corresponding entry sequence, the entry sequence include at least one entry obtained after the sample sentence is split；

Grouped element, for the relative importance according to each entry in every entry sequence, in every entry sequence Entry is grouped, and obtains every corresponding annotated sequence of entry sequence, and the annotated sequence includes the entry sequence In entry grouping after obtained at least one entry group, the entry group includes at least one entry；

Training unit, for being trained to preset entry weight calculation model, obtaining institute's predicate according to every annotated sequence The value of model parameter in weight calculation model.

6. device as claimed in claim 5, which is characterized in that the grouped element, comprising:

Coupling subelement, for any entry in every entry sequence: according in the entry sequence entry it is relatively heavy Want degree, obtained from the entry sequence with the matched entry of relative importance of the entry, and by the entry and acquired Entry be stored in the same entry group；

First generates subelement, for it is respectively right to obtain every entry sequence according to the entry group in every entry sequence The annotated sequence answered, the annotated sequence include at least one the entry group obtained after the entry in the entry sequence is grouped, The entry group includes at least one entry.

7. device as claimed in claim 6, the first generation subelement includes sorting module, for every entry sequence In all entry groups: be ranked up according to the relative importance of entry in entry group, using the sequence formed after sequence as The corresponding annotated sequence of entry sequence.

8. device as claimed in claim 5, the training unit include:

Second generates subelement, for generating entry pair, the entry centering based on the every two entry group in the annotated sequence Two entries relative importance it is different and arranged according to predetermined order；Obtain the feature of each each entry of entry centering Vector；

Third generates subelement, for the feature vector according to each each entry of entry centering, generates the first training sample set It closes and the second training sample set；

Training subelement, for being weighed to preset entry according to the first training sample set and the second training sample set Re-computation model is trained, and obtains the value of model parameter in the entry weight calculation model.