CN108959263B

CN108959263B - Entry weight calculation model training method and device

Info

Publication number: CN108959263B
Application number: CN201810757233.4A
Authority: CN
Inventors: 王亮
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2018-07-11
Filing date: 2018-07-11
Publication date: 2022-06-03
Anticipated expiration: 2038-07-11
Also published as: CN108959263A

Abstract

The method splits each sample sentence in an obtained sample sentence set to obtain a respective entry sequence corresponding to each sample sentence, wherein the entry sequence comprises at least one entry obtained after splitting the sample sentences; determining the relative importance degree of each entry in each entry sequence; the method comprises the steps that entries in each entry sequence are grouped according to the relative importance degree of each entry in each entry sequence to obtain a labeling sequence corresponding to each entry sequence, wherein the labeling sequence comprises at least one entry group obtained after the entries in the entry sequence are grouped, and each entry group comprises at least one entry; and training a preset entry weight calculation model according to each tagging sequence to obtain values of model parameters in the entry weight calculation model. The tagging sequence obtained based on the relative importance degree of each entry in each entry sequence is more accurate, and the accuracy of the entry weight calculation model is improved.

Description

Entry weight calculation model training method and device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for training an entry weight calculation model.

Background

The entry weight calculation is an important natural language processing work, and the calculation accuracy directly influences the performance of keyword extraction, tag extraction, search sequencing and the like. The term weight calculation can be obtained through a term weight calculation model, the current term weight calculation model can be obtained through a supervised learning method, and in the process of obtaining the term weight calculation model, the sample sentences used by the term weight calculation model need to be labeled, and the labeling process is as follows:

firstly, dividing the weight of the vocabulary entry into a plurality of levels, namely determining the number of the weight levels, then labeling the weight level of each vocabulary entry in the vocabulary entry sequence corresponding to the sample sentence, for example, labeling the vocabulary entry according to 5 weight levels, wherein the lowest weight level of the vocabulary entry is level1 level, the highest weight level is level5 level, and then training a vocabulary entry weight calculation model based on the labeled weight level of the vocabulary entry and the feature vector of the vocabulary entry.

In the method, the weight level number is set to be equivalent to the calculation of the entry weight by using a classification method, but the classification method determines the absolute importance level of the entry, namely the importance of the entry is determined in all sample sentences, and the labeling accuracy is low, so that the trained entry weight calculation model is inaccurate.

Disclosure of Invention

In view of this, the present application provides a method and an apparatus for training an entry weight calculation model to improve the accuracy of the entry weight calculation model.

In order to achieve the above object, the following solutions are proposed:

a method for training an entry weight calculation model, the method comprising:

acquiring a sample statement set;

splitting each sample statement in the sample statement set to obtain a vocabulary entry sequence corresponding to each sample statement, wherein the vocabulary entry sequence comprises at least one vocabulary entry obtained after the sample statement is split;

determining the relative importance degree of each entry in each entry sequence;

according to the relative importance degree of each entry in each entry sequence, the entries in each entry sequence are grouped to obtain a labeling sequence corresponding to each entry sequence, the labeling sequence comprises at least one entry group obtained after the entries in the entry sequence are grouped, and the entry group comprises at least one entry;

and training a preset entry weight calculation model according to each tagging sequence to obtain values of model parameters in the entry weight calculation model.

An apparatus for training an entry weight calculation model, the apparatus comprising:

an obtaining unit, configured to obtain a sample statement set;

the splitting unit is used for splitting each sample statement in the sample statement set to obtain a vocabulary entry sequence corresponding to each sample statement, and the vocabulary entry sequence comprises at least one vocabulary entry obtained after the sample statement is split;

the determining unit is used for determining the relative importance degree of each entry in each entry sequence;

the system comprises a grouping unit, a judging unit and a judging unit, wherein the grouping unit is used for grouping the entries in each entry sequence according to the relative importance degree of each entry in each entry sequence to obtain a labeling sequence corresponding to each entry sequence, the labeling sequence comprises at least one entry group obtained by grouping the entries in the entry sequence, and the entry group comprises at least one entry;

and the training unit is used for training a preset entry weight calculation model according to each labeling sequence to obtain the value of the model parameter in the entry weight calculation model.

According to the technical scheme, each sample statement in the sample statement set is split to obtain various corresponding entry sequences of the sample statement, and the entry sequences comprise at least one entry obtained after the sample statement is split; determining the relative importance degree of each entry in each entry sequence, and grouping the entries in each entry sequence according to the relative importance degree of each entry in each entry sequence to obtain a labeling sequence corresponding to each entry sequence, wherein the labeling sequence comprises at least one entry group obtained by grouping the entries in the entry sequence, and the entry group comprises at least one entry; it can be seen that the tagging sequence is obtained based on the relative importance degree of each entry in the entry sequence, compared with the absolute weight level of a directly tagged entry in the prior art, the absolute weight level is not tagged any more, but the relative importance degree of each entry in the same entry sequence is determined, and the tagging sequence is determined based on the relative importance degree.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a term weight calculation model training method disclosed in an embodiment of the present application;

FIG. 2 is a flowchart of a term weight calculation model training method according to another embodiment of the present disclosure;

FIG. 3 is a flowchart of a term weight computation model trained based on the pair-wise algorithm disclosed in the embodiment of the present application;

fig. 4 is a block diagram of a structure of a vocabulary entry weight calculation model training apparatus disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a method for training a vocabulary entry weight calculation model, as shown in fig. 1, the method includes:

s100, acquiring a sample statement set;

the sample sentences in the sample sentence set are search query sentences of users in a search engine, and are used for video search words in a video player, video titles and news titles during video browsing or news browsing. For example, a sample sentence entitled "explosion in nation a, B, and natural shining fireball" by video.

S101, splitting each sample statement in the sample statement set to obtain various corresponding entry sequences of each sample statement, wherein the entry sequences comprise at least one entry obtained after the sample statement is split;

the method comprises the following steps of performing entry splitting on each sample sentence, specifically, separating words or characters according to punctuation marks such as spaces and the like contained in the obtained sample sentences; and/or splitting the character string by using a word segmentation program to obtain words or characters.

If the vocabulary entry splitting is carried out on 'explosion occurs in nation A and B and the sky-lowering dazzling fireball', the obtained vocabulary entry sequence is as follows: "nation a", "B ground", "occurrence", "explosion", "sky", "descent", "dazzling" and "fireball".

S102, determining the relative importance degree of each entry in each entry sequence;

continuing to use the example sentence, in the entry sequence: in the words of "nation a", "B", "generation", "explosion", "day", "reduction", "dazzling" and "fireball", the annotating personnel compares the entries according to objective criteria, such as criteria of generally considering the names of dramas, proper nouns, names of people, high importance of nouns, etc., and determines the relative importance degree of each entry in the entry sequence according to the relative importance degree of each entry input by the annotating personnel in the entry sequence, wherein the entry with the highest relative importance degree is "explosion", the next time is "nation a" and "B", the next time is "fireball", and the lowest relative importance degree is "generation", "day", "reduction" and "dazzling".

Specifically, the annotator can use relative importance indicators such as 1, 2, 3, etc. to annotate the relative importance of each entry. Or, determining the keyword from the entry, and labeling the relative importance of each keyword, for example, performing subsequent processing by using "nation a", "place B", "explosion" and "fireball" in the above example sentence as keywords.

S103, according to the relative importance degree of each entry in each entry sequence, grouping the entries in each entry sequence to obtain a labeling sequence corresponding to each entry sequence, wherein the labeling sequence comprises at least one entry group obtained after the entries in the entry sequence are grouped, and the entry group comprises at least one entry;

and S104, training a preset entry weight calculation model according to each tagging sequence to obtain values of model parameters in the entry weight calculation model.

Specifically, the term weight calculation model may be a linear model:

,

wherein the content of the first and second substances,

the term is represented as an entry of a word,

the weight of the entry is represented and,

in order to be a term of the offset,

is the first of an entry

The value of the characteristic is used as the characteristic value,

is the first of an entry

A weight coefficient corresponding to each characteristic value, the

And

parameters in the model are calculated for the term weights. Wherein, the linear model is trained by using a conventional Learning To Rank (LTR).

In the embodiment, each sample statement in the sample statement set is split to obtain various corresponding entry sequences of the sample statement; determining the relative importance degree of each entry in each entry sequence, and grouping the entries in each entry sequence according to the relative importance degree of each entry in each entry sequence to obtain a labeling sequence corresponding to each entry sequence. It can be seen that the tagging sequence is obtained based on the relative importance degree of each entry in the entry sequence, and compared with the absolute weight level of a directly tagged entry in the prior art, the absolute weight level is not tagged any more, but the relative importance degree of each entry in the same entry sequence is determined, and the tagging sequence is determined based on the relative importance degree.

Moreover, the entry weight values calculated by the embodiment are continuous floating point values, such as 0.41 and the like, and are not limited by importance levels, namely, the entry weight values are not mapped to limited weight levels, so that the accuracy of the entry weight values is improved.

In another embodiment of the present application, a method for training an entry weight calculation model, as shown in fig. 2, includes:

s200, obtaining a sample statement set;

s201, splitting each sample statement in a sample statement set to obtain a vocabulary entry sequence corresponding to each sample statement, wherein the vocabulary entry sequence comprises at least one vocabulary entry obtained after the sample statement is split;

s202, determining the relative importance degree of each entry in each entry sequence;

s203, for any entry in each entry sequence: acquiring the entry matched with the relative importance degree of the entry from the entry sequence according to the relative importance degree of the entry in the entry sequence, and storing the entry and the acquired entry in the same entry group;

specifically, for each entry sequence, each entry in the entry sequence may be matched based on the relative importance degree identifier, the matched entry is stored in the same entry group, that is, the entries with the matched relative importance degrees are stored in the same entry, after the matching is completed, one or more entry groups may be formed, each entry group includes at least one entry, the relative importance degree identifiers of the entries in each entry group are matched, and the matching may be set to have the same relative importance degree identifier or have a certain difference, for example, the difference is 1.

Or after the relative importance degrees of the relative terms are labeled, the terms with the same relative importance degrees can be directly stored in the same term group through the input operation of a labeling person.

S204, for all the entry groups in each entry sequence: sequencing according to the relative importance degree of the entries in the entry group, and taking a sequence formed after sequencing as a tagging sequence corresponding to the entry sequence;

specifically, all the entry groups in each entry sequence may be sorted in order from top to bottom according to the relative importance of the entries in the entry groups. For the example sentence, the sequence of the labels obtained after sorting is as follows: [ detonation ] [ Country A, Country B ] [ Mare ] [ Tian, descent, dazzling, occurrence ].

S205, training a preset entry weight calculation model according to each tagging sequence to obtain values of model parameters in the entry weight calculation model.

In the above embodiment, the relative importance of each entry in the same entry sequence is determined, and then the entries are grouped based on the importance of the entries, and the grouped entry groups are ranked according to the relative importance of the entries in the entry groups.

In the above embodiment, a plurality of entry groups obtained by grouping may also be directly used as the tagging sequences.

An embodiment of the present application specifically discloses a training method of a term weight calculation model based on a pair-wise algorithm included in an LTR algorithm, as shown in fig. 3, the method includes:

s300, generating an entry pair based on every two entry groups in the tagging sequence, wherein the two entries in the entry pair have different relative importance degrees and are arranged according to a preset sequence; acquiring a feature vector of each entry in each entry pair;

specifically, the obtained labeling sequence is continuously utilized: [ detonation ] [ nation a, B ] [ fireball ] [ sky, dode, flare, happen ], description of the training method in this example is made:

based on every two vocabulary entry groups in the tagging sequence, generating vocabulary entry pairs as follows:

< explosion, nation a > < explosion, B > < explosion, fireball > < explosion, sky > < explosion, precipitation > < explosion, dazzling > < explosion, occurrence >;

< nation a, fireball > < nation a, heaven > < nation, downhill > < nation, dazzling > < nation a, occurrence >;

< B > fireball > < B, sky > < B, downhill > < B, glare > < B, occurrence >;

< fireball, sky > < fireball, descending > < fireball, dazzling > < fireball, occurrence >.

The relative importance degree of the first entry in the entry pair is greater than that of the second entry, that is, the arrangement order is the same as the arrangement order of the entry group in the tag sequence.

Obtaining a feature vector of each entry, specifically, obtaining lexical features of the entries: such as part of speech, statistical characteristics of the entry: for example, tf-idf, the user behavior characteristics of the entry: such as the number of clicks of the entry as a tag, the entry characteristics obtained in the search engine based on the search log. The feature vectors of the obtained entries are shown in table 1 below:

the method comprises the steps of obtaining a domain idf, a log (# query), a log (word length), a part of speech and pos, wherein the domain idf represents the domain-related inverse document frequency characteristic of a term, the domain free idf represents the domain-unrelated inverse document frequency characteristic of the term, the log (# query) represents the term characteristic obtained based on a search log, the log (word length) represents the length characteristic of the term, and the pos represents the part of speech characteristic of the term.

By entry pair<Explosion, nation A>For example, the eigenvector Φ of "explosion₁Is (0.4818, 0.3795, 0.6780, 0.3010, 0.8000) and the characteristic vector phi of the A country₂Is (0.3621, 0.5101, 0.8130, 0.3010, 1.1000).

S301, generating a first training sample set and a second training sample set according to the feature vector of each entry in each entry pair;

specifically, the entry pair is generated<Explosion, nation A>Feature vector of the corresponding positive sample: phi (phi) of_{Is just for}=Φ₁-Φ₂= (0.1197, -0.1306, -0.135, 0.000, -0.3000), corresponding to a sample output label of 1; eigenvector Φ of negative examples_{Negative pole}=Φ₂ -Φ₁= (-0.1197, 0.1306, 0.135, 0.000, 0.3000), corresponding to a sample output flag label of-1.

S302, training a preset entry weight calculation model according to the first training sample set and the second training sample set to obtain values of model parameters in the entry weight calculation model.

Specifically, a term weight calculation model is trained by using a positive sample set and a negative sample set corresponding to all term pairs generated by all the tagging sequences:

to determine model parameters in the model

And the value of (a) and (b).

And obtaining a term weight calculation model after obtaining the values of the model parameters in the term weight calculation model, and obtaining the feature vector of each term in the short text when the term weights in the short text such as search sentences, titles and introduction are required to be calculated subsequently, and substituting the feature vector into the term weight calculation model to obtain each term weight.

In the embodiment, the model training is carried out by combining the svm algorithm with the Pair-wise algorithm, so that simple and effective training is realized.

In other embodiments, other LTR algorithms may also be used to train the model, such as a list-wise algorithm, which specifically obtains one entry from each entry group included in the tag sequence to form an ordered entry sequence. For example, from the annotation sequences described above: each entry in [ explosive ] [ nation a, B place ] [ fireball ] [ sky, descending, dazzling, happening ] gets one entry, and the sequence of the formed ordered entries is: 8 ordered entry sequences such as [ explosion, nation A, fireball, day ], [ explosion, nation B, fireball, day ], [ explosion, nation A, fireball, descending ] …. And fitting the sequence of each entry in each ordered entry sequence through a list-wise algorithm such as a ListNet algorithm and a LambdaMART algorithm, and training the entry weight calculation model according to the obtained fitted sequence of each entry to obtain the value of the model parameter in the entry weight calculation model. The LTR algorithm is not specifically limited, and any LTR algorithm that can perform model training to obtain values of model parameters in a model by combining with the tagging sequence provided by the present application is within the scope of the present application.

The embodiment of the present application further discloses a training device for an entry weight calculation model, as shown in fig. 4, the device includes:

an obtaining unit 400, configured to obtain a sample statement set;

a splitting unit 401, configured to split each sample statement in the sample statement set to obtain a entry sequence corresponding to each sample statement, where the entry sequence includes at least one entry obtained after the sample statement is split;

a determining unit 402, configured to determine a relative importance degree of each entry in each entry sequence;

a grouping unit 403, configured to group entries in each entry sequence according to a relative importance degree of each entry in each entry sequence to obtain a tag sequence corresponding to each entry sequence, where the tag sequence includes at least one entry group obtained after the entries in the entry sequence are grouped, and the entry group includes at least one entry;

a training unit 404, configured to train a preset entry weight calculation model according to each tagging sequence, so as to obtain values of model parameters in the entry weight calculation model.

Preferably, the grouping unit includes:

a matching subunit, configured to, for any entry in each entry sequence: acquiring the entry matched with the relative importance degree of the entry from the entry sequence according to the relative importance degree of the entry in the entry sequence, and storing the entry and the acquired entry in the same entry group;

the first generating subunit is configured to obtain, according to the entry group in each entry sequence, a tag sequence corresponding to each entry sequence, where the tag sequence includes at least one entry group obtained after the entries in the entry sequence are grouped, and the entry group includes at least one entry.

Preferably, the first generating subunit includes a sorting module, configured to, for all entry groups in each entry sequence: and sequencing according to the relative importance degree of the entries in the entry group, and taking the sequence formed after sequencing as a labeling sequence corresponding to the entry sequence.

Preferably, the training unit comprises:

the second generation subunit is used for generating a vocabulary entry pair based on every two vocabulary entry groups in the tagging sequence, wherein the relative importance degrees of the two vocabulary entries in the vocabulary entry pair are different and are arranged according to a preset sequence; acquiring a feature vector of each entry in each entry pair;

the third generating subunit is configured to generate a first training sample set and a second training sample set according to the feature vector of each entry in each entry pair;

and the training subunit is used for training a preset entry weight calculation model according to the first training sample set and the second training sample set to obtain values of model parameters in the entry weight calculation model.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Claims

1. A method for training an entry weight calculation model, the method comprising:

acquiring a sample statement set;

training a preset entry weight calculation model according to each tagging sequence to obtain values of model parameters in the entry weight calculation model;

the grouping of the entries in each entry sequence according to the relative importance degree of each entry in each entry sequence to obtain the tagging sequences corresponding to each entry sequence comprises:

for any entry in each entry sequence: acquiring the entry matched with the relative importance degree of the entry from the entry sequence according to the relative importance degree of the entry in the entry sequence, and storing the entry and the acquired entry in the same entry group;

and obtaining a labeling sequence corresponding to each entry sequence according to the entry group in each entry sequence, wherein the labeling sequence comprises at least one entry group obtained after the entries in the entry sequence are grouped, and the entry group comprises at least one entry.

2. The method of claim 1, wherein obtaining the tagging sequence corresponding to each entry sequence according to the entry group in each entry sequence comprises:

for all entry groups in each entry sequence: and sequencing according to the relative importance degree of the entries in the entry group, and taking the sequence formed after sequencing as a labeling sequence corresponding to the entry sequence.

3. The method of claim 1, wherein the training of a preset entry weight calculation model according to the labeling sequence corresponding to each entry sequence includes:

generating entry pairs based on every two entry groups in the tagging sequence, wherein the relative importance degrees of the two entries in the entry pairs are different and are arranged according to a preset sequence; acquiring a feature vector of each entry in each entry pair;

generating a first training sample set and a second training sample set according to the feature vector of each entry in each entry pair;

and training the entry weight calculation model according to the first training sample set and the second training sample set to obtain values of model parameters in the entry weight calculation model.

4. An apparatus for training a vocabulary entry weight calculation model, the apparatus comprising:

an obtaining unit, configured to obtain a sample statement set;

the training unit is used for training a preset entry weight calculation model according to each tagging sequence to obtain values of model parameters in the entry weight calculation model;

the grouping unit includes:

5. The apparatus of claim 4, the first generation subunit comprising an ordering module to, for all entry groups in each entry sequence: and sequencing according to the relative importance degree of the entries in the entry group, and taking the sequence formed after sequencing as a labeling sequence corresponding to the entry sequence.

6. The apparatus of claim 4, the training unit comprising:

the third generation subunit is used for generating a first training sample set and a second training sample set according to the feature vector of each entry in each entry pair;