CN108959263B - Entry weight calculation model training method and device - Google Patents

Entry weight calculation model training method and device Download PDF

Info

Publication number
CN108959263B
CN108959263B CN201810757233.4A CN201810757233A CN108959263B CN 108959263 B CN108959263 B CN 108959263B CN 201810757233 A CN201810757233 A CN 201810757233A CN 108959263 B CN108959263 B CN 108959263B
Authority
CN
China
Prior art keywords
entry
sequence
training
relative importance
weight calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810757233.4A
Other languages
Chinese (zh)
Other versions
CN108959263A (en
Inventor
王亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201810757233.4A priority Critical patent/CN108959263B/en
Publication of CN108959263A publication Critical patent/CN108959263A/en
Application granted granted Critical
Publication of CN108959263B publication Critical patent/CN108959263B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The method splits each sample sentence in an obtained sample sentence set to obtain a respective entry sequence corresponding to each sample sentence, wherein the entry sequence comprises at least one entry obtained after splitting the sample sentences; determining the relative importance degree of each entry in each entry sequence; the method comprises the steps that entries in each entry sequence are grouped according to the relative importance degree of each entry in each entry sequence to obtain a labeling sequence corresponding to each entry sequence, wherein the labeling sequence comprises at least one entry group obtained after the entries in the entry sequence are grouped, and each entry group comprises at least one entry; and training a preset entry weight calculation model according to each tagging sequence to obtain values of model parameters in the entry weight calculation model. The tagging sequence obtained based on the relative importance degree of each entry in each entry sequence is more accurate, and the accuracy of the entry weight calculation model is improved.

Description

Entry weight calculation model training method and device
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for training an entry weight calculation model.
Background
The entry weight calculation is an important natural language processing work, and the calculation accuracy directly influences the performance of keyword extraction, tag extraction, search sequencing and the like. The term weight calculation can be obtained through a term weight calculation model, the current term weight calculation model can be obtained through a supervised learning method, and in the process of obtaining the term weight calculation model, the sample sentences used by the term weight calculation model need to be labeled, and the labeling process is as follows:
firstly, dividing the weight of the vocabulary entry into a plurality of levels, namely determining the number of the weight levels, then labeling the weight level of each vocabulary entry in the vocabulary entry sequence corresponding to the sample sentence, for example, labeling the vocabulary entry according to 5 weight levels, wherein the lowest weight level of the vocabulary entry is level1 level, the highest weight level is level5 level, and then training a vocabulary entry weight calculation model based on the labeled weight level of the vocabulary entry and the feature vector of the vocabulary entry.
In the method, the weight level number is set to be equivalent to the calculation of the entry weight by using a classification method, but the classification method determines the absolute importance level of the entry, namely the importance of the entry is determined in all sample sentences, and the labeling accuracy is low, so that the trained entry weight calculation model is inaccurate.
Disclosure of Invention
In view of this, the present application provides a method and an apparatus for training an entry weight calculation model to improve the accuracy of the entry weight calculation model.
In order to achieve the above object, the following solutions are proposed:
a method for training an entry weight calculation model, the method comprising:
acquiring a sample statement set;
splitting each sample statement in the sample statement set to obtain a vocabulary entry sequence corresponding to each sample statement, wherein the vocabulary entry sequence comprises at least one vocabulary entry obtained after the sample statement is split;
determining the relative importance degree of each entry in each entry sequence;
according to the relative importance degree of each entry in each entry sequence, the entries in each entry sequence are grouped to obtain a labeling sequence corresponding to each entry sequence, the labeling sequence comprises at least one entry group obtained after the entries in the entry sequence are grouped, and the entry group comprises at least one entry;
and training a preset entry weight calculation model according to each tagging sequence to obtain values of model parameters in the entry weight calculation model.
An apparatus for training an entry weight calculation model, the apparatus comprising:
an obtaining unit, configured to obtain a sample statement set;
the splitting unit is used for splitting each sample statement in the sample statement set to obtain a vocabulary entry sequence corresponding to each sample statement, and the vocabulary entry sequence comprises at least one vocabulary entry obtained after the sample statement is split;
the determining unit is used for determining the relative importance degree of each entry in each entry sequence;
the system comprises a grouping unit, a judging unit and a judging unit, wherein the grouping unit is used for grouping the entries in each entry sequence according to the relative importance degree of each entry in each entry sequence to obtain a labeling sequence corresponding to each entry sequence, the labeling sequence comprises at least one entry group obtained by grouping the entries in the entry sequence, and the entry group comprises at least one entry;
and the training unit is used for training a preset entry weight calculation model according to each labeling sequence to obtain the value of the model parameter in the entry weight calculation model.
According to the technical scheme, each sample statement in the sample statement set is split to obtain various corresponding entry sequences of the sample statement, and the entry sequences comprise at least one entry obtained after the sample statement is split; determining the relative importance degree of each entry in each entry sequence, and grouping the entries in each entry sequence according to the relative importance degree of each entry in each entry sequence to obtain a labeling sequence corresponding to each entry sequence, wherein the labeling sequence comprises at least one entry group obtained by grouping the entries in the entry sequence, and the entry group comprises at least one entry; it can be seen that the tagging sequence is obtained based on the relative importance degree of each entry in the entry sequence, compared with the absolute weight level of a directly tagged entry in the prior art, the absolute weight level is not tagged any more, but the relative importance degree of each entry in the same entry sequence is determined, and the tagging sequence is determined based on the relative importance degree.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a term weight calculation model training method disclosed in an embodiment of the present application;
FIG. 2 is a flowchart of a term weight calculation model training method according to another embodiment of the present disclosure;
FIG. 3 is a flowchart of a term weight computation model trained based on the pair-wise algorithm disclosed in the embodiment of the present application;
fig. 4 is a block diagram of a structure of a vocabulary entry weight calculation model training apparatus disclosed in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides a method for training a vocabulary entry weight calculation model, as shown in fig. 1, the method includes:
s100, acquiring a sample statement set;
the sample sentences in the sample sentence set are search query sentences of users in a search engine, and are used for video search words in a video player, video titles and news titles during video browsing or news browsing. For example, a sample sentence entitled "explosion in nation a, B, and natural shining fireball" by video.
S101, splitting each sample statement in the sample statement set to obtain various corresponding entry sequences of each sample statement, wherein the entry sequences comprise at least one entry obtained after the sample statement is split;
the method comprises the following steps of performing entry splitting on each sample sentence, specifically, separating words or characters according to punctuation marks such as spaces and the like contained in the obtained sample sentences; and/or splitting the character string by using a word segmentation program to obtain words or characters.
If the vocabulary entry splitting is carried out on 'explosion occurs in nation A and B and the sky-lowering dazzling fireball', the obtained vocabulary entry sequence is as follows: "nation a", "B ground", "occurrence", "explosion", "sky", "descent", "dazzling" and "fireball".
S102, determining the relative importance degree of each entry in each entry sequence;
continuing to use the example sentence, in the entry sequence: in the words of "nation a", "B", "generation", "explosion", "day", "reduction", "dazzling" and "fireball", the annotating personnel compares the entries according to objective criteria, such as criteria of generally considering the names of dramas, proper nouns, names of people, high importance of nouns, etc., and determines the relative importance degree of each entry in the entry sequence according to the relative importance degree of each entry input by the annotating personnel in the entry sequence, wherein the entry with the highest relative importance degree is "explosion", the next time is "nation a" and "B", the next time is "fireball", and the lowest relative importance degree is "generation", "day", "reduction" and "dazzling".
Specifically, the annotator can use relative importance indicators such as 1, 2, 3, etc. to annotate the relative importance of each entry. Or, determining the keyword from the entry, and labeling the relative importance of each keyword, for example, performing subsequent processing by using "nation a", "place B", "explosion" and "fireball" in the above example sentence as keywords.
S103, according to the relative importance degree of each entry in each entry sequence, grouping the entries in each entry sequence to obtain a labeling sequence corresponding to each entry sequence, wherein the labeling sequence comprises at least one entry group obtained after the entries in the entry sequence are grouped, and the entry group comprises at least one entry;
and S104, training a preset entry weight calculation model according to each tagging sequence to obtain values of model parameters in the entry weight calculation model.
Specifically, the term weight calculation model may be a linear model:
Figure DEST_PATH_IMAGE001
,
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE002
the term is represented as an entry of a word,
Figure DEST_PATH_IMAGE003
the weight of the entry is represented and,
Figure DEST_PATH_IMAGE004
in order to be a term of the offset,
Figure DEST_PATH_IMAGE005
is the first of an entry
Figure DEST_PATH_IMAGE006
The value of the characteristic is used as the characteristic value,
Figure DEST_PATH_IMAGE007
is the first of an entry
Figure 521847DEST_PATH_IMAGE006
A weight coefficient corresponding to each characteristic value, the
Figure 361076DEST_PATH_IMAGE004
And
Figure 875234DEST_PATH_IMAGE007
parameters in the model are calculated for the term weights. Wherein, the linear model is trained by using a conventional Learning To Rank (LTR).
In the embodiment, each sample statement in the sample statement set is split to obtain various corresponding entry sequences of the sample statement; determining the relative importance degree of each entry in each entry sequence, and grouping the entries in each entry sequence according to the relative importance degree of each entry in each entry sequence to obtain a labeling sequence corresponding to each entry sequence. It can be seen that the tagging sequence is obtained based on the relative importance degree of each entry in the entry sequence, and compared with the absolute weight level of a directly tagged entry in the prior art, the absolute weight level is not tagged any more, but the relative importance degree of each entry in the same entry sequence is determined, and the tagging sequence is determined based on the relative importance degree.
Moreover, the entry weight values calculated by the embodiment are continuous floating point values, such as 0.41 and the like, and are not limited by importance levels, namely, the entry weight values are not mapped to limited weight levels, so that the accuracy of the entry weight values is improved.
In another embodiment of the present application, a method for training an entry weight calculation model, as shown in fig. 2, includes:
s200, obtaining a sample statement set;
s201, splitting each sample statement in a sample statement set to obtain a vocabulary entry sequence corresponding to each sample statement, wherein the vocabulary entry sequence comprises at least one vocabulary entry obtained after the sample statement is split;
s202, determining the relative importance degree of each entry in each entry sequence;
s203, for any entry in each entry sequence: acquiring the entry matched with the relative importance degree of the entry from the entry sequence according to the relative importance degree of the entry in the entry sequence, and storing the entry and the acquired entry in the same entry group;
specifically, for each entry sequence, each entry in the entry sequence may be matched based on the relative importance degree identifier, the matched entry is stored in the same entry group, that is, the entries with the matched relative importance degrees are stored in the same entry, after the matching is completed, one or more entry groups may be formed, each entry group includes at least one entry, the relative importance degree identifiers of the entries in each entry group are matched, and the matching may be set to have the same relative importance degree identifier or have a certain difference, for example, the difference is 1.
Or after the relative importance degrees of the relative terms are labeled, the terms with the same relative importance degrees can be directly stored in the same term group through the input operation of a labeling person.
S204, for all the entry groups in each entry sequence: sequencing according to the relative importance degree of the entries in the entry group, and taking a sequence formed after sequencing as a tagging sequence corresponding to the entry sequence;
specifically, all the entry groups in each entry sequence may be sorted in order from top to bottom according to the relative importance of the entries in the entry groups. For the example sentence, the sequence of the labels obtained after sorting is as follows: [ detonation ] [ Country A, Country B ] [ Mare ] [ Tian, descent, dazzling, occurrence ].
S205, training a preset entry weight calculation model according to each tagging sequence to obtain values of model parameters in the entry weight calculation model.
In the above embodiment, the relative importance of each entry in the same entry sequence is determined, and then the entries are grouped based on the importance of the entries, and the grouped entry groups are ranked according to the relative importance of the entries in the entry groups.
In the above embodiment, a plurality of entry groups obtained by grouping may also be directly used as the tagging sequences.
An embodiment of the present application specifically discloses a training method of a term weight calculation model based on a pair-wise algorithm included in an LTR algorithm, as shown in fig. 3, the method includes:
s300, generating an entry pair based on every two entry groups in the tagging sequence, wherein the two entries in the entry pair have different relative importance degrees and are arranged according to a preset sequence; acquiring a feature vector of each entry in each entry pair;
specifically, the obtained labeling sequence is continuously utilized: [ detonation ] [ nation a, B ] [ fireball ] [ sky, dode, flare, happen ], description of the training method in this example is made:
based on every two vocabulary entry groups in the tagging sequence, generating vocabulary entry pairs as follows:
< explosion, nation a > < explosion, B > < explosion, fireball > < explosion, sky > < explosion, precipitation > < explosion, dazzling > < explosion, occurrence >;
< nation a, fireball > < nation a, heaven > < nation, downhill > < nation, dazzling > < nation a, occurrence >;
< B > fireball > < B, sky > < B, downhill > < B, glare > < B, occurrence >;
< fireball, sky > < fireball, descending > < fireball, dazzling > < fireball, occurrence >.
The relative importance degree of the first entry in the entry pair is greater than that of the second entry, that is, the arrangement order is the same as the arrangement order of the entry group in the tag sequence.
Obtaining a feature vector of each entry, specifically, obtaining lexical features of the entries: such as part of speech, statistical characteristics of the entry: for example, tf-idf, the user behavior characteristics of the entry: such as the number of clicks of the entry as a tag, the entry characteristics obtained in the search engine based on the search log. The feature vectors of the obtained entries are shown in table 1 below:
Figure DEST_PATH_IMAGE008
the method comprises the steps of obtaining a domain idf, a log (# query), a log (word length), a part of speech and pos, wherein the domain idf represents the domain-related inverse document frequency characteristic of a term, the domain free idf represents the domain-unrelated inverse document frequency characteristic of the term, the log (# query) represents the term characteristic obtained based on a search log, the log (word length) represents the length characteristic of the term, and the pos represents the part of speech characteristic of the term.
By entry pair<Explosion, nation A>For example, the eigenvector Φ of "explosion1Is (0.4818, 0.3795, 0.6780, 0.3010, 0.8000) and the characteristic vector phi of the A country2Is (0.3621, 0.5101, 0.8130, 0.3010, 1.1000).
S301, generating a first training sample set and a second training sample set according to the feature vector of each entry in each entry pair;
specifically, the entry pair is generated<Explosion, nation A>Feature vector of the corresponding positive sample: phi (phi) ofIs just for12= (0.1197, -0.1306, -0.135, 0.000, -0.3000), corresponding to a sample output label of 1; eigenvector Φ of negative examplesNegative pole21= (-0.1197, 0.1306, 0.135, 0.000, 0.3000), corresponding to a sample output flag label of-1.
S302, training a preset entry weight calculation model according to the first training sample set and the second training sample set to obtain values of model parameters in the entry weight calculation model.
Specifically, a term weight calculation model is trained by using a positive sample set and a negative sample set corresponding to all term pairs generated by all the tagging sequences:
Figure DEST_PATH_IMAGE009
to determine model parameters in the model
Figure 303066DEST_PATH_IMAGE004
And the value of (a) and (b).
And obtaining a term weight calculation model after obtaining the values of the model parameters in the term weight calculation model, and obtaining the feature vector of each term in the short text when the term weights in the short text such as search sentences, titles and introduction are required to be calculated subsequently, and substituting the feature vector into the term weight calculation model to obtain each term weight.
In the embodiment, the model training is carried out by combining the svm algorithm with the Pair-wise algorithm, so that simple and effective training is realized.
In other embodiments, other LTR algorithms may also be used to train the model, such as a list-wise algorithm, which specifically obtains one entry from each entry group included in the tag sequence to form an ordered entry sequence. For example, from the annotation sequences described above: each entry in [ explosive ] [ nation a, B place ] [ fireball ] [ sky, descending, dazzling, happening ] gets one entry, and the sequence of the formed ordered entries is: 8 ordered entry sequences such as [ explosion, nation A, fireball, day ], [ explosion, nation B, fireball, day ], [ explosion, nation A, fireball, descending ] …. And fitting the sequence of each entry in each ordered entry sequence through a list-wise algorithm such as a ListNet algorithm and a LambdaMART algorithm, and training the entry weight calculation model according to the obtained fitted sequence of each entry to obtain the value of the model parameter in the entry weight calculation model. The LTR algorithm is not specifically limited, and any LTR algorithm that can perform model training to obtain values of model parameters in a model by combining with the tagging sequence provided by the present application is within the scope of the present application.
The embodiment of the present application further discloses a training device for an entry weight calculation model, as shown in fig. 4, the device includes:
an obtaining unit 400, configured to obtain a sample statement set;
a splitting unit 401, configured to split each sample statement in the sample statement set to obtain a entry sequence corresponding to each sample statement, where the entry sequence includes at least one entry obtained after the sample statement is split;
a determining unit 402, configured to determine a relative importance degree of each entry in each entry sequence;
a grouping unit 403, configured to group entries in each entry sequence according to a relative importance degree of each entry in each entry sequence to obtain a tag sequence corresponding to each entry sequence, where the tag sequence includes at least one entry group obtained after the entries in the entry sequence are grouped, and the entry group includes at least one entry;
a training unit 404, configured to train a preset entry weight calculation model according to each tagging sequence, so as to obtain values of model parameters in the entry weight calculation model.
Preferably, the grouping unit includes:
a matching subunit, configured to, for any entry in each entry sequence: acquiring the entry matched with the relative importance degree of the entry from the entry sequence according to the relative importance degree of the entry in the entry sequence, and storing the entry and the acquired entry in the same entry group;
the first generating subunit is configured to obtain, according to the entry group in each entry sequence, a tag sequence corresponding to each entry sequence, where the tag sequence includes at least one entry group obtained after the entries in the entry sequence are grouped, and the entry group includes at least one entry.
Preferably, the first generating subunit includes a sorting module, configured to, for all entry groups in each entry sequence: and sequencing according to the relative importance degree of the entries in the entry group, and taking the sequence formed after sequencing as a labeling sequence corresponding to the entry sequence.
Preferably, the training unit comprises:
the second generation subunit is used for generating a vocabulary entry pair based on every two vocabulary entry groups in the tagging sequence, wherein the relative importance degrees of the two vocabulary entries in the vocabulary entry pair are different and are arranged according to a preset sequence; acquiring a feature vector of each entry in each entry pair;
the third generating subunit is configured to generate a first training sample set and a second training sample set according to the feature vector of each entry in each entry pair;
and the training subunit is used for training a preset entry weight calculation model according to the first training sample set and the second training sample set to obtain values of model parameters in the entry weight calculation model.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Claims (6)

1. A method for training an entry weight calculation model, the method comprising:
acquiring a sample statement set;
splitting each sample statement in the sample statement set to obtain a vocabulary entry sequence corresponding to each sample statement, wherein the vocabulary entry sequence comprises at least one vocabulary entry obtained after the sample statement is split;
determining the relative importance degree of each entry in each entry sequence;
according to the relative importance degree of each entry in each entry sequence, the entries in each entry sequence are grouped to obtain a labeling sequence corresponding to each entry sequence, the labeling sequence comprises at least one entry group obtained after the entries in the entry sequence are grouped, and the entry group comprises at least one entry;
training a preset entry weight calculation model according to each tagging sequence to obtain values of model parameters in the entry weight calculation model;
the grouping of the entries in each entry sequence according to the relative importance degree of each entry in each entry sequence to obtain the tagging sequences corresponding to each entry sequence comprises:
for any entry in each entry sequence: acquiring the entry matched with the relative importance degree of the entry from the entry sequence according to the relative importance degree of the entry in the entry sequence, and storing the entry and the acquired entry in the same entry group;
and obtaining a labeling sequence corresponding to each entry sequence according to the entry group in each entry sequence, wherein the labeling sequence comprises at least one entry group obtained after the entries in the entry sequence are grouped, and the entry group comprises at least one entry.
2. The method of claim 1, wherein obtaining the tagging sequence corresponding to each entry sequence according to the entry group in each entry sequence comprises:
for all entry groups in each entry sequence: and sequencing according to the relative importance degree of the entries in the entry group, and taking the sequence formed after sequencing as a labeling sequence corresponding to the entry sequence.
3. The method of claim 1, wherein the training of a preset entry weight calculation model according to the labeling sequence corresponding to each entry sequence includes:
generating entry pairs based on every two entry groups in the tagging sequence, wherein the relative importance degrees of the two entries in the entry pairs are different and are arranged according to a preset sequence; acquiring a feature vector of each entry in each entry pair;
generating a first training sample set and a second training sample set according to the feature vector of each entry in each entry pair;
and training the entry weight calculation model according to the first training sample set and the second training sample set to obtain values of model parameters in the entry weight calculation model.
4. An apparatus for training a vocabulary entry weight calculation model, the apparatus comprising:
an obtaining unit, configured to obtain a sample statement set;
the splitting unit is used for splitting each sample statement in the sample statement set to obtain a vocabulary entry sequence corresponding to each sample statement, and the vocabulary entry sequence comprises at least one vocabulary entry obtained after the sample statement is split;
the determining unit is used for determining the relative importance degree of each entry in each entry sequence;
the system comprises a grouping unit, a judging unit and a judging unit, wherein the grouping unit is used for grouping the entries in each entry sequence according to the relative importance degree of each entry in each entry sequence to obtain a labeling sequence corresponding to each entry sequence, the labeling sequence comprises at least one entry group obtained by grouping the entries in the entry sequence, and the entry group comprises at least one entry;
the training unit is used for training a preset entry weight calculation model according to each tagging sequence to obtain values of model parameters in the entry weight calculation model;
the grouping unit includes:
a matching subunit, configured to, for any entry in each entry sequence: acquiring the entry matched with the relative importance degree of the entry from the entry sequence according to the relative importance degree of the entry in the entry sequence, and storing the entry and the acquired entry in the same entry group;
the first generating subunit is configured to obtain, according to the entry group in each entry sequence, a tag sequence corresponding to each entry sequence, where the tag sequence includes at least one entry group obtained after the entries in the entry sequence are grouped, and the entry group includes at least one entry.
5. The apparatus of claim 4, the first generation subunit comprising an ordering module to, for all entry groups in each entry sequence: and sequencing according to the relative importance degree of the entries in the entry group, and taking the sequence formed after sequencing as a labeling sequence corresponding to the entry sequence.
6. The apparatus of claim 4, the training unit comprising:
the second generation subunit is used for generating a vocabulary entry pair based on every two vocabulary entry groups in the tagging sequence, wherein the relative importance degrees of the two vocabulary entries in the vocabulary entry pair are different and are arranged according to a preset sequence; acquiring a feature vector of each entry in each entry pair;
the third generation subunit is used for generating a first training sample set and a second training sample set according to the feature vector of each entry in each entry pair;
and the training subunit is used for training a preset entry weight calculation model according to the first training sample set and the second training sample set to obtain values of model parameters in the entry weight calculation model.
CN201810757233.4A 2018-07-11 2018-07-11 Entry weight calculation model training method and device Active CN108959263B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810757233.4A CN108959263B (en) 2018-07-11 2018-07-11 Entry weight calculation model training method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810757233.4A CN108959263B (en) 2018-07-11 2018-07-11 Entry weight calculation model training method and device

Publications (2)

Publication Number Publication Date
CN108959263A CN108959263A (en) 2018-12-07
CN108959263B true CN108959263B (en) 2022-06-03

Family

ID=64483601

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810757233.4A Active CN108959263B (en) 2018-07-11 2018-07-11 Entry weight calculation model training method and device

Country Status (1)

Country Link
CN (1) CN108959263B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472665A (en) * 2019-07-17 2019-11-19 新华三大数据技术有限公司 Model training method, file classification method and relevant apparatus
CN113392651B (en) * 2020-11-09 2024-05-14 腾讯科技(深圳)有限公司 Method, device, equipment and medium for training word weight model and extracting core words

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164454A (en) * 2011-12-15 2013-06-19 百度在线网络技术(北京)有限公司 Keyword grouping method and keyword grouping system
CN105589847A (en) * 2015-12-22 2016-05-18 北京奇虎科技有限公司 Weighted article identification method and device
CN107562717A (en) * 2017-07-24 2018-01-09 南京邮电大学 A kind of text key word abstracting method being combined based on Word2Vec with Term co-occurrence
CN107967256A (en) * 2017-11-14 2018-04-27 北京拉勾科技有限公司 Term weighing prediction model generation method, position recommend method and computing device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164454A (en) * 2011-12-15 2013-06-19 百度在线网络技术(北京)有限公司 Keyword grouping method and keyword grouping system
CN105589847A (en) * 2015-12-22 2016-05-18 北京奇虎科技有限公司 Weighted article identification method and device
CN107562717A (en) * 2017-07-24 2018-01-09 南京邮电大学 A kind of text key word abstracting method being combined based on Word2Vec with Term co-occurrence
CN107967256A (en) * 2017-11-14 2018-04-27 北京拉勾科技有限公司 Term weighing prediction model generation method, position recommend method and computing device

Also Published As

Publication number Publication date
CN108959263A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
CN105893533B (en) Text matching method and device
US11182435B2 (en) Model generation device, text search device, model generation method, text search method, data structure, and program
CN109508414B (en) Synonym mining method and device
CN106156204B (en) Text label extraction method and device
US8341112B2 (en) Annotation by search
US8214363B2 (en) Recognizing domain specific entities in search queries
US9256596B2 (en) Language model adaptation for specific texts
CN108920633B (en) Paper similarity detection method
US20070136280A1 (en) Factoid-based searching
US20130198192A1 (en) Author disambiguation
JP2004005600A (en) Method and system for indexing and retrieving document stored in database
US20040163035A1 (en) Method for automatic and semi-automatic classification and clustering of non-deterministic texts
JP2004133880A (en) Method for constructing dynamic vocabulary for speech recognizer used in database for indexed document
CN109033212B (en) Text classification method based on similarity matching
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN114036930A (en) Text error correction method, device, equipment and computer readable medium
US11893537B2 (en) Linguistic analysis of seed documents and peer groups
US9652997B2 (en) Method and apparatus for building emotion basis lexeme information on an emotion lexicon comprising calculation of an emotion strength for each lexeme
CN108959263B (en) Entry weight calculation model training method and device
CN105653553B (en) Word weight generation method and device
WO2011037753A1 (en) Method and apparatus for ordering results of a query
CN111767733A (en) Document security classification discrimination method based on statistical word segmentation
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
Cerón-Guzmán Classifier ensembles that push the state-of-the-art in sentiment analysis of spanish tweets
US10409861B2 (en) Method for fast retrieval of phonetically similar words and search engine system therefor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant