WO2022062404A1 - 文本分类模型的训练方法、装置、设备及存储介质 - Google Patents

文本分类模型的训练方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022062404A1
WO2022062404A1 PCT/CN2021/091090 CN2021091090W WO2022062404A1 WO 2022062404 A1 WO2022062404 A1 WO 2022062404A1 CN 2021091090 W CN2021091090 W CN 2021091090W WO 2022062404 A1 WO2022062404 A1 WO 2022062404A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
text
iteration
sub
classification model
Prior art date
Application number
PCT/CN2021/091090
Other languages
English (en)
French (fr)
Inventor
刘广
黄海龙
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022062404A1 publication Critical patent/WO2022062404A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application belongs to the technical field of artificial intelligence, and in particular relates to a training method, apparatus, device and storage medium for a text classification model.
  • text classification is being studied and applied more and more widely.
  • text classification based on the text classification model usually encounters the problem of more data and less labeling (low resources).
  • labeled data is scarce
  • semi-supervised training methods can use a very small amount of labeled corpus and A large amount of unlabeled data yields a high-performance text classification model.
  • VAT virtual adversarial training
  • One of the purposes of the embodiments of the present application is to provide a training method, apparatus, device and storage medium for a text classification model, so as to solve the technical problem of poor classification effect of the text classification model in the prior art.
  • a first aspect of the embodiments of the present application provides a training method for a text classification model, including:
  • the training sample set includes N labeled training samples and M unlabeled training samples, each labeled training sample includes text information and a category label of the text information, and each unlabeled training sample includes text information; M and N are integers greater than 1;
  • the M enhanced training samples are The text enhancement model obtained by i-1 alternate iterations is generated by performing text enhancement processing on M unlabeled training samples, where i is an integer greater than 1.
  • a second aspect of the embodiments of the present application provides a training device for a text classification model, the device comprising:
  • the acquisition module is used to acquire a training sample set, the training sample set includes N marked training samples and M unmarked training samples, each of the marked training samples includes text information and a category label of the text information, each The unlabeled training samples include text information; wherein, M and N are both integers greater than 1;
  • the training module is used to perform alternate iterative training on the initial text classification model and the initial text enhancement model according to the training sample set and the M enhanced training samples to obtain the target text classification model; wherein, in the i-th alternate iterative training process, The M enhanced training samples are generated by performing text enhancement processing on the M unlabeled training samples according to the text enhancement model obtained by the i-1 th alternate iteration, where i is an integer greater than 1.
  • a third aspect of the embodiments of the present application provides a training device for a text classification model, including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor implements the following steps when executing the computer program:
  • the training sample set includes N labeled training samples and M unlabeled training samples, each labeled training sample includes text information and a category label of the text information, and each unlabeled training sample includes text information; M and N are integers greater than 1;
  • the M enhanced training samples are The text enhancement model obtained by i-1 alternate iterations is generated by performing text enhancement processing on M unlabeled training samples, where i is an integer greater than 1.
  • a fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the following steps are implemented:
  • the training sample set includes N labeled training samples and M unlabeled training samples, each labeled training sample includes text information and a category label of the text information, and each unlabeled training sample includes text information; M and N are integers greater than 1;
  • the M enhanced training samples are The text enhancement model obtained by i-1 alternate iterations is generated by performing text enhancement processing on M unlabeled training samples, where i is an integer greater than 1.
  • the initial text classification model and the initial text enhancement model are alternately and iteratively trained according to the training sample set and the M enhanced training samples, that is, the text classification model can be trained while the The text enhancement model is trained, and the data enhancement strategy is inductively trained according to the classification performance of the text classification model, so that the training target of the text classification model is consistent with the training target of the text enhancement model, which greatly improves the final text classification model. performance; and in each alternate iteration training process, M enhanced training samples are generated by text enhancement processing on M unlabeled training samples according to the text enhancement model obtained from the previous alternate iteration.
  • the goal of text enhancement is based on the unlabeled training samples. Augmented data/modified data is used to obtain enhanced training samples similar to real data. Compared with the enhanced samples obtained by VAT in the prior art, enhanced training samples obtained through text enhancement are highly interpretable and can provide guidance on the types of labeled data in the future.
  • FIG. 1 is a schematic flowchart of a training method for a text classification model provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of an alternate iterative training process provided by an embodiment of the present application
  • FIG. 3 is a schematic flowchart of obtaining a text classification model and a text enhancement model obtained by the jth sub-iteration according to an embodiment of the present application;
  • FIG. 5 is a schematic flowchart of determining a first loss function value according to an embodiment of the present application
  • FIG. 6 is a schematic flowchart of obtaining an enhanced training sample corresponding to an unlabeled training sample according to an embodiment of the present application
  • FIG. 7 is a schematic structural diagram of a training device for a text classification model provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a hardware composition of a training device for a text classification model provided by an embodiment of the present application.
  • references in this specification to "one embodiment” or “some embodiments” and the like mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.
  • the terms “including”, “including”, “having” and their variants mean “including but not limited to” unless specifically emphasized otherwise.
  • FIG. 1 is a schematic flowchart of a training method for a text classification model provided by an embodiment of the present application. As shown in Figure 1, the method includes:
  • the training sample set includes N labeled training samples and M unlabeled training samples, each labeled training sample includes text information and a category label of the text information, and each unlabeled training sample includes text information .
  • the labeled training samples represent the labeled corpus
  • the unlabeled training samples represent the unlabeled corpus.
  • the training sample set in this embodiment has no standard.
  • the number M of training samples is much larger than the number N of labeled training samples, where both M and N are integers greater than 1.
  • the text information may refer to the text sequence to be classified
  • the category label may refer to the category of the content represented by the text sequence to be classified.
  • the category label may be determined according to the application field of the text classification model.
  • the category label may refer to the sentiment tendency of the content expressed by the text sequence to be classified.
  • the emotional inclination can be any one of positive news, neutral news and negative news.
  • the training sample set can be obtained according to the application field of the text classification model, so as to increase the pertinence of the text classification model.
  • a verification sample set may also be obtained, wherein the verification sample set includes P verification samples, and each verification sample includes verification text information and a category label of the verification text information.
  • the purpose of this embodiment is to enhance the classification performance of the text classification model by alternately training the initial text classification model and the initial text enhancement model so that the training target of the initial text classification model is consistent with the training target of the initial text enhancement model.
  • the output of the initial text enhancement model is the input of the initial text classification model
  • the consistent training target may mean that the output of the trained text enhancement model matches the input of the trained text classification model, so that the trained text classification model The classification effect of unlabeled text information is better.
  • the initial text classification model can be used as a classifier, and a sample containing text information is input to the initial text classification model, then the initial text classification model will classify the sample and determine the type of the input sample label, and the loss function value of the sample can also be obtained, so that the model parameters of the initial text classification model can be optimized according to the loss function value.
  • the initial text augmentation model can be used as a sample generator, and the initial text augmentation model can augment/modify data for text information without class labels to obtain augmented samples similar to real data.
  • both the initial text classification model and the initial text enhancement module may be open source language models, which are not specifically limited here.
  • the initial text classification model is a BERT model
  • the initial text enhancement model is a CBERT model
  • performing alternate iterative training on the initial text classification model and the initial text enhancement model may refer to updating the model parameters of the current text classification model and the model parameters of the current text enhancement model in sequence during an iterative training process .
  • the first alternate iterative training according to the initial text enhancement model, text enhancement processing is performed on M unlabeled training samples to generate M enhanced training samples. Then, perform an alternate iteration training on the initial text classification model and the initial text enhancement model according to the training sample set and the above-mentioned M enhanced training samples, and obtain the text classification model obtained by the first alternate iteration and the text enhancement obtained by the first alternate iteration. Model.
  • text enhancement processing is performed on M unlabeled training samples according to the text enhancement model obtained in the first alternate iteration to generate M enhanced training samples. Then, according to the training sample set and the above-mentioned M enhanced training samples, perform an alternate iteration training on the text classification model obtained by the first alternate iteration and the text enhancement model obtained by the first alternate iteration, and obtain the text obtained by the second alternate iteration.
  • M unlabeled training samples are subjected to text enhancement processing according to the text enhancement model obtained by the i-1-th alternate iteration to generate M enhanced training samples. Then, according to the training sample set and the above-mentioned M enhanced training samples, perform an alternate iterative training on the text classification model obtained by the i-1th alternate iteration and the text enhancement model obtained by the i-1th alternate iteration, and obtain the i-th The text classification model obtained by the alternate iteration and the text enhancement model obtained by the i-th alternate iteration.
  • the above-mentioned alternate iterative training process is performed until the preset alternate iterative training end condition is met, and the target text classification model is obtained.
  • condition for ending the alternate iterative training may include: the number of times of the alternate iterative training is equal to n times, where n ⁇ i.
  • the generated target text classification model after the alternate iterative training is: the text classification model after the n-th alternate training.
  • the condition for ending the alternate iterative training may further include that, after the latest alternate training process, the output result of the target text classification model converges.
  • the generated target text classification model after the alternate iterative training is: the text classification model after the latest alternate training.
  • the verification sample set includes P verification samples, and each verification sample includes verification text information and a category label of the verification text information.
  • the verification text information of the P verification samples is used as the feature, and the category labels of the verification text information of the P verification samples are used as the label, and the text classification obtained according to the i-th alternate training is used. If the model obtains the i-th loss function value, it can be judged whether the current i-th loss function value is converged according to the loss function value after each alternate iteration training. If the target text classification model does not converge, perform the i+1-th alternate iteration training until the loss function of the current alternate iteration training converges.
  • the initial text classification model and the initial text enhancement model are alternately and iteratively trained according to the training sample set and M enhanced training samples, that is, the text classification model can be trained at the same time.
  • the text enhancement model is trained, and the data enhancement strategy is inductively trained according to the classification performance of the text classification model, so that the training target of the text classification model is consistent with the training target of the text enhancement model, which greatly improves the final text classification.
  • the performance of the model; and in each alternate iteration training process the M enhanced training samples are generated by text enhancement processing on the M unlabeled training samples according to the text enhancement model obtained by the previous alternate iteration. Training samples to expand data/modify data to obtain enhanced training samples similar to real data. Compared with the enhanced samples obtained by VAT in the prior art, the enhanced training samples obtained through text enhancement are highly interpretable and can provide information on the types of future labeled data. guidelines.
  • each alternate iterative training process includes k sub-iteration processes.
  • the processing method of each alternate iteration training is the same, and the processing method of each sub-iteration process is also the same, and an alternate iteration process is exemplarily described below with reference to the embodiment of FIG. 2 .
  • FIG. 2 is a schematic flowchart of an alternate iterative training process provided by an embodiment of the present application.
  • the embodiment of FIG. 2 describes a possible implementation of an alternate iterative process in step 20 of the embodiment of FIG. 1 .
  • the initial text classification model and the initial text enhancement model are alternately and iteratively trained according to the training sample set and M enhanced training samples, and the target text classification model is obtained, including:
  • each alternate iteration training multiple training samples in the training sample set are divided into multiple batches, and the above two models are trained according to batches.
  • each alternate iteration training includes multiple sub-iteration processes (corresponding to multiple batches), and each sub-iteration process is processed in the same manner. After all the training samples in the training sample set are iterated once, the alternate iteration training process is completed, and the text classification model after the alternate iteration training is obtained.
  • each alternate iteration training process may be the same.
  • the purpose of this step is to obtain a batch of training samples.
  • the preset ratio can be set by the user. For example, the ratio of labeled training samples to unlabeled training samples is 1:3.
  • j is 2.
  • the marked training samples and the unmarked training samples are extracted from the training sample set according to the ratio of 1:3, and S marked training samples and 3S unmarked training samples are obtained.
  • the S labeled training samples and the 3S unlabeled training samples are a batch of training data.
  • text enhancement processing is performed on the unlabeled training samples according to the text enhancement model obtained in the previous sub-iteration (j-1th) sub-iteration, and the multiple unlabeled training samples are generated in the process of the j-th sub-iteration. Respectively corresponding augmented training samples.
  • the unlabeled training samples refer to the 3S unlabeled training samples extracted in step 21.
  • the number of enhanced training samples is in one-to-one correspondence with the number of unlabeled training samples extracted.
  • the text classification model obtained by the jth sub-iteration and the text enhancement model obtained by the jth sub-iteration are determined as the text classification model and the text enhancement model obtained by this alternate iteration training.
  • the training samples extracted in step 21 are different from the training samples extracted in the jth sub-iteration process.
  • the above sub-iteration training process is performed until the N labeled training samples and the M unlabeled training samples in the training sample set are all iterated once, and the text classification model after the current alternate iteration training is obtained.
  • each batch contains both labeled training samples and unlabeled training samples, and a group of data in a batch jointly determines the direction of the gradient this time, and the gradient is not easy to deviate when it descends, reducing randomness, and each batch Compared with the data set of the entire training sample set, the amount of sample data is much smaller, and the calculation amount of each iteration training will be greatly reduced.
  • FIG. 3 is a schematic flowchart of obtaining a text classification model and a text enhancement model obtained by the jth sub-iteration according to an embodiment of the present application, and describes a possible implementation of S23 in the embodiment of FIG. 2 .
  • the enhanced training samples, the extracted marked training samples and unlabeled training samples are used as input, and the text classification model obtained by the j-1th sub-iteration and the text obtained by the j-1th sub-iteration are enhanced.
  • the model is trained to obtain the text classification model of the jth sub-iteration and the text enhancement model of the jth sub-iteration, including:
  • the first loss function value includes a supervised loss function value and an unsupervised loss function value, wherein the supervised loss function value is generated according to the labeled training sample, and the unsupervised loss function value is generated according to the unlabeled training sample and the corresponding The augmented training samples are generated.
  • FIG. 4 is a schematic flowchart of the sub-iteration training provided by the embodiment of the present application.
  • the input of the text classification model includes labeled training samples, unlabeled training samples, and enhanced training samples processed by the text enhancement model.
  • the output of the text classification model includes supervised loss and unsupervised training samples. , where the supervised loss is generated from the labeled training samples, and the unsupervised loss is generated from the unlabeled training samples and the corresponding enhanced training samples.
  • the input of the text enhancement model is the unlabeled training sample, and the output is the enhanced training sample corresponding to the unlabeled training sample.
  • the parameters of the text classification model obtained by the j-1 th sub-iteration and the text enhancement obtained by the j-1 th sub-iteration are sequentially updated through backpropagation. parameters of the model.
  • FIG. 5 is a schematic flowchart of determining the value of the first loss function according to an embodiment of the present application, which describes a possible implementation of S231 in the embodiment of FIG. 3 .
  • the resulting text classification model determines the first loss function values, including:
  • the second loss function value may refer to the value of the cross-entropy function.
  • L 1 is the cross-entropy function value
  • M is the number of labeled training samples
  • y m is the class label of the m-th labeled training sample
  • p m is the probability distribution of the m-th labeled training sample, where m is an integer greater than or equal to 1 and less than or equal to M.
  • the third loss function value is used to represent the probability distribution of the unlabeled training samples and the closeness of the probability distribution of the enhanced training samples.
  • the third loss function value may be the KL divergence, which is used to compare how close two probability distributions are.
  • calculation formula of the third loss function value may refer to formula (2):
  • q) refers to the KL divergence value
  • N is the number of unlabeled training samples
  • x n is the n-th unlabeled training sample
  • p(x n ) is the n-th unlabeled training sample
  • the probability distribution of , q(x n ) is the probability distribution of the augmented training sample corresponding to the nth unlabeled training sample, where n is an integer greater than or equal to 1 and less than or equal to N.
  • the first loss function value includes a supervised loss function value generated according to marked training samples, and an unsupervised loss function value generated according to unmarked training samples, where the supervised loss function value may refer to the second loss The function value, the unsupervised loss function value may refer to the third loss function value.
  • calculation formula of the first loss function value can be expressed as formula (3)
  • L 1 is the cross-entropy function value in equation (1)
  • q) is the KL divergence value in equation (2)
  • r is a hyperparameter.
  • FIG. 6 is a schematic flowchart of obtaining enhanced training samples corresponding to unlabeled training samples according to an embodiment of the present application, and describes a possible implementation of S22 in the embodiment of FIG. 2 .
  • the text enhancement model obtained by 1 sub-iteration processes the unlabeled training samples to obtain the enhanced training samples corresponding to the unlabeled training samples, including:
  • the word segmentation processing may refer to dividing the continuous text sequence in the unlabeled training sample into separate words according to a certain specification.
  • the unlabeled training samples can be segmented according to syntax and semantics.
  • the unmarked training sample is "I like playing basketball, and Xiaoming also likes it", and the unmarked training sample is segmented according to semantics to generate the corresponding first text sequence ⁇ I, like, play, basketball, Xiaoming, also, like ⁇ .
  • word segmentation processing method is only an example, and word segmentation processing may be performed on unmarked training samples based on existing word segmentation tools, which is not limited herein.
  • S222 Encode the first text sequence based on the preset dictionary, and generate a first vector corresponding to the first text sequence, where the first vector includes a plurality of encoded values.
  • the preset dictionary may include all words, object-oriented domain keywords and professional terms in the standard modern Chinese corpus; the preset dictionary may also include the respective numerical values of all the above words. It should be understood that the value corresponding to each word in the preset dictionary is generally different.
  • encoding the first text sequence based on a preset dictionary may refer to mapping each word in the first text sequence to a corresponding numerical value in the preset dictionary to obtain a target vector, and in the target A start mark is added before the start position of the vector, and a stop mark is added after the end position of the first vector to generate a first vector corresponding to the first text sequence.
  • the start tag identification can be ⁇ CLS>
  • the termination tag can be ⁇ SEP>.
  • the length of the first vector is a fixed value L, such as 128.
  • the length of the target vector can meet the requirements by adding an invalid code value, such as 0, after the end position of the target vector.
  • the first text sequence is ⁇ I, like, play, basketball, Xiao Ming, also, like ⁇ .
  • the corresponding first vector can be [CLS, 1, 2, 3, 4, 5, 6, 7, 2, 0, 0, 0...SEP], and the value in the first vector is each value in the first text sequence.
  • the encoded value corresponding to the word, the length of the first vector is 128.
  • the preset probability represents the ratio of the encoded value used for mask processing in the first vector to all encoded values in the first vector.
  • the preset probability can be set by the user, which is not limited here.
  • the preset probability may be 15%.
  • the second vector is obtained by masking some encoded values in the first vector, so the second vector has multiple mask positions.
  • S224 Input the second vector into the text enhancement model obtained by the j-1th sub-iteration to obtain an enhanced training sample corresponding to the unlabeled training sample.
  • obtaining the enhanced training samples corresponding to the unlabeled training samples may include the following steps:
  • Step 1 Input the second vector into the text enhancement model obtained by the j-1th sub-iteration, and obtain the word probability distribution of each mask position in the second vector.
  • the probability distribution of each mask position may refer to the probability distribution of all words in the preset dictionary appearing at the mask position.
  • the second vector may be Y, and the second vector includes x mask positions, then for each mask position, the probability distribution of the mask position may refer to the occurrence of all words in the preset dictionary at the mask position probability distribution of .
  • the preset dictionary contains k words, respectively A1, A2,...Ak, and the probability distribution of the k words at the mask position is p1, p2,...pk, where pi represents the probability of Ai appearing, where i is a value greater than or equal to 1 and less than or equal to k.
  • Step 2 Determine the word corresponding to each mask position based on the multinomial distribution sampling process.
  • the multinomial distribution is an extension of the binomial distribution.
  • the mask position in the second vector is transformed once, which is equivalent to one result A, and multiple results A can be obtained.
  • the probability distribution of different mask positions in step 1 the probability of occurrence of each result can be obtained.
  • the probability values of the polynomial distribution corresponding to different results can be determined according to the probability of occurrence of each result, the result corresponding to the maximum value of the above polynomial probability values can be determined as the target result, and the word of each mask position can be determined according to the target result.
  • Step 3 Determine the enhanced training samples corresponding to the second vector according to the second vector and the words corresponding to each mask position.
  • the method for obtaining enhanced training samples corresponding to unlabeled training samples provided by the embodiments of the present application, some words in the input unlabeled training samples are randomly masked through mask processing, and the above single words are predicted from the context of these words. Based on the IDs in the preset vocabulary, the enhanced training text obtained based on this model incorporates contextual information, which is highly interpretable and can provide guidance on the types of labeled data in the future.
  • the embodiment of the present application further provides an embodiment of an apparatus for implementing the foregoing method embodiment.
  • FIG. 7 is a schematic structural diagram of an apparatus for training a text classification model according to an embodiment of the present application.
  • the training device 30 of the text classification model includes an acquisition module 301 and a training module 302, wherein:
  • the obtaining module 301 is used to obtain a training sample set, the training sample set includes N labeled training samples and M unlabeled training samples, each labeled training sample includes text information and a category label of the text information, and each unlabeled training sample includes text information and a category label of the text information.
  • Samples include textual information;
  • the training module 302 is used to perform alternate iterative training on the initial text classification model and the initial text enhancement model according to the training sample set and the M enhanced training samples to obtain the target text classification model; wherein, in the ith alternate iterative training process, M The enhanced training samples are generated by performing text enhancement processing on M unlabeled training samples according to the text enhancement model obtained by the i-1th alternate iteration, where i is an integer greater than 1.
  • the apparatus for training a text classification model performs alternate and iterative training on the initial text classification model and the initial text enhancement model according to the training sample set and M enhanced training samples, that is, the text classification model can be trained at the same time.
  • the text enhancement model is trained, and the data enhancement strategy is inductively trained according to the classification performance of the text classification model, so that the training target of the text classification model is consistent with the training target of the text enhancement model, which greatly improves the final text classification.
  • the performance of the model; and in each alternate iteration training process the M enhanced training samples are generated by text enhancement processing on the M unlabeled training samples according to the text enhancement model obtained by the previous alternate iteration. Training samples to expand data/modify data to obtain enhanced training samples similar to real data. Compared with the enhanced samples obtained by VAT in the prior art, the enhanced training samples obtained through text enhancement are highly interpretable and can provide information on the types of future labeled data. guidelines.
  • the number of times of alternate iteration training is multiple times, and each alternate iteration training process includes k sub-iteration processes;
  • the training module 302 is configured to perform alternate iterative training on the initial text classification model and the initial text enhancement model according to the training sample set and the M enhanced training samples to obtain the target text classification model, including:
  • the labeled training samples and the unlabeled training samples are extracted from the training sample set according to a preset ratio; wherein, 1 ⁇ j ⁇ k;
  • the text classification model obtained by the j-1th sub-iteration and the text enhancement model obtained by the j-1th sub-iteration are trained to obtain the first The text classification model obtained by the jth sub-iteration and the text enhancement model obtained by the jth sub-iteration;
  • the training module 302 is configured to use the enhanced training samples, the extracted marked training samples and the unmarked training samples as input, and obtain the text classification model obtained by the j-1th sub-iteration and the j-1th sub-iteration obtained.
  • the text enhancement model is trained to obtain the text classification model of the jth sub-iteration and the text enhancement model of the jth sub-iteration, including:
  • the first loss function value is determined based on the text classification model obtained by the j-1th sub-iteration
  • the training module 302 is configured to determine the first loss function value based on the text classification model obtained by the j-1th sub-iteration, including:
  • the unlabeled training samples and the enhanced training samples corresponding to the unlabeled training samples are used as input, and the third loss function value is obtained based on the text classification model obtained by the j-1th sub-iteration;
  • the first loss function value is determined according to the second loss function value and the third loss function value.
  • the training module 302 is configured to process the unlabeled training samples according to the text enhancement model obtained in the j-1th sub-iteration, and obtain the enhanced training samples corresponding to the unlabeled training samples, including:
  • the training module 302 is configured to input the second vector into the text enhancement model obtained by the j-1th sub-iteration to obtain enhanced training samples corresponding to the unlabeled training samples, including:
  • the enhanced training samples corresponding to the second vector are determined.
  • condition for ending the alternate iterative training includes at least one of the following: the number of alternate iterative training is equal to n times or the output result of the target text classification model converges; wherein, n ⁇ i.
  • the apparatus for training a text classification model provided by the embodiment shown in FIG. 7 can be used to execute the technical solutions in the foregoing method embodiments, and its implementation principles and technical effects are similar, and details are not described herein again in this embodiment.
  • FIG. 8 is a schematic diagram of a training device for a text classification model provided by an embodiment of the present application.
  • the training device 40 of the text classification model includes: at least one processor 401 , a memory 402 , and a computer program stored in the memory 402 and executable on the processor 401 .
  • the training device of the text classification model further includes a communication part 403 , wherein the processor 401 , the memory 402 and the communication part 403 are connected through a bus 404 .
  • the processor 401 executes the computer program, it implements the steps in each of the above embodiments of the training method for the text classification model, for example, steps S10 to S20 in the embodiment shown in FIG. 1 .
  • the processor 401 executes the computer program, the functions of the modules/units in the foregoing device embodiments are implemented, for example, the functions of the modules 301 to 302 shown in FIG. 7 .
  • the computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 402 and executed by the processor 401 to complete the present application.
  • One or more modules/units may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program in the training device 40 of the text classification model.
  • FIG. 8 is only an example of a training device for a text classification model, and does not constitute a limitation on the training device for a text classification model, and may include more or less components than those shown in the figure, or combine some components, or different components such as input and output devices, network access devices, buses, etc.
  • the training device of the text classification model in this embodiment of the present application may be a terminal device, a server, or the like, which is not specifically limited herein.
  • the so-called processor 401 may be a central processing unit (Central Processkng Unkt, CPU), or other general-purpose processors, digital signal processors (Dkgktal Skgnal Processor, DSP), application-specific integrated circuits (Applkcatkon Speckfkc Kntegrated Ckrcukt, ASKC), Off-the-shelf programmable gate array (Fkeld-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 402 can be an internal storage unit of the training device of the text classification model, or an external storage device of the training device of the text classification model, such as a plug-in hard disk, a smart memory card (Smart Medka Card, SMC), a secure digital (Secure) Dkgktal, SD) card, flash card (Flash Card), etc.
  • the memory 402 is used to store the computer program and other programs and data required by the training device of the text classification model.
  • the memory 402 may also be used to temporarily store data that has been or will be output.
  • the bus can be an industry standard architecture (Kndustry Standard Architecture, KSA) bus, a Perkpheral Component (PCK) bus, or an extended industry standard architecture (Extended Kndustry Standard Architecture, EKSA) bus, and the like.
  • KSA Kndustry Standard Architecture
  • PCK Perkpheral Component
  • EKSA Extended Kndustry Standard Architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the buses in the drawings of the present application are not limited to only one bus or one type of bus.
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps in the foregoing method embodiments can be implemented.
  • the embodiments of the present application provide a computer program product, when the computer program product runs on a training device for a text classification model, the training device for a text classification model implements the steps in the above method embodiments.
  • the integrated unit if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium.
  • all or part of the processes in the methods of the above embodiments can be implemented by a computer program to instruct the relevant hardware.
  • the computer program can be stored in a computer-readable storage medium, and the computer program can be processed When the device is executed, the steps of the foregoing method embodiments may be implemented.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate forms, and the like.
  • the computer-readable medium may include at least: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM) , Random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media.
  • ROM read-only memory
  • RAM random access memory
  • electrical carrier signals telecommunication signals
  • software distribution media For example, U disk, mobile hard disk, disk or CD, etc.
  • computer readable media may not be electrical carrier signals and telecommunications signals.
  • the disclosed apparatus/network device and method may be implemented in other manners.
  • the apparatus/network device embodiments described above are only illustrative.
  • the division of modules or units is only a logical function division.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • Units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请适用于人工智能技术领域,提供了一种文本分类模型的训练方法。该方法获取训练样本集,训练样本集包括N个有标训练样本和M个无标训练样本,每个有标训练样本包括文本信息以及文本信息的类别标签,每个无标训练样本包括文本信息;M和N均为大于1的整数;根据训练样本集以及M个增强训练样本对初始文本分类模型和初始文本增强模型进行交替迭代训练,得到目标文本分类模型;其中,在第i次交替迭代训练过程中,M个增强训练样本根据第i-1次交替迭代得到的文本增强模型对M个无标训练样本进行文本增强处理生成,i为大于1的整数。

Description

文本分类模型的训练方法、装置、设备及存储介质
本申请要求于2020年09月28日在中国专利局提交的、申请号为202011038589.6、发明名称为“文本分类模型的训练方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于人工智能技术领域,尤其涉及一种文本分类模型的训练方法、装置、设备及存储介质。
背景技术
文本分类作为自然语言处理的一项重要任务,正在得到越来越广泛的研究和应用。现实场景中基于文本分类模型进行文本分类,通常会遇到数据量多标注少(低资源)问题,面对一个标注数据稀缺的低资源应用场景,半监督训练方法可以利用非常少量的标注语料以及大量无标注数据得到一个高性能的文本分类模型。
目前,半监督训练方法通常采用虚拟对抗训练(Virtual Adversarial Training,VAT)来进行,虚拟对抗训练VAT通过在待标注数据中引入噪音向量(局部扰动)以泛化模型。但是发明人发现,由于噪音向量的可解释性差,VAT不能很好的指出待标注数据的类型,并不能帮助我们在数据量少的情况下对未来标注数据的类型提供指引,且当标注数据量小时模型对噪音更加敏感,因此造成了文本分类模型的分类效果不佳。
技术问题
本申请实施例的目的之一在于提供一种文本分类模型的训练方法、装置、设备及存储介质,以解决现有技术中文本分类模型的分类效果不佳的技术问题。
技术解决方案
为解决上述技术问题,本申请实施例采用的技术方案是:
本申请实施例的第一方面提供了一种文本分类模型的训练方法,包括:
获取训练样本集,训练样本集包括N个有标训练样本和M个无标训练样本,每个有标训练样本包括文本信息以及文本信息的类别标签,每个无标训练样本包括文本信息;M和N均为大于1的整数;
根据训练样本集以及M个增强训练样本对初始文本分类模型和初始文本增强模型进行交替迭代训练,得到目标文本分类模型;其中,在第i次交替迭代训练过程中,M个增强训练样本根据第i-1次交替迭代得到的文本增强模型对M个无标训练样本进行文本增强处理生成,i为大于1的整数。
本申请实施例的第二方面提供了一种文本分类模型的训练装置,装置包括:
获取模块,用于获取训练样本集,所述训练样本集包括N个有标训练样本和M个无标训练样本,每个所述有标训练样本包括文本信息以及文本信息的类别标签,每个所述无标训练样本包括文本信息;其中,M和N均为大于1的整数;
训练模块,用于根据所述训练样本集以及M个增强训练样本对初始文本分类模型和初始文本增强模型进行交替迭代训练,得到目标文本分类模型;其中,在第i次交替迭代训练过程中,所述M个增强训练样本根据第i-1次交替迭代得到的文本增强模型对所述M个无标训练样本进行文本增强处理生成,i为大于1的整数。
本申请实施例的第三方面提供了一种文本分类模型的训练设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如下步骤:
获取训练样本集,训练样本集包括N个有标训练样本和M个无标训练样本,每个有标训练样本包括文本信息以及文本信息的类别标签,每个无标训练样本包括文本信息;M和N均为大于1的整数;
根据训练样本集以及M个增强训练样本对初始文本分类模型和初始文本增强模型进行 交替迭代训练,得到目标文本分类模型;其中,在第i次交替迭代训练过程中,M个增强训练样本根据第i-1次交替迭代得到的文本增强模型对M个无标训练样本进行文本增强处理生成,i为大于1的整数。
本申请实施例的第四方面提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时实现如下步骤:
获取训练样本集,训练样本集包括N个有标训练样本和M个无标训练样本,每个有标训练样本包括文本信息以及文本信息的类别标签,每个无标训练样本包括文本信息;M和N均为大于1的整数;
根据训练样本集以及M个增强训练样本对初始文本分类模型和初始文本增强模型进行交替迭代训练,得到目标文本分类模型;其中,在第i次交替迭代训练过程中,M个增强训练样本根据第i-1次交替迭代得到的文本增强模型对M个无标训练样本进行文本增强处理生成,i为大于1的整数。
有益效果
本申请的有益效果在于:
在本申请实施例提出的技术方案中,一方面根据训练样本集以及M个增强训练样本对初始文本分类模型和初始文本增强模型进行交替迭代训练,即可以对文本分类模型进行训练的同时对该文本增强模型进行训练,根据文本分类模型的分类性能对数据增强策略进行归纳训练,从而使得文本分类模型的训练目标与文本增强模型的训练目标一致,极大的提高了最终得到的文本分类模型的性能;且在每一次交替迭代训练过程中,M个增强训练样本根据上一次交替迭代得到的文本增强模型对M个无标训练样本进行文本增强处理生成,文本增强的目标在于基于无标训练样本扩充数据/修改数据获得类似于真实数据的增强训练样本,通过文本增强得到的增强训练样本相对于现有技术中VAT得到的增强样本,可解释性强,可以对未来标注数据的类型提供指引。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请一实施例提供的文本分类模型的训练方法的流程示意图;
图2为本申请一实施例提供的一次交替迭代训练过程的流程示意图;
图3为本申请一实施例提供的获得第j次子迭代得到的文本分类模型和文本增强模型的流程示意图;
图4为本申请实施例提供的子迭代训练的流程示意图;
图5为本申请一实施例提供的确定第一损失函数值的流程示意图;
图6为本申请一实施例提供的获得与无标训练样本对应的增强训练样本的流程示意图;
图7为本申请一实施例提供的文本分类模型的训练装置的结构示意图;
图8是本申请一实施例提供的文本分类模型的训练设备的硬件组成示意图。
本申请的实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施 例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
下面以具体地实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行示例性说明。值得说明的是,下文中列举的具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。
图1为本申请一实施例提供的文本分类模型的训练方法的流程示意图。如图1所示,该方法包括:
S10、获取训练样本集,训练样本集包括N个有标训练样本和M个无标训练样本,每个有标训练样本包括文本信息以及文本信息的类别标签,每个无标训练样本包括文本信息。
本实施例中,有标训训练样本表征有标注语料,无标训练样本表征无标注语料,为了与实际应用中有标注语料稀缺的低资源应用场景相匹配,本实施例中训练样本集中无标训练样本的个数M远远大于有标训练样本的个数N,其中M和N均为大于1的整数。
本实施例中,文本信息可以是指待分类的文本序列,类别标签可以待分类的文本序列所表征内容的类别。
本实施例中,可以根据文本分类模型的应用领域确定类别标签。
例如,文本分类模型用于金融情感分类领域,则类别标签可以是指待分类的文本序列所表达内容的情感倾向。其中,情感倾向可以为正面消息、中立消息以及负面消息中的任意一项。
本实施例中,可以根据文本分类模型的应用领域获取训练样本集,以增加文本分类模型的针对性。
本实施例中,还可以获取验证样本集,其中,验证样本集中包括P个验证样本,每个验证样本包括验证文本信息以及该验证文本信息的类别标签。
S20、根据训练样本集以及M个增强训练样本对初始文本分类模型和初始文本增强模型进行交替迭代训练,得到目标文本分类模型;其中,在第i次交替迭代训练过程中,M个增强训练样本根据第i-1次交替迭代得到的文本增强模型对M个无标训练样本进行文本增强处理生成,i为大于1的整数。
本实施例的目的在于通过在对初始文本分类模型和初始文本增强模型交替进行训练,以使得初始文本分类模型的训练目标与初始文本增强模型的训练目标一致,从而增强文本分类模型的分类性能。
其中,初始文本增强模型的输出为初始文本分类模型的输入,训练目标一致可以是指,训练后的文本增强模型的输出与训练后的文本分类模型的输入相匹配,使得训练后的文本分类模型的对无标文本信息的分类效果更好。
本实施例中,初始文本分类模型的可以作为一个分类器,给所述初始文本分类模型输入一个包含文本信息的样本,则初始文本分类模型会对该样本进行一个分类,判断该输入样本的类别标签,同时也可以获得该样本的损失函数值,以便根据损失函数值对初始文本分类模型的模型参数进行优化。
初始文本增强模型可以作为一个样本生成器,初始文本增强模型可以对没有类别标签的文本信息扩充数据/修改数据获得类似于真实数据的增强样本。
其中,初始文本分类模型和初始文本增强模块均可以是开源的语言模型,在此不做具体限定。
示例性地,初始文本分类模型为BERT模型,初始文本增强模型为CBERT模型。
本实施例中,对初始文本分类模型和初始文本增强模型进行交替迭代训练可以是指,在一次迭代训练过程中,依次对当前的文本分类模型模型参数和当前的文本增强模型的模型参数进行更新。
例如,首先保持当前的文本增强模型的参数不变,更新当前的文本分类模型的参数, 获得更新后的文本分类模型。然后保持更新后的文本分类模型的参数不变,更新当前的文本增强模型的参数,获得更新后的文本增强模型。然后在下一次迭代训练过程中,根据更新后的文本分类模型和更新后的文本增强模型。重复上述过程,从而实现了文本分类模型和文本增强模型的交替迭代训练。
本实施例中,在第1次交替迭代训练中,根据初始文本增强模型对M个无标训练样本进行文本增强处理生成M个增强训练样本。然后,根据训练样本集以及上述M个增强训练样本对初始文本分类模型和初始文本增强模型进行一次交替迭代训练,获得第1次交替迭代得到的文本分类模型以及第1次交替迭代得到的文本增强模型。
在第2次交替迭代训练中,根据第1次交替迭代得到的文本增强模型对M个无标训练样本进行文本增强处理生成M个增强训练样本。然后,根据训练样本集以及上述M个增强训练样本,对第1次交替迭代得到的文本分类模型以及第1次交替迭代得到的文本增强模型进行一次交替迭代训练,获得第2次交替迭代得到的文本分类模型以及第2次交替迭代得到的文本增强模型。
在第i次交替迭代训练过程中,根据第i-1次交替迭代得到的文本增强模型对M个无标训练样本进行文本增强处理生成M个增强训练样本。然后,根据训练样本集以及上述M个增强训练样本,对第i-1次交替迭代得到的文本分类模型以及第i-1次交替迭代得到的文本增强模型进行一次交替迭代训练,获得第i次交替迭代得到的文本分类模型以及第i次交替迭代得到的文本增强模型。
执行上述交替迭代训练过程,直至满足预设的交替迭代训练结束条件,获得目标文本分类模型。
应理解的是,每一次交替迭代过程中的M个无标训练样本可以不同。
本实施例中,交替迭代训练结束的条件可以包括:交替迭代训练的次数等于n次,其中,n≥i。
相应地,交替迭代训练后的生成目标文本分类模型为:第n次交替训练后的文本分类模型。
交替迭代训练结束的条件还可以包括,在最新一次交替训练过程后,目标文本分类模型的输出结果收敛。
相应地,所述交替迭代训练后的生成目标文本分类模型为:最新一次交替训练后的文本分类模型。
其中,判断文本分类模型的输出结果收敛可以基于S10获取的验证集进行判断,验证样本集中包括P个验证样本,每个验证样本包括验证文本信息以及该验证文本信息的类别标签。
具体地,在第i次交替迭代训练完成后,将P个验证样本的验证文本信息作为特征,将P个验证样本的验证文本信息的类别标签作为标签,根据第i次交替训练得到的文本分类模型获得第i个损失函数值,则可以根据每次交替迭代训练后的损失函数值,判断当前第i个损失函数值是否收敛,若收敛,则将第i次交替训练得到的文本分类模型作为目标文本分类模型,若未收敛,则进行第i+1次交替迭代训练,直至当前交替迭代训练的损失函数收敛。
本申请实施例提供的文本分类模型的训练方法,一方面根据训练样本集以及M个增强训练样本对初始文本分类模型和初始文本增强模型进行交替迭代训练,即可以对文本分类模型进行训练的同时对该文本增强模型进行训练,根据文本分类模型的分类性能对数据增强策略进行归纳训练,从而使得文本分类模型的训练目标与文本增强模型的训练目标一致,极大的提高了最终得到的文本分类模型的性能;且在每一次交替迭代训练过程中,M个增强训练样本根据上一次交替迭代得到的文本增强模型对M个无标训练样本进行文本增强处理生成,文本增强的目标在于基于无标训练样本扩充数据/修改数据获得类似于真实数据的增强训练样本,通过文本增强得到的增强训练样本相对于现有技术中VAT得到的增强样本, 可解释性强,可以对未来标注数据的类型提供指引。
由图1实施例可知,得到目标文本分类模型的交替迭代训练的次数为多次,且每次交替迭代训练过程包括k次子迭代过程。其中,每次交替迭代训练的处理方式相同,且每次子迭代过程的处理方式也相同,下面通过图2实施例对一次交替迭代过程进行示例性说明。
图2为本申请一实施例提供的一次交替迭代训练过程的流程示意图。图2实施例描述了图1实施例步骤20中,一次交替迭代过程的可能实施方式。图2所示,根据训练样本集以及M个增强训练样本对初始文本分类模型和初始文本增强模型进行交替迭代训练,得到目标文本分类模型,包括:
S21、对于每次交替迭代训练中的第j次子迭代过程,按照预设比例从训练样本集中抽取有标训练样本以及无标训练样本;其中,1<j≤k。
本实施例中,在每一次交替迭代训练中,将训练样本集中的多个训练样本分为多个批次,按照批次进行上述两个模型的训练。
相应地,每一次交替迭代训练中则均包括了多次子迭代过程(对应多个批次),每次子迭代过程的处理方式相同。在训练样本集中所有训练样本均迭代一次后,完成本次交替迭代训练的过程,获得本次交替迭代训练后的文本分类模型。
其中,每次交替迭代训练过程包含的子迭代过程可以相同。
本步骤的目的在于获取一个批次的训练样本。
其中,预设比例可以由用户进行设定。例如,有标训练样本与无标训练样本的比值为1:3。
示例性地,j为2,在第2次子迭代过程中,按照1:3的比例从训练样本集中抽取有标训练样本以及无标训练样本,获得S个有标训练样本和3S个无标训练样本。该S个有标训练样本和3S个无标训练样本为一批次的训练数据。
S22、根据第j-1次子迭代得到的文本增强模型对抽取的无标训练样本进行处理,获得与无标训练样本对应的增强训练样本。
本实施例中,根据上一次子迭代(第j-1次)子迭代得到的文本增强模型对无标训练样本进行文本增强处理,生成该多个无标训练样本在第j次子迭代过程中各自分别对应的增强训练样本。
其中,无标训练样本是指步骤21抽取到的3S个无标训练样本。
可以理解的是,增强训练样本的个数与抽取到的无标训练样本的个数一一对应。
S23、将增强训练样本、抽取的有标训练样本以及无标训练样本作为输入,对第j-1次子迭代得到的文本分类模型和第j-1次子迭代得到的文本增强模型进行训练,得到第j次子迭代得到的文本分类模型和第j次子迭代得到的文本增强模型。
S24、返回执行按照预设比例从所述训练样本集中抽取有标训练样本以及无标训练样本的步骤,直至训练样本集中N个有标训练样本和M个无标训练样本均迭代一次后,获得当前交替迭代训练后的文本分类模型。
本实施例中,在得到第j次子迭代得到的文本分类模型和第j次子迭代得到的文本增强模型后,判断训练样本集中N个有标训练样本和M个无标训练样本是否均迭代一次。
若是,则将第j次子迭代得到的文本分类模型和第j次子迭代得到的文本增强模型,确定为本次交替迭代训练得到的文本分类模型和文本增强模型。
若否,则进入第j+1次子迭代,返回执行上述步骤21至步骤23。
此时步骤21中抽取到的训练样本,与第j次子迭代过程中抽取到的训练样本不同。
执行上述子迭代训练过程,直至训练样本集中N个有标训练样本和M个无标训练样本均迭代一次后,获得当前交替迭代训练后的文本分类模型。
本申请实施例提供的交替迭代训练过程,将训练样本集中的多个训练样本分为多个批次,按照批次进行上述两个模型的训练。其中,每个批次同时包含有标训练样本和无标训练样本,一个批次中一组数据共同决定了本次梯度的方向,下降起来梯度就不易跑偏,减 少随机性,且每个批次的样本数据量与整个训练样本集的数据集相比小了很多,每次的迭代训练的计算量将大大降低。
图3为本申请一实施例提供的获得第j次子迭代得到的文本分类模型和文本增强模型的流程示意图,描述了图2实施例中S23的一种可能性实施方式。如图3所示,将增强训练样本、抽取的有标训练样本以及无标训练样本作为输入,对第j-1次子迭代得到的文本分类模型和第j-1次子迭代得到的文本增强模型进行训练,得到第j次子迭代的文本分类模型和第j次子迭代的文本增强模型,包括:
S231、将增强训练样本、抽取的有标训练样本以及无标训练样本作为输入,基于第j-1次子迭代得到的文本分类模型确定第一损失函数值。
本实施例中,第一损失函数值包括有监督损失函数值,和无监督损失函数值,其中,有监督损失函数值根据有标训练样本生成,无监督损失函数值根据无标训练样本以及对应的增强训练样本生成。
示例性地,请一并参阅图4,图4为本申请实施例提供的子迭代训练的流程示意图。如图4所示,文本分类模型的输入包括有标训练样本、无标训练样本以及无标训练样本经过文本增强模型处理后的增强训练样本,文本分类模型的输出包括有监督的损失以及无监督的损失,其中,有监督的损失根据有标训练样本生成,无监督的损失根据无标训练样本以及对应的增强训练样本生成。
文本增强模型的输入为无标训练样本,输出为对应无标训练样本的增强训练样本。
如图4所示,在每一次子迭代过程中,将抽取到的有标训练样本以及无标训练样本作为输入,最终获得有监督损失和无监督损失,两者的函数值共同构成第一损失函数值。
S232、保持第j-1次子迭代得到的文本增强模型的参数不变,根据第一损失函数值,更新第j-1次子迭代得到的文本分类模型的参数,获得第j次子迭代得到的文本分类模型。
S233、保持第j次子迭代得到的文本分类模型的参数不变,根据第一损失函数值更新第j-1次子迭代得到的文本增强模型的参数,获得第j次子迭代得到的文本增强模型。
本实施例中,在第j-1次子迭代的过程中,通过反向传播依次更新第j-1次子迭代得到的文本分类模型的参数,以及第j-1次子迭代得到的文本增强模型的参数。
图5为本申请一实施例提供的确定第一损失函数值的流程示意图,描述了图3实施例中S231的一种可能的实施方式,如图5所示,基于第j-1次子迭代得到的文本分类模型确定第一损失函数值,包括:
S2311、将有标训练样本中的文本信息作为特征,将与文本信息对应的类别标签作为标签,基于第j-1次子迭代得到的文本分类模型,获得第二损失函数值。
本实施例中,第二损失函数值可以是指交叉熵函数的值。
交叉熵函数的公式可以参见下式:
Figure PCTCN2021091090-appb-000001
其中,L 1为交叉熵函数值,M为有标训练样本的个数,y m是第m个有标训练样本的类别标签,p m是第m个有标训练样本的概率分布,其中m为大于等于1且小于等于M的整数。
S2312、将无标训练样本以及与无标训练样本对应的增强训练样本作为输入,基于第j-1次子迭代得到的文本分类模型,获得第三损失函数值。
本实施例中,第三损失函数值用于表征无标训练样本的概率分布以及增强训练样本概率分布的接近程度。
例如,第三损失函数值可以是KL散度,KL散度用于比较两个概率分布的接近程度。
示例性地,本实施例中,第三损失函数值的计算公式可以参见式(2):
Figure PCTCN2021091090-appb-000002
其中,D KL(p|q)是指KL散度值,N为无标训练样本的个数,x n是第n个无标训练样本,p(x n)是第n个无标训练样本的概率分布,q(x n)是第n个无标训练样本对应的增强训练样本的概率分布,其中n为大于等于1且小于等于N的整数。
S2313、根据第二损失函数值以及所述第三损失函数值确定第一损失函数值。
本实施例中,第一损失函数值包括根据有标训练样本生成的有监督损失函数值,和根据无标训练样本生成的无监督损失函数值,其中有监督损失函数值可以是指第二损失函数值,无监督损失函数值可以是指第三损失函数值。
例如,第一损失函数值的计算公式可以表示为式(3)
L=L 1+r·D KL(p|q)    (3)
其中,L 1为式(1)中的交叉熵函数值,D KL(p|q)为式(2)中的KL散度值,r为超参数。
图6为本申请一实施例提供的获得与无标训练样本对应的增强训练样本的流程示意图,描述了图2实施例中S22的一种可能的实施方式,如图6所示,根据第j-1次子迭代得到的文本增强模型对无标训练样本进行处理,获得与无标训练样本对应的增强训练样本,包括:
S221、对无标训练样本进行分词处理,获得无标训练样本对应的第一文本序列,第一文本序列包括至少一个单词。
本步骤中,分词处理可以是指将无标训练样本中的连续的文本序列按照一定的规范切分为单独的单词。
其中,可以根据句法以及语义对无标训练样本进行分词处理。
例如,无标训练样本为“我喜欢打篮球,小明也喜欢”,按照语义对该无标训练样本进行分词生成对应的第一文本序列{我,喜欢,打,篮球,小明,也,喜欢}。
应理解的是,上述分词处理方法仅为一个示例,可以基于现有的分词工具对无标训练样本进行分词处理,在此不做限定。
S222、基于预设词典对第一文本序列进行编码,生成第一文本序列对应的第一向量,第一向量包括多个编码值。
本步骤中,预设词典可以包含标准的现代汉语语料库中所有单词、面向对象的领域关键词以及专业术语;预设词典还可以包括上述所有单词各自分别的数值。应理解的是,预设词典中每个单词对应的数值一般不相同。
本步骤中,基于预设词典对所述第一文本序列进行编码,可以是指,将该第一文本序列中每个单词映射为预设词典中对应的数值,获得目标向量,在所述目标向量的起始位置之前添加启始标识,以及在所述第一向量的终止位置之后添加终止标识,生成第一文本序列对应的第一向量。
其中,启始标识别可以为<CLS>,终止标识可以为<SEP>。
为了便于进行后续处理,第一向量的长度为固定值L,例如可以为128。
在目标向量的长度不满足要求的情况下,可以通过在目标向量的终止位置之后增加无效编码值,例如0,使得目标向量的长度满足要求。
示例性的,第一文本序列为{我,喜欢,打,篮球,小明,也,喜欢}。
则对应的第一向量可以为[CLS,1,2,3,4,5,6,7,2,0,0,0……SEP],第一向量中的数值为第一文本序列中各单词对应的编码值,第一向量的长度为128。
S223、基于预设概率对第一向量中的编码值进行掩码处理,生成第一向量对应的第二向量。
本实施例中,预设概率表征了第一向量中用于进行掩码处理的编码值与第一向量中所有编码值的比值。预设概率可以由用户设定,在此不做限定。
例如,预设概率可以为15%。
本步骤中,第二向量为对第一向量中的部分编码值掩码处理得到的,故第二向量具有多个掩码位置。
S224、将第二向量输入第j-1次子迭代得到的文本增强模型,获得与无标训练样本对应的增强训练样本。
本实施例中,获得与无标训练样本对应的增强训练样本可以包括下述步骤:
步骤1、将第二向量输入第j-1次子迭代得到的文本增强模型,获取第二向量中各个掩码位置的单词概率分布。
其中,各个掩码位置的概率分布可以是指,预设词典中所有词在该掩码位置出现的概率分布。
例如,第二向量可以为Y,第二向量中包括x个掩码位置,则针对每个掩码位置,该掩码位置的概率分布可以是指预设词典中所有词在该掩码位置出现的概率分布。
示例性地,预设词典包含k个单词,分别为A1,A2,……Ak,k个单词在掩码位置的概率分布为p1,p2,……pk,其中pi表征了Ai出现的概率,其中i为大于等于1且小于等于k的值。
步骤2、基于多项式分布采样处理,确定各个掩码位置对应的单词。
本步骤中,多项式分布为二项式分布的一个扩展。
示例性地,假设随机试验有k个可能的结果A1,A2,...…Ak,每个结果出现的次数为随机变量X1,X2,...…Xn,每个结果出现的概率为P1,P2,...Pk,则经过Q次独立重复试验中A1出现n1次,A2出现n2次,...…,Ak出现nk次的的概率满足多项式分布,具体可以参考式(4)。
Figure PCTCN2021091090-appb-000003
其中,
Figure PCTCN2021091090-appb-000004
P(X 1=n 1,X 2=n 2,......X k=n k)表示Q次独立重复试验中A1出现n1次,A2出现n2次,……,Ak出现nk次的的概率。
本步骤中,第二向量中的掩码位置变换一次,则相当于一个结果A,则可获得多个结果A,根据步骤1中不同掩码位置的概率分布可以获得每个结果出现的概率,进而可以根据每个结果出现的概率,确定不同结果各自分别对应的多项式分布概率值,将上述多项式概率值中最大值对应的结果确定为目标结果,根据目标结果确定各个掩码位置的单词。
步骤3、根据第二向量以及各个掩码位置对应的单词,确定与第二向量对应的增强训练样本。
根据预设词典将第二向量中掩码位置的除掩码位置以外的其他编码值映射为对应的单词,生成第二文本序列,将第二文本序列中各个掩码位置替换为对应的单词,生成与第二向量对应的增强训练文本。
本申请实施例提供的获得与无标训练样本对应的增强训练样本的方法,通过掩码处理,随机掩码到输入的无标训练样本中的一些单词,从这些单词的上下文中预测出上述单次在预设词表中的ID,基于此模型获得增强训练文本融合了上下文信息,可解释性强,可以对未来标注数据的类型提供指引。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
基于上述实施例所提供的文本分类模型的训练方法,本申请实施例进一步给出实现上述方法实施例的装置实施例。
图7为本申请一实施例提供的文本分类模型的训练装置的结构示意图。如图7所示,文本分类模型的训练装置30包括获取模块301和训练模块302,其中:
获取模块301,用于获取训练样本集,训练样本集包括N个有标训练样本和M个无标训练样本,每个有标训练样本包括文本信息以及文本信息的类别标签,每个无标训练样本包括文本信息;
训练模块302,用于根据训练样本集以及M个增强训练样本对初始文本分类模型和初始文本增强模型进行交替迭代训练,得到目标文本分类模型;其中,在第i次交替迭代训练过程中,M个增强训练样本根据第i-1次交替迭代得到的文本增强模型对M个无标训练样本进行文本增强处理生成,i为大于1的整数。
本申请实施例提供的文本分类模型的训练装置,一方面根据训练样本集以及M个增强训练样本对初始文本分类模型和初始文本增强模型进行交替迭代训练,即可以对文本分类模型进行训练的同时对该文本增强模型进行训练,根据文本分类模型的分类性能对数据增强策略进行归纳训练,从而使得文本分类模型的训练目标与文本增强模型的训练目标一致,极大的提高了最终得到的文本分类模型的性能;且在每一次交替迭代训练过程中,M个增强训练样本根据上一次交替迭代得到的文本增强模型对M个无标训练样本进行文本增强处理生成,文本增强的目标在于基于无标训练样本扩充数据/修改数据获得类似于真实数据的增强训练样本,通过文本增强得到的增强训练样本相对于现有技术中VAT得到的增强样本,可解释性强,可以对未来标注数据的类型提供指引。
可选地,交替迭代训练的次数为多次,且每次交替迭代训练过程包括k次子迭代过程;
相应地,训练模块302用于根据所述训练样本集以及M个增强训练样本,对初始文本分类模型和初始文本增强模型进行交替迭代训练,得到目标文本分类模型,包括:
对于每次交替迭代训练中的第j次子迭代过程,按照预设比例从训练样本集中抽取有标训练样本以及无标训练样本;其中,1<j≤k;
根据第j-1次子迭代的得到的文本增强模型对抽取的无标训练样本进行处理,获得与无标训练样本对应的增强训练样本;
将增强训练样本、抽取的有标训练样本以及无标训练样本作为输入,对第j-1次子迭代得到的文本分类模型和第j-1次子迭代得到的文本增强模型进行训练,得到第j次子迭代得到的文本分类模型和第j次子迭代得到的文本增强模型;
返回执行按照预设比例从所述训练样本集中抽取有标训练样本以及无标训练样本的步骤,直至训练样本集中N个有标训练样本和M个无标训练样本均迭代一次后,获得当前交替迭代训练后的文本分类模型。
可选地,训练模块302用于将增强训练样本、抽取的有标训练样本以及无标训练样本作为输入,对第j-1次子迭代得到的文本分类模型和第j-1次子迭代得到的文本增强模型进行训练,得到第j次子迭代的文本分类模型和第j次子迭代的文本增强模型,包括:
将增强训练样本、抽取的有标训练样本以及无标训练样本作为输入,基于第j-1次子迭代得到的文本分类模型确定第一损失函数值;
保持第j-1次子迭代得到的文本增强模型的参数不变,根据第一损失函数值,更新第j-1次子迭代得到的文本分类模型的参数,获得第j次子迭代得到的文本分类模型;
保持第j次子迭代得到的文本分类模型的参数不变,根据第一损失函数值更新第j-1次子迭代得到的文本增强模型的参数,获得第j次子迭代的文本增强模型。
可选地,训练模块302用于基于第j-1次子迭代得到的文本分类模型确定第一损失函数值,包括:
将有标训练样本中的文本信息作为特征,将与文本信息对应的类别标签作为标签,基于第j-1次子迭代得到的文本分类模型,获得第二损失函数值;
将无标训练样本以及与无标训练样本对应的增强训练样本作为输入,基于第j-1次子迭代得到的文本分类模型,获得第三损失函数值;
根据第二损失函数值以及所述第三损失函数值确定第一损失函数值。
可选地,训练模块302用于根据第j-1次子迭代得到的文本增强模型对无标训练样本进行处理,获得与无标训练样本对应的增强训练样本,包括:
对无标训练样本进行分词处理,获得无标训练样本对应的第一文本序列,第一文本序列包括至少一个单词;
基于预设词典对第一文本序列进行编码,生成第一文本序列对应的第一向量,第一向量包括多个编码值;
基于预设概率对第一向量中的编码值进行掩码处理,生成第一向量对应的第二向量;
将第二向量输入第j-1次子迭代得到的文本增强模型,获得与无标训练样本对应的增强训练样本。
可选地,训练模块302用于将第二向量输入第j-1次子迭代得到的文本增强模型,获得与无标训练样本对应的增强训练样本,包括:
将第二向量输入第j-1次子迭代得到的文本增强模型,获取第二向量中各个掩码位置的单词概率分布;
基于多项式分布采样处理,确定各个掩码位置对应的单词;
根据第二向量以及各个掩码位置对应的单词,确定与第二向量对应的增强训练样本。
可选地,交替迭代训练结束的条件包括下述至少一个:交替迭代训练的次数等于n次或目标文本分类模型的输出结果收敛;其中,n≥i。
图7所示实施例提供的文本分类模型的训练装置,可用于执行上述方法实施例中的技术方案,其实现原理和技术效果类似,本实施例此处不再赘述。
图8是本申请一实施例提供的文本分类模型的训练设备的示意图。如图8所示,该文本分类模型的训练设备40包括:至少一个处理器401、存储器402以及存储在所述存储器402中并可在所述处理器401上运行的计算机程序。文本分类模型的训练设备还包括通信部件403,其中,处理器401、存储器402以及通信部件403通过总线404连接。
处理器401执行所述计算机程序时实现上述各个文本分类模型的训练方法实施例中的步骤,例如图1所示实施例中的步骤S10至步骤S20。或者,处理器401执行计算机程序时实现上述各装置实施例中各模块/单元的功能,例如图7所示模块301至302的功能。
示例性的,计算机程序可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器402中,并由处理器401执行,以完成本申请。一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述计算机程序在所述文本分类模型的训练设备40中的执行过程。
本领域技术人员可以理解,图8仅仅是文本分类模型的训练设备的示例,并不构成对文本分类模型的训练设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如输入输出设备、网络接入设备、总线等。
本申请实施例中的文本分类模型的训练设备可以为终端设备、服务器等,在此不做具体限制。
所称处理器401可以是中央处理单元(Central Processkng Unkt,CPU),还可以是其他通用处理器、数字信号处理器(Dkgktal Skgnal Processor,DSP)、专用集成电路(Applkcatkon Speckfkc Kntegrated Ckrcukt,ASKC)、现成可编程门阵列(Fkeld-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
存储器402可以是文本分类模型的训练设备的内部存储单元,也可以是文本分类模型 的训练设备的外部存储设备,例如插接式硬盘,智能存储卡(Smart Medka Card,SMC),安全数字(Secure Dkgktal,SD)卡,闪存卡(Flash Card)等。所述存储器402用于存储所述计算机程序以及文本分类模型的训练设备所需的其他程序和数据。存储器402还可以用于暂时地存储已经输出或者将要输出的数据。
总线可以是工业标准体系结构(Kndustry Standard Archktecture,KSA)总线、外部设备互连(Perkpheral Component,PCK)总线或扩展工业标准体系结构(Extended Kndustry Standard Archktecture,EKSA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,本申请附图中的总线并不限定仅有一根总线或一种类型的总线。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时实现可实现上述各个方法实施例中的步骤。
本申请实施例提供了一种计算机程序产品,当计算机程序产品在文本分类模型的训练设备上运行时,使得文本分类模型的训练设备执行时实现可实现上述各个方法实施例中的步骤。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,计算机程序包括计算机程序代码,计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。计算机可读介质至少可以包括:能够将计算机程序代码携带到拍照装置/终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质不可以是电载波信号和电信信号。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的装置/网络设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/网络设备实施例仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种文本分类模型的训练方法,其中,方法包括:
    获取训练样本集,所述训练样本集包括N个有标训练样本和M个无标训练样本,每个所述有标训练样本包括文本信息以及文本信息的类别标签,每个所述无标训练样本包括文本信息;其中,M和N均为大于1的整数;
    根据所述训练样本集以及M个增强训练样本对初始文本分类模型和初始文本增强模型进行交替迭代训练,得到目标文本分类模型;其中,在第i次交替迭代训练过程中,所述M个增强训练样本根据第i-1次交替迭代得到的文本增强模型对所述M个无标训练样本进行文本增强处理生成,i为大于1的整数。
  2. 如权利要求1所述的文本分类模型的训练方法,其中,所述交替迭代训练的次数为多次,且每次交替迭代训练过程包括k次子迭代过程;
    所述根据所述训练样本集以及M个增强训练样本,对初始文本分类模型和初始文本增强模型进行交替迭代训练,得到目标文本分类模型,包括:
    对于每次交替迭代训练中的第j次子迭代过程,按照预设比例从所述训练样本集中抽取有标训练样本以及无标训练样本;其中,1<j≤k;
    根据第j-1次子迭代得到的文本增强模型对抽取的无标训练样本进行处理,获得与所述无标训练样本对应的增强训练样本;
    将所述增强训练样本、抽取的有标训练样本以及所述无标训练样本作为输入,对第j-1次子迭代得到的文本分类模型和第j-1次子迭代得到的文本增强模型进行训练,得到第j次子迭代得到的文本分类模型和第j次子迭代得到的文本增强模型;
    返回执行所述按照预设比例从所述训练样本集中抽取有标训练样本以及无标训练样本的步骤,直至所述训练样本集中N个有标训练样本和M个无标训练样本均迭代一次后,获得当前交替迭代训练后的文本分类模型。
  3. 如权利要求2所述的文本分类模型的训练方法,其中,所述将所述增强训练样本、抽取的有标训练样本以及所述无标训练样本作为输入,对第j-1次子迭代得到的文本分类模型和第j-1次子迭代得到的文本增强模型进行训练,得到第j次子迭代的文本分类模型和第j次子迭代的文本增强模型,包括:
    将所述增强训练样本、抽取的有标训练样本以及所述无标训练样本作为输入,基于第j-1次子迭代得到的文本分类模型确定第一损失函数值;
    保持第j-1次子迭代得到的文本增强模型的参数不变,根据所述第一损失函数值,更新第j-1次子迭代得到的文本分类模型的参数,获得第j次子迭代得到的文本分类模型;
    保持第j次子迭代得到的文本分类模型的参数不变,根据所述第一损失函数值更新第j-1次子迭代得到的文本增强模型的参数,获得第j次子迭代得到的文本增强模型。
  4. 如权利要求3所述的文本分类模型的训练方法,其中,所述基于第j-1次子迭代得到的文本分类模型确定第一损失函数值,包括:
    将所述有标训练样本中的文本信息作为特征,将与所述文本信息对应的类别标签作为标签,基于第j-1次子迭代得到的文本分类模型,获得第二损失函数值;
    将所述无标训练样本以及与所述无标训练样本对应的增强训练样本作为输入,基于第j-1次子迭代得到的文本分类模型,获得第三损失函数值;
    根据所述第二损失函数值以及所述第三损失函数值确定所述第一损失函数值。
  5. 如权利要求2所述的文本分类模型的训练方法,其中,所述根据第j-1次子迭代得到的文本增强模型对所述无标训练样本进行处理,获得与所述无标训练样本对应的增强训练样本,包括:
    对所述无标训练样本进行分词处理,获得所述无标训练样本对应的第一文本序列,所述第一文本序列包括至少一个单词;
    基于预设词典对所述第一文本序列进行编码,生成所述第一文本序列对应的第一向量,所述第一向量包括多个编码值;
    基于预设概率对所述第一向量中的编码值进行掩码处理,生成所述第一向量对应的第二向量;
    将所述第二向量输入第j-1次子迭代得到的文本增强模型,获得与所述无标训练样本对应的增强训练样本。
  6. 如权利要求5所述的文本分类模型的训练方法,其中,所述将所述第二向量输入第j-1次子迭代得到的文本增强模型,获得与所述无标训练样本对应的增强训练样本,包括:
    将所述第二向量输入第j-1次子迭代得到的文本增强模型,获取所述第二向量中各个掩码位置的单词概率分布;
    基于多项式分布采样处理,确定各个所述掩码位置对应的单词;
    根据所述第二向量以及各个所述掩码位置对应的单词,确定与所述第二向量对应的增强训练样本。
  7. 如权利要求1-6任一项所述的文本分类模型的训练方法,其中,所述交替迭代训练结束的条件包括下述至少一个:
    所述交替迭代训练的次数等于n次或所述目标文本分类模型的输出结果收敛;其中,n≥i。
  8. 一种文本分类模型的训练装置,其中,装置包括:
    获取模块,用于获取训练样本集,所述训练样本集包括N个有标训练样本和M个无标训练样本,每个所述有标训练样本包括文本信息以及文本信息的类别标签,每个所述无标训练样本包括文本信息;其中,M和N均为大于1的整数;
    训练模块,用于根据所述训练样本集以及M个增强训练样本对初始文本分类模型和初始文本增强模型进行交替迭代训练,得到目标文本分类模型;其中,在第i次交替迭代训练过程中,所述M个增强训练样本根据第i-1次交替迭代得到的文本增强模型对所述M个无标训练样本进行文本增强处理生成,i为大于1的整数。
  9. 一种文本分类模型的训练设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现步骤包括:
    获取训练样本集,所述训练样本集包括N个有标训练样本和M个无标训练样本,每个所述有标训练样本包括文本信息以及文本信息的类别标签,每个所述无标训练样本包括文本信息;其中,M和N均为大于1的整数;
    根据所述训练样本集以及M个增强训练样本对初始文本分类模型和初始文本增强模型进行交替迭代训练,得到目标文本分类模型;其中,在第i次交替迭代训练过程中,所述M个增强训练样本根据第i-1次交替迭代得到的文本增强模型对所述M个无标训练样本进行文本增强处理生成,i为大于1的整数。
  10. 根据权利要求9所述的文本分类模型的训练设备,其中,所述交替迭代训练的次数为多次,且每次交替迭代训练过程包括k次子迭代过程,所述处理器执行所述计算机程序时实现步骤还包括:
    对于每次交替迭代训练中的第j次子迭代过程,按照预设比例从所述训练样本集中抽取有标训练样本以及无标训练样本;其中,1<j≤k;
    根据第j-1次子迭代得到的文本增强模型对抽取的无标训练样本进行处理,获得与所述无标训练样本对应的增强训练样本;
    将所述增强训练样本、抽取的有标训练样本以及所述无标训练样本作为输入,对第j-1次子迭代得到的文本分类模型和第j-1次子迭代得到的文本增强模型进行训练,得到第j次子迭代得到的文本分类模型和第j次子迭代得到的文本增强模型;
    返回执行所述按照预设比例从所述训练样本集中抽取有标训练样本以及无标训练样本 的步骤,直至所述训练样本集中N个有标训练样本和M个无标训练样本均迭代一次后,获得当前交替迭代训练后的文本分类模型。
  11. 根据权利要求10所述的文本分类模型的训练设备,其中,所述处理器执行所述计算机程序时实现步骤还包括:
    将所述增强训练样本、抽取的有标训练样本以及所述无标训练样本作为输入,基于第j-1次子迭代得到的文本分类模型确定第一损失函数值;
    保持第j-1次子迭代得到的文本增强模型的参数不变,根据所述第一损失函数值,更新第j-1次子迭代得到的文本分类模型的参数,获得第j次子迭代得到的文本分类模型;
    保持第j次子迭代得到的文本分类模型的参数不变,根据所述第一损失函数值更新第j-1次子迭代得到的文本增强模型的参数,获得第j次子迭代得到的文本增强模型。
  12. 根据权利要求11所述的文本分类模型的训练设备,其中,所述处理器执行所述计算机程序时实现步骤还包括:
    将所述有标训练样本中的文本信息作为特征,将与所述文本信息对应的类别标签作为标签,基于第j-1次子迭代得到的文本分类模型,获得第二损失函数值;
    将所述无标训练样本以及与所述无标训练样本对应的增强训练样本作为输入,基于第j-1次子迭代得到的文本分类模型,获得第三损失函数值;
    根据所述第二损失函数值以及所述第三损失函数值确定所述第一损失函数值。
  13. 根据权利要求10所述的文本分类模型的训练设备,其中,所述处理器执行所述计算机程序时实现步骤还包括:
    对所述无标训练样本进行分词处理,获得所述无标训练样本对应的第一文本序列,所述第一文本序列包括至少一个单词;
    基于预设词典对所述第一文本序列进行编码,生成所述第一文本序列对应的第一向量,所述第一向量包括多个编码值;
    基于预设概率对所述第一向量中的编码值进行掩码处理,生成所述第一向量对应的第二向量;
    将所述第二向量输入第j-1次子迭代得到的文本增强模型,获得与所述无标训练样本对应的增强训练样本。
  14. 根据权利要求13所述的文本分类模型的训练设备,其中,所述处理器执行所述计算机程序时实现步骤还包括:
    将所述第二向量输入第j-1次子迭代得到的文本增强模型,获取所述第二向量中各个掩码位置的单词概率分布;
    基于多项式分布采样处理,确定各个所述掩码位置对应的单词;
    根据所述第二向量以及各个所述掩码位置对应的单词,确定与所述第二向量对应的增强训练样本。
  15. 根据权利要求9-14任一项所述的文本分类模型的训练设备,其中,所述处理器执行所述计算机程序时,所述交替迭代训练结束的条件包括下述至少一个:
    所述交替迭代训练的次数等于n次或所述目标文本分类模型的输出结果收敛;其中,n≥i。
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现步骤包括:
    获取训练样本集,所述训练样本集包括N个有标训练样本和M个无标训练样本,每个所述有标训练样本包括文本信息以及文本信息的类别标签,每个所述无标训练样本包括文本信息;其中,M和N均为大于1的整数;
    根据所述训练样本集以及M个增强训练样本对初始文本分类模型和初始文本增强模型进行交替迭代训练,得到目标文本分类模型;其中,在第i次交替迭代训练过程中,所述M个增强训练样本根据第i-1次交替迭代得到的文本增强模型对所述M个无标训练样本进行 文本增强处理生成,i为大于1的整数。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述交替迭代训练的次数为多次,且每次交替迭代训练过程包括k次子迭代过程,所述计算机程序被处理器执行时实现步骤还包括:
    对于每次交替迭代训练中的第j次子迭代过程,按照预设比例从所述训练样本集中抽取有标训练样本以及无标训练样本;其中,1<j≤k;
    根据第j-1次子迭代得到的文本增强模型对抽取的无标训练样本进行处理,获得与所述无标训练样本对应的增强训练样本;
    将所述增强训练样本、抽取的有标训练样本以及所述无标训练样本作为输入,对第j-1次子迭代得到的文本分类模型和第j-1次子迭代得到的文本增强模型进行训练,得到第j次子迭代得到的文本分类模型和第j次子迭代得到的文本增强模型;
    返回执行所述按照预设比例从所述训练样本集中抽取有标训练样本以及无标训练样本的步骤,直至所述训练样本集中N个有标训练样本和M个无标训练样本均迭代一次后,获得当前交替迭代训练后的文本分类模型。
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时实现步骤还包括:
    将所述增强训练样本、抽取的有标训练样本以及所述无标训练样本作为输入,基于第j-1次子迭代得到的文本分类模型确定第一损失函数值;
    保持第j-1次子迭代得到的文本增强模型的参数不变,根据所述第一损失函数值,更新第j-1次子迭代得到的文本分类模型的参数,获得第j次子迭代得到的文本分类模型;
    保持第j次子迭代得到的文本分类模型的参数不变,根据所述第一损失函数值更新第j-1次子迭代得到的文本增强模型的参数,获得第j次子迭代得到的文本增强模型。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时实现步骤还包括:
    将所述有标训练样本中的文本信息作为特征,将与所述文本信息对应的类别标签作为标签,基于第j-1次子迭代得到的文本分类模型,获得第二损失函数值;
    将所述无标训练样本以及与所述无标训练样本对应的增强训练样本作为输入,基于第j-1次子迭代得到的文本分类模型,获得第三损失函数值;
    根据所述第二损失函数值以及所述第三损失函数值确定所述第一损失函数值。
  20. 根据权利要求17所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时实现步骤还包括:
    对所述无标训练样本进行分词处理,获得所述无标训练样本对应的第一文本序列,所述第一文本序列包括至少一个单词;
    基于预设词典对所述第一文本序列进行编码,生成所述第一文本序列对应的第一向量,所述第一向量包括多个编码值;
    基于预设概率对所述第一向量中的编码值进行掩码处理,生成所述第一向量对应的第二向量;
    将所述第二向量输入第j-1次子迭代得到的文本增强模型,获得与所述无标训练样本对应的增强训练样本。
PCT/CN2021/091090 2020-09-28 2021-04-29 文本分类模型的训练方法、装置、设备及存储介质 WO2022062404A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011038589.6A CN112115267B (zh) 2020-09-28 2020-09-28 文本分类模型的训练方法、装置、设备及存储介质
CN202011038589.6 2020-09-28

Publications (1)

Publication Number Publication Date
WO2022062404A1 true WO2022062404A1 (zh) 2022-03-31

Family

ID=73797210

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/091090 WO2022062404A1 (zh) 2020-09-28 2021-04-29 文本分类模型的训练方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN112115267B (zh)
WO (1) WO2022062404A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114896307A (zh) * 2022-06-30 2022-08-12 北京航空航天大学杭州创新研究院 时间序列数据增强方法、装置和电子设备
CN114970499A (zh) * 2022-04-27 2022-08-30 上海销氪信息科技有限公司 一种对话文本增强方法、装置、设备、存储介质
CN116150379A (zh) * 2023-04-04 2023-05-23 中国信息通信研究院 短信文本分类方法、装置、电子设备及存储介质
CN116226382A (zh) * 2023-02-28 2023-06-06 北京数美时代科技有限公司 一种给定关键词的文本分类方法、装置、电子设备及介质
CN116340510A (zh) * 2023-02-14 2023-06-27 北京数美时代科技有限公司 一种文本分类变体召回的优化方法、系统、介质及设备

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115267B (zh) * 2020-09-28 2023-07-07 平安科技(深圳)有限公司 文本分类模型的训练方法、装置、设备及存储介质
CN112733539A (zh) * 2020-12-30 2021-04-30 平安科技(深圳)有限公司 面试实体识别模型训练、面试信息实体提取方法及装置
CN112948582B (zh) * 2021-02-25 2024-01-19 平安科技(深圳)有限公司 一种数据处理方法、装置、设备以及可读介质
CN112906392B (zh) * 2021-03-23 2022-04-01 北京天融信网络安全技术有限公司 一种文本增强方法、文本分类方法及相关装置
CN113178189B (zh) * 2021-04-27 2023-10-27 科大讯飞股份有限公司 一种信息分类方法及装置、信息分类模型训练方法及装置
CN113537630B (zh) * 2021-08-04 2024-06-14 支付宝(杭州)信息技术有限公司 业务预测模型的训练方法及装置
CN114091577B (zh) * 2021-11-02 2022-12-16 北京百度网讯科技有限公司 用于训练模型的方法、装置、设备、介质和程序产品

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009049262A1 (en) * 2007-10-11 2009-04-16 Honda Motor Co., Ltd. Text categorization with knowledge transfer from heterogeneous datasets
US20180285771A1 (en) * 2017-03-31 2018-10-04 Drvision Technologies Llc Efficient machine learning method
CN109063724A (zh) * 2018-06-12 2018-12-21 中国科学院深圳先进技术研究院 一种增强型生成式对抗网络以及目标样本识别方法
CN109522961A (zh) * 2018-11-23 2019-03-26 中山大学 一种基于字典深度学习的半监督图像分类方法
CN110263165A (zh) * 2019-06-14 2019-09-20 中山大学 一种基于半监督学习的用户评论情感分析方法
CN112115267A (zh) * 2020-09-28 2020-12-22 平安科技(深圳)有限公司 文本分类模型的训练方法、装置、设备及存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7558766B1 (en) * 2006-09-29 2009-07-07 Hewlett-Packard Development Company, L.P. Classification using enhanced feature sets
US11205103B2 (en) * 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
CN113627458A (zh) * 2017-10-16 2021-11-09 因美纳有限公司 基于循环神经网络的变体致病性分类器
CN110196908A (zh) * 2019-04-17 2019-09-03 深圳壹账通智能科技有限公司 数据分类方法、装置、计算机装置及存储介质
CN111046673B (zh) * 2019-12-17 2021-09-03 湖南大学 一种用于防御文本恶意样本的对抗生成网络的训练方法
CN111444326B (zh) * 2020-03-30 2023-10-20 腾讯科技(深圳)有限公司 一种文本数据处理方法、装置、设备以及存储介质
CN111666409B (zh) * 2020-05-28 2022-02-08 武汉大学 一种基于综合深度胶囊网络的复杂评论文本的整体情感智能分类方法
CN114117048A (zh) * 2021-11-29 2022-03-01 平安银行股份有限公司 一种文本分类的方法、装置、计算机设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009049262A1 (en) * 2007-10-11 2009-04-16 Honda Motor Co., Ltd. Text categorization with knowledge transfer from heterogeneous datasets
US20180285771A1 (en) * 2017-03-31 2018-10-04 Drvision Technologies Llc Efficient machine learning method
CN109063724A (zh) * 2018-06-12 2018-12-21 中国科学院深圳先进技术研究院 一种增强型生成式对抗网络以及目标样本识别方法
CN109522961A (zh) * 2018-11-23 2019-03-26 中山大学 一种基于字典深度学习的半监督图像分类方法
CN110263165A (zh) * 2019-06-14 2019-09-20 中山大学 一种基于半监督学习的用户评论情感分析方法
CN112115267A (zh) * 2020-09-28 2020-12-22 平安科技(深圳)有限公司 文本分类模型的训练方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QIZHE XIE; ZIHANG DAI; EDUARD HOVY; MINH-THANG LUONG; QUOC V. LE: "Unsupervised Data Augmentation for Consistency Training", ARXIV.ORG, 25 June 2020 (2020-06-25), pages 1 - 21, XP081678973 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114970499A (zh) * 2022-04-27 2022-08-30 上海销氪信息科技有限公司 一种对话文本增强方法、装置、设备、存储介质
CN114970499B (zh) * 2022-04-27 2024-05-31 上海销氪信息科技有限公司 一种对话文本增强方法、装置、设备、存储介质
CN114896307A (zh) * 2022-06-30 2022-08-12 北京航空航天大学杭州创新研究院 时间序列数据增强方法、装置和电子设备
CN114896307B (zh) * 2022-06-30 2022-09-27 北京航空航天大学杭州创新研究院 时间序列数据增强方法、装置和电子设备
CN116340510A (zh) * 2023-02-14 2023-06-27 北京数美时代科技有限公司 一种文本分类变体召回的优化方法、系统、介质及设备
CN116340510B (zh) * 2023-02-14 2023-10-24 北京数美时代科技有限公司 一种文本分类变体召回的优化方法、系统、介质及设备
CN116226382A (zh) * 2023-02-28 2023-06-06 北京数美时代科技有限公司 一种给定关键词的文本分类方法、装置、电子设备及介质
CN116226382B (zh) * 2023-02-28 2023-08-01 北京数美时代科技有限公司 一种给定关键词的文本分类方法、装置、电子设备及介质
CN116150379A (zh) * 2023-04-04 2023-05-23 中国信息通信研究院 短信文本分类方法、装置、电子设备及存储介质
CN116150379B (zh) * 2023-04-04 2023-06-30 中国信息通信研究院 短信文本分类方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN112115267A (zh) 2020-12-22
CN112115267B (zh) 2023-07-07

Similar Documents

Publication Publication Date Title
WO2022062404A1 (zh) 文本分类模型的训练方法、装置、设备及存储介质
Chang et al. Chinese named entity recognition method based on BERT
WO2019169719A1 (zh) 文摘自动提取方法、装置、计算机设备及存储介质
CN112613308A (zh) 用户意图识别方法、装置、终端设备及存储介质
CN111599340A (zh) 一种多音字读音预测方法、装置及计算机可读存储介质
CN112101031B (zh) 一种实体识别方法、终端设备及存储介质
CN113434858B (zh) 基于反汇编代码结构和语义特征的恶意软件家族分类方法
WO2023092960A1 (zh) 一种用于法律文书的命名实体识别的标注方法和装置
CN112417878B (zh) 实体关系抽取方法、系统、电子设备及存储介质
WO2024067276A1 (zh) 用于确定视频的标签的方法、装置、设备及介质
WO2023020522A1 (zh) 用于自然语言处理、训练自然语言处理模型的方法及设备
CN111339308B (zh) 基础分类模型的训练方法、装置和电子设备
CN114610851A (zh) 意图识别模型的训练方法、意图识别方法、设备及介质
CN112101042A (zh) 文本情绪识别方法、装置、终端设备和存储介质
CN114218945A (zh) 实体识别方法、装置、服务器及存储介质
CN115795065A (zh) 基于带权哈希码的多媒体数据跨模态检索方法及系统
CN112232070A (zh) 自然语言处理模型构建方法、系统、电子设备及存储介质
CN113486178A (zh) 文本识别模型训练方法、文本识别方法、装置以及介质
Lyu et al. Deep learning for textual entailment recognition
Chan et al. Applying and optimizing NLP model with CARU
CN112528657A (zh) 基于双向lstm的文本意图识别方法及装置、服务器和介质
CN115910065A (zh) 基于子空间稀疏注意力机制的唇语识别方法、系统及介质
CN114330375A (zh) 一种基于固定范式的术语翻译方法及系统
CN117235205A (zh) 命名实体识别方法、装置及计算机可读存储介质
CN113283218A (zh) 一种语义文本压缩方法及计算机设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21870792

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21870792

Country of ref document: EP

Kind code of ref document: A1