CN116226375A - Training method and device for classification model suitable for text auditing - Google Patents

Training method and device for classification model suitable for text auditing Download PDF

Info

Publication number
CN116226375A
CN116226375A CN202310019334.2A CN202310019334A CN116226375A CN 116226375 A CN116226375 A CN 116226375A CN 202310019334 A CN202310019334 A CN 202310019334A CN 116226375 A CN116226375 A CN 116226375A
Authority
CN
China
Prior art keywords
text
text sample
enhanced
samples
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310019334.2A
Other languages
Chinese (zh)
Inventor
王赞博
曹宇慧
黄硕
陈永锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202310019334.2A priority Critical patent/CN116226375A/en
Publication of CN116226375A publication Critical patent/CN116226375A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a training method and a training device for a classification model suitable for text auditing, relates to the technical field of artificial intelligence, and particularly relates to the technical field of deep learning and computers. The method comprises the following steps: based on a pre-training language model, carrying out different types of predictions on text samples in the ith round of text sample set so as to obtain enhanced text samples with different prediction types; acquiring a label and a confidence coefficient of the enhanced text sample based on the first classification model after the j-th round of training; screening the enhanced text samples according to the prediction type of the enhanced text samples, the labels and the confidence of the enhanced text samples, updating the ith round of text sample set according to the screened enhanced text samples, and obtaining the (i+1) th round of text sample set so as to train the first classification model and obtain a second classification model after the (j+1) th round of training. The method and the device can avoid noise introduced during model training, fully mine generalization capability of the classification model, and are beneficial to improving accuracy of the classification model.

Description

Training method and device for classification model suitable for text auditing
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to the field of deep learning and computer technology.
Background
In the related art, text auditing is an automated and intelligent system for judging whether a piece of text content complies with platform content specifications such as internet and media based on natural language processing technology. Common text review application scenarios include user signatures/nicknames, comments/messages, instant messaging text content, user posts, media information, merchandise information, video live video bullet screens, graphic information, and the like.
The text auditing processing has various auditing types, massive user data is generated on the Internet every day, and heavy auditing tasks are required. The classification model can utilize a computer and natural language processing technology to realize automatic text content violation detection and recognition, and lead or assist manual auditing functions, thereby greatly reducing the working cost of related personnel.
Therefore, how to improve the accuracy and generalization ability of classification models suitable for text review has become one of the important research directions.
Disclosure of Invention
The disclosure provides a training method and device for a classification model suitable for text auditing.
According to an aspect of the present disclosure, there is provided a training method of a classification model suitable for text review, the method comprising:
Aiming at any text auditing type in a plurality of text auditing types, acquiring an ith round of text sample set corresponding to the text auditing type and a jth round of trained first classification model, wherein i and j are positive integers;
based on a pre-training language model, carrying out different types of predictions on text samples in the ith round of text sample set so as to obtain enhanced text samples with different prediction types;
acquiring a label and a confidence coefficient of the enhanced text sample based on the first classification model;
screening the enhanced text samples according to the prediction type of the enhanced text samples, the labels and the confidence of the enhanced text samples, and updating the ith round of text sample set according to the screened enhanced text samples to obtain the (i+1) th round of text sample set;
training the first classification model according to the (i+1) th round of text sample set, obtaining a second classification model after the (j+1) th round of training, and continuously obtaining the next round of text sample set to train the second classification model until training is finished to generate a target classification model.
According to the method and the device, noise can be prevented from being introduced during model training, the diversity of training data is improved, the boundary of model training is expanded, the generalization capability of the classification model is fully mined, and the accuracy of the classification model is improved.
According to another aspect of the present disclosure, there is provided a training apparatus for a classification model suitable for text review, comprising:
the first acquisition module is used for acquiring an ith round of text sample set corresponding to the text audit type and a jth round of trained first classification model aiming at any text audit type in a plurality of text audit types, wherein i and j are positive integers;
the second acquisition module is used for carrying out different types of predictions on the text samples in the ith round of text sample set based on the pre-training language model so as to acquire enhanced text samples with different prediction types;
the third acquisition module is used for acquiring the label and the confidence of the enhanced text sample based on the first classification model;
the updating module is used for screening the enhanced text samples according to the prediction type of the enhanced text samples, the labels and the confidence of the enhanced text samples, and updating the ith round of text sample set according to the screened enhanced text samples to obtain the (i+1) th round of text sample set;
the training module is used for training the first classification model according to the (i+1) th round of text sample set, obtaining a second classification model after the (j+1) th round of training, and continuing to obtain the next round of text sample set to train the second classification model until training is finished to generate a target classification model.
According to another aspect of the present disclosure, there is provided an electronic device including at least one processor, and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a training method for a classification model suitable for text review in accordance with an embodiment of the first aspect of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a training method of a classification model suitable for text auditing of an embodiment of the first aspect of the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a training method for a classification model suitable for text auditing of the first aspect of the present disclosure.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow chart of a training method for a classification model suitable for text review in accordance with an embodiment of the present disclosure;
FIG. 2 is a flow chart of a training method for a classification model suitable for text review in accordance with an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a training method for a classification model suitable for text review in accordance with an embodiment of the present disclosure;
FIG. 4 is a flow chart of a training method for a classification model suitable for text review in accordance with an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a training method for a classification model suitable for text review in accordance with an embodiment of the disclosure;
FIG. 6 is a block diagram of a training apparatus for a classification model suitable for text review in accordance with an embodiment of the disclosure;
FIG. 7 is a block diagram of an electronic device for implementing a training method for a classification model suitable for text review in accordance with an embodiment of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The embodiment of the disclosure relates to the technical field of artificial intelligence such as computer vision, deep learning and the like.
Artificial intelligence (Artificial Intelligence), english is abbreviated AI. It is a new technical science for researching, developing theory, method, technology and application system for simulating, extending and expanding human intelligence.
Deep Learning (DL) is a new research direction in the field of Machine Learning (ML), which was introduced to Machine Learning to make it closer to the original target-AI. Deep learning is the inherent regularity and presentation hierarchy of learning sample data, and the information obtained during such learning is helpful in interpreting data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data.
The computer technology is to use the computing capacity, logic judging capacity and manual simulating capacity of the computer to quantitatively calculate and analyze the system, so as to provide means and tools for solving the problems of complex systems.
The text auditing system comprises a sensitive dictionary matching module and a classification model constructed based on deep learning, wherein the classification model needs to be trained by using manually marked high-quality data, and the quantity and quality of the training data have great influence on the effect of the target classification model. In the implementation, labeling a large amount of data requires more manpower resources; in addition, since the text samples conforming to the text review type have a low distribution ratio in the natural data, the difficulty in acquisition and acquisition thereof is high. According to the method and the device, the text sample meeting the text auditing type can be obtained under the condition that additional manpower labeling is not needed, meanwhile, the data diversity is increased, and the generalization capability of the model is improved.
The training method of the classification model suitable for text auditing and the device thereof are described below with reference to the accompanying drawings.
FIG. 1 is a flow chart of a training method for a classification model suitable for text review, as shown in FIG. 1, according to one embodiment of the present disclosure, the method comprising the steps of:
s101, aiming at any text auditing type in a plurality of text auditing types, acquiring an ith round of text sample set corresponding to the text auditing type and a jth round of trained first classification model, wherein i and j are positive integers.
In some implementations, the text audit types include passive information types, sensitive information types, and the like, each text audit type corresponding to a classification model.
Optionally, the classification model in the embodiment of the present disclosure may be a classification model, for example, fine adjustment may be performed based on a pre-training model (ERNIE 2.0 model), a two-classification model may be obtained, and for any text audit type of a plurality of text audit types, whether the text to be processed accords with the current text audit type may be determined based on the classification model corresponding to the text audit type.
Optionally, the initial set of text samples comprises a plurality of text samples, the text samples being annotated text samples, that is, each text sample having a label.
For example, if the text review type is advertisement promotion, the text samples in the ith round of text sample set include text samples with labels 0 and 1, wherein the text sample with label 0 is a text sample with advertisement promotion and the text sample with label 1 is a text sample without advertisement promotion.
S102, carrying out different types of predictions on text samples in the ith round of text sample set based on the pre-training language model so as to obtain enhanced text samples with different prediction types.
Alternatively, in embodiments of the present disclosure, the prediction types may include mask prediction and renewal prediction.
It should be noted that, in the embodiment of the present disclosure, the pre-training language model may be a pre-training language model with generating capability (such as an ERNIE 3.0 model, a t5 model, a gpt-3 model, etc.).
In some implementations, different prediction types correspond to different data preprocessing processes, and text samples in the ith round of text sample set can be subjected to data preprocessing according to the prediction types, and then input into a pre-training language model for prediction so as to generate enhanced text samples with different prediction types.
And S103, acquiring labels and confidence of the enhanced text sample based on the first classification model.
And classifying the enhanced text sample by using the first classification model trained in the j-th round to construct a label of the enhanced text sample, and acquiring the confidence coefficient of the enhanced text sample.
In some implementations, the tag of the enhanced text sample indicates that the enhanced text sample is compliant with the text review type, and in some implementations, the tag of the enhanced text sample indicates that the enhanced text sample is not compliant with the text review type.
Note that the label of the enhanced text sample constructed based on the first classification model is a pseudo label.
S104, screening the enhanced text samples according to the prediction type of the enhanced text samples, the labels and the confidence of the enhanced text samples, and updating the ith round of text sample set according to the screened enhanced text samples to obtain the (i+1) th round of text sample set.
In some implementations, different prediction types correspond to different screening policies.
For example, if the prediction type of the enhanced text sample is mask prediction, in order to improve the core of the classification model effect, the positions of keywords affecting the label judgment are learned, and based on the labels and the confidence of the enhanced text sample, a plurality of enhanced text samples meeting the text audit type and not meeting the text audit type in the screened enhanced text sample can be respectively obtained so as to update the ith round of text sample set.
For another example, if the prediction type of the enhanced text sample is a continuation prediction, in order to improve the recall rate of the pre-training language model, the generalization capability of the classification model is fully mined, and a plurality of enhanced text samples meeting the text audit type in the enhanced text sample can be screened based on the label of the enhanced text sample so as to update the ith round of text sample set.
It should be noted that, the text sample in the i+1 th round of text sample set is the labeled text sample, and the labeled label is the pseudo label obtained by the classification model.
S105, training the first classification model according to the (i+1) th round of text sample set, obtaining a second classification model after the (j+1) th round of training, and continuing to obtain the next round of text sample set to train the second classification model until training is finished to generate a target classification model.
Optionally, training the first classification model according to the (i+1) -th round text sample set, and reversely adjusting the first classification model according to the loss function to obtain a (j+1) -th round trained second classification model.
And continuing to acquire a next round of text sample set to train the second classification model, and repeating the steps until a preset training ending condition is met, determining that training of the classification model is completed, and generating a target classification model. Alternatively, the training end condition may be that a preset number of training iterations is reached, or that the loss value of the loss function converges to a preset loss threshold. The embodiments of the present disclosure are not limited in this regard.
In the embodiment of the disclosure, different types of predictions are performed on text samples in an ith round of text sample set based on a pre-training language model so as to obtain enhanced text samples with different prediction types; acquiring a label and a confidence coefficient of the enhanced text sample based on the first classification model; screening the enhanced text samples according to the prediction type of the enhanced text samples, the labels and the confidence of the enhanced text samples, and updating the ith round of text sample set according to the screened enhanced text samples to obtain the (i+1) th round of text sample set; according to the method and the device, noise can be prevented from being introduced during model training, the diversity of training data is improved, the boundary of model training is expanded, the generalization capability of the classification model is fully mined, and the accuracy of the classification model is improved.
FIG. 2 is a flow chart of a training method for a classification model suitable for text review, as shown in FIG. 2, according to one embodiment of the present disclosure, the method comprising the steps of:
s201, aiming at any text auditing type in a plurality of text auditing types, acquiring an ith round of text sample set corresponding to the text auditing type and a jth round of trained first classification model, wherein i and j are positive integers.
The description of step S201 may be referred to the content of the above embodiment, and will not be repeated here.
S202, performing mask coverage on the text samples, and performing mask prediction on the text samples subjected to mask coverage based on the pre-training language model to generate enhanced text samples with the prediction type of mask prediction.
And carrying out mask coverage on the text samples according to a preset coverage proportion to generate a first text sample, wherein the first text sample comprises one or more masks. The first text samples are input into a pre-training language model, and one or more masks in the first text samples are predicted based on the pre-training language model to generate enhanced text samples with a prediction type of mask prediction.
Alternatively, the coverage ratio may be 15%, and in other implementations, the coverage ratio may take on other values, which are not limited by the embodiments of the present disclosure.
Masking and covering the text samples according to a preset coverage proportion to generate a plurality of first text samples, for example, if the text samples are "I are an employee" and the text samples are masked and covered according to the preset coverage proportion, the generated first text samples can be "I are a [ mask ]", "[ mask ] are an employee", "I are a mask ] and the like.
The mask in the first text sample of 'I are a mask' is predicted based on the pre-training language model, and the generation of the enhanced text sample can be 'I are a teacher', 'I are a student', 'I are an employee', 'I are a staff', and the like.
And S203, rewriting the text sample based on a preset prompt word, and carrying out continuous writing prediction on the rewritten text sample based on a pre-training language model to generate an enhanced text sample with the prediction type of continuous writing prediction.
In order to improve the recall rate of the model, the generalization capability of the classification model is fully mined, and a second text sample with labels conforming to the text auditing type in the text sample is obtained. And rewriting the second text sample based on a preset prompt word to obtain a third text sample. And splicing any two third text samples to obtain a fourth text sample. And renewing the fourth text sample based on the pre-training language model to generate an enhanced text sample with the prediction type being the renewed prediction.
As shown in FIG. 3, firstly, designing a plurality of different prompt word template, randomly sampling a second text sample, rewriting the second text sample based on a preset template to obtain a third text sample, splicing the third text sample according to a defined template, wherein the usable template is shown in FIG. 3, the core idea is to construct a prompt, list two existing second text samples meeting the text auditing type according to a list form, and expect a pre-training language model to continuously generate the third text sample meeting the text auditing type as a prompt, namely, generate an enhanced text sample with a prediction type of continuous writing prediction.
Taking a pre-training language model as an ERNIE 3.0 model for illustration, the ERNIE 3.0 model has two parts, wherein one part is an understanding network and the other part is a generating network, and the understanding task and the generating task can be simultaneously performed during pre-training, wherein the generating network is used for renewing a fourth text sample to generate an enhanced text sample with a prediction type being a continuous writing prediction.
S204, obtaining labels and confidence of the enhanced text sample based on the first classification model.
S205, screening the enhanced text samples according to the prediction type of the enhanced text samples, the labels and the confidence of the enhanced text samples, and updating the ith round of text sample set according to the screened enhanced text samples to obtain the (i+1) th round of text sample set.
S206, training the first classification model according to the (i+1) th round of text sample set, obtaining a second classification model after the (j+1) th round of training, and continuing to obtain the next round of text sample set to train the second classification model until training is finished to generate a target classification model.
The descriptions of step S204 to step S206 may be referred to the content of the above embodiments, and are not repeated here.
In the embodiment of the disclosure, a text sample is covered by a mask, the masked text sample is subjected to mask prediction based on a pre-training language model, an enhanced text sample with a prediction type of mask prediction is generated, the text sample is rewritten based on a preset prompt word, and the rewritten text sample is subjected to continuous writing prediction based on the pre-training language model, so that the enhanced text sample with the prediction type of continuous writing prediction is generated. According to the method and the device, noise can be prevented from being introduced during model training, the diversity of training data is improved, the boundary of model training is expanded, the generalization capability of the classification model is fully mined, and the accuracy of the classification model is improved.
FIG. 4 is a flowchart of a training method for a classification model suitable for text review, according to one embodiment of the present disclosure, as shown in FIG. 4, the method comprising the steps of:
S401, aiming at any text auditing type in a plurality of text auditing types, acquiring an ith round of text sample set corresponding to the text auditing type and a jth round of trained first classification model, wherein i and j are positive integers.
S402, carrying out different types of predictions on text samples in the ith round of text sample set based on the pre-training language model so as to obtain enhanced text samples with different prediction types.
The description of step S401 to step S402 may be referred to the content of the above embodiment, and will not be repeated here.
S403, obtaining the label and the confidence of the enhanced text sample based on the first classification model.
Alternatively, the first classification model in embodiments of the present disclosure may be a classification model. And classifying the enhanced text sample based on the first classification model, and judging whether the enhanced text sample accords with the current text auditing type, thereby acquiring the label and the confidence of the enhanced text sample.
In some implementations, the tag of the enhanced text sample indicates that the enhanced text sample is compliant with the text review type, and in some implementations, the tag of the enhanced text sample indicates that the enhanced text sample is not compliant with the text review type.
S404, responding to the prediction type of the enhanced text sample as mask prediction, screening the enhanced text sample based on the label and the confidence of the enhanced text sample, and generating a first enhanced text sample.
In some implementations, the M labels with the highest confidence in the enhanced text samples are obtained as third enhanced text samples conforming to the text audit type, and the N labels with the highest confidence in the enhanced text samples are fourth enhanced text samples not conforming to the text audit type, and the third enhanced text samples and the fourth enhanced text samples are determined to be the first enhanced text samples. M, N is a positive integer.
For example, the enhanced text sample with the highest confidence level in the enhanced text sample may be obtained as the enhanced text sample conforming to the text audit type, and the enhanced text sample with the highest confidence level in the enhanced text sample is obtained as the first enhanced text sample.
It should be noted that the data enhancement process aims at learning why the text sample accords with the text audit type, where the keyword position is, for example, the modification of the mask part in one text sample changes into two pieces of enhanced text, one piece accords with the text audit type, and the other piece does not accord with the text audit type, and the two pieces are very similar, and the input of the data enhancement process into the classification model is beneficial to the classification model learning why only the mask part label is changed, so that the data enhancement process can improve the accuracy of the classification model.
And S405, responding to the prediction type of the enhanced text sample as the continuation prediction, screening the enhanced text sample based on the label of the enhanced text sample, and generating a second enhanced text sample.
In some implementations, in order to reduce the amount of calculation and speed up training efficiency, the enhanced text samples may be combined to remove the repeated enhanced text samples, so as to obtain a fifth enhanced text sample, and the fifth enhanced text sample labeled as the enhanced text sample conforming to the text audit type is determined to be the second enhanced text sample.
It should be noted that the data enhancement process is aimed at generating more offending samples (labeled as text samples conforming to the text audit type), filtering out samples predicted not to be offending (labeled as text samples not conforming to the text audit type), and improving the recall of the model.
S406, adding the first enhanced text sample and the second enhanced text sample into the ith round of text sample set to obtain the (i+1) th round of text sample set.
S407, training the first classification model according to the (i+1) th round of text sample set, obtaining a second classification model after the (j+1) th round of training, and continuing to obtain the next round of text sample set to train the second classification model until training is finished to generate a target classification model.
The description of steps S406 to S407 may be referred to the content of the above embodiment, and will not be repeated here.
In the embodiment of the disclosure, a label and a confidence coefficient of an enhanced text sample are obtained based on a first classification model, the enhanced text sample is screened based on the label and the confidence coefficient of the enhanced text sample in response to the prediction type of the enhanced text sample being mask prediction, the first enhanced text sample is generated, the enhanced text sample is screened based on the label of the enhanced text sample in response to the prediction type of the enhanced text sample being continuation prediction, and a second enhanced text sample is generated. The method and the device have the advantages that the classification model after the previous training is used for scoring the enhanced data sample to generate the label, noise can be prevented from being introduced during model training, the pre-training language model is trained through massive data in the pre-training stage, the capability of generating data rich in semantics is achieved, the diversity of training data can be improved through data enhancement by using the pre-training language model, the boundary of the classification model is expanded, and the generalization capability of the classification model is fully mined.
FIG. 5 is a flowchart of a training method of a classification model suitable for text review according to one embodiment of the present disclosure, as shown in FIG. 5, in an embodiment of the present disclosure, for any one of a plurality of text review types, an ith round of text sample set corresponding to the text review type and a first classification model after a jth round of training are obtained; masking and covering the text samples, masking and predicting the masked text samples based on a pre-training language model, and generating enhanced text samples with the prediction type of masking and predicting; and rewriting the text sample based on a preset prompt word, and performing continuous writing prediction on the rewritten text sample based on a pre-training language model to generate an enhanced text sample with a prediction type of continuous writing prediction. Acquiring a label and a confidence coefficient of the enhanced text sample based on the first classification model; responding to the prediction type of the enhanced text sample as mask prediction, screening the enhanced text sample based on the label and the confidence of the enhanced text sample, and generating a first enhanced text sample; responding to the prediction type of the enhanced text sample as the follow-up prediction, screening the enhanced text sample based on the label of the enhanced text sample, and generating a second enhanced text sample; and adding the first enhanced text sample and the second enhanced text sample into the ith round of text sample set to obtain an (i+1) th round of text sample set. Training the first classification model according to the (i+1) th round of text sample set, obtaining a second classification model after the (j+1) th round of training, and continuously obtaining the next round of text sample set to train the second classification model until training is finished to generate a target classification model.
The method and the device have the advantages that the classification model after the previous training is used for scoring the enhanced data sample to generate the label, noise can be prevented from being introduced during model training, the pre-training language model is trained through massive data in the pre-training stage, the capability of generating data rich in semantics is achieved, the diversity of training data can be improved through data enhancement by using the pre-training language model, the boundary of the classification model is expanded, and the generalization capability of the classification model is fully mined.
FIG. 6 is a block diagram of a training apparatus for a classification model for text review, according to one embodiment of the present disclosure, as shown in FIG. 6, the training apparatus 600 for a classification model for text review includes:
a first obtaining module 610, configured to obtain, for any text audit type of a plurality of text audit types, an ith round of text sample set corresponding to the text audit type and a jth round of trained first classification model, where i and j are positive integers;
a second obtaining module 620, configured to perform different types of predictions on the text samples in the ith round of text sample set based on the pre-training language model, so as to obtain enhanced text samples with different prediction types;
a third obtaining module 630, configured to obtain a label and a confidence of the enhanced text sample based on the first classification model;
The updating module 640 is configured to screen the enhanced text sample according to the prediction type of the enhanced text sample, the tag and the confidence coefficient of the enhanced text sample, and update the ith round of text sample set according to the screened enhanced text sample to obtain the (i+1) th round of text sample set;
the training module 650 is configured to train the first classification model according to the i+1st round of text sample set, acquire the second classification model after the j+1st round of training, and continue to acquire the next round of text sample set to train the second classification model until training is finished to generate the target classification model.
In some implementations, the second acquisition module 620 is further configured to: masking and covering the text samples, masking and predicting the masked text samples based on a pre-training language model, and generating enhanced text samples with the prediction type of masking and predicting; and/or rewriting the text sample based on a preset prompt word, and performing continuous writing prediction on the rewritten text sample based on a pre-training language model to generate an enhanced text sample with the prediction type of continuous writing prediction.
In some implementations, the second acquisition module 620 is further configured to: performing mask coverage on the text samples according to a preset coverage proportion to generate first text samples, wherein the first text samples comprise one or more masks; the first text sample is input into a pre-trained language model, and one or more masks in the first text sample are predicted based on the pre-trained language model to generate enhanced text samples.
In some implementations, the second acquisition module 620 is further configured to: acquiring a second text sample with labels conforming to the text auditing type in the text sample; rewriting the second text sample based on a preset prompt word to obtain a third text sample; splicing any two third text samples to obtain a fourth text sample; and renewing the fourth text sample based on the pre-trained language model to generate an enhanced text sample.
In some implementations, the update module 640 is further to: responding to the prediction type of the enhanced text sample as mask prediction, screening the enhanced text sample based on the label and the confidence of the enhanced text sample, and generating a first enhanced text sample; responding to the prediction type of the enhanced text sample as the follow-up prediction, screening the enhanced text sample based on the label of the enhanced text sample, and generating a second enhanced text sample; and adding the first enhanced text sample and the second enhanced text sample into the ith round of text sample set to obtain an (i+1) th round of text sample set.
In some implementations, the tag includes a text-audit-compliant and a text-non-audit-compliant type, the update module 640 further operable to: obtaining a third enhanced text sample which accords with the text auditing type from M tags with highest confidence in the enhanced text sample, and a fourth enhanced text sample which does not accord with the text auditing type from N tags with highest confidence in the enhanced text sample; the third enhanced text sample and the fourth enhanced text sample are determined to be the first enhanced text sample.
In some implementations, the update module 640 is further to: merging the enhanced text samples to remove repeated enhanced text samples to obtain a fifth enhanced text sample; and determining the fifth enhanced text sample which is marked as conforming to the text audit type as a second enhanced text sample.
According to the method and the device, noise can be prevented from being introduced during model training, the diversity of training data is improved, the boundary of model training is expanded, the generalization capability of the classification model is fully mined, and the accuracy of the classification model is improved.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
FIG. 7 is a block diagram of an electronic device for implementing a training method for a classification model suitable for text review in accordance with an embodiment of the disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as training methods for classification models suitable for text review. For example, in some embodiments, the training method of classification models suitable for text review may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into RAM 703 and executed by the computing unit 701, one or more steps of the training method of the classification model suitable for text review described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform a training method of classification models suitable for text auditing.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (17)

1. A method of training a classification model suitable for text review, comprising:
aiming at any text auditing type in a plurality of text auditing types, acquiring an ith round of text sample set and a jth round of trained first classification model corresponding to the text auditing type, wherein i and j are positive integers;
based on a pre-training language model, carrying out different types of predictions on text samples in the ith round of text sample set so as to obtain enhanced text samples with different prediction types;
Acquiring labels and confidence of the enhanced text samples based on the first classification model;
screening the enhanced text samples according to the prediction type of the enhanced text samples, the labels and the confidence of the enhanced text samples, and updating the ith round of text sample set according to the screened enhanced text samples to obtain an (i+1) th round of text sample set;
training the first classification model according to the (i+1) -th round of text sample set, obtaining a second classification model after the (j+1) -th round of training, and continuously obtaining the next round of text sample set to train the second classification model until training is finished to generate a target classification model.
2. The method of claim 1, wherein the prediction types include masked prediction and/or writeback prediction, the predicting the text samples in the i-th round of text sample set differently based on a pre-trained language model to obtain enhanced text samples of different prediction types, comprising:
performing mask coverage on the text samples, and performing mask prediction on the text samples subjected to mask coverage based on the pre-training language model to generate enhanced text samples with the prediction type of mask prediction; and/or
And rewriting the text sample based on a preset prompt word, and performing continuous writing prediction on the rewritten text sample based on the pre-training language model to generate an enhanced text sample with a prediction type of continuous writing prediction.
3. The method of claim 2, wherein the masking the text samples and masking the masked text samples based on the pre-trained language model, generating enhanced text samples with a prediction type of masked prediction, comprises:
performing mask coverage on the text sample according to a preset coverage proportion to generate a first text sample, wherein the first text sample comprises one or more masks;
the first text sample is input into the pre-training language model, and the one or more masks in the first text sample are predicted based on the pre-training language model to generate an enhanced text sample with a prediction type of mask prediction.
4. The method of claim 2, wherein the text sample has a tag, the rewriting the text sample based on a preset hint word, and performing a renewal prediction on the rewritten text sample based on the pre-trained language model, generating an enhanced text sample with a prediction type of a renewal prediction, comprising:
Acquiring a second text sample, the label of which accords with the text auditing type, from the text sample;
rewriting the second text sample based on a preset prompting word to obtain a third text sample;
splicing any two third text samples to obtain a fourth text sample;
and renewing the fourth text sample based on the pre-training language model to generate an enhanced text sample with a prediction type being a renewed-writing prediction.
5. The method according to any one of claims 2-4, wherein the screening the enhanced text samples according to the prediction type of the enhanced text samples, the labels and the confidence of the enhanced text samples, and updating the ith round of text sample set according to the screened enhanced text samples to obtain the (i+1) th round of text sample set, includes:
responding to the prediction type of the enhanced text sample as mask prediction, screening the enhanced text sample based on the label and the confidence of the enhanced text sample, and generating a first enhanced text sample;
responding to the prediction type of the enhanced text sample as a renewal prediction, screening the enhanced text sample based on the label of the enhanced text sample, and generating a second enhanced text sample;
And adding the first enhanced text sample and the second enhanced text sample into the ith round of text sample set to obtain an (i+1) th round of text sample set.
6. A method as defined in claim 5, wherein the tag includes a tag that is compliant with the text review type and a tag that is not compliant with the text review type, the screening the enhanced text sample based on the tag and confidence of the enhanced text sample to generate a first enhanced text sample comprising:
obtaining a third enhanced text sample conforming to the text auditing type from M tags with highest confidence in the enhanced text sample, and a fourth enhanced text sample not conforming to the text auditing type from N tags with highest confidence in the enhanced text sample, wherein M, N is a positive integer;
determining the third enhanced text sample and the fourth enhanced text sample as the first enhanced text sample.
7. The method of claim 6, wherein the screening the enhanced text sample based on the label of the enhanced text sample to generate a second enhanced text sample comprises:
merging the enhanced text samples to remove repeated enhanced text samples and obtain a fifth enhanced text sample;
And determining the fifth enhanced text sample which is marked as conforming to the text audit type as the second enhanced text sample.
8. A training device for a classification model suitable for text review, comprising:
the first acquisition module is used for acquiring an ith round of text sample set and a jth round of trained first classification model corresponding to a text auditing type aiming at any text auditing type in a plurality of text auditing types, wherein i and j are positive integers;
the second acquisition module is used for carrying out different types of predictions on the text samples in the ith round of text sample set based on a pre-training language model so as to acquire enhanced text samples with different prediction types;
a third obtaining module, configured to obtain a label and a confidence coefficient of the enhanced text sample based on the first classification model;
the updating module is used for screening the enhanced text samples according to the prediction type of the enhanced text samples, the labels and the confidence of the enhanced text samples, and updating the ith round of text sample set according to the screened enhanced text samples to obtain an (i+1) th round of text sample set;
the training module is used for training the first classification model according to the (i+1) th round of text sample set, obtaining a second classification model after the (j+1) th round of training, and continuing to obtain the next round of text sample set to train the second classification model until training is finished to generate a target classification model.
9. The apparatus of claim 8, wherein the second acquisition module is further configured to:
performing mask coverage on the text samples, and performing mask prediction on the text samples subjected to mask coverage based on the pre-training language model to generate enhanced text samples with the prediction type of mask prediction; and/or
And rewriting the text sample based on a preset prompt word, and performing continuous writing prediction on the rewritten text sample based on the pre-training language model to generate an enhanced text sample with a prediction type of continuous writing prediction.
10. The apparatus of claim 9, wherein the second acquisition module is further configured to:
performing mask coverage on the text sample according to a preset coverage proportion to generate a first text sample, wherein the first text sample comprises one or more masks;
the first text sample is input into the pre-trained language model, and the one or more masks in the first text sample are predicted based on the pre-trained language model to generate enhanced text samples.
11. The apparatus of claim 9, wherein the second acquisition module is further configured to:
Acquiring a second text sample, the label of which accords with the text auditing type, from the text sample;
rewriting the second text sample based on a preset prompting word to obtain a third text sample;
splicing any two third text samples to obtain a fourth text sample;
and renewing the fourth text sample based on the pre-trained language model to generate an enhanced text sample.
12. The apparatus of any of claims 9-11, wherein the update module is further to:
responding to the prediction type of the enhanced text sample as mask prediction, screening the enhanced text sample based on the label and the confidence of the enhanced text sample, and generating a first enhanced text sample;
responding to the prediction type of the enhanced text sample as a renewal prediction, screening the enhanced text sample based on the label of the enhanced text sample, and generating a second enhanced text sample;
and adding the first enhanced text sample and the second enhanced text sample into the ith round of text sample set to obtain an (i+1) th round of text sample set.
13. The apparatus of claim 12, wherein the tag includes a compliance with the text review type and a non-compliance with the text review type, the update module further to:
Obtaining a third enhanced text sample conforming to the text auditing type from M tags with highest confidence in the enhanced text sample, and a fourth enhanced text sample not conforming to the text auditing type from N tags with highest confidence in the enhanced text sample, wherein M, N is a positive integer;
determining the third enhanced text sample and the fourth enhanced text sample as the first enhanced text sample.
14. The apparatus of claim 13, wherein the update module is further configured to:
merging the enhanced text samples to remove repeated enhanced text samples and obtain a fifth enhanced text sample;
and determining the fifth enhanced text sample which is marked as conforming to the text audit type as the second enhanced text sample.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the steps of the method according to any one of claims 1-7.
17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-7.
CN202310019334.2A 2023-01-06 2023-01-06 Training method and device for classification model suitable for text auditing Pending CN116226375A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310019334.2A CN116226375A (en) 2023-01-06 2023-01-06 Training method and device for classification model suitable for text auditing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310019334.2A CN116226375A (en) 2023-01-06 2023-01-06 Training method and device for classification model suitable for text auditing

Publications (1)

Publication Number Publication Date
CN116226375A true CN116226375A (en) 2023-06-06

Family

ID=86577873

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310019334.2A Pending CN116226375A (en) 2023-01-06 2023-01-06 Training method and device for classification model suitable for text auditing

Country Status (1)

Country Link
CN (1) CN116226375A (en)

Similar Documents

Publication Publication Date Title
CN113705187B (en) Method and device for generating pre-training language model, electronic equipment and storage medium
CN111078887B (en) Text classification method and device
CN112749300B (en) Method, apparatus, device, storage medium and program product for video classification
CN113836925B (en) Training method and device for pre-training language model, electronic equipment and storage medium
CN113011155B (en) Method, apparatus, device and storage medium for text matching
CN115063875A (en) Model training method, image processing method, device and electronic equipment
CN111950279A (en) Entity relationship processing method, device, equipment and computer readable storage medium
CN113887627A (en) Noise sample identification method and device, electronic equipment and storage medium
CN113434683A (en) Text classification method, device, medium and electronic equipment
CN113657483A (en) Model training method, target detection method, device, equipment and storage medium
CN112559885A (en) Method and device for determining training model of map interest point and electronic equipment
CN114970540A (en) Method and device for training text audit model
CN112270169B (en) Method and device for predicting dialogue roles, electronic equipment and storage medium
CN112906368A (en) Industry text increment method, related device and computer program product
CN113641724B (en) Knowledge tag mining method and device, electronic equipment and storage medium
CN116048463A (en) Intelligent recommendation method and device for content of demand item based on label management
CN113688232B (en) Method and device for classifying bid-inviting text, storage medium and terminal
CN113360672B (en) Method, apparatus, device, medium and product for generating knowledge graph
CN113051911B (en) Method, apparatus, device, medium and program product for extracting sensitive words
CN115565186A (en) Method and device for training character recognition model, electronic equipment and storage medium
CN114817476A (en) Language model training method and device, electronic equipment and storage medium
CN116226375A (en) Training method and device for classification model suitable for text auditing
CN113807390A (en) Model training method and device, electronic equipment and storage medium
CN114491030A (en) Skill label extraction and candidate phrase classification model training method and device
CN114492364A (en) Same vulnerability judgment method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination