CN112347261A - Classification model training method, system, equipment and storage medium - Google Patents
Classification model training method, system, equipment and storage medium Download PDFInfo
- Publication number
- CN112347261A CN112347261A CN202011418498.5A CN202011418498A CN112347261A CN 112347261 A CN112347261 A CN 112347261A CN 202011418498 A CN202011418498 A CN 202011418498A CN 112347261 A CN112347261 A CN 112347261A
- Authority
- CN
- China
- Prior art keywords
- text sample
- classification model
- sample
- training
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013145 classification model Methods 0.000 title claims abstract description 119
- 238000012549 training Methods 0.000 title claims abstract description 118
- 238000000034 method Methods 0.000 title claims abstract description 57
- 230000006870 function Effects 0.000 claims description 57
- 238000009826 distribution Methods 0.000 claims description 35
- 238000012545 processing Methods 0.000 claims description 17
- 238000013519 translation Methods 0.000 claims description 15
- 230000002708 enhancing effect Effects 0.000 claims 2
- 230000000694 effects Effects 0.000 abstract description 7
- 238000002372 labelling Methods 0.000 abstract description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000013136 deep learning model Methods 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a classification model training method, a system, equipment and a storage medium, wherein the method comprises the following steps: acquiring a training set, wherein the training set comprises a first text sample and a second text sample, the first text sample has a classification label, and the second text sample does not have the classification label; and training the classification model based on semi-supervised learning by adopting the training set, wherein the first text sample is adopted to carry out supervised learning on the classification model, and the second text sample is adopted to carry out unsupervised learning on the classification model. The method and the device fully utilize the existing data comprising the labeled data and the unlabeled data, improve the training effect of the model, solve the problem that the classification model is easy to over-fit due to the fact that the existing data is not enough when the classification model is trained in the prior art, and reduce the manual labeling cost because manual labeling is not needed.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a classification model training method, a classification model training system, classification model training equipment and a storage medium.
Background
When the current classification model is trained, the number of samples in the adopted training set may be insufficient, and the model is easily over-fitted due to insufficient training samples. Therefore, data enhancement needs to be performed on the training set, and the number of training samples is increased. The method comprises the steps of using a plurality of data enhancement algorithms in the current text, including translation, EDA (electronic data acquisition), replacement based on non-core words and the like, wherein the translation method has a simple basic flow, translating the original text of language 1 into the text expression of language 2 by using a translation model, translating the original text of language 2 into the text expression of language 3 based on the expression of language 2, and finally directly translating the original text of language 1 back into the text expression of language 3, wherein the text is the text enhanced by the original text.
EDA, that is, a data enhancement method based on random word replacement, is a general name of a class of text data enhancement methods, and a basic method thereof is similar to random clipping and image scaling in an image enhancement technology, and usually a certain proportion of words in a text are randomly selected, and simple operations such as synonym replacement and deletion are performed on the words, unlike models such as translation, and the like, which require assistance of an external pre-trained model. The EDA technology can effectively improve the generalization ability of the model, reduce the generalization error, and even under a complete data set, the EDA technology can bring the promotion of 0.8 percentage point on average. However, EDA technology is randomly selected for the words to be replaced, so an intuitive feel is that if some important words are replaced, the quality of the enhanced text is compromised. To try to avoid this problem, data enhancement techniques based on non-core word replacement have also emerged.
However, the above algorithms only perform local transformation on the text content, and do not fully consider the effectiveness and diversity of the enhanced text.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a classification model training method, a classification model training system, classification model training equipment and a storage medium, which make full use of the existing data, reduce the labeling cost and improve the model training effect.
The embodiment of the invention provides a classification model training method, which comprises the following steps:
acquiring a training set, wherein the training set comprises a first text sample and a second text sample, the first text sample has a classification label, and the second text sample does not have the classification label;
and training the classification model based on semi-supervised learning by adopting the training set, wherein the first text sample is adopted to carry out supervised learning on the classification model, and the second text sample is adopted to carry out unsupervised learning on the classification model.
In some embodiments, the training the classification model based on semi-supervised learning comprises constructing a loss function of the classification model based on a loss function of supervised learning and a loss function of unsupervised learning.
In some embodiments, supervised learning of the classification model using the first text sample includes constructing a supervised learning loss function of the classification model using a cross entropy loss function based on a prediction class and a label class of the first text sample.
In some embodiments, the method further comprises the steps of:
and performing data enhancement based on the second text sample to obtain an enhanced text sample.
In some embodiments, the data enhancement based on the second text sample includes data enhancement of the second text sample in at least one of:
performing translation processing on the second text sample to obtain a first enhanced text sample serving as an enhanced text sample;
performing similar word replacement of random words on the second text sample to obtain a second enhanced text sample as an enhanced text sample;
and performing similar word replacement based on the word importance on the second text sample to obtain a third enhanced text sample as the enhanced text sample.
In some embodiments, the data enhancement based on the second text sample comprises the following steps:
dividing the second text sample into three parts according to a preset sample quantity distribution proportion;
performing retranslation processing on the first second text sample to obtain a first enhanced text sample of the first second text sample;
performing similar word replacement of random words on the second text sample to obtain a second enhanced text sample of the second text sample;
performing similar word replacement based on the word importance on the third second text sample to obtain a third enhanced text sample of the third second text sample;
taking the first, second, and third enhanced text samples as enhanced text samples.
In some embodiments, unsupervised learning of the classification model using the second text sample includes constructing a loss function for unsupervised learning based on a prediction class distribution of the second text sample and a prediction class distribution of the enhanced text sample.
In some embodiments, the constructing the unsupervised learning loss function includes constructing the unsupervised learning loss function with a KL divergence based on the prediction class distribution of the second text sample and the prediction class distribution of the enhanced text sample.
In some embodiments, the performing similar word replacement based on word importance includes the following steps:
evaluating the importance of each word of the second text sample by using TF-IDF;
and replacing the words with the importance lower than the preset importance threshold value by using similar words.
The embodiment of the invention also provides a classification model training system, which is used for realizing the classification model training method and comprises the following steps:
the system comprises a sample acquisition module, a classification module and a classification module, wherein the sample acquisition module is used for acquiring a training set, the training set comprises a first text sample and a second text sample, and the first text sample is provided with a classification label;
and the model training module is used for training the classification model based on semi-supervised learning by adopting the training set, wherein the first text sample is adopted for carrying out supervised learning on the classification model, and the second text sample is adopted for carrying out unsupervised learning on the classification model.
In some embodiments, the system further includes a sample enhancement module, configured to perform data enhancement based on the second text sample, to obtain an enhanced text sample;
when the model training module trains the classification model based on semi-supervised learning, constructing a loss function of the classification model based on a loss function of supervised learning and a loss function of unsupervised learning;
in the supervised learning, the model training module adopts a cross entropy loss function to construct a supervised learning loss function of the classification model based on the prediction class and the label class of the first text sample;
in the unsupervised learning, the model training module constructs a loss function of the unsupervised learning based on the prediction category distribution of the second text sample and the prediction category distribution of the enhanced text sample.
An embodiment of the present invention further provides a classification model training device, including:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to perform the steps of the classification model training method via execution of the executable instructions.
The embodiment of the present invention further provides a computer-readable storage medium for storing a program, where the program, when executed by a processor, implements the steps of the classification model training method.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
The classification model training method, the classification model training system, the classification model training equipment and the storage medium have the following beneficial effects:
the method fully utilizes the existing data comprising the labeled data and the unlabeled data, improves the model training effect, solves the problem that the classification model is easy to over-fit due to the fact that the existing data is not enough in the process of training the classification model in the prior art, reduces the manual labeling cost due to the fact that manual labeling is not needed, and greatly improves the F1 parameter of the model compared with the model training mode without the method.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.
FIG. 1 is a flow chart of a classification model training method according to an embodiment of the present invention;
FIG. 2 is a diagram of classification model training according to an embodiment of the present invention;
FIG. 3 is a flow diagram of text enhancement according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a classification model training system according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a classification model training apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the steps. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
As shown in fig. 1, an embodiment of the present invention provides a classification model training method, including the following steps:
s100: acquiring a training set, wherein the training set comprises a first text sample and a second text sample, the first text sample has a classification label, and the second text sample does not have the classification label;
s200: and training the classification model based on semi-supervised learning by adopting the training set, wherein the first text sample is adopted to carry out supervised learning on the classification model, and the second text sample is adopted to carry out unsupervised learning on the classification model.
According to the method for training the classification model, the classification model is trained by utilizing the first text sample with the label and the second text sample without the label, the existing data comprising the label data and the data without the label are fully utilized, the model training effect is improved, the problem that the classification model is easy to over-fit due to the fact that the existing data are not enough when the classification model is trained in the prior art is solved, manual marking is not needed, the manual marking cost is reduced, and compared with a model training mode without the method, the F1 parameter of the model is greatly improved.
The classification model applied by the classification model training method of the present invention may be a deep learning model, such as a Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), transform, and the like. However, the invention is not limited thereto, and in other alternative embodiments, the classification model may be other types of models, which are within the scope of the invention.
As shown in fig. 2, the training the classification model based on semi-supervised learning includes constructing a loss function of the classification model based on a loss function of supervised learning and a loss function of unsupervised learning. Namely, the total loss of the classification model comprises two parts of a loss function of supervised learning and a loss function of unsupervised learning. In step S200, the training set is used to train the classification model based on semi-supervised learning, that is, the classification model is iteratively trained based on the loss function, and the training target of the classification model is to minimize the loss function of supervised learning and the loss function of unsupervised learning.
As shown in fig. 2, in the step S200, performing supervised learning on the classification model by using the first text sample, including constructing a supervised learning loss function of the classification model by using a cross entropy loss function based on the prediction class and the label class of the first text sample. Cross entropy is an important concept in information theory, and is mainly used for measuring the difference between two probability distributions. The cross entropy can be used as a loss function in the neural network, p represents the distribution of real markers, q is the distribution of the predicted markers of the trained model, and the similarity of p and q can be measured through the cross entropy loss function. The cross entropy is used as a loss function, and the problem that the learning rate of a mean square error loss function is reduced when the gradient is reduced can be avoided by using the sigmoid function.
In this embodiment, the method for training the classification model further includes the following steps:
and performing data enhancement based on the second text sample to obtain an enhanced text sample.
Further, the data enhancement based on the second text sample comprises data enhancement of the second text sample in at least one of the following ways:
performing translation processing on the second text sample to obtain a first enhanced text sample serving as an enhanced text sample; the translation mode is the method, the original text of language 1 is translated into the text expression of language 2 by using a translation model, the original text is translated into the text expression of language 3 based on the expression of language 2, and finally the original text is directly translated into the text expression of language 1 from the form of language 3, wherein the text is a first enhanced text sample obtained after the translation processing is carried out on a second text sample;
performing similar word replacement of random words on the second text sample to obtain a second enhanced text sample as an enhanced text sample; here, the similar word replacement of the random word is the EDA text enhancement mode;
and performing similar word replacement based on the word importance on the second text sample to obtain a third enhanced text sample as the enhanced text sample.
In this embodiment, in order to fully consider the effectiveness and diversity of the enhanced text samples (the enhanced text samples are very similar to the original text samples) and (the enhanced text samples are very different from the original text samples but have similar semantics), a combination of the above-mentioned data enhancement algorithms is used.
Specifically, as shown in fig. 3, the data enhancement based on the second text sample includes the following steps:
s310: dividing the second text sample into three parts according to a preset sample quantity distribution proportion;
s320: performing retranslation processing on the first second text sample to obtain a first enhanced text sample of the first second text sample;
s330: performing similar word replacement of random words on the second text sample to obtain a second enhanced text sample of the second text sample;
s340: performing similar word replacement based on the word importance on the third second text sample to obtain a third enhanced text sample of the third second text sample;
s350: taking the first, second, and third enhanced text samples as enhanced text samples.
The predetermined sample number distribution ratio may be selected and adjusted as needed, for example, the first second text sample is 35% of the second text sample, the second text sample is 35% of the second text sample, and the third second text sample is 30% of the second text sample. The sample number distribution ratio is merely exemplary, and in other alternative embodiments, other values may be used.
In this embodiment, in step S200, performing unsupervised learning on the classification model by using the second text sample includes constructing a loss function of unsupervised learning based on the prediction category distribution of the second text sample and the prediction category distribution of the enhanced text sample. Namely, the principle of the unsupervised learning is that the classification distribution of the original text content is not changed after the text content is subjected to local transformation such as retranslation, random word replacement, similar word replacement of words based on the importance of the words and the like.
Further, the KL divergence is used in this embodiment to measure the category distribution of the original text and the category distribution of the transformed text. Namely, the constructing of the unsupervised learning loss function includes constructing the unsupervised learning loss function by adopting the KL divergence based on the prediction category distribution of the second text sample and the prediction category distribution of the enhanced text sample. The KL divergence, known as relative entropy (Kullback-Leibler divergence) or information divergence (information divergence), is a measure of the asymmetry of the difference between two probability distributions.
In this embodiment, the performing of similar word replacement based on word importance includes the following steps:
evaluating the importance of each word of the second text sample by using TF-IDF;
and replacing the words with the importance lower than the preset importance threshold value by using similar words.
TF-IDF (term frequency-inverse document frequency) is a weighting technique used for information retrieval and data mining. TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency). The TF-IDF is used here to evaluate how important a word is for a candidate sample. The importance of a word increases in direct proportion to the number of times it appears in the alternative sample, but at the same time decreases in inverse proportion to the frequency with which it appears in the entire corpus.
As shown in fig. 4, an embodiment of the present invention further provides a classification model training system, for implementing the classification model training method, where the system includes:
a sample obtaining module M100, configured to obtain a training set, where the training set includes a first text sample and a second text sample, and the first text sample has a classification label;
and the model training module M200 is used for training the classification model based on semi-supervised learning by adopting the training set, wherein the first text sample is adopted for carrying out supervised learning on the classification model, and the second text sample is adopted for carrying out unsupervised learning on the classification model.
In the classification model training system of the present invention, the model training module M200 trains the classification model using the labeled first text sample and the unlabeled second text sample, and makes full use of the existing data including the labeled data and the unlabeled data, thereby improving the model training effect, on one hand, solving the problem that the classification model is easy to be over-fitted due to the lack of sufficient training data when training the classification model in the prior art, on the other hand, manual labeling is not required, thereby reducing the manual labeling cost, and compared with the model training mode without using the method of the present invention, the F1 parameter of the model is greatly improved.
The classification model applied by the classification model training system of the present invention may be a deep learning model, such as a Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), transform, and the like. However, the invention is not limited thereto, and in other alternative embodiments, the classification model may be other types of models, which are within the scope of the invention.
In this embodiment, when the model training module M200 trains the classification model based on semi-supervised learning, the loss function of the classification model is constructed based on the loss function of supervised learning and the loss function of unsupervised learning. That is, when the model training module M200 trains the classification model, the training target of the classification model is to minimize the loss function of supervised learning and the loss function of unsupervised learning.
In the supervised learning, the model training module M200 constructs a supervised learning loss function of the classification model by using a cross entropy loss function based on the prediction class and the label class of the first text sample.
As shown in fig. 4, the classification model training system further includes a sample enhancement module M300, configured to perform data enhancement based on the second text sample to obtain an enhanced text sample. In the unsupervised learning, the model training module M200 constructs a loss function of the unsupervised learning based on the prediction class distribution of the second text sample and the prediction class distribution of the enhanced text sample. Further, the KL divergence is used in this embodiment to measure the category distribution of the original text and the category distribution of the transformed text. Namely, the model training module M200 constructs the unsupervised learning loss function, including the model training module M200 constructing the unsupervised learning loss function by using the KL divergence based on the prediction category distribution of the second text sample and the prediction category distribution of the enhanced text sample.
The sample enhancement module M300 performs data enhancement on the second text sample, which may adopt any one of the following manners:
performing translation processing on the second text sample to obtain a first enhanced text sample serving as an enhanced text sample; the translation mode is the method, the original text of language 1 is translated into the text expression of language 2 by using a translation model, the original text is translated into the text expression of language 3 based on the expression of language 2, and finally the original text is directly translated into the text expression of language 1 from the form of language 3, wherein the text is a first enhanced text sample obtained after the translation processing is carried out on a second text sample;
performing similar word replacement of random words on the second text sample to obtain a second enhanced text sample as an enhanced text sample; here, the similar word replacement of the random word is the EDA text enhancement mode;
and performing similar word replacement based on the word importance on the second text sample to obtain a third enhanced text sample as an enhanced text sample, specifically, performing similar word replacement on words with low importance based on the TF-IDF word importance, firstly, evaluating the importance of each word of the second text sample by adopting TF-IDF, and replacing words with the importance lower than a preset importance threshold by adopting similar words.
Further, in order to fully consider the effectiveness and diversity of the enhanced text sample, the text enhancement method described above is combined, and the sample enhancement module M300 performs data enhancement based on the second text sample, including the following steps:
the sample enhancement module M300 divides the second text sample into three parts according to a predetermined sample number distribution ratio;
the sample enhancement module M300 performs a translation process on the first second text sample to obtain a first enhanced text sample of the first second text sample;
the sample enhancement module M300 performs similar word replacement of a random word on the second text sample to obtain a second enhanced text sample of the second text sample;
the sample enhancement module M300 performs similar word replacement based on the word importance on the third second text sample to obtain a third enhanced text sample of the third second text sample;
the sample enhancement module M300 takes the first enhanced text sample, the second enhanced text sample and the third enhanced text sample as enhanced text samples.
The embodiment of the invention also provides a classification model training device, which comprises a processor; a memory having stored therein executable instructions of the processor; wherein the processor is configured to perform the steps of the classification model training method via execution of the executable instructions.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" platform.
An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 5. The electronic device 600 shown in fig. 5 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 5, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one storage unit 620, a bus 630 that connects the various system components (including the storage unit 620 and the processing unit 610), a display unit 640, and the like.
Wherein the storage unit stores program code executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present invention described in the classification model training method section above in this specification. For example, the processing unit 610 may perform the steps as shown in fig. 1.
The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.
The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
In the classification model training apparatus, the program in the memory is executed by the processor to implement the steps of the classification model training method, and therefore, the computer storage medium can also achieve the technical effects of the classification model training method.
The embodiment of the present invention further provides a computer-readable storage medium for storing a program, where the program, when executed by a processor, implements the steps of the classification model training method. In some possible embodiments, aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present invention described in the classification model training method section above of this specification, when the program product is executed on the terminal device.
Referring to fig. 6, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be executed on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The program in the computer storage medium, when executed by a processor, implements the steps of the classification model training method, and thus the computer storage medium may also achieve the technical effects of the classification model training method described above.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Claims (13)
1. A classification model training method is characterized by comprising the following steps:
acquiring a training set, wherein the training set comprises a first text sample and a second text sample, the first text sample has a classification label, and the second text sample does not have the classification label;
and training the classification model based on semi-supervised learning by adopting the training set, wherein the first text sample is adopted to carry out supervised learning on the classification model, and the second text sample is adopted to carry out unsupervised learning on the classification model.
2. The method of claim 1, wherein training the classification model based on semi-supervised learning comprises constructing a loss function of the classification model based on a loss function of supervised learning and a loss function of unsupervised learning.
3. The method of claim 1, wherein supervised learning of the classification model using the first text sample comprises constructing a supervised learning loss function of the classification model using a cross entropy loss function based on a prediction class and a label class of the first text sample.
4. The classification model training method according to claim 1, further comprising the steps of:
and performing data enhancement based on the second text sample to obtain an enhanced text sample.
5. The method for training classification models according to claim 4, wherein the enhancing data based on the second text sample comprises enhancing data of the second text sample by at least one of:
performing translation processing on the second text sample to obtain a first enhanced text sample serving as an enhanced text sample;
performing similar word replacement of random words on the second text sample to obtain a second enhanced text sample as an enhanced text sample;
and performing similar word replacement based on the word importance on the second text sample to obtain a third enhanced text sample as the enhanced text sample.
6. The method for training classification models according to claim 5, wherein the data enhancement based on the second text sample comprises the following steps:
dividing the second text sample into three parts according to a preset sample quantity distribution proportion;
performing retranslation processing on the first second text sample to obtain a first enhanced text sample of the first second text sample;
performing similar word replacement of random words on the second text sample to obtain a second enhanced text sample of the second text sample;
performing similar word replacement based on the word importance on the third second text sample to obtain a third enhanced text sample of the third second text sample;
taking the first, second, and third enhanced text samples as enhanced text samples.
7. The method of claim 4, wherein unsupervised learning the classification model using the second text sample comprises constructing a loss function for unsupervised learning based on a prediction class distribution of the second text sample and a prediction class distribution of the enhanced text sample.
8. The classification model training method according to claim 7, wherein the constructing the unsupervised learning loss function includes constructing the unsupervised learning loss function using KL divergence based on the prediction class distribution of the second text sample and the prediction class distribution of the enhanced text sample.
9. The method for training classification models according to claim 7 or 8, wherein the performing of similar word replacement based on word importance comprises the following steps:
evaluating the importance of each word of the second text sample by using TF-IDF;
and replacing the words with the importance lower than the preset importance threshold value by using similar words.
10. A classification model training system for implementing the classification model training method according to any one of claims 1 to 8, the system comprising:
the system comprises a sample acquisition module, a classification module and a classification module, wherein the sample acquisition module is used for acquiring a training set, the training set comprises a first text sample and a second text sample, and the first text sample is provided with a classification label;
and the model training module is used for training the classification model based on semi-supervised learning by adopting the training set, wherein the first text sample is adopted for carrying out supervised learning on the classification model, and the second text sample is adopted for carrying out unsupervised learning on the classification model.
11. The system for training classification models according to claim 10, further comprising a sample enhancement module configured to perform data enhancement based on the second text sample to obtain an enhanced text sample;
when the model training module trains the classification model based on semi-supervised learning, constructing a loss function of the classification model based on a loss function of supervised learning and a loss function of unsupervised learning;
in the supervised learning, the model training module adopts a cross entropy loss function to construct a supervised learning loss function of the classification model based on the prediction class and the label class of the first text sample;
in the unsupervised learning, the model training module constructs a loss function of the unsupervised learning based on the prediction category distribution of the second text sample and the prediction category distribution of the enhanced text sample.
12. A classification model training apparatus, comprising:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to perform the steps of the classification model training method of any one of claims 1 to 8 via execution of the executable instructions.
13. A computer-readable storage medium storing a program which, when executed by a processor, performs the steps of the classification model training method of any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011418498.5A CN112347261A (en) | 2020-12-07 | 2020-12-07 | Classification model training method, system, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011418498.5A CN112347261A (en) | 2020-12-07 | 2020-12-07 | Classification model training method, system, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112347261A true CN112347261A (en) | 2021-02-09 |
Family
ID=74427490
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011418498.5A Pending CN112347261A (en) | 2020-12-07 | 2020-12-07 | Classification model training method, system, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112347261A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112948582A (en) * | 2021-02-25 | 2021-06-11 | 平安科技(深圳)有限公司 | Data processing method, device, equipment and readable medium |
CN113378895A (en) * | 2021-05-24 | 2021-09-10 | 成都欧珀通信科技有限公司 | Classification model generation method and device, storage medium and electronic equipment |
CN113378573A (en) * | 2021-06-24 | 2021-09-10 | 北京华成智云软件股份有限公司 | Content big data oriented small sample relation extraction method and device |
CN113806536A (en) * | 2021-09-14 | 2021-12-17 | 广州华多网络科技有限公司 | Text classification method and device, equipment, medium and product thereof |
CN113806535A (en) * | 2021-09-07 | 2021-12-17 | 清华大学 | Method and device for improving classification model performance by using label-free text data samples |
WO2022227214A1 (en) * | 2021-04-29 | 2022-11-03 | 平安科技(深圳)有限公司 | Classification model training method and apparatus, and terminal device and storage medium |
WO2022242459A1 (en) * | 2021-05-17 | 2022-11-24 | 腾讯科技(深圳)有限公司 | Data classification and identification method and apparatus, and device, medium and program product |
WO2023011470A1 (en) * | 2021-08-05 | 2023-02-09 | 上海高德威智能交通系统有限公司 | Machine learning system and model training method |
CN117235534A (en) * | 2023-11-13 | 2023-12-15 | 支付宝(杭州)信息技术有限公司 | Method and device for training content understanding model and content generating model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110298415A (en) * | 2019-08-20 | 2019-10-01 | 视睿(杭州)信息科技有限公司 | A kind of training method of semi-supervised learning, system and computer readable storage medium |
CN111522958A (en) * | 2020-05-28 | 2020-08-11 | 泰康保险集团股份有限公司 | Text classification method and device |
CN111723209A (en) * | 2020-06-28 | 2020-09-29 | 上海携旅信息技术有限公司 | Semi-supervised text classification model training method, text classification method, system, device and medium |
-
2020
- 2020-12-07 CN CN202011418498.5A patent/CN112347261A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110298415A (en) * | 2019-08-20 | 2019-10-01 | 视睿(杭州)信息科技有限公司 | A kind of training method of semi-supervised learning, system and computer readable storage medium |
CN111522958A (en) * | 2020-05-28 | 2020-08-11 | 泰康保险集团股份有限公司 | Text classification method and device |
CN111723209A (en) * | 2020-06-28 | 2020-09-29 | 上海携旅信息技术有限公司 | Semi-supervised text classification model training method, text classification method, system, device and medium |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112948582B (en) * | 2021-02-25 | 2024-01-19 | 平安科技(深圳)有限公司 | Data processing method, device, equipment and readable medium |
CN112948582A (en) * | 2021-02-25 | 2021-06-11 | 平安科技(深圳)有限公司 | Data processing method, device, equipment and readable medium |
WO2022227214A1 (en) * | 2021-04-29 | 2022-11-03 | 平安科技(深圳)有限公司 | Classification model training method and apparatus, and terminal device and storage medium |
WO2022242459A1 (en) * | 2021-05-17 | 2022-11-24 | 腾讯科技(深圳)有限公司 | Data classification and identification method and apparatus, and device, medium and program product |
CN113378895A (en) * | 2021-05-24 | 2021-09-10 | 成都欧珀通信科技有限公司 | Classification model generation method and device, storage medium and electronic equipment |
CN113378895B (en) * | 2021-05-24 | 2024-03-01 | 成都欧珀通信科技有限公司 | Classification model generation method and device, storage medium and electronic equipment |
CN113378573A (en) * | 2021-06-24 | 2021-09-10 | 北京华成智云软件股份有限公司 | Content big data oriented small sample relation extraction method and device |
WO2023011470A1 (en) * | 2021-08-05 | 2023-02-09 | 上海高德威智能交通系统有限公司 | Machine learning system and model training method |
CN113806535A (en) * | 2021-09-07 | 2021-12-17 | 清华大学 | Method and device for improving classification model performance by using label-free text data samples |
CN113806535B (en) * | 2021-09-07 | 2024-09-06 | 清华大学 | Method and device for improving classification model performance by using unlabeled text data sample |
CN113806536A (en) * | 2021-09-14 | 2021-12-17 | 广州华多网络科技有限公司 | Text classification method and device, equipment, medium and product thereof |
CN113806536B (en) * | 2021-09-14 | 2024-04-16 | 广州华多网络科技有限公司 | Text classification method and device, equipment, medium and product thereof |
CN117235534A (en) * | 2023-11-13 | 2023-12-15 | 支付宝(杭州)信息技术有限公司 | Method and device for training content understanding model and content generating model |
CN117235534B (en) * | 2023-11-13 | 2024-02-20 | 支付宝(杭州)信息技术有限公司 | Method and device for training content understanding model and content generating model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112347261A (en) | Classification model training method, system, equipment and storage medium | |
US11288593B2 (en) | Method, apparatus and device for extracting information | |
CN110287278B (en) | Comment generation method, comment generation device, server and storage medium | |
CN107908635B (en) | Method and device for establishing text classification model and text classification | |
CN111723209A (en) | Semi-supervised text classification model training method, text classification method, system, device and medium | |
CN110362823B (en) | Training method and device for descriptive text generation model | |
US20200210526A1 (en) | Document classification using attention networks | |
US11886480B2 (en) | Detecting affective characteristics of text with gated convolutional encoder-decoder framework | |
CN110377902B (en) | Training method and device for descriptive text generation model | |
US20200104409A1 (en) | Method and system for extracting information from graphs | |
US20220059200A1 (en) | Deep-learning systems and methods for medical report generation and anomaly detection | |
CN109635197B (en) | Searching method, searching device, electronic equipment and storage medium | |
US11520993B2 (en) | Word-overlap-based clustering cross-modal retrieval | |
CN110245232B (en) | Text classification method, device, medium and computing equipment | |
CN112115700A (en) | Dependency syntax tree and deep learning based aspect level emotion analysis method | |
CN111783450B (en) | Phrase extraction method and device in corpus text, storage medium and electronic equipment | |
CN111079432A (en) | Text detection method and device, electronic equipment and storage medium | |
US11880664B2 (en) | Identifying and transforming text difficult to understand by user | |
CN114298050A (en) | Model training method, entity relation extraction method, device, medium and equipment | |
CN115587184A (en) | Method and device for training key information extraction model and storage medium thereof | |
CN112417860A (en) | Training sample enhancement method, system, device and storage medium | |
CN115952854B (en) | Training method of text desensitization model, text desensitization method and application | |
CN113761895A (en) | Text abstract generation method and device, electronic equipment and storage medium | |
Jorge-Botana et al. | Predicting word maturity from frequency and semantic diversity: a computational study | |
CN111666405B (en) | Method and device for identifying text implication relationship |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210209 |