CN111881295A - Text classification model training method and device and text labeling method and device - Google Patents

Text classification model training method and device and text labeling method and device Download PDF

Info

Publication number
CN111881295A
CN111881295A CN202010761788.3A CN202010761788A CN111881295A CN 111881295 A CN111881295 A CN 111881295A CN 202010761788 A CN202010761788 A CN 202010761788A CN 111881295 A CN111881295 A CN 111881295A
Authority
CN
China
Prior art keywords
classification model
text classification
samples
sample
sample set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010761788.3A
Other languages
Chinese (zh)
Inventor
马小龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Everbright Bank Co Ltd
Original Assignee
China Everbright Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Everbright Bank Co Ltd filed Critical China Everbright Bank Co Ltd
Priority to CN202010761788.3A priority Critical patent/CN111881295A/en
Publication of CN111881295A publication Critical patent/CN111881295A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Abstract

The embodiment of the invention provides a text classification model training method and device and a text labeling method and device, wherein the text classification model training method comprises the following steps: generating a first text classification model and a second text classification model through the first sample set; the first text classification model is one or more, and the samples in the first sample set are labeled samples; labeling the samples in the second sample set through at least one first text classification model so as to mark at least part of the samples meeting a first preset condition in the second sample set as first cycle samples; wherein the samples in the second sample set are unlabeled samples; a second text classification model is trained over the first loop samples. According to the text labeling method and device, the problem that efficiency of text labeling in the related technology is low is solved, and the effect of remarkably improving text labeling efficiency can be achieved through the trained second text classification model.

Description

Text classification model training method and device and text labeling method and device
Technical Field
The embodiment of the invention relates to the field of natural language processing, in particular to a text classification model training method and device and a text labeling method and device.
Background
Labeling and analyzing text data samples are necessary processes in a Natural Language Processing (NLP) process, and the result of text labeling analysis directly affects the application result of Natural language processing in fields such as finance and commerce.
In the related art, the process of implementing the text annotation analysis is mostly screening manually, the manual screening mode causes low processing efficiency, the daily completion amount of the annotation is usually between one hundred lines and one thousand lines, for example, for a certain sample set containing 7476 lines of data, a single person in the related art needs to spend about one month for processing.
Aiming at the problem that the efficiency of text labeling is low in the related technology, an effective solution is not provided in the related technology.
Disclosure of Invention
The embodiment of the invention provides a text classification model training method and device and a text labeling method and device, which at least solve the problem of low efficiency of text labeling in the related technology.
According to an embodiment of the present invention, there is provided a text classification model training method, including:
generating a first text classification model and a second text classification model through the first sample set; wherein the first text classification model is one or more, and the samples in the first sample set are labeled samples;
labeling samples in a second sample set through at least one first text classification model so as to mark at least part of samples meeting a first preset condition in the second sample set as first cycle samples; wherein the samples in the second sample set are unlabeled samples;
a second text classification model is trained through the first loop samples.
According to another embodiment of the present invention, there is also provided a text labeling method, including the text classification model training method described in the above embodiment; the text labeling method comprises the following steps:
and labeling the samples in the second sample set through the trained second text classification model.
According to another embodiment of the present invention, there is also provided a text classification model training apparatus including:
the generating module is used for generating a first text classification model and a second text classification model through the first sample set; wherein the first text classification model is one or more, and the samples in the first sample set are labeled samples;
the circulation module is used for labeling the samples in the second sample set through at least one first text classification model so as to mark at least part of samples meeting a first preset condition in the second sample set as first circulation samples; wherein the samples in the second sample set are unlabeled samples;
and the training module is used for training a second text classification model through the first circulation sample.
According to another embodiment of the present invention, there is also provided a text labeling apparatus, including the text classification model training apparatus in the above embodiment; the text labeling device comprises:
and the marking module is used for marking the samples in the second sample set through the trained second text classification model.
According to another embodiment of the present invention, there is also provided a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to, when executed, perform the steps of any of the above method embodiments.
According to another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.
According to the embodiment of the invention, one or more first text classification models and one or more second text classification models can be generated through the first sample set, and the samples in the second sample set are further labeled through at least one first text classification model, so that at least part of the samples meeting a first preset condition in the second sample set are marked as first cyclic samples, and then the second text classification models are trained through the first cyclic samples; wherein the samples in the second sample set are unlabeled samples; wherein, the samples in the first sample set are marked samples; the samples in the second set of samples are unlabeled samples. Therefore, the embodiment of the invention can solve the problem of low efficiency of text labeling in the related technology, and the effect of remarkably improving the text labeling efficiency can be achieved by finishing the trained second text classification model.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a scene schematic diagram of a text classification model training method and apparatus, and a text labeling method and apparatus according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a terminal device provided according to an embodiment of the present invention;
FIG. 3 is a flowchart of a text classification model training method provided in accordance with an embodiment of the present invention;
FIG. 4 is a flowchart of a text annotation method provided in accordance with an embodiment of the present invention;
FIG. 5 is a block diagram of a text classification model training apparatus according to an embodiment of the present invention;
fig. 6 is a block diagram of a text annotation device according to an embodiment of the present invention.
Detailed Description
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings in conjunction with the embodiments.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
To further explain the working modes of the text classification model training method and device and the text labeling method and device in the embodiments of the present invention, the following describes the application scenarios of the text classification model training method and device and the text labeling method and device in the embodiments of the present invention:
fig. 1 is a scene schematic diagram of a text classification model training method and apparatus, and a text labeling method and apparatus according to embodiments of the present invention, where the text classification model training method and the text labeling method in embodiments of the present invention may be applied to the system architecture shown in fig. 1. As shown in fig. 1, the system architecture 100 may include one or more of terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative, and that there may be any number of terminal devices, networks, and servers, as desired for an implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.
The text classification model training method and the text labeling method provided by the embodiment of the present invention are generally executed by the server 105, and accordingly, the text classification model training apparatus and the text labeling apparatus provided by the embodiment of the present invention are generally disposed in the server 105. However, it is easily understood by those skilled in the art that the text classification model training method and the text labeling method provided in the embodiment of the present invention may also be executed by the terminal devices 101, 102, and 103, and accordingly, the text data labeling apparatus may also be disposed in the terminal devices 101, 102, and 103, which is not particularly limited in the embodiment of the present invention.
For example, in an exemplary embodiment, a user may upload a first sample set and a second sample set to the server 105 through the terminal devices 101, 102, and 103, the server 105 completes training of a text classification model through the text classification model training method provided in the embodiment of the present invention, completes labeling of a text through the text labeling method provided in the embodiment of the present invention, and transmits the labeled text to the terminal devices 101, 102, and 103, and the like.
Taking the above terminal device as an example for explanation, fig. 2 is a schematic structural diagram of the terminal device provided according to the embodiment of the present invention, and as shown in fig. 2, the computer system 200 includes a central processing unit CPU201 that can execute various appropriate actions and processes according to a program stored in a read only memory ROM202 or a program loaded from a storage section 208 into a random access memory RAM 203. In the RAM203, various programs and data necessary for system operation are also stored. The CPU201, ROM202, and RAM203 are connected to each other via a bus 204. An input/output I/O interface 205 is also connected to bus 204.
The I/O interface 205 described above also connects the following components: an input portion 206 including a keyboard, a mouse, and the like; an output section 207 including a cathode ray tube CRT, a liquid crystal display LCD, and the like, a speaker, and the like; a storage section 208 including a hard disk and the like; and a communication section 209 including a network interface card such as a LAN card, a modem, or the like. The communication section 209 performs communication processing via a network such as the internet. A drive 210 is also connected to the I/O interface 205 as needed. A removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 210 as necessary, so that a computer program read out therefrom is mounted into the storage section 208 as necessary.
The processes of the text classification model training method and the text labeling method provided by the embodiment of the invention can be realized as a computer software program. For example, in one exemplary embodiment, a computer program product is included that includes a computer program embodied on a computer-readable medium, the computer program including program code for performing a text classification model training method and a text labeling method. In an exemplary embodiment, the computer program can be downloaded and installed from a network through the communication section 209, and/or installed from the removable medium 211. The computer program, when executed by the central processing unit CPU201, performs various functions defined in the methods and apparatus of the present application.
It should be noted that the computer readable storage medium in the embodiments of the present invention may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory RAM, a read-only memory ROM, an erasable programmable read-only memory EPROM or flash memory, an optical fiber, a portable compact disc read-only memory CD-ROM, an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The working modes of the text classification model training method and device and the text labeling method and device in the embodiment of the invention are explained as follows:
example 1
According to an embodiment of the present invention, a method for training a text classification model is provided, fig. 3 is a flowchart of the method for training a text classification model according to the embodiment of the present invention, and as shown in fig. 3, the method for training a text classification model in the embodiment includes:
s102, generating a first text classification model and a second text classification model through a first sample set; the first text classification model is one or more, and the samples in the first sample set are labeled samples;
s104, marking the samples in the second sample set through at least one first text classification model so as to mark at least part of samples meeting a first preset condition in the second sample set as first cyclic samples; wherein the samples in the second sample set are unlabeled samples;
and S106, training the second text classification model through the first loop sample.
It should be further noted that, in the step S102, since the samples in the first sample set are labeled samples, the preset text classification model, such as fasttext, textcnn, textrnn, transform, etc., can be trained according to the labeled samples in the first sample set to obtain the first text classification model and the second text classification model. It should be noted that the number of the first text classification model may be one, that is, two text classification models are generated through the first sample set, and correspond to the first text classification model and the second text classification model; the first text classification models may also be multiple ones, that is, more than two text classification models are generated through the first sample set, wherein the two text classification models include a second text classification model, and the text classification models except the second text classification model form the multiple first text classification models. When the first text classification models are multiple, the multiple first text classification models may be different text classification models from each other, that is, the multiple different text classification models are generated by the first sample set.
In the step S104, in the process of labeling the samples in the second sample set by using at least one first text classification model, when a plurality of first text classification models are provided, the samples in the second sample set are labeled by using the plurality of first text classification models, and a sample meeting a first preset condition in the second sample set is selected as a first cyclic sample in the process of labeling each first text classification model. Since the first cyclic sample is labeled by the first text classification model, the second text classification model can be trained by the first cyclic sample.
It should be further noted that, in the embodiment of the present invention, the first text classification model and the second text classification model do not indicate a specific text classification model, for example, the text classification models F1 and F2 are generated by the first sample set in step S102, and in step S104, F1 may be first defined as the first text classification model, F2 may be defined as the second text classification model, so as to label samples in the second sample set by F1, and label the first loop sample to train F2. Meanwhile, F2 may be defined as a first text classification model, F1 may be defined as a second text classification model, so that samples in the second sample set are labeled through F2, and a first loop sample is labeled to train F1.
Through the above definitions of different text classification models, step S104 in the embodiment of the present invention can implement mutual training of the first text classification model and the second text classification model.
According to the embodiment of the invention, one or more first text classification models and one or more second text classification models can be generated through the first sample set, and the samples in the second sample set are further labeled through at least one first text classification model, so that at least part of the samples meeting a first preset condition in the second sample set are marked as first cyclic samples, and then the second text classification models are trained through the first cyclic samples; wherein, the samples in the first sample set are marked samples; the samples in the second set of samples are unlabeled samples. Therefore, in the embodiment of the invention, the trained second text classification model can be obtained by training the second text classification model through the first cyclic sample, and the unlabeled text is further labeled through the trained second text classification model, so that the automatic labeling processing of the text can be realized. Therefore, the embodiment of the invention can solve the problem of low efficiency of text labeling in the related technology, and can achieve the effect of remarkably improving the text labeling efficiency by automatically labeling the text by the trained second text classification model to replace the manual labeling mode in the related technology.
On the other hand, in the embodiment of the present invention, the first text classification model and the second text classification model are used for alternately labeling and training the second sample set, so that the labeling precision of the finally obtained trained second text classification model is significantly improved compared with the manual labeling and the labeling of a single text classifier.
In an exemplary embodiment, for a case where the first text classification model is one, the text classification model training method in an embodiment of the present invention includes:
generating a first text classification model and a second text classification model through the first sample set;
labeling the samples in the second sample set through the first text classification model, so as to mark at least part of samples with classification credibility higher than a preset threshold value in the second sample set as first cycle samples; labeling the samples in the second sample set through a second text classification model, so as to mark at least part of samples with classification credibility higher than a preset threshold value in the second sample set as second cycle samples;
and training the first text classification model according to the second circulation sample, and training the second text classification model according to the first circulation sample.
It should be further noted that, in the above example, the first preset condition is that the classification confidence level is higher than the preset threshold.
In an exemplary embodiment, after training the first text classification model according to the second loop sample and training the second text classification model according to the first loop sample, the method further includes:
circularly executing the following operations until a second preset condition is met:
labeling the samples in the second sample set through the trained first text classification model, so as to label at least part of samples with classification credibility higher than a preset threshold value in the second sample set as first cycle samples; labeling the samples in the second sample set through the trained second text classification model, so as to label at least part of samples with classification credibility higher than a preset threshold value in the second sample set as second cycle samples;
and training the trained first text classification model according to the second circulation sample, and training the trained second text classification model according to the first circulation sample.
It should be further noted that, in the above example, in the process of training the first text classification model according to the second cyclic sample and training the second text classification model according to the first cyclic sample, after the first text classification model and the second text classification model are trained, the classification and training processes may be performed in a cyclic manner again. For example, on the premise that a first text classification model (marked as an initial first text classification model) labels a second sample set to obtain a first cyclic sample, a second text classification model (marked as an initial second text classification model) can be trained through the first cyclic sample, and the trained second text classification model is marked as a first-time training second text classification model. Correspondingly, on the premise that a second circulation sample is obtained in the process that a second text classification model (marked as an initial second text classification model) labels a second sample set, a first text classification model (marked as an initial first text classification model) can be trained through the second circulation sample, and the trained first text classification model is marked as a first-time training first text classification model.
In the above example, the first text classification model trained for the first time and the first text classification model trained for the second time may be further re-labeled, and the first cyclic sample and the second cyclic sample are selected again according to the classification confidence in the labeling process. And obtaining a first cyclic sample and a second cyclic sample, training the first once-trained text classification model through the second cyclic sample again to obtain a second once-trained first text classification model, and training the second once-trained text classification model through the first cyclic sample to obtain a second once-trained second text classification model.
In the above example, the training may be further repeated for training the first text classification model twice and training the second text classification model twice; and repeating the steps, namely realizing the cyclic training of the first text classification model and the second text classification model until a second preset condition is met.
In an exemplary embodiment, the second preset condition includes at least one of:
the predicted values of the consistency of the results of labeling the samples in the second sample set by the first text classification model and the second text classification model are greater than or equal to a preset threshold value;
the number of cycles is greater than or equal to a preset threshold;
each sample in the second set of samples is labeled as a first recurring sample and/or a second recurring sample.
In an exemplary embodiment, training the first text classification model according to the second loop samples and training the second text classification model according to the first loop samples comprises:
taking the second cyclic sample as an input sample of the first text classification model, taking the label of the second cyclic sample by the second text classification model as an output sample of the first text classification model, and training the first text classification model through the input sample and the output sample;
and taking the first cyclic sample as an input sample of the second text classification model, taking the label of the first cyclic sample by the first text classification model as an output sample of the second text classification model, and training the second text classification model through the input sample and the output sample.
In an exemplary embodiment, for a case where a plurality of first text classification models are provided, a text classification model training method in an embodiment of the present invention includes:
generating a first text classification model and a second text classification model through the first sample set; the first text classification model is multiple;
labeling the samples in the second sample set through the plurality of first text classification models respectively so as to mark at least part of the samples in the second sample set, which are consistent in labeling with the plurality of first text classification models, as first cyclic samples;
a second text classification model is trained over the first loop samples.
It should be further noted that, in the above example, in the process of generating the first text classification model and the second text classification model by using the first sample set, it is set that two first text classification models F1 and F2 and one text classification model F3 exist, and in the process of labeling samples in the second sample set by using a plurality of first text classification models respectively, F1 and F2 label samples in the second sample set respectively, since F1 and F2 may be different text classification models, the results of labeling a sample by using F1 and F2 may or may not be the same; the samples labeled F1 and F2 are labeled as first cycle samples, and F3 can be trained by the first cycle samples.
According to the foregoing description, F1, F2, and F3 may be reassigned, that is, F1 and F3 are defined as a first text classification model, and F2 is defined as a second text classification model, at this time, the above processes are repeated, that is, samples in the second sample set are labeled through F1 and F3, respectively, and samples with consistent labels of F1 and F3 are labeled as first loop samples, that is, F2 may be trained through the first loop samples. Similarly, F2 and F3 may be defined as a first text classification model, and F1 may be defined as a second text classification model, at this time, the above process is repeated, that is, samples in the second sample set are labeled by F2 and F3, respectively, and a sample with the labels of F2 and F3 is labeled as a first loop sample, that is, F1 may be trained by the first loop sample.
Therefore, under the condition that a plurality of first text classification models exist, the first circulation sample is determined through the plurality of first text classification models, and then the second text classification model is trained.
It should be further noted that the above training process may also be a loop, for example, for the trained F1, F2, and F3, the samples in the second sample set may be labeled again to obtain the corresponding first loop sample, and the training is performed again, so as to repeat this process until the second preset condition is met. The loop process is similar to the previous case of the first text classification model and is not described in detail here.
It should be further noted that, in the above example, the first preset condition is that the labels of the multiple first text classification models to some samples in the second sample set are consistent.
In an exemplary embodiment, the plurality of first text classification models are different text classification models.
In an exemplary embodiment, generating the first text classification model and the second text classification model by the first sample set includes:
and generating a first text classification model and a second text classification model according to the samples in the first sample set and the labels corresponding to the samples and a preset classification mode.
In an exemplary embodiment, after generating the first text classification model and the second text classification model by the first sample set, the method further includes:
labeling at least part of samples in the second sample set through the first text classification model, and putting the samples with the errors smaller than a preset threshold value into the first sample set;
labeling at least part of samples in the second sample set through a second text classification model, and putting the samples with the errors smaller than a preset threshold value into the first sample set.
To further illustrate the text classification model training method in the embodiment of the present invention, the following description is made with an exemplary embodiment:
s1, establishing a first sample set. The following text samples are labeled manually or automatically, and the labeled text samples can form a first sample set.
The first set of samples includes:
"annotated sample 01: the company is mainly engaged in the research and production of cell detection preparation and storage, gene detection, in vitro diagnostic reagents and instruments, and the research and production of biological genes, proteins, antibodies, medical intermediates and experimental comprehensive agents. Annotated sample 02: companies have established an industry chain of unique pre-clinical drug research services, clinical services, pharmacoviginal alert services, breeding and marketing of high-quality laboratory animals, and gene editing model animal customization services. Labeled sample 03: the business engaged in by the company mainly comprises five major boards of a whole-process design consultation business, a general engineering contract business, a green energy-saving scientific and technological service business, an engineering detection business and an investment and industry combined business. Labeled sample 04: the company main operation business is an engineering consultation and engineering contract business, covers the industries of highways, municipal administration, buildings, water transportation and the like, and mainly provides engineering technical services such as investigation, design, consultation, test detection, supervision, construction, general contract and the like in the fields of highways, bridges, tunnels, rock-soil, electromechanics, municipal administration, buildings, ports, navigation channels and the like.
In the first sample set, each sample is labeled as "business for main business of company".
S2, training the fastText classifier and the textCNN classifier respectively through the first sample set to obtain a first text classification model F1 and a second text classification model F2.
S3, labeling the second sample set through the first text classification model F1.
The second set of samples includes:
"unlabeled sample 01: the company automobile body part products mainly refer to stamping and welding assembly parts forming the white automobile body of the automobile, and comprise a wheel cover assembly, a column assembly, a skylight frame assembly, a rear end plate assembly, a coat and hat plate assembly, a tail lamp support assembly, a side wall assembly, a middle channel assembly and the like, and the stamping and assembly parts of aluminum alloy are produced in batch. Unlabeled sample 02: the company is a comprehensive pharmaceutical enterprise integrating the whole value chain of research, production and marketing, integrating raw material medicines and preparations and developing in multiple regions, and mainly manages research, development, production and sale services of chemical raw material medicines and preparations. Unlabeled sample 03: the company is concentrated in the field of smart grid business, and specializes in research, development, production, sales and technical service of products such as smart grid power distribution, power transformation, power utilization, high-low voltage switches, complete equipment, distributed photovoltaic power generation equipment and the like. Unlabeled sample 04: the main business and operation mode of the company include liquefied natural gas production/sale and investment, energy engineering service, methanol and other energy chemical products production, sale and trade, coal mining, washing and trade, and biological pesticide and veterinary drug raw material and preparation production and sale. Unlabeled sample 05: welding and cutting equipment which companies have in charge is called as a steel sewing machine and a steel scissors respectively, is indispensable basic processing equipment in modern industrial production, needs the welding and cutting equipment as long as the industrial field of metal material processing is used, and has a very wide application range. Unlabeled sample 06: the main business of the main business company of the company is to engage in the production and the sale of light packaging products and heavy packaging products, and provide packaging product research and development design, integral packaging scheme optimization, third party purchase and packaging product logistics distribution, supplier inventory management, auxiliary packaging operation and other packaging integrated services for customers.
After the second sample set is labeled by the first text classification model F1, the classification reliability of the unlabeled sample 02, the unlabeled sample 03, the unlabeled sample 04, and the unlabeled sample 06 is higher than a preset threshold, so that the unlabeled sample 02, the unlabeled sample 03, the unlabeled sample 04, and the unlabeled sample 06 can be labeled to be used as a first cyclic sample. The first cycle samples are labeled as "business of main business of company".
S4, a second text classification model F2 is trained by the first loop sample.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 2
According to an embodiment of the present invention, a text labeling method is provided, where the text labeling method in this embodiment is applied to a second text classification model obtained by training a text classification model training method in embodiment 1, and fig. 4 is a flowchart of the text labeling method provided in the embodiment of the present invention, as shown in fig. 4, the text classification model training method in this embodiment includes:
s202, generating a first text classification model and a second text classification model through a first sample set; the first text classification model is one or more, and the samples in the first sample set are labeled samples;
s204, labeling the samples in the second sample set through at least one first text classification model so as to mark at least part of samples meeting a first preset condition in the second sample set as first cyclic samples; wherein the samples in the second sample set are unlabeled samples;
s206, training a second text classification model through the first circulation sample;
and S208, labeling the samples in the second sample set through the trained second text classification model.
It should be further noted that the steps S202 to S206 correspond to the steps S102 to S106 in the embodiment 1, and the technical solutions described in the steps S202 to S206 in this embodiment can be applied to the exemplary embodiment corresponding to the steps S102 to S106 in the embodiment 1, and therefore, the description thereof is omitted.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 3
According to another embodiment of the present invention, there is also provided a text classification model training apparatus, which is used for implementing the foregoing embodiment and preferred embodiments, and which has been already described and will not be described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 5 is a block diagram of a text classification model training apparatus according to an embodiment of the present invention, and as shown in fig. 5, the apparatus includes:
a generating module 302, configured to generate a first text classification model and a second text classification model through a first sample set; the first text classification model is one or more, and the samples in the first sample set are labeled samples;
a loop module 304, configured to label, through at least one first text classification model, samples in the second sample set, so as to label at least a part of samples in the second sample set that meet a first preset condition as first loop samples; wherein the samples in the second sample set are unlabeled samples;
a training module 306 for training the second text classification model through the first loop sample.
It should be further noted that the technical effects of the text classification model training apparatus in this embodiment and the remaining exemplary embodiments are all corresponding to the text classification model training method in embodiment 1, and therefore are not described herein again.
In an exemplary embodiment, in the present embodiment, the generating module 302 is further configured to generate a first text classification model and a second text classification model through the first sample set;
the circulation module 304 is further configured to label, through the first text classification model, samples in the second sample set, so as to label at least a part of samples in the second sample set, of which the classification confidence is higher than a preset threshold, as first circulation samples; labeling the samples in the second sample set through a second text classification model, so as to mark at least part of samples with classification credibility higher than a preset threshold value in the second sample set as second cycle samples;
the training module 306 is further configured to train the first text classification model according to the second cyclic sample, and train the second text classification model according to the first cyclic sample.
In an exemplary embodiment, after training the first text classification model according to the second loop sample and training the second text classification model according to the first loop sample, the method further includes:
circularly executing the following operations until a second preset condition is met:
labeling the samples in the second sample set through the trained first text classification model, so as to label at least part of samples with classification credibility higher than a preset threshold value in the second sample set as first cycle samples; labeling the samples in the second sample set through the trained second text classification model, so as to label at least part of samples with classification credibility higher than a preset threshold value in the second sample set as second cycle samples;
and training the trained first text classification model according to the second circulation sample, and training the trained second text classification model according to the first circulation sample.
In an exemplary embodiment, the second preset condition includes at least one of:
the predicted values of the consistency of the results of labeling the samples in the second sample set by the first text classification model and the second text classification model are greater than or equal to a preset threshold value;
the number of cycles is greater than or equal to a preset threshold;
each sample in the second set of samples is labeled as a first recurring sample and/or a second recurring sample.
In an exemplary embodiment, training the first text classification model according to the second loop samples and training the second text classification model according to the first loop samples comprises:
taking the second cyclic sample as an input sample of the first text classification model, taking the label of the second cyclic sample by the second text classification model as an output sample of the first text classification model, and training the first text classification model through the input sample and the output sample;
and taking the first cyclic sample as an input sample of the second text classification model, taking the label of the first cyclic sample by the first text classification model as an output sample of the second text classification model, and training the second text classification model through the input sample and the output sample.
In an exemplary embodiment, the first and second electrodes, in this embodiment,
the generating module 302 is further configured to generate a first text classification model and a second text classification model through the first sample set; the first text classification model is multiple;
the loop module 304 is further configured to label the samples in the second sample set through the plurality of first text classification models, respectively, so as to label at least a part of the samples in the second sample set, which are labeled consistently by the plurality of first text classification models, as first loop samples;
the training module 304 is further configured to train a second text classification model through the first loop samples.
In an exemplary embodiment, the plurality of first text classification models are different text classification models.
In an exemplary embodiment, generating the first text classification model and the second text classification model by the first sample set includes:
and generating a first text classification model and a second text classification model according to the samples in the first sample set and the labels corresponding to the samples and a preset classification mode.
In an exemplary embodiment, after generating the first text classification model and the second text classification model by the first sample set, the method further includes:
labeling at least part of samples in the second sample set through the first text classification model, and putting the samples with the errors smaller than a preset threshold value into the first sample set;
labeling at least part of samples in the second sample set through a second text classification model, and putting the samples with the errors smaller than a preset threshold value into the first sample set.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Example 4
According to another embodiment of the present invention, there is also provided a text labeling apparatus, which is used for implementing the foregoing embodiment and preferred embodiments, and the description of the text labeling apparatus is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
The text labeling apparatus in this embodiment is applied to the second text classification model obtained by the training of the text classification model training apparatus in embodiment 3, fig. 6 is a block diagram of a structure of the text labeling apparatus according to the embodiment of the present invention, and as shown in fig. 6, the apparatus includes:
a generating module 402, configured to generate a first text classification model and a second text classification model through a first sample set; the first text classification model is one or more, and the samples in the first sample set are labeled samples;
a loop module 404, configured to label, through at least one first text classification model, samples in the second sample set, so as to label at least a part of samples, which meet a first preset condition, in the second sample set as first loop samples; wherein the samples in the second sample set are unlabeled samples;
a training module 406 for training the second text classification model by the first loop sample.
And the labeling module 408 is configured to label the samples in the second sample set through the trained second text classification model.
It should be further noted that the generating module 402, the cycling module 404, and the training module 406 correspond to the generating module 302, the cycling module 304, and the training module 306 in embodiment 3, and the generating module 402, the cycling module 404, and the training module 406 in this embodiment are all applicable to the exemplary embodiment corresponding to the generating module 302, the cycling module 304, and the training module 306 in embodiment 1, and therefore are not described herein again.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Example 5
Embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.
In this embodiment, the computer-readable storage medium may be configured to store a computer program for executing the steps in embodiments 1 and 2.
In an exemplary embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Example 6
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
In an exemplary embodiment, the processor may be configured to execute the steps in embodiment 1 and embodiment 2 through a computer program.
For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.
It will be apparent to those skilled in the art that the various modules or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and they may be implemented using program code executable by the computing devices, such that they may be stored in a memory device and executed by the computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (14)

1. A method for training a text classification model, the method comprising:
generating a first text classification model and a second text classification model through the first sample set; wherein the first text classification model is one or more, and the samples in the first sample set are labeled samples;
labeling samples in a second sample set through at least one first text classification model so as to mark at least part of samples meeting a first preset condition in the second sample set as first cycle samples; wherein the samples in the second sample set are unlabeled samples;
a second text classification model is trained through the first loop samples.
2. The method of claim 1, wherein in a case where the first text classification model is one, the method comprises:
generating the first text classification model and the second text classification model from the first sample set;
labeling samples in the second sample set through the first text classification model so as to label at least part of samples with classification credibility higher than a preset threshold value in the second sample set as the first cycle samples; labeling the samples in the second sample set through the second text classification model, so as to label at least part of samples with classification credibility higher than a preset threshold value in the second sample set as second cycle samples;
and training the first text classification model according to the second cyclic sample, and training the second text classification model according to the first cyclic sample.
3. The method of claim 2, wherein after training the first text classification model based on the second loop samples and training the second text classification model based on the first loop samples, further comprising:
circularly executing the following operations until a second preset condition is met:
labeling samples in a second sample set through the trained first text classification model, so as to label at least part of samples with classification credibility higher than a preset threshold value in the second sample set as the first cyclic sample; labeling the samples in the second sample set through the trained second text classification model, so as to label at least part of samples with classification credibility higher than a preset threshold value in the second sample set as the second cyclic sample;
and training the trained first text classification model according to the second cyclic sample, and training the trained second text classification model according to the first cyclic sample.
4. The method of claim 3, wherein the second preset condition comprises at least one of:
the predicted values of the consistency of the results of labeling the samples in the second sample set by the first text classification model and the second text classification model are greater than or equal to a preset threshold value;
the number of cycles is greater than or equal to a preset threshold;
each sample in the second set of samples is labeled as the first recurring sample and/or the second recurring sample.
5. The method of claim 2, wherein training the first text classification model according to the second loop samples and training the second text classification model according to the first loop samples comprises:
taking the second cyclic sample as an input sample of the first text classification model, taking the label of the second cyclic sample by the second text classification model as an output sample of the first text classification model, and training the first text classification model through the input sample and the output sample;
and taking the first cyclic sample as an input sample of the second text classification model, taking the label of the first cyclic sample by the first text classification model as an output sample of the second text classification model, and training the second text classification model through the input sample and the output sample.
6. The method according to claim 1, wherein in case that the first text classification model is plural, the method comprises:
generating the first text classification model and a second text classification model through a first sample set; wherein the first text classification model is a plurality of models;
labeling samples in a second sample set through a plurality of first text classification models respectively, so as to label at least part of samples in the second sample set, which are labeled consistently by the plurality of first text classification models, as the first cyclic sample;
a second text classification model is trained through the first loop samples.
7. The method of claim 6, wherein the plurality of first text classification models are different text classification models.
8. The method of any of claims 1-7, wherein generating a first text classification model and a second text classification model from a first set of samples comprises:
and generating a first text classification model and a second text classification model according to the samples in the first sample set and the labels corresponding to the samples and a preset classification mode.
9. The method of claim 8, wherein after generating the first text classification model and the second text classification model by the first sample set, further comprising:
labeling at least part of samples in the second sample set through the first text classification model, and putting samples with errors smaller than a preset threshold value into the first sample set;
labeling at least part of samples in the second sample set through the second text classification model, and putting samples with errors smaller than a preset threshold value into the first sample set.
10. A text labeling method, comprising the text classification model training method of any one of claims 1 to 9; the text labeling method comprises the following steps:
and labeling the samples in the second sample set through the trained second text classification model.
11. A text classification model training device, comprising:
the generating module is used for generating a first text classification model and a second text classification model through the first sample set; wherein the first text classification model is one or more, and the samples in the first sample set are labeled samples;
the circulation module is used for labeling the samples in the second sample set through at least one first text classification model so as to mark at least part of samples meeting a first preset condition in the second sample set as first circulation samples; wherein the samples in the second sample set are unlabeled samples;
and the training module is used for training a second text classification model through the first circulation sample.
12. A text labeling apparatus, comprising the text classification model training apparatus of claim 11; the text labeling device comprises:
and the marking module is used for marking the samples in the second sample set through the trained second text classification model.
13. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 9 when executed, or to perform the method of claim 10.
14. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 9, or to perform the method of claim 10.
CN202010761788.3A 2020-07-31 2020-07-31 Text classification model training method and device and text labeling method and device Pending CN111881295A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010761788.3A CN111881295A (en) 2020-07-31 2020-07-31 Text classification model training method and device and text labeling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010761788.3A CN111881295A (en) 2020-07-31 2020-07-31 Text classification model training method and device and text labeling method and device

Publications (1)

Publication Number Publication Date
CN111881295A true CN111881295A (en) 2020-11-03

Family

ID=73205023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010761788.3A Pending CN111881295A (en) 2020-07-31 2020-07-31 Text classification model training method and device and text labeling method and device

Country Status (1)

Country Link
CN (1) CN111881295A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460795A (en) * 2018-12-17 2019-03-12 北京三快在线科技有限公司 Classifier training method, apparatus, electronic equipment and computer-readable medium
CN109582793A (en) * 2018-11-23 2019-04-05 深圳前海微众银行股份有限公司 Model training method, customer service system and data labeling system, readable storage medium storing program for executing
CN110110080A (en) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 Textual classification model training method, device, computer equipment and storage medium
CN110889463A (en) * 2019-12-10 2020-03-17 北京奇艺世纪科技有限公司 Sample labeling method and device, server and machine-readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582793A (en) * 2018-11-23 2019-04-05 深圳前海微众银行股份有限公司 Model training method, customer service system and data labeling system, readable storage medium storing program for executing
CN109460795A (en) * 2018-12-17 2019-03-12 北京三快在线科技有限公司 Classifier training method, apparatus, electronic equipment and computer-readable medium
CN110110080A (en) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 Textual classification model training method, device, computer equipment and storage medium
CN110889463A (en) * 2019-12-10 2020-03-17 北京奇艺世纪科技有限公司 Sample labeling method and device, server and machine-readable storage medium

Similar Documents

Publication Publication Date Title
CN107479882B (en) Generating method and generating device of application page, medium and electronic equipment
CN107895286A (en) Amount for which loss settled determines method and device, storage medium and electronic equipment
CN109359194B (en) Method and apparatus for predicting information categories
CN108804327A (en) A kind of method and apparatus of automatic Data Generation Test
CN107729106A (en) It is a kind of that the method and apparatus quickly redirected are realized between application component
CN109582661B (en) Data structured evaluation method and device, storage medium and electronic equipment
CN112328671A (en) Data format conversion method, system, storage medium and equipment
CN113435846A (en) Business process arranging method and device, computer equipment and storage medium
CN107256206A (en) The method and apparatus of character stream format conversion
CN107491382A (en) Log-output method and device
CN110134427A (en) A kind of method and apparatus generating code file
US20220067659A1 (en) Research and development system and method
CN111383768B (en) Medical data regression analysis method, device, electronic equipment and computer readable medium
CN111881295A (en) Text classification model training method and device and text labeling method and device
US20220198153A1 (en) Model training
CN108984221B (en) Method and device for acquiring multi-platform user behavior logs
CN110764768A (en) Method and device for mutual conversion between model object and JSON object
CN113515271B (en) Service code generation method and device, electronic equipment and readable storage medium
CN109901934A (en) The method and apparatus for generating interface help document
CN112330502B (en) Contract auditing method and device, electronic equipment and storage medium
CN113032003B (en) Development file export method, development file export device, electronic equipment and computer storage medium
CN112131379A (en) Method, device, electronic equipment and storage medium for identifying problem category
Allmer Towards an internet of science
CN112446192A (en) Method, device, electronic equipment and medium for generating text labeling model
CN107704318A (en) The method and apparatus of example scheduling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination