CN115858776B

CN115858776B - Variant text classification recognition method, system, storage medium and electronic equipment

Info

Publication number: CN115858776B
Application number: CN202211348321.1A
Authority: CN
Inventors: 刘苏楠
Original assignee: Shumei Tianxia Beijing Technology Co ltd; Beijing Nextdata Times Technology Co ltd
Current assignee: Shumei Tianxia Beijing Technology Co ltd; Beijing Nextdata Times Technology Co ltd
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-06-23
Anticipated expiration: 2042-10-31
Also published as: CN115858776A

Abstract

The invention relates to a variant text classification recognition method, a system, a storage medium and electronic equipment, which comprise the following steps: constructing a variant error correction text data set according to the supervised corpus data set and the unsupervised corpus data set; training a first original neural network model based on the first text data set and the variant error correction text data set to obtain a target text classification model for text variant error correction and text classification recognition; and inputting the text to be identified into a target text classification model to obtain a target identification result containing variant error correction and text classification of the text to be identified. According to the invention, the variant error correction data set is constructed through the supervised and unsupervised corpus data sets, the variant error correction task is trained through the variant error correction data set, and the variant error correction task is used as an auxiliary task to train a model together with the classification task, so that the regular effect can be played on the variant semantic understanding of the model, and the recognition accuracy of the classification model is further improved.

Description

Variant text classification recognition method, system, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of text classification, in particular to a variant text classification recognition method, a system, a storage medium and electronic equipment.

Background

The neural network can be used for training to obtain a classification model, so that the identification and interception of forbidden contents are realized. In order to avoid network supervision, poor text content often contains a large number of variants, either near-voice or near-shape, which presents a great challenge to internet content supervision. To address the challenges presented by these variants, a common solution is to add corresponding variant samples in the dataset of the trained classification model. However, the above scheme can improve the recall rate of the model to the variant sample and reduce the accuracy of the classification model.

Therefore, there is a need to provide a solution to the problems in the prior art.

Disclosure of Invention

In order to solve the technical problems, the invention provides a variant text classification recognition method, a system, a storage medium and electronic equipment.

The technical scheme of the variant text classification and identification method is as follows:

acquiring a first text data set, a supervised corpus data set and an unsupervised corpus data set, and constructing a variant error correction text data set according to the supervised corpus data set and the unsupervised corpus data set;

training a first original neural network model based on the first text data set and the variant error correction text data set to obtain a target text classification model for text variant error correction and text classification recognition;

inputting the text to be identified into the target text classification model to obtain a target identification result containing the variant error correction and text classification of the text to be identified.

The variant text classification and identification method has the beneficial effects that:

according to the method, the variant error correction data set is constructed through the supervised and unsupervised corpus data sets, the variant error correction task is trained through the variant error correction data set, and the variant error correction task is used as an auxiliary task to train the model together with the classification task, so that the regular effect on the variant semantic understanding of the model can be achieved, and the recognition accuracy of the classification model is improved.

Based on the scheme, the variant text classification recognition method can be improved as follows.

Further, the method further comprises the following steps:

and training a second original neural network model for text classification based on the first text data set to obtain an original text classification model.

Further, the step of constructing a variant error correction text dataset from the supervised corpus dataset and the unsupervised corpus dataset, comprises:

classifying the supervised corpus data set by using the original text classification model to obtain a supervised corpus black sample set and a supervised corpus white sample set, and classifying the unsupervised corpus data set by using the original text classification model to obtain an unsupervised corpus black sample set and an unsupervised corpus white sample set;

generating a supervised language model by using the supervised corpus black sample set, and generating an unsupervised language model by using the unsupervised corpus black sample set;

based on a keyword extraction technology, extracting a black sample template from the black sample set of the non-supervised corpus, and obtaining a first variant mapping dataset according to the black sample template, the supervised language model and the non-supervised language model;

manually labeling the first variant mapping data set to obtain a target variant mapping data set, and obtaining the variant error correction text data set according to the target variant mapping data set, the supervised corpus white sample set and the unsupervised corpus white sample set.

The beneficial effects of adopting the further technical scheme are as follows: and further, the variant error correction data set is automatically constructed by constructing the supervised language model and the unsupervised language model, so that the production efficiency of the variant error correction data set is improved compared with the variant error correction data set which is completely and manually marked.

Further, the step of generating a supervised language model by training the supervised corpus black sample set and generating an unsupervised language model by training the unsupervised corpus black sample set includes:

and training the supervised corpus black sample set by adopting a Masked LM mode to obtain the supervised language model, and training the unsupervised corpus black sample set to obtain the unsupervised language model.

The technical scheme of the variant text classification recognition system is as follows:

comprising the following steps: the system comprises a construction module, a training module and an identification module;

the construction module is used for: acquiring a first text data set, a supervised corpus data set and an unsupervised corpus data set, and constructing a variant error correction text data set according to the supervised corpus data set and the unsupervised corpus data set;

the training module is used for: training a first original neural network model based on the first text data set and the variant error correction text data set to obtain a target text classification model for text variant error correction and text classification recognition;

the identification module is used for: inputting the text to be identified into the target text classification model to obtain a target identification result containing the variant error correction and text classification of the text to be identified.

The variant text classification recognition system has the following beneficial effects:

according to the system, the variant error correction data set is constructed through the supervised and unsupervised corpus data sets, the variant error correction task is trained through the variant error correction data set, and the variant error correction task is used as an auxiliary task to train the model together with the classification task, so that the regular effect can be played on the variant semantic understanding of the model, and the recognition accuracy of the classification model is improved.

Based on the scheme, the variant text classification recognition system can be improved as follows.

Further, the method further comprises the following steps: a processing module;

the processing module is used for: and training a second original neural network model for text classification based on the first text data set to obtain an original text classification model.

Further, the construction module is specifically configured to:

The technical scheme of the storage medium is as follows:

the storage medium has stored therein instructions which, when read by a computer, cause the computer to perform the steps of a variant text classification recognition method according to the invention.

The technical scheme of the electronic equipment is as follows:

comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, causes the computer to perform the steps of a variant text classification recognition method according to the invention.

Drawings

FIG. 1 is a flow chart of a variant text classification recognition method according to an embodiment of the invention;

fig. 2 is a schematic structural diagram of a variant text classification recognition system according to an embodiment of the invention.

Detailed Description

As shown in fig. 1, a variant text classification recognition method according to an embodiment of the present invention includes the following steps:

s1, acquiring a first text data set, a supervised corpus data set and an unsupervised corpus data set, and constructing a variant error correction text data set according to the supervised corpus data set and the unsupervised corpus data set.

Wherein (1) the first text data set is: a data set comprising a plurality of texts, usable for training a text classification model, each data in a first text data set being labeled with a classification type, such as: forbidden, normal, etc. (2) The supervised corpus data set includes: a plurality of supervised corpus texts, the supervised corpus texts being obtained from text content sent by a supervisory population, the supervised corpus texts comprising a plurality of variant texts. (3) The unsupervised corpus data set includes: the method comprises the steps of acquiring a plurality of pieces of unregulated corpus texts, wherein the unregulated corpus texts are obtained from text contents sent by an unregulated crowd, and basically do not contain variant texts. (4) The variant error correction text data set is used to train a variant error correction task, the variant error correction text data set comprising a plurality of variant data pairs. For example, one variant data pair is: "you are sand" (variant text) and "you are boy" (body text).

And S2, training the first original neural network model based on the first text data set and the variant error correction text data set to obtain a target text classification model for text variant error correction and text classification recognition.

Wherein (1) the first original neural network model is: a neural network model capable of being used for text variant correction and text classification recognition simultaneously, the two parts of the model share a model backbone, and only the output layers of the model are different. (2) The target text classification model is as follows: the model for text variant error correction and text classification recognition is obtained after training.

It should be noted that, in the training process of the first original neural network model, the first text data set is used for training the text classification task of the first original neural network model, and the variant error correction text data set is used for training the variant error correction task of the first original neural network model.

S3, inputting the text to be identified into the target text classification model to obtain a target identification result containing the variant error correction and text classification of the text to be identified.

Wherein, (1) the text to be recognized is: the text selected at will can be a variant text or an ontology text. (2) The target recognition result includes: variant error correction results and text classification results. For example, the text to be recognized is: "you are sand", the target recognition result corresponding to the text to be recognized includes: variant error correction results: "you are fool", text classification results: "forbidden".

It should be noted that, the text classification result is judged according to the preset threshold value of the target text classification model. For example, assuming that the lower limit of the forbidden probability set by default of the preset threshold is 0.7, when the forbidden probability obtained by judging that the text to be recognized is input into the target text classification model is 0.8, judging that the text classification result of the text to be recognized is: forbidden; when the forbidden probability is 0.3, determining that the text classification result of the text to be identified is: normal. In this embodiment, the preset threshold may be set according to the requirement, and no limitation is set here.

Preferably, the method further comprises:

Wherein (1) the second original neural network model is: neural network models that can be used for text classification. (2) The original text classification model is: the model for classification herein obtained after training, the specific training process is not described in detail herein.

Preferably, the step of constructing a variant error correction text dataset from the supervised corpus dataset and the unsupervised corpus dataset comprises:

classifying the supervised corpus data set by using the original text classification model to obtain a supervised corpus black sample set and a supervised corpus white sample set, and classifying the unsupervised corpus data set by using the original text classification model to obtain an unsupervised corpus black sample set and an unsupervised corpus white sample set.

The original text classification model judges whether a default threshold value for forbidden text defaults to 0.7, if so, the text is a black sample set which is more than or equal to 0.7, and is a white sample set which is less than 0.7, so that the supervised corpus data set and the unsupervised corpus data set are respectively classified through the original text classification model, and the supervised corpus black sample set, the supervised corpus white sample set, the unsupervised corpus black sample set and the unsupervised corpus white sample set are respectively obtained.

And generating a supervised language model by using the supervised corpus black sample set, and generating an unsupervised language model by using the unsupervised corpus black sample set.

Specifically, a Masked LM mode is adopted to train the black corpus sample set to obtain the supervised language model, and train the black corpus sample set to obtain the unsupervised language model.

The method comprises the steps of (1) training a language sample set by adopting a Masked LM mode to obtain a corresponding language model. (2) The functions of the supervised language model and the unsupervised language model are: and predicting the missing text according to the context. For example, the input is "today' s_gas is true. The model predicts "_" and outputs "day".

And extracting a black sample template from the black sample set of the non-supervised corpus based on a keyword extraction technology, and obtaining a first variant mapping data set according to the black sample template, the supervised language model and the non-supervised language model.

Wherein, (1) the black sample template is: text templates containing forbidden phrases. For example, when the black sample is "you are fool", the keyword "fool" is extracted by the keyword extraction technique; at this time, the characters in the keywords are deleted randomly, so that a black sample template is obtained: "you are_son" or "you are fool. (2) The first variant mapping dataset comprises: a plurality of low precision variant pairs. For example, using both a supervised language model and an unsupervised language model, the same black sample template is predicted (complemented); specifically, the supervised language model complements the black sample template "you are_son" to obtain "you are sand", while the unsupervised language model complements the black sample template "you are_son" to obtain "you are foolson", so as to obtain a variant pair.

Wherein (1) the target variant mapping dataset comprises: a plurality of artificially labeled variant pairs. (2) Since the white sample generally contains no variants, namely in the variant mapping pairs constructed by the supervised corpus white sample set and the unsupervised corpus white sample set, the ontology and the variants are the corresponding white samples.

It should be noted that, since the first variant mapping dataset may have errors (because the model is automatically generated, there may be problems of keyword extraction errors, misprediction between the supervised language model and the unsupervised language model, and incapability of matching the ontology and the variant, etc.), the first variant mapping dataset needs to be corrected by a manual labeling manner, so as to obtain the high-precision target variant mapping dataset.

According to the technical scheme, the variant error correction data set is automatically constructed by constructing the supervised language model and the unsupervised language model, so that the production efficiency of the variant error correction data set is improved compared with the variant error correction data set which is completely and manually marked; the variant error correction task training can be performed through the variant error correction data set, the variant error correction task is used as an auxiliary task to train the model together with the classification task, the regular effect can be achieved on the variant semantic understanding of the model, and the recognition accuracy of the classification model is further improved.

As shown in fig. 2, a variant text classification recognition system 200 according to an embodiment of the present invention includes: a building module 210, a training module 220, and an identification module 230;

the construction module 210 is configured to: acquiring a first text data set, a supervised corpus data set and an unsupervised corpus data set, and constructing a variant error correction text data set according to the supervised corpus data set and the unsupervised corpus data set;

the training module 220 is configured to: training a first original neural network model based on the first text data set and the variant error correction text data set to obtain a target text classification model for text variant error correction and text classification recognition;

the identification module 230 is configured to: inputting the text to be identified into the target text classification model to obtain a target identification result containing the variant error correction and text classification of the text to be identified.

Preferably, the method further comprises: a processing module;

Preferably, the construction module 210 is specifically configured to:

The steps for implementing the corresponding functions by the parameters and the modules in the variant text classification recognition system 200 according to the present embodiment are referred to the parameters and the steps in the embodiment of the variant text classification recognition method according to the present embodiment, and are not described herein.

The storage medium provided by the embodiment of the invention comprises: the storage medium stores instructions that, when read by a computer, cause the computer to perform steps such as a variant text classification and identification method, and specific reference may be made to the parameters and steps in the above embodiments of a variant text classification and identification method, which are not described herein.

Computer storage media such as: flash disk, mobile hard disk, etc.

The electronic device provided in the embodiment of the present invention includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor executes the computer program to make the computer execute steps such as a variant text classification recognition method, and specific reference may be made to each parameter and step in the embodiment of a variant text classification recognition method described above, which is not described herein.

Those skilled in the art will appreciate that the present invention may be implemented as a method, system, storage medium, and electronic device.

Thus, the invention may be embodied in the form of: either entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or entirely software, or a combination of hardware and software, referred to herein generally as a "circuit," module "or" system. Furthermore, in some embodiments, the invention may also be embodied in the form of a computer program product in one or more computer-readable media, which contain computer-readable program code. Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A method for classifying and identifying variant text, comprising:

inputting a text to be identified into the target text classification model to obtain a target identification result containing variant error correction and text classification of the text to be identified;

further comprises: training a second original neural network model for text classification based on the first text data set to obtain an original text classification model;

the step of constructing a variant error correction text dataset from the supervised corpus dataset and the unsupervised corpus dataset comprises:

manually labeling the first variant mapping data set to obtain a target variant mapping data set, and obtaining the variant error correction text data set according to the target variant mapping data set, the supervised corpus white sample set and the unsupervised corpus white sample set;

wherein the first text data set is: a dataset comprising a plurality of text, each piece of data in the first text dataset being labeled with a classification type; the supervised corpus data set includes: a plurality of pieces of supervised corpus text, each piece of supervised corpus text being obtained from text content sent by a supervisory population and including a plurality of variant text; the unsupervised corpus data set includes: a plurality of pieces of non-supervised corpus texts, wherein the non-supervised corpus texts are obtained from text contents which are sent by non-supervised people and do not contain variant texts; the variant error correction text dataset comprises: a plurality of variant data pairs; the black sample template is: a text template containing forbidden phrases; the first variant mapping dataset comprises: a plurality of low precision variant pairs; the target variant mapping dataset comprises: a plurality of artificially labeled variant pairs.

2. The method of claim 1, wherein the step of generating a supervised language model using the supervised corpus black sample set and generating an unsupervised language model using the unsupervised corpus black sample set comprises:

and training the supervised corpus black sample set by adopting a maskidlm mode to obtain the supervised language model, and training the unsupervised corpus black sample set to obtain the unsupervised language model.

3. A variant text classification recognition system, comprising: the system comprises a construction module, a training module and an identification module;

the identification module is used for: inputting a text to be identified into the target text classification model to obtain a target identification result containing variant error correction and text classification of the text to be identified;

further comprises: a processing module;

the processing module is used for: training a second original neural network model for text classification based on the first text data set to obtain an original text classification model;

the construction module is specifically used for:

4. A variant text classification recognition system according to claim 3, wherein the construction module is specifically configured to:

5. A storage medium having instructions stored therein which, when read by a computer, cause the computer to perform a variant text classification recognition method according to claim 1 or 2.

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, causes the computer to perform a variant text classification recognition method as claimed in claim 1 or 2.