CN115858776A

CN115858776A - Variant text classification recognition method, system, storage medium and electronic equipment

Info

Publication number: CN115858776A
Application number: CN202211348321.1A
Authority: CN
Inventors: 刘苏楠
Original assignee: Shumei Tianxia Beijing Technology Co ltd; Beijing Nextdata Times Technology Co ltd
Current assignee: Shumei Tianxia Beijing Technology Co ltd; Beijing Nextdata Times Technology Co ltd
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-03-28
Anticipated expiration: 2042-10-31
Also published as: CN115858776B

Abstract

The invention relates to a variant text classification recognition method, a system, a storage medium and an electronic device, comprising the following steps: constructing a variant error correction text data set according to the supervised corpus data set and the unsupervised corpus data set; training the first original neural network model based on the first text data set and the variant error correction text data set to obtain a target text classification model for text variant error correction and text classification identification; and inputting the text to be recognized into the target text classification model to obtain a target recognition result containing variant error correction and text classification of the text to be recognized. The method constructs the variant error correction data set through the supervised and unsupervised corpus data sets, performs variant error correction task training through the variant error correction data set, trains the model by taking the variant error correction task as an auxiliary task together with the classification task, can play a regular role in understanding the variant semantics of the model, and further improves the identification accuracy of the classification model.

Description

Variant text classification recognition method, system, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of text classification, in particular to a variant text classification and identification method, a variant text classification and identification system, a storage medium and electronic equipment.

Background

A classification model can be obtained by using neural network training, so as to realize the identification and interception of forbidden content. To avoid network administration, objectionable textual content often contains a large number of variants, either near-sound, near-shape, which presents a significant challenge to internet content administration. To address the challenges presented by these variants, a common solution is to add corresponding variant samples to the data set that trains the classification model. However, the above scheme can improve the recall rate of the model for the variant sample and also reduce the accuracy of the classification model.

Therefore, it is desirable to provide a technical solution to the problems in the prior art.

Disclosure of Invention

In order to solve the technical problem, the invention provides a variant text classification and identification method, a variant text classification and identification system, a storage medium and electronic equipment.

The technical scheme of the variant text classification and identification method is as follows:

acquiring a first text data set, a supervised corpus data set and an unsupervised corpus data set, and constructing a variant error correction text data set according to the supervised corpus data set and the unsupervised corpus data set;

training a first original neural network model based on the first text data set and the variant error correction text data set to obtain a target text classification model for text variant error correction and text classification identification;

and inputting the text to be recognized into the target text classification model to obtain a target recognition result containing variant error correction and text classification of the text to be recognized.

The variant text classification and identification method has the following beneficial effects:

the method constructs the variant error correction dataset through the supervised and unsupervised corpus datasets, performs variant error correction task training through the variant error correction dataset, trains the model by taking the variant error correction task as an auxiliary task together with the classification task, can play a regular role in understanding the variant semantics of the model, and further improves the identification accuracy of the classification model.

On the basis of the scheme, the variant text classification and identification method can be further improved as follows.

Further, still include:

and training a second original neural network model for text classification based on the first text data set to obtain an original text classification model.

Further, the step of constructing a variant error correction text dataset from the supervised corpus dataset and the unsupervised corpus dataset comprises:

classifying the supervised corpus data set by using the original text classification model to obtain a supervised corpus black sample set and a supervised corpus white sample set, and classifying the unsupervised corpus data set by using the original text classification model to obtain an unsupervised corpus black sample set and an unsupervised corpus white sample set;

training by utilizing the supervised corpus black sample set to generate a supervised language model, and training by utilizing the unsupervised corpus black sample set to generate an unsupervised language model;

extracting a black sample template from the unsupervised corpus black sample set based on a keyword extraction technology, and obtaining a first variant mapping data set according to the black sample template, the supervised language model and the unsupervised language model;

and manually labeling the first variant mapping data set to obtain a target variant mapping data set, and obtaining the variant error correction text data set according to the target variant mapping data set, the supervised corpus white sample set and the unsupervised corpus white sample set.

The beneficial effect of adopting the further technical scheme is that: and further, the variant error correction data set is automatically constructed by constructing the supervised language model and the unsupervised language model, and compared with the variant error correction data set which is completely manually marked, the production efficiency of the variant error correction data set is improved.

Further, the step of training to generate a supervised language model using the supervised corpus black sample set and generating an unsupervised language model using the unsupervised corpus black sample set includes:

and training the supervised corpus black sample set by adopting a Masked LM mode to obtain the supervised language model, and training the unsupervised corpus black sample set to obtain the unsupervised language model.

The technical scheme of the variant text classification and identification system is as follows:

the method comprises the following steps: the system comprises a construction module, a training module and an identification module;

the building module is used for: acquiring a first text data set, a supervised corpus data set and an unsupervised corpus data set, and constructing a variant error correction text data set according to the supervised corpus data set and the unsupervised corpus data set;

the training module is configured to: training a first original neural network model based on the first text data set and the variant error correction text data set to obtain a target text classification model for text variant error correction and text classification identification;

the identification module is configured to: and inputting the text to be recognized into the target text classification model to obtain a target recognition result containing variant error correction and text classification of the text to be recognized.

The variant text classification and identification system has the following beneficial effects:

the system of the invention constructs the variant error correction dataset through the supervised and unsupervised corpus dataset, performs variant error correction task training through the variant error correction dataset, takes the variant error correction task as an auxiliary task to train the model together with the classification task, can play a regular role in understanding the variant semantics of the model, and further improves the identification accuracy of the classification model.

On the basis of the scheme, the variant text classification and recognition system can be further improved as follows.

Further, the method also comprises the following steps: a processing module;

the processing module is used for: and training a second original neural network model for text classification based on the first text data set to obtain an original text classification model.

Further, the building module is specifically configured to:

and training the supervised corpus black sample set by adopting a mask LM mode to obtain the supervised language model, and training the unsupervised corpus black sample set to obtain the unsupervised language model.

The technical scheme of the storage medium of the invention is as follows:

the storage medium has stored therein instructions which, when read by a computer, cause the computer to perform the steps of a variant text classification recognition method according to the invention.

The technical scheme of the electronic equipment is as follows:

comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, causes the computer to carry out the steps of a method for classification and identification of variant texts as claimed in the present invention.

Drawings

Fig. 1 is a schematic flow chart of a variant text classification and identification method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a variant text classification recognition system according to an embodiment of the present invention.

Detailed Description

As shown in fig. 1, a method for classifying and recognizing a variant text according to an embodiment of the present invention includes the following steps:

s1, a first text data set, a supervised corpus data set and an unsupervised corpus data set are obtained, and a variant error correction text data set is constructed according to the supervised corpus data set and the unsupervised corpus data set.

Wherein (1) the first text data set is: the data set containing a plurality of texts can be used for training a text classification model, and each piece of data in the first text data set is labeled with a classification type, such as: contraband, normal, etc. (2) The supervised corpus data set includes: and a plurality of supervised corpus texts, wherein the supervised corpus texts are obtained from the text contents sent by the supervised people and comprise a large amount of variant texts. (3) The unsupervised corpus data set includes: the method comprises the steps that a plurality of unsupervised corpus texts are obtained from text contents sent by unsupervised crowds, and the unsupervised corpus texts basically do not contain variant texts. (4) The variant error correction text data set is used to train a variant error correction task, the variant error correction text data set including a plurality of variant data pairs. For example, one variant data pair is: "you are sand" (variant text) and "you are fool" (ontology text).

S2, training a first original neural network model based on the first text data set and the variant error correction text data set to obtain a target text classification model for text variant error correction and text classification recognition.

Wherein, (1) the first original neural network model is: the neural network model can be simultaneously used for text variant error correction and text classification recognition, two functions of the model share a model backbone, and only the output layer of the model is different. (2) The target text classification model is as follows: and training to obtain a model for text variant error correction and text classification recognition.

It should be noted that, in the training process of the first original neural network model, the first text data set is used for training the text classification task of the first original neural network model, and the variant error correction text data set is used for training the variant error correction task of the first original neural network model.

S3, inputting the text to be recognized into the target text classification model to obtain a target recognition result containing variant error correction and text classification of the text to be recognized.

Wherein, (1) the text to be recognized is: the arbitrarily selected text may be a variant text or an ontology text. (2) The target recognition result includes: variant error correction results and text classification results. For example, the text to be recognized is: "you are sand", the target recognition result corresponding to the text to be recognized includes: variant error correction results: "you are fool", text classification result: "violation".

It should be noted that the text classification result is judged according to the preset threshold of the target text classification model. For example, assuming that the default setting of the preset threshold value has a lower limit of the forbidden probability of 0.7, when the text to be recognized is input into the target text classification model for judgment and the obtained forbidden probability is 0.8, the text classification result of the text to be recognized is determined as follows: violation; when the forbidden probability is 0.3, judging that the text classification result of the text to be recognized is as follows: and (4) normal. In this embodiment, the preset threshold may be set according to requirements, and is not limited herein.

Preferably, the method further comprises the following steps:

Wherein, (1) the second original neural network model is: neural network models that can be used for text classification. (2) The original text classification model was: the specific training process of the model for text classification obtained after training is not described in detail herein.

Preferably, the step of constructing a variant error corrected text data set from the supervised corpus data set and the unsupervised corpus data set comprises:

classifying the supervised corpus data set by using the original text classification model to obtain a supervised corpus black sample set and a supervised corpus white sample set, and classifying the unsupervised corpus data set by using the original text classification model to obtain an unsupervised corpus black sample set and an unsupervised corpus white sample set.

The original text classification model judges whether the forbidden preset threshold value of the text is 0.7 by default, if the forbidden preset threshold value is greater than or equal to 0.7, the forbidden preset threshold value is a black sample set, and if the forbidden preset threshold value is less than 0.7, the forbidden preset threshold value is a white sample set, so that the supervised corpus data set and the unsupervised corpus data set are classified through the original text classification model respectively, and the supervised corpus black sample set, the supervised corpus white sample set, the unsupervised corpus black sample set and the unsupervised corpus white sample set are obtained respectively.

And training by using the supervised corpus black sample set to generate a supervised language model, and training by using the unsupervised corpus black sample set to generate an unsupervised language model.

Specifically, a Masked LM mode is adopted to train the supervised corpus black sample set to obtain the supervised language model, and the unsupervised corpus black sample set is trained to obtain the unsupervised language model.

Wherein, the process of (1) training the corpus sample set to obtain the corresponding language model by adopting a Masked LM mode is the prior art. (2) The supervised and unsupervised language models function as: and predicting the missing characters according to the context. For example, the input is "today's _ true and good. ", the model predicts" _ "and outputs" day ".

And extracting a black sample template from the unsupervised corpus black sample set based on a keyword extraction technology, and obtaining a first variant mapping data set according to the black sample template, the supervised language model and the unsupervised language model.

Wherein, (1) the black sample template is: a text template containing a forbidden phrase. For example, when the black sample is "you are fool", the keyword "fool" is extracted by the keyword extraction technique; and at the moment, randomly deleting the characters in the keywords to obtain a black sample template: "you are children" or "you are fool". (2) The first variant mapping dataset comprises: a plurality of low precision variant pairs. For example, the same black sample template is predicted (complemented) using both the supervised and unsupervised language models; specifically, the supervised language model completes the black sample template 'you are _ children' to obtain 'you are sand', and the unsupervised language model completes the black sample template 'you are _ children' to obtain 'you are fool', so that a variant pair is obtained.

Wherein (1) the target variant mapping dataset comprises: a plurality of manually labeled variant pairs. (2) Since the white samples generally do not contain variants, that is, in the variant mapping pair constructed by the supervised corpus white sample set and the unsupervised corpus white sample set, both the ontology and the variants are the corresponding white samples themselves.

It should be noted that, since the first variant mapping data set may have errors (because the model is automatically generated, there may be problems of a keyword extraction error, a prediction error between a supervised language model and an unsupervised language model, and an ontology and a variant being unable to be matched, etc.), the first variant mapping data set needs to be modified by a manual labeling manner, so as to obtain a high-precision target variant mapping data set.

According to the technical scheme, the variant error correction data set is automatically constructed by constructing the supervised language model and the unsupervised language model, and compared with the variant error correction data set which is completely manually marked, the production efficiency of the variant error correction data set is improved; the variant error correction task can be trained through the variant error correction data set, the variant error correction task is used as an auxiliary task to train the model together with the classification task, the regular understanding effect on the variant semantics of the model can be achieved, and the recognition accuracy of the classification model is further improved.

As shown in fig. 2, a variant text classification recognition system 200 according to an embodiment of the present invention includes: a construction module 210, a training module 220, and a recognition module 230;

the building module 210 is configured to: acquiring a first text data set, a supervised corpus data set and an unsupervised corpus data set, and constructing a variant error correction text data set according to the supervised corpus data set and the unsupervised corpus data set;

the training module 220 is configured to: training a first original neural network model based on the first text data set and the variant error correction text data set to obtain a target text classification model for text variant error correction and text classification identification;

the identification module 230 is configured to: and inputting the text to be recognized into the target text classification model to obtain a target recognition result containing variant error correction and text classification of the text to be recognized.

Preferably, the method further comprises the following steps: a processing module;

Preferably, the building module 210 is specifically configured to:

The above steps for realizing the corresponding functions of each parameter and each module in the variant text classification and identification system 200 of the present embodiment may refer to each parameter and step in the above embodiment of the variant text classification and identification method, which are not described herein again.

An embodiment of the present invention provides a storage medium, including: the storage medium stores instructions, and when the instructions are read by a computer, the computer is caused to execute the steps of the variant text classification and identification method, which may specifically refer to the parameters and steps in the embodiment of the variant text classification and identification method above, and are not described herein again.

Computer storage media such as: flash disks, portable hard disks, and the like.

An electronic device provided in an embodiment of the present invention includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and is characterized in that when the processor executes the computer program, the computer executes steps of a variant text classification and identification method, which may specifically refer to each parameter and step in an embodiment of the variant text classification and identification method above, and are not described herein again.

Those skilled in the art will appreciate that the present invention may be embodied as methods, systems, storage media and electronic devices.

Thus, the present invention may be embodied in the form of: may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software, and may be referred to herein generally as a "circuit," module "or" system. Furthermore, in some embodiments, the invention may also be embodied in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied in the medium. Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A variant text classification recognition method is characterized by comprising the following steps:

2. The variant text classification recognition method of claim 1, further comprising:

3. The variant text classification recognition method according to claim 2, wherein the step of constructing a variant corrected text data set from the supervised corpus data set and the unsupervised corpus data set comprises:

4. The method according to claim 3, wherein the step of training the supervised corpus black sample set to generate the supervised language model and the step of training the unsupervised corpus black sample set to generate the unsupervised language model comprises:

5. A variant text classification recognition system, comprising: the system comprises a construction module, a training module and an identification module;

6. The variant text classification recognition system of claim 5, further comprising: a processing module;

7. The system according to claim 6, wherein the construction module is specifically configured to:

8. The system according to claim 7, wherein the construction module is specifically configured to:

9. A storage medium having stored thereon instructions which, when read by a computer, cause the computer to carry out a variant text classification recognition method as claimed in any one of claims 1 to 4.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, causes the computer to perform a variant text classification recognition method according to any one of claims 1 to 4.