CN116362292A - Text classification model training method and device, text classification method and device - Google Patents
Text classification model training method and device, text classification method and device Download PDFInfo
- Publication number
- CN116362292A CN116362292A CN202211729559.9A CN202211729559A CN116362292A CN 116362292 A CN116362292 A CN 116362292A CN 202211729559 A CN202211729559 A CN 202211729559A CN 116362292 A CN116362292 A CN 116362292A
- Authority
- CN
- China
- Prior art keywords
- text
- classification model
- classification
- target
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a text classification model training method and device and a text classification method and device, wherein the text classification model training method comprises the steps of determining a training text set, wherein the training text set comprises unlabeled illegal texts, unlabeled illegal texts with marked forbidden categories and unlabeled normal texts; generating training data by using the hidden characters and the target text; inputting training data into a text classification model to obtain target characters predicted by the text classification model and classification results predicted by the text classification model based on the training data; and adjusting parameters of the text classification model based on the text semantic loss value and the classification loss value until the text semantic loss value and the classification loss value accord with preset conditions, so as to obtain the trained text classification model. Therefore, the text classification model obtained by training by the text classification model training method provided by the application has higher semantic analysis capability and illegal text distinguishing capability.
Description
Technical Field
The present disclosure relates to the field of information processing technologies, and in particular, to a method and apparatus for training a text classification model, and a method and apparatus for classifying text.
Background
With the rapid development of internet technology, the reading of text information on the internet becomes a common recreation mode, but with the reduction of the threshold of issuing text information on the internet, a large number of illegal texts which are not suitable for being displayed for users, especially underage users, exist on the internet, so that certain treatment is required for the illegal texts to ensure physical and mental health of the users, especially the underage users. The premise of processing the offensive text is to find out the offensive text from a large amount of text information.
Based on this, in order to find out offensive text from a large amount of text information, a classification model capable of identifying offensive text may be introduced, and each text information may be classified.
Disclosure of Invention
In view of this, the present application provides a text classification model training method and apparatus, and a text classification method and apparatus for training a classification model capable of identifying offensive text, and classifying each text information.
In order to achieve the above object, the following solutions have been proposed:
a text classification model training method, comprising:
determining a training text set, wherein the training text set comprises a plurality of unlabeled illegal texts, a plurality of illegal texts with marked forbidden categories and a plurality of unlabeled normal texts;
sequentially selecting target texts from the training text set;
generating training data by using preset hidden characters and the target text, wherein the training data is the target text with partial characters replaced by the hidden characters;
inputting the training data into a text classification model to obtain target characters predicted by the text classification model and classification results predicted by the text classification model based on the training data, wherein the classification results are classification results corresponding to a plurality of forbidden categories, and the text classification model is a text classification model to be trained;
calculating a text semantic loss value of the text classification model according to the target characters and the target text;
calculating a classification loss value of the text classification model according to the classification result and the target text;
and adjusting parameters of the text classification model based on the text semantic loss value and the classification loss value until the text semantic loss value and the classification loss value accord with preset conditions, so as to obtain a trained text classification model.
Optionally, the method further comprises:
and adjusting the bias parameters and the weight parameters of the trained text classification model to obtain a processed text classification model, wherein the output of the processed text classification model is a classification result corresponding to the input text.
Optionally, adjusting the bias parameter and the weight parameter of the trained text classification model to obtain a processed text classification model, which includes:
and adjusting weight parameters and bias parameters related to the predicted target characters and the predicted classification results in the trained text classification model to obtain a processed text classification model.
Optionally, adjusting a weight parameter and a bias parameter related to the predicted target character and the predicted classification result in the trained text classification model to obtain a processed text classification model, which includes:
adjusting weight parameters and bias parameters related to the predicted target characters and the predicted classification results in the trained text classification model by using a preset adjusting formula to obtain a processed text classification model;
the adjustment formula is as follows:
H=BERT(X;θ)
wherein X represents the input of the trained text classification model, theta represents the weight parameter of the trained text classification model, BERT represents the semantic coding by adopting the text classification model, H represents the semantic vector coded by the text classification model, and Slice 1 The representation intercepts the semantic vector encoded by the text classification model,representing the corresponding semantic vector of the classification result,/->Representing the semantic vector after full connection, LN represents Layer Normalization normalization operation, GELU represents Gaussian error linear unit activation function, W 1 Weight parameters representing full connection layer, B 1 Bias parameters representing fully connected layers, +.>Representing vector matrix corresponding to target character only, slice 2 Representing a vector matrix corresponding to the target character in the truncated dictionary, and enabling to represent the vector matrix of the dictionary,/-for the dictionary>Score representing the corresponding classification result of the forbidden class, +.>Representing transpose of vector matrix corresponding to target character only, B 2 Bias parameters representing dimension transformation (upscale), +.>And (3) representing the probability of the classification result corresponding to the forbidden class, wherein Softmax represents the probability obtained by adopting a Softmax function, and the dictionary is a pre-established character database.
Optionally, generating training data by using preset hidden characters and the target text includes:
replacing part of characters of the target text by using the hidden characters to obtain a replaced text;
if the target text does not have a label, directly taking the replacement text as the training data;
If the target text has a label, processing the replacement text by using a preset text template to obtain training data.
Optionally, the text template includes a fixed sequence and a specific position corresponding to the forbidden categories, and further includes a specific position for replacing the text;
processing the replacement text by using a preset text template to obtain training data, wherein the processing comprises the following steps:
determining a classification result corresponding to each forbidden class by using the labeling label of the target text corresponding to the replacement text;
determining the sequence of each classification result based on the fixed sequence corresponding to the forbidden classes in the text template;
forming a two-classification result combination according to the sequence of each two-classification result;
and generating training data based on specific positions corresponding to the forbidden categories in the text template, specific positions of the replacement text in the text template, the classification result combination and the replacement text.
Optionally, the generating training data based on specific positions corresponding to the forbidden categories in the text template, specific positions of the replaced text in the text template, the classification result combination and the replaced text includes:
Combining the two classification result combination and the replacement text according to the specific positions corresponding to the forbidden classes in the text template and the specific positions of the replacement text to obtain combined data;
and adding preset prefix characters into the combined data, adding suffix characters between a classification result of the combined data and the replacement text to obtain training data, so that after the training data is input into the text classification model, the text classification model distinguishes and identifies the classification result combination and the replacement text based on the prefix characters and the suffix characters.
A text classification model training device, comprising:
the training text set comprises a plurality of unlabeled illegal texts, a plurality of unlabeled illegal texts with marked forbidden categories and a plurality of unlabeled normal texts;
the selecting unit is used for sequentially selecting target texts from the training text set;
the generation unit is used for generating training data by utilizing preset hidden characters and the target text, wherein the training data is the target text with partial characters replaced by the hidden characters;
The classifying unit is used for inputting the training data into a text classifying model to obtain target characters predicted by the text classifying model and classifying results predicted by the text classifying model based on the training data, wherein the classifying results are classifying results corresponding to a plurality of forbidden categories, and the text classifying model is a text classifying model to be trained;
the calculating unit is used for calculating a text semantic loss value of the text classification model according to the target characters and the target text;
the utilization unit is used for calculating a classification loss value of the text classification model according to the classification result and the target text;
and the adjusting unit is used for adjusting the parameters of the text classification model based on the text semantic loss value and the classification loss value until the text semantic loss value and the classification loss value accord with preset conditions, so as to obtain the trained text classification model.
A text classification method, comprising:
acquiring text information to be classified;
and classifying the text information to be classified by using the text classification model trained by the text classification model training method to obtain a classification result, wherein the classification result comprises a classification result corresponding to a plurality of forbidden categories.
A text classification device, comprising:
the text acquisition unit is used for acquiring text information to be classified;
the information classification unit is used for classifying the text information to be classified by the text classification model trained by the text classification model training method to obtain a classification result, wherein the classification result comprises classification results corresponding to a plurality of forbidden categories.
According to the technical scheme, the training method for the text classification model determines a training text set, wherein the training text set comprises a plurality of unlabeled illegal texts, a plurality of unlabeled illegal texts with illegal categories, and a plurality of unlabeled normal texts; sequentially selecting target texts from the training text set; generating training data by using preset hidden characters and the target text, wherein the training data is the target text with partial characters replaced by the hidden characters; inputting the training data into a text classification model to obtain target characters predicted by the text classification model and classification results predicted by the text classification model based on the training data, wherein the classification results are classification results corresponding to a plurality of forbidden categories; through the process, the semantic analysis capability of the text classification model can be trained through hiding part of characters, and the text classification model can learn the distinction between the normal text and the offensive sample by adopting a mixed training mode of the normal text and the offensive text, so that the discrimination capability of the text classification model is improved; calculating a text semantic loss value of the text classification model according to the target characters and the target text; calculating a classification loss value of the text classification model according to the classification result and the target text; and adjusting parameters of the text classification model based on the text semantic loss value and the classification loss value until the text semantic loss value and the classification loss value accord with preset conditions, so as to obtain a trained text classification model. Through the process, the parameters of the text classification model can be adjusted by using the loss value, so that the purposes of improving the semantic analysis capability of the text classification model and the illegal text recognition capability are achieved. Therefore, the text classification model obtained by training by the text classification model training method provided by the application has higher semantic analysis capability and illegal text distinguishing capability.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.
FIG. 1 is a flowchart of a text classification model training method disclosed in the present application;
fig. 2 is a structural block diagram of a training device for text classification model according to an embodiment of the present application;
fig. 3 is a block diagram of a hardware structure of a text classification model training device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The text classification model training method provided by the application can be applied to numerous general or special computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor devices, distributed computing environments that include any of the above devices or devices, and the like.
The text classification model training method of the present application is described in detail below with reference to fig. 1, and includes the following steps:
step S1, determining a training text set.
Specifically, the training text set may include a plurality of unlabeled offending texts, a plurality of offending texts with marked offending categories, and a plurality of unlabeled normal texts.
The proportion among unlabeled offensive texts, offensive texts with marked forbidden categories and unlabeled normal texts in the training text set can be set according to actual training requirements, and generally speaking, offensive texts with marked forbidden categories can be main components in the training text set.
The training text set may include offensive text corresponding to a plurality of offensive categories, and the same offensive text may correspond to one or more offensive categories. The same illicit category may correspond to multiple illicit text.
The forbidden categories can be set according to actual requirements.
And S2, selecting target texts from the training text set in sequence.
Specifically, the target text can be sequentially selected from the training text set, and the target text can be unlabeled illegal text, unlabeled illegal text with forbidden classes, or unlabeled normal text.
And S3, generating training data by utilizing the preset hidden characters and the target text, wherein the training data is the target text with partial characters replaced by the hidden characters.
Specifically, part of characters in the selected target text can be hidden, further, part of words in the target text can be hidden by adopting hidden characters, and part of words in the target text can be replaced by randomly selecting characters, so that the purpose of hiding is achieved.
Training data can be generated by utilizing preset hidden characters and the target text in various modes, for example, partial words in the target text can be randomly replaced to generate the training data; sensitive words in the target text can also be determined, and the hidden characters are utilized to replace the sensitive words in the target text, so that training data are generated.
Wherein the characters in the target text that are replaced by hidden characters may account for 15% of the total characters in the target text.
And S4, inputting the training data into a text classification model to obtain target characters predicted by the text classification model and classification results predicted by the text classification model based on the training data.
In particular, the text classification model may be a text classification model that requires training.
The training data can be input into the text classification model for training of the text classification model, and target characters and classification results output by the text classification model are obtained.
And obtaining a classification result of the text classification model based on training data and target character prediction corresponding to the training data.
The target characters may be characters obscured by hidden characters in the target text predicted by the text classification model based on the training data.
The classification result is a classification result corresponding to the multiple forbidden categories, for example, the multiple forbidden categories can be a first forbidden type and a second forbidden type, the classification result is yes or no, and then the classification result can be whether the first forbidden type is and whether the second forbidden type is.
If the target text corresponding to the training data is the normal text, the two classification results corresponding to the forbidden classification are no.
And S5, calculating a text semantic loss value of the text classification model according to the target characters and the target text.
Specifically, the target character can be replaced with the hidden character in the training data, and the obtained predicted text; and calculating cross entropy loss between the predicted text and the target text, and calculating a text semantic loss value of the text classification model based on the semantic distance.
And S6, calculating a classification loss value of the text classification model according to the classification result and the target text.
Specifically, whether a labeling label exists in the target text is determined, and if the labeling label exists, a classification loss value of a text classification model is calculated according to the distance between a classification result and the labeling label; if the target text is a normal text, determining that the classification loss value is 0 when the target text is the normal text and the classification result indicates that the training data does not have a matched forbidden class; if the target text is illegal text and no labeling label exists, the classification result indicates that the training data has a matched forbidden class, and the classification loss value is directly determined to be 0.
And step S7, adjusting parameters of the text classification model based on the text semantic loss value and the classification loss value until the text semantic loss value and the classification loss value accord with preset conditions, and obtaining the trained text classification model.
In particular, parameters of the text classification model may be adjusted based on the text semantic loss value and the size of the classification loss value until the text semantic loss value and the classification loss value meet less than a threshold.
The threshold value can be preset according to actual requirements, and different accuracy requirements can be corresponding to different threshold values.
A penalty function may also be employed and used to calculate the penalty in training the text classification model using the training data, the target character, and the classification result, where the penalty function is as follows:
wherein i represents the ith character of the training data; j represents the j-th character in the dictionary, the target character is selected from the dictionary, and 21128 words can be included in the dictionary; l represents a loss value corresponding to single training data, and N represents trainingTraining the number of characters in the data replaced by hidden characters, m i Indicating whether the ith character in the training data is replaced by a hidden character, y j Indicating whether the jth character in the training data is a label, p j Representing the probability that the jth character in the dictionary is the target character.
According to the technical scheme, through the text classification model training method, the semantic analysis capability of the text classification model can be trained through hiding part of characters, and the text classification model can learn the distinction between the normal text and the offending sample by adopting a mixed training mode of the normal text and the offending text, so that the discrimination capability of the text classification model is improved, and the parameters of the text classification model can be adjusted by using the loss value, so that the purposes of improving the semantic analysis capability of the text classification model and the offending text recognition capability are achieved. Therefore, the text classification model obtained by training by the text classification model training method provided by the application has higher semantic analysis capability and illegal text distinguishing capability.
In addition, through training the text set and mixing with unlabeled offending text, unlabeled offending text with forbidden classes, and training the text classification model by unlabeled normal text, supervised learning and self-supervised learning of the text classification model can be realized, namely, semi-supervised learning can be carried out by fully utilizing the labeled and unlabeled text, and the reliability and learning capacity of the text classification model are further improved.
In some embodiments of the present application, considering that the text classification model obtained by training may predict the target character, but in the actual prediction process, the text classification model is not required to predict the character, and only the text classification model is required to predict the classification result, so that after the text classification model is obtained by training, a processing process for the text classification model may be added, and the function of the text classification model may be deleted, so as to achieve the purpose of accelerating the prediction classification result, and next, the processing process will be described in detail as follows:
and S8, adjusting the bias parameters and the weight parameters of the trained text classification model to obtain a processed text classification model, wherein the output of the processed text classification model is a classification result corresponding to the input text.
Specifically, parameters of the text classification model may be adjusted to implement that the output of the text classification model is only a classification result, where the classification result includes classification results corresponding to a plurality of forbidden categories.
The adjusted parameter may be a bias parameter and a weight parameter.
Compared with the previous embodiment, the method and the device for predicting the target characters of the text classification model increase the process of adjusting the bias parameters and the weight parameters of the text classification model, can reduce the process of predicting the target characters of the text classification model through the process, further reduce the prediction difficulty of the text classification model and the prediction process under the condition that the text classification model only outputs the classification result, realize the acceleration of the prediction classification result of the text classification model, and improve the prediction efficiency of the text classification model.
In some embodiments of the present application, the process of adjusting the bias parameters and the weight parameters of the trained text classification model in step S8 to obtain the processed text classification model is described in detail, and the steps are as follows:
and S80, adjusting weight parameters and bias parameters related to the predicted target characters and the predicted classification results in the trained text classification model to obtain a processed text classification model.
Specifically, in the process of adjusting the bias parameters and the weight parameters of the text classification model, only the characters and the predicted target characters and the predicted classification results in the text classification model can be adjusted
From the above technical solution, it can be seen that this embodiment provides an optional manner for adjusting parameters of a text classification model, which can adjust parameters related to a predicted target character and a predicted classification result in the text classification model, so as to further improve efficiency of adjusting the text classification model and further improve distinguishing efficiency of the text classification model.
In some embodiments of the present application, in step S80, the process of adjusting the weight parameters and bias parameters related to the predicted target characters and the predicted classification result in the trained text classification model to obtain the processed text classification model is described in detail, and the steps are as follows:
s800, adjusting weight parameters and bias parameters related to the predicted target characters and the predicted classification result in the trained text classification model by using a preset adjusting formula to obtain a processed text classification model.
Specifically, the adjustment formula is as follows:
H=BERT(X;θ)
Wherein X represents the input of the trained text classification model, theta represents the weight parameter of the trained text classification model, BERT represents the semantic coding by adopting the text classification model, H represents the semantic vector coded by the text classification model, and Slice 1 The representation intercepts the semantic vector encoded by the text classification model,representing the correspondence of the classification resultSemantic vector->Representing the semantic vector after full connection, LN represents Layer Normalization normalization operation, GELU represents Gaussian error linear unit activation function, W 1 Weight parameters representing full connection layer, B 1 Bias parameters representing fully connected layers, +.>Representing vector matrix corresponding to target character only, slice 2 Representing a vector matrix corresponding to the target character in the truncated dictionary, and enabling to represent the vector matrix of the dictionary,/-for the dictionary>Score representing the corresponding classification result of the forbidden class, +.>Representing transpose of vector matrix corresponding to target character only, B 2 Bias parameters representing dimension transformation (upscale), +.>And (3) representing the probability of the classification result corresponding to the forbidden class, wherein Softmax represents the probability obtained by adopting a Softmax function, the dictionary is a pre-established character database, and the target characters are selected from the dictionary.
b represents the batch size, s represents the sequence length, d represents the vector dimension, k represents the number of forbidden categories, v represents the dictionary size, and typically v > s.
According to the technical scheme, the method and the device for adjusting the parameters of the text classification model can achieve parameter adjustment of the text classification model better through the process, and therefore the purpose of improving the prediction efficiency of the text classification model is achieved.
After the parameters of the trained text classification model are adjusted by adopting the adjusting formula, the operation times of the text classification model and a large number of power function calculations in the softmax function can be reduced, so that the prediction efficiency of the text classification model can be improved. And verification shows that after parameters of the text classification model are adjusted, the accuracy of classification results predicted by the text classification model is not reduced, and therefore, the text classification model obtained by the embodiment can improve the prediction efficiency while guaranteeing the discrimination accuracy.
In some embodiments of the present application, the process of generating training data by using the preset hidden characters and the target text in step S3, where the training data is a target text with part of the characters replaced by the hidden characters is described in detail, and the steps are as follows:
S30, replacing part of characters of the target text by using the hidden characters to obtain a replaced text.
Specifically, the replacement text can be obtained in various manners, for example, part of characters of the target text can be randomly replaced by using hidden characters, so that the replacement text is obtained; and the sensitive words in the target text can be replaced by hidden characters to obtain a replacement text.
And S31, if the target text does not have a label, directly taking the replacement text as the training data.
Specifically, when no label exists in the target text, the target text is replaced by the replacement characters, and the obtained replacement text can be directly used as training data.
S32, if the target text has a label, processing the replacement text by using a preset text template to obtain training data.
Specifically, if a tag exists in the target text, the text template can be utilized to process the replacement text to obtain training data, so that forbidden categories exist at fixed positions of each training data.
According to the technical scheme, the method for generating the training data is an optional mode for generating the training data, the formats of the training data generated by the mode are similar, semi-supervised learning of the text classification model in the training process is facilitated, and the training process of the text classification model is quickened.
In some embodiments of the present application, in step S32, if the target text has a tag, the process of processing the replacement text by using a preset text template to obtain training data is described in detail, and the steps are as follows:
s320, determining a classification result corresponding to each forbidden class by using the labeling label of the target text corresponding to the replacement text.
Specifically, the labeling tag of the target text may indicate the forbidden class matched with the target text, and a classification result corresponding to each forbidden class is determined according to the forbidden class matched with the target text, for example, if the labeling tag of the target text indicates that the forbidden class matched with the target text is a fifth forbidden class, and the forbidden class which can be distinguished by the text classification model is a first forbidden class, a second forbidden class, a third forbidden class, a fourth forbidden class and a fifth forbidden class, then the classification result corresponding to the first forbidden class may be no, the classification result corresponding to the second forbidden class may be no, the classification result corresponding to the third forbidden class may be no, the classification result corresponding to the fourth forbidden class may be no, and the classification result corresponding to the fifth forbidden class may be yes.
S321, determining the sequence of each classification result based on the fixed sequence corresponding to the forbidden classes in the text template.
Specifically, the sequence of the two classification results corresponding to each forbidden class that can be identified by the text classification model can be determined according to the fixed sequence corresponding to the forbidden class in the text template, for example, when each forbidden class that can be identified by the text classification model is a first forbidden type, a second forbidden type, a third forbidden type, a fourth forbidden type and a fifth forbidden type, it is required to determine that the sequence of the two classification results corresponding to the first forbidden type, the sequence of the two classification results corresponding to the second forbidden type, the sequence of the two classification results corresponding to the third forbidden type, the sequence of the two classification results corresponding to the fourth forbidden type and the sequence of the two classification results corresponding to the fifth forbidden type.
S322, forming a two-classification result combination according to the sequence of the two-classification results.
Specifically, the respective classification results may be combined according to the order corresponding to the respective classification results, and the obtained combined result is the combination of the classification results. For example, the fixed order of the forbidden categories in the text template may be a first forbidden type, a second forbidden type, a third forbidden type, a fourth forbidden type, and a fifth forbidden type, and then the order of the "no" of the two classification results corresponding to the first forbidden type is the first, the order of the "no" of the two classification results corresponding to the second forbidden type is the second, the order of the "no" of the two classification results corresponding to the third forbidden type is the third, the order of the "no" of the two classification results corresponding to the fourth forbidden type is the no, and when the two classification results corresponding to the fifth forbidden type is the yes, the formed combination of the two classification results may be the no or the no.
S323, training data is generated based on specific positions corresponding to the forbidden categories in the text template, specific positions of the replacement text in the text template, the classification result combination and the replacement text.
Specifically, based on a specific position corresponding to the forbidden class in the text template and a specific position of the replacement text, the two-class result combination and the replacement text are combined to generate training data, for example, the position which can be the forbidden class in the text template is positioned at the left end of the replacement text, and then the two-class result combination can be placed before the replacement text to generate the training data.
According to the technical scheme, the embodiment provides an optional combination mode of the training data, through the mode, in the process of training the text classification model, the text classification model can distinguish the combination of the alternative text and the classification result according to the positions of each component of the training data, and the training efficiency of the text classification model is improved.
In some embodiments of the present application, the process of generating training data in step S323 based on specific positions corresponding to the forbidden categories in the text template, specific positions of the substituted text in the text template, the combination of the classification results, and the substituted text is described in detail, and the steps are as follows:
S3230, combining the two classification result combinations and the replacement text according to specific positions corresponding to the forbidden categories and specific positions of the replacement text in the text template to obtain combined data.
Specifically, the combination of the classification result and the replacement text can be integrated according to the forbidden classification and the position of the replacement text in the text template, and the obtained result is combination data, wherein the combination data comprises the combination of the classification result and the replacement text.
S3231, adding preset prefix characters into the combined data, and adding suffix characters between the classification result of the combined data and the replacement text to obtain training data, so that after the training data is input into the text classification model, the text classification model distinguishes and identifies the classification result combination and the replacement text based on the prefix characters and the suffix characters.
Specifically, a prefix character and a suffix character may be preset, specific contents of the prefix character and the suffix character may be set according to actual requirements, the prefix character may include a character string indicating a starting meaning, and the training data may be composed of "prefix character+dichotomy result combination+suffix character+substitution text". Thus, the data between the prefix characters and the suffix characters are the two classification result combinations, and the data after the suffix characters are the substitution text.
As can be seen from the above technical solution, the present embodiment provides an alternative way of generating training data by using the alternative text and the binary result combination, by which the binary result combination and the alternative text in the training data can be better distinguished by using the prefix character and the suffix character, so that training of the text classification model can be better completed.
The text classification method provided by the embodiment of the application will be described in detail, and the text classification model obtained by training above can be applied to the text classification method provided below, which can be referred to by the text classification model training method provided above.
The specific steps of the text classification method can be as follows:
s1, acquiring text information to be classified.
Specifically, text information to be classified can be acquired.
The text information to be classified can be obtained from the internet, for example, the text information to be classified is obtained from a chat record in a live broadcast room, and the text information to be classified is obtained from a chat interface.
S2, classifying the text information to be classified by using the text classification model trained by the text classification model training method provided by any embodiment to obtain a classification result, wherein the classification result comprises classification results corresponding to a plurality of forbidden categories.
Specifically, the text information to be classified can be input into the text classification model obtained through the training, a classification result predicted by the text classification model is obtained, the classification result can comprise classification results corresponding to a plurality of forbidden categories, and whether the text information to be classified belongs to the illegal text can be known through the classification result.
According to the technical scheme, the text classification method is provided, and the illegal texts and the normal texts can be identified through the process, so that the illegal texts can be processed, and physical and psychological health of underage users and scientific surfing can be guaranteed.
The text classification model training device provided in the embodiments of the present application is described below, and the text classification model training device described below and the text classification model training method described above may be referred to correspondingly.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a training device for text classification model according to an embodiment of the present application.
As shown in fig. 2, the text classification model training apparatus may include:
a determining unit 1, configured to determine a training text set, where the training text set includes a plurality of unlabeled offensive texts, a plurality of offensive texts with marked forbidden categories, and a plurality of unlabeled normal texts;
A selecting unit 2, configured to sequentially select target texts from the training text set;
a generating unit 3, configured to generate training data by using preset hidden characters and the target text, where the training data is a target text in which part of characters are replaced by hidden characters;
the classifying unit 4 is used for inputting the training data into a text classifying model to obtain target characters predicted by the text classifying model and classifying results predicted by the text classifying model based on the training data, wherein the classifying results are classifying results corresponding to a plurality of forbidden categories;
a calculating unit 5, configured to calculate a text semantic loss value of the text classification model according to the target character and the target text;
a utilization unit 6, configured to calculate a classification loss value of the text classification model according to the classification result and the target text;
and the adjusting unit 7 is configured to adjust parameters of the text classification model based on the text semantic loss value and the classification loss value until the text semantic loss value and the classification loss value meet preset conditions, thereby obtaining a trained text classification model.
Optionally, the text classification model training apparatus may further include:
and the parameter adjusting unit is used for adjusting the bias parameters and the weight parameters of the trained text classification model to obtain a processed text classification model, and the output of the processed text classification model is a classification result corresponding to the input text.
Alternatively, the parameter adjusting unit may include:
and the weight parameter adjusting unit is used for adjusting weight parameters and bias parameters related to the predicted target characters and the predicted classification results in the trained text classification model to obtain a processed text classification model.
Alternatively, the weight parameter adjusting unit may include:
the company utilization unit is used for utilizing a preset regulation formula to regulate weight parameters and bias parameters related to the predicted target characters and the predicted classification results in the trained text classification model to obtain a processed text classification model;
the adjustment formula is as follows:
H=BERT(X;θ)
wherein X represents the input of the trained text classification model, theta represents the weight parameter of the trained text classification model, BERT represents the semantic coding by adopting the text classification model, H represents the semantic vector coded by the text classification model, and Slice 1 Representation interception through text segmentationThe semantic vector after class model encoding,representing the corresponding semantic vector of the classification result,/->Representing the semantic vector after full connection, LN represents Layer Normalization normalization operation, GELU represents Gaussian error linear unit activation function, W 1 Weight parameters representing full connection layer, B 1 Bias parameters representing fully connected layers, +.>Representing vector matrix corresponding to target character only, slice 2 Representing a vector matrix corresponding to the target character in the truncated dictionary, and enabling to represent the vector matrix of the dictionary,/-for the dictionary>Score representing the corresponding classification result of the forbidden class, +.>Representing transpose of vector matrix corresponding to target character only, B 2 Bias parameters representing dimension transformation (upscale), +.>And (5) representing the probability of the classification result corresponding to the forbidden class, wherein Softmax represents the probability obtained by adopting a Softmax function.
Alternatively, the generating unit may include:
a character replacing unit, configured to replace a part of characters of the target text with hidden characters, so as to obtain a replaced text;
the label judging unit is used for directly taking the replacement text as the training data if the target text does not have a label;
and the text processing unit is used for processing the replacement text by using a preset text template if the target text has a label, so as to obtain training data.
Alternatively, the text processing unit may include:
the classification result determining unit is used for determining a classification result corresponding to each forbidden class by using the labeling label of the target text corresponding to the replacement text;
the sequence determining unit is used for determining the sequence of each classification result based on the fixed sequence corresponding to the forbidden classes in the text template;
the two-classification result combination unit is used for forming two-classification result combinations according to the sequence of the two-classification results;
the position utilization unit is used for generating training data based on specific positions corresponding to the forbidden categories in the text template, specific positions of the replaced text in the text template, the classification result combination and the replaced text.
Alternatively, the location utilization unit may include:
the first position utilization unit is used for combining the two classification result combinations and the replacement text according to specific positions corresponding to the forbidden categories in the text template and specific positions of the replacement text to obtain combined data;
and the second position utilization unit is used for adding preset prefix characters into the combined data and adding suffix characters between the classification result of the combined data and the replacement text to obtain training data, so that after the training data is input into the text classification model, the text classification model distinguishes and identifies the classification result combination and the replacement text based on the prefix characters and the suffix characters.
The text classification model training device provided by the embodiment of the application can be applied to text classification model training equipment, such as a PC terminal, a cloud platform, a server cluster and the like. Optionally, fig. 3 shows a block diagram of a hardware structure of the text classification model training apparatus, and referring to fig. 3, the hardware structure of the text classification model training apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete communication with each other through the communication bus 4;
processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;
wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:
Determining a training text set, wherein the training text set comprises a plurality of unlabeled illegal texts, a plurality of illegal texts with marked forbidden categories and a plurality of unlabeled normal texts;
sequentially selecting target texts from the training text set;
generating training data by using preset hidden characters and the target text, wherein the training data is the target text with partial characters replaced by the hidden characters;
inputting the training data into a text classification model to obtain target characters predicted by the text classification model and classification results predicted by the text classification model based on the training data, wherein the classification results are classification results corresponding to a plurality of forbidden categories;
calculating a text semantic loss value of the text classification model according to the target characters and the target text;
calculating a classification loss value of the text classification model according to the classification result and the target text;
and adjusting parameters of the text classification model based on the text semantic loss value and the classification loss value until the text semantic loss value and the classification loss value accord with preset conditions, so as to obtain a trained text classification model.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
The embodiment of the application also provides a readable storage medium, which can store a program suitable for being executed by a processor, the program being configured to:
determining a training text set, wherein the training text set comprises a plurality of unlabeled illegal texts, a plurality of illegal texts with marked forbidden categories and a plurality of unlabeled normal texts;
sequentially selecting target texts from the training text set;
generating training data by using preset hidden characters and the target text, wherein the training data is the target text with partial characters replaced by the hidden characters;
inputting the training data into a text classification model to obtain target characters predicted by the text classification model and classification results predicted by the text classification model based on the training data, wherein the classification results are classification results corresponding to a plurality of forbidden categories;
calculating a text semantic loss value of the text classification model according to the target characters and the target text;
calculating a classification loss value of the text classification model according to the classification result and the target text;
And adjusting parameters of the text classification model based on the text semantic loss value and the classification loss value until the text semantic loss value and the classification loss value accord with preset conditions, so as to obtain a trained text classification model.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
The text classification device provided in the embodiments of the present application will be described in detail below, and the text classification device described below may be referred to with the text classification method provided above.
The text classification apparatus may include:
the text acquisition unit is used for acquiring text information to be classified;
the information classification unit is used for classifying the text information to be classified by using a text classification model trained by a text classification model training method to obtain a classification result, wherein the classification result comprises classification results corresponding to a plurality of forbidden categories.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Various embodiments of the present application may be combined with one another. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A method for training a text classification model, comprising:
determining a training text set, wherein the training text set comprises a plurality of unlabeled illegal texts, a plurality of illegal texts with marked forbidden categories and a plurality of unlabeled normal texts;
sequentially selecting target texts from the training text set;
Generating training data by using preset hidden characters and the target text, wherein the training data is the target text with partial characters replaced by the hidden characters;
inputting the training data into a text classification model to obtain target characters predicted by the text classification model and classification results predicted by the text classification model based on the training data, wherein the classification results are classification results corresponding to a plurality of forbidden categories;
calculating a text semantic loss value of the text classification model according to the target characters and the target text;
calculating a classification loss value of the text classification model according to the classification result and the target text;
and adjusting parameters of the text classification model based on the text semantic loss value and the classification loss value until the text semantic loss value and the classification loss value accord with preset conditions, so as to obtain a trained text classification model.
2. The text classification model training method of claim 1, further comprising:
and adjusting the bias parameters and the weight parameters of the trained text classification model to obtain a processed text classification model, wherein the output of the processed text classification model is a classification result corresponding to the input text.
3. The method for training a text classification model according to claim 2, wherein the adjusting the bias parameter and the weight parameter of the trained text classification model to obtain the processed text classification model comprises:
and adjusting weight parameters and bias parameters related to the predicted target characters and the predicted classification results in the trained text classification model to obtain a processed text classification model.
4. A method of training a text classification model according to claim 3, wherein adjusting the weight parameters and bias parameters associated with the predicted target character and the predicted classification result in the trained text classification model to obtain a processed text classification model comprises:
adjusting weight parameters and bias parameters related to the predicted target characters and the predicted classification results in the trained text classification model by using a preset adjusting formula to obtain a processed text classification model;
the adjustment formula is as follows:
H=BERT(X;θ)
wherein X represents the input of the trained text classification model, theta represents the weight parameter of the trained text classification model, BERT represents the semantic coding by adopting the text classification model, H represents the semantic vector coded by the text classification model, and Slice 1 The representation intercepts the semantic vector encoded by the text classification model,representing the corresponding semantic vector of the classification result,/->Representing the semantic vector after full connection, LN represents Layer Normalization normalization operation, GELU represents Gaussian error linear unit activation function, W 1 Weight parameters representing full connection layer, B 1 Bias parameters representing fully connected layers, +.>Representing vector matrix corresponding to target character only, slice 2 Representing a vector matrix corresponding to the target character in the truncated dictionary, and enabling to represent the vector matrix of the dictionary,/-for the dictionary>Score representing the corresponding classification result of the forbidden class, +.>Representing transpose of vector matrix corresponding to target character only, B 2 Bias parameters representing dimension transformation (upscale), +.>Representing the probability of the corresponding classification result of the forbidden class, and Softmax represents the calculation of the probability by adopting the Softmax functionThe rate, the dictionary is a pre-established character database.
5. The text classification model training method of claim 1, wherein generating training data using preset hidden characters and the target text comprises:
replacing part of characters of the target text by using the hidden characters to obtain a replaced text;
If the target text does not have a label, directly taking the replacement text as the training data;
if the target text has a label, processing the replacement text by using a preset text template to obtain training data.
6. The method for training a text classification model according to claim 5, wherein the text template comprises a fixed sequence and a specific position corresponding to a plurality of forbidden categories, and further comprises a specific position for replacing text;
processing the replacement text by using a preset text template to obtain training data, wherein the processing comprises the following steps:
determining a classification result corresponding to each forbidden class by using the labeling label of the target text corresponding to the replacement text;
determining the sequence of each classification result based on the fixed sequence corresponding to the forbidden classes in the text template;
forming a two-classification result combination according to the sequence of each two-classification result;
and generating training data based on specific positions corresponding to the forbidden categories in the text template, specific positions of the replacement text in the text template, the classification result combination and the replacement text.
7. The method for training a text classification model according to claim 6, wherein generating training data based on specific locations corresponding to a plurality of forbidden categories in a text template, specific locations of a substituted text in the text template, the combination of classification results, and the substituted text comprises:
Combining the two classification result combination and the replacement text according to the specific positions corresponding to the forbidden classes in the text template and the specific positions of the replacement text to obtain combined data;
and adding preset prefix characters into the combined data, adding suffix characters between a classification result of the combined data and the replacement text to obtain training data, so that after the training data is input into the text classification model, the text classification model distinguishes and identifies the classification result combination and the replacement text based on the prefix characters and the suffix characters.
8. A text classification model training device, comprising:
the training text set comprises a plurality of unlabeled illegal texts, a plurality of unlabeled illegal texts with marked forbidden categories and a plurality of unlabeled normal texts;
the selecting unit is used for sequentially selecting target texts from the training text set;
the generation unit is used for generating training data by utilizing preset hidden characters and the target text, wherein the training data is the target text with partial characters replaced by the hidden characters;
The classifying unit is used for inputting the training data into a text classifying model to obtain target characters predicted by the text classifying model and classifying results predicted by the text classifying model based on the training data, wherein the classifying results are classifying results corresponding to a plurality of forbidden categories;
the calculating unit is used for calculating a text semantic loss value of the text classification model according to the target characters and the target text;
the utilization unit is used for calculating a classification loss value of the text classification model according to the classification result and the target text;
and the adjusting unit is used for adjusting the parameters of the text classification model based on the text semantic loss value and the classification loss value until the text semantic loss value and the classification loss value accord with preset conditions, so as to obtain the trained text classification model.
9. A method of text classification, comprising:
acquiring text information to be classified;
classifying the text information to be classified by using the text classification model trained by the text classification model training method according to any one of claims 1 to 7 to obtain a classification result, wherein the classification result comprises classification results corresponding to a plurality of forbidden classes.
10. A text classification device, comprising:
the text acquisition unit is used for acquiring text information to be classified;
the information classification unit is used for classifying the text information to be classified by using the text classification model trained by the text classification model training method according to any one of claims 1-7 to obtain a classification result, wherein the classification result comprises classification results corresponding to a plurality of forbidden categories.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211729559.9A CN116362292A (en) | 2022-12-30 | 2022-12-30 | Text classification model training method and device, text classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211729559.9A CN116362292A (en) | 2022-12-30 | 2022-12-30 | Text classification model training method and device, text classification method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116362292A true CN116362292A (en) | 2023-06-30 |
Family
ID=86926325
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211729559.9A Pending CN116362292A (en) | 2022-12-30 | 2022-12-30 | Text classification model training method and device, text classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116362292A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117874239A (en) * | 2024-03-11 | 2024-04-12 | 腾讯科技(深圳)有限公司 | Content generation method, device, equipment and storage medium |
-
2022
- 2022-12-30 CN CN202211729559.9A patent/CN116362292A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117874239A (en) * | 2024-03-11 | 2024-04-12 | 腾讯科技(深圳)有限公司 | Content generation method, device, equipment and storage medium |
CN117874239B (en) * | 2024-03-11 | 2024-06-11 | 腾讯科技(深圳)有限公司 | Content generation method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112084337B (en) | Training method of text classification model, text classification method and equipment | |
Zhao et al. | Cyberbullying detection based on semantic-enhanced marginalized denoising auto-encoder | |
Fang et al. | Self multi-head attention-based convolutional neural networks for fake news detection | |
CN107992596B (en) | Text clustering method, text clustering device, server and storage medium | |
CN110069709B (en) | Intention recognition method, device, computer readable medium and electronic equipment | |
US10803387B1 (en) | Deep neural architectures for detecting false claims | |
CN111680159B (en) | Data processing method and device and electronic equipment | |
US7689531B1 (en) | Automatic charset detection using support vector machines with charset grouping | |
US7827133B2 (en) | Method and arrangement for SIM algorithm automatic charset detection | |
CN108536754A (en) | Electronic health record entity relation extraction method based on BLSTM and attention mechanism | |
CN112231485B (en) | Text recommendation method and device, computer equipment and storage medium | |
CN110795525A (en) | Text structuring method and device, electronic equipment and computer readable storage medium | |
CN110188158B (en) | Keyword and topic label generation method, device, medium and electronic equipment | |
CN112667813B (en) | Method for identifying sensitive identity information of referee document | |
CN112464655A (en) | Word vector representation method, device and medium combining Chinese characters and pinyin | |
CN111898704A (en) | Method and device for clustering content samples | |
CN116362292A (en) | Text classification model training method and device, text classification method and device | |
CN115186085A (en) | Reply content processing method and interaction method of media content interaction content | |
CN114722832A (en) | Abstract extraction method, device, equipment and storage medium | |
Trisal et al. | K-RCC: A novel approach to reduce the computational complexity of KNN algorithm for detecting human behavior on social networks | |
Zhao et al. | Topic identification of text‐based expert stock comments using multi‐level information fusion | |
Karaoglan et al. | Enhancing Aspect Category Detection Through Hybridised Contextualised Neural Language Models: A Case Study In Multi-Label Text Classification | |
Khan et al. | Fake news classification using machine learning: Count vectorizer and support vector machine | |
Ling | Coronavirus public sentiment analysis with BERT deep learning | |
Hossain et al. | An Ensemble Method-Based Machine Learning Approach Using Text Mining to Identify Semantic Fake News |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |