CN114116969A - Corpus screening method based on multiple loss fusion text classification model results - Google Patents

Corpus screening method based on multiple loss fusion text classification model results Download PDF

Info

Publication number
CN114116969A
CN114116969A CN202111341075.2A CN202111341075A CN114116969A CN 114116969 A CN114116969 A CN 114116969A CN 202111341075 A CN202111341075 A CN 202111341075A CN 114116969 A CN114116969 A CN 114116969A
Authority
CN
China
Prior art keywords
text classification
classification model
data
output
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111341075.2A
Other languages
Chinese (zh)
Inventor
徐泽坤
岳文浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Visual Technology Co Ltd
Original Assignee
Hisense Visual Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Visual Technology Co Ltd filed Critical Hisense Visual Technology Co Ltd
Priority to CN202111341075.2A priority Critical patent/CN114116969A/en
Publication of CN114116969A publication Critical patent/CN114116969A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a corpus training screening method based on a result of a text classification model fused with various losses, and the corpus training screening method enables the text classification model to adaptively learn the weight of each loss function to the classification effect of the text classification model by fusing various loss functions at the model level of the text classification model, so that the robustness of the text classification model is improved. On the data level, based on the text classification model result fused by the loss functions, the output class classification quality of the training sample data is judged through variance calculation, the to-be-screened data with poor quality is found, and the to-be-screened data is subjected to the review processing. And training the text classification model again according to the processing result, and improving the classification or prediction effect of the text classification model. Through calculating the confusion degree among the output categories, a quantitative score is made for the classification system of the text classification model and used as a basis for adjusting the classification definition in the text classification model, and the prediction effect of the text classification model is further improved.

Description

Corpus screening method based on multiple loss fusion text classification model results
Technical Field
The application relates to the technical field of natural language processing, in particular to a corpus screening method based on a result of a multi-loss fusion text classification model.
Background
In the data-driven artificial intelligence era, man-machine conversation is widely applied, and in some application scenes, the operation of an intelligent terminal can be controlled through voice. The bottom layer of these operations is based on text classification, relying on a text classification model.
The text classification model is obtained through deep learning neural network training, and for the deep learning neural network, the text classification model can guarantee the more accurate prediction effect of the text classification model due to the acquisition of high-quality training sample data. The text classification model can classify and mark a large amount of training sample data according to a certain classification system or standard, and extracts semantic features of texts in the training sample data through learning and training the existing training sample data so as to classify the training sample data and predict the output class of the training sample data.
In one implementation, the training sample data is unscreened, and there is a large amount of low-quality data, and the text classification model depends on the quality of the training sample data in the training process, and these low-quality unscreened data inevitably affect the training process and the training effect of the text classification model, so the accuracy of the text classification model is affected.
Disclosure of Invention
The application provides a corpus screening method based on a result of a multi-loss fusion text classification model, which aims to solve the problem of poor prediction effect of the current text classification model.
The corpus screening method based on the result of the multiple loss fusion text classification model comprises the following steps:
dividing the text classification model into a model layer and a data layer according to functions;
on the model level, fusing a plurality of loss functions on the text classification model to obtain the output class and the class probability value of training sample data in the text classification model;
on a data level, calculating the variance of training sample data according to the output category and the category probability value;
and screening the review data to be screened, wherein the variance of the review data to be screened is lower than a variance threshold value, and the variance threshold value is preset according to the output category of the training sample data.
According to the technical scheme, aiming at the problem of poor robustness of the text classification model, the text classification model is fused to learn the weight of each loss function to the classification effect of the text classification model in a self-adaptive manner, so that the robustness of the text classification model is improved. In the data aspect of the text classification model, in the process of training the text classification model, the output classification quality of the training sample data is judged through variance calculation based on the text classification model result fused by the multiple loss functions, and the aim is to find the review data to be screened with poor quality and perform review processing on the review data to be screened. And training the text classification model again according to the processing result, and improving the classification or prediction effect of the text classification model. In addition, a quantitative score is made for a classification system of the text classification model by calculating the confusion degree among output categories, and the quantitative score is used as a basis for adjusting classification definition in the text classification model, so that the prediction effect of the text classification model is improved. In summary, the embodiment of the application realizes the improvement of the prediction effect of the text classification model from the model level and the data level by the fusion of various loss functions and the screening of the training corpora.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating a principle logic of a text classification model according to an embodiment of the present application;
fig. 2 is a schematic view of a dialog system page of a terminal device in an embodiment of the present application;
fig. 3 is a schematic diagram illustrating a display effect of a terminal device when a text classification is poor in an embodiment of the present application;
FIG. 4 is a schematic diagram of a general framework of a corpus screening method based on a result of a multi-loss fusion text classification model;
FIG. 5 is a schematic flow chart of a corpus screening method based on a result of a multiple-loss fusion text classification model;
FIG. 6 is a schematic flow chart illustrating a process of fusing multiple loss functions to a text classification model to obtain an output class and a class probability value;
FIG. 7 is a schematic diagram of a flow of performing multiple loss function calculations on training sample data and a schematic diagram of data flow;
fig. 8 is a schematic diagram of a display effect of the terminal device after text classification is promoted in the embodiment of the present application.
Detailed Description
To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.
It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.
This application relates to, but is not limited to, text classification and dialog systems. The dialog system may be, for example, a terminal device controlled by sending voice, in a use scenario, a television, an air conditioner, and the like may be controlled by voice, for example, the air conditioner is turned up and down in temperature, the wind mode, and the like, and then, for example, the television searches for a certain type of program and plays the program through voice control, and the application is not limited in this respect. The text classification is a bottom-layer technical support for terminal display, for example, when a certain category of programs such as 'movie' is searched on a television through a dialog system, the effect of the text classification directly affects the effect displayed by the dialog system, and if the bottom-layer text classification is poor, the effect presented to a user is not ideal.
The text classification is determined by a text classification model, fig. 1 is a schematic diagram illustrating a principle logic of the text classification model shown in the embodiment of the present application, and as shown in fig. 1, the text classification model can classify and label a large amount of training sample data according to a certain classification system or standard, and after the training sample data (training corpus) is input, semantic features of a text in the training sample data are extracted by learning and training the existing training sample data, so that the training sample data is classified, and an output category of the training sample data is predicted, that is, a final output category is output. It follows that the text classification model plays a decisive role in the effectiveness of text classification.
In order to visually show the association between the bottom-layer text classification and the display effect of the terminal device, the embodiment of the application is schematically illustrated by taking the smart television as an example, and it should be noted that in an actual use scenario, the smart television is not only used for the smart television, and here, the smart television is merely used for further understanding, and is illustrated as one example. For example, fig. 2 is a schematic diagram of a page of a dialog system of a terminal device in an embodiment of the present application, in the embodiment of the present application, a display terminal of a smart television is marked as a display device 200, as shown in fig. 2, a home page of the dialog system may include entry options such as "movie", "art", "music", and the like, and when a user sends a voice control instruction through the dialog system, a corresponding page effect may be displayed through a selected category.
The text classification model is obtained by deep learning neural network training and depends on training sample data, namely on training corpora. The corpus is a language material to be trained, and may be a collection of text resources of a certain quantity and scale, for example, one or more pieces of training sample data (corpus) collected or input by a terminal, the high-quality corpus can ensure a more accurate prediction effect of the text classification model, and the low-quality corpus affects the final prediction effect of the text classification model.
If the training corpus is not screened and a large amount of low-quality data exists, the training process and the training effect of the text classification model are necessarily influenced in the training process of the text classification model, and the accuracy of the text classification model is further influenced. Based on this, in order to improve the prediction effect of the text classification model, the quality of the corpus needs to be improved, that is, the corpus needs to be screened, data with poor quality in the corpus is found out, the data is screened out, secondary inspection or re-labeling can be performed according to actual conditions, and the purpose is to improve the quality of the corpus and further improve the prediction effect of the text classification model.
Fig. 3 is a schematic diagram of a display effect of the terminal device in the embodiment of the present application when the text classification is poor, for example, a voice instruction sent to the display device 200 by the dialog system is { movie performed by hua zi, as shown in fig. 3, the effect output by the display device 200 includes "movie a", "movie B", "movie C", and the like, but due to poor classification of the underlying text, results that do not belong to the category of movies but are related to hua zi are also displayed, such as "hua zi song 1", "hua zi song 2", and "hua zi talk movie", so that the impression of the user is affected. Therefore, the underlying text classification is the basis for the display of the terminal device, and the text classification is determined by the text classification model. The text classification model is used for predicting the output category of the training sample data by learning and training the existing training sample data, so that the quality of text classification can be improved only by improving the classification effect and the prediction effect of the text classification model.
In order to improve the effect of the text classification model, the text classification model may be divided into a model level and a data level in the embodiment of the present application. Fig. 4 is a schematic diagram of a general framework of a corpus screening method based on a result of a text classification model fused with multiple losses, as shown in fig. 4, on a model level, aiming at the problem that a current text classification model is poor in robustness, on the basis of an experiment that a plurality of loss optimizations are tried, in the text classification model, the text classification model adaptively learns the weight of each loss function to the classification effect of the text classification model by fusing multiple loss functions, namely, different loss functions are deeply fused in the text classification model, and further the robustness of the text classification model is improved.
In the data aspect, in the process of training the text classification model, based on the text classification model result fused by the loss functions, the corpus in the text classification model result is screened, so that the corpus with poor quality is found out, the quality of the corpus is improved from the data aspect, and the classification effect of the text classification model is improved. In addition, depending on the screening result of the training corpus, a quantitative score is made for the classification system of the text classification model, and the quantitative score is used as a basis for adjusting the classification definition in the text classification model, so that the prediction effect of the text classification model is improved.
In the embodiment of the present application, the training corpus refers to a language material to be trained, and may be a set of text resources of a certain quantity and scale. The training corpus may be large or small in scale, and may be one or several training sample data, or may be a general term for a set of all training sample data to be trained. Training corpora can be synchronously optimized by optimizing training sample data, and similarly, training sample data can also be synchronously optimized by optimizing the training corpora, and the training sample data and the training corpora complement each other without a strict limit (the training sample data and the training corpora can be the same and the application is not limited).
Specifically, referring to fig. 5, fig. 5 is a schematic flow chart of a corpus screening method based on a result of a multi-loss fusion text classification model, and as shown in fig. 5, the corpus screening method based on a result of a multi-loss fusion text classification model includes the following steps:
s1: dividing the text classification model into a model layer and a data layer according to functions;
from a text classification model functional perspective, the text classification model can be divided into two distinct parts. One part is a data layer, namely training sample data, and is used for providing a data base for the training of the text classification model; the other part is a model level, namely a text classification model itself, and is used for training sample data.
S2: on the model level, fusing a plurality of loss functions on the text classification model to obtain the output class and the class probability value of training sample data in the text classification model;
in the embodiment of the application, the fusion of various loss functions can be based on a fully-connected neural network, the fully-connected neural network is a naive neural network, the network parameters are the most, the calculated amount is the largest, the fully-connected neural network can be multi-layer fully-connected, and the prediction effect and the classification accuracy of the text classification model can be better improved through the multi-layer fully-connected neural network.
In one implementation, training sample data is not screened, and most data is high-dimensional and sparse, so that the text classification feature expression capability is weak, and the text classification is inaccurate. The fully-connected neural network solves the text representation problem in the large-scale text classification problem in a deep learning mode, changes the text representation from a high-dimensional and high-sparse neural network difficult processing mode into continuous dense data similar to images and voice, and can better perform feature extraction on texts, labels and related knowledge.
In one implementation, the function used by the text classification model is a cross entropy loss function, and a one-hot (i.e., one-hot coded) encoding label form is adopted for data, that is, only one state is labeled for a classification category (hereinafter, referred to as a category) of the data. In the network learning process of the one-hot coded label, the probability of predicting the text classification model to be the target output class is encouraged to approach 1, the probability of the non-target output class approaches 0, namely the value of the probability of the target output class in a vector of finally predicted logits (in deep learning, the logits can refer to the output of various classifiers, text classification models and scoring of certain classification classes) tends to be infinite, so that the text classification model learns in the direction of infinitely increasing the logit difference between the correct prediction and the wrong label, and the excessive logit difference can cause the model to lack adaptability, thereby leading to over-confidence of the prediction result and over-fitting.
In one embodiment, in order to solve the problem of overfitting in the text classification model calculation process, a cross entropy loss function can be introduced or replaced by a label smooth loss function, the label smooth loss function is a regularized overfitting inhibition function, and mainly a soft one hot, namely a soft label, is used for better fitting with an actual scene, so that the weight of the output category of a real sample label in calculating the loss function is reduced, and the overfitting inhibition effect is finally achieved. And the cross entropy loss function is replaced by a label smooth loss function, so that the text classification model can reasonably learn the label probability.
In one embodiment, to solve the problem of sample imbalance between different output classes in the text classification model, a cross entropy Loss function may be introduced or replaced with a Focal local Loss function, which is used to alleviate the problem of output class imbalance and difficult and easy sample imbalance in the text classification model. The implementation process may be that a hyper-parameter is added to the cross entropy, and the solution of the hyper-parameter may be reversely derived according to the sample proportion of the corpus, and is generally obtained according to the inverse class frequency or according to the cross entropy verification. When the problem of different difficult and easy samples is treated, a regulating factor can be introduced, and the factor is smaller than 1, so that the attenuation effect can be realized. For samples with high scores, this factor has the effect of a fast decay, and the closer to 1 the more the decay, the more the Focal local Loss function focuses on the difficult classification samples.
In one embodiment, in order to improve robustness and generalization of the text classification model (robustness refers to the anti-interference capability of the text classification model, and generalization refers to the adaptation degree of the text classification model to training data), a cross entropy loss function may be introduced or replaced with a countering training loss function. The countertraining is an important way for enhancing robustness in the learning process of the text classification model, and poor robustness can affect the classification effect of the text classification model. For example, a certain sample in the training data set is { how do on Chongqing facet }, and when one of the words is changed, the sample is changed into { how do on Chongqing facet }, the meaning of the sample expression after the change is actually the same as that before the change, but the text classification model cannot be correctly classified, which is the expression that the text classification model is poor in robustness.
The training-resisting loss function provides a regularized learning algorithm, the maximum acceptable disturbance is added into a word vector layer of a text classification model, the word vector layer has the same classification capability in a neighborhood range, and small fluctuation is carried out on an input vector, so that the generalization capability (the generalization capability refers to the adaptability of a machine learning algorithm to a fresh sample) and the robustness of the text classification model are improved. Meanwhile, the counterdisturbance in the text classification does not act on the original discrete input, but acts on continuous Word embedding (embedding represents embedding in the deep learning field, namely, the sentence formed by the words is mapped/embedded into a characterization vector). As in the confrontational training, where the input is represented by x and the model parameters are represented by θ, the confrontational training adds the following penalty to the current penalty function:
Figure BDA0003352036380000051
where x is the perturbation of the current input,
Figure BDA0003352036380000052
is a parameter of the current model, acting as a constant in the loss (use
Figure BDA0003352036380000053
Rather than theta indicating that the perturbation generation process should not be used to propagate the gradient backwards). In each confrontation training, the current model is first generated
Figure BDA0003352036380000054
Disturbance of greatest influence radvThen r is putadvThe method is added to the word embedding, and then the text classification model is enabled to have better robustness to disturbance by minimizing the loss function, namely the anti-interference performance of the text classification model is improved.
The above lists only a few possible Loss functions, and in the embodiment of the present application, the multiple Loss functions may include, but are not limited to, the above label smoothing Loss function, the Focal local Loss function, and the anti-training Loss function. For example, the various loss functions may be selected based on past experience or on experimental basis attempting several loss optimizations.
In the embodiment of the present application, taking the multiple Loss functions as an example, including the above-mentioned tag smoothing Loss function, Focal local Loss function, and anti-training Loss function, the fusion of the multiple Loss functions means that the tag smoothing Loss function, the Focal local Loss function, and the anti-training Loss function are fused according to a preset fusion method. Fig. 6 is a schematic flow chart illustrating a process of performing multiple loss function fusion on a text classification model to obtain an output category and a category probability value, where as shown in fig. 6, the step of performing multiple loss function fusion on the text classification model to obtain the output category and the category probability value may include:
s21: inputting training sample data into a text classification model to obtain a process category and a process probability value;
the tag smooth Loss function, the Focal local Loss function, and the anti-training Loss function can all calculate the Loss value of the batch data in the current training sample data (the batch data may refer to a part of the training sample data input into the text classification model during Loss value calculation), fig. 7 is a schematic flow diagram and a schematic data flow diagram of the training sample data for performing multiple Loss function calculations, as shown in fig. 7, the training sample data may be input into the text classification model in an Embedding manner, and the text classification model may output predicted logits. The predicted logits may include process classes and process probability values, but this logis the result of the lossless fusion and does not represent the final output, so it is called process class and process probability value.
S22: and calculating the process loss value of each loss function according to the process category and the process probability value, and calculating all loss types by the text classification model after obtaining the process category and the process probability value output by the text classification model according to the step S21 to obtain the process loss value of each loss function.
S23: inputting the process loss value into a fully-connected neural network to obtain a final loss value;
and the text classification model calculates the process loss values of all the loss functions and outputs the process loss values to the next layer of network. The next layer of network module is a fully-connected neural network, and after the results of the three loss functions are accessed to the fully-connected neural network, the fully-connected neural network is used for deeply fusing the loss values of the multiple loss functions, so that the text classification model can learn different losses in a self-adaptive manner.
During specific implementation, the text classification model can learn parameters in the whole training process, so that the text classification model can better adapt to training sample data, the generalization capability of the text classification model is improved in more scenes, for example, the text classification model can learn how to accept or reject three loss functions, and the prediction effect of the text classification model is better.
The calculation of each process loss value can be realized by the following method, the text classification model can include a fusion module, one or more loss functions are selected by the fusion module to calculate the process loss value of the text classification model, and the selection method of the fusion module for the loss functions can be the following methods:
in the first embodiment, the fusion module randomly selects the loss function of the current batch of data, and the process loss value of any one loss function may be the final loss value. Different loss functions have different advantages, the different loss functions can optimize the text classification model from multiple angles, and random selection of the loss functions can also optimize the text classification model from different angles in large-scale batch data.
In the second embodiment, the fusion module respectively calculates the process loss values of all the loss functions for the current batch of data, selects the process loss value with the largest result as the final loss value, and updates the relevant parameters of the text classification model by taking the loss function corresponding to the final loss value. For example, the process Loss values for the tag smoothing Loss function, the Focal local Loss function, and the trainee penalty function are the tag smoothing Loss function process Loss value a1, the Focal local Loss function process Loss value a2, and the trainee penalty function process Loss value A3, respectively. Wherein A1 is more than A2 is more than A3, A1 is the maximum value of the process loss values, and A1 is the final loss value. And if the loss function corresponding to the A1 is a label smooth loss function, the classification module performs gradient descent parameter updating on the text classification model according to the process loss value of the label smooth loss function. In general, the text classification model usually has the phenomena of model non-convergence and gradient disappearance in the late stage of training, and the situation can be alleviated by selecting the maximum process loss value for gradient reduction.
In the third embodiment, the fusion module sums the process loss values of the three loss functions, and the sum result is taken as the final loss value, so that the classification module performs gradient descent parameter update on the text classification model according to the summed final loss value. Therefore, the Loss values of the text classification model can be fused from multiple angles by integrating the advantages of three Loss functions, for example, the problem of overfitting in the calculation process of the text classification model can be solved through the label smoothing Loss function, the problem of sample imbalance among different output categories in the text classification model can be solved through the Focal local Loss function, the problems of robustness and generalization of the text classification model can be improved through the countertraining Loss function, then multi-angle optimization is carried out on the text classification model, and the prediction effect of the text classification model is improved.
In the fourth embodiment, after the total summation calculation is performed on the process loss values of the three loss functions, the text classification model may add a full link layer and an output layer according to the actual training requirements, wherein parameters of the full link layer may learn from batch data to adjust the proportion of each loss function in the total loss, so that the text classification model may better adapt to training sample data and different use scenarios.
It should be noted that there may be other calculation methods for the loss function and the final loss value in the actual training process of the text classification model, and the above are only four listed possible ways, and do not represent all possibilities. When training sample data, the text classification model generally performs dozens of times to dozens of times of training according to different scenes and tasks. In general, the selection of the text classification model for the loss function and the final loss value is performed by selecting the third or fourth embodiment mode first during the previous training, and the subsequent training generally selects the loss function and calculates the final loss value at random to meet the requirement of the text classification model for generalization. And finally, performing gradient descent according to the maximum process loss value of the current batch of data until the text classification model finishes all training times.
S24: and training a text classification model according to the final loss value to obtain an output class and a class probability value.
In step S23, no matter which process loss value of the loss function is selected by the text classification model as the final loss value, the text classification model is trained according to the final loss value, such as updating the relevant parameters of the text classification model. Through selection of different loss functions and a calculation method of a final loss value, the weight of each loss is fully considered in the training process of the text classification model, and the robustness of the text classification model is improved from the model level through training of the text classification model.
In order to further understand the flow of the computation of various loss functions in the text classification model and the flow relationship of data, refer to fig. 7. Before any operation is not performed, the initial training sample data is 'a song, a movie, sung by florets in a singer program' (in the actual processing process, a plurality of training sample data are performed simultaneously, and only one data is taken as an example to explain the data circulation relationship). The initial training sample data is in a form that a sentence of text is provided with a label, the text is 'the song sung by the floret in the singer program', and the label is 'film and television'.
However, the tag is a hard tag which is not 0 or 1, and has only one output category and no other output categories, and the tag does not analyze the text or classify and label potential other output categories. After the initial training sample data passes through a text classification model based on multiple loss fusion, a cross entropy loss function which is generally applied at present is optimized into multiple loss functions, the text classification model is trained, and a hard label labeling mode for the training sample data is abandoned. And outputting the result of the training sample data into a soft label form of 'output category plus category probability value' by using a text classification model, and outputting each output category and category probability value after the initial training sample data passes through the text classification model. For example, after the initial training sample data of 'songs and videos sung by florists in singers' programs passes through a text classification model fused by a plurality of loss functions, the output results are { 'movies': 0.60 ',' music ': 0.40' }, wherein 'movies' and 'music' are output categories, and '0.60' and '0.40' are corresponding category probability values.
It should be noted that, in the model level, no matter whether the text classification model is fused by multiple loss functions, the output result of the text classification model is in a soft label form of "output category + category probability value", but the difference is that the text classification model fused by multiple loss functions can make the performance of the text classification model better, can improve the robustness and the generalization of the text classification model, and further improve the prediction and classification effects of the text classification model.
S3: and in the data layer, calculating the variance of the training sample data according to the output class and the class probability value.
Typically, the last layer of the text classification model is the softmax layer, i.e., the logistic regression layer. The output result of the text classification model is not a probability before passing through the softmax layer, and the probability value of the probability pattern of "adding equal to 1" can be normalized after passing through the softmax layer, for example, { "movie": 0.998"," music ": 0.0019" }. In the data layer, based on the result of the "output category + category probability value" fused by the multiple loss functions, the variance of the training sample data is calculated based on the output category and the category probability value.
The ideal output result of the normalized text classification model is { "movie": 1.0"," music ": 0.0" }, which means that the text classification model determines the result of the training sample data one hundred percent. If the output result of the text classification model is { "movie": 0.50"," music ": 0.50" }, it means that the text classification model cannot determine the classification result of the training sample data, and it may be that the training sample data is labeled incorrectly, or the training sample data itself may be either the "movie" output category or the "music" output category. Based on this, the embodiments of the present application introduce variance to characterize the confidence level of a certain piece of training sample data.
In statistics, the variance is a metric describing a discrete degree of a group of data, and in the embodiment of the present application, the output class classification quality of training sample data is determined by variance calculation. The higher the variance is, the higher the discrimination of the text classification model to the training sample data is, the better the output classification quality of the training sample data is; the lower the variance is, the more difficult the text classification model is to distinguish the output classes of the training sample data, and the output class classification quality of the training sample data with the low variance is relatively poor.
In one embodiment, the variance of the training sample data is calculated by the following formula:
Figure BDA0003352036380000071
wherein, x is the sample value of training sample data, μ is the average of all sample values, and n is the number of samples. For example, if the output result of the text classification model is { movie 0.4, music 0.3, game 0.2, and cate 0.1}, x is 0.4, 0.3, 0.2, and 0.1, μ is 0.25, and n is 4 in the variance formula.
S4: and screening the review data to be screened, wherein the variance of the review data to be screened is lower than a variance threshold value, and the variance threshold value is preset according to the output category of the training sample data.
The variance distribution of the training sample data is different, taking into account the different output classes. In one embodiment, the quality of the training sample data under each output class is evaluated by presetting a variance threshold. For example, a variance threshold set for a certain output class is a, the variance of a certain training sample data is b, if b is less than a, the classification quality of the training sample data class is considered to be poor, and the training sample data is to-be-screened review data which needs to be reviewed for two times. If b is larger than a, the classification quality of the training sample data is considered to be better, and the training sample data does not need to be screened and checked back, so that training sample data with potential labeling errors and possible classification difficulties can be rapidly selected by calculating the variance of the training sample data and comparing the variance with a preset variance threshold value, and screening and secondary checking are carried out on the training sample data, so that all training corpora/training sample data are prevented from being checked again.
It should be noted that after the review data to be screened is determined, the review processing needs to be performed on the review data to be screened. The mode of the review process may be determined according to actual conditions, and the present application is not particularly limited. For example, the corpus may be optimized or training sample data may be re-labeled, and the text classification model may be retrained again according to the processing result. When the training is carried out again, the training data basis is not the low-quality data to be screened, but the data after the data processing is checked, namely, the left data is the high-quality training sample data, so that the text classification model has better classification or prediction effect.
Based on the screening result of the to-be-screened review data, the method further includes generating a problem sample corresponding to each output category, where the problem sample includes all the to-be-screened review data in the output category. And each output category screens the data to be screened by taking the variance threshold of the output category as a standard, and all the data to be screened under the output category are collected to obtain a problem sample corresponding to the output category. And after the problem sample corresponding to the output category is obtained, generating a label pair of the to-be-screened data according to the problem sample, classifying the to-be-screened data of all the same label pairs into the same label pair class, wherein the label pair class comprises two output categories. And then calculating the confusion degree of the two output categories in the label pair class, and if the confusion degree is lower than the confusion degree threshold value, the cross of the output categories exists in the label pair.
Specifically, the training sample data of the text classification model is often over a million, and when each output class is defined in a planning manner, the situation that the definition is unclear or confusing inevitably occurs, so that the classification effect of the text classification model is influenced. For example, a piece of training sample data is { whether there is teaching related to java }, the training sample data can be in a { movie } output category or a { adult education } output category, and the text classification model is difficult to distinguish when classifying. Based on the method, the quality of the training sample data is evaluated, and the current classification system is also subjected to quantitative scoring, so that whether the current classification system needs to be adjusted or not is judged according to the quantitative scoring.
The classification category to which a certain training sample data belongs is originally required to be manually defined, for example, the name of a film and a name of an actor appear in the training sample data, and the training sample data is manually divided into the output category of the film and the television; if the name of music appears in the training sample data, it is artificially classified into a music output category. In an ideal state, the division boundaries among the output categories are clear, for example, movies are movies, and music is music. However, in practical cases, especially when there are many output categories, the boundaries between different output categories may be cross-mixed, and the output category division is not clear. Therefore, when the output categories are disputed, a value is needed to determine whether there is an intersection between the two output categories or between the output categories, so as to adjust the output category division to make the output categories more "Jingwei clear".
In one implementation, the classification state of the whole classification system is described by the index of "confusion", and whether two or more output classes are crossed or not is judged. One calculation method of the confusion degree may be that a label pair of the to-be-screened review data is generated according to the problem sample, the to-be-screened review data of all the same label pairs is classified into the same label pair class, the label pair class may include two output classes, and the confusion degree of the two output classes in the label pair is calculated according to the classified to-be-screened review data.
In specific implementation, the step of generating the label pair of the to-be-screened review data according to the problem sample comprises the following steps:
firstly, the problem sample comprises the data to be screened and rechecked, and the data to be screened and rechecked are sorted and sorted according to a preset sorting mode. For example, the review data to be filtered may be sorted according to the magnitude order of the output category probability values, and taking a certain review data to be filtered as an example, the sorting mode may be { movie 0.4, music 0.3, game 0.2, and gourmet 0.1}, where the output category probability value is large and the output category probability value is small and the output category probability value is large and the output category probability value is small.
Then, after the sorting mode is confirmed, the data to be filtered and reviewed in the first two output categories are combined into a tag pair, for example, in the above example, if the first two output categories of the data to be filtered and reviewed are combined into a tag pair, the tag pair is { movie 0.4, music 0.3}, and since 0.4 and 0.3 are very close in value, it indicates that the two output categories, movie and music, are crossed.
And generating label pairs according to the two steps aiming at all the problem samples, and classifying the samples (namely the data to be screened) of all the same label pairs into the same label pair class. For example, some of the tag pairs are as follows, tag pair 1 is { "movie": 0.60"," music ": 0.30", tag pair 2 is { "movie": 0.60"," gourmet ": 0.30", tag pair 3 is { "music": 0.60"," drama ": 0.30", tag pair 4 is { "movie": 0.50"," music ": 0.40", tag pair 5 is { "movie": 0.40"," gourmet ": 0.40", and in the above 5 tag pairs, both of tag pair 1 and tag pair 4 include movie and music, both are classified as 1; the label pair 2 and the label pair 5 both comprise film, television and food, and are classified into the same label pair; tag pair 3, which is not identical to it, is itself a tag pair class.
And after the label pair is classified, calculating the confusion degree of the two output categories in the label pair according to the classified data to be screened. When the degree of intersection between two output categories is determined, for example, the degree of intersection between two output categories of film and music is determined, all of the review data to be filtered classified as a pair of tags of film and music is selected. The confusion degree in the whole classification category system can be calculated by calculating the confusion degree between every two output categories.
In one implementation, the degree of confusion for two output categories is calculated by the following formula:
Figure BDA0003352036380000091
wherein Score means the confusability Score, C1、C2Respectively the sum of the number of samples of two output classes in a label pair, K being C1,C2The number of all the problem samples in the formula, sigma, is the variance of the training sample data, the confusion degree between the first two output classes of the output classes can be calculated through the formula, and the confusion degree between any two output classes can be calculated by using the same method.
In the above step, the confusion score between any two output categories has been obtained by the confusion calculation formula, and a threshold value of the confusion may be set when determining whether there is an output category cross between the two output categories. One way to set the confusion threshold is to take the average of all the confusion scores as the confusion threshold, and when the confusion score of a certain label pair is far lower than the confusion threshold, the definitions of two output categories in the label pair are considered to be crossed. If there is an output class cross for a tag pair, the output class needs to be redefined. For example, a new output category may be redefined according to actual conditions, or the corpus of training sample data may be rearranged.
In terms of the effect presented by the terminal equipment, the classification and prediction effects of the text classification model are improved, and then the text classification effect is inevitably improved, so that the display effect of the terminal equipment is improved. Fig. 8 is a schematic diagram of a display effect of the terminal device after text classification is promoted in the embodiment of the present application, and a voice instruction sent to the display device 200 through the dialog system is { movie played by huazi }, as shown in fig. 8, the effect output by the display device 200 includes "movie a", "movie B", "movie C" … … "movie J", compared with fig. 3, the search effect after the text classification model is promoted is very accurate, and there is no interference factor of other output categories, so that the sense of embodiment of the user is promoted.
According to the technical scheme, on the model level of the text classification model, aiming at the problem that the current text classification model is poor in robustness, the text classification model is enabled to learn the weight of each loss function to the classification effect of the text classification model in a self-adaptive mode by fusing multiple loss functions, and the robustness of the text classification model is further improved. In the data aspect of the text classification model, in the process of training the text classification model, the classification quality of training sample data is judged through variance calculation based on the result of the text classification model fused by the loss functions, and the purpose is to find the review data to be screened with poor quality and perform review processing on the review data to be screened. And training the text classification model again according to the processing result, and improving the classification or prediction effect of the text classification model. In addition, a quantitative score is made for a classification system of the text classification model by calculating the confusion degree among the classes, and the quantitative score is used as a basis for adjusting the classification definition in the text classification model, so that the prediction effect of the text classification model is improved. In summary, the embodiment of the application realizes the improvement of the prediction effect of the text classification model from the model level and the data level by the fusion of various loss functions and the screening of the training corpora.
The embodiments provided in the present application are only a few examples of the general concept of the present application, and do not limit the scope of the present application. Any other embodiments extended according to the scheme of the present application without inventive efforts will be within the scope of protection of the present application for a person skilled in the art.

Claims (11)

1. The corpus screening method based on the results of the multiple loss fusion text classification models is characterized by comprising the following steps:
dividing the text classification model into a model layer and a data layer according to functions;
on the model level, fusing a plurality of loss functions of the text classification model to obtain the output class and the class probability value of training sample data in the text classification model;
calculating the variance of the training sample data according to the output category and the category probability value at the data level;
and screening the review data to be screened, wherein the variance of the review data to be screened is lower than a variance threshold value, and the variance threshold value is preset according to the output category of the training sample data.
2. The corpus screening method based on multiple loss fusion text classification model results according to claim 1, wherein the multiple loss function fusion is based on a fully-connected neural network, and the multiple loss function fusion is a fusion of a label smoothing loss function, a focallloss loss function and an antithetic training loss function according to a preset fusion mode.
3. The corpus screening method according to claim 2, wherein the fusing of the text classification model with the loss functions to obtain the output class and the class probability value comprises:
inputting training sample data into the text classification model to obtain a process category and a process probability value;
calculating the process loss value of each loss function according to the process category and the process probability value;
inputting the process loss value into the fully-connected neural network to obtain a final loss value;
and training the text classification model according to the final loss value to obtain an output class and a class probability value.
4. The corpus screening method according to claim 2, wherein said fully-connected neural network is a multi-layer fully-connected neural network.
5. The corpus screening method based on multiple loss fusion text classification model results according to claim 1, further comprising:
and after the data to be screened are determined, carrying out the review processing on the data to be screened, and carrying out the retraining processing on the text classification model according to the processing result.
6. The method for screening corpus training samples according to claim 1, wherein the variance of the training sample data is calculated by the following formula:
Figure FDA0003352036370000011
wherein x is the sample value of the training sample data, μ is the average of all sample values, and n is the number of samples.
7. The corpus screening method based on multifold-loss fusion text classification model result of claim 1, wherein the category probability value is obtained after normalization processing.
8. The corpus screening method based on multiple loss fusion text classification model results according to claim 1, further comprising:
generating a problem sample corresponding to each output category, wherein the problem sample comprises all the to-be-screened review data in the output category;
generating label pairs of the data to be screened and rechecked according to the problem samples, and classifying the data to be screened and rechecked of all the same label pairs into the same label pair class, wherein the label pair class comprises two output classes;
calculating the confusion degree of the two output categories in the label pair class;
if the confusability is below a confusability threshold, there is an intersection of output categories in the pair of labels.
9. The corpus screening method according to claim 8, wherein generating the label pair of the review data to be screened according to the question sample comprises:
sorting and sequencing the data to be screened according to a preset sequencing mode;
and forming the data to be screened of the first two output categories into a label pair.
10. The corpus screening method based on multifold-loss-fused text classification model result according to claim 8, wherein the degree of confusion of two output categories is calculated by the following formula:
Figure FDA0003352036370000021
wherein, C1,C2Respectively the sum of the number of samples of the two output classes in the label pair, K being C1,C2And sigma is the variance of the training sample data.
11. The method according to claim 8, further comprising redefining the output class if there is an intersection of the label pair with the output class.
CN202111341075.2A 2021-11-12 2021-11-12 Corpus screening method based on multiple loss fusion text classification model results Pending CN114116969A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111341075.2A CN114116969A (en) 2021-11-12 2021-11-12 Corpus screening method based on multiple loss fusion text classification model results

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111341075.2A CN114116969A (en) 2021-11-12 2021-11-12 Corpus screening method based on multiple loss fusion text classification model results

Publications (1)

Publication Number Publication Date
CN114116969A true CN114116969A (en) 2022-03-01

Family

ID=80379151

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111341075.2A Pending CN114116969A (en) 2021-11-12 2021-11-12 Corpus screening method based on multiple loss fusion text classification model results

Country Status (1)

Country Link
CN (1) CN114116969A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118014793A (en) * 2024-02-02 2024-05-10 广州铭德教育投资有限公司 Individualized knowledge tracking method based on post-class problem difficulty and student capacity

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118014793A (en) * 2024-02-02 2024-05-10 广州铭德教育投资有限公司 Individualized knowledge tracking method based on post-class problem difficulty and student capacity

Similar Documents

Publication Publication Date Title
Bakhtin et al. Real or fake? learning to discriminate machine from human generated text
CN111708882B (en) Transformer-based Chinese text information missing completion method
CN112070138B (en) Construction method of multi-label mixed classification model, news classification method and system
CN109189862A (en) A kind of construction of knowledge base method towards scientific and technological information analysis
CN110866113A (en) Text classification method based on sparse self-attention mechanism fine-tuning Bert model
CN114462489A (en) Training method of character recognition model, character recognition method and equipment, electronic equipment and medium
CN113360659B (en) Cross-domain emotion classification method and system based on semi-supervised learning
CN113886562A (en) AI resume screening method, system, equipment and storage medium
CN113869055A (en) Power grid project characteristic attribute identification method based on deep learning
CN115797701A (en) Target classification method and device, electronic equipment and storage medium
CN114116969A (en) Corpus screening method based on multiple loss fusion text classification model results
CN113297387B (en) News detection method for image-text mismatching based on NKD-GNN
CN109034182A (en) A kind of zero sample image identification new method based on attribute constraint
CN116204642B (en) Intelligent character implicit attribute recognition analysis method, system and application in digital reading
CN112434512A (en) New word determining method and device in combination with context
CN114595695B (en) Self-training model construction method for small sample intention recognition system
CN116431813A (en) Intelligent customer service problem classification method and device, electronic equipment and storage medium
WO2023177666A1 (en) Deep learning systems and methods to disambiguate false positives in natural language processing analytics
CN115422349A (en) Hierarchical text classification method based on pre-training generation model
CN116956915A (en) Entity recognition model training method, device, equipment, storage medium and product
CN113435190B (en) Chapter relation extraction method integrating multilevel information extraction and noise reduction
Lai et al. Domain-aware recurrent neural network for cross-domain sentiment classification
US20230289531A1 (en) Deep Learning Systems and Methods to Disambiguate False Positives in Natural Language Processing Analytics
CN118070775B (en) Performance evaluation method and device of abstract generation model and computer equipment
CN114154519B (en) Neural machine translation model training method based on weighted label smoothing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination