CN113590814A - Text classification method fusing text interpretation features - Google Patents
Text classification method fusing text interpretation features Download PDFInfo
- Publication number
- CN113590814A CN113590814A CN202110521823.9A CN202110521823A CN113590814A CN 113590814 A CN113590814 A CN 113590814A CN 202110521823 A CN202110521823 A CN 202110521823A CN 113590814 A CN113590814 A CN 113590814A
- Authority
- CN
- China
- Prior art keywords
- sentence
- interpretation
- features
- feature
- text classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000013145 classification model Methods 0.000 claims abstract description 65
- 238000012549 training Methods 0.000 claims abstract description 27
- 238000005070 sampling Methods 0.000 claims abstract description 25
- 230000000694 effects Effects 0.000 claims abstract description 12
- 238000013528 artificial neural network Methods 0.000 claims abstract description 11
- 230000009286 beneficial effect Effects 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 24
- 239000000126 substance Substances 0.000 claims description 16
- 238000012360 testing method Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000001568 sexual effect Effects 0.000 claims 1
- 230000006870 function Effects 0.000 description 12
- 238000010801 machine learning Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a text classification method fusing text interpretation characteristics. The method comprises the following specific implementation steps: (1) training a text classification model based on a neural network for predicting the category of the sentence; (2) acquiring the interpretation characteristics of the sentence prediction result in the step (1) by using a linear fitting method based on local random disturbance sampling; (3) selecting key interpretation characteristics beneficial to classification effect according to the frequency and weight of the acquired interpretation characteristics; (4) and (4) fusing the key interpretation features acquired in the step (3) with the raw data, and retraining the text classification model. The method uses a linear fitting method based on local random disturbance sampling to explain which key features have the largest contribution to the prediction result of the text classification model, fuses the features and the original labeled sample, and highlights the key features of the original sample, thereby improving the classification effect.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a text classification method fusing text interpretation characteristics, which is a method for interpreting a trained text classification model based on a neural network by using a linear fitting method based on random disturbance sampling to obtain the interpretation characteristics of a prediction result of each sentence, and retraining the text classification model by fusing key interpretation characteristics in the text classification model, and can be applied to the specific fields of junk mail identification, text theme classification, emotion analysis and the like.
Background
Text classification is an important research direction in the field of natural language processing, and the task of text classification is to map a text to a predefined category by using a certain method. The text classification method includes a rule-based method and a machine learning-based method.
When the method based on the rules is used for text classification, different rules need to be set for different texts, so that time and labor are wasted, and the coverage and accuracy cannot be guaranteed. With the rise of machine learning, the machine learning method is used for the text classification task and achieves better effect. However, many machine learning models are black box models, and we can only obtain the prediction result given by the model, but cannot know why the model gives the result, and can only judge the reliability of the model from some judgment indexes such as the accuracy of the model, but in the fields of medical treatment and the like, we can provide a more accurate decision basis for model users by knowing not only the prediction result and the accuracy of the model but also the basis of the prediction result given by the model, and intervene in the model training process according to the basis of the prediction result given by the model, so that the model classification effect is improved.
In summary, due to the unexplainable property of the deep learning model, it is difficult for the model user to determine the basis of the prediction result given by the model, and to make a correct decision according to the prediction result of the model
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide a text classification method fusing text interpretation characteristics, which uses a linear fitting method based on random disturbance sampling to interpret the prediction result of a text classification model based on a neural network, obtains the interpretation characteristics of each sentence according to the classification characteristics used in the linear fitting process, obtains key interpretation characteristics according to the frequency and weight of the characteristics, fuses the key interpretation characteristics with original data, and retrains the text classification model, thereby enabling the text classification result to be more accurate.
In order to achieve the purpose, the invention adopts the following technical scheme:
a text classification method fusing text interpretation features comprises the following operation steps:
step 1, training a text classification model based on a neural network to predict the category of a sentence;
step 2, obtaining the interpretation characteristics of the sentence prediction result in the step 1 by using a linear fitting method based on local random disturbance sampling;
step 3, selecting key interpretation characteristics which are beneficial to the classification effect according to the frequency and the weight of the interpretation characteristics acquired in the step 2;
and 4, fusing the key interpretation characteristics and the raw data acquired in the step 3, and retraining the text classification model.
Preferably, the training of the neural network-based text classification model in step 1 is used for predicting the category to which the sentence belongs, and the specific steps include:
(1-1) input layer: the input to the text classification model is a sentence with a category label, S ═ S (S)1,S2,S3......,SN) Wherein S isiRepresenting the ith sentence in the data set, N representing the number of sentences, representing the jth word in the ith sentence, and k representing the number of words in the ith sentence;
(1-2) sentence vectorization: using Glove training word vectors, vocabulary V is set to (w)1,w2,w3,......,wM) Each word in the dictionary is converted into a 64-dimensional vector, and a vectorized word list V' is generated1,v2,v3,......,vM) The dimension of V' isWherein wiRepresenting words in a vocabulary, viRepresents the word wiM represents the number of all words present in the data set; the lookup word table V' converts words in the sentence into corresponding vector representation, and the sentence SiIs shown as
(1-3) Linear layer: vectorized sentencesInputting a category label of a linear layer prediction sentence, wherein the linear layer formula is as follows:
wherein, ylFor prediction results, is an array containing num _ class numbers, num _ class representing a predefined number of classes, where each number represents a likelihood size of predicting the class represented by the current location, l represents a linear transformation equation, WTAnd b are parameters of the linear layer, respectively;
(1-4) softmax layer: predicting the result y by using softmax functionlThe value range of each value is mapped to [0, 1 ]]The formula of the softmax function is as follows:
wherein the content of the first and second substances,indicates the prediction result ylJ value of (1), ylAfter each value in (1) is transformed by the softmax function, the sum of the num _ class values is 1;
(1-5) Loss equation: the final output of the model is the class label y corresponding to the maximum value in the prediction resultpreBy usingFormula loss (y)i,ypre)=-yprelog(softmax(yi) Determine a loss function, where loss (y)i,ypre) Represents the loss function, yiFor inputting a sentence SiThe label of (1);
(1-6) parameter optimization: and optimizing parameters of the text classification model by taking the minimized loss function as a target to obtain the trained text classification model.
Preferably, in the step 2, the linear fitting method based on local random disturbance sampling is used for obtaining the interpretation characteristics of the sentence prediction result in the step 1; the method comprises the following specific steps:
(2-1) selection of sentence S to be interpretediAnd is in SiNearby samples are taken by random perturbations: siFor sentences containing k words in the original data setFor sentence SiPerforming random disturbance, acquiring sampling samples, generating a data set containing a plurality of sampling samples, and performing vectorization representation on the sampling samples by using 0 and 1; the random perturbation process is as follows:
deleting sentence S at randomiThe number of deleted words is more than 0 and less than k, and a new sentence is obtainedNamely SiA randomly perturbed sample of (1), whereinAs a sentence SiThe jth word in the tth random disturbance sample, wherein c is the number of words remaining after random disturbance; initializing a 1 × k vector, setting the position of the deleted word to 0, and setting the other positions to 1 to obtainVectorized representation ofEach element therein4999 times of random disturbance to obtain a new data set containing 5000 sentencesWhereinIs the original sentence Si,SiIs expressed as a vector containing k 1 s; the vector matrix of the new data set X is represented as
(2-2) tagging the newly generated data:
inputting each data in the data set X into a trained text classification model for prediction to obtain a corresponding prediction result; expressing the trained text classification model as f, and obtaining the prediction result of each sample after the steps (1-1) to (1-4)Is an array containing num _ class numbers, where each value represents the probability of prediction as a corresponding class;
(2-3) calculating the distance between all the disturbance data and the original data in the new data set Z as the disturbance data weight:
the closer the distance between the newly generated disturbance data and the original data is, the more the prediction data can be explained, the higher weight is given, the weight of the newly generated data is defined by using an exponential kernel function, and the calculation formula is as follows:
wherein the content of the first and second substances,is an exponential kernel defined at cosine distance, representing the distance weight between samples, the closer the distance,the larger the value of (a), σ is the kernel width;
(2-4) fitting the new data set Z using a linear model: the linear model is expressed in g, and the linear model formula is as follows:
wherein the content of the first and second substances,as a vector in the data set Z, wgIs the weight coefficient of the linear model;
(2-5) determining coefficients of the linear model: training a linear classification model to determine a weight coefficient, and setting a Loss equation as follows:
let L (f, g, π)z) Minimum, obtain the optimal linear model weight wg,wgHas the dimension ofWhereinFor the t-th perturbation data, the data is,is composed ofThe vector form of (1);
(2-6) acquiring interpretation characteristics and denoising: after the linear model training is completed, Feai=wg×SiI.e. interpretation features and weights for different classes,sorting the characteristics of the mth category from big to small according to the absolute value of the weight, removing the information such as auxiliary words, connecting words, punctuation marks and the like, and selecting the first T categories as sentences SiPredicted as an interpretation feature of the m-th class
Wherein the content of the first and second substances,representing a set of features and each feature correspondence weight obtained by a model interpretation method for predicting an ith sentence into an mth category, m being labels corresponding to different categories, 1. ltoreq. m.ltoreq.num _ class,is a sentence SiThe (c) th feature of (a),is characterized in thatA corresponding weight; the feature representation model with the weight being positive considers that the feature supports the ith sample to be classified into the mth category, and we call this category of feature as positive feature or positive feature, and the feature representation model with the weight being negative considers that the feature does not support the ith sample to be classified into the mth category, which is called negative feature or negative feature.
Preferably, in the step 3, the selecting a key feature set according to the frequency and weight of the acquired interpretation feature includes:
(3-1) acquiring data SiAll the explanatory features:represents the sentence S obtained by the step (3-6)iPredicting a set of features corresponding to any category:
(3-2) calculating the frequency and weight of each feature: since the same feature may appear in different categories, the same feature is inMay occur multiple times in, willThe weights of all the same forward features are summed, and the first c1 features are sorted from large to small according to the weightsCalculated in the same wayThe weights of all the middle negative features are ranked from large to small according to the absolute value of the weights, and the top c2 features are taken to obtainSimultaneous calculationThe frequency of each negative feature in the sequence is sorted from high to low, and the first c3 features are taken
(3-3) obtaining the sentence SiKey explanatory features of (1): finally, sentence S is obtainediKey interpretation feature set ofIs the intersection of the three sets obtained in step (3-2), and contains p key interpretation features:
preferably, in the step 4, the key interpretation features and the raw data obtained in the step 3 are fused to retrain the text classification model, and the specific steps include:
(4-1) acquiring data fusing key interpretation features: sentence S to be acquirediKey interpretation feature and sentence S ofiThe sentences which are used as the input of the text classification model and are fused with the key interpretation characteristics are expressed as Si′:
Wherein the content of the first and second substances,as a sentence SiThe number of k words in (a) is,for the obtained sentence SiP key interpretation features of (a);
(4-2) retraining the text classification model: fusing key interpretation characteristics according to the steps from (2-1) to (4-1) on all training samples and test samples to obtain a new data set S ═ (S ═1′,S2′,S3′,...,SN') and then retraining the text classification model on the data set S' according to the process of claim 2, the resulting text classification results are more accurate.
Compared with the prior art, the invention has the following obvious and prominent substantive characteristics and remarkable technical progress:
1. the method uses a linear fitting method based on local random disturbance sampling to explain which key features have the greatest contribution to the prediction result of the text classification model, fuses the features and the original labeled sample, and highlights the key features of the original sample, thereby improving the classification effect;
2. the method can efficiently retrain the text classification model, so that the text classification result is more accurate.
Drawings
FIG. 1 is a flow chart of a text classification method for fusing text interpretation features according to the present invention.
FIG. 2 is a diagram of a neural network-based text classification model according to the present invention.
FIG. 3 is a flow chart of the present invention for obtaining interpretation characteristics using a model interpretation method.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings and tables.
The invention aims to provide a text classification method fusing text interpretation features, which is used for acquiring key features of a prediction result given by a text classification model through a model interpretation method, and using the key features and an original text together as an input retraining model of the text classification model, thereby improving the effect of the text classification model.
The invention provides a text classification method fusing text interpretation characteristics, which is characterized in that a linear fitting method based on local random disturbance sampling is used for interpreting a prediction result of a text classification model based on a neural network to obtain interpretation characteristics, key interpretation characteristics are obtained according to the frequency and weight of the characteristics and fused with original data, and the text classification model is retrained, so that the text classification result is more accurate. The basic features of the present invention mainly include the following aspects:
firstly, interpreting a prediction result of a trained text classification model by using a linear fitting method based on local random disturbance sampling to obtain an interpretation characteristic;
selecting key interpretation features which are beneficial to text classification according to the weight and the frequency of the interpretation features;
and thirdly, fusing the original data with key interpretation characteristics to retrain the text classification model.
The first embodiment is as follows:
referring to fig. 1, a text classification method fusing text interpretation features includes the following operation steps:
step 1, training a text classification model based on a neural network to predict the category of a sentence;
step 2, obtaining the interpretation characteristics of the sentence prediction result in the step 1 by using a linear fitting method based on local random disturbance sampling;
step 3, selecting key interpretation characteristics which are beneficial to the classification effect according to the frequency and the weight of the interpretation characteristics acquired in the step 2;
and 4, fusing the key interpretation characteristics and the raw data acquired in the step 3, and retraining the text classification model.
The method can efficiently retrain the text classification model, so that the text classification result is more accurate.
Example two:
in the above embodiment, referring to fig. 1, the text classification method with fused text interpretation features is shown in flowchart,
a text classification method fusing text interpretation features comprises the following steps of:
step S1: training a text classification model based on a neural network for predicting the category of the sentence, wherein the text classification model is illustrated in the attached figure 2, and the model parameter setting is illustrated in the table 1; the specific process is as follows:
(1-1) input layer: acquiring an AG-News data set, wherein the AG-News is a standard English data set for text classification, and comprises 127600 pieces of data in four categories; considering the time problem of training a text classification model and acquiring each data interpretation feature, uniformly and randomly sampling data of each category from an AG-News data set, and selecting 16000 pieces of data for an experiment, wherein a training set comprises 12800 pieces of data, and a verification set and a test set respectively comprise 1600 pieces of data; the input to the text classification model is a sentence with a category label, S ═ S (S)1,S2,S3......,SN) In which S isiRepresents the ith sentence in the data set, N represents the number of sentences, the value is 16000, j represents the jth word of the ith sentence, k represents the number of words in the ith sentence, and the value of k is not fixed due to different sentence lengths;
TABLE 1 text classification model parameter set
(1-2) sentence vectorization: using Glove training word vectors, vocabulary V is set to (w)1,w2,w3,......,wM) Each word in (a) is converted into a 64-dimensional vector representation, and a vectorized vocabulary V ═ (V) is generated1,v2,v3,......,vM) Wherein w isiRepresenting words in a vocabulary, viRepresents the word wiVectorization ofMeaning that M represents the number of all words present in the dataset, M takes the value 161067, and V' has a dimension ofThe lookup word table V' converts words in the sentence into corresponding vector representation, and the sentence SiIs shown as
(1-3) Linear layer: vectorized sentencesInputting a category label of a linear layer prediction sentence, wherein the linear layer formula is as follows:
wherein the content of the first and second substances,ylfor prediction results, an array of 4 values is formed, where each value represents the likelihood of predicting the class represented by the current location, l represents a linear transformation equation, WTAnd b are parameters of the linear layer respectively, and the value range of the random initialization parameter is (-0.3, 0.3).
(1-4) softmax layer: predicting the result y by using softmax functionlThe value range of each value in (b) is mapped to [0, 1 ]]The formula of the sofimax function is as follows:
wherein the content of the first and second substances,indicates the prediction result ylJ value of (1), ylAfter each value in (a) is transformed by the softmax function, the sum of the 4 values is 1.
(1-5) Loss equation: the final output of the model is the most predictive resultClass label y for large valuespreUsing the formula loss (y)i,ypre)=-yprelog(softmax(yi) Determine a loss function, where loss (y)i,ypre) Represents the loss function, yiLabels are tagged to the input sentence.
(1-6) parameter optimization: and optimizing the parameters of the text classification model by taking the minimization of the loss function as a target. As shown in table 1, Batch Size is set to 16, i.e., 16 sentences are input into the text classification model at a time. The learning rate in the model training process is 2.0, the learning rate adjustment multiple is 0.8, the adjustment interval is 1 epoch, namely, the learning rate is adjusted to be 0.8 times of the previous epoch every time an epoch passes, and the model is finally iterated for 35 times to complete the training.
Step S2: and acquiring the interpretation characteristics of the sentence prediction result in the step S1 by using a linear fitting method based on local random disturbance sampling. The specific process is shown in the attached figure 3:
(2-1) selection of sentence S to be interpretediAnd at SiNearby samples are taken by random perturbations: siFor sentences containing k words in the original data setFor sentence SiPerforming random perturbation, obtaining a sampling sample, generating a data set containing a plurality of sampling samples, and performing vectorization representation on the sampling samples by using 0 and 1. The random perturbation process is as follows:
deleting sentence S at randomiThe number of deleted words is more than 0 and less than k, and a new sentence is obtainedNamely SiA randomly perturbed sample of (1), whereinAs a sentence SiAnd d, randomly disturbing the jth word in the sample of the t-th time, wherein c is the number of words remaining after random disturbance. Initializing a 1 xk vector, deleting bits of the wordSet to 0 and the other positions to 1 to obtainVectorized representation ofEach element therein4999 times of random disturbance to obtain a new data set containing 5000 sentencesWhereinIs the original sentence Si,SiIs represented as a vector containing k 1 s. The vector matrix of the new data set X is represented as
(2-2) tagging the newly generated data: and inputting each data in the data set X into the trained text classification model for prediction to obtain a corresponding prediction result. Expressing the trained text classification model as f, and obtaining the prediction result of each sample after the steps (1-1) to (1-4)For an array containing 4 numbers, 4 is the number of data classes, each of which represents the probability of predicting as a corresponding class.
(2-3) calculating the distance between all the disturbance data and the original data in the new data set Z as the disturbance data weight: the closer the distance between the newly generated disturbance data and the original data is, the more the prediction data can be explained, the higher weight is given, the weight of the newly generated data is defined by using an exponential kernel function, and the calculation formula is as follows:
wherein the content of the first and second substances,is an exponential kernel defined at cosine distance, representing the distance weight between samples, the closer the distance,the larger the value of (a), σ is the kernel width.
(2-4) fitting the new data set Z using a linear model: the linear model is expressed in g, and the linear model formula is as follows:
wherein the content of the first and second substances,as a vector in the data set Z, wgIs the weight coefficient of the linear model.
(2-5) determining coefficients of the linear model: the Loss equation is set as follows:
let L (f, g, π)z) Minimum, obtain the optimal linear model weight wg,wgHas the dimension ofWhereinFor the t-th perturbation data, the data is,is composed ofIn the form of a vector.
(2-6) acquiring interpretation characteristics and denoising: after the linear model training is completed, Feai=wgX is the interpretation characteristic and weight for different classes,sorting the features of the mth category from large to small according to the absolute value of the weight, removing the information such as auxiliary words, connecting words, punctuation marks and the like, and selecting the first T as the explanation features of the mth category predicted by the sentence x
Wherein the content of the first and second substances,the ith sentence output by the representation model interpretation method is predicted to be the set of the features of the mth category and the weight corresponding to each feature, m is a label corresponding to different categories, m is more than or equal to 1 and less than or equal to 4,is a sentence SiThe (c) th feature of (a),is characterized in thatThe corresponding weight. The feature with the weight of positive value indicates that the model considers the feature to support the ith sample to be classified into the mth category, and we call this category of feature as positive feature or positive feature, and the feature with the weight of negative value indicates that the model considers the feature not to support the ith sample to be classified into the mth category, which is called negative feature or negative feature.
Step S3: selecting key interpretation characteristics beneficial to classification effect according to the frequency and weight of the acquired interpretation characteristics, and specifically performing the following steps:
(3-1) acquiring data SiAll the explanatory features:represents the sentence S obtained by the step (2-6)iPredicting a set of features corresponding to any category:
(3-2) calculating the frequency and weight of each feature:
since the same feature may appear in different categories, it is possible to use the same feature in different categoriesMay occur multiple times, willThe weights of all the same forward features in the sequence are summed, and the first c1 features are taken according to the order of the weights from large to smallCalculated in the same wayThe weights of all the negative features are ranked from large to small according to the absolute value of the weights, and the top c2 features are taken to obtainSimultaneous calculationThe frequency of each negative-going feature in the sequence is sorted from high to low, and the first c3 features are taken
(3-3) obtaining the sentence SiKey explanatory features of (1): finally, sentence S is obtainediKey interpretation feature set ofIs the intersection of the three sets obtained in step (4-2), and contains p key interpretation features:
step S4: fusing the key interpretation features and the raw data acquired in the step S3, and retraining the text classification model, which specifically comprises:
(4-1) acquiring data fusing key interpretation features: sentence S to be acquirediKey interpretation feature and sentence S ofiThe sentences which are used as the input of the text classification model and are fused with the key interpretation characteristics are expressed as Si′:
Wherein the content of the first and second substances,as a sentence SiThe number of k words in (a) is,for the obtained sentence SiP key interpretation features.
(4-2) retraining the text classification model: fusing key interpretation characteristics according to the steps from (2-1) to (4-1) on all training samples and test samples to obtain a new data set S ═ (S ═1′,S2′,S3′,...,SN') and then retraining the text classification model on the data set S according to the process of claim 2, the resulting text classification results are more accurate.
Description of the experiment and results: the experimental data set is a part of the data in the AG-News data set in the step (1-1), 16000 pieces of data are obtained for experiment by randomly and uniformly sampling the data of each category, wherein the training set comprises 12800 pieces of data, and the verification set and the test set respectively comprise 1600 pieces of data. Table 2 shows experimental comparison results of training a text classification model using data fused with key interpretation features and training the text classification model using raw data. Wherein Train _ acc is the accuracy of the training set, Test _ acc is the accuracy of the Test set, Test _ ma _ R is the macro recall rate of the Test set, Test _ ma _ f1 is the macro f1 value of the Test set, and Test _ mi _ f1 is the micro f1 value of the Test set. It can be seen that the method provided by the invention is improved in all indexes, wherein the accuracy of the test set is improved by 2.39 percentage points, which shows that the method provided by the invention can improve the effect of the text classification model.
TABLE 2 Experimental results
The method uses a linear fitting method based on local random disturbance sampling to explain which key features have the greatest contribution to the prediction result of the text classification model, fuses the features and the original labeled sample, and highlights the key features of the original sample, thereby improving the classification effect; the method can efficiently retrain the text classification model, so that the text classification result is more accurate.
The foregoing is a more detailed description of the present invention in connection with specific/preferred embodiments thereof, and it is not intended that the practice of the invention be limited to these descriptions. It will be apparent to those skilled in the art that various substitutions and modifications can be made to the described embodiments without departing from the spirit of the invention, and these substitutions and modifications should be considered to fall within the scope of the invention.
Claims (5)
1. A text classification method fusing text interpretation features is characterized by comprising the following operation steps:
step 1, training a text classification model based on a neural network to predict the category of a sentence;
step 2, obtaining the interpretation characteristics of the sentence prediction result in the step 1 by using a linear fitting method based on local random disturbance sampling;
step 3, selecting key interpretation characteristics which are beneficial to the classification effect according to the frequency and the weight of the interpretation characteristics acquired in the step 2;
and 4, fusing the key interpretation characteristics and the raw data acquired in the step 3, and retraining the text classification model.
2. The method for classifying the text fusing the text interpretation features according to claim 1, wherein the training of the neural network-based text classification model in the step 1 is used for predicting the category to which the sentence belongs, and the specific steps include:
(1-1) input layer: the input to the text classification model is a sentence with a category label, S ═ S (S)1,S2,S3......,SN) In which S isiRepresenting the ith sentence in the data set, N representing the number of sentences,wj irepresenting the jth word in the ith sentence, and k representing the number of words in the ith sentence;
(1-2) sentence vectorization: using Glove training word vectors, vocabulary V is set to (w)1,w2,w3,......,wM) Each word in the dictionary is converted into a 64-dimensional vector, and a vectorized word list V' is generated1,v2,v3,......,vM) The dimension of V' isWherein wiRepresenting words in a vocabulary, viRepresents the word wiM represents the number of all words present in the data set; the lookup word table V' converts words in the sentence into corresponding vector representation, and the sentence SiIs shown as
(1-3) Linear layer: vectorized sentencesInputting a category label of a linear layer prediction sentence, wherein the linear layer formula is as follows:
wherein, ylFor prediction results, is an array containing num _ class numbers, num _ class representing a predefined number of classes, where each number represents a likelihood size of predicting the class represented by the current location, l represents a linear transformation equation, WTAnd b are parameters of the sexual layer respectively;
(1-4) softmax layer: predicting the result y by using softmax functionlEach inThe value ranges of the values are all mapped to [0, 1 ]]The formula of the softmax function is as follows:
wherein the content of the first and second substances,indicates the prediction result ylOflAfter each value in (1) is transformed by the softmax function, the sum of the num _ class values is 1;
(1-5) Loss equation: the final output of the model is the class label y corresponding to the maximum value in the prediction resultpreUsing the formula loss (y)i,ypre)=-yprelog(softmax(yi) Determine a loss function, where loss (y)i,ypre) Represents the loss function, yiFor inputting a sentence SiThe label of (1);
(1-6) parameter optimization: and optimizing parameters of the text classification model by taking the minimized loss function as a target to obtain the trained text classification model.
3. The method for classifying texts fusing text interpretation features according to claim 1, wherein in the step 2, the linear fitting method based on local random disturbance sampling is used to obtain the interpretation features of the sentence prediction results in the step 1; the method comprises the following specific steps:
(2-1) selection of sentence S to be interpretediAnd is in SiNearby samples are taken by random perturbations: siFor sentences containing k words in the original data setFor sentence SiPerforming random disturbance, acquiring sampling samples, generating a data set containing a plurality of sampling samples, and performing vectorization representation on the sampling samples by using 0 and 1; the random perturbation process is as follows:
deleting sentence S at randomiThe number of deleted words is more than 0 and less than k, and a new sentence is obtainedNamely SiA randomly perturbed sample of (1), whereinAs a sentence SiThe jth word in the tth random disturbance sample, wherein c is the number of words remaining after random disturbance; initializing a 1 × k vector, setting the position of the deleted word to 0, and setting the other positions to 1 to obtainVectorized representation ofEach element therein4999 times of random disturbance to obtain a new data set containing 5000 sentencesWhereinIs the original sentence Si,SiIs expressed as a vector containing k 1 s; the vector matrix of the new data set X is represented as
(2-2) tagging the newly generated data:
inputting each data in the data set X into a trained text classification model for prediction to obtain a corresponding prediction result; will trainThe trained text classification model is represented as f, and the prediction result of each sample is obtained after the steps (1-1) to (1-4) Is an array containing num _ class numbers, each of which represents a probability of prediction as a corresponding class;
(2-3) calculating the distance between all the disturbance data and the original data in the new data set Z as the disturbance data weight:
the closer the distance between the newly generated disturbance data and the original data is, the more the prediction data can be explained, the higher weight is given, the weight of the newly generated data is defined by using an exponential kernel function, and the calculation formula is as follows:
wherein the content of the first and second substances,is an exponential kernel defined at cosine distance, representing the distance weight between samples, the closer the distance,the larger the value of (a), σ is the kernel width;
(2-4) fitting the new data set Z using a linear model: the linear model is expressed in g, and the linear model formula is as follows:
wherein the content of the first and second substances,as a vector in the data set Z, wgIs the weight coefficient of the linear model;
(2-5) determining coefficients of the linear model: training a linear classification model to determine a weight coefficient, and setting a Loss equation as follows:
let L (f, g, π)z) Minimum, obtain the optimal linear model weight wg,wgHas the dimension ofWhereinFor the t-th perturbation data, the data is,is composed ofThe vector form of (1);
(2-6) acquiring interpretation characteristics and denoising: after the linear model training is completed, Feai=wg×SiI.e. interpretation features and weights for different classes,sorting the features of the mth category from big to small according to the absolute value of the weight, removing the information such as auxiliary words, connecting words, punctuation marks and the like, and selecting the first T categories as sentences SiPredicted as an interpretation feature of the m-th class
Wherein the content of the first and second substances,representing the set of features and corresponding weights of each feature, which are obtained by a model interpretation method and predict the ith sentence into the mth category, wherein m is a label corresponding to different categories, m is more than or equal to 1 and less than or equal to num _ class, fj iIs a sentence SiThe (c) th feature of (a),is a characteristic fj iA corresponding weight; the feature with the weight of positive value indicates that the model considers that the feature supports the ith sample to be classified into the mth category, and we call this category of feature as positive feature or positive feature, and the feature with the weight of negative value indicates that the model considers that the feature does not support the ith sample to be classified into the mth category, which is called negative feature or negative feature.
4. The method for classifying texts fusing text interpretation features according to claim 1, wherein in the step 3, the key feature set is selected according to the frequency and weight of the obtained interpretation features, and the specific steps include:
(3-1) acquiring data SiAll the explanatory features:represents the sentence S obtained by the step (3-6)iPredicting a set of features corresponding to any category:
(3-2) calculating the frequency and weight of each feature: since the same feature may appear in different categories, the same feature is inMay occur multiple times in, willThe weights of all the same forward features in the sequence are summed, and the first c1 features are taken according to the order of the weights from large to smallCalculated in the same wayThe weights of all the negative features are ranked from large to small according to the absolute value of the weights, and the top c2 features are taken to obtainSimultaneous calculationThe frequency of each negative-going feature in the sequence is sorted from high to low, and the first c3 features are taken
(3-3) obtaining the sentence SiKey explanatory features of (1): finally, sentence S is obtainediKey interpretation feature set ofIs the intersection of the three sets obtained in step (3-2), and contains p key interpretation features:
5. the method for classifying texts fusing text interpretation features according to claim 1, wherein in the step 4, the key interpretation features obtained in the step 3 are fused with the raw data, and the text classification model is retrained, and the specific steps include:
(4-1) acquiring data fusing key interpretation features: sentence S to be acquirediKey interpretation feature and sentence S ofiThe sentences which are used together as the input of the text classification model and are fused with key interpretation characteristics are represented as Si′:
Wherein the content of the first and second substances,as a sentence SiThe number of k words in (a) is,for the obtained sentence SiP key interpretation features of (a);
(4-2) retraining the text classification model: fusing key interpretation characteristics according to the steps from (2-1) to (4-1) on all training samples and test samples to obtain a new data set S ═ (S ═1′,S2′,S3′,...,SN′) Then according to the rightThe process of claim 2 is used to retrain the text classification model on the data set S', and the obtained text classification result is more accurate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110521823.9A CN113590814A (en) | 2021-05-13 | 2021-05-13 | Text classification method fusing text interpretation features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110521823.9A CN113590814A (en) | 2021-05-13 | 2021-05-13 | Text classification method fusing text interpretation features |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113590814A true CN113590814A (en) | 2021-11-02 |
Family
ID=78243402
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110521823.9A Pending CN113590814A (en) | 2021-05-13 | 2021-05-13 | Text classification method fusing text interpretation features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113590814A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108182186A (en) * | 2016-12-08 | 2018-06-19 | 广东精点数据科技股份有限公司 | A kind of Web page sequencing method based on random forests algorithm |
CN110688491A (en) * | 2019-09-25 | 2020-01-14 | 暨南大学 | Machine reading understanding method, system, device and medium based on deep learning |
CN111967354A (en) * | 2020-07-31 | 2020-11-20 | 华南理工大学 | Depression tendency identification method based on multi-modal characteristics of limbs and microexpressions |
US20210012199A1 (en) * | 2019-07-04 | 2021-01-14 | Zhejiang University | Address information feature extraction method based on deep neural network model |
-
2021
- 2021-05-13 CN CN202110521823.9A patent/CN113590814A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108182186A (en) * | 2016-12-08 | 2018-06-19 | 广东精点数据科技股份有限公司 | A kind of Web page sequencing method based on random forests algorithm |
US20210012199A1 (en) * | 2019-07-04 | 2021-01-14 | Zhejiang University | Address information feature extraction method based on deep neural network model |
CN110688491A (en) * | 2019-09-25 | 2020-01-14 | 暨南大学 | Machine reading understanding method, system, device and medium based on deep learning |
CN111967354A (en) * | 2020-07-31 | 2020-11-20 | 华南理工大学 | Depression tendency identification method based on multi-modal characteristics of limbs and microexpressions |
Non-Patent Citations (3)
Title |
---|
MARCO TULIO RIBEIRO等: "Why Should I Trust You?" Explaining the Predictions of Any Classifier", 《KDD \'16: THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING》 * |
周乾荣: "面向句子分类的深度表示学习技术研究", 《中国博士学位论文全文数据库》 * |
戴亚平等: "《多传感器数据智能融合理论与应用 面向新工科普通高等教育系列教材》", 机械工业出版社, pages: 143 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220019745A1 (en) | Methods and apparatuses for training service model and determining text classification category | |
CN109189925B (en) | Word vector model based on point mutual information and text classification method based on CNN | |
CN110008338B (en) | E-commerce evaluation emotion analysis method integrating GAN and transfer learning | |
CN110245229B (en) | Deep learning theme emotion classification method based on data enhancement | |
CN106407333B (en) | Spoken language query identification method and device based on artificial intelligence | |
CN107943784B (en) | Relationship extraction method based on generation of countermeasure network | |
CN110033281B (en) | Method and device for converting intelligent customer service into manual customer service | |
CN111738007B (en) | Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network | |
CN111506732B (en) | Text multi-level label classification method | |
CN112364638B (en) | Personality identification method based on social text | |
CN108038492A (en) | A kind of perceptual term vector and sensibility classification method based on deep learning | |
CN112052684A (en) | Named entity identification method, device, equipment and storage medium for power metering | |
CN112966068A (en) | Resume identification method and device based on webpage information | |
CN113255366B (en) | Aspect-level text emotion analysis method based on heterogeneous graph neural network | |
CN111191031A (en) | Entity relation classification method of unstructured text based on WordNet and IDF | |
CN114564563A (en) | End-to-end entity relationship joint extraction method and system based on relationship decomposition | |
CN111581364B (en) | Chinese intelligent question-answer short text similarity calculation method oriented to medical field | |
CN114841151B (en) | Medical text entity relation joint extraction method based on decomposition-recombination strategy | |
Niyozmatova et al. | Classification based on decision trees and neural networks | |
CN114691525A (en) | Test case selection method and device | |
CN111651597A (en) | Multi-source heterogeneous commodity information classification method based on Doc2Vec and convolutional neural network | |
CN115017879A (en) | Text comparison method, computer device and computer storage medium | |
CN114722198A (en) | Method, system and related device for determining product classification code | |
CN114239584A (en) | Named entity identification method based on self-supervision learning | |
CN112989803A (en) | Entity link model based on topic vector learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |