CN113590814A - Text classification method fusing text interpretation features - Google Patents

Text classification method fusing text interpretation features Download PDF

Info

Publication number
CN113590814A
CN113590814A CN202110521823.9A CN202110521823A CN113590814A CN 113590814 A CN113590814 A CN 113590814A CN 202110521823 A CN202110521823 A CN 202110521823A CN 113590814 A CN113590814 A CN 113590814A
Authority
CN
China
Prior art keywords
sentence
interpretation
features
feature
text classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110521823.9A
Other languages
Chinese (zh)
Inventor
骆祥峰
陈璐
陈雪
高剑奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202110521823.9A priority Critical patent/CN113590814A/en
Publication of CN113590814A publication Critical patent/CN113590814A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a text classification method fusing text interpretation characteristics. The method comprises the following specific implementation steps: (1) training a text classification model based on a neural network for predicting the category of the sentence; (2) acquiring the interpretation characteristics of the sentence prediction result in the step (1) by using a linear fitting method based on local random disturbance sampling; (3) selecting key interpretation characteristics beneficial to classification effect according to the frequency and weight of the acquired interpretation characteristics; (4) and (4) fusing the key interpretation features acquired in the step (3) with the raw data, and retraining the text classification model. The method uses a linear fitting method based on local random disturbance sampling to explain which key features have the largest contribution to the prediction result of the text classification model, fuses the features and the original labeled sample, and highlights the key features of the original sample, thereby improving the classification effect.

Description

Text classification method fusing text interpretation features
Technical Field
The invention relates to the technical field of natural language processing, in particular to a text classification method fusing text interpretation characteristics, which is a method for interpreting a trained text classification model based on a neural network by using a linear fitting method based on random disturbance sampling to obtain the interpretation characteristics of a prediction result of each sentence, and retraining the text classification model by fusing key interpretation characteristics in the text classification model, and can be applied to the specific fields of junk mail identification, text theme classification, emotion analysis and the like.
Background
Text classification is an important research direction in the field of natural language processing, and the task of text classification is to map a text to a predefined category by using a certain method. The text classification method includes a rule-based method and a machine learning-based method.
When the method based on the rules is used for text classification, different rules need to be set for different texts, so that time and labor are wasted, and the coverage and accuracy cannot be guaranteed. With the rise of machine learning, the machine learning method is used for the text classification task and achieves better effect. However, many machine learning models are black box models, and we can only obtain the prediction result given by the model, but cannot know why the model gives the result, and can only judge the reliability of the model from some judgment indexes such as the accuracy of the model, but in the fields of medical treatment and the like, we can provide a more accurate decision basis for model users by knowing not only the prediction result and the accuracy of the model but also the basis of the prediction result given by the model, and intervene in the model training process according to the basis of the prediction result given by the model, so that the model classification effect is improved.
In summary, due to the unexplainable property of the deep learning model, it is difficult for the model user to determine the basis of the prediction result given by the model, and to make a correct decision according to the prediction result of the model
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide a text classification method fusing text interpretation characteristics, which uses a linear fitting method based on random disturbance sampling to interpret the prediction result of a text classification model based on a neural network, obtains the interpretation characteristics of each sentence according to the classification characteristics used in the linear fitting process, obtains key interpretation characteristics according to the frequency and weight of the characteristics, fuses the key interpretation characteristics with original data, and retrains the text classification model, thereby enabling the text classification result to be more accurate.
In order to achieve the purpose, the invention adopts the following technical scheme:
a text classification method fusing text interpretation features comprises the following operation steps:
step 1, training a text classification model based on a neural network to predict the category of a sentence;
step 2, obtaining the interpretation characteristics of the sentence prediction result in the step 1 by using a linear fitting method based on local random disturbance sampling;
step 3, selecting key interpretation characteristics which are beneficial to the classification effect according to the frequency and the weight of the interpretation characteristics acquired in the step 2;
and 4, fusing the key interpretation characteristics and the raw data acquired in the step 3, and retraining the text classification model.
Preferably, the training of the neural network-based text classification model in step 1 is used for predicting the category to which the sentence belongs, and the specific steps include:
(1-1) input layer: the input to the text classification model is a sentence with a category label, S ═ S (S)1,S2,S3......,SN) Wherein S isiRepresenting the ith sentence in the data set, N representing the number of sentences,
Figure BDA0003064331130000021
Figure BDA0003064331130000022
representing the jth word in the ith sentence, and k representing the number of words in the ith sentence;
(1-2) sentence vectorization: using Glove training word vectors, vocabulary V is set to (w)1,w2,w3,......,wM) Each word in the dictionary is converted into a 64-dimensional vector, and a vectorized word list V' is generated1,v2,v3,......,vM) The dimension of V' is
Figure BDA0003064331130000023
Wherein wiRepresenting words in a vocabulary, viRepresents the word wiM represents the number of all words present in the data set; the lookup word table V' converts words in the sentence into corresponding vector representation, and the sentence SiIs shown as
Figure BDA0003064331130000024
(1-3) Linear layer: vectorized sentences
Figure BDA0003064331130000025
Inputting a category label of a linear layer prediction sentence, wherein the linear layer formula is as follows:
Figure BDA0003064331130000026
wherein, ylFor prediction results, is an array containing num _ class numbers, num _ class representing a predefined number of classes, where each number represents a likelihood size of predicting the class represented by the current location, l represents a linear transformation equation, WTAnd b are parameters of the linear layer, respectively;
(1-4) softmax layer: predicting the result y by using softmax functionlThe value range of each value is mapped to [0, 1 ]]The formula of the softmax function is as follows:
Figure BDA0003064331130000031
wherein the content of the first and second substances,
Figure RE-GDA0003284893990000032
indicates the prediction result ylJ value of (1), ylAfter each value in (1) is transformed by the softmax function, the sum of the num _ class values is 1;
(1-5) Loss equation: the final output of the model is the class label y corresponding to the maximum value in the prediction resultpreBy usingFormula loss (y)i,ypre)=-yprelog(softmax(yi) Determine a loss function, where loss (y)i,ypre) Represents the loss function, yiFor inputting a sentence SiThe label of (1);
(1-6) parameter optimization: and optimizing parameters of the text classification model by taking the minimized loss function as a target to obtain the trained text classification model.
Preferably, in the step 2, the linear fitting method based on local random disturbance sampling is used for obtaining the interpretation characteristics of the sentence prediction result in the step 1; the method comprises the following specific steps:
(2-1) selection of sentence S to be interpretediAnd is in SiNearby samples are taken by random perturbations: siFor sentences containing k words in the original data set
Figure BDA0003064331130000033
For sentence SiPerforming random disturbance, acquiring sampling samples, generating a data set containing a plurality of sampling samples, and performing vectorization representation on the sampling samples by using 0 and 1; the random perturbation process is as follows:
deleting sentence S at randomiThe number of deleted words is more than 0 and less than k, and a new sentence is obtained
Figure BDA0003064331130000034
Namely SiA randomly perturbed sample of (1), wherein
Figure BDA0003064331130000035
As a sentence SiThe jth word in the tth random disturbance sample, wherein c is the number of words remaining after random disturbance; initializing a 1 × k vector, setting the position of the deleted word to 0, and setting the other positions to 1 to obtain
Figure BDA0003064331130000041
Vectorized representation of
Figure BDA0003064331130000042
Each element therein
Figure BDA0003064331130000043
4999 times of random disturbance to obtain a new data set containing 5000 sentences
Figure BDA0003064331130000044
Wherein
Figure BDA0003064331130000045
Is the original sentence Si,SiIs expressed as a vector containing k 1 s; the vector matrix of the new data set X is represented as
Figure BDA0003064331130000046
(2-2) tagging the newly generated data:
inputting each data in the data set X into a trained text classification model for prediction to obtain a corresponding prediction result; expressing the trained text classification model as f, and obtaining the prediction result of each sample after the steps (1-1) to (1-4)
Figure BDA0003064331130000047
Is an array containing num _ class numbers, where each value represents the probability of prediction as a corresponding class;
(2-3) calculating the distance between all the disturbance data and the original data in the new data set Z as the disturbance data weight:
the closer the distance between the newly generated disturbance data and the original data is, the more the prediction data can be explained, the higher weight is given, the weight of the newly generated data is defined by using an exponential kernel function, and the calculation formula is as follows:
Figure BDA0003064331130000048
wherein the content of the first and second substances,
Figure BDA0003064331130000049
is an exponential kernel defined at cosine distance, representing the distance weight between samples, the closer the distance,
Figure BDA00030643311300000410
the larger the value of (a), σ is the kernel width;
(2-4) fitting the new data set Z using a linear model: the linear model is expressed in g, and the linear model formula is as follows:
Figure BDA00030643311300000411
wherein the content of the first and second substances,
Figure BDA00030643311300000412
as a vector in the data set Z, wgIs the weight coefficient of the linear model;
(2-5) determining coefficients of the linear model: training a linear classification model to determine a weight coefficient, and setting a Loss equation as follows:
Figure BDA0003064331130000051
let L (f, g, π)z) Minimum, obtain the optimal linear model weight wg,wgHas the dimension of
Figure BDA0003064331130000052
Wherein
Figure BDA0003064331130000053
For the t-th perturbation data, the data is,
Figure BDA0003064331130000054
is composed of
Figure BDA0003064331130000055
The vector form of (1);
(2-6) acquiring interpretation characteristics and denoising: after the linear model training is completed, Feai=wg×SiI.e. interpretation features and weights for different classes,
Figure BDA0003064331130000056
sorting the characteristics of the mth category from big to small according to the absolute value of the weight, removing the information such as auxiliary words, connecting words, punctuation marks and the like, and selecting the first T categories as sentences SiPredicted as an interpretation feature of the m-th class
Figure BDA0003064331130000057
Figure BDA0003064331130000058
Wherein the content of the first and second substances,
Figure BDA0003064331130000059
representing a set of features and each feature correspondence weight obtained by a model interpretation method for predicting an ith sentence into an mth category, m being labels corresponding to different categories, 1. ltoreq. m.ltoreq.num _ class,
Figure BDA00030643311300000510
is a sentence SiThe (c) th feature of (a),
Figure BDA00030643311300000511
is characterized in that
Figure BDA00030643311300000512
A corresponding weight; the feature representation model with the weight being positive considers that the feature supports the ith sample to be classified into the mth category, and we call this category of feature as positive feature or positive feature, and the feature representation model with the weight being negative considers that the feature does not support the ith sample to be classified into the mth category, which is called negative feature or negative feature.
Preferably, in the step 3, the selecting a key feature set according to the frequency and weight of the acquired interpretation feature includes:
(3-1) acquiring data SiAll the explanatory features:
Figure BDA00030643311300000513
represents the sentence S obtained by the step (3-6)iPredicting a set of features corresponding to any category:
Figure BDA0003064331130000061
(3-2) calculating the frequency and weight of each feature: since the same feature may appear in different categories, the same feature is in
Figure BDA0003064331130000062
May occur multiple times in, will
Figure BDA0003064331130000063
The weights of all the same forward features are summed, and the first c1 features are sorted from large to small according to the weights
Figure BDA0003064331130000064
Calculated in the same way
Figure BDA0003064331130000065
The weights of all the middle negative features are ranked from large to small according to the absolute value of the weights, and the top c2 features are taken to obtain
Figure BDA0003064331130000066
Simultaneous calculation
Figure BDA0003064331130000067
The frequency of each negative feature in the sequence is sorted from high to low, and the first c3 features are taken
Figure BDA0003064331130000068
Figure BDA0003064331130000069
Figure BDA00030643311300000610
Figure BDA00030643311300000611
(3-3) obtaining the sentence SiKey explanatory features of (1): finally, sentence S is obtainediKey interpretation feature set of
Figure BDA00030643311300000612
Is the intersection of the three sets obtained in step (3-2), and contains p key interpretation features:
Figure BDA00030643311300000613
preferably, in the step 4, the key interpretation features and the raw data obtained in the step 3 are fused to retrain the text classification model, and the specific steps include:
(4-1) acquiring data fusing key interpretation features: sentence S to be acquirediKey interpretation feature and sentence S ofiThe sentences which are used as the input of the text classification model and are fused with the key interpretation characteristics are expressed as Si′:
Figure BDA00030643311300000614
Wherein the content of the first and second substances,
Figure BDA0003064331130000071
as a sentence SiThe number of k words in (a) is,
Figure BDA0003064331130000072
for the obtained sentence SiP key interpretation features of (a);
(4-2) retraining the text classification model: fusing key interpretation characteristics according to the steps from (2-1) to (4-1) on all training samples and test samples to obtain a new data set S ═ (S ═1′,S2′,S3′,...,SN') and then retraining the text classification model on the data set S' according to the process of claim 2, the resulting text classification results are more accurate.
Compared with the prior art, the invention has the following obvious and prominent substantive characteristics and remarkable technical progress:
1. the method uses a linear fitting method based on local random disturbance sampling to explain which key features have the greatest contribution to the prediction result of the text classification model, fuses the features and the original labeled sample, and highlights the key features of the original sample, thereby improving the classification effect;
2. the method can efficiently retrain the text classification model, so that the text classification result is more accurate.
Drawings
FIG. 1 is a flow chart of a text classification method for fusing text interpretation features according to the present invention.
FIG. 2 is a diagram of a neural network-based text classification model according to the present invention.
FIG. 3 is a flow chart of the present invention for obtaining interpretation characteristics using a model interpretation method.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings and tables.
The invention aims to provide a text classification method fusing text interpretation features, which is used for acquiring key features of a prediction result given by a text classification model through a model interpretation method, and using the key features and an original text together as an input retraining model of the text classification model, thereby improving the effect of the text classification model.
The invention provides a text classification method fusing text interpretation characteristics, which is characterized in that a linear fitting method based on local random disturbance sampling is used for interpreting a prediction result of a text classification model based on a neural network to obtain interpretation characteristics, key interpretation characteristics are obtained according to the frequency and weight of the characteristics and fused with original data, and the text classification model is retrained, so that the text classification result is more accurate. The basic features of the present invention mainly include the following aspects:
firstly, interpreting a prediction result of a trained text classification model by using a linear fitting method based on local random disturbance sampling to obtain an interpretation characteristic;
selecting key interpretation features which are beneficial to text classification according to the weight and the frequency of the interpretation features;
and thirdly, fusing the original data with key interpretation characteristics to retrain the text classification model.
The first embodiment is as follows:
referring to fig. 1, a text classification method fusing text interpretation features includes the following operation steps:
step 1, training a text classification model based on a neural network to predict the category of a sentence;
step 2, obtaining the interpretation characteristics of the sentence prediction result in the step 1 by using a linear fitting method based on local random disturbance sampling;
step 3, selecting key interpretation characteristics which are beneficial to the classification effect according to the frequency and the weight of the interpretation characteristics acquired in the step 2;
and 4, fusing the key interpretation characteristics and the raw data acquired in the step 3, and retraining the text classification model.
The method can efficiently retrain the text classification model, so that the text classification result is more accurate.
Example two:
in the above embodiment, referring to fig. 1, the text classification method with fused text interpretation features is shown in flowchart,
a text classification method fusing text interpretation features comprises the following steps of:
step S1: training a text classification model based on a neural network for predicting the category of the sentence, wherein the text classification model is illustrated in the attached figure 2, and the model parameter setting is illustrated in the table 1; the specific process is as follows:
(1-1) input layer: acquiring an AG-News data set, wherein the AG-News is a standard English data set for text classification, and comprises 127600 pieces of data in four categories; considering the time problem of training a text classification model and acquiring each data interpretation feature, uniformly and randomly sampling data of each category from an AG-News data set, and selecting 16000 pieces of data for an experiment, wherein a training set comprises 12800 pieces of data, and a verification set and a test set respectively comprise 1600 pieces of data; the input to the text classification model is a sentence with a category label, S ═ S (S)1,S2,S3......,SN) In which S isiRepresents the ith sentence in the data set, N represents the number of sentences, the value is 16000,
Figure BDA0003064331130000081
Figure BDA0003064331130000082
j represents the jth word of the ith sentence, k represents the number of words in the ith sentence, and the value of k is not fixed due to different sentence lengths;
TABLE 1 text classification model parameter set
Figure BDA0003064331130000083
(1-2) sentence vectorization: using Glove training word vectors, vocabulary V is set to (w)1,w2,w3,......,wM) Each word in (a) is converted into a 64-dimensional vector representation, and a vectorized vocabulary V ═ (V) is generated1,v2,v3,......,vM) Wherein w isiRepresenting words in a vocabulary, viRepresents the word wiVectorization ofMeaning that M represents the number of all words present in the dataset, M takes the value 161067, and V' has a dimension of
Figure BDA0003064331130000091
The lookup word table V' converts words in the sentence into corresponding vector representation, and the sentence SiIs shown as
Figure BDA0003064331130000092
(1-3) Linear layer: vectorized sentences
Figure BDA0003064331130000093
Inputting a category label of a linear layer prediction sentence, wherein the linear layer formula is as follows:
Figure BDA0003064331130000094
wherein the content of the first and second substances,ylfor prediction results, an array of 4 values is formed, where each value represents the likelihood of predicting the class represented by the current location, l represents a linear transformation equation, WTAnd b are parameters of the linear layer respectively, and the value range of the random initialization parameter is (-0.3, 0.3).
(1-4) softmax layer: predicting the result y by using softmax functionlThe value range of each value in (b) is mapped to [0, 1 ]]The formula of the sofimax function is as follows:
Figure BDA0003064331130000095
wherein the content of the first and second substances,
Figure RE-GDA0003284893990000096
indicates the prediction result ylJ value of (1), ylAfter each value in (a) is transformed by the softmax function, the sum of the 4 values is 1.
(1-5) Loss equation: the final output of the model is the most predictive resultClass label y for large valuespreUsing the formula loss (y)i,ypre)=-yprelog(softmax(yi) Determine a loss function, where loss (y)i,ypre) Represents the loss function, yiLabels are tagged to the input sentence.
(1-6) parameter optimization: and optimizing the parameters of the text classification model by taking the minimization of the loss function as a target. As shown in table 1, Batch Size is set to 16, i.e., 16 sentences are input into the text classification model at a time. The learning rate in the model training process is 2.0, the learning rate adjustment multiple is 0.8, the adjustment interval is 1 epoch, namely, the learning rate is adjusted to be 0.8 times of the previous epoch every time an epoch passes, and the model is finally iterated for 35 times to complete the training.
Step S2: and acquiring the interpretation characteristics of the sentence prediction result in the step S1 by using a linear fitting method based on local random disturbance sampling. The specific process is shown in the attached figure 3:
(2-1) selection of sentence S to be interpretediAnd at SiNearby samples are taken by random perturbations: siFor sentences containing k words in the original data set
Figure BDA0003064331130000101
For sentence SiPerforming random perturbation, obtaining a sampling sample, generating a data set containing a plurality of sampling samples, and performing vectorization representation on the sampling samples by using 0 and 1. The random perturbation process is as follows:
deleting sentence S at randomiThe number of deleted words is more than 0 and less than k, and a new sentence is obtained
Figure BDA0003064331130000102
Namely SiA randomly perturbed sample of (1), wherein
Figure BDA0003064331130000103
As a sentence SiAnd d, randomly disturbing the jth word in the sample of the t-th time, wherein c is the number of words remaining after random disturbance. Initializing a 1 xk vector, deleting bits of the wordSet to 0 and the other positions to 1 to obtain
Figure BDA0003064331130000104
Vectorized representation of
Figure BDA0003064331130000105
Each element therein
Figure BDA0003064331130000106
4999 times of random disturbance to obtain a new data set containing 5000 sentences
Figure BDA0003064331130000107
Wherein
Figure BDA0003064331130000108
Is the original sentence Si,SiIs represented as a vector containing k 1 s. The vector matrix of the new data set X is represented as
Figure BDA0003064331130000109
(2-2) tagging the newly generated data: and inputting each data in the data set X into the trained text classification model for prediction to obtain a corresponding prediction result. Expressing the trained text classification model as f, and obtaining the prediction result of each sample after the steps (1-1) to (1-4)
Figure BDA00030643311300001010
For an array containing 4 numbers, 4 is the number of data classes, each of which represents the probability of predicting as a corresponding class.
(2-3) calculating the distance between all the disturbance data and the original data in the new data set Z as the disturbance data weight: the closer the distance between the newly generated disturbance data and the original data is, the more the prediction data can be explained, the higher weight is given, the weight of the newly generated data is defined by using an exponential kernel function, and the calculation formula is as follows:
Figure BDA0003064331130000111
wherein the content of the first and second substances,
Figure BDA0003064331130000112
is an exponential kernel defined at cosine distance, representing the distance weight between samples, the closer the distance,
Figure BDA0003064331130000113
the larger the value of (a), σ is the kernel width.
(2-4) fitting the new data set Z using a linear model: the linear model is expressed in g, and the linear model formula is as follows:
Figure BDA0003064331130000114
wherein the content of the first and second substances,
Figure BDA0003064331130000115
as a vector in the data set Z, wgIs the weight coefficient of the linear model.
(2-5) determining coefficients of the linear model: the Loss equation is set as follows:
Figure BDA0003064331130000116
let L (f, g, π)z) Minimum, obtain the optimal linear model weight wg,wgHas the dimension of
Figure BDA0003064331130000117
Wherein
Figure BDA0003064331130000118
For the t-th perturbation data, the data is,
Figure BDA0003064331130000119
is composed of
Figure BDA00030643311300001110
In the form of a vector.
(2-6) acquiring interpretation characteristics and denoising: after the linear model training is completed, Feai=wgX is the interpretation characteristic and weight for different classes,
Figure BDA00030643311300001111
sorting the features of the mth category from large to small according to the absolute value of the weight, removing the information such as auxiliary words, connecting words, punctuation marks and the like, and selecting the first T as the explanation features of the mth category predicted by the sentence x
Figure BDA00030643311300001112
Figure BDA00030643311300001113
Wherein the content of the first and second substances,
Figure BDA00030643311300001114
the ith sentence output by the representation model interpretation method is predicted to be the set of the features of the mth category and the weight corresponding to each feature, m is a label corresponding to different categories, m is more than or equal to 1 and less than or equal to 4,
Figure BDA0003064331130000121
is a sentence SiThe (c) th feature of (a),
Figure BDA0003064331130000122
is characterized in that
Figure BDA0003064331130000123
The corresponding weight. The feature with the weight of positive value indicates that the model considers the feature to support the ith sample to be classified into the mth category, and we call this category of feature as positive feature or positive feature, and the feature with the weight of negative value indicates that the model considers the feature not to support the ith sample to be classified into the mth category, which is called negative feature or negative feature.
Step S3: selecting key interpretation characteristics beneficial to classification effect according to the frequency and weight of the acquired interpretation characteristics, and specifically performing the following steps:
(3-1) acquiring data SiAll the explanatory features:
Figure BDA0003064331130000124
represents the sentence S obtained by the step (2-6)iPredicting a set of features corresponding to any category:
Figure BDA0003064331130000125
(3-2) calculating the frequency and weight of each feature:
since the same feature may appear in different categories, it is possible to use the same feature in different categories
Figure BDA0003064331130000126
May occur multiple times, will
Figure BDA0003064331130000127
The weights of all the same forward features in the sequence are summed, and the first c1 features are taken according to the order of the weights from large to small
Figure BDA0003064331130000128
Calculated in the same way
Figure BDA0003064331130000129
The weights of all the negative features are ranked from large to small according to the absolute value of the weights, and the top c2 features are taken to obtain
Figure BDA00030643311300001210
Simultaneous calculation
Figure BDA00030643311300001211
The frequency of each negative-going feature in the sequence is sorted from high to low, and the first c3 features are taken
Figure BDA00030643311300001212
Figure BDA00030643311300001213
Figure BDA00030643311300001214
Figure BDA00030643311300001215
(3-3) obtaining the sentence SiKey explanatory features of (1): finally, sentence S is obtainediKey interpretation feature set of
Figure BDA00030643311300001216
Is the intersection of the three sets obtained in step (4-2), and contains p key interpretation features:
Figure BDA0003064331130000131
step S4: fusing the key interpretation features and the raw data acquired in the step S3, and retraining the text classification model, which specifically comprises:
(4-1) acquiring data fusing key interpretation features: sentence S to be acquirediKey interpretation feature and sentence S ofiThe sentences which are used as the input of the text classification model and are fused with the key interpretation characteristics are expressed as Si′:
Figure BDA0003064331130000132
Wherein the content of the first and second substances,
Figure BDA0003064331130000133
as a sentence SiThe number of k words in (a) is,
Figure BDA0003064331130000134
for the obtained sentence SiP key interpretation features.
(4-2) retraining the text classification model: fusing key interpretation characteristics according to the steps from (2-1) to (4-1) on all training samples and test samples to obtain a new data set S ═ (S ═1′,S2′,S3′,...,SN') and then retraining the text classification model on the data set S according to the process of claim 2, the resulting text classification results are more accurate.
Description of the experiment and results: the experimental data set is a part of the data in the AG-News data set in the step (1-1), 16000 pieces of data are obtained for experiment by randomly and uniformly sampling the data of each category, wherein the training set comprises 12800 pieces of data, and the verification set and the test set respectively comprise 1600 pieces of data. Table 2 shows experimental comparison results of training a text classification model using data fused with key interpretation features and training the text classification model using raw data. Wherein Train _ acc is the accuracy of the training set, Test _ acc is the accuracy of the Test set, Test _ ma _ R is the macro recall rate of the Test set, Test _ ma _ f1 is the macro f1 value of the Test set, and Test _ mi _ f1 is the micro f1 value of the Test set. It can be seen that the method provided by the invention is improved in all indexes, wherein the accuracy of the test set is improved by 2.39 percentage points, which shows that the method provided by the invention can improve the effect of the text classification model.
TABLE 2 Experimental results
Figure BDA0003064331130000135
Figure BDA0003064331130000141
The method uses a linear fitting method based on local random disturbance sampling to explain which key features have the greatest contribution to the prediction result of the text classification model, fuses the features and the original labeled sample, and highlights the key features of the original sample, thereby improving the classification effect; the method can efficiently retrain the text classification model, so that the text classification result is more accurate.
The foregoing is a more detailed description of the present invention in connection with specific/preferred embodiments thereof, and it is not intended that the practice of the invention be limited to these descriptions. It will be apparent to those skilled in the art that various substitutions and modifications can be made to the described embodiments without departing from the spirit of the invention, and these substitutions and modifications should be considered to fall within the scope of the invention.

Claims (5)

1. A text classification method fusing text interpretation features is characterized by comprising the following operation steps:
step 1, training a text classification model based on a neural network to predict the category of a sentence;
step 2, obtaining the interpretation characteristics of the sentence prediction result in the step 1 by using a linear fitting method based on local random disturbance sampling;
step 3, selecting key interpretation characteristics which are beneficial to the classification effect according to the frequency and the weight of the interpretation characteristics acquired in the step 2;
and 4, fusing the key interpretation characteristics and the raw data acquired in the step 3, and retraining the text classification model.
2. The method for classifying the text fusing the text interpretation features according to claim 1, wherein the training of the neural network-based text classification model in the step 1 is used for predicting the category to which the sentence belongs, and the specific steps include:
(1-1) input layer: the input to the text classification model is a sentence with a category label, S ═ S (S)1,S2,S3......,SN) In which S isiRepresenting the ith sentence in the data set, N representing the number of sentences,
Figure RE-FDA0003284893980000011
wj irepresenting the jth word in the ith sentence, and k representing the number of words in the ith sentence;
(1-2) sentence vectorization: using Glove training word vectors, vocabulary V is set to (w)1,w2,w3,......,wM) Each word in the dictionary is converted into a 64-dimensional vector, and a vectorized word list V' is generated1,v2,v3,......,vM) The dimension of V' is
Figure RE-FDA0003284893980000012
Wherein wiRepresenting words in a vocabulary, viRepresents the word wiM represents the number of all words present in the data set; the lookup word table V' converts words in the sentence into corresponding vector representation, and the sentence SiIs shown as
Figure RE-FDA0003284893980000013
(1-3) Linear layer: vectorized sentences
Figure RE-FDA0003284893980000014
Inputting a category label of a linear layer prediction sentence, wherein the linear layer formula is as follows:
Figure RE-FDA0003284893980000015
wherein, ylFor prediction results, is an array containing num _ class numbers, num _ class representing a predefined number of classes, where each number represents a likelihood size of predicting the class represented by the current location, l represents a linear transformation equation, WTAnd b are parameters of the sexual layer respectively;
(1-4) softmax layer: predicting the result y by using softmax functionlEach inThe value ranges of the values are all mapped to [0, 1 ]]The formula of the softmax function is as follows:
Figure RE-FDA0003284893980000021
wherein the content of the first and second substances,
Figure RE-FDA0003284893980000022
indicates the prediction result ylOflAfter each value in (1) is transformed by the softmax function, the sum of the num _ class values is 1;
(1-5) Loss equation: the final output of the model is the class label y corresponding to the maximum value in the prediction resultpreUsing the formula loss (y)i,ypre)=-yprelog(softmax(yi) Determine a loss function, where loss (y)i,ypre) Represents the loss function, yiFor inputting a sentence SiThe label of (1);
(1-6) parameter optimization: and optimizing parameters of the text classification model by taking the minimized loss function as a target to obtain the trained text classification model.
3. The method for classifying texts fusing text interpretation features according to claim 1, wherein in the step 2, the linear fitting method based on local random disturbance sampling is used to obtain the interpretation features of the sentence prediction results in the step 1; the method comprises the following specific steps:
(2-1) selection of sentence S to be interpretediAnd is in SiNearby samples are taken by random perturbations: siFor sentences containing k words in the original data set
Figure RE-FDA0003284893980000023
For sentence SiPerforming random disturbance, acquiring sampling samples, generating a data set containing a plurality of sampling samples, and performing vectorization representation on the sampling samples by using 0 and 1; the random perturbation process is as follows:
deleting sentence S at randomiThe number of deleted words is more than 0 and less than k, and a new sentence is obtained
Figure RE-FDA0003284893980000031
Namely SiA randomly perturbed sample of (1), wherein
Figure RE-FDA0003284893980000032
As a sentence SiThe jth word in the tth random disturbance sample, wherein c is the number of words remaining after random disturbance; initializing a 1 × k vector, setting the position of the deleted word to 0, and setting the other positions to 1 to obtain
Figure RE-FDA0003284893980000033
Vectorized representation of
Figure RE-FDA0003284893980000034
Each element therein
Figure RE-FDA0003284893980000035
4999 times of random disturbance to obtain a new data set containing 5000 sentences
Figure RE-FDA0003284893980000036
Wherein
Figure RE-FDA0003284893980000037
Is the original sentence Si,SiIs expressed as a vector containing k 1 s; the vector matrix of the new data set X is represented as
Figure RE-FDA0003284893980000038
(2-2) tagging the newly generated data:
inputting each data in the data set X into a trained text classification model for prediction to obtain a corresponding prediction result; will trainThe trained text classification model is represented as f, and the prediction result of each sample is obtained after the steps (1-1) to (1-4)
Figure RE-FDA0003284893980000039
Figure RE-FDA00032848939800000310
Is an array containing num _ class numbers, each of which represents a probability of prediction as a corresponding class;
(2-3) calculating the distance between all the disturbance data and the original data in the new data set Z as the disturbance data weight:
the closer the distance between the newly generated disturbance data and the original data is, the more the prediction data can be explained, the higher weight is given, the weight of the newly generated data is defined by using an exponential kernel function, and the calculation formula is as follows:
Figure RE-FDA00032848939800000311
wherein the content of the first and second substances,
Figure RE-FDA00032848939800000312
is an exponential kernel defined at cosine distance, representing the distance weight between samples, the closer the distance,
Figure RE-FDA00032848939800000313
the larger the value of (a), σ is the kernel width;
(2-4) fitting the new data set Z using a linear model: the linear model is expressed in g, and the linear model formula is as follows:
Figure RE-FDA0003284893980000041
wherein the content of the first and second substances,
Figure RE-FDA0003284893980000042
as a vector in the data set Z, wgIs the weight coefficient of the linear model;
(2-5) determining coefficients of the linear model: training a linear classification model to determine a weight coefficient, and setting a Loss equation as follows:
Figure RE-FDA0003284893980000043
let L (f, g, π)z) Minimum, obtain the optimal linear model weight wg,wgHas the dimension of
Figure RE-FDA0003284893980000044
Wherein
Figure RE-FDA0003284893980000045
For the t-th perturbation data, the data is,
Figure RE-FDA0003284893980000046
is composed of
Figure RE-FDA0003284893980000047
The vector form of (1);
(2-6) acquiring interpretation characteristics and denoising: after the linear model training is completed, Feai=wg×SiI.e. interpretation features and weights for different classes,
Figure RE-FDA0003284893980000048
sorting the features of the mth category from big to small according to the absolute value of the weight, removing the information such as auxiliary words, connecting words, punctuation marks and the like, and selecting the first T categories as sentences SiPredicted as an interpretation feature of the m-th class
Figure RE-FDA0003284893980000049
Figure RE-FDA00032848939800000410
Wherein the content of the first and second substances,
Figure RE-FDA00032848939800000411
representing the set of features and corresponding weights of each feature, which are obtained by a model interpretation method and predict the ith sentence into the mth category, wherein m is a label corresponding to different categories, m is more than or equal to 1 and less than or equal to num _ class, fj iIs a sentence SiThe (c) th feature of (a),
Figure RE-FDA00032848939800000412
is a characteristic fj iA corresponding weight; the feature with the weight of positive value indicates that the model considers that the feature supports the ith sample to be classified into the mth category, and we call this category of feature as positive feature or positive feature, and the feature with the weight of negative value indicates that the model considers that the feature does not support the ith sample to be classified into the mth category, which is called negative feature or negative feature.
4. The method for classifying texts fusing text interpretation features according to claim 1, wherein in the step 3, the key feature set is selected according to the frequency and weight of the obtained interpretation features, and the specific steps include:
(3-1) acquiring data SiAll the explanatory features:
Figure RE-FDA0003284893980000051
represents the sentence S obtained by the step (3-6)iPredicting a set of features corresponding to any category:
Figure RE-FDA0003284893980000052
(3-2) calculating the frequency and weight of each feature: since the same feature may appear in different categories, the same feature is in
Figure RE-FDA0003284893980000053
May occur multiple times in, will
Figure RE-FDA0003284893980000054
The weights of all the same forward features in the sequence are summed, and the first c1 features are taken according to the order of the weights from large to small
Figure RE-FDA0003284893980000055
Calculated in the same way
Figure RE-FDA0003284893980000056
The weights of all the negative features are ranked from large to small according to the absolute value of the weights, and the top c2 features are taken to obtain
Figure RE-FDA0003284893980000057
Simultaneous calculation
Figure RE-FDA0003284893980000058
The frequency of each negative-going feature in the sequence is sorted from high to low, and the first c3 features are taken
Figure RE-FDA0003284893980000059
Figure RE-FDA00032848939800000510
Figure RE-FDA00032848939800000511
Figure RE-FDA00032848939800000512
(3-3) obtaining the sentence SiKey explanatory features of (1): finally, sentence S is obtainediKey interpretation feature set of
Figure RE-FDA00032848939800000513
Is the intersection of the three sets obtained in step (3-2), and contains p key interpretation features:
Figure RE-FDA00032848939800000514
5. the method for classifying texts fusing text interpretation features according to claim 1, wherein in the step 4, the key interpretation features obtained in the step 3 are fused with the raw data, and the text classification model is retrained, and the specific steps include:
(4-1) acquiring data fusing key interpretation features: sentence S to be acquirediKey interpretation feature and sentence S ofiThe sentences which are used together as the input of the text classification model and are fused with key interpretation characteristics are represented as Si′
Figure RE-FDA0003284893980000061
Wherein the content of the first and second substances,
Figure RE-FDA0003284893980000062
as a sentence SiThe number of k words in (a) is,
Figure RE-FDA0003284893980000063
for the obtained sentence SiP key interpretation features of (a);
(4-2) retraining the text classification model: fusing key interpretation characteristics according to the steps from (2-1) to (4-1) on all training samples and test samples to obtain a new data set S ═ (S ═1′,S2′,S3′,...,SN′) Then according to the rightThe process of claim 2 is used to retrain the text classification model on the data set S', and the obtained text classification result is more accurate.
CN202110521823.9A 2021-05-13 2021-05-13 Text classification method fusing text interpretation features Pending CN113590814A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110521823.9A CN113590814A (en) 2021-05-13 2021-05-13 Text classification method fusing text interpretation features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110521823.9A CN113590814A (en) 2021-05-13 2021-05-13 Text classification method fusing text interpretation features

Publications (1)

Publication Number Publication Date
CN113590814A true CN113590814A (en) 2021-11-02

Family

ID=78243402

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110521823.9A Pending CN113590814A (en) 2021-05-13 2021-05-13 Text classification method fusing text interpretation features

Country Status (1)

Country Link
CN (1) CN113590814A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108182186A (en) * 2016-12-08 2018-06-19 广东精点数据科技股份有限公司 A kind of Web page sequencing method based on random forests algorithm
CN110688491A (en) * 2019-09-25 2020-01-14 暨南大学 Machine reading understanding method, system, device and medium based on deep learning
CN111967354A (en) * 2020-07-31 2020-11-20 华南理工大学 Depression tendency identification method based on multi-modal characteristics of limbs and microexpressions
US20210012199A1 (en) * 2019-07-04 2021-01-14 Zhejiang University Address information feature extraction method based on deep neural network model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108182186A (en) * 2016-12-08 2018-06-19 广东精点数据科技股份有限公司 A kind of Web page sequencing method based on random forests algorithm
US20210012199A1 (en) * 2019-07-04 2021-01-14 Zhejiang University Address information feature extraction method based on deep neural network model
CN110688491A (en) * 2019-09-25 2020-01-14 暨南大学 Machine reading understanding method, system, device and medium based on deep learning
CN111967354A (en) * 2020-07-31 2020-11-20 华南理工大学 Depression tendency identification method based on multi-modal characteristics of limbs and microexpressions

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MARCO TULIO RIBEIRO等: "Why Should I Trust You?" Explaining the Predictions of Any Classifier", 《KDD \'16: THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING》 *
周乾荣: "面向句子分类的深度表示学习技术研究", 《中国博士学位论文全文数据库》 *
戴亚平等: "《多传感器数据智能融合理论与应用 面向新工科普通高等教育系列教材》", 机械工业出版社, pages: 143 *

Similar Documents

Publication Publication Date Title
US20220019745A1 (en) Methods and apparatuses for training service model and determining text classification category
CN109189925B (en) Word vector model based on point mutual information and text classification method based on CNN
CN110008338B (en) E-commerce evaluation emotion analysis method integrating GAN and transfer learning
CN110245229B (en) Deep learning theme emotion classification method based on data enhancement
CN106407333B (en) Spoken language query identification method and device based on artificial intelligence
CN107943784B (en) Relationship extraction method based on generation of countermeasure network
CN110033281B (en) Method and device for converting intelligent customer service into manual customer service
CN111738007B (en) Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN111506732B (en) Text multi-level label classification method
CN112364638B (en) Personality identification method based on social text
CN108038492A (en) A kind of perceptual term vector and sensibility classification method based on deep learning
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN112966068A (en) Resume identification method and device based on webpage information
CN113255366B (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN111191031A (en) Entity relation classification method of unstructured text based on WordNet and IDF
CN114564563A (en) End-to-end entity relationship joint extraction method and system based on relationship decomposition
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN114841151B (en) Medical text entity relation joint extraction method based on decomposition-recombination strategy
Niyozmatova et al. Classification based on decision trees and neural networks
CN114691525A (en) Test case selection method and device
CN111651597A (en) Multi-source heterogeneous commodity information classification method based on Doc2Vec and convolutional neural network
CN115017879A (en) Text comparison method, computer device and computer storage medium
CN114722198A (en) Method, system and related device for determining product classification code
CN114239584A (en) Named entity identification method based on self-supervision learning
CN112989803A (en) Entity link model based on topic vector learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination