CN108595568B - Text emotion classification method based on great irrelevant multiple logistic regression - Google Patents

Text emotion classification method based on great irrelevant multiple logistic regression Download PDF

Info

Publication number
CN108595568B
CN108595568B CN201810332338.5A CN201810332338A CN108595568B CN 108595568 B CN108595568 B CN 108595568B CN 201810332338 A CN201810332338 A CN 201810332338A CN 108595568 B CN108595568 B CN 108595568B
Authority
CN
China
Prior art keywords
model
data
text
logistic regression
cost function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810332338.5A
Other languages
Chinese (zh)
Other versions
CN108595568A (en
Inventor
雷大江
张红宇
陈浩
张莉萍
吴渝
杨杰
程克非
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201810332338.5A priority Critical patent/CN108595568B/en
Publication of CN108595568A publication Critical patent/CN108595568A/en
Application granted granted Critical
Publication of CN108595568B publication Critical patent/CN108595568B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis

Abstract

The invention provides a text emotion classification method based on a great irrelevant multiple logistic regression, which comprises the following steps: acquiring text data and preprocessing the text data; on the basis of the cost function of the first model, a cost function of a second model is obtained by introducing a related parameter penalty term; inputting the training data obtained by preprocessing into a derivative function of a cost function of the second model, and solving to obtain the second model; the first model is a multiple logistic regression model, and the second model is a maximum-independent multiple logistic regression model; and inputting the data to be predicted obtained by preprocessing into the second model to obtain the emotion category to which each text entry in the data to be predicted belongs. The method has the advantages that the method is made to have higher robustness for redundant data by adding irrelevant constraint items; the complexity of the traditional multiple logistic regression model is reduced, and the generalization capability is stronger; and then the text entries in the acquired target text data can be accurately classified.

Description

Text emotion classification method based on great irrelevant multiple logistic regression
Technical Field
The invention relates to the field of machine learning, in particular to a text emotion classification method based on extremely large irrelevant multiple logistic regression.
Background
The classification is used as a key part of machine learning and data mining, and has wide application in the aspects of image recognition, drug development, voice recognition, handwriting recognition and the like. It is a supervised learning problem that identifies which class a new instance belongs to based on a known training set. In a classification algorithm, non-linear classification capabilities and whether it can be extended to multi-classification are important.
A Support Vector Machine (SVM) is a classical binary classifier that uses the Hinge loss to establish the optimal boundary between datasets by solving a quadratic optimization problem with constraints. Compared with other algorithms, the important advantages are that: SVMs can be used for both linear and non-linear classification by using different kernel functions. But SVMs are very limited in multi-class classification because they rely on a one-to-one pattern, and these methods still have many negative effects despite many efforts in extending SVMs to multi-class classification. For example, in multi-class classification, the SVM one-to-many decision method is deeply influenced by imbalance among classes of data sets. Another important issue is that it is possible to assign the same instance to multiple classes. Although many methods have been proposed to address these problems, they all have other adverse effects: such as efficiency. The result of SVM is purely dichotomous and does not support probabilistic output. The SVM is not comparable from the numerical output of one task to the numerical output of another task. Furthermore, such unlimited values make it difficult for the end user to interpret what is behind them, as compared to a confidence-based classifier.
Logistic Regression (LR) is one of the important methods of classification. Standard logistic regression uses logistic losses and is classified by a coefficient weighted linear combination of the input variables. Logistic regression greatly reduces the weight of points far away from a classification plane through nonlinear mapping, improves the weight of data points most relevant to classification, can give corresponding class distribution estimation from a given class compared with a support vector machine, and also has great advantages in model training time. Logistic regression is relatively simple and well understood in terms of model, and is convenient to realize when large-scale linear classification is carried out. Furthermore, standard logistic regression is more easily extended to multi-class classes than support vector machines. Some improved algorithms for logistic regression are for example: sparse logistic regression, weighted logistic regression and the like all have good effects in corresponding fields.
However, logistic regression can only be applied to the two-class problem, and cannot be directly applied to the multi-class (class k >2) classification problem. In order to solve the multi-classification problem by using logistic regression, two types of logistic regression extension modes are generally used, one type is to establish k independent binary classifiers, each classifier marks one type of samples as positive samples, and marks all other types of samples as negative samples. For a given test sample, each classifier can get the probability that the test sample belongs to this class, and thus can perform multi-classification by taking the maximum class probability. Another category is called Multiple Logistic Regression (MLR), which is a generalization of the Logistic Regression model to the multi-classification problem. The specific method to be selected for processing the multi-classification problem generally depends on whether the classes to be classified are mutually exclusive. For multi-classification problems, there are usually mutual exclusions between classes. Thus, using multiple logistic regression generally gives better results than logistic regression. Meanwhile, the multiple logistic regression only needs to be trained once, so that the method has higher running speed.
In the field of computer information processing, a text data set usually contains more common information, the common information greatly increases the complexity and the recognition error of recognition, and although multivariate logistic regression trains multiple groups of parameters to calculate corresponding probabilities for each category, the problem of whether the parameters of each group are related or not is not considered. Therefore, the realization of the text emotion classification method based on the multivariate logistic regression with great independence has certain practical significance.
Disclosure of Invention
In order to solve the technical problem, the invention provides a text emotion classification method based on a great irrelevant multiple logistic regression, which comprises the following steps:
acquiring text data and preprocessing the text data; the text data comprises training data and data to be predicted; the data to be predicted comprises a plurality of text entries;
on the basis of the cost function of the first model, a cost function of a second model is obtained by introducing a related parameter penalty term;
inputting the training data obtained by preprocessing into a derivative function of a cost function of the second model, and solving to obtain the second model; the first model is a multiple logistic regression model, and the second model is a maximum-independent multiple logistic regression model;
and inputting the data to be predicted obtained by preprocessing into the second model to obtain the emotion category to which each text entry in the data to be predicted belongs.
Further, the step of inputting the preprocessed data to be predicted into the second model to obtain the emotion category to which each text entry in the data to be predicted belongs includes:
inputting each text entry in the data to be predicted obtained through preprocessing into the second model to obtain the text emotion category probability of each text entry;
setting a classification threshold;
when the text emotion category probability of the text entry is larger than the classification threshold value, judging that the text entry belongs to a first emotion category;
and when the text emotion category probability of the text entry is less than or equal to the classification threshold value, judging that the text entry belongs to a second emotion category.
Further, the obtaining a cost function of the second model by introducing a penalty term of a relevant parameter on the basis of the cost function of the first model includes:
obtaining a negative log-likelihood function of a model parameter of the first model;
acquiring irrelevant constraint items;
and introducing the uncorrelated constraint terms into the cost function of the first model to obtain the cost function of the second model.
Further, the first model is:
Figure BDA0001628291250000031
wherein
Figure BDA0001628291250000032
The negative log-likelihood function of the parameter θ of the first model is:
Figure BDA0001628291250000033
the negative log-likelihood function is the cost function of the first model; where m is the number of independent samples.
Further, the irrelevant constraint term is:
Figure BDA0001628291250000041
the irrelevant constraint item is a relevant parameter penalty item; wherein, thetaiAnd thetajAny two different sets of parameters;
the cost function of the second model is:
Figure BDA0001628291250000042
further, the derivative function of the cost function of the second model is:
Figure BDA0001628291250000043
the text emotion classification method based on the great irrelevant multiple logistic regression has the technical effects that:
on the basis of a traditional multiple logistic regression model, a cost function of the largely irrelevant multiple logistic regression model is obtained by introducing a relevant parameter penalty term (irrelevant constraint term); and obtaining the maximum irrelevant multiple logistic regression model according to the derivative function of the cost function of the maximum irrelevant multiple logistic regression model. The method has the advantages that the method is made to have higher robustness for redundant data by adding irrelevant constraint items; the complexity of the traditional multiple logistic regression model is reduced, and the obtained new classification model (the largely irrelevant multiple logistic regression model) has stronger generalization capability; and then the text entries in the acquired target text data can be accurately classified.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart of a text sentiment classification method based on maximal irrelevant multiple logistic regression according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating an example of target text data provided by an embodiment of the invention;
FIG. 3 is a flowchart of a method for determining an emotion classification to which each text entry belongs according to an embodiment of the present invention;
FIG. 4 is a flowchart of a method for obtaining a cost function of a second model according to an embodiment of the present invention;
fig. 5 is a diagram illustrating the magnitude of the MNIST data set MLR and UMLR parameter norms provided in an embodiment of the present invention;
FIG. 6 is a diagram illustrating exemplary MLR and UMLR parameter norms for the COIL20 data set provided in an embodiment of the present invention;
fig. 7 is a diagram illustrating the norm sizes of the ORL data set MLR and UMLR parameters provided in the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that the prior art includes the Logistic Regression (LR) algorithm and l2The constrained multiple logistic regression (RMLR) algorithm has some disadvantages and shortcomings in classification applications, and an improved algorithm is proposed that is largely independent of multiple logistic regression algorithms.
Logistic Regression (LR) algorithm:
for logistic regression, it is assumed that there is a data set D ═ xi,yi},i=1,…,N,xi∈RD,yiE {0,1}, and the input vector x ═ x (x)(1),…,x(D)) Class label y is a binary function: y is 0 or 1. Logistic Regression (LR) is a probabilistic model based on:
Figure BDA0001628291250000061
wherein the content of the first and second substances,
Figure BDA0001628291250000062
called Logistic function or Sigmoid function.
For the binary problem, assuming that y takes a value of 0 or 1, and the probability of y ═ 1 follows bernoulli distribution, then:
p(y=1|x;θ)=hθ(x)
p(y=0|x;θ)=1-hθ(x)
the two formulas can be combined as follows:
p(y|x;θ)=hθ(x)y(1-hθ(x))1-y (2)
where y is equal to 0, 1. Assuming that the m samples are independent, a likelihood function for the parameter θ can be written:
Figure BDA0001628291250000063
the log-likelihood function can be expressed as:
Figure BDA0001628291250000064
the optimum θ can be obtained by maximizing l (θ). Usually make
Figure BDA0001628291250000065
And obtaining a loss function corresponding to l (theta), and solving the optimal theta by minimizing the loss function. However, logistic regression can only deal with the two-class problem and cannot be directly applied to the multi-class problem.
l2Constrained multiple logistic regression (RMLR) algorithm:
for the problem that the traditional logistic regression can not process the multi-classification, the Multiple Logistic Regression (MLR) adapts to the multi-classification problem by modifying the cost function of the logistic regression.
Suppose that the data set D ═ xi,yi},i=1,…,N,xi∈RD,yi∈{0,…,K}(K>2) The input vector is x ═ x(1),…,x(D)) Multivariate Logistic Regression (MLR) is a probabilistic model based on:
Figure BDA0001628291250000071
wherein
Figure BDA0001628291250000072
The cost function is:
Figure BDA0001628291250000073
however, multiple logistic regression has an unusual feature that has a "redundant" set of parameters. Suppose we derive a parameter vector θ fromjThe vector ψ is subtracted from, at which time, each θjAll become thetaj- ψ (j ═ 1, …, k). At this time, the assumption function becomes the following equation:
Figure BDA0001628291250000074
this indicates a deviation from θjSubtracting ψ does not affect the prediction result of the hypothesis function at all, that is, there are redundant parameters in the above-described multiple logistic regression model.
For the multiple logistic regression model over-parameterization problem, l2The constrained multiple logistic regression (RMLR) algorithm modifies the cost function by adding a weighted decay term that penalizes excessive parameter values, changing the cost function into a strict convex function, thus ensuring that a unique solution is obtained. The cost function is:
Figure BDA0001628291250000075
the Hessian matrix at this time becomes a reversible matrix, and since the cost function is a convex function, convergence to a global optimal solution can be guaranteed by using an optimization algorithm. Although l2Constrained multiple logistic regression (RMLR) algorithms alleviate the overfitting problem to some extent, however for datasets with redundant information, l2Constrained multiple logistic regression (RMLR) algorithms perform poorly.
According to the analysis, a great irrelevant multiple logistic regression model is further provided: specifically, the embodiment provides a text emotion classification method based on a maximum irrelevant multiple logistic regression, as shown in fig. 1, the method includes:
s101, acquiring text data and preprocessing the text data; the text data comprises training data and data to be predicted; the data to be predicted comprises a plurality of text entries;
for example, the evaluation text data after the consumer-to-store consumption is read, and the text data is composed of the comments after the consumer-to-store consumption. As shown in fig. 2, the first column is a text label column, 0 represents a positive comment, and 1 represents a negative comment. The second column is the consumer review column. Then, because a large amount of noise exists in the original text data, the training is not suitable for direct training, and corresponding preprocessing is needed; preprocessing the evaluation text data set, specifically comprising:
in step S103 and step S106, the method for preprocessing the RDD includes:
acquiring interval characters in a text comment sentence to be processed, and replacing the interval symbols with empty character strings;
acquiring special character strings, numbers and the like in the comment sentences, and replacing the special character strings, the numbers and the like with empty character strings;
acquiring words expressing fuzzy tone in the comment sentences, converting the words expressing fuzzy tone into absolute expression words, and converting the expression of fuzzy tone into absolute expression;
adding a custom dictionary, and adding a noun with higher frequency in the text comment sentence to be processed into the custom dictionary;
performing word segmentation on words in the processed comment sentences, and filtering stop words in the comment sentences;
and performing vector conversion on the words in the comment sentences of which the word segmentation is finished so as to generate word vectors.
Specifically, the method for preprocessing the text to be processed comprises the following steps:
matching comments beginning with and ending with a "#" in the comments by using a function re.complex ('# ([ < lambda > ]) #'), and replacing with a null character string; re is a python regular expression module, and functions in re can be directly called to realize the regular matching of the character strings.
Matching special character strings, numbers and the like in the comment by using a function re, namely, using u '[ < Lambda >/u 4e00- \\ u9fa5| a-zA-Z ] +', and replacing the special character strings with null character strings;
replacing the comment text with a function flash. The fuzzy mood expression is converted into an absolute expression. For example, replacing "what is not so" with "bad", "not special" with "not;
adding a self-defined dictionary, and adding new nouns into the dictionary aiming at nouns with higher frequency in the text data set, so that the word segmentation accuracy is enhanced; the method is characterized in that a user-defined dictionary is added with a specific noun according to a specific scene, and word segmentation is completed more efficiently and accurately.
Segmenting the comment by using a function jieba.cut (), and filtering stop words in the comment; wherein the stop word is a word or words that is not helpful to the text classification target, such as 'of', 'on', 'o', etc.; different scenes have different stop word lists, and stop words in the text are deleted according to the corresponding stop word lists.
The comment data set that has completed the word segmentation is converted into a word2vec model using the function generic.
For a comment, a word vector of each word is generated by using a word2vec model, and the word vectors of all the words in the comment are averaged according to dimensions to obtain word vector representation of the comment. Assume that the comment dataset contains n non-identical words. If a sentence contains m words, the word vector of each word of the sentence is shown as equation (81):
Figure BDA0001628291250000091
the word vector of the sentence is as shown in equation (82):
Figure BDA0001628291250000092
wherein the content of the first and second substances,
Figure BDA0001628291250000093
and repeating the step of generating the word vector corresponding to each word by using the word2vec model, thereby obtaining the word vector representation of the whole comment data set.
S102, on the basis of the cost function of the first model, obtaining a cost function of a second model by introducing a related parameter penalty term;
s103, inputting the training data obtained through preprocessing into a derivative function of a cost function of a second model, and solving to obtain the second model; the first model is a multiple logistic regression model, and the second model is a maximum irrelevant multiple logistic regression model;
and S104, inputting the data to be predicted obtained through preprocessing into the second model to obtain the emotion type of each text entry in the data to be predicted.
Namely, introducing a relative parameter penalty term, and establishing a maximum irrelevant multiple logistic regression model; inputting the processed formatted data into a model, and predicting the emotion type of each evaluation text.
Specifically, in step S104, the preprocessed data to be predicted is input into the second model, so as to obtain an emotion category to which each text entry in the data to be predicted belongs, as shown in fig. 3, where the emotion category includes:
s104a, inputting each text entry in the data to be predicted obtained through preprocessing into the second model to obtain the text emotion category probability of each text entry;
s104b, setting a classification threshold value;
wherein, preferably, the classification threshold is in a binary classification problem in the multivariate logistic regression, and is specifically 0.5.
S104c, when the text emotion category probability of the text entry is larger than the classification threshold, judging that the text entry belongs to a first emotion category;
s104d, when the text emotion category probability of the text entry is smaller than or equal to the classification threshold value, judging that the text entry belongs to a second emotion category.
For example; and setting the classification threshold value to be 0.5, and when the probability of the class to which the sample belongs is calculated by the model to be greater than 0.5, marking the comment as 1 and representing the comment as a positive comment. When the probability of the sample belonging to the category is less than or equal to 0.5, the comment is marked as 0 and is represented as a negative comment.
In the field of computer information processing, a data set usually contains more common information, the common information greatly increases the complexity and the identification error of identification, although multivariate logistic regression trains k groups of parameters to calculate corresponding probability for each category, the problem of whether the k groups of parameters are related or not is not considered, if the parameters (theta)12,…,θk) Is the minimum point of the cost function, any parameter thetaiAll can be replaced by other thetaj(j ≠ i) linear representation, i.e.
θi=λ0+∑j≠iλjθj (9)
This indicates that there is a correlation between the parameters of the different classes. l2Regularization
Figure BDA0001628291250000101
Although the intra-group elements of each group of parameters are constrained, the problems related to different types of parameters are not considered, so that the classification effect on the data sets with more redundant information is poor. For any two different sets of parameters thetaiAnd thetajAccording to the basic inequality:
Figure BDA0001628291250000111
wherein if and only if θi=θjThe maximum value is obtained.
If thetaiAnd thetajCorrelation, i.e. thetai=λ0jθjThen, then
Figure BDA0001628291250000112
The value is large, so we add an irrelevant constraint term:
Figure BDA0001628291250000113
the constraint penalizes the relevant parameters to ensure that more irrelevant and discriminant features are retained as much as possible. And because of
Figure BDA0001628291250000114
The cost function can thus be obtained as:
Figure BDA0001628291250000115
to use the optimization algorithm, the derivative of J (θ) is found as follows:
Figure BDA0001628291250000116
according to the derivation, the irrelevant parameter theta can be rapidly obtained through the gradient descent algorithm and the improved algorithm thereof.
Specifically, in this embodiment, in step S102, on the basis of the cost function of the first model, by introducing a penalty term of a relevant parameter, the obtaining of the cost function of the second model, as shown in fig. 4, includes:
s102a, obtaining a negative log-likelihood function of a model parameter of a first model;
s102b, acquiring irrelevant constraint items;
s102c, introducing the irrelevant constraint item into the cost function of the first model to obtain the cost function of the second model.
Correspondingly, the first model is:
Figure BDA0001628291250000117
wherein
Figure BDA0001628291250000118
The negative log-likelihood function of the parameter θ of the first model is:
Figure BDA0001628291250000121
the negative log-likelihood function is the cost function of the first model; where m is the number of independent samples. Further, the irrelevant constraint term is:
Figure BDA0001628291250000122
the irrelevant constraint item is a relevant parameter penalty item; wherein, thetaiAnd thetajAny two different sets of parameters; the cost function of the second model is:
Figure BDA0001628291250000123
further, the derivative function of the cost function of the second model is:
Figure BDA0001628291250000124
aiming at the above, the algorithm comprises the following steps:
inputting: training set D { (x)1,y1),(x2,y2),…,(xm,ym)};
The process is as follows:
Initializeλ,η,Θ
While stopping criterion are not satisfied do:
Forj=1,2,…,k:
Figure BDA0001628291250000125
Figure BDA0001628291250000126
Θ=L-BFGS(Loss,dΘ)
and (3) outputting: regression coefficient theta
Further, performing convergence analysis on the maximal-independence multiple logistic regression algorithm:
loss function according to the maximum independent multiple logistic regression:
Figure BDA0001628291250000127
it is possible to obtain:
Figure BDA0001628291250000131
j (θ) is a strictly convex function because the second derivative of J (θ) is constantly greater than 0.
Wherein the algorithm convergence is verifiable according to an online learning framework analysis algorithm and a convergence analysis with respect to the Adam algorithm.
Further, the maximal irrelevant multiple logistic regression (UMLR) algorithm proposed by the present invention was evaluated. The experimental results mainly focus on the following two problems of classification precision and execution speed. Data classification algorithms for comparison include weight-decaying multivariate logistic regression, support vector machines, and parameter-independent multivariate logistic regression. The experiment respectively adopts artificial data sets with different correlation degrees and 4 real data sets such as MNIST, COIL20, GT and ORL, and the verification mode is cross-fold verification.
(1) Normalization
Suppose Φ (x)minAnd phi (x)maxRespectively, a maximum and a minimum in the data set. For one example, the normalization algorithm is as follows:
Figure BDA0001628291250000132
and the dimensionalized expression is converted into the dimensionless expression in a normalization mode, so that the problem of unbalanced data contribution is solved.
(2) Experimental results on an artificial data set
To verify the validity of the algorithm on a linearly dependent dataset, we generated an artificial dataset as follows: the intra-class correlation degree is more than 0.9, and the inter-class similarity degrees are respectively 0.5, 0.6, 0.7, 0.8 and 0.9.
The sample size and data dimension are chosen to be (m, n) ═ 5000,1000, for a total of 5 classes, 1000 samples per class.
The following are the maximum independent multiple logistic regression algorithm and for data of different correlation degrees2And (4) comparing the recognition rates of the constraint multiple logistic regression algorithm.
TABLE 1 recognition rates of MLR, UMLR different correlation data sets
Figure BDA0001628291250000133
Figure BDA0001628291250000141
(3) Experimental results on MNIST and COIL20 data sets
MNIST datasets are widely used in the field of pattern recognition. It contains 10 categories, ten of which correspond to handwritten numbers 0-9, with 5000 pictures per category. The COIL20 dataset had 20 different categories with 72 pictures in each category.
TABLE 2 recognition rates of SVM, MLR, UMLR against MINIST, COIL20 data sets
Figure BDA0001628291250000142
The above table demonstrates the accuracy of three different algorithms for two data sets. FIG. 5 is a diagram illustrating the magnitude of the MLR and UMLR parameter norms of the MNIST data set; FIG. 6 is a diagram of the MLR and UMLR parameter norm sizes of the COIL20 data set. Wherein, the left side in fig. 5 and fig. 6 corresponds to the UMLR parameter norm size histogram under the corresponding data set, and the right side in fig. 5 and fig. 6 corresponds to the MLR parameter norm size histogram under the corresponding data set.
(4) Experimental results on the GT and ORL datasets
The GT data set has 50 categories, each containing 15 pictures. The ORL dataset contains 20 categories, each containing 10 pictures.
TABLE 3 recognition rates of SVM, MLR, UMLR against GT, ORL data sets
Figure BDA0001628291250000143
FIG. 7 is a diagram illustrating the magnitudes of the norm of the ORL data set MLR and the UMLR parameters; the left side of fig. 7 corresponds to a histogram of UMLR parameter norm size under the corresponding data set, and the right side of fig. 7 corresponds to a histogram of MLR parameter norm size under the corresponding data set.
(5) Analysis of Experimental results
The experimental result shows that the great irrelevant multiple logistic regression compares2The constrained multiple logistic regression algorithm and the support vector machine algorithm have higher classification precision. The method has obvious effect particularly on the data sets with high correlation among classes, and shows that the method has high robustness on redundant data. The convergence parameter is compared to the comparison2The convergence parameters of the constraint multiple logistic regression are small, which generally means that the constraint multiple logistic regression has stronger generalization capability.
According to the analysis of the experimental results, the classification is used as an important branch of pattern recognition and data mining, and has increasingly wide application fields, so that the classification gradually becomes the core and key technology of public security criminal investigation, electronic payment, medical treatment and other systems.
The invention provides a great irrelevant multiple logistic regression model; the method constructs a novel classifier based on a basic model of multiple logistic regression. The experimental result shows that the method has advantages over the traditional classification algorithm in classification precision and classification robustness. And the model obtained by training has stronger interpretability than methods such as a support vector machine and naive Bayes.
In summary, the text emotion classification method based on the largely irrelevant multiple logistic regression provided by the invention has the technical effects that:
on the basis of a traditional multiple logistic regression model, a cost function of the largely irrelevant multiple logistic regression model is obtained by introducing a relevant parameter penalty term (irrelevant constraint term); and obtaining the maximum irrelevant multiple logistic regression model according to the derivative function of the cost function of the maximum irrelevant multiple logistic regression model. The method has the advantages that the method is made to have higher robustness for redundant data by adding irrelevant constraint items; the complexity of the traditional multiple logistic regression model is reduced, and the obtained new classification model (the largely irrelevant multiple logistic regression model) has stronger generalization capability; and then the text entries in the acquired target text data can be accurately classified.
It should be noted that: the sequence of the above embodiments of the present invention is only for description, and does not represent the advantages or disadvantages of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (6)

1. A text emotion classification method based on maximal irrelevant multiple logistic regression is characterized by comprising the following steps:
acquiring text data and preprocessing the text data; the text data comprises training data and data to be predicted; the data to be predicted comprises a plurality of text entries;
on the basis of the cost function of the first model, the cost function of the second model is obtained by introducing a relative parameter penalty term, and the method comprises the following steps: acquiring irrelevant constraint items;
inputting the training data obtained by preprocessing into a derivative function of a cost function of the second model, and solving to obtain the second model; the first model is a multiple logistic regression model, and the second model is a maximum-independent multiple logistic regression model;
inputting the data to be predicted obtained by preprocessing into the second model to obtain the emotion category to which each text entry in the data to be predicted belongs;
wherein the first model is:
Figure FDA0003459961600000011
wherein
Figure FDA0003459961600000012
The irrelevant constraint term is:
Figure FDA0003459961600000013
the irrelevant constraint item is a relevant parameter penalty item; wherein, thetaiAnd thetajAre any two different sets of parameters.
2. The method of claim 1, wherein the inputting the preprocessed data to be predicted into the second model to obtain an emotion category to which each text entry in the data to be predicted belongs comprises:
inputting each text entry in the data to be predicted obtained through preprocessing into the second model to obtain the text emotion category probability of each text entry;
setting a classification threshold;
when the text emotion category probability of the text entry is larger than the classification threshold value, judging that the text entry belongs to a first emotion category;
and when the text emotion category probability of the text entry is less than or equal to the classification threshold value, judging that the text entry belongs to a second emotion category.
3. The method according to claim 1 or 2, wherein the obtaining the cost function of the second model by introducing a relevant parameter penalty term on the basis of the cost function of the first model further comprises:
obtaining a negative log-likelihood function of a model parameter of the first model;
and introducing the uncorrelated constraint terms into the cost function of the first model to obtain the cost function of the second model.
4. The method of claim 3, wherein the negative log-likelihood function for the parameter θ of the first model is:
Figure FDA0003459961600000021
the negative log-likelihood function is the cost function of the first model; where m is the number of independent samples.
5. The method of claim 4, wherein the cost function of the second model is:
Figure FDA0003459961600000022
6. the method of claim 5, wherein the derivative function of the cost function of the second model is:
Figure FDA0003459961600000023
CN201810332338.5A 2018-04-13 2018-04-13 Text emotion classification method based on great irrelevant multiple logistic regression Active CN108595568B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810332338.5A CN108595568B (en) 2018-04-13 2018-04-13 Text emotion classification method based on great irrelevant multiple logistic regression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810332338.5A CN108595568B (en) 2018-04-13 2018-04-13 Text emotion classification method based on great irrelevant multiple logistic regression

Publications (2)

Publication Number Publication Date
CN108595568A CN108595568A (en) 2018-09-28
CN108595568B true CN108595568B (en) 2022-05-17

Family

ID=63622383

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810332338.5A Active CN108595568B (en) 2018-04-13 2018-04-13 Text emotion classification method based on great irrelevant multiple logistic regression

Country Status (1)

Country Link
CN (1) CN108595568B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109671487A (en) * 2019-02-25 2019-04-23 上海海事大学 A kind of social media user psychology crisis alert method
CN110942450A (en) * 2019-11-19 2020-03-31 武汉大学 Multi-production-line real-time defect detection method based on deep learning
CN112802456A (en) * 2021-04-14 2021-05-14 北京世纪好未来教育科技有限公司 Voice evaluation scoring method and device, electronic equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004094583A (en) * 2002-08-30 2004-03-25 Ntt Advanced Technology Corp Method of classifying writings
CN103473380A (en) * 2013-09-30 2013-12-25 南京大学 Computer text sentiment classification method
CN103514279A (en) * 2013-09-26 2014-01-15 苏州大学 Method and device for classifying sentence level emotion
CN103955451A (en) * 2014-05-15 2014-07-30 北京优捷信达信息科技有限公司 Method for judging emotional tendentiousness of short text
CN104239554A (en) * 2014-09-24 2014-12-24 南开大学 Cross-domain and cross-category news commentary emotion prediction method
CN104462408A (en) * 2014-12-12 2015-03-25 浙江大学 Topic modeling based multi-granularity sentiment analysis method
CN105389583A (en) * 2014-09-05 2016-03-09 华为技术有限公司 Image classifier generation method, and image classification method and device
CN105740349A (en) * 2016-01-25 2016-07-06 重庆邮电大学 Sentiment classification method capable of combining Doc2vce with convolutional neural network
CN106156004A (en) * 2016-07-04 2016-11-23 中国传媒大学 The sentiment analysis system and method for film comment information based on term vector
CN107798349A (en) * 2017-11-03 2018-03-13 合肥工业大学 A kind of transfer learning method based on the sparse self-editing ink recorder of depth

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10073830B2 (en) * 2014-01-10 2018-09-11 Cluep Inc. Systems, devices, and methods for automatic detection of feelings in text

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004094583A (en) * 2002-08-30 2004-03-25 Ntt Advanced Technology Corp Method of classifying writings
CN103514279A (en) * 2013-09-26 2014-01-15 苏州大学 Method and device for classifying sentence level emotion
CN103473380A (en) * 2013-09-30 2013-12-25 南京大学 Computer text sentiment classification method
CN103955451A (en) * 2014-05-15 2014-07-30 北京优捷信达信息科技有限公司 Method for judging emotional tendentiousness of short text
CN105389583A (en) * 2014-09-05 2016-03-09 华为技术有限公司 Image classifier generation method, and image classification method and device
CN104239554A (en) * 2014-09-24 2014-12-24 南开大学 Cross-domain and cross-category news commentary emotion prediction method
CN104462408A (en) * 2014-12-12 2015-03-25 浙江大学 Topic modeling based multi-granularity sentiment analysis method
CN105740349A (en) * 2016-01-25 2016-07-06 重庆邮电大学 Sentiment classification method capable of combining Doc2vce with convolutional neural network
CN106156004A (en) * 2016-07-04 2016-11-23 中国传媒大学 The sentiment analysis system and method for film comment information based on term vector
CN107798349A (en) * 2017-11-03 2018-03-13 合肥工业大学 A kind of transfer learning method based on the sparse self-editing ink recorder of depth

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Text Visualization - What Colors Tell About a Text;Wibke Weber等;《2007 11th International Conference Information Visualization (IV "07)》;20070716;第1-6页 *
基于混合卡方统计量与逻辑回归的文本情感分析;李平等;《计算机工程》;20171215;第192-196页,第202页 *

Also Published As

Publication number Publication date
CN108595568A (en) 2018-09-28

Similar Documents

Publication Publication Date Title
Manoharan Capsule network algorithm for performance optimization of text classification
CN107832663B (en) Multi-modal emotion analysis method based on quantum theory
Riesen Structural pattern recognition with graph edit distance
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
Lin et al. A post-processing method for detecting unknown intent of dialogue system via pre-trained deep neural network classifier
CN111444342B (en) Short text classification method based on multiple weak supervision integration
CN108536838B (en) Method for classifying text emotion through maximum irrelevant multiple logistic regression model based on Spark
US20220058496A1 (en) Systems and methods for machine learning-based document classification
CN108595568B (en) Text emotion classification method based on great irrelevant multiple logistic regression
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
Hoefel et al. Learning a two-stage SVM/CRF sequence classifier
US20230101817A1 (en) Systems and methods for machine learning-based data extraction
Rathpisey et al. Handling imbalance issue in hate speech classification using sampling-based methods
CN113434858A (en) Malicious software family classification method based on disassembly code structure and semantic features
WO2022035942A1 (en) Systems and methods for machine learning-based document classification
Wan et al. Cost-sensitive label propagation for semi-supervised face recognition
Haffner Scaling large margin classifiers for spoken language understanding
Chai et al. Towards deep learning interpretability: A topic modeling approach
Alalyan et al. Model-based hierarchical clustering for categorical data
Kiyak et al. Comparison of image-based and text-based source code classification using deep learning
Altun et al. SKETRACK: stroke-based recognition of online hand-drawn sketches of arrow-connected diagrams and digital logic circuit diagrams
Shang et al. Energy-based self-attentive learning of abstractive communities for spoken language understanding
Bassiou et al. Greek folk music classification into two genres using lyrics and audio via canonical correlation analysis
Pandit et al. Big data multimedia mining: feature extraction facing volume, velocity, and variety
Winter et al. Incremental training for face recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant