CN108595568A - A kind of text sentiment classification method based on very big unrelated multivariate logistic regression - Google Patents

A kind of text sentiment classification method based on very big unrelated multivariate logistic regression Download PDF

Info

Publication number
CN108595568A
CN108595568A CN201810332338.5A CN201810332338A CN108595568A CN 108595568 A CN108595568 A CN 108595568A CN 201810332338 A CN201810332338 A CN 201810332338A CN 108595568 A CN108595568 A CN 108595568A
Authority
CN
China
Prior art keywords
model
data
logistic regression
cost function
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810332338.5A
Other languages
Chinese (zh)
Other versions
CN108595568B (en
Inventor
雷大江
张红宇
陈浩
张莉萍
吴渝
杨杰
程克非
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201810332338.5A priority Critical patent/CN108595568B/en
Publication of CN108595568A publication Critical patent/CN108595568A/en
Application granted granted Critical
Publication of CN108595568B publication Critical patent/CN108595568B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Algebra (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of text sentiment classification method based on very big unrelated multivariate logistic regression, the method includes:Text data is obtained, and the text data is pre-processed;On the basis of the cost function of the first model, by introducing relevant parameter penalty term, the cost function of the second model is obtained;The training data that pretreatment is obtained inputs the derived function of the cost function of the second model, and is solved to obtain the second model;First model is multivariate logistic regression model, and second model is very big unrelated multivariate logistic regression model;The data to be predicted that pretreatment is obtained input second model, obtain the emotional category belonging to each textual entry in data to be predicted.So that there is higher robustness for redundant data by adding uncorrelated bound term;The complexity of traditional multivariate logistic regression model is reduced, there is stronger generalization ability;And then precise classification can be carried out to textual entry in the target text data of acquisition.

Description

A kind of text sentiment classification method based on very big unrelated multivariate logistic regression
Technical field
The present invention relates to machine learning field more particularly to a kind of text emotions based on very big unrelated multivariate logistic regression Sorting technique.
Background technology
Classify key component as machine learning, data mining, in image recognition, drug development, speech recognition, hand-written Identification etc. has a wide range of applications.It is to identify that which classification a new example belonged to has prison based on known training set The problem concerning study superintended and directed.In sorting algorithm, non-linear classification and can expand to more classification most important.
Support vector machines (SVM) is a kind of two-value grader of classics, and Hinge is used to lose, by solving belt restraining item The double optimization problem of part establishes the best line of demarcation between data set.Compared with other algorithms, considerable advantage is:It is logical It crosses using different kernel functions, SVM both can be used for linear classification, can be used for Nonlinear Classification.But due to its dependence In one-to-one pattern, SVM is very limited on multicategory classification, although having been done much expanding to SVM on multicategory classification Effort, but these methods still have many negative impacts.For example, in multi-class classification, decision-making technique one-to-many SVM is just deep It is unbalanced between by data set class to influence.Another important problem, which is it, may distribute to same instance multiple classes.Although Many methods, which are suggested, to be solved these problems, but they have other adverse effects:Such as efficiency.SVM's the result is that pure Pure two points, do not support probability output.SVM does not have from the numerical value output of a task and the numerical value output of another task can Compare property.In addition, compared with the grader based on degree of belief, this numerical value that there is no limit is difficult to explain for terminal user The meaning of its behind.
Logistic regression (LR) is one of the important method of classification.Standard logic recurrence is lost using Logistical, is passed through The coefficient weighted linear combination of input variable is classified.Logistic regression is substantially reduced by Nonlinear Mapping from classification plane The weight of point farther out improves the weight with the maximally related data point of classifying, compared to support vector machines, from a certain given In class, standard logic recurrence can provide corresponding class distribution estimation, and very big advantage is also accounted on the model training time.Logic Comparatively model is simpler for recurrence, understands well, is implemented when for extensive linear classification more convenient.In addition, standard Logistic regression is easier to expand to multicategory classification than support vector machines.Some are directed to the innovatory algorithm of logistic regression for example:It is sparse Logistic regression, weighted logistic regression etc. all obtain preferable effect in corresponding field.
However logistic regression is only used for two classification problems, is not directly applicable multi-class (classification k>2) classification problem. In order to solve more classification problems with logistic regression, usually there are two logic of class to return extended mode, one kind is to establish k independent two A kind of sample labeling is positive sample by meta classifier, each grader, is negative sample by the sample labeling of every other classification.Needle To giving test sample, each grader can obtain the test sample and belong to this kind of probability, therefore can be by taking Maximum class probability carries out classify more.In addition a kind of to be then referred to as multivariate logistic regression (Multinomial Logistic Regression, MLR), it is popularization of the Logic Regression Models in more classification problems.Specifically choose which kind of method handles more points Class problem generally depend between classification to be sorted whether mutual exclusion.It is typically mutual exclusion for more classification problems, between classification 's.Therefore, using multivariate logistic regression better result is usually led to compared to logistic regression.Meanwhile multivariate logistic regression Only need training primary, therefore it also has the faster speed of service.
In computer information processing field, text data set usually contains more common information, these common informations are big The big complexity and identification error for increasing identification, although multivariate logistic regression trains multigroup parameter to be directed to each classification meter Calculate corresponding probability, however there is no consider between each group parameter whether related problem.Therefore a kind of based on greatly unrelated Multivariate logistic regression text sentiment classification method realization have certain realistic meaning.
Invention content
In order to solve the above technical problem, the present invention provides a kind of text feelings based on very big unrelated multivariate logistic regression Feel sorting technique, the method includes:
Text data is obtained, and the text data is pre-processed;The text data includes training data and waits for Prediction data;The data to be predicted include multiple textual entries;
On the basis of the cost function of the first model, by introducing relevant parameter penalty term, the cost of the second model is obtained Function;
Will the obtained training data of pretreatment input the second model cost function derived function, and solved to obtain the Two models;First model is multivariate logistic regression model, and second model is very big unrelated multivariate logistic regression model;
The data to be predicted that pretreatment is obtained input second model, obtain each textual entry in data to be predicted Affiliated emotional category.
Further, the data to be predicted that pretreatment is obtained input second model, obtain data to be predicted In emotional category belonging to each textual entry, including:
Each textual entry in the data to be predicted that pretreatment is obtained inputs second model, obtains each text The text emotion class probability of entry;
Classification thresholds are set;
When the text emotion class probability of textual entry is more than the classification thresholds, judge that the textual entry belongs to the One emotional category;
When the text emotion class probability of textual entry is less than or equal to the classification thresholds, the textual entry category is judged In the second emotional category.
Further, described on the basis of the cost function of the first model, by introducing relevant parameter penalty term, obtain the The cost function of two models, including:
Obtain the negative log-likelihood function of the model parameter of the first model;
Obtain uncorrelated bound term;
The cost function that uncorrelated bound term is introduced to the first model, obtains the cost function of the second model.
Further, first model is:
Wherein
The negative log-likelihood function of the parameter θ of first model is:
The negative log-likelihood function i.e. cost function of the first model;Wherein, m is independent the number of sample.
Further, the uncorrelated bound term is:
Uncorrelated bound term, that is, relevant parameter the penalty term;Wherein, θiAnd θjFor arbitrary two groups of different parameters;
The cost function of second model is:
Further, the derived function of the cost function of second model is:
A kind of text sentiment classification method based on very big unrelated multivariate logistic regression provided by the invention, the technology having Effect is:
The present invention is on the basis of traditional multivariate logistic regression model by introducing relevant parameter penalty term (uncorrelated about Beam item), obtain the cost function of greatly unrelated multivariate logistic regression model;According to the solution greatly unrelated multivariate logistic regression The derived function of the cost function of model obtains the greatly unrelated multivariate logistic regression model.By adding uncorrelated bound term So that there is higher robustness for redundant data;The complexity for reducing traditional multivariate logistic regression model, obtains New disaggregated model (very big unrelated multivariate logistic regression model) has stronger generalization ability;And then it can be to the target of acquisition Textual entry carries out precise classification in text data.
Description of the drawings
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology and advantage, below will be to implementing Example or attached drawing needed to be used in the description of the prior art are briefly described, it should be apparent that, the accompanying drawings in the following description is only Only it is some embodiments of the present invention, for those of ordinary skill in the art, without creative efforts, It can also be obtained according to these attached drawings other attached drawings.
Fig. 1 is a kind of text sentiment classification method based on very big unrelated multivariate logistic regression provided in an embodiment of the present invention Flow chart;
Fig. 2 is target text data instance figure provided in an embodiment of the present invention;
Fig. 3 is the method flow diagram of the emotional category belonging to each textual entry of determination that is provided in the embodiment of the present invention;
Fig. 4 is the method flow diagram of the cost function provided in an embodiment of the present invention for obtaining the second model;
Fig. 5 is the MNIST data sets MLR provided in the embodiment of the present invention and the big logotype of UMLR parameter norms;
Fig. 6 is the COIL20 data sets MLR provided in the embodiment of the present invention and the big logotype of UMLR parameter norms;
Fig. 7 is the ORL data sets MLR provided in the embodiment of the present invention and the big logotype of UMLR parameter norms.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art obtained without making creative work it is all its His embodiment, shall fall within the protection scope of the present invention.
It should be noted that term " first " in description and claims of this specification and above-mentioned attached drawing, " Two " etc. be for distinguishing similar object, without being used to describe specific sequence or precedence.It should be appreciated that using in this way Data can be interchanged in the appropriate case, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover It includes to be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment to cover non-exclusive Those of clearly list step or unit, but may include not listing clearly or for these processes, method, product Or the other steps or unit that equipment is intrinsic.
It needs to give explanation, logistic regression (LR) algorithm and l existing in the prior art2Polynary logic is constrained to return Return (RMLR) algorithm, there are some shortcoming and defect in classification application, and then it is greatly unrelated polynary to propose modified hydrothermal process Logistic regression algorithm.
Logistic regression (LR) algorithm:
For logistic regression, hypothesis has data set D={ xi,yi, i=1 ..., N, xi∈RD,yi∈ { 0,1 }, input Vector is x=(x(1),…,x(D)), class label y is two-valued function:Y is 0 or 1.Logistic regression (LR) is based on following probability Model:
Wherein,Referred to as Logistic functions or Sigmoid functions.
For two classification problems, it is assumed that the value of y is 0 or 1, and the probability that y=1 occurs obeys Bernoulli Jacob's distribution, then has:
P (y=1 | x;θ)=hθ(x)
P (y=0 | x;θ)=1-hθ(x)
Two formula as above can be merged into:
p(y|x;θ)=hθ(x)y(1-hθ(x))1-y (2)
Wherein y ∈ { 0,1 }.It is assumed that m sample is independent, then the likelihood function of parameter θ can be write out:
Then log-likelihood function can be expressed as:
Optimal θ can be obtained by maximizing l (θ).Usually enableObtain the corresponding losses of l (θ) Function solves optimal θ by minimizing loss function.However logistic regression can only handle two classification problems, it can not be direct It is applied in more classification problems.
l2Constrain multivariate logistic regression (RMLR) algorithm:
Polytypic problem cannot be handled for traditional logistic regression, multivariate logistic regression (MLR) is by changing logic The cost function of recurrence, to adapt to more classification problems.
Assuming that there is data set D={ xi,yi, i=1 ..., N, xi∈RD,yi∈{0,…,K}(K>2), input vector x =(x(1),…,x(D)), multivariate logistic regression (MLR) is based on following probabilistic model:
Wherein
Cost function is:
However there are one uncommon features for multivariate logistic regression, there are one the parameter sets of " redundancy " for it.Assuming that we from Parameter vector θjIn subtracted vectorial ψ, at this moment, each θjAll become θj- ψ (j=1 ..., k).It is assumed that function becomes Following formula:
This shows from θjIn subtract ψ completely and do not influence the prediction result for assuming function, that is to say, that polynary patrolled above-mentioned There are the parameters of redundancy in volume regression model.
For multivariate logistic regression model overparameterization problem, l2Multivariate logistic regression (RMLR) algorithm is constrained by adding A weight attenuation term is added to change cost function, this attenuation term can punish excessive parameter value, and cost function is made to become tight The convex function of lattice ensures that uniquely solved in this way.Its cost function is:
Hessian matrixes at this time become invertible matrix, and because cost function is convex function, can using optimization algorithm To ensure to converge to globally optimal solution.Although l2Constraint multivariate logistic regression (RMLR) algorithm has mitigated excessively quasi- to a certain extent Conjunction problem, however for the data set for possessing redundancy, l2It is poor to constrain the performance of multivariate logistic regression (RMLR) algorithm.
According to above-mentioned analysis, and then propose very big unrelated multivariate logistic regression model:Specifically, the present embodiment provides A kind of text sentiment classification method based on very big unrelated multivariate logistic regression, as shown in Figure 1, the method includes:
S101. text data is obtained, and the text data is pre-processed;The text data includes training data With data to be predicted;The data to be predicted include multiple textual entries;
For example, reading the evaluation text data after consumer consumes to shop, the text data is consumed by consumer to shop Comment composition afterwards.As shown in Figure 2, wherein first is classified as text label row, and 0 indicates positive comment, and 1 indicates negative sense comment.The Two are classified as consumer reviews' row.Then, it because there are much noises directly to train in original text data, needs corresponding Pretreatment;Pretreatment operation is carried out to evaluation text data set, is specifically included:
Wherein, in step S103 and step S106, preprocess method is carried out to RDD, including:
The gap character in pending text comments sentence is obtained, and the space character is replaced using null character string;
Special string, the number etc. in comment sentence are obtained, and the special string, number are replaced using null character string Word etc.;
It obtains in comment sentence and expresses the word for obscuring the tone, convert expression fuzzy expression word to absolutely expression word Language, and then so that the fuzzy tone is expressed and be converted into absolute expression;
Custom dictionaries is added, the higher noun of frequency in pending text comments sentence is added to custom dictionaries In;
Word in above-mentioned processed comment sentence is segmented, and filters the stop words in comment sentence;
Word in the comment sentence for having completed participle is converted into row vector, and then generates term vector.
Specifically, carrying out pretreated method to pending text includes:
Using function re.compile (' # ([^>] *) #') match in comment with the comment of " # " beginning and end, and use Null character string is replaced;Wherein re is the regular expression module of python, can directly invoke function therein to realize character The canonical of string matches.
Using function re.compile (u'[^ u4e00- u9fa5 | a-zA-Z]+') matching comment in special string, Number etc., and replaced using null character string;
Comment text is replaced using function flashtext.KeywordProcessor.The fuzzy tone is expressed It is converted into absolute expression.Such as " so-so " is replaced with into " bad ", " not being special " replaces with " no ";
Custom dictionaries is added, for the higher noun of frequency in text data set, is added in new term to dictionary, enhancing Segment accuracy;Wherein, it is added in specific noun to custom dictionaries according to specific scene, it is more efficient complete very accurately At participle work.
Comment is segmented using function jieba.cut (), and filters the stop words in comment;Wherein stop words is pair Text classification target helps little word or word, such as ' ', ' ', ' ' etc.;Different scenes have different deactivated vocabularys, According to the corresponding stop words deactivated in vocabulary deletion text.
The comment data collection for having completed participle is converted to using function gensim.models.Word2Vec () Word2vec models generate term vector.
One is commented on, the term vector of each word is generated using word2vec models, the word of all words in comment Vector is averaged by dimension, and the term vector for obtaining this comment indicates.Assuming that comment data is concentrated comprising n different words. Include m word in some sentence, then shown in the term vector such as formula (81) of each word of the sentence:
Shown in the term vector of the sentence such as formula (82):
Wherein,
Recycling word2vec models generate that corresponding step of term vector of each word, and then are entirely commented It is indicated by the term vector of data set.
S102. on the basis of the cost function of the first model, by introducing relevant parameter penalty term, the second model is obtained Cost function;
S103. training data pretreatment obtained inputs the derived function of the cost function of the second model, and is solved Obtain the second model;First model is multivariate logistic regression model, and second model is that very big unrelated polynary logic is returned Return model;
S104. the data to be predicted pretreatment obtained input second model, obtain each text in data to be predicted Emotional category belonging to this entry.
That is, introducing relevant parameter penalty term, greatly unrelated multivariate logistic regression model is established;It will handle well Format data input model, predict every evaluation text emotional category.
Specifically, in step S104, the data to be predicted that pretreatment is obtained input second model, obtain waiting for pre- Emotional category in measured data belonging to each textual entry, as shown in figure 3, including:
S104a. each textual entry in the data to be predicted pretreatment obtained inputs second model, obtains every The text emotion class probability of a textual entry;
S104b., classification thresholds are set;
Wherein it is preferred in two classification problems of the classification thresholds in multivariate logistic regression, specially 0.5.
S104c. when the text emotion class probability of textual entry is more than the classification thresholds, judge the textual entry Belong to the first emotional category;
S104d. when the text emotion class probability of textual entry is less than or equal to the classification thresholds, judge the text Entry belongs to the second emotional category.
Such as;Classification thresholds are set as 0.5, when model calculates sample generic probability more than 0.5, comment is marked It is denoted as 1, is expressed as positive comment.When sample generic probability is less than or equal to 0.5, comment is marked as 0, is expressed as negatively commenting By.
Wherein, in computer information processing field, data set usually contains more common information, these common informations are big The big complexity and identification error for increasing identification, although multivariate logistic regression trains k groups parameter to be directed to each classification meter Calculate corresponding probability, however there is no consider between k group parameters whether the problem of correlation, if parameter (θ12,…,θk) be The minimum point of cost function, then any parameter θiIt all can be by other θj(j ≠ i) linear expression, i.e.,
θi0+∑j≠iλjθj (9)
This illustrate it is different classes of between parameter have correlation.l2Canonical
Although constraining the group interior element of every group of parameter, the problem of different classes of parameter correlation is not considered yet, is led Causing to be directed to has the data set classifying quality of more redundancy poor.For arbitrary two groups of different parameter θsiAnd θj, according to basic Inequality:
Wherein, and if only if θijWhen obtain maximum value.
If θiWith θjCorrelation, i.e. θi0jθj, thenIt is worth larger, therefore we are added to uncorrelated bound term:
This bound term can punish relevant parameter, for ensureing to retain as possible more spies that are uncorrelated, having differentiation Sign.And because
It is so as to obtain its cost function:
In order to use optimization algorithm, the derivative for acquiring J (θ) is as follows:
It is derived according to above, uncorrelated parameter θ can quickly be acquired by gradient descent algorithm and its innovatory algorithm.
Specifically, it corresponds into the present embodiment, in step S102, on the basis of the cost function of the first model, by drawing Enter relevant parameter penalty term, the cost function of the second model is obtained, as shown in figure 4, including:
S102a. the negative log-likelihood function of the model parameter of the first model is obtained;
S102b. uncorrelated bound term is obtained;
S102c., the cost function that uncorrelated bound term is introduced to the first model, obtains the cost function of the second model.
Accordingly, first model is:
Wherein
The negative log-likelihood function of the parameter θ of first model is:
The negative log-likelihood function i.e. cost function of the first model;Wherein, m is independent the number of sample.Into one Step ground, the uncorrelated bound term are:
Uncorrelated bound term, that is, relevant parameter the penalty term;Wherein, θiAnd θjFor arbitrary two groups of different parameters;It is described The cost function of second model is:
Further, the derived function of the cost function of second model is:
For above-mentioned content, then algorithm steps:
Input:Training set D={ (x1,y1),(x2,y2),…,(xm,ym)};
Process:
Initializeλ,η,Θ
While stopping criterion are not satisfied do:
Forj=1,2 ..., k:
Θ=L-BFGS (Loss, d Θ)
Output:Regression coefficient Θ
Further, convergence is carried out to the greatly unrelated multivariate logistic regression algorithm:
According to the loss function of very big unrelated multivariate logistic regression:
It can obtain:
Because it is stringent convex function that the second dervative perseverance of J (θ), which is more than 0, J (θ),.
Wherein, algorithm receipts can be demonstrate,proved according to on-line study frame analysis algorithm and about the analysis of Adam convergences It holds back.
Further, greatly unrelated multivariate logistic regression (UMLR) algorithm proposed by the present invention is assessed.Experiment knot Fruit is concentrated mainly on following two problems:Nicety of grading executes speed.Data classification algorithm for comparing includes weight decaying Multivariate logistic regression, support vector machines and the unrelated multivariate logistic regression of parameter.The artificial of the different degrees of correlation has been respectively adopted in experiment 4 real data sets such as data set and MNIST, COIL20, GT and ORL, verification mode are ten folding cross validations.
(1) it normalizes
Assuming that Φ (x)minWith Φ (x)maxMaximum value and minimum value respectively in data set.For an example, normalizing It is as follows to change algorithm:
By normalized mode, there will be the expression formula of dimension to be converted into nondimensional expression formula, solve contribution data Unbalanced problem.
(2) experimental result on artificial data collection
In order to which verification algorithm is to the validity of linearly related data set, we generate artificial data collection as follows:Class The interior degree of correlation is more than 0.9, and similarity distinguishes value 0.5,0.6,0.7,0.8,0.9. between class
Sample size and data dimension are selected as (m, n)=(5000,1000), amount to 5 classifications, each classification 1000 Bar sample.
It is the data for the different degrees of correlation, very big unrelated multivariate logistic regression algorithm and l below2Polynary logic is constrained to return The comparison of reduction method discrimination.
The discrimination of table 1.MLR, UMLR difference correlation data collection
(3) experimental result on MNIST and COIL20 data sets
MNIST data sets are widely used in area of pattern recognition.It includes 10 classifications, this ten classifications correspond to hand-written Digital 0-9, each classification have 5000 plurality of pictures.COIL20 data sets possess 20 different classifications, and each classification has 72 Picture.
Table 2.SVM, MLR, UMLR are directed to the discrimination of MINIST, COIL20 data set
Above table illustrates the accuracy that three kinds of different algorithms are directed to two kinds of data sets.It is illustrated in figure 5 MNIST numbers According to collection MLR and the big logotype of UMLR parameter norms;It is illustrated in figure 6 COIL20 data sets MLR and UMLR parameter norm sizes Schematic diagram.Wherein, it is UMLR parameter norm size block diagrams under corresponding data collection, Fig. 5 that left side in Fig. 5 and Fig. 6 is corresponding MLR parameter norm size block diagrams under corresponding data collection corresponding with the right side of Fig. 6.
(4) experimental result on GT and ORL data sets
Totally 50 classifications, each classification include 15 pictures to GT data sets.ORL data sets totally 20 classifications, each classification Including 10 pictures.
Table 3.SVM, MLR, UMLR are directed to the discrimination of GT, ORL data set
It is illustrated in figure 7 ORL data sets MLR and the big logotype of UMLR parameter norms;Wherein, the left side of Fig. 7 is corresponding It is the UMLR parameter norm size block diagrams under corresponding data collection, the MLR parameter models under the corresponding corresponding data collection in right side of Fig. 7 Number size block diagram.
(5) analysis of experimental results
The experimental results showed that greatly unrelated multivariate logistic regression compares l2Constraint multivariate logistic regression algorithm and support to Amount machine algorithm has higher nicety of grading.It is with obvious effects especially for the higher data set of correlation between class, illustrate it to superfluous Remainder evidence has higher robustness.Its convergence parameter is compared to compared with l2The convergence parameter for constraining multivariate logistic regression is small, this is logical Often mean that it possesses stronger generalization ability.
According to above-mentioned analysis of experimental results as it can be seen that important branch of the classification as pattern-recognition, data mining, has more More be widely applied field, so, be increasingly becoming police criminal detection solve a case, e-payment, the core of the systems such as medical treatment and pass Key technology.
A kind of greatly unrelated multivariate logistic regression model proposed by the present invention;This method is based on the basic of multivariate logistic regression Model constructs a kind of novel classification device.The experimental results showed that its nicety of grading, classification robustness on than traditional classification algorithm It is advantageous.And it is stronger explanatory that it trains obtained model to have than the methods of support vector machines, naive Bayesian.
In conclusion a kind of text sentiment classification method based on very big unrelated multivariate logistic regression provided by the invention, The technique effect having is:
The present invention is on the basis of traditional multivariate logistic regression model by introducing relevant parameter penalty term (uncorrelated about Beam item), obtain the cost function of greatly unrelated multivariate logistic regression model;According to the solution greatly unrelated multivariate logistic regression The derived function of the cost function of model obtains the greatly unrelated multivariate logistic regression model.By adding uncorrelated bound term So that there is higher robustness for redundant data;The complexity for reducing traditional multivariate logistic regression model, obtains New disaggregated model (very big unrelated multivariate logistic regression model) has stronger generalization ability;And then it can be to the target of acquisition Textual entry carries out precise classification in text data.
It should be noted that:Embodiments of the present invention sequencing is for illustration only, can not represent the quality of embodiment.
One of ordinary skill in the art will appreciate that realizing that all or part of step of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can be stored in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.

Claims (6)

1. a kind of text sentiment classification method based on very big unrelated multivariate logistic regression, which is characterized in that the method includes:
Text data is obtained, and the text data is pre-processed;The text data includes training data and to be predicted Data;The data to be predicted include multiple textual entries;
On the basis of the cost function of the first model, by introducing relevant parameter penalty term, the cost function of the second model is obtained;
The training data that pretreatment is obtained inputs the derived function of the cost function of the second model, and is solved to obtain the second mould Type;First model is multivariate logistic regression model, and second model is very big unrelated multivariate logistic regression model;
The data to be predicted that pretreatment is obtained input second model, obtain in data to be predicted belonging to each textual entry Emotional category.
2. according to the method described in claim 1, it is characterized in that, described in the data to be predicted input that pretreatment is obtained Second model obtains the emotional category belonging to each textual entry in data to be predicted, including:
Each textual entry in the data to be predicted that pretreatment is obtained inputs second model, obtains each textual entry Text emotion class probability;
Classification thresholds are set;
When the text emotion class probability of textual entry is more than the classification thresholds, judge that the textual entry belongs to the first feelings Feel classification;
When the text emotion class probability of textual entry is less than or equal to the classification thresholds, judge that the textual entry belongs to the Two emotional categories.
3. method according to claim 1 or 2, which is characterized in that it is described on the basis of the cost function of the first model, lead to Introducing relevant parameter penalty term is crossed, the cost function of the second model is obtained, including:
Obtain the negative log-likelihood function of the model parameter of the first model;
Obtain uncorrelated bound term;
The cost function that uncorrelated bound term is introduced to the first model, obtains the cost function of the second model.
4. according to the method described in claim 3, it is characterized in that, first model is:
Wherein
The negative log-likelihood function of the parameter θ of first model is:
The negative log-likelihood function i.e. cost function of the first model;Wherein, m is independent the number of sample.
5. according to the method described in claim 4, it is characterized in that, the uncorrelated bound term is:
Uncorrelated bound term, that is, relevant parameter the penalty term;Wherein, θiAnd θjFor arbitrary two groups of different parameters;
The cost function of second model is:
6. according to the method described in claim 5, it is characterized in that, the derived function of the cost function of second model is:
CN201810332338.5A 2018-04-13 2018-04-13 Text emotion classification method based on great irrelevant multiple logistic regression Active CN108595568B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810332338.5A CN108595568B (en) 2018-04-13 2018-04-13 Text emotion classification method based on great irrelevant multiple logistic regression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810332338.5A CN108595568B (en) 2018-04-13 2018-04-13 Text emotion classification method based on great irrelevant multiple logistic regression

Publications (2)

Publication Number Publication Date
CN108595568A true CN108595568A (en) 2018-09-28
CN108595568B CN108595568B (en) 2022-05-17

Family

ID=63622383

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810332338.5A Active CN108595568B (en) 2018-04-13 2018-04-13 Text emotion classification method based on great irrelevant multiple logistic regression

Country Status (1)

Country Link
CN (1) CN108595568B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109671487A (en) * 2019-02-25 2019-04-23 上海海事大学 A kind of social media user psychology crisis alert method
CN110942450A (en) * 2019-11-19 2020-03-31 武汉大学 Multi-production-line real-time defect detection method based on deep learning
CN112802456A (en) * 2021-04-14 2021-05-14 北京世纪好未来教育科技有限公司 Voice evaluation scoring method and device, electronic equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004094583A (en) * 2002-08-30 2004-03-25 Ntt Advanced Technology Corp Method of classifying writings
CN103473380A (en) * 2013-09-30 2013-12-25 南京大学 Computer text sentiment classification method
CN103514279A (en) * 2013-09-26 2014-01-15 苏州大学 Method and device for classifying sentence level emotion
CN103955451A (en) * 2014-05-15 2014-07-30 北京优捷信达信息科技有限公司 Method for judging emotional tendentiousness of short text
CN104239554A (en) * 2014-09-24 2014-12-24 南开大学 Cross-domain and cross-category news commentary emotion prediction method
CN104462408A (en) * 2014-12-12 2015-03-25 浙江大学 Topic modeling based multi-granularity sentiment analysis method
CN105389583A (en) * 2014-09-05 2016-03-09 华为技术有限公司 Image classifier generation method, and image classification method and device
CN105740349A (en) * 2016-01-25 2016-07-06 重庆邮电大学 Sentiment classification method capable of combining Doc2vce with convolutional neural network
US20160321243A1 (en) * 2014-01-10 2016-11-03 Cluep Inc. Systems, devices, and methods for automatic detection of feelings in text
CN106156004A (en) * 2016-07-04 2016-11-23 中国传媒大学 The sentiment analysis system and method for film comment information based on term vector
CN107798349A (en) * 2017-11-03 2018-03-13 合肥工业大学 A kind of transfer learning method based on the sparse self-editing ink recorder of depth

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004094583A (en) * 2002-08-30 2004-03-25 Ntt Advanced Technology Corp Method of classifying writings
CN103514279A (en) * 2013-09-26 2014-01-15 苏州大学 Method and device for classifying sentence level emotion
CN103473380A (en) * 2013-09-30 2013-12-25 南京大学 Computer text sentiment classification method
US20160321243A1 (en) * 2014-01-10 2016-11-03 Cluep Inc. Systems, devices, and methods for automatic detection of feelings in text
CN103955451A (en) * 2014-05-15 2014-07-30 北京优捷信达信息科技有限公司 Method for judging emotional tendentiousness of short text
CN105389583A (en) * 2014-09-05 2016-03-09 华为技术有限公司 Image classifier generation method, and image classification method and device
CN104239554A (en) * 2014-09-24 2014-12-24 南开大学 Cross-domain and cross-category news commentary emotion prediction method
CN104462408A (en) * 2014-12-12 2015-03-25 浙江大学 Topic modeling based multi-granularity sentiment analysis method
CN105740349A (en) * 2016-01-25 2016-07-06 重庆邮电大学 Sentiment classification method capable of combining Doc2vce with convolutional neural network
CN106156004A (en) * 2016-07-04 2016-11-23 中国传媒大学 The sentiment analysis system and method for film comment information based on term vector
CN107798349A (en) * 2017-11-03 2018-03-13 合肥工业大学 A kind of transfer learning method based on the sparse self-editing ink recorder of depth

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WIBKE WEBER等: "Text Visualization - What Colors Tell About a Text", 《2007 11TH INTERNATIONAL CONFERENCE INFORMATION VISUALIZATION (IV "07)》 *
李平等: "基于混合卡方统计量与逻辑回归的文本情感分析", 《计算机工程》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109671487A (en) * 2019-02-25 2019-04-23 上海海事大学 A kind of social media user psychology crisis alert method
CN110942450A (en) * 2019-11-19 2020-03-31 武汉大学 Multi-production-line real-time defect detection method based on deep learning
CN112802456A (en) * 2021-04-14 2021-05-14 北京世纪好未来教育科技有限公司 Voice evaluation scoring method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN108595568B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN112084337B (en) Training method of text classification model, text classification method and equipment
CN108960073B (en) Cross-modal image mode identification method for biomedical literature
CN107832663B (en) Multi-modal emotion analysis method based on quantum theory
KR102008845B1 (en) Automatic classification method of unstructured data
Peng et al. A joint framework for coreference resolution and mention head detection
CN108595708A (en) A kind of exception information file classification method of knowledge based collection of illustrative plates
CN112732916B (en) BERT-based multi-feature fusion fuzzy text classification system
Lopes et al. An AutoML-based approach to multimodal image sentiment analysis
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN112528894A (en) Method and device for distinguishing difference items
CN108536838A (en) Very big unrelated multivariate logistic regression model based on Spark is to text sentiment classification method
CN113627151B (en) Cross-modal data matching method, device, equipment and medium
CN108595568A (en) A kind of text sentiment classification method based on very big unrelated multivariate logistic regression
WO2021148625A1 (en) A method for identifying vulnerabilities in computer program code and a system thereof
CN116911286A (en) Dictionary construction method, emotion analysis device, dictionary construction equipment and storage medium
CN116304051A (en) Text classification method integrating local key information and pre-training
Yoon et al. ESD: Expected Squared Difference as a Tuning-Free Trainable Calibration Measure
CN115827871A (en) Internet enterprise classification method, device and system
CN115687917A (en) Sample processing method and device, and recognition model training method and device
CN115017894A (en) Public opinion risk identification method and device
CN113434721A (en) Expression package classification method and device, computer equipment and storage medium
CN114610882A (en) Abnormal equipment code detection method and system based on electric power short text classification
CN112528653A (en) Short text entity identification method and system
Prabhu et al. A dynamic weight function based BERT auto encoder for sentiment analysis
Jabde et al. Offline Handwritten Multilingual Numeral Recognition Using CNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant