CN108595568A

CN108595568A - A kind of text sentiment classification method based on very big unrelated multivariate logistic regression

Info

Publication number: CN108595568A
Application number: CN201810332338.5A
Authority: CN
Inventors: 雷大江; 张红宇; 陈浩; 张莉萍; 吴渝; 杨杰; 程克非
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2018-04-13
Filing date: 2018-04-13
Publication date: 2018-09-28
Anticipated expiration: 2038-04-13
Also published as: CN108595568B

Abstract

The present invention provides a kind of text sentiment classification method based on very big unrelated multivariate logistic regression, the method includes：Text data is obtained, and the text data is pre-processed；On the basis of the cost function of the first model, by introducing relevant parameter penalty term, the cost function of the second model is obtained；The training data that pretreatment is obtained inputs the derived function of the cost function of the second model, and is solved to obtain the second model；First model is multivariate logistic regression model, and second model is very big unrelated multivariate logistic regression model；The data to be predicted that pretreatment is obtained input second model, obtain the emotional category belonging to each textual entry in data to be predicted.So that there is higher robustness for redundant data by adding uncorrelated bound term；The complexity of traditional multivariate logistic regression model is reduced, there is stronger generalization ability；And then precise classification can be carried out to textual entry in the target text data of acquisition.

Description

A kind of text sentiment classification method based on very big unrelated multivariate logistic regression

Technical field

The present invention relates to machine learning field more particularly to a kind of text emotions based on very big unrelated multivariate logistic regression Sorting technique.

Background technology

Classify key component as machine learning, data mining, in image recognition, drug development, speech recognition, hand-written Identification etc. has a wide range of applications.It is to identify that which classification a new example belonged to has prison based on known training set The problem concerning study superintended and directed.In sorting algorithm, non-linear classification and can expand to more classification most important.

Support vector machines (SVM) is a kind of two-value grader of classics, and Hinge is used to lose, by solving belt restraining item The double optimization problem of part establishes the best line of demarcation between data set.Compared with other algorithms, considerable advantage is：It is logical It crosses using different kernel functions, SVM both can be used for linear classification, can be used for Nonlinear Classification.But due to its dependence In one-to-one pattern, SVM is very limited on multicategory classification, although having been done much expanding to SVM on multicategory classification Effort, but these methods still have many negative impacts.For example, in multi-class classification, decision-making technique one-to-many SVM is just deep It is unbalanced between by data set class to influence.Another important problem, which is it, may distribute to same instance multiple classes.Although Many methods, which are suggested, to be solved these problems, but they have other adverse effects：Such as efficiency.SVM's the result is that pure Pure two points, do not support probability output.SVM does not have from the numerical value output of a task and the numerical value output of another task can Compare property.In addition, compared with the grader based on degree of belief, this numerical value that there is no limit is difficult to explain for terminal user The meaning of its behind.

Logistic regression (LR) is one of the important method of classification.Standard logic recurrence is lost using Logistical, is passed through The coefficient weighted linear combination of input variable is classified.Logistic regression is substantially reduced by Nonlinear Mapping from classification plane The weight of point farther out improves the weight with the maximally related data point of classifying, compared to support vector machines, from a certain given In class, standard logic recurrence can provide corresponding class distribution estimation, and very big advantage is also accounted on the model training time.Logic Comparatively model is simpler for recurrence, understands well, is implemented when for extensive linear classification more convenient.In addition, standard Logistic regression is easier to expand to multicategory classification than support vector machines.Some are directed to the innovatory algorithm of logistic regression for example：It is sparse Logistic regression, weighted logistic regression etc. all obtain preferable effect in corresponding field.

However logistic regression is only used for two classification problems, is not directly applicable multi-class (classification k>2) classification problem. In order to solve more classification problems with logistic regression, usually there are two logic of class to return extended mode, one kind is to establish k independent two A kind of sample labeling is positive sample by meta classifier, each grader, is negative sample by the sample labeling of every other classification.Needle To giving test sample, each grader can obtain the test sample and belong to this kind of probability, therefore can be by taking Maximum class probability carries out classify more.In addition a kind of to be then referred to as multivariate logistic regression (Multinomial Logistic Regression, MLR), it is popularization of the Logic Regression Models in more classification problems.Specifically choose which kind of method handles more points Class problem generally depend between classification to be sorted whether mutual exclusion.It is typically mutual exclusion for more classification problems, between classification 's.Therefore, using multivariate logistic regression better result is usually led to compared to logistic regression.Meanwhile multivariate logistic regression Only need training primary, therefore it also has the faster speed of service.

In computer information processing field, text data set usually contains more common information, these common informations are big The big complexity and identification error for increasing identification, although multivariate logistic regression trains multigroup parameter to be directed to each classification meter Calculate corresponding probability, however there is no consider between each group parameter whether related problem.Therefore a kind of based on greatly unrelated Multivariate logistic regression text sentiment classification method realization have certain realistic meaning.

Invention content

In order to solve the above technical problem, the present invention provides a kind of text feelings based on very big unrelated multivariate logistic regression Feel sorting technique, the method includes：

Text data is obtained, and the text data is pre-processed；The text data includes training data and waits for Prediction data；The data to be predicted include multiple textual entries；

On the basis of the cost function of the first model, by introducing relevant parameter penalty term, the cost of the second model is obtained Function；

Will the obtained training data of pretreatment input the second model cost function derived function, and solved to obtain the Two models；First model is multivariate logistic regression model, and second model is very big unrelated multivariate logistic regression model；

The data to be predicted that pretreatment is obtained input second model, obtain each textual entry in data to be predicted Affiliated emotional category.

Further, the data to be predicted that pretreatment is obtained input second model, obtain data to be predicted In emotional category belonging to each textual entry, including：

Each textual entry in the data to be predicted that pretreatment is obtained inputs second model, obtains each text The text emotion class probability of entry；

Classification thresholds are set；

When the text emotion class probability of textual entry is more than the classification thresholds, judge that the textual entry belongs to the One emotional category；

When the text emotion class probability of textual entry is less than or equal to the classification thresholds, the textual entry category is judged In the second emotional category.

Further, described on the basis of the cost function of the first model, by introducing relevant parameter penalty term, obtain the The cost function of two models, including：

Obtain the negative log-likelihood function of the model parameter of the first model；

Obtain uncorrelated bound term；

The cost function that uncorrelated bound term is introduced to the first model, obtains the cost function of the second model.

Further, first model is：

Wherein

The negative log-likelihood function of the parameter θ of first model is：

The negative log-likelihood function i.e. cost function of the first model；Wherein, m is independent the number of sample.

Further, the uncorrelated bound term is：

Uncorrelated bound term, that is, relevant parameter the penalty term；Wherein, θ_iAnd θ_jFor arbitrary two groups of different parameters；

The cost function of second model is：

Further, the derived function of the cost function of second model is：

A kind of text sentiment classification method based on very big unrelated multivariate logistic regression provided by the invention, the technology having Effect is：

The present invention is on the basis of traditional multivariate logistic regression model by introducing relevant parameter penalty term (uncorrelated about Beam item), obtain the cost function of greatly unrelated multivariate logistic regression model；According to the solution greatly unrelated multivariate logistic regression The derived function of the cost function of model obtains the greatly unrelated multivariate logistic regression model.By adding uncorrelated bound term So that there is higher robustness for redundant data；The complexity for reducing traditional multivariate logistic regression model, obtains New disaggregated model (very big unrelated multivariate logistic regression model) has stronger generalization ability；And then it can be to the target of acquisition Textual entry carries out precise classification in text data.

Description of the drawings

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology and advantage, below will be to implementing Example or attached drawing needed to be used in the description of the prior art are briefly described, it should be apparent that, the accompanying drawings in the following description is only Only it is some embodiments of the present invention, for those of ordinary skill in the art, without creative efforts, It can also be obtained according to these attached drawings other attached drawings.

Fig. 1 is a kind of text sentiment classification method based on very big unrelated multivariate logistic regression provided in an embodiment of the present invention Flow chart；

Fig. 2 is target text data instance figure provided in an embodiment of the present invention；

Fig. 3 is the method flow diagram of the emotional category belonging to each textual entry of determination that is provided in the embodiment of the present invention；

Fig. 4 is the method flow diagram of the cost function provided in an embodiment of the present invention for obtaining the second model；

Fig. 5 is the MNIST data sets MLR provided in the embodiment of the present invention and the big logotype of UMLR parameter norms；

Fig. 6 is the COIL20 data sets MLR provided in the embodiment of the present invention and the big logotype of UMLR parameter norms；

Fig. 7 is the ORL data sets MLR provided in the embodiment of the present invention and the big logotype of UMLR parameter norms.

Specific implementation mode

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art obtained without making creative work it is all its His embodiment, shall fall within the protection scope of the present invention.

It should be noted that term " first " in description and claims of this specification and above-mentioned attached drawing, " Two " etc. be for distinguishing similar object, without being used to describe specific sequence or precedence.It should be appreciated that using in this way Data can be interchanged in the appropriate case, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover It includes to be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment to cover non-exclusive Those of clearly list step or unit, but may include not listing clearly or for these processes, method, product Or the other steps or unit that equipment is intrinsic.

It needs to give explanation, logistic regression (LR) algorithm and l existing in the prior art₂Polynary logic is constrained to return Return (RMLR) algorithm, there are some shortcoming and defect in classification application, and then it is greatly unrelated polynary to propose modified hydrothermal process Logistic regression algorithm.

Logistic regression (LR) algorithm：

For logistic regression, hypothesis has data set D={ x_i,y_i, i=1 ..., N, x_i∈RD,y_i∈ { 0,1 }, input Vector is x=(x⁽¹⁾,…,x^(D)), class label y is two-valued function：Y is 0 or 1.Logistic regression (LR) is based on following probability Model：

Wherein,Referred to as Logistic functions or Sigmoid functions.

For two classification problems, it is assumed that the value of y is 0 or 1, and the probability that y=1 occurs obeys Bernoulli Jacob's distribution, then has：

P (y=1 | x；θ)=h_θ(x)

P (y=0 | x；θ)=1-h_θ(x)

Two formula as above can be merged into：

p(y|x；θ)=h_θ(x)^y(1-h_θ(x))^1-y (2)

Wherein y ∈ { 0,1 }.It is assumed that m sample is independent, then the likelihood function of parameter θ can be write out：

Then log-likelihood function can be expressed as：

Optimal θ can be obtained by maximizing l (θ).Usually enableObtain the corresponding losses of l (θ) Function solves optimal θ by minimizing loss function.However logistic regression can only handle two classification problems, it can not be direct It is applied in more classification problems.

l₂Constrain multivariate logistic regression (RMLR) algorithm：

Polytypic problem cannot be handled for traditional logistic regression, multivariate logistic regression (MLR) is by changing logic The cost function of recurrence, to adapt to more classification problems.

Assuming that there is data set D={ x_i,y_i, i=1 ..., N, x_i∈RD,y_i∈{0,…,K}(K>2), input vector x =(x⁽¹⁾,…,x^(D)), multivariate logistic regression (MLR) is based on following probabilistic model：

Wherein

Cost function is：

However there are one uncommon features for multivariate logistic regression, there are one the parameter sets of " redundancy " for it.Assuming that we from Parameter vector θ_jIn subtracted vectorial ψ, at this moment, each θ_jAll become θ_j- ψ (j=1 ..., k).It is assumed that function becomes Following formula：

This shows from θ_jIn subtract ψ completely and do not influence the prediction result for assuming function, that is to say, that polynary patrolled above-mentioned There are the parameters of redundancy in volume regression model.

For multivariate logistic regression model overparameterization problem, l₂Multivariate logistic regression (RMLR) algorithm is constrained by adding A weight attenuation term is added to change cost function, this attenuation term can punish excessive parameter value, and cost function is made to become tight The convex function of lattice ensures that uniquely solved in this way.Its cost function is：

Hessian matrixes at this time become invertible matrix, and because cost function is convex function, can using optimization algorithm To ensure to converge to globally optimal solution.Although l₂Constraint multivariate logistic regression (RMLR) algorithm has mitigated excessively quasi- to a certain extent Conjunction problem, however for the data set for possessing redundancy, l₂It is poor to constrain the performance of multivariate logistic regression (RMLR) algorithm.

According to above-mentioned analysis, and then propose very big unrelated multivariate logistic regression model：Specifically, the present embodiment provides A kind of text sentiment classification method based on very big unrelated multivariate logistic regression, as shown in Figure 1, the method includes：

S101. text data is obtained, and the text data is pre-processed；The text data includes training data With data to be predicted；The data to be predicted include multiple textual entries；

For example, reading the evaluation text data after consumer consumes to shop, the text data is consumed by consumer to shop Comment composition afterwards.As shown in Figure 2, wherein first is classified as text label row, and 0 indicates positive comment, and 1 indicates negative sense comment.The Two are classified as consumer reviews' row.Then, it because there are much noises directly to train in original text data, needs corresponding Pretreatment；Pretreatment operation is carried out to evaluation text data set, is specifically included：

Wherein, in step S103 and step S106, preprocess method is carried out to RDD, including：

The gap character in pending text comments sentence is obtained, and the space character is replaced using null character string；

Special string, the number etc. in comment sentence are obtained, and the special string, number are replaced using null character string Word etc.；

It obtains in comment sentence and expresses the word for obscuring the tone, convert expression fuzzy expression word to absolutely expression word Language, and then so that the fuzzy tone is expressed and be converted into absolute expression；

Custom dictionaries is added, the higher noun of frequency in pending text comments sentence is added to custom dictionaries In；

Word in above-mentioned processed comment sentence is segmented, and filters the stop words in comment sentence；

Word in the comment sentence for having completed participle is converted into row vector, and then generates term vector.

Specifically, carrying out pretreated method to pending text includes：

Using function re.compile (' # ([^>] *) #') match in comment with the comment of " # " beginning and end, and use Null character string is replaced；Wherein re is the regular expression module of python, can directly invoke function therein to realize character The canonical of string matches.

Using function re.compile (u'[^ u4e00- u9fa5 | a-zA-Z]+') matching comment in special string, Number etc., and replaced using null character string；

Comment text is replaced using function flashtext.KeywordProcessor.The fuzzy tone is expressed It is converted into absolute expression.Such as " so-so " is replaced with into " bad ", " not being special " replaces with " no "；

Custom dictionaries is added, for the higher noun of frequency in text data set, is added in new term to dictionary, enhancing Segment accuracy；Wherein, it is added in specific noun to custom dictionaries according to specific scene, it is more efficient complete very accurately At participle work.

Comment is segmented using function jieba.cut (), and filters the stop words in comment；Wherein stop words is pair Text classification target helps little word or word, such as ' ', ' ', ' ' etc.；Different scenes have different deactivated vocabularys, According to the corresponding stop words deactivated in vocabulary deletion text.

The comment data collection for having completed participle is converted to using function gensim.models.Word2Vec () Word2vec models generate term vector.

One is commented on, the term vector of each word is generated using word2vec models, the word of all words in comment Vector is averaged by dimension, and the term vector for obtaining this comment indicates.Assuming that comment data is concentrated comprising n different words. Include m word in some sentence, then shown in the term vector such as formula (81) of each word of the sentence：

Shown in the term vector of the sentence such as formula (82)：

Wherein,

Recycling word2vec models generate that corresponding step of term vector of each word, and then are entirely commented It is indicated by the term vector of data set.

S102. on the basis of the cost function of the first model, by introducing relevant parameter penalty term, the second model is obtained Cost function；

S103. training data pretreatment obtained inputs the derived function of the cost function of the second model, and is solved Obtain the second model；First model is multivariate logistic regression model, and second model is that very big unrelated polynary logic is returned Return model；

S104. the data to be predicted pretreatment obtained input second model, obtain each text in data to be predicted Emotional category belonging to this entry.

That is, introducing relevant parameter penalty term, greatly unrelated multivariate logistic regression model is established；It will handle well Format data input model, predict every evaluation text emotional category.

Specifically, in step S104, the data to be predicted that pretreatment is obtained input second model, obtain waiting for pre- Emotional category in measured data belonging to each textual entry, as shown in figure 3, including：

S104a. each textual entry in the data to be predicted pretreatment obtained inputs second model, obtains every The text emotion class probability of a textual entry；

S104b., classification thresholds are set；

Wherein it is preferred in two classification problems of the classification thresholds in multivariate logistic regression, specially 0.5.

S104c. when the text emotion class probability of textual entry is more than the classification thresholds, judge the textual entry Belong to the first emotional category；

S104d. when the text emotion class probability of textual entry is less than or equal to the classification thresholds, judge the text Entry belongs to the second emotional category.

Such as；Classification thresholds are set as 0.5, when model calculates sample generic probability more than 0.5, comment is marked It is denoted as 1, is expressed as positive comment.When sample generic probability is less than or equal to 0.5, comment is marked as 0, is expressed as negatively commenting By.

Wherein, in computer information processing field, data set usually contains more common information, these common informations are big The big complexity and identification error for increasing identification, although multivariate logistic regression trains k groups parameter to be directed to each classification meter Calculate corresponding probability, however there is no consider between k group parameters whether the problem of correlation, if parameter (θ₁,θ₂,…,θ_k) be The minimum point of cost function, then any parameter θ_iIt all can be by other θ_j(j ≠ i) linear expression, i.e.,

θ_i=λ₀+∑_j≠iλ_jθ_j (9)

This illustrate it is different classes of between parameter have correlation.l₂Canonical

Although constraining the group interior element of every group of parameter, the problem of different classes of parameter correlation is not considered yet, is led Causing to be directed to has the data set classifying quality of more redundancy poor.For arbitrary two groups of different parameter θs_iAnd θ_j, according to basic Inequality：

Wherein, and if only if θ_i=θ_jWhen obtain maximum value.

If θ_iWith θ_jCorrelation, i.e. θ_i=λ₀+λ_jθ_j, thenIt is worth larger, therefore we are added to uncorrelated bound term：

This bound term can punish relevant parameter, for ensureing to retain as possible more spies that are uncorrelated, having differentiation Sign.And because

It is so as to obtain its cost function：

In order to use optimization algorithm, the derivative for acquiring J (θ) is as follows：

It is derived according to above, uncorrelated parameter θ can quickly be acquired by gradient descent algorithm and its innovatory algorithm.

Specifically, it corresponds into the present embodiment, in step S102, on the basis of the cost function of the first model, by drawing Enter relevant parameter penalty term, the cost function of the second model is obtained, as shown in figure 4, including：

S102a. the negative log-likelihood function of the model parameter of the first model is obtained；

S102b. uncorrelated bound term is obtained；

S102c., the cost function that uncorrelated bound term is introduced to the first model, obtains the cost function of the second model.

Accordingly, first model is：

Wherein

The negative log-likelihood function of the parameter θ of first model is：

The negative log-likelihood function i.e. cost function of the first model；Wherein, m is independent the number of sample.Into one Step ground, the uncorrelated bound term are：

Uncorrelated bound term, that is, relevant parameter the penalty term；Wherein, θ_iAnd θ_jFor arbitrary two groups of different parameters；It is described The cost function of second model is：

Further, the derived function of the cost function of second model is：

For above-mentioned content, then algorithm steps：

Input：Training set D={ (x₁,y₁),(x₂,y₂),…,(x_m,y_m)}；

Process：

Initializeλ,η,Θ

While stopping criterion are not satisfied do:

Forj=1,2 ..., k:

Θ=L-BFGS (Loss, d Θ)

Output：Regression coefficient Θ

Further, convergence is carried out to the greatly unrelated multivariate logistic regression algorithm：

According to the loss function of very big unrelated multivariate logistic regression：

It can obtain：

Because it is stringent convex function that the second dervative perseverance of J (θ), which is more than 0, J (θ),.

Wherein, algorithm receipts can be demonstrate,proved according to on-line study frame analysis algorithm and about the analysis of Adam convergences It holds back.

Further, greatly unrelated multivariate logistic regression (UMLR) algorithm proposed by the present invention is assessed.Experiment knot Fruit is concentrated mainly on following two problems:Nicety of grading executes speed.Data classification algorithm for comparing includes weight decaying Multivariate logistic regression, support vector machines and the unrelated multivariate logistic regression of parameter.The artificial of the different degrees of correlation has been respectively adopted in experiment 4 real data sets such as data set and MNIST, COIL20, GT and ORL, verification mode are ten folding cross validations.

(1) it normalizes

Assuming that Φ (x)_minWith Φ (x)_maxMaximum value and minimum value respectively in data set.For an example, normalizing It is as follows to change algorithm：

By normalized mode, there will be the expression formula of dimension to be converted into nondimensional expression formula, solve contribution data Unbalanced problem.

(2) experimental result on artificial data collection

In order to which verification algorithm is to the validity of linearly related data set, we generate artificial data collection as follows：Class The interior degree of correlation is more than 0.9, and similarity distinguishes value 0.5,0.6,0.7,0.8,0.9. between class

Sample size and data dimension are selected as (m, n)=(5000,1000), amount to 5 classifications, each classification 1000 Bar sample.

It is the data for the different degrees of correlation, very big unrelated multivariate logistic regression algorithm and l below₂Polynary logic is constrained to return The comparison of reduction method discrimination.

The discrimination of table 1.MLR, UMLR difference correlation data collection

(3) experimental result on MNIST and COIL20 data sets

MNIST data sets are widely used in area of pattern recognition.It includes 10 classifications, this ten classifications correspond to hand-written Digital 0-9, each classification have 5000 plurality of pictures.COIL20 data sets possess 20 different classifications, and each classification has 72 Picture.

Table 2.SVM, MLR, UMLR are directed to the discrimination of MINIST, COIL20 data set

Above table illustrates the accuracy that three kinds of different algorithms are directed to two kinds of data sets.It is illustrated in figure 5 MNIST numbers According to collection MLR and the big logotype of UMLR parameter norms；It is illustrated in figure 6 COIL20 data sets MLR and UMLR parameter norm sizes Schematic diagram.Wherein, it is UMLR parameter norm size block diagrams under corresponding data collection, Fig. 5 that left side in Fig. 5 and Fig. 6 is corresponding MLR parameter norm size block diagrams under corresponding data collection corresponding with the right side of Fig. 6.

(4) experimental result on GT and ORL data sets

Totally 50 classifications, each classification include 15 pictures to GT data sets.ORL data sets totally 20 classifications, each classification Including 10 pictures.

Table 3.SVM, MLR, UMLR are directed to the discrimination of GT, ORL data set

It is illustrated in figure 7 ORL data sets MLR and the big logotype of UMLR parameter norms；Wherein, the left side of Fig. 7 is corresponding It is the UMLR parameter norm size block diagrams under corresponding data collection, the MLR parameter models under the corresponding corresponding data collection in right side of Fig. 7 Number size block diagram.

(5) analysis of experimental results

The experimental results showed that greatly unrelated multivariate logistic regression compares l₂Constraint multivariate logistic regression algorithm and support to Amount machine algorithm has higher nicety of grading.It is with obvious effects especially for the higher data set of correlation between class, illustrate it to superfluous Remainder evidence has higher robustness.Its convergence parameter is compared to compared with l₂The convergence parameter for constraining multivariate logistic regression is small, this is logical Often mean that it possesses stronger generalization ability.

According to above-mentioned analysis of experimental results as it can be seen that important branch of the classification as pattern-recognition, data mining, has more More be widely applied field, so, be increasingly becoming police criminal detection solve a case, e-payment, the core of the systems such as medical treatment and pass Key technology.

A kind of greatly unrelated multivariate logistic regression model proposed by the present invention；This method is based on the basic of multivariate logistic regression Model constructs a kind of novel classification device.The experimental results showed that its nicety of grading, classification robustness on than traditional classification algorithm It is advantageous.And it is stronger explanatory that it trains obtained model to have than the methods of support vector machines, naive Bayesian.

In conclusion a kind of text sentiment classification method based on very big unrelated multivariate logistic regression provided by the invention, The technique effect having is：

It should be noted that：Embodiments of the present invention sequencing is for illustration only, can not represent the quality of embodiment.

One of ordinary skill in the art will appreciate that realizing that all or part of step of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can be stored in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.

Claims

1. a kind of text sentiment classification method based on very big unrelated multivariate logistic regression, which is characterized in that the method includes：

Text data is obtained, and the text data is pre-processed；The text data includes training data and to be predicted Data；The data to be predicted include multiple textual entries；

On the basis of the cost function of the first model, by introducing relevant parameter penalty term, the cost function of the second model is obtained；

The training data that pretreatment is obtained inputs the derived function of the cost function of the second model, and is solved to obtain the second mould Type；First model is multivariate logistic regression model, and second model is very big unrelated multivariate logistic regression model；

The data to be predicted that pretreatment is obtained input second model, obtain in data to be predicted belonging to each textual entry Emotional category.

2. according to the method described in claim 1, it is characterized in that, described in the data to be predicted input that pretreatment is obtained Second model obtains the emotional category belonging to each textual entry in data to be predicted, including：

Each textual entry in the data to be predicted that pretreatment is obtained inputs second model, obtains each textual entry Text emotion class probability；

Classification thresholds are set；

When the text emotion class probability of textual entry is more than the classification thresholds, judge that the textual entry belongs to the first feelings Feel classification；

When the text emotion class probability of textual entry is less than or equal to the classification thresholds, judge that the textual entry belongs to the Two emotional categories.

3. method according to claim 1 or 2, which is characterized in that it is described on the basis of the cost function of the first model, lead to Introducing relevant parameter penalty term is crossed, the cost function of the second model is obtained, including：

Obtain uncorrelated bound term；

4. according to the method described in claim 3, it is characterized in that, first model is：

Wherein

The negative log-likelihood function of the parameter θ of first model is：

5. according to the method described in claim 4, it is characterized in that, the uncorrelated bound term is：

The cost function of second model is：

6. according to the method described in claim 5, it is characterized in that, the derived function of the cost function of second model is：