CN108595568A - A kind of text sentiment classification method based on very big unrelated multivariate logistic regression - Google Patents
A kind of text sentiment classification method based on very big unrelated multivariate logistic regression Download PDFInfo
- Publication number
- CN108595568A CN108595568A CN201810332338.5A CN201810332338A CN108595568A CN 108595568 A CN108595568 A CN 108595568A CN 201810332338 A CN201810332338 A CN 201810332338A CN 108595568 A CN108595568 A CN 108595568A
- Authority
- CN
- China
- Prior art keywords
- model
- data
- logistic regression
- cost function
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Operations Research (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Algebra (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of text sentiment classification method based on very big unrelated multivariate logistic regression, the method includes:Text data is obtained, and the text data is pre-processed;On the basis of the cost function of the first model, by introducing relevant parameter penalty term, the cost function of the second model is obtained;The training data that pretreatment is obtained inputs the derived function of the cost function of the second model, and is solved to obtain the second model;First model is multivariate logistic regression model, and second model is very big unrelated multivariate logistic regression model;The data to be predicted that pretreatment is obtained input second model, obtain the emotional category belonging to each textual entry in data to be predicted.So that there is higher robustness for redundant data by adding uncorrelated bound term;The complexity of traditional multivariate logistic regression model is reduced, there is stronger generalization ability;And then precise classification can be carried out to textual entry in the target text data of acquisition.
Description
Technical field
The present invention relates to machine learning field more particularly to a kind of text emotions based on very big unrelated multivariate logistic regression
Sorting technique.
Background technology
Classify key component as machine learning, data mining, in image recognition, drug development, speech recognition, hand-written
Identification etc. has a wide range of applications.It is to identify that which classification a new example belonged to has prison based on known training set
The problem concerning study superintended and directed.In sorting algorithm, non-linear classification and can expand to more classification most important.
Support vector machines (SVM) is a kind of two-value grader of classics, and Hinge is used to lose, by solving belt restraining item
The double optimization problem of part establishes the best line of demarcation between data set.Compared with other algorithms, considerable advantage is:It is logical
It crosses using different kernel functions, SVM both can be used for linear classification, can be used for Nonlinear Classification.But due to its dependence
In one-to-one pattern, SVM is very limited on multicategory classification, although having been done much expanding to SVM on multicategory classification
Effort, but these methods still have many negative impacts.For example, in multi-class classification, decision-making technique one-to-many SVM is just deep
It is unbalanced between by data set class to influence.Another important problem, which is it, may distribute to same instance multiple classes.Although
Many methods, which are suggested, to be solved these problems, but they have other adverse effects:Such as efficiency.SVM's the result is that pure
Pure two points, do not support probability output.SVM does not have from the numerical value output of a task and the numerical value output of another task can
Compare property.In addition, compared with the grader based on degree of belief, this numerical value that there is no limit is difficult to explain for terminal user
The meaning of its behind.
Logistic regression (LR) is one of the important method of classification.Standard logic recurrence is lost using Logistical, is passed through
The coefficient weighted linear combination of input variable is classified.Logistic regression is substantially reduced by Nonlinear Mapping from classification plane
The weight of point farther out improves the weight with the maximally related data point of classifying, compared to support vector machines, from a certain given
In class, standard logic recurrence can provide corresponding class distribution estimation, and very big advantage is also accounted on the model training time.Logic
Comparatively model is simpler for recurrence, understands well, is implemented when for extensive linear classification more convenient.In addition, standard
Logistic regression is easier to expand to multicategory classification than support vector machines.Some are directed to the innovatory algorithm of logistic regression for example:It is sparse
Logistic regression, weighted logistic regression etc. all obtain preferable effect in corresponding field.
However logistic regression is only used for two classification problems, is not directly applicable multi-class (classification k>2) classification problem.
In order to solve more classification problems with logistic regression, usually there are two logic of class to return extended mode, one kind is to establish k independent two
A kind of sample labeling is positive sample by meta classifier, each grader, is negative sample by the sample labeling of every other classification.Needle
To giving test sample, each grader can obtain the test sample and belong to this kind of probability, therefore can be by taking
Maximum class probability carries out classify more.In addition a kind of to be then referred to as multivariate logistic regression (Multinomial Logistic
Regression, MLR), it is popularization of the Logic Regression Models in more classification problems.Specifically choose which kind of method handles more points
Class problem generally depend between classification to be sorted whether mutual exclusion.It is typically mutual exclusion for more classification problems, between classification
's.Therefore, using multivariate logistic regression better result is usually led to compared to logistic regression.Meanwhile multivariate logistic regression
Only need training primary, therefore it also has the faster speed of service.
In computer information processing field, text data set usually contains more common information, these common informations are big
The big complexity and identification error for increasing identification, although multivariate logistic regression trains multigroup parameter to be directed to each classification meter
Calculate corresponding probability, however there is no consider between each group parameter whether related problem.Therefore a kind of based on greatly unrelated
Multivariate logistic regression text sentiment classification method realization have certain realistic meaning.
Invention content
In order to solve the above technical problem, the present invention provides a kind of text feelings based on very big unrelated multivariate logistic regression
Feel sorting technique, the method includes:
Text data is obtained, and the text data is pre-processed;The text data includes training data and waits for
Prediction data;The data to be predicted include multiple textual entries;
On the basis of the cost function of the first model, by introducing relevant parameter penalty term, the cost of the second model is obtained
Function;
Will the obtained training data of pretreatment input the second model cost function derived function, and solved to obtain the
Two models;First model is multivariate logistic regression model, and second model is very big unrelated multivariate logistic regression model;
The data to be predicted that pretreatment is obtained input second model, obtain each textual entry in data to be predicted
Affiliated emotional category.
Further, the data to be predicted that pretreatment is obtained input second model, obtain data to be predicted
In emotional category belonging to each textual entry, including:
Each textual entry in the data to be predicted that pretreatment is obtained inputs second model, obtains each text
The text emotion class probability of entry;
Classification thresholds are set;
When the text emotion class probability of textual entry is more than the classification thresholds, judge that the textual entry belongs to the
One emotional category;
When the text emotion class probability of textual entry is less than or equal to the classification thresholds, the textual entry category is judged
In the second emotional category.
Further, described on the basis of the cost function of the first model, by introducing relevant parameter penalty term, obtain the
The cost function of two models, including:
Obtain the negative log-likelihood function of the model parameter of the first model;
Obtain uncorrelated bound term;
The cost function that uncorrelated bound term is introduced to the first model, obtains the cost function of the second model.
Further, first model is:
Wherein
The negative log-likelihood function of the parameter θ of first model is:
The negative log-likelihood function i.e. cost function of the first model;Wherein, m is independent the number of sample.
Further, the uncorrelated bound term is:
Uncorrelated bound term, that is, relevant parameter the penalty term;Wherein, θiAnd θjFor arbitrary two groups of different parameters;
The cost function of second model is:
Further, the derived function of the cost function of second model is:
A kind of text sentiment classification method based on very big unrelated multivariate logistic regression provided by the invention, the technology having
Effect is:
The present invention is on the basis of traditional multivariate logistic regression model by introducing relevant parameter penalty term (uncorrelated about
Beam item), obtain the cost function of greatly unrelated multivariate logistic regression model;According to the solution greatly unrelated multivariate logistic regression
The derived function of the cost function of model obtains the greatly unrelated multivariate logistic regression model.By adding uncorrelated bound term
So that there is higher robustness for redundant data;The complexity for reducing traditional multivariate logistic regression model, obtains
New disaggregated model (very big unrelated multivariate logistic regression model) has stronger generalization ability;And then it can be to the target of acquisition
Textual entry carries out precise classification in text data.
Description of the drawings
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology and advantage, below will be to implementing
Example or attached drawing needed to be used in the description of the prior art are briefly described, it should be apparent that, the accompanying drawings in the following description is only
Only it is some embodiments of the present invention, for those of ordinary skill in the art, without creative efforts,
It can also be obtained according to these attached drawings other attached drawings.
Fig. 1 is a kind of text sentiment classification method based on very big unrelated multivariate logistic regression provided in an embodiment of the present invention
Flow chart;
Fig. 2 is target text data instance figure provided in an embodiment of the present invention;
Fig. 3 is the method flow diagram of the emotional category belonging to each textual entry of determination that is provided in the embodiment of the present invention;
Fig. 4 is the method flow diagram of the cost function provided in an embodiment of the present invention for obtaining the second model;
Fig. 5 is the MNIST data sets MLR provided in the embodiment of the present invention and the big logotype of UMLR parameter norms;
Fig. 6 is the COIL20 data sets MLR provided in the embodiment of the present invention and the big logotype of UMLR parameter norms;
Fig. 7 is the ORL data sets MLR provided in the embodiment of the present invention and the big logotype of UMLR parameter norms.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art obtained without making creative work it is all its
His embodiment, shall fall within the protection scope of the present invention.
It should be noted that term " first " in description and claims of this specification and above-mentioned attached drawing, "
Two " etc. be for distinguishing similar object, without being used to describe specific sequence or precedence.It should be appreciated that using in this way
Data can be interchanged in the appropriate case, so as to the embodiment of the present invention described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover
It includes to be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment to cover non-exclusive
Those of clearly list step or unit, but may include not listing clearly or for these processes, method, product
Or the other steps or unit that equipment is intrinsic.
It needs to give explanation, logistic regression (LR) algorithm and l existing in the prior art2Polynary logic is constrained to return
Return (RMLR) algorithm, there are some shortcoming and defect in classification application, and then it is greatly unrelated polynary to propose modified hydrothermal process
Logistic regression algorithm.
Logistic regression (LR) algorithm:
For logistic regression, hypothesis has data set D={ xi,yi, i=1 ..., N, xi∈RD,yi∈ { 0,1 }, input
Vector is x=(x(1),…,x(D)), class label y is two-valued function:Y is 0 or 1.Logistic regression (LR) is based on following probability
Model:
Wherein,Referred to as Logistic functions or Sigmoid functions.
For two classification problems, it is assumed that the value of y is 0 or 1, and the probability that y=1 occurs obeys Bernoulli Jacob's distribution, then has:
P (y=1 | x;θ)=hθ(x)
P (y=0 | x;θ)=1-hθ(x)
Two formula as above can be merged into:
p(y|x;θ)=hθ(x)y(1-hθ(x))1-y (2)
Wherein y ∈ { 0,1 }.It is assumed that m sample is independent, then the likelihood function of parameter θ can be write out:
Then log-likelihood function can be expressed as:
Optimal θ can be obtained by maximizing l (θ).Usually enableObtain the corresponding losses of l (θ)
Function solves optimal θ by minimizing loss function.However logistic regression can only handle two classification problems, it can not be direct
It is applied in more classification problems.
l2Constrain multivariate logistic regression (RMLR) algorithm:
Polytypic problem cannot be handled for traditional logistic regression, multivariate logistic regression (MLR) is by changing logic
The cost function of recurrence, to adapt to more classification problems.
Assuming that there is data set D={ xi,yi, i=1 ..., N, xi∈RD,yi∈{0,…,K}(K>2), input vector x
=(x(1),…,x(D)), multivariate logistic regression (MLR) is based on following probabilistic model:
Wherein
Cost function is:
However there are one uncommon features for multivariate logistic regression, there are one the parameter sets of " redundancy " for it.Assuming that we from
Parameter vector θjIn subtracted vectorial ψ, at this moment, each θjAll become θj- ψ (j=1 ..., k).It is assumed that function becomes
Following formula:
This shows from θjIn subtract ψ completely and do not influence the prediction result for assuming function, that is to say, that polynary patrolled above-mentioned
There are the parameters of redundancy in volume regression model.
For multivariate logistic regression model overparameterization problem, l2Multivariate logistic regression (RMLR) algorithm is constrained by adding
A weight attenuation term is added to change cost function, this attenuation term can punish excessive parameter value, and cost function is made to become tight
The convex function of lattice ensures that uniquely solved in this way.Its cost function is:
Hessian matrixes at this time become invertible matrix, and because cost function is convex function, can using optimization algorithm
To ensure to converge to globally optimal solution.Although l2Constraint multivariate logistic regression (RMLR) algorithm has mitigated excessively quasi- to a certain extent
Conjunction problem, however for the data set for possessing redundancy, l2It is poor to constrain the performance of multivariate logistic regression (RMLR) algorithm.
According to above-mentioned analysis, and then propose very big unrelated multivariate logistic regression model:Specifically, the present embodiment provides
A kind of text sentiment classification method based on very big unrelated multivariate logistic regression, as shown in Figure 1, the method includes:
S101. text data is obtained, and the text data is pre-processed;The text data includes training data
With data to be predicted;The data to be predicted include multiple textual entries;
For example, reading the evaluation text data after consumer consumes to shop, the text data is consumed by consumer to shop
Comment composition afterwards.As shown in Figure 2, wherein first is classified as text label row, and 0 indicates positive comment, and 1 indicates negative sense comment.The
Two are classified as consumer reviews' row.Then, it because there are much noises directly to train in original text data, needs corresponding
Pretreatment;Pretreatment operation is carried out to evaluation text data set, is specifically included:
Wherein, in step S103 and step S106, preprocess method is carried out to RDD, including:
The gap character in pending text comments sentence is obtained, and the space character is replaced using null character string;
Special string, the number etc. in comment sentence are obtained, and the special string, number are replaced using null character string
Word etc.;
It obtains in comment sentence and expresses the word for obscuring the tone, convert expression fuzzy expression word to absolutely expression word
Language, and then so that the fuzzy tone is expressed and be converted into absolute expression;
Custom dictionaries is added, the higher noun of frequency in pending text comments sentence is added to custom dictionaries
In;
Word in above-mentioned processed comment sentence is segmented, and filters the stop words in comment sentence;
Word in the comment sentence for having completed participle is converted into row vector, and then generates term vector.
Specifically, carrying out pretreated method to pending text includes:
Using function re.compile (' # ([^>] *) #') match in comment with the comment of " # " beginning and end, and use
Null character string is replaced;Wherein re is the regular expression module of python, can directly invoke function therein to realize character
The canonical of string matches.
Using function re.compile (u'[^ u4e00- u9fa5 | a-zA-Z]+') matching comment in special string,
Number etc., and replaced using null character string;
Comment text is replaced using function flashtext.KeywordProcessor.The fuzzy tone is expressed
It is converted into absolute expression.Such as " so-so " is replaced with into " bad ", " not being special " replaces with " no ";
Custom dictionaries is added, for the higher noun of frequency in text data set, is added in new term to dictionary, enhancing
Segment accuracy;Wherein, it is added in specific noun to custom dictionaries according to specific scene, it is more efficient complete very accurately
At participle work.
Comment is segmented using function jieba.cut (), and filters the stop words in comment;Wherein stop words is pair
Text classification target helps little word or word, such as ' ', ' ', ' ' etc.;Different scenes have different deactivated vocabularys,
According to the corresponding stop words deactivated in vocabulary deletion text.
The comment data collection for having completed participle is converted to using function gensim.models.Word2Vec ()
Word2vec models generate term vector.
One is commented on, the term vector of each word is generated using word2vec models, the word of all words in comment
Vector is averaged by dimension, and the term vector for obtaining this comment indicates.Assuming that comment data is concentrated comprising n different words.
Include m word in some sentence, then shown in the term vector such as formula (81) of each word of the sentence:
Shown in the term vector of the sentence such as formula (82):
Wherein,
Recycling word2vec models generate that corresponding step of term vector of each word, and then are entirely commented
It is indicated by the term vector of data set.
S102. on the basis of the cost function of the first model, by introducing relevant parameter penalty term, the second model is obtained
Cost function;
S103. training data pretreatment obtained inputs the derived function of the cost function of the second model, and is solved
Obtain the second model;First model is multivariate logistic regression model, and second model is that very big unrelated polynary logic is returned
Return model;
S104. the data to be predicted pretreatment obtained input second model, obtain each text in data to be predicted
Emotional category belonging to this entry.
That is, introducing relevant parameter penalty term, greatly unrelated multivariate logistic regression model is established;It will handle well
Format data input model, predict every evaluation text emotional category.
Specifically, in step S104, the data to be predicted that pretreatment is obtained input second model, obtain waiting for pre-
Emotional category in measured data belonging to each textual entry, as shown in figure 3, including:
S104a. each textual entry in the data to be predicted pretreatment obtained inputs second model, obtains every
The text emotion class probability of a textual entry;
S104b., classification thresholds are set;
Wherein it is preferred in two classification problems of the classification thresholds in multivariate logistic regression, specially 0.5.
S104c. when the text emotion class probability of textual entry is more than the classification thresholds, judge the textual entry
Belong to the first emotional category;
S104d. when the text emotion class probability of textual entry is less than or equal to the classification thresholds, judge the text
Entry belongs to the second emotional category.
Such as;Classification thresholds are set as 0.5, when model calculates sample generic probability more than 0.5, comment is marked
It is denoted as 1, is expressed as positive comment.When sample generic probability is less than or equal to 0.5, comment is marked as 0, is expressed as negatively commenting
By.
Wherein, in computer information processing field, data set usually contains more common information, these common informations are big
The big complexity and identification error for increasing identification, although multivariate logistic regression trains k groups parameter to be directed to each classification meter
Calculate corresponding probability, however there is no consider between k group parameters whether the problem of correlation, if parameter (θ1,θ2,…,θk) be
The minimum point of cost function, then any parameter θiIt all can be by other θj(j ≠ i) linear expression, i.e.,
θi=λ0+∑j≠iλjθj (9)
This illustrate it is different classes of between parameter have correlation.l2Canonical
Although constraining the group interior element of every group of parameter, the problem of different classes of parameter correlation is not considered yet, is led
Causing to be directed to has the data set classifying quality of more redundancy poor.For arbitrary two groups of different parameter θsiAnd θj, according to basic
Inequality:
Wherein, and if only if θi=θjWhen obtain maximum value.
If θiWith θjCorrelation, i.e. θi=λ0+λjθj, thenIt is worth larger, therefore we are added to uncorrelated bound term:
This bound term can punish relevant parameter, for ensureing to retain as possible more spies that are uncorrelated, having differentiation
Sign.And because
It is so as to obtain its cost function:
In order to use optimization algorithm, the derivative for acquiring J (θ) is as follows:
It is derived according to above, uncorrelated parameter θ can quickly be acquired by gradient descent algorithm and its innovatory algorithm.
Specifically, it corresponds into the present embodiment, in step S102, on the basis of the cost function of the first model, by drawing
Enter relevant parameter penalty term, the cost function of the second model is obtained, as shown in figure 4, including:
S102a. the negative log-likelihood function of the model parameter of the first model is obtained;
S102b. uncorrelated bound term is obtained;
S102c., the cost function that uncorrelated bound term is introduced to the first model, obtains the cost function of the second model.
Accordingly, first model is:
Wherein
The negative log-likelihood function of the parameter θ of first model is:
The negative log-likelihood function i.e. cost function of the first model;Wherein, m is independent the number of sample.Into one
Step ground, the uncorrelated bound term are:
Uncorrelated bound term, that is, relevant parameter the penalty term;Wherein, θiAnd θjFor arbitrary two groups of different parameters;It is described
The cost function of second model is:
Further, the derived function of the cost function of second model is:
For above-mentioned content, then algorithm steps:
Input:Training set D={ (x1,y1),(x2,y2),…,(xm,ym)};
Process:
Initializeλ,η,Θ
While stopping criterion are not satisfied do:
Forj=1,2 ..., k:
Θ=L-BFGS (Loss, d Θ)
Output:Regression coefficient Θ
Further, convergence is carried out to the greatly unrelated multivariate logistic regression algorithm:
According to the loss function of very big unrelated multivariate logistic regression:
It can obtain:
Because it is stringent convex function that the second dervative perseverance of J (θ), which is more than 0, J (θ),.
Wherein, algorithm receipts can be demonstrate,proved according to on-line study frame analysis algorithm and about the analysis of Adam convergences
It holds back.
Further, greatly unrelated multivariate logistic regression (UMLR) algorithm proposed by the present invention is assessed.Experiment knot
Fruit is concentrated mainly on following two problems:Nicety of grading executes speed.Data classification algorithm for comparing includes weight decaying
Multivariate logistic regression, support vector machines and the unrelated multivariate logistic regression of parameter.The artificial of the different degrees of correlation has been respectively adopted in experiment
4 real data sets such as data set and MNIST, COIL20, GT and ORL, verification mode are ten folding cross validations.
(1) it normalizes
Assuming that Φ (x)minWith Φ (x)maxMaximum value and minimum value respectively in data set.For an example, normalizing
It is as follows to change algorithm:
By normalized mode, there will be the expression formula of dimension to be converted into nondimensional expression formula, solve contribution data
Unbalanced problem.
(2) experimental result on artificial data collection
In order to which verification algorithm is to the validity of linearly related data set, we generate artificial data collection as follows:Class
The interior degree of correlation is more than 0.9, and similarity distinguishes value 0.5,0.6,0.7,0.8,0.9. between class
Sample size and data dimension are selected as (m, n)=(5000,1000), amount to 5 classifications, each classification 1000
Bar sample.
It is the data for the different degrees of correlation, very big unrelated multivariate logistic regression algorithm and l below2Polynary logic is constrained to return
The comparison of reduction method discrimination.
The discrimination of table 1.MLR, UMLR difference correlation data collection
(3) experimental result on MNIST and COIL20 data sets
MNIST data sets are widely used in area of pattern recognition.It includes 10 classifications, this ten classifications correspond to hand-written
Digital 0-9, each classification have 5000 plurality of pictures.COIL20 data sets possess 20 different classifications, and each classification has 72
Picture.
Table 2.SVM, MLR, UMLR are directed to the discrimination of MINIST, COIL20 data set
Above table illustrates the accuracy that three kinds of different algorithms are directed to two kinds of data sets.It is illustrated in figure 5 MNIST numbers
According to collection MLR and the big logotype of UMLR parameter norms;It is illustrated in figure 6 COIL20 data sets MLR and UMLR parameter norm sizes
Schematic diagram.Wherein, it is UMLR parameter norm size block diagrams under corresponding data collection, Fig. 5 that left side in Fig. 5 and Fig. 6 is corresponding
MLR parameter norm size block diagrams under corresponding data collection corresponding with the right side of Fig. 6.
(4) experimental result on GT and ORL data sets
Totally 50 classifications, each classification include 15 pictures to GT data sets.ORL data sets totally 20 classifications, each classification
Including 10 pictures.
Table 3.SVM, MLR, UMLR are directed to the discrimination of GT, ORL data set
It is illustrated in figure 7 ORL data sets MLR and the big logotype of UMLR parameter norms;Wherein, the left side of Fig. 7 is corresponding
It is the UMLR parameter norm size block diagrams under corresponding data collection, the MLR parameter models under the corresponding corresponding data collection in right side of Fig. 7
Number size block diagram.
(5) analysis of experimental results
The experimental results showed that greatly unrelated multivariate logistic regression compares l2Constraint multivariate logistic regression algorithm and support to
Amount machine algorithm has higher nicety of grading.It is with obvious effects especially for the higher data set of correlation between class, illustrate it to superfluous
Remainder evidence has higher robustness.Its convergence parameter is compared to compared with l2The convergence parameter for constraining multivariate logistic regression is small, this is logical
Often mean that it possesses stronger generalization ability.
According to above-mentioned analysis of experimental results as it can be seen that important branch of the classification as pattern-recognition, data mining, has more
More be widely applied field, so, be increasingly becoming police criminal detection solve a case, e-payment, the core of the systems such as medical treatment and pass
Key technology.
A kind of greatly unrelated multivariate logistic regression model proposed by the present invention;This method is based on the basic of multivariate logistic regression
Model constructs a kind of novel classification device.The experimental results showed that its nicety of grading, classification robustness on than traditional classification algorithm
It is advantageous.And it is stronger explanatory that it trains obtained model to have than the methods of support vector machines, naive Bayesian.
In conclusion a kind of text sentiment classification method based on very big unrelated multivariate logistic regression provided by the invention,
The technique effect having is:
The present invention is on the basis of traditional multivariate logistic regression model by introducing relevant parameter penalty term (uncorrelated about
Beam item), obtain the cost function of greatly unrelated multivariate logistic regression model;According to the solution greatly unrelated multivariate logistic regression
The derived function of the cost function of model obtains the greatly unrelated multivariate logistic regression model.By adding uncorrelated bound term
So that there is higher robustness for redundant data;The complexity for reducing traditional multivariate logistic regression model, obtains
New disaggregated model (very big unrelated multivariate logistic regression model) has stronger generalization ability;And then it can be to the target of acquisition
Textual entry carries out precise classification in text data.
It should be noted that:Embodiments of the present invention sequencing is for illustration only, can not represent the quality of embodiment.
One of ordinary skill in the art will appreciate that realizing that all or part of step of above-described embodiment can pass through hardware
It completes, relevant hardware can also be instructed to complete by program, the program can be stored in a kind of computer-readable
In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and
Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.
Claims (6)
1. a kind of text sentiment classification method based on very big unrelated multivariate logistic regression, which is characterized in that the method includes:
Text data is obtained, and the text data is pre-processed;The text data includes training data and to be predicted
Data;The data to be predicted include multiple textual entries;
On the basis of the cost function of the first model, by introducing relevant parameter penalty term, the cost function of the second model is obtained;
The training data that pretreatment is obtained inputs the derived function of the cost function of the second model, and is solved to obtain the second mould
Type;First model is multivariate logistic regression model, and second model is very big unrelated multivariate logistic regression model;
The data to be predicted that pretreatment is obtained input second model, obtain in data to be predicted belonging to each textual entry
Emotional category.
2. according to the method described in claim 1, it is characterized in that, described in the data to be predicted input that pretreatment is obtained
Second model obtains the emotional category belonging to each textual entry in data to be predicted, including:
Each textual entry in the data to be predicted that pretreatment is obtained inputs second model, obtains each textual entry
Text emotion class probability;
Classification thresholds are set;
When the text emotion class probability of textual entry is more than the classification thresholds, judge that the textual entry belongs to the first feelings
Feel classification;
When the text emotion class probability of textual entry is less than or equal to the classification thresholds, judge that the textual entry belongs to the
Two emotional categories.
3. method according to claim 1 or 2, which is characterized in that it is described on the basis of the cost function of the first model, lead to
Introducing relevant parameter penalty term is crossed, the cost function of the second model is obtained, including:
Obtain the negative log-likelihood function of the model parameter of the first model;
Obtain uncorrelated bound term;
The cost function that uncorrelated bound term is introduced to the first model, obtains the cost function of the second model.
4. according to the method described in claim 3, it is characterized in that, first model is:
Wherein
The negative log-likelihood function of the parameter θ of first model is:
The negative log-likelihood function i.e. cost function of the first model;Wherein, m is independent the number of sample.
5. according to the method described in claim 4, it is characterized in that, the uncorrelated bound term is:
Uncorrelated bound term, that is, relevant parameter the penalty term;Wherein, θiAnd θjFor arbitrary two groups of different parameters;
The cost function of second model is:
6. according to the method described in claim 5, it is characterized in that, the derived function of the cost function of second model is:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810332338.5A CN108595568B (en) | 2018-04-13 | 2018-04-13 | Text emotion classification method based on great irrelevant multiple logistic regression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810332338.5A CN108595568B (en) | 2018-04-13 | 2018-04-13 | Text emotion classification method based on great irrelevant multiple logistic regression |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108595568A true CN108595568A (en) | 2018-09-28 |
CN108595568B CN108595568B (en) | 2022-05-17 |
Family
ID=63622383
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810332338.5A Active CN108595568B (en) | 2018-04-13 | 2018-04-13 | Text emotion classification method based on great irrelevant multiple logistic regression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108595568B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109671487A (en) * | 2019-02-25 | 2019-04-23 | 上海海事大学 | A kind of social media user psychology crisis alert method |
CN110942450A (en) * | 2019-11-19 | 2020-03-31 | 武汉大学 | Multi-production-line real-time defect detection method based on deep learning |
CN112802456A (en) * | 2021-04-14 | 2021-05-14 | 北京世纪好未来教育科技有限公司 | Voice evaluation scoring method and device, electronic equipment and storage medium |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004094583A (en) * | 2002-08-30 | 2004-03-25 | Ntt Advanced Technology Corp | Method of classifying writings |
CN103473380A (en) * | 2013-09-30 | 2013-12-25 | 南京大学 | Computer text sentiment classification method |
CN103514279A (en) * | 2013-09-26 | 2014-01-15 | 苏州大学 | Method and device for classifying sentence level emotion |
CN103955451A (en) * | 2014-05-15 | 2014-07-30 | 北京优捷信达信息科技有限公司 | Method for judging emotional tendentiousness of short text |
CN104239554A (en) * | 2014-09-24 | 2014-12-24 | 南开大学 | Cross-domain and cross-category news commentary emotion prediction method |
CN104462408A (en) * | 2014-12-12 | 2015-03-25 | 浙江大学 | Topic modeling based multi-granularity sentiment analysis method |
CN105389583A (en) * | 2014-09-05 | 2016-03-09 | 华为技术有限公司 | Image classifier generation method, and image classification method and device |
CN105740349A (en) * | 2016-01-25 | 2016-07-06 | 重庆邮电大学 | Sentiment classification method capable of combining Doc2vce with convolutional neural network |
US20160321243A1 (en) * | 2014-01-10 | 2016-11-03 | Cluep Inc. | Systems, devices, and methods for automatic detection of feelings in text |
CN106156004A (en) * | 2016-07-04 | 2016-11-23 | 中国传媒大学 | The sentiment analysis system and method for film comment information based on term vector |
CN107798349A (en) * | 2017-11-03 | 2018-03-13 | 合肥工业大学 | A kind of transfer learning method based on the sparse self-editing ink recorder of depth |
-
2018
- 2018-04-13 CN CN201810332338.5A patent/CN108595568B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004094583A (en) * | 2002-08-30 | 2004-03-25 | Ntt Advanced Technology Corp | Method of classifying writings |
CN103514279A (en) * | 2013-09-26 | 2014-01-15 | 苏州大学 | Method and device for classifying sentence level emotion |
CN103473380A (en) * | 2013-09-30 | 2013-12-25 | 南京大学 | Computer text sentiment classification method |
US20160321243A1 (en) * | 2014-01-10 | 2016-11-03 | Cluep Inc. | Systems, devices, and methods for automatic detection of feelings in text |
CN103955451A (en) * | 2014-05-15 | 2014-07-30 | 北京优捷信达信息科技有限公司 | Method for judging emotional tendentiousness of short text |
CN105389583A (en) * | 2014-09-05 | 2016-03-09 | 华为技术有限公司 | Image classifier generation method, and image classification method and device |
CN104239554A (en) * | 2014-09-24 | 2014-12-24 | 南开大学 | Cross-domain and cross-category news commentary emotion prediction method |
CN104462408A (en) * | 2014-12-12 | 2015-03-25 | 浙江大学 | Topic modeling based multi-granularity sentiment analysis method |
CN105740349A (en) * | 2016-01-25 | 2016-07-06 | 重庆邮电大学 | Sentiment classification method capable of combining Doc2vce with convolutional neural network |
CN106156004A (en) * | 2016-07-04 | 2016-11-23 | 中国传媒大学 | The sentiment analysis system and method for film comment information based on term vector |
CN107798349A (en) * | 2017-11-03 | 2018-03-13 | 合肥工业大学 | A kind of transfer learning method based on the sparse self-editing ink recorder of depth |
Non-Patent Citations (2)
Title |
---|
WIBKE WEBER等: "Text Visualization - What Colors Tell About a Text", 《2007 11TH INTERNATIONAL CONFERENCE INFORMATION VISUALIZATION (IV "07)》 * |
李平等: "基于混合卡方统计量与逻辑回归的文本情感分析", 《计算机工程》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109671487A (en) * | 2019-02-25 | 2019-04-23 | 上海海事大学 | A kind of social media user psychology crisis alert method |
CN110942450A (en) * | 2019-11-19 | 2020-03-31 | 武汉大学 | Multi-production-line real-time defect detection method based on deep learning |
CN112802456A (en) * | 2021-04-14 | 2021-05-14 | 北京世纪好未来教育科技有限公司 | Voice evaluation scoring method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108595568B (en) | 2022-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112084337B (en) | Training method of text classification model, text classification method and equipment | |
CN108960073B (en) | Cross-modal image mode identification method for biomedical literature | |
CN107832663B (en) | Multi-modal emotion analysis method based on quantum theory | |
KR102008845B1 (en) | Automatic classification method of unstructured data | |
Peng et al. | A joint framework for coreference resolution and mention head detection | |
CN108595708A (en) | A kind of exception information file classification method of knowledge based collection of illustrative plates | |
CN112732916B (en) | BERT-based multi-feature fusion fuzzy text classification system | |
Lopes et al. | An AutoML-based approach to multimodal image sentiment analysis | |
CN112749274B (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
CN112528894A (en) | Method and device for distinguishing difference items | |
CN108536838A (en) | Very big unrelated multivariate logistic regression model based on Spark is to text sentiment classification method | |
CN113627151B (en) | Cross-modal data matching method, device, equipment and medium | |
CN108595568A (en) | A kind of text sentiment classification method based on very big unrelated multivariate logistic regression | |
WO2021148625A1 (en) | A method for identifying vulnerabilities in computer program code and a system thereof | |
CN116911286A (en) | Dictionary construction method, emotion analysis device, dictionary construction equipment and storage medium | |
CN116304051A (en) | Text classification method integrating local key information and pre-training | |
Yoon et al. | ESD: Expected Squared Difference as a Tuning-Free Trainable Calibration Measure | |
CN115827871A (en) | Internet enterprise classification method, device and system | |
CN115687917A (en) | Sample processing method and device, and recognition model training method and device | |
CN115017894A (en) | Public opinion risk identification method and device | |
CN113434721A (en) | Expression package classification method and device, computer equipment and storage medium | |
CN114610882A (en) | Abnormal equipment code detection method and system based on electric power short text classification | |
CN112528653A (en) | Short text entity identification method and system | |
Prabhu et al. | A dynamic weight function based BERT auto encoder for sentiment analysis | |
Jabde et al. | Offline Handwritten Multilingual Numeral Recognition Using CNN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |