CN107391492A

CN107391492A - Indicia distribution Chinese emotion Forecasting Methodology based on fractional sample correlation

Info

Publication number: CN107391492A
Application number: CN201710661382.6A
Authority: CN
Inventors: 贾修; 贾修一; 郑翔
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2017-08-04
Filing date: 2017-08-04
Publication date: 2017-11-24

Abstract

The present invention provides a kind of indicia distribution Chinese emotion Forecasting Methodology based on fractional sample correlation, comprises the following steps：Training set is clustered into m cluster using k means clustering methods, local correlative character matrix and cluster centre label matrix are initialized；Object function is optimized using gradient descent method, solves primitive character coefficient matrix, local correlations characteristic coefficient matrix and local correlative character matrix；Using primitive character as input, each of local correlations eigenmatrix c for solving to obtain is classified as output, trains m linear regression model (LRM)；The local correlations feature of test sample is predicted using the linear regression model (LRM) trained；It is predicted using distribution of the output model to test sample.

Description

Indicia distribution Chinese emotion Forecasting Methodology based on fractional sample correlation

Technical field

The present invention relates to a kind of emotion Predicting Technique, particularly a kind of indicia distribution Chinese based on fractional sample correlation Emotion Forecasting Methodology.

Background technology

Mark ambiguity problem is the popular research direction of current machine learning areas.The solution mark of comparative maturity at present The normal form of ambiguity has two kinds, is single mark study (Single-lable learning) and Multi-label learning (Multi- respectively lable learning).In singly learning framework is marked, an example corresponds only to a label, and in Multi-label learning, One example may have multiple labels to correspond to therewith.Multi-label learning is the expansion to single mark study.By largely studying Show with experiment, Multi-label learning is a kind of effective and wider study formula of application scenarios.But still there are some problems not It is adapted to solve using Multi-label learning, for example, in some cases, we are related to which emotion necessary not only for a word Connection, with greater need for knowing description degree of each emotion to the words.Such issues that in order to solve, indicia distribution study are proposed out Come.Indicia distribution study is the further expansion to Multi-label learning, and a tag set difference, mark are exported with Multi-label learning Remember Distributed learning output is an indicia distribution, description degree of each representation in components correspondence markings in distribution to example (referred to as description degree).Indicia distribution study is a kind of wider array of study formula of usage scenario, can solve the problem that more mark ambiguity Sex chromosome mosaicism.

Learn currently for indicia distribution, the strategy of algorithm for design mainly there are three kinds.The first strategy is problem conversion.This After the problems such as indicia distribution problem concerning study is converted to single mark study by kind strategy first, the existing algorithm in corresponding normal form is utilized Solved, output result is then converted into indicia distribution again.Second of strategy is algorithm adjustment.This layout strategy does not have Indicia distribution problem concerning study is converted into other study formula problems to be solved.This strategy found some before this to be solved The algorithm of multivariate regression problem, then these algorithms could be adjusted to solve indicia distribution study.The third strategy is pin The algorithm special to indicia distribution learning scene.This strategy is had no problem transfer process, and the study of direct solution indicia distribution is asked Topic.And it is different with second of strategy, directly output token it can be distributed using this strategy, it is not necessary to which output result is carried out Conversion.

Existing indicia distribution algorithm seldom considers the correlation between mark, or only considered the mark correlation of the overall situation, But in actual life, the correlation between marking is typically local.Herein, we attempt to utilize the mark in fractional sample Remember correlation, it is proposed that a kind of new indicia distribution algorithm.We assume that example is segmented into different clusters, example in each cluster Mark correlation be just as.In order to represent the influence of local flag correlation, we are that each example constructs one Local correlations are vectorial, the feature extra as the example, and each single item in local correlations vector represents each fractional sample Influence to the example.

The content of the invention

It is an object of the invention to provide the indicia distribution Chinese emotion Forecasting Methodology based on fractional sample correlation, bag Include：

Step 1, training set is clustered into m cluster using k-means clustering methods, to local correlative character matrix c and Cluster centre label matrix P is initialized；

Step 2, object function is optimized using gradient descent method, solves primitive character coefficient matrix θ, Local Phase Closing property characteristic coefficient matrix w and local correlative character matrix c；

Step 3, using the primitive character of data as input, obtained local correlations eigenmatrix c is solved with above-mentioned steps 2 For output, m linear regression model (LRM) is trained using existing linear regression method；

Step 4, the local correlations feature of test sample is predicted using the linear regression model (LRM) trained；

Step 5, it is predicted using distribution of the output model to test sample.

The mark correlation in fractional sample is utilized in the present invention, it is proposed that a kind of new indicia distribution algorithm, by example It is divided into different clusters, not each example that the mark correlation of example is just as in each cluster builds a local correlations Vector, the feature extra as the example, each single item in local correlations vector represent each fractional sample to the example Influence.Indicia distribution Chinese emotion Forecasting Methodology based on fractional sample correlation has preferable performance.

The present invention is described further with reference to Figure of description.

Brief description of the drawings

Fig. 1 is flow chart of the method for the present invention.

Embodiment

With reference to Fig. 1, a kind of indicia distribution Chinese emotion Forecasting Methodology based on fractional sample correlation, including following step Suddenly：

Step 5, it is predicted using distribution of the output model to test sample.

In step 1, training set is clustered into m cluster using k-means clustering methods, to local correlative character matrix The detailed process that c and cluster centre label matrix P are initialized is as follows：

Step S100, if Chinese affection data primitive character is X=R^q, emotion mark corresponding to i-th of example in data set Note setWherein q is the dimension of primitive character, and L is label number,Represent l-th of mark to showing Example x_iDescription.Given training set S={ (x₁,D₁),(x₂,D₂),…,(x_n,D_n), wherein x_i∈ X are an examples.In label Spatially, using k-means clustering methods by sample cluster into m cluster.

Step S101, according to cluster result, local correlative character matrix c and cluster centre label matrix P is carried out just Beginningization, initialization step are as follows：If example x_iIn j-th of cluster, then1 is initialized as, is otherwise initialized as 0, whereinFor An element in local feature matrix c,|G_j| for the number of example in cluster, x_kFor in j-th of cluster K-th of element.

In step 2, the detailed process optimized using gradient descent method to object function is as follows：

Step S200, the object function of this algorithm are as follows：

Wherein, wherein, n is number of samples, and m is the number of cluster, p_jIt is j-th of cluster centre,For local feature square An element in battle array c, | | | |_FFor the F normal forms of matrix, λ₁、λ₂、λ₃For three balance parameters, p (y_l|x_i；θ, w, c) it is p (y |x_i；θ, w, c) l items, p (y | x_i；θ, w, c) it is the indicia distribution predicted.The Section 1 of object function is KL divergences, measurement The similitude of prediction result and legitimate reading；Section 2 and Section 3 are regular terms, it is therefore an objective to simplified model；Section 4 be in order to Make similar sample that there is similar local correlations, and sample x_iAnd p_jIt is more similar,It is bigger；

Step S201, object function in S200 is optimized using gradient descent method, solve parameter θ, w and c.

In step 3, train the detailed process of m linear regression model (LRM) as follows：

Due to being characterized in unknown for a test sample, its local correlations.So we use the original of training set Beginning is characterized as inputting, and each of correlative character matrix c is classified as output, trains m linear regression model (LRM).

In step 4, the local correlations feature of test sample is predicted using the linear regression model (LRM) trained Detailed process is as follows：

The m linear regression model (LRM) obtained using step 3, using the primitive character of test sample as input, respectively to test M local correlations feature of sample is predicted.

In step 5, the detailed process being predicted using distribution of the output model to test sample is as follows：

By the primitive character of test sample, local correlations feature, primitive character coefficient matrix, local correlations feature system Matrix number is updated in output model, and the emotion distribution to test sample is predicted.Wherein, output model is as follows：

Wherein,It is a normalization item, is in order to full All label degree of description summations of one sample of foot are 1.θ_l,k1It is the row l column elements of kth 1 of primitive character coefficient matrix,It is Example x_i1 primitive character of kth, w_l,k2It is the row l column elements of kth 2 of fractional sample correlative character coefficient matrix,It is to show Example x_iFractional sample correlation vector 2 elements of kth.The Section 1 of index represents the information of primitive character, Section 2 generation Table additionally increases the information of feature, i.e. local correlations information.

Finally, it is Euclidean, S φ respectively using the performance of six kinds of evaluation index alignment marks Distribution Algorithms rensen、Squaredχ², K-L, Intersection and Fidelity.

Claims

1. a kind of indicia distribution Chinese emotion Forecasting Methodology based on fractional sample correlation, it is characterised in that including following step Suddenly：

Step 1, training set is clustered into m cluster using k-means clustering methods, to local correlative character matrix c and cluster Center label matrix P is initialized；

Step 2, object function is optimized using gradient descent method, solves primitive character coefficient matrix θ, local correlations Characteristic coefficient matrix w and local correlative character matrix c；

Step 3, using the primitive character of data as input, obtained local correlations eigenmatrix c is solved to be defeated using above-mentioned steps 2 Go out, m linear regression model (LRM) is trained using existing linear regression method；

Step 5, it is predicted using distribution of the output model to test sample.

2. method according to claim 1, it is characterised in that step 1 it is specific excessively as follows：

If Chinese affection data primitive character is X=R^q, emotion tag set corresponding to i-th of example in data setWherein q is the dimension of primitive character, and L is label number,Represent l-th of mark to example x_i's Description；

Given training set S={ (x₁,D₁),(x₂,D₂),…,(x_n,D_n), wherein x_i∈ X are an examples；

On Label space, using k-means clustering methods by sample cluster into m cluster；

According to cluster result, local correlative character matrix c and cluster centre label matrix P are initialized, initialization step It is rapid as follows：If example x_iIn j-th of cluster, then1 is initialized as, is otherwise initialized as 0, whereinFor local feature matrix c In an element,|G_j| for the number of example in cluster, x_kFor k-th of element in j-th of cluster.

3. method according to claim 2, it is characterised in that the detailed process of step 2 is：

It is as follows to establish object function：

<mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <mi>T</mi> <mrow> <mo>(</mo> <mi>&theta;</mi> <mo>,</mo> <mi>w</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>&Sigma;</mi> <mi>i</mi> </munder> <munder> <mi>&Sigma;</mi> <mi>l</mi> </munder> <mrow> <mo>(</mo> <msubsup> <mi>d</mi> <mi>i</mi> <mi>l</mi> </msubsup> <mi>ln</mi> <mo>(</mo> <mfrac> <msubsup> <mi>d</mi> <mi>i</mi> <mi>l</mi> </msubsup> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>l</mi> </msub> <mo>|</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>;</mo> <mi>&theta;</mi> <mo>,</mo> <mi>w</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>)</mo> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>&lambda;</mi> <mn>1</mn> </msub> <mo>|</mo> <mo>|</mo> <mi>&theta;</mi> <mo>|</mo> <msubsup> <mo>|</mo> <mi>F</mi> <mn>2</mn> </msubsup> <mo>+</mo> <msub> <mi>&lambda;</mi> <mn>2</mn> </msub> <mo>|</mo> <mo>|</mo> <mi>w</mi> <mo>|</mo> <msubsup> <mo>|</mo> <mi>F</mi> <mn>2</mn> </msubsup> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>+</mo> <msub> <mi>&lambda;</mi> <mn>3</mn> </msub> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <msubsup> <mi>c</mi> <mi>i</mi> <mi>j</mi> </msubsup> <mo>|</mo> <mo>|</mo> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>;</mo> <mi>&theta;</mi> <mo>,</mo> <mi>w</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>|</mo> <msubsup> <mo>|</mo> <mn>2</mn> <mn>2</mn> </msubsup> </mrow> </mtd> </mtr> </mtable> </mfenced>

Wherein, n is number of samples, and m is the number of cluster, p_jIt is j-th of cluster centre,For one in local feature matrix c Element, | | | |_FFor the F normal forms of matrix, λ₁、λ₂、λ₃For three balance parameters, p (y_l|x_i；θ, w, c) for p (y | x_i；θ,w,c) L items, p (y | x_i；θ, w, c) it is the indicia distribution predicted；

Above-mentioned object function is optimized using gradient descent method, solves parameter θ, w and c.

4. method according to claim 3, it is characterised in that the detailed process of step 5 is：

It is as follows to establish output model：

<mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>l</mi> </msub> <mo>|</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>;</mo> <mi>&theta;</mi> <mo>,</mo> <mi>w</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mi>Z</mi> </mfrac> <mi>exp</mi> <mrow> <mo>(</mo> <munder> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mn>1</mn> </mrow> </munder> <msub> <mi>&theta;</mi> <mrow> <mi>l</mi> <mo>,</mo> <mi>k</mi> <mn>1</mn> </mrow> </msub> <msubsup> <mi>x</mi> <mi>i</mi> <mrow> <mi>k</mi> <mn>1</mn> </mrow> </msubsup> <mo>+</mo> <munder> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mn>2</mn> </mrow> </munder> <msub> <mi>w</mi> <mrow> <mi>l</mi> <mo>,</mo> <mi>k</mi> <mn>2</mn> </mrow> </msub> <msubsup> <mi>c</mi> <mi>i</mi> <mrow> <mi>k</mi> <mn>2</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow>

θ_l,k1It is the row l column elements of kth 1 of primitive character coefficient matrix,It is example x_i1 primitive character of kth, w_l,k2It is office The row l column elements of kth 2 of portion's sample correlations characteristic coefficient matrix,It is example x_iFractional sample correlation vector kth 2 Individual element；

By the primitive character of test sample, local correlations feature, primitive character coefficient matrix, local correlations characteristic coefficient square Battle array is updated in output model, and the emotion distribution to test sample is predicted；

It is Euclidean, S φ rensen, Squared respectively using the performance of six kinds of evaluation index alignment marks Distribution Algorithms χ², K-L, Intersection and Fidelity.