CN110033824A

CN110033824A - A kind of gene expression profile classification method based on shared dictionary learning

Info

Publication number: CN110033824A
Application number: CN201910296287.XA
Authority: CN
Inventors: 彭绍亮; 刘伟; 李非; 杨亚宁; 李肯立; 潘佳铭; 骆嘉伟; 刘云浩; 田李
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-04-13
Filing date: 2019-04-13
Publication date: 2019-07-19

Abstract

The invention belongs to gene expression profile classification fields, disclose a kind of gene expression profile classification method based on shared dictionary learning, belong to sparse dictionary and learn excavation and application in biological big data.This method constructs a shared dictionary first, which can obtain the sample of all categories；Then training dictionary also trains projection matrix while training dictionary, and projection matrix can widen the distance between different type sample to the projection of test sample；Finally, rebuilding the distance between coefficient coding vector of test sample by using dictionary determines classification.This method can quickly and efficiently classify to gene expression profile data, this helps to distinguish cancer species and its hypotype, help the pathogenesis from molecular level understanding tumour, and provide the solution of gene level for thoroughly treatment tumour.This method has an ability of shared sample, and when a small amount of sample is able to maintain stable projection ability and compares with general classification method, and classification accuracy has very big promotion.

Description

A kind of gene expression profile classification method based on shared dictionary learning

Technical field:

Classify field the invention belongs to gene expression profile, more particularly to tumor cell line gene expression profile data Classification method, in particular to a kind of gene expression profile classification method based on shared dictionary learning.

Background technique:

Tumour is a kind of disease for seriously threatening human life health.For a long time, researcher controls seeking always More the best means of tumour.However tumor type is numerous, even same tumour can also be divided into many different hypotypes, and The treatment means of different subtype tumour are different.Therefore, that accurately and quickly staging can be played farthest and is controlled Therapeutic effect extends the life for even saving patient.Carrying out classification to tumour using oncogene express spectra is when former compares New staging means, this method speed is fast, assorting process automation, a large amount of human and material resources can be saved, at For the research hotspot in current cancer classification field.But Most current conventional machines learning method divides oncogene express spectra Class accuracy rate is generally lower, needs to design the classification method being more suitable for.Dictionary learning classification method is that a kind of comparison is suitble to handle The method of gene expression profile data, but general dictionary learning method only focuses on being promoted trained dictionary to the reconstruct energy of sample Power, and ignore its distinguishing ability to sample.Meanwhile there are a large amount of redundancy and noises for gene expression profile data, so that general Logical dictionary learning method is unable to fully obtain required data characteristics using sample data, this will lead to CustomDict The classifier classification capacity of learning method is weaker, and then influences final classification results, causes classification performance that required standard is not achieved.

Summary of the invention:

The technical problem to be solved in the invention is to give full play in dictionary learning method dictionary to the reconstruct energy of sample Power obtains the sample of all categories in combination with sample sharing, to improve the mapping and distinguishing ability of feature, to solve Conventional method not high problem of classification accuracy in tumor cell line gene expression profile classification problem.It is of the invention in order to realize Purpose is achieved through the following technical solutions:

A kind of gene expression profile classification method based on shared dictionary learning, comprising the following steps:

Step 1: initialization dictionaryWith projection matrix P, include the following steps,

1.1. gene expression profile training sample set Y=[Y is inputted₁, Y₂..., Y_c], wherein, c is total classification number, Y_cIt is Classification is the subset of the training set data of c.

1.2. make dictionary with random number sequenceInitialization, dictionaryWherein D₀It is shared dictionary, to obtain The sample of all categories divides dictionary D=[D₁, D₂..., D_c],D_cIt is corresponding training subset Y_cSub- dictionary.

1.3. the principal component analysis transition matrix initial projection matrix P of training sample set Y is used.

Step 2: calculating and updates sparse coding coefficient matrixInclude the following steps,

2.1. the sparse coding coefficient matrix by dividing coefficient matrix to obtain dictionaryDivide coefficient square Battle array X=[X₁,X₂,...X_c],X_cIt is the subsystem number that classification is c, X^TIt is the transposed matrix of X, X⁰It is sparse coefficient of the Y on D, (X⁰)^TIt is X⁰Transposed matrix.

2.2. minimum target function is obtained by rarefaction representationMinimum target function representation are as follows:

Wherein, c is total classification number,It is the subsystem matrix number for the training sample that classification is c,It is to sentence Other fidelity term, by allowing dictionaryEvery a kind of training sample after projection is minimized the error and then restored, is enhanced with this similar The expression ability of sample and the expression ability for weakening inhomogeneity sample, to ensureFarthest restore P；It is dilute Item is dredged, matrix is adjusted by parameter lambdaSparse degree to keep sparse coefficient matrixSparsity；F (X) is that coefficient is sentenced Other item, distribution within class is allowed to minimize for the distribution by adjusting X and distribution between class maximizes, to ensure dictionaryTo training sample set Y has relevant resolving ability.

2.3. minimum target function is fixedMiddle dictionaryWith the value of projection matrix P so that target to be asked becomes At sparse coding coefficient matrixSubclass code coefficient is calculated using projection iterative methodFinally subclass code coefficientGroup Synthesize sparse coding coefficient matrix

Step 3: projection matrix P, fixed minimum target function are updatedMiddle dictionaryWith sparse coding coefficient MatrixValue, projection matrix P is projected directly into training sample set Y.

Step 4: dictionary is updatedFixed minimum target functionMiddle projection matrix P and sparse coding coefficient MatrixValue, using projection iterative method calculate class small pin for the case dictionaryThen class small pin for the case dictionaryIt is combined into dictionary

Step 5: minimum target function is sought by the way of gradient declineA locally optimal solution, ask Circulation executes step 3 and step 4 in solution preocess, no longer changes until reconstructed error tends to be steady, obtains finally obtained word Allusion quotationWith projection matrix P.

Step 6: by the distance between sparse coding vector come discriminating test data category, including the following steps,

6.1. dictionary step 5 obtainedTest data set y is passed to projection matrix P.

6.2. y is projected with projection matrix P, y is allowed to project to the space of a low-dimensional, the sample after being projected

6.3. dictionary is used in lower dimensional spaceIt is rightCarrying out sparse linear indicates to obtain sparse coding vector u.

6.4. the distance between sparse coding vector u is used to come pair as judgment basisCarry out final classification.

Compared with existing invention, the invention discloses a kind of gene expression profile classification sides based on shared dictionary learning Method, this method while sample re-configurability, also improve the feature extraction of sample in focusing on dictionary using shared performance And distinguishing ability, can accurately promptly it classify to tumor cell line gene expression profile data.This method can overcome Some shortcomings in conventional sorting methods and general dictionary learning: sample distinguishing ability is weak, classification performance is poor.

Detailed description of the invention:

Fig. 1 is dictionary training process flow chart；

Fig. 2 is shared dictionary learning method schematic diagram.

Specific embodiment

The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments:

Dictionary training process shown in FIG. 1 includes step 1 to this five steps of step 5, and particular content is as follows:

Wherein, c is total classification number,It is the subsystem matrix number for the training sample that classification is c,It is to sentence Other fidelity term, by allowing dictionaryEvery a kind of training sample after projection is minimized the error and then restored, is enhanced with this similar The expression ability of sample and the expression ability for weakening inhomogeneity sample, to ensureFarthest restore P；It is dilute Item is dredged, matrix is adjusted by parameter lambdaSparse degree to keep sparse coefficient matrixSparsity；F (X) is that coefficient is sentenced Other item, distribution within class is allowed to minimize for the distribution by adjusting X and distribution between class maximizes, to ensure dictionaryTo training sample set Y has relevant resolving ability, and it is as shown in Figure 2 to share dictionary learning method principle.

6.2. y is projected with projection matrix P, y is made to project to the space of a low-dimensional, the sample after being projected

The above is only a preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-mentioned implementation Example, all technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art Those of ordinary skill for, several improvements and modifications without departing from the principles of the present invention, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims

1. a kind of gene expression profile classification method based on shared dictionary learning, which comprises the following steps:

1.1. gene expression profile training sample set Y=[Y is inputted₁, Y₂..., Y_c], wherein, c is total classification number, Y_cIt is that classification is The subset of the training set data of c；

1.2. make dictionary with random number sequenceInitialization, dictionaryWherein D₀It is shared dictionary, it is all to obtain The sample of classification divides dictionary D=[D₁, D₂..., D_c],D_cIt is corresponding training subset Y_cSub- dictionary；

1.3. the principal component analysis transition matrix initial projection matrix P of training sample set Y is used；

2.1. the sparse coding coefficient matrix by dividing coefficient matrix to obtain dictionaryDivide coefficient matrix X= [X₁,X₂,...X_c],X_cIt is the subsystem number that classification is c, X^TIt is the transposed matrix of X, X⁰It is sparse coefficient of the Y on D, (X⁰)^TIt is X⁰ Transposed matrix；

Wherein, c is total classification number,It is the subsystem matrix number for the training sample that classification is c,It is to differentiate to protect True item, by allowing dictionaryEvery a kind of training sample after projection is minimized the error and then restored, similar sample is enhanced with this Expression ability and also weaken inhomogeneity sample expression ability, to ensureFarthest restore P；It is sparse , matrix is adjusted by parameter lambdaSparse degree to keep sparse coefficient matrixSparsity；F (X) is that coefficient differentiates , distribution within class is allowed to minimize for the distribution by adjusting X and distribution between class maximizes, to ensure dictionaryTo training sample set Y Has relevant resolving ability；

2.3. minimum target function is fixedMiddle dictionaryWith the value of projection matrix P so that target to be asked becomes sparse Code coefficient matrixSubclass code coefficient is calculated using projection iterative methodFinally subclass code coefficientIt is combined into dilute Dredge code coefficient matrix

Step 3: projection matrix P, fixed minimum target function are updatedMiddle dictionaryWith sparse coding coefficient matrixValue, projection matrix P is projected directly into training sample set Y；

Step 5: minimum target function is sought by the way of gradient declineA locally optimal solution, solution procedure Middle circulation executes step 3 and step 4, no longer changes until reconstructed error tends to be steady, obtains finally obtained dictionaryWith Projection matrix P；

6.1. dictionary step 5 obtainedTest data set y is passed to projection matrix P；

6.3. dictionary is used in lower dimensional spaceIt is rightCarrying out sparse linear indicates to obtain sparse coding vector u；