CN101996284A

CN101996284A - Screening method of characteristic gene of certain disease

Info

Publication number: CN101996284A
Application number: CN2010105623087A
Authority: CN
Inventors: 王�华; 梁素梅; 王建军; 孟华; 李红娟
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2010-11-29
Filing date: 2010-11-29
Publication date: 2011-03-30

Abstract

The invention provides a screening method of the characteristic gene of a certain disease. A gene expression profile is analyzed from a brand-new angle. Firstly, main ingredient analysis is carried out for reducing dimensions; under the condition that the contribution rate is 99%, a characteristic value and the contribution rate serve as classification factors to screen the characteristic gene of an oncogene and reasonably lower the effective dimensionality of gene expression space; on the basis of main ingredient analysis, Fourier transform and support vector base on the basis of a complex field effectively classify and distinguish samples; the main ingredient analysis is innovatively combined, and data processing of changing a real number field into the complex field is carried out; the frequency is recorded, and the bigger the frequency is, the better the classification result is; the gene label is reasonably and effectively extracted. The invention can be applied in the field of biological diseases, such as gene classification and distinguishing, can be applied in the field of meteorological geography, such as meteorological observation and has obvious effects and higher practical value.

Description

The screening technique of the characterizing gene of certain disease

Technical field

The invention belongs to bioengineering field, specifically a kind of method that is adapted to the biological gene tag extraction, just the screening technique of the characterizing gene of certain disease relates to biometrics to small sample, high flux, the processing of high-dimensional data.

Background technology

Along with extensive gene expression profile (Gene expression profile; or be called gene expression profiles) development of technology; the normal gene expression of human various tissues obtains; all kinds of patients' gene expression profiles has all had the benchmark of reference, so the analysis of gene expression data and modeling have become the important topic in the bioinformatics research field.If it is can on molecular level, utilize gene expression profiles to carry out the identification of tumors subtypes exactly, significant to diagnosis and treatment tumour.Because each tumour all has the feature representation spectrum (seeing accompanying drawing) of its gene.From thousands of measured genes of DNA chip, find out one group of gene " label " of decision sample class, i.e. " information gene.Usually because number gene is very big, in the process of judging the oncogene label, need weed out a large amount of " independent basises because of ", need the oncogene scope that search for thereby dwindle greatly.In fact, in gene expression profile, some expression of gene levels are all very approaching in all samples.For example, many genes are at acute leukemia hypotype (ALL, AML) no matter its average still is a variance to the distribution in two classifications does not all have significant difference, can think that these genes and sample class are irrelevant, the differentiation to sample type does not provide useful information, increases the computation complexity of information gene search on the contrary.Therefore, must reject these " independent basis because of ".

In the face of extracting the such field of genome information, can pass through mathematical modeling, the genome information that obtains effectively extracting sample to be detected is just.Domestic and international application PCA method is studied genetic chip and still is in the elementary step.

Summary of the invention

The object of the present invention is to provide the method for effective extraction gene label, i.e. the screening technique of the characterizing gene of certain disease.Can be applicable to all small samples high flux, high-dimensional data processing.This method is simple and convenient, and has very high promotional value, and classification and identification problems such as gene label extraction on the biology provide a kind of reliable and practical method, and can extract gene label effectively and timely, and disease is analyzed, and alleviate patient's misery.

The objective of the invention is to be realized with following technical scheme:

The screening technique of the characterizing gene of certain disease adopts principal component analytical method and screens based on the support vector machine classification method of Fourier transform, comprises the following steps:

(1) utilizes principal component analysis (PCA) to treat analytic sample and carry out dimensionality reduction, the gene expression space is reduced, being under the situation more than the 78%-88% with the contribution rate, screen the characterizing gene of certain disease with eigenwert and contribution rate as the classification factor, reasonably reduce the valid dimension in gene expression space;

Obtain the correlation matrix of all genes, to this matrix eig; Utilizing eigenwert is variance, and variance is the thought of quantity of information, and promptly we think variance big more to comprise quantity of information many more, and eigenwert is the bigger the better, and obtains the gene that contribution rate reduces under 99% situation:

Formula is as follows:

Contribution rate:

λ_{1} / Σ_{i = 1}^{p} λ_{i} = \frac{Var (F_{i})}{Σ_{i = 1}^{p} Var (F_{i})}

In the formula: P representation eigenvalue number is that gene number Var represents variance

(2) on the basis of principal component analysis (PCA), utilization is effectively classified and identification to sample based on the support vector base of the Fourier transform of complex field, the record frequency, and frequency is big more, and classifying quality is good more, extracts the gene label of certain disease;

The characterizing gene of several certain diseases that (1) step is filtered out carries out classification and the identification based on the support vector machine of the Fourier transform of complex field, extracts gene label:

Formula is as follows: apha is done two dimensional discrete Fourier transform:

X (k, λ) = Σ_{m = 0}^{m - 1} Σ_{n = 0}^{n - 1} X (n, m) e^{- j \frac{2 π}{n} kn} e^{- j \frac{2 π}{m} lm}

= Σ_{m = 0}^{m - 1} Σ_{n = 0}^{n - 1} X (n, m) {W_{n}}^{kn} W^{lm}

Wherein

W_{n} = e^{- j \frac{2 π}{n}}, W_{m} = e^{- j \frac{2 π}{m}}

Do in the formula: m, n represent normal number of samples respectively, the pathology number of samples? X here face represents alpha.

The objective of the invention is to be used for effectively to extract gene label, from the angle of real number field and frequency domain.Little when average, the variance of per two genes, therefore, from then on the formed matrix of data itself is considered.Owing to there is very strong correlativity between the gene representation, thus the present invention with principal component analysis (PCA) it is analyzed, from its correlation matrix and matrix itself eigenwert (be to classify the variance).With respect to number gene, sample is often very little, if be directly used in the problem concerning study that classification can cause small sample, how to reduce the core that the gene expression characteristics that is used for Classification and Identification is a classification problem, in fact have only this feature of working as more after a little while, the effect of classification is just better, on 1 basis, use based on the support vector machine of Fourier transform again and classify and discern, classifying quality is fine, can effectively extract gene label.

Description of drawings:

Fig. 1 is the scatter diagram of all sample points;

Fig. 2 is the experience distribution plan;

Fig. 3 is the classification 1 of support vector machine in small sample based on the Fourier transform of frequency domain;

Fig. 4 is α and amplitude Fig. 1;

Fig. 5 is the classification 2 of support vector machine in small sample based on the Fourier transform of frequency domain;

Fig. 6 is α and amplitude Fig. 2.

Embodiment

Below in conjunction with accompanying drawing, further specify essentiality content of the present invention with embodiments of the invention, but do not limit the present invention with this.

Embodiment 1: basic step is:

Adopt principal component analytical method and screen, comprise the following steps: based on the support vector machine classification method of Fourier transform

(1) utilize principal component analysis (PCA) treat analytic sample (analysis to as if 2000 * 62 matrix fluorescence datas (line display gene type wherein, the normal sample of 22 row, 44 epidemy people samples) gene of colon cancer disease carries out dimensionality reduction, the gene expression space is reduced, be under the situation more than the 78%-88% with the contribution rate, screen the characterizing gene of certain disease with eigenwert and contribution rate as the classification factor, reasonably reduce the valid dimension in gene expression space;

Obtain the correlation matrix of all genes, to this matrix eig; Utilizing eigenwert is variance, is variance the thought of quantity of information?, eigenwert is the bigger the better, and obtains the gene that contribution rate reduces under 99% situation:

Formula is as follows:

Contribution rate:

λ_{1} / Σ_{i = 1}^{p} λ_{i} = \frac{Var (F_{i})}{Σ_{i = 1}^{p} Var (F_{i})}

Formula is as follows: apha is done two dimensional discrete Fourier transform:

X (k, λ) = Σ_{m = 0}^{m - 1} Σ_{n = 0}^{n - 1} X (n, m) e^{- j \frac{2 π}{n} kn} e^{- j \frac{2 π}{m} lm}

= Σ_{m = 0}^{m - 1} Σ_{n = 0}^{n - 1} X (n, m) {W_{n}}^{kn} W^{lm}

Wherein

W_{n} = e^{- j \frac{2 π}{n}}, W_{m} = e^{- j \frac{2 π}{m}} .

The analysis result that draws is as follows:

(1) 1. by dna microarray (DNA microarray), also is genetic chip, is that a kind of energy of growing up the nearest several years is quick, the new technology of efficient detection sequence dna fragment, gene expression dose.It is fixed on little (about 1cm from hundreds of is individual to up to a million the nucleotide sequences that are referred to as probe that do not wait with number ²) on solid substrate such as glass or silicon chip or the film, this substrate that is fixed with probe just is referred to as dna microarray.When forming two strands, follow the base complementrity principle according to nucleic acid molecule, just can detect in the sample with probe array in complementary nucleotide fragments, thereby obtain in the sample information about gene expression, Here it is gene expression profile, therefore gene expression profile can be represented with a matrix or a vector, the numerical values recited of matrix or vector element i.e. this expression of gene level, it is the fluorescence intensity data of known 2000 genes, these data are analyzed, extracted the gene label of oncogene.

2. comprise the sample of 40 tumour patients and 22 normal persons' sample based on 1. step experimental data, each sample has comprised 2000 gene expression profile data, at first the genetic chip data are analyzed by software MATLABE, calculated about the several important statistic of gene expression profile data as shown in table 1, and drawn out the scatter diagram of all sample points, as shown in Figure 1, and its experience distribution plan, as shown in Figure 2.Obtain gene data analysis and pre-service are seen Table 1 and Fig. 1,2.

Several statistics of table 1 gene expression profile

min	max	mean	median	std
					2.5401	14.3514	7.6554	7.5851	1.6411

(2) programme in MATLAB with the PCA method and can get its correlation matrix, eigenwert, proper vector, contribution rate sees Table 2, simultaneously in table 2 as can be seen, the contribution rate of accumulative total value of preceding 1948 genes has 1%, 52 of back have comprised 99% information of whole indexs.The genes corresponding with the number of the absolute value maximum of its proper vector are rational as characterizing gene.And ask correlation matrix and eigenwert respectively with normal person and tumour patient sample, similar with the data research result of all samples, can be that research object is analyzed with all samples therefore.And obtain gene behind the dimensionality reduction, see Table 3.

Table 2 with the PCA method in MATLAB, programme its correlation matrix, eigenwert, proper vector, contribution rate

The gene numbering	Eigenwert	Contribution rate	Contribution rate of accumulative total
				1949	3.1439	0.0016	0.0116
1950	3.2714	0.0016	0.0132
				1951	3.2961	0.0016	0.0148
1952	3.426	0.0017	0.0166
				1953	3.5774	0.0018	0.0183
1954	3.5935	0.0018	0.0201
				1955	3.8179	0.0019	0.0221
1956	3.9897	0.002	0.024
				1957	4.0711	0.002	0.0261
1958	4.142	0.0021	0.0282
				1959	4.2776	0.0021	0.0303
1960	4.4746	0.0022	0.0325
				1961	4.5358	0.0023	0.0348
1962	4.664	0.0023	0.0371
				1963	4.955	0.0025	0.0396
1964	5.1764	0.0026	0.0422

1965	5.234	0.0026	0.0448
				1966	5.3834	0.0027	0.0475
1967	5.5517	0.0028	0.0503
				1968	5.8656	0.0029	0.0532
1969	5.9414	0.003	0.0562
				1970	6.2221	0.0031	0.0593
1971	6.3619	0.0032	0.0625
				1972	6.8821	0.0034	0.0659
1973	7.1863	0.0036	0.0695
				1974	7.593	0.0038	0.0733
1975	7.7005	0.0039	0.0772
				1976	8.1223	0.0041	0.0812
1977	8.5939	0.0043	0.0855
				1978	8.7864	0.0044	0.0899
1979	9.1849	0.0046	0.0945
				1980	9.3452	0.0047	0.0992
1981	9.6833	0.0048	0.104
				1982	10.4506	0.0052	0.1092
1983	10.939	0.0055	0.1147
				1984	11.85	0.0059	0.1206
1985	12.6548	0.0063	0.127

1986	13.1761	0.0066	0.1336
				1987	14.1693	0.0071	0.1406
1988	17.6173	0.0088	0.1494
				1989	18.3988	0.0092	0.1586
1990	21.0809	0.0105	0.1692
				1991	22.1408	0.0111	0.1803
1992	24.8231	0.0124	0.1927
				1993	28.4395	0.0142	0.2069
1994	38.0237	0.019	0.2259
				1995	54.129	0.0271	0.253
1996	62.571	0.0313	0.2842
				1997	90.865	0.0454	0.3297
1998	109.9177	0.055	0.3846
				1999	112.6179	0.0563	0.4409
2000	179.7467	0.0899	0.5308

The gene classifying and numbering that table 3 obtains with principal component analysis (PCA)

(3) on the basis of (2), use support vector machine classification to see Table 5 based on Fourier transform, and to the number gene that obtains through the support vector machine of using after main the analysis based on Fourier transform with analyze number gene that the back directly obtains based on the support vector machine of Fourier transform and carried out contrast and see Table 4 through main, carry out Fourier transform SVC after to the data dimensionality reduction with PCA based on frequency domain, with directly handle without PCA, and, estimate classification in conjunction with the various screening levels of screening the gene relevant with the tissue sample somatotype.The result shows: when tissue sample was Fourier transform PCA based on frequency domain and is analyzed, PCA can improve classification quality, reasonably screens difference expression gene, obviously improves classifying quality.

Direct deal with data of table 4 support vector machine and main the analysis after the data after the support vector machine processing

	The direct deal with data of support vector machine	The main analysis after the data after the support vector machine processing
			Execution?time	0.1seconds	0.0
Status	OPTIMAL_SOLUTION	OPTIMAL_SOLUTION
			\|w0\|^2	0.057335	1.984033
Margin	8.352546	1.419893
			Sum?alpha	0.057335	1.984033
Support?Vectors	35(56.5％)	25(55.6％)
			bias	0	0

Table 5 is main to be analyzed after based on the data after the support vector machine processing of Fourier transform

Frequency

36

37

38

Gene number	839	1283	1464	1695	1862	1864
							Frequency	38	38	39	41	42	43
Gene number	1791	788	988	1810	1671	249

Data and figure after the support vector machine that below is based on Fourier transform is handled are scope with frequency interval [1801,1994], and can extract 25 genes is the label gene; And based on the result who obtains based on the Fourier transform support vector machine after main the analysis: with frequency interval [36,43] is scope, and can extract 12 genes is the label gene, sees Table 5.By Fig. 3～6 as can be seen, Fig. 5, the effects of 6 classification are than Fig. 3,4 good, Fig. 5,6 alpha value is milder, above the spike of figure many, the amplitude of passing through the figure figure below behind the Fourier transform is more steady, is worth for a short time, effect is relatively good.

In sum, fine through the Fourier transform support vector machine effect after main the analysis based on frequency domain, rapider to the classification gene label, in actual conditions, can consider sample problem in this way.

Compared with prior art, the invention has the beneficial effects as follows:

PCA can filter out spurious information, reduces the impact of irrelevant variable, can get rid of the interference of " noise "; PCA simplifies the complexity of gene chip data analysis, reduces the valid dimension in gene expression space; PCA determines common factor number (being number of principal components or comprehensive factor number), can point out to a certain extent the number of categories of sample or gene; The sign of PCA by factor load be show sample or intergenic positive correlation not only, and show sample and intergenic negative correlation; Based on the SVMs of the Fourier transformation of complex field, only to the processing of real number, realized the processing to complex field before having carried out, can more effectively classify.

The present invention can be applicable to all small samples, high flux, high-dimensional data are processed, the method is simple and convenient, and has very high promotional value, to classification and the identification problems such as gene label extraction on the biology, a kind of reliable and practical method is provided, and can extract effectively and timely gene label, disease is analyzed, alleviate patient's misery.

Claims

1. the screening technique of the characterizing gene of certain disease adopts principal component analytical method and screens based on the support vector machine classification method of Fourier transform, comprises the following steps:

Obtain the correlation matrix of all genes, to this matrix eig; Utilizing eigenwert is variance, and variance is a quantity of information, eigenwert be variance big more to comprise quantity of information big more, obtain the gene that contribution rate reduces under 99% situation:

Formula is as follows:

Contribution rate:

λ_{1} / Σ_{i = 1}^{p} λ_{i} = \frac{Var (F_{i})}{Σ_{i = 1}^{p} Var (F_{i})}

In the formula: P representation eigenvalue number is that gene number Var represents variance;

Formula is as follows: apha is done two dimensional discrete Fourier transform:

X (k, λ) = Σ_{m = 0}^{m - 1} Σ_{n = 0}^{n - 1} X (n, m) e^{- j \frac{2 π}{n} kn} e^{- j \frac{2 π}{m} lm}

= Σ_{m = 0}^{m - 1} Σ_{n = 0}^{n - 1} X (n, m) {W_{n}}^{kn} W^{lm}

Wherein

W_{n} = e^{- j \frac{2 π}{n}}, W_{m} = e^{- j \frac{2 π}{m}}

In the formula: m, n represent normal number of samples respectively, and pathology number of samples, X here face are represented alpha.