CN101996284A - Screening method of characteristic gene of certain disease - Google Patents

Screening method of characteristic gene of certain disease Download PDF

Info

Publication number
CN101996284A
CN101996284A CN2010105623087A CN201010562308A CN101996284A CN 101996284 A CN101996284 A CN 101996284A CN 2010105623087 A CN2010105623087 A CN 2010105623087A CN 201010562308 A CN201010562308 A CN 201010562308A CN 101996284 A CN101996284 A CN 101996284A
Authority
CN
China
Prior art keywords
gene
sigma
fourier transform
classification
contribution rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010105623087A
Other languages
Chinese (zh)
Inventor
王�华
梁素梅
王建军
孟华
李红娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN2010105623087A priority Critical patent/CN101996284A/en
Publication of CN101996284A publication Critical patent/CN101996284A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a screening method of the characteristic gene of a certain disease. A gene expression profile is analyzed from a brand-new angle. Firstly, main ingredient analysis is carried out for reducing dimensions; under the condition that the contribution rate is 99%, a characteristic value and the contribution rate serve as classification factors to screen the characteristic gene of an oncogene and reasonably lower the effective dimensionality of gene expression space; on the basis of main ingredient analysis, Fourier transform and support vector base on the basis of a complex field effectively classify and distinguish samples; the main ingredient analysis is innovatively combined, and data processing of changing a real number field into the complex field is carried out; the frequency is recorded, and the bigger the frequency is, the better the classification result is; the gene label is reasonably and effectively extracted. The invention can be applied in the field of biological diseases, such as gene classification and distinguishing, can be applied in the field of meteorological geography, such as meteorological observation and has obvious effects and higher practical value.

Description

The screening technique of the characterizing gene of certain disease
Technical field
The invention belongs to bioengineering field, specifically a kind of method that is adapted to the biological gene tag extraction, just the screening technique of the characterizing gene of certain disease relates to biometrics to small sample, high flux, the processing of high-dimensional data.
Background technology
Along with extensive gene expression profile (Gene expression profile; or be called gene expression profiles) development of technology; the normal gene expression of human various tissues obtains; all kinds of patients' gene expression profiles has all had the benchmark of reference, so the analysis of gene expression data and modeling have become the important topic in the bioinformatics research field.If it is can on molecular level, utilize gene expression profiles to carry out the identification of tumors subtypes exactly, significant to diagnosis and treatment tumour.Because each tumour all has the feature representation spectrum (seeing accompanying drawing) of its gene.From thousands of measured genes of DNA chip, find out one group of gene " label " of decision sample class, i.e. " information gene.Usually because number gene is very big, in the process of judging the oncogene label, need weed out a large amount of " independent basises because of ", need the oncogene scope that search for thereby dwindle greatly.In fact, in gene expression profile, some expression of gene levels are all very approaching in all samples.For example, many genes are at acute leukemia hypotype (ALL, AML) no matter its average still is a variance to the distribution in two classifications does not all have significant difference, can think that these genes and sample class are irrelevant, the differentiation to sample type does not provide useful information, increases the computation complexity of information gene search on the contrary.Therefore, must reject these " independent basis because of ".
In the face of extracting the such field of genome information, can pass through mathematical modeling, the genome information that obtains effectively extracting sample to be detected is just.Domestic and international application PCA method is studied genetic chip and still is in the elementary step.
Summary of the invention
The object of the present invention is to provide the method for effective extraction gene label, i.e. the screening technique of the characterizing gene of certain disease.Can be applicable to all small samples high flux, high-dimensional data processing.This method is simple and convenient, and has very high promotional value, and classification and identification problems such as gene label extraction on the biology provide a kind of reliable and practical method, and can extract gene label effectively and timely, and disease is analyzed, and alleviate patient's misery.
The objective of the invention is to be realized with following technical scheme:
The screening technique of the characterizing gene of certain disease adopts principal component analytical method and screens based on the support vector machine classification method of Fourier transform, comprises the following steps:
(1) utilizes principal component analysis (PCA) to treat analytic sample and carry out dimensionality reduction, the gene expression space is reduced, being under the situation more than the 78%-88% with the contribution rate, screen the characterizing gene of certain disease with eigenwert and contribution rate as the classification factor, reasonably reduce the valid dimension in gene expression space;
Obtain the correlation matrix of all genes, to this matrix eig; Utilizing eigenwert is variance, and variance is the thought of quantity of information, and promptly we think variance big more to comprise quantity of information many more, and eigenwert is the bigger the better, and obtains the gene that contribution rate reduces under 99% situation:
Formula is as follows:
Contribution rate: λ 1 / Σ i = 1 p λ i = Var ( F i ) Σ i = 1 p Var ( F i )
In the formula: P representation eigenvalue number is that gene number Var represents variance
(2) on the basis of principal component analysis (PCA), utilization is effectively classified and identification to sample based on the support vector base of the Fourier transform of complex field, the record frequency, and frequency is big more, and classifying quality is good more, extracts the gene label of certain disease;
The characterizing gene of several certain diseases that (1) step is filtered out carries out classification and the identification based on the support vector machine of the Fourier transform of complex field, extracts gene label:
Formula is as follows: apha is done two dimensional discrete Fourier transform:
X ( k , λ ) = Σ m = 0 m - 1 Σ n = 0 n - 1 X ( n , m ) e - j 2 π n kn e - j 2 π m lm
= Σ m = 0 m - 1 Σ n = 0 n - 1 X ( n , m ) W n kn W lm
Wherein W n = e - j 2 π n , W m = e - j 2 π m
Do in the formula: m, n represent normal number of samples respectively, the pathology number of samples? X here face represents alpha.
The objective of the invention is to be used for effectively to extract gene label, from the angle of real number field and frequency domain.Little when average, the variance of per two genes, therefore, from then on the formed matrix of data itself is considered.Owing to there is very strong correlativity between the gene representation, thus the present invention with principal component analysis (PCA) it is analyzed, from its correlation matrix and matrix itself eigenwert (be to classify the variance).With respect to number gene, sample is often very little, if be directly used in the problem concerning study that classification can cause small sample, how to reduce the core that the gene expression characteristics that is used for Classification and Identification is a classification problem, in fact have only this feature of working as more after a little while, the effect of classification is just better, on 1 basis, use based on the support vector machine of Fourier transform again and classify and discern, classifying quality is fine, can effectively extract gene label.
Description of drawings:
Fig. 1 is the scatter diagram of all sample points;
Fig. 2 is the experience distribution plan;
Fig. 3 is the classification 1 of support vector machine in small sample based on the Fourier transform of frequency domain;
Fig. 4 is α and amplitude Fig. 1;
Fig. 5 is the classification 2 of support vector machine in small sample based on the Fourier transform of frequency domain;
Fig. 6 is α and amplitude Fig. 2.
Embodiment
Below in conjunction with accompanying drawing, further specify essentiality content of the present invention with embodiments of the invention, but do not limit the present invention with this.
Embodiment 1: basic step is:
Adopt principal component analytical method and screen, comprise the following steps: based on the support vector machine classification method of Fourier transform
(1) utilize principal component analysis (PCA) treat analytic sample (analysis to as if 2000 * 62 matrix fluorescence datas (line display gene type wherein, the normal sample of 22 row, 44 epidemy people samples) gene of colon cancer disease carries out dimensionality reduction, the gene expression space is reduced, be under the situation more than the 78%-88% with the contribution rate, screen the characterizing gene of certain disease with eigenwert and contribution rate as the classification factor, reasonably reduce the valid dimension in gene expression space;
Obtain the correlation matrix of all genes, to this matrix eig; Utilizing eigenwert is variance, is variance the thought of quantity of information?, eigenwert is the bigger the better, and obtains the gene that contribution rate reduces under 99% situation:
Formula is as follows:
Contribution rate: λ 1 / Σ i = 1 p λ i = Var ( F i ) Σ i = 1 p Var ( F i )
(2) on the basis of principal component analysis (PCA), utilization is effectively classified and identification to sample based on the support vector base of the Fourier transform of complex field, the record frequency, and frequency is big more, and classifying quality is good more, extracts the gene label of certain disease;
The characterizing gene of several certain diseases that (1) step is filtered out carries out classification and the identification based on the support vector machine of the Fourier transform of complex field, extracts gene label:
Formula is as follows: apha is done two dimensional discrete Fourier transform:
X ( k , λ ) = Σ m = 0 m - 1 Σ n = 0 n - 1 X ( n , m ) e - j 2 π n kn e - j 2 π m lm
= Σ m = 0 m - 1 Σ n = 0 n - 1 X ( n , m ) W n kn W lm
Wherein W n = e - j 2 π n , W m = e - j 2 π m .
The analysis result that draws is as follows:
(1) 1. by dna microarray (DNA microarray), also is genetic chip, is that a kind of energy of growing up the nearest several years is quick, the new technology of efficient detection sequence dna fragment, gene expression dose.It is fixed on little (about 1cm from hundreds of is individual to up to a million the nucleotide sequences that are referred to as probe that do not wait with number 2) on solid substrate such as glass or silicon chip or the film, this substrate that is fixed with probe just is referred to as dna microarray.When forming two strands, follow the base complementrity principle according to nucleic acid molecule, just can detect in the sample with probe array in complementary nucleotide fragments, thereby obtain in the sample information about gene expression, Here it is gene expression profile, therefore gene expression profile can be represented with a matrix or a vector, the numerical values recited of matrix or vector element i.e. this expression of gene level, it is the fluorescence intensity data of known 2000 genes, these data are analyzed, extracted the gene label of oncogene.
2. comprise the sample of 40 tumour patients and 22 normal persons' sample based on 1. step experimental data, each sample has comprised 2000 gene expression profile data, at first the genetic chip data are analyzed by software MATLABE, calculated about the several important statistic of gene expression profile data as shown in table 1, and drawn out the scatter diagram of all sample points, as shown in Figure 1, and its experience distribution plan, as shown in Figure 2.Obtain gene data analysis and pre-service are seen Table 1 and Fig. 1,2.
Several statistics of table 1 gene expression profile
min max mean median std
2.5401 14.3514 7.6554 7.5851 1.6411
(2) programme in MATLAB with the PCA method and can get its correlation matrix, eigenwert, proper vector, contribution rate sees Table 2, simultaneously in table 2 as can be seen, the contribution rate of accumulative total value of preceding 1948 genes has 1%, 52 of back have comprised 99% information of whole indexs.The genes corresponding with the number of the absolute value maximum of its proper vector are rational as characterizing gene.And ask correlation matrix and eigenwert respectively with normal person and tumour patient sample, similar with the data research result of all samples, can be that research object is analyzed with all samples therefore.And obtain gene behind the dimensionality reduction, see Table 3.
Table 2 with the PCA method in MATLAB, programme its correlation matrix, eigenwert, proper vector, contribution rate
The gene numbering Eigenwert Contribution rate Contribution rate of accumulative total
1949 3.1439 0.0016 0.0116
1950 3.2714 0.0016 0.0132
1951 3.2961 0.0016 0.0148
1952 3.426 0.0017 0.0166
1953 3.5774 0.0018 0.0183
1954 3.5935 0.0018 0.0201
1955 3.8179 0.0019 0.0221
1956 3.9897 0.002 0.024
1957 4.0711 0.002 0.0261
1958 4.142 0.0021 0.0282
1959 4.2776 0.0021 0.0303
1960 4.4746 0.0022 0.0325
1961 4.5358 0.0023 0.0348
1962 4.664 0.0023 0.0371
1963 4.955 0.0025 0.0396
1964 5.1764 0.0026 0.0422
1965 5.234 0.0026 0.0448
1966 5.3834 0.0027 0.0475
1967 5.5517 0.0028 0.0503
1968 5.8656 0.0029 0.0532
1969 5.9414 0.003 0.0562
1970 6.2221 0.0031 0.0593
1971 6.3619 0.0032 0.0625
1972 6.8821 0.0034 0.0659
1973 7.1863 0.0036 0.0695
1974 7.593 0.0038 0.0733
1975 7.7005 0.0039 0.0772
1976 8.1223 0.0041 0.0812
1977 8.5939 0.0043 0.0855
1978 8.7864 0.0044 0.0899
1979 9.1849 0.0046 0.0945
1980 9.3452 0.0047 0.0992
1981 9.6833 0.0048 0.104
1982 10.4506 0.0052 0.1092
1983 10.939 0.0055 0.1147
1984 11.85 0.0059 0.1206
1985 12.6548 0.0063 0.127
1986 13.1761 0.0066 0.1336
1987 14.1693 0.0071 0.1406
1988 17.6173 0.0088 0.1494
1989 18.3988 0.0092 0.1586
1990 21.0809 0.0105 0.1692
1991 22.1408 0.0111 0.1803
1992 24.8231 0.0124 0.1927
1993 28.4395 0.0142 0.2069
1994 38.0237 0.019 0.2259
1995 54.129 0.0271 0.253
1996 62.571 0.0313 0.2842
1997 90.865 0.0454 0.3297
1998 109.9177 0.055 0.3846
1999 112.6179 0.0563 0.4409
2000 179.7467 0.0899 0.5308
The gene classifying and numbering that table 3 obtains with principal component analysis (PCA)
Figure BSA00000362889800061
(3) on the basis of (2), use support vector machine classification to see Table 5 based on Fourier transform, and to the number gene that obtains through the support vector machine of using after main the analysis based on Fourier transform with analyze number gene that the back directly obtains based on the support vector machine of Fourier transform and carried out contrast and see Table 4 through main, carry out Fourier transform SVC after to the data dimensionality reduction with PCA based on frequency domain, with directly handle without PCA, and, estimate classification in conjunction with the various screening levels of screening the gene relevant with the tissue sample somatotype.The result shows: when tissue sample was Fourier transform PCA based on frequency domain and is analyzed, PCA can improve classification quality, reasonably screens difference expression gene, obviously improves classifying quality.
Direct deal with data of table 4 support vector machine and main the analysis after the data after the support vector machine processing
The direct deal with data of support vector machine The main analysis after the data after the support vector machine processing
Execution?time 0.1seconds 0.0
Status OPTIMAL_SOLUTION OPTIMAL_SOLUTION
|w0|^2 0.057335 1.984033
Margin 8.352546 1.419893
Sum?alpha 0.057335 1.984033
Support?Vectors 35(56.5%) 25(55.6%)
bias 0 0
Table 5 is main to be analyzed after based on the data after the support vector machine processing of Fourier transform
Frequency 36 36 36 36 37 38
Gene number 839 1283 1464 1695 1862 1864
Frequency 38 38 39 41 42 43
Gene number 1791 788 988 1810 1671 249
Data and figure after the support vector machine that below is based on Fourier transform is handled are scope with frequency interval [1801,1994], and can extract 25 genes is the label gene; And based on the result who obtains based on the Fourier transform support vector machine after main the analysis: with frequency interval [36,43] is scope, and can extract 12 genes is the label gene, sees Table 5.By Fig. 3~6 as can be seen, Fig. 5, the effects of 6 classification are than Fig. 3,4 good, Fig. 5,6 alpha value is milder, above the spike of figure many, the amplitude of passing through the figure figure below behind the Fourier transform is more steady, is worth for a short time, effect is relatively good.
In sum, fine through the Fourier transform support vector machine effect after main the analysis based on frequency domain, rapider to the classification gene label, in actual conditions, can consider sample problem in this way.
Compared with prior art, the invention has the beneficial effects as follows:
PCA can filter out spurious information, reduces the impact of irrelevant variable, can get rid of the interference of " noise "; PCA simplifies the complexity of gene chip data analysis, reduces the valid dimension in gene expression space; PCA determines common factor number (being number of principal components or comprehensive factor number), can point out to a certain extent the number of categories of sample or gene; The sign of PCA by factor load be show sample or intergenic positive correlation not only, and show sample and intergenic negative correlation; Based on the SVMs of the Fourier transformation of complex field, only to the processing of real number, realized the processing to complex field before having carried out, can more effectively classify.
The present invention can be applicable to all small samples, high flux, high-dimensional data are processed, the method is simple and convenient, and has very high promotional value, to classification and the identification problems such as gene label extraction on the biology, a kind of reliable and practical method is provided, and can extract effectively and timely gene label, disease is analyzed, alleviate patient's misery.

Claims (1)

1. the screening technique of the characterizing gene of certain disease adopts principal component analytical method and screens based on the support vector machine classification method of Fourier transform, comprises the following steps:
(1) utilizes principal component analysis (PCA) to treat analytic sample and carry out dimensionality reduction, the gene expression space is reduced, being under the situation more than the 78%-88% with the contribution rate, screen the characterizing gene of certain disease with eigenwert and contribution rate as the classification factor, reasonably reduce the valid dimension in gene expression space;
Obtain the correlation matrix of all genes, to this matrix eig; Utilizing eigenwert is variance, and variance is a quantity of information, eigenwert be variance big more to comprise quantity of information big more, obtain the gene that contribution rate reduces under 99% situation:
Formula is as follows:
Contribution rate: λ 1 / Σ i = 1 p λ i = Var ( F i ) Σ i = 1 p Var ( F i )
In the formula: P representation eigenvalue number is that gene number Var represents variance;
(2) on the basis of principal component analysis (PCA), utilization is effectively classified and identification to sample based on the support vector base of the Fourier transform of complex field, the record frequency, and frequency is big more, and classifying quality is good more, extracts the gene label of certain disease;
The characterizing gene of several certain diseases that (1) step is filtered out carries out classification and the identification based on the support vector machine of the Fourier transform of complex field, extracts gene label:
Formula is as follows: apha is done two dimensional discrete Fourier transform: X ( k , λ ) = Σ m = 0 m - 1 Σ n = 0 n - 1 X ( n , m ) e - j 2 π n kn e - j 2 π m lm
= Σ m = 0 m - 1 Σ n = 0 n - 1 X ( n , m ) W n kn W lm
Wherein W n = e - j 2 π n , W m = e - j 2 π m
In the formula: m, n represent normal number of samples respectively, and pathology number of samples, X here face are represented alpha.
CN2010105623087A 2010-11-29 2010-11-29 Screening method of characteristic gene of certain disease Pending CN101996284A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010105623087A CN101996284A (en) 2010-11-29 2010-11-29 Screening method of characteristic gene of certain disease

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010105623087A CN101996284A (en) 2010-11-29 2010-11-29 Screening method of characteristic gene of certain disease

Publications (1)

Publication Number Publication Date
CN101996284A true CN101996284A (en) 2011-03-30

Family

ID=43786431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010105623087A Pending CN101996284A (en) 2010-11-29 2010-11-29 Screening method of characteristic gene of certain disease

Country Status (1)

Country Link
CN (1) CN101996284A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105046110A (en) * 2015-06-28 2015-11-11 中国科学院合肥物质科学研究院 Abnormal tumour cell pathway identification method capable of overcoming signal distortion
CN105117617A (en) * 2015-08-26 2015-12-02 大连海事大学 Method for screening environmentally sensitive biomolecules
CN105825078A (en) * 2016-03-16 2016-08-03 广东工业大学 Small sample gene expression data classification method based on gene big data
CN107301331A (en) * 2017-07-20 2017-10-27 北京大学 A kind of method for digging of the sickness influence factor based on microarray data
CN108416190A (en) * 2018-02-11 2018-08-17 广州市碳码科技有限责任公司 Tumour methods for screening, device, equipment and medium based on deep learning
CN109975594A (en) * 2019-02-28 2019-07-05 北京交通大学 A kind of phasor principal component analytical method for data compression in synchronized measurement system
CN111312336A (en) * 2014-11-13 2020-06-19 中国科学院上海生命科学研究院 Method and system for establishing biological edge identification system

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312336A (en) * 2014-11-13 2020-06-19 中国科学院上海生命科学研究院 Method and system for establishing biological edge identification system
CN105046110A (en) * 2015-06-28 2015-11-11 中国科学院合肥物质科学研究院 Abnormal tumour cell pathway identification method capable of overcoming signal distortion
CN105117617A (en) * 2015-08-26 2015-12-02 大连海事大学 Method for screening environmentally sensitive biomolecules
CN105117617B (en) * 2015-08-26 2017-10-24 大连海事大学 A kind of method for screening environmental sensitivity biomolecule
CN105825078A (en) * 2016-03-16 2016-08-03 广东工业大学 Small sample gene expression data classification method based on gene big data
CN105825078B (en) * 2016-03-16 2019-02-26 广东工业大学 Small sample Classification of Gene Expression Data method based on gene big data
CN107301331A (en) * 2017-07-20 2017-10-27 北京大学 A kind of method for digging of the sickness influence factor based on microarray data
CN107301331B (en) * 2017-07-20 2020-05-05 北京大学 Method for mining disease influence factors based on gene chip data
CN108416190A (en) * 2018-02-11 2018-08-17 广州市碳码科技有限责任公司 Tumour methods for screening, device, equipment and medium based on deep learning
CN109975594A (en) * 2019-02-28 2019-07-05 北京交通大学 A kind of phasor principal component analytical method for data compression in synchronized measurement system
CN109975594B (en) * 2019-02-28 2021-11-30 北京交通大学 Phasor principal component analysis method for data compression in synchronous measurement system

Similar Documents

Publication Publication Date Title
CN101996284A (en) Screening method of characteristic gene of certain disease
Barker et al. Automated classification of brain tumor type in whole-slide digital pathology images using local representative tiles
EP3100205B1 (en) Adaptive classification for whole slide tissue segmentation
Wang et al. Cellular phenotype recognition for high-content RNA interference genome-wide screening
Chitra et al. Recent advancement in cervical cancer diagnosis for automated screening: a detailed review
CN105154542B (en) One group of gene for being used for lung cancer molecule parting and its application
CN102521605A (en) Wave band selection method for hyperspectral remote-sensing image
Guan et al. NeuroSeg: automated cell detection and segmentation for in vivo two-photon Ca 2+ imaging data
Zhang et al. Segmentation of overlapping cells in cervical smears based on spatial relationship and overlapping translucency light transmission model
CN101799926B (en) Automatically quantitative analysis system of Ki-67 immune-histochemical pathological image
Kumar et al. An amalgam method efficient for finding of cancer gene using CSC from micro array data
Di Cataldo et al. ANAlyte: A modular image analysis tool for ANA testing with indirect immunofluorescence
CN105139037B (en) Integrated multi-target evolution automatic clustering method based on minimum spanning tree
Belean et al. Unsupervised image segmentation for microarray spots with irregular contours and inner holes
Helmy et al. Regular gridding and segmentation for microarray images
Tsai et al. PHD: an efficient data clustering scheme using partition space technique for knowledge discovery in large databases
Sasaki et al. Non-invasive quality evaluation of confluent cells by image-based orientation heterogeneity analysis
CN103839051B (en) The method of single sample hand vein recognition based on 2DPCA and subregion LBP
Ahmad et al. A study on microarray image gridding techniques for DNA analysis
Khalilabad et al. Fully automatic classification of breast cancer microarray images
Harikiran et al. Edge detection using mathematical morphology for gridding of microarray image
Rosa et al. Cluster center genes as candidate biomarkers for the classification of Leukemia
Nandy et al. Automatic nuclei segmentation and spatial FISH analysis for cancer detection
Xu et al. Multiclass feature selection algorithms base on R-SVM
Gao et al. Hexagonal image segmentation on spatially resolved transcriptomics

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110330