CN107729718A - A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology - Google Patents
A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology Download PDFInfo
- Publication number
- CN107729718A CN107729718A CN201710966853.4A CN201710966853A CN107729718A CN 107729718 A CN107729718 A CN 107729718A CN 201710966853 A CN201710966853 A CN 201710966853A CN 107729718 A CN107729718 A CN 107729718A
- Authority
- CN
- China
- Prior art keywords
- gene
- screening
- mammary gland
- correlated characteristic
- characterizing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Abstract
The invention discloses a kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology, using breast cancer data in TCGA databases as research object, using multiple characteristics genetic screening methodology, respectively from many aspects such as correlation, specificity and biological functions, real characterizing gene is screened.Cancer gene group data can be based on using the present invention, using this feature gene extracting method, extract the gene related to cancer early stage generation, and establish disaggregated model, so as to realize breast cancer early stage automated diagnostic.
Description
Technical field
The invention belongs to field of bioinformatics, is related to a kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology, is used for
Identification and the automated diagnostic of correlated characteristic gene occur for cancer, have high efficiency and universality.
Background technology
The fast development of the high flux gene sequencing technologies such as gene microarray technology and bioinformatics, it is from genome water
The flat pathogenesis of cancer related gene of screening on a large scale provides necessary means.But the superelevation of gene methylation microarray data
Dimension strong noise small sample characteristic makes a small number of important gene information be easy to be submerged in the noise of the tens thousand of genes of full-length genome and cause
Information saturated phenomenon, the early diagnosis horizontal to characterizing gene screening and cancer gene cause difficulty.Therefore primary task is
Data Dimensionality Reduction is made by feature selecting.The feature selection approach that such as Xie J Y are combined by identification and independence, according to
Gene expression data is different to the identification capability of breast cancer, and all genes are ranked up according to identification capability is descending, sieved
One group of gene cluster being made up of 10 genes is selected, breast cancer tissue and normal structure (accuracy rate can be distinguished well
85.32%);Lymphoma data are divided by a kind of Wang Wei, characterizing gene screening technique based on signal to noise ratio that Luo Linkai is proposed
Class rate of accuracy reached is to 96.15%;A kind of comprehensive signal to noise ratio and the characterizing gene screening technique of cluster that Ruan Xiaogang, Chao Hao are proposed,
94.15% is reached to the classification results of Leukemia Data;Zhang Shizhi proposes one kind and is based on SVMs (Support
Vector machine, SVM) embedded feature gene selection method two groups of data of acute leukemia and prostate cancer are divided
Class accuracy rate respectively reaches 98.67% and 98.96%.Features described above genetic screening methodology for the multi-collinearity between feature at
Reason lacks validity, can not be sufficiently reserved the characterizing gene with identical biological function, easily causes the leakage choosing of important gene,
And outside some methods can not be independently of disaggregated model, had a great influence by classifier performance.Therefore there is an urgent need to it is a kind of comprehensively,
Efficient and relatively independent characterizing gene screening technique.
The content of the invention
The technical problem to be solved in the present invention is to provide a kind of breast cancer early stage based on high-flux sequence information to occur phase
Close the screening technique and disaggregated model of characterizing gene.
To achieve these goals, the technical solution adopted by the present invention is:
A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology, based on the mammary gland in TCGA cancer gene group databases
Cancer data, characterizing gene is occurred using multiplex screening method screening cancer and classified for disaggregated model.
Preferably, the screening of multiplex screening method synthesis correlation, significance of difference screening and the elastomeric network screening
Multistep screening is carried out to full-length genome.
Preferably, characterizing gene screening technique concretely comprises the following steps:
1. correlation is screened, screening methylation level and gene expression dose have obvious correlation (methylation level and
The absolute value of gene expression dose spearman coefficient correlations is more than Part I of the gene as candidate gene collection 0.5);
2. the significance of difference is screened, first, the gene that screening methylation level has obvious correlation with classification results (is protected
Stay correlation coefficient r>0.5 gene);Secondly, difference degree of certain gene between two class samples is carried out using variance analysis
Analysis, must do test of homogeneity, for meeting that the gene of homogeneous is adopted before variance analysis is done to candidate gene and classification results
Characterizing gene is screened with one-way analysis of variance, retain has significant difference (p to classification results label<0.05, FDR<
0.01) gene, for being unsatisfactory for the gene of homogeneous analysis using the non-sieve of the Cruskal-Wallis rank tests in testing of engaging in an inspection
Characterizing gene is selected, is retained to there is significant difference (p to classification results label<0.05, FDR<0.01) gene is as candidate
The Part II of gene set.
3. candidate gene collection is merged into duplicate removal, repeated screening is carried out using elastomeric network, until obtaining minimal characteristic gene
Collection.
Preferably, using support vector machine classifier, disaggregated model is carried out using the method for 5 retransposings checking excellent
Change.
The above-mentioned technical proposal of the present invention has the following advantages:1st, using multiple characteristics genetic screening methodology, respectively from correlation
Property, many aspects such as specificity and biological function, avoid the leakage choosing of characterizing gene, being sufficiently reserved has identical biology
The gene of function.2nd, characterizing gene screening technique of the invention can be applied to the data of different test platforms, have universality and
Reliability.3rd, screening early diagnosis of the disaggregated model of characterizing gene and foundation to breast cancer has high accuracy rate.4、
Outside Feature Selection process is independently of disaggregated model, Feature Selection process is not influenceed by classifier performance.5th, the present invention is available
Correlated characteristic genescreen and the characterizing gene screening of other relevant issues occurs in all cancers.6th, the present invention is to breast cancer
Relevant classification result be far above other method.
Brief description of the drawings
Fig. 1 is the flow chart of the inventive method.
Embodiment
As shown in figure 1, the present invention provides a kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology, based on TCGA cancer bases
Because of the breast cancer data in group database, characterizing gene is occurred using multiplex screening method screening cancer and entered for disaggregated model
Row classification.
Preferably, the screening of multiplex screening method synthesis correlation, significance of difference screening and the elastomeric network screening
Multistep screening is carried out to full-length genome.
Below in conjunction with data to being described in detail in terms of this method.
First, the selection and data processing of material
The present invention choose TCGA public databases in Illumina Infinium Human Methylation 450K,
Llumina Infinium Human Methylation two test platform breast cancer of 27K methylate data and
The gene expression data of the test platforms of IlluminaHi Seq 2000RNA Sequencing Version 2, extraction are therein
Normal and cancer I phases sample is research object.Wherein 450K platform datas make training set, and 27K platform datas make independent test collection,
Gene expression data makees individual authentication collection.Data specifying information is as shown in table 1.
The data classification information of table 1 collects
Data processing:1. extract all β value average methylating as the gene positioned at gene promoter region probe
Horizontal data.2. gene expression data is normalized, it is desirable to which its section is [0,1], and value formula is
2nd, characterizing gene screening technique
The characterizing gene screening technique of the present invention concretely comprises the following steps:
2. correlation is screened, screening methylation level and gene expression dose have obvious correlation (methylation level and
The absolute value of gene expression dose spearman coefficient correlations is more than Part I of the gene as candidate gene collection 0.5).
2. the significance of difference is screened, first, the gene that screening methylation level has obvious correlation with classification results (is protected
Stay correlation coefficient r>0.5 gene);Secondly, difference degree of certain gene between two class samples is carried out using variance analysis
Analysis, must do test of homogeneity, for meeting that the gene of homogeneous is adopted before variance analysis is done to candidate gene and classification results
Characterizing gene is screened with one-way analysis of variance, retain has significant difference (p to classification results label<0.05, FDR<
0.01) gene, for being unsatisfactory for the gene of homogeneous analysis using the non-sieve of the Cruskal-Wallis rank tests in testing of engaging in an inspection
Characterizing gene is selected, is retained to there is significant difference (p to classification results label<0.05, FDR<0.01) gene is as candidate
The Part II of gene set.
3. candidate gene collection is merged into duplicate removal, repeated screening is carried out using elastomeric network, until obtaining minimal characteristic gene
Collection.
3rd, the structure and evaluation index of disaggregated model
The gene for screening to obtain using training set establishes the disaggregated model of training set, the characterizing gene matching that training set obtains
To test set, test set disaggregated model is established using matching gene.
The present invention uses SVMs that it is excellent to carry out model using 6 folding cross validations as grader used in disaggregated model
Change.
Classification results are assessed it using tetra- accuracy rate, sensitiveness, specificity, MCC indexs, and parameter definition is such as
Under:
Accuracy rate:
Sensitiveness:
Specificity:
Coefficient correlation:
T in formulapFor true positives number, tnFor true negative number, fpIt is f for false positive numbernFalse negative number.
4th, classifying quality
The screening technique of the present invention screens classifying quality of the characterizing gene on training set and independent test collection such as table 2
Show.
The training set of table 2., test set classification results
From upper table classification results, either to training set still to independent test collection, the specificity of pattern-recognition and
Sensitiveness is very close, and has higher accuracy rate.For test set, its specificity is 98.21%, and sensitiveness is
96.29%, difference is only 1.02%, less than the corresponding deviation 1.04% of training set, it is sufficient to it is good to illustrate that the inventive method has
Balance and reliability.
The universality of the inventive method is examined using individual authentication collection, its classification results is as shown in table 3.
The individual authentication collection classifying quality of table 3.
Upper table result shows that this feature genetic screening methodology has excellent performance on different pieces of information collection, so as to demonstrate
The universality of this feature genetic screening methodology.
In order to further examine the present invention to screen the classifying quality of characterizing gene, respectively extraction training set, test set and
Individual authentication is concentrated is located at preceding 10 to classification contribution, and the gene of 15 is modeled, and its classification results is as shown in table 4.
The a small amount of characterizing gene classifying quality of table 4
Understand to choose by upper table classification results and a small amount of gene of former of ranking is contributed with regard to that can reach very excellent to classification
Classifying quality, illustrate the validity to early diagnosis of the characterizing gene of the screening technique screening of the present invention, so as to illustrate this
The validity of inventive features genetic screening methodology.
The invention discloses a kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology.High flux technique of gene detection goes out
It is existing, provide necessary means for early diagnosing mammary cancer, but the superelevation of microarray data dimension, strong noise, small sample characteristic
A small number of important gene information are made to be easy to be submerged in the noise of the tens thousand of genes of full-length genome and cause information saturated phenomenon, to early stage
Diagnosis causes difficulty.In order to overcome the influence of these unfavorable factors, the present invention is using breast cancer data in TCGA databases as research
Object, using multiple characteristics genetic screening methodology, respectively from many aspects such as correlation, specificity and biological function, screening
Real characterizing gene.Cancer gene group data can be based on using the present invention, using this feature gene extracting method, are extracted
The gene related to cancer early stage generation, and disaggregated model is established, so as to realize breast cancer early stage automated diagnostic.
Claims (4)
1. a kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology, it is characterised in that based on TCGA cancer gene group data
Breast cancer data in storehouse, characterizing gene is occurred using multiplex screening method screening cancer and classified for disaggregated model.
2. mammary gland carcinogenesis correlated characteristic genetic screening methodology as claimed in claim 1, it is characterised in that the multiplex screening
The screening of method synthesis correlation, significance of difference screening and elastomeric network screening carry out multistep screening to full-length genome.
3. mammary gland carcinogenesis correlated characteristic genetic screening methodology as claimed in claim 2, it is characterised in that characterizing gene screens
Method concretely comprises the following steps:
1. correlation is screened, screening methylation level has obvious correlation (methylation level and gene with gene expression dose
The absolute value of expression spearman coefficient correlations is more than Part I of the gene as candidate gene collection 0.5);
2. the significance of difference is screened, first, the gene that screening methylation level has obvious correlation with classification results (retains phase
Relation number r>0.5 gene);Secondly, difference degree of certain gene between two class samples is analyzed using variance analysis,
Test of homogeneity must be done before variance analysis is done to candidate gene and classification results, for meeting that the gene of homogeneous uses Dan Yin
Characterizing gene is screened in plain variance analysis, and retain has significant difference (p to classification results label<0.05, FDR<0.01) base
Cause, for being unsatisfactory for the gene of homogeneous analysis using the non-screening feature base of the Cruskal-Wallis rank tests in testing of engaging in an inspection
Cause, retain to there is significant difference (p to classification results label<0.05, FDR<0.01) gene is as candidate gene collection
Part II;
3. candidate gene collection is merged into duplicate removal, repeated screening is carried out using elastomeric network, until obtaining minimal characteristic gene set.
4. mammary gland carcinogenesis correlated characteristic genetic screening methodology as claimed in claim 1, it is characterised in that use supporting vector
Machine grader, the method verified using 5 retransposings are optimized to disaggregated model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710966853.4A CN107729718A (en) | 2017-10-17 | 2017-10-17 | A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710966853.4A CN107729718A (en) | 2017-10-17 | 2017-10-17 | A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107729718A true CN107729718A (en) | 2018-02-23 |
Family
ID=61211503
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710966853.4A Pending CN107729718A (en) | 2017-10-17 | 2017-10-17 | A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107729718A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109841280A (en) * | 2017-11-29 | 2019-06-04 | 郑州大学第一附属医院 | The identification of cancer of the esophagus correlated characteristic access and the construction method of early stage diagnostic model |
CN110257524A (en) * | 2019-08-01 | 2019-09-20 | 浙江大学 | It is a kind of distinguish colorectal cancer cancerous tissue and Carcinoma side normal tissue colorectal cancer discrimination model and its construction method |
CN111378754A (en) * | 2020-04-23 | 2020-07-07 | 嘉兴市第一医院 | TCGA (TCGA-based genetic algorithm) database-based breast cancer methylation biomarker and screening method thereof |
CN112465533A (en) * | 2019-09-09 | 2021-03-09 | 中国移动通信集团河北有限公司 | Intelligent product selection method and device and computing equipment |
CN113017650A (en) * | 2021-03-12 | 2021-06-25 | 南昌航空大学 | Electroencephalogram feature extraction method and system based on power spectral density image |
CN116312785A (en) * | 2023-01-19 | 2023-06-23 | 首都医科大学附属北京胸科医院 | Breast cancer diagnosis marker gene and screening method thereof |
-
2017
- 2017-10-17 CN CN201710966853.4A patent/CN107729718A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109841280A (en) * | 2017-11-29 | 2019-06-04 | 郑州大学第一附属医院 | The identification of cancer of the esophagus correlated characteristic access and the construction method of early stage diagnostic model |
CN110257524A (en) * | 2019-08-01 | 2019-09-20 | 浙江大学 | It is a kind of distinguish colorectal cancer cancerous tissue and Carcinoma side normal tissue colorectal cancer discrimination model and its construction method |
CN112465533A (en) * | 2019-09-09 | 2021-03-09 | 中国移动通信集团河北有限公司 | Intelligent product selection method and device and computing equipment |
CN111378754A (en) * | 2020-04-23 | 2020-07-07 | 嘉兴市第一医院 | TCGA (TCGA-based genetic algorithm) database-based breast cancer methylation biomarker and screening method thereof |
CN111378754B (en) * | 2020-04-23 | 2020-11-17 | 嘉兴市第一医院 | TCGA (TCGA-based genetic algorithm) database-based breast cancer methylation biomarker and screening method thereof |
CN113017650A (en) * | 2021-03-12 | 2021-06-25 | 南昌航空大学 | Electroencephalogram feature extraction method and system based on power spectral density image |
CN113017650B (en) * | 2021-03-12 | 2022-06-28 | 南昌航空大学 | Electroencephalogram feature extraction method and system based on power spectral density image |
CN116312785A (en) * | 2023-01-19 | 2023-06-23 | 首都医科大学附属北京胸科医院 | Breast cancer diagnosis marker gene and screening method thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107729718A (en) | A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology | |
Jaber et al. | A deep learning image-based intrinsic molecular subtype classifier of breast tumors reveals tumor heterogeneity that may affect survival | |
Biasci et al. | A blood-based prognostic biomarker in IBD | |
Xu et al. | Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data | |
CN103502473B (en) | The prediction of gastro-entero-pancreatic tumor (GEP-NEN) | |
Monzon et al. | Diagnosis of metastatic neoplasms: molecular approaches for identification of tissue of origin | |
Sinnott et al. | Molecular differences in transition zone and peripheral zone prostate tumors | |
Scott | Cell-of-origin in diffuse large B-cell lymphoma: are the assays ready for the clinic? | |
US20160002714A1 (en) | Automated analysis of multiplexed probe-target interaction patterns: pattern matching and allele identification | |
CN111128385B (en) | Prognosis early warning system for esophageal squamous carcinoma and application thereof | |
CN108884494A (en) | The unicellular Genome Atlas of circulating tumor cell is analyzed to characterize disease heterogeneity in metastatic disease | |
CN101194166A (en) | Materials and methods relating to breast cancer classification | |
CN104046624B (en) | Gene and application thereof for lung cancer for prognosis | |
Bilal et al. | Novel deep learning algorithm predicts the status of molecular pathways and key mutations in colorectal cancer from routine histology images | |
JP2016073287A (en) | Method for identification of tumor characteristics and marker set, tumor classification, and marker set of cancer | |
CN108642568B (en) | Method for designing SNP chip special for identifying low-density breed of whole genome of domestic dog | |
CN111653314B (en) | Method for analyzing and identifying lymphatic infiltration | |
CN109859796A (en) | A kind of Dimension Reduction Analysis method that the DNA methylation about gastric cancer is composed | |
CN112831562A (en) | Biomarker combination and kit for predicting recurrence risk of liver cancer patient after resection | |
Polikowsky et al. | Supervised machine learning with CITRUS for single cell biomarker discovery | |
de Ridder et al. | Purity for clarity: the need for purification of tumor cells in DNA microarray studies | |
CN113355411B (en) | Tumor immunotyping method based on lncRNA marker | |
Janikova et al. | Gene expression profiling in follicular lymphoma and its implication for clinical practice | |
CN108603233A (en) | The unicellular Genome Atlas of circulating tumor cell (CTC) is analyzed to characterize disease heterogeneity in metastatic disease | |
CN111798924A (en) | Human leukocyte antigen typing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180223 |
|
WD01 | Invention patent application deemed withdrawn after publication |