CN107729718A - A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology - Google Patents

A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology Download PDF

Info

Publication number
CN107729718A
CN107729718A CN201710966853.4A CN201710966853A CN107729718A CN 107729718 A CN107729718 A CN 107729718A CN 201710966853 A CN201710966853 A CN 201710966853A CN 107729718 A CN107729718 A CN 107729718A
Authority
CN
China
Prior art keywords
gene
screening
mammary gland
correlated characteristic
characterizing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710966853.4A
Other languages
Chinese (zh)
Inventor
李晓琴
王学栋
常宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201710966853.4A priority Critical patent/CN107729718A/en
Publication of CN107729718A publication Critical patent/CN107729718A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Abstract

The invention discloses a kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology, using breast cancer data in TCGA databases as research object, using multiple characteristics genetic screening methodology, respectively from many aspects such as correlation, specificity and biological functions, real characterizing gene is screened.Cancer gene group data can be based on using the present invention, using this feature gene extracting method, extract the gene related to cancer early stage generation, and establish disaggregated model, so as to realize breast cancer early stage automated diagnostic.

Description

A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology
Technical field
The invention belongs to field of bioinformatics, is related to a kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology, is used for Identification and the automated diagnostic of correlated characteristic gene occur for cancer, have high efficiency and universality.
Background technology
The fast development of the high flux gene sequencing technologies such as gene microarray technology and bioinformatics, it is from genome water The flat pathogenesis of cancer related gene of screening on a large scale provides necessary means.But the superelevation of gene methylation microarray data Dimension strong noise small sample characteristic makes a small number of important gene information be easy to be submerged in the noise of the tens thousand of genes of full-length genome and cause Information saturated phenomenon, the early diagnosis horizontal to characterizing gene screening and cancer gene cause difficulty.Therefore primary task is Data Dimensionality Reduction is made by feature selecting.The feature selection approach that such as Xie J Y are combined by identification and independence, according to Gene expression data is different to the identification capability of breast cancer, and all genes are ranked up according to identification capability is descending, sieved One group of gene cluster being made up of 10 genes is selected, breast cancer tissue and normal structure (accuracy rate can be distinguished well 85.32%);Lymphoma data are divided by a kind of Wang Wei, characterizing gene screening technique based on signal to noise ratio that Luo Linkai is proposed Class rate of accuracy reached is to 96.15%;A kind of comprehensive signal to noise ratio and the characterizing gene screening technique of cluster that Ruan Xiaogang, Chao Hao are proposed, 94.15% is reached to the classification results of Leukemia Data;Zhang Shizhi proposes one kind and is based on SVMs (Support Vector machine, SVM) embedded feature gene selection method two groups of data of acute leukemia and prostate cancer are divided Class accuracy rate respectively reaches 98.67% and 98.96%.Features described above genetic screening methodology for the multi-collinearity between feature at Reason lacks validity, can not be sufficiently reserved the characterizing gene with identical biological function, easily causes the leakage choosing of important gene, And outside some methods can not be independently of disaggregated model, had a great influence by classifier performance.Therefore there is an urgent need to it is a kind of comprehensively, Efficient and relatively independent characterizing gene screening technique.
The content of the invention
The technical problem to be solved in the present invention is to provide a kind of breast cancer early stage based on high-flux sequence information to occur phase Close the screening technique and disaggregated model of characterizing gene.
To achieve these goals, the technical solution adopted by the present invention is:
A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology, based on the mammary gland in TCGA cancer gene group databases Cancer data, characterizing gene is occurred using multiplex screening method screening cancer and classified for disaggregated model.
Preferably, the screening of multiplex screening method synthesis correlation, significance of difference screening and the elastomeric network screening Multistep screening is carried out to full-length genome.
Preferably, characterizing gene screening technique concretely comprises the following steps:
1. correlation is screened, screening methylation level and gene expression dose have obvious correlation (methylation level and The absolute value of gene expression dose spearman coefficient correlations is more than Part I of the gene as candidate gene collection 0.5);
2. the significance of difference is screened, first, the gene that screening methylation level has obvious correlation with classification results (is protected Stay correlation coefficient r>0.5 gene);Secondly, difference degree of certain gene between two class samples is carried out using variance analysis Analysis, must do test of homogeneity, for meeting that the gene of homogeneous is adopted before variance analysis is done to candidate gene and classification results Characterizing gene is screened with one-way analysis of variance, retain has significant difference (p to classification results label<0.05, FDR< 0.01) gene, for being unsatisfactory for the gene of homogeneous analysis using the non-sieve of the Cruskal-Wallis rank tests in testing of engaging in an inspection Characterizing gene is selected, is retained to there is significant difference (p to classification results label<0.05, FDR<0.01) gene is as candidate The Part II of gene set.
3. candidate gene collection is merged into duplicate removal, repeated screening is carried out using elastomeric network, until obtaining minimal characteristic gene Collection.
Preferably, using support vector machine classifier, disaggregated model is carried out using the method for 5 retransposings checking excellent Change.
The above-mentioned technical proposal of the present invention has the following advantages:1st, using multiple characteristics genetic screening methodology, respectively from correlation Property, many aspects such as specificity and biological function, avoid the leakage choosing of characterizing gene, being sufficiently reserved has identical biology The gene of function.2nd, characterizing gene screening technique of the invention can be applied to the data of different test platforms, have universality and Reliability.3rd, screening early diagnosis of the disaggregated model of characterizing gene and foundation to breast cancer has high accuracy rate.4、 Outside Feature Selection process is independently of disaggregated model, Feature Selection process is not influenceed by classifier performance.5th, the present invention is available Correlated characteristic genescreen and the characterizing gene screening of other relevant issues occurs in all cancers.6th, the present invention is to breast cancer Relevant classification result be far above other method.
Brief description of the drawings
Fig. 1 is the flow chart of the inventive method.
Embodiment
As shown in figure 1, the present invention provides a kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology, based on TCGA cancer bases Because of the breast cancer data in group database, characterizing gene is occurred using multiplex screening method screening cancer and entered for disaggregated model Row classification.
Preferably, the screening of multiplex screening method synthesis correlation, significance of difference screening and the elastomeric network screening Multistep screening is carried out to full-length genome.
Below in conjunction with data to being described in detail in terms of this method.
First, the selection and data processing of material
The present invention choose TCGA public databases in Illumina Infinium Human Methylation 450K, Llumina Infinium Human Methylation two test platform breast cancer of 27K methylate data and The gene expression data of the test platforms of IlluminaHi Seq 2000RNA Sequencing Version 2, extraction are therein Normal and cancer I phases sample is research object.Wherein 450K platform datas make training set, and 27K platform datas make independent test collection, Gene expression data makees individual authentication collection.Data specifying information is as shown in table 1.
The data classification information of table 1 collects
Data processing:1. extract all β value average methylating as the gene positioned at gene promoter region probe Horizontal data.2. gene expression data is normalized, it is desirable to which its section is [0,1], and value formula is
2nd, characterizing gene screening technique
The characterizing gene screening technique of the present invention concretely comprises the following steps:
2. correlation is screened, screening methylation level and gene expression dose have obvious correlation (methylation level and The absolute value of gene expression dose spearman coefficient correlations is more than Part I of the gene as candidate gene collection 0.5).
2. the significance of difference is screened, first, the gene that screening methylation level has obvious correlation with classification results (is protected Stay correlation coefficient r>0.5 gene);Secondly, difference degree of certain gene between two class samples is carried out using variance analysis Analysis, must do test of homogeneity, for meeting that the gene of homogeneous is adopted before variance analysis is done to candidate gene and classification results Characterizing gene is screened with one-way analysis of variance, retain has significant difference (p to classification results label<0.05, FDR< 0.01) gene, for being unsatisfactory for the gene of homogeneous analysis using the non-sieve of the Cruskal-Wallis rank tests in testing of engaging in an inspection Characterizing gene is selected, is retained to there is significant difference (p to classification results label<0.05, FDR<0.01) gene is as candidate The Part II of gene set.
3. candidate gene collection is merged into duplicate removal, repeated screening is carried out using elastomeric network, until obtaining minimal characteristic gene Collection.
3rd, the structure and evaluation index of disaggregated model
The gene for screening to obtain using training set establishes the disaggregated model of training set, the characterizing gene matching that training set obtains To test set, test set disaggregated model is established using matching gene.
The present invention uses SVMs that it is excellent to carry out model using 6 folding cross validations as grader used in disaggregated model Change.
Classification results are assessed it using tetra- accuracy rate, sensitiveness, specificity, MCC indexs, and parameter definition is such as Under:
Accuracy rate:
Sensitiveness:
Specificity:
Coefficient correlation:
T in formulapFor true positives number, tnFor true negative number, fpIt is f for false positive numbernFalse negative number.
4th, classifying quality
The screening technique of the present invention screens classifying quality of the characterizing gene on training set and independent test collection such as table 2 Show.
The training set of table 2., test set classification results
From upper table classification results, either to training set still to independent test collection, the specificity of pattern-recognition and Sensitiveness is very close, and has higher accuracy rate.For test set, its specificity is 98.21%, and sensitiveness is 96.29%, difference is only 1.02%, less than the corresponding deviation 1.04% of training set, it is sufficient to it is good to illustrate that the inventive method has Balance and reliability.
The universality of the inventive method is examined using individual authentication collection, its classification results is as shown in table 3.
The individual authentication collection classifying quality of table 3.
Upper table result shows that this feature genetic screening methodology has excellent performance on different pieces of information collection, so as to demonstrate The universality of this feature genetic screening methodology.
In order to further examine the present invention to screen the classifying quality of characterizing gene, respectively extraction training set, test set and Individual authentication is concentrated is located at preceding 10 to classification contribution, and the gene of 15 is modeled, and its classification results is as shown in table 4.
The a small amount of characterizing gene classifying quality of table 4
Understand to choose by upper table classification results and a small amount of gene of former of ranking is contributed with regard to that can reach very excellent to classification Classifying quality, illustrate the validity to early diagnosis of the characterizing gene of the screening technique screening of the present invention, so as to illustrate this The validity of inventive features genetic screening methodology.
The invention discloses a kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology.High flux technique of gene detection goes out It is existing, provide necessary means for early diagnosing mammary cancer, but the superelevation of microarray data dimension, strong noise, small sample characteristic A small number of important gene information are made to be easy to be submerged in the noise of the tens thousand of genes of full-length genome and cause information saturated phenomenon, to early stage Diagnosis causes difficulty.In order to overcome the influence of these unfavorable factors, the present invention is using breast cancer data in TCGA databases as research Object, using multiple characteristics genetic screening methodology, respectively from many aspects such as correlation, specificity and biological function, screening Real characterizing gene.Cancer gene group data can be based on using the present invention, using this feature gene extracting method, are extracted The gene related to cancer early stage generation, and disaggregated model is established, so as to realize breast cancer early stage automated diagnostic.

Claims (4)

1. a kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology, it is characterised in that based on TCGA cancer gene group data Breast cancer data in storehouse, characterizing gene is occurred using multiplex screening method screening cancer and classified for disaggregated model.
2. mammary gland carcinogenesis correlated characteristic genetic screening methodology as claimed in claim 1, it is characterised in that the multiplex screening The screening of method synthesis correlation, significance of difference screening and elastomeric network screening carry out multistep screening to full-length genome.
3. mammary gland carcinogenesis correlated characteristic genetic screening methodology as claimed in claim 2, it is characterised in that characterizing gene screens Method concretely comprises the following steps:
1. correlation is screened, screening methylation level has obvious correlation (methylation level and gene with gene expression dose The absolute value of expression spearman coefficient correlations is more than Part I of the gene as candidate gene collection 0.5);
2. the significance of difference is screened, first, the gene that screening methylation level has obvious correlation with classification results (retains phase Relation number r>0.5 gene);Secondly, difference degree of certain gene between two class samples is analyzed using variance analysis, Test of homogeneity must be done before variance analysis is done to candidate gene and classification results, for meeting that the gene of homogeneous uses Dan Yin Characterizing gene is screened in plain variance analysis, and retain has significant difference (p to classification results label<0.05, FDR<0.01) base Cause, for being unsatisfactory for the gene of homogeneous analysis using the non-screening feature base of the Cruskal-Wallis rank tests in testing of engaging in an inspection Cause, retain to there is significant difference (p to classification results label<0.05, FDR<0.01) gene is as candidate gene collection Part II;
3. candidate gene collection is merged into duplicate removal, repeated screening is carried out using elastomeric network, until obtaining minimal characteristic gene set.
4. mammary gland carcinogenesis correlated characteristic genetic screening methodology as claimed in claim 1, it is characterised in that use supporting vector Machine grader, the method verified using 5 retransposings are optimized to disaggregated model.
CN201710966853.4A 2017-10-17 2017-10-17 A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology Pending CN107729718A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710966853.4A CN107729718A (en) 2017-10-17 2017-10-17 A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710966853.4A CN107729718A (en) 2017-10-17 2017-10-17 A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology

Publications (1)

Publication Number Publication Date
CN107729718A true CN107729718A (en) 2018-02-23

Family

ID=61211503

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710966853.4A Pending CN107729718A (en) 2017-10-17 2017-10-17 A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology

Country Status (1)

Country Link
CN (1) CN107729718A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109841280A (en) * 2017-11-29 2019-06-04 郑州大学第一附属医院 The identification of cancer of the esophagus correlated characteristic access and the construction method of early stage diagnostic model
CN110257524A (en) * 2019-08-01 2019-09-20 浙江大学 It is a kind of distinguish colorectal cancer cancerous tissue and Carcinoma side normal tissue colorectal cancer discrimination model and its construction method
CN111378754A (en) * 2020-04-23 2020-07-07 嘉兴市第一医院 TCGA (TCGA-based genetic algorithm) database-based breast cancer methylation biomarker and screening method thereof
CN112465533A (en) * 2019-09-09 2021-03-09 中国移动通信集团河北有限公司 Intelligent product selection method and device and computing equipment
CN113017650A (en) * 2021-03-12 2021-06-25 南昌航空大学 Electroencephalogram feature extraction method and system based on power spectral density image
CN116312785A (en) * 2023-01-19 2023-06-23 首都医科大学附属北京胸科医院 Breast cancer diagnosis marker gene and screening method thereof

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109841280A (en) * 2017-11-29 2019-06-04 郑州大学第一附属医院 The identification of cancer of the esophagus correlated characteristic access and the construction method of early stage diagnostic model
CN110257524A (en) * 2019-08-01 2019-09-20 浙江大学 It is a kind of distinguish colorectal cancer cancerous tissue and Carcinoma side normal tissue colorectal cancer discrimination model and its construction method
CN112465533A (en) * 2019-09-09 2021-03-09 中国移动通信集团河北有限公司 Intelligent product selection method and device and computing equipment
CN111378754A (en) * 2020-04-23 2020-07-07 嘉兴市第一医院 TCGA (TCGA-based genetic algorithm) database-based breast cancer methylation biomarker and screening method thereof
CN111378754B (en) * 2020-04-23 2020-11-17 嘉兴市第一医院 TCGA (TCGA-based genetic algorithm) database-based breast cancer methylation biomarker and screening method thereof
CN113017650A (en) * 2021-03-12 2021-06-25 南昌航空大学 Electroencephalogram feature extraction method and system based on power spectral density image
CN113017650B (en) * 2021-03-12 2022-06-28 南昌航空大学 Electroencephalogram feature extraction method and system based on power spectral density image
CN116312785A (en) * 2023-01-19 2023-06-23 首都医科大学附属北京胸科医院 Breast cancer diagnosis marker gene and screening method thereof

Similar Documents

Publication Publication Date Title
CN107729718A (en) A kind of mammary gland carcinogenesis correlated characteristic genetic screening methodology
Jaber et al. A deep learning image-based intrinsic molecular subtype classifier of breast tumors reveals tumor heterogeneity that may affect survival
Biasci et al. A blood-based prognostic biomarker in IBD
Xu et al. Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data
CN103502473B (en) The prediction of gastro-entero-pancreatic tumor (GEP-NEN)
Monzon et al. Diagnosis of metastatic neoplasms: molecular approaches for identification of tissue of origin
Sinnott et al. Molecular differences in transition zone and peripheral zone prostate tumors
Scott Cell-of-origin in diffuse large B-cell lymphoma: are the assays ready for the clinic?
US20160002714A1 (en) Automated analysis of multiplexed probe-target interaction patterns: pattern matching and allele identification
CN111128385B (en) Prognosis early warning system for esophageal squamous carcinoma and application thereof
CN108884494A (en) The unicellular Genome Atlas of circulating tumor cell is analyzed to characterize disease heterogeneity in metastatic disease
CN101194166A (en) Materials and methods relating to breast cancer classification
CN104046624B (en) Gene and application thereof for lung cancer for prognosis
Bilal et al. Novel deep learning algorithm predicts the status of molecular pathways and key mutations in colorectal cancer from routine histology images
JP2016073287A (en) Method for identification of tumor characteristics and marker set, tumor classification, and marker set of cancer
CN108642568B (en) Method for designing SNP chip special for identifying low-density breed of whole genome of domestic dog
CN111653314B (en) Method for analyzing and identifying lymphatic infiltration
CN109859796A (en) A kind of Dimension Reduction Analysis method that the DNA methylation about gastric cancer is composed
CN112831562A (en) Biomarker combination and kit for predicting recurrence risk of liver cancer patient after resection
Polikowsky et al. Supervised machine learning with CITRUS for single cell biomarker discovery
de Ridder et al. Purity for clarity: the need for purification of tumor cells in DNA microarray studies
CN113355411B (en) Tumor immunotyping method based on lncRNA marker
Janikova et al. Gene expression profiling in follicular lymphoma and its implication for clinical practice
CN108603233A (en) The unicellular Genome Atlas of circulating tumor cell (CTC) is analyzed to characterize disease heterogeneity in metastatic disease
CN111798924A (en) Human leukocyte antigen typing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180223

WD01 Invention patent application deemed withdrawn after publication