CN110751983A - 一种筛选特征mRNA用于诊断早期肺癌的方法 - Google Patents
一种筛选特征mRNA用于诊断早期肺癌的方法 Download PDFInfo
- Publication number
- CN110751983A CN110751983A CN201911146308.6A CN201911146308A CN110751983A CN 110751983 A CN110751983 A CN 110751983A CN 201911146308 A CN201911146308 A CN 201911146308A CN 110751983 A CN110751983 A CN 110751983A
- Authority
- CN
- China
- Prior art keywords
- lung cancer
- mrna
- diagnosis
- sample
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 208000020816 lung neoplasm Diseases 0.000 title claims abstract description 33
- 206010058467 Lung neoplasm malignant Diseases 0.000 title claims abstract description 32
- 201000005202 lung cancer Diseases 0.000 title claims abstract description 32
- 238000012216 screening Methods 0.000 title claims abstract description 18
- 238000000034 method Methods 0.000 title claims abstract description 9
- 229920002477 rna polymer Polymers 0.000 title description 3
- 238000003745 diagnosis Methods 0.000 claims abstract description 22
- 239000011159 matrix material Substances 0.000 claims abstract description 20
- 108020004999 messenger RNA Proteins 0.000 claims abstract description 19
- 230000014509 gene expression Effects 0.000 claims abstract description 18
- 238000004458 analytical method Methods 0.000 claims abstract description 9
- 238000012549 training Methods 0.000 claims abstract description 9
- 238000006243 chemical reaction Methods 0.000 claims abstract description 5
- 230000007423 decrease Effects 0.000 claims abstract description 4
- 239000000090 biomarker Substances 0.000 claims abstract description 3
- 230000000875 corresponding effect Effects 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000000717 retained effect Effects 0.000 claims description 3
- 230000002596 correlated effect Effects 0.000 claims description 2
- 230000014759 maintenance of location Effects 0.000 abstract description 3
- 206010028980 Neoplasm Diseases 0.000 description 11
- 201000011510 cancer Diseases 0.000 description 11
- 210000004027 cell Anatomy 0.000 description 4
- 238000013399 early diagnosis Methods 0.000 description 4
- 238000000354 decomposition reaction Methods 0.000 description 3
- 210000001519 tissue Anatomy 0.000 description 3
- 108020004414 DNA Proteins 0.000 description 2
- 102000053602 DNA Human genes 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 208000024891 symptom Diseases 0.000 description 2
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 1
- 102100040228 Homeobox protein Hox-D3 Human genes 0.000 description 1
- 101001037158 Homo sapiens Homeobox protein Hox-D3 Proteins 0.000 description 1
- 101000610110 Homo sapiens Pre-B-cell leukemia transcription factor 2 Proteins 0.000 description 1
- 206010035148 Plague Diseases 0.000 description 1
- 102100040168 Pre-B-cell leukemia transcription factor 2 Human genes 0.000 description 1
- 208000008691 Precursor B-Cell Lymphoblastic Leukemia-Lymphoma Diseases 0.000 description 1
- 108010074506 Transfer Factor Proteins 0.000 description 1
- 241000607479 Yersinia pestis Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000000711 cancerogenic effect Effects 0.000 description 1
- 231100000315 carcinogenic Toxicity 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000002594 fluoroscopy Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000011528 liquid biopsy Methods 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 238000010239 partial least squares discriminant analysis Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 210000003705 ribosome Anatomy 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
Landscapes
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Public Health (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Biotechnology (AREA)
- Theoretical Computer Science (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Genetics & Genomics (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Analytical Chemistry (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Chemical & Material Sciences (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
本发明公开了一种通过非相关线性判别分析模型筛选特征mRNA用于诊断早期肺癌的方法。首先对己知肺癌样本和正常样本的mRNA表达数据进行标准化处理,将其作为训练集。然后采用非相关线性判别分析对训练集进行建模,得到每个变量对应的转换矩阵G。将标准化后的待诊断样本作为预测集,将其mRNA表达数据与G矩阵相乘,得到非相关判别矢量值,从而进行肺癌样本的识别诊断。通过每个mRNA对应的G的绝对值大小筛选出特征mRNA,并不断调整筛选阈值以减少mRNA保留数量并更新肺癌诊断ULDA模型,直到预测集诊断准确度开始下降。诊断准确度为100%时,所筛选出的最少数目的mRNA可以作为肺癌诊断的潜在特征生物标记物。
Description
技术领域
本发明涉及一种通过非相关线性判别分析筛选特征mRNA用于诊断早期肺癌的方法。
背景技术
癌症的早期诊断与及时治疗是提高癌症患者生存率的最有效途径。但癌症的早期诊断一直是困扰医务人员及相关科研人员的难题。肺癌是当前世界最常见、致死人数最多的恶性肿瘤之一,其发病率、增长速度亦高居各恶性肿瘤之首。肺癌早期症状不明显,目前临床一般通过X射线透视、计算机体层扫描等影像学检查及液体活检等手段,但往往确诊时,患者已经发展到中晚期。
由于癌症是一种由于细胞受到致癌因素影响发生基因突变所导致的疾病。正常细胞与癌细胞的基因表达存在差异,因此找到标志着细胞癌变的差异表达基因(标记物),可以作为癌症早期诊断的有效手段。随着高通量测序技术的快速发展及测试价格的下降,获得大量基因表达数据已不是难题。然而如何从庞大且复杂的基因表达数据中提取出癌症相关基因,对现有的数据处理和分析方法提出了挑战。而化学信息学恰恰是一个非常有效的解析途径。已有基于化学信息学方法结合基因表达数据建立偏最小二乘判别分析、支持向量机的模式识别(分类)模型,用于重要特征变量的筛选和癌症的识别。
mRNA(Messenger RNA)即信使核糖核酸,作为中间体将遗传信息从脱氧核糖核酸(DNA)传递到核糖体。目前已有报道mRNA在不同类型的肿瘤中异常表达。
发明内容
本发明的目的是提供一种筛选特征RNA用于诊断早期肺癌的方法,为肺癌的快速高效、准确地早期诊断提供新的途径。
将已确诊肺癌患者组和正常组作为训练集,将其mRNA表达数据进行标准化处理。然后,采用非相关线性判别分析(ULDA,uncorrelated linear discriminant analysis)方法对训练集进行建模,得到每个变量对应的转换矩阵G。G的维度为n×1,n为变量数目。将标准化后的待诊断样本作为预测集,将其mRNA表达数据与G矩阵相乘,得到非相关判别矢量值(UDV值)。当样本类别为二时,仅依据一个UDV值便可进行样本的分类识别。不同类样本的UDV值有较大差异,在空间上分别聚类。每个mRNA的重要性反映在矩阵G中对应值的大小,通过其对应的G的绝对值大小筛选出关键mRNA,并不断调整筛选阈值。每次筛选将原有的基因表达数据降维后,建立肺癌诊断模型,在准确识别肺癌组和对照组的前提下,逐步提高筛选阈值以减少mRNA保留数量,更新肺癌诊断模型,直到诊断准确度开始下降。保证准确诊断时,筛选的最少数目的mRNA可以作为早期诊断肺癌的潜在生物标记物。
具体步骤为:
1.将已知确诊肺癌患者组和正常组的mRNA表达数据进行标准化处理后,作为训练集;
3)构造矩阵U使U=[u1,…,ur],ui(i=1,2,…,r)是矩阵U的第r列,r等于St的秩[r=rank(St);
6)对矩阵B进行奇异值分解,即
8)提取矩阵A的前q列构成转换矩阵G,即G=[a1,…,aq],q等于Sb的秩[q=rank(Sb)];
9)对于待诊断未知样本数据(预测集)Xnew,其对应的低维矩阵一非相关判别矢量(UDV)根据UDV=XnewG计算得到。根据UDV值的差异和聚类结果进行癌症组和正常组的识别诊断。
G矩阵元素绝对值的平均值为定义筛选阈值K,当时,对应的mRNA被保留带入ULDA模型继续计算,记录诊断准确度。不断提高筛选阈值以减少mRNA保留数量并更新肺癌诊断ULDA模型,直到诊断准确度开始下降。诊断准确度为100%时,筛选出的最少数目的mRNA被确定为诊断肺癌的特征mRNA。
附图说明
图1为实施例的肺癌样本和正常样本的识别诊断结果图。
具体实施方式
实施例:
(1)肺癌组织和正常组织的mRNA表达数据从网站https://www.pnas.org/content/98/24/13790下载,其中包含21例肺癌和17例正常组织,每个样本含12600个表达量。
(2)将肺癌样本和正常样本的mRNA表达数据分别随机划分为训练集和预测集,并进行标准化处理。
(3)对训练集样本进行非相关线性判别分析,计算出转换矩阵G。
(4)根据G值计算预测集样本的UDV值,根据UDV值的差异和聚类结果进行癌症组和正常组的判别诊断。
(6)多次重复运行步骤(5),不断提高筛选阈值K使mRNA数量逐步减少,并更新肺癌诊断ULDA模型,直到诊断准确度开始下降。诊断准确度为100%时,所筛选出的最少数目(134个)的mRNA可以作为诊断肺癌的潜在标记物,用于进一步的研究。
筛选出|G|最大值所对应的变量可以作为诊断肺癌的特征mRNA,包括Homeo boxD3(HOXD3)、Pre-B-cell leukemia transcription factor 2(PBX2)、KIR2DLA等,在一些研究中已经得到证实。
Claims (1)
1.一种筛选特征mRNA用于诊断早期肺癌的方法,其特征在于,包括如下步骤:
步骤一,已知肺癌样本和正常样本的mRNA表达数据进行标准化处理,将其作为训练集;
步骤二,对训练集进行非相关线性判别分析,计算出变量对应的转换矩阵G;
步骤三,将标准化后的待诊断样本作为预测集,将其mRNA表达数据与G矩阵相乘,得到非相关判别矢量值,从而进行肺癌样本与正常样本的识别诊断;
步骤五,多次重复运行步骤三,不断提高筛选阈值K使保留mRNA数量逐步减少,直到预测集诊断准确度开始下降;诊断准确度为100%时,所筛选出的最少数目的mRNA可以作为早期肺癌诊断的潜在特征生物标记物。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911146308.6A CN110751983A (zh) | 2019-11-14 | 2019-11-14 | 一种筛选特征mRNA用于诊断早期肺癌的方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911146308.6A CN110751983A (zh) | 2019-11-14 | 2019-11-14 | 一种筛选特征mRNA用于诊断早期肺癌的方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110751983A true CN110751983A (zh) | 2020-02-04 |
Family
ID=69283939
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911146308.6A Pending CN110751983A (zh) | 2019-11-14 | 2019-11-14 | 一种筛选特征mRNA用于诊断早期肺癌的方法 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110751983A (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111444941A (zh) * | 2020-02-24 | 2020-07-24 | 华北电力大学(保定) | 联合血清中电解质和蛋白质组学数据用于诊断早期肺癌的方法 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102999765A (zh) * | 2012-11-09 | 2013-03-27 | 江苏大学 | 自适应提升法和非相关判别分析的猪肉贮藏时间判定方法 |
CN109837340A (zh) * | 2017-11-24 | 2019-06-04 | 顾万君 | 用于肺癌无创诊断的外周血基因标记物 |
-
2019
- 2019-11-14 CN CN201911146308.6A patent/CN110751983A/zh active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102999765A (zh) * | 2012-11-09 | 2013-03-27 | 江苏大学 | 自适应提升法和非相关判别分析的猪肉贮藏时间判定方法 |
CN109837340A (zh) * | 2017-11-24 | 2019-06-04 | 顾万君 | 用于肺癌无创诊断的外周血基因标记物 |
Non-Patent Citations (1)
Title |
---|
张明锦: "基于特征选择的多变量数据分析方法及其在谱学研究中的应用" * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111444941A (zh) * | 2020-02-24 | 2020-07-24 | 华北电力大学(保定) | 联合血清中电解质和蛋白质组学数据用于诊断早期肺癌的方法 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Machine learning for lung cancer diagnosis, treatment, and prognosis | |
Khan et al. | Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks | |
US20230222311A1 (en) | Generating machine learning models using genetic data | |
JP5064625B2 (ja) | パターンを同定するための方法及び機械 | |
Brentani et al. | Gene expression arrays in cancer research: methods and applications | |
Shannon et al. | Analyzing microarray data using cluster analysis | |
EP3942556A1 (en) | Systems and methods for deriving and optimizing classifiers from multiple datasets | |
CN109872776B (zh) | 一种基于加权基因共表达网络分析对胃癌潜在生物标志物的筛选方法及其应用 | |
CN111276252B (zh) | 一种肿瘤良恶性鉴别模型的构建方法及装置 | |
US9940383B2 (en) | Method, an arrangement and a computer program product for analysing a biological or medical sample | |
CN107025384A (zh) | 一种复杂数据预测模型的构建方法 | |
Kumar et al. | An amalgam method efficient for finding of cancer gene using CSC from micro array data | |
Schachtner et al. | Knowledge-based gene expression classification via matrix factorization | |
CN110751983A (zh) | 一种筛选特征mRNA用于诊断早期肺癌的方法 | |
US20180181705A1 (en) | Method, an arrangement and a computer program product for analysing a biological or medical sample | |
CN117616505A (zh) | 用于使用指纹分析将化合物与生理状况相关联的系统和方法 | |
Rosa et al. | Cluster center genes as candidate biomarkers for the classification of Leukemia | |
Mythili et al. | CTCHABC-hybrid online sequential fuzzy Extreme Kernel learning method for detection of Breast Cancer with hierarchical Artificial Bee | |
Jiang et al. | Integrating image and molecular profiles for spatial transcriptomics analysis | |
Kreutz | Statistical Approaches for Molecular and Systems Biology | |
Haibe-Kains | Identification and assessment of gene signatures in human breast cancer | |
Lu | An embedded method for gene identification in heterogenous data involving unwanted heterogeneity | |
CN114334012A (zh) | 一种基于多组学数据识别癌症亚型的方法 | |
Reiser et al. | Can matching improve the performance of boosting for identifying important genes in observational studies? | |
Ahmad | Dissecting patient heterogeneity via statistical modeling based on multi-modal omics data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200204 |
|
WD01 | Invention patent application deemed withdrawn after publication |