CN113764044A - Method for constructing myelodysplastic syndrome progress gene prediction model - Google Patents

Method for constructing myelodysplastic syndrome progress gene prediction model Download PDF

Info

Publication number
CN113764044A
CN113764044A CN202111009322.9A CN202111009322A CN113764044A CN 113764044 A CN113764044 A CN 113764044A CN 202111009322 A CN202111009322 A CN 202111009322A CN 113764044 A CN113764044 A CN 113764044A
Authority
CN
China
Prior art keywords
mds
mutation
gene
risk
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111009322.9A
Other languages
Chinese (zh)
Other versions
CN113764044B (en
Inventor
侯珺
杜欣
孙启慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202111009322.9A priority Critical patent/CN113764044B/en
Publication of CN113764044A publication Critical patent/CN113764044A/en
Application granted granted Critical
Publication of CN113764044B publication Critical patent/CN113764044B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Chemical & Material Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method for constructing a myelodysplastic syndrome progress gene prediction model, which comprises the following steps: collecting samples of three groups of patients, extracting DNA and sequencing to obtain a gene mutation spectrum and mutation frequency; the mutation frequency of each gene was divided into four modules according to the following criteria: module 1: the gene mutation rate is in an ascending trend in each group; and (3) module 2: the gene mutation rate is in a descending trend in each group; and a module 3: the mutation frequency of the gene in the high-risk MDS group is lower than that of the gene in the low-risk MDS group, and the mutation frequency of the gene in the MDS-AML group is highest; and (4) module: the mutation frequency of the gene in the high-risk MDS group is higher than that of the gene in the low-risk MDS group, and the mutation frequency of the gene in the MDS-AML group is the lowest; the genes that are included in these four modules are MDS progression-associated genes; and training by using an SVM classifier model to complete the construction of the model. The invention can predict the disease progress from molecular level, to obtain accurate prediction.

Description

Method for constructing myelodysplastic syndrome progress gene prediction model
Technical Field
The invention relates to a method for constructing a myelodysplastic syndrome progress gene prediction model.
Background
Myelodysplastic syndrome (MDS) is a common group of hematological malignancies resulting from the clonal proliferation of myeloid hematopoietic stem and progenitor cells. The development of MDS is well recognized as a multistep process, including a prophase of MDS, a phase of MDS, and a secondary phase of MDS with leukemia. Its development is related to the dysregulation of the bone marrow microenvironment, an additional driving mutation, and the gradual acquisition. In this multi-step process, several mutations with selective advantage are obtained multiple times so that these dominant cell populations with the same mutation continue to expand.
The current clonal evolution process is divided into a linear progression mode and a branch progression mode. The linear evolutionary pattern is characterized by the continued emergence of dominant clones from their ancestral clones that continue to expand their ancestral clones after additional mutations are obtained. The hallmark of the pattern of branching evolution means that different subclones from a common ancestral clone occur simultaneously or sequentially, resulting in the co-existence of related (sub) clones containing a partially overlapping set of mutations. The complex genetic diversity between these subclones may lead to more complex disease types and may lead to treatment resistance, as some subclones may be resistant to a particular type of treatment.
Several mutant genes are currently being found in studies directed to genomic features at various stages of MDS progression. However, it should be emphasized that the results obtained from each experimental group are not identical due to tumor heterogeneity, differences in patients in the group, differences in sequencing methods, and complex interrelations between genes. Furthermore, the relationship between upstream and downstream genes, the co-existence and mutual exclusion of genes, and the stromal cells and immune cells in the patient's bone marrow microenvironment may affect the patient's overall condition, making prediction of MDS difficult and uncertain.
Disclosure of Invention
According to the method, the differential mutant genes of the patients in the low-risk MDS, the high-risk MDS and the AML group related to the MDS are compared, the SVM is selected as the optimal machine learning classification model, the progress condition of the MDS is predicted, the myelodysplastic syndrome progress gene prediction model is constructed, and clinical guidance is provided for selection of subsequent treatment strategies and disease prediction.
The purpose of the invention is realized by the following technical scheme:
a method for constructing a myelodysplastic syndrome progression gene prediction model comprises the following steps:
(1) collecting samples of low-risk and high-risk myelodysplastic syndromes (herein, referred to as low-risk MDS and high-risk MDS) patients and leukemia (herein, referred to as MDS-AML) patients, and extracting DNA of each sample; forming a training set by samples of low-risk and high-risk myelodysplastic syndrome patients;
the number of primitive cells of the low-risk myelodysplastic syndrome patient is less than 5 percent;
the number of primitive cells of the high-risk myelodysplastic syndrome patient is more than or equal to 5 percent and less than 20 percent;
the leukemia patient has more than 2 months of leukemia history, and the number of primitive cells is more than or equal to 20%;
the sample can be a blood, tissue or bone marrow puncture sample of a patient;
(2) sequencing DNA samples of three groups of patients, and comparing the DNA samples with hg19 reference genome to obtain a gene mutation spectrum and mutation frequency;
the mutation frequency of each gene was divided into four modules according to the following criteria:
module 1: the gene mutation rate is in an ascending trend from low-risk MDS to MDS-AML group;
and (3) module 2: the gene mutation rate is in a descending trend from low-risk MDS to MDS-AML group;
and a module 3: the mutation frequency of the gene in the high-risk MDS group is lower than that of the gene in the low-risk MDS group, and the mutation frequency of the gene in the MDS-AML group is highest;
and (4) module: the mutation frequency of the gene in the high-risk MDS group is higher than that of the gene in the low-risk MDS group, and the mutation frequency of the gene in the MDS-AML group is the lowest;
the genes that are included in these four modules are MDS progression-associated genes;
most myeloid-associated genes mutate at various stages of myelodysplastic syndrome disease progression (low-risk MDS, high-risk MDS, and MDS-AML), and do not uniquely occur at a certain stage. However, since the mutation frequencies of some genes are different at different stages of disease progression, genes whose mutation frequencies are continuously increased or decreased during the progression are sought and may be correlated with disease progression, genes whose mutation frequencies are gradually increased or decreased with disease progression in three groups of patients have a strong correlation with disease progression (e.g., module1 and module 2).
In addition, the progression of some patients does not follow a gradual progression from low-risk MDS, high-risk MDS to MDS-AML, may progress rapidly from low-risk MDS to MDS-AML directly, or may be high-risk MDS at the time of diagnosis. In order not to miss these specific progression-associated genes, they were incorporated into both the module3 and module 4 gene sets.
The MDS progression-associated genes are: ABL, ANKRD, ARID1, ATG2, BCORL, BIRC, BRAF, BRINP, CALR, CARD, CBL, CCND, CEBPA, CREBP, CUX, CXCR, DDX3, DNM, DNMT3, ECT2, EP300, ETNK, EZH, FAM46, FGFR, FLT, GATA, ID, IDH, JAK, KDM6, KIT, KMT2, MAPK, MPL, NOTCH, PDS5, PHF, PIGA, PLCG, PRKCB, PRPF40, RAD, RBBP, RELN, RUNX, TBSEP, SETD, SF3A, SF3B, SMC1, SMC, SRP, STAG, TERT, TET, TP, TPMT, TRAF, XPO, and ZRSR, for a total of 64;
the mutation comprises missense mutation, nonsense mutation, frame shift insertion, frame shift deletion, non-frame shift insertion, non-frame shift deletion and shearing site mutation, and the intron mutation and the synonymous mutation are excluded;
the average depth of sequencing is not less than 800 ×;
the sequencing method comprises sanger sequencing, ARMS-PCR (Amplification recovery Mutation System PCR), MASS-PCR (Mutation-Selected Amplification specificity System), whole genome sequencing, whole exon sequencing and small-queue targeted sequencing;
(3) selecting genes related to MDS progression in the DNA of the training sample according to four modules, and performing mutation marking, wherein the gene is marked as A1 when the mutation exists and is marked as A2 when the mutation does not exist; training the SVM classifier model by taking the MDS progress related gene mutation markers of the training samples as input to complete the construction of the model;
in the step (3), A1 is 1, A2 is 0, or A1 is 1, and A2 is 0;
the SVM classifier model can select 0.3 as a threshold value to predict the sample, and when the predicted value of the sample is more than or equal to 0.3, the sample is predicted to have high risk of developing disease progress; when the sample prediction value is <0.3, the sample is predicted to be a progression low risk sample.
In the step (3), preferably, 70% of patients are selected from the grouped MDS group samples at random through a sample function as a training set, mutation markers of MDS progress related genes of the patients are used as input, an SVM classifier model is trained, and then the predicted values of the rest 30% of samples are verified;
in the step (3), clinical information including sex, age, bone marrow primary cell number, primary red blood cell count, white blood cell count, platelet count and the like of the patient is extracted from each training sample, and during training, the MDS progress-related gene mutation of the training sample is fused with the clinical information and then used as input to train the SVM classifier model.
Compared with the prior art, the invention has the following advantages and effects:
compared with the existing prediction index aiming at the myelodysplastic syndrome disease progression, the invention carries out early prediction on the disease progression from the molecular level, can obtain more accurate prediction, carries out early intervention on high-risk patients, delays the disease progression, is beneficial to the selection of subsequent treatment targeted drugs, and has higher clinical practicability.
Drawings
FIG. 1 is a gene frequency distribution of 64 genes incorporated into each of 4 disease progression-associated modules in example 1.
FIG. 2 is a sample clustering heatmap of 64 genes based on the constructed prediction model in example 1; the abscissa is 20 samples and the ordinate 64 relevant genes.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Example 1
Selection of myelodysplastic syndrome progress gene prediction label and establishment of prediction model
1) Bone marrow puncture samples before treatment were collected from 23 patients with myeloid tumors in the department of hematology of the national hospital of Guangdong province, from 2016 (1) month to 2017 (12) months, wherein 10 patients with low-risk (low-risk MDS) (primitive cell count < 5%), 10 patients with high-risk (high-risk MDS) (5% ≦ primitive cell count < 20%), and 3 AMLs related to MDS (AML patients with MDS history clearly >2 months, primitive cell count ≧ 20%).
2) 20ng of bone marrow puncture fluid is taken, and genome DNA is extracted.
3) Pretreatment of sequencing: the Extraction of DNA from bone marrow fluid was carried out by cell lysis, DNA-membrane binding, DNA purification and the like using TaKaRa MiniBEST Universal Genomic DNA Extraction Kit Ver.5.0. Then, DNA quality detection is carried out on the obtained DNA by using a NanoDrop2000 ultramicro spectrophotometer; and (5) entering a library preparation link after the quality control is qualified.
And (3) detecting the concentration and purity of the nucleic acid in the extracted sample by using a ultramicro spectrophotometer. And when the A260/A280 ratio of the extracted double-stranded DNA is 1.7-1.9, the quality control is qualified. If the ratio is higher, RNA residue exists in the extracted DNA, and if the ratio is lower, protein residue exists, resampling and quality control are needed.
4) Library preparation: by utilizing an amplicon capture method, 114 genes related to the blood tumor are targeted, and the DNA fragment obtained by amplification is subjected to sequencing library preparation by utilizing Ion Kits.
5) Gene sequencing: collecting water-in-oil phase amplified and purified magnetic beads by using Ion ProtonTMOnTouch 2System (Instrument and Ion OnTouch)TMES (enrichment system), MILLIPORE water purifier Milli-Q, ABI Veriti96 PCR amplification apparatus, Applied biosystems Q-PCR apparatus 7500), and the like. Application of Ion ProtonTMSequencers perform on-machine sequencing.
6) Processing raw data and identifying variation: through the gene sequencing, a mutation file in a BAM format is obtained for subsequent analysis. The mean sequencing depth was 800 ×, no non-repeat sequence analysis method was present. First, the data obtained were sorted, PCR duplicates removed and indexed by software such as Samtools-1.8 and Picard-2.19, respectively. Next, mutation detection (call variation) was performed using bcftools mpileup in combination with bcftools call. The detection rate of the research on mutation sites with 5% of variation frequency is 97% -98%.
The variation comprises missense mutation, nonsense mutation, frame shift insertion, frame shift deletion, non-frame shift insertion, non-frame shift deletion and shearing site mutation, and the variation and the synonymous mutation of an intron are excluded;
7) variant filtering and annotation: the variation detected in the above procedure (VCF, Variant Call Format Format) was further filtered. The filter indices are QUAL <20 and MQ < 40.
To further clarify the significance of the mutation, databases such as refGene, cytoBand, avsnp150, esp6500siv2_ all, 1000g2015aug _ eas, dbnsfp30a, cosmic70, exac03, clinvar _20140929 were used to annotate the mutation and amino acid mutation analyses, based on the table _ innovar.pl tool in ANNOVAR software, with hg19 as the reference genome.
8) Construction of a progress-related module:
the mutation information of group 3 patients was summarized and based on the pattern of changes at different stages of MDS, a mutation module (module) was constructed to look for differentially mutated genes at various stages of disease progression.
As shown in FIG. 1, the module1 (module 1; the same below) gene is increasing from low risk MDS to MDS-AML, and the opposite is true for the module 2 gene. The genes in module3 not only have to meet the criteria that the mutation frequency of the high-risk MDS group is lower than that of the low-risk MDS group, but also the mutation frequency of MDS-AML is higher than that of the first two groups. The genes in module 4 satisfy that MDS of the high-risk group is higher than MDS patients of the low-risk group, and mutation frequencies of the patients of the MDS group are higher than those of the MDS-AML group, as shown in Table 1 and figure 1.
Table 1: construction of progression-related Module (numerical values in the Table are the variation frequencies)
Figure BDA0003238069700000061
Figure BDA0003238069700000071
Note: the variation frequency is the percentage of patients with mutation in each group to all patients
The 64 genes that were included in the module were included in subsequent analyses.
9) 20 MDS (high-risk group and low-risk group) bone marrow samples which are selected from the group are counted, the existence situations of 64 gene mutations which are included in 4 modules are counted, namely, the existence of the mutation mark is 1, the nonexistence of the mutation mark is 0, the mutation situations of all samples are shown in figure 2, red represents the existence of the mutation, and blue represents the nonexistence of the mutation.
10) Constructing a disease progress prediction model: the method comprises the steps of randomly selecting 64 gene mutations of 13 MDS samples from R3.6.1 as a training set by using a sample function, carrying out SVM classifier model training by using an R package 'e 1071', selecting 0.3 as a prediction threshold according to the result of the training set, dividing the samples into two groups of disease progression and disease non-progression, and verifying typing in the remaining 7 patients (as shown in Table 2).
The codes used for construction of the SVM classifier model are as follows:
library(e1071)
SVM1<-read.table("C:/Users/30798/Desktop/SVM_AML.txt",header=TRUE,sep="\t")
index<-sample(2,nrow(SVM1),replace=TRUE,prob=c(0.7,0.3))
traindata<-SVM1[index==1,]
testdata<-SVM1[index==2,]
cats_svm_model<-svm(AML~.,data=traindata)
cats_svm_model
cats_svm_model_pred_1<-predict(cats_svm_model,traindata[,-1])
cats_table_1<-table(pred=cats_svm_model_pred_1,true=traindata[,1])
cats_table_1
cats_svm_model_pred_2<-predict(cats_svm_model,testdata[,-1])
cats_table_2<-table(pred=cats_svm_model_pred_2,true=testdata[,1])
cats_table_2
SVM2<-read.table("C:/Users/30798/Desktop/heatmap_SVM.txt",header=TRUE,row.names=1,sep="\t")
SVM2<-as.matrix(SVM2)
pheatmap(SVM2,color=colorRampPalette(c("navy","white","firebrick3"))(50))annotation_col=data.frame(CellType=factor(rep(c("low","high"),5)),progress=c("NO","NO","NO","YES","NO","NO","NO","NO","YES","NO","YES","NO","NO","NO","NO","NO","NO","NO","NO","YES"))
rownames(annotation_col)=colnames(SVM2)
pheatmap(SVM2,annotation_col=annotation_col)
table 2: disease progression gene model prediction and clinical outcome for 20 MDS patients
Sample(s) SVM prediction value Predicted results Clinical results Consistency
T1 0.09958<0.3 Low risk progression Not progressed Uniformity
T2 0.09975<0.3 Low risk progression Not progressed Uniformity
T3 0.0999<0.3 Low risk progression Not progressed Uniformity
T4 0.09997<0.3 Low risk progression Not progressed Uniformity
T5 0.10006<0.3 Low risk progression Not progressed Uniformity
T6 0.10006<0.3 Low risk progression Not progressed Uniformity
T7 0.10021<0.3 Low risk progression Not progressed Uniformity
T8 0.10023<0.3 Low risk progression Not progressed Uniformity
T9 0.10024<0.3 Low risk progression Not progressed Uniformity
T10 0.43693>0.3 High risk progression Progress of the development Uniformity
T11 0.45018>0.3 High risk progression Progress of the development Uniformity
T12 0.48653>0.3 High risk progression Progress of the development Uniformity
T13 0.49049>0.3 High risk progression Progress of the development Uniformity
V1 0.10333<0.3 Low risk progression Not progressed Uniformity
V2 0.16121<0.3 Low risk progression Not progressed Uniformity
V3 0.16969<0.3 Low risk progression Not progressed Uniformity
V4 0.1955<0.3 Low risk progression Not progressed Uniformity
V5 0.20033<0.3 Low risk progression Not progressed Uniformity
V6 0.24631<0.3 Low risk progression Not progressed Uniformity
V7 0.2711<0.3 Low risk progression Not progressed Uniformity
Note: t represents training set samples, V represents validation group samples
Example 2
Establishment and prediction of prediction model after combining myelodysplastic syndrome progress prediction gene label with clinical indexes
Using 20 MDS (high-risk group and low-risk group) patients enrolled in example 1, gene sequencing and data statistics of the presence or absence of mutations were performed with reference to the gene model and analysis method in example 1.
Randomly selecting 15 patients as a training set, bringing the mutation of 64 genes into a model according to the combination of the sex, age, bone marrow primary cell number, primary red blood cells, white blood cells, platelet count and other clinical information of the patients, and constructing a classification model by using an SVM classifier. Similarly, 0.3 is selected as the prediction threshold, the group with high risk progression is defined as the predicted value greater than or equal to 0.3, and the group with low risk progression is predicted as the predicted value less than 0.3, as shown in table 3.
And the remaining 5 test set patients were predicted using this model. The predicted results are shown in table 4.
Table 3: disease progression gene combination clinical factor model prediction and clinical outcome for 15 MDS patients
Sample(s) SVM prediction value Predicted results Clinical results Consistency
S1 0.045547<0.3 Low risk progression Not progressed Uniformity
S2 0.045744<0.3 Low risk progression Not progressed Uniformity
S3 0.045766<0.3 Low risk progression Not progressed Uniformity
S4 0.045779<0.3 Low risk progression Not progressed Uniformity
S5 0.045785<0.3 Low risk progression Not progressed Uniformity
S6 0.045789<0.3 Low risk progression Not progressed Uniformity
S7 0.045791<0.3 Low risk progression Not progressed Uniformity
S8 0.045791<0.3 Low risk progression Not progressed Uniformity
S9 0.045822<0.3 Low risk progression Not progressed Uniformity
S10 0.045833<0.3 Low risk progression Not progressed Uniformity
S11 0.045866<0.3 Low risk progression Not progressed Uniformity
S12 0.595040>0.3 High risk progression Progress of the development Uniformity
S13 0.623043>0.3 High risk progression Progress of the development Uniformity
S14 0.649239>0.3 High risk progression Progress of the development Uniformity
S15 0.724980>0.3 High risk progression Progress of the development Uniformity
Table 4: disease progression gene-associated clinical factor model prediction and clinical outcome for 5 MDS patients
Sample(s) SVM prediction value Predicted results Clinical results Consistency
S1 0.162420<0.3 Low risk progression Not progressed Uniformity
S2 0.226496<0.3 Low risk progression Not progressed Uniformity
S3 0.239825<0.3 Low risk progression Not progressed Uniformity
S4 0.270390<0.3 Low risk progression Not progressed Uniformity
S5 0.279371<0.3 Low risk progression Not progressed Uniformity
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (10)

1. A method for constructing a myelodysplastic syndrome progression gene prediction model, which is characterized by comprising the following steps:
(1) collecting samples of low-risk and high-risk myelodysplastic syndrome patients and leukemia patients, and extracting DNA of each sample; forming a training set by samples of low-risk and high-risk myelodysplastic syndrome patients;
(2) sequencing DNA samples of three groups of patients, and comparing the DNA samples with hg19 reference genome to obtain a gene mutation spectrum and mutation frequency;
the mutation frequency of each gene was divided into four modules according to the following criteria:
module 1: the gene mutation rate is in an ascending trend from low-risk MDS to MDS-AML group;
and (3) module 2: the gene mutation rate is in a descending trend from low-risk MDS to MDS-AML group;
and a module 3: the mutation frequency of the gene in the high-risk MDS group is lower than that of the gene in the low-risk MDS group, and the mutation frequency of the gene in the MDS-AML group is highest;
and (4) module: the mutation frequency of the gene in the high-risk MDS group is higher than that of the gene in the low-risk MDS group, and the mutation frequency of the gene in the MDS-AML group is the lowest;
the genes that are included in these four modules are MDS progression-associated genes;
(3) selecting genes related to MDS progression in the DNA of the training sample according to four modules, and performing mutation marking, wherein the gene is marked as A1 when the mutation exists and is marked as A2 when the mutation does not exist; and training the SVM classifier model by taking the MDS progress related gene mutation markers of the training samples as input to complete the construction of the model.
2. The method of claim 1, wherein: the MDS progress related genes in the step (2) are as follows: ABL, ANKRD, ARID1, ATG2, BCORL, BIRC, BRAF, BRINP, CALR, CARD, CBL, CCND, CEBPA, CREBP, CUX, CXCR, DDX3, DNM, DNMT3, ECT2, EP300, ETNK, EZH, FAM46, FGFR, FLT, GATA, ID, IDH, JAK, KDM6, KIT, KMT2, MAPK, MPL, NOTCH, PDS5, PHF, PIGA, PLCG, PRKCB, PRPF40, RAD, RBBP, RELN, RUNX, TBSEP, SETD, SF3A, SF3B, SMC1, SMC, SRP, STAG, TERT, TET, TP, TPMT, TRAF, XPO, and ZRSR, for a total of 64.
3. The method of claim 1, wherein: in the step (3), A1 is 1, A2 is 0, or A1 is 1, and A2 is 0; selecting 0.3 as a threshold value for the SVM classifier model to predict the sample, and predicting the sample to have high risk of developing disease progress when the predicted value of the sample is more than or equal to 0.3; when the sample prediction value is <0.3, the sample is predicted to be a progression low risk sample.
4. The method of claim 1, wherein: in the step (3), 70% of patients are selected from the grouped MDS group samples at random through a sample function as a training set, mutation markers of MDS progress related genes of the patients are used as input, an SVM classifier model is trained, and then the predicted values of the rest 30% of samples are used for verification.
5. The method of claim 1, wherein: in the step (3), clinical information is extracted for each training sample, and during training, the MDS progress-related gene mutation of the training sample is fused with the clinical information and then used as input to train the SVM classifier model.
6. The method of claim 5, wherein: the clinical information includes patient gender, age, bone marrow primary cell count, primary red blood cell, white blood cell, and platelet count.
7. The method of claim 1, wherein: the number of primitive cells of the low-risk myelodysplastic syndrome patient in the step (1) is less than 5 percent;
the number of primitive cells of the high-risk myelodysplastic syndrome patient is more than or equal to 5 percent and less than 20 percent;
the leukemia patients have more than 2 months of leukemia history, and the number of primitive cells is more than or equal to 20 percent.
8. The method of claim 1, wherein: the sample in the step (1) is a blood, tissue or bone marrow puncture sample of a patient.
9. The method of claim 1, wherein: the mutation in the step (2) comprises missense mutation, nonsense mutation, frame shift insertion, frame shift deletion, non-frame shift insertion, non-frame shift deletion and shearing site mutation, and the intron mutation and the synonymous mutation are excluded.
10. The method of claim 1, wherein: the average depth of the sequencing in the step (2) is not less than 800 ×.
CN202111009322.9A 2021-08-31 2021-08-31 Method for constructing myelodysplastic syndrome progress gene prediction model Active CN113764044B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111009322.9A CN113764044B (en) 2021-08-31 2021-08-31 Method for constructing myelodysplastic syndrome progress gene prediction model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111009322.9A CN113764044B (en) 2021-08-31 2021-08-31 Method for constructing myelodysplastic syndrome progress gene prediction model

Publications (2)

Publication Number Publication Date
CN113764044A true CN113764044A (en) 2021-12-07
CN113764044B CN113764044B (en) 2023-07-21

Family

ID=78792060

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111009322.9A Active CN113764044B (en) 2021-08-31 2021-08-31 Method for constructing myelodysplastic syndrome progress gene prediction model

Country Status (1)

Country Link
CN (1) CN113764044B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104862407A (en) * 2015-06-02 2015-08-26 上海艾迪康医学检验所有限公司 Primer and method for detecting EZH2 genes
CN106566875A (en) * 2016-09-20 2017-04-19 上海荻硕贝肯医学检验所有限公司 Primers, kit and method for detecting myelodysplastic syndromes (MDS) gene mutation
CN107949643A (en) * 2015-04-23 2018-04-20 奎斯特诊断投资股份有限公司 For detecting the method and composition that CALR is mutated in bone marrow proliferative diseases
CN110846411A (en) * 2019-11-21 2020-02-28 上海仁东医学检验所有限公司 Method for distinguishing gene mutation types of single tumor sample based on next generation sequencing
CN110993026A (en) * 2019-12-30 2020-04-10 苏州大学 Assessment method and system for myelodysplastic syndrome
CN111154881A (en) * 2020-03-09 2020-05-15 南京实践医学检验有限公司 Detection kit for gene mutation in acute myeloid leukemia and application
CN112094914A (en) * 2020-11-17 2020-12-18 苏州科贝生物技术有限公司 Kit for combined detection of acute myeloid leukemia
CN113025619A (en) * 2021-03-25 2021-06-25 大连医科大学附属第二医院 HOOK3-FGFR1 novel fusion gene and application and detection kit thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107949643A (en) * 2015-04-23 2018-04-20 奎斯特诊断投资股份有限公司 For detecting the method and composition that CALR is mutated in bone marrow proliferative diseases
CN104862407A (en) * 2015-06-02 2015-08-26 上海艾迪康医学检验所有限公司 Primer and method for detecting EZH2 genes
CN106566875A (en) * 2016-09-20 2017-04-19 上海荻硕贝肯医学检验所有限公司 Primers, kit and method for detecting myelodysplastic syndromes (MDS) gene mutation
CN110846411A (en) * 2019-11-21 2020-02-28 上海仁东医学检验所有限公司 Method for distinguishing gene mutation types of single tumor sample based on next generation sequencing
CN110993026A (en) * 2019-12-30 2020-04-10 苏州大学 Assessment method and system for myelodysplastic syndrome
CN111154881A (en) * 2020-03-09 2020-05-15 南京实践医学检验有限公司 Detection kit for gene mutation in acute myeloid leukemia and application
CN112094914A (en) * 2020-11-17 2020-12-18 苏州科贝生物技术有限公司 Kit for combined detection of acute myeloid leukemia
CN113025619A (en) * 2021-03-25 2021-06-25 大连医科大学附属第二医院 HOOK3-FGFR1 novel fusion gene and application and detection kit thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PING WU ET AL.: "Co-occurrence of RUNX1 and ASXL1 mutations underlie poor response and outcome for MDS patients treated with HMAs", 《AM J TRANSL RES》, vol. 11, no. 6, pages 1 - 2 *

Also Published As

Publication number Publication date
CN113764044B (en) 2023-07-21

Similar Documents

Publication Publication Date Title
CN108893466B (en) Sequencing joint, sequencing joint group and detection method of ultralow frequency mutation
Astolfi et al. Whole exome sequencing (WES) on formalin-fixed, paraffin-embedded (FFPE) tumor tissue in gastrointestinal stromal tumors (GIST)
CN110211633B (en) Detection method for MGMT gene promoter methylation, processing method for sequencing data and processing device
CN111088382B (en) Corn whole genome SNP chip and application thereof
CN104946773A (en) Method for judging antenatal parental right relation with SNP
WO2021227329A1 (en) Classification unit component computing method for sequencing data
CN105331606A (en) Nucleic acid molecule quantification method applied to high-throughput sequencing
CN110305968A (en) A kind of composite amplification system in the micro- haplotype domain SNP-DIP based on NGS parting for medical jurisprudence individual identification
CN110863056A (en) Method, reagent and application for accurately typing human DNA
CN113718057A (en) MNP (MNP protein) marker site of EB (Epstein-Barr) virus, primer composition, kit and application
CN110846429A (en) Corn whole genome InDel chip and application thereof
CN113764038A (en) Method for constructing myelodysplastic syndrome whitening gene prediction model
CN113564266B (en) SNP typing genetic marker combination, detection kit and application
CN107977550A (en) A kind of quick analysis Disease-causing gene algorithm based on compression
CN113764044A (en) Method for constructing myelodysplastic syndrome progress gene prediction model
CN111944807A (en) Human sequencing sample tracking marker, and monitoring method and monitoring device for human sequencing sample cross contamination
CN112094899A (en) Detection method of folic acid metabolism capability based on MassArray nucleic acid mass spectrum and application thereof
CN113462783B (en) Brain glioma chromosome lp/19q detection method based on MassArray nucleic acid mass spectrum and application thereof
US20240209446A1 (en) Circulating noncoding rnas as a signature of autism spectrum disorder symptomatology
CN115985389A (en) Method and device for detecting sample cross contamination
CN112885407B (en) Second-generation sequencing-based micro-haplotype detection and typing system and method
CN112626216B (en) Composition for detecting unstable state of tumor microsatellite and application thereof
CN115851923A (en) Methylated biomarker for detecting colorectal cancer lymph node metastasis and application thereof
WO2022082199A1 (en) Method for detecting amyotrophic lateral sclerosis
CN117625788B (en) Construction method of multiplex PCR (polymerase chain reaction) combined molecular tag sequencing library

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant