CN114203261A - Method for developing gene detection Panel clinical diagnosis index algorithm - Google Patents
Method for developing gene detection Panel clinical diagnosis index algorithm Download PDFInfo
- Publication number
- CN114203261A CN114203261A CN202111251878.9A CN202111251878A CN114203261A CN 114203261 A CN114203261 A CN 114203261A CN 202111251878 A CN202111251878 A CN 202111251878A CN 114203261 A CN114203261 A CN 114203261A
- Authority
- CN
- China
- Prior art keywords
- data
- gene detection
- detection panel
- sequencing
- panel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 96
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 59
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 46
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000003759 clinical diagnosis Methods 0.000 title claims abstract description 12
- 238000012163 sequencing technique Methods 0.000 claims abstract description 70
- 238000004458 analytical method Methods 0.000 claims abstract description 29
- 238000012549 training Methods 0.000 claims abstract description 19
- 238000011161 development Methods 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims abstract description 13
- 238000001914 filtration Methods 0.000 claims abstract description 6
- 238000009826 distribution Methods 0.000 claims description 21
- 238000012360 testing method Methods 0.000 claims description 16
- 238000004088 simulation Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 7
- 230000005540 biological transmission Effects 0.000 claims description 6
- 238000012070 whole genome sequencing analysis Methods 0.000 claims description 6
- 238000010801 machine learning Methods 0.000 claims description 4
- 230000000717 retained effect Effects 0.000 claims description 3
- 238000007482 whole exome sequencing Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 abstract description 2
- 239000000523 sample Substances 0.000 description 27
- 230000007067 DNA methylation Effects 0.000 description 12
- 108020004414 DNA Proteins 0.000 description 6
- 206010028980 Neoplasm Diseases 0.000 description 5
- 230000008901 benefit Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 239000012634 fragment Substances 0.000 description 3
- 238000012165 high-throughput sequencing Methods 0.000 description 3
- 230000035772 mutation Effects 0.000 description 3
- 208000003174 Brain Neoplasms Diseases 0.000 description 2
- 108700024394 Exon Proteins 0.000 description 2
- 208000032612 Glial tumor Diseases 0.000 description 2
- 206010018338 Glioma Diseases 0.000 description 2
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 2
- 108091092878 Microsatellite Proteins 0.000 description 2
- 238000012356 Product development Methods 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 2
- 201000005202 lung cancer Diseases 0.000 description 2
- 208000020816 lung neoplasm Diseases 0.000 description 2
- 230000011987 methylation Effects 0.000 description 2
- 238000007069 methylation reaction Methods 0.000 description 2
- 230000033607 mismatch repair Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 241000218645 Cedrus Species 0.000 description 1
- 108091029523 CpG island Proteins 0.000 description 1
- 208000032818 Microsatellite Instability Diseases 0.000 description 1
- 102000048850 Neoplasm Genes Human genes 0.000 description 1
- 108700019961 Neoplasm Genes Proteins 0.000 description 1
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 108700026244 Open Reading Frames Proteins 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000037429 base substitution Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 210000003855 cell nucleus Anatomy 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013481 data capture Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000001973 epigenetic effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000007849 functional defect Effects 0.000 description 1
- 238000009169 immunotherapy Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000012164 methylation sequencing Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000000869 mutational effect Effects 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 102000054765 polymorphisms of proteins Human genes 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000004393 prognosis Methods 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000002864 sequence alignment Methods 0.000 description 1
- 230000000392 somatic effect Effects 0.000 description 1
- 238000011895 specific detection Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
- 238000002560 therapeutic procedure Methods 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 239000000439 tumor marker Substances 0.000 description 1
- 238000002604 ultrasonography Methods 0.000 description 1
- 238000012049 whole transcriptome sequencing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Public Health (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a method for developing a gene detection Panel clinical diagnosis index algorithm. Belongs to the field of cell gene detection, and specifically comprises the following steps: providing a gene locus information table, and filtering sequencing data; simulating sequencing data, and taking the transmitted data as virtual gene detection Panel data; analyzing sequencing data by adopting an existing index analysis algorithm; analyzing Panel data by adopting an existing index analysis algorithm; integrating analysis results and performing model training; evaluating the performance of various calculation models and selecting an optimal scheme. The method is based on the sequencing data of the whole genome and the whole exon in the public database, extracts the site data from the sequencing data through the site information of each gene in the gene detection Panel to construct virtual gene detection Panel data, and carries out algorithm development on the virtual Panel detection data, thereby improving the development quality and efficiency of the gene detection Panel product.
Description
Technical Field
The invention belongs to the field of cell gene detection, and relates to a method for developing a gene detection Panel clinical diagnosis index algorithm; the development and optimization of clinical diagnosis indexes of a gene (locus) detection Panel sequencing sample are realized through a novel data analysis model. Specifically, the method is based on multigroup sequencing data (including but not limited to whole genome sequencing, whole exon sequencing, whole genome methylation sequencing, whole transcriptome sequencing and the like), and helps developers construct digital detection Panel by simulating the characteristics of distribution patterns of gene sites under specific detection Panel, reading enrichment bias and the like. On the basis, fitting analysis is carried out on the detection value obtained by calculation in the detection Panel and the original detection value by using an artificial intelligence algorithm, so that the detection Panel has the detection performance consistent with multiple groups of chemical sequencing data. The invention can greatly reduce the development and test cost of detecting Panel.
Background
In the prior art, the method for developing the gene detection Panel clinical diagnosis index algorithm mainly adopts the steps of collecting a large number of samples and carrying out gene Panel detection on the samples to generate a large amount of data to develop the algorithm, but the method needs to consume a large amount of money, time and manpower, and once the initial site design of the gene Panel is wrong, the method may bring great risk to product development; and the current genome-wide and exon-wide omics high-throughput sequencing cost is higher, and more detection sites irrelevant to diseases are covered. Therefore, some gene detection panels are designed to detect mutation states of sites of some important genes related to diseases, so that not only can detection cost be reduced, but also sequencing depth of the specific gene sites can be intensively increased, and sensitivity and accuracy of detection results are improved. However, when some clinical diagnosis index analyses (such as indexes of TMB, MSI, etc.) are performed based on sequencing data generated by these gene detection panels, due to factors such as bias of selected gene combinations, results obtained by existing index calculation methods cannot completely reflect the true state of the sample; 1. a conventional method; currently, the following two methods are mainly used to construct and optimize Panel: (1) and carrying out mass sampling to construct Panel from the head. The method comprises the following specific steps: a: collecting a large number of samples (such as 100 samples and 500 samples), and respectively carrying out specific omics sequencing (such as whole exon sequencing) and detection Panel sequencing on each sample; b: analyzing the two sequencing methods by using a similar analysis algorithm to obtain a specific score related to a certain index; c: fitting the index score obtained based on detecting Panel according to the index score obtained by sequencing of the specific omics so as to obtain a standard score for clinical evaluation and diagnosis; the biggest defects of the method are that the early sample acquisition period is long, the cost is high, and a large amount of manpower is spent; and once the initial site of the gene Panel is designed by mistake, the product development is possibly carried with great risk; (2) optimizing a Panel prediction algorithm based on public data; the method comprises the following specific steps: a: collecting related omics sequencing data based on a public database, and capturing corresponding regions of the collected sequencing data according to the genome region related to Panel so as to simulate and detect the sequencing data of Panel; steps b and c correspond to steps b and c of the first method; although the method greatly reduces the cost of preparation in the early stage of Panel optimization; however, due to the bias of the technology of the Panel itself, the actually captured region and the detection depth of different regions and the like can be greatly different from the sequencing data in the existing public database; therefore, the effect of simply grabbing the corresponding area for subsequent simulation analysis is limited, and even a result opposite to that in the actual detection and analysis process can be obtained; therefore, the method has limited application range and is difficult to popularize on a large scale; therefore, a new index analysis algorithm is urgently needed to be developed based on the gene detection Panel data; the invention mainly aims at the scene of the application and development of clinical diagnosis indexes of gene detection Panel.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide a method for developing a gene detection Panel clinical diagnosis index algorithm.
The technical scheme is as follows: the invention relates to a method for developing a gene detection Panel clinical diagnosis index algorithm, which comprises two transmission processes of constructing a virtual gene detection Panel and developing a clinical index analysis algorithm aiming at data of the virtual gene detection Panel;
firstly, the specific transmission process of constructing the virtual gene detection Panel is as follows:
(1) providing information of all detection sites involved in the designed gene detection Panel,
(2) filtering the sequencing data of the whole genome or the whole exome;
(3) simulating sequencing data retained to encompass the detection site based on a set of sequencing-related parameters,
(4) sorting and storing the data (simulation) transmitted in the step (3) as virtual gene detection Panel data;
secondly, the specific delivery process for developing the clinical index analysis algorithm aiming at the virtual gene detection Panel data is as follows:
(5) analyzing the sequencing data of the whole genome or the whole exome input in the step (2) by adopting an existing index analysis algorithm;
(6) analyzing the Panel data of the virtual gene detection provided in the step (4) by adopting an existing index analysis algorithm;
(7) integrating the results of steps (5) and (6); corresponding the result of each sample in the step (5) to the corresponding sample in the step (6) and marking the result as the expected result of the sample;
performing model training based on the integrated result by adopting a proper machine learning algorithm;
(8) evaluating the performance of various calculation models and selecting an optimal scheme.
Further, in step (1), the information provided includes, but is not limited to, the position information of the locus on the genome and the sequence information of the locus.
Further, in step (2), the filtering of the whole genome or whole exome sequencing data specifically comprises: extracting sequencing data based on the detection site information provided in the step (1), and only reserving the sequencing data covered in the detection site;
further, in step (3), the parameters include, but are not limited to, the platform used for sequencing, the length of the sequence, the depth of sequencing, and the GC content on the sequence;
the simulation process includes but is not limited to re-fitting the read distribution and enrichment degree in the data (sequencing data in the detection sites) transmitted in the step (3) according to parameter setting, so that the generated data is consistent with the read distribution and enrichment degree of the sequencing data of the gene detection Panel obtained under the real condition.
Further, in the step (6), the analysis results analyzed by the index analysis algorithm are divided into two groups, namely a training set and a test set; the sample analyzed in the step (5) is consistent with the sample analyzed in the step (5);
the training set and the test set are grouped according to the proportion of 7:3 randomly to the existing data, wherein 70% of sample data is used as the training set for training the model; the remaining 30% of the data was used as a test set to finally evaluate the predicted performance of the model.
Accessories:
gene detection Panel: means that not only one site, one gene is detected in the detection; but simultaneously detecting a plurality of loci and a plurality of genes; these sites and genes need to be selected and combined according to a standard to form a detection set; this collection of gene loci is called the gene test Panel.
Whole genome sequencing: all DNA fragments in the cell nucleus are collectively called as genome, and the genome is subjected to high-throughput sequencing to obtain whole genome sequencing.
Sequencing of all exons: there is a portion of DNA within the cell that is capable of directing the encoding of a protein, this portion of DNA being called an "exon"; all fragments of DNA that have these functions are called exomes; and performing high-throughput sequencing on the exome to obtain the sequencing of the whole exon.
Sequencing depth: the ratio of the total amount of bases obtained by sequencing to the size of the genome is one of the indexes for evaluating the sequencing quantity.
TMB: tumor mutational burden; defined as the total number of somatic gene coding errors, base substitutions, gene insertion or deletion errors detected per million bases; TMB is the latest marker for the evaluation of the therapeutic effect of PD-1 antibodies, and its effect has been demonstrated in a variety of tumor therapies.
MSI: microsatellite instability; a kind of short tandem repeat DNA sequence in genome, generally composed of 1-6 nucleotides, is in tandem repeat arrangement; microsatellites have population polymorphisms due to differences in the number of repeats of their core repeat units. MSI occurs due to a functional defect in the DNA mismatch repair of tumor tissue; the MSI phenomenon, which is accompanied by a deficiency in DNA mismatch repair, is a clinically important tumor marker.
Reading: sequencing the obtained sequence fragment.
Omics: the method mainly comprises genomics, proteomics, metabonomics, transcriptomics, lipidomics, immunoomics, glycomics, imageomics, ultrasound and the like.
Has the advantages that: compared with the prior art, the invention has the advantages that: the invention is based on the sequencing data of the whole genome and the whole exon in the public database (or accumulated by the public database), extracts the site data from the sequencing data through the site information of each gene in the gene detection Panel to construct virtual gene detection Panel data, and carries out algorithm development on the virtual Panel detection data, thereby improving the development quality and efficiency of the gene detection Panel product and greatly reducing the development cost and risk.
Drawings
FIG. 1 is a flow chart of the operation of the present invention;
FIG. 2 is a graphical representation of TMB values for two sets of data analyzed using a linear fitting algorithm in accordance with the present invention;
FIG. 3 is a schematic representation of the statistical signal values of the Beta mixture model for a single probe in the present invention;
figure 4 is a graph of GCIMP values for two sets of data analyzed using a linear fitting algorithm in accordance with the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
The invention aims to realize the construction of virtual gene detection Panel and the development of a clinical index calculation method through the following scheme.
As shown in the figure, the present invention is divided into two transfer processes and 8 main steps.
The 1 st transmission process is mainly used for constructing virtual gene detection Panel, and comprises the following specific steps:
step 1: providing information of all detection sites related in the designed gene detection Panel, wherein the step 1 of the invention is to obtain the information of all detection sites related in the gene detection Panel;
wherein, the information includes but is not limited to the position information of the locus on the genome, the sequence information of the locus and the like; this information will pass to step 2;
in addition, such information can be directly provided by the worker or the like who designs gene testing Panel; the specific locus information can also be determined by performing sequence alignment analysis (such as by BWA or other alignment tools) on sample data of the test case sequencing of the gene detection Panel;
step 2: filtering whole genome or whole exome (or other omics) sequencing data; specifically, the sequencing data is extracted based on the detection site information provided in step 1, and only the sequencing data contained in the detection site is reserved; the reserved data is transmitted to the step 3;
capturing the site data in the invention mainly captures the read data of the corresponding site in the reference data set according to the site coordinate information obtained in the step 1;
in the invention, the reference data set can be obtained by downloading a platform such as a public database; or the sequencing data accumulated by the staff;
the method adopted by the data capture in the invention comprises the steps of extracting the data of the specific site from the reference data set by using tools such as BWA, samtools and the like, but not limited to the tools;
and step 3: the data retained at step 3 was simulated based on a series of sequencing-related parameters including, but not limited to, the platform used for sequencing, the length of the sequence, the depth of sequencing, the GC content on the sequence, etc. The simulation process includes, but is not limited to, re-fitting the read distribution, enrichment degree and the like in the data delivered in the step 3 according to the parameter setting, so that the generated data and the sequencing data of the gene detection Panel obtained under the real condition are consistent in the read distribution, enrichment degree and the like. The fitted data will be further passed to step 4;
firstly, directly constructing a mathematical statistical model (such as a Poisson distribution model) through parameters such as a sequencing platform, sequence length, sequencing depth, GC content on a sequence and the like provided by a worker, and fitting the number of reads of each site in the captured data to ensure that the distribution characteristics of the number of reads are consistent with the data distribution characteristics generated by real gene detection Panel;
calculating information such as sequence length, sequencing depth, GC content on a sequence and the like in sample sequencing sample data of a test example of the gene detection Panel by means of tools such as BWA, samtools, flagstat and the like, constructing a mathematical statistic model (such as a Poisson distribution model and the like) according to the parameter information, and fitting the number of reads of each site in the captured data to enable the read number distribution characteristics to be consistent with the data distribution characteristics generated by the real gene detection Panel;
and 4, step 4: the data transmitted in the step 3 are sorted and stored to be used as virtual gene detection Panel data;
The second transmission process is mainly used for developing a clinical index analysis algorithm aiming at the virtual gene detection Panel data, and comprises the following specific steps:
and 5: analyzing the sequencing data of the whole genome or the whole exome (or other omics) input in the step 2 by adopting the existing index analysis algorithm (including but not limited to TMB, MSI and the like); because the standard calculation method of most clinical indexes is constructed based on whole genome/whole exon omics sequencing data; the result from this step will therefore be used as a gold standard for the algorithm training of step 7;
the index score includes, but is not limited to, index calculation methods such as MSI, HRD, TMB and the like;
step 6: analyzing the virtual gene detection Panel data provided in the step 4 by adopting an existing index analysis algorithm; the analysis result is divided into a training set and a test set; the sample analyzed in the step is consistent with the sample analyzed in the step 5; the training set and the test set data are grouped according to the proportion of 7:3 randomly to the existing data, wherein 70 percent of sample data is used as the training set for training the model; the remaining 30% of the data is used as a test set for finally evaluating the prediction performance of the model;
the index score includes, but is not limited to, index calculation methods such as MSI, HRD, TMB and the like;
and 7: integrating the results of steps 5 and 6; corresponding the result of each sample in the step 5 to the corresponding sample in the step 6, and marking the result as the expected result of the sample; model training based on the integrated results using appropriate machine learning algorithms (including but not limited to support vector machines, deep learning algorithms, etc.);
step 7 of the invention is to construct a prediction model by using the index scores calculated in steps 5 and 6, and the specific steps are as follows:
firstly, the result of each sample in the step 5 is corresponding to the corresponding sample in the step 6 and is marked as the expected result of the sample, and the paired sample results are divided into two groups of a training set and a testing set according to the proportion of 1:1 (or 7:3 and the like);
secondly, in the training set data, a model is trained by utilizing various machine learning algorithms (such as linear fitting and the like), so that the score calculated by the model based on the virtual gene detection Panel data is approximate to the score calculated by the corresponding sample in the reference data set. Then, evaluating the model prediction performance through test set data;
and 8: evaluating the performance of various calculation models and selecting an optimal scheme;
Example 1:
constructing a lung cancer gene detection Panel TMB prediction algorithm:
TMB is the tumor mutation burden, representing the density of non-synonymous mutation distributions of the protein coding regions; in some cancer types, patients with high TMB may benefit from immunotherapy;
1. downloading sequencing data of 100 lung cancer exons from a GDC website; meanwhile, downloading gene detection Panel data designed by the commercial kit of MSK-IMPACT as a pre-simulation object;
2. extracting corresponding reads from exon sequencing data according to site information on gene detection Panel;
3. constructing a Poisson distribution model based on the number of reads of each site of gene detection Panel sequencing data, and recording various parameter information in the model;
4. performing addition and deletion of the reads extracted from exon sequencing again according to the constructed cedar model parameters in 3, so that the relative distribution of the number of the reads at each site obtained based on exon sequencing is consistent with the relative distribution of the number of the reads at each site on the gene detection Panel;
5. respectively calculating exon sequencing data and reading data obtained based on exon extraction by using a conventional TMB calculation method to respectively obtain TMB scores of two groups of data;
6. the two sets of data are analyzed by using a linear fitting model, and a prediction model is constructed, so that the TMB score calculated based on the reading data obtained by exon extraction can predict a result similar to the TMB score directly calculated based on exon data according to the model, and the result is specifically shown in fig. 2.
Example 2:
constructing a brain tumor DNA methylation Panel G-CIMP prediction algorithm:
G-CIMP is an epigenetic characteristic in glioma, and means that a large number of CpG islands in the glioma have methylation phenomena; patients carrying this feature will generally have a better prognosis;
1. downloading 100 cases of data of the Illumina 450K DNA methylation chip of the brain cancer from a GDC website; simultaneously downloading 10 cases of Illumina 27K DNA methylation chip data as pre-simulation objects;
2. extracting corresponding data of Illumina 450K DNA methylation data according to site information on Illumina 27K DNA methylation Panel;
3. constructing a Beta mixed model (see figure 3) based on the signal value of each site of Illumina 27K DNA methylation data, and recording various kinds of parameter information in the model;
4. increasing and decreasing the data extracted from Illumina 450K DNA methylation data again according to Beta mixed model parameters constructed in 3 to enable the relative distribution of the signal value of each site to be consistent with the relative distribution of the signal value of each site on Illumina 27K DNA methylation data;
5. respectively calculating Illumina 450K DNA methylation data and data obtained by extraction based on Illumina 450K DNA methylation by using a conventional G-CIMP calculation method to respectively obtain G-CIMP scores of the two groups of data;
6. the two sets of data were analyzed using a linear fitting model to construct a prediction model, so that the score calculated based on data extracted from Illumina 450K DNA methylation could predict a result similar to the G-CIMP score calculated directly based on Illumina 450K DNA methylation data from the model (see fig. 4).
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.
Claims (5)
1. A method for gene detection Panel clinical diagnosis index algorithm development is characterized by comprising two transmission processes of constructing virtual gene detection Panel and developing a clinical index analysis algorithm aiming at data of the virtual gene detection Panel;
firstly, the specific transmission process of constructing the virtual gene detection Panel is as follows:
(1) providing information of all detection sites involved in the designed gene detection Panel,
(2) filtering the sequencing data of the whole genome or the whole exome;
(3) simulating sequencing data retained to encompass the detection site based on a set of sequencing-related parameters,
(4) the data passing through the simulation is sorted and stored, and is used as virtual gene detection Panel data;
secondly, the specific delivery process for developing the clinical index analysis algorithm aiming at the virtual gene detection Panel data is as follows:
(5) analyzing the filtered whole genome or whole exome sequencing data by adopting an existing index analysis algorithm;
(6) analyzing the provided virtual gene detection Panel data by adopting an existing index analysis algorithm;
(7) and integrating the analysis results of steps (5) and (6): corresponding the result of each sample in the step (5) to the corresponding sample in the step (6) and marking the result as the expected result of the sample;
performing model training based on the integrated result by adopting a proper machine learning algorithm;
(8) evaluating the performance of various calculation models and selecting an optimal scheme.
2. The method for gene detection Panel clinical diagnostic indicator algorithm development as claimed in claim 1, wherein in step (1), the provided information includes, but is not limited to, position information of the locus on the genome and sequence information of the locus.
3. The method for gene detection Panel clinical diagnostic index algorithm development according to claim 1, wherein in step (2), the filtering of whole genome or whole exome sequencing data specifically means: extracting sequencing data based on the detection site information provided in step (1), and only preserving the sequencing data contained in the detection site.
4. The method for gene detection Panel clinical diagnostic indicator algorithm development as claimed in claim 1, wherein in step (3), the sequence is based on a series of sequencing related parameters including but not limited to sequencing platform, length of sequence, sequencing depth and GC content on sequence;
the simulation process includes but is not limited to fitting read distribution and enrichment degree in sequencing data in the detection site again according to parameter setting, so that the generated data is consistent with the sequencing data of the gene detection Panel obtained under the real condition in the read distribution and enrichment degree.
5. The method for gene detection Panel clinical diagnosis index algorithm development according to claim 1, characterized in that, in step (6), the analysis results analyzed by the index analysis algorithm are divided into two groups, namely a training set and a test set;
the training set and the test set are grouped according to the proportion of 7:3 randomly to the existing data, wherein 70% of sample data is used as the training set for training the model; the remaining 30% of the data was used as a test set to finally evaluate the predicted performance of the model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111251878.9A CN114203261A (en) | 2021-10-26 | 2021-10-26 | Method for developing gene detection Panel clinical diagnosis index algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111251878.9A CN114203261A (en) | 2021-10-26 | 2021-10-26 | Method for developing gene detection Panel clinical diagnosis index algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114203261A true CN114203261A (en) | 2022-03-18 |
Family
ID=80646355
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111251878.9A Pending CN114203261A (en) | 2021-10-26 | 2021-10-26 | Method for developing gene detection Panel clinical diagnosis index algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114203261A (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107451419A (en) * | 2017-07-14 | 2017-12-08 | 浙江大学 | It is a kind of that the method for simplifying DNA methylation sequencing data is produced by computer program simulation |
CN109136371A (en) * | 2018-07-25 | 2019-01-04 | 南京世和基因生物技术有限公司 | A kind of radiotherapy effect and the combination of toxic reaction related gene, detection probe library and detection kit |
CN109880910A (en) * | 2019-04-25 | 2019-06-14 | 南京世和基因生物技术有限公司 | A kind of detection site combination, detection method, detection kit and the system of Tumor mutations load |
CN111826447A (en) * | 2020-09-21 | 2020-10-27 | 求臻医学科技(北京)有限公司 | Method for detecting tumor mutation load and prediction model |
CN112029861A (en) * | 2020-09-07 | 2020-12-04 | 臻悦生物科技江苏有限公司 | Tumor mutation load detection device and method based on capture sequencing technology |
US20210020314A1 (en) * | 2018-03-30 | 2021-01-21 | Juno Diagnostics, Inc. | Deep learning-based methods, devices, and systems for prenatal testing |
CN112786103A (en) * | 2020-12-31 | 2021-05-11 | 普瑞基准生物医药(苏州)有限公司 | Method and device for analyzing feasibility of target sequencing Panel for estimating tumor mutation load |
CN113517066A (en) * | 2020-08-03 | 2021-10-19 | 东南大学 | Depression assessment method and system based on candidate gene methylation sequencing and deep learning |
-
2021
- 2021-10-26 CN CN202111251878.9A patent/CN114203261A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107451419A (en) * | 2017-07-14 | 2017-12-08 | 浙江大学 | It is a kind of that the method for simplifying DNA methylation sequencing data is produced by computer program simulation |
US20210020314A1 (en) * | 2018-03-30 | 2021-01-21 | Juno Diagnostics, Inc. | Deep learning-based methods, devices, and systems for prenatal testing |
CN109136371A (en) * | 2018-07-25 | 2019-01-04 | 南京世和基因生物技术有限公司 | A kind of radiotherapy effect and the combination of toxic reaction related gene, detection probe library and detection kit |
CN109880910A (en) * | 2019-04-25 | 2019-06-14 | 南京世和基因生物技术有限公司 | A kind of detection site combination, detection method, detection kit and the system of Tumor mutations load |
CN113517066A (en) * | 2020-08-03 | 2021-10-19 | 东南大学 | Depression assessment method and system based on candidate gene methylation sequencing and deep learning |
CN112029861A (en) * | 2020-09-07 | 2020-12-04 | 臻悦生物科技江苏有限公司 | Tumor mutation load detection device and method based on capture sequencing technology |
CN111826447A (en) * | 2020-09-21 | 2020-10-27 | 求臻医学科技(北京)有限公司 | Method for detecting tumor mutation load and prediction model |
CN112786103A (en) * | 2020-12-31 | 2021-05-11 | 普瑞基准生物医药(苏州)有限公司 | Method and device for analyzing feasibility of target sequencing Panel for estimating tumor mutation load |
Non-Patent Citations (3)
Title |
---|
冉冰冰;梁楠;孙辉;: "组学技术在肿瘤精准诊疗中应用的研究进展:从单组学分析到多组学整合", 中国肿瘤生物治疗杂志, no. 12, 25 December 2019 (2019-12-25) * |
徐云碧;杨泉女;郑洪建;许彦芬;桑志勤;郭子锋;彭海;张丛;蓝昊发;王蕴波;吴坤生;陶家军;张嘉楠;: "靶向测序基因型检测(GBTS)技术及其应用", 中国农业科学, no. 15, 1 August 2020 (2020-08-01) * |
陈如萍;刘蕊;: "下一代测序技术在结直肠癌诊疗中的应用", 天津医药, no. 09, 15 September 2020 (2020-09-15) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109022553B (en) | Genetic chip for Tumor mutations cutting load testing and preparation method thereof and device | |
CN107403074B (en) | A kind of detection method and device of mutain | |
CN112397151B (en) | Methylation marker screening and evaluating method and device based on target capture sequencing | |
CN108319813A (en) | Circulating tumor DNA copies the detection method and device of number variation | |
CN109706065A (en) | Tumor neogenetic antigen load detection device and storage medium | |
CN106446597B (en) | Several species feature selecting and the method for identifying unknown gene | |
CN113096728B (en) | Method, device, storage medium and equipment for detecting tiny residual focus | |
CN115052994A (en) | Method for determining base type of predetermined site in chromosome of embryonic cell and application thereof | |
CN116825188B (en) | Method, device and computer readable storage medium for identifying tumor neoantigen at multiple groups of chemical layers based on high-throughput sequencing technology | |
CN112746097A (en) | Method for detecting sample cross contamination and method for predicting cross contamination source | |
CN111584006A (en) | Circular RNA identification method based on machine learning strategy | |
CN114898803B (en) | Mutation detection analysis method, device, readable medium and apparatus | |
CN112837748A (en) | System and method for distinguishing tumors of different anatomical origins | |
CN113096737A (en) | Method and system for automatically analyzing pathogen types | |
CN114203261A (en) | Method for developing gene detection Panel clinical diagnosis index algorithm | |
CN114496089B (en) | Pathogenic microorganism identification method | |
CN107885972A (en) | It is a kind of based on the fusion detection method of single-ended sequencing and its application | |
CN114067908B (en) | Method, device and storage medium for evaluating single-sample homologous recombination defects | |
CN109215736A (en) | A kind of high-flux detection method of enterovirus group and application | |
CN113355426B (en) | Evaluation gene set and kit for predicting liver cancer prognosis | |
CN111411167A (en) | DNA fingerprint atlas database of tobacco variety and application thereof | |
CN110684830A (en) | RNA analysis method for paraffin section tissue | |
CN113793641B (en) | Method for rapidly judging sample gender from FASTQ file | |
CN117577182B (en) | System for rapidly identifying drug identification sites and application thereof | |
CN116312786B (en) | Single cell expression pattern difference evaluation method based on multi-group comparison |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |