CN114203261A

CN114203261A - Method for developing gene detection Panel clinical diagnosis index algorithm

Info

Publication number: CN114203261A
Application number: CN202111251878.9A
Authority: CN
Inventors: 汪强虎; 李铜舒; 吴玲祥; 黄斌; 夏鹏; 葛东伟; 吴维; 李�杰; 王子宇
Original assignee: Ankai Life Technology Suzhou Co ltd
Current assignee: Ankai Life Technology Suzhou Co ltd
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2022-03-18

Abstract

The invention discloses a method for developing a gene detection Panel clinical diagnosis index algorithm. Belongs to the field of cell gene detection, and specifically comprises the following steps: providing a gene locus information table, and filtering sequencing data; simulating sequencing data, and taking the transmitted data as virtual gene detection Panel data; analyzing sequencing data by adopting an existing index analysis algorithm; analyzing Panel data by adopting an existing index analysis algorithm; integrating analysis results and performing model training; evaluating the performance of various calculation models and selecting an optimal scheme. The method is based on the sequencing data of the whole genome and the whole exon in the public database, extracts the site data from the sequencing data through the site information of each gene in the gene detection Panel to construct virtual gene detection Panel data, and carries out algorithm development on the virtual Panel detection data, thereby improving the development quality and efficiency of the gene detection Panel product.

Description

Method for developing gene detection Panel clinical diagnosis index algorithm

Technical Field

The invention belongs to the field of cell gene detection, and relates to a method for developing a gene detection Panel clinical diagnosis index algorithm; the development and optimization of clinical diagnosis indexes of a gene (locus) detection Panel sequencing sample are realized through a novel data analysis model. Specifically, the method is based on multigroup sequencing data (including but not limited to whole genome sequencing, whole exon sequencing, whole genome methylation sequencing, whole transcriptome sequencing and the like), and helps developers construct digital detection Panel by simulating the characteristics of distribution patterns of gene sites under specific detection Panel, reading enrichment bias and the like. On the basis, fitting analysis is carried out on the detection value obtained by calculation in the detection Panel and the original detection value by using an artificial intelligence algorithm, so that the detection Panel has the detection performance consistent with multiple groups of chemical sequencing data. The invention can greatly reduce the development and test cost of detecting Panel.

Background

In the prior art, the method for developing the gene detection Panel clinical diagnosis index algorithm mainly adopts the steps of collecting a large number of samples and carrying out gene Panel detection on the samples to generate a large amount of data to develop the algorithm, but the method needs to consume a large amount of money, time and manpower, and once the initial site design of the gene Panel is wrong, the method may bring great risk to product development; and the current genome-wide and exon-wide omics high-throughput sequencing cost is higher, and more detection sites irrelevant to diseases are covered. Therefore, some gene detection panels are designed to detect mutation states of sites of some important genes related to diseases, so that not only can detection cost be reduced, but also sequencing depth of the specific gene sites can be intensively increased, and sensitivity and accuracy of detection results are improved. However, when some clinical diagnosis index analyses (such as indexes of TMB, MSI, etc.) are performed based on sequencing data generated by these gene detection panels, due to factors such as bias of selected gene combinations, results obtained by existing index calculation methods cannot completely reflect the true state of the sample; 1. a conventional method; currently, the following two methods are mainly used to construct and optimize Panel: (1) and carrying out mass sampling to construct Panel from the head. The method comprises the following specific steps: a: collecting a large number of samples (such as 100 samples and 500 samples), and respectively carrying out specific omics sequencing (such as whole exon sequencing) and detection Panel sequencing on each sample; b: analyzing the two sequencing methods by using a similar analysis algorithm to obtain a specific score related to a certain index; c: fitting the index score obtained based on detecting Panel according to the index score obtained by sequencing of the specific omics so as to obtain a standard score for clinical evaluation and diagnosis; the biggest defects of the method are that the early sample acquisition period is long, the cost is high, and a large amount of manpower is spent; and once the initial site of the gene Panel is designed by mistake, the product development is possibly carried with great risk; (2) optimizing a Panel prediction algorithm based on public data; the method comprises the following specific steps: a: collecting related omics sequencing data based on a public database, and capturing corresponding regions of the collected sequencing data according to the genome region related to Panel so as to simulate and detect the sequencing data of Panel; steps b and c correspond to steps b and c of the first method; although the method greatly reduces the cost of preparation in the early stage of Panel optimization; however, due to the bias of the technology of the Panel itself, the actually captured region and the detection depth of different regions and the like can be greatly different from the sequencing data in the existing public database; therefore, the effect of simply grabbing the corresponding area for subsequent simulation analysis is limited, and even a result opposite to that in the actual detection and analysis process can be obtained; therefore, the method has limited application range and is difficult to popularize on a large scale; therefore, a new index analysis algorithm is urgently needed to be developed based on the gene detection Panel data; the invention mainly aims at the scene of the application and development of clinical diagnosis indexes of gene detection Panel.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a method for developing a gene detection Panel clinical diagnosis index algorithm.

The technical scheme is as follows: the invention relates to a method for developing a gene detection Panel clinical diagnosis index algorithm, which comprises two transmission processes of constructing a virtual gene detection Panel and developing a clinical index analysis algorithm aiming at data of the virtual gene detection Panel;

firstly, the specific transmission process of constructing the virtual gene detection Panel is as follows:

(1) providing information of all detection sites involved in the designed gene detection Panel,

(2) filtering the sequencing data of the whole genome or the whole exome;

(3) simulating sequencing data retained to encompass the detection site based on a set of sequencing-related parameters,

(4) sorting and storing the data (simulation) transmitted in the step (3) as virtual gene detection Panel data;

secondly, the specific delivery process for developing the clinical index analysis algorithm aiming at the virtual gene detection Panel data is as follows:

(5) analyzing the sequencing data of the whole genome or the whole exome input in the step (2) by adopting an existing index analysis algorithm;

(6) analyzing the Panel data of the virtual gene detection provided in the step (4) by adopting an existing index analysis algorithm;

(7) integrating the results of steps (5) and (6); corresponding the result of each sample in the step (5) to the corresponding sample in the step (6) and marking the result as the expected result of the sample;

performing model training based on the integrated result by adopting a proper machine learning algorithm;

(8) evaluating the performance of various calculation models and selecting an optimal scheme.

Further, in step (1), the information provided includes, but is not limited to, the position information of the locus on the genome and the sequence information of the locus.

Further, in step (2), the filtering of the whole genome or whole exome sequencing data specifically comprises: extracting sequencing data based on the detection site information provided in the step (1), and only reserving the sequencing data covered in the detection site;

further, in step (3), the parameters include, but are not limited to, the platform used for sequencing, the length of the sequence, the depth of sequencing, and the GC content on the sequence;

the simulation process includes but is not limited to re-fitting the read distribution and enrichment degree in the data (sequencing data in the detection sites) transmitted in the step (3) according to parameter setting, so that the generated data is consistent with the read distribution and enrichment degree of the sequencing data of the gene detection Panel obtained under the real condition.

Further, in the step (6), the analysis results analyzed by the index analysis algorithm are divided into two groups, namely a training set and a test set; the sample analyzed in the step (5) is consistent with the sample analyzed in the step (5);

the training set and the test set are grouped according to the proportion of 7:3 randomly to the existing data, wherein 70% of sample data is used as the training set for training the model; the remaining 30% of the data was used as a test set to finally evaluate the predicted performance of the model.

Accessories:

gene detection Panel: means that not only one site, one gene is detected in the detection; but simultaneously detecting a plurality of loci and a plurality of genes; these sites and genes need to be selected and combined according to a standard to form a detection set; this collection of gene loci is called the gene test Panel.

Whole genome sequencing: all DNA fragments in the cell nucleus are collectively called as genome, and the genome is subjected to high-throughput sequencing to obtain whole genome sequencing.

Sequencing of all exons: there is a portion of DNA within the cell that is capable of directing the encoding of a protein, this portion of DNA being called an "exon"; all fragments of DNA that have these functions are called exomes; and performing high-throughput sequencing on the exome to obtain the sequencing of the whole exon.

Sequencing depth: the ratio of the total amount of bases obtained by sequencing to the size of the genome is one of the indexes for evaluating the sequencing quantity.

TMB: tumor mutational burden; defined as the total number of somatic gene coding errors, base substitutions, gene insertion or deletion errors detected per million bases; TMB is the latest marker for the evaluation of the therapeutic effect of PD-1 antibodies, and its effect has been demonstrated in a variety of tumor therapies.

MSI: microsatellite instability; a kind of short tandem repeat DNA sequence in genome, generally composed of 1-6 nucleotides, is in tandem repeat arrangement; microsatellites have population polymorphisms due to differences in the number of repeats of their core repeat units. MSI occurs due to a functional defect in the DNA mismatch repair of tumor tissue; the MSI phenomenon, which is accompanied by a deficiency in DNA mismatch repair, is a clinically important tumor marker.

Reading: sequencing the obtained sequence fragment.

Omics: the method mainly comprises genomics, proteomics, metabonomics, transcriptomics, lipidomics, immunoomics, glycomics, imageomics, ultrasound and the like.

Has the advantages that: compared with the prior art, the invention has the advantages that: the invention is based on the sequencing data of the whole genome and the whole exon in the public database (or accumulated by the public database), extracts the site data from the sequencing data through the site information of each gene in the gene detection Panel to construct virtual gene detection Panel data, and carries out algorithm development on the virtual Panel detection data, thereby improving the development quality and efficiency of the gene detection Panel product and greatly reducing the development cost and risk.

Drawings

FIG. 1 is a flow chart of the operation of the present invention;

FIG. 2 is a graphical representation of TMB values for two sets of data analyzed using a linear fitting algorithm in accordance with the present invention;

FIG. 3 is a schematic representation of the statistical signal values of the Beta mixture model for a single probe in the present invention;

figure 4 is a graph of GCIMP values for two sets of data analyzed using a linear fitting algorithm in accordance with the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The invention aims to realize the construction of virtual gene detection Panel and the development of a clinical index calculation method through the following scheme.

As shown in the figure, the present invention is divided into two transfer processes and 8 main steps.

The 1 st transmission process is mainly used for constructing virtual gene detection Panel, and comprises the following specific steps:

step 1: providing information of all detection sites related in the designed gene detection Panel, wherein the step 1 of the invention is to obtain the information of all detection sites related in the gene detection Panel;

wherein, the information includes but is not limited to the position information of the locus on the genome, the sequence information of the locus and the like; this information will pass to step 2;

in addition, such information can be directly provided by the worker or the like who designs gene testing Panel; the specific locus information can also be determined by performing sequence alignment analysis (such as by BWA or other alignment tools) on sample data of the test case sequencing of the gene detection Panel;

step 2: filtering whole genome or whole exome (or other omics) sequencing data; specifically, the sequencing data is extracted based on the detection site information provided in step 1, and only the sequencing data contained in the detection site is reserved; the reserved data is transmitted to the step 3;

step 2 of the invention is based on whole genome, whole exon, whole transcriptome, whole genome methylation or other omics sequencing data (hereinafter referred to as reference data set) to capture local locus data;

capturing the site data in the invention mainly captures the read data of the corresponding site in the reference data set according to the site coordinate information obtained in the step 1;

in the invention, the reference data set can be obtained by downloading a platform such as a public database; or the sequencing data accumulated by the staff;

the method adopted by the data capture in the invention comprises the steps of extracting the data of the specific site from the reference data set by using tools such as BWA, samtools and the like, but not limited to the tools;

and step 3: the data retained at step 3 was simulated based on a series of sequencing-related parameters including, but not limited to, the platform used for sequencing, the length of the sequence, the depth of sequencing, the GC content on the sequence, etc. The simulation process includes, but is not limited to, re-fitting the read distribution, enrichment degree and the like in the data delivered in the step 3 according to the parameter setting, so that the generated data and the sequencing data of the gene detection Panel obtained under the real condition are consistent in the read distribution, enrichment degree and the like. The fitted data will be further passed to step 4;

step 3 of the invention is based on the data distribution characteristics in the gene detection Panel, and the data obtained by capturing in the reference data set is subjected to distribution characteristic fitting, and the method mainly comprises the following two methods:

firstly, directly constructing a mathematical statistical model (such as a Poisson distribution model) through parameters such as a sequencing platform, sequence length, sequencing depth, GC content on a sequence and the like provided by a worker, and fitting the number of reads of each site in the captured data to ensure that the distribution characteristics of the number of reads are consistent with the data distribution characteristics generated by real gene detection Panel;

calculating information such as sequence length, sequencing depth, GC content on a sequence and the like in sample sequencing sample data of a test example of the gene detection Panel by means of tools such as BWA, samtools, flagstat and the like, constructing a mathematical statistic model (such as a Poisson distribution model and the like) according to the parameter information, and fitting the number of reads of each site in the captured data to enable the read number distribution characteristics to be consistent with the data distribution characteristics generated by the real gene detection Panel;

and 4, step 4: the data transmitted in the step 3 are sorted and stored to be used as virtual gene detection Panel data;

step 4 of the invention is to store the fitting data result in the format of Fastq or BAM, etc., and the fitting data result is named as virtual gene detection Panel data for subsequent analysis.

The second transmission process is mainly used for developing a clinical index analysis algorithm aiming at the virtual gene detection Panel data, and comprises the following specific steps:

and 5: analyzing the sequencing data of the whole genome or the whole exome (or other omics) input in the step 2 by adopting the existing index analysis algorithm (including but not limited to TMB, MSI and the like); because the standard calculation method of most clinical indexes is constructed based on whole genome/whole exon omics sequencing data; the result from this step will therefore be used as a gold standard for the algorithm training of step 7;

step 5 of the invention is to utilize index analysis algorithm widely used in the industry to calculate corresponding index score for each sample in the reference data set, and the calculated index score is used as gold standard;

the index score includes, but is not limited to, index calculation methods such as MSI, HRD, TMB and the like;

step 6: analyzing the virtual gene detection Panel data provided in the step 4 by adopting an existing index analysis algorithm; the analysis result is divided into a training set and a test set; the sample analyzed in the step is consistent with the sample analyzed in the step 5; the training set and the test set data are grouped according to the proportion of 7:3 randomly to the existing data, wherein 70 percent of sample data is used as the training set for training the model; the remaining 30% of the data is used as a test set for finally evaluating the prediction performance of the model;

step 6 of the invention is to calculate corresponding index score for each sample in the virtual gene detection Panel data by utilizing an index analysis algorithm widely used in the industry;

and 7: integrating the results of

steps

5 and 6; corresponding the result of each sample in the step 5 to the corresponding sample in the step 6, and marking the result as the expected result of the sample; model training based on the integrated results using appropriate machine learning algorithms (including but not limited to support vector machines, deep learning algorithms, etc.);

step 7 of the invention is to construct a prediction model by using the index scores calculated in

steps

5 and 6, and the specific steps are as follows:

firstly, the result of each sample in the step 5 is corresponding to the corresponding sample in the step 6 and is marked as the expected result of the sample, and the paired sample results are divided into two groups of a training set and a testing set according to the proportion of 1:1 (or 7:3 and the like);

secondly, in the training set data, a model is trained by utilizing various machine learning algorithms (such as linear fitting and the like), so that the score calculated by the model based on the virtual gene detection Panel data is approximate to the score calculated by the corresponding sample in the reference data set. Then, evaluating the model prediction performance through test set data;

and 8: evaluating the performance of various calculation models and selecting an optimal scheme;

step 8 of the invention is to select an optimal scheme as a specific index calculation method for the virtual gene detection Panel by comparing the prediction performance of each model in step 7.

Example 1:

constructing a lung cancer gene detection Panel TMB prediction algorithm:

TMB is the tumor mutation burden, representing the density of non-synonymous mutation distributions of the protein coding regions; in some cancer types, patients with high TMB may benefit from immunotherapy;

1. downloading sequencing data of 100 lung cancer exons from a GDC website; meanwhile, downloading gene detection Panel data designed by the commercial kit of MSK-IMPACT as a pre-simulation object;

2. extracting corresponding reads from exon sequencing data according to site information on gene detection Panel;

3. constructing a Poisson distribution model based on the number of reads of each site of gene detection Panel sequencing data, and recording various parameter information in the model;

4. performing addition and deletion of the reads extracted from exon sequencing again according to the constructed cedar model parameters in 3, so that the relative distribution of the number of the reads at each site obtained based on exon sequencing is consistent with the relative distribution of the number of the reads at each site on the gene detection Panel;

5. respectively calculating exon sequencing data and reading data obtained based on exon extraction by using a conventional TMB calculation method to respectively obtain TMB scores of two groups of data;

6. the two sets of data are analyzed by using a linear fitting model, and a prediction model is constructed, so that the TMB score calculated based on the reading data obtained by exon extraction can predict a result similar to the TMB score directly calculated based on exon data according to the model, and the result is specifically shown in fig. 2.

Example 2:

constructing a brain tumor DNA methylation Panel G-CIMP prediction algorithm:

G-CIMP is an epigenetic characteristic in glioma, and means that a large number of CpG islands in the glioma have methylation phenomena; patients carrying this feature will generally have a better prognosis;

1. downloading 100 cases of data of the Illumina 450K DNA methylation chip of the brain cancer from a GDC website; simultaneously downloading 10 cases of Illumina 27K DNA methylation chip data as pre-simulation objects;

2. extracting corresponding data of Illumina 450K DNA methylation data according to site information on Illumina 27K DNA methylation Panel;

3. constructing a Beta mixed model (see figure 3) based on the signal value of each site of Illumina 27K DNA methylation data, and recording various kinds of parameter information in the model;

4. increasing and decreasing the data extracted from Illumina 450K DNA methylation data again according to Beta mixed model parameters constructed in 3 to enable the relative distribution of the signal value of each site to be consistent with the relative distribution of the signal value of each site on Illumina 27K DNA methylation data;

5. respectively calculating Illumina 450K DNA methylation data and data obtained by extraction based on Illumina 450K DNA methylation by using a conventional G-CIMP calculation method to respectively obtain G-CIMP scores of the two groups of data;

6. the two sets of data were analyzed using a linear fitting model to construct a prediction model, so that the score calculated based on data extracted from Illumina 450K DNA methylation could predict a result similar to the G-CIMP score calculated directly based on Illumina 450K DNA methylation data from the model (see fig. 4).

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A method for gene detection Panel clinical diagnosis index algorithm development is characterized by comprising two transmission processes of constructing virtual gene detection Panel and developing a clinical index analysis algorithm aiming at data of the virtual gene detection Panel;

(2) filtering the sequencing data of the whole genome or the whole exome;

(4) the data passing through the simulation is sorted and stored, and is used as virtual gene detection Panel data;

(5) analyzing the filtered whole genome or whole exome sequencing data by adopting an existing index analysis algorithm;

(6) analyzing the provided virtual gene detection Panel data by adopting an existing index analysis algorithm;

(7) and integrating the analysis results of steps (5) and (6): corresponding the result of each sample in the step (5) to the corresponding sample in the step (6) and marking the result as the expected result of the sample;

2. The method for gene detection Panel clinical diagnostic indicator algorithm development as claimed in claim 1, wherein in step (1), the provided information includes, but is not limited to, position information of the locus on the genome and sequence information of the locus.

3. The method for gene detection Panel clinical diagnostic index algorithm development according to claim 1, wherein in step (2), the filtering of whole genome or whole exome sequencing data specifically means: extracting sequencing data based on the detection site information provided in step (1), and only preserving the sequencing data contained in the detection site.

4. The method for gene detection Panel clinical diagnostic indicator algorithm development as claimed in claim 1, wherein in step (3), the sequence is based on a series of sequencing related parameters including but not limited to sequencing platform, length of sequence, sequencing depth and GC content on sequence;

the simulation process includes but is not limited to fitting read distribution and enrichment degree in sequencing data in the detection site again according to parameter setting, so that the generated data is consistent with the sequencing data of the gene detection Panel obtained under the real condition in the read distribution and enrichment degree.

5. The method for gene detection Panel clinical diagnosis index algorithm development according to claim 1, characterized in that, in step (6), the analysis results analyzed by the index analysis algorithm are divided into two groups, namely a training set and a test set;