Method for establishing baseline and model for detecting instability of microsatellite and application
Technical Field
The invention relates to the field of gene sequencing data analysis, in particular to a method for establishing a base line and a model for detecting microsatellite instability and application of the method.
background
Microsatellite Instability (MSI) refers to a decrease or increase in the number of Microsatellite repeats, leading to the appearance of new alleles. Numerous studies have shown that microsatellite instability is caused by a defect in the occurrence of mismatch repair genes and is closely related to the occurrence of tumors. Microsatellite instability has been clinically used as an important molecular marker for prognosis and development of adjuvant treatment regimens for colorectal cancer and other solid tumors, and has been applied to assist in Lynch syndrome screening. However, there is a lack of gold criteria for NGS data to discriminate microsatellite instability.
Currently, as shown in fig. 1, most of the NGS data are detected using microsatellite loci from NCCN or the like, and then the sample is determined to be microsatellite unstable by taking 20% as a boundary, that is, by determining that 20% or more of the microsatellite loci are unstable. However, the existing method for detecting the instability of the microsatellite still has the defect of low sensitivity.
disclosure of Invention
The invention mainly aims to provide a method for establishing a base line and a model for detecting instability of a microsatellite and application thereof, so as to solve the problem of low detection sensitivity of the microsatellite locus in sequencing data in the prior art.
to achieve the above object, according to one aspect of the present invention, there is provided a method of establishing a baseline for detecting microsatellite instability, the method including: searching all available microsatellite loci in a region corresponding to the sequencing data of the sample to be detected on the human reference genome; counting the average coverage depth baseline of each microsatellite locus in each control blood cell sample by using the sequencing data of a plurality of control blood cell samples, and reserving the microsatellite locus of which the average coverage depth baseline meets the depth threshold as a candidate microsatellite locus; and calculating the average coverage depth and the number of peaks of each candidate microsatellite locus in the sequencing data of each of the plurality of positive samples and the plurality of negative samples by utilizing each candidate microsatellite locus and the average coverage depth baseline, finding out the candidate microsatellite loci with the number of peaks having significant difference in the plurality of positive samples and the plurality of negative samples as the detection microsatellite loci, and forming the unstable baseline of the detection microsatellite by the average coverage depth and the number of peaks of each detection microsatellite locus in the plurality of positive samples and the plurality of negative samples.
Further, the minimum length of all available microsatellite loci is 10 bp.
further, the depth threshold is equal to or greater than 30.
According to a second aspect of the present invention, there is provided a method of modelling for detecting microsatellite instability, the method comprising: establishing a baseline for detecting the instability of the microsatellite by adopting any one of the methods; and modeling the average coverage depth and the number of peaks in a plurality of positive samples and a plurality of negative samples in the baseline by using a machine learning algorithm to obtain a model for detecting the instability of the microsatellite.
further, the machine learning algorithm is a random forest algorithm.
According to a third aspect of the present invention there is provided a model for detecting microsatellite instability, the model being constructed using any of the methods described above.
According to a fourth aspect of the present invention, there is provided a method of detecting microsatellite instability, the method comprising: detecting the microsatellite loci according to any one of the methods, and detecting the number of peaks of each detected microsatellite locus in the sequencing data of the sample to be detected; and analyzing the number of peaks of each detected microsatellite locus in the sequencing data of the sample to be detected by using any one of the models for detecting the instability of the microsatellite, thereby obtaining the unstable state result of the microsatellite of the sample to be detected.
According to a fifth aspect of the present invention there is provided apparatus for establishing a baseline for detecting microsatellite instability, the apparatus comprising: the system comprises a microsatellite locus searching module, a candidate microsatellite locus screening module and a base line establishing module, wherein the microsatellite locus searching module is used for comparing sequencing data of a sample to be tested with a human reference genome sequence to obtain all available microsatellite loci, and the sequencing data of the sample to be tested comprises sequencing data of all known microsatellite loci; the candidate microsatellite locus screening module is used for counting the average coverage depth baseline of each microsatellite locus in each control blood cell sample by using the sequencing data of a plurality of control blood cell samples and reserving the microsatellite locus of which the average coverage depth baseline meets the depth threshold as a candidate microsatellite locus; and the baseline establishing module is used for calculating the average coverage depth and the number of peaks of each candidate microsatellite locus in the sequencing data of the multiple positive samples and the multiple negative samples by utilizing the candidate microsatellite loci and the average coverage depth baseline, finding out the candidate microsatellite loci with the number of peaks having significant difference in the multiple positive samples and the multiple negative samples as the detection microsatellite loci, and forming unstable baselines of the detection microsatellite by the average coverage depth and the number of peaks of each detection microsatellite locus in the multiple positive samples and the multiple negative samples.
According to a sixth aspect of the present invention, there is provided an apparatus for modeling for detecting microsatellite instability, the apparatus comprising: the device for establishing the baseline for detecting the instability of the microsatellite comprises a microsatellite locus searching module, a candidate microsatellite locus screening module, a baseline establishing module and a machine learning modeling module, wherein the machine learning modeling module is used for modeling the average coverage depth and the peak number of a plurality of positive samples and a plurality of negative samples in the baseline by utilizing a machine learning algorithm to obtain a model for detecting the instability of the microsatellite.
according to a seventh aspect of the present invention, there is provided an apparatus for detecting microsatellite instability, the apparatus comprising: the device comprises a microsatellite locus searching module, a candidate microsatellite locus screening module and a base line establishing module in the device for establishing the base line for detecting the instability of the microsatellite, a machine learning modeling module, a detection module, a prediction module and a detection module in the device for establishing the model for detecting the instability of the microsatellite, wherein the detection module is used for detecting the number of peaks of each detected microsatellite locus in the sequencing data of a sample to be detected; and the prediction module is used for analyzing the number of peaks of each detected microsatellite locus in the sequencing data of the sample to be detected by utilizing the model for detecting the instability of the microsatellite, so as to obtain the result of the unstable state of the microsatellite of the sample to be detected.
according to an eighth aspect of the present invention, there is provided a storage medium comprising a stored program, wherein the program, when executed, controls an apparatus on which the storage medium is located to perform the above-described method of establishing a baseline for detecting instability of a microsatellite, or to perform the above-described method of establishing a model for detecting instability of a microsatellite, or to perform the above-described method of detecting instability of a microsatellite.
According to a ninth aspect of the present invention, there is provided a processor comprising a stored program, wherein the program, when executed, controls an apparatus on a storage medium to perform the above-described method of establishing a baseline for detecting instability of a microsatellite, or to perform the above-described method of establishing a model for detecting instability of a microsatellite, or to perform the above-described method of detecting instability of a microsatellite.
By applying the technical scheme of the invention, all available microsatellite loci in sequencing data are found by comparing the sequencing data with a ginseng reference genome sequence, the microsatellite loci contained in NGS data are fully utilized, then the sequencing data of a tumor control blood cell sample are utilized, the microsatellite loci with higher capture efficiency are screened from the microsatellite loci for subsequent analysis, the microsatellite loci with the peak numbers which have significant difference between two groups of samples are further found from the microsatellite loci with the high capture efficiency through the sequencing data of positive samples and negative samples with known microsatellite states, and the average coverage depth and the peak numbers of the microsatellite loci with the significant difference in the peak numbers are further utilized to form an unstable baseline of the microsatellite of a subsequent detection sample to be detected. Compared with the common method in the current market, the method firstly utilizes more microsatellite locus information to establish the base line, and when the microsatellite state of a sample to be detected is detected or judged subsequently, more microsatellite loci are detected and judged, so that the utilization efficiency of sequencing data is improved, and the detection sensitivity is improved.
drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 shows a schematic flow diagram of a method for detecting microsatellite instability according to the prior art; and
FIG. 2 is a schematic flow chart diagram illustrating a method for establishing a baseline for detecting microsatellite instability in accordance with a preferred embodiment of the present application;
FIG. 3 shows a detailed flow diagram of a method for detecting microsatellite instability in accordance with a preferred embodiment of the present application;
FIG. 4 shows a graph of cluster analysis of microsatellite loci with significant difference in the number of peaks in microsatellite positive and microsatellite negative samples in example 5;
FIG. 5 is a graph showing the results of ROC analysis of the peak matrices in microsatellite positive and microsatellite negative samples in example 5.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.
microsatellites: microsatellites are short tandem repeats distributed throughout the human genome with repeats of single, double or higher nucleotides, 10-50 times.
Stability of the microsatellite: microsatellites result in a change in the length of the microsatellite due to the insertion or deletion of repeat units.
The BED file format is a variable way of data-line to describe annotated data. The BED line has 3 required fields and 9 additional fields. The three required fields are:
1, chrom, chromosome or scaffold name, such as chr 3, chrY.
2 starting position of chromostart chromosome or scaffold, position of the first chromosome is 0.
3, chromogen, chromosome or scaffold termination position.
The 9 additional optional BED fields are:
4, name, defining the name of the bed;
Score 0-1000, which is perceived as showing a gray level if the track line attribute is set to 1 when annotated, the larger the water, the higher the gray;
Strand defines + or-;
7, where clickstart begins;
8, the location where thickEnd ends;
9, form of itemRGB An AGB value;
10, block count BED line, block number on exon;
11, blockSize separates blockSize with commas, the item list corresponding to BlockCount;
12, blockStarts comma separated list.
Example 1
in a preferred embodiment of the present application, a method for establishing a baseline for detecting microsatellite instability is provided, and fig. 2 is a flow chart of the method for establishing a baseline for detecting microsatellite instability according to an embodiment of the present invention. As shown in fig. 2, the method includes:
Step S101, searching all available microsatellite loci in a region corresponding to sequencing data of a sample to be detected on a human reference genome;
Step S102, counting the average coverage depth baseline of each microsatellite locus in each control blood cell sample by using the sequencing data of a plurality of control blood cell samples, and reserving the microsatellite locus of which the average coverage depth baseline meets the depth threshold as a candidate microsatellite locus;
step S103, calculating the average coverage depth and the number of peaks of each candidate microsatellite locus in the sequencing data of each of the plurality of positive samples and the plurality of negative samples by using each candidate microsatellite locus and the average coverage depth baseline, finding out the candidate microsatellite loci with the number of peaks having significant difference in the plurality of positive samples and the plurality of negative samples as detection microsatellite loci, and forming unstable baselines of the detection microsatellite by the average coverage depth and the number of peaks of each detection microsatellite locus in the plurality of positive samples and the plurality of negative samples.
The method comprises the steps of comparing sequencing data with a ginseng reference genome sequence, finding all available microsatellite loci in the sequencing data, fully utilizing the microsatellite loci contained in NGS data, then utilizing the sequencing data of tumor control blood cell samples, screening the microsatellite loci with higher capture efficiency from the microsatellite loci for subsequent analysis, further finding the microsatellite loci with significant differences of peak numbers in two groups of samples from the selected microsatellite loci with high capture efficiency through the sequencing data of positive samples and negative samples with known microsatellite states, and further utilizing the average coverage depth and peak numbers of the microsatellite loci with significant differences of peak numbers to form a baseline for subsequently detecting the instability of the microsatellite of a sample to be detected. Compared with the common method in the current market, the method firstly utilizes more microsatellite locus information to establish the base line, and when the microsatellite state of a sample to be detected is detected or judged subsequently, more microsatellite loci are detected and judged, so that the utilization efficiency of sequencing data is improved, and the detection sensitivity is improved.
in the above embodiment, the sequencing data of the sample to be tested is preferably the sequencing data of the tumor tissue of the sample to be tested. The control blood cell sample and the tumor tissue are derived from the same test sample (as a test subject). More preferably, for all the reported microsatellite loci, a capture library is constructed by designing Panel for all loci known to be associated with tumor mutations, and then sequenced to obtain sequencing data. In determining the microsatellite loci, besides the commonly used microsatellite loci reported by NCCN, the microsatellite loci (not reported) selected by aligning the sequencing data with the reference genome sequence are also included.
The reference blood cell sample is the same as the blood cell sample used in the conventional method for detecting microsatellite instability, and refers to the blood cell sample from the same sample source as the tumor tissue to be detected. Here, the blood cell sample is used to calculate the average depth of coverage of each microsatellite locus to be detected. The number of blood cell samples is usually plural, and preferably at least 5, 10, 15, 20, 25, 30, 40, 50, or 50 or more. In an alternative embodiment of the present application, the number of blood cell samples is 52.
The above-mentioned peak of the microsatellite locus with high capture efficiency is calculated by using the microsatellite positive sample and the microsatellite negative sample, and the peak is used to further screen out the microsatellite locus with large scale in two groups, namely the locus with significant difference in the number of the peak in the two groups of the microsatellite positive sample and the microsatellite negative sample, wherein the significant difference is preferably checked by adopting rank sum, and p is 0.01.
In the application, peak or peak refers to an insertion deletion condition statistic value of a microsatellite locus, for the microsatellite locus of a detected sample, the length types of the read segments are counted, the supported reads of each length type need to be more than 3 to be calculated into effective length types, for example, one microsatellite locus is 24A on a reference genome, and multiple conditions occur in one detection sample, namely 15A (2 reads), 20A (10 reads), 21A (20 reads), 22A (40 reads) and 23A (100 reads), firstly, the condition of 15A is deleted, because the number of the supported reads is less than 3, and the rest 4 read segment lengths are 4, so the peak is 4.
In an optional embodiment, the method comprises the step of comparing sequencing data of a sample to be tested with a human reference genome sequence to obtain all available microsatellite loci, and searching all the microsatellite loci in a sequencing data file range of the sample to be tested in a human reference genome sequence (hg19) by adopting msisensor software (v0.5), wherein default parameters are adopted for the software parameters except that the minimum length of the microsatellite loci is set to be 10.
step S102 is to select the microsatellite loci with high capture efficiency from all available microsatellite loci. In an alternative embodiment, the sequencing data of a plurality of blood cells sequenced simultaneously is used to calculate the average coverage depth of all available microsatellite loci, and the microsatellite loci with the average coverage depth of more than 30 are selected as the microsatellite loci with high capture efficiency.
Example 2
In a preferred embodiment, the present application also provides a method of modeling for detecting microsatellite instability, the method comprising: establishing a baseline for detecting the instability of the microsatellite by adopting any one of the methods; and modeling the average coverage depth and the number of peaks in a plurality of positive samples and a plurality of negative samples in the baseline by using a machine learning algorithm to obtain a model for detecting the instability of the microsatellite.
In the preferred embodiment, on one hand, a detection baseline is established by maximally utilizing microsatellite locus information in sequencing data, and on the other hand, a prediction model for detecting instability of the microsatellite is established by adopting a machine learning algorithm and utilizing the number of peaks of each detection microsatellite locus in a positive sample and the number of peaks of each detection microsatellite locus in a negative sample in the baseline. When the model established by the machine learning method is used for predicting the microsatellite state of the sample, the mechanical learning method is also used for analyzing and judging, so that the detection sensitivity is higher, the specificity is higher, and compared with the judging method which generally selects 20% as a boundary in the current market, the defect of the existing method that the subjectivity is too high is avoided.
in the step of establishing a model by using a machine learning method, a random forest algorithm (sklern 0.20.0) is preferably used for modeling.
Example 3
In a preferred embodiment of the present application, a model for detecting instability of a microsatellite, which is built by the method for building a model for detecting instability of a microsatellite is also provided. The model established by the machine learning algorithm is used for analyzing and judging the microsatellite state of the sample, and the method has higher sensitivity and specificity.
example 4
in a preferred embodiment of the present application, there is also provided a method of detecting microsatellite instability, the method including: detecting the microsatellite loci according to the detection microsatellite loci in the method for establishing the base line for detecting the microsatellite loci, and detecting the number of peaks of each detection microsatellite locus in the sequencing data of a sample to be detected; and analyzing the number of peaks of each detected microsatellite locus in the sequencing data of the sample to be detected by using the model for detecting the instability of the microsatellite, thereby obtaining the unstable state result of the microsatellite of the sample to be detected.
The model established by the machine learning algorithm is used for analyzing and judging the microsatellite state of the sample, and the method has higher sensitivity and specificity.
In a more specific embodiment, the step of microsatellite stability detection comprises:
1) calculating peaks of the microsatellite loci for a sample to be detected;
2) establishing a model by utilizing peak matrixes in the negative sample set and the positive sample set, and analyzing the calculated peak of the final microsatellite locus by utilizing a random forest machine learning model (sklern 0.20.0) to give the probability that the microsatellite state of the sample is positive, wherein model parameters are all defaults;
3) The probability of positive sample prediction is considered MSI-H when it is greater than 0.6, MSI-L when it is greater than 0.4 and less than 0.6, and MSS when it is less than 0.4 (the thresholds of 0.4 and 0.6 are determined by the best value that can be distinguished between the positive and negative sample sets).
The scheme is characterized in that the instability of a critical value cannot be flexibly processed, the proportion of unstable points in all detected effective points is calculated for each sample, and then the instability of the sample microsatellite is judged according to the proportion, wherein once the proportion is determined to be a specific value, the problem that the critical value cannot be well processed can also occur.
In the preferred embodiment, the models are directly established for the peak matrixes in the negative sample set and the positive sample set, and the learning ability of machine learning is utilized to predict the new samples, so that the processing of critical values is avoided, and the accuracy of sample discrimination is improved.
example 5
the target is as follows: microsatellite status of an NGS sequencing (panel sequencing, with hg19 reference genome) sample was examined.
The method comprises the following steps: the specific detection process is shown in figure 3,
1. The msisensor software (v0.5) is used to search all the microsatellite loci corresponding to the ginseng reference genome within the sample sequencing panel file (i.e., the bed region file), and the software parameters are default parameters except that the minimum length of the microsatellite loci is set to be 8 bp. All available microsatellite loci were obtained from this step (i.e., step A), and some of the results are shown in Table 1.
Table 1:
chr
|
start
|
end
|
MS
|
repeat
|
MSID
|
chr1
|
8074168
|
8074175
|
AG
|
4
|
MS1
|
chr1
|
11182071
|
11182082
|
TCT
|
4
|
MS2
|
chr1
|
16203144
|
16203155
|
CAG
|
4
|
MS3
|
chr1
|
16255142
|
16255153
|
GA
|
6
|
MS4
|
chr1
|
16256107
|
16256114
|
AG
|
4
|
MS5
|
chr1
|
16262695
|
16262702
|
CA
|
4
|
MS6
|
chr1
|
27022940
|
27022954
|
CCG
|
5
|
MS7
|
chr1
|
27022977
|
27022988
|
AGC
|
4
|
MS8
|
chr1
|
27023008
|
27023022
|
GGC
|
5
|
MS9
|
…
|
…
|
…
|
…
|
…
|
… |
Attached: chr denotes the chromosome, start denotes the starting position of the microsatellite locus; end represents the termination position of the microsatellite locus; MS represents the minimum unit of repeating units, repeat represents the number of repetitions, and MSID represents the number of microsatellite loci.
2. and (3) selecting 52 blood cells sequenced by the same panel, calculating the average coverage depth of all the microsatellite loci in the step (1), and selecting loci with high capture efficiency with the average coverage depth of more than 30 as basic loci for further subsequent screening. This step (i.e., step B) is mainly to obtain a baseline (baseline) of the mean depth of the sites and to capture the microsatellite sites with high efficiency, as shown in table 2.
Table 2:
MSID
|
qcs
|
Average_Total_Reads
|
Count
|
MS1
|
pass
|
302.3529412
|
51
|
MS2
|
pass
|
236.2156863
|
51
|
MS3
|
pass
|
234.4705882
|
51
|
MS4
|
pass
|
200.8235294
|
51
|
MS5
|
pass
|
199.627451
|
51
|
MS6
|
pass
|
481.7254902
|
51
|
MS7
|
pass
|
131.3333333
|
51
|
MS8
|
pass
|
133.9803922
|
51
|
MS9
|
pass
|
125.96
|
50
|
…
|
…
|
…
|
… |
Attached: in table 2, MSID indicates the number of microsatellite loci; qcs: representing the quality control state of the support sequence; average _ Total _ Reads represents the Average number of Reads covering each microsatellite locus (i.e., Average depth of coverage); count represents the number of samples that pass quality control.
3. Using 9 positive samples (MSI-H, microsatellite high frequency instability) and 18 negative samples (MSS, microsatellite stability) of the known microsatellite status and the baseline of the site average coverage depth of the microsatellite sites with high capture efficiency obtained in step 2, calculating to obtain the average depth and the number of peaks of each site and the microsatellite sites with large scale in the positive and negative samples, wherein the microsatellite sites with large scale are sites with significant difference in the number of peaks in two groups of 9 positive samples and 18 negative samples (rank sum test, p is 0.01, 53 sites in total). This step (i.e., step C) yields the average depth of coverage of the microsatellite loci and the peak matrix baseline with a large discrimination between positive and negative samples. The results are shown in tables 3-1 and 3-2.
Table 3-1:
Tables 3-2:
4. and classifying the peak matrixes of the microsatellite loci with high capturing efficiency and large differentiation in the positive and negative samples by utilizing a PCA algorithm, wherein the classification result is shown in figure 4. As can be seen from FIG. 4(PC1 represents the first principal component and PC2 represents the second principal component), the selected microsatellite loci were significantly different between the two sets of samples.
5. ROC analysis was performed on peak matrices of microsatellite loci with high capture efficiency and large discrimination between positive and negative samples by cross-validation, and the analysis results are shown in FIG. 5. As can be seen from FIG. 5, the sensitivity and specificity of the model constructed above are both 100%.
6. for the sample to be detected, the peaks of the microsatellite loci with high capture efficiency and large differentiation in the positive and negative samples are calculated, then the model is used for prediction, and the result is finally given (namely step D), and the result is shown in table 4.
Table 4:
sample(s)
|
Probability of being MIS-H
|
180504253T1
|
0.9 |
From this example it can be seen that: the method has 100% sensitivity and 100% specificity for 9 known MSI-H and 18 MSS samples.
through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
corresponding to the above manner, the present application further provides a device for establishing a baseline for detecting instability of a microsatellite, a device for establishing a model for detecting instability of a microsatellite, and a device for detecting instability of a microsatellite, which are used to implement the above embodiments and preferred embodiments, and have been described above and will not be described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
This is further illustrated below in connection with alternative embodiments.
Example 6
in this embodiment, there is also provided an apparatus for establishing a baseline for detecting microsatellite instability, the apparatus comprising: the system comprises a microsatellite locus searching module, a candidate microsatellite locus screening module and a baseline establishing module, wherein the microsatellite locus searching module is used for searching all available microsatellite loci in a region corresponding to sequencing data of a sample to be tested on a human reference genome; the candidate microsatellite locus screening module is used for counting the average coverage depth baseline of each microsatellite locus in each control blood cell sample by using the sequencing data of a plurality of control blood cell samples and reserving the microsatellite locus of which the average coverage depth baseline meets the depth threshold as a candidate microsatellite locus; and the baseline establishing module is used for calculating the average coverage depth and the number of peaks of each candidate microsatellite locus in the sequencing data of the multiple positive samples and the multiple negative samples by utilizing the candidate microsatellite loci and the average coverage depth baseline, finding out the candidate microsatellite loci with the number of peaks having significant difference in the multiple positive samples and the multiple negative samples as the detection microsatellite loci, and forming unstable baselines of the detection microsatellite by the average coverage depth and the number of peaks of each detection microsatellite locus in the multiple positive samples and the multiple negative samples.
the device compares sequencing data with a human reference genome sequence by utilizing a microsatellite locus searching module to find all available microsatellite loci in the sequencing data, fully utilizes the microsatellite loci contained in NGS data, then executing a candidate microsatellite locus screening module to screen microsatellite loci with higher capturing efficiency from the microsatellite loci by utilizing the sequencing data of tumor control blood cell samples for subsequent analysis, further executing a baseline establishing module to find the microsatellite loci with the peak number having significant difference between two groups of samples from the selected microsatellite loci with high capturing efficiency through the sequencing data of positive samples and negative samples with known microsatellite states, and then forming unstable baselines of the microsatellite of the sample to be detected in the follow-up detection by using the average coverage depth of the microsatellite loci with the significant difference in the number of peaks and the number of peaks. Compared with the common device in the current market, the device firstly utilizes more microsatellite locus information to establish a base line, and when the microsatellite state of a sample to be detected is detected or judged subsequently, more microsatellite loci are detected and judged, so that the utilization efficiency of sequencing data is improved, and the detection sensitivity is improved.
Alternatively, the minimum length of all available microsatellite loci is 10 bp.
optionally, the depth threshold is equal to or greater than 30.
Example 7
In this embodiment, a device for establishing a model for detecting instability of a microsatellite is further provided, the device includes a machine learning modeling module in addition to the microsatellite locus searching module, the candidate microsatellite locus screening module and the base line establishing module in the device for establishing a base line for detecting instability of a microsatellite, wherein the machine learning modeling module is used for modeling average coverage depths and peak numbers in a plurality of positive samples and a plurality of negative samples in the base line by using a machine learning algorithm to obtain the model for detecting instability of a microsatellite.
compared with the existing device, the microsatellite locus used by the device not only comprises the microsatellite locus used by the existing algorithm, but also comprises other loci capable of obviously distinguishing MSI-H and MSS, so that the sensitivity is improved. In addition, the device adopts a machine learning mode to establish a model by utilizing the positive samples and the negative samples of the known microsatellite in an unstable state, and then judges the samples to be detected by utilizing the model, so that the specificity of detection is improved compared with the conventional device which judges the samples to be detected by a hard 20% boundary.
Example 8
In this embodiment, there is also provided an apparatus for detecting instability of a microsatellite, the apparatus including: the device for establishing the unstable base line of the microsatellite for detecting the instability comprises a microsatellite point searching module, a candidate microsatellite point screening module, a base line establishing module, a machine learning modeling module, a detection module and a prediction module, wherein the machine learning modeling module is arranged in the device for establishing the unstable base line of the microsatellite, and the detection module is used for detecting the number of peaks of each detected microsatellite point in the sequencing data of a sample to be detected; and the prediction module is used for analyzing the number of peaks of each detected microsatellite locus in the sequencing data of the sample to be detected by utilizing the model for detecting the instability of the microsatellite, so as to obtain the result of the unstable state of the microsatellite of the sample to be detected.
From the above description, it can be seen that the above-described embodiments of the present invention achieve the following technical effects: according to the method and the device for establishing the baseline for detecting the instability of the microsatellite, the used microsatellite loci not only comprise the existing used microsatellite loci, but also comprise other loci capable of remarkably distinguishing MSI-H and MSS, so that the sensitivity is improved. In addition, the method and the device of the application do not use 20% as a boundary for judging the final sample microsatellite state, model positive samples and negative samples of known microsatellite unstable states by using a machine learning mode, and then judge the sample to be detected, so that the specificity is improved relative to a hard 20% boundary.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.