CN110570907B - Method for establishing baseline and model for detecting instability of microsatellite and application - Google Patents

Method for establishing baseline and model for detecting instability of microsatellite and application Download PDF

Info

Publication number
CN110570907B
CN110570907B CN201910833273.7A CN201910833273A CN110570907B CN 110570907 B CN110570907 B CN 110570907B CN 201910833273 A CN201910833273 A CN 201910833273A CN 110570907 B CN110570907 B CN 110570907B
Authority
CN
China
Prior art keywords
microsatellite
baseline
instability
sample
detecting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910833273.7A
Other languages
Chinese (zh)
Other versions
CN110570907A (en
Inventor
周涛
陈利斌
郭璟
楼峰
曹善柏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiangxin Biotechnology Co ltd
Original Assignee
Beijing Xiangxin Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiangxin Biotechnology Co ltd filed Critical Beijing Xiangxin Biotechnology Co ltd
Priority to CN201910833273.7A priority Critical patent/CN110570907B/en
Publication of CN110570907A publication Critical patent/CN110570907A/en
Application granted granted Critical
Publication of CN110570907B publication Critical patent/CN110570907B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • General Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a method for establishing a baseline and a model for detecting instability of a microsatellite and application thereof. The method comprises the following steps: searching all available microsatellite loci in a region corresponding to the sequencing data of the sample to be detected on the human reference genome; counting and reserving microsatellite loci with average coverage depth baselines meeting a depth threshold as candidate microsatellite loci by utilizing sequencing data of a plurality of control blood cell samples; and calculating and finding out candidate microsatellite loci with peak numbers which are obviously different in a plurality of positive samples and a plurality of negative samples by utilizing each candidate microsatellite locus and the average coverage depth baseline, wherein the candidate microsatellite loci are used as detection microsatellite loci, and the average coverage depth and the peak numbers of each detection microsatellite locus in the plurality of positive samples and the plurality of negative samples form unstable baselines of the detection microsatellite. The method not only improves the utilization efficiency of sequencing data, but also improves the detection sensitivity.

Description

Method for establishing baseline and model for detecting instability of microsatellite and application
Technical Field
The invention relates to the field of gene sequencing data analysis, in particular to a method for establishing a base line and a model for detecting microsatellite instability and application of the method.
Background
Microsatellite Instability (MSI) refers to a decrease or increase in the number of Microsatellite repeats, leading to the appearance of new alleles. Numerous studies have shown that microsatellite instability is caused by a defect in the occurrence of mismatch repair genes and is closely related to the occurrence of tumors. Microsatellite instability has been clinically used as an important molecular marker for prognosis and development of adjuvant treatment regimens for colorectal cancer and other solid tumors, and has been applied to assist in Lynch syndrome screening. However, there is a lack of gold criteria for NGS data to discriminate microsatellite instability.
Currently, as shown in fig. 1, most of the NGS data are detected using microsatellite loci from NCCN or the like, and then the sample is determined to be microsatellite unstable by taking 20% as a boundary, that is, by determining that 20% or more of the microsatellite loci are unstable. However, the existing method for detecting the instability of the microsatellite still has the defect of low sensitivity.
Disclosure of Invention
The invention mainly aims to provide a method for establishing a base line and a model for detecting instability of a microsatellite and application thereof, so as to solve the problem of low detection sensitivity of the microsatellite locus in sequencing data in the prior art.
To achieve the above object, according to one aspect of the present invention, there is provided a method of establishing a baseline for detecting microsatellite instability, the method including: searching all available microsatellite loci in a region corresponding to the sequencing data of the sample to be detected on the human reference genome; counting the average coverage depth baseline of each microsatellite locus in each control blood cell sample by using the sequencing data of a plurality of control blood cell samples, and reserving the microsatellite locus of which the average coverage depth baseline meets the depth threshold as a candidate microsatellite locus; and calculating the average coverage depth and the number of peaks of each candidate microsatellite locus in the sequencing data of each of the plurality of positive samples and the plurality of negative samples by utilizing each candidate microsatellite locus and the average coverage depth baseline, finding out the candidate microsatellite loci with the number of peaks having significant difference in the plurality of positive samples and the plurality of negative samples as the detection microsatellite loci, and forming the unstable baseline of the detection microsatellite by the average coverage depth and the number of peaks of each detection microsatellite locus in the plurality of positive samples and the plurality of negative samples.
Further, the minimum length of all available microsatellite loci is 10 bp.
Further, the depth threshold is equal to or greater than 30.
According to a second aspect of the present invention, there is provided a method of modelling for detecting microsatellite instability, the method comprising: establishing a baseline for detecting the instability of the microsatellite by adopting any one of the methods; and modeling the average coverage depth and the number of peaks in a plurality of positive samples and a plurality of negative samples in the baseline by using a machine learning algorithm to obtain a model for detecting the instability of the microsatellite.
Further, the machine learning algorithm is a random forest algorithm.
According to a third aspect of the present invention there is provided a model for detecting microsatellite instability, the model being constructed using any of the methods described above.
According to a fourth aspect of the present invention, there is provided a method of detecting microsatellite instability, the method comprising: detecting the microsatellite loci according to any one of the methods, and detecting the number of peaks of each detected microsatellite locus in the sequencing data of the sample to be detected; and analyzing the number of peaks of each detected microsatellite locus in the sequencing data of the sample to be detected by using any one of the models for detecting the instability of the microsatellite, thereby obtaining the unstable state result of the microsatellite of the sample to be detected.
According to a fifth aspect of the present invention there is provided apparatus for establishing a baseline for detecting microsatellite instability, the apparatus comprising: the system comprises a microsatellite locus searching module, a candidate microsatellite locus screening module and a base line establishing module, wherein the microsatellite locus searching module is used for comparing sequencing data of a sample to be tested with a human reference genome sequence to obtain all available microsatellite loci, and the sequencing data of the sample to be tested comprises sequencing data of all known microsatellite loci; the candidate microsatellite locus screening module is used for counting the average coverage depth baseline of each microsatellite locus in each control blood cell sample by using the sequencing data of a plurality of control blood cell samples and reserving the microsatellite locus of which the average coverage depth baseline meets the depth threshold as a candidate microsatellite locus; and the baseline establishing module is used for calculating the average coverage depth and the number of peaks of each candidate microsatellite locus in the sequencing data of the multiple positive samples and the multiple negative samples by utilizing the candidate microsatellite loci and the average coverage depth baseline, finding out the candidate microsatellite loci with the number of peaks having significant difference in the multiple positive samples and the multiple negative samples as the detection microsatellite loci, and forming unstable baselines of the detection microsatellite by the average coverage depth and the number of peaks of each detection microsatellite locus in the multiple positive samples and the multiple negative samples.
According to a sixth aspect of the present invention, there is provided an apparatus for modeling for detecting microsatellite instability, the apparatus comprising: the device for establishing the baseline for detecting the instability of the microsatellite comprises a microsatellite locus searching module, a candidate microsatellite locus screening module, a baseline establishing module and a machine learning modeling module, wherein the machine learning modeling module is used for modeling the average coverage depth and the peak number of a plurality of positive samples and a plurality of negative samples in the baseline by utilizing a machine learning algorithm to obtain a model for detecting the instability of the microsatellite.
According to a seventh aspect of the present invention, there is provided an apparatus for detecting microsatellite instability, the apparatus comprising: the device comprises a microsatellite locus searching module, a candidate microsatellite locus screening module and a base line establishing module in the device for establishing the base line for detecting the instability of the microsatellite, a machine learning modeling module, a detection module, a prediction module and a detection module in the device for establishing the model for detecting the instability of the microsatellite, wherein the detection module is used for detecting the number of peaks of each detected microsatellite locus in the sequencing data of a sample to be detected; and the prediction module is used for analyzing the number of peaks of each detected microsatellite locus in the sequencing data of the sample to be detected by utilizing the model for detecting the instability of the microsatellite, so as to obtain the result of the unstable state of the microsatellite of the sample to be detected.
According to an eighth aspect of the present invention, there is provided a storage medium comprising a stored program, wherein the program, when executed, controls an apparatus on which the storage medium is located to perform the above-described method of establishing a baseline for detecting instability of a microsatellite, or to perform the above-described method of establishing a model for detecting instability of a microsatellite, or to perform the above-described method of detecting instability of a microsatellite.
According to a ninth aspect of the present invention, there is provided a processor comprising a stored program, wherein the program, when executed, controls an apparatus on a storage medium to perform the above-described method of establishing a baseline for detecting instability of a microsatellite, or to perform the above-described method of establishing a model for detecting instability of a microsatellite, or to perform the above-described method of detecting instability of a microsatellite.
By applying the technical scheme of the invention, all available microsatellite loci in sequencing data are found by comparing the sequencing data with a ginseng reference genome sequence, the microsatellite loci contained in NGS data are fully utilized, then the sequencing data of a tumor control blood cell sample are utilized, the microsatellite loci with higher capture efficiency are screened from the microsatellite loci for subsequent analysis, the microsatellite loci with the peak numbers which have significant difference between two groups of samples are further found from the microsatellite loci with the high capture efficiency through the sequencing data of positive samples and negative samples with known microsatellite states, and the average coverage depth and the peak numbers of the microsatellite loci with the significant difference in the peak numbers are further utilized to form an unstable baseline of the microsatellite of a subsequent detection sample to be detected. Compared with the common method in the current market, the method firstly utilizes more microsatellite locus information to establish the base line, and when the microsatellite state of a sample to be detected is detected or judged subsequently, more microsatellite loci are detected and judged, so that the utilization efficiency of sequencing data is improved, and the detection sensitivity is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 shows a schematic flow diagram of a method for detecting microsatellite instability according to the prior art; and
FIG. 2 is a schematic flow chart diagram illustrating a method for establishing a baseline for detecting microsatellite instability in accordance with a preferred embodiment of the present application;
FIG. 3 shows a detailed flow diagram of a method for detecting microsatellite instability in accordance with a preferred embodiment of the present application;
FIG. 4 shows a graph of cluster analysis of microsatellite loci with significant difference in the number of peaks in microsatellite positive and microsatellite negative samples in example 5;
FIG. 5 is a graph showing the results of ROC analysis of the peak matrices in microsatellite positive and microsatellite negative samples in example 5.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.
Microsatellites: microsatellites are short tandem repeats distributed throughout the human genome with repeats of single, double or higher nucleotides, 10-50 times.
Stability of the microsatellite: microsatellites result in a change in the length of the microsatellite due to the insertion or deletion of repeat units.
The BED file format is a variable way of data-line to describe annotated data. The BED line has 3 required fields and 9 additional fields. The three required fields are:
1, chrom, chromosome or scaffold name, such as chr 3, chrY.
2 starting position of chromostart chromosome or scaffold, position of the first chromosome is 0.
3, chromogen, chromosome or scaffold termination position.
The 9 additional optional BED fields are:
4, name, defining the name of the bed;
score 0-1000, which is perceived as showing a gray level if the track line attribute is set to 1 when annotated, the larger the water, the higher the gray;
strand defines + or-;
7, where clickstart begins;
8, the location where thickEnd ends;
9, form of itemRGB An AGB value;
10, block count BED line, block number on exon;
11, blockSize separates blockSize with commas, the item list corresponding to BlockCount;
12, blockStarts comma separated list.
Example 1
In a preferred embodiment of the present application, a method for establishing a baseline for detecting microsatellite instability is provided, and fig. 2 is a flow chart of the method for establishing a baseline for detecting microsatellite instability according to an embodiment of the present invention. As shown in fig. 2, the method includes:
step S101, searching all available microsatellite loci in a region corresponding to sequencing data of a sample to be detected on a human reference genome;
step S102, counting the average coverage depth baseline of each microsatellite locus in each control blood cell sample by using the sequencing data of a plurality of control blood cell samples, and reserving the microsatellite locus of which the average coverage depth baseline meets the depth threshold as a candidate microsatellite locus;
step S103, calculating the average coverage depth and the number of peaks of each candidate microsatellite locus in the sequencing data of each of the plurality of positive samples and the plurality of negative samples by using each candidate microsatellite locus and the average coverage depth baseline, finding out the candidate microsatellite loci with the number of peaks having significant difference in the plurality of positive samples and the plurality of negative samples as detection microsatellite loci, and forming unstable baselines of the detection microsatellite by the average coverage depth and the number of peaks of each detection microsatellite locus in the plurality of positive samples and the plurality of negative samples.
The method comprises the steps of comparing sequencing data with a ginseng reference genome sequence, finding all available microsatellite loci in the sequencing data, fully utilizing the microsatellite loci contained in NGS data, then utilizing the sequencing data of tumor control blood cell samples, screening the microsatellite loci with higher capture efficiency from the microsatellite loci for subsequent analysis, further finding the microsatellite loci with significant differences of peak numbers in two groups of samples from the selected microsatellite loci with high capture efficiency through the sequencing data of positive samples and negative samples with known microsatellite states, and further utilizing the average coverage depth and peak numbers of the microsatellite loci with significant differences of peak numbers to form a baseline for subsequently detecting the instability of the microsatellite of a sample to be detected. Compared with the common method in the current market, the method firstly utilizes more microsatellite locus information to establish the base line, and when the microsatellite state of a sample to be detected is detected or judged subsequently, more microsatellite loci are detected and judged, so that the utilization efficiency of sequencing data is improved, and the detection sensitivity is improved.
In the above embodiment, the sequencing data of the sample to be tested is preferably the sequencing data of the tumor tissue of the sample to be tested. The control blood cell sample and the tumor tissue are derived from the same test sample (as a test subject). More preferably, for all the reported microsatellite loci, a capture library is constructed by designing Panel for all loci known to be associated with tumor mutations, and then sequenced to obtain sequencing data. In determining the microsatellite loci, besides the commonly used microsatellite loci reported by NCCN, the microsatellite loci (not reported) selected by aligning the sequencing data with the reference genome sequence are also included.
The reference blood cell sample is the same as the blood cell sample used in the conventional method for detecting microsatellite instability, and refers to the blood cell sample from the same sample source as the tumor tissue to be detected. Here, the blood cell sample is used to calculate the average depth of coverage of each microsatellite locus to be detected. The number of blood cell samples is usually plural, and preferably at least 5, 10, 15, 20, 25, 30, 40, 50, or 50 or more. In an alternative embodiment of the present application, the number of blood cell samples is 52.
The above-mentioned peak of the microsatellite locus with high capture efficiency is calculated by using the microsatellite positive sample and the microsatellite negative sample, and the peak is used to further screen out the microsatellite locus with large scale in two groups, namely the locus with significant difference in the number of the peak in the two groups of the microsatellite positive sample and the microsatellite negative sample, wherein the significant difference is preferably checked by adopting rank sum, and p is 0.01.
In the application, peak or peak refers to an insertion deletion condition statistic value of a microsatellite locus, for the microsatellite locus of a detected sample, the length types of the read segments are counted, the supported reads of each length type need to be more than 3 to be calculated into effective length types, for example, one microsatellite locus is 24A on a reference genome, and multiple conditions occur in one detection sample, namely 15A (2 reads), 20A (10 reads), 21A (20 reads), 22A (40 reads) and 23A (100 reads), firstly, the condition of 15A is deleted, because the number of the supported reads is less than 3, and the rest 4 read segment lengths are 4, so the peak is 4.
In an optional embodiment, the method comprises the step of comparing sequencing data of a sample to be tested with a human reference genome sequence to obtain all available microsatellite loci, and searching all the microsatellite loci in a sequencing data file range of the sample to be tested in a human reference genome sequence (hg19) by adopting msisensor software (v0.5), wherein default parameters are adopted for the software parameters except that the minimum length of the microsatellite loci is set to be 10.
Step S102 is to select the microsatellite loci with high capture efficiency from all available microsatellite loci. In an alternative embodiment, the sequencing data of a plurality of blood cells sequenced simultaneously is used to calculate the average coverage depth of all available microsatellite loci, and the microsatellite loci with the average coverage depth of more than 30 are selected as the microsatellite loci with high capture efficiency.
Example 2
In a preferred embodiment, the present application also provides a method of modeling for detecting microsatellite instability, the method comprising: establishing a baseline for detecting the instability of the microsatellite by adopting any one of the methods; and modeling the average coverage depth and the number of peaks in a plurality of positive samples and a plurality of negative samples in the baseline by using a machine learning algorithm to obtain a model for detecting the instability of the microsatellite.
In the preferred embodiment, on one hand, a detection baseline is established by maximally utilizing microsatellite locus information in sequencing data, and on the other hand, a prediction model for detecting instability of the microsatellite is established by adopting a machine learning algorithm and utilizing the number of peaks of each detection microsatellite locus in a positive sample and the number of peaks of each detection microsatellite locus in a negative sample in the baseline. When the model established by the machine learning method is used for predicting the microsatellite state of the sample, the mechanical learning method is also used for analyzing and judging, so that the detection sensitivity is higher, the specificity is higher, and compared with the judging method which generally selects 20% as a boundary in the current market, the defect of the existing method that the subjectivity is too high is avoided.
In the step of establishing a model by using a machine learning method, a random forest algorithm (sklern 0.20.0) is preferably used for modeling.
Example 3
In a preferred embodiment of the present application, a model for detecting instability of a microsatellite, which is built by the method for building a model for detecting instability of a microsatellite is also provided. The model established by the machine learning algorithm is used for analyzing and judging the microsatellite state of the sample, and the method has higher sensitivity and specificity.
Example 4
In a preferred embodiment of the present application, there is also provided a method of detecting microsatellite instability, the method including: detecting the microsatellite loci according to the detection microsatellite loci in the method for establishing the base line for detecting the microsatellite loci, and detecting the number of peaks of each detection microsatellite locus in the sequencing data of a sample to be detected; and analyzing the number of peaks of each detected microsatellite locus in the sequencing data of the sample to be detected by using the model for detecting the instability of the microsatellite, thereby obtaining the unstable state result of the microsatellite of the sample to be detected.
The model established by the machine learning algorithm is used for analyzing and judging the microsatellite state of the sample, and the method has higher sensitivity and specificity.
In a more specific embodiment, the step of microsatellite stability detection comprises:
1) calculating peaks of the microsatellite loci for a sample to be detected;
2) establishing a model by utilizing peak matrixes in the negative sample set and the positive sample set, and analyzing the calculated peak of the final microsatellite locus by utilizing a random forest machine learning model (sklern 0.20.0) to give the probability that the microsatellite state of the sample is positive, wherein model parameters are all defaults;
3) the probability of positive sample prediction is considered MSI-H when it is greater than 0.6, MSI-L when it is greater than 0.4 and less than 0.6, and MSS when it is less than 0.4 (the thresholds of 0.4 and 0.6 are determined by the best value that can be distinguished between the positive and negative sample sets).
The scheme is characterized in that the instability of a critical value cannot be flexibly processed, the proportion of unstable points in all detected effective points is calculated for each sample, and then the instability of the sample microsatellite is judged according to the proportion, wherein once the proportion is determined to be a specific value, the problem that the critical value cannot be well processed can also occur.
In the preferred embodiment, the models are directly established for the peak matrixes in the negative sample set and the positive sample set, and the learning ability of machine learning is utilized to predict the new samples, so that the processing of critical values is avoided, and the accuracy of sample discrimination is improved.
Example 5
The target is as follows: microsatellite status of an NGS sequencing (panel sequencing, with hg19 reference genome) sample was examined.
The method comprises the following steps: the specific detection process is shown in figure 3,
1. the msisensor software (v0.5) is used to search all the microsatellite loci corresponding to the ginseng reference genome within the sample sequencing panel file (i.e., the bed region file), and the software parameters are default parameters except that the minimum length of the microsatellite loci is set to be 8 bp. All available microsatellite loci were obtained from this step (i.e., step A), and some of the results are shown in Table 1.
Table 1:
chr start end MS repeat MSID
chr1 8074168 8074175 AG 4 MS1
chr1 11182071 11182082 TCT 4 MS2
chr1 16203144 16203155 CAG 4 MS3
chr1 16255142 16255153 GA 6 MS4
chr1 16256107 16256114 AG 4 MS5
chr1 16262695 16262702 CA 4 MS6
chr1 27022940 27022954 CCG 5 MS7
chr1 27022977 27022988 AGC 4 MS8
chr1 27023008 27023022 GGC 5 MS9
attached: chr denotes the chromosome, start denotes the starting position of the microsatellite locus; end represents the termination position of the microsatellite locus; MS represents the minimum unit of repeating units, repeat represents the number of repetitions, and MSID represents the number of microsatellite loci.
2. And (3) selecting 52 blood cells sequenced by the same panel, calculating the average coverage depth of all the microsatellite loci in the step (1), and selecting loci with high capture efficiency with the average coverage depth of more than 30 as basic loci for further subsequent screening. This step (i.e., step B) is mainly to obtain a baseline (baseline) of the mean depth of the sites and to capture the microsatellite sites with high efficiency, as shown in table 2.
Table 2:
MSID qcs Average_Total_Reads Count
MS1 pass 302.3529412 51
MS2 pass 236.2156863 51
MS3 pass 234.4705882 51
MS4 pass 200.8235294 51
MS5 pass 199.627451 51
MS6 pass 481.7254902 51
MS7 pass 131.3333333 51
MS8 pass 133.9803922 51
MS9 pass 125.96 50
attached: in table 2, MSID indicates the number of microsatellite loci; qcs: representing the quality control state of the support sequence; average _ Total _ Reads represents the Average number of Reads covering each microsatellite locus (i.e., Average depth of coverage); count represents the number of samples that pass quality control.
3. Using 9 positive samples (MSI-H, microsatellite high frequency instability) and 18 negative samples (MSS, microsatellite stability) of the known microsatellite status and the baseline of the site average coverage depth of the microsatellite sites with high capture efficiency obtained in step 2, calculating to obtain the average depth and the number of peaks of each site and the microsatellite sites with large scale in the positive and negative samples, wherein the microsatellite sites with large scale are sites with significant difference in the number of peaks in two groups of 9 positive samples and 18 negative samples (rank sum test, p is 0.01, 53 sites in total). This step (i.e., step C) yields the average depth of coverage of the microsatellite loci and the peak matrix baseline with a large discrimination between positive and negative samples. The results are shown in tables 3-1 and 3-2.
Table 3-1:
Figure BDA0002191418030000091
tables 3-2:
Figure BDA0002191418030000092
4. and classifying the peak matrixes of the microsatellite loci with high capturing efficiency and large differentiation in the positive and negative samples by utilizing a PCA algorithm, wherein the classification result is shown in figure 4. As can be seen from FIG. 4(PC1 represents the first principal component and PC2 represents the second principal component), the selected microsatellite loci were significantly different between the two sets of samples.
5. ROC analysis was performed on peak matrices of microsatellite loci with high capture efficiency and large discrimination between positive and negative samples by cross-validation, and the analysis results are shown in FIG. 5. As can be seen from FIG. 5, the sensitivity and specificity of the model constructed above are both 100%.
6. For the sample to be detected, the peaks of the microsatellite loci with high capture efficiency and large differentiation in the positive and negative samples are calculated, then the model is used for prediction, and the result is finally given (namely step D), and the result is shown in table 4.
Table 4:
sample(s) Probability of being MIS-H
180504253T1 0.9
From this example it can be seen that: the method has 100% sensitivity and 100% specificity for 9 known MSI-H and 18 MSS samples.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Corresponding to the above manner, the present application further provides a device for establishing a baseline for detecting instability of a microsatellite, a device for establishing a model for detecting instability of a microsatellite, and a device for detecting instability of a microsatellite, which are used to implement the above embodiments and preferred embodiments, and have been described above and will not be described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
This is further illustrated below in connection with alternative embodiments.
Example 6
In this embodiment, there is also provided an apparatus for establishing a baseline for detecting microsatellite instability, the apparatus comprising: the system comprises a microsatellite locus searching module, a candidate microsatellite locus screening module and a baseline establishing module, wherein the microsatellite locus searching module is used for searching all available microsatellite loci in a region corresponding to sequencing data of a sample to be tested on a human reference genome; the candidate microsatellite locus screening module is used for counting the average coverage depth baseline of each microsatellite locus in each control blood cell sample by using the sequencing data of a plurality of control blood cell samples and reserving the microsatellite locus of which the average coverage depth baseline meets the depth threshold as a candidate microsatellite locus; and the baseline establishing module is used for calculating the average coverage depth and the number of peaks of each candidate microsatellite locus in the sequencing data of the multiple positive samples and the multiple negative samples by utilizing the candidate microsatellite loci and the average coverage depth baseline, finding out the candidate microsatellite loci with the number of peaks having significant difference in the multiple positive samples and the multiple negative samples as the detection microsatellite loci, and forming unstable baselines of the detection microsatellite by the average coverage depth and the number of peaks of each detection microsatellite locus in the multiple positive samples and the multiple negative samples.
The device compares sequencing data with a human reference genome sequence by utilizing a microsatellite locus searching module to find all available microsatellite loci in the sequencing data, fully utilizes the microsatellite loci contained in NGS data, then executing a candidate microsatellite locus screening module to screen microsatellite loci with higher capturing efficiency from the microsatellite loci by utilizing the sequencing data of tumor control blood cell samples for subsequent analysis, further executing a baseline establishing module to find the microsatellite loci with the peak number having significant difference between two groups of samples from the selected microsatellite loci with high capturing efficiency through the sequencing data of positive samples and negative samples with known microsatellite states, and then forming unstable baselines of the microsatellite of the sample to be detected in the follow-up detection by using the average coverage depth of the microsatellite loci with the significant difference in the number of peaks and the number of peaks. Compared with the common device in the current market, the device firstly utilizes more microsatellite locus information to establish a base line, and when the microsatellite state of a sample to be detected is detected or judged subsequently, more microsatellite loci are detected and judged, so that the utilization efficiency of sequencing data is improved, and the detection sensitivity is improved.
Alternatively, the minimum length of all available microsatellite loci is 10 bp.
Optionally, the depth threshold is equal to or greater than 30.
Example 7
In this embodiment, a device for establishing a model for detecting instability of a microsatellite is further provided, the device includes a machine learning modeling module in addition to the microsatellite locus searching module, the candidate microsatellite locus screening module and the base line establishing module in the device for establishing a base line for detecting instability of a microsatellite, wherein the machine learning modeling module is used for modeling average coverage depths and peak numbers in a plurality of positive samples and a plurality of negative samples in the base line by using a machine learning algorithm to obtain the model for detecting instability of a microsatellite.
Compared with the existing device, the microsatellite locus used by the device not only comprises the microsatellite locus used by the existing algorithm, but also comprises other loci capable of obviously distinguishing MSI-H and MSS, so that the sensitivity is improved. In addition, the device adopts a machine learning mode to establish a model by utilizing the positive samples and the negative samples of the known microsatellite in an unstable state, and then judges the samples to be detected by utilizing the model, so that the specificity of detection is improved compared with the conventional device which judges the samples to be detected by a hard 20% boundary.
Example 8
In this embodiment, there is also provided an apparatus for detecting instability of a microsatellite, the apparatus including: the device for establishing the unstable base line of the microsatellite for detecting the instability comprises a microsatellite point searching module, a candidate microsatellite point screening module, a base line establishing module, a machine learning modeling module, a detection module and a prediction module, wherein the machine learning modeling module is arranged in the device for establishing the unstable base line of the microsatellite, and the detection module is used for detecting the number of peaks of each detected microsatellite point in the sequencing data of a sample to be detected; and the prediction module is used for analyzing the number of peaks of each detected microsatellite locus in the sequencing data of the sample to be detected by utilizing the model for detecting the instability of the microsatellite, so as to obtain the result of the unstable state of the microsatellite of the sample to be detected.
From the above description, it can be seen that the above-described embodiments of the present invention achieve the following technical effects: according to the method and the device for establishing the baseline for detecting the instability of the microsatellite, the used microsatellite loci not only comprise the existing used microsatellite loci, but also comprise other loci capable of remarkably distinguishing MSI-H and MSS, so that the sensitivity is improved. In addition, the method and the device of the application do not use 20% as a boundary for judging the final sample microsatellite state, model positive samples and negative samples of known microsatellite unstable states by using a machine learning mode, and then judge the sample to be detected, so that the specificity is improved relative to a hard 20% boundary.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (11)

1. A method of establishing a baseline for detecting microsatellite instability, said method comprising:
searching all available microsatellite loci in a region corresponding to sequencing data of a sample to be detected on a human reference genome;
counting an average coverage depth baseline of each microsatellite locus in each control blood cell sample by using sequencing data of a plurality of control blood cell samples, and reserving the microsatellite locus of which the average coverage depth baseline meets a depth threshold value as a candidate microsatellite locus;
calculating the average coverage depth and the number of peaks of each candidate microsatellite locus in the sequencing data of a plurality of positive samples and a plurality of negative samples by utilizing each candidate microsatellite locus and the average coverage depth baseline, finding out the candidate microsatellite loci with the number of peaks having significant difference in the plurality of positive samples and the plurality of negative samples as detection microsatellite loci, and forming the unstable baseline for detecting the microsatellite by the average coverage depth and the number of peaks of each detection microsatellite locus in the plurality of positive samples and the plurality of negative samples;
the peak number refers to an insertion deletion condition statistic value of each candidate microsatellite locus, namely the number of read segment length types of each candidate microsatellite locus, wherein the support reads of each read segment length type is more than 3;
the positive sample is a microsatellite high-frequency unstable MSI-H sample, and the negative sample is a microsatellite stable MSS sample.
2. The method of claim 1, wherein the minimum length of all available microsatellite loci is 10 bp.
3. The method of claim 1, wherein the depth threshold is greater than or equal to 30.
4. A method of modeling for detecting microsatellite instability, said method comprising:
establishing a baseline for detecting microsatellite instability using the method of any one of claims 1 to 3;
and modeling the average coverage depth and the number of peaks in the plurality of positive samples and the plurality of negative samples in the baseline by utilizing a machine learning algorithm to obtain the unstable model of the detected microsatellite.
5. The method of claim 4, wherein the machine learning algorithm is a random forest algorithm.
6. A method of detecting microsatellite instability, said method comprising:
the method according to any one of claims 1 to 3, wherein the detection of the microsatellite loci detects the number of peaks of each detection microsatellite locus in the sequencing data of a sample to be detected;
the method of claim 4 or 5 is used for establishing the model for detecting the instability of the microsatellite, and the number of peaks of each detected microsatellite locus in the sequencing data of the sample to be detected is analyzed, so that the result of the instability state of the microsatellite of the sample to be detected is obtained.
7. An apparatus for establishing a baseline for detecting microsatellite instability, said apparatus comprising:
the microsatellite locus searching module is used for comparing the sequencing data of the sample to be tested with the human reference genome sequence to obtain all available microsatellite loci;
the candidate microsatellite locus screening module is used for counting the average coverage depth baseline of each microsatellite locus in each control blood cell sample by using the sequencing data of a plurality of control blood cell samples and reserving the microsatellite locus of which the average coverage depth baseline meets a depth threshold value as a candidate microsatellite locus;
a baseline establishing module, configured to calculate, using the candidate microsatellite loci and the average coverage depth baseline, an average coverage depth and a number of peaks of each candidate microsatellite locus in respective sequencing data of a plurality of positive samples and a plurality of negative samples, and find out the candidate microsatellite loci where the number of peaks is significantly different among the plurality of positive samples and the plurality of negative samples, as detection microsatellite loci, where the average coverage depth and the number of peaks of each detection microsatellite locus in the plurality of positive samples and the plurality of negative samples form a baseline for detecting instability of the microsatellite;
the peak number refers to an insertion deletion condition statistic value of each candidate microsatellite locus, namely the number of read segment length types of each candidate microsatellite locus, wherein the support reads of each read segment length type is more than 3;
the positive sample is a microsatellite high-frequency unstable MSI-H sample, and the negative sample is a microsatellite stable MSS sample.
8. An apparatus for modeling detection of microsatellite instability, said apparatus comprising:
the microsatellite locus searching module, the candidate microsatellite locus screening module, the baseline establishing module and the machine learning modeling module in the apparatus for establishing a baseline for detecting microsatellite instability as set forth in claim 7,
the machine learning modeling module is used for modeling the average coverage depth and the number of peaks in the positive samples and the negative samples in the baseline by utilizing a machine learning algorithm to obtain a model for detecting the instability of the microsatellite.
9. An apparatus for detecting microsatellite instability, said apparatus comprising:
the microsatellite locus searching module, the candidate microsatellite locus screening module, the baseline establishing module in the microsatellite instability detection apparatus establishing device according to claim 7 and the machine learning modeling module in the microsatellite instability detection apparatus establishing model according to claim 8, and
the detection module is used for detecting the number of peaks of each detected microsatellite locus in the sequencing data of a sample to be detected;
and the prediction module is used for analyzing the number of peaks of each detected microsatellite locus in the sequencing data of the sample to be detected by utilizing the model for detecting the instability of the microsatellite, so as to obtain the result of the unstable state of the microsatellite of the sample to be detected.
10. A storage medium comprising a stored program, wherein the program, when executed, controls an apparatus in the storage medium to perform the method of establishing a baseline for detecting microsatellite instability as claimed in any one of claims 1 to 3, or to perform the method of establishing a model for detecting microsatellite instability as claimed in claim 4 or 5, or to perform the method of detecting microsatellite instability as claimed in claim 6.
11. A processor for running a program, wherein the program is run to perform the method of establishing a baseline for detecting microsatellite instability as claimed in any one of claims 1 to 3 or the method of establishing a model for detecting microsatellite instability as claimed in claim 4 or 5 or the method of detecting microsatellite instability as claimed in claim 6.
CN201910833273.7A 2019-09-04 2019-09-04 Method for establishing baseline and model for detecting instability of microsatellite and application Active CN110570907B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910833273.7A CN110570907B (en) 2019-09-04 2019-09-04 Method for establishing baseline and model for detecting instability of microsatellite and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910833273.7A CN110570907B (en) 2019-09-04 2019-09-04 Method for establishing baseline and model for detecting instability of microsatellite and application

Publications (2)

Publication Number Publication Date
CN110570907A CN110570907A (en) 2019-12-13
CN110570907B true CN110570907B (en) 2021-07-30

Family

ID=68777794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910833273.7A Active CN110570907B (en) 2019-09-04 2019-09-04 Method for establishing baseline and model for detecting instability of microsatellite and application

Country Status (1)

Country Link
CN (1) CN110570907B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110797078A (en) * 2020-01-06 2020-02-14 北京吉因加科技有限公司 Method and device for constructing microsatellite unstable site screening and analyzing model
CN111583999B (en) * 2020-04-24 2023-08-18 北京优迅医学检验实验室有限公司 Method, device and application for establishing baseline for detecting microsatellite instability
CN111785324B (en) * 2020-07-02 2021-02-02 深圳市海普洛斯生物科技有限公司 Microsatellite instability analysis method and device
CN112365922B (en) * 2021-01-13 2021-06-15 臻和(北京)生物科技有限公司 Microsatellite locus for detecting MSI, screening method and application thereof
CN113744251B (en) * 2021-09-07 2023-08-29 上海桐树生物科技有限公司 Method for predicting microsatellite instability from pathological pictures based on self-attention mechanism
CN114708916B (en) * 2022-03-15 2023-11-10 至本医疗科技(上海)有限公司 Method and device for detecting stability of microsatellite, computer equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104428425A (en) * 2012-05-04 2015-03-18 考利达基因组股份有限公司 Methods for determining absolute genome-wide copy number variations of complex tumors
CN114272371A (en) * 2015-07-29 2022-04-05 诺华股份有限公司 Combination therapy comprising anti-PD-1 antibody molecules
CN109182525B (en) * 2018-09-29 2019-09-06 广州燃石医学检验所有限公司 A kind of microsatellite biomarker combinations, detection kit and application thereof
CN109207594B (en) * 2018-09-29 2020-09-25 广州燃石医学检验所有限公司 Method for detecting microsatellite stability state and genome change through plasma based on next generation sequencing

Also Published As

Publication number Publication date
CN110570907A (en) 2019-12-13

Similar Documents

Publication Publication Date Title
CN110570907B (en) Method for establishing baseline and model for detecting instability of microsatellite and application
Schrider Background selection does not mimic the patterns of genetic diversity produced by selective sweeps
CN112750502B (en) Single cell transcriptome sequencing data clustering recommendation method based on two-dimensional distribution structure judgment
CN108256289B (en) Method for capturing and sequencing genome copy number variation based on target region
CN110648721B (en) Method and device for detecting copy number variation by aiming at exon capture technology
CN107408163B (en) Method and apparatus for analyzing gene
CN107480470B (en) Known variation detection method and device based on Bayesian and Poisson distribution test
CN111312334B (en) Receptor-ligand system analysis method for influencing intercellular communication
CN108475300B (en) Custom-made drug selection method and system using genomic base sequence mutation information and survival information of cancer patient
CN111755068B (en) Method and device for identifying tumor purity and absolute copy number based on sequencing data
Hills et al. BAIT: Organizing genomes and mapping rearrangements in single cells
CN108804876B (en) Method and apparatus for calculating purity and chromosome ploidy of cancer sample
CN108304694B (en) Method for analyzing gene mutation based on second-generation sequencing data
Nagashima et al. Optimizing an ion semiconductor sequencing data analysis method to identify somatic mutations in the genomes of cancer cells in clinical tissue samples
CN110010195A (en) A kind of method and device detecting single nucleotide mutation
CN116200490A (en) Method for detecting tiny residual focus of solid tumor
KR101941011B1 (en) Method for predicting prognosis of breast cancer by using gene expression data
CN111798924B (en) Human leukocyte antigen typing method and device
CN115565606B (en) Detection method, equipment and computer readable storage medium for automatically screening mutation subset
Florea et al. Detection of Alu exonization events in human frontal cortex from RNA-seq data
CN112735594A (en) Method for screening disease phenotype related mutation sites and application thereof
CN111508559A (en) Method and device for detecting target area CNV
CN110462063B (en) Mutation detection method and device based on sequencing data and storage medium
CN115066503A (en) Using bulk sequencing data to guide analysis of single cell sequencing data
EP4297037A1 (en) Device for determining an indicator of presence of hrd in a genome of a subject

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Method and application of establishing baseline and model for detecting microsatellite instability

Effective date of registration: 20220105

Granted publication date: 20210730

Pledgee: Beijing ustron Tongsheng financing Company limited by guarantee

Pledgor: Beijing Xiangxin Biotechnology Co.,Ltd.

Registration number: Y2022990000003

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20240103

Granted publication date: 20210730

Pledgee: Beijing ustron Tongsheng financing Company limited by guarantee

Pledgor: Beijing Xiangxin Biotechnology Co.,Ltd.

Registration number: Y2022990000003

PC01 Cancellation of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: The Method and Application of Establishing Baselines and Models for Detecting Unstable Microsatellites

Effective date of registration: 20240103

Granted publication date: 20210730

Pledgee: Beijing ustron Tongsheng financing Company limited by guarantee

Pledgor: Beijing Xiangxin Biotechnology Co.,Ltd.

Registration number: Y2023990000651

PE01 Entry into force of the registration of the contract for pledge of patent right