CN117198399A

CN117198399A - Microsatellite locus, system and kit for predicting MSI state

Info

Publication number: CN117198399A
Application number: CN202311224358.8A
Authority: CN
Inventors: 裘迈宁; 陈丽衫; 郎秋蕾; 韩斐然
Original assignee: Hangzhou Link Care Medical Laboratory Co ltd
Current assignee: Hangzhou Link Care Medical Laboratory Co ltd
Priority date: 2023-09-21
Filing date: 2023-09-21
Publication date: 2023-12-08
Anticipated expiration: 2043-09-21
Also published as: CN117198399B

Abstract

The application discloses a marked microsatellite locus, a system and a kit for predicting an MSI overall state, and belongs to the technical field of MSI detection. According to the application, a regression model is built through known microsatellite loci, the stability of 22 marked microsatellite loci is predicted, and a machine learning model is further built to build an MSI overall state. By utilizing the application, the MSI overall state prediction can be carried out in different NGS detection platforms and different cancer types through analyzing the second generation sequencing data without matching normal samples, the method is stable and quick, the result is accurate, the repeatability is high, and the detection limitation is reduced.

Description

Microsatellite locus, system and kit for predicting MSI state

Technical Field

The application belongs to the technical field of MSI detection, and particularly relates to a labeled microsatellite locus, a labeled microsatellite system and a labeled microsatellite kit for predicting an MSI state.

Background

Microsatellite instability (Microsatellite Instability, MSI), a phenomenon in which the sequence length of a Microsatellite (MS) sequence changes due to insertion or deletion mutations during DNA replication, is often caused by Mismatch repair function (MMR) defects. MS sequences, which are short and repetitive DNA sequences, generally consist of 1-6 nucleotides, are arranged in tandem repeats, and are commonly of the double-base CA/GA/GT or single-base A/T type. The MS sequence can be located in an important non-coding region of a gene or in a coding region of the gene, and the polymorphism is distributed in the whole genome and has large individual difference. Cells can repair regions where mismatches occur during genome replication by mismatch repair elements. In some cases, the mismatch repair element loses its original function, such as MLH1, MSH2, MSH6, etc., and the cell fails to repair the mismatch, and the cell develops an MSI phenotype. Such a change in the number of unit repetitions of the microsatellite may be long or short.

Microsatellite instability is a relatively common phenomenon in tumors. The unstable state of the microsatellite indicates the cause and development of tumors, and can play an important role in auxiliary diagnosis and medication guidance in different cancer types. In general, the state of microsatellite instability can be classified into microsatellite highly unstable (MSI-H), microsatellite low unstable (MSI-L) and microsatellite stable (MSS). Among cancer species such as colorectal cancer, endometrial cancer, gastric cancer, etc., patients with MSI-H status have significant differences in survival, medication preference, palliative treatment prognosis, and the like for both. MSI detection can help doctors fully know cancer types and put forward a correct diagnosis and treatment scheme.

Another important role of MSI detection is to aid in the screening of Cha Linji Syndrome (Lynch Syndrome), a hereditary Syndrome in which the mismatch repair gene undergoes germ line mutations, characterized by colorectal, endometrial, gastric cancers, which occur at a younger age, and related cancer species in the family. MSI status has a certain correlation with Linked syndrome, and most patients with microsatellite stabilization are not Linked syndrome. Traditionally, if a patient has the symptoms described above, a physician will often consider detailed testing. If the immunohistochemical method is used, two experienced pathologists are required to jointly detect the composition, and the accuracy is low. If MSI detection is used, the detection is more convenient and the accuracy is higher. When the patient is judged to be MSI-H, comprehensive genetic diagnosis can be continued to determine whether the patient is the Lin-Chemicals syndrome.

Due to the sequence characteristics of MSI sites, single site sequencing is less reliable, hundreds of MSI sites are usually detected by second generation sequencing-based MSI detection, and microsatellite status can be detected by using a multi-site clone detection and genome sequencing method. The multi-site cloning method is commonly used in the current MSI detection, and by judging whether the lengths of 5-tens of microsatellite regions highly related to the MSI state are changed or not, judging whether the sites are unstable or not according to the change degree, the higher the proportion of unstable sites, the more likely the cells are in the MSI state. The method has low price and short experimental flow. However, many current detection tools that predict MSI status by NGS methods require normal sample pairing, such as mSINGS, MANTIS, etc.

Disclosure of Invention

The application mainly aims at providing a method for predicting the MSI overall state without normal sample pairing aiming at liquid phase capture second generation sequencing microsatellite locus data, which has the advantages of convenient calculation, stability, rapidness, accurate result, high repeatability and reduced detection limitation.

In order to achieve the above purpose, the technical scheme provided by the application is as follows:

a first aspect of the present application provides a tagged microsatellite loci for determining the overall status of MSI of a sample to be tested, said tagged microsatellite loci comprising:

microsatellite (MS) refers to short tandem repeats (Short tandem repeat, STR) in the human genome, or multiple base sequences, including single nucleotide repeats, double nucleotide repeats, and even more nucleotide repeats. Microsatellite instability (Microsatellite instability, MSI) refers to any change in length of a microsatellite in tumor tissue due to insertion or deletion of a repeat unit relative to normal tissue. In the application, the MSI overall state refers to the state of all microsatellites in a sample to be tested; the microsatellite loci are specific microsatellite loci, and the marked microsatellite loci are microsatellites for judging the MSI overall state of the sample to be detected.

Microsatellites comprising other single nucleotide repeats and/or dinucleotide repeats may also be selected or included as marker microsatellite loci in the present application.

In the application, the repetition number is the standard value of the length of the marked microsatellite loci) I.e. the number of repetitions in the table above (standard number of repetitions).

In the present application, since the microsatellite is changed in any length due to insertion or deletion of the repeating unit, the length of the microsatellite is a specific number of repetitions of the repeating unit, i.e., how many times the length is repeated. For example, even if the number of bases of the repeating unit is 2, which is repeated 10 times, the base length is 20, the length of the microsatellite loci is considered to be 10.

For microsatellite loci, there are 3 states of deletion, invariant and insertion per repeat unit, i.e. the number of each repeat unit may be 0, 1, 2. Based on this, for each microsatellite locus, the length is in the range of [0,2 ]]。

In a second aspect, the application provides the use of a detection reagent for labelling microsatellite loci according to the first aspect of the application in the preparation of a kit for determining the overall status of MSI of a sample to be tested.

A third aspect of the application provides a system for determining the MSI overall status of a sample to be tested, comprising the following modules:

the system comprises a marked microsatellite locus data input module, a first data processing module and a second data processing module, wherein the marked microsatellite locus data input module is used for receiving stability data of a marked microsatellite locus of a first aspect of the application of a sample to be detected, and the stability comprises stability and instability;

an MSI global state storage module for storing stability data and MSI global state data for the tagged microsatellite loci of a second population sample, the MSI global state comprising MSI-H, MSI-L and MSS;

the MSI overall state determining module is respectively connected with the data input module and the MSI overall state storage module and is used for constructing a prediction model by utilizing the stability data of the marked microsatellite loci of the second population samples and determining the MSI overall state of the sample to be detected based on the stability data of the marked microsatellite loci of the sample to be detected obtained from the marked microsatellite locus stability prediction module.

In some embodiments of the present application, the stability data of the tagged microsatellite loci refer to whether there is a change in the length of the tagged microsatellite loci, and if a significant change in the length of a tagged microsatellite locus occurs, i.e., the length of the tagged microsatellite locus is significantly changed due to insertion or deletion of a repeat unit, the tagged microsatellite locus is unstable; in contrast, a tagged microsatellite locus is stable if the length of the tagged microsatellite locus is unchanged, i.e., there is no change in length of the tagged microsatellite locus due to insertion or deletion of a repeat unit or if the length change is insignificant even if insertion or deletion of a repeat unit is present.

In the present application, the MSI-H, MSI-L and MSS have the following meanings:

(1) MSS: microsatellite Stability microsatellite stability, i.e. all microsatellite loci are stable.

(2) MSI-L: low-frequency MSI, low-frequency microsatellite instability, i.e., microsatellite loci less than a preset threshold are unstable.

(3) MSI-H: high-frequency MSI, high-frequency microsatellite instability, i.e., microsatellite loci that are not less than a preset threshold are unstable.

In some embodiments of the application, the preset threshold may be an absolute value of the quantity, or may be a proportion, for example 30%.

In some embodiments of the application, the MSI global state determining module, the constructing a predictive model using stability data of the tagged microsatellite loci of the population sample comprises the steps of:

s21, randomly dividing the stability data of the marked microsatellite loci of the second population sample into two groups, wherein one group is a second training set, the other group is a second testing set, and each group comprises the stability data of the marked microsatellite loci of an MSI-H sample, an MSI-L sample and an MSS sample;

s22, constructing an MSI overall state prediction model based on a machine learning algorithm by using the second training set data;

s23, in the second test set, verifying the obtained MSI overall state prediction model.

In some embodiments of the application, the machine learning algorithm is selected from any one of the following algorithms:

random forest algorithm, neural network algorithm, support vector machine algorithm, bayesian classification algorithm, gradient lifting algorithm, K neighbor algorithm and decision tree algorithm.

In some preferred embodiments of the application, the MSI global state prediction model is trained using a random forest model.

In some embodiments of the application, the stability data of the tagged microsatellite loci obtained in the tagged microsatellite locus data input module is obtained by a PCR-based method.

Further, the system further comprises:

the sequencing data input module is used for inputting capturing sequencing data of a target area containing the marked microsatellite loci in the sample to be tested and obtaining the peak number, kurtosis, skewness, standard deviation and standard deviation of the length of the marked microsatellite loci;

the microsatellite locus storage module is used for storing the peak number, kurtosis, skewness, standard deviation and standard deviation of the length of the microsatellite locus of the first group sample and stability data;

a tagged microsatellite locus stability prediction module, coupled to the sequencing data input module, the tagged microsatellite locus storage module, and the tagged microsatellite locus data input module, respectively, for constructing a tagged microsatellite locus stability prediction model using the peak count, kurtosis, skewness, standard deviation, and standard deviation of the first population sample known microsatellite locus length and stability data, and predicting stability of the tagged microsatellite locus based on the peak count, kurtosis, skewness, standard deviation, and standard deviation of the tagged microsatellite locus length obtained from the sequencing data input module, and for outputting stability data of the tagged microsatellite locus to the tagged microsatellite locus data input module.

In some embodiments of the application, the known microsatellite loci comprise:

in some embodiments of the application, the constructing the tagged microsatellite locus stability prediction model using the peak count, kurtosis, skewness, standard deviation and standard deviation of the known microsatellite locus length and stability data comprises the steps of:

s11, randomly dividing the peak number, kurtosis, skewness, standard deviation and stability data of the known microsatellite locus length of a first population sample into two groups, wherein one group is a first training set and the other group is a first test set;

s12, constructing a marker microsatellite locus stability prediction model based on a regression algorithm by using the first training set data;

and S13, in the first test set, verifying the obtained marked microsatellite locus stability prediction model.

In some embodiments of the application, the regression algorithm is selected from any one of the following algorithms: logistic regression algorithm, linear regression algorithm.

In some embodiments of the present application, for any one of the marker microsatellite loci, firstly counting the peak number of the marker microsatellite locus length according to a peak finding algorithm, and respectively calculating the skewness Shew, kurt, standard deviation S and standard deviation P of the marker microsatellite locus length:

the asymmetry of the random variable probability distribution is measured by the skewness, which is a measure of the degree of asymmetry relative to the average value, and the degree and direction of asymmetry of the marker microsatellite locus length distribution can be determined by measuring the skewness coefficient. The offset is measured relative to the normal distribution, which is 0, i.e., if the distribution of the length of the marker microsatellite loci is symmetrical. If the deviation is greater than 0, distributing right deviation, namely distributing a long tail on the right; if the deviation is smaller than 0, the distribution is left-biased, i.e. a long tail is distributed on the left; meanwhile, the larger the absolute value of the skewness, the more serious the shift degree of the distribution is.

The deflection calculation formula is:

wherein,the length of the microsatellite loci for the marker is +.>Number of reads at time,/->=[1，/>]；/>The average value of the numbers of reads with different lengths of the marked microsatellite loci is obtained; n is the number of length classes of the tagged microsatellite loci, i.e.how many different lengths are, < >>=2/>；/>Is the standard value of the length of the marked microsatellite loci.

Kurtosis is a statistic of the steep or smooth distribution of research data, and by measuring the kurtosis coefficient, it can be determined whether the length of the marked microsatellite loci is steeper or flatter than that of normal distribution. If kurtosis=3, the kurtosis of the length distribution of the marked microsatellite loci obeys normal distribution; if kurtosis is >3, the kurtosis of the length distribution of the marked microsatellite loci is steep (high-pointed); if kurtosis is <3, the kurtosis of the distribution of the length of the marked microsatellite loci is gentle (short and fat). The kurtosis calculation formula is:

wherein,the length of the microsatellite loci for the marker is +.>Number of reads at time,/->=[1，/>]；/>The average value of the numbers of reads with different lengths of the marked microsatellite loci is obtained; n is the length class number of the marked microsatellite loci, ">=2/>，/>Is the standard value of the length of the marked microsatellite loci.

The standard deviation reflects the degree of dispersion of the length of the marked microsatellite loci, and the larger the value is, the more the value is dispersed, namely the larger the difference between different lengths of the marked microsatellite loci is.

The standard deviation calculation formula is:

wherein,the length of the microsatellite loci for the marker is +.>Number of reads at time,/->=[1，/>]；/>The average value of the numbers of reads with different lengths of the marked microsatellite loci is obtained; />Is a microsatellite locus length +.>Normalized value of the number of reads at time, < ->；/>Is->Is the average value of (2); n is the labelLength class of microsatellite loci,/->=2/>，/>Is the standard value of the length of the marked microsatellite loci.

Standard deviation refers to the level at which the length of the tagged microsatellite loci deviates from the standard value of the length of the tagged microsatellite loci.

The standard offset calculation formula is:

wherein,the length of the microsatellite loci for the marker is +.>Number of reads at time,/->=[1，/>]；/>Is a microsatellite locus length +.>Normalized value of the number of reads at time, < ->The method comprises the steps of carrying out a first treatment on the surface of the n is the length category number of the marked microsatellite loci; />For the standard value of the length of the marked microsatellite loci, < + >>=2/>，/>Is the standard value of the length of the marked microsatellite loci.

In some embodiments of the application, in the sequencing data input module, the number of peaks, kurtosis, skewness, standard deviation, and standard deviation of the tagged microsatellite loci are calculated only if the depth of capture sequencing of the target region reaches 400×. Specifically, the number of reads of different lengths of the tagged microsatellite loci is calculated using the result file of Msisensor 2.

In a fourth aspect, the application provides a kit for determining the overall status of MSI of a sample to be tested, comprising a detection reagent for labelling microsatellite loci according to the first aspect of the application.

In the present application, the "first population sample" and the "second population sample" are only formal regions, wherein the data of the first population sample includes the number of peaks, kurtosis, skewness, standard deviation and standard deviation of the known microsatellite loci length of each sample and stability data for constructing a labeled microsatellite locus stability prediction model based on a regression algorithm; the data of the second population sample includes stability data and MSI global state data of the tagged microsatellite loci for constructing an MSI global state prediction model.

In the present application, the sample to be tested is derived from a human, preferably a tumor sample, such as fresh tissue, tissue paraffin block (FFPE), etc.

The beneficial effects of the application are that

Compared with the prior art, the application has the following beneficial effects:

by utilizing the application, MSI overall state prediction can be performed in different NGS detection platforms and different cancer species through analysis of second generation sequencing data without normal sample pairing.

The method for predicting the MSI overall state is stable and rapid, accurate in result and high in repeatability, and reduces detection limitation.

Drawings

FIG. 1 shows a length distribution of a microsatellite loci in example 3 of the present application.

FIG. 2 shows the MSI-PCR capillary electrophoresis detection structure of a microsatellite loci in example 4 of the present application. A: a tumor sample; b: a blood sample.

Fig. 3 shows the results of stability performance evaluation of the logistic regression model prediction microsatellite loci established in example 4 of the present application.

Fig. 4 shows the result of performance evaluation of the random forest model established in embodiment 5 of the present application to predict the MSI overall state.

Fig. 5 shows the result of verification of the external data set in embodiment 6 of the present application.

Fig. 6 is a schematic diagram of a system for determining the MSI overall status of a sample to be tested constructed in embodiment 7 of the present application.

Detailed Description

Unless otherwise indicated, implied from the context, or common denominator in the art, all parts and percentages in the present application are based on weight and the test and characterization methods used are synchronized with the filing date of the present application. Where applicable, the disclosure of any patent, patent application, or publication referred to in this application is incorporated by reference in its entirety, and the equivalent patents to those cited in this application are incorporated by reference, particularly as if they were set forth in the relevant terms of art. If the definition of a particular term disclosed in the prior art is inconsistent with any definition provided in the present application, the definition of the term provided in the present application controls.

The numerical ranges in the present application are approximations, so that it may include the numerical values outside the range unless otherwise indicated. The numerical range includes all values from the lower value to the upper value that increase by 1 unit, provided that there is a spacing of at least 2 units between any lower value and any higher value. For ranges containing values less than 1 or containing fractions greater than 1 (e.g., 1.1,1.5, etc.), then 1 unit is suitably considered to be 0.0001,0.001,0.01, or 0.1. For a range containing units of less than 10 (e.g., 1 to 5), 1 unit is generally considered to be 0.1. These are merely specific examples of what is intended to be provided, and all possible combinations of numerical values between the lowest value and the highest value enumerated are to be considered to be expressly stated in this disclosure.

The terms "comprises," "comprising," "including," and their derivatives do not exclude the presence of any other component, step or process, and are not related to whether or not such other component, step or process is disclosed in the present application. For the avoidance of any doubt, all use of the terms "comprising", "including" or "having" herein, unless expressly stated otherwise, may include any additional additive, adjuvant or compound. Rather, the term "consisting essentially of … …" excludes any other component, step or process from the scope of any of the terms recited below, as those out of necessity for operability. The term "consisting of … …" does not include any components, steps or processes not specifically described or listed. The term "or" refers to the listed individual members or any combination thereof unless explicitly stated otherwise.

In order to make the technical problems, technical schemes and beneficial effects solved by the application more clear, the application is further described in detail below with reference to the embodiments.

Examples

The following examples are presented herein to demonstrate preferred embodiments of the present application. It will be appreciated by those skilled in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function in the practice of the application, and thus can be considered to constitute preferred modes for its practice. Those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit or scope of the application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, the disclosure of which is incorporated herein by reference as is commonly understood by reference.

Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the application described herein. Such equivalents are intended to be encompassed by the claims.

The experimental methods in the following examples are conventional methods unless otherwise specified. The instruments used in the following examples are laboratory conventional instruments unless otherwise specified; the test materials used in the examples described below, unless otherwise specified, were purchased from conventional biochemical reagent stores.

Example 1 selection of microsatellite loci

800 clinical tumor patient tissue samples containing colorectal cancer, lung cancer and other cancer species were collected for MSI prediction.

For the selection of microsatellite loci, this example used a large gene set (panel, as shown in Table 1) containing 169 gene tests developed based on VariantBaitsTM technology, associated with solid tumors such as colorectal, lung, endometrial, gastric, etc. And performing hybridization capture on DNA extracted from the sample by using a target area probe, then establishing a library to obtain a corresponding sample library, and further performing second-generation sequencing on the library to obtain second-generation sequencing original data (raw data) in fastq format.

TABLE 1 list of 169 genes

Example 2 sample library-based sequencing and sequence alignment

The quality control software picard is used for quality control of the original data obtained in the embodiment 1, sequencing joints, low-quality bases, sequencing error fragments and the like are filtered, and high-quality data (clean data) is obtained after filtering.

Comparing and analyzing the clean data by using sequence comparison software bwa-MEM to obtain genome specific position information (reference genome is hg 19) of the sequence; and (5) sequencing and deduplication are carried out by using samtools and sambamba software, so that a sample original bam file is obtained.

And inputting the original sample bam file into software Msisensor2 for analysis to obtain the numbers of reads with different lengths of the target region repeated base sequence, namely a sample data set.

Example 3 microsatellite locus stability State prediction

The target region sequencing depth is at least 400×, and subsequent calculations are performed if the depth is met.

For a particular microsatellite locus: chr2_47641559_CAGGT_27[ A ] _GGGTT, as shown in Table 2).

Table 2 chr2_47641559_CAGGT_27[A _gggtt information

The length distribution is shown in Table 3 and FIG. 1.

Table 3 chr2_47641559_CAGGT_27[A-GGGTT Length distribution

In this example, the MSI state is determined using five features, i.e., the number of peaks N, the skewness Skew, the Kurt, the standard deviation S, and the standard deviation P, of the length of the repetitive base sequence in the target region.

For a specific repetitive base sequence, the following was calculated.

Peak count calculation:

and counting the sequencing depth of the target area, and counting the number of peaks of the length of the obtained microsatellite loci according to a peak searching algorithm. Specifically, the method is realized by Scipy, the Peak Prominance calibration Peak number is calculated, and noise is removed.

In this example, the number of peaks of chr2_47641559_caggt_27[ a ] _gggtt is 1.

Reuse of the stable length of the known microsatellite lociThe theoretical maximum length of the microsatellite loci is 2 +.>I.e. 54, thus chr2_47641559_caggt_27[ a ]]Number of length classes +.>=54，/>=[1，54]。

The deflection calculation formula is:

the kurtosis calculation formula is:

the standard deviation calculation formula is:

the standard offset calculation formula is:

wherein,is a microsatellite locus length +.>Number of reads at time; />Is the number average value of reads with different lengths of microsatellite loci. />Is microsatellite locus length/>Normalized value of the number of reads at time, < ->；/>Is->Is a mean value of (c).

According to the above formula, the skewness, kurtosis, standard deviation and standard deviation of the chr2_47641559_caggt_27[ a ] _gggtt length are respectively: 2.80, 9.87, 0.05 and 19.82.

Example 4 establishing a logistic regression model to predict the stability of microsatellite loci

In this embodiment, a model is built using 6 microsatellite loci with known stability data to predict the stability of microsatellite loci with unknown stability data. The microsatellite loci of the 6 known stability data are shown in Table 4.

Table 4 6 known microsatellite loci

In order to establish a microsatellite locus stability model, MSI prediction results are verified and the performances of the MSI prediction results are evaluated, the inventor performs MSI-PCR capillary electrophoresis detection on a sample, the result is used as a gold standard, and the consistency of the results between the MSI prediction results and the MSI capillary electrophoresis detection is compared. The kit used for MSI-PCR capillary electrophoresis detection is a tung tree microsatellite instability (MSI) detection kit (multiple fluorescence PCR-capillary electrophoresis method), and the state of each microsatellite locus in each sample and the MSI total state of the sample are obtained according to MSI-PCR capillary electrophoresis detection.

Specifically, each microsatellite locus is divided into two major categories of microsatellite locus stability and microsatellite locus instability according to MSI-PCR capillary electrophoresis results, and if a certain microsatellite locus changes, namely the length of the microsatellite locus obviously changes due to insertion or deletion of a repeating unit, the microsatellite locus is unstable; in contrast, if a microsatellite locus is unchanged, i.e., there is no change in the length of the microsatellite locus due to insertion or deletion of a repeat unit or if the length change is insignificant even if insertion or deletion of a repeat unit is present, the labeled microsatellite locus is stable.

As shown in FIG. 2, the MSI-PCR capillary electrophoresis detection result of BAT25 site (chr 4 55598211) shows that the site is homozygous, and a group of displacement peaks (shown by a dotted line in the figure) are added in the tumor sample in comparison with the paired sample, so that the site is in an unstable state of the microsatellite site in the tumor sample, and in a stable state of the microsatellite site in the paired sample (blood).

Further, according to the stability degree of the microsatellite loci, the MSI total state of the sample is divided into three states of MSI-H, MSI-L, MSS, which correspond to the three states of microsatellite high instability, microsatellite low instability and microsatellite stability respectively.

MMS: all microsatellite loci are stable.

MSI-L: microsatellite loci of < 30% are unstable.

MSI-H: more than or equal to 30 percent of microsatellite loci are in an unstable state.

In this example, MSI-PCR capillary electrophoresis data for 800 samples were obtained, including stability data for 6 microsatellite loci and MSI overall status data for 800 samples.

MSI-PCR capillary electrophoresis data of 6 microsatellite locus stability of the 800 samples are processed, then samples of which the NGS results do not accord with quality control are removed, 3780 microsatellite locus stability data are obtained in total, and the samples are randomly and hierarchically sampled and divided into a training set and a testing set according to the proportion of 7:3. For the training set, based on the Msisensor2 result file obtained in the embodiment 2, calculating the lengths of 6 microsatellite loci, obtaining the peak number N, kurt, shew, standard deviation S and standard deviation P of the lengths of the microsatellite loci by using the method in the embodiment 3, generating training set data, recording the unstable microsatellite locus sample as 1 and the stable microsatellite locus sample as 0, and forming the training set 1. The same procedure resulted in test set 1. Training the training set 1 data based on the logistic regression model to obtain the logistic regression model of the training set 1.

Based on the logistic regression model obtained by the training set 1, the stability of the microsatellite loci of the testing set 1 is predicted, the accuracy of the stability of the microsatellite loci of the testing set is estimated according to the MSI-PCR capillary electrophoresis result, then model correction is carried out, noise is removed, and model parameters and a solving algorithm are modified. Finally, the regularization parameter of the appointed model is l2, the maximum iteration number is 5000, and the classification type is two classifications. Because the unstable samples of the microsatellite loci in the actual training set are far less than the stable samples (15:1), the penalty mode of the loss function is balance, and the L-BFGS algorithm is utilized to solve the problem, so that a preferred logistic regression model1 is obtained, and the performance evaluation is shown in figure 3.

Based on the logistic regression model1 obtained in training set 1, the stability of all 22 selected microsatellite loci was calculated to obtain a microsatellite locus dataset as shown in Table 5.

TABLE 5 22 microsatellite loci

Example 5 establishment of random forest model prediction sample MSI population State

And randomly layering and sampling the 22 microsatellite locus data sets according to a ratio of 7:3 to divide the microsatellite locus data sets into a training set and a testing set, and obtaining a training set 2 and a testing set 2. Training the training set 2 data based on a random forest model, and designating a penalty mode of a loss function as a balance to obtain the random forest model of the training set 2.

Based on the random forest model obtained by the training set 2, predicting the MSI overall state of the testing set 2, and evaluating the accuracy of the MSI overall state of the testing sample according to the MSI-PCR capillary electrophoresis result. The above steps are repeated until the preferred random forest model2 is obtained, and the performance evaluation result is shown in fig. 4.

Example 6 application of predictive model

143 samples are obtained for performing NGS sequencing data analysis and calculation to obtain an external data set. Microsatellite locus stability and overall MSI status of the external dataset were predicted using logistic regression model1 and random forest model2 and validated using MSI overall status data obtained by MSI-PCR capillary electrophoresis according to example 4, the results are shown in Table 6.

TABLE 6 143 external sample validation results

From this, the sensitivity of the predictive model of the application was 1 and the specificity was 0.92.

MSI-L and MSS are generally classified into one type in actual clinical medication, namely MSI-L/MSS. The resulting model ROC curve is shown in fig. 5, with AUC of 0.99.

Example 7 System for determining MSI Overall State of sample under test

The system for determining the MSI overall state of the sample to be tested according to the embodiment is established based on the above method, as shown in fig. 6, and includes:

(1) the sequencing data input module is used for inputting capturing sequencing data of a target area containing 22 microsatellite loci in a sample to be tested and obtaining the peak number, kurtosis, skewness, standard deviation and standard deviation of the length of the 22 microsatellite loci.

And (3) performing quality control on the original data obtained by capturing the sequencing data by using quality control software picard, filtering a sequencing joint, low-quality bases, sequencing error fragments and the like, and filtering to obtain high-quality data (clean data).

And inputting the original sample bam file into software Msisensor2 for analysis to obtain the result of the length of the target region repeated base sequence (microsatellite locus).

For a certain microsatellite locus, firstly counting the peak number of the microsatellite locus length according to a peak searching algorithm, and respectively calculating the skewness, kurtosis, standard deviation and standard deviation of the attitude microsatellite locus length by using the following formulas:

the deflection calculation formula is:

/>

the kurtosis calculation formula is:

the standard deviation calculation formula is:

the standard offset calculation formula is:

wherein,for the microsatellite loci length is +.>Number of reads at time,/->=[1，/>]；/>The average value of the numbers of reads with different lengths of the microsatellite loci is obtained; />Is a microsatellite locus length +.>Normalized value of the number of reads at time, < ->；/>Is->Is the average value of (2); n is the length class number of the microsatellite loci, ">=2/>；/>Is the standard value of the microsatellite loci length.

(2) The microsatellite locus storage module is used for storing the peak number, kurtosis, skewness, standard deviation and stability data of the lengths of 6 microsatellite loci of the first population sample;

(3) the marked microsatellite locus stability prediction module is respectively connected with the sequencing data input module, the microsatellite locus storage module and the marked microsatellite locus data input module, and is used for constructing 22 microsatellite locus stability prediction models by using the peak number, kurtosis, skewness, standard deviation and standard deviation of the lengths of the 6 microsatellite loci of the first population sample and the stability data, predicting the stability of 22 microsatellite loci based on the peak number, kurtosis, skewness, standard deviation and standard deviation of the lengths of the 22 microsatellite loci obtained from the sequencing data input module, and outputting the stability data of the 22 microsatellite loci to the marked microsatellite locus data input module.

The construction of a marker microsatellite locus stability prediction model using the peak number, kurtosis, skewness, standard deviation and standard deviation of known microsatellite locus lengths and stability data comprises the steps of:

s12, constructing a marker microsatellite locus stability prediction model based on a logistic regression algorithm by using the first training set data;

(4) The marked microsatellite locus data input module is used for receiving stability data of 22 microsatellite loci of a sample to be detected, wherein the stability comprises stability and instability;

(5) an MSI global state storage module for storing stability data and MSI global state data of the tagged microsatellite loci of the second population sample, the MSI global state comprising MSI-H, MSI-L and MSS;

(6) the MSI overall state determining module is respectively connected with the data input module and the MSI overall state storage module and is used for constructing a prediction model by utilizing the stability data of the marked microsatellite loci of the second population samples and determining the MSI overall state of the sample to be detected based on the stability data of the marked microsatellite loci of the sample to be detected obtained from the marked microsatellite locus stability prediction module.

The method for constructing the prediction model by using the stability data of the marked microsatellite loci of the population samples comprises the following steps:

s21, randomly dividing the stability data of the marked microsatellite loci of the second population samples into two groups, wherein one group is a second training set, the other group is a second testing set, and each group comprises the stability data of the marked microsatellite loci of the MSI-H samples, the MSI-L samples and the MSS samples;

s22, constructing an MSI overall state prediction model based on a random forest algorithm by using the second training set data;

All documents mentioned in this disclosure are incorporated by reference in this disclosure as if each were individually incorporated by reference. Further, it will be appreciated that various changes and modifications may be made by those skilled in the art after reading the above teachings, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

Claims

1. A tagged microsatellite loci for determining the overall status of MSI of a sample under test, said tagged microsatellite loci comprising:

。

2. use of a detection reagent for labeling microsatellite loci according to claim 1 for preparing a kit for determining the overall status of MSI of a sample to be tested.

3. A system for determining the MSI population status of a sample to be tested, comprising the following modules:

a tagged microsatellite loci data input module for receiving stability data of the tagged microsatellite loci of claim 1 of a sample to be tested, said stability including stability and instability;

4. A system according to claim 3, wherein in the MSI global state determination module, the constructing a predictive model using stability data of the tagged microsatellite loci of the population sample comprises the steps of:

s23, in the second test set, verifying the obtained prediction model.

5. The system of claim 4, wherein the machine learning algorithm is selected from any one of the following algorithms:

6. The system of claim 3, wherein the stability data for the tagged microsatellite loci obtained in the tagged microsatellite locus data entry module is obtained by a PCR-based method.

7. A system according to claim 3, wherein the system further comprises:

the sequencing data input module is used for inputting capturing sequencing data of a target area of the marked microsatellite loci of the sample to be tested and obtaining the peak number, kurtosis, skewness, standard deviation and standard deviation of the marked microsatellite guard point length;

the microsatellite locus storage module is used for storing the peak number, kurtosis, skewness, standard deviation and standard deviation of the length of the microsatellite locus of the first group sample and stability data; the known microsatellite loci include:

8. The system of claim 7, wherein in the tagged microsatellite locus stability prediction module, the constructing a tagged microsatellite locus stability prediction model comprises the steps of:

9. The system of claim 7 or 8, wherein for a particular tagged microsatellite locus:

counting the peak number of the length of the marked microsatellite loci according to a peak searching algorithm, and respectively calculating the skewness, kurtosis, standard deviation and standard deviation of the length of the marked microsatellite loci by using the following formulas:

the deflection calculation formula is:

the kurtosis calculation formula is:

the standard deviation calculation formula is:

the standard offset calculation formula is:

wherein,the length of the microsatellite loci for the marker is +.>Number of reads at time,/->=[1，/>]；/>The average value of the numbers of reads with different lengths of the marked microsatellite loci is obtained; />Is a microsatellite locus length +.>Normalized value of the number of reads at time, < ->；/>Is->Is the average value of (2); />For the length class number of the tagged microsatellite loci, -/-, for the tag microsatellite loci>=2/>；/>Is the standard value of the length of the marked microsatellite loci.

10. A kit for determining the overall state of MSI of a sample to be tested comprising the detection reagent for labelling microsatellite loci according to claim 1.