CN113488105B - Microsatellite locus based on amplicon next-generation sequencing MSI detection, screening method and application thereof - Google Patents

Microsatellite locus based on amplicon next-generation sequencing MSI detection, screening method and application thereof Download PDF

Info

Publication number
CN113488105B
CN113488105B CN202111046575.3A CN202111046575A CN113488105B CN 113488105 B CN113488105 B CN 113488105B CN 202111046575 A CN202111046575 A CN 202111046575A CN 113488105 B CN113488105 B CN 113488105B
Authority
CN
China
Prior art keywords
microsatellite
locus
reads
sample
sequencing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111046575.3A
Other languages
Chinese (zh)
Other versions
CN113488105A (en
Inventor
赵利利
郑露露
王璐
魏丽
许青
杜波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhenyue Biotechnology Jiangsu Co ltd
Zhenhe Beijing Biotechnology Co ltd
Original Assignee
Zhenyue Biotechnology Jiangsu Co ltd
Zhenhe Beijing Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhenyue Biotechnology Jiangsu Co ltd, Zhenhe Beijing Biotechnology Co ltd filed Critical Zhenyue Biotechnology Jiangsu Co ltd
Priority to CN202111046575.3A priority Critical patent/CN113488105B/en
Publication of CN113488105A publication Critical patent/CN113488105A/en
Application granted granted Critical
Publication of CN113488105B publication Critical patent/CN113488105B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a microsatellite locus detected based on amplicon next-generation sequencing MSI, a screening method and application thereof. The screening method comprises the following steps: selecting a microsatellite locus with 7-15bp of A or T single base repeat sequence and similarity value of two wing sequences lower than a similarity threshold value, and recording as a first locus set; designing an amplification primer and a sequencing direction of each microsatellite locus according to the reading length requirement of sequencing reads so that the sequencing reads can completely span each microsatellite locus region; obtaining sequencing data of the amplicon library and screening and counting the type and frequency of the repeating unit of each site in the first site set; selecting sites which meet the condition that the type of the repeating unit with the highest frequency is consistent with the reference sequence and the polymorphism in the crowd is lower than 5 percent as a second site set; the sites in the second set of sites that were significantly different in deletion ratio between the negative and positive sample groups were counted and retained. These sites have higher sensitivity and specificity.

Description

Microsatellite locus based on amplicon next-generation sequencing MSI detection, screening method and application thereof
Technical Field
The invention relates to the field of high-throughput sequencing data analysis, in particular to a microsatellite locus detected based on amplicon next-generation sequencing MSI, a screening method and application thereof.
Background
Microsatellite instability (MSI) is a phenotypic Manifestation of Mismatch Repair (MMR) loss and is increasingly used as a biomarker in clinical tumor diagnosis and therapy. MSI-PCR methods are mainly used in clinical applications to determine the MSI status of a sample. The method uses fluorescence labeling primer and capillary electrophoresis to determine the fragment length polymorphism of 5 sites NR-21, NR-24, BAT-25, BAT-26 and MONO-27 in Promega panel. Compared with PCR-MSI detection, the NGS-based MSI detection has the advantages of "simultaneous MSI detection and other detection in a customized target region (targeted panel)," low tumor purity requirement and no need of a control sample.
The method for detecting MSI based on the second-generation sequencing of hybrid capture is developed rapidly, has been published by various methods and tools, and is applied to clinical detection and scientific research. The second generation sequencing technology based on amplicon enrichment mainly utilizes a multiplex PCR technology to carry out specific amplification and enrichment on a plurality of target region sequences to obtain amplicons of a target region, and then the second generation sequencing technology is adopted to carry out sequencing on the amplicons to obtain sequence information of the target region.
Compared with hybridization capture sequencing, the secondary sequencing technology based on amplicon enrichment has the advantages of simple experimental process, less manual operation, short library building period, high sequencing depth, low initial amount, low cost and the like, so the technology is more and more widely applied to the secondary sequencing, but no method or tool for MSI analysis based on the secondary sequencing of amplicons exists at present.
Disclosure of Invention
The invention mainly aims to provide a microsatellite locus for detecting MSI based on amplicon next-generation sequencing, a screening method and application thereof, so as to solve the problem that the MSI is not detected by amplicon-based next-generation sequencing data in the prior art.
To achieve the above objects, according to one aspect of the present invention, there is provided a method for screening microsatellite loci based on amplicon secondary sequencing MSI detection, the method comprising: selecting the microsatellite loci meeting a first condition to be recorded as a first locus set, wherein the first condition comprises the following steps: a.7-15 bp of A or T single base repeat sequence; b. similarity values with flanking sequences of the single base repeat sequence are lower than a similarity threshold value; c. designing an amplification primer and a sequencing direction of each microsatellite locus according to the reading length requirement of sequencing reads, and selecting the microsatellite loci which can enable the sequencing reads to completely span each microsatellite locus region; obtaining sequencing data of an amplicon library of a plurality of microsatellite stability samples, screening a first bit set from the sequencing data of each microsatellite stability sample, and counting the type of a repeating unit of each microsatellite bit in the first bit set and the type frequency of each repeating unit; selecting, from the first set of sites, a microsatellite site satisfying a second condition as a second set of sites, the second condition comprising: 1) the type of the most frequent repeat unit is consistent with the reference sequence; 2) polymorphism in the population is less than 5%; counting and retaining microsatellite loci with significant difference of deletion ratio between the negative sample group and the positive sample group of each microsatellite locus in the second locus set by adopting a negative sample group consisting of a plurality of microsatellite stable samples and a positive sample group consisting of a plurality of microsatellite unstable samples; wherein, the deletion ratio refers to the ratio of the number of scanning reads of the repetitive unit type with reduced length compared with the reference sequence of the microsatellite locus to the total number of the scanning reads of the microsatellite locus, and the scanning reads refer to the reads covering the microsatellite locus and at least 2bp of each of the left end and the right end of the microsatellite locus.
Further, the similarity value is calculated as follows: Σ (d 2+1-d 1)/d 2, wherein d1 is the distance from the same base as the base of the microsatellite locus in the sequence with the set length at the left and right ends to the microsatellite locus, d2 is the set length, and d2 is 8-12 bp.
Further, the similarity threshold is 1.5-2.5.
Further, obtaining sequencing data of the amplicon libraries of the plurality of microsatellite stability samples, screening a first bit set from the sequencing data of each microsatellite stability sample, and counting the type of the repeating unit and the type frequency of each repeating unit of each microsatellite position in the first bit set comprises: comparing the sequencing data of the amplicon library of each microsatellite stability sample with the reference genome sequence to obtain a comparison result; searching a first site set from the comparison result, and extracting the spinning reads covering all the microsatellite sites in the first site set from the comparison result; and counting the frequency of each repeating unit type and each repeating unit type of each microsatellite locus, wherein the frequency refers to the ratio of the number of the spinning reads of each repeating unit type on the microsatellite locus to the total number of the spinning reads covering the microsatellite locus.
Further, counting and reserving microsatellite loci with significant difference in deletion ratio between the negative sample group and the positive sample group of each microsatellite locus in the second locus set by adopting a non-parameter test method; preferably, the nonparametric test is the wilcox test; preferably, a microsatellite locus with a significant difference refers to a microsatellite locus with a p-value < 0.05.
According to a second aspect of the present application, there is provided a baseline construction method for detecting MSI, the construction method comprising: obtaining sequencing data of an amplicon library of a plurality of known stable microsatellite samples, firstly extracting reads with the 5' end as an amplicon primer sequence according to the enrichment characteristic of amplicons, and counting the number of the spinning coverage of each microsatellite locus and the number of the spinning reads of each repetitive unit type of each sample; wherein, the microsatellite locus is selected by any one of the screening methods and detected based on the amplicon secondary sequencing MSI; under the condition that the spread coverage reaches a saturation value, calculating the deletion ratio of each microsatellite locus according to the number of spread reads of the repeating unit type of each microsatellite locus of each sample, removing the microsatellite loci with polymorphism in the stable microsatellite sample, and further obtaining the average value and the standard deviation of the deletion ratio of all samples at each microsatellite locus, thereby constructing and obtaining the baseline of the deletion ratio of each microsatellite locus; wherein the spanning coverage refers to the sum of the number of spanning reads covering different types of repeating units at each microsatellite locus.
According to a third aspect of the present application, there is provided a method of detecting a state of a microsatellite, the method comprising: obtaining sequencing data of a sample to be detected based on the amplicon library and calculating the scanning coverage and the deletion ratio of each microsatellite locus in the sample to be detected; if the spinning coverage of the microsatellite locus reaches a saturation value, the microsatellite locus passes quality control; comparing the deletion ratio value of each microsatellite locus of the sample to be detected with the base line constructed by the construction method 6; if the deletion ratio (i) > mean (Di) + n + SD (Di) of the sample to be detected and n is more than or equal to 3.5 and less than or equal to 4.4, determining that the microsatellite locus is unstable; judging the microsatellite state of the sample to be detected according to the following conditions: (1) if the number n1 of the microsatellite loci passing through the quality control is more than or equal to 15, the number of unstable loci is n2, and n2/n1 is more than or equal to 0.1, judging the microsatellite state of the sample to be detected to be MSI-H; (2) if the number n1 of the microsatellite loci passing through the quality control is more than or equal to 15, the number of unstable loci is n2, and n2/n1 is less than 0.1, judging the microsatellite state of the sample to be detected to be MSS; (3) if the number n1 of the sites passing the quality control is less than 15, judging the microsatellite state of the sample to be detected to be undetermined; wherein, the microsatellite locus is selected by any one of the screening methods and detected based on the amplicon secondary sequencing MSI.
According to a fourth aspect of the present application, there is provided a screening apparatus for microsatellite loci based on amplicon secondary sequencing MSI detection, the screening apparatus comprising: the first position point set acquisition module is used for selecting the microsatellite position points meeting a first condition and recording the selected microsatellite position points as a first position point set, wherein the first condition comprises the following steps: a.7-15 bp of A or T single base repeat sequence; b. similarity values with flanking sequences of the single base repeat sequence are lower than a similarity threshold value; c. designing an amplification primer and a sequencing direction of each microsatellite locus according to the reading length requirement of sequencing reads, and selecting the microsatellite locus which can enable the sequencing reads corresponding to an amplicon of the amplification primer to completely span the microsatellite locus region; the repeated unit type and frequency counting module is used for acquiring sequencing data of an amplicon library of a plurality of microsatellite stable samples, screening out a first bit set from the sequencing data of each microsatellite stable sample, and counting the type of the repeated unit of each microsatellite bit in the first bit set and the type frequency of each repeated unit; a second set of sites obtaining module, configured to select, from the first set of sites, a microsatellite site satisfying a second condition as a second set of sites, where the second condition includes: 1) the type of the most frequent repeat unit is consistent with the reference sequence; 2) polymorphism in the population is less than 5%; the site screening module is used for counting and retaining microsatellite sites of which deletion ratios are obviously different between the negative sample group and the positive sample group of each microsatellite site in the second site set by adopting a negative sample group consisting of a plurality of microsatellite stable samples and a positive sample group consisting of a plurality of microsatellite unstable samples; wherein, the deletion ratio refers to the ratio of the number of scanning reads of the repetitive unit type with reduced length compared with the reference sequence of the microsatellite locus to the total number of the scanning reads of the microsatellite locus, and the scanning reads refer to the reads which completely cover the microsatellite locus and at least 2bp length respectively at the left end and the right end of the microsatellite locus.
Further, the similarity value is calculated as follows: Σ (d 2+1-d 1)/d 2, wherein d1 is the distance from the same base as the base of the microsatellite locus in the sequence with the set length at the left and right ends to the microsatellite locus, d2 is the set length, and d2 is 8-12 bp.
Further, the similarity threshold is 1.5-2.5.
Further, the repeating unit type and frequency statistic module comprises: the comparison module is used for comparing the sequencing data of the amplicon library of each microsatellite stability sample with the reference genome sequence to obtain an initial comparison result; the positioning module is used for extracting reads with the 5' end as an amplicon primer sequence from the initial comparison result according to the enrichment characteristic of the amplicon so as to eliminate the comparison error caused by similar sequences in the comparison result and obtain the corrected comparison result; the spanning reads extraction module is used for searching the first site set from the corrected comparison result and extracting the spanning reads covering all the microsatellite sites in the first site set; and the counting module is used for counting the type of each repeating unit of each microsatellite locus and the frequency of each repeating unit type, wherein the frequency refers to the proportion of the number of the spaning reads of each repeating unit type on the microsatellite locus to the total number of the spaning reads covering the microsatellite locus.
Further, the site screening module adopts a non-parameter test method to count and reserve the microsatellite sites with the significant difference of the deletion ratio of each microsatellite site in the second site set between the negative sample group and the positive sample group; preferably, the nonparametric test is the wilcox test.
According to a fifth aspect of the present application, there is provided a baseline building apparatus for detecting MSI, the building apparatus comprising: the acquisition counting module is used for acquiring sequencing data of an amplicon library of a plurality of known MSS samples, and counting the number of the spinning coverage of each microsatellite locus of each sample and the number of the spinning reads covering each repeating unit type, wherein the microsatellite locus is a microsatellite locus detected based on the second-generation sequencing MSI of the amplicon screened by any one of the screening methods; a baseline building module, configured to calculate a deletion ratio of each microsatellite locus according to the number of scanning reads of the repeating unit type of each microsatellite locus of each MSS sample under the condition that the scanning coverage reaches a saturation value, and remove microsatellite loci with polymorphisms in the microsatellite stabilization samples, thereby obtaining an average value and a standard deviation of the deletion ratios of all samples at each microsatellite locus, and thus building a baseline of the deletion ratio of each microsatellite locus; wherein the spanning coverage refers to the sum of the number of spanning reads covering different types of repeating units at each microsatellite locus.
According to a sixth aspect of the present application, there is provided a detection apparatus for a microsatellite status, the detection apparatus comprising: the acquisition and calculation module is used for acquiring sequencing data of the sample to be detected based on the amplicon library and calculating the scanning coverage and the deletion ratio of each microsatellite locus in the sample to be detected; the quality control module is used for judging that the microsatellite locus passes quality control when the spanning coverage of the microsatellite locus reaches a saturation value; the comparison module is used for comparing the deletion ratio of each microsatellite locus of the sample to be detected with the base line constructed by the construction method; the unstable site determination module is used for determining that the microsatellite site is unstable if the deletion ratio of the sample to be detected is greater than mean (Di) + n SD (Di), and n is more than or equal to 3.5 and less than or equal to 4.4; the microsatellite state judging module is used for judging the microsatellite state of the sample to be detected according to the following conditions: (1) if the number n1 of the sites passing the quality control is more than or equal to 15, the number of unstable sites is n2, and n2/n1 is more than or equal to 0.1, judging the microsatellite state of the sample to be detected to be MSI-H; (2) if the number n1 of the sites passing the quality control is more than or equal to 15, the number of unstable sites is n2, and n2/n1 is less than 0.1, judging the microsatellite state of the sample to be detected to be MSS; (3) if the number n1 of the sites passing the quality control is less than 15, judging the microsatellite state of the sample to be detected to be undetermined; wherein, the microsatellite locus is selected by any one of the screening methods and detected based on the amplicon secondary sequencing MSI.
According to a seventh aspect of the present application, there is provided a microsatellite locus for MSI detection based on amplicon-based secondary sequencing comprising at least 15 of the 68 microsatellite loci shown in table 1.
According to an eighth aspect of the present application, there is provided a kit for detecting MSI, the kit comprising detection reagents for microsatellite sites based on amplicon-secondary sequencing MSI detection, the microsatellite sites comprising at least 15 of the 68 microsatellite sites shown in table 1.
According to a ninth aspect of the present application, there is provided a processor for running a program, wherein the program executes to perform any one of the screening methods described above, or any one of the construction methods described above, or any one of the detection methods described above.
According to a tenth aspect of the present application, there is provided a storage medium for storing a program, wherein the program executes any one of the screening methods described above, or any one of the construction methods described above, or any one of the detection methods described above when the program is executed.
By applying the technical scheme of the invention, the single base repeated fragment with the base length of 7-15bp being A or T is selected as the candidate microsatellite locus, the proportion of the real single base repeated fragment in the sample sequencing data is improved, and the loci with low similarity value of the flanking 10bp sequences and the microsatellite in the loci are selected to reduce the influence of sequencing error on the result; and selecting the sites with high singleness from the sites meeting the conditions, and finally obtaining the sites with high sensitivity and high specificity through differential screening. Therefore, the sites are used for detecting the state of the microsatellite, and the sensitivity and the specificity are higher.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 illustrates the accuracy of the MSI detection method in a preferred embodiment of the present application in determining microsatellite instability;
FIG. 2 shows the lowest tumor detection limit of the MSI detection method employed in the preferred embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of description, some terms or expressions referred to in the embodiments of the present application are explained below:
spinning reads: in the present application, refers to cross reads, i.e., reads that completely cover the microsatellite locus region and at least 2bp lengths at the left and right ends of the microsatellite locus region.
spinning coverage: refers to the sum of the number of spinning reads of all repeat unit types at a certain microsatellite locus.
Removal ratio: in this application, refers to the ratio of the number of spanning reads for a type of repeating unit in which the length of the repeating sequence is reduced compared to the reference sequence to the total number of spanning reads for that microsatellite locus.
Wilcox test: a non-parametric test to detect differences between the two groups is also called the wilcox rank sum test. When the data does not satisfy the parameter assumptions for performing the t-test (e.g., the data distribution does not conform to normality, the variables are intrinsically heavily biased or exhibit an ordered relationship), the test cannot be performed using the t-test analysis, using a non-parametric approach.
Frequency, in this application, refers to the ratio of the number of spanning reads covering a certain type of repeating unit at the microsatellite locus to the sum of the number of spanning reads covering the microsatellite locus.
Reference: the reference sequence, herein referred to as hg19 human reference genome, can be allole, wild-type in the study population.
As mentioned in the background section, due to the high sequence repeatability of microsatellite loci, the probe design is difficult based on the hybrid capture method, and the hybridization efficiency is lower compared with other capture regions. The enrichment method based on the amplicon has the advantages of simple experimental process, less manual operation, short library building period, high sequencing depth, low initial amount, low cost and the like, but no method or tool for MSI analysis based on the second-generation sequencing of the amplicon exists at present. To this end, the present application attempts to study and improve the method of MSI analysis from the perspective of sequencing data based on amplicon libraries as follows:
the second generation sequencing technology based on amplicon enrichment mainly utilizes a multiplex PCR technology to carry out specific amplification and enrichment on a plurality of target region sequences to obtain amplicons of a target region, and then the second generation sequencing technology is adopted to carry out sequencing on the amplicons to obtain sequence information of the target region. Therefore, primers can be designed on two sides of the microsatellite locus to efficiently enrich a target microsatellite region; meanwhile, the non-specific amplification product is almost completely eliminated by adopting a multiple PCR enrichment technology of CleanPlex and a Background elimination technology (Background clearing), so that the amplification uniformity is greatly improved, and the sensitivity and specificity of detection are further optimized.
The method is an improvement in terms of experimental means before library establishment, and in terms of algorithm, the method utilizes the characteristics of an amplicon library, extracts reads with primer sequences at the 5' end before the amplification reads are extracted, eliminates alignment errors caused by similar sequences, and takes the proportion of the repetitive unit types representing the unstable states of the microsatellites, the length of which is reduced, in all the repetitive unit types as a statistic and screening index, so that the sensitivity and the specificity of the screened microsatellite loci are further improved, and the microsatellite states of a sample to be detected are detected by utilizing the microsatellite loci, so that the detection sensitivity is improved, and the minimum detection limit of tumor purity is reduced.
In the present application, a type of a repeating unit that coincides with the reference sequence length is also referred to as a repeating unit type R, a type of a repeating unit that is shorter than the reference sequence length is referred to as a repeating unit type D, and a type of a repeating unit that is longer than the reference sequence length is referred to as a repeating unit type L.
Therefore, based on the previous research, the inventors further propose a method for screening and analyzing MSI sites based on the sequencing data of the amplicon library according to the characteristics of the sequencing data of the amplicon library. Among these, the present application focuses on improving c) in the site selection method, except for a) and b) which are the same as before. The screening conditions for microsatellite loci of the present application are as follows:
according to the PCR experimental conditions and the influence of PCR on the background noise of the single base repeated fragment, selecting a) the single base repeated fragment with the length less than or equal to 15bp and the base of A or T as a candidate microsatellite locus. b) On the basis, the similarity between the flanking 10bp sequences and the microsatellite is calculated, and sites with low similarity are selected to reduce the influence of sequencing errors on the result. c) According to the reading length requirement of sequencing reads, a proper primer and directional sequencing can be designed, so that the sequencing reads can completely span the microsatellite locus region. d) Counting the polymorphism ratio of the sites, selecting the site with polymorphism less than 0.5, and improving the specificity of the site; e) sites with high sensitivity and specificity were selected. In addition, by designing proper primers and directional sequencing, the data of single-ended sequencing can be used for performing microsatellite instability analysis, so that the cost is further reduced.
The specific operation is as follows: comparing reads obtained by sequencing with a reference sequence, performing weight comparison, extracting the reads with the 5' end as an amplicon primer sequence according to the enrichment characteristic of the amplicon, then extracting the reads (namely the scanning reads) which completely cover each microsatellite locus region and at least 2bp lengths on both wings, extracting the microsatellite locus sequence in each scanning read and calculating the sequence length, wherein each different length represents a repeat unit type. Respectively calculating the number of the scanning reads (such as X, X is a natural number which is more than or equal to 2) of the covering repeating unit type R, the number of the scanning reads (such as Y, Y is a natural number which is more than or equal to 2) of the covering repeating unit type D, and the number of the scanning reads (such as Z, Z is a natural number which is more than or equal to 2) of the covering repeating unit type L, and obtaining the ratio of the number of the scanning reads of the covering repeating unit type D in all the repeating unit types of each microsatellite locus, namely the deletion ratio (such as Y/(X + Y + Z)).
Microsatellite loci with significant differences in deletion ratios were screened as sites for detecting microsatellite instability between positive MSI samples and negative MSS samples.
Finally, 68 microsatellite loci are screened, and corresponding primers and molecular sequencing directions are designed. The sensitivity and specificity were 100% for verification using 6 MSI-H samples and 28 MSS samples; meanwhile, 2 MSI-H cell line samples are used to be matched with MSS cell line samples according to different proportions, the lowest detection limit of the tumor purity of the method is simulated, and the result shows that the lowest detection line of the method for the tumor purity reaches 5%. Is far superior to 20 percent of the lowest detection line of the MSI-PCR method, and can meet the clinical requirement.
Based on the above research results, the applicant proposes a series of technical solutions of the present application.
Example 1
In this example, a method for screening microsatellite loci based on amplicon secondary sequencing MSI detection is provided, the method comprising:
s101, selecting microsatellite loci meeting a first condition, and recording the microsatellite loci as a first locus set, wherein the first condition comprises the following steps: a.7-15 bp of A or T single base repeat sequence; b. similarity values with flanking sequences of the single base repeat sequence are lower than a similarity threshold value; c. according to the read length requirement of sequencing reads, designing amplification primers (the amplification primers are respectively positioned at two sides of the microsatellite locus) and a sequencing direction (sequencing reads spanning two sides of the microsatellite locus can be obtained by single-ended sequencing) of each microsatellite locus, and selecting the microsatellite locus which can enable the sequencing reads to completely span the microsatellite locus region;
s103, obtaining sequencing data of the amplicon libraries of the plurality of stable microsatellite samples (the amplicon libraries are the amplicon libraries aiming at the selected microsatellite loci), screening out a first locus set from the sequencing data of each stable microsatellite sample, and counting the type of the repeating unit of each microsatellite locus in the first locus set and the type frequency of each repeating unit;
s105, selecting the microsatellite locus meeting a second condition from the first locus set as a second locus set, wherein the second condition comprises the following steps: 1) the type of the most frequent repeat unit is consistent with the reference sequence; 2) polymorphism in the population is less than 5%;
s109, a negative sample group consisting of a plurality of stable microsatellite samples and a positive sample group consisting of a plurality of unstable microsatellite samples are adopted, and microsatellite loci with significant difference of deletion ratio between the negative sample group and the positive sample group of each microsatellite locus in the second locus set are counted and reserved; wherein, the deletion ratio refers to the ratio of the number of scanning reads of the repetitive unit type with reduced length compared with the reference sequence of the microsatellite locus to the total number of the scanning reads of the microsatellite locus, and the scanning reads refer to the reads covering the microsatellite locus and at least 2bp of each of the left end and the right end of the microsatellite locus.
According to the screening method of the microsatellite loci, the single base repeated fragments with the base length of 7-15bp being A or T are selected as candidate microsatellite loci, the proportion of real single base repeated fragments in sample sequencing data is improved, loci with low similarity values of two wing 10bp sequences and the microsatellite are selected, on the basis of reducing the influence of sequencing errors on results, amplification primers and sequencing directions of all the microsatellite loci are further designed according to the reading requirement of sequencing reads, so that the obtained sequencing reads can completely span the microsatellite locus region, single-ended sequencing can be adopted, and the cost is further reduced; and selecting the sites with high singleness from the sites meeting the conditions, and finally obtaining the sites with high sensitivity and high specificity through differential screening. Therefore, the sites are used for detecting the state of the microsatellite, and the sensitivity and the specificity are higher.
In step S101, the selection of the type D of repeating unit having a shorter length and the selection of the specific A or T base repeat sequence are due to the low background noise caused by the PCR of the A or T base repeat sequence. The single base repeated sequences with the flanking sequences at two ends obviously different from the length of 7-15bp are selected as candidate microsatellite loci, so that the interference of the sequences at two ends on the variation detection of the microsatellite loci can be obviously reduced, and the noise interference is reduced.
In the step S101, according to the read length requirement of the sequencing reads, the amplification primers and the sequencing direction of each microsatellite locus are designed, so that the sequencing reads can completely span the microsatellite locus region, as shown in the following example: if single-ended sequencing (the length of reads is 100 bp) is adopted, the amplification product can cover the microsatellite locus and reach the length which can be reached by the single-ended sequencing when the amplification primer is designed. For example, if the microsatellite locus is a 15A repeat sequence, the flanking sequence at the left end is 35bp in length, and the flanking sequence at the right end is 50bp in length, the sequencing primer can be designed to start from the flanking sequence at the left end, and can also be designed to start from the flanking sequence at the right end. And constructing an amplicon library based on the amplicons of the microsatellite loci obtained by amplification of the amplification primers.
Sequencing data in this application refers to sequencing data of the amplicon library. In the present application, during sequencing, since the same amplicon is obtained by PCR amplification and enrichment using the same forward and reverse primers, the base sequences of the sequencing reads of the same amplicon on the reference genome at the start end and the base sequences at the stop end are the same, and thus it is difficult to perform deduplication using various deduplication strategies in capture sequencing.
In the above screening method of the present application, the ratio of the types of repeating units having a reduced length representing the "microsatellite instability state" is used as a microsatellite locus instability signal to improve the detection sensitivity and lower the minimum detection limit of tumor purity.
The microsatellite stabilised sample herein preferably refers to a normal healthy sample, either a normal tissue sample or a normal lymphocyte sample, preferably a buffy coat sample. It should be noted that other stable microsatellite samples which may have a somatic SNV mutation are not excluded, although such stable microsatellite samples which have a somatic SNV mutation may have variations affecting a certain microsatellite locus.
The polymorphism in the above population can be evaluated as follows: and if the frequencies of the types of the repeating units corresponding to the first high frequency and the second high frequency are similar, judging that the site is heterozygous, and adding a polymorphism sample. If the frequency of the first high-frequency repeating unit type is far greater than that of the second high-frequency corresponding repeating unit type, the site is judged to be homozygous, and the first high-frequency repeating unit type is different from the first high-frequency repeating unit type of most normal samples, and a polymorphic sample is added. Polymorphism ratio is the ratio of the polymorphic sample to the total sample.
In a preferred embodiment, the above similarity value is calculated as follows: Σ (d 2+1-d 1)/d 2, wherein d1 is the distance from the same base as the base of the microsatellite locus in the sequence with the set length at the left and right ends to the microsatellite locus, d2 is the set length, and d2 is 8-12 bp.
In a preferred embodiment, obtaining sequencing data for a plurality of MSS samples (i.e., normal lymphocyte samples), and screening the sequencing data for each MSS sample for a first set of sites, and counting the types of repeat units and the frequency of occurrence of each type of repeat unit for each microsatellite locus in the first set of sites comprises: comparing the sequencing data of each MSS sample with the reference genome sequence to obtain an initial comparison result; extracting reads with the 5' end as an amplicon primer sequence from the initial comparison result according to the enrichment characteristic of the amplicon so as to eliminate the comparison error caused by similar sequences in the comparison result and obtain a corrected comparison result; searching a first site set from the corrected comparison result, and extracting the spaning reads covering all the microsatellite sites in the first site set, wherein the spaning reads are at least 2bp reads covering the left end and the right end of each microsatellite site in the first site set; and counting the types of each repeating unit covering each microsatellite locus and the occurrence frequency of each repeating unit type (the frequency refers to the proportion of the number of spinning reads of each repeating unit type on the microsatellite locus to the total number of reads covering the microsatellite locus).
The two wing sequences can be used for determining the positions of the reads on the reference genome and enabling the types of the repetitive units of the detected microsatellite loci to be accurate, so that the number of the types of each repetitive unit counted is accurate, and the accuracy of the detection result is improved. At least 2bp each, more preferably 2bp, as described above, completely span the entire site region, and extend the site region by about 2bp on both wings, so as to ensure that reads completely span the entire site region, while minimizing the data loss (the longer the length of the sequence covering both wings is, the more strict the alignment condition is, and the fewer reads are matched), and at the same time, avoiding the influence of insertion deletion in the site region on the type judgment of the repeat unit. Of course, the preset length can also be 3bp, 4bp, 5bp, 6bp, 7bp, 8bp, 9bp or even longer, and can be reasonably adjusted according to actual conditions.
In the construction method based on the amplicon library, because the amplification primers of each microsatellite locus are the same, the sequences of the origins of reads derived from the same microsatellite locus on the reference genome in the sequencing data are the same, so that the de-duplication treatment is difficult to carry out.
The above-mentioned method for detecting significant difference can be a nonparametric method, preferably a Wilcox method. Preferably, the p value is lower than 0.05.
Example 2
The embodiment provides a baseline construction method for detecting MSI, which comprises the following steps:
obtaining sequencing data of an amplicon library of a plurality of known MSS samples, extracting reads with amplicon primer sequences at the 5' end according to the enrichment characteristics of the amplicons, and counting the number of the scanning coverage of each microsatellite locus of each sample and the number of the scanning reads covering various repeating unit types, wherein the microsatellite loci are microsatellite loci detected by the screening method based on the second-generation sequencing MSI of the amplicons (indicated by 15 loci or all loci in 68 loci shown in Table 1);
under the condition that the span coverage reaches a saturation value (for example, 1000-2000, different site values are different), calculating the deletion ratio of each microsatellite site of each sample according to the span reading number of the repeating unit type of each microsatellite site of each sample, removing the microsatellite sites with polymorphism in the microsatellite stability sample, and further obtaining the average value and the standard deviation of the deletion ratio of all samples at each microsatellite site, thereby constructing a baseline of the deletion ratio of each microsatellite site;
wherein, the scanning coverage refers to the sum of the number of scanning reads covering different repeating unit types on each microsatellite locus, and the scanning reads refer to reads completely covering the microsatellite locus and at least 2bp lengths at the left end and the right end of the microsatellite locus respectively.
The baseline construction method takes the deletion ratio of the repetitive unit type with the reduced length representing the unstable state of the microsatellite as the index of measurement, and calculates the index values of a plurality of known MSS samples, thereby obtaining the baseline level of each site in a negative sample and being beneficial to improving the accuracy of the detection result.
The MSS sample refers to a normal healthy sample, either a normal tissue sample or a normal lymphocyte sample, preferably a buffy coat sample. It should be noted that other MSS samples that may have a somatic SNV mutation are not excluded, although such MSS samples with a somatic SNV mutation may have a variation affecting a certain microsatellite locus.
Example 3
The embodiment provides a method for detecting the state of a microsatellite, which comprises the following steps: obtaining sequencing data of a sample to be detected and calculating the breathing coverage and the deletion ratio of each microsatellite locus in the sample to be detected;
if the spinning coverage of the microsatellite locus reaches a saturation value, the microsatellite locus passes quality control;
comparing the deletion ratio of each microsatellite locus of the sample to be detected with the base line constructed by the construction method;
deletion ratio of sample to be examined>Mean(Di)+n*SD(Di)And n is more than or equal to 3.5 and less than or equal to 4.4, judging that the microsatellite locus is unstable;
judging the microsatellite state of the sample to be detected according to the following conditions:
(1) if the number n1 of the sites passing the quality control is more than or equal to 15, the number of unstable sites is n2, and n2/n1 is more than or equal to 0.1, judging the microsatellite state of the sample to be detected to be MSI-H;
(2) if the number n1 of the sites passing the quality control is more than or equal to 15, the number of unstable sites is n2, and n2/n1 is less than 0.1, judging the microsatellite state of the sample to be detected to be MSS;
(3) if the number n1 of the sites passing the quality control is less than 15, judging the microsatellite state of the sample to be detected to be undetermined;
wherein, the microsatellite locus is the microsatellite locus which is screened by the screening method and is detected based on the secondary sequencing MSI of the amplicon.
According to the detection method for the unstable state of the microsatellite, the deletion ratio representing the type of the length-reduced repeating unit of the unstable state of the microsatellite is used as a detection index, so that the sensitivity of detecting unstable signals of the microsatellite locus is improved, and the minimum detection limit of tumor purity is reduced.
The rule for determining the microsatellite status may be determined by machine learning based on the microsatellite loci and the baseline of the present application.
Example 4
The embodiment provides a method for detecting an unstable state of a microsatellite, which comprises the following detailed steps:
1) the number of spanning reads per microsatellite locus for each sample was calculated using amplicon library sequencing data from (. gtoreq.30) Blood Cell (BC) samples or MSS tissue samples.
1.1 using software BWA to align the reads obtained by sequencing with the human genome to obtain an alignment file.
1.2 following the realignment with the GATK, reads (scanning reads) are extracted whose 5' end is the amplicon primer sequence and completely covers the microsatellite locus region and flanks by at least 2bp each.
1.3 extracting the microsatellite locus sequences in the spinning reads in 1.2, calculating the sequence length, wherein each different length represents a repeating unit type.
1.4 count the sum of the number of spanning reads at the microsatellite locus, which is defined as the spanning coverage.
1.5 if the number of scanning coverage of the microsatellite locus is greater than the saturation value of the scanning coverage of the locus, the microsatellite locus passes quality control.
1.6 if the number of scanning reads supporting a repeat unit type is 2 or more, the repeat unit type is valid.
1.7 count the number of scanning reads covering the type D of the repeating unit, and calculate the ratio of the number of scanning reads covering the type D of the repeating unit to the total number of scanning reads of the microsatellite locus, i.e., the deletion ratio.
1.8 statistics of all sample site polymorphisms.
2) Baseline construction
Calculating the deletion ratio values of all BC sample or MSS sample microsatellite loci according to 1), and removing the microsatellite loci with polymorphism in the sample. Calculating the Mean of the deletion ratio values of each microsatellite locus(Di)And standard deviation SD(Di)And constructing a baseline.
3) The deletion ratio of the microsatellite loci of the sample to be examined is calculated according to the above 1.1,1.2,1.3,1.4,1.5,1.6 and 1.7.
4) Comparing the deletion ratio value of the sample to be detected with the baseline, if the deletion ratio (i) of the sample to be detected> Mean(Di)+4*SD(Di)Then the site is judged to be unstable.
6) If the number of sites passing through quality control is more than or equal to 15, the number of unstable sites/the number of sites passing through quality control is more than or equal to 0.1, and the state of the microsatellite is judged to be MSI-H.
7) If the number of the sites passing through the quality control is more than or equal to 15, the number of unstable sites/the number of the sites passing through the quality control is less than 0.1, and the micro-satellite state is judged to be MSS.
8) If the number of sites passing quality control is less than 15, the microsatellite status is judged to be QNS (quantity Not sufficient).
It should be noted that the primers and the method in this embodiment are suitable for both single-ended sequencing and double-ended sequencing, and the single-ended sequencing cost is lower.
The benefits of the present application will be further illustrated with reference to other specific examples.
Example 5: screening for microsatellite loci
Sites were selected within the reference sequence of the whole genome or of the target panel in the following order:
1) a single base repeated microsatellite sequence;
2) the length range is 7-15 bp;
3) the basic group is A or T;
4) calculating similarity values of 10bp sequences at the left end and the right end of the microsatellite and the microsatellite sequence according to a formula sigma (11-n)/10, and selecting a locus with the similarity value less than or equal to 2;
5) according to the reading length requirement of sequencing reads, a proper primer and directional sequencing can be designed, so that the sequencing reads can completely span the microsatellite locus region.
6) And (3) carrying out amplicon method sequencing on the N & gt100) exceptional perileukocyte or MSS samples, counting the type of the repeating unit of each microsatellite locus and the proportion of each type, constructing a frequency spectrum diagram, determining the type of the repeating unit with the highest proportion, and selecting the locus with the highest proportion and the type of the repeating unit of the reference sequence consistent for calculation. In practice, without limiting this condition, the reference sequence should be changed to the type of repeat unit of the wild type of the target population when calculating the deletion ratio.
7) The polymorphism ratio of each site was determined, and sites with a polymorphism ratio of less than 5% (5%) were selected.
8) A willbox test is carried out by using M (such as 100) MSI-H samples and M' (such as 30) MSS samples, the difference of the deletion ratio between two groups of each site is counted, and sites with the p value lower than 0.05 are selected.
According to the steps, 68 single base repeated sequences with the length of 10-15 bp, the base of A or T, low similarity with the flanking sequences, proper primer sequences and directional sequencing, high singleness, high sensitivity and high specificity are screened. Specific information for each site is seen in table 1.
Table 1:
Figure DEST_PATH_IMAGE001
Figure DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE003
wherein: in the sequencing direction, "+" indicates sequencing from the forward primer sequence, i.e., read1 from the forward primer sequence, and "-" indicates sequencing from the reverse primer sequence, i.e., read1 from the reverse primer sequence. Sequencing primer sequences in combination with directed sequencing can be used for MSI detection using single-ended sequencing.
It should be noted that the microsatellite loci in this example are not unique or fixed, and the enrichment regions may be different according to different designs. However, according to the site selection of the single-base repeat unit of this example, the site is selected to have the best effect. The primers and molecular sequencing direction required for these microsatellite loci will also vary depending on the sequencing length requirements, e.g., 100SE or 150 SE.
Example 6: baseline construction
30 healthy human leukocyte samples are selected, and the MSS sample delay ratio baseline data is constructed according to the step 3) in the method scheme.
Example 7: repeat unit type frequency distribution saturation analysis
The 12 MSS samples were used for down-sampling (down sample) according to the protocol of example 4, and the saturation of the deletion ratio value for each microsatellite locus was evaluated. And (3) calculating a corresponding deletion ratio value with the increase of the satellite coverage, drawing a saturation curve graph, determining the number of the satellite coverage required by the microsatellite analysis, and further determining the quality control standard of the locus.
According to the method, the quality control is carried out when the number of the spanning coverage of each microsatellite locus determined by the second-generation sequencing result of the amplicon library is 1000-2000.
Example 8:
34 samples were selected as the test set, 6 of which were MSI-H samples and 28 of which were MSS samples, and the detection was performed by the method of example 4, with 100% sensitivity, 100% specificity and 100% accuracy. The specific results are shown in FIG. 1.
It should be noted that, the detection results of the MSI detection method of the present application are consistent with those of the MSI-PCR method. However, the lowest detection line of tumor purity reaches 5 percent and is 20 percent better than that of PCR-MSI.
Example 9:
2 MSI-H cell line samples were mixed with one MSS cell line sample at 5%, 10%, 15%, 20%, 25% and 30% tumor content, with an initial amount of 10ng, the status of microsatellites in the samples was predicted, and the minimum tumor content detection limit was analyzed, and the results are reviewed in Table 2 and FIG. 2.
TABLE 2
Figure DEST_PATH_IMAGE004
Figure DEST_PATH_IMAGE005
From the results, it was concluded that the MSI-H samples had a minimum detection limit of 5% for tumor content.
From the above description, it can be seen that the above-described embodiments of the present invention achieve the following technical effects: the detection scheme of the MSI can be successfully realized based on amplicon second-generation sequencing data, the accuracy of the MSI detection is higher, and the minimum detection limit of the tumor content of the MSI-H sample is reduced.
Specifically, the method of the present application has the following advantages:
1) no reference is required for normal tissue samples. This advantage is based on the detection of a single or quasi-single microsatellite locus in a 68 person population queue.
2) The enrichment efficiency is high. Because the amplicon enrichment technology is adopted and a proper primer sequence and directional sequencing are designed, reads obtained by microsatellite sequencing amplicons can completely cover microsatellite loci, so the enrichment efficiency is high.
3) Cost is reduced by single-ended sequencing. And designing a primer sequence and a directional sequencing based on the sites.
4) By utilizing the primer sequence information, the alignment error caused by similar sequences is eliminated, and the analysis accuracy is improved.
5) The deletion ratio is used as the calculated characteristic value. When the microsatellite is unstable, the vast majority of the microsatellite loci are shown to be shortened in length, so that the proportion of the types of the repeating units with the shortened length is increased, and the proportion of the types of the repeating units with the shortened length is used as unstable signals of the microsatellite loci, thereby being convenient for improving the detection sensitivity and reducing the minimum detection limit of tumor purity.
In addition, 68 single base repeat sequences are used as a microsatellite locus combination for carrying out microsatellite instability (MSI) analysis, and the locus combination can improve the stability and accuracy of analysis results and reduce the requirement of the tumor content of a sample.
The 68 microsatellite loci of the application have short repetitive unit types and specific A or T basic groups, so that the background noise caused by PCR is low, and the 68 loci are used for MSI detection, which is favorable for reducing the minimum detection limit of the tumor content of the MSI-H sample.
The 68 microsatellite loci 1) of the application have low similarity with two wing sequences, so that the comparison error can be reduced; 2) the site is monomorphic, which is beneficial to improving the detection specificity; 3) The locus has obvious difference between MSI-H and MSS, which is helpful to improve the detection sensitivity. Therefore, the stability and the accuracy of an analysis result can be improved by using the microsatellite locus combination to carry out MSI detection.
Furthermore, from the validation samples, the assay method combined with the product site can detect MSI status in samples with tumor content as low as 5%.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Corresponding to the above manner, the present application also provides a series of apparatuses, which are used for implementing the above embodiments and preferred embodiments, and the descriptions of the apparatuses that have been already described are omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
The benefits of the present application are further described below in connection with certain alternative embodiments.
Example 10
This example provides a screening apparatus for microsatellite loci based on amplicon next generation sequencing MSI detection, comprising: this sieving mechanism includes: the first position point set acquisition module is used for selecting the microsatellite position points meeting a first condition and recording the selected microsatellite position points as a first position point set, wherein the first condition comprises the following steps: a.7-15 bp of A or T single base repeat sequence; b. similarity values with flanking sequences of the single base repeat sequence are lower than a similarity threshold value; c. designing an amplification primer and a sequencing direction of each microsatellite locus according to the reading length requirement of sequencing reads, and selecting the microsatellite locus which can enable the sequencing reads corresponding to an amplicon of the amplification primer to completely span the microsatellite locus region; the repeated unit type and frequency counting module is used for acquiring sequencing data of an amplicon library of a plurality of microsatellite stable samples, screening out a first bit set from the sequencing data of each microsatellite stable sample, and counting the type of the repeated unit of each microsatellite bit in the first bit set and the type frequency of each repeated unit; a second set of sites obtaining module, configured to select, from the first set of sites, a microsatellite site satisfying a second condition as a second set of sites, where the second condition includes: 1) the type of the most frequent repeat unit is consistent with the reference sequence; 2) polymorphism in the population is less than 5%; the site screening module is used for counting and retaining microsatellite sites of which deletion ratios are obviously different between the negative sample group and the positive sample group of each microsatellite site in the second site set by adopting a negative sample group consisting of a plurality of microsatellite stable samples and a positive sample group consisting of a plurality of microsatellite unstable samples; wherein, the deletion ratio refers to the ratio of the number of scanning reads of the repetitive unit type with reduced length compared with the reference sequence of the microsatellite locus to the total number of the scanning reads of the microsatellite locus, and the scanning reads refer to the reads which completely cover the microsatellite locus and at least 2bp length respectively at the left end and the right end of the microsatellite locus.
Alternatively, the similarity value is calculated as follows: Σ (d 2+1-d 1)/d 2, wherein d1 is the distance from the same base as the base of the microsatellite locus in the sequence with the set length at the left and right ends to the microsatellite locus, d2 is the set length, and d2 is 8-12 bp.
Optionally, the similarity threshold is 1.5-2.5.
Optionally, the repeating unit type and frequency statistics module includes: the comparison module is used for comparing the sequencing data of the amplicon library of each microsatellite stability sample with the reference genome sequence to obtain an initial comparison result; the positioning module is used for extracting reads with the 5' end as an amplicon primer sequence from the initial comparison result according to the enrichment characteristic of the amplicon so as to eliminate the comparison error caused by similar sequences in the comparison result and obtain the corrected comparison result; the spanning reads extraction module is used for searching the first site set from the corrected comparison result and extracting the spanning reads covering all the microsatellite sites in the first site set; and the counting module is used for counting the type of each repeating unit of each microsatellite locus and the frequency of each repeating unit type, wherein the frequency refers to the proportion of the number of the spaning reads of each repeating unit type on the microsatellite locus to the total number of the spaning reads covering the microsatellite locus.
Optionally, the site screening module adopts a non-parametric test method to count and reserve the microsatellite sites with significant difference in deletion ratio between the negative sample group and the positive sample group of each microsatellite site in the second site set; preferably, the nonparametric test is the wilcox test.
Example 11
The present embodiment provides a baseline building apparatus for detecting MSI, the building apparatus including: the acquisition counting module is used for acquiring sequencing data of an amplicon library of a plurality of known MSS samples, and counting the number of the spinning coverage of each microsatellite locus of each sample and the number of the spinning reads covering each repeating unit type, wherein the microsatellite locus is a microsatellite locus detected based on the second-generation sequencing MSI of the amplicon screened by any one of the screening methods; a baseline building module, configured to calculate a deletion ratio of each microsatellite locus according to the number of scanning reads of the repeating unit type of each microsatellite locus of each MSS sample under the condition that the scanning coverage reaches a saturation value, and remove microsatellite loci with polymorphisms in the microsatellite stabilization samples, thereby obtaining an average value and a standard deviation of the deletion ratios of all samples at each microsatellite locus, and thus building a baseline of the deletion ratio of each microsatellite locus; wherein the spanning coverage refers to the sum of the number of spanning reads covering different types of repeating units at each microsatellite locus.
Example 12
The embodiment provides a device for detecting the state of a microsatellite, which comprises: the acquisition and calculation module is used for acquiring sequencing data of the sample to be detected based on the amplicon library and calculating the scanning coverage and the deletion ratio of each microsatellite locus in the sample to be detected; the quality control module is used for judging that the microsatellite locus passes quality control when the spanning coverage of the microsatellite locus reaches a saturation value; the comparison module is used for comparing the deletion ratio of each microsatellite locus of the sample to be detected with the base line constructed by the construction method; the unstable site determination module is used for determining that the microsatellite site is unstable if the deletion ratio of the sample to be detected is greater than mean (Di) + n SD (Di), and n is more than or equal to 3.5 and less than or equal to 4.4; the microsatellite state judging module is used for judging the microsatellite state of the sample to be detected according to the following conditions: (1) if the number n1 of the sites passing the quality control is more than or equal to 15, the number of unstable sites is n2, and n2/n1 is more than or equal to 0.1, judging the microsatellite state of the sample to be detected to be MSI-H; (2) if the number n1 of the sites passing the quality control is more than or equal to 15, the number of unstable sites is n2, and n2/n1 is less than 0.1, judging the microsatellite state of the sample to be detected to be MSS; (3) if the number n1 of the sites passing the quality control is less than 15, judging the microsatellite state of the sample to be detected to be undetermined; wherein, the microsatellite locus is selected by any one of the screening methods and detected based on the amplicon secondary sequencing MSI.
Example 13
This example provides a microsatellite locus for MSI detection based on amplicon-based secondary sequencing, including at least 15 of the 68 microsatellite loci shown in Table 1.
Further, a kit for detecting MSI is provided, the kit comprising at least 15 of the 68 microsatellite loci shown in table 1.
Example 14
The embodiment provides a storage medium, which includes a stored program, and wherein, when the program runs, a device in which the storage medium is located is controlled to execute any one of the screening methods, or any one of the construction methods, or any one of the detection methods.
The application also provides a processor, wherein the processor is used for running a program, and when the program runs, any screening method, any construction method or any detection method is executed.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Sequence listing
<110> Zhen He (Beijing) Biotechnology Ltd
Zhen Yue Biotechnology Jiangsu Co Ltd
<120> amplicon-based secondary sequencing MSI detection-based microsatellite loci, and screening method and application thereof
<130> PN153011ZYSW
<160> 136
<170> SIPOSequenceListing 1.0
<210> 1
<211> 29
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 1
tgtttaggtt gaaacagcat tagaaaact 29
<210> 2
<211> 27
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 2
ggactcaaat tttcctctga atgctaa 27
<210> 3
<211> 33
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 3
gtggatgtta aataaaagta ctttagtcac tca 33
<210> 4
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 4
ggacagaaac ccaactccta aaatc 25
<210> 5
<211> 30
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 5
gtccatgttg aaattgatca gtgtaagtta 30
<210> 6
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 6
cctggtttct tcttcaagac tctga 25
<210> 7
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 7
cccctaccat gactttattc tggaa 25
<210> 8
<211> 23
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 8
cattgcactc atcagagcta cag 23
<210> 9
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 9
cagatcccag cacctattga attac 25
<210> 10
<211> 26
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 10
gcatatatgg ccacagtcta aatacg 26
<210> 11
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 11
ggcacagtta cagatcacag aaaa 24
<210> 12
<211> 32
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 12
atcaaatagc atatagttaa caccatggtt ac 32
<210> 13
<211> 28
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 13
agtaactaaa ttcaccccca gactttaa 28
<210> 14
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 14
actttggttt ctttctttcc acgtt 25
<210> 15
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 15
gctgttggag gaaagttcca tttag 25
<210> 16
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 16
acagtatccc tatggcttct cttg 24
<210> 17
<211> 22
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 17
cttcatgtgc ctagggctag ta 22
<210> 18
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 18
gtatggcaca ctctggagaa aatt 24
<210> 19
<211> 27
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 19
gctttatttg tttaatgcag agttgca 27
<210> 20
<211> 27
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 20
ctctttcaga gggcaataat ggtattg 27
<210> 21
<211> 18
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 21
cattttccac cacgcgga 18
<210> 22
<211> 18
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 22
tcaccacctc tgtagcgg 18
<210> 23
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 23
cattatccca gtctagcaat cgttg 25
<210> 24
<211> 26
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 24
tggcccatag agtgttttaa acattt 26
<210> 25
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 25
gggaggaata tgatagatgg gcatt 25
<210> 26
<211> 27
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 26
aagaaagtgt tcactttaac agggttt 27
<210> 27
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 27
acactggaat tgaaatgttg aggtt 25
<210> 28
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 28
aggagaatct tagggaaaac agctt 25
<210> 29
<211> 28
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 29
cataagcaag gcacagtagt aaaagtag 28
<210> 30
<211> 33
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 30
aaagattatg tatgtgtatg tttaccttta aca 33
<210> 31
<211> 29
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 31
atttttaggc agtcacataa ctaacaaaa 29
<210> 32
<211> 26
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 32
aggtgacata cctggtacat aacttt 26
<210> 33
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 33
caccagagtt ggctctttct ttac 24
<210> 34
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 34
gactttcacg taagtgacct atcg 24
<210> 35
<211> 30
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 35
gctaatacag tgcttgaaca tgtaatatct 30
<210> 36
<211> 27
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 36
ctctatactg acgaaccaga agaagat 27
<210> 37
<211> 30
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 37
tcttattcca gtaatgtcct cttttaaggt 30
<210> 38
<211> 31
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 38
aaaaactccg aagaaataag aattgaaatg a 31
<210> 39
<211> 26
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 39
ctcattagaa aaggaagcaa ggagaa 26
<210> 40
<211> 27
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 40
tagaaacagt atgtggaaca catttcg 27
<210> 41
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 41
acactggcat ttagcaaaca gaat 24
<210> 42
<211> 18
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 42
gctgagatcg tgccaact 18
<210> 43
<211> 26
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 43
gtgttactgt tgagaagttc agtgtc 26
<210> 44
<211> 27
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 44
gaacactgtc aagattaata ccccttc 27
<210> 45
<211> 30
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 45
gtaagtccta attgtattcc aaagtcttca 30
<210> 46
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 46
tctaggcttt cagtgggtaa gattt 25
<210> 47
<211> 28
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 47
tgagaatcta tatttgtggt ggatcaca 28
<210> 48
<211> 32
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 48
tctcccttgt cattatttaa tcatgagata ct 32
<210> 49
<211> 20
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 49
ctgtagacac gggacttgtg 20
<210> 50
<211> 32
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 50
gtttctggtt ttaatgtttt ctttttgttg tt 32
<210> 51
<211> 29
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 51
caactaaaaa taagaacaag agggaagga 29
<210> 52
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 52
cgcttgatgg atttactctg gaaag 25
<210> 53
<211> 28
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 53
tgatttgaaa agcagagctt aaataggt 28
<210> 54
<211> 28
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 54
tgttcatctt ctcaactaaa agcttctg 28
<210> 55
<211> 28
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 55
gcctcaaggt aaatgaattt gcataaat 28
<210> 56
<211> 33
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 56
aatgcaaaga agtatatcac ttttatggtt atc 33
<210> 57
<211> 28
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 57
aaacctactg cactaactag ttttatgc 28
<210> 58
<211> 28
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 58
gttcatattt taggacctag gtgattcc 28
<210> 59
<211> 22
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 59
tgcatcccac gtggtaagaa ta 22
<210> 60
<211> 26
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 60
aggaaaagaa agaaagaaca gcaagt 26
<210> 61
<211> 32
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 61
atcatctcct tccttcttta aataagagta ac 32
<210> 62
<211> 27
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 62
ggtcctaaat ccccaaatca gaattta 27
<210> 63
<211> 26
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 63
tggcagtgca gcagaatata aataac 26
<210> 64
<211> 32
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 64
ctattgggtt taatagtatt cgttgtttct ga 32
<210> 65
<211> 22
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 65
acttggtgga agaacagctt tg 22
<210> 66
<211> 20
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 66
agaaaggcac cgctcagata 20
<210> 67
<211> 22
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 67
ctcctgactc ccattctgat ga 22
<210> 68
<211> 23
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 68
cagctttctt ctagtcaccc aat 23
<210> 69
<211> 33
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 69
aaaatctatg tttaaagttt tgttttctgt cgt 33
<210> 70
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 70
tttcttcttg tacagttggt ctgc 24
<210> 71
<211> 30
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 71
tatgtaggat ttcacaattg tttggctaag 30
<210> 72
<211> 28
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 72
gggcatcact accctctaag ataaaata 28
<210> 73
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 73
tggatatttg cttggaaaaa cgtgt 25
<210> 74
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 74
tttctactgc gacattagcc aaaa 24
<210> 75
<211> 26
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 75
ggttccttct gccttcttca aattac 26
<210> 76
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 76
ccattcagac atgtcacact tgaa 24
<210> 77
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 77
aatacttctg ccctgaaaac atcag 25
<210> 78
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 78
atgccctatt cgacagaact gata 24
<210> 79
<211> 27
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 79
aggtacggac ttatctatcc attcaag 27
<210> 80
<211> 29
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 80
gacctacttt tctttcagaa agtgtctaa 29
<210> 81
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 81
ggaatcttca tgttgtgggt catc 24
<210> 82
<211> 27
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 82
agaattccct ggtaagccat gaatata 27
<210> 83
<211> 22
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 83
ccgggagcac caaatcaatt ag 22
<210> 84
<211> 23
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 84
gcatgcaaaa tgtcccctct tac 23
<210> 85
<211> 28
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 85
gtagttttgg aaaaagtttg acaaaggt 28
<210> 86
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 86
ctgttgtctc agaggaaaat gctt 24
<210> 87
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 87
tgtgccatac agttgcaaaa tactt 25
<210> 88
<211> 29
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 88
cagactactg aaggtaatat agtttgcag 29
<210> 89
<211> 27
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 89
ttttgaggtc cattgcttta ctaagac 27
<210> 90
<211> 26
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 90
tgtagtacaa gagactttcc atgtca 26
<210> 91
<211> 22
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 91
cacccgactc tcatagaaaa cg 22
<210> 92
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 92
ccgctcttca gaagaactga aaac 24
<210> 93
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 93
tgaagaacta ttccgttaac cacct 25
<210> 94
<211> 19
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 94
tgaggcttcg gtgtaccat 19
<210> 95
<211> 33
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 95
ttttatcaag agggataaaa caccatgaaa ata 33
<210> 96
<211> 30
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 96
gtcaggaaaa gagaattgtt cctataactg 30
<210> 97
<211> 33
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 97
ttttatcaag agggataaaa caccatgaaa ata 33
<210> 98
<211> 27
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 98
tttttatgcc attttgctaa tgtaccc 27
<210> 99
<211> 26
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 99
ctttaatgag tgtctttgac ccatgt 26
<210> 100
<211> 27
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 100
gatttatgga gaaggatccc taccttt 27
<210> 101
<211> 33
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 101
aaatcctttt tctgtatggg attatggaat att 33
<210> 102
<211> 23
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 102
tgtgaaggtt tcagatagag cct 23
<210> 103
<211> 33
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 103
aagttataaa ataactgatg tgttctgtta agc 33
<210> 104
<211> 27
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 104
tagttttaca aacatcttgg tcacgac 27
<210> 105
<211> 23
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 105
taagcaaata gccgaaggaa acc 23
<210> 106
<211> 19
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 106
gctgtgaggc taccgtgta 19
<210> 107
<211> 33
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 107
acttataaat gttgttttaa ggaatgtgat ttc 33
<210> 108
<211> 21
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 108
gcctgcacca acgtagaatt t 21
<210> 109
<211> 23
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 109
ccagagaacc gtcttcattg aac 23
<210> 110
<211> 21
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 110
ctaagggaga gagacgtttg c 21
<210> 111
<211> 27
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 111
cgcattaatt ttgtcaccac tttacag 27
<210> 112
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 112
attcaagact actcacggaa tctca 25
<210> 113
<211> 28
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 113
cttttaaggt gaccactttg aaaggata 28
<210> 114
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 114
cccactttga gacaaagtgg taac 24
<210> 115
<211> 28
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 115
ctgaaggatt tattgccttt gagtatct 28
<210> 116
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 116
ccctttcaga gctcttgttt actg 24
<210> 117
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 117
caacaaggag atatctgcct tcttc 25
<210> 118
<211> 33
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 118
ttttagaaca atgtacatga taaatatgac aga 33
<210> 119
<211> 33
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 119
tctctgattc atttatactt aactcatcaa cat 33
<210> 120
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 120
ccttttctct tggcacagta tgat 24
<210> 121
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 121
cctttgatgc tctccttcca ttttc 25
<210> 122
<211> 21
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 122
ctgggtcttt tgcatctacc c 21
<210> 123
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 123
tataacaggg caagggaaga cttt 24
<210> 124
<211> 30
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 124
cattttccct gaaactaggc ttgatattat 30
<210> 125
<211> 26
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 125
gctttgtgat ttgttcagca agtatc 26
<210> 126
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 126
aaacacacaa tcaagtaggg aactg 25
<210> 127
<211> 26
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 127
aagatatcac ctgagcaggt gataat 26
<210> 128
<211> 24
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 128
acactttgca atttgttcca ttcg 24
<210> 129
<211> 23
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 129
agtctcgaca tccacatgtg ata 23
<210> 130
<211> 26
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 130
tgattaggaa gcatttggta gaagga 26
<210> 131
<211> 28
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 131
agactcaatt tagctctctg aactagtt 28
<210> 132
<211> 23
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 132
taacacgctt ggattctagg tct 23
<210> 133
<211> 28
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 133
ccccaaaagt tttcttgttc tctgaata 28
<210> 134
<211> 29
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 134
gggaaataga ggcagtatat aaagacaga 29
<210> 135
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 135
tgttacaaag taagtgtggg aacac 25
<210> 136
<211> 25
<212> DNA
<213> Intelligent (Homo sapiens)
<400> 136
gccactgaaa aagaaacctt aatgc 25

Claims (23)

1. A method for screening microsatellite loci based on amplicon next-generation sequencing MSI detection, said screening method comprising:
selecting microsatellite loci meeting a first condition to be recorded as a first locus set, wherein the first condition comprises the following steps: a.7-15 bp of A or T single base repeat sequence; b. similarity values to flanking sequences of the single base repeat sequence are below a similarity threshold; c. designing an amplification primer and a sequencing direction of each microsatellite locus according to the reading requirement of sequencing reads, and selecting the microsatellite loci which can enable the sequencing reads to completely span each microsatellite locus region;
obtaining sequencing data of an amplicon library of a plurality of microsatellite stability samples, screening the sequencing data of each microsatellite stability sample to obtain a first bit set, and counting the type of a repeating unit and the type frequency of each repeating unit of each microsatellite position in the first bit set;
selecting, from the first set of sites, a microsatellite site satisfying a second condition as a second set of sites, the second condition comprising: 1) the type of the repeating unit with the highest frequency is consistent with a reference sequence; 2) polymorphism in the population is less than 5%;
counting and retaining microsatellite loci with significant difference in deletion ratio between the negative sample group and the positive sample group of each microsatellite locus in the second locus set by adopting a negative sample group consisting of a plurality of microsatellite stable samples and a positive sample group consisting of a plurality of microsatellite unstable samples;
wherein the deletion ratio refers to the ratio of the number of scanning reads of the repetitive unit type with reduced length to the total number of the scanning reads of the microsatellite locus compared with the reference sequence, and the scanning reads refer to the reads covering the microsatellite locus and at least 2bp of each of the left end and the right end of the microsatellite locus;
the screening method is a method of non-medical diagnostic or therapeutic purpose.
2. The screening method according to claim 1, wherein the similarity value is calculated according to the following formula: Σ (d 2+1-d 1)/d 2, wherein d1 is the distance from the same base as the base of the microsatellite locus in the sequence with the set length at the left and right ends to the microsatellite locus, d2 is the set length, and d2 is 8-12 bp.
3. The screening method according to claim 1, wherein the similarity threshold is 1.5 to 2.5.
4. The screening method of claim 1, wherein obtaining sequencing data from an amplicon library of a plurality of microsatellite stability samples and screening the first set of sites from the sequencing data for each of the microsatellite stability samples and wherein counting the type of repeat unit and the type frequency of each repeat unit for each of the microsatellite sites in the first set of sites comprises:
comparing the sequencing data of the amplicon library of each microsatellite stability sample with a reference genome sequence to obtain an initial comparison result;
extracting reads with the 5' end as an amplicon primer sequence from the initial comparison result according to the enrichment characteristic of the amplicon so as to eliminate comparison errors caused by similar sequences in the comparison result and obtain a corrected comparison result;
searching the first bit set from the corrected comparison result, and extracting the spinning reads covering all the microsatellite loci in the first bit set;
counting the frequency of each repeating unit type and each repeating unit type of each micro-satellite site, wherein the frequency is the ratio of the number of spaning reads of each repeating unit type on the micro-satellite site to the total number of spaning reads covering the micro-satellite site.
5. The screening method of claim 4, wherein said amplicon library is obtained by multiplex PCR enrichment of CleanPlex using said amplification primers that amplify said microsatellite loci.
6. The screening method according to claim 1, wherein the microsatellite loci in each of the microsatellite loci in the second locus set which have a significant difference in deletion ratio between the negative sample group and the positive sample group are counted and retained by a nonparametric test.
7. The screening method according to claim 6, wherein the nonparametric test is a wilcoxt test.
8. The screening method of claim 6, wherein a microsatellite locus with a significant difference is said microsatellite locus with a p-value < 0.05.
9. A baseline construction method for detecting MSI, the construction method comprising:
obtaining sequencing data of an amplicon library of a plurality of known stable microsatellite samples, firstly extracting reads with the 5' end as an amplicon primer sequence according to the enrichment characteristic of amplicons, and counting the number of the spinning coverage of each microsatellite locus and the number of the spinning reads of each repetitive unit type of each sample; wherein the microsatellite loci selected by the screening method of any one of claims 1 to 8 based on amplicon secondary sequencing MSI detection;
under the condition that the scanning coverage reaches a saturation value, calculating the deletion ratio of each microsatellite locus according to the scanning reads number of the repeating unit type of each microsatellite locus of each sample, removing the microsatellite loci with polymorphism in the stable microsatellite samples so as to obtain the average value and the standard deviation of the deletion ratio of all samples at each microsatellite locus, thereby constructing a baseline for obtaining the deletion ratio of each microsatellite locus;
wherein the spanning coverage refers to the sum of the number of spanning reads covering different types of repeating units at each microsatellite locus;
the construction method is a method for non-medical diagnosis or treatment purposes.
10. A method for detecting the status of a microsatellite for non-medical diagnostic or therapeutic purposes, said method comprising:
obtaining sequencing data of a sample to be detected based on an amplicon library and calculating the spanning coverage and the deletion ratio of each microsatellite locus in the sample to be detected, wherein the amplicon library is obtained by utilizing an amplification primer of the amplified microsatellite locus through a multiple PCR enrichment method of CleanPlex;
if the spinning coverage of the microsatellite locus reaches a saturation value, the microsatellite locus passes quality control;
comparing the deletion ratio value of each microsatellite locus of the sample to be detected with the baseline constructed by the baseline construction method of claim 9;
if the deletion ratio (i) of the sample to be examined>Mean (Di)+n*SD(Di)And n is more than or equal to 3.5 and less than or equal to 4.4, judging that the microsatellite locus is unstable;
judging the microsatellite state of the sample to be detected according to the following conditions:
(1) if the number n1 of the microsatellite loci passing through the quality control is more than or equal to 15, the number of unstable loci is n2, and n2/n1 is more than or equal to 0.1, judging the microsatellite state of the sample to be detected to be MSI-H;
(2) if the number n1 of the microsatellite loci passing through the quality control is more than or equal to 15, the number of unstable loci is n2, and n2/n1 is less than 0.1, judging the microsatellite state of the sample to be detected to be MSS;
(3) if the number n1 of the sites passing the quality control is less than 15, judging the microsatellite state of the sample to be detected to be undetermined;
wherein the microsatellite loci selected by the screening method of any one of claims 1 to 8 based on amplicon secondary sequencing MSI detection.
11. A screening apparatus for microsatellite loci based on amplicon next generation sequencing MSI detection, said screening apparatus comprising:
a first site set obtaining module, configured to select a microsatellite site meeting a first condition, and record the microsatellite site as a first site set, where the first condition includes: a.7-15 bp of A or T single base repeat sequence; b. similarity values to flanking sequences of the single base repeat sequence are below a similarity threshold; c. designing an amplification primer and a sequencing direction of each microsatellite locus according to the reading length requirement of sequencing reads, and selecting the microsatellite locus which can enable the sequencing reads corresponding to an amplicon of the amplification primer to completely span the microsatellite locus region;
the repeated unit type and frequency counting module is used for obtaining sequencing data of an amplicon library of a plurality of microsatellite stability samples, screening the first bit set from the sequencing data of each microsatellite stability sample, and counting the type of the repeated unit of each microsatellite bit in the first bit set and the type frequency of each repeated unit;
a second set of sites obtaining module, configured to select, from the first set of sites, a microsatellite site satisfying a second condition as a second set of sites, where the second condition includes: 1) the type of the repeating unit with the highest frequency is consistent with a reference sequence; 2) polymorphism in the population is less than 5%;
the site screening module is used for counting and retaining the microsatellite sites of which the deletion ratio is obviously different between the negative sample group and the positive sample group of each microsatellite site in the second site set by adopting a negative sample group consisting of a plurality of microsatellite stable samples and a positive sample group consisting of a plurality of microsatellite unstable samples;
wherein the deletion ratio refers to the ratio of the number of scanning reads of the repetitive unit type with reduced length of the microsatellite locus compared with a reference sequence to the total number of the scanning reads of the microsatellite locus, and the scanning reads refer to the reads which completely cover the microsatellite locus and at least 2bp lengths of the left end and the right end of the microsatellite locus.
12. The screening apparatus according to claim 11, wherein the similarity value is calculated according to the following formula: Σ (d 2+1-d 1)/d 2, wherein d1 is the distance from the same base as the base of the microsatellite locus in the sequence with the set length at the left and right ends to the microsatellite locus, d2 is the set length, and d2 is 8-12 bp.
13. The screening apparatus according to claim 11, wherein the similarity threshold is 1.5 to 2.5.
14. The screening apparatus of claim 11, wherein the repeating unit type and frequency statistics module comprises:
the comparison module is used for comparing the sequencing data of the amplicon library of each microsatellite stabilized sample with a reference genome sequence to obtain an initial comparison result;
the positioning module is used for extracting reads with the 5' end as an amplicon primer sequence from the initial comparison result according to the enrichment characteristic of the amplicon so as to eliminate the comparison error caused by similar sequences in the comparison result and obtain a corrected comparison result;
a spaning reads extraction module, configured to search the corrected comparison result for the first bit set, and extract the spaning reads covering each of the microsatellite sites in the first bit set from the comparison result;
and the counting module is used for counting the type of each repeating unit of each micro-satellite site and the frequency of each type of repeating unit, wherein the frequency is the proportion of the number of the scanning reads of each type of repeating unit on the micro-satellite site to the total number of the scanning reads covering the micro-satellite site.
15. The screening apparatus of claim 14, wherein said amplicon library is obtained by multiplex PCR enrichment of clearplex using said amplification primers that amplify each of said microsatellite loci.
16. The screening apparatus according to claim 11, wherein the site screening module uses a non-parametric test method to count and retain microsatellite sites in which the deletion ratio of each microsatellite site in the second site set is significantly different between the negative sample group and the positive sample group.
17. The screening apparatus of claim 11, wherein the non-parametric test is a wilcoxt test.
18. A baseline construction apparatus for detecting MSI, the baseline construction apparatus comprising:
an obtaining statistic module, configured to obtain sequencing data of an amplicon library of a plurality of known MSS samples, and count the number of microsatellite coverage of each microsatellite locus of each sample and the number of microsatellite reads covering each repeat unit type, where the microsatellite loci are detected based on amplicon-based secondary sequencing MSI screened by the screening method according to any one of claims 1 to 8, and the amplicon library is obtained by a multiple PCR enrichment method of CleanPlex using amplification primers for amplifying the microsatellite loci;
a baseline building module, configured to calculate a deletion ratio of each microsatellite locus according to the number of scanning reads of the repeating unit type of each microsatellite locus of each MSS sample under the condition that the scanning coverage reaches a saturation value, and remove the microsatellite loci with polymorphisms in the microsatellite stability samples, thereby obtaining an average value and a standard deviation of the deletion ratios of all samples at each microsatellite locus, and thus building a baseline of the deletion ratio of each microsatellite locus;
wherein the spanning coverage refers to the sum of the number of spanning reads covering different types of repeating units at each microsatellite locus.
19. A device for detecting the state of a microsatellite, said device comprising:
the acquisition and calculation module is used for acquiring sequencing data of a sample to be detected based on the amplicon library and calculating the spinning coverage and the deletion ratio of each microsatellite locus in the sample to be detected; the amplicon library is obtained by utilizing an amplification primer for amplifying a microsatellite locus through a multiple PCR enrichment method of CleanPlex;
the quality control module is used for controlling the microsatellite locus to pass quality if the spanning coverage of the microsatellite locus reaches a saturation value;
a comparing module, configured to compare a deletion ratio of each microsatellite locus of the sample to be detected with the baseline constructed by the baseline construction method according to claim 9;
an unstable site determination module for determining the deletion ratio of the sample to be detected>Mean(Di)+n*SD(Di)And n is more than or equal to 3.5 and less than or equal to 4.4, judging that the microsatellite locus is unstable;
the microsatellite state judging module is used for judging the microsatellite state of the sample to be detected according to the following conditions:
(1) if the number n1 of the sites passing the quality control is more than or equal to 15, the number of unstable sites is n2, and n2/n1 is more than or equal to 0.1, judging the microsatellite state of the sample to be detected to be MSI-H;
(2) if the number n1 of the sites passing the quality control is more than or equal to 15, the number of unstable sites is n2, and n2/n1 is less than 0.1, judging the microsatellite state of the sample to be detected to be MSS;
(3) if the number n1 of the sites passing the quality control is less than 15, judging the microsatellite state of the sample to be detected to be undetermined;
wherein the microsatellite loci selected by the screening device of any one of claims 1 to 8 based on amplicon secondary sequencing MSI detection.
20. A microsatellite locus for MSI detection based on amplicon-based secondary sequencing, wherein said microsatellite locus is selected from at least 15 of the 68 microsatellite loci shown in table 1.
21. A kit comprising detection reagents for microsatellite loci based on amplicon-based secondary sequencing MSI detection, the microsatellite loci comprising at least 15 of the 68 microsatellite loci shown in table 1.
22. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to perform the screening method according to any one of claims 1 to 8, or the baseline construction method according to claim 9, or the detection method according to claim 10 when running.
23. A storage medium for storing a program, wherein the program executes to perform the screening method of any one of claims 1 to 8, or the baseline construction method of claim 9, or the detection method of claim 10.
CN202111046575.3A 2021-09-08 2021-09-08 Microsatellite locus based on amplicon next-generation sequencing MSI detection, screening method and application thereof Active CN113488105B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111046575.3A CN113488105B (en) 2021-09-08 2021-09-08 Microsatellite locus based on amplicon next-generation sequencing MSI detection, screening method and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111046575.3A CN113488105B (en) 2021-09-08 2021-09-08 Microsatellite locus based on amplicon next-generation sequencing MSI detection, screening method and application thereof

Publications (2)

Publication Number Publication Date
CN113488105A CN113488105A (en) 2021-10-08
CN113488105B true CN113488105B (en) 2022-01-18

Family

ID=77947351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111046575.3A Active CN113488105B (en) 2021-09-08 2021-09-08 Microsatellite locus based on amplicon next-generation sequencing MSI detection, screening method and application thereof

Country Status (1)

Country Link
CN (1) CN113488105B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114150067B (en) * 2022-02-07 2022-05-17 元码基因科技(北京)股份有限公司 Method, system and probe set for determining combination of sites for detecting microsatellite instability state
CN116705157B (en) * 2022-03-28 2024-01-30 北京吉因加医学检验实验室有限公司 Method and device for detecting microsatellite state of plasma sample based on second-generation sequencing

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8609338B2 (en) * 2006-02-28 2013-12-17 University Of Louisville Research Foundation, Inc. Detecting fetal chromosomal abnormalities using tandem single nucleotide polymorphisms
CN111627501B (en) * 2020-05-22 2023-06-02 无锡臻和生物科技有限公司 Microsatellite locus for detecting MSI, screening method and application thereof
CN111785324B (en) * 2020-07-02 2021-02-02 深圳市海普洛斯生物科技有限公司 Microsatellite instability analysis method and device
CN112365922B (en) * 2021-01-13 2021-06-15 臻和(北京)生物科技有限公司 Microsatellite locus for detecting MSI, screening method and application thereof

Also Published As

Publication number Publication date
CN113488105A (en) 2021-10-08

Similar Documents

Publication Publication Date Title
Sikkema‐Raddatz et al. Targeted next‐generation sequencing can replace Sanger sequencing in clinical diagnostics
CN113488105B (en) Microsatellite locus based on amplicon next-generation sequencing MSI detection, screening method and application thereof
DK2823062T5 (en) SIZE-BASED ANALYSIS OF Fetal DNA FRACTION IN MOTHER PLASMA
US20210343367A1 (en) Methods for detecting mutation load from a tumor sample
Wong et al. Multiplex Illumina sequencing using DNA barcoding
US20240035094A1 (en) Methods and systems to detect large rearrangements in brca1/2
CN112365922B (en) Microsatellite locus for detecting MSI, screening method and application thereof
CN113316645A (en) Improvements in variant detection
CN111627501B (en) Microsatellite locus for detecting MSI, screening method and application thereof
US20190325990A1 (en) Process for aligning targeted nucleic acid sequencing data
JP2020534011A (en) Methods for Detecting Fusions Using Compressed Molecular Tagged Nucleic Acid Sequence Data
Storvall et al. Efficient and comprehensive representation of uniqueness for next-generation sequencing by minimum unique length analyses
CN108304694B (en) Method for analyzing gene mutation based on second-generation sequencing data
EP3844755A1 (en) Methods for detecting mutation load from a tumor sample
CN109920480B (en) Method and device for correcting high-throughput sequencing data
US20120238457A1 (en) Rna analytics method
CN109830265B (en) Kit for detecting MSI, reference database, construction method and application thereof
US11866778B2 (en) Methods and systems for evaluating microsatellite instability status
CN110942806A (en) Blood type genotyping method and device and storage medium
CN114420214A (en) Quality evaluation method and screening method of nucleic acid sequencing data
US20200318175A1 (en) Methods for partner agnostic gene fusion detection
US20170226588A1 (en) Systems and methods for dna amplification with post-sequencing data filtering and cell isolation
WO2024059487A1 (en) Methods for detecting allele dosages in polyploid organisms
WO2024006878A1 (en) Methods for assessing genomic instability
CN114708905A (en) Chromosome aneuploidy detection method, device, medium and equipment based on NGS

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant