CN111627501B - Microsatellite locus for detecting MSI, screening method and application thereof - Google Patents

Microsatellite locus for detecting MSI, screening method and application thereof Download PDF

Info

Publication number
CN111627501B
CN111627501B CN202010444206.9A CN202010444206A CN111627501B CN 111627501 B CN111627501 B CN 111627501B CN 202010444206 A CN202010444206 A CN 202010444206A CN 111627501 B CN111627501 B CN 111627501B
Authority
CN
China
Prior art keywords
sample
mss
microsatellite
type
microsatellite loci
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010444206.9A
Other languages
Chinese (zh)
Other versions
CN111627501A (en
Inventor
赵利利
于佳宁
闫慧婷
洪媛媛
陈维之
何骥
杜波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Zhenhe Biotechnology Co ltd
Original Assignee
Wuxi Zhenhe Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Zhenhe Biotechnology Co ltd filed Critical Wuxi Zhenhe Biotechnology Co ltd
Priority to CN202010444206.9A priority Critical patent/CN111627501B/en
Publication of CN111627501A publication Critical patent/CN111627501A/en
Application granted granted Critical
Publication of CN111627501B publication Critical patent/CN111627501B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Analytical Chemistry (AREA)
  • Zoology (AREA)
  • Databases & Information Systems (AREA)
  • Immunology (AREA)
  • Bioethics (AREA)
  • Pathology (AREA)
  • Genetics & Genomics (AREA)
  • Wood Science & Technology (AREA)
  • Biochemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Microbiology (AREA)
  • Oncology (AREA)
  • Hospice & Palliative Care (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a microsatellite locus for detecting MSI, a screening method and application thereof. Selecting microsatellite candidate sites with the length of 7-15bp and low similarity of flanking sequences at two ends, detecting the types and frequencies of repeated units in the sites in MSS sequencing data, removing sites with crowd polymorphism higher than 5%, further selecting sites with the frequency dispersion degree of the types of the repeated units lower than a discrete threshold value, namely obtaining microsatellite sites of an MSS model sample, and finally selecting sites with the frequency distribution of the repeated units and the difference level of the MSS model sample having obvious difference in MSI-H samples and MSS samples as microsatellite sites for detecting MSI. And by establishing an MSS model sample as a normal sample contrast, a negative sample baseline is conveniently established or the microsatellite state of a sample to be detected is conveniently detected. Therefore, when detecting the sample to be detected, the unstable state can be detected by sequencing only a single sample without a control sample.

Description

Microsatellite locus for detecting MSI, screening method and application thereof
Technical Field
The invention relates to the field of high-throughput sequencing data analysis, in particular to a microsatellite locus for detecting MSI, a screening method and application thereof.
Background
Microsatellites are tandem repeats of the human genome, and microsatellite instability (Microsatellite Instability, MSI) refers to a change in the number of microsatellite repeats, new alleles appear, the inherent mechanism of which is mismatch repair (MMR) system deregulation, thereby limiting the ability to correct spontaneous length-altered somatic mutations in microsatellites, which accumulate and ultimately form MSI.
Mismatch Repair (MMR) system disorders mainly comprise two types: 1) One or more of mismatch repair genes MLH1, MSH2, MSH6 and PMS2 undergo germline mutation, resulting in mismatch repair defects, and MSI-H phenomenon occurs in hereditary non-polyposis colorectal cancer (Lynch syndrome). 2) Hypermethylation of the MLH1 promoter region, MSI-H phenomenon is manifested in various cancers such as colorectal cancer, endometrial cancer, ovarian cancer, gastric cancer and the like.
MSI detection can be used for diagnosing the Linked's syndrome, and can be used for medication guidance and prognosis prediction of patients with metastatic colorectal cancer, MSI-H solid tumors without colorectal cancer and colorectal cancer in stage II. MSI-PCR method is mainly used in clinical application to judge MSI state. The method uses fluorescent labeled primers and capillary electrophoresis to determine fragment length polymorphisms of 5 sites NR-21, NR-24, BAT-25, BAT-26 and MONO-27 in Promega panel. Comparing the tumor sample with the control sample, wherein the size of the PCR amplified fragment is not changed in all 5 microsatellite detection sites, and the microsatellite stability (MSS) is not changed; 1 MSI site in the 5 MSI detection sites shows the change of the size of the PCR amplified fragment, and microsatellite instability-L (MSI-L); 2 or more MSI sites in the 5 MSI detection sites have the change of the size of PCR amplified fragments, and microsatellite instability-H (MSI-H).
Immunohistochemical mismatch repair is a method for detecting microsatellite instability, and mismatch repair gene deletion or integrity and microsatellite stability consistency reach 0.92, so that the method can lead to detection omission and false detection in a certain proportion.
In recent years, with the development of the second generation sequencing (NGS) data MSI algorithm, an MSI analysis method using NGS data is increasingly widely used in practice, such as msisenor, MSI-ColonCore, mSINGS, and MSI analysis methods developed in advanced and technological interiors, which have high sensitivity and high specificity. Compared with PCR-MSI, the NGS method can analyze the biomarkers of MSI, SNV, CNV, gene fusion and the like at the same time, and has great advantages in the aspects of sample saving, time saving and economic cost.
In PCR or NGS PCR amplification, due to sequence characteristics, single base repeat sequences (microsatellite loci) may undergo a slide chain phenomenon, resulting in insertions (or deletions), and these newly generated allele types (called stutter) constitute the background noise of MSI analysis. Due to these background noise, at least 20% tumor purity of the sample is required when performing MSI analysis. Most of the plasma samples have low tumor content, and the MSI state cannot be accurately judged by using MSI-PCR or the existing NGS method for the samples, so that the MSI state of the plasma samples with low tumor content is analyzed, and the existing method still needs to be improved.
Disclosure of Invention
The invention mainly aims to provide a microsatellite locus for detecting MSI, a screening method and application thereof, so as to solve the problem that the MSI state of a plasma sample with low tumor content is difficult to accurately analyze in the prior art.
In order to achieve the above object, according to one aspect of the present invention, there is provided a screening method of microsatellite loci of an MSS plasma model sample, the screening method comprising: extracting microsatellite loci meeting a first condition from a human reference genome sequence or a target gene capture sequence, and recording the microsatellite loci as a first locus set, wherein the first condition comprises: a 7-15 bp single base repetitive sequence; b. the similarity value of the two wing sequences with the single base repetitive sequences of 7-15 bp is lower than a similarity threshold value; acquiring sequencing data of a plurality of MSS samples, screening a first site set from the sequencing data of each MSS sample, and counting the types of repeated units of each microsatellite site in the first site set and the occurrence frequency of the types of each repeated unit; selecting microsatellite loci satisfying a second condition from the first locus set as a second locus set, the second condition comprising: polymorphisms in the population below 5% and capture efficiency during pool sequencing above capture threshold; calculating the average level and the discrete degree of the occurrence frequency of each repeated unit type of each microsatellite locus in the second locus set of all MSS plasma samples, and selecting the microsatellite locus with the average level of the discrete degree lower than the discrete threshold as a third locus set; and taking the microsatellite loci in the third locus set as microsatellite loci of the MSS plasma model sample, and taking the average level of the occurrence frequency of each repeating unit type of each microsatellite locus in the third locus set as the frequency distribution of the repeating unit type of each microsatellite locus in the MSS plasma model sample.
Further, extracting microsatellite loci meeting a first condition from a human reference genome sequence or a target gene capture sequence, denoted as a first set of loci comprising: extracting microsatellite loci of 7-15 bp single base repetitive sequences from a human reference genome sequence; calculating the similarity value of the sequence with the set length at the left end and the right end of the single base repetitive sequence of 7-15 bp and the single base repetitive sequence of 7-15 bp for each microsatellite locus; selecting microsatellite loci with similarity values lower than a similarity threshold as a first locus set; preferably, the similarity value is calculated as follows: Σ (d2+1-d 1)/d 2, wherein d1 is the distance from the same base as the single base repetitive sequence of 7-15 bp in the sequences with set lengths at the left and right ends to the microsatellite locus, and d2 is the set length; preferably, d2 is 8 to 12bp, more preferably 10bp; preferably, the similarity threshold is 1.5 to 2.5, more preferably 2.
Further, obtaining sequencing data of a plurality of MSS samples, and screening a first site set from the sequencing data of each MSS sample, and counting the type of repeating units of each microsatellite site in the first site set and the frequency of occurrence of the type of each repeating unit includes: comparing the sequencing data of each MSS sample with a reference genome sequence to obtain a comparison result; searching a first position set from the comparison result, and extracting end-to-end reads covering each microsatellite locus in the first position set from the comparison result, wherein the end-to-end reads refer to reads covering the microsatellite locus and at least 2bp at the left end and the right end of the microsatellite locus under the first condition; the type of each repeat unit and the frequency of occurrence of each repeat unit type in the end-to-end reads covering each microsatellite locus are counted.
Further, extracting end-to-end reads covering each microsatellite locus in the first locus set from the alignment result comprises: counting the end-to-end reads belonging to the same repeated sequence family from the comparison result, counting the number of types of different repeated units in the same repeated sequence family, selecting the type with the largest number of repeated units as the type of the repeated units of the same repeated sequence family, and counting the number of supports supporting the end-to-end reads of the microsatellite loci; preferably, the number of support of the end-to-end reads for each repeating unit type supporting each microsatellite locus is at least two; preferably, the capture efficiency is measured as the ratio of the number of end-to-end reads of each of said microsatellite loci to the sequencing depth of the sample, preferably the capture threshold is ≡0.4.
According to a second aspect of the present application, there is provided a screening method for detecting microsatellite loci of MSI, the screening method comprising: selecting sequencing data of a plurality of known MSI-H samples and a plurality of known MSS samples, respectively screening microsatellite loci in an MSS plasma model sample according to any screening method, and respectively calculating to obtain frequency distribution of types of repeating units of each microsatellite locus of the known MSI-H samples and the known MSS samples; the level of difference between the frequency distribution of the type of repeating unit in each microsatellite loci of the known MSI-H sample and the known MSS sample and the frequency distribution of the type of repeating unit in the MSS plasma model sample is calculated separately, and the microsatellite loci having a significant difference between the known MSI-H sample and the known MSS sample are retained as the microsatellite loci for detecting MSI.
Further, calculating KLD values between the frequency distribution of the type of the repeating unit of the known MSI-H sample and the known MSS sample at each microsatellite loci and the frequency distribution of the type of the repeating unit in the MSS plasma model sample, respectively, according to formula (I), and retaining microsatellite loci for detecting MSI, for which there is a significant difference in KLD values between the known MSI-H sample and the known MSS sample;
Figure BDA0002505152650000031
wherein p (x) represents the frequency distribution of the microsatellite loci of an MSI-H sample or of a known MSS sample, q (x) represents the frequency distribution of the microsatellite loci of an MSS plasma model sample; preferably, a non-parametric test method is used to test whether there is a significant difference in the respective KLD values between the known MSI-H samples and the known MSS samples, preferably a Wilcox test method.
According to a third aspect of the present application, there is provided a microsatellite loci for use in MSI detection, the microsatellite loci for use in MSI detection being screened using any of the screening methods described above.
According to a fourth aspect of the present application, there is provided a microsatellite loci for detecting MSI comprising at least 15 of the 38 microsatellite loci shown in Table 1.
According to a fifth aspect of the present application, there is provided a kit for detecting MSI, the kit comprising detection reagents for detecting microsatellite loci of MSI, the microsatellite loci comprising at least 15 of the 38 microsatellite loci shown in table 1.
According to a sixth aspect of the present application, there is provided a baseline construction method for detecting MSI, the construction method comprising: screening at least 15 of the 38 microsatellite loci shown in table 1 from a plurality of known MSS samples according to any one of the screening methods described above for MSI detection; counting the frequency distribution of the types of the repeated units of each microsatellite locus, and calculating the difference level of the frequency distribution of the types of the repeated units and the frequency distribution of the types of the repeated units of the MSS plasma model sample; removing microsatellite loci with polymorphism in each sample; the average level and degree of dispersion of the level of difference for each microsatellite loci of all MSS samples are counted to construct a baseline for MSI detection.
Further, the KLD value of the frequency distribution of each repeating unit type and the frequency distribution of the repeating unit type of the MSS plasma model sample is calculated according to the formula (I),
Figure BDA0002505152650000041
wherein p (x) represents the frequency distribution of said microsatellite loci of a known MSS sample and q (x) represents the frequency distribution of said microsatellite loci of an MSS plasma model sample; the average level and degree of dispersion of KLD values for each microsatellite loci of all known MSS samples are counted to construct a baseline for MSI detection.
Further, when the type of the repeating unit in the known MSS sample is not consistent with the type of the repeating unit in the MSS plasma model sample, calculating the KLD value of the frequency distribution of the type of the repeating unit in each known MSS sample and the frequency distribution of the type of the repeating unit in the MSS plasma model sample includes: taking the union of the types of the repeating units in the known MSS sample and the types of the repeating units in the MSS plasma model sample, marking the union as M, marking the number of the types of the repeating units as M, and setting a minimum epsilon; smoothing the frequency distribution of the type of the repeating unit in the known MSS sample and the frequency distribution of the type of the repeating unit in the MSS plasma model sample respectively; calculating the KLD value of the frequency distribution of the type of the repeating unit in the known MSS sample after the smoothing process and the frequency distribution of the type of the repeating unit in the MSS plasma model sample; preferably, the smoothing process includes: in comparison with M, in the known MSS sample or MSS plasma model sample, if n types of repeating units are absent, the frequency of the absent repeating unit type is epsilon/n, and the frequency of the remaining repeating unit type is p (x) -epsilon/(M-n).
According to a seventh aspect of the present application, there is provided a method for detecting a microsatellite status, the method comprising: screening the type of the repeating unit of each microsatellite locus and the frequency distribution of the type of the repeating unit from the sample to be tested according to any one of the screening methods described above for at least 15 loci of the 38 microsatellite loci shown in table 1 for detection of MSI; calculating the difference level g of the frequency distribution of the type of the repeating units of each microsatellite loci of the sample to be tested and the frequency distribution of the type of the repeating units of the MSS plasma model sample 1 The method comprises the steps of carrying out a first treatment on the surface of the And according to the difference level g of the frequency distribution of the type of the repeating unit of each microsatellite loci of the baseline sample and the frequency distribution of the type of the repeating unit of the MSS plasma model sample 0 Calculating the Z value of the sample to be measured; selecting each microsatellite bit of a sample to be measuredThe type Mp of the repeating unit with the highest frequency in the point and the type Mq of the repeating unit with the highest frequency in the same microsatellite loci of the MSS plasma model sample, and judging the microsatellite loci as unstable loci according to any one of the following methods: (1) If Mp is not equal to Mq and the difference level g of the sample to be measured 1 >Average (g) 0 )+zSD(g 0 ) The method comprises the steps of carrying out a first treatment on the surface of the (2) If mp=mq, and p (Mp)<=average (q (Mq))+ zSD (q (Mq)), and at the same time, the difference level g of the sample to be measured 1 >Average (g) 0 )+zSD(g 0 ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein average (g) 0 ) Represents the differential level g of the baseline sample 0 Average level of (1), SD (g) 0 ) Represents the differential level g of the baseline sample 0 Z represents the level of variance g of the baseline sample 0 A coefficient of degree of deviation of (2); counting Z values of microsatellite loci meeting a depth threshold in a sample to be detected, obtaining an average level of the Z values, and judging microsatellite states of the sample to be detected according to the following conditions: (1) The number n1 of microsatellite loci meeting the depth threshold is more than or equal to 15, wherein the number of unstable loci is n2, and if n2/n1 is more than or equal to a or the average level of Z values is more than or equal to b, the microsatellite state of a sample to be detected is judged to be MSI-H; (2) The number n1 of microsatellite loci meeting the depth threshold is more than or equal to 15, wherein the number of unstable loci is n2, n2/n1 is less than a, and the average level of Z values is less than b, and judging that the microsatellite state of a sample to be detected is MSS; (3) The number n1 of microsatellite loci meeting the depth threshold is less than 15, and the microsatellite state of the sample to be detected is undetermined; wherein a is 0.15-0.3, and b is 0.8-2.
Further, the detection method comprises the following steps: calculating the difference level g according to the formula (I) 0 And a difference level g 1 Obtaining the KLD value of the baseline sample and the KLD value of the sample to be tested; calculating the Z value of the sample to be detected according to the average level and the discrete degree of the KLD value of the sample to be detected and the KLD value of the baseline sample;
Figure BDA0002505152650000051
wherein p (x) represents the frequency distribution of the microsatellite loci of the sample to be tested or the microsatellite loci of the baseline sample, and q (x) represents the frequency distribution of the microsatellite loci of the MSS plasma model sample.
Further, the microsatellite loci are judged to be unstable loci according to any of the following methods: (1) If Mp is not equal to Mq and KLD value of the sample to be measured is equal to (Ki) +3SD (Ki); (2) If mp=mq, and p (Mp) <=average (q (Mq)) + zSD (q (Mq)), at the same time, the KLD value of the sample to be measured > average (Ki) +3sd (Ki); wherein, average (Ki) represents the average level of KLD values of the baseline samples, SD (Ki) represents the degree of dispersion of KLD values of the baseline samples.
According to an eighth aspect of the present application, there is provided a screening apparatus for microsatellite loci of an MSS plasma model sample, the screening apparatus comprising: a first set of sites module for extracting microsatellite loci satisfying a first condition from a human reference genomic sequence or a capture sequence of a target gene, denoted as a first set of sites, the first condition comprising: a 7-15 bp single base repetitive sequence; b. the similarity value of the two wing sequences with the single base repetitive sequences of 7-15 bp is lower than a similarity threshold value; the MSS sample screening and counting module is used for acquiring sequencing data of a plurality of MSS samples, screening a first position set from the sequencing data of each MSS sample, and counting the type of a repeating unit of each microsatellite position in the first position set and the occurrence frequency of the type of each repeating unit; and a second site set module for selecting microsatellite loci satisfying a second condition from the first site set as the second site set, the second condition including: polymorphisms in the population below 5% and capture efficiency during pool sequencing above capture threshold; the third site set module is used for calculating the average level and the discrete degree of the occurrence frequency of each repeated unit type of each microsatellite site in the second site set of all MSS plasma samples, and selecting microsatellite sites with the average level of the discrete degree lower than a discrete threshold value as a third site set; and the MSS plasma model sample module is used for taking the microsatellite loci in the third locus set as the microsatellite loci of the MSS plasma model sample, and taking the average level of the frequency of occurrence of each repeating unit type of each microsatellite locus in the third locus set as the frequency distribution of the repeating unit type of each microsatellite locus in the MSS plasma model sample.
Further, the first locus set module includes: the first locus selection module is used for extracting microsatellite loci of 7-15 bp single base repetitive sequences from a human reference genome sequence or a capture sequence of a target gene; the similarity value calculation module is used for calculating the similarity value of the sequence with the set length at the left end and the right end of the single base repeated sequence of 7-15 bp and the single base repeated sequence of 7-15 bp for each microsatellite locus; the second position selecting module is used for selecting microsatellite positions with similarity values lower than a similarity threshold value as a first position set; preferably, the similarity value is calculated as follows: Σ (d2+1-d 1)/d 2, wherein d1 is the distance from the same base as the single base repetitive sequence of 7-15 bp in the sequences with set lengths at the left and right ends to the microsatellite locus, and d2 is the set length; preferably, d2 is 8 to 12bp, more preferably 10bp; preferably, the similarity threshold is 1.5 to 2.5, more preferably 2.
Further, the MSS sample screening statistics module includes: the comparison module is used for comparing the sequencing data of each MSS sample with the reference genome sequence respectively to obtain a comparison result; the searching and extracting module is used for searching the first position set from the comparison result and extracting end-to-end reads covering each microsatellite position in the first position set from the comparison result, wherein the end-to-end reads refer to reads covering at least 2bp at the left end and the right end of each microsatellite position in the first position set; and the type frequency statistics module is used for counting the type of each repeating unit in the end-to-end reads covering each microsatellite locus and the occurrence frequency of each repeating unit type.
Further, the search extraction module includes: the same repeated sequence family statistics module is used for counting end-to-end reads belonging to the same repeated sequence family from the comparison result and counting the number of types of different repeated units in the same repeated sequence family; the repeating unit type selection module is used for selecting the type of the repeating unit with the largest number as the type of the repeating unit of the same repeating sequence family and counting the number of support of end-to-end reads supporting the microsatellite loci; preferably, the number of support for the end-to-end reads of each repeat unit type for each microsatellite locus is at least two; preferably, the capture efficiency is measured as the ratio of the number of end-to-end reads for each microsatellite locus to the sequencing depth of the sample, with a capture threshold value of ≡ 0.4 being preferred.
According to a ninth aspect of the present application, there is provided a screening apparatus for detecting microsatellite loci of MSI, the screening apparatus comprising: a first calculation module, configured to select sequencing data of a plurality of known MSI-H samples and a plurality of known MSS samples, respectively screen microsatellite loci in the MSS plasma model samples according to any one of the screening apparatuses, and respectively calculate frequency distributions of types of repeating units of each microsatellite locus of the known MSI-H samples and the known MSS samples; a second calculation module for calculating a level of difference between the frequency distribution of the type of the repeating unit in each microsatellite loci of the known MSI-H sample and the known MSS sample and the frequency distribution of the type of the repeating unit in the MSS plasma model sample, respectively, and retaining the microsatellite loci having a significant difference between the known MSI-H sample and the known MSS sample as the microsatellite loci for detecting MSI.
Further, the second calculation module calculates KLD values between the frequency distribution of the type of the repeating unit of each microsatellite loci of the known MSI-H sample and the known MSS sample and the frequency distribution of the type of the repeating unit in the MSS plasma model sample according to formula (I), respectively, and retains microsatellite loci for which there is a significant difference in KLD values between the known MSI-H sample and the known MSS sample as microsatellite loci for detecting MSI;
Figure BDA0002505152650000061
wherein p (x) represents the frequency distribution of the microsatellite loci of an MSI-H sample or of a known MSS sample, q (x) represents the frequency distribution of the microsatellite loci of an MSS plasma model sample; preferably, a non-parametric test is used to test for significant differences between the KLD values, preferably a Wilcox test.
According to a tenth aspect of the present application, there is provided a baseline building apparatus for detecting MSI, the building apparatus comprising: a microsatellite loci screening module for screening at least 15 of the 38 microsatellite loci shown in table 1 for MSI detection from a plurality of known MSS samples according to any of the screening apparatus described above; the frequency distribution difference statistics module is used for counting the frequency distribution of the types of the repeated units of each microsatellite locus and calculating the difference level of the frequency distribution of the types of the repeated units and the frequency distribution of the types of the repeated units of the MSS plasma model sample; a diversity removal module for removing samples having polymorphisms from a plurality of known MSS samples; and the baseline establishment module is used for counting the average level and the discrete degree of the difference level of each microsatellite loci of all MSS samples, so as to construct a baseline for detecting MSI.
Further, the frequency distribution difference statistics module is a KLD module, the KLD module is used for calculating the KLD value of the frequency distribution of each repeating unit type and the frequency distribution of the repeating unit type of the MSS plasma model sample according to the formula (I),
Figure BDA0002505152650000062
wherein p (x) represents the frequency distribution of said microsatellite loci of a known MSS sample and q (x) represents the frequency distribution of said microsatellite loci of an MSS plasma model sample; and the baseline establishment module is used for counting the average level and the discrete degree of the KLD value of each microsatellite loci of all known MSS samples, so as to construct a baseline for detecting MSI.
Further, the KLD module further comprises: a type inconsistency KLD calculation module for calculating a KLD value of a frequency distribution of a type of a repeating unit in each known MSS sample and a frequency distribution of a type of a repeating unit of the MSS plasma model sample when the type of the repeating unit in the known MSS sample is inconsistent with the type of the repeating unit in the MSS plasma model sample, the type inconsistency KLD calculation module comprising: the union module is used for taking a union of the types of the repeating units in the known MSS sample and the types of the repeating units in the MSS plasma model sample, marking M, marking the number of the types of the repeating units as M, and setting a minimum epsilon; a smoothing module for smoothing the frequency distribution of the type of the repeating unit in the known MSS sample and the frequency distribution of the type of the repeating unit in the MSS plasma model sample, respectively; a post-processing calculation module for calculating KLD values of the frequency distribution of the type of repeating units in the smoothed known MSS sample and the frequency distribution of the type of repeating units of the MSS plasma model sample; preferably, the smoothing process includes: in comparison with M, in the known MSS sample or MSS plasma model sample, if n types of repeating units are absent, the frequency of the absent repeating unit type is epsilon/n, and the frequency of the remaining repeating unit type is p (x) -epsilon/(M-n).
According to an eleventh aspect of the present application, there is provided a detection apparatus for a microsatellite state, the detection apparatus comprising: a microsatellite locus screening module, configured to screen the type of the repeating unit and the frequency distribution of the type of the repeating unit of each microsatellite locus from the sample to be tested according to any one of the screening devices described above, for at least 15 loci of the 38 microsatellite loci shown in table 1 for detecting MSI; a difference level detection module for calculating a difference level g of the frequency distribution of the type of the repeating unit of each microsatellite locus of the sample to be detected and the frequency distribution of the type of the repeating unit of the MSS plasma model sample 1 The method comprises the steps of carrying out a first treatment on the surface of the A Z value calculation module for calculating a difference level g of the frequency distribution of the type of the repeating unit of each microsatellite loci according to the baseline sample and the frequency distribution of the type of the repeating unit of the MSS plasma model sample 0 Calculating the Z value of the sample to be measured; the unstable site judging module is used for selecting the type Mp of the highest-frequency repeating unit in each microsatellite site of the sample to be detected and the type Mq of the highest-frequency repeating unit in the same microsatellite site of the MSS plasma model sample, and judging the microsatellite site as an unstable site according to any one of the following methods: (1) If Mp is not equal to Mq and the difference level g of the sample to be measured 1 >Average (g) 0 )+zSD(g 0 ) The method comprises the steps of carrying out a first treatment on the surface of the (2) If mp=mq, and p (Mp)<=average (q (Mq))+ zSD (q (Mq)), and at the same time, the difference level g of the sample to be measured 1 >Average (g) 0 )+zSD(g 0 ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein average (g) 0 ) Represents the differential level g of the baseline sample 0 Average level of (1), SD (g) 0 ) Represents the differential level g of the baseline sample 0 Z represents the level of variance g of the baseline sample 0 A coefficient of degree of deviation of (2); the microsatellite state judging module is used for counting Z values of microsatellite loci meeting a depth threshold in the sample to be detected, obtaining the average level of the Z values, and judging the microsatellite state of the sample to be detected according to the following conditions: (1) The number n1 of microsatellite loci meeting the depth threshold is more than or equal to 15, wherein the number of unstable loci is n2, and if n2/n1 is more than or equal to a, a is 0.15-0.3 or the average level of Z value is more than or equal to b, b is 0.8-2, the microsatellite state of a sample to be detected is judged to be MSI-H; (2) The number n1 of microsatellite loci meeting the depth threshold is more than or equal to 15, wherein the number of unstable loci is n2, n2/n1 is less than a, and the average level of Z values is less than b, and judging that the microsatellite state of a sample to be detected is MSS; (3) And if the number n1 of the microsatellite loci meeting the depth threshold is less than 15, the microsatellite state of the sample to be detected is undetermined.
Further, the detection device includes: calculating the difference level g according to the formula (I) 0 And a difference level g 1 Obtaining the KLD value of the baseline sample and the KLD value of the sample to be tested; calculating the Z value of the sample to be detected according to the average level and the discrete degree of the KLD value of the sample to be detected and the KLD value of the baseline sample;
Figure BDA0002505152650000081
wherein p (x) represents the frequency distribution of the microsatellite loci of the sample to be tested or the microsatellite loci of the baseline sample, and q (x) represents the frequency distribution of the microsatellite loci of the MSS plasma model sample.
Further, the microsatellite loci are judged to be unstable loci according to any of the following methods: (1) If Mp is not equal to Mq and KLD value of the sample to be measured is equal to (Ki) +3SD (Ki); (2) If mp=mq, and p (Mp) <=average (q (Mq)) + zSD (q (Mq)), at the same time, the KLD value of the sample to be measured > average (Ki) +3sd (Ki); wherein, average (Ki) represents the average level of KLD values of the baseline samples, SD (Ki) represents the degree of dispersion of KLD values of the baseline samples.
According to a twelfth aspect of the present application, there is provided a storage medium, the storage medium including a stored program, wherein the device in which the storage medium is controlled to execute any one of the above screening methods, any one of the above construction methods, or any one of the above detection methods when the program is run.
According to a thirteenth aspect of the present application, there is provided a processor for running a program, wherein the program when run performs any one of the screening methods described above, or any one of the construction methods described above, or any one of the detection methods described above.
By applying the technical scheme of the invention, the MSS plasma model sample is established as the contrast of the normal sample.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:
FIG. 1 shows a graph of saturation analysis of end-to-end reads for 38 microsatellite loci;
FIG. 2 shows the relationship of the site repeat fragment length to the stutter ratio; and
figure 3 shows the effect of different deduplication methods on the frequency distribution of microsatellite loci repeat unit types.
Detailed Description
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present invention will be described in detail with reference to examples.
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the present application described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of description, the following will describe some terms or terms related to the embodiments of the present application:
the scanning reads: in the application, the end-to-end reads are referred to as reads which completely cover the microsatellite locus region and the left and right ends of the microsatellite locus region by at least 2bp length.
Duplica: in this application, duplicate reads family, a family of repeated sequences, means that the same DNA fragment is sequenced multiple times. Because PCR amplification is required during library construction, one DNA fragment will be amplified into many. During sequencing, when the sequencing amount is high, the same sequence is measured multiple times, and these reads are called a duplicate (reads family). In microsatellite sequences, a slide chain can occur in the PCR process, the microsatellite sequence length can be prolonged or shortened, and repeated sequences with different lengths become different repeated unit types. There are generally two methods for removing the multiplexed reads, one by comparison with the reference genome and the other by comparison of the reads themselves. The first principle is: identical reads will be aligned to the same location, with the same alignment insert length. Both samtools and picards based on this principle have tools to remove duplicate reads in the bam file. Based on the comparison of reads, the repetitive sequences are detected mainly in the absence of a reference genome or in the case of inconvenient alignment. Because of the high throughput sequencing reads, short sequence alignment software is not suitable for alignment with itself, direct comparison is typically used to find two identical reads.
KLD: kullback-Leibler divergence, simply KLD, also called relative entropy, is a measure of the asymmetry of the two probability distributions P and Q. It measures the difference between two probability distributions in the same event space.
Wilcox test: a non-parametric test for detecting differences between two groups, also called wilcox rank sum test. When the data does not meet the parametric assumption that the t-test is performed (e.g., the data distribution is not normal, the variables are heavily biased or in order in nature) and t-test analysis cannot be used, non-parametric methods can be used to complete the test.
The ratio, abundance and frequency, which have the same meaning in this application, refer to the ratio of the number of reads of a repeat unit type at the microsatellite loci to the total number of reads covering the microsatellite loci.
As mentioned in the background section, plasma sample sequencing is characterized by low tumor content, high depth and high repetition rate compared to tissue samples, and MSI status cannot be accurately determined for such samples using MSI-PCR or existing NGS methods. For this purpose, the applicant has studied and analyzed the existing detection schemes of MSI status, in particular as follows:
1 for MSI analysis, the current generation of sequencing analysis methods comprise MSI-PCR, and the second generation of sequencing analysis comprises MSIsensor, MSI-ColonCore, mSINGS and other methods. These methods require greater than 20% tumor purity and are suitable for use with tissue samples. For plasma samples, only report on errors caused by PCR in the library construction process or the sequencing process can be obtained from sequencing data at the subsequent time based on improvement of the library construction method, and the method uses UMI technology, so that the method has high economic cost in application.
2 site length comparison: the effect of the PCR process on single base repeats of different lengths is different. For 47 MSS samples, the statistical lengths are respectively 10bp, 11bp, 12bp, 14bp, 15bp, 21bp, 22bp, 23bp, 24bp and 25bp site stutter. The results are shown in FIG. 2: 1) The proportion and dispersion of stutter of the 21-25 bp length locus are obviously higher than those of the 7-15 bp length locus. 2) In the length range of 7-15 bp, the proportion of stutter increases with the increase of the length of the locus; in the range of 21-24 bp, the variation trend of the stutter proportion is not obvious along with the increase of the site length, which is probably caused by the larger difference of different sites.
3, comparison of duplicate removal methods: the effect of the deduplication approach on the outcome was evaluated. The method randomly selects 5 microsatellite loci with the lengths of 10bp, 11bp, 12bp, 14bp and 15bp, and evaluates the influence of the ' no-duplication ', ' picard ' software duplication elimination and ' duplication elimination mode in the method on the type frequency distribution of the repeated units of the microsatellite loci. The results are shown in FIG. 3: 1) Whichever way of deduplication is possible, the number of types of repeating units can be reduced; 2) The picard deduplication approach increases background noise as the sequence segment length of the repeat unit increases. Sites of 10bp, 11bp and 12bp, and the picard de-duplication mode is almost consistent with the non-de-duplication effect; sites of 14bp and 15bp, the picard deduplication approach increases background noise. 3) The deduplication algorithm of the present application reduces the stutter ratio and reduces the background noise in all length loci.
According to the characteristics, the background noise is reduced from the two aspects of microsatellite locus selection and algorithm optimization, so that the purpose of microsatellite analysis of the plasma sample is achieved. The specific improvement thought is as follows:
1) Microsatellite loci: a) Compared with a long (> 20 bp) single-base repetitive sequence, the number of the scanning reads of the short (7-15 bp) sequence is relatively increased, the influence of the PCR process on the scanning reads is small, the proportion of the generated stutter is low, the type frequency distribution of the repetitive units in the crowd is stable, and the background noise is low, so that a single-base repetitive fragment with the length of 7-15 bp is selected as a site pool. b) On the basis, the similarity between the two flanking 10bp sequences and the microsatellite is calculated, and the sites with low similarity are selected to reduce the influence of sequencing and alignment errors on the result. c) Then, the polymorphism ratio, sensitivity and specificity of the sites were counted, and 38 sites with low polymorphism ratio, high sensitivity and specificity were selected.
2) Algorithm: a) The widely used deduplication software picard, whose algorithm classifies reads aligned to the same position on the genome and the same insert length as one duplicate, retains the read with the highest sequencing quality as representative of that duplicate in the same duplicate, tends to select reads containing shorter repeat unit types (type), and amplifies background noise. For MSI analysis, we optimized the deduplication algorithm, for reads that completely cover a certain microsatellite site and belong to one duplicate, extract site fragments and calculate fragment lengths, select the fragment length with the largest number as the representation of the duplicate. The algorithm can correct the type of the alle type, so that the algorithm is close to the real distribution of the alle type, and the larger the dup ratio is, the stronger the correcting function is, and the lower the background noise is. Plasma samples are generally high in sequencing and dup ratio, and the de-duplication algorithm can effectively reduce background noise. b) The Kullback-Leibler divergence (KLD) is an asymmetry measure of the difference between two probability distributions, and is applied to MSI analysis, and contains comprehensive information of the number, abundance and sequence of the repeat unit types of the microsatellite loci, so that the Kullback-Leibler divergence is used as a characteristic value for judging the state of the microsatellite loci. The range of KLD values is > =0, and when kld=0, the repeating unit type frequency distribution of the microsatellite loci of the sample to be measured is the same as that of the MSS sample. The larger the KLD value, the larger the difference between the repeating unit type frequency distribution of the microsatellite loci of the sample to be measured and the sample of the MSS, and the loci are judged to be unstable when the difference is significant.
Compared with the prior analysis method which uses the Peak number or the highest Peak ratio as an index for judging the microsatellite locus state, the analysis method has the advantages that when the difference is analyzed, the KLD method has more comprehensive information (comprising the number, the abundance and the position of peaks with different lengths) for each locus, so that the detection sensitivity is higher.
For plasma samples, we first selected a lot of microsatellite loci with low background noise (except the highest frequency repeating unit type, the rest repeating unit types belong to the background noise, the higher the repeating unit frequency of the highest peak frequency is, the lower the background noise is), then further reduces the background noise by optimizing a deduplication algorithm, uses KLD value to represent microsatellite locus characteristics for MSI analysis, can greatly reduce the background noise in the whole process, and improves the true mutation intensity, thereby being capable of detecting MSI-H in samples with low tumor content.
Based on the above results, the applicant has proposed a series of technical solutions of the present application.
Example 1
In this embodiment, a screening method for microsatellite loci of an MSS plasma model sample is provided, the screening method comprising:
s101, extracting microsatellite loci meeting a first condition from a human reference genome sequence or a target gene capture sequence, and recording the microsatellite loci as a first locus set, wherein the first condition comprises: a 7-15 bp single base repetitive sequence; b. the similarity value of the two wing sequences with the single base repetitive sequences of 7-15 bp is lower than a similarity threshold value;
S103, acquiring sequencing data of a plurality of MSS samples, screening a first position set from the sequencing data of each MSS sample, and counting the type of a repeating unit of each microsatellite locus in the first position set and the occurrence frequency of the type of each repeating unit;
s105, selecting microsatellite loci satisfying a second condition from the first locus set as a second locus set, the second condition including: polymorphisms in the population below 5% and capture efficiency during pool sequencing above capture threshold;
s107, calculating the average level and the discrete degree of the occurrence frequency of each repeated unit type of each microsatellite locus in the second locus set of all MSS plasma samples, and selecting the microsatellite loci with the average level of the discrete degree lower than the discrete threshold as a third locus set;
s109, taking microsatellite loci in the third locus set as microsatellite loci of the MSS plasma model sample, and taking the average level (median) of the occurrence frequency of each repeating unit type of each microsatellite locus in the third locus set as the frequency distribution of the repeating unit type of each microsatellite locus in the MSS plasma model sample.
According to the screening method of the microsatellite loci of the MSS plasma model sample, the microsatellite candidate loci with shorter length and lower similarity of flanking sequences at two ends are selected from human reference genome sequences or designed gene capturing sequences, types and frequencies of repeated units in the microsatellite loci are detected in MSS sequencing data, loci with crowd polymorphism not lower than 5% are removed from the MSS sequencing data, a second locus set is obtained, microsatellite loci with average levels of the frequencies of the repeated units relatively concentrated and discrete degrees lower than discrete threshold values in the loci are further selected and used as third locus sets, the microsatellite loci in the third locus set are the microsatellite loci of the MSS plasma model sample, and the average level of the frequencies of the repeated units in the third locus sets is the frequency distribution of the types of the repeated units of the MSS plasma model sample.
The microsatellite loci of the MSS plasma model sample and the frequency distribution of the types of the repeated units thereof are convenient to be used as a reference object when detecting the microsatellite state of the sample to be detected or establishing a negative sample base line. And the control sample is not required to be sequenced at the same time when the sample to be detected is detected. So that only a single sample can be sequenced to detect an unstable state. The MSS sample herein preferably refers to a normal healthy sample, which may be a normal tissue sample or a normal lymphocyte sample, and the established plasma model sample is a normal control sample model. Preferably, the MSS sample is a leukocyte sample. It should be noted that other MSS samples that may have somatic SNV mutations are not excluded, although such MSS samples with somatic SNV mutations may have a situation that affects a certain microsatellite locus variation.
Polymorphisms in the above population can be assessed as follows: if the frequencies of the repeating unit types corresponding to the first high frequency and the second high frequency are similar, determining that the locus is heterozygous, and adding a polymorphism sample. If the frequency of the first high-frequency repeating unit type is far greater than the frequency of the repeating unit type corresponding to the second high frequency, determining that the locus is homozygous, and the first high-frequency repeating unit type is different from the first high-frequency repeating unit type of most normal samples, and adding one polymorphic sample. The polymorphism ratio is the ratio of the polymorphic sample to the total sample.
The average level and the degree of dispersion of the frequency of occurrence of each type of repeating unit of the microsatellite loci can be represented by an average value or a median value, or can be represented by other parameters capable of representing the average level. The degree of dispersion may be standard deviation or variance, etc. The above-mentioned "microsatellite loci having an average level of discrete degree lower than the discrete threshold value is selected as the discrete threshold value in the third locus set", and is preferably 0.01 in this application.
The extracting microsatellite loci meeting the first condition from the human reference genome sequence or the target gene capturing sequence, and recording the microsatellite loci as a first locus set comprises: extracting microsatellite loci of 7-15 bp single base repetitive sequences from a human reference genome sequence; calculating the similarity value of the sequence with the set length at the left end and the right end of the single base repetitive sequence of 7-15 bp and the single base repetitive sequence of 7-15 bp for each microsatellite locus; microsatellite loci with similarity values lower than a similarity threshold are selected as a first set of loci.
Preferably, the similarity value is calculated as follows: Σ (d2+1-d 1)/d 2, wherein d1 is the distance from the same base as the single base repetitive sequence of 7-15 bp in the sequences with set lengths at the left and right ends to the microsatellite locus, and d2 is the set length; preferably, d2 is 8 to 12bp, more preferably 10bp; preferably, the similarity threshold is 1.5 to 2.5, more preferably 2.
The single base repeated sequences with the flanking sequences at two ends which are obviously different from the length of 7-15bp are selected as candidate microsatellite loci, so that the interference of the sequences at two ends to the mutation detection of the microsatellite loci can be obviously reduced, and the noise interference is reduced.
In a preferred embodiment, obtaining sequencing data for a plurality of MSS samples (i.e., normal lymphocyte samples), and screening a first set of loci from the sequencing data for each MSS sample, and counting the type of repeat units for each microsatellite loci in the first set of loci and the frequency of occurrence of each repeat unit type comprises: comparing the sequencing data of each MSS sample with a reference genome sequence to obtain a comparison result; searching a first position set from the comparison result, and extracting end-to-end reads covering each microsatellite position in the first position set from the comparison result, wherein the end-to-end reads refer to reads covering at least 2bp at the left end and the right end of each microsatellite position in the first position set; the type of each repeat unit and the frequency of occurrence of each repeat unit type in the end-to-end reads covering each microsatellite locus are counted.
Covering the microsatellite locus region and the end-to-end reads of at least 2bp on each of the two wings, not only can the positions of the reads on the reference genome be determined by utilizing the sequences of the two wings, but also the types of the repeated units of the detected microsatellite locus can be accurate, and the counted number of the types of each repeated unit is accurate, so that the accuracy of the detection result is improved. At least 2bp each, more preferably 2bp, as described above, fully spans the entire locus region, and extends both wings of the locus region by about 2bp, thereby ensuring that reads fully spans the entire locus region while minimizing data loss (the longer the length of the covered both wings sequences, the more stringent the alignment conditions, the fewer the reads that are met), and also avoiding the influence of the insertion deletion in both wings of the locus region on the determination of the repeat unit type. Of course, the preset length can be 3bp, 4bp, 5bp, 6bp, 7bp, 8bp, 9bp or even longer, and can be reasonably adjusted according to practical situations.
In a preferred embodiment, extracting from the alignment results end-to-end reads covering each microsatellite loci in the first set of loci comprises: counting the end-to-end reads belonging to the same repeated sequence family from the comparison result, counting the number of types of different repeated units in the same repeated sequence family, selecting the type with the largest number of repeated units as the type of the repeated units of the same repeated sequence family (the end-to-end reads with one type with the largest coverage number of repeated units in the same repeated sequence family is reserved), and counting the support number of the end-to-end reads supporting the microsatellite loci; preferably, there are at least two support numbers for the end-to-end reads of each repeat unit type that support each microsatellite locus; preferably, the capture efficiency is measured as the ratio of the number of end-to-end reads for each microsatellite locus to the sequencing depth of the sample, with a capture threshold value of ≡ 0.4 being preferred.
Because PCR amplification is required during the library construction, one DNA fragment is amplified into a plurality of DNA fragments. During sequencing, when the sequencing is high, the same sequence is measured multiple times, and these reads are called a duplicate (reads family), i.e., family of repeated sequences. In microsatellite sequences, a slide chain can occur in the PCR process, the microsatellite sequence length can be prolonged or shortened, and repeated sequences with different lengths become different repeated unit types.
A real DNA fragment is tested multiple times, and a piece of DNA needs to be selected to remain. For microsatellite loci, the types of repeat units of the same family of repeat sequences may not be identical. Among a family of repeat sequences, we selected the most frequently detected repeat unit type as the microsatellite sequence length of the authentic DNA fragment. There are N types of repeat units covering the same microsatellite locus, and the family of repeat sequences only holds an end-to-end read that contains entirely the "most frequently detected repeat unit type in the family of repeat sequences". Compared with the prior art, the method has the advantages that the read with the highest sequencing quality is reserved as the representative of the repeated sequence family, the preference of the read of a shorter microsatellite sequence is overcome, and the background noise is effectively reduced, so that the obtained repeated units are more similar to the real type distribution.
Example 2
The present embodiment provides a screening method for detecting microsatellite loci of MSI, the screening method comprising: selecting sequencing data of a plurality of known MSI-H samples and a plurality of known MSS samples, respectively screening microsatellite loci in an MSS plasma model sample according to any screening method, and respectively calculating to obtain frequency distribution of types of repeating units of each microsatellite locus of the known MSI-H samples and the known MSS samples; the level of difference between the frequency distribution of the type of repeating unit in each microsatellite loci of the known MSI-H sample and the known MSS sample and the frequency distribution of the type of repeating unit in the MSS plasma model sample is calculated separately, and the microsatellite loci having a significant difference between the known MSI-H sample and the known MSS sample are retained as the microsatellite loci for detecting MSI.
And (3) for a plurality of samples with MSI-H known states and a plurality of samples with MSS known states, according to a screening method of the type and frequency distribution of repeated units of the MSS plasma model samples and microsatellite loci, the type and the frequency distribution of the repeated units of each microsatellite locus in the samples with two different states are screened, then the difference level between the frequency distribution of the repeated units in the samples with two different states and the frequency distribution of the repeated units in the samples with MSS plasma model is calculated respectively, and microsatellite loci with significant differences in the samples with two different states are reserved as the sites for detecting microsatellite instability.
The above level of difference can be detected using different methods of difference detection, preferably at a point where the p-value is below 0.05.
To more accurately represent the level of difference between two types of samples in different states, in a preferred embodiment, the known MSI-H samples and the known MSS samples are calculated separately according to formula (I) for the frequency distribution of the type of repeating units at each microsatellite loci and the frequency distribution of the type of repeating units in the MSS plasma model samples, and the microsatellite loci for which there is a significant difference in KLD values (preferably a locus with a p value below 0.05) are retained as microsatellite loci for detecting MSI;
Figure BDA0002505152650000141
Wherein p (x) represents the frequency distribution of the microsatellite loci of an MSI-H sample or of a known MSS sample, and q (x) represents the frequency distribution of the microsatellite loci of an MSS plasma model sample.
Preferably, a non-parametric test is used to test for significant differences between the KLD values, preferably a Wilcox test.
The KLD method is more informative for each site, including the number, abundance and location of peaks of different lengths. Compared with the existing method for detecting the difference level by using the Peak number or the highest Peak ratio, the KLD method has higher sensitivity. However, although conventional chi-square tests may be used to detect differential levels, the chi-square test requires a p-value of, for example, 0.05 or 0.01, and the plasma tumor levels are relatively low, which is relatively difficult to achieve. And the KLD method can design the difference value by itself when judging. In addition, KLD is a relative entropy, and other applicable divergence methods may be employed in the present application.
The above-mentioned type of repeating unit and its frequency distribution by establishing a microsatellite loci of a plasma model sample has the following two advantages: 1) The calculation of KLD requires a normal control sample, only a single sample is needed to be analyzed (the sequencing depth is higher) by establishing a plasma model sample, the control sample is not needed, white blood cells are usually used as the control sample in the existing method, and the sequencing depth is low; 2) The plasma model samples were constructed using multiple samples, which were more random and representative than the single samples.
Example 3
In this embodiment, the microsatellite loci for detecting MSI are obtained by screening the microsatellite loci for detecting MSI by using the screening method.
The present embodiment also provides a microsatellite loci for detecting MSI, the microsatellite loci for detecting MSI comprising at least 15 of the 38 microsatellite loci shown in Table 1.
The 38 microsatellite loci screened in this example are obtained by screening plasma samples according to two different known microsatellite states and can be used as marker loci for characterizing the presence or absence of microsatellite instability in the plasma samples. Using at least 15 of these sites to detect a sample of an unknown state can make the detection result more accurate. More preferably, at least 15, 20, 25, 30, 35 or 38 of these 38 sites, or preferably any number of 15 to 38 sites, may be employed.
Example 4
The kit for detecting MSI of the present embodiment includes a detection reagent for detecting microsatellite loci of MSI, the microsatellite loci including at least 15 of 38 microsatellite loci shown in table 1.
The 38 microsatellite loci screened in this example are obtained by screening plasma samples according to two different known microsatellite states and can be used as marker loci for characterizing the presence or absence of microsatellite instability in the plasma samples. Detection of a sample in an unknown state using a kit constructed with at least 15 of these sites can make the detection result more accurate.
Preferably, at least 15, 20, 25, 30, 35 or 38 of these 38 sites may be employed, or preferably any number of 15 to 38.
Example 5
The present embodiment provides a baseline construction method for detecting MSI, the construction method including: screening at least 15 of the 38 microsatellite loci shown in table 1 from a plurality of known MSS samples according to any one of the screening methods described above for MSI detection; counting the frequency distribution of the types of the repeated units of each microsatellite locus, and calculating the difference level of the frequency distribution of the types of the repeated units and the frequency distribution of the types of the repeated units of the MSS plasma model sample; removing microsatellite loci having polymorphisms; the mean level (mean or median) and degree of dispersion (standard deviation or variance) of the level of variance of each microsatellite loci of all MSS samples are counted to construct a baseline for MSI detection.
The construction method of the base line comprises the steps of screening at least 15, even more or all microsatellite loci in 38 screened samples of a negative MSS, counting the types of repeated units and the frequency distribution of the repeated units, and calculating the average value and the standard deviation of KLD values of the frequency distribution of each locus and the frequency distribution of a model sample, so that the base line is obtained.
The MSS sample refers to a normal healthy sample, which can be a normal tissue sample or a normal lymphocyte sample, and the established plasma model sample is a normal control sample model. Preferably, the MSS sample is a leukocyte sample. It should be noted that other MSS samples that may have somatic SNV mutations are not excluded, although such MSS samples with somatic SNV mutations may have a situation that affects a certain microsatellite locus variation.
Specific example operations of the above steps for removing polymorphic microsatellite loci are as follows: because of the MSS sample, looking at the frequencies of the types of repeating units of the sites, because the microsatellite loci in this application are all relatively short in length, there is a much greater frequency of one type of repeating unit than the other types of repeating units in the frequency distribution of the sites (such as shown in FIG. 3). If the frequencies of the highest frequency and the next highest frequency repeat unit types of a certain MSS sample at a microsatellite locus are similar, then the sample is heterozygous at that locus and a polymorphism exists. In addition, if this sample is homozygous at the site, but the highest frequency repeat unit type is different from the vast majority of MSS samples, this is also the case for polymorphisms. Microsatellite loci with the above frequency states are removed, thereby reducing or even avoiding frequency differences in the types of repeating units caused by the presence of polymorphic states.
To further increase the sensitivity of detecting differences, in a preferred embodiment, the KLD values of the frequency distribution of the type of each repeating unit and the frequency distribution of the type of repeating unit of the MSS plasma model sample are calculated according to formula (I), wherein p (x) represents the frequency distribution of the microsatellite loci of a known MSS sample and q (x) represents the frequency distribution of the microsatellite loci of an MSS plasma model sample;
Figure BDA0002505152650000151
the average level and degree of dispersion of KLD values for each microsatellite loci of all known MSS samples are counted to construct a baseline for MSI detection. The average level here may be an average value or a median, and the degree of dispersion may be a standard deviation or a variance.
To further increase the sensitivity of the differential detection, in a preferred embodiment, when the type of repeating units in the known MSS sample is not consistent with the type of repeating units in the MSS plasma model sample, calculating the KLD values of the frequency distribution of the type of repeating units in each known MSS sample and the frequency distribution of the type of repeating units in the MSS plasma model sample comprises: taking the union of the types of the repeating units in the known MSS sample and the types of the repeating units in the MSS plasma model sample, marking the union as M, marking the number of the types of the repeating units as M, and setting a minimum epsilon; smoothing the frequency distribution of the type of the repeating unit in the known MSS sample and the frequency distribution of the type of the repeating unit in the MSS plasma model sample respectively; the KLD values of the frequency distribution of the type of repeating units in the smoothed known MSS samples and the frequency distribution of the type of repeating units of the MSS plasma model samples are calculated.
In order to calculate the KLD value of the sites with inconsistent types of the repeating units in the two groups of samples, the frequency distribution of the types of the missing repeating units in each group of samples is given to a minimum value, so that the frequency of the types of the repeating units can be reflected, the processing according to the same types of the repeating units is facilitated, and the corresponding KLD value is calculated.
Preferably, the smoothing process includes: in comparison with M, in the known MSS sample or MSS plasma model sample, if n types of repeating units are absent, the frequency of the absent repeating unit type is epsilon/n, and the frequency of the remaining repeating unit type is p (x) -epsilon/(M-n).
Example 6
The embodiment provides a method for detecting a microsatellite state, which comprises the following steps: screening the type of the repeating unit of each microsatellite locus and the frequency distribution of the type of the repeating unit from the sample to be tested according to the screening method for at least 15 loci of 38 microsatellite loci for detecting MSI shown in table 1; calculating the difference level g of the frequency distribution of the type of the repeating units of each microsatellite loci of the sample to be tested and the frequency distribution of the type of the repeating units of the MSS plasma model sample 1 The method comprises the steps of carrying out a first treatment on the surface of the And according to the difference level g of the frequency distribution of the type of the repeating unit of each microsatellite loci of the baseline sample and the frequency distribution of the type of the repeating unit of the MSS plasma model sample 0 Calculating the Z value of the sample to be measured; selecting the type Mp of the repeating unit with the highest frequency in each microsatellite locus of the sample to be detected and the type Mq of the repeating unit with the highest frequency in the same microsatellite locus of the MSS plasma model sample, and judging the microsatellite locus as an unstable locus according to any one of the following methods: (1) If Mp is not equal to Mq and the difference level g of the sample to be measured 1 >Average (g) 0 )+zSD(g 0 ) The method comprises the steps of carrying out a first treatment on the surface of the (2) If mp=mq, and p (Mp)<=average (q (Mq))+ zSD (q (Mq)), and at the same time, the difference level g of the sample to be measured 1 >Average (g) 0 )+zSD(g 0 ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein average (g) 0 ) Represents the differential level g of the baseline sample 0 Average level of (1), SD (g) 0 ) Represents the differential level g of the baseline sample 0 Z represents the level of variance g of the baseline sample 0 A coefficient of degree of deviation of (2); counting Z values of microsatellite loci meeting a depth threshold value (preferably, when the sequencing depth is 1000x, the depth threshold value is more than or equal to 400 x) in a sample to be tested, obtaining the average level of the Z values, and judging the microsatellite state of the sample to be tested according to the following conditions: (1) The number n1 of microsatellite loci meeting the depth threshold is more than or equal to 15, wherein the number of unstable loci is n2, and if n2/n1 is more than or equal to a or the average level of Z values is more than or equal to b, the microsatellite state of a sample to be detected is judged to be MSI-H; (2) The number n1 of microsatellite loci meeting the depth threshold is more than or equal to 15, wherein the number of unstable loci is n2, n2/n1 is less than a, the average level of Z values is less than b, and the microsatellite state of a sample to be detected is judged to be MSS; (3) The number n1 of microsatellite loci meeting the depth threshold is less than 15, and the microsatellite state of the sample to be detected is undetermined; wherein a is 0.15-0.3, and b is 0.8-2.
According to the method for detecting the microsatellite instability, a batch of microsatellite loci with low background noise (except the repeat unit type with the highest frequency, the rest repeat unit types belong to the background noise), the higher the repeat unit frequency with the highest peak frequency is, the lower the background noise is, then the background noise is further reduced by optimizing a deduplication algorithm, the KLD value is used for representing microsatellite locus characteristics to carry out MSI analysis, the background noise can be greatly reduced in the whole process, the real mutation intensity is improved, and therefore MSI-H can be detected in samples with low tumor content.
Preferably, the above method comprehensively considers 1) whether the number of sites involved in the determination is not less than 8, 2) the number of unstable sites, and 3) whether the median of the Z value is 1 or more in determining the stability of the sample.
In the above method, the difference level may be measured by using different parameter values g1 and g0 of different difference levels obtained by different difference measurement methods, and in this application, the KLD value is more preferably used. In a preferred embodiment, the detection method comprises: calculating the difference level g according to the formula (I) 0 And a difference level g 1 Obtaining the KLD value of the baseline sample and the KLD value of the sample to be tested; calculating the Z value of the sample to be detected according to the average level and the discrete degree of the KLD value of the sample to be detected and the KLD value of the baseline sample;
Figure BDA0002505152650000171
Wherein p (x) represents the frequency distribution of the microsatellite loci of the sample to be tested or the microsatellite loci of the baseline sample, and q (x) represents the frequency distribution of the microsatellite loci of the MSS plasma model sample.
For each microsatellite loci, calculating the KLD value of the type of the repeating unit of each microsatellite locus distributed between the sample to be tested and the plasma model sample, and detecting whether the KLD value of the sample to be tested deviates significantly from the KLD value of the baseline sample(s), wherein the specific detection method can be using the average +n times variance of the KLD value of the baseline sample, and n can be 3 or 4, more preferably 3. In a preferred embodiment, the microsatellite loci are determined to be unstable loci according to any of the following methods: (1) If Mp is not equal to Mq and KLD value of the sample to be measured is equal to (Ki) +3SD (Ki); (2) If mp=mq, and p (Mp) <=average (q (Mq)) + zSD (q (Mq)), at the same time, the KLD value of the sample to be measured > average (Ki) +3sd (Ki); where average (Ki) represents the average level (which may be the average or median, as described above) of the KLD values of the baseline samples, and SD (Ki) represents the degree of dispersion (which may be the variance or standard deviation, as described above) of the KLD values of the baseline samples.
In the above preferred embodiment, the stability state of each site is determined by the relationship between the KLD value of the sample to be measured and the average +3 times variance of the KLD value in the base line.
Example 7
The embodiment provides a method for detecting an unstable state of a microsatellite, which comprises the following detailed steps:
1) Using N (> = 30) MSS plasma samples, the frequency of occurrence of the type of repeat unit per microsatellite loci per sample was calculated.
1.1 alignment of reads obtained by sequencing with human genome using software BWA, followed by sequencing using software samtools, extraction of reads (also called end-to-end reads) covering completely the microsatellite loci region and both wings at least 2bp in length, using GATK re-alignment
1.2 extraction of the sequence of position points in the scanning read of 1.1, calculation of the sequence length, each different length representing a type of repeating unit
1.3 for the scanning read in 1.1, if the read1 and the read2 are aligned to the same chromosome, calculating the alignment position of the read at the leftmost end of the chromosome and the length of the insert, wherein reads with the same length of the leftmost alignment position of the chromosome and the insert belong to the same duplicate read family; if the read1 and the read2 are aligned to different chromosomes or only one read is aligned to a chromosome, the aligned locus, the aligned length and the chromosome on the mate read alignment of the left-most end of the chromosome of the read are calculated, and the aligned locus, the aligned length and the reads of the chromosome on the mate read alignment of the same chromosome of the left-most end of the chromosome belong to the same duplicate read family. The repeat unit types belonging to the same duplicate read family were counted, the most number was reserved as the repeat unit type of duplicate read family, and one duplicate read family reserved only one scanning read.
1.4 if the number of scanning reads of the microsatellite loci > = 400, the loci are quality controlled;
1.5 if a repeating read number of one repeating unit type > =2 is supported, the repeating unit type is valid;
1.6 calculating the frequency of each repeating unit type of the microsatellite loci;
2) Calculating the frequency (Pi) of the number of each repeating unit type of each microsatellite locus of the MSS plasma sample and calculating the average mean (Pi) and the standard deviation SD (Pi), and constructing the repeating unit type distribution q of the microsatellite loci of the MSS plasma model sample;
3) Using another N' (> =30) MSS plasma samples, the repeat unit type distribution p for each microsatellite locus for all samples was calculated according to 1) and according to the formula
Figure BDA0002505152650000181
A Kullback-Leibler divergence (KLD) between the microsatellite locus repeat unit type distribution p of the sample and the microsatellite locus repeat unit type distribution q of the model sample is calculated. If the sample repeating unit type is inconsistent with the model repeating unit type, taking a union set of the sample repeating unit type and the model repeating unit type, marking the union set as M, setting a minimum value epsilon, and carrying out smoothing treatment on the sample repeating unit distribution and the model repeating unit distribution respectively: in comparison to M, in a sample or model, if n repeat unit types are absent, the frequency of the absent repeat unit types is ε/n, and the frequency of the other repeat unit types is p (x) - ε/(M-n).
4) Calculate KLD value (Ki) at each microsatellite loci of MSS plasma samples according to 3) and calculate mean (Ki) and standard deviation SD (Ki) to construct baseline.
5) The highest frequency repeating unit type of the microsatellite loci of the sample to be detected is Mp, the highest frequency repeating unit type Mq of the microsatellite loci of the MSS model is Mq, and the KLD and Z values (Z value= (test sample KLD-average value of KLD in baseline sample)/SD) of the sample are calculated.
5.1 if mp=mq and p (Mq) <=mean (q (Mq))+3sd (q (Mq)) and KLD > mean (Ki) +3sd (Ki) of the sample to be detected, then the site is judged to be unstable.
5.2 if Mp. Noteq and KLD > mean (Ki) +3SD (Ki) of the sample to be detected, the site is judged to be unstable.
5.3 median median (zscore) for zscore values for all sites meeting the requirements (i.e., scanning read number > =400).
6) If the number of sites meeting the requirement is more than or equal to 15, the number of unstable sites/the number of sites meeting the requirement > =0.2 or median (zscore) > =1, and the microsatellite state is judged to be MSI-H.
7) If the number of sites meeting the requirement is more than or equal to 15, the number of unstable sites/the number of sites meeting the requirement is less than 0.2 and median (zscore) is less than 1, and the microsatellite state is judged to be MSS.
8) If the number of sites meeting the requirement is less than 15, the microsatellite status is determined as QNS (Quantity Not Sufficient).
The beneficial effects of the present application will be further described below in conjunction with specific embodiments.
Example 8: MSI site selection
We found single base repeat microsatellites of 7-15 bp in length in the panel range, and selected the sites in the following order:
1.1 calculating the similarity value of the sequences of 10bp at the left end and the right end of the microsatellite and the sequence of the microsatellite, and selecting a site with low similarity value
1.2 high-depth sequencing (> 1000X) is carried out on 209 cases of MSS plasma samples, the repeat unit type of each site and the proportion occupied by each type are counted, a spectrogram is constructed, the polymorphism proportion of each site is determined, and quasi-singlet sites (< 5%) are selected.
Because of the MSS samples, the frequency of the type of repeating units of the site is looked at. Because our sites are relatively short, the frequency distribution of the sites has a much larger repeat unit type than other repeat unit types, such as FIG. 3. If a sample of MSS is similar in frequency between the highest frequency and the next highest frequency repeat unit type at a microsatellite locus, then the sample is heterozygous for that locus and a polymorphism is present. In addition, if the sample is in a homozygous state, but the highest frequency repeat unit type is different from most MSS samples, and also in the polymorphic case, the polymorphic proportion is calculated from the ratio of the number of polymorphic samples to the total test sample.
1.3 calculating the ratio of the number of the scanning reads to the sequencing depth, selecting a site with high scanning ratio (the effective capturing efficiency of the corresponding site is expressed by the ratio, preferably the ratio is more than or equal to 0.4, and the site with high effective capturing efficiency is reserved),
1.4 counting the type frequency of each repeating unit of each site of all samples, averaging and standard deviation for each repeating unit type, calculating the average value of the standard deviation of all repeating unit types of the site, selecting a site with a small standard deviation average value, and using the average value of each repeating unit type frequency as the repeating unit type frequency distribution q of the microsatellite site of the MSS plasma model sample.
1.5 wilcox test was performed using 17 MSI-H samples and 45 MSS samples, and the difference in KLD values between the two groups was counted for each site, selecting sites with p-value below 0.05.
In accordance with the procedure described above, 38 sites were screened together and specific information for each site was shown in Table 1.
38 sites of single base repetitive sequences with 7-15 bp length, high unistate, low similarity with two-wing sequences, high sensitivity, high specificity and high proportion of scanning reads, and stable distribution of repetitive unit types in MSS samples. Table 1 shows the specific parameters of each site.
Table 1:
Figure BDA0002505152650000201
/>
Figure BDA0002505152650000211
wherein:
site name: microsatellite site names;
physical location: coordinates of microsatellite loci on the human genome;
length: microsatellite site length;
similarity value: the similarity of the sequences of 10bp on the left and right wings of the microsatellite locus and the sequence of the microsatellite locus is calculated by a formula of sigma (11-n)/10, wherein n is the distance between the base, which is the same as the sequence of the microsatellite, in the sequences of the two wings and the microsatellite locus;
polymorphism ratio: the proportion of microsatellite loci that exhibit polymorphism in the population queues;
average of standard deviations of the proportions of the different repeating units: average value of standard deviation of proportion of each repeated unit type of microsatellite locus in MSS crowd;
proportions of the coupling reads: the ratio of the number of scanning reads to the sample depth for this site (the sequencing depth for some sites is greater than the average depth due to the different interval capture efficiencies, the scanning read ratio is greater than 1) is used to evaluate whether the site is easily captured and therefore is used in this application to measure capture efficiency. The ratio in the above table is equal to or greater than 0.4, i.e., the capture threshold needs to be equal to or greater than 0.4.
P value: wilcox test was used to calculate the difference in KLD values at the microsatellite loci in both MSI-H and MSS sets of samples.
Example 9: baseline construction (i.e., establishing the mean and standard deviation of the KLD values for negative samples)
In addition, 30 plasma samples with microsatellite status MSS were selected to construct baseline. Firstly, calculating the distribution of the frequency of the repeating unit type of each site of each sample and the KLD value of the model sample, removing the samples with polymorphism, and then calculating the average value and standard deviation of the KLD of each site of all samples.
Example 10: repeat unit type frequency distribution saturation analysis
KLD value saturation was assessed for each site using 28 healthy human plasma samples. Along with the increase of the KLD value obtained by calculation of the number of the scanning reads, a saturation curve graph is drawn, the number of the scanning reads required for microsatellite analysis is determined, and then the quality control standard of the site is determined.
FIG. 1 is a graph of saturation at each site, essentially 400 scanning reads, with KLD divergence values capable of achieving saturation.
Example 11:
the plasma samples were simulated by mixing 7 tissue samples in MSI-H with corresponding leukocyte samples, and performing dilutions to different extents, for a total of 21 cases. The tumor content (ccf) of the samples was calculated and the state of the microsatellites of the samples were predicted and the results are shown in table 2.
Table 2:
Figure BDA0002505152650000221
/>
Figure BDA0002505152650000231
Example 12:
4 MSI-H cell line samples were mixed with one MSS cell line sample to simulate MSI plasma samples at 0.878%, 1.97%, 2.96%, 4.44%, 6.67%, 10%, 20% tumor content, starting at 10ng, predicting the microsatellite status of the sample, and analyzing the minimum tumor content detection limit, and the results are shown in Table 3.
Table 3:
Figure BDA0002505152650000232
/>
Figure BDA0002505152650000241
from the results, it was concluded that when ccf > =0.02, there were 2 negative samples (i.e. DLD1 tumor content was 1.97% and 2.96%).
Example 13
The predicted microsatellite status is that of 95 samples of MSS, all samples were judged to be MSI, specificity 100%.
Example 14
Statistics of example 11, example 12, and example 13 when ccf > =0.02, the sensitivity was 93.33%, and the specificity was 100%. The results are shown in Table 4.
Table 4:
Figure BDA0002505152650000251
from the above description, it can be seen that the above embodiments of the present invention achieve the following technical effects: the scheme of the application can detect the microsatellite state of the MSI sample in the plasma sample with low tumor content. The method has the following advantages:
1) Normal tissue samples are not required for reference. This advantage results from 38 microsatellite loci in a single or quasi-single state in the population in the protocol of the present application;
2) And accurately detecting the type of the repeated unit and the proportion of the type of the repeated unit at each site. This advantage results from the low similarity of the sequence of 38 microsatellite locus repeat units to the flanking sequences in the scheme of the present application, the method of extracting the repeat unit fragments of the microsatellite loci in the sequenced sequence (extracting end-to-end reads) and the method of deduplication of repeat sequences belonging to the same replication sequence (preserving the type of end-to-end reads of the most numerous repeat units in the same replication sequence);
3) The application range of the MSI-H performance type is wide.
Microsatellite loci of MSI-H samples will exhibit different types of frequency distribution of repeating units due to the interplay of PCR process and repair deletions.
In comparing the differences in the frequency distribution of the types of repeating units of two samples, the difference between the sample to be tested and the control sample (baseline) is determined from the whole without distinguishing whether MSI is embodied as a Deletion (insertion) or an insertion (insertion), and the degree of insertion or Deletion.
4) High detection sensitivity
The advantage is derived from the fact that the frequency distribution of the types of the repeated units of 38 microsatellite loci in the MSS sample is stable in the scheme of the application, the repeated unit fragment extraction method (end-to-end reads extraction) of the microsatellite loci in the sequencing sequence, the deduplication method (end-to-end reads of the type of the repeated units with the largest number in the same replication sequence are reserved) of the repeated sequences belonging to the same replication sequence, and the KLD is used as a site characteristic value and a determination method of the MSI state of the sample.
The above method screened 38 microsatellite loci with significant differences between MSS and MSI samples. Any combination of several of the 38 single-base repeated sequences is used as a microsatellite locus to replace 5 loci in a Promega kit, so that microsatellite instability (MSI) analysis is performed, the number of the end-to-end reads (spanning reads) can be increased, the requirement on the tumor content of a sample is reduced, and the stability and the accuracy of an analysis result are improved.
The "improved end-to-end reads" is due to the 38 microsatellite loci ranging in length from 7 to 15bp, which is easier to obtain end-to-end reads (spanning reads) than longer fragments (> 20 bp).
The advantage of the requirement of reducing the tumor content of the sample is derived from the fact that the length of the repeated segment of 38 single bases in the scheme is in the range of 7-15 bp, the proportion of stutter generated by PCR is small, the background noise is low, the actual repeated segment type of the sample occupies the majority, and the repeated segment type generated by MSI is relatively improved under the background, so that the repeated segment type is easier to detect.
The "stability and accuracy of the analysis results" advantage derives from the sensitivity and specificity in the protocol. The sites have high sensitivity and specificity, the analysis result is stable, and the accuracy is improved.
From the validation samples, the assay method combined with the product site was able to detect MSI status in plasma samples with tumor levels as low as 2%.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
Corresponding to the above manner, the present application further provides a series of devices, which are used to implement the above embodiments and preferred embodiments, and are not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The beneficial effects of the present application are further described below in conjunction with some alternative embodiments.
Example 15
The embodiment provides a screening device of microsatellite loci of MSS plasma model sample, screening device includes: the system comprises a first site set acquisition module, an MSS sample screening and counting module, a second site set acquisition module, a third site set acquisition module and an MSS plasma model sample establishment module,
a first site set acquisition module for extracting microsatellite loci meeting a first condition from a human reference genome sequence or a capture sequence of a target gene, denoted as a first site set, the first condition comprising: a 7-15 bp single base repetitive sequence; b. the similarity value of the two wing sequences with the single base repetitive sequences of 7-15 bp is lower than a similarity threshold value;
the MSS sample screening and counting module is used for acquiring sequencing data of a plurality of MSS samples, screening a first position set from the sequencing data of each MSS sample, and counting the type of a repeating unit of each microsatellite position in the first position set and the occurrence frequency of the type of each repeating unit;
the second site set obtaining module selects microsatellite loci meeting a second condition from the first site set as the second site set, wherein the second condition comprises: polymorphisms in the population below 5% and capture efficiency during pool sequencing above capture threshold;
The third site set acquisition module is used for calculating the average level and the discrete degree of the occurrence frequency of each repeated unit type of each microsatellite site in the second site set of all MSS plasma samples, and selecting microsatellite sites with the average level of the discrete degree lower than a discrete threshold value as a third site set;
the MSS plasma model sample establishment module is used for taking the microsatellite loci in the third locus set as the microsatellite loci of the MSS plasma model sample, and taking the average level (median) of the occurrence frequency of each repeating unit type of each microsatellite locus in the third locus set as the frequency distribution of the repeating unit type of each microsatellite locus in the MSS plasma model sample.
In a preferred embodiment, the first locus set module comprises: the first locus selection module is used for extracting microsatellite loci of 7-15 bp single base repetitive sequences from a human reference genome sequence or a capture sequence of a target gene; the similarity value calculation module is used for calculating the similarity value of the sequence with the set length at the left end and the right end of the single base repeated sequence of 7-15 bp and the single base repeated sequence of 7-15 bp for each microsatellite locus; and the second position selecting module is used for selecting microsatellite positions with similarity values lower than a similarity threshold value as the first position set.
Preferably, the similarity value is calculated as follows: Σ (d2+1-d 1)/d 2, wherein d1 is the distance from the same base as the single base repetitive sequence of 7-15 bp in the sequences with set lengths at the left and right ends to the microsatellite locus, and d2 is the set length; preferably, d2 is 8 to 12bp, more preferably 10bp;
preferably, the similarity threshold is 1.5 to 2.5, more preferably 2.
In a preferred embodiment, the MSS sample screening statistics module includes: the comparison module is used for comparing the sequencing data of each MSS sample with the reference genome sequence respectively to obtain a comparison result; the searching and extracting module is used for searching a first position set from the comparison result and extracting end-to-end reads covering all the microsatellite loci in the first position set from the comparison result, wherein the end-to-end reads refer to reads covering the microsatellite loci and at least 2bp at the left end and the right end of the microsatellite loci under the first condition; and the type frequency statistics module is used for counting the type of each repeating unit in the end-to-end reads covering each microsatellite locus and the occurrence frequency of each repeating unit type.
In a preferred embodiment, the search extraction module comprises: the same repeated sequence family statistics module is used for counting end-to-end reads belonging to the same repeated sequence family from the comparison result and counting the number of types of different repeated units in the same repeated sequence family; and a repeating unit type selection module for selecting the type of the repeating unit with the largest number as the type of the repeating unit of the same repeating sequence family (the end-to-end reads of the type with the largest coverage number are reserved in the same repeating sequence family), and counting the support number of the end-to-end reads supporting the microsatellite loci.
Preferably, there are at least two support numbers for end-to-end reads for each repeat unit type that supports the microsatellite loci.
Preferably, the capture efficiency is measured as the ratio of the number of end-to-end reads for each microsatellite locus to the sequencing depth of the sample, with a capture threshold value of ≡ 0.4 being preferred.
Example 16
The present embodiment provides a screening apparatus for detecting microsatellite loci of MSI, the screening apparatus comprising: a first calculation module, configured to select sequencing data of a plurality of known MSI-H samples and a plurality of known MSS samples, respectively screen microsatellite loci in the MSS plasma model samples according to any one of the screening apparatuses, and respectively calculate frequency distributions of types of repeating units of each microsatellite locus of the known MSI-H samples and the known MSS samples; a second calculation module for calculating a level of difference between the frequency distribution of the type of the repeating unit in each microsatellite loci of the known MSI-H samples and the known MSS samples and the frequency distribution of the type of the repeating unit in the MSS plasma model samples, respectively, and retaining microsatellite loci having significant differences between the plurality of known MSI-H samples and the plurality of known MSS samples as microsatellite loci for detecting MSI.
In a preferred embodiment, the second calculation module calculates KLD values between the frequency distribution of the type of repeating units in each microsatellite loci of the known MSI-H samples and the known MSS samples and the frequency distribution of the type of repeating units in the MSS plasma model samples, respectively, according to formula (I), and retains microsatellite loci for which there is a significant difference in KLD values between the plurality of known MSI-H samples and the plurality of known MSS samples as microsatellite loci for detecting MSI;
Figure BDA0002505152650000281
wherein p (x) represents the frequency distribution of the microsatellite loci of an MSI-H sample or of a known MSS sample, and q (x) represents the frequency distribution of the microsatellite loci of an MSS plasma model sample.
Preferably, a non-parametric test is used to test for significant differences between the KLD values, preferably a Wilcox test.
Example 17
In this embodiment, there is provided a baseline building apparatus for detecting MSI, the building apparatus comprising: a microsatellite loci screening module for screening at least 15 of the 38 microsatellite loci shown in table 1 for MSI detection from a plurality of known MSS samples according to any of the screening apparatus described above; the frequency distribution difference statistics module is used for counting the frequency distribution of the types of the repeated units of each microsatellite locus and calculating the difference level of the frequency distribution of the types of the repeated units and the frequency distribution of the types of the repeated units of the MSS plasma model sample; a diversity removal module for removing samples having polymorphisms from a plurality of known MSS samples; and the baseline establishment module is used for counting the average level and the discrete degree of the difference level of each microsatellite loci of all MSS samples, so as to construct a baseline for detecting MSI.
In a preferred embodiment, the frequency distribution difference statistics module is a KLD module, which is used to calculate the KLD value of the frequency distribution of the type of each repeating unit and the frequency distribution of the type of repeating unit of the MSS plasma model sample according to formula (I),
Figure BDA0002505152650000282
wherein p (x) represents the frequency distribution of said microsatellite loci of a known MSS sample and q (x) represents the frequency distribution of said microsatellite loci of an MSS plasma model sample; and the baseline establishment module is used for counting the average level and the discrete degree of the KLD value of each microsatellite locus of the known MSS sample, so as to construct and obtain a baseline for detecting MSI.
In a preferred embodiment, the KLD module further comprises: a type inconsistency KLD calculation module for calculating a KLD value of a frequency distribution of a type of a repeating unit in each known MSS sample and a frequency distribution of a type of a repeating unit of the MSS plasma model sample when the type of the repeating unit in the known MSS sample is inconsistent with the type of the repeating unit in the MSS plasma model sample, the type inconsistency KLD calculation module comprising: the union module is used for taking a union of the types of the repeating units in the known MSS sample and the types of the repeating units in the MSS plasma model sample, marking M, marking the number of the types of the repeating units as M, and setting a minimum epsilon; a smoothing module for smoothing the frequency distribution of the type of the repeating unit in the known MSS sample and the frequency distribution of the type of the repeating unit in the MSS plasma model sample, respectively; and the post-processing calculation module is used for calculating the KLD value of the frequency distribution of the type of the repeating unit in the known MSS sample after the smoothing processing and the frequency distribution of the type of the repeating unit in the MSS plasma model sample.
Preferably, the smoothing process includes: in comparison with M, in the known MSS sample or MSS plasma model sample, if n types of repeating units are absent, the frequency of the absent repeating unit type is epsilon/n, and the frequency of the remaining repeating unit type is p (x) -epsilon/(M-n).
Example 18
The present embodiment provides a detection apparatus for detecting a microsatellite state, the detection apparatus including:
a microsatellite locus screening module, configured to screen the type of the repeating unit and the frequency distribution of the type of the repeating unit of each microsatellite locus from the sample to be tested according to any one of the screening devices described above, for at least 15 loci of the 38 microsatellite loci shown in table 1 for detecting MSI;
a difference level detection module for calculating a difference level g of the frequency distribution of the type of the repeating unit of each microsatellite locus of the sample to be detected and the frequency distribution of the type of the repeating unit of the MSS plasma model sample 1
A Z value calculation module for calculating a difference level g of the frequency distribution of the type of the repeating unit of each microsatellite loci according to the baseline sample and the frequency distribution of the type of the repeating unit of the MSS plasma model sample 0 Calculating the Z value of the sample to be measured;
The unstable site judging module is used for selecting the type Mp of the highest-frequency repeating unit in each microsatellite site of the sample to be detected and the type Mq of the highest-frequency repeating unit in the same microsatellite site of the MSS plasma model sample, and judging the microsatellite site as an unstable site according to any one of the following methods:
(1) If Mp is not equal to Mq and the difference level g of the sample to be measured 1 >Average (g) 0 )+zSD(g 0 );
(2) If mp=mq, and p (Mp)<=average (q (Mq))+ zSD (q (Mq)), and at the same time, the difference level g of the sample to be measured 1 >Average (g) 0 )+zSD(g 0 );
Wherein average (g) 0 ) Represents the differential level g of the baseline sample 0 Average level of (1), SD (g) 0 ) Represents the differential level g of the baseline sample 0 Z represents the level of variance g of the baseline sample 0 A coefficient of degree of deviation of (2);
the microsatellite state judging module is used for counting Z values of microsatellite loci meeting a depth threshold in the sample to be detected, obtaining the average level of the Z values, and judging the microsatellite state of the sample to be detected according to the following conditions:
(1) The number n1 of microsatellite loci meeting the depth threshold is more than or equal to 15, wherein the number of unstable loci is n2, and if n2/n1 is more than or equal to a or the average level of Z values is more than or equal to b, the microsatellite state of a sample to be detected is judged to be MSI-H;
(2) The number n1 of microsatellite loci meeting the depth threshold is more than or equal to 15, wherein the number of unstable loci is n2, n2/n1 is less than a, and the average level of Z values is less than b, and judging that the microsatellite state of a sample to be detected is MSS;
(3) The number n1 of microsatellite loci meeting the depth threshold is less than 15, if the sample to be detected does not pass through the quality control, the microsatellite state is undetermined;
wherein a is 0.15-0.3, and b is 0.8-2.
In a preferred embodiment, the detection device comprises: calculating the difference level g according to the formula (I) 0 And a difference level g 1 Obtaining the KLD value of the baseline sample and the KLD value of the sample to be tested; calculating the Z value of the sample to be detected according to the average level and the discrete degree of the KLD value of the sample to be detected and the KLD value of the baseline sample;
Figure BDA0002505152650000301
wherein p (x) represents the frequency distribution of the microsatellite loci of the sample to be tested or the microsatellite loci of the baseline sample, and q (x) represents the frequency distribution of the microsatellite loci of the MSS plasma model sample.
In a preferred embodiment, the microsatellite loci are determined to be unstable loci according to any of the following methods: (1) If Mp is not equal to Mq and KLD value of the sample to be measured is equal to (Ki) +3SD (Ki); (2) If mp=mq, and p (Mp) <=average (q (Mq)) + zSD (q (Mq)), at the same time, the KLD value of the sample to be measured > average (Ki) +3sd (Ki); wherein, average (Ki) represents the average level of KLD values of the baseline samples, SD (Ki) represents the degree of dispersion of KLD values of the baseline samples.
Example 19
The embodiment provides a storage medium, where the storage medium includes a stored program, and when the program runs, the device where the storage medium is controlled to execute any one of the above screening methods, any one of the above construction methods, or any one of the above detection methods.
The application also provides a processor, wherein the processor is used for running a program, and the program executes any one of the screening methods, any one of the construction methods or any one of the detection methods.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (51)

1. A method for screening microsatellite loci of an MSS plasma model sample, the method comprising:
extracting microsatellite loci meeting a first condition from a human reference genome sequence or a target gene capture sequence, and recording the microsatellite loci as a first locus set, wherein the first condition comprises: a 7-15 bp single base repetitive sequence; b. the similarity value of the two wing sequences of the 7-15 bp single base repetitive sequence is lower than a similarity threshold value;
acquiring sequencing data of a plurality of MSS samples, screening the first site set from the sequencing data of each MSS sample, and counting the type of a repeating unit of each microsatellite site in the first site set and the occurrence frequency of the type of each repeating unit;
selecting microsatellite loci satisfying a second condition from the first locus set as a second locus set, the second condition referring to comprising: polymorphisms in the population below 5% and capture efficiency during pool sequencing above capture threshold;
Calculating the average level and the discrete degree of the frequency of occurrence of each repeated unit type of each microsatellite locus of all MSS samples in the second locus set, and selecting the microsatellite locus with the average level of the discrete degree lower than a discrete threshold as a third locus set;
taking the microsatellite loci in the third locus set as the microsatellite loci of the MSS plasma model sample, wherein the average level of the frequency of occurrence of each of the types of repeating units of each of the microsatellite loci in the third locus set is taken as the frequency distribution of the types of repeating units of each of the microsatellite loci in the MSS plasma model sample.
2. The screening method of claim 1, wherein extracting microsatellite loci satisfying a first condition from a human reference genomic sequence or a target gene capture sequence, denoted as a first set of loci comprises:
extracting microsatellite loci of 7-15 bp single base repetitive sequences from a human reference genome sequence;
calculating the similarity value of the sequence with the set length at the left end and the right end of the 7-15 bp single base repetitive sequence and the 7-15 bp single base repetitive sequence for each microsatellite locus;
And selecting microsatellite loci with the similarity value lower than a similarity threshold value as the first locus set.
3. The screening method of claim 2, wherein the similarity value is calculated according to the following formula: and (d 2+1-d 1)/d 2, wherein d1 is the distance from the same base as the 7-15 bp single base repeat sequence in the sequences with the set lengths of the left and right ends to the microsatellite loci, and d2 is the set length.
4. The method according to claim 3, wherein d2 is 8 to 12bp.
5. The method according to claim 4, wherein d2 is 10bp.
6. The screening method of claim 2, wherein the similarity threshold is 1.5 to 2.5.
7. The method of screening according to claim 6, wherein the similarity threshold is 2.
8. The screening method of claim 2, wherein obtaining sequencing data for a plurality of MSS samples and screening the first set of sites from the sequencing data for each of the MSS samples and counting the type of repeat units for each of the microsatellite loci in the first set of sites and the frequency of occurrence of each of the types of repeat units comprises:
Comparing the sequencing data of each MSS sample with a reference genome sequence to obtain a comparison result;
searching the first position set from the comparison result, and extracting end-to-end reads covering each microsatellite locus in the first position set from the comparison result, wherein the end-to-end reads refer to reads covering the microsatellite locus and at least 2bp at the left end and the right end of the microsatellite locus under the first condition;
counting the type of each repeating unit in said end-to-end reads covering each said microsatellite loci and the frequency of occurrence of each said repeating unit type.
9. The method of claim 8, wherein extracting end-to-end reads covering each of the microsatellite loci in the first set of loci from the alignment comprises:
counting the end-to-end reads belonging to the same repeated sequence family from the comparison result, counting the number of types of different repeated units in the same repeated sequence family, selecting the type with the largest number of the repeated units as the type of the repeated units of the same repeated sequence family, and counting the number of the end-to-end reads supporting the microsatellite loci.
10. The method of claim 9, wherein the number of end-to-end reads that support each repeat unit type of each microsatellite locus is at least two.
11. The method of claim 9, wherein the capture efficiency is measured as a ratio of the number of end-to-end reads for each of the microsatellite loci to the sequencing depth of the sample.
12. The method of claim 1, wherein the capture threshold is greater than or equal to 0.4.
13. A screening method for detecting microsatellite loci of MSI, said screening method comprising:
selecting a plurality of known MSI-H samples and sequencing data of a plurality of known MSS samples, screening microsatellite loci in the MSS plasma model samples according to the screening method of any one of claims 1 to 12, and calculating frequency distribution of types of repeating units of each of the microsatellite loci of the known MSI-H samples and the known MSS samples, respectively;
calculating a level of difference between the frequency distribution of the type of repeating units in each of the microsatellite loci of the known MSI-H sample and the known MSS sample and the frequency distribution of the type of repeating units in the MSS plasma model sample, respectively, and retaining a microsatellite locus where the level of difference has a significant difference between the MSI-H sample and the MSS sample as the microsatellite locus for detecting MSI.
14. The screening method according to claim 13, wherein said calculating KLD values between the frequency distribution of the type of repeating units of each of said microsatellite loci of said known MSI-H samples and said known MSS samples and the frequency distribution of the type of repeating units in said MSS plasma model samples according to formula (I) respectively and retaining microsatellite loci for which said KLD values have a significant difference between said known MSI-H samples and said known MSS samples as said microsatellite loci for detecting MSI;
Figure QLYQS_1
wherein p (x) represents the frequency distribution of each of the microsatellite loci of the MSI-H samples or each of the microsatellite loci of the known MSS samples, and q (x) represents the frequency distribution of each of the microsatellite loci of the MSS plasma model samples.
15. The screening method of claim 14, wherein a non-parametric test is used to test whether each of said KLD values has a significant difference between said known MSI-H samples and said known MSS samples.
16. The screening method according to claim 15, wherein said non-parametric test method is a Wilcox test method.
17. A microsatellite loci for use in the detection of MSI, which is screened using the screening method of any one of claims 13 to 16.
18. A microsatellite loci for use in the detection of MSI, said microsatellite loci for use in the detection of MSI comprising at least 15 of the 38 microsatellite loci shown in table 1.
19. A kit for detecting MSI, comprising detection reagents for detecting microsatellite loci of MSI, said microsatellite loci comprising at least 15 of the 38 microsatellite loci shown in table 1.
20. A baseline construction method for detecting MSI, the construction method comprising:
screening at least 15 of the 38 microsatellite loci shown in table 1 for MSI detection from a plurality of known MSS samples according to the screening method of any one of claims 1 to 12;
counting the frequency distribution of the type of the repeating unit of each microsatellite locus, and calculating the difference level between the frequency distribution of the type of each repeating unit and the frequency distribution of the type of the repeating unit of the MSS plasma model sample;
removing microsatellite loci with polymorphism in each sample;
and counting the average level and the discrete degree of the difference level of each microsatellite loci of all MSS samples, thereby constructing the baseline for detecting MSI.
21. The construction method according to claim 20, wherein the KLD value of the frequency distribution of the type of each of the repeating units and the frequency distribution of the type of the repeating units of the MSS plasma model sample is calculated according to the formula (I),
Figure QLYQS_2
wherein p (x) represents the frequency distribution of the microsatellite loci of the known MSS sample and q (x) represents the frequency distribution of the microsatellite loci of the MSS plasma model sample;
the average level and degree of dispersion of the KLD values for each of the microsatellite loci of all known MSS samples are counted to construct the baseline for MSI detection.
22. The method of claim 21, wherein when the type of repeating units in the known MSS sample does not match the type of repeating units in the MSS plasma model sample, calculating the KLD value for the frequency distribution of the type of repeating units in each of the known MSS samples and the frequency distribution of the type of repeating units in the MSS plasma model sample comprises:
taking a union set of the types of the repeating units in the known MSS sample and the types of the repeating units in the MSS plasma model sample, marking the union set as M, marking the number of the types of the repeating units as M, and setting a minimum epsilon;
Smoothing the frequency distribution of the type of the repeating units in the known MSS sample and the frequency distribution of the type of the repeating units in the MSS plasma model sample respectively;
calculating the KLD value of the frequency distribution of the type of the repeating unit in the known MSS sample after the smoothing and the frequency distribution of the type of the repeating unit in the MSS plasma model sample.
23. The building method according to claim 22, wherein the smoothing process includes:
in the known MSS sample or the MSS plasma model sample, if n types of repeating units are absent, the frequency of the absent repeating unit type is ε/n, and the frequency of the remaining repeating unit type is p (x) - ε/(M-n) compared to the M.
24. A method for detecting a microsatellite status, the method comprising:
screening a sample to be tested for the type of repeating unit of each microsatellite loci and the frequency distribution of the type of repeating unit according to the screening method of any one of claims 1 to 12 for at least 15 loci of 38 microsatellite loci shown in table 1 for MSI;
calculating the difference level g of the frequency distribution of the type of the repeating unit of each microsatellite locus of the sample to be tested and the frequency distribution of the type of the repeating unit of the MSS plasma model sample 1
And based on a level g of difference between the frequency distribution of the type of repeating unit of each of the microsatellite loci of the baseline sample and the frequency distribution of the type of repeating unit of the MSS plasma model sample 0 Calculating the Z value of the sample to be detected;
selecting the type Mp of the repeating unit with the highest frequency in each microsatellite locus of the sample to be detected and the type Mq of the repeating unit with the highest frequency in the same microsatellite locus of the MSS plasma model sample, and judging the microsatellite locus as an unstable locus according to any one of the following methods:
(1) If Mp is not equal to Mq and the difference level g of the microsatellite loci of the sample to be tested 1 >Average (g) 0 )+zSD(g 0 );
(2) If mp=mq, and p (Mp)<=average (q (Mq))+ zSD (q (Mq)), and at the same time, the difference level g of the microsatellite loci of the sample to be tested 1 >Average (g) 0 )+zSD(g 0 );
Wherein the average (g 0 ) Differential level g of the microsatellite loci representing the baseline sample 0 Is the average level of SD (g) 0 ) Differential level g of the microsatellite loci representing the baseline sample 0 Z represents the level of variance g of the microsatellite loci of the baseline sample 0 A coefficient of degree of deviation of (2);
Counting Z values of the microsatellite loci meeting a depth threshold in the sample to be detected, obtaining the average level of the Z values, and judging the microsatellite state of the sample to be detected according to the following conditions:
(1) The number n1 of the microsatellite loci meeting the depth threshold is more than or equal to 15, wherein the number of the unstable loci is n2, and if n2/n1 is more than or equal to a or the average level of Z values is more than or equal to b, the microsatellite state of the sample to be detected is judged to be MSI-H;
(2) The number n1 of the microsatellite loci meeting the depth threshold is more than or equal to 15, wherein the number of the unstable loci is n2, n2/n1 is less than a, and the average level of Z values is less than b, and judging that the microsatellite state of the sample to be detected is MSS;
(3) The number n1 of the microsatellite loci meeting the depth threshold is less than 15, and the microsatellite state of the sample to be detected is undetermined;
wherein a is 0.15-0.3, and b is 0.8-2.
25. The method of claim 24, wherein the method of detecting comprises:
calculating the difference according to formula (I)Different level g 0 And said level of difference g 1 Obtaining the KLD value of the baseline sample and the KLD value of the sample to be tested;
calculating the Z value of the sample to be detected according to the average level and the discrete degree of the KLD value of the sample to be detected and the KLD value of the baseline sample;
Figure QLYQS_3
Wherein p (x) represents the frequency distribution of the microsatellite loci of the sample to be tested or the microsatellite loci of the baseline sample, and q (x) represents the frequency distribution of the microsatellite loci of the MSS plasma model sample.
26. The method of claim 24, wherein the microsatellite loci are determined to be unstable loci according to any one of the following methods:
(1) If Mp is not equal to Mq and the KLD value of the sample to be tested is greater than average (Ki) +3SD (Ki);
(2) If mp=mq, and p (Mp) <=average (q (Mq)) + zSD (q (Mq)), at the same time, KLD value of the sample to be measured > average (Ki) +3sd (Ki);
wherein the average (Ki) represents the average level of KLD values of the baseline samples, and the SD (Ki) represents the degree of dispersion of KLD values of the baseline samples.
27. A screening apparatus for microsatellite loci of an MSS plasma model sample, the screening apparatus comprising:
a first set of sites module for extracting microsatellite loci satisfying a first condition from a human reference genomic sequence or a capture sequence of a target gene, denoted as a first set of sites, the first condition comprising: a.7-15 bp single base repeat sequence; b. the similarity value of the two wing sequences of the 7-15 bp single base repetitive sequence is lower than a similarity threshold value;
The MSS sample screening and counting module is used for acquiring sequencing data of a plurality of MSS samples, screening the first position set from the sequencing data of each MSS sample, and counting the type of a repeated unit of each microsatellite locus in the first position set and the occurrence frequency of the type of each repeated unit;
a second set of sites module that selects microsatellite loci satisfying a second condition from the first set of sites as a second set of sites, the second condition comprising: polymorphisms in the population below 5% and capture efficiency during pool sequencing above capture threshold;
a third site set module, configured to calculate an average level and a degree of dispersion of the frequencies of occurrence of each of the types of repeating units of each of the microsatellite loci in the second site set for all the MSS samples, and select, as a third site set, a microsatellite locus having an average level of the degree of dispersion lower than a dispersion threshold;
an MSS plasma model sample module for taking a microsatellite loci in the third locus set as microsatellite loci of the MSS plasma model sample, the average level of the frequency of occurrence of each of the types of repeating units of each of the microsatellite loci in the third locus set as a frequency distribution of the types of repeating units of each of the microsatellite loci in the MSS plasma model sample.
28. The screening device of claim 27, wherein the first locus set module comprises:
the first locus selection module is used for extracting microsatellite loci of 7-15 bp single base repetitive sequences from a human reference genome sequence or a capture sequence of a target gene;
the similarity value calculation module is used for calculating the similarity value of the sequence with the set length at the left end and the right end of the 7-15 bp single base repetitive sequence and the 7-15 bp single base repetitive sequence for each microsatellite locus;
and the second position selecting module is used for selecting microsatellite positions with the similarity value lower than a similarity threshold value as the first position set.
29. The screening apparatus of claim 28, wherein the similarity value is calculated according to the formula: and (d 2+1-d 1)/d 2, wherein d1 is the distance from the same base as the 7-15 bp single base repeat sequence in the sequences with the set lengths of the left and right ends to the microsatellite loci, and d2 is the set length.
30. The screening device of claim 29, wherein d2 is 8-12 bp.
31. The screening device of claim 30, wherein d2 is 10bp.
32. The screening apparatus of claim 28, wherein the similarity threshold is 1.5 to 2.5.
33. The screening apparatus of claim 32, wherein the similarity threshold is 2.
34. The screening apparatus of claim 28, wherein the MSS sample screening statistics module comprises:
the comparison module is used for comparing the sequencing data of each MSS sample with a reference genome sequence respectively to obtain a comparison result;
the searching and extracting module is used for searching the first position set from the comparison result and extracting end-to-end reads covering each microsatellite position in the first position set from the comparison result, wherein the end-to-end reads refer to reads covering at least 2bp on the left and right ends of each microsatellite position in the first position set;
and a type frequency statistics module for counting the type of each repeating unit in the end-to-end reads covering each microsatellite loci and the frequency of occurrence of each type of repeating unit.
35. The screening apparatus of claim 34, wherein the look-up extraction module comprises:
the same repeated sequence family statistics module is used for counting the end-to-end reads belonging to the same repeated sequence family from the comparison result and counting the number of the types of different repeated units in the same repeated sequence family,
And the repeating unit type selection module is used for selecting the type of the repeating unit with the largest number as the type of the repeating unit of the same repeating sequence family and counting the support number of the end-to-end reads supporting the microsatellite loci.
36. The screening apparatus of claim 35, wherein the number of end-to-end reads supporting each repeat unit type for each microsatellite locus is at least two.
37. The screening device of claim 34, wherein the capture efficiency is measured as a ratio of the number of end-to-end reads for each of the microsatellite loci to the sequencing depth of the sample.
38. The screening device of claim 27, wherein the capture threshold is ≡ 0.4.
39. A screening apparatus for detecting microsatellite loci of MSI, said screening apparatus comprising:
a first calculation module for selecting a plurality of known MSI-H samples and sequencing data of a plurality of known MSS samples, screening microsatellite loci in the MSS plasma model samples according to the screening apparatus of any one of claims 27 to 38, and calculating frequency distributions of types of repeating units of each of the microsatellite loci of the known MSI-H samples and the known MSS samples, respectively;
A second calculation module for calculating a level of difference between the frequency distribution of the type of the repeating unit of each of the microsatellite loci of the known MSI-H sample and the known MSS sample and the frequency distribution of the type of the repeating unit in the MSS plasma model sample, respectively, and retaining the microsatellite loci having a significant difference between the known MSI-H sample and the known MSS sample as the microsatellite loci for detecting MSI.
40. The screening apparatus of claim 39, wherein the second calculation module calculates a KLD value between the frequency distribution of the type of repeating unit of each of the microsatellite loci and the frequency distribution of the type of repeating unit in the MSS plasma model sample for the known MSI-H sample and the known MSS sample, respectively, according to equation (I), and retains microsatellite loci for which there is a significant difference in the KLD value between the known MSI-H sample and the known MSS sample as the microsatellite loci for detecting MSI;
Figure QLYQS_4
wherein p (x) represents the frequency distribution of the microsatellite loci of the MSI-H sample or the known MSS sample and q (x) represents the frequency distribution of each of the microsatellite loci of the MSS plasma model sample.
41. The screening apparatus of claim 40, wherein a non-parametric test is used to test for a significant difference between the KLD values.
42. The screening apparatus of claim 41, wherein the non-parametric test is a Wilcox test.
43. A baseline building apparatus for detecting MSI, the building apparatus comprising:
a microsatellite loci screening module for screening at least 15 of the 38 microsatellite loci shown in table 1 for MSI detection from a plurality of known MSS samples according to the screening apparatus of any one of claims 27 to 38;
the frequency distribution difference statistics module is used for counting the frequency distribution of the types of the repeated units of each microsatellite locus and calculating the difference level of the frequency distribution of the types of the repeated units and the frequency distribution of the types of the repeated units of the MSS plasma model sample;
a diversity removal module for removing samples having polymorphisms from a plurality of the known MSS samples;
and the baseline establishment module is used for counting the average level and the discrete degree of the difference level of each microsatellite loci of all MSS samples, so as to construct the baseline for detecting MSI.
44. The apparatus for constructing a plasma model according to claim 43, wherein the frequency distribution difference statistics module is a KLD module for calculating KLD values of the frequency distribution of the type of each of the repeating units and the frequency distribution of the type of the repeating unit of the MSS plasma model sample according to equation (I),
Figure QLYQS_5
wherein p (x) represents the frequency distribution of the microsatellite loci of the known MSS sample and q (x) represents the frequency distribution of the microsatellite loci of the MSS plasma model sample;
the baseline establishment module is used for counting the average level and the discrete degree of the KLD value of each microsatellite loci of all known MSS samples, thereby constructing and obtaining the baseline for detecting MSI.
45. The build apparatus of claim 44, wherein the KLD module further comprises: a type inconsistency KLD calculation module for calculating a KLD value of a frequency distribution of a type of a repeating unit in each of the known MSS samples and a frequency distribution of a type of a repeating unit of the MSS plasma model sample when the type of the repeating unit in the known MSS sample is inconsistent with the type of the repeating unit in the MSS plasma model sample, the type inconsistency KLD calculation module comprising:
A union module, configured to take a union of the types of the repeating units in the known MSS sample and the types of the repeating units in the MSS plasma model sample, denoted as M, and the number of the types of the repeating units denoted as M, and set a minimum epsilon;
a smoothing module, configured to smooth a frequency distribution of a type of a repeating unit in the known MSS sample and a frequency distribution of a type of a repeating unit in the MSS plasma model sample, respectively;
and the post-processing calculation module is used for calculating the KLD value of the frequency distribution of the type of the repeating unit in the known MSS sample after the smoothing and the frequency distribution of the type of the repeating unit in the MSS plasma model sample.
46. The build apparatus of claim 45, wherein the smoothing process comprises:
in the known MSS sample or the MSS plasma model sample, if n types of repeating units are absent, the frequency of the absent repeating unit type is ε/n, and the frequency of the remaining repeating unit type is p (x) - ε/(M-n) compared to the M.
47. A device for detecting a microsatellite state, the device comprising:
A microsatellite loci screening module for screening a sample to be tested for the type of repeating units of each microsatellite loci and the frequency distribution of the types of repeating units according to the screening device of any one of claims 27 to 38 for at least 15 loci of the 38 microsatellite loci shown in table 1;
a difference level detection module for calculating the frequency distribution of the type of the repeating unit of each microsatellite loci of the sample to be tested and the frequency distribution of the type of the repeating unit of the MSS plasma model sampleLevel of difference g of cloth 1
A Z value calculation module for calculating a difference level g between the frequency distribution of the type of the repeating unit of each microsatellite loci according to the baseline sample and the frequency distribution of the type of the repeating unit of the MSS plasma model sample 0 Calculating the Z value of the sample to be detected;
the unstable site judging module is used for selecting the type Mp of the highest-frequency repeating unit in each microsatellite locus of the sample to be detected and the type Mq of the highest-frequency repeating unit in the same microsatellite locus of the MSS plasma model sample, and judging the microsatellite locus as an unstable locus according to any one of the following methods:
(1) If Mp is not equal to Mq and the difference level g of the sample to be tested 1 >Average (g) 0 )+zSD(g 0 );
(2) If mp=mq, and p (Mp)<=average (q (Mq)) + zSD (q (Mq)), and at the same time, the difference level g of the sample to be tested 1 >Average (g) 0 )+zSD(g 0 );
Wherein the average (g 0 ) Represents the differential level g of the baseline sample 0 Is the average level of SD (g) 0 ) Represents the differential level g of the baseline sample 0 Z represents the level of variance g of the baseline sample 0 A coefficient of degree of deviation of (2);
the microsatellite state judging module is used for counting Z values of the microsatellite loci meeting a depth threshold in the sample to be detected, obtaining the average level of the Z values, and judging the microsatellite state of the sample to be detected according to the following conditions:
(1) The number n1 of the microsatellite loci meeting the depth threshold is more than or equal to 15, wherein the number of the unstable loci is n2, and if n2/n1 is more than or equal to a, a is 0.15-0.3 or the average level of Z value is more than or equal to b, b is 0.8-2, the microsatellite state of the sample to be detected is judged to be MSI-H;
(2) The number n1 of the microsatellite loci meeting the depth threshold is more than or equal to 15, wherein the number of the unstable loci is n2, n2/n1 is less than a, and the average level of Z values is less than b, and judging that the microsatellite state of the sample to be detected is MSS;
(3) And if the number n1 of the microsatellite loci meeting the depth threshold is less than 15, determining that the microsatellite state of the sample to be detected is undetermined.
48. The device of claim 47, wherein the device comprises:
calculating the difference level g according to the formula (I) 0 And said level of difference g 1 Obtaining the KLD value of the baseline sample and the KLD value of the sample to be tested;
calculating the Z value of the sample to be detected according to the average level and the discrete degree of the KLD value of the sample to be detected and the KLD value of the baseline sample;
Figure QLYQS_6
wherein p (x) represents the frequency distribution of the microsatellite loci of the sample to be tested or the microsatellite loci of the baseline sample, and q (x) represents the frequency distribution of the microsatellite loci of the MSS plasma model sample.
49. The apparatus of claim 47, wherein the microsatellite loci are determined to be unstable loci according to any one of the following methods:
(1) If Mp is not equal to Mq and the KLD value of the sample to be tested is greater than average (Ki) +3SD (Ki);
(2) If mp=mq, and p (Mp) <=average (q (Mq)) + zSD (q (Mq)), at the same time, KLD value of the sample to be measured > average (Ki) +3sd (Ki);
Wherein the average (Ki) represents the average level of KLD values of the baseline samples, and the SD (Ki) represents the degree of dispersion of KLD values of the baseline samples.
50. A storage medium comprising a stored program, wherein the program, when run, controls a device on which the storage medium is located to perform the screening method of any one of claims 1 to 16, or the construction method of any one of claims 20 to 23, or the detection method of any one of claims 24 to 26.
51. A processor, characterized in that the processor is configured to run a program, wherein the program when run performs the screening method of any one of claims 1 to 16, or the construction method of any one of claims 20 to 23, or the detection method of any one of claims 24 to 26.
CN202010444206.9A 2020-05-22 2020-05-22 Microsatellite locus for detecting MSI, screening method and application thereof Active CN111627501B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010444206.9A CN111627501B (en) 2020-05-22 2020-05-22 Microsatellite locus for detecting MSI, screening method and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010444206.9A CN111627501B (en) 2020-05-22 2020-05-22 Microsatellite locus for detecting MSI, screening method and application thereof

Publications (2)

Publication Number Publication Date
CN111627501A CN111627501A (en) 2020-09-04
CN111627501B true CN111627501B (en) 2023-06-02

Family

ID=72272280

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010444206.9A Active CN111627501B (en) 2020-05-22 2020-05-22 Microsatellite locus for detecting MSI, screening method and application thereof

Country Status (1)

Country Link
CN (1) CN111627501B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112435711B (en) * 2020-11-11 2022-04-01 赛福解码(北京)基因科技有限公司 Method for improving detection effect of large CNV in small PANEL data
CN112365922B (en) * 2021-01-13 2021-06-15 臻和(北京)生物科技有限公司 Microsatellite locus for detecting MSI, screening method and application thereof
CN113488105B (en) * 2021-09-08 2022-01-18 臻和(北京)生物科技有限公司 Microsatellite locus based on amplicon next-generation sequencing MSI detection, screening method and application thereof
CN116705157B (en) * 2022-03-28 2024-01-30 北京吉因加医学检验实验室有限公司 Method and device for detecting microsatellite state of plasma sample based on second-generation sequencing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107058551A (en) * 2017-05-04 2017-08-18 北京诺禾致源科技股份有限公司 Detect the instable method and device of microsatellite locus
CN107526944A (en) * 2017-09-06 2017-12-29 南京世和基因生物技术有限公司 Sequencing data analysis method, device and the computer-readable medium of a kind of microsatellite instability
CN109207594A (en) * 2018-09-29 2019-01-15 广州燃石医学检验所有限公司 A method of microsatellite stable state and genome variation are detected by blood plasma based on the sequencing of two generations
CN110343763A (en) * 2019-06-11 2019-10-18 浙江中创生物医药有限公司 A kind of mankind's microsatellite instability state MSI detection site and detection method and application

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107058551A (en) * 2017-05-04 2017-08-18 北京诺禾致源科技股份有限公司 Detect the instable method and device of microsatellite locus
CN107526944A (en) * 2017-09-06 2017-12-29 南京世和基因生物技术有限公司 Sequencing data analysis method, device and the computer-readable medium of a kind of microsatellite instability
CN109207594A (en) * 2018-09-29 2019-01-15 广州燃石医学检验所有限公司 A method of microsatellite stable state and genome variation are detected by blood plasma based on the sequencing of two generations
CN110343763A (en) * 2019-06-11 2019-10-18 浙江中创生物医药有限公司 A kind of mankind's microsatellite instability state MSI detection site and detection method and application

Also Published As

Publication number Publication date
CN111627501A (en) 2020-09-04

Similar Documents

Publication Publication Date Title
CN111627501B (en) Microsatellite locus for detecting MSI, screening method and application thereof
US20210257053A1 (en) Size-based analysis of cell-free tumor dna for classifying level of cancer
US12002544B2 (en) Determining progress of chromosomal aberrations over time
CN112365922B (en) Microsatellite locus for detecting MSI, screening method and application thereof
CN108304694B (en) Method for analyzing gene mutation based on second-generation sequencing data
US20220049297A1 (en) Method and kit for determining genome instability based on next generation sequencing (ngs)
US20230175053A1 (en) Method for analysing loss-of-heterozygosity (loh) following deterministic restriction-site whole genome amplification (drs-wga).
AU2020201081B2 (en) Detection of genetic or molecular aberrations associated with cancer
WO2020063052A1 (en) Method for acquiring cell-free fetal dna concentration, acquisition device, storage medium, and electronic device
CN113488105B (en) Microsatellite locus based on amplicon next-generation sequencing MSI detection, screening method and application thereof
CN109830265B (en) Kit for detecting MSI, reference database, construction method and application thereof
CN110729025B (en) Paraffin section sample somatic mutation detection method and device based on second-generation sequencing
WO2019132010A1 (en) Method, apparatus and program for estimating base type in base sequence
CN117497056B (en) Non-contrast HRD detection method, system and device
CN114242170B (en) Method and device for evaluating homologous recombination repair defects and storage medium
EP3863019A1 (en) Methods for detecting and characterizing microsatellite instability with high throughput sequencing
CN114708905A (en) Chromosome aneuploidy detection method, device, medium and equipment based on NGS
CN118280438A (en) Homozygous locus identification method and device, storage medium and electronic equipment
Ha et al. TITAN: inference of copy number architectures in clonal cell

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 1310-1318, block C, Xidong chuangfong building, 88 Danshan Road, anzhen street, Xishan District, Wuxi City, Jiangsu Province, 214500

Applicant after: Wuxi Zhenhe Biotechnology Co.,Ltd.

Address before: 214500 Room 401, block D, Xidong chuangfong building, anzhen street, Xishan District, Wuxi City, Jiangsu Province

Applicant before: Wuxi Zhenhe Biotechnology Co.,Ltd.

CB02 Change of applicant information
CB02 Change of applicant information

Address after: 1310-1318, block C, Xidong chuangfong building, 88 Danshan Road, anzhen street, Xishan District, Wuxi City, Jiangsu Province, 214500

Applicant after: Wuxi Zhenhe Biotechnology Co.,Ltd.

Address before: 1310-1318, block C, Xidong chuangfong building, 88 Danshan Road, anzhen street, Xishan District, Wuxi City, Jiangsu Province, 214500

Applicant before: Wuxi Zhenhe Biotechnology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant