CN110232949B - Genome microsatellite wide-area length distribution estimation method considering tumor purity factor - Google Patents

Genome microsatellite wide-area length distribution estimation method considering tumor purity factor Download PDF

Info

Publication number
CN110232949B
CN110232949B CN201910385057.0A CN201910385057A CN110232949B CN 110232949 B CN110232949 B CN 110232949B CN 201910385057 A CN201910385057 A CN 201910385057A CN 110232949 B CN110232949 B CN 110232949B
Authority
CN
China
Prior art keywords
microsatellite
length
read
tumor
pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910385057.0A
Other languages
Chinese (zh)
Other versions
CN110232949A (en
Inventor
王嘉寅
王以瑄
张选平
闫新兴
冯旋
赵仲孟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201910385057.0A priority Critical patent/CN110232949B/en
Publication of CN110232949A publication Critical patent/CN110232949A/en
Application granted granted Critical
Publication of CN110232949B publication Critical patent/CN110232949B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Genetics & Genomics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a genome microsatellite wide area length distribution estimation method considering tumor purity factors, which is used for completing data feature extraction; finding a microsatellite candidate region; screening ignored microsatellite candidate regions by using a clustering algorithm; traversing the reading section of the region and dividing; estimating tumor purity for a given sequencing sample; estimating the length distribution parameter of the tumor tissue microsatellite; reflecting the overall length distribution of the long microsatellite by using the average length distribution of the long microsatellite; estimating the average length of the microsatellite based on the coverage degree containing the microsatellite specified window, then iteratively estimating the coverage degree of the specified window by using the updated average length of the microsatellite, and detecting to finish the pure tumor sample long microsatellite; and judging the microsatellite state of the long tumor to complete wide-area length distribution estimation. The invention solves the calculation deviation caused by the purity problem of the tumor sample of the input data, breaks through the length limitation of the sequencing read length to the detectable genome microsatellite, and realizes the wide-area length detection.

Description

Genome microsatellite wide-area length distribution estimation method considering tumor purity factor
Technical Field
The invention belongs to the technical field of data science with accurate medicine as an application background, and particularly relates to a genome microsatellite wide-area length distribution estimation method considering a tumor purity factor.
Background
Genomic Microsatellites (MS) are DNA sequences consisting of repeats of specific oligonucleotide units, usually 1-6 nucleotide fragments, with a diversity in length, usually called length distribution. Microsatellite instability (MSI) refers to a hypermutation pattern caused by defects in the DNA mismatch repair system (MMR), and is characterized by a wide range of length diversity of microsatellite repeats and an increase in the frequency of Single Nucleotide Variations (SNVs). When the length distribution of the same microsatellite has a significant difference between different tissue samples (such as a tumor tissue sample and a normal tissue sample), the microsatellite instability event is determined, otherwise, the microsatellite stability (English name: micro-satellite stability, English abbreviation: MSS) event is determined. Whether MSI is positive or not is one of important indexes in accurate tumor diagnosis and treatment, is particularly suitable for diagnosis and typing of digestive system cancers and urinary system cancers, and has wide clinical indication significance in other common cancers. Currently, there are a large number of studies on the characteristics of MSI in modern oncology, clinical medicine and pharmacy, and it is widely reported that MSI can be used not only for cancer diagnosis but also as an important clinical index for medication decisions and patient prognosis. The efficacy of the currently widely used tumor immunotherapy is closely related to the MSI positive model. In view of the great clinical application value, the detection of the genome microsatellite has important significance.
As the Next Generation Sequencing technology (the English name: Next Generation Sequencing, the English abbreviation: NGS) is increasingly popularized, the detection of MSI by using a data mining model and an algorithm based on NGS data is the mainstream technology at present, and the traditional fragment analysis technology based on PCR is basically replaced. These current MSI detection algorithms based on NGS data can be roughly classified into two categories: an algorithm based on read count distribution and an algorithm based on tumor mutational burden. The first type of algorithm is based on paired normal-tumor sequencing data, for each microsatellite, the number of reads carrying microsatellites with different lengths in the data is firstly counted, the frequency distribution of the reads is calculated, the length distribution is reconstructed, and then the stability of the microsatellite is judged by utilizing statistical test. The existing detection algorithm improves the specificity and sensitivity of detection by adjusting a detection method and a judgment standard mainly according to the characteristics of sequencing data types, cancer types and the like. Such algorithms have been used clinically. The second type of algorithm is also based on paired normal-tumor sequencing data, and is different in that different data characteristics are collected to construct a machine learning model, and the MSI state is determined according to indexes such as tumor mutation load and the like. Such algorithms are currently not mature.
However, the existing algorithms have two types of defects: first, only MS length distributions with lengths lower than the read length (typically 100bp, which is a measure of genome length) can be detected. For microsatellites longer than the length of a sequencing read, the existing detection algorithm cannot carry out positioning by an anchoring method or a split reading method of double-end reading, and further cannot determine the length distribution of the microsatellites. Secondly, the existing algorithms all imply a model hypothesis that the tumor purity of the input sequencing data is assumed to be 100% or close to 100%. Tumor purity refers to the proportion of tumor cells in a given sample. However, this is not practical. Despite the differences in tumor purity between different cancers, it is not possible to obtain a sequencing sample with 100% tumor purity based on modern oncology principles and current sequencing technology principles. In this case, the existing algorithm will generate errors due to insufficient tumor purity. For microsatellites in mixed normal-tumor samples, which may have different length distributions in different tissues, the distribution observed in sequencing data is actually a convolution of the distribution in normal cells with the distribution in tumor cells. Specifically, as shown in fig. 1, the left normal distribution dotted line and the right normal distribution dotted line correspond to the microsatellite length distribution in the normal tissue and the tumor tissue, respectively, the microsatellite length distribution in the normal-tumor mixed sample should actually be a double-peak rounded solid line, and the middle diamond dotted line represents a convolution distribution obtained by direct fitting without considering tumor purity, which is greatly different from the actual double-peak distribution. Stable microsatellite length data in normal cells can dilute unstable microsatellite length data signals in tumor cells, misleading statistical tests to report microsatellite stability (MSS) events, and finally introducing false negative errors. If the algorithm directly ignores the tumor purity of the input sample, the computed microsatellite length distribution and its state are not accurate. Therefore, the existing algorithm cannot solve the limitation of the read length and the tumor purity of a sequencing sample on the MSI detection, so that a large error is caused, and the clinical application effect needs to be improved urgently.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a genome microsatellite wide area length distribution estimation method considering tumor purity factors, to solve the calculation deviation caused by the tumor sample purity of the input data, break through the limit of the sequencing read length on the length of the detectable genome microsatellite, and realize the wide area length detection.
The invention adopts the following technical scheme:
the wide-area length distribution estimation method of the genome microsatellite considering the tumor purity factor comprises the following steps:
s1, defining data characteristics and collecting statistical read information to finish data characteristic extraction;
s2, scanning a given reference genome sequence, finding a microsatellite candidate region, recording a microsatellite with the maximum repeat unit length of 6bp, and storing the position and the related sequence of the microsatellite; further screening ignored microsatellite candidate regions by using a clustering algorithm; after the number of the microsatellites is determined, traversing the read segment of each candidate microsatellite region by using a k-mer-based algorithm, segmenting, and identifying a microsatellite repeating unit and a breakpoint;
s3, estimating the tumor purity of the given sequencing sample by calculating the read count of the screened SNVs;
s4, detecting short microsatellites in the mixed sample, and estimating the length distribution parameters of the microsatellites in the tumor tissues by using a maximum likelihood estimation method;
s5, reflecting the overall length distribution of the long microsatellite by using the average length distribution of the long microsatellite; estimating the average length of the microsatellite based on the coverage degree of the specified window containing the microsatellite by adopting a maximum expectation algorithm, then iteratively estimating the coverage degree of the specified window by using the updated average length of the microsatellite, and circularly iterating until convergence to finish detecting the long microsatellite of the pure tumor sample;
and S6, judging the microsatellite state of the long tumor by adopting independent z test to complete wide-area length distribution estimation.
Specifically, in step S1, the data characteristics are:
MS-pair: two paired reads, one perfectly aligned and the other crossing the breakpoint;
SB-read: reading across break points in MS-pair;
PSset: a set of binary groups consisting of the initial position and sequence of the SB-read, denoted by (POS, SEQ);
sk-mer: a sequence consisting of the first k bases.
Specifically, step S2 specifically includes:
s201, reading the comparison result of PSset, and when the distance between the initial positions of two SB-reads in PSset is less than 50kbps, allocating the two SB-reads into the same cluster;
s202, each cluster represents a candidate microsatellite region, and the number of the microsatellites of an input sample is identified;
s203, aiming at each microsatellite candidate region, starting from the first base of a read sequence which is aligned into the region from a read at the right end belonging to the SB-read, selecting k bases backwards as an initial k-mer, recording the first base, moving backwards in sequence, moving backwards one base every time, selecting a new k-mer, recording the first base, and detecting whether the new k-mer is consistent with the initial k-mer sequence;
s204, when the two k-mer sequences are consistent, the recorded base sequence is a candidate repeating unit, and the first base position of the base sequence is a candidate breakpoint of the microsatellite;
and S205, performing the same operation on all the reads of other candidate microsatellite regions.
Specifically, step S4 specifically includes:
s401, a mixed normal-tumor sequencing sample is given, the proportion of normal cells is (1-c)%, the proportion of tumor cells is c%, and a short microsatellite region in the mixed normal-tumor sequencing sample is used for counting the number of support reads of all different lengths of the microsatellite;
s402, according to the number of the supported reads and the supported lengthObtaining a set of length values L ═ L for the microsatellite region1,l2,...,lNL is a length data set randomly extracted from two samples which are independent from each other and obey different normal distributions simultaneously, wherein (1-c)% of the data in L is the length data of the normal tissue microsatellite, and c% of the probability is the length data of the tumor tissue microsatellite;
s403, a short microsatellite region is given, and when the short microsatellite region belongs to normal tissues, the length of the short microsatellite region is made to be in normal distribution N111 2) (ii) a When it belongs to tumor tissue, its length follows a normal distribution N222 2) (ii) a The length of the hybrid microsatellite obeys a density function of f ═ 1-c) f1+cf2Wherein f is1And f2Are respectively in a normal distribution N1And N2By separately examining normal samples to obtain mu1And σ1A value of (d);
s404, based on the estimated length distribution parameters of the tumor sample microsatellite and the length data of the normal sample microsatellite, the stability of the short microsatellite is judged through z test.
Further, in step S403, μmay be obtained using a maximum likelihood estimation method2And σ2The likelihood function is a joint probability density function of the length of the microsatellite:
Figure BDA0002054567140000051
wherein L ═ { L ═ L1,l2,...,lNIs the length data set of the hybrid microsatellite, N is the size of the length set;
maximizing the joint probability of the microsatellite length yields the value to be estimated as:
Figure BDA0002054567140000052
wherein, mu2And σ2Is the length distribution parameter of the tumor tissue microsatellite.
Specifically, step S5 specifically includes:
s501, collecting read information:
WIN-bk: a window on the reference genome sequence takes the breakpoint of a microsatellite as the midpoint of the window;
c-pair: comparing the paired reads in the WIN-bk area;
t-pair: comparing the two reads to paired reads of the microsatellite region;
o-pair: one read is perfectly compared to the WIN-bk area, and the other read is compared to the paired reads of the microsatellite area;
SO-pair: one read is compared with the microsatellite region, and the other read spans the paired reads of the breakpoint;
s-pair: one read is perfectly compared to the WIN-bk area, and the other read spans the paired reads of the breakpoint;
s-read: reads that span break points in SO-pair and S-pair;
s502, making m be the total number of the microsatellites, p be the pth microsatellite, S be the sampling frequency, and LWinIs the length of WIN-bk, LalnThe total number of bases belonging to the microsatellite region, L, in all S-readssetInitializing variables for the microsatellite length set;
s503, calculating the average coverage of WIN-bk according to the statistical result of the paired reads;
and S504, the coverage degrees are uniformly distributed, and the coverage degree calculated in the step S503 is equal to the coverage degree of the microsatellite region. At this time, the length of the microsatellite is updated;
s505, if L-L is greater than delta,
Figure BDA0002054567140000061
let L ═ L ", repeat step S503;
s506, bringing the obtained length of the microsatellite into a set Lset=Lset∪{L”};
S507, estimating a given microsatellite sequenceNormal distribution parameter of column by changing LWinIs sampled at least 30 times, S is set to S +1, and if S < 30, L is madeWin=LWin+1000, then go to step S502;
s508, carrying out statistical test on the microsatellite length statistical data obtained by 30 groups of sampling experiments by utilizing a normal test algorithm and a Shapiro-Wilk algorithm, and outputting microsatellite length distribution parameters N (mu, sigma)2);
S509, if p < m, making p equal to p +1, and returning to step S502;
s510, independent z detection is adopted, the microsatellite states of the tumor cells and the normal cells are compared, if p is less than 0.05, the microsatellite is judged to be unstable, and if not, the microsatellite is judged to be a microsatellite stabilization event.
Further, step S502 specifically includes:
s5021, initializing the number, the repeating units and the breakpoints of the microsatellites through data preprocessing;
s5022, paired reading compared to WIN-bk is divided into 5 types: c-pair, T-pair, O-pair, S-pair and SO-pair;
s5023, calculating the number of paired reads in 5 types, namely NUMC、NUMT、NUMO、NUMS、NUMSORespectively represent C-pair, T-pair, O-pair, S-pair, SO-pair, and Laln
S5024, setting m as the number of the microsatellites, p as 1, S as 1 and LWin=5kbps、L'=0、Lset=φ。
Further, in step S503, calculating the average coverage of WIN-bk specifically as follows:
Figure BDA0002054567140000071
SUMbp=2×(NUMC+NUMT+NUMO+NUMS+NUMSO)×Lread+Laln
L=L'+LWin
wherein C is the coverage of WIN-bkDegree, SUMbpTotal number of reads base aligned into WIN-bk, LreadFor read length, L is the length of the microsatellite.
Further, in step S504, the length of the updated microsatellite is:
Figure BDA0002054567140000072
SUMbp=(2×NUMT+NUMO+NUMSO)×Lread+Laln
wherein C is coverage of WIN-bk, SUMbpFor comparison of the total number of read bases into the microsatellite region, LreadFor read length, NUMTNumber of T-pairs, NUMONumber of O-pairs, NUMSOIs the number of SO-pair.
Specifically, in step S6, N is obtained by detecting normal samples alone1According to the identification step of long microsatellite of pure tumor sample and the central limit theorem, the length of the long satellite belonging to the mixed sample follows normal distribution mu (1-c) mu1+cμ2,σ2=(1-c)σ1 2+cσ2 2And the mixing parameters mu and sigma are obtained by estimating the mixed sample by a maximum expectation algorithm.
Compared with the prior art, the invention has at least the following beneficial effects:
the invention relates to a genome microsatellite wide-area length distribution estimation method considering tumor purity factors, which solves the problem of calculation deviation caused by the tumor sample purity of input data, breaks through the limitation of sequencing read length on the length of a detectable genome microsatellite, realizes wide-area length detection, utilizes the existing purity estimation software to obtain the estimated value of the tumor purity of the input sample, and solves the process of mixed microsatellite length distribution based on purity acceleration deconvolution to realize the length distribution identification and state detection of mixed sample short microsatellites and the length distribution identification and state detection of long microsatellites.
Furthermore, the statistical data information is convenient to link with other upstream and downstream data processing software, and is also convenient to process data in order to standardize a data format, and data characteristics are defined and statistical read information is collected based on the input BAM format data.
Further, scanning a reference genome, acquiring initial microsatellite candidate regions, clustering based on the distance between the initial alignment positions crossing the breakpoint reads, and screening out all microsatellite candidate regions. And (3) segmenting the read segments in the candidate regions of the microsatellite by adopting a traversal algorithm based on the k-mer so as to identify the repeating units and the breakpoints of the microsatellite. The repeating unit is the minimal repeating sequence forming the microsatellite, and consists of 1-6 nucleotides according to biological knowledge; a breakpoint refers to the starting position of a microsatellite.
Furthermore, the number of the reading segments of the microsatellites with different lengths in the statistical data is counted, based on the preliminary detection result, the Maximum Likelihood Estimation (English name: Maximum Likelihood Estimation, English abbreviation: MLE) method is adopted, and based on the detected length data set of the microsatellites in the mixed sample and the length distribution of the microsatellites in the normal sample, and the estimated tumor purity, the length distribution parameter of the short microsatellites (0-200bp) in the tumor sample is calculated.
Further, the MSI state is judged by adopting statistical test, and specifically, the stability of the short microsatellite is judged by testing based on the estimated length distribution parameters of the tumor sample microsatellite and the length data of the normal sample microsatellite.
Further, a continuous estimation strategy is adopted by using a maximum Expectation-Maximization algorithm (EM), wherein the average length of the micro-satellite is estimated based on the coverage of a specified window containing the micro-satellite, and then the coverage of the specified window is iteratively updated by using the updated average length of the micro-satellite. The loop iterates to converge and the length distribution of long microsatellites (microsatellites with a length greater than 200bp) is estimated. Convergence here means that the average length of the microsatellites no longer changes significantly or reaches the upper limit of the number of cycles.
Further, the central limit theorem is adopted, the length distribution of the long microsatellite in the tumor sample is calculated based on the estimated tumor purity and the length distribution of the normal sample microsatellite, the MSI state is judged by adopting statistical test, and specifically, the stability of the long microsatellite is judged by testing based on the estimated length distribution of the tumor sample microsatellite and the length distribution parameters of the normal sample microsatellite.
In conclusion, the method and the device realize the determination of the microsatellite repeated unit, the breakpoint and the candidate region, realize the length distribution estimation of the short microsatellite in the mixed sample, judge the stability of the short microsatellite more accurately, realize the length distribution estimation and the stability state identification of the long microsatellite in the pure tumor sample and the mixed sample, and break through the limitation of the read length. Based on the method, the calculation deviation caused by the purity problem of the tumor sample of the input data is solved, the limitation of the sequencing read length on the length of the detectable genome microsatellite is broken through, and the wide-area length detection is realized.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a simulation of the length distribution of microsatellites in a mixed sample;
FIG. 2 is a schematic representation of microsatellites in a mixed sample, wherein (a) is a normal-tumor sequencing sample and (b) is a mixed normal-tumor sample;
FIG. 3 is a schematic diagram of coverage variation in the event of microsatellite instability and the definition of different read pairs;
FIG. 4 is a flow chart of the present invention.
Detailed Description
The invention provides a genome microsatellite length distribution based on tumor purity and a state Estimation method ELMSI (Estimation of Long Micro-satellite), wherein input data are data of a normal sample and a matched tumor sample, a process of solving the length distribution of a mixed microsatellite is realized by utilizing the estimated tumor purity through accelerated deconvolution based on purity Estimation software, the length distribution and the state of a short microsatellite in the mixed sample are accurately identified by combining microsatellite read counting and maximum likelihood Estimation, the length distribution and the state of a Long microsatellite are deduced by combining a maximum expectation algorithm and a central limit theorem, and the limitation of the sample tumor purity and the length of a sequencing read on MSI detection is solved.
The present invention is based on the following assumptions that are generally common in academia:
1. when the patient is MSI positive, the microsatellites in the normal cells and tumor cells of the patient obey two different length distribution patterns, respectively;
2. according to the genomic microsatellite evolution model in modern oncology, the length distribution of the microsatellites approximates a normal distribution.
3. The total length of the microsatellite to be detected is less than 50kbps, wherein the repeating unit consists of nucleotides not more than 6bp, wherein both bp and kbps are genome length units, and 1kbps is equal to 1000 bp.
Referring to fig. 4, a method for estimating the wide-area distribution of genomic microsatellite length considering the tumor purity factor according to the present invention includes the following steps:
s1, data feature extraction, which is used for facilitating the connection with other upstream and downstream data processing software and data processing for standardizing data formats, defining data features and collecting statistical read information;
the following data characteristics:
MS-pair: two paired reads, one perfectly aligned and the other crossing a breakpoint;
SB-read: reading across break points in MS-pair;
PSset: a set of binary groups consisting of the initial position and sequence of the SB-read, denoted by (POS, SEQ);
sk-mer: a sequence consisting of the first k bases.
Input data are Whole Exome Sequencing (WES) data in Binary Alignment (BAM) format.
S2, preprocessing data
Scanning a given reference genome sequence to find a microsatellite candidate region, recording a microsatellite with the maximum repeat unit length of 6bp, and storing the position and a related sequence of the microsatellite; and further screening possibly ignored microsatellite candidate regions by using a clustering algorithm. Specifically, the clustering algorithm clusters based on the distance between the initial alignment positions of each cross-breakpoint read. After the number of the microsatellites is determined, for each candidate microsatellite region, a read segment of the region is traversed by using a k-mer-based algorithm to carry out segmentation, and a microsatellite repeating unit and a breakpoint are identified, wherein the method specifically comprises the following steps:
s201, reading the comparison result of PSset, and when the distance between the initial positions of two SB-reads in PSset is less than 50kbps, allocating the two SB-reads into the same cluster;
s202, each cluster represents a candidate microsatellite region, and the number of the microsatellites of an input sample is identified;
and S203, aiming at each microsatellite candidate region, starting from the first base of the read sequence aligned into the region from the right-end read belonging to the SB-read, selecting k bases backwards to be marked as an initial k-mer, and recording the first base when the default k is 6. Sequentially backwards moving one base each time, selecting a new k-mer, recording the first base of the new k-mer, and detecting whether the sequences of the new k-mer and the initial k-mer are consistent;
s204, when the two k-mer sequences are consistent, the recorded base sequence is a candidate repeat unit, and the first base position of the base sequence is a candidate breakpoint of the microsatellite;
and S205, performing the same operation on all the reads of other candidate microsatellite regions.
S3 estimation of tumor purity
Purity is calculated using any of the commonly used tumor sample purity estimation software, such as EMpurity, and in particular, such methods generally estimate tumor purity for a given sequenced sample by calculating read counts for selected SNVs locations;
s4 detection of short microsatellite in mixed sample
Because a large number of algorithms for detecting the short microsatellite of the pure tumor sample can be accurately realized at present, the invention further provides a method for realizing the detection of the short microsatellite in the mixed sample, and the method specifically comprises the following steps:
s401, referring to fig. 2(a), a mixed normal-tumor sequencing sample is given, and according to S3, the percentage of normal cells (1-c)%, the percentage of tumor cells (c%) and a short microsatellite region in the mixed sample are determined, and the number of all the supported reads of different lengths of the microsatellite is counted;
s402, according to the number of the supported reads and the supported length value thereof, the length value set L ═ L of the microsatellite region can be further obtained1,l2,...,lNL is actually a random length data set taken from two independent samples that are simultaneously subjected to different normal distributions. According to the law of majority in probability theory, the data in L has (1-c)% probability of length data of the normal tissue microsatellite, and the c% probability of length data of the tumor tissue microsatellite;
s403, a short microsatellite region is given, and when the short microsatellite region belongs to a normal tissue, the length of the short microsatellite region follows normal distribution N111 2) (ii) a When it belongs to tumor tissue, its length follows a normal distribution N222 2) (ii) a The length of the hybrid microsatellite is then subject to a density function of f ═ 1-c) f1+cf2In which f is1And f2Are respectively in a normal distribution N1And N2Is used as the density function. Mu can be obtained by testing normal samples alone1And σ1A value of (d);
under the above known conditions, μ can be obtained using a Maximum Likelihood Estimation (MLE) method2And σ2An estimate of (d). The likelihood function is a joint probability density function of the microsatellite length:
Figure BDA0002054567140000121
wherein L ═ { L ═ L1,l2,...,lNIs the length data set of the hybrid microsatellite, N is the size of the length set, maximizing this summaryThe rate can obtain the value to be estimated:
Figure BDA0002054567140000122
wherein, mu2And σ2Is a length distribution parameter of the tumor tissue microsatellite, and therefore, the length distribution of short microsatellites can be identified from a given mixed sample.
S404, based on the estimated length distribution parameters of the tumor sample microsatellite and the length data of the normal sample microsatellite, determining the stability of the short microsatellite through z test.
S5 Long microsatellite for detecting pure tumor sample
Specifically, first, based on the central limit theorem, no matter what the population is distributed, the sample average value of any one population surrounds the overall average value of the population and is normally distributed; secondly, because the prior art can not obtain the specific length of the long microsatellite, the invention reflects the overall length distribution of the long microsatellite by utilizing the average length distribution of the long microsatellite. The present invention employs a maximum expectation algorithm (EM), a continuous estimation strategy, that estimates the average length of a given window containing the microsatellite based on the coverage of the window, and then iteratively estimates the coverage of the given window using the updated average length of the microsatellite. The loop iterates to converge, where convergence means that the average length of the microsatellites no longer changes significantly or reaches the upper limit of the number of loops. The specific EM process is as follows:
s501, with reference to fig. 3, collects the read information according to the following definition:
WIN-bk: a window on the reference genomic sequence (reference) with the breakpoint of a microsatellite as its midpoint, default to 5 kbps;
c-pair: comparing the paired reads in the WIN-bk area;
t-pair: comparing the two reads to paired reads of the microsatellite region;
o-pair: one read is perfectly compared to the WIN-bk area, and the other read is compared to the paired reads of the microsatellite area;
SO-pair: one read is compared with the microsatellite region, and the other read spans the paired reads of the breakpoint;
s-pair: one read is perfectly compared to the WIN-bk area, and the other read spans the paired reads of the breakpoint;
s-read: reads that span break points in SO-pair and S-pair.
S502, initializing variables:
let m be the total number of microsatellites, p be the pth microsatellite, S be the number of samples, LWinIs the length of WIN-bk, LalnThe total number of bases belonging to the microsatellite region, L, in all S-readssetIs a microsatellite length set;
s5021, initializing the number, the repeating units and the breakpoints of the microsatellites through data preprocessing;
s5022, paired reading compared to WIN-bk is divided into 5 types: c-pair, T-pair, O-pair, S-pair and SO-pair;
s5023, calculating the number of paired reads in the classes, namely NUMC、NUMT、NUMO、NUMS、NUMSORespectively represent C-pair, T-pair, O-pair, S-pair, SO-pair, and Laln
S5024, setting m as the number of the microsatellites, p as 1, S as 1 and LWin=5kbps、L'=0、Lset=φ。
S503, calculating the average coverage of WIN-bk according to the statistical result of the paired reads, specifically:
Figure BDA0002054567140000131
SUMbp=2×(NUMC+NUMT+NUMO+NUMS+NUMSO)×Lread+Laln
L=L'+LWin
and S504, the coverage degrees are uniformly distributed, and the coverage degree calculated in the step S503 is equal to the coverage degree of the microsatellite region. At this time, the length of the microsatellite is updated;
the length of the updated microsatellite is:
Figure BDA0002054567140000141
SUMbp=(2×NUMT+NUMO+NUMSO)×Lread+Laln
s505, if L-L is greater than delta,
Figure BDA0002054567140000142
let L ═ L ", repeat step S503;
s506, bringing the obtained length of the microsatellite into a set Lset=Lset∪{L”};
S507, in order to estimate the normal distribution parameter of the given microsatellite sequence, L is changedWinIs sampled at least 30 times. If S is equal to S +1, if S is less than 30, let LWin=LWin+1000, then go to step S502;
s508, carrying out statistical test on the microsatellite length statistical data obtained by 30 groups of sampling experiments by utilizing a normal test algorithm and a Shapiro-Wilk algorithm, and outputting microsatellite length distribution parameters N (mu, sigma)2);
S509, if p < m, making p equal to p +1, and returning to step S502;
s510, comparing the microsatellite status of the tumor cells and the normal cells by adopting independent z test. And if p is less than 0.05, judging the microsatellite to be unstable (MSI), otherwise, judging the microsatellite to be stable (MSS) event.
And S6, detecting the mixed sample long-type micro-satellite.
Referring to FIG. 2(b), given a mixed normal-tumor sample, the proportion of tumor cells is set as c% for easy calculation. Similarly, microsatellite lengths belonging to (1-c)% of normal tissue obey a normal distribution N111 2) While the microsatellite length belonging to c% of the tumor tissue obeys a normal distribution N222 2). By separately testing a normal sample (e.g., a human blood sample), N can be obtained1The parameter (c) of (c).
According to the identification step of the long micro-satellite of the pure tumor sample and the central limit theorem, the length of the long satellite belonging to the mixed sample follows normal distributionμ=(1-c)μ1+cμ2,σ2=(1-c)σ1 2+cσ2 2The mixture parameters μ and σ can be estimated from the above-mentioned EM algorithm for the mixed samples. Based on the above known conditions,. mu.2And σ2May be calculated. Again, independent z-tests were used to determine long tumor microsatellite status.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
To further verify the performance of the invention in identifying long microsatellites, ignoring the effect of sample tumor purity on the algorithm, experiments were conducted on a series of differently configured simulation datasets that varied in the number, coverage and read length of the microsatellites.
One correct microsatellite identification is defined as follows: if the repeat unit of a microsatellite is correctly identified, the detected break points belong to (b-10bps, b +10bps), b is the set actual break point, and the actual microsatellite length belongs to (mu-3 sigma, mu +3 sigma), wherein (mu, sigma is2) Is the estimated normal distribution parameter, and the identification is correct if the above conditions are satisfied.
The number of microsatellites was first changed and increased stepwise from 20 to 100. To better reflect the effect of the number of microsatellites on the algorithm, the coverage was adjusted from 30, 60, 100 to 120 at each different number of microsatellites. In this set of experiments, the read segment length was set to 100 bps. The test was repeated 5 times for each different number of microsatellites using the same setup and the average results were output as shown in table 1.
Table 1: ELMSI performance at different microsatellite numbers
Figure BDA0002054567140000151
Figure BDA0002054567140000161
Figure BDA0002054567140000171
The increasing number of microsatellites affects the robustness of the ELMSI. In practical cases, however, the number of microsatellites is small in a given region of 10Mbps of the chromosomal sequence, since microsatellites are very rare. Even so, to test the elmi, the density of microsatellites was increased. As can be seen from Table 1, ELMSI is able to accurately identify microsatellites, excluding interference from non-microsatellite sequences.
Meanwhile, sequencing coverage directly affects somatic mutation invocation, which may affect the performance of the algorithm. To evaluate the performance of the elmi at different coverage, the coverage was further adjusted stepwise from 10 x to 100 x. As shown in table 2, the change in coverage intuitively reflects the change in key indicators. In this set of experiments, the number of microsatellites was set at 20, 40, 60 and the read length was set at 100 bps.
Table 2: ELMSI Performance at different coverage
Figure BDA0002054567140000181
Figure BDA0002054567140000191
The lower the coverage, the greater the difficulty the algorithm faces. Consistent with this, table 2 shows that the detection performance of the elmmsi improves with increasing coverage, with maximum recall exceeding 80%. Thus, the higher the coverage, the higher the detection accuracy of the elongated microsatellites by the ELMSI.
The ELMSI also maintains its validity when read lengths vary. In this set of experiments, the number of microsatellites was set to 20 to 50, the coverage was set to 30, 60, 100 and 120, and the read lengths were set to 100bps, 150bps, 200bps, 250bps and 300 bps. The results are shown in Table 3:
table 3: ELMSI performance at different read lengths
Figure BDA0002054567140000192
Figure BDA0002054567140000201
Figure BDA0002054567140000211
NGS is a high throughput, low cost sequencing technology. The main disadvantage of this method is the large number of splices required. The longer the read segment length is, the smaller the load of the splicing work is, and the fewer errors are brought by splicing. Thus, as the read length increases, the performance of the ELMSI will improve. Table 3 shows that the longer the read length, the more accurate the estimation result.
Compared with the existing algorithm, the method solves the problems that the existing algorithm cannot effectively identify the states of the microsatellite of the mixed sample and the long microsatellite which is longer than the length of the sequencing reading section. As no detection algorithm aiming at the long microsatellite exists at present, in order to verify the validity of the invention, the capability of the invention for classifying the states of the microsatellite is firstly tested, and two important indexes of accuracy and recall ratio are compared with the result of MSISensor. In these simulation experiments, the evaluation indexes were as follows: true Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). Meanwhile, five indexes are further calculated: precision (accuracycacy), recall (recall), accuracy (precision), MCC, Gain.
Accuracy=(TP+TN)/(TP+TN+FN+FP)
Recall=TP/(TP+FN)
Precision=TP/(TP+FP)
Figure BDA0002054567140000221
Gain=(TP-FP)/(TP+FN)
To generate the simulation dataset, a 10Mbps region was first randomly selected on human chromosome 19, with the length, repeat units, and breakpoints of the implanted microsatellites randomly selected. As previously mentioned, the microsatellite length of a given individual is normally distributed. The normal distribution is divided into 7 parts, μ -3 σ, μ -2 σ, μ - σ, μ, μ + σ, μ +2 σ, μ +3 σ, and the number of microsatellites implanted in each part on the reference sequence is such that the coverage is multiplied by the corresponding probabilities of 1%, 6%, 24%, 38%, 24%, 6%, and 1% for each part, respectively. Once the seven parts of microsatellites are implanted, the seven simulation read files are combined and aligned to the reference sequence, and finally the output alignment information file is provided for different mutation calling tools.
For comparison experiments, the invention sets the number of microsatellites to 30, the coverage to 100 x, and the read length to 200 bps. The tumor purities of the input samples were set to 0.9, 0.7, 0.5, 0.3, and 0.1, respectively. The state recognition is carried out on the MSISensor and the ELMSI by utilizing a detection algorithm. The results are shown in Table 4.
TABLE 4 comparison of the results of ELMSI and MSISensor
Figure BDA0002054567140000222
Compared with the MSISensor, the method has better performance in the microsatellite state classification of the mixed sample. When the tumor purity of the input sequenced sample is below a certain ratio, MSS signal in normal sample will dilute the MSI signal of tumor sample, resulting in MSS sensor false alarm event. Therefore, MSIsensor failed to accurately identify MSI when the input sample was a mixture of normal cellular contamination. Even if the tumor purity is lower than 10%, the invention can classify the sample by the microsatellite status.
In the aspect of detecting the mixed sample long microsatellite, a series of experiments are also carried out to verify the effectiveness of the mixed sample long microsatellite. The number of microsatellites is set to 30, the coverage is set to 100 x, and the read length is set to 200 bps. The tumor purities of the input samples were set to 0.9, 0.7, 0.5, 0.3, and 0.1, respectively, and the specific results are shown in table 5.
Table 5: performance of ELMSI long microsatellite classification
Figure BDA0002054567140000231
Figure BDA0002054567140000241
Figure BDA0002054567140000251
Figure BDA0002054567140000261
As shown in table 5, a decrease in tumor purity affected the accuracy of the elmi. However, even at low purities of as low as 10%, the results still indicate that ELMSI can provide a reliable MSI classification result.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (8)

1. The method for estimating the wide-area length distribution of the genome microsatellite considering the tumor purity factor is characterized by comprising the following steps of:
s1, defining data characteristics and collecting statistical read information to finish data characteristic extraction, wherein the data characteristics are as follows:
MS-pair: two paired reads, one perfectly aligned and the other crossing the breakpoint;
SB-read: reading across break points in MS-pair;
PSset: a set of binary sets consisting of the initial position POS of the SB-read and the sequence SEQ, denoted (POS, SEQ);
sk-mer: a sequence consisting of the first k bases;
s2, scanning a given reference genome sequence, finding a microsatellite candidate region, recording a microsatellite with the maximum repeat unit length of 6bp, and storing the position and the related sequence of the microsatellite; further screening ignored microsatellite candidate regions by using a clustering algorithm; after the number of the microsatellites is determined, traversing the read segment of each candidate microsatellite region by using a k-mer-based algorithm, segmenting, and identifying a microsatellite repeating unit and a breakpoint;
s3, estimating the tumor purity of the given sequencing sample by calculating the read count of the screened SNVs;
s4, detecting short microsatellites in the mixed sample, and estimating the length distribution parameters of the microsatellites in the tumor tissues by using a maximum likelihood estimation method;
s5, reflecting the overall length distribution of the long microsatellite by using the average length distribution of the long microsatellite; estimating the average length of the microsatellite based on the coverage degree of the specified window containing the microsatellite by adopting a maximum expectation algorithm, then iteratively estimating the coverage degree of the specified window by using the updated average length of the microsatellite, and circularly iterating until convergence to finish detecting the long microsatellite of the pure tumor sample;
s6, judging the microsatellite state of the long tumor by adopting independent z test to complete wide-area length distribution estimation, and obtaining N by independently detecting normal samples1According to the identification step of long microsatellite of pure tumor sample and the central limit theorem, the length of the long satellite belonging to the mixed sample follows normal distribution mu (1-c) mu1+cμ2,σ2=(1-c)σ1 2+cσ2 2And the mixing parameters mu and sigma are obtained by estimating the mixed sample by a maximum expectation algorithm.
2. The method according to claim 1, wherein step S2 is specifically:
s201, reading the comparison result of PSset, and when the distance between the initial positions of two SB-reads in PSset is less than 50kbps, allocating the two SB-reads into the same cluster;
s202, each cluster represents a candidate microsatellite region, and the number of the microsatellites of an input sample is identified;
s203, aiming at each microsatellite candidate region, starting from the first base of a read sequence which is aligned into the region from a read at the right end belonging to the SB-read, selecting k bases backwards as an initial k-mer, recording the first base, moving backwards in sequence, moving backwards one base every time, selecting a new k-mer, recording the first base, and detecting whether the new k-mer is consistent with the initial k-mer sequence;
s204, when the two k-mer sequences are consistent, the recorded base sequence is a candidate repeating unit, and the first base position of the base sequence is a candidate breakpoint of the microsatellite;
and S205, performing the same operation on all the reads of other candidate microsatellite regions.
3. The method according to claim 1, wherein step S4 is specifically:
s401, a mixed normal-tumor sequencing sample is given, the proportion of normal cells is (1-c)%, the proportion of tumor cells is c%, and a short microsatellite region in the mixed normal-tumor sequencing sample is used for counting the number of support reads of all different lengths of the microsatellite;
s402, obtaining a length value set L of the microsatellite region according to the number of the supported reads and the supported length value1,l2,...,lNL is a length data set randomly extracted from two samples which are independent from each other and obey different normal distributions simultaneously, wherein (1-c)% of the data in L is the length data of the normal tissue microsatellite, and c% of the probability is the length data of the tumor tissue microsatellite;
s403, a short microsatellite region is given, and when the short microsatellite region belongs to normal tissues, the length of the short microsatellite region is made to be in normal distribution N111 2),μ1And σ1Estimating the length distribution parameters of the microsatellite of the obtained tumor sample; when it belongs to tumor tissue, its length follows a normal distribution N222 2),μ2And σ2Length distribution parameters of tumor tissue microsatellites; the length of the hybrid microsatellite obeys a density function of f ═ 1-c) f1+cf2Wherein f is1And f2Are respectively in a normal distribution N1And N2By separately examining normal samples to obtain mu1And σ1A value of (d);
s404, based on the estimated length distribution parameters of the tumor sample microsatellite and the length data of the normal sample microsatellite, the stability of the short microsatellite is judged through z test.
4. The method of claim 3, wherein in step S403, μ is obtained by using a maximum likelihood estimation method2And σ2The likelihood function is a joint probability density function of the length of the microsatellite:
Figure FDA0002984653600000031
wherein L ═ { L ═ L1,l2,...,lNIs the length data set of the short microsatellite, N is the size of the length set;
maximizing the joint probability of the microsatellite length yields the value to be estimated as:
Figure FDA0002984653600000032
wherein, mu2And σ2Is the length distribution parameter of the tumor tissue microsatellite.
5. The method according to claim 1, wherein step S5 is specifically:
s501, collecting read information:
WIN-bk: a window on the reference genome sequence takes the breakpoint of a microsatellite as the midpoint of the window;
c-pair: comparing the paired reads in the WIN-bk area;
t-pair: comparing the two reads to paired reads of the microsatellite region;
o-pair: one read is perfectly compared to the WIN-bk area, and the other read is compared to the paired reads of the microsatellite area;
SO-pair: one read is compared with the microsatellite region, and the other read spans the paired reads of the breakpoint;
s-pair: one read is perfectly compared to the WIN-bk area, and the other read spans the paired reads of the breakpoint;
s-read: reads that span break points in SO-pair and S-pair;
s502, making m be the total number of the microsatellites, p be the pth microsatellite, S be the sampling frequency, and LWinIs the length of WIN-bk, LalnThe total number of bases belonging to the microsatellite region, L, in all S-readssetInitializing variables for the microsatellite length set;
s503, calculating the average coverage of WIN-bk according to the statistical result of the paired reads;
s504, the coverage degrees are uniformly distributed, and then the coverage degree calculated in the step S503 is equal to the coverage degree of the microsatellite region; at this time, the length of the microsatellite is updated;
s505, if
Figure FDA0002984653600000041
L is a set of length values of the microsatellite region, L "is the length of the updated microsatellite, L 'is the length of the microsatellite, L' is set to L", and step S503 is repeated;
s506, bringing the obtained length of the microsatellite into a set Lset=Lset∪{L”};
S507, in order to estimate the normal distribution parameter of the given microsatellite sequence, L is changedWinIs sampled at least 30 times, S is set to S +1, and if S < 30, L is madeWin=LWin+1000, then go to step S502;
s508, carrying out statistical test on the microsatellite length statistical data obtained by 30 groups of sampling experiments by utilizing a normal test algorithm and a Shapiro-Wilk algorithm, and outputting microsatellite length distribution parameters N (mu, sigma)2);
S509, if p < m, making p equal to p +1, and returning to step S502;
s510, independent z detection is adopted, the microsatellite states of the tumor cells and the normal cells are compared, if p is less than 0.05, the microsatellite is judged to be unstable, and if not, the microsatellite is judged to be a microsatellite stabilization event.
6. The method according to claim 5, wherein step S502 specifically comprises:
s5021, initializing the number, the repeating units and the breakpoints of the microsatellites through data preprocessing;
s5022, paired reading compared to WIN-bk is divided into 5 types: c-pair, T-pair, O-pair, S-pair and SO-pair;
s5023, calculating the number of paired reads in 5 types, namely NUMC、NUMT、NUMO、NUMS、NUMSORespectively represent C-pair, T-pair, O-pair, S-pair, SO-pair, and Laln
S5024, setting m as the number of the microsatellites, p as 1, S as 1 and LWin=5kbps、L'=0、Lset=φ。
7. The method according to claim 5, wherein in step S503, the average coverage of WIN-bk is calculated, specifically:
Figure FDA0002984653600000051
SUMbp=2×(NUMC+NUMT+NUMO+NUMS+NUMSO)×Lread+Laln
L=L'+LWin
wherein C is coverage of WIN-bk, SUMbpTotal number of reads base aligned into WIN-bk, LreadFor read length, L is the length of the microsatellite.
8. The method of claim 5, wherein in step S504, the length of the updated microsatellite is:
Figure FDA0002984653600000052
SUMbp=(2×NUMT+NUMO+NUMSO)×Lread+Laln
wherein C is coverage of WIN-bk, SUMbpFor comparison of the total number of read bases into the microsatellite region, LreadFor read length, NUMTNumber of T-pairs, NUMONumber of O-pairs, NUMSOIs the number of SO-pair.
CN201910385057.0A 2019-05-09 2019-05-09 Genome microsatellite wide-area length distribution estimation method considering tumor purity factor Active CN110232949B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910385057.0A CN110232949B (en) 2019-05-09 2019-05-09 Genome microsatellite wide-area length distribution estimation method considering tumor purity factor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910385057.0A CN110232949B (en) 2019-05-09 2019-05-09 Genome microsatellite wide-area length distribution estimation method considering tumor purity factor

Publications (2)

Publication Number Publication Date
CN110232949A CN110232949A (en) 2019-09-13
CN110232949B true CN110232949B (en) 2021-08-13

Family

ID=67860505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910385057.0A Active CN110232949B (en) 2019-05-09 2019-05-09 Genome microsatellite wide-area length distribution estimation method considering tumor purity factor

Country Status (1)

Country Link
CN (1) CN110232949B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112086129B (en) * 2020-09-23 2021-04-06 深圳吉因加医学检验实验室 Method and system for predicting cfDNA of tumor tissue
KR102529641B1 (en) * 2020-11-20 2023-05-08 국립암센터 Method for determining microsatellite instability through tumor purity correction

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101627128A (en) * 2005-05-02 2010-01-13 基因信息公司 Bladder cancer biomarkers and uses thereof
CN106834479A (en) * 2017-02-16 2017-06-13 凯杰(苏州)转化医学研究有限公司 Microsatellite instability state analysis system in immunotherapy of tumors
CN109584961A (en) * 2018-12-03 2019-04-05 元码基因科技(北京)股份有限公司 Method based on two generation sequencing technologies detection blood microsatellite instability
WO2019074963A1 (en) * 2017-10-09 2019-04-18 Strata Oncology, Inc. Microsatellite instability characterization

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101627128A (en) * 2005-05-02 2010-01-13 基因信息公司 Bladder cancer biomarkers and uses thereof
CN106834479A (en) * 2017-02-16 2017-06-13 凯杰(苏州)转化医学研究有限公司 Microsatellite instability state analysis system in immunotherapy of tumors
WO2019074963A1 (en) * 2017-10-09 2019-04-18 Strata Oncology, Inc. Microsatellite instability characterization
CN109584961A (en) * 2018-12-03 2019-04-05 元码基因科技(北京)股份有限公司 Method based on two generation sequencing technologies detection blood microsatellite instability

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Estimating the Length Distributions of Genomic Micro-satellites from Next Generation Sequencing Data;Xuan Feng et al.;《IWBBIO》;20181231;第461–472页 *

Also Published As

Publication number Publication date
CN110232949A (en) 2019-09-13

Similar Documents

Publication Publication Date Title
CN109637590B (en) Microsatellite instability detection system and method based on genome sequencing
US20210065847A1 (en) Systems and methods for determining consensus base calls in nucleic acid sequencing
CN111304303B (en) Method for predicting microsatellite instability and application thereof
WO2023115662A1 (en) Method for detecting variant nucleic acids
CN110010193A (en) A kind of labyrinth mutation detection method based on mixed strategy
CN112365922B (en) Microsatellite locus for detecting MSI, screening method and application thereof
CN110232949B (en) Genome microsatellite wide-area length distribution estimation method considering tumor purity factor
US11929148B2 (en) Systems and methods for enriching for cancer-derived fragments using fragment size
WO2020132499A2 (en) Systems and methods for using fragment lengths as a predictor of cancer
CN106676178A (en) System and method for tumor heterogeneity assessment
CN111091868B (en) Method and system for analyzing chromosome aneuploidy
CN112218957A (en) Systems and methods for determining tumor fraction in cell-free nucleic acids
CN110060733B (en) Second-generation sequencing tumor somatic variation detection device based on single sample
CN110592208B (en) Capture probe composition of three subtypes of thalassemia as well as application method and application device thereof
CN113674803A (en) Detection method of copy number variation and application thereof
CN114694750A (en) Single-sample tumor somatic mutation distinguishing and TMB (Tetramethylbenzidine) detecting method based on NGS (Next Generation System) platform
WO2021061473A1 (en) Systems and methods for diagnosing a disease condition using on-target and off-target sequencing data
CN116434843A (en) Base sequencing quality assessment method
CN109920480B (en) Method and device for correcting high-throughput sequencing data
CN108460248B (en) Method for detecting long tandem repeat sequence based on Bionano platform
Smith et al. Benchmarking splice variant prediction algorithms using massively parallel splicing assays
CN113278706A (en) Method for distinguishing somatic mutation from germline mutation
CN112837748A (en) System and method for distinguishing tumors of different anatomical origins
WO2019132010A1 (en) Method, apparatus and program for estimating base type in base sequence
CN108229099A (en) Data processing method, device, storage medium and processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant