CN111370056B - Method, system and computer readable medium for determining predetermined chromosome instability index of a sample to be tested - Google Patents

Method, system and computer readable medium for determining predetermined chromosome instability index of a sample to be tested Download PDF

Info

Publication number
CN111370056B
CN111370056B CN201910428251.2A CN201910428251A CN111370056B CN 111370056 B CN111370056 B CN 111370056B CN 201910428251 A CN201910428251 A CN 201910428251A CN 111370056 B CN111370056 B CN 111370056B
Authority
CN
China
Prior art keywords
window
sequence
sample
window sequence
sequencing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910428251.2A
Other languages
Chinese (zh)
Other versions
CN111370056A (en
Inventor
李世勇
茅矛
张锋
陈彦
钟果林
张岩
陈灏
封裕敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Siqin Medical Technology Co ltd
Original Assignee
Shenzhen Siqin Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Siqin Medical Technology Co ltd filed Critical Shenzhen Siqin Medical Technology Co ltd
Priority to CN201910428251.2A priority Critical patent/CN111370056B/en
Publication of CN111370056A publication Critical patent/CN111370056A/en
Application granted granted Critical
Publication of CN111370056B publication Critical patent/CN111370056B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/10Design of libraries

Abstract

The invention provides a method for determining a predetermined chromosome instability index of a sample to be tested. The method comprises the following steps: (1) dividing window sequences (bins); (2) and the reference sequence; (3) counting the number of matched sequencing reads for each of the bins; (4) filtering, standardizing and correcting the number of the matched sequencing reads of each bins; (5) logarithmically processing the results obtained in step (4) to obtain log R ratio of the number of sequencing reads for each bin; (6) determining a first preselected anomaly window sequence; (7) determining a second preselected anomaly window sequence; (8) determining an abnormal window sequence; (9) determining a frequency of occurrence of each copy number variation of the anomalous window sequence; (10) and determining the instability index of the sample to be tested for the predetermined chromosome.

Description

Method, system and computer readable medium for determining predetermined chromosome instability index of a sample to be tested
Technical Field
The present invention relates to the field of biological information, and in particular, to a method, system and computer readable medium for determining a predetermined chromosome instability index for a test sample.
Background
Cancer causes amplification or deletion of certain regions of the genome, and 30% of cancer patients cause human chromosome doubling. Then is the rate of chromosomal amplification or deletion correlated with cancer, or, alternatively, is the probability of the sample originating from the cancer organism inferred by the rate of chromosomal amplification or deletion?
This is a problem that researchers are urgently required to solve.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art. For this reason, the present invention innovatively develops a calculation method of a chromosome instability index Chromosome Instability (CIN) score to measure the instability of a predetermined chromosome of a sample.
Based on this, in a first aspect of the invention, a method for determining a predetermined chromosome instability index of a test sample is presented. According to an embodiment of the invention, the method comprises: (1) dividing the reference sequence of the predetermined chromosome into a plurality of window sequences (bins) of the same length; (2) comparing sequencing data from the test sample to the reference sequence, the sequencing data consisting of a plurality of sequencing reads (reads); (3) respectively counting the number of matched sequencing reads of each of the plurality of window sequences based on the comparison result of the step (2); (4) based on the statistical result of the step (3), filtering, standardizing and correcting the number of the matched sequencing reads of each window; (5) logarithmically processing the normalized and corrected number of matching sequencing reads for each window obtained in step (4) to obtain a log value logR ratio of the number of sequencing reads for each window; (6) smoothing the logR ratio obtained in the step (5), merging windows and performing CNV fragmentation processing to determine a first preselected abnormal window sequence; (7) obtaining a standard score of the number of matched sequencing reads of each window based on the number of the normalized and corrected matched sequencing reads of each window, and determining a second preselected abnormal window sequence; (8) determining an anomaly window sequence based on the first preselected anomaly window sequence or the second preselected anomaly window sequence determined in (6) or (7); (9) determining, based on a database of known tumors, a frequency of occurrence of each copy number variation of the aberrant window sequence; (10) determining an instability index of the test sample for the predetermined chromosome based on the number of matched sequencing reads, the frequency of occurrence of copy number variations, the length of the window sequence, and a predetermined conventional parameter for each of a plurality of the aberrant window sequences. As will be understood by those skilled in the art, the logR ratio is obtained by normalizing and correcting the number of matching sequencing reads in each window, dividing the normalized and corrected number by the number of copies in the normal reference sample (e.g., human diploid, so the number of copies in the normal reference sample is 2), and then taking the logarithm of the number of matching sequencing reads in each window. By using the method according to the embodiment of the invention, after the instability index of the sample to be detected for the predetermined chromosome is obtained, the probability that the sample to be detected is derived from the cancer sample can be obtained, so as to provide a detection index for scientific research, for example, in the research of screening cancer treatment drugs or ascertaining the cause of cancer of an individual, reliable drugs for cancer treatment or ascertaining possible influence factors of cancer of the individual can be screened by the change of the instability index of the sample to be detected for the predetermined chromosome before and after administration or before and after administration of interference factors; alternatively, after obtaining the instability index of the sample to be detected with respect to the predetermined chromosome by using the method according to the embodiment of the present invention, the probability that the sample to be detected is derived from the cancer sample can be obtained, and an index for cancer detection can be provided.
According to an embodiment of the present invention, the method may further include at least one of the following additional technical features:
according to an embodiment of the present invention, the sample to be tested is derived from a suspected cancer patient.
According to an embodiment of the present invention, the sample to be tested is blood, body fluid, urine, saliva or skin.
According to an embodiment of the invention, the window sequence is 1M, 50K, 20K, 10K or 5K in length.
According to an embodiment of the invention, the sequencing data is obtained by sequencing with a second generation sequencer after whole genome pooling of plasma-free DNA, with an average sequencing depth of less than 1X,2X, 3X,4X or 5X.
The type of the second generation sequencer is not particularly limited, and according to a specific embodiment of the present invention, the second generation sequencer is XTen, NovaSeq, or NextSeq 500.
According to an embodiment of the invention, in step (4), the filtering comprises filtering out sequencing reads having the following characteristics:
1) alignment (mappability): representing the probability of the unique correct alignment of reads obtained by sequencing on the region is > 0.5; (ii) a
2) The ratio of N (non-bases A, T, C, G) within each bins is < 0.5;
3) not in the region files wgEncodeDacMapabiliityConsenssus Excludable. bed and wgEncodeDukeMapalityRegionsExcludable. bed downloaded from UCSC;
4) the X, Y chromosome;
5) using the normal reference set, bins greater than 4 standard deviations after inter-sample normalization (divided by the mean of the samples) were calculated.
According to an embodiment of the present invention, in step (4), the match sequencing read data is obtained after GC and alignment correction.
According to an embodiment of the present invention, the GC correction and the contrast ratio correction are performed by:
1) and (3) GC calculation: counting the number of A, T, C and G bases in each window (bin); and the number of G and C. The ratio of GC is the GC content of the window.
2) Mapavailability calculation: according to the ENCODE's mappability bigwig file downloaded from UCSC, the mappability of each region in the file is compared with a bin, and the average value of the mappability of all regions in each bin is calculated as the mappability value of the bin.
3) Filtering out bins with abnormal numbers of reads, and reserving bins with 1% -99% quantiles;
4) and combining the GC and the mapcapability of each bin, grouping according to the combination, and simultaneously calculating the median of the numbers of reads of all the bins corresponding to each GC and mapcapability combination.
5) Determining the optimal value of the local weighted nonparametric regression parameter by using a cross validation method; a fitted curve is constructed, and finally the normalized depth of each bins is divided by the value predicted by the curve to obtain a corrected value.
According to an embodiment of the present invention, in step (6), the logratio after the smoothing process and the merging and CNV fragmentation processes of the windows is greater than 0.1 or less than-0.1, which is an indication that the window corresponding to the logratio is the first pre-selected abnormal window sequence.
In step (7), a second sequence of preselected anomaly windows is determined in accordance with the following equation:
zi=(xi-μi)/σi
wherein xi represents the number of matched sequencing reads of the corrected sequencing data of the ith window sequence from the sample to be detected and the ith window reference sequence;
μ i represents the average of the number of matched sequencing reads for which the corrected predetermined sequencing data for the ith window sequence from the plurality of reference set samples matches the ith window reference sequence;
σ i represents a standard deviation of a predetermined number of matching sequencing reads for which sequencing data from an ith window sequence of the plurality of reference set samples matches the ith window reference sequence;
zi represents the standard score of the number of matching sequencing reads per window (z-score);
the reference set is a sample of a known normal population.
According to an embodiment of the present invention, zi being greater than 3 or less than-3 is an indication that the ith window sequence of the sample under test is an abnormal window sequence.
According to an embodiment of the invention, in step (8), the logratio is greater than 0.1 or less than-0.1 and/or zi is greater than 3 or less than-3, being an indication that the window is an abnormal window sequence.
According to the embodiment of the invention, in the step (9), the instability index CIN score of the sample to be tested for the predetermined chromosome is determined by the following formula,
further, an instability index CIN score of the chromosome is calculated by the following formula,
Figure GDA0002757076870000031
Figure GDA0002757076870000032
wherein n represents the total number of window sequences;
a denotes a predetermined constant, i.e., a window size, in relation to the window size;
lkrepresents the length of the k-th exception window;
fkrepresenting the probability of CNV occurrence of the k-th abnormal window sequence;
abs (Z-score) represents the absolute value of the standard score for the kth window;
abs (logR) represents the absolute value of logR ratio of the k-th window after the smoothing process.
According to an embodiment of the present invention, the frequency of CNVs occurring in the kth abnormal window sequence is determined based on the CNV variation result of the WGS tumor sample, wherein the overlapping region of the kth abnormal window sequence section and the CNV variation region of the tumor sample in the WGS tumor sample accounts for 90% or more of the kth abnormal window sequence section, and is an indication that CNVs exist in the tumor sample in the kth abnormal window sequence section, and the fkIs the ratio of the number of cancer samples comprising the kth abnormal window sequence interval to the total number of said cancer samples.
According to the embodiment of the invention, the method further comprises the step of determining the cancer probability of the sample to be tested based on a plurality of samples with known states and the CIN score and/or standard score of the sample to be tested.
According to the specific embodiment of the invention, based on the CIN score of the known normal sample as baseline data, the normal distribution of the CIN score is constructed, the mean value and the standard deviation of the normal distribution of the CIN score are obtained, the p-value corresponding to the CIN score of the sample to be detected is less than 0.01, the indication that the sample to be detected is from the cancer sample is obtained, and further, the detection index is provided for related scientific research or cancer detection.
In a second aspect of the invention, a computer-readable medium is presented. According to an embodiment of the present invention, the computer readable medium has stored therein instructions adapted to be processed and executed to determine a predetermined chromosome instability index of a sample to be tested by (1) dividing a reference sequence of the predetermined chromosome into a plurality of window sequences (bins) of the same length; (2) comparing sequencing data from the test sample to the reference sequence, the sequencing data consisting of a plurality of sequencing reads (reads); (3) respectively counting the number of matched sequencing reads of each of the plurality of window sequences based on the comparison result of the step (2); (4) based on the statistical result of the step (3), filtering, standardizing and correcting the number of the matched sequencing reads of each window; (5) logarithmically processing the normalized and corrected number of matching sequencing reads for each window obtained in step (4) to obtain a log value logratio of the number of sequencing reads for each window; (6) smoothing the logratio obtained in the step (5), merging windows and performing CNV fragmentation processing on the windows, and determining a first preselected abnormal window sequence; (7) obtaining a standard score of the number of matched sequencing reads of each window based on the number of the normalized and corrected matched sequencing reads of each window, and determining a second preselected abnormal window sequence; (8) determining an anomaly window sequence based on the first preselected anomaly window sequence or the second preselected anomaly window sequence determined in (6) or (7); (9) determining, based on a database of known tumors, a frequency of occurrence of each copy number variation of the aberrant window sequence; (10) determining an instability index of the test sample for the predetermined chromosome based on the number of matched sequencing reads, the frequency of occurrence of copy number variations, the length of the window sequence, and a predetermined conventional parameter for each of a plurality of the aberrant window sequences. By using the computer readable medium according to the embodiment of the invention, after the instability index of the sample to be detected for the predetermined chromosome is obtained, the probability that the sample to be detected is from the cancer sample is obtained, and then a detection index or a cancer detection index is provided for scientific research.
In a third aspect of the invention, a system for determining a predetermined chromosome instability index for a test sample is provided. According to an embodiment of the invention, the system comprises: dividing windowing means adapted to divide the reference sequence of the predetermined chromosome into a plurality of window sequences (bins) of the same length; a comparison device connected to the window dividing device and adapted to compare sequencing data from the sample to be tested with the reference sequence, the sequencing data being composed of a plurality of sequencing reads (reads); the statistical device is connected with the comparison device and is suitable for respectively counting the number of the matched sequencing reads of each of the window sequences based on the comparison result obtained by the comparison device; the correcting device is connected with the statistical device and is suitable for filtering, standardizing and correcting the number of the matched sequencing reads of each window based on the statistical result obtained by the statistical device; a log taking device connected with the correcting device and adapted to take a log of the normalized and corrected number of matching sequencing reads for each window obtained by the correcting device so as to obtain a log value logR ratio of the number of sequencing reads for each window; the first pre-selection abnormal window sequence determining device is connected with the logarithm taking device, is suitable for smoothing the logR ratio obtained by the logarithm taking device, and performs merging and CNV fragmentation processing on the windows to determine a first pre-selection abnormal window sequence; second preselected abnormal window sequence determining means, said second preselected abnormal window sequence determining means being connected to said correcting means and adapted to obtain a standard score for the number of matched sequencing reads per window based on the normalized and corrected number of matched sequencing reads per window and determine a second preselected abnormal window sequence; an anomaly window sequence determining means, connected to said first preselected anomaly window sequence determining means and said second preselected anomaly window sequence determining means, adapted to determine an anomaly window sequence based on the first preselected anomaly window sequence or the second preselected anomaly window sequence determined by the first preselected anomaly window sequence determining means or the second preselected anomaly window sequence determining means; copy number variation occurrence frequency determining means, adjacent to the abnormal window sequence determining means, adapted to determine each copy number variation occurrence frequency of the abnormal window sequence based on a known tumor database; an instability index determination device, connected to the copy number variation occurrence frequency determination device, adapted to determine an instability index of the sample to be tested for the predetermined chromosome based on the number of matched sequencing reads, the copy number variation occurrence frequency, the length of the window sequence, and a predetermined conventional parameter for each of a plurality of the abnormal window sequences. The system for determining the predetermined chromosome instability index of the sample to be detected according to the embodiment of the invention is suitable for executing the method for determining the predetermined chromosome instability index of the sample to be detected, and further determines the probability that the sample to be detected is derived from a cancer sample according to the obtained predetermined chromosome instability index of the sample to be detected, so as to provide a detection index for scientific research or provide an index for cancer detection.
It should be noted that, as will be understood by those skilled in the art, the features and advantages of the method for determining a predetermined chromosome instability index of a test sample described above are also applicable to a computer-readable medium and a system for determining a predetermined chromosome instability index of a test sample, and are not described in detail for convenience of description.
Drawings
FIG. 1 is a schematic diagram of a system for determining a predetermined chromosome instability index of a sample to be tested according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of another embodiment of a system for determining a predetermined chromosome instability index of a sample under test according to the present invention;
FIG. 3 is an exemplary diagram of a selected window interval size according to an embodiment of the invention;
FIG. 4 is a GC distribution graph of sequencing read data (bins are grouped by GC% and the corresponding frequencies of bins at different GC percentages) according to an embodiment of the invention;
FIG. 5 is a normal distribution diagram of CIN score according to an embodiment of the present invention;
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
For convenience of description, fig. 1 is a schematic structural diagram of a system for determining a predetermined chromosome instability index of a sample to be tested according to the present invention. According to an embodiment of the invention, the system comprises:
a window dividing means 100, said window dividing means 100 being adapted to divide the reference sequence of said predetermined chromosome into a plurality of window sequence bins of the same length;
the comparison device 200 is connected with the window dividing device 100 and is suitable for comparing the sequencing data from the sample to be tested with the reference sequence, wherein the sequencing data is composed of a plurality of sequencing read reads;
a statistic device 300, wherein the statistic device 300 is connected to the alignment device 200 and is adapted to count the number of matching sequencing reads of each of the plurality of window sequences based on the alignment result obtained by the alignment device;
a calibration device 400, connected to the statistical device 300, adapted to filter, standardize, and calibrate the number of matching sequencing reads for each window based on the statistical result obtained by the statistical device, optionally, the matching sequencing reads are obtained by GC and contrast correction;
a log-taking device 500, said log-taking device 500 being connected to said calibration device 400 and adapted to take a log of the number of normalized and calibrated matched sequencing reads per window obtained by said calibration device, so as to obtain a log ratio of the number of sequencing reads per window;
a first pre-selected abnormal window sequence determining means 600, said first pre-selected abnormal window sequence determining means 600 being connected to said logarithm taking means 500 and adapted to smooth the logratio obtained by the logarithm taking means and to merge and CNV fragment the windows to determine a first pre-selected abnormal window sequence,
optionally, the logratio after the smoothing processing and the merging and CNV fragmentation processing of the window is greater than 0.1 or less than-0.1, which is an indication that the window corresponding to the logratio is the first preselected abnormal window sequence;
a second preselected anomaly window sequence determining means 700, said second preselected anomaly window sequence determining means 700 coupled to said correcting means 500 and adapted to obtain a standard score for the number of matched sequencing reads per window based on the normalized and corrected number of matched sequencing reads per window, determine a second preselected anomaly window sequence,
optionally, the second preselected anomaly window sequence is determined based on the following equation:
zi=(xi-μi)/σi
wherein xi represents the number of matched sequencing reads of the corrected sequencing data of the ith window sequence from the sample to be detected and the ith window reference sequence;
μ i represents the average of the number of matched sequencing reads for which the corrected predetermined sequencing data for the ith window sequence from the plurality of reference set samples matches the ith window reference sequence;
σ i represents a standard deviation of a predetermined number of matching sequencing reads for which sequencing data from an ith window sequence of the plurality of reference set samples matches the ith window reference sequence;
zi represents the standard score of the number of matching sequencing reads per window;
the reference set is a sample of a known normal population;
wherein zi is greater than 3 or less than-3 is an indication that the ith window sequence of the sample under test is a second pre-selected abnormal window sequence;
an anomaly window sequence determining means 800, said anomaly window sequence determining means 800 being connected to said first preselected anomaly window sequence determining means 600 and said second preselected anomaly window sequence determining means 700 and being adapted to determine an anomaly window sequence based on the first preselected anomaly window sequence or the second preselected anomaly window sequence determined by the first preselected anomaly window sequence determining means or the second preselected anomaly window sequence determining means;
a copy number variation occurrence frequency determining means 900, said copy number variation occurrence frequency determining means 900 being adjacent to said abnormal window sequence determining means 800, adapted to determine each copy number variation occurrence frequency of said abnormal window sequence based on a known tumor database;
an instability index determination means 1000, said instability index determination means 1000 being connected to said copy number variation occurrence frequency determination means 900 and adapted to determine an instability index of said sample to be tested for said predetermined chromosome based on said number of matched sequencing reads, said copy number variation occurrence frequency, a length of said window sequence and predetermined general parameters for each of a plurality of said abnormal window sequences,
optionally, determining an instability index CIN score of the sample to be tested against the predetermined chromosome by the following formula,
Figure GDA0002757076870000071
Figure GDA0002757076870000072
wherein n represents the total number of window sequences;
a represents a predetermined constant, related to the window size;
lkrepresents the length of the k-th exception window;
fkrepresenting the probability of CNV occurrence of the k-th abnormal window sequence;
abs (Z-score) represents the absolute value of the standard score for the kth window;
abs (logR) represents the absolute value of logR ratio of the k-th window after the smoothing process;
the frequency of CNV occurrence in the kth aberrant window sequence is determined based on the CNV variation results of WGS tumor samples in which they occurWherein the overlap region between the kth abnormal window sequence region and the CNV variation region of the tumor sample accounts for 90% or more of the kth abnormal window sequence region, and is an indication that the kth abnormal window sequence region has CNV in the tumor sample, and f iskIs the ratio of the number of cancer samples comprising the kth abnormal window sequence interval to the total number of said cancer samples.
Specifically, the sample to be tested is derived from a suspected cancer patient.
Specifically, the sample to be tested is blood, body fluid, urine, saliva or skin.
Specifically, sequencing data were obtained by whole genome pooling of plasma-free DNA followed by sequencing using a second-generation sequencer, with average sequencing depths of less than 1X,2X, 3X,4X or 5X.
Specifically, the second generation sequencer is XTen, NovaSeq, or NextSeq 500.
Specifically, the length of the window sequence is determined as follows: 1M, 50K, 20K, 10K or 5K.
According to still another embodiment of the present invention, referring to fig. 2, the system further includes: a cancer probability determination means 1100, said cancer probability determination means 1100 being connected to said instability index determination means 1000 and being adapted to determine a cancer probability of said sample to be tested based on a plurality of samples of known status and CIN score and/or standard score of said sample to be tested.
The scheme of the invention will be explained with reference to the examples. It will be appreciated by those skilled in the art that the following examples are illustrative of the invention only and should not be taken as limiting the scope of the invention. The examples do not specify particular techniques or conditions, and are carried out according to techniques or conditions described in literature in the art (for example, refer to molecular cloning, a laboratory Manual, third edition, scientific Press, written by J. SammBruke et al, Huang Petang et al) or according to product instructions. The reagents or apparatus used are not indicated by the manufacturer, but are conventional products available commercially, for example from Illumina.
Example 1 preparation of sequencing samples
1. Plasma separation
a) Preparing instruments, reagents and consumables required by the experiment, and precooling the high-speed refrigerated centrifuge to 4 ℃ in advance.
b) If peripheral blood samples were collected using EDTA anticoagulation tubes, they were immediately placed in a 4 ℃ freezer after blood withdrawal and plasma separation was performed within 2 hours. If the peripheral blood sample is collected by using a free nucleic acid storage tube such as a streck tube, the peripheral blood sample can be left at room temperature and separated into plasma within a time specified in the specification of the blood collection tube.
c) Recording sample information, balancing a blood collection tube, replacing a high-speed refrigerated centrifuge with a horizontal rotor, and setting parameters: the temperature is 4 ℃, the centrifugal force is 1600g, and the time is 10 min. The blood collection tube was trimmed, and then placed in a centrifuge for centrifugation.
d) After centrifugation was completed, the blood collection tubes were placed on a centrifuge tube rack of a biosafety cabinet. The supernatant from the centrifuged blood collection tube was collected into a new 15mL centrifuge tube, and the tube wall was marked with the sample number and the operation time. Note that careful handling is required to collect the supernatant to avoid aspiration of leukocytes. The remaining blood cells were used to extract gDNA, and were dispensed into a new 15mL centrifuge tube, and the tube wall was labeled with the sample number and the operating time.
e) The high-speed refrigerated centrifuge is replaced by an angle rotor, and the parameters are set as follows: the temperature is 4 ℃, the centrifugal force is 16000g, and the time is 10 min. A15 mL centrifuge tube containing the supernatant was trimmed, placed in a centrifuge, and centrifuged.
f) After centrifugation was complete, 15mL centrifuge tubes containing the supernatant were placed on the centrifuge tube rack of a biosafety cabinet. The supernatant from the centrifuged tube was collected into a new 15mL tube. Care was taken to collect the supernatant and avoid aspiration precipitation. The purpose of this step is to remove impurities such as cellular debris from the plasma.
g) Storing the blood plasma and blood cells in a refrigerator at-80 deg.C for use.
h) After the experiment is finished, all the articles are returned, the experiment table top is cleaned, the ultraviolet lamp of the biological safety cabinet is turned on, and the biological safety cabinet is turned off after 30min of irradiation. Record the detailed experimental record.
cfDNA extraction
i) Preparing instruments, reagents and consumables required by the experiment. The water bath was opened and the temperature was adjusted to 60 ℃. The metal bath was opened and the temperature was adjusted to 56 ℃. Confirming the validity of the kit, whether the buffer ACB is added with proper amount of isopropanol, and whether the buffer ACW1 and the buffer ACW1 are added with proper amount of absolute ethyl alcohol.
j) Record the sample number and other information.
k) If the plasma is separated from the fresh plasma, cfDNA extraction is directly carried out. When plasma jelly exists at-80 deg.C, plasma sample is thawed, and centrifuged at 16,000x g [ fixed angle head ] under centrifugal force and at 4 deg.C for 5min to remove frozen precipitate.
l) prepare the required amount of ACL mixture according to Table 1.
Table 1: volumetric amounts of Buffer ACL and carrier RNA (dissolved in Buffer AVE) required to treat 4ml samples
Figure GDA0002757076870000091
Figure GDA0002757076870000101
m) transfer 400. mu.l of Proteinase K to a 50ml centrifuge tube containing 4ml of plasma. Vortex intermittently for 30s to mix well.
n) 3.2ml of Buffer ACL (containing 1.0. mu.g of carrier RNA) was added. Vortex vigorously and mix for 15 seconds. Ensure the centrifuge tube through violent vortex to guarantee the repeated mixing of sample and Buffer ACL, thereby realize efficient schizolysis.
o) note that: after this step, the experiment was left uninterrupted and the next lysis incubation step was immediately performed.
p) centrifuge tube followed by a water bath at 60 ℃ for 30 minutes.
q) 7.2ml of Buffer ACB were added to the above reaction mixture. The tube cap was closed and vortexed intermittently for 15s to mix well.
r) the lysates containing Buffer ACB were incubated on ice or refrigerated for 5 min.
s) assembling a suction filtration device: VacValve was inserted on a 24-well bottom, VacConnectors were inserted in the VacValve, QIAamp Mini silica gel membrane columns were attached to the VacConnectors, and finally 20ml flash tubes were inserted on the silica gel membrane columns. Ensure that the dilatation pipe is inserted compactly to prevent the sample from leaking. Note that: the 2ml collection tube was left to use until subsequent idling. And marking the sample number on a silica gel membrane column. VacValve can regulate the flow rate, VacConnectors can prevent pollution, a QIAamp Mini silica gel membrane column is used for adsorbing DNA, and a dilatation tube is used for containing large-volume plasma.
t) transferring the incubated mixture into a dilatation tube, turning on a vacuum pump, turning off the vacuum pump after the lysate in the centrifugal column is completely drained, and opening an exhaust valve at one side of the 24-hole base to release the pressure to 0 MPa. The flash tube is carefully removed and discarded.
u) to the QIAamp Mini silica gel membrane column, 600. mu.l of Buffer ACW1 was added, the exhaust valve was closed, and the vacuum pump was turned on to suction-filter the liquid. When the Buffer ACW1 in the spin column was drained, the vacuum pump was turned off and the vent valve on the base side of the 24 wells was opened to release the pressure to 0 MPa.
v) to the QIAamp Mini silica gel membrane column, 750. mu.l of Buffer ACW2 was added, the exhaust valve was closed, and the vacuum pump was turned on to suction-filter the liquid. When the Buffer ACW2 in the spin column was drained, the vacuum pump was turned off and the vent valve on the base side of the 24 wells was opened to release the pressure to 0 MPa.
w) to a QIAamp Mini silica gel membrane column, 750. mu.l of an absolute ethanol solution was added, the exhaust valve was closed, and the vacuum pump was turned on to suction-filter the liquid. And when the absolute ethyl alcohol in the centrifugal column is pumped to be dry, closing the vacuum pump, and opening an exhaust valve at one side of the 24-hole base to release the pressure to 0 MPa. And turning off the power supply of the vacuum pump.
x) cover the QIAamp Mini silica gel membrane column and remove from the vacuum manifold and place into a clean 2ml collection tube, discarding the VacConnector. The collection tube was centrifuged for 3min at full speed (20,000x g; 14,000 rpm).
y) the QIAamp Mini silica gel membrane column was placed in a new 2ml collection tube, uncapped and placed on a metal bath at 56 ℃ for drying for 10min until the silica gel membrane was completely dried.
z) the QIAamp Mini silica gel membrane column was removed and placed into a clean 1.5ml elution tube (kit-of-parts) and the used 2ml collection tube was discarded.
aa) to the center of the silica gel membrane in a QIAamp Mini silica gel membrane column, 55. mu.l of nucleic-free water was carefully added. The tube was capped and incubated at room temperature for 3 min.
bb) the elution tube was placed in a mini centrifuge at full speed (20,000x g; 14,000rpm) for 1min to elute cfDNA.
cc) quality standards and assessments
Quantitive HS quantification: 1 μ LcfDNA was used
Figure GDA0002757076870000111
The dsDNA HS Assay Kit was quantitated and the concentration was recorded.
Agilent 2100 detection: cfDNA fragment distribution was determined.
dd), returning all articles, cleaning the experiment table, turning on the ultraviolet lamp of the biological safety cabinet, and irradiating
And closing after 30 min. Record the detailed experimental record.
cfDNA library construction
ee) preparation before construction of library
i. Magnetic beads (AMPureXP beads, Beckman) for DNA purification were removed from the 4 ℃ freezer and equilibrated at room temperature for 30min before use.
And ii, taking the End Repair & A-Tailing Buffer and the End Repair & A-TailingBuffer enzyme mix reagent out of a refrigerator at the temperature of-20 ℃, and placing the reagents on an ice box for thawing for later use.
And iii, recording the name of the cfDNA sample to be subjected to library construction, the sampling date and the DNA concentration on an experimental record book, and writing a serial number to facilitate later operation.
Corresponding number of 200. mu.L PCR tubes were taken and numbered (tube lid and tube wall are numbered).
v. calculating the volume of the DNA solution required by each cfDNA sample according to the standard that the initial amount of the cfDNA library is more than or equal to 10ng and less than or equal to 100ng, recording the volume on an experimental record book, and placing the corresponding volume in a corresponding 200 mu L PCR tube.
Add appropriate amount of nucleic-Free water to each 200. mu.L PCR tube to bring the final volume to 50. mu.L.
vii, annotate: the following rules should be followed for formulating all reaction systems during the library building process: if the number of the samples is less than four, a mixed system is not required to be prepared, and each sample is independently added into each component solution in the reaction system; if the amount of the reaction solution exceeds four samples, preparing a mixed system by 105 percent of the required amount of each component solution in the reaction system, and then adding the mixed system into each sample one by one.
ff) end repair & Add A
i. The end-repair & A reaction system was prepared as shown in Table 2.
Table 2:
Figure GDA0002757076870000121
add 10. mu.L of the above-mentioned end-repair reaction system to each 200. mu.L PCR tube, mix well and centrifuge at low speed, set up the PCR instrument, and program as in Table 3 below.
Table 3:
Figure GDA0002757076870000122
and iii, taking the reaction system out of the PCR instrument, placing the reaction system on a small yellow plate, and performing joint connection reaction.
gg) linker ligation reaction System
i. The linker ligation reaction system was prepared as shown in Table 4.
Table 4:
composition (I) 1 reaction system 8 reaction systems (5% excess)
PCR-grade water (PCR-grade water) 5μL 42μL
Ligation Buffer (Ligation Buffer) 30μL 252μL
DNA Ligase (DNA Ligase) 10μL 84μL
Total volume (Total volume) 45μL 378μL
And ii, adding 45 mu L of the reaction system into each reaction tube, mixing the mixture gently and uniformly, and centrifuging the mixture at a low speed.
Add the appropriate amount of adapter according to the amount of input DNA, which is shown in Table 5 below, and add 5. mu.L of adapter to each reaction tube. In addition, according to the sequencing requirement, different adapters are added to each sample, so that the situation that two samples use the same adapter cannot occur in the same lane, and the adapter information used by each sample is recorded.
Table 5:
Figure GDA0002757076870000131
and iv, mixing uniformly, putting into a PCR instrument, setting the temperature to be 20 ℃, and reacting for 15 min.
hh) DNA purification
i. 80% ethanol (for example, 50mL of 80% ethanol: 40mL of absolute ethanol +10mL of nucleic-fresh Water) is prepared, and the 80% ethanol should be prepared just before use.
Prepare a corresponding number of 1.5mL sample tubes and mark them accordingly.
The beads equilibrated at room temperature were mixed well with shaking and dispensed into 88. mu.L each tube.
And iv, mixing the DNA added with the adapter with the magnetic beads. Standing at room temperature for 10 min.
v. place 1.5mL sample tube on magnetic rack for magnetic bead adsorption until the solution is clear.
Carefully remove the supernatant, add 200 μ L80% ethanol, rotate the sample tube horizontally 360 degrees, stand for 30s and discard the supernatant. (this process, the centrifuge tube was kept on the magnetic stand.)
Repeating the steps once.
All remaining alcohol solution should be removed. And opening the tube cover, drying the magnetic beads at normal temperature, and volatilizing ethanol to prevent excessive ethanol from influencing the effect of the enzyme in a subsequent reaction system. Note that: the beads cannot be dried too much, which would otherwise result in DNA not being easily eluted from the beads, resulting in yield loss. When the surface of the magnetic beads is no longer glossy, the drying is finished.
Add 21. mu.L of nucleic-Free water to each sample tube, resuspend the beads, mix well and then let stand at room temperature for 5 min.
x. prepare a new batch of 200 μ L PCR tubes, with the tube lid labeled with the corresponding sample number.
And xi, placing the sample tube in a magnetic frame, carrying out magnetic bead adsorption until the solution is clarified, and transferring the supernatant into the PCR tube with the corresponding number to be used as a template of the PCR experiment.
ii) library amplification
i. Library amplification reaction systems were prepared as shown in Table 6.
Table 6:
Figure GDA0002757076870000132
Figure GDA0002757076870000141
and ii, adding 30 mu L of Pre-PCR amplification reaction system into each 0.2mL sample tube, gently mixing uniformly, centrifuging at low speed, and placing into a PCR instrument for reaction.
The PCR machine was programmed as follows, and the PCR cycles were adjusted appropriately according to the amount of input DNA, see Table 7.
Table 7:
Figure GDA0002757076870000142
cycle number selection reference table 8.
Table 8:
amount of Input DNA (ng) PCR cycle
X>50ng 4
25ng<X≤50ng 5
10ng<X≤25ng 6
X≤10ng 7
After the end of the Pre-PCR reaction, library purification was started.
jj) library purification
i. A corresponding number of 1.5mL sample tubes are prepared and labeled accordingly.
The beads equilibrated at room temperature were mixed well with shaking and 50. mu.L of each tube was dispensed.
And iii, mixing the DNA added with the adapter with the magnetic beads. Standing at room temperature for 10 min.
Placing a 1.5mL sample tube on a magnetic rack, and performing magnetic bead adsorption until the solution is clarified.
v. carefully remove the supernatant, add 200 μ L80% ethanol, rotate the sample tube horizontally 360 degrees, stand for 30s and discard the supernatant. (this process, the centrifuge tube was kept on the magnetic stand.)
Repeating the steps once.
All remaining alcohol solution should be removed. And opening the tube cover, drying the magnetic beads at normal temperature, and volatilizing ethanol to prevent excessive ethanol from influencing the effect of the enzyme in a subsequent reaction system. Note that: the beads cannot be dried too much, which would otherwise result in DNA not being easily eluted from the beads, resulting in yield loss. When the surface of the magnetic beads is no longer glossy, the drying is finished.
Add 35. mu.L of nucleic-Free water to each sample tube, resuspend the beads, mix well and then let stand at room temperature for 5 min.
Preparing a batch of new centrifuge tubes, and marking the items, the sampling date and the sample name on tube covers; and marking joint information, database building date and concentration on the pipe wall.
And x, placing the 1.5mL sample tube on a magnetic rack, carrying out magnetic bead adsorption until the solution is clarified, and transferring the supernatant to a corresponding new 1.5mL centrifuge tube written with sample information.
Taking 1ul sample for concentration determination, determining the size of the library fragment by using 1ul sample through Agilent 2100, and recording corresponding information.
And xi, putting the sample into a freezing storage box of a corresponding project, and storing at-20 ℃.
And xiii, after the experiment is finished, returning all the articles, cleaning the experiment table top, turning on the ultraviolet lamp of the ultra-clean workbench, and turning off the ultraviolet lamp after irradiating for 30 min. Detailed experimental information was recorded.
4. Library pooling
kk) preparing instruments, reagents and consumables required by the experiment.
ll) pooling volume was calculated according to the concentration measured and the amount of data that needed to be measured.
mm) a new 1.5ml centrifuge tube is taken and marked. Pooling was performed according to the calculated pooling volume.
nn) after mixing well, the concentration was measured and the information was recorded.
oo) after the experiment was completed, all items were returned and the experiment table was cleaned.
5. Sequencing on machine
The above pooling library was denatured by dilution with Tris-HCl and NaOH, and then subjected to on-machine sequencing.
Example 2
The inventors have innovatively developed a method of Chromosome Instability (CIN) score calculation to measure chromosomal instability in cancer patients:
(1) the library-building sequencing of sample "HZ 042" was done as in example 1, and after off-line data was obtained, low quality equal reads were filtered out and aligned to the human reference genome using alignment software (bwa) (hg 19).
(2) The alignment results are filtered, requiring an alignment quality value of >30, removing duplicate reads, incorrectly paired reads, etc. Alignment start positions for reads1 were obtained using tools within bedtools.
(3) According to the comparison starting position, the inventor calculates the erythroid information amount criterion (Akaike's information criterion) and the Cross validation Log-likelihood (Cross validation Log-likelihood) corresponding to different intervals by the published method (Gusnanto et al (2014)). As shown in fig. 3, the interval size corresponding to the minimum AIC value (or the maximum log-likelihood value) is selected, and finally 10,000 bp is selected as the interval size.
(4) Dividing human reference into one interval (bin) based on groups of 10000bp, and counting comparison reads of each interval;
(5) the filtering of bins includes: 1) mappability>0.5; 2) proportion of N<0.5; 3) region files not being downloaded from UCSCwgEncodeDacMapabilityConsensusExcludable.bedAndwgEncodeDukeMapabi lityRegionsExcludable.bed(ii) a 4) Filtering out X and Y chromosomes; 5) using the normal reference set, bins greater than 4 standard deviations after inter-sample normalization (divided by the mean of the samples) were calculated; dividing the whole genome into 309579 bins in total, and after filtering, 251519 bins;
(6) the number of reads per sample, corrected for the length of the bins (divided by the non-N ratio of the bin)
(7) According to the GC value of each bin: counting the number of A, T, C and G bases in each window (bin); and the number of G and C. The ratio of GC is the GC content of the window, and fig. 4 is a GC distribution diagram of the sequencing read data of the sample to be tested.
(8) Mapavailability calculation: according to the ENCODE's mappability bigwig file downloaded from UCSC, the mappability of each region in the file is compared with a bin, and the average value of the mappability of all regions in each bin is calculated as the mappability value of the bin.
(9) Filtering bins with abnormal numbers of reads: 1% -99% quantiles of bins are reserved;
(10) and combining the GC and the mapcapability of each bin, grouping according to the combination, and simultaneously calculating the median of the numbers of reads of all the bins corresponding to each GC and mapcapability combination.
(11) Using a generalized cross validation method to averagely divide bins into 10 parts, fitting a local weighted nonparametric regression parameter curve by using 9 parts of data, using the remaining 1 part of data as a test set, predicting, calculating AIC and the like;
determining an optimal value of the locally weighted non-parametric regression parameter (AIC minimum); a fitted curve is constructed, and finally the normalized depth of each bins is divided by the value predicted by the curve to obtain a corrected value.
(12) Assuming normal samples, there are few CNV changes, while inherited CNVs are random. Normal population, corrected depth on the same bin follows a normal distribution. Therefore, the inventors completed sequencing and analysis of 300 normal populations using the same method, and obtained the mean and standard deviation of the normal distribution of the bins for each (the mean and standard deviation of some bins are shown in table 9 below). Z-score was calculated based on the normalized depth of the subject under the same bins. If the subject's absolute Z-score is greater than 3 standard deviations, the bins of the sample are considered to be missing or amplified in that region. And (5) selecting abnormal biorarers, and calculating the logratio of the test sample relative to the reference set.
Table 9: calculated mean and standard deviation based on near 300 normal sample reference sets
Figure GDA0002757076870000161
Figure GDA0002757076870000171
Figure GDA0002757076870000181
Figure GDA0002757076870000191
Figure GDA0002757076870000201
Figure GDA0002757076870000211
Figure GDA0002757076870000221
Figure GDA0002757076870000231
(13) Using the published R software package DNACopy: (https://bioconductor.org/packages/ release/bioc/html/DNAcopy.html) Smoothing the values of bins, and correcting abnormal values (smooth);
(14) the CNV results for this sample are finally obtained using published algorithms such as Cyclic Binary Segmentation (CBS) to merge bins into fragments (DNacopy), hidden Markov models (HMMcopy: Lai D, Ha G, Shah S (2019)).
Table 10: CNV results based on hidden Markov models for HZ042
Figure GDA0002757076870000232
Figure GDA0002757076870000241
Figure GDA0002757076870000251
Figure GDA0002757076870000261
Figure GDA0002757076870000271
Figure GDA0002757076870000281
Remarking: 1 represents 0 copies, homozygous deleted; 2 represents 1 copy, loss of heterozygosity; 3 represents 2 copies, normal state; 4 represents 3 copies, increased chromosome (gain); 5 represents 4 copies, chromosomal amplification; 6 means 5 and more copies, high level amplification.
(15) Taking the average logratio of each fragment as the logratio of each bins;
(16) the Z-score absolute value >3 or/and logratio >0.1 or < -0.1 is selected as the tag with the final CNV variation. In sample HZ042, 95 bins were found to have CNV variation.
Based on ICGC, TCGA and tumor sequencing data owned by the inventor, a tumor big database is constructed. The inventors collected over 10000 specimens of CNV variation data for the whole tumor genome in total. And comparing the region of CNV variation of each tumor sample with each bin, and calculating the CNV occurrence frequency f of each bin in the tumor big databasek. The specific calculation method comprises the following steps: the CNV variant region of each sample in the tumor database was compared to each marker region. If the overlap region of a certain bin region and the CNV variant region of the sample accounts for more than 90% of the bin interval (overlapping ratio/length of bin)>0.9) indicating that the marker has CNV variation on the sample. Dividing all tumor samples with CNV variation by the total tumor samples to obtain fk(ii) a If all samples do not have CNV mutation in a certain bin, fk1/(total tumor samples + 1).
Further, an instability index CIN score of the chromosome is calculated by the following formula,
Figure GDA0002757076870000282
Figure GDA0002757076870000283
wherein n represents the total number of window sequences;
a represents a predetermined constant, related to the window size;
lkrepresents the length of the k-th exception window;
fkrepresenting the probability of CNV occurrence of the k-th abnormal window sequence;
z-score represents the absolute value of the standard score for the kth window;
abs (logR) represents the absolute value of logR ratio of the k-th window after the smoothing process.
Similarly, the inventors tested the normal sample, calculated CINscore of the normal sample according to the above method and formula, and constructed normal distribution of CINscore based on the baseline data of the normal sample. The calculated probability of abnormality of CINscore of the subject is the probability of cancer.
Example 3
The inventors constructed a normal distribution graph of CIN score using the method of example 2, and obtained the mean and standard deviation of the normal distribution of CIN score as shown in FIG. 5
From the above fitted distribution, and CINscore of the sample to be tested, P ═ P (x < CIS | mean, sd) can be calculated. The probability that the sample is a tumor is 1-p, the corresponding cut-offvalue: 0.01.
for example, 18091403BP was obtained as a tumor sample (clinical pathology confirmed), and the sample was judged to be a tumor sample if CIN score of 93.48 and p-value of 0, which is much less than 0.01, were calculated according to the method of example 2
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (36)

1. A method for determining a predetermined chromosome instability index for a test sample, comprising:
(1) dividing the reference sequence of the predetermined chromosome into a plurality of window sequence bins with the same length;
(2) comparing sequencing data from the sample to be tested with the reference sequence, wherein the sequencing data is composed of a plurality of sequencing read reads;
(3) respectively counting the number of matched sequencing reads of each of the window sequences based on the comparison result of the step (2);
(4) based on the statistical result of the step (3), filtering, standardizing and correcting the number of the matched sequencing reads of each window;
(5) logarithmically processing the normalized and corrected number of matching sequencing reads for each window obtained in step (4) to obtain a log value logratio of the number of sequencing reads for each window;
(6) smoothing the logratio obtained in the step (5), merging windows and performing CNV fragmentation processing on the windows, and determining a first preselected abnormal window sequence;
(7) obtaining a standard score of the number of matched sequencing reads of each window based on the number of the normalized and corrected matched sequencing reads of each window, and determining a second preselected abnormal window sequence;
(8) determining an anomaly window sequence based on the first preselected anomaly window sequence or the second preselected anomaly window sequence determined in (6) or (7);
(9) determining, based on a database of known tumors, a frequency of occurrence of each copy number variation of the aberrant window sequence;
(10) determining an instability index of the sample to be tested for the predetermined chromosome based on the number of matched sequencing reads, the frequency of occurrence of copy number variations, the length of the window sequence, and a predetermined conventional parameter for each of a plurality of the aberrant window sequences,
wherein the second preselected anomaly window sequence is determined based on the following equation:
zi=(xi-μi)/σi
wherein xi represents the number of matched sequencing reads of the corrected sequencing data of the ith window sequence from the sample to be detected and the ith window reference sequence;
μ i represents the average of the number of matched sequencing reads for which the corrected predetermined sequencing data for the ith window sequence from the plurality of reference set samples matches the ith window reference sequence;
σ i represents a standard deviation of a predetermined number of matching sequencing reads for which sequencing data from an ith window sequence of the plurality of reference set samples matches the ith window reference sequence;
zi represents the standard score of the number of matching sequencing reads per window;
the reference set is a sample of a known normal population;
determining an instability index CIN score of the sample to be tested against the predetermined chromosome by the following formula,
Figure FDA0002853536210000021
Figure FDA0002853536210000022
wherein n represents the total number of window sequences;
a represents a predetermined constant, related to the window size;
lkrepresents the length of the k-th exception window;
fkrepresenting the probability of CNV occurrence of the k-th abnormal window sequence;
abs (Z-score) represents the absolute value of the standard score for the kth window;
abs (logR) represents the absolute value of logR ratio of the k-th window after the smoothing process.
2. The method of claim 1, wherein the test sample is derived from a patient suspected of having cancer.
3. The method of claim 2, wherein the sample to be tested is blood, body fluid, urine, saliva, or skin.
4. The method of claim 1, wherein the sequencing data is obtained by whole genome pooling of plasma-free DNA followed by sequencing using a second-generation sequencer, and the average sequencing depth is less than 1X,2X, 3X,4X, or 5X.
5. The method of claim 4, wherein the second generation sequencer is XTen, NovaSeq, or NextSeq 500.
6. The method of claim 2, wherein the window sequence is determined to have a length of: 1M, 50K, 20K, 10K or 5K.
7. The method of claim 1, wherein in step (4), the match sequencing read data is obtained after GC and alignment correction.
8. The method of claim 1, wherein in step (6), the logratio of the windows after the smoothing and merging of the windows and the CNV fragmentation is greater than 0.1 or less than-0.1 is an indication that the window to which the logratio corresponds is the first preselected abnormal window sequence.
9. The method of claim 1, wherein zi is greater than 3 or less than-3 is an indication that the ith window sequence of the sample under test is a second preselected outlier window sequence.
10. The method according to claim 8 or 9, wherein in step (8) the logratio is greater than 0.1 or less than-0.1 and/or zi is greater than 3 or less than-3, being an indication that the window is an abnormal window sequence.
11. The method of claim 1, wherein the frequency of CNV occurrence in the kth abnormal window sequence is determined based on CNV variation of WGS tumor sample, wherein the overlap region of the kth abnormal window sequence interval and the CNV variation region of the tumor sample is more than 90% of the kth abnormal window sequence interval, and is an indication that CNV exists in the tumor sample in the kth abnormal window sequence interval, and the fkIs the ratio of the number of cancer samples comprising the kth abnormal window sequence interval to the total number of said cancer samples.
12. The method of claim 1, further comprising determining a probability of cancer for the test sample based on a plurality of samples of known status and CIN score and/or standard score of the test sample.
13. A computer readable medium having stored therein instructions adapted to be processed and executed to determine a predetermined chromosome instability index for a sample under test,
(1) dividing the reference sequence of the predetermined chromosome into a plurality of window sequence bins with the same length;
(2) comparing sequencing data from the sample to be tested with the reference sequence, wherein the sequencing data is composed of a plurality of sequencing read reads;
(3) respectively counting the number of matched sequencing reads of each of the window sequences based on the comparison result of the step (2);
(4) based on the statistical result of the step (3), filtering, standardizing and correcting the number of the matched sequencing reads of each window;
(5) logarithmically processing the normalized and corrected number of matching sequencing reads for each window obtained in step (4) to obtain a log value logratio of the number of sequencing reads for each window;
(6) smoothing the logratio obtained in the step (5), merging windows and performing CNV fragmentation processing on the windows, and determining a first preselected abnormal window sequence;
(7) obtaining a standard score of the number of matched sequencing reads of each window based on the number of the normalized and corrected matched sequencing reads of each window, and determining a second preselected abnormal window sequence;
(8) determining an anomaly window sequence based on the first preselected anomaly window sequence or the second preselected anomaly window sequence determined in (6) or (7);
(9) determining, based on a database of known tumors, a frequency of occurrence of each copy number variation of the aberrant window sequence;
(10) determining an instability index of the sample to be tested for the predetermined chromosome based on the number of matched sequencing reads, the frequency of occurrence of copy number variations, the length of the window sequence, and a predetermined conventional parameter for each of a plurality of the aberrant window sequences,
wherein, in step (7), a second preselected anomaly window sequence is determined based on the following equation:
zi=(xi-μi)/σi
wherein xi represents the number of matched sequencing reads of the corrected sequencing data of the ith window sequence from the sample to be detected and the ith window reference sequence;
μ i represents the average of the number of matched sequencing reads for which the corrected predetermined sequencing data for the ith window sequence from the plurality of reference set samples matches the ith window reference sequence;
σ i represents a standard deviation of a predetermined number of matching sequencing reads for which sequencing data from an ith window sequence of the plurality of reference set samples matches the ith window reference sequence;
zi represents the standard score of the number of matching sequencing reads per window;
the reference set is a sample of a known normal population;
in the step (10), the instability index CIN score of the sample to be tested for the predetermined chromosome is determined by the following formula,
Figure FDA0002853536210000041
Figure FDA0002853536210000042
wherein n represents the total number of window sequences;
a represents a predetermined constant, related to the window size;
lkrepresents the length of the k-th exception window;
fkrepresenting the probability of CNV occurrence of the k-th abnormal window sequence;
abs (Z-score) represents the absolute value of the standard score for the kth window;
abs (logR) represents the absolute value of logR ratio of the k-th window after the smoothing process.
14. The computer-readable medium of claim 13, wherein the test sample is derived from a patient suspected of having cancer.
15. The computer-readable medium of claim 14, wherein the sample to be tested is blood, body fluid, urine, saliva, or skin.
16. The computer readable medium of claim 15, wherein the sequencing data is obtained by whole genome pooling of plasma-free DNA followed by sequencing using a second generation sequencer and has an average sequencing depth of less than 1X,2X, 3X,4X, or 5X.
17. The computer-readable medium of claim 16, wherein the second generation sequencer is X Ten, NovaSeq, or NextSeq 500.
18. The computer-readable medium of claim 13, wherein the window sequence is determined to have a length of: 1M, 50K, 20K, 10K or 5K.
19. The computer-readable medium of claim 13, wherein in step (4), the match sequencing read data is obtained after GC and alignment correction.
20. The computer-readable medium of claim 13, wherein in step (6), the logratio of the windows after the smoothing and merging of the windows and the CNV fragmentation is greater than 0.1 or less than-0.1 is an indication that the window to which the logratio corresponds is the first preselected abnormal window sequence.
21. The computer-readable medium of claim 13, wherein zi is greater than 3 or less than-3 is an indication that the ith window sequence of the sample under test is a second preselected outlier window sequence.
22. The computer-readable medium of claim 13, wherein in step (8), a logratio greater than 0.1 or less than-0.1 and/or a zi greater than 3 or less than-3 is an indication that the window is an abnormal sequence of windows.
23. The computer readable medium of claim 13, wherein the frequency of CNV occurrence in the kth aberrant window sequence is determined based on CNV variation results of WGS tumor samplesSpecifically, in the WGS tumor sample, the overlapping region between the kth abnormal window sequence region and the CNV variant region of the tumor sample accounts for 90% or more of the kth abnormal window sequence region, which indicates the presence of CNV in the tumor sample in the kth abnormal window sequence region, and the fkIs the ratio of the number of cancer samples comprising the kth abnormal window sequence interval to the total number of said cancer samples.
24. The computer-readable medium of claim 13, further comprising determining a probability of cancer for the test sample based on a plurality of samples of known status and the CIN score and/or standard score of the test sample.
25. A system for determining a predetermined chromosome instability index for a test sample, comprising:
a windowing means adapted to divide the reference sequence of the predetermined chromosome into a plurality of window sequence bins of the same length;
the comparison device is connected with the window dividing device and is suitable for comparing the sequencing data from the sample to be tested with the reference sequence, and the sequencing data consists of a plurality of sequencing read reads;
the statistical device is connected with the comparison device and is suitable for respectively counting the number of the matched sequencing reads of each of the window sequences based on the comparison result obtained by the comparison device;
the correcting device is connected with the statistical device and is suitable for filtering, standardizing and correcting the number of the matched sequencing reads of each window based on the statistical result obtained by the statistical device;
a log taking device connected to the calibration device and adapted to take a log of the normalized and calibrated number of matching sequencing reads per window obtained by the calibration device to obtain a log value logratio of the number of sequencing reads per window;
the first pre-selection abnormal window sequence determining device is connected with the logarithm taking device, is suitable for smoothing the logratio obtained by the logarithm taking device, and performs merging and CNV fragmentation processing on windows to determine a first pre-selection abnormal window sequence;
second preselected abnormal window sequence determining means, said second preselected abnormal window sequence determining means being connected to said correcting means and adapted to obtain a standard score for the number of matched sequencing reads per window based on the normalized and corrected number of matched sequencing reads per window and determine a second preselected abnormal window sequence;
an anomaly window sequence determining means, connected to said first preselected anomaly window sequence determining means and said second preselected anomaly window sequence determining means, adapted to determine an anomaly window sequence based on the first preselected anomaly window sequence or the second preselected anomaly window sequence determined by the first preselected anomaly window sequence determining means or the second preselected anomaly window sequence determining means;
copy number variation occurrence frequency determining means, adjacent to the abnormal window sequence determining means, adapted to determine each copy number variation occurrence frequency of the abnormal window sequence based on a known tumor database;
an instability index determination device, connected to the copy number variation occurrence frequency determination device, adapted to determine an instability index of the sample to be tested for the predetermined chromosome based on the number of matched sequencing reads, the copy number variation occurrence frequency, the length of the window sequence, and a predetermined conventional parameter for each of a plurality of the abnormal window sequences,
wherein the second preselected anomaly window sequence determining means is adapted to perform the following operations: determining a second preselected anomaly window sequence based on the following equation:
zi=(xi-μi)/σi
wherein xi represents the number of matched sequencing reads of the corrected sequencing data of the ith window sequence from the sample to be detected and the ith window reference sequence;
μ i represents the average of the number of matched sequencing reads for which the corrected predetermined sequencing data for the ith window sequence from the plurality of reference set samples matches the ith window reference sequence;
σ i represents a standard deviation of a predetermined number of matching sequencing reads for which sequencing data from an ith window sequence of the plurality of reference set samples matches the ith window reference sequence;
zi represents the standard score of the number of matching sequencing reads per window;
the reference set is a sample of a known normal population;
the instability index determination means is adapted to perform the following operations: determining an instability index CIN score of the sample to be tested against the predetermined chromosome by the following formula,
Figure FDA0002853536210000061
Figure FDA0002853536210000062
wherein n represents the total number of window sequences;
a represents a predetermined constant, related to the window size;
lkrepresents the length of the k-th exception window;
fkrepresenting the probability of CNV occurrence of the k-th abnormal window sequence;
abs (Z-score) represents the absolute value of the standard score for the kth window;
abs (logR) represents the absolute value of logR ratio of the k-th window after the smoothing process.
26. The system of claim 25, wherein the test sample is derived from a patient suspected of having cancer.
27. The system of claim 26, wherein the sample to be tested is blood, body fluid, urine, saliva, or skin.
28. The system of claim 27, wherein the sequencing data is obtained by whole genome pooling of plasma-free DNA followed by sequencing using a second-generation sequencer and has an average sequencing depth of less than 1X,2X, 3X,4X, or 5X.
29. The system of claim 28, wherein the second generation sequencer is an XTen, NovaSeq, or NextSeq 500.
30. The system of claim 25, wherein the window sequence is determined to have a length of: 1M, 50K, 20K, 10K or 5K.
31. The system of claim 25, wherein the match sequencing read data processed in the calibration device is corrected by GC and alignment.
32. The system of claim 25, wherein the first preselected anomaly window sequence determining means is adapted to: the logratio after smoothing and merging the windows and CNV fragmentation is greater than 0.1 or less than-0.1, which is an indication that the window corresponding to the logratio is the first preselected abnormal window sequence.
33. The system of claim 25, wherein zi is greater than 3 or less than-3 is an indication that the ith window sequence of the sample under test is a second preselected outlier window sequence.
34. The system according to claim 25, wherein the anomaly window sequence determining means is adapted to perform the following operations: a logratio greater than 0.1 or less than-0.1 and/or zi greater than 3 or less than-3 is an indication that the window is an abnormal window sequence.
35. The system of claim 25, wherein the frequency of CNVs occurring in the kth abnormal window sequence is determined based on CNV variation results of WGS tumor samples, wherein the overlap region between the kth abnormal window sequence region and the CNV variation region of the tumor sample in the WGS tumor sample is greater than or equal to 90% of the kth abnormal window sequence region, and is an indication that CNVs exist in the tumor sample in the kth abnormal window sequence region, and the fkIs the ratio of the number of cancer samples comprising the kth abnormal window sequence interval to the total number of said cancer samples.
36. The system according to claim 25, further comprising a cancer probability determination device, connected to said instability index determination device, adapted to determine a cancer probability for said test sample based on a plurality of samples of known state and CIN score and/or standard score of said test sample.
CN201910428251.2A 2019-05-22 2019-05-22 Method, system and computer readable medium for determining predetermined chromosome instability index of a sample to be tested Active CN111370056B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910428251.2A CN111370056B (en) 2019-05-22 2019-05-22 Method, system and computer readable medium for determining predetermined chromosome instability index of a sample to be tested

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910428251.2A CN111370056B (en) 2019-05-22 2019-05-22 Method, system and computer readable medium for determining predetermined chromosome instability index of a sample to be tested

Publications (2)

Publication Number Publication Date
CN111370056A CN111370056A (en) 2020-07-03
CN111370056B true CN111370056B (en) 2021-03-30

Family

ID=71209985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910428251.2A Active CN111370056B (en) 2019-05-22 2019-05-22 Method, system and computer readable medium for determining predetermined chromosome instability index of a sample to be tested

Country Status (1)

Country Link
CN (1) CN111370056B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112397143B (en) * 2020-10-30 2022-06-21 深圳思勤医疗科技有限公司 Method for predicting tumor risk value based on plasma multi-omic multi-dimensional features and artificial intelligence
CN112669906B (en) * 2020-11-25 2021-09-28 深圳华大基因股份有限公司 Detection method, device, terminal device and computer-readable storage medium for measuring genome instability
CN112634987B (en) * 2020-12-25 2021-07-27 北京吉因加医学检验实验室有限公司 Method and device for detecting copy number variation of single-sample tumor DNA
CN114093417B (en) * 2021-11-23 2022-10-04 深圳吉因加信息科技有限公司 Method and device for identifying chromosomal arm heterozygosity loss
CN114220481B (en) * 2021-11-25 2023-09-08 深圳思勤医疗科技有限公司 Method, system and computer readable medium for completing karyotyping of a sample to be tested based on whole genome sequencing
CN114792548B (en) * 2022-06-14 2022-09-09 北京贝瑞和康生物技术有限公司 Methods, apparatus and media for correcting sequencing data, detecting copy number variations

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2015230677A1 (en) * 2014-03-11 2016-10-27 The Council Of The Queensland Institute Of Medical Research Determining cancer agressiveness, prognosis and responsiveness to treatment
GB201503023D0 (en) * 2015-02-24 2015-04-08 King S College London Chromosomal instability

Also Published As

Publication number Publication date
CN111370056A (en) 2020-07-03

Similar Documents

Publication Publication Date Title
CN111370056B (en) Method, system and computer readable medium for determining predetermined chromosome instability index of a sample to be tested
CN109686408B (en) Metagenome data analysis method and system for identifying drug-resistant gene and/or drug-resistant gene mutation site
AU2020202153B2 (en) Single-molecule sequencing of plasma DNA
EP2772549B1 (en) Method for detecting genetic variation
CN111370057B (en) Method for determining chromosome structure variation signal intensity and insert length distribution characteristics of sample and application
CN112397143B (en) Method for predicting tumor risk value based on plasma multi-omic multi-dimensional features and artificial intelligence
CN106834502A (en) A kind of spinal muscular atrophy related gene copy number detection kit and method based on gene trap and two generation sequencing technologies
WO2016049878A1 (en) Snp profiling-based parentage testing method and application
CN108315404B (en) Method and system for determining fetal beta thalassemia gene haplotype
CN109637587B (en) Method, device, storage medium, processor and method for standardizing transcriptome data expression quantity for detecting gene fusion mutation
He et al. Assessing the impact of data preprocessing on analyzing next generation sequencing data
CN107699957A (en) Fusion based on DNA, which is quantitatively sequenced, builds storehouse, detection method and its application
CN111850116A (en) Gene mutation site group of NK/T cell lymphoma, targeted sequencing kit and application
CN113265452A (en) Bioinformatics pathogen detection method based on Nanopore metagenome RNA-seq
CN111304299B (en) Primer combination, kit and method for detecting copy number variation of autosome
CN108070648B (en) Method and system for determining fetal spinal muscular atrophy (SMR) gene haplotype
CN108315403B (en) Method and system for determining fetus Duchenne muscular dystrophy gene haplotype
CN115198035A (en) Detection method for simultaneously obtaining virus integration transcript and RNA modification based on nanopore sequencing and application
CN113637747B (en) Method for determining SNV and tumor mutation load in nucleic acid sample and application
CN114107454A (en) Respiratory tract infection pathogen detection method based on macrogene/macrotranscriptome sequencing
CN111593126A (en) Primer and probe for detecting MYD88 gene L265P mutation and high-sensitivity detection method
CN112513292A (en) Method and device for detecting homologous sequence based on high-throughput sequencing
CN113186255A (en) Method and device for detecting nucleotide variation based on single molecule sequencing
CN114220481B (en) Method, system and computer readable medium for completing karyotyping of a sample to be tested based on whole genome sequencing
CN114107325B (en) Metagenome internal reference, preparation method and application thereof, and metagenome blood flow pathogen detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant