CN111370057B - Method for determining chromosome structure variation signal intensity and insert length distribution characteristics of sample and application - Google Patents

Method for determining chromosome structure variation signal intensity and insert length distribution characteristics of sample and application Download PDF

Info

Publication number
CN111370057B
CN111370057B CN201910704257.8A CN201910704257A CN111370057B CN 111370057 B CN111370057 B CN 111370057B CN 201910704257 A CN201910704257 A CN 201910704257A CN 111370057 B CN111370057 B CN 111370057B
Authority
CN
China
Prior art keywords
chromosome
sample
reads
predetermined
reference sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910704257.8A
Other languages
Chinese (zh)
Other versions
CN111370057A (en
Inventor
李世勇
茅矛
张锋
钟果林
陈彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Siqin Medical Technology Co ltd
Original Assignee
Shenzhen Siqin Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Siqin Medical Technology Co ltd filed Critical Shenzhen Siqin Medical Technology Co ltd
Priority to CN201910704257.8A priority Critical patent/CN111370057B/en
Publication of CN111370057A publication Critical patent/CN111370057A/en
Application granted granted Critical
Publication of CN111370057B publication Critical patent/CN111370057B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Abstract

The invention provides a method for determining the chromosome structure variation signal intensity and the length distribution characteristics of an insert and application thereof. And in particular to a method of determining the origin of a sample. The method comprises the following steps: (1) comparing the sequencing read data of the sample chromosome with a reference genome, and determining the low-quality comparison rate and the high-quality comparison rate of the sequencing read data of the sample chromosome; (2) determining the ratio of structural variation of the sample chromosome, the content of mitochondrial DNA and the ratio of the insert with the preset length based on the reads corresponding to the high-quality comparison rate; (3) determining the probability of the sample source based on a preset tumor prediction model, the structural variation proportion of the chromosome obtained in the step (2), the content of mitochondrial DNA and the proportion of the insert with the preset length; (4) based on the probability of the sample source, the sample source is determined.

Description

Method for determining chromosome structure variation signal intensity and insert length distribution characteristics of sample and application
Technical Field
The invention relates to the field of biological information, in particular to a method for determining the intensity of a sample chromosome structure variation signal and counting the length distribution of an insert of a sample and application thereof, and more particularly to a method for determining the intensity of the sample chromosome structure variation signal and the distribution of the insert, the copy number of mitochondria and a method for determining the source of the sample.
Background
It has been reported that approximately 10% of cancers exhibit a large amount of structural chromosomal variation, chromosome disruption (Chromothripsis). There are mainly 4 structural variations occurring in chromosomes:
1. deletion of a certain segment of a chromosome, for example, Cat's syndrome is a genetic disease caused by partial deletion of chromosome 5 in humans, because sick children cry lightly and are called Cat's high pitch. The patients with the cat-cry syndrome have long distance between eyes, low ear position and slow growth and development and have serious intellectual disability; the formation of the notch wing in Drosophila is also caused by the deletion of a segment of the chromosome.
2. The phenomenon that the repetitive chromosomes increase the rod eye of a certain fragment of drosophila is caused by partial repetition on the X chromosome.
3. The inversion of a certain segment of the chromosome is reversed by 180 degrees, resulting in rearrangement within the chromosome such as female habitual abortion (long-arm inversion of chromosome 9).
4. Such a mutation often occurs in evening primrose when a certain segment of the translocation chromosome is transferred to another non-homologous chromosome or a different region on the same chromosome, such as human Chronic myeloleukemia (which is caused by the translocation of a part of chromosome 22 to chromosome 14). Some tumor samples are due to variation in chromosomal outcome, and nearly 10% of the samples have a large number of structural variables (i.e., Chromothripsis). Thus by testing the intensity of the chromosomal variation signal in the sample, the origin of the tumor can be distinguished.
Meanwhile, a plurality of recent studies show that the fragmentation characteristics of cfDNA (circulating cell-free DNA) based on low-depth whole genome sequencing can reflect the cell source and the molecular mechanism of blood entering the cfDNA to a certain extent. Typically the cfDNA fragment size distribution of healthy individuals is mostly concentrated at 167bp, which mainly reflects the location distribution of white cell nucleosomes. The distribution profile of cfDNA fragments of cancer patients is different from that of cancer patients, mainly due to the fact that the overall fragment size distribution of cfDNA is shorter than that of normal cfDNA. The ctDNA (circulating tumor DNA) from the cfDNA of a late-stage tumor patient can be effectively enriched by screening small-fragment cfDNA (such as less than 150bp), and the gene mutation information of the DNA fragments is not acquired. Therefore, the cfDNA fragmentation characteristic analysis method based on low-depth whole genome sequencing can enrich ctDNA through fragment screening, improve mutation detection limit, can be used as a supplement of a high-depth sequencing method of a target region, and is applied to clinical scenes such as scientific research and cancer screening, medication guidance, recurrence monitoring and the like.
Disclosure of Invention
The present application is based on the discovery and recognition by the inventors of the following facts and problems:
the inventors found that the probability of occurrence of structural chromosomal variation is low even in tumor samples, and in addition, there is no difference between other tumor samples and normal human samples. Therefore, the signal intensity of the chromosome structure variation is detected to predict the source of the sample, and the sensitivity is not ideal. The inventor of the application creatively fits the chromosome structure variation signal intensity and the insert length distribution characteristics, greatly improves the accuracy of determining the sample source by optimizing the fitting process and parameters, and lays a foundation for subsequent scientific research based on sample data, such as drug screening or clinical application, such as cancer screening, medication guidance and relapse monitoring.
In a first aspect of the invention, a method of determining the source of a sample is presented. According to an embodiment of the invention, the method comprises: (1) comparing the sequencing read data of the sample chromosome with a reference genome, and determining the low-quality comparison rate and the high-quality comparison rate of the sequencing read data of the sample chromosome; (2) determining the signal proportion of structural variation of the sample chromosome, the content of mitochondrial DNA and the proportion of the insert with the preset length based on the reads corresponding to the high-quality comparison rate; (3) determining the probability of the sample source based on a preset tumor prediction model, the signal proportion of the structural variation of the chromosome obtained in the step (2), the content of mitochondrial DNA and the proportion of the insert with the preset length; (4) based on the probability of the sample source, the sample source is determined. Wherein the signal proportion of the structural variation of the sample chromosome reflects the signal intensity of the structural variation of the sample chromosome. The method provided by the embodiment of the invention can accurately judge the source of the sample, and lays a solid foundation for subsequent scientific research based on sample data, such as drug screening or clinical application, such as cancer screening, medication guidance and relapse monitoring.
According to an embodiment of the present invention, the method may further include at least one of the following additional technical features:
according to an embodiment of the invention, the ratio of inserts of predetermined length comprises the ratio of the number of reads of the insert of predetermined length to the total number of sequencing reads of the sample chromosome and the ratio of the number of reads of the inserts of different predetermined length of the predetermined interval of the sample chromosome.
According to an embodiment of the invention, the low quality alignment of the sequencing read data is determined by: determining the number of sequencing reads with the alignment quality value of any end of the sequencing reads less than 30, wherein the alignment quality value of each read can be automatically output by alignment software (BWA) to represent the probability of error of the alignment position of the read, and the specific calculation method can be seen in Li H, Ruan J, Durbin R. (2008) Genome Research 18: 1851-8; determining a low quality alignment rate of sequencing read data based on a ratio of the number of sequencing reads having an alignment quality value of less than 30 on either end to the total number of sequencing reads. And then used as comparison quality control of the sample.
According to an embodiment of the invention, the high quality alignment of the sequencing read data is determined by: determining the number of sequencing reads with alignment quality values at both ends of the sequencing reads greater than 30; and determining the high-quality alignment rate of the sequencing read data based on the ratio of the number of the sequencing reads with the alignment quality values of both ends of the sequencing reads larger than 30 to the total number of the sequencing reads.
According to an embodiment of the present invention, the signal ratio of the structural variation of the chromosome is determined by the following formula:
y=x/A,
wherein y represents a signal ratio of a chromosomal structural variation,
x represents the number of anomalous alignment sequencing reads,
a represents the total number of sequencing reads for the sample chromosome.
It should be noted that the forward reads and the reverse reads described herein are forward sequencing reads and reverse sequencing reads based on a predetermined sequence, and when a sequencing read in one direction is considered as a forward read, a sequencing read in the other direction is a reverse read. For example, when "the forward reads perfectly match the first predetermined reference sequence of the first chromosome and the reverse reads perfectly match the second predetermined reference sequence" is described, it can also be understood as "the reverse reads perfectly match the first predetermined reference sequence of the first chromosome and the forward reads perfectly match the second predetermined reference sequence", both expressions are consistent with the technical solutions described in the present application.
According to a specific embodiment of the present invention, the aberrant alignment sequencing reads refer to: based on a pair of sequencing reads comprising a forward read and a reverse read of a predetermined sequence, a portion of the sequence of the forward read perfectly matches a first predetermined reference sequence of a first chromosome and another portion perfectly matches a second predetermined reference sequence, the second predetermined reference sequence is on the same chromosome as the first predetermined reference sequence and is more than 10000bp apart or the second predetermined reference sequence is on a second chromosome different from the first chromosome. The reverse reads are completely matched with the third predetermined reference sequence, and the third predetermined sequence is on the same chromosome with the first predetermined reference sequence or the second predetermined reference sequence and is positioned at a distance of less than 1000bp, and the signal ratio of the chromosome structure variation is determined by the following formula:
y1=x1/A,
wherein, y1A signal ratio representing a structural variation of the chromosome in the case of the above-mentioned abnormal alignment sequencing reads,
x1representing the number of aberrant alignment sequencing reads in the case of the aberrant alignment sequencing reads described above,
a represents the total number of sequencing reads for the sample chromosome.
For ease of understanding, applicants represent the structural chromosomal variation in the case of the above-described aberrant alignment sequencing reads as FIG. 1.
According to a specific embodiment of the present invention, the aberrant alignment sequencing reads refer to: based on a pair of sequencing reads comprising a forward read and a reverse read of a predetermined sequence, the forward read perfectly matching a first predetermined reference sequence of a first chromosome, the reverse read perfectly matching a second predetermined reference sequence, the second predetermined reference sequence being on the same chromosome as the first predetermined reference sequence and being more than 10000bp apart or the second predetermined reference sequence being on a second chromosome different from the first chromosome, the proportion of chromosomal structural variation being determined by the formula:
y2=x2/A,
wherein, y2A signal ratio representing the structural variation of the chromosome in the case of the above-mentioned abnormal alignment sequencing reads,
X2representing the number of said aberrant alignment sequencing reads in the case of said aberrant alignment sequencing reads,
a represents the total number of sequencing reads for the sample chromosome.
For ease of understanding, applicants represent the structural chromosomal variation in the case of the above-described aberrant alignment sequencing reads as FIG. 2.
According to a specific embodiment of the present invention, the aberrant alignment sequencing reads refer to: based on a pair of sequencing reads comprising a forward read and a reverse read of a predetermined sequence, a portion of the forward read and the reverse read simultaneously align to a first predetermined reference sequence of a first chromosome, the remaining portion of the forward read and the reverse read simultaneously align to a second predetermined reference sequence that is complete and the same as the starting position of the second predetermined reference sequence, the second predetermined reference sequence is on the same chromosome as the first predetermined reference sequence and is more than 10000bp apart or the second predetermined reference sequence is on a second chromosome different from the first chromosome, the signal proportion of the chromosomal structural variation is determined by the following formula:
y3=x3/A,
wherein, y3A signal ratio representing the structural variation of the chromosome in the case of the above-mentioned abnormal alignment sequencing reads,
X3representing the number of aberrant alignment sequencing reads in the case of the aberrant alignment sequencing reads described above,
a represents the total number of sequencing reads for the sample chromosome.
For ease of understanding, applicants represent the structural chromosomal variation in the case of the above-described aberrant alignment sequencing reads as FIG. 3.
According to a specific embodiment of the present invention, the aberrant alignment sequencing reads refer to: based on a pair of sequencing reads comprising a forward read and a reverse read of a predetermined sequence, the forward read or reverse read perfectly matching a sense strand of a first predetermined reference sequence and a sense strand of a second predetermined reference sequence, respectively; or a perfect match with the antisense strand of the first predetermined reference sequence and the perfect match with the antisense strand of the second predetermined reference sequence, respectively; or the forward reads perfectly match the sense strand of a first predetermined reference sequence of the first chromosome, the reverse reads perfectly match the antisense strand of a second predetermined reference sequence of the first chromosome, the second predetermined reference sequence being located at a position on the first chromosome that is smaller than the first predetermined reference sequence,
y4=x4/A,
wherein, y4A signal ratio representing the structural variation of the chromosome in the case of the above-mentioned abnormal alignment sequencing reads,
X4representing the number of said aberrant alignment sequencing reads in the case of said aberrant alignment sequencing reads,
a represents the total number of sequencing reads for the sample chromosome.
For ease of understanding, applicants represent the structural chromosomal variation in the case of the above-described aberrant alignment sequencing reads as FIG. 4.
According to an embodiment of the present invention, the content of mitochondrial DNA is determined by: determining the number of sequencing reads aligned to a reference mitochondrial gene sequence; the content of mitochondrial DNA is determined based on the ratio of the number of sequencing reads aligned to the reference mitochondrial gene sequence to the total number of sequencing reads.
According to an embodiment of the invention, determining the ratio of the number of reads of the predetermined length insert to the total number of sequencing reads of the sample chromosome is performed by: the length of the insert with the preset length is set to be 70-150 bp, 150-180 bp, 180-220 bp or 250-330 bp. The inventor finds that strong negative correlation exists between the proportion of the insert with the length of 70-150 bp (marked as P70-150) and the proportion of the insert with the length of 180-220 bp (marked as P180-220) in the same sample, and the proportion of the insert of the cancer sample and the insert of the normal sample is subjected to t test to find that the P-value of P180-220 is smaller, so that the accuracy of the result can be further improved by counting P180-220; meanwhile, the inventor finds that the insert fragment with the length of 250-330 bp is more in a cancer sample, so that the result accuracy can be further improved by counting P250-330 bp.
According to the embodiment of the invention, the method further comprises the step of determining the peak-valley distance, wherein the peak-valley distance refers to the difference of the insertion fragments with the lengths of peaks and valleys within the range of less than 150bp and the ratio of the number of sequencing reads of 2bp +/-2 bp above and below the insertion fragments to the total number of sequencing reads of the sample chromosome.
According to an embodiment of the invention, the ratio of the number of reads of the inserts of different predetermined lengths of the predetermined interval of the sample chromosome is determined by: dividing a reference gene sequence into a plurality of window intervals with the same length, wherein the size of each window interval is 100kb optionally; determining the number of reads of the insert with the preset length in each window interval, wherein the length of the insert with the preset length is 100-150 bp or 151-220 bp optionally; the ratio of the number of reads for different inserts of predetermined length within each window interval is determined. The inventors found that the difference between the total proportion of the number of sample sequencing reads for inserts of predetermined length from samples of different origin (tumor VS normal) was small, but the stability was good; the ratio of the predetermined inserts with different lengths in a certain window interval is obvious in difference, but is greatly influenced by external factors (such as GC content, sequence complexity and the like), so that the overall proportion of the number of the inserts with the predetermined lengths in the sample sequencing reads and the ratio of the number of the inserts with different predetermined lengths in the predetermined interval are comprehensively analyzed, and the authenticity of detection and prediction results can be obviously improved.
According to the embodiment of the invention, in each window interval, the method further comprises the step of correcting the number of the reading of the insertion sheet fragments with the preset length. Further improving the accuracy of judging the sample source.
According to an embodiment of the present invention, the correction process is obtained by adding a median value of the number of reads of a predetermined length of an insertion fragment within each window section to a fragment number residual. Further, the influence on the number of reads of an insert of a predetermined length due to the difference in GC and alignment ratio for each window interval can be eliminated.
According to an embodiment of the present invention, the residual of the number of fragments is obtained by: (1) determining the GC content and the comparison ratio in each window interval, wherein the GC content in each window interval is determined by the following method: counting the number of A, T, C and G bases in each window (bin), calculating the ratio of the number of GC bases to the total number of bases, namely the GC content of the window, wherein the comparison ratio (mappability) of each window interval is obtained by the following steps: according to an ENCODE's mappability bigwig file downloaded from UCSC, comparing mappability of each region in the file with divided window intervals, and calculating the average value of the comparison rate of all the intervals as the comparison rate of the interval; (2) combining and grouping the GC content and the comparison rate in each window interval obtained in the step (1) to obtain a median of the number of sequencing reads of all window intervals corresponding to each GC content and comparison rate combination; (3) constructing a fitting curve of the median value of the number of sequencing reads in a window interval corresponding to the combination of the GC content and the comparison rate relative to the GC content and the comparison rate based on a local weighted non-parametric regression method; (4) determining the number of theoretical inserts in each window interval based on the fitted curve and the GC content and the alignment ratio in each window interval; (5) and (4) subtracting the theoretical number of the inserts obtained in the step (4) from the number of the reads of the inserts with the preset length in each window interval to obtain the residual error of the number of the inserts with the preset length in each window interval. The residual error of the fragment number obtained by the method can effectively eliminate the difference of GC content and contrast ratio in each window interval.
According to an embodiment of the present invention, further comprising summing up the number of inserts of a predetermined length within a predetermined interval, the summing up process comprising summing up the number of reads of inserts of 100 to 150bp in length and the number of reads of inserts of 151 to 220bp in length, respectively,
according to the embodiment of the present invention, the length of the section after the addition processing may be 1M, 5M, 10M, and the like. According to a specific embodiment of the present invention, the length of the section after the summation processing is 5M. Thereby effectively improving the stability of the signals and the significance of the difference.
According to an embodiment of the present invention, the method further comprises dividing the sum of the number of reads of the insert with the length of 100-150 bp by the sum of the number of reads of the insert with the length of 151-22 bp to obtain the sum ratio of the number of reads of the insert.
According to an embodiment of the present invention, the method further comprises performing a principal component analysis on the summed ratio, and determining the number of the selected principal components according to the cumulative variance (or contribution ratio) being greater than a certain threshold, such as 80%, 85%, 90%, 95%.
According to an embodiment of the invention, the predetermined tumor prediction model comprises at least one selected from SVM, Lasso, GBM.
According to an embodiment of the present invention, the predetermined tumor prediction model is GBM, and the corresponding threshold is determined based on the ROC curve and a predetermined sensitivity and specificity. According to a particular embodiment of the invention, said predetermined specificity is 98%, corresponding to said threshold value being 0.45.
According to an embodiment of the invention, in step (4), the probability that the sample originates from a tumor is greater than 0.45, which is an indication that the sample originates from a tumor patient.
In summary, for ease of understanding, applicants have summarized the flow of the above-described method of determining sample sources according to an embodiment of the invention as fig. 5.
In a second aspect of the invention, a system for determining the source of a sample is provided. According to an embodiment of the invention, the system comprises: the comparison module is used for comparing the sequencing read data of the sample chromosome with the reference gene and determining the low-quality comparison rate and the high-quality comparison rate of the sequencing read data of the sample chromosome; an obtaining module, configured to determine, based on the reads corresponding to the high quality comparison ratio, a ratio of structural variation of the sample chromosome, a content of mitochondrial DNA, and a ratio of inserts of a predetermined length; a sample source probability determining module for determining the probability of the source of the sample based on the predetermined tumor prediction model, the ratio of the structural variation of the chromosome, the content of the mitochondrial DNA and the ratio of the insert with the predetermined length obtained by the obtaining module; a decision module to determine a sample source based on the probability of the sample source determined by the determine sample source probability module. The system according to the embodiment of the invention is suitable for executing the method for determining the sample source, so that the source of the sample can be accurately judged, and a solid foundation is laid for the subsequent scientific research based on the sample data, such as drug screening or clinical application, such as cancer screening, medication guidance and relapse monitoring.
For ease of understanding, applicants have represented the structure of the system for determining the source of a sample described above as FIG. 6. According to an embodiment of the invention, the system comprises: the comparison module 100 is used for comparing the sequencing read data of the sample chromosome with the reference gene, and determining the low-quality comparison rate and the high-quality comparison rate of the sequencing read data of the sample chromosome; an obtaining module 200, where the obtaining module 200 is configured to determine, based on the reads corresponding to the high quality alignment ratio output by the alignment module 100, a ratio of structural variation of a sample chromosome, a content of mitochondrial DNA, and a ratio of an insert of a predetermined length; a sample origin probability determining module 300, wherein the sample origin probability determining module 300 is configured to determine a probability of a sample origin based on a predetermined tumor prediction model, and the ratio of structural variation, the content of mitochondrial DNA, and the ratio of predetermined-length insert of the chromosome output by the obtaining module 200; a decision module 400, the decision module 400 configured to determine a sample source based on the probability of determining the sample source output by the sample source probability module 300.
The system according to the embodiment of the present invention has the same additional technical features as those of the method for determining the source of the sample, and the advantages and effects thereof are similar and will not be described herein again.
In a third aspect of the invention, an electronic device for determining a source of a sample is presented. According to an embodiment of the present invention, the electronic device includes a memory, a processor; wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the method for determining the source of the sample. The electronic equipment provided by the embodiment of the invention can be used for detecting whether the sample to be detected is from a tumor patient, and has the advantages of high detection accuracy and low cost.
In a fourth aspect of the invention, the invention proposes a computer-readable storage medium having a computer program stored thereon. According to an embodiment of the invention, the program, when executed by a processor, implements a method of determining a source of a sample as described above. The system according to the embodiment of the invention is suitable for executing the method for determining the sample source, so that the source of the sample can be accurately judged, and a solid foundation is laid for the subsequent scientific research based on the sample data, such as drug screening or clinical application, such as cancer screening, medication guidance and relapse monitoring.
Drawings
FIG. 1 is a schematic diagram of structural variation of chromosomes in the case of aberrant alignment sequencing reads according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of structural variation of chromosomes in the case of aberrant alignment sequencing reads according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of structural variation of chromosomes in the case of aberrant alignment sequencing reads according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of structural variation of chromosomes in the case of 3 aberrant alignment sequencing reads according to an embodiment of the present invention;
FIG. 5 is a flow diagram of a method of determining a source of a sample according to an embodiment of the invention;
FIG. 6 is a block diagram of a system for determining a source of a sample in accordance with an embodiment of the present invention;
FIG. 7 is a graph showing the difference in distribution of the intensity of a chromosome structural variation signal in a cancer sample and a normal sample, when compared using a t-test, between two groups, which are significantly different (P-value < 0.05);
FIG. 8 shows the insert length for a sample according to an embodiment of the present invention, wherein the arrows indicate local peaks (secondary peaks) smaller than 150 bp;
FIG. 9 is a difference in distribution of the ratio between the normal sample and the cancer sample between the inserts 180 to 220 according to an embodiment of the present invention (P-value is obtained by t-test);
FIG. 10 is a Principal Component Analysis (PCA) result according to the present embodiment; and
fig. 11 is a predictive ROC curve on a test set for 10-fold cross-validation according to the present implementation, wherein the average specificity and sensitivity during 10-fold cross-validation is shown in black bold lines in the figure.
Detailed description of the preferred embodiments
Example 1 sample plasma separation, library preparation, on-machine sequencing;
1. plasma separation
a) Preparing instruments, reagents and consumables required by the experiment, and precooling the high-speed refrigerated centrifuge to 4 ℃ in advance.
b) If peripheral blood samples were collected using EDTA anticoagulation tubes, they were immediately placed in a 4 ℃ freezer after blood withdrawal and plasma separation was performed within 2 hours. If the peripheral blood sample is collected by using a free nucleic acid storage tube such as a streck tube, the peripheral blood sample can be left at room temperature and separated into plasma within a time specified in the specification of the blood collection tube.
c) Recording sample information, balancing a blood collection tube, replacing a high-speed refrigerated centrifuge with a horizontal rotor, and setting parameters: the temperature is 4 ℃, the centrifugal force is 1600g, and the time is 10 min. The blood collection tube was trimmed, and then placed in a centrifuge for centrifugation.
d) After centrifugation was completed, the blood collection tubes were placed on a centrifuge tube rack of a biosafety cabinet. The supernatant from the centrifuged blood collection tube was collected into a new 15mL centrifuge tube, and the tube wall was marked with the sample number and the operation time. Note that careful handling is required to collect the supernatant to avoid aspiration of leukocytes. The remaining blood cells were used to extract gDNA, and were dispensed into a new 15mL centrifuge tube, and the tube wall was labeled with the sample number and the operating time.
e) The high-speed refrigerated centrifuge is replaced by an angle rotor, and the parameters are set as follows: the temperature is 4 ℃, the centrifugal force is 16000g, and the time is 10 min. A15 mL centrifuge tube containing the supernatant was trimmed, placed in a centrifuge, and centrifuged.
f) After centrifugation was complete, 15mL centrifuge tubes containing the supernatant were placed on the centrifuge tube rack of a biosafety cabinet. The supernatant from the centrifuged tube was collected into a new 15mL tube. Care was taken to collect the supernatant and avoid aspiration precipitation. The purpose of this step is to remove impurities such as cellular debris from the plasma.
g) Storing the blood plasma and blood cells in a refrigerator at-80 deg.C for use.
h) After the experiment is finished, all the articles are returned, the experiment table top is cleaned, the ultraviolet lamp of the biological safety cabinet is turned on, and the biological safety cabinet is turned off after 30min of irradiation. Record the detailed experimental record.
cfDNA extraction
i) Preparing instruments, reagents and consumables required by the experiment. The water bath was opened and the temperature was adjusted to 60 ℃. The metal bath was opened and the temperature was adjusted to 56 ℃. Confirming the validity of the kit, whether the buffer ACB is added with proper amount of isopropanol, and whether the buffer ACW1 and the buffer ACW1 are added with proper amount of absolute ethyl alcohol.
j) Record the sample number and other information.
k) If the plasma is separated from the fresh plasma, cfDNA extraction is directly carried out. When plasma jelly exists at-80 deg.C, plasma sample is thawed, and centrifuged at 16,000x g [ fixed angle head ] under centrifugal force and at 4 deg.C for 5min to remove frozen precipitate.
l) prepare the required amount of ACL mixture according to Table 1.
Table 1: volumetric amounts of Buffer ACL and carrier RNA (dissolved in Buffer AVE) required to treat 4ml samples
Figure BDA0002151642070000091
Figure BDA0002151642070000101
m) transfer 400. mu.l of Proteinase K to a 50ml centrifuge tube containing 4ml of plasma. Vortex intermittently for 30s to mix well.
n) 3.2ml of Buffer ACL (containing 1.0. mu.g of carrier RNA) was added. Vortex vigorously and mix for 15 seconds. Ensure the centrifuge tube through violent vortex to guarantee the repeated mixing of sample and Buffer ACL, thereby realize efficient schizolysis.
o) note that: after this step, the experiment was left uninterrupted and the next lysis incubation step was immediately performed.
p) centrifuge tube followed by a water bath at 60 ℃ for 30 minutes.
q) 7.2ml of Buffer ACB were added to the above reaction mixture. The tube cap was closed and vortexed intermittently for 15s to mix well.
r) the lysates containing Buffer ACB were incubated on ice or refrigerated for 5 min.
s) assembling a suction filtration device: VacValve was inserted on a 24-well bottom, VacConnectors were inserted in the VacValve, QIAamp Mini silica gel membrane columns were attached to the VacConnectors, and finally 20ml flash tubes were inserted on the silica gel membrane columns. Ensure that the dilatation pipe is inserted compactly to prevent the sample from leaking. Note that: the 2ml collection tube was left to use until subsequent idling. And marking the sample number on a silica gel membrane column. VacValve can regulate the flow rate, VacConnectors can prevent pollution, a QIAamp Mini silica gel membrane column is used for adsorbing DNA, and a dilatation tube is used for containing large-volume plasma.
t) transferring the incubated mixture into a dilatation tube, turning on a vacuum pump, turning off the vacuum pump after the lysate in the centrifugal column is completely drained, and opening an exhaust valve at one side of the 24-hole base to release the pressure to 0 MPa. The flash tube is carefully removed and discarded.
u) to the QIAamp Mini silica gel membrane column, 600. mu.l of Buffer ACW1 was added, the exhaust valve was closed, and the vacuum pump was turned on to suction-filter the liquid. When the Buffer ACW1 in the spin column was drained, the vacuum pump was turned off and the vent valve on the base side of the 24 wells was opened to release the pressure to 0 MPa.
v) to the QIAamp Mini silica gel membrane column, 750. mu.l of Buffer ACW2 was added, the exhaust valve was closed, and the vacuum pump was turned on to suction-filter the liquid. When the Buffer ACW2 in the spin column was drained, the vacuum pump was turned off and the vent valve on the base side of the 24 wells was opened to release the pressure to 0 MPa.
w) to a QIAamp Mini silica gel membrane column, 750. mu.l of an absolute ethanol solution was added, the exhaust valve was closed, and the vacuum pump was turned on to suction-filter the liquid. And when the absolute ethyl alcohol in the centrifugal column is pumped to be dry, closing the vacuum pump, and opening an exhaust valve at one side of the 24-hole base to release the pressure to 0 MPa. And turning off the power supply of the vacuum pump.
x) cover the QIAamp Mini silica gel membrane column and remove from the vacuum manifold and place into a clean 2ml collection tube, discarding the VacConnector. The collection tube was centrifuged for 3min at full speed (20,000x g; 14,000 rpm).
y) the QIAamp Mini silica gel membrane column was placed in a new 2ml collection tube, uncapped and placed on a metal bath at 56 ℃ for drying for 10min until the silica gel membrane was completely dried.
z) the QIAamp Mini silica gel membrane column was removed and placed into a clean 1.5ml elution tube (kit-of-parts) and the used 2ml collection tube was discarded.
aa) to the center of the silica gel membrane in a QIAamp Mini silica gel membrane column, 55. mu.l of nucleic-free water was carefully added. The tube was capped and incubated at room temperature for 3 min.
bb) the elution tube was placed in a mini centrifuge at full speed (20,000x g; 14,000rpm) for 1min to elute cfDNA.
cc) quality standards and assessments
Quantitive HS quantification: 1 μ LcfDNA was used
Figure BDA0002151642070000112
The dsDNA HS Assay Kit was quantitated and the concentration was recorded.
Agilent 2100 detection: cfDNA fragment distribution was determined.
dd), returning all articles, cleaning the experiment table, turning on the ultraviolet lamp of the biological safety cabinet, and turning off after irradiating for 30 min. Record the detailed experimental record.
cfDNA library construction
ee) preparation before construction of library
i. Magnetic beads (AMPureXP beads, Beckman) for DNA purification were removed from the 4 ℃ freezer and equilibrated at room temperature for 30min before use.
And ii, taking the End Repair & A-Tailing Buffer and the End Repair & A-Tailing Buffer mix reagent out of a refrigerator at the temperature of-20 ℃, placing the reagents on an ice box, and unfreezing for later use.
And iii, recording the name of the cfDNA sample to be subjected to library construction, the sampling date and the DNA concentration on an experimental record book, and writing a serial number to facilitate later operation.
Corresponding number of 200. mu.L PCR tubes were taken and numbered (tube lid and tube wall are numbered).
v. calculating the volume of the DNA solution required by each cfDNA sample according to the standard that the initial amount of the cfDNA library is more than or equal to 10ng and less than or equal to 100ng, recording the volume on an experimental record book, and placing the corresponding volume in a corresponding 200 mu L PCR tube.
Add appropriate amount of nucleic-Free water to each 200. mu.L PCR tube to bring the final volume to 50. mu.L.
vii, annotate: the following rules should be followed for formulating all reaction systems during the library building process: if the number of the samples is less than four, a mixed system is not required to be prepared, and each sample is independently added into each component solution in the reaction system; if the amount of the reaction solution exceeds four samples, preparing a mixed system by 105 percent of the required amount of each component solution in the reaction system, and then adding the mixed system into each sample one by one.
ff) end repair & Add A
i. The end-repair & A reaction system was prepared as shown in Table 2.
Table 2:
Figure BDA0002151642070000111
Figure BDA0002151642070000121
add 10. mu.L of the above-mentioned end-repair reaction system to each 200. mu.L PCR tube, mix well and centrifuge at low speed, set up the PCR instrument, and program as in Table 3 below.
Table 3:
Figure BDA0002151642070000122
and iii, taking the reaction system out of the PCR instrument, placing the reaction system on a small yellow plate, and performing joint connection reaction.
gg) linker ligation reaction System
i. The linker ligation reaction system was prepared as shown in Table 4.
Table 4:
composition (I) 1 reaction system 8 reaction systems (5% excess)
PCR-grade water (PCR-grade water) 5μL 42μL
Ligation Buffer (Ligation Buffer) 30μL 252μL
DNA Ligase (DNA Ligase) 10μL 84μL
Total volume (Total volume) 45μL 378μL
And ii, adding 45 mu L of the reaction system into each reaction tube, mixing the mixture gently and uniformly, and centrifuging the mixture at a low speed.
Add the appropriate amount of adapter according to the amount of input DNA, which is shown in Table 5 below, and add 5. mu.L of adapter to each reaction tube. In addition, according to the sequencing requirement, different adapters are added to each sample, so that the situation that two samples use the same adapter cannot occur in the same lane, and the adapter information used by each sample is recorded.
Table 5:
Figure BDA0002151642070000123
Figure BDA0002151642070000131
and iv, mixing uniformly, putting into a PCR instrument, setting the temperature to be 20 ℃, and reacting for 15 min.
hh) DNA purification
i. 80% ethanol (for example, 50mL of 80% ethanol: 40mL of anhydrous ethanol +10mL of clean-free Water) is prepared, and the 80% ethanol should be prepared as it is.
Prepare a corresponding number of 1.5mL sample tubes and mark them accordingly.
The beads equilibrated at room temperature were mixed well with shaking and dispensed into 88. mu.L each tube.
And iv, mixing the DNA added with the adapter with the magnetic beads. Standing at room temperature for 10 min.
v. place 1.5mL sample tube on magnetic rack for magnetic bead adsorption until the solution is clear.
Carefully remove the supernatant, add 200 μ L80% ethanol, rotate the sample tube horizontally 360 degrees, stand for 30s and discard the supernatant. (this process, the centrifuge tube was kept on the magnetic stand.)
Repeating the steps once.
All remaining alcohol solution should be removed. And opening the tube cover, drying the magnetic beads at normal temperature, and volatilizing ethanol to prevent excessive ethanol from influencing the effect of the enzyme in a subsequent reaction system. Note that: the beads cannot be dried too much, which would otherwise result in DNA not being easily eluted from the beads, resulting in yield loss. When the surface of the magnetic beads is no longer glossy, the drying is finished.
Add 21. mu.L of nucleic-Free water to each sample tube, resuspend the beads, mix well and then let stand at room temperature for 5 min.
x. prepare a new batch of 200 μ L PCR tubes, with the tube lid labeled with the corresponding sample number.
And xi, placing the sample tube in a magnetic frame, carrying out magnetic bead adsorption until the solution is clarified, and transferring the supernatant into the PCR tube with the corresponding number to be used as a template of the PCR experiment.
ii) library amplification
i. Library amplification reaction systems were prepared as shown in Table 6.
Table 6:
Figure BDA0002151642070000132
and ii, adding 30 mu L of Pre-PCR amplification reaction system into each 0.2mL sample tube, gently mixing uniformly, centrifuging at low speed, and placing into a PCR instrument for reaction.
The PCR machine was programmed as follows, and the PCR cycles were adjusted appropriately according to the amount of input DNA, see Table 7.
Table 7:
Figure BDA0002151642070000141
cycle number selection reference table 8.
Table 8:
amount of Input DNA (ng) PCR cycle
X>50ng 4
25ng<X≤50ng 5
10ng<X≤25ng 6
X≤10ng 7
After the end of the Pre-PCR reaction, library purification was started.
jj) library purification
i. A corresponding number of 1.5mL sample tubes are prepared and labeled accordingly.
The beads equilibrated at room temperature were mixed well with shaking and 50. mu.L of each tube was dispensed.
And iii, mixing the DNA added with the adapter with the magnetic beads. Standing at room temperature for 10 min.
Placing a 1.5mL sample tube on a magnetic rack, and performing magnetic bead adsorption until the solution is clarified.
v. carefully remove the supernatant, add 200 μ L80% ethanol, rotate the sample tube horizontally 360 degrees, stand for 30s and discard the supernatant. (this process, the centrifuge tube was kept on the magnetic stand.)
Repeating the steps once.
All remaining alcohol solution should be removed. And opening the tube cover, drying the magnetic beads at normal temperature, and volatilizing ethanol to prevent excessive ethanol from influencing the effect of the enzyme in a subsequent reaction system. Note that: the beads cannot be dried too much, which would otherwise result in DNA not being easily eluted from the beads, resulting in yield loss. When the surface of the magnetic beads is no longer glossy, the drying is finished.
Add 35. mu.L of nucleic-Free water to each sample tube, resuspend the beads, mix well and then let stand at room temperature for 5 min.
Preparing a batch of new centrifuge tubes, and marking the items, the sampling date and the sample name on tube covers; and marking joint information, database building date and concentration on the pipe wall.
And x, placing the 1.5mL sample tube on a magnetic rack, carrying out magnetic bead adsorption until the solution is clarified, and transferring the supernatant to a corresponding new 1.5mL centrifuge tube written with sample information.
Taking 1ul sample for concentration determination, determining the size of the library fragment by using 1ul sample through Agilent 2100, and recording corresponding information.
And xi, putting the sample into a freezing storage box of a corresponding project, and storing at-20 ℃.
And xiii, after the experiment is finished, returning all the articles, cleaning the experiment table top, turning on the ultraviolet lamp of the ultra-clean workbench, and turning off the ultraviolet lamp after irradiating for 30 min. Detailed experimental information was recorded.
4. Library pooling
kk) preparing instruments, reagents and consumables required by the experiment.
ll) pooling volume was calculated according to the concentration measured and the amount of data that needed to be measured.
mm) a new 1.5ml centrifuge tube is taken and marked. Pooling was performed according to the calculated pooling volume.
nn) after mixing well, the concentration was measured and the information was recorded.
oo) after the experiment was completed, all items were returned and the experiment table was cleaned.
5. Sequencing on machine
The above pooling library was denatured by dilution with Tris-HCl and NaOH, and then subjected to on-machine sequencing.
Example 2
First, the method of example 1 was followed to perform the pooling sequencing of the samples, obtain off-line data, filter out low quality reads, and align these sequencing reads to the human reference genome using alignment software (bwa) (hg 19).
Secondly, counting the bam files after comparison:
1. filtering reads and duplicate reads that are aligned on only one side (duplicate reads labeled with samtools or picard)
2. Counting the low-quality comparison rate: low-quality comparison is comparison reads with the comparison quality value of less than 30 of reads at any end, all the reads are added, and the total number of the reads is divided to obtain the low-quality comparison rate;
3. comparing abnormal alignments with high (greater than 30) alignment quality values, counting reads signal intensities supporting structural variation: including 1) as shown in FIG. 4 below, statistics of the number of reads (reads) with abnormal 3 alignment directions; normally, two ends of reads are aligned to a sense strand (F: forward) and an antisense strand (reverse), and the alignment position of the sense strand is smaller than that of the antisense strand, which is commonly called FR alignment. Thus, when two reads are aligned to the sense strand (FF) simultaneously, to the antisense strand (RR) simultaneously, and at a position aligned to the antisense strand at one end less than the position aligned to the sense strand at the other end (RF), all three alignments are likely to be due to structural variation, and are named "FF _ RR _ RF". 2) In addition to the above, as shown in fig. 2: two ends of reads are aligned to different chromosomes or aligned to the same chromosome, but the distance between the two ends is more than 10000bp, the normal distance is less than 1000bp, and the method is named as Diff _ Chr _ Pos _ PE. 3) As shown in fig. 1: reads at one end, split into two alignments from the middle, align to two different locations on the genome (different chromosomes, or two distances >10000bp), respectively, and this alignment is called "splice alignment". The other normal alignment of reads is on human chromosome, and the distance between the normal alignment and any part of splicing alignment is not more than 1000bp, and the alignment is named as 'PE _ Sclip'; 4) as shown in fig. 3: the alignment of both ends of reads was close to the starting position (<150), and both reads were splice aligned, while the splice break points of both reads were at the same position, named "twos plicemap". And (4) counting the number of each abnormal read according to the above, and dividing the ratio obtained by the total comparison read number of the sample to obtain the chromosome structure abnormal signal intensity value. Based on example 1, the inventors analyzed 400 cases of cancer and normal samples, and found that the intensity of the chromosomal structural variation signal was significantly higher in the cancer sample than in the normal sample (P-value <0.05) using t-test, and the results are shown in fig. 7. The values of the chromosomal abnormality signals of the samples in this example are shown in Table 9.
Table 9:
Figure BDA0002151642070000161
Figure BDA0002151642070000171
Figure BDA0002151642070000181
Figure BDA0002151642070000191
Figure BDA0002151642070000201
Figure BDA0002151642070000211
Figure BDA0002151642070000221
Figure BDA0002151642070000231
Figure BDA0002151642070000241
Figure BDA0002151642070000251
4. for normal alignment reads with high alignment quality (>30), the distribution of insert lengths (distance from normal alignment to both ends of reads on the chromosome) was counted. Many studies report that the length of the insert of the free DNA fragment derived from cancer tumor cells is smaller than that of the insert of the free DNA fragment derived from normal cells. As shown in FIG. 8, the insert distribution map of a sample of an embodiment. According to the distribution, the inventor counts the proportion of the insert segments between 70-150, 180-220, 250-300, which is marked as P70_150, P180_220, P250_ 300. Meanwhile, a strong negative correlation exists between P70_150 and P180_220 through comparison. Therefore, one of them was selected as the predictive feature P180_220 for cancer and normal (FIG. 9: t-test between cancer and normal samples, P180_220 has a smaller P-value). Meanwhile, as shown in FIG. 8, in the portion of less than 150bp, there are small peaks and valleys (indicated by arrows in the figure), and the same lines exist for different samples, so the inventors counted the difference before each secondary peak (insert length corresponding to peak: 8192102112122134) and the corresponding valley (insert length corresponding to valley: 8496106116126137. add up 6 differences, named "peak-valley distance";
5. based on the reference (Jiang, P., et al.,. Proc Natl Acad Sci U S A,2015.112(11): p.E1317-25.), the mitochondrial DNA copy number of cancer samples was significantly higher than that of normal samples. The inventor uses the normal comparison to obtain the number of reads on human mitochondria, and divides the total comparison number of reads to measure the content of mitochondrial DNA; the statistics for the near 400 samples are shown in table 10 below.
Table 10:
Figure BDA0002151642070000252
Figure BDA0002151642070000261
Figure BDA0002151642070000271
Figure BDA0002151642070000281
Figure BDA0002151642070000291
Figure BDA0002151642070000301
Figure BDA0002151642070000311
Figure BDA0002151642070000321
Figure BDA0002151642070000331
Figure BDA0002151642070000341
6. meanwhile, the inventor divides the whole genome into regions (bins) with the size of 100kb uniformly, counts the number of reads with the length of 100 to 150bp of the inserted fragment in each interval and records the number as the number of short fragments, and counts the number of reads with the length of 151 to 220bp of the inserted fragment in each interval and records the number as the number of long fragments. Considering that the GC content and the alignment (mapability) were different for each region, the inventors corrected the number of short and long fragments, respectively, using a local weighted non-parametric regression parameter (loess). The specific process is as follows:
a) the filtering of bins includes: 1) mappability>0.6; 2) proportion of N<0.5; 3) region files not being downloaded from UCSCwgEncodeDacMapabilityConsensusExcludable.bedAndwgEncodeDukeMapabili tyRegionsExcludable.bed(ii) a 4) Filtering out X and Y chromosomes;
b) according to the GC value of each bin: counting the number of A, T, C and G bases in each window (bin); and the number of G and C. The ratio of GC is the GC content of the window.
c) Mapavailability calculation: according to the ENCODE's mappability bigwig file downloaded from UCSC, the mappability of each region in the file is compared with a bin, and the average value of the mappability of all regions in each bin is calculated as the mappability value of the bin.
d) The number of each interval, corrected with respect to the length of the bins (divided by the proportion of non-N bins);
e) and combining the GC and the mapcapability of each bin, grouping according to the combination, and simultaneously calculating the median of the numbers of reads of all the bins corresponding to each GC and mapcapability combination.
f) Using the generalized cross-validation method (loess), a fitted curve of GC and mappability was constructed with respect to the number of long or short fragments. And finally, calculating the theoretical fragment number corresponding to the region according to the GC content and maplability corresponding to each bin and the curve fitted above, and subtracting the theoretical fragment number from the fragment number counted in the interval to obtain the residual error of the fragment number.
g) Using the median of the number of the long segments or the short segments of the sample plus the residual value as the final correction value of the region; adding up the adjacent segments, and finally calculating the correction value of the number of long segments and the correction value of the number of short segments of every 5M of one region;
h) and filtering the number of fragments in each 5M interval, wherein the significance of the number of fragments in the filtered interval is required to be less than 3 times of the standard deviation, and finally 537 5M intervals are obtained.
i) For each interval after filtering, the fragment ratio of each interval is obtained by dividing the number of short fragments by the number of long fragments. The fraction ratios of the fractional samples, fractional intervals are shown in Table 11 below.
Table 11:
Figure BDA0002151642070000351
Figure BDA0002151642070000361
Figure BDA0002151642070000371
7. the fraction ratios of all above samples in each 5M interval were subjected to Principal Component Analysis (PCA) and the data were subjected to dimensionality reduction (see fig. 10), requiring a cumulative variance > 80%, so the inventors selected the top 10 principal components. Example sample (part) corresponds to the values of the first 10 principal components, as in table 12.
Table 12:
Figure BDA0002151642070000372
Figure BDA0002151642070000381
Figure BDA0002151642070000391
Figure BDA0002151642070000401
Figure BDA0002151642070000411
Figure BDA0002151642070000421
Figure BDA0002151642070000431
example 3:
based on example 2, the following were obtained: (1) signal strength to support chromosomal structural variation; (2) a ratio of mitochondrial content of the sample; (3) the ratio of the entire sample insert between 180-220 and 250-300, and the "peak-to-valley spacing" between peaks and valleys of less than 150 bp. (3) And (3) taking the values of the first 10 principal components after the dimension reduction of the ratio of the short segments to the long segments in each 5M interval through principal component analysis.
These statistics of the samples are input as feature vectors, and the effect of tumor prediction is tested using machine learning methods (e.g., SVM, Lasso, GBM) and 10-fold cross-validation based on the above nearly 400 cases of cancer and normal samples. And averagely dividing the sample into 10 parts, and establishing a tumor prediction model by sequentially using 9 parts of data as a training set. The remaining one is used as a training set to measure the prediction effect of the model. AUC values (defined as the area under the ROC curve bounded by the coordinate axes) were calculated for each test set, and are shown in detail in FIG. 11, where the black bold lines represent the mean of 10 fold cross-validation. Wherein the mean AUC value is 0.864 (10 fold cross validation AUC mean for LASSO 0.871, SVM 0.867), with the maximum AUC value of 0.932;
based on the model selected above, a prediction model is constructed, tumor prediction is carried out on all samples (including part of third-party independent verification samples), and the probability of all samples from tumors is determined. Finally, based on the ROC curve, taking the corresponding p-value at 98% specificity as cut-off value: 0.43.
similarly, the inventors can also construct a tumor recurrence prediction model based on machine learning based on the statistical results of example 1 and example 2 performed on normal persons and tumor patients after surgery.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (25)

1. A method of determining the source of a sample, comprising:
(1) comparing the sequencing read data of the sample chromosome with a reference genome, and determining the low-quality comparison rate and the high-quality comparison rate of the sequencing read data of the sample chromosome;
(2) determining the signal proportion of structural variation of the sample chromosome, the content of mitochondrial DNA and the proportion of the insert with the preset length based on the reads corresponding to the high-quality comparison rate;
(3) determining the probability of the sample source based on a preset tumor prediction model, the signal proportion of the structural variation of the chromosome obtained in the step (2), the content of mitochondrial DNA and the proportion of the insert with the preset length;
(4) determining a sample source based on the probability of the sample source,
wherein the signal proportion of the structural variation of the chromosome is determined by the following formula:
y=x/A,
wherein y represents a signal ratio of structural variation of the chromosome,
x represents the number of anomalous alignment sequencing reads,
a represents the total number of sequencing reads of the sample chromosome;
the ratio of inserts of predetermined length comprises the ratio of the number of reads of the insert of predetermined length to the total number of sample sequencing reads and the ratio of the number of reads of inserts of different predetermined lengths for a predetermined interval of the sample chromosome.
2. The method of claim 1, wherein the low quality alignment of the sequencing read data is determined by:
determining the number of sequencing reads having an alignment quality value of less than 30 on either end;
determining a low quality alignment rate of sequencing read data based on a ratio of the number of sequencing reads having an alignment quality value of less than 30 on either end to the total number of sequencing reads.
3. The method of claim 1, wherein the high quality alignment of sequencing read data is determined by:
determining the number of sequencing reads with alignment quality values at both ends of the sequencing reads greater than 30;
and determining the high-quality alignment rate of the sequencing read data based on the ratio of the number of the sequencing reads with the alignment quality values of both ends of the sequencing reads larger than 30 to the total number of the sequencing reads.
4. The method of claim 1, wherein the aberrant alignment sequencing reads are: based on a pair of sequencing reads comprising a forward read and a reverse read of a predetermined sequence, a portion of the sequence of the forward read perfectly matches a first predetermined reference sequence of a first chromosome and another portion perfectly matches a second predetermined reference sequence, the second predetermined reference sequence is on the same chromosome as the first predetermined reference sequence and is more than 10000bp apart or the second predetermined reference sequence is on a second chromosome different from the first chromosome, the reverse read perfectly matches a third predetermined reference sequence, and the third predetermined reference sequence is on the same chromosome as the first predetermined reference sequence or the second predetermined reference sequence and is located less than 1000bp apart.
5. The method of claim 1, wherein the aberrant alignment sequencing reads are: based on a pair of sequencing reads comprising a forward read and a reverse read of a predetermined sequence, the forward read perfectly matching a first predetermined reference sequence of a first chromosome, the reverse read perfectly matching a second predetermined reference sequence, the second predetermined reference sequence being on the same chromosome as the first predetermined reference sequence and being more than 10000bp apart or the second predetermined reference sequence being on a second chromosome different from the first chromosome.
6. The method of claim 1, wherein the aberrant alignment sequencing reads are: based on a pair of sequencing reads comprising a forward read and a reverse read of a predetermined sequence, a portion of the forward read and the reverse read align to a first predetermined reference sequence of a first chromosome simultaneously, the remaining portion of the forward read and the reverse read align to a second predetermined reference sequence simultaneously and align to the same starting position of the second predetermined reference sequence, the second predetermined reference sequence is on the same chromosome as the first predetermined reference sequence and is more than 10000bp apart or the second predetermined reference sequence is on a second chromosome different from the first chromosome.
7. The method of claim 1, wherein the aberrant alignment sequencing reads are: based on a pair of sequencing reads comprising a forward read and a reverse read of a predetermined sequence, the forward read or reverse read perfectly matching a sense strand of a first predetermined reference sequence and a sense strand of a second predetermined reference sequence, respectively; or a perfect match with the antisense strand of the first predetermined reference sequence and the perfect match with the antisense strand of the second predetermined reference sequence, respectively; or the forward reads perfectly match the sense strand of a first predetermined reference sequence of the first chromosome, the reverse reads perfectly match the antisense strand of a second predetermined reference sequence of the first chromosome, the second predetermined reference sequence being located at a position on the first chromosome that is smaller than the first predetermined reference sequence.
8. The method of claim 1, wherein the content of mitochondrial DNA is determined by:
determining the number of sequencing reads aligned to a reference mitochondrial gene sequence;
the content of mitochondrial DNA is determined based on the ratio of the number of sequencing reads aligned to the reference mitochondrial gene sequence to the total number of sequencing reads.
9. The method of claim 1, wherein determining the ratio of the number of reads of the insert of the predetermined length to the total number of sequencing reads on the sample chromosome is performed by:
the length of the insert with the preset length is set to be 70-150 bp, 150-180 bp, 180-220 bp or 250-330 bp.
10. The method of claim 1, further comprising determining a peak-to-valley spacing that is a difference between the insert of corresponding peak-to-valley length and the number of sequencing reads 2bp [ ± 2bp ] above and below the insert in a range of less than 150bp as a proportion of the total number of sequencing reads for the sample chromosome.
11. The method of claim 1, wherein determining the ratio of the number of reads of inserts of different predetermined lengths for a predetermined interval of a sample chromosome is performed by:
dividing a reference gene sequence into a plurality of window intervals with the same length;
determining the number of reads of different preset-length inserts in each window interval;
the ratio of the number of reads for different inserts of predetermined length within each window interval is determined.
12. The method of claim 11, wherein the window interval is 100kb in size.
13. The method of claim 11, wherein the predetermined length insert is 100-150 bp or 151-220 bp in length.
14. The method of claim 11, further comprising correcting the number of reads for a predetermined length of interleaf fragments within each window interval.
15. The method of claim 14, wherein the correction process is obtained by adding a median of the number of reads of a predetermined length of an insertion fragment within each window interval to a fragment number residual.
16. The method of claim 15, wherein the residual of the number of fragments is obtained by:
(1) determining GC content and a comparison rate in each window interval;
(2) combining and grouping the GC content and the comparison rate in each window interval obtained in the step (1) to obtain a median of the number of sequencing reads in each window interval corresponding to the combination of the GC content and the comparison rate;
(3) constructing a fitting curve of the median value of the number of sequencing reads in a window interval corresponding to the combination of the GC content and the comparison rate relative to the GC content and the comparison rate based on a local weighted non-parametric regression method;
(4) determining the number of theoretical inserts in each window interval based on the fitted curve and the GC content and the alignment ratio in each window interval;
(5) and (4) subtracting the theoretical number of the inserts obtained in the step (4) from the number of the reads of the inserts with the preset length in each window interval to obtain the residual error of the number of the inserts with the preset length in each window interval.
17. The method according to claim 11, further comprising summing up the number of inserts of a predetermined length within a predetermined interval, the summing up comprising summing up the number of reads of inserts of 100 to 150bp in length and summing up the number of reads of inserts of 151 to 220bp in length, respectively.
18. The method of claim 17, wherein the length of the summed interval is 5M.
19. The method of claim 17, further comprising summing the number of reads of an insert having an insert length of 100-150 bp divided by the number of reads of an insert having an insert length of 151-220 bp to obtain a summed number of insert reads ratio.
20. The method of claim 19, further comprising performing a principal component analysis on the summed ratio.
21. The method of claim 1, wherein the predetermined tumor prediction model comprises at least one selected from SVM, Lasso, GBM.
22. The method of claim 1, wherein the predetermined tumor prediction model is GBM and the respective threshold is determined based on a ROC curve, a predetermined sensitivity or specificity.
23. The method of claim 22, wherein the predetermined specificity is 98% and the threshold is 0.45.
24. The method of claim 22, wherein in step (4), the probability that the sample originates from a tumor is greater than 0.45, which is an indication that the sample originated from a tumor patient.
25. A system for determining the source of a sample, comprising:
the comparison module is used for comparing the sequencing read data of the sample chromosome with the reference gene and determining the low-quality comparison rate and the high-quality comparison rate of the sequencing read data of the sample chromosome;
an obtaining module, configured to determine, based on the reads corresponding to the high quality comparison ratio, a ratio of structural variation of the sample chromosome, a content of mitochondrial DNA, and a ratio of inserts of a predetermined length;
a sample source probability determining module for determining the probability of the source of the sample based on the predetermined tumor prediction model, the ratio of the structural variation of the chromosome, the content of the mitochondrial DNA and the ratio of the insert with the predetermined length obtained by the obtaining module;
a decision module to determine a sample source based on the probability of the sample source determined by the determine sample source probability module;
wherein the ratio of structural variation of the chromosome is determined by the following formula:
y=x/A,
wherein y represents the ratio of structural variation of the chromosome,
x represents the number of anomalous alignment sequencing reads,
a represents the total number of sequencing reads of the sample chromosome;
the ratio of inserts of predetermined length comprises the ratio of the number of reads of the insert of predetermined length to the total number of sample sequencing reads and the ratio of the number of reads of inserts of different predetermined lengths for a predetermined interval of the sample chromosome.
CN201910704257.8A 2019-07-31 2019-07-31 Method for determining chromosome structure variation signal intensity and insert length distribution characteristics of sample and application Active CN111370057B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910704257.8A CN111370057B (en) 2019-07-31 2019-07-31 Method for determining chromosome structure variation signal intensity and insert length distribution characteristics of sample and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910704257.8A CN111370057B (en) 2019-07-31 2019-07-31 Method for determining chromosome structure variation signal intensity and insert length distribution characteristics of sample and application

Publications (2)

Publication Number Publication Date
CN111370057A CN111370057A (en) 2020-07-03
CN111370057B true CN111370057B (en) 2021-03-30

Family

ID=71207876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910704257.8A Active CN111370057B (en) 2019-07-31 2019-07-31 Method for determining chromosome structure variation signal intensity and insert length distribution characteristics of sample and application

Country Status (1)

Country Link
CN (1) CN111370057B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112410422B (en) * 2020-10-30 2022-06-03 深圳思勤医疗科技有限公司 Method for predicting tumor risk value based on fragmentation pattern
CN112397143B (en) * 2020-10-30 2022-06-21 深圳思勤医疗科技有限公司 Method for predicting tumor risk value based on plasma multi-omic multi-dimensional features and artificial intelligence
CN113035276B (en) * 2021-03-11 2021-12-03 深圳荻硕贝肯精准医学有限公司 Method and system for analyzing heterozygous deletion of human HLA chromosome region
CN113192557B (en) * 2021-06-03 2022-01-25 中国人民解放军军事科学院军事医学研究院 Chromosome variation detection method, device, electronic equipment and medium
CN113643759B (en) * 2021-10-15 2022-01-11 臻和(北京)生物科技有限公司 Chromosome stability evaluation method and device based on liquid biopsy, terminal equipment and storage medium
CN114220481B (en) * 2021-11-25 2023-09-08 深圳思勤医疗科技有限公司 Method, system and computer readable medium for completing karyotyping of a sample to be tested based on whole genome sequencing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105349678A (en) * 2015-12-03 2016-02-24 上海美吉生物医药科技有限公司 Detection method of chromosome copy number variation
CN107209814A (en) * 2015-01-13 2017-09-26 10X基因组学有限公司 For making structure variation and the visual system and method for phase information
WO2018204777A2 (en) * 2017-05-05 2018-11-08 The Broad Institute, Inc. Methods for identification and modification of lncrna associated with target genotypes and phenotypes
CN109402241A (en) * 2017-08-07 2019-03-01 深圳华大基因研究院 Identification and the method for analyzing ancient DNA sample

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1290217A2 (en) * 2000-02-04 2003-03-12 Aeomica, Inc. Methods and apparatus for predicting, confirming, and displaying functional information derived from genomic sequence
MX2017015148A (en) * 2015-06-03 2018-08-01 Dow Agrosciences Llc Genetic locus associated with phytophthora root and stem rot in soybean.
CN106544407B (en) * 2015-09-18 2019-11-08 广州华大基因医学检验所有限公司 The method for determining donor source cfDNA ratio in receptor cfDNA sample
CN106202991B (en) * 2016-06-30 2019-03-08 厦门艾德生物医药科技股份有限公司 The detection method of abrupt information in product is sequenced in a kind of genome multiplex amplification
CN109993305B (en) * 2018-01-03 2023-01-03 成都二十三魔方生物科技有限公司 Ancestral polymorphism prediction method based on big data artificial intelligence algorithm
CN110029157B (en) * 2018-01-11 2020-12-22 北京大学 Method for detecting haploid copy number variation of tumor single cell genome
CN108090324B (en) * 2018-01-16 2020-03-27 深圳市泰康吉音生物科技研发服务有限公司 Pathogenic microorganism identification method based on high-throughput gene sequencing data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107209814A (en) * 2015-01-13 2017-09-26 10X基因组学有限公司 For making structure variation and the visual system and method for phase information
CN105349678A (en) * 2015-12-03 2016-02-24 上海美吉生物医药科技有限公司 Detection method of chromosome copy number variation
WO2018204777A2 (en) * 2017-05-05 2018-11-08 The Broad Institute, Inc. Methods for identification and modification of lncrna associated with target genotypes and phenotypes
CN109402241A (en) * 2017-08-07 2019-03-01 深圳华大基因研究院 Identification and the method for analyzing ancient DNA sample

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Calling known variants and identifying new variants while rapidly aligning sequence data";P.M. VanRaden等;《Journal of Dairy Science》;20190430;第102卷;第3216-3229页 *
"中华蜜蜂保种(资源)分子生物学评价方法";陈道印等;《中国蜂业》;20190630(第6期);第71-74页 *

Also Published As

Publication number Publication date
CN111370057A (en) 2020-07-03

Similar Documents

Publication Publication Date Title
CN111370057B (en) Method for determining chromosome structure variation signal intensity and insert length distribution characteristics of sample and application
Li et al. Identification of transcription factor binding sites using ATAC-seq
US20220093212A1 (en) Size-based analysis of fetal dna fraction in plasma
CN112397143B (en) Method for predicting tumor risk value based on plasma multi-omic multi-dimensional features and artificial intelligence
CN111370056B (en) Method, system and computer readable medium for determining predetermined chromosome instability index of a sample to be tested
CN110739027B (en) Cancer tissue positioning method and system based on chromatin region coverage depth
CN109767810B (en) High-throughput sequencing data analysis method and device
TW201718874A (en) Single-molecule sequencing of plasma DNA
US20200372296A1 (en) Systems and methods for determining whether a subject has a cancer condition using transfer learning
EP3405573A1 (en) Methods and systems for high fidelity sequencing
EP4004238A1 (en) Systems and methods for determining tumor fraction
CN110621785B (en) Method and device for haplotyping diploid genome based on three-generation capture sequencing
IL258999A (en) Methods for detecting copy-number variations in next-generation sequencing
US20160203260A1 (en) Applications of plasma mitochondrial dna analysis
US20210358626A1 (en) Systems and methods for cancer condition determination using autoencoders
WO2019213811A1 (en) Method, apparatus, and system for detecting chromosomal aneuploidy
CN112410422B (en) Method for predicting tumor risk value based on fragmentation pattern
US20240120026A1 (en) Method and device for extracting somatic mutations from single-cell transcriptome sequencing data
Wilmott et al. Tumour procurement, DNA extraction, coverage analysis and optimisation of mutation-detection algorithms for human melanoma genomes
ES2937408T3 (en) Massive sequencing fetal DNA analysis method and computer product
CN113862351B (en) Kit and method for identifying extracellular RNA biomarkers in body fluid sample
CN113637747B (en) Method for determining SNV and tumor mutation load in nucleic acid sample and application
CN113186255A (en) Method and device for detecting nucleotide variation based on single molecule sequencing
CN112513292A (en) Method and device for detecting homologous sequence based on high-throughput sequencing
CN115691665B (en) Transcription factor-based cancer early-stage screening and diagnosis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant