CN114792548B - Methods, apparatus and media for correcting sequencing data, detecting copy number variations - Google Patents

Methods, apparatus and media for correcting sequencing data, detecting copy number variations Download PDF

Info

Publication number
CN114792548B
CN114792548B CN202210663961.5A CN202210663961A CN114792548B CN 114792548 B CN114792548 B CN 114792548B CN 202210663961 A CN202210663961 A CN 202210663961A CN 114792548 B CN114792548 B CN 114792548B
Authority
CN
China
Prior art keywords
window
unique
chromosome
log
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210663961.5A
Other languages
Chinese (zh)
Other versions
CN114792548A (en
Inventor
张钰
孙怀玉
卢玉林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Berry Genomics Co Ltd
Original Assignee
Berry Genomics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Berry Genomics Co Ltd filed Critical Berry Genomics Co Ltd
Priority to CN202210663961.5A priority Critical patent/CN114792548B/en
Publication of CN114792548A publication Critical patent/CN114792548A/en
Application granted granted Critical
Publication of CN114792548B publication Critical patent/CN114792548B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation

Landscapes

  • Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The present invention relates to a method, apparatus and medium for correcting sequencing data, detecting copy number variation. The method comprises the following steps: calculating the position of the read length on each unique alignment in the reference genome and the number of times each position is sequenced; dividing the genome according to windows with preset lengths so as to count the number of the reads in each window compared with the reads in the pair; performing GC correction for the number of read lengths on the comparison within each window to generate a unique log of each window via GC correction at the window level; obtaining a unique log-ratio adjustment ratio based on the unique log-ratio of the chromosome level to generate a unique log of each window corrected by GC at the chromosome level; and normalizing the unique alignment number for each window corrected by GC at the chromosome level to obtain a unique log of each window corrected by normalization and GC. The present invention can significantly improve the correction effect for regions of high GC content.

Description

Methods, apparatus and media for correcting sequencing data, detecting copy number variations
Technical Field
The present invention relates generally to biological information processing, and in particular, to methods for correcting sequencing data, methods for detecting copy number variations, computing devices, and computer storage media.
Background
Sequencing data obtained on a second generation sequencer generally show a correlation between sequencing depth and GC content, and is called GC bias (or "GC bias"). GC bias refers to regions on the genome where GC content is around 50% more easily detected, resulting in more reads, higher coverage of these regions, less easily detected in regions with high or low GC content, resulting in fewer reads, and less coverage of these regions. For more accurate subsequent analysis of the bioinformatics, corrections for GC bias are usually needed.
Conventional protocols for correcting sequencing data include, for example: the GC content is then windowed (or "bin") and then GC corrected using a similar algorithm such as Loess regression on a window basis. However, in the conventional scheme for correcting sequencing data, which is mainly based on genome-level GC correction, there are some regions with poor correction effect, especially regions with high GC content, which is not ideal, for example, it is difficult to eliminate the bias introduced by sequencing and genome characteristics.
In summary, the conventional scheme for correcting sequencing data has the following disadvantages: the correction effect for the region of high GC content is poor.
Disclosure of Invention
The present invention provides a method, a computing device and a computer storage medium for correcting sequencing data, which can significantly improve the correction effect for a region of high GC content.
Further, the present invention also provides a method for detecting copy number variation, which can ensure that the whole large-fragment CNV is not too fragmented and is sensitive enough to low-proportion chimerism. But also can ensure enough sensitivity and accuracy for small segment CNV.
According to a first aspect of the invention, a method for correcting sequencing data is provided. The method comprises the following steps: calculating the position of the read length of each unique alignment in the reference genome and the sequencing frequency of each position based on the alignment result data of the sequencing data of the sample to be detected and the human reference genome sequence; dividing the genome according to windows with preset lengths so as to count the number of the reads in each window compared with the reads in the pair; performing GC corrections for the number of read lengths on the alignment within each window to generate a unique log of bits per window via GC corrections at the window level; obtaining a unique log-ratio adjustment ratio based on the unique number-of-comparisons of the chromosome level of the sample to be tested, so as to generate a unique log of each window corrected by the GC of the chromosome level based on the unique log of each window corrected by the GC of the window level and the unique number-of-comparisons adjustment ratio; and normalizing the unique alignment number for each window corrected by GC at the chromosome level to obtain a unique log of each window corrected by normalization and GC.
According to a second aspect of the present invention, a method for detecting copy number variation is provided. The method comprises the following steps: according to the steps of performing the method of the first aspect of the invention, so as to obtain a unique log of the ratio for each window via normalization and GC correction; constructing an autosomal negative reference set and a sex chromosome negative reference set; determining a normalized copy rate of each window of the sample to be tested based on the unique log of each window of the sample to be tested through normalization and GC correction, the unique comparison number of each window of the constructed autosomal negative reference set and the sex chromosome negative reference set; combining a first preset number of continuous windows into a section so as to take the area with the copying rate of a second preset number of continuous sections meeting the preset condition as a candidate area; calculating the offset direction of each segment in the candidate region to determine the copy number variation of the first type of segment with the length in a predetermined range; and determining the copy number variation of the second type of fragments with the length larger than a preset value so as to determine the copy number variation of the sample to be tested based on the copy number variation of the first type of fragments and the copy number variation of the second type of fragments.
According to a third aspect of the present invention, there is also provided a computing device comprising: a memory configured to store one or more computer programs; and a processor coupled to the memory and configured to execute the one or more programs to cause the apparatus to perform the method of the first or second aspect of the invention.
According to a fourth aspect of the invention, there is also provided a non-transitory computer-readable storage medium. The non-transitory computer readable storage medium has stored thereon machine executable instructions which, when executed, cause a machine to perform the method of the first or second aspect of the invention.
In some embodiments, the unique alignment scores for the chromosomal level of the test sample comprise: and comparing the predicted unique comparison number of each chromosome of the sample to be detected with the actual unique comparison number of each chromosome of the sample to be detected.
In some embodiments, generating a unique alignment number for each window corrected by GC at the window level comprises: selecting windows with reading lengths on all autosomes in comparison so as to determine the proportion of GC (gas chromatography) bases on each window; and obtaining a predicted unique log of alignments over each window via a trained GC correction model based on the determined proportion of GC bases over each window for generating unique aligned alignments per window via GC correction at the window level.
In some embodiments, obtaining the predicted unique log of alignments over each window for use in generating the unique log of alignments for each window via GC corrections for the window level comprises: calculating the median of the normalized unique log of all windows on the autosome where read lengths on alignments exist; and generating a unique log for each window corrected by GC at the window level based on the predicted unique log over each window and the median of the calculated normalized unique log.
In some embodiments, generating a unique alignment number for each window corrected by GC at the chromosome level comprises: calculating the proportion of GC bases on each chromosome based on the human reference genome sequence; calculating the unique alignment number ratio of each chromosome of the negative reference set; calculating the actual unique comparison number ratio of each chromosome of the sample to be detected so as to calculate the content ratio of the actual unique comparison number ratio of each chromosome of the sample to be detected relative to the unique comparison number ratio of each chromosome of the negative reference set; and fitting via a linear regression model based on the calculated content ratio and the proportion of GC bases on each chromosome to determine regression coefficients for generating a predicted unique log for each chromosome of the sample to be tested based on the determined regression coefficients to generate a unique log for each window corrected by GC at the chromosome level.
In some embodiments, generating the predicted unique log of pairs for each chromosome based on the determined regression coefficients to generate unique aligned pairs for each window corrected by GC at the chromosome level comprises: obtaining a unique comparison number ratio adjustment ratio based on the predicted unique comparison number ratio of each chromosome and the actual unique comparison number ratio of each chromosome of each sample to be detected; and generating a unique log of each window corrected by GC at the chromosome level based on the unique log of each window corrected by GC at the window level and the unique alignment fraction scaling.
In some embodiments, determining copy number variation of fragments of the first type having a length within a predetermined range comprises: calculating the offset direction of each section in the candidate region by using a mean shift algorithm aiming at the first type of segment with the length belonging to a preset range; and merging the offset directions for the respective segments in the candidate region to determine the copy number variation of the first type segment.
In some embodiments, the copy number variation of the second type of fragments having a length greater than a predetermined value is determined: copy number variations of the second type of fragments having a length greater than a predetermined value are detected based on a cyclic binary partitioning algorithm.
In some embodiments, constructing the autosomal negative reference set and the sex chromosome negative reference set comprises: performing window-level GC correction and chromosome-level GC correction for the unique pair-log of each window of the negative reference set to obtain normalized and GC-corrected unique pair-log of each window of the negative reference set; calculating the mean and standard deviation of the unique log of the ratio over each window; determining a retention interval for each window based on the calculated mean and standard deviation to determine a retention coefficient for each window of the sample to be tested; and constructing an autosomal negative reference set and a sex chromosome negative reference set based on the retention coefficient of each window of the sample to be detected.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the invention, nor is it intended to be used to limit the scope of the invention.
Drawings
FIG. 1 shows a schematic diagram of a system for a method of correcting sequencing data, according to an embodiment of the invention.
Fig. 2 shows a flow diagram of a method for correcting sequencing data according to an embodiment of the invention.
FIG. 3 shows a flow diagram of a method for generating a unique log of bits per window via GC correction at the window level, according to an embodiment of the invention.
FIG. 4 shows a flow diagram of a method for generating a unique log of each window via GC correction at a chromosome level, according to an embodiment of the invention.
FIG. 5 shows a flow diagram of a method for detecting copy number variation according to an embodiment of the invention.
FIG. 6 schematically shows a block diagram of an electronic device suitable for use to implement an embodiment of the invention.
Like or corresponding reference characters designate like or corresponding parts throughout the several views.
Detailed Description
Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object.
As described above, in the conventional methods for correcting sequencing data, which are mainly based on genome-level GC correction, there are some regions with poor correction effect, especially regions with high GC content.
To address, at least in part, one or more of the above problems, as well as other potential problems, example embodiments of the present invention propose a scheme for correcting sequencing data. The windows are divided so as to carry out GC correction on the window level and GC correction on the chromosome level on all the windows on the sample to be detected, so that the influence of different GC content areas on the correction result can be eliminated, and a better sequencing data correction effect is achieved.
Fig. 1 shows a schematic diagram of a system 100 for a method of correcting sequencing data according to an embodiment of the invention. It should be understood that the system 100 may also implement methods for detecting copy number variations according to embodiments of the present invention. As shown in fig. 1, the system 100 includes: computing device 110, sequencing device 130, network 140. In some embodiments, the computing device 110, the sequencing device 130, and the network 140 interact with each other via data.
A sequencing apparatus 130, for example, for sequencing a sample to be tested of a test object, so as to generate sequencing data on the sample to be tested; and sending the generated sequencing sequence data of the sample to be tested to the computing device 110. The sequencing data for the sample to be tested is for example low depth whole genome sequencing.
With respect to the computing device 110, it is used, for example, to calculate the position of the read length on each unique alignment in the reference genome and the number of times each position was sequenced based on the alignment result data of the sequencing data of the sample to be tested and the human reference genome sequence; the genome is divided into windows of predetermined length to count the number of reads longer than the reads on the pair within each window. The computing device 110 may also be used to make the window-level GC corrections and the chromosome-level GC corrections, e.g., the computing device 110 makes the GC corrections for the number of reads over the alignment within each window to generate a unique log of each window via the window-level GC corrections; performing GC correction at the chromosome level for the unique log per window corrected by GC correction at the window level to generate a unique log per window corrected by GC correction at the chromosome level; and normalizing the unique alignment number for each window corrected by GC at the chromosome level to obtain a unique log of each window corrected by normalization and GC.
In some embodiments, computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, and ASICs, as well as general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device. The computing device 110 includes, for example: the positions and positions are sequenced to the number of times calculation unit 112, the number of reads on the window alignment statistics unit 114, the window-level GC correction unit 116, the chromosome-level GC correction unit 118, the unique alignment number obtaining unit 120. The above positions and positions are sequenced to the number of times calculation unit 112, the intra-window alignment read-up number statistics unit 114, the window-level GC correction unit 116, the chromosome-level GC correction unit 118, and the unique alignment number obtaining unit 120. May be configured on one or more computing devices 110.
And a sequencing-to-times-for-position calculation unit 112 for calculating the position of the read length on each unique alignment in the reference genome and the number of times each position was sequenced based on the alignment result data of the sequencing data of the sample to be tested and the human reference genome sequence.
And a statistics unit 114 for the number of reads in the intra-window alignment, which is used for dividing the genome into windows of predetermined lengths so as to count the number of reads in the alignment in each window.
A window-level GC correction unit 116 for GC correction for the number of read lengths on the alignment within each window to generate a unique log of each window corrected by GC of the window-level.
Regarding the chromosome level GC correction unit 118, it is configured to obtain a unique log-ratio adjustment ratio based on the unique alignment ratio of the chromosome level of the sample to be tested, so as to generate a unique log of each window corrected by GC of the chromosome level based on the unique log of each window corrected by GC of the window level and the unique alignment ratio adjustment ratio.
A unique alignment number obtaining unit 120 for normalizing the unique alignment number for each window corrected by GC at the chromosome level to obtain a unique log of each window corrected by normalization and GC.
A method for correcting sequencing data according to an embodiment of the present invention will be described below with reference to fig. 2. FIG. 2 shows a flow diagram of a method 200 for correcting sequencing data, according to an embodiment of the invention. It should be understood that the method 200 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 200 may also include additional acts not shown and/or may omit acts shown, as the scope of the invention is not limited in this respect.
At step 202, the computing device 110 calculates the location of the read length on each unique alignment in the reference genome and the number of times each location was sequenced based on the alignment data of the test sample with the human reference genome sequence.
The sequencing data of the sample to be tested is for example and not limited to low depth whole genome sequencing data, e.g., sequencing depth < = 3X.
For example, computing device 110 aligns the sequencing fastq data onto a human reference genomic sequence (e.g., hg19 or hg 38) using BWA or similar public software to generate alignment result data; and based on the comparison result data, reserving the read length on the unique comparison; the position of each read in the reference genome (positional information including, for example, chromosome number and genome coordinates) and the number of times each position was sequenced were calculated.
At step 204, the computing device 110 divides the genome into windows of predetermined length to count the number of reads within each window that are longer than on the pair.
For example, the computing device 110 determines whether the sequencing depth of the sequencing data of the sample to be tested is less than or equal to a predetermined sequencing depth value (e.g., the sequencing data is low-depth whole genome sequencing data with a sequencing depth < = 3X), then divides the sequencing data into a plurality of windows by a predetermined length (e.g., without limitation, 5 Kb) so as to count the number of reads within each window over a pair.
At step 206, the computing device 110 performs GC corrections for the number of read lengths over the alignment within each window in order to generate a unique log of each window via GC corrections at the window level.
With respect to a method of generating a unique log of each window corrected by GC at the window level, it includes, for example: the computing device 110 selects windows of aligned read lengths on all autosomes to determine the proportion of GC bases on each window; and obtaining a predicted unique one-bit logarithm over each window via a trained GC correction model based on the determined proportion of GC bases over each window for generating the unique one-bit logarithm per window via the first GC correction. By adopting the means, the accuracy of the unique comparison number corrected by the GC at the window level can be improved, and the method is further favorable for improving the accuracy of detecting the copy number variation based on the more accurate unique comparison number.
For example, the computing device 110 selects windows with aligned read lengths on all autosomes to determine the proportion of GC bases in each window on each chromosome of the sample to be tested; order to
Figure 722937DEST_PATH_IMAGE001
Representing the proportion of GC bases on the jth window of the ith chromosome on the kth sample of the training sample so as to
Figure 786707DEST_PATH_IMAGE002
Represents the number of reads on the alignment of the jth window of the ith chromosome on the kth sample of the training sample. The computing apparatus 110 then trains the GC correction model using smoothening spline algorithm with Y (i.e., the base-on-window GC ratio) as the independent variable and X (i.e., the number of read lengths on the unique alignment, or "log-only ratio") as the dependent variable. Inputting the proportion of GC bases on each window of all chromosomes of the sample to be detected into a trained GC correction model so as to obtain a predicted unique log on each window of the sample to be detected; then, the computing device 110 calculates the median of the normalized unique log of all windows on the autosome where there are read lengths on the alignments; and generating a unique log for each window corrected by GC at the window level based on the predicted unique log over each window and the median of the calculated normalized unique log.
The method 300 for generating the unique bitlog for each window corrected by GC at the window level will be described in detail with reference to fig. 3, and will not be described herein again.
At step 208, the computing device 110 obtains a unique log-to-ratio adjustment ratio based on the unique log-to-ratio of the chromosome level of the sample under test to generate a unique log of each window corrected via GC of the chromosome level based on the unique log of each window corrected via GC of the window level and the unique log-to-ratio adjustment ratio. The unique alignment ratio of the chromosome level of the sample to be tested includes, for example: and comparing the predicted unique comparison number of each chromosome of the sample to be detected with the actual unique comparison number of each chromosome of the sample to be detected.
A method for generating a unique log of each window corrected by GC at the chromosome level, which includes, for example: the computing device 110 calculates the proportion of GC bases on each chromosome based on the human reference genomic sequence; calculating the unique alignment number ratio of each chromosome of the negative reference set; calculating the unique comparison number ratio of each chromosome of the sample to be detected so as to calculate the content ratio of the unique comparison number ratio of each chromosome of the sample to be detected relative to the unique comparison number ratio of each chromosome of the negative reference set; and fitting via a linear regression model based on the calculated content ratio, the proportion of GC bases on each chromosome, so as to determine regression coefficients for generating a predicted unique log for each chromosome based on the determined regression coefficients to generate a unique log for each window corrected via GC at the chromosome level. For example, the computing device 110 obtains a unique alignment number ratio adjustment ratio based on the predicted unique alignment number ratio of each chromosome and the actual unique alignment number ratio of each chromosome of each sample to be tested; and generating the unique log-log per window corrected by GC at the chromosome level based on the unique log-per-window and the unique alignment score scaling of the GC at the window level. By adopting the above means, the present invention can eliminate the influence of the high GC content region or the low GC content region on the correction result.
The method 400 for generating the unique log of each window corrected by GC at the chromosome level will be described in detail below with reference to fig. 4, and will not be described herein again.
At step 210, the computing device 110 normalizes for the unique alignment number for each window via GC correction at the chromosome level to obtain a unique log of each window via normalization and GC correction. It will be appreciated that the data volume of the sequencing data for each sample is numerous and numerous, and that for comparison at the same data level, the unique alignment numbers for each window corrected by GC at the chromosome level need to be normalized.
For example, the computing device 110 normalizes the unique alignment number for each window of GC correction via the chromosome level for all samples to be tested on one chip to the same level. The algorithm for normalization is described below with reference to equation (1).
Figure 110241DEST_PATH_IMAGE003
(1)
In the above-mentioned formula (1),
Figure 638175DEST_PATH_IMAGE004
represents the data volume of the test sample via normalization (e.g., the unique log of each window of the test sample via normalization and GC correction).
Figure 425346DEST_PATH_IMAGE005
Represents the average of the data quantities representing n samples tested (e.g., the unique aligned number average for each window of GC correction via the chromosome level for a plurality of samples tested).
In the scheme, the windows are divided so as to perform window-level GC correction and chromosome-level GC correction on all the windows on the sample to be detected, so that the influence of different GC content areas on the correction result can be eliminated, and a better sequencing data correction effect is achieved. For example, for high GC regions, and for chromosomes with higher GC content, such as chromosome 16, 19, etc., the method 200 can better correct the data bias caused by GC bias.
A method for generating a unique log of each window via GC correction at the window level according to an embodiment of the present invention will be described below in conjunction with fig. 3. FIG. 3 shows a flow diagram of a method 300 for generating a unique log of bits per window via GC correction of a window level, according to an embodiment of the invention. It should be understood that the method 300 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 300 may also include additional acts not shown and/or may omit acts shown, as the scope of the invention is not limited in this respect.
At step 302, the computing device 110 selects windows of aligned read lengths on all autosomes to determine the proportion of GC bases on each window.
For example, the computing device 110 selects a window where aligned read lengths exist on all autosomes, and lets
Figure 914097DEST_PATH_IMAGE006
Represents the first
Figure 447846DEST_PATH_IMAGE007
On the sample
Figure 33548DEST_PATH_IMAGE008
First of a chromosome
Figure 322447DEST_PATH_IMAGE009
The proportion of GC bases on each window.
At step 304, the computing device 110 obtains a predicted unique log of ratios over each window based on the determined proportion of GC bases over each window via a trained GC correction model.
For example, the computing device 110 takes as the independent variable the proportion of GC bases on each window of the training sample (e.g., Y), and as the dependent variable the number of reads on the alignment on each window of the training sample (e.g., X),for example and without limitation, the smoothening spline algorithm is used to train the GC correction model. For example,
Figure 360810DEST_PATH_IMAGE010
represents the first
Figure 698251DEST_PATH_IMAGE007
On the sample
Figure 76142DEST_PATH_IMAGE008
First of a chromosome
Figure 208047DEST_PATH_IMAGE009
The number of read lengths on the alignment on each window is input into the proportion of GC bases on each window through a trained GC correction model
Figure 796023DEST_PATH_IMAGE011
) Generating a predicted unique log over each window
Figure 674505DEST_PATH_IMAGE012
). It should be understood that other algorithms besides smoothening spline may be used to train the GC correction model, and by using smoothening spline model to perform GC correction at the window level, a better data correction effect may be achieved.
At step 306, the computing device 110 calculates the median of the normalized unique log pairs of all windows on the autosome where read lengths on alignments exist.
For example, the computing device 110 calculates the median of the normalized unique log of all windows on the K sample genomes where there are read lengths on alignments, e.g.,
Figure 500378DEST_PATH_IMAGE013
at step 308, the computing device 110 generates a unique bit log for each window via GC correction at the window level based on the predicted unique bit log over each window and the median of the calculated normalized unique bit logs.
For example, the computing device 110 bases the predicted unique alignment number
Figure 68763DEST_PATH_IMAGE014
And the median of the calculated normalized unique bit logarithm
Figure 753822DEST_PATH_IMAGE013
The unique log of each window corrected by GC at the window level is generated by the method shown in the following equation (2).
Figure DEST_PATH_IMAGE015
In the above-mentioned formula (2),
Figure 292120DEST_PATH_IMAGE016
represents the unique log of each window corrected by GC at the window level.
Figure 441341DEST_PATH_IMAGE017
Represents the first
Figure 118310DEST_PATH_IMAGE007
On the sample
Figure 212037DEST_PATH_IMAGE008
First of a chromosome
Figure 426462DEST_PATH_IMAGE009
The number of read lengths on the window alignment, or "the first
Figure 633452DEST_PATH_IMAGE007
On the sample
Figure 543640DEST_PATH_IMAGE008
First of a chromosome
Figure 62346DEST_PATH_IMAGE009
The only log of the ratio over each window ".
Figure 21074DEST_PATH_IMAGE013
Represents the median of the normalized unique log of all windows on the autosome where read length on alignment is present.
Figure 879309DEST_PATH_IMAGE018
Representing the only log of the prediction over each window.
It should be understood that the GC proportion is related to the unique comparison number, and the invention can obtain the unique comparison logarithm close to the actual level by adopting the above means. In the conventional GC correction method, the whole genome is used as a GC content reference, so that the obtained correction coefficient is deviated, and the corrected unique alignment number is not accurate enough. Therefore, by adopting the means, the method can not only perform GC correction on all windows on the sample to be detected, but also improve the accuracy of the corrected unique comparison number so as to provide more accurate data signals for subsequent CNV detection.
A method for generating a unique bitlog for each window corrected by GC at the chromosome level according to an embodiment of the present invention will be described below in conjunction with fig. 4. FIG. 4 illustrates a flow diagram of a method 400 for generating a unique log of each window via GC corrections at a chromosome level, according to an embodiment of the invention. It should be understood that the method 400 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It is to be understood that method 400 may also include additional acts not shown and/or may omit acts shown, as the scope of the invention is not limited in this respect.
At step 402, the computing device 110 calculates the proportion of GC bases on each chromosome based on the human reference genomic sequence.
For example, the computing device 110 calculates the ratio of GC bases on each chromosome from the human reference genome sequence and makes it as
Figure 22714DEST_PATH_IMAGE019
At step 404, the computing device 110 calculates a unique alignment count for each chromosome of the negative reference set.
An algorithm for calculating the unique per-chromosome one-log ratio of each autosome of the negative reference set is described below in conjunction with equation (3). That is, the computing device 110 calculates the per-chromosome unique ratio log ratio of the negative reference set autosomes based on the sum of the per-autosomal unique ratio log of the negative reference set and all the autosomal unique ratio logs of the negative reference set.
Figure 231979DEST_PATH_IMAGE020
(3)
In the above-mentioned formula (3),
Figure 994398DEST_PATH_IMAGE021
the unique alignment counts for each autosome representing the negative reference set were compared.
Figure 972719DEST_PATH_IMAGE022
The only log of each autosome representing the negative reference set.
Figure 21446DEST_PATH_IMAGE023
The sum of the unique alignment numbers of 22 autosomes representing the negative reference set.
It should be understood that the calculation method of the sex chromosome unique alignment number ratio of the negative reference set is the same as the algorithm of each chromosome unique alignment number ratio of the autosomes of the negative reference set shown in formula (3), except that the sex chromosome unique alignment number ratio pair is replaced by the autosomes unique alignment number pair of the negative reference set in the numerator.
At step 406, the computing device 110 calculates an actual unique alignment count ratio for each chromosome of the test sample to calculate a content ratio of the actual unique alignment count ratio for each chromosome of the test sample relative to the unique alignment count ratio for each chromosome of the negative reference set.
A method for calculating a unique alignment count ratio for each chromosome of a test sample includes, for example: the calculation device 110 calculates the sum of the proportion of GC bases of all windows of each autosome of the sample to be detected, the sum of the proportion of GC bases of 22 autosomes, and the unique comparison number proportion of each autosome of each sample to be detected; and calculating the unique comparison number ratio of the sex chromosomes of the sample to be detected based on the sum of the proportion of the GC bases of all windows of each sex chromosome of the sample to be detected and the sum of the proportion of the GC bases of all sex chromosomes so as to calculate the actual unique comparison number ratio of each chromosome of each sample to be detected.
An algorithm for calculating the unique alignment ratio of each autosome of each test sample is described below with reference to formula (4).
Figure 720936DEST_PATH_IMAGE024
(4)
In the above-mentioned formula (4),
Figure 21468DEST_PATH_IMAGE025
the unique alignment numbers of each autosome representing each test sample are compared.
Figure 119874DEST_PATH_IMAGE026
Represents the sum of the proportion of GC bases in all (for example, but not limited to 12500) windows of each autosome of the sample to be tested.
Figure 339502DEST_PATH_IMAGE027
Represents the sum of the proportions of GC bases of 22 autosomes.
It should be understood that the calculation method of the unique to logarithmic ratio of each sex chromosome of each test sample is the same as the algorithm of the unique to logarithmic ratio of each autosome of each test sample shown in formula (4), except that the unique to logarithmic ratio of the autosome in the numerator is replaced by the unique to logarithmic ratio of the sex chromosome. It will be appreciated that the calculation of the unique alignment ratio for each sex chromosome of the test samples for males and females is calculated separately.
An algorithm for calculating a content ratio of the unique alignment number ratio of each chromosome of the test sample to the unique alignment number ratio of chromosomes of the negative reference set is described below with reference to formula (5).
Figure 523359DEST_PATH_IMAGE028
(5)
In the above-mentioned formula (5),
Figure 627581DEST_PATH_IMAGE021
representing the unique alignment count per chromosome of the negative reference set.
Figure 111652DEST_PATH_IMAGE025
The actual unique alignment numbers representing each chromosome of each test sample are compared.
Figure 705445DEST_PATH_IMAGE029
Representing the content ratio of the unique alignment count ratio of each chromosome of the test sample relative to the unique alignment count ratio of each chromosome of the negative reference set.
At step 408, the computing device 110 fits via a linear regression model based on the calculated content ratio, the proportion of GC bases on each chromosome, to determine regression coefficients for generating a predicted unique log of each chromosome of the test sample based on the determined regression coefficients.
For example, the computing device 110 calculates, for each training sample, the ratio of GC bases on each chromosome (i.e.,
Figure 48701DEST_PATH_IMAGE030
) As an argument, and as a content ratio of the unique aligned-number ratio of each chromosome of the training sample relative to the unique aligned-number ratio of chromosomes of the negative reference set (i.e.,
Figure 18931DEST_PATH_IMAGE029
) For dependent variables, using linear models "Y = aX + b ", fitting, e.g. using least-squares, the best-fit line being selected so as to calculate the best regression coefficients
Figure 623088DEST_PATH_IMAGE031
Figure 650431DEST_PATH_IMAGE032
(i.e., using a formula)
Figure 480984DEST_PATH_IMAGE033
To pair
Figure 989326DEST_PATH_IMAGE034
Correction is performed). The predicted unique log of each chromosome is then pre-generated by a linear regression model. In some embodiments, by combining the smoothening spline model and the linear model, GC correction is performed twice at the window level and the chromosome level, respectively, so that a better data correction effect can be achieved.
An algorithm for generating a predicted unique log of each chromosome is described below in conjunction with equation (6).
Figure 713568DEST_PATH_IMAGE035
In the above-mentioned formula (6),
Figure 914742DEST_PATH_IMAGE036
representing the predicted unique log-log ratio of each chromosome.
Figure 967012DEST_PATH_IMAGE031
Figure 13465DEST_PATH_IMAGE032
Representing the regression coefficients of a linear regression model.
At step 410, the computing device 110 obtains a unique alignment number fraction adjustment ratio based on the predicted unique alignment number fraction of each chromosome and the actual unique alignment number fraction of each chromosome of each sample to be tested.
For example, the computing device 110 calculates the adjustment ratio based on a ratio of the predicted unique per-chromosome log ratio to the actual unique per-chromosome ratio of each chromosome of each sample to be tested (i.e., the actual unique per-chromosome ratio). An algorithm for calculating the ratio of the one-to-one log to the ratio adjustment is described below with reference to equation (7).
Figure 592214DEST_PATH_IMAGE037
(7)
In the above-mentioned formula (7),
Figure 964290DEST_PATH_IMAGE038
representing the unique alignment number ratio.
Figure 769435DEST_PATH_IMAGE039
Representing the actual unique alignment number ratio of each chromosome of each sample to be tested, i.e. the actual unique alignment number ratio of each chromosome
At step 412, the computing device 110 generates a unique log of each window corrected by GC at the chromosome level based on the unique log of each window corrected by GC at the window level and the unique alignment score scaling.
For example, the computing device 110 calculates the unique log of each window corrected via the chromosome level GC based on the product of the unique log of each window corrected via the window level GC and the unique alignment fraction scaling. An algorithm to compute the unique log of each window corrected by chromosome-level GC is described below in conjunction with equation (8).
Figure 416317DEST_PATH_IMAGE040
(8)
In the above-mentioned formula (8),
Figure 321343DEST_PATH_IMAGE038
representing the unique alignment number ratio.
Figure 270845DEST_PATH_IMAGE041
Represents the unique log of each window corrected by GC at chromosome level, i.e., the unique number of alignments for each window after the second GC correction. Represents the unique log of each window via GC correction at the window level, i.e., the unique log of each window after the first GC correction.
It will be appreciated that the normal reference genome has a GC content of between 30-40%, and that a high GC content chromosome (e.g., with a GC content of 45%) will have a CG deviation from that of the whole genome (say 35%) that is somewhat greater if compared to that of the whole genome (say 35%) according to conventional GC correction methods, for example. According to the invention, the unique comparison log proportion of the sample on the chromosome is compared with the corresponding unique comparison log proportion of the chromosome of the negative reference set, so that the CG deviation proportion obtained by comparison is smaller and is closer to the actual real level.
In the above scheme, the method can further eliminate the influence of the high-GC content region or the low-GC content region on the correction result by predicting the predicted value of the unique comparison ratio through the regression model based on the content ratio of the unique comparison ratio of each chromosome of the sample to be detected to the unique comparison ratio of each chromosome of the negative reference set and correcting the unique comparison ratio of each window according to the adjustment ratio determined based on the actually measured unique comparison ratio and the predicted unique comparison ratio.
A method for detecting copy number variation according to an embodiment of the present invention will be described below with reference to fig. 5. FIG. 5 shows a flow diagram of a method 500 for detecting copy number variation, according to an embodiment of the invention. It should be understood that the method 500 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It is to be understood that method 500 may also include additional acts not shown and/or may omit acts shown, as the scope of the invention is not limited in this respect.
At step 502, the computing device 110 performs the aforementioned method 200 to obtain a unique bitlog for each window via normalization and GC correction.
At step 504, the computing device 110 constructs an autosomal negative reference set and a sex chromosome negative reference set.
Regarding the method for constructing the autosomal negative reference set and the sex chromosome negative reference set, it includes, for example: the computing device 110 performs a window-level GC correction and a chromosome-level GC correction for the unique log of each window of the negative reference set to obtain a normalized and GC-corrected unique log of each window of the negative reference set; calculating the mean and standard deviation of the unique log of the ratio over each window; determining a retention interval for each window based on the calculated mean and standard deviation to determine a retention coefficient for each window of the sample to be tested; and constructing an autosomal negative reference set and a sex chromosome negative reference set based on the retention coefficient of each window of the sample to be detected.
For example, the computing device 110 selects some negative samples for copy number variation (i.e., no change in copy number) to construct a negative reference set. The two GC corrections (i.e., the window-level GC correction and the chromosome-level GC correction) shown in method 200 are performed for the unique alignment number for each window of the negative reference set to obtain a unique log of each window of the negative reference set that is normalized and GC corrected. The computing device 110 then calculates the mean and standard deviation of the unique log over each window separately for the normalized and GC corrected unique log of the negative reference set for each window.
The algorithm for calculating the mean and standard deviation of the unique log ratio over each window of the negative reference set is described below in conjunction with equations (9) and (10).
Figure 360024DEST_PATH_IMAGE042
(9)
Figure 76176DEST_PATH_IMAGE043
(10)
In the above formulas (9) and (10),
Figure 567200DEST_PATH_IMAGE044
the mean of only one log over each window representing the negative reference set.
Figure 953182DEST_PATH_IMAGE045
Represents the standard deviation of the unique log of each window of the negative reference set.
Figure 795236DEST_PATH_IMAGE046
The unique log of each window via normalization and GC correction representing the negative reference set.
For each window, a reserve interval is calculated
Figure 49500DEST_PATH_IMAGE047
. The calculation of the reserved interval is described below with reference to equation (11)
Figure 395030DEST_PATH_IMAGE047
The algorithm of (1).
Figure 951914DEST_PATH_IMAGE048
(11)
In the above-mentioned formula (11),
Figure 78002DEST_PATH_IMAGE047
representing the reserved interval. It is understood that the distribution of the next generation sequencing data across the genome is random, assuming it conforms to a normal distribution level. Based on the theory of normal distribution, setting an accuracy of 95% (i.e., 95% of the data is accurate, falling within the retention interval), the corresponding coefficient is 1.96.
Calculation of the retention factor is described below in conjunction with equation (12)
Figure 94147DEST_PATH_IMAGE049
The method of (1). If the only comparison number of the current window belongs to the reserved interval, the reservation coefficient is 1, namely the current window belongs to a normal state, and the current window is reserved; if the only comparison number of the current window belongs to the reserved interval, the reserved coefficient is 0, the current window belongs to an abnormal state, and the current window is filtered.
Figure 559764DEST_PATH_IMAGE050
=
Figure 881024DEST_PATH_IMAGE051
。(12)
In the above-mentioned formula (12),
Figure 369774DEST_PATH_IMAGE049
representing the retention factor.
The calculation of all the windows of chromosomes for the negative reference set is described below in connection with equation (13)
Figure 903523DEST_PATH_IMAGE052
The algorithm of (1). For example, all the windows of chromosomes of the autosomal negative reference set can be constructed by formula (13) for the autosomes
Figure 285963DEST_PATH_IMAGE052
(ii) a And respectively constructing a male negative reference set and a female negative reference set according to sex aiming at sex chromosomes.
Figure 184649DEST_PATH_IMAGE053
(13)
In the above-mentioned formula (13),
Figure 223012DEST_PATH_IMAGE054
representing all windows of chromosomes of a negative reference set
Figure 560452DEST_PATH_IMAGE052
At step 506, the computing device 110 determines a normalized copy rate for each window of the sample under test based on the unique log of each window of the sample under test via normalization and GC correction, the constructed unique alignment numbers for each window of the autosomal negative reference set and the sex chromosome negative reference set. By adopting the means, the method can compare the standardized effective data volume of the sample to be detected with the normal reference level, and can obtain the standardized depth value (namely, the copy rate CopyRatio, CR) of each window of the sample to be detected.
The algorithm for calculating the normalized copy rate for each window of the test sample is described below in conjunction with equations (14) and (15). Wherein, the formula (14) indicates the algorithm of the normalized copy rate of the autosome and female sex chromosome of the sample to be tested. Equation (15) indicates the algorithm for the normalized copy rate of the male sex chromosome of the test sample.
Figure 797399DEST_PATH_IMAGE055
(14)
Figure 866986DEST_PATH_IMAGE056
(15)
In the above formulas (14) and (15),
Figure 661154DEST_PATH_IMAGE057
is representative of the sample to be tested
Figure 536706DEST_PATH_IMAGE007
A sample of
Figure 628159DEST_PATH_IMAGE009
First of a chromosome
Figure 868647DEST_PATH_IMAGE008
Normalized copy rate for each window.
Figure 881603DEST_PATH_IMAGE058
Represents the first
Figure 623163DEST_PATH_IMAGE007
A sample of
Figure 444488DEST_PATH_IMAGE009
First of a chromosome
Figure 183774DEST_PATH_IMAGE008
Unique alignment numbers after GC correction for each window.
Figure 480763DEST_PATH_IMAGE054
Representing all windows of chromosomes in the same position in the negative reference set
Figure 698118DEST_PATH_IMAGE052
At step 508, the computing device 110 combines the consecutive first predetermined number of windows into one section, so as to take the area where the copy rate of the consecutive second predetermined number of sections meets the predetermined condition as a candidate area.
With respect to the first predetermined number, it is, for example and without limitation, 10. With respect to the second predetermined number, it is, for example and without limitation, greater than or equal to 10. It should be understood that the first predetermined number and the second predetermined number may be the same or different.
For example, the computing device 110 joins 10 consecutive windows into one section (or window); then, with one section as a step size, whether the copy rate (i.e., copy ratio) of consecutive 10 sections or more meets a predetermined condition (e.g., the copy rate of consecutive 10 sections or more is greater than or equal to 1.1, or less than or equal to 0.9) is detected, and if the copy rate of consecutive 10 sections or more meets the predetermined condition, the region is determined as a candidate region. It will be appreciated how many windows are combined into a single segment can be adjusted according to the sequencing depth and the size of the CNV to be detected.
At step 510, the computing device 110 calculates the offset direction for each segment in the candidate region in order to determine copy number variations for segments of the first type having lengths that fall within a predetermined range.
As for the predetermined range, it is, for example, without limitation, 50 kb-1M. The first type of segment whose length falls within the predetermined range is, for example, a large segment.
A method for determining copy number variations of fragments of a first type having a length within a predetermined range, for example, comprising: the calculation device 110 calculates, for a first type segment whose length falls within a predetermined range, a shift direction of each segment in the candidate region using a mean-shift (mean-shift) algorithm; and merging the offset directions for the respective segments in the candidate region to determine the copy number variation of the first type segment. It should be understood that the CNV drift direction in each step is calculated by a mean shift algorithm, and then the current step and the left or right are merged or segmented according to whether the CNV drift directions of the left or right of the current step are the same or opposite, so as to accurately detect the small segment copy number variation.
An algorithm for calculating the drift direction of each section in the candidate region is described below with reference to equations (16) and (17). Wherein equation (16) indicates the algorithm for calculating the estimated density function. Equation (17) indicates an algorithm for inverting the estimated density function in order to determine the drift direction.
Figure 905108DEST_PATH_IMAGE059
(16)
Figure 812366DEST_PATH_IMAGE060
(17)
In the above formulas (16) and (17),
Figure 534334DEST_PATH_IMAGE061
representing the estimated density function.
Figure 617697DEST_PATH_IMAGE008
Represents the first
Figure 741511DEST_PATH_IMAGE008
A window.
Figure 760282DEST_PATH_IMAGE009
Represents and is
Figure 766284DEST_PATH_IMAGE008
A window adjacent to the window. r represents the unique log of the GC corrected window mentioned above (e.g.)
Figure 325442DEST_PATH_IMAGE062
Representing the bandwidth of one point per window, with the center point window extending (in some embodiments,
Figure 569341DEST_PATH_IMAGE062
default to 5 windows). Representing a range of copy numbers (in some embodiments,
Figure 493435DEST_PATH_IMAGE063
default to 0.2).
Figure 252312DEST_PATH_IMAGE064
Representing a constant, for example 2.72.
The algorithm for determining the drift direction is described below in conjunction with equation (18).
Figure 883669DEST_PATH_IMAGE065
(18)
In the above-mentioned formula (18),
Figure 716496DEST_PATH_IMAGE066
representing the drift direction. Some windows representing the current window and the left side do not belong to a segment.
Figure 77070DEST_PATH_IMAGE067
Some windows representing the current window and the right side belong to one segment.
For example, if the computing device 110 finds two adjacent windows with directions of-1 and 1, respectively, if at least 3 of the 4 windows behind the right window are oriented to the right, and at least 3 of the 4 windows in front of the left window are oriented to the left, then the break point is located between the two windows, and the region from the break point to the window with the same direction is a segment.
It will be appreciated that by calculating the copy number of a fragment, it is possible to determine whether there is a copy number variation.
An algorithm for calculating the copy number of a fragment for an autosome is described below in conjunction with equation (19); and a method for determining whether a copy number variation exists based on the copy number of the fragment is described in connection with equation (20).
Figure 260927DEST_PATH_IMAGE068
Figure 224204DEST_PATH_IMAGE069
(20)
In the above-mentioned formulas (19) and (20),
Figure 177116DEST_PATH_IMAGE070
representing the number of copies of the fragment. It should be understood that for male sex chromosomes, equation (19)
Figure 708592DEST_PATH_IMAGE071
The calculation of (c) does not need to be multiplied by 2.
As can be seen from equation (20), if the calculated copy number of the current segment is in the range [0.3,0.7], the current segment is deleted; if the calculated copy number of the current fragment is in the range [1.3, 2.7], then it is determined that the current fragment has a copy number variation of the first type of fragment.
At step 512, the computing device 110 determines copy number variations of the second type of fragments that are greater than a predetermined value in length to determine copy number variations of the test sample based on the copy number variations of the first type of fragments and the copy number variations of the second type of fragments.
For example, the computing device 110 integrates the copy number variation of the first type of fragment and the copy number variation of the second type of fragment as the copy number variation of the sample to be tested.
As for the predetermined value, it is, for example, without limitation, 1 Mb. The second type segments having a length larger than a predetermined value are for example large segments.
A method for determining copy number variation of a second type of fragment having a length greater than a predetermined value, comprising, for example: the computing device 110 detects copy number variations for fragments greater than 1Mb in length using a Cyclic Binary Segmentation (CBS) algorithm. Specifically, the computing device 110 initializes each chromosome of the genome to be detected in a circle; in a way that each window in the area to be detected corresponds to one point, enabling a plurality of (for example, n) windows in the area to be detected to correspond to a plurality of points; two points are arbitrarily selected from a plurality of points (for example, an arbitrary two points are taken as
Figure 114165DEST_PATH_IMAGE072
Figure 349975DEST_PATH_IMAGE073
Two points of
Figure 219711DEST_PATH_IMAGE074
) To calculate the degree of deviation variance (i.e., the Z value) of the segment of the region defined by the two selected points; determining whether the calculated degree of deviation variance of the region segment is greater than or equal to a predetermined deviation variance threshold; in response to determining that the calculated degree of deviation variance of the region segment is greater than or equal to a predetermined deviation variance threshold (e.g., a predetermined deviation variance threshold of
Figure 981475DEST_PATH_IMAGE075
) Determining that a breakpoint exists in the region segment defined by the two selected points, so as to select two points again in the region segment to circularly calculate the deviation variance degree of the region segment defined by the two selected points again; the section defined by the two points selected based on the last cycle is determined as a copyAnd the number of copies variation region fragments are used for determining the copy number variation of the second type of fragments based on the copy number variation region fragments so as to determine the copy number variation of the sample to be tested. If it is determined that the calculated degree of deviation variance of the region segment is less than the predetermined deviation variance threshold, determining that no break point exists within the region segment defined by the selected two points, and ending the loop;
an algorithm for calculating the degree of deviation variance of the region segment defined by the selected two points is described below in conjunction with equation (21).
Figure 812027DEST_PATH_IMAGE076
In the above-mentioned formula (21),
Figure 54790DEST_PATH_IMAGE077
to be provided with
Figure 44611DEST_PATH_IMAGE072
Figure 980206DEST_PATH_IMAGE073
The Z value of a region segment that is a breakpoint, i.e., the degree to which the region segment deviates from variance.
Figure 298055DEST_PATH_IMAGE078
Representing the total number of windows of the suspect region.
Figure 344509DEST_PATH_IMAGE072
Figure 188837DEST_PATH_IMAGE073
Representing 2 arbitrarily selected breakpoint positions, i.e. the first
Figure 295333DEST_PATH_IMAGE072
A window and the second
Figure 100478DEST_PATH_IMAGE073
And (4) a window.
Figure DEST_PATH_IMAGE079
Represents the first
Figure 547027DEST_PATH_IMAGE072
From window to first
Figure 855649DEST_PATH_IMAGE073
Standard deviation of copy rate over a range of regions per window.
Figure 195363DEST_PATH_IMAGE080
The copy rate standard deviation in the region range representing the whole number of windows of the region to be detected.
Figure 284542DEST_PATH_IMAGE081
Represents the first
Figure 203957DEST_PATH_IMAGE073
From window to first
Figure 367085DEST_PATH_IMAGE078
From window to window
Figure 877700DEST_PATH_IMAGE082
Standard deviation of copy rate over a range of regions of each window.
An algorithm for determining the number of cycles of deviation of the degree of variance of the region segment defined by the candidate breakpoints is described below in conjunction with equation (22).
Figure 985334DEST_PATH_IMAGE083
(22)
In the above-mentioned formula (22),
Figure 177281DEST_PATH_IMAGE084
the loop determines a number of loops of a plurality of degrees of deviation from the variance for the segment of the region defined by the plurality of candidate breakpoints.
With respect to a predetermined deviation variance threshold
Figure 788391DEST_PATH_IMAGE075
For example, without limitation, 2.58. The predetermined deviation variance threshold is obtained by setting an α value according to a P-value method based on a standard normal distribution curve. For example when a =0.1,
Figure 610853DEST_PATH_IMAGE085
2.58。
in the scheme, different processes are adopted by the large segment CNV and the small segment CNV, so that the large segment CNV is not fragmented completely and is sensitive to low-proportion embedding. But also can ensure enough sensitivity and accuracy for small segment CNV.
Fig. 6 schematically shows a block diagram of an electronic device 600 suitable for implementing an embodiment of the invention. The electronic device 600 may be a device for implementing the method 200 to 500 shown in fig. 2 to 5. As shown in fig. 6, electronic device 600 includes a central processing unit (i.e., CPU 601) that can perform various appropriate actions and processes in accordance with computer program instructions stored in a read-only memory (i.e., ROM 602) or loaded from storage unit 608 into a random access memory (i.e., RAM 603). In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 can also be stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output interface (i.e., I/O interface 605) is also connected to bus 604.
Various components in the electronic device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, and the CPU 601 executes the respective methods and processes described above, for example, the methods 200 to 500. For example, in some embodiments, methods 200-500 may be implemented as a computer software program stored on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by CPU 601, one or more operations of methods 200 through 500 described above may be performed. Alternatively, in other embodiments, CPU 601 may be configured by any other suitable means (e.g., by way of firmware) to perform one or more acts of methods 200-500.
It should be further appreciated that the present invention may be embodied as methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therein for carrying out aspects of the present invention.
The computer-readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, such as punch cards or in-groove raised structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
The above description is only an alternative embodiment of the present invention and is not intended to limit the present invention, and various modifications and variations of the present invention are possible to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method for correcting sequencing data, comprising:
calculating the position of the read length of each unique alignment in the reference genome and the sequencing frequency of each position based on the sequencing data of the sample to be detected and the alignment result data of the human reference genome sequence;
dividing the genome according to windows with preset lengths so as to count the number of the longer reads in each window than the pairs;
performing GC corrections for the number of read lengths on the alignment within each window to generate a unique log of bits per window via GC corrections at the window level;
obtaining a unique log-to-ratio adjustment ratio based on the unique number-of-comparisons of the chromosome level of the sample to be tested, so as to generate a unique log-to-ratio of each window corrected by the GC of the chromosome level based on the unique log-to-ratio of each window corrected by the GC of the window level and the unique number-of-comparisons adjustment ratio; and
the unique alignment numbers for each window corrected by GC at the chromosome level are normalized to obtain a unique log of each window corrected by normalization and GC.
2. The method of claim 1, wherein comparing the unique alignment numbers for the chromosome level of the test sample comprises: and comparing the predicted unique comparison number of each chromosome of the sample to be detected with the actual unique comparison number of each chromosome of the sample to be detected.
3. The method of claim 1, wherein generating a unique alignment number for each window via GC correction at the window level comprises:
selecting windows with reading lengths on all autosomes in comparison so as to determine the proportion of GC (gas chromatography) bases on each window; and
based on the determined proportion of GC bases on each window, a predicted unique log over each window is obtained via a trained GC correction model for generating a unique log for each window via GC correction at the window level.
4. The method of claim 3, wherein obtaining the predicted unique log of comparisons over each window for generating the unique log of comparisons for each window via GC corrections for the window level comprises:
calculating the median of the normalized unique log of all windows with aligned read lengths on the autosome; and
generating a unique log for each window corrected by GC at the window level based on the predicted unique log over each window and the median of the calculated normalized unique log.
5. The method of claim 1, wherein generating a unique alignment number for each window corrected by GC at the chromosome level comprises:
calculating the proportion of GC bases on each chromosome based on the human reference genome sequence;
calculating the unique alignment number ratio of each chromosome of the negative reference set;
calculating the actual unique comparison number ratio of each chromosome of the sample to be detected so as to calculate the content ratio of the actual unique comparison number ratio of each chromosome of the sample to be detected relative to the unique comparison number ratio of each chromosome of the negative reference set; and
fitting via a linear regression model based on the calculated content ratio, the proportion of GC bases on each chromosome, to determine regression coefficients for generating a predicted unique log for each chromosome of the test sample based on the determined regression coefficients to generate a unique log for each window corrected by GC at the chromosome level.
6. The method of claim 5, wherein generating the predicted unique log-log for each chromosome based on the determined regression coefficients to generate a unique alignment number for each window corrected by GC at the chromosome level comprises:
obtaining a unique comparison number ratio adjustment ratio based on the predicted unique comparison number ratio of each chromosome and the actual unique comparison number ratio of each chromosome of each sample to be detected; and
the unique log-log per window corrected by GC at the chromosome level is generated based on the unique log-per-window and unique alignment scale adjustments of the GC at the window level.
7. A method for detecting copy number variation, comprising:
performing the steps of the method according to any one of claims 1 to 6, so as to obtain a unique one-bit logarithm per window via normalization and GC correction;
constructing an autosomal negative reference set and a sex chromosome negative reference set;
determining a normalized copy rate of each window of the sample to be tested based on the unique log of each window of the sample to be tested through normalization and GC correction, the unique comparison number of each window of the constructed autosomal negative reference set and the sex chromosome negative reference set;
combining continuous first preset number of windows into a section so as to take the area with the copying rate of continuous second preset number of sections meeting the preset condition as a candidate area;
calculating the offset direction of each segment in the candidate region to determine the copy number variation of the first type of segment with the length in a predetermined range; and
and determining the copy number variation of the second type of fragments with the length larger than a preset value so as to determine the copy number variation of the sample to be tested based on the copy number variation of the first type of fragments and the copy number variation of the second type of fragments.
8. The method of claim 7, wherein determining copy number variations for the first type of fragments having lengths within a predetermined range comprises:
calculating the offset direction of each section in the candidate region by using a mean shift algorithm aiming at the first type of segment with the length belonging to a preset range; and
the offset directions for the respective segments in the candidate region are combined to determine the copy number variation of the first type of segment.
9. The method of claim 7, wherein the copy number variation of the second type of fragments having a length greater than a predetermined value is determined by:
copy number variations of the second type of fragments having a length greater than a predetermined value are detected based on a cyclic binary partitioning algorithm.
10. The method of claim 7, wherein constructing the autosomal negative reference set and the sex chromosome negative reference set comprises:
performing a window-level GC correction and a chromosome-level GC correction for the unique log of each window of the negative reference set to obtain a normalized and GC-corrected unique log of each window of the negative reference set;
calculating the mean and standard deviation of the unique log of the ratio over each window;
determining a retention interval for each window based on the calculated mean and standard deviation to determine a retention coefficient for each window of the sample to be tested; and
and constructing an autosomal negative reference set and a sex chromosome negative reference set based on the retention coefficient of each window of the sample to be detected.
11. A computing device, comprising:
at least one processing unit;
at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit, cause the apparatus to perform the steps of the method of any of claims 1 to 10.
12. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when executed by a machine, implements the method according to any one of claims 1 to 10.
CN202210663961.5A 2022-06-14 2022-06-14 Methods, apparatus and media for correcting sequencing data, detecting copy number variations Active CN114792548B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210663961.5A CN114792548B (en) 2022-06-14 2022-06-14 Methods, apparatus and media for correcting sequencing data, detecting copy number variations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210663961.5A CN114792548B (en) 2022-06-14 2022-06-14 Methods, apparatus and media for correcting sequencing data, detecting copy number variations

Publications (2)

Publication Number Publication Date
CN114792548A CN114792548A (en) 2022-07-26
CN114792548B true CN114792548B (en) 2022-09-09

Family

ID=82463542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210663961.5A Active CN114792548B (en) 2022-06-14 2022-06-14 Methods, apparatus and media for correcting sequencing data, detecting copy number variations

Country Status (1)

Country Link
CN (1) CN114792548B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013107048A1 (en) * 2012-01-20 2013-07-25 深圳华大基因健康科技有限公司 Method and system for determining whether copy number variation exists in sample genome, and computer readable medium
CN105483229A (en) * 2015-12-21 2016-04-13 广东腾飞基因科技有限公司 Method and system for detecting fetal chromosome aneuploidy
CN105574361A (en) * 2015-11-05 2016-05-11 上海序康医疗科技有限公司 Method for detecting variation of copy numbers of genomes
WO2019213811A1 (en) * 2018-05-07 2019-11-14 深圳市真迈生物科技有限公司 Method, apparatus, and system for detecting chromosomal aneuploidy
CN111916150A (en) * 2019-05-10 2020-11-10 北京贝瑞和康生物技术有限公司 Method and device for detecting genome copy number variation
CN113789371A (en) * 2021-09-17 2021-12-14 广州燃石医学检验所有限公司 Method for detecting copy number variation based on batch correction
WO2022110039A1 (en) * 2020-11-27 2022-06-02 深圳华大生命科学研究院 Fetal chromosomal abnormality detection method and system

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120046877A1 (en) * 2010-07-06 2012-02-23 Life Technologies Corporation Systems and methods to detect copy number variation
CN104133914B (en) * 2014-08-12 2017-03-08 厦门万基生物科技有限公司 A kind of GC deviation eliminating high-flux sequence introducing and the detection method to chromosome copies number variation
WO2016176847A1 (en) * 2015-05-06 2016-11-10 安诺优达基因科技(北京)有限公司 Reagent kit, apparatus, and method for detecting chromosome aneuploidy
WO2018161245A1 (en) * 2017-03-07 2018-09-13 深圳华大基因研究院 Method and device for detecting chromosomal variations
CN108573125B (en) * 2018-04-19 2022-05-13 上海亿康医学检验所有限公司 Method for detecting genome copy number variation and device comprising same
CN111370056B (en) * 2019-05-22 2021-03-30 深圳思勤医疗科技有限公司 Method, system and computer readable medium for determining predetermined chromosome instability index of a sample to be tested
WO2022015998A1 (en) * 2020-07-15 2022-01-20 University Of Connecticut Gene panels and methods of use thereof for screening and diagnosis of congenital heart defects and diseases
CN112669901A (en) * 2020-12-31 2021-04-16 北京优迅医学检验实验室有限公司 Chromosome copy number variation detection device based on low-depth high-throughput genome sequencing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013107048A1 (en) * 2012-01-20 2013-07-25 深圳华大基因健康科技有限公司 Method and system for determining whether copy number variation exists in sample genome, and computer readable medium
CN105574361A (en) * 2015-11-05 2016-05-11 上海序康医疗科技有限公司 Method for detecting variation of copy numbers of genomes
CN105483229A (en) * 2015-12-21 2016-04-13 广东腾飞基因科技有限公司 Method and system for detecting fetal chromosome aneuploidy
WO2019213811A1 (en) * 2018-05-07 2019-11-14 深圳市真迈生物科技有限公司 Method, apparatus, and system for detecting chromosomal aneuploidy
CN111916150A (en) * 2019-05-10 2020-11-10 北京贝瑞和康生物技术有限公司 Method and device for detecting genome copy number variation
WO2022110039A1 (en) * 2020-11-27 2022-06-02 深圳华大生命科学研究院 Fetal chromosomal abnormality detection method and system
CN113789371A (en) * 2021-09-17 2021-12-14 广州燃石医学检验所有限公司 Method for detecting copy number variation based on batch correction

Also Published As

Publication number Publication date
CN114792548A (en) 2022-07-26

Similar Documents

Publication Publication Date Title
CN111462816B (en) Method, electronic device and computer storage medium for detecting microdeletion and microduplication of germ line genes
CN111429968A (en) Method, electronic device, and computer storage medium for predicting tumor type
CN111916150A (en) Method and device for detecting genome copy number variation
CN111292802A (en) Method, electronic device, and computer storage medium for detecting sudden change
CN111933214B (en) Method and computing device for detecting RNA level somatic gene variation
CN114496077B (en) Methods, devices, and media for detecting single nucleotide variations and indels
US20190005099A1 (en) Low memory sampling-based estimation of distinct elements and deduplication
CN114649055B (en) Methods, devices and media for detecting single nucleotide variations and indels
Beuf et al. Improved base-calling and quality scores for 454 sequencing based on a Hurdle Poisson model
Behr et al. Multiscale blind source separation
CN111584002A (en) Method, computing device and computer storage medium for detecting tumor mutational burden
CN114792548B (en) Methods, apparatus and media for correcting sequencing data, detecting copy number variations
Das et al. Base calling for high-throughput short-read sequencing: dynamic programming solutions
US8031628B2 (en) Optimal probing for unicast network delay tomography
CN114758720B (en) Method, apparatus and medium for detecting copy number variation
CN111932536B (en) Method and device for verifying lesion marking, computer equipment and storage medium
CN114694752B (en) Method, computing device and medium for predicting homologous recombination repair defects
US10650318B2 (en) Systems and methods of determining sufficient causes from multiple outcomes
KR101522087B1 (en) System and method for aligning genome sequnce considering mismatch
Chang et al. Empirical likelihood based tests for stochastic ordering under right censorship
CN111211886B (en) Energy analysis detection method for SM2 decryption algorithm
CN109933579B (en) Local K neighbor missing value interpolation system and method
CN113990492A (en) Method, apparatus and storage medium for determining detection parameters for minimal residual disease of solid tumors
CN103745136A (en) Efficient haplotype inference and deleted genotype fill method
CN114008658A (en) K-line form recognition method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40077973

Country of ref document: HK