CN114792548B

CN114792548B - Methods, apparatus and media for correcting sequencing data, detecting copy number variations

Info

Publication number: CN114792548B
Application number: CN202210663961.5A
Authority: CN
Inventors: 张钰; 孙怀玉; 卢玉林
Original assignee: Berry Genomics Co Ltd
Current assignee: Berry Genomics Co Ltd
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-09-09
Anticipated expiration: 2042-06-14
Also published as: CN114792548A

Abstract

The present invention relates to a method, apparatus and medium for correcting sequencing data, detecting copy number variation. The method comprises the following steps: calculating the position of the read length on each unique alignment in the reference genome and the number of times each position is sequenced; dividing the genome according to windows with preset lengths so as to count the number of the reads in each window compared with the reads in the pair; performing GC correction for the number of read lengths on the comparison within each window to generate a unique log of each window via GC correction at the window level; obtaining a unique log-ratio adjustment ratio based on the unique log-ratio of the chromosome level to generate a unique log of each window corrected by GC at the chromosome level; and normalizing the unique alignment number for each window corrected by GC at the chromosome level to obtain a unique log of each window corrected by normalization and GC. The present invention can significantly improve the correction effect for regions of high GC content.

Description

Methods, apparatus and media for correcting sequencing data, detecting copy number variations

Technical Field

The present invention relates generally to biological information processing, and in particular, to methods for correcting sequencing data, methods for detecting copy number variations, computing devices, and computer storage media.

Background

Sequencing data obtained on a second generation sequencer generally show a correlation between sequencing depth and GC content, and is called GC bias (or "GC bias"). GC bias refers to regions on the genome where GC content is around 50% more easily detected, resulting in more reads, higher coverage of these regions, less easily detected in regions with high or low GC content, resulting in fewer reads, and less coverage of these regions. For more accurate subsequent analysis of the bioinformatics, corrections for GC bias are usually needed.

Conventional protocols for correcting sequencing data include, for example: the GC content is then windowed (or "bin") and then GC corrected using a similar algorithm such as Loess regression on a window basis. However, in the conventional scheme for correcting sequencing data, which is mainly based on genome-level GC correction, there are some regions with poor correction effect, especially regions with high GC content, which is not ideal, for example, it is difficult to eliminate the bias introduced by sequencing and genome characteristics.

In summary, the conventional scheme for correcting sequencing data has the following disadvantages: the correction effect for the region of high GC content is poor.

Disclosure of Invention

The present invention provides a method, a computing device and a computer storage medium for correcting sequencing data, which can significantly improve the correction effect for a region of high GC content.

Further, the present invention also provides a method for detecting copy number variation, which can ensure that the whole large-fragment CNV is not too fragmented and is sensitive enough to low-proportion chimerism. But also can ensure enough sensitivity and accuracy for small segment CNV.

According to a first aspect of the invention, a method for correcting sequencing data is provided. The method comprises the following steps: calculating the position of the read length of each unique alignment in the reference genome and the sequencing frequency of each position based on the alignment result data of the sequencing data of the sample to be detected and the human reference genome sequence; dividing the genome according to windows with preset lengths so as to count the number of the reads in each window compared with the reads in the pair; performing GC corrections for the number of read lengths on the alignment within each window to generate a unique log of bits per window via GC corrections at the window level; obtaining a unique log-ratio adjustment ratio based on the unique number-of-comparisons of the chromosome level of the sample to be tested, so as to generate a unique log of each window corrected by the GC of the chromosome level based on the unique log of each window corrected by the GC of the window level and the unique number-of-comparisons adjustment ratio; and normalizing the unique alignment number for each window corrected by GC at the chromosome level to obtain a unique log of each window corrected by normalization and GC.

According to a second aspect of the present invention, a method for detecting copy number variation is provided. The method comprises the following steps: according to the steps of performing the method of the first aspect of the invention, so as to obtain a unique log of the ratio for each window via normalization and GC correction; constructing an autosomal negative reference set and a sex chromosome negative reference set; determining a normalized copy rate of each window of the sample to be tested based on the unique log of each window of the sample to be tested through normalization and GC correction, the unique comparison number of each window of the constructed autosomal negative reference set and the sex chromosome negative reference set; combining a first preset number of continuous windows into a section so as to take the area with the copying rate of a second preset number of continuous sections meeting the preset condition as a candidate area; calculating the offset direction of each segment in the candidate region to determine the copy number variation of the first type of segment with the length in a predetermined range; and determining the copy number variation of the second type of fragments with the length larger than a preset value so as to determine the copy number variation of the sample to be tested based on the copy number variation of the first type of fragments and the copy number variation of the second type of fragments.

According to a third aspect of the present invention, there is also provided a computing device comprising: a memory configured to store one or more computer programs; and a processor coupled to the memory and configured to execute the one or more programs to cause the apparatus to perform the method of the first or second aspect of the invention.

According to a fourth aspect of the invention, there is also provided a non-transitory computer-readable storage medium. The non-transitory computer readable storage medium has stored thereon machine executable instructions which, when executed, cause a machine to perform the method of the first or second aspect of the invention.

In some embodiments, the unique alignment scores for the chromosomal level of the test sample comprise: and comparing the predicted unique comparison number of each chromosome of the sample to be detected with the actual unique comparison number of each chromosome of the sample to be detected.

In some embodiments, generating a unique alignment number for each window corrected by GC at the window level comprises: selecting windows with reading lengths on all autosomes in comparison so as to determine the proportion of GC (gas chromatography) bases on each window; and obtaining a predicted unique log of alignments over each window via a trained GC correction model based on the determined proportion of GC bases over each window for generating unique aligned alignments per window via GC correction at the window level.

In some embodiments, obtaining the predicted unique log of alignments over each window for use in generating the unique log of alignments for each window via GC corrections for the window level comprises: calculating the median of the normalized unique log of all windows on the autosome where read lengths on alignments exist; and generating a unique log for each window corrected by GC at the window level based on the predicted unique log over each window and the median of the calculated normalized unique log.

In some embodiments, generating a unique alignment number for each window corrected by GC at the chromosome level comprises: calculating the proportion of GC bases on each chromosome based on the human reference genome sequence; calculating the unique alignment number ratio of each chromosome of the negative reference set; calculating the actual unique comparison number ratio of each chromosome of the sample to be detected so as to calculate the content ratio of the actual unique comparison number ratio of each chromosome of the sample to be detected relative to the unique comparison number ratio of each chromosome of the negative reference set; and fitting via a linear regression model based on the calculated content ratio and the proportion of GC bases on each chromosome to determine regression coefficients for generating a predicted unique log for each chromosome of the sample to be tested based on the determined regression coefficients to generate a unique log for each window corrected by GC at the chromosome level.

In some embodiments, generating the predicted unique log of pairs for each chromosome based on the determined regression coefficients to generate unique aligned pairs for each window corrected by GC at the chromosome level comprises: obtaining a unique comparison number ratio adjustment ratio based on the predicted unique comparison number ratio of each chromosome and the actual unique comparison number ratio of each chromosome of each sample to be detected; and generating a unique log of each window corrected by GC at the chromosome level based on the unique log of each window corrected by GC at the window level and the unique alignment fraction scaling.

In some embodiments, determining copy number variation of fragments of the first type having a length within a predetermined range comprises: calculating the offset direction of each section in the candidate region by using a mean shift algorithm aiming at the first type of segment with the length belonging to a preset range; and merging the offset directions for the respective segments in the candidate region to determine the copy number variation of the first type segment.

In some embodiments, the copy number variation of the second type of fragments having a length greater than a predetermined value is determined: copy number variations of the second type of fragments having a length greater than a predetermined value are detected based on a cyclic binary partitioning algorithm.

In some embodiments, constructing the autosomal negative reference set and the sex chromosome negative reference set comprises: performing window-level GC correction and chromosome-level GC correction for the unique pair-log of each window of the negative reference set to obtain normalized and GC-corrected unique pair-log of each window of the negative reference set; calculating the mean and standard deviation of the unique log of the ratio over each window; determining a retention interval for each window based on the calculated mean and standard deviation to determine a retention coefficient for each window of the sample to be tested; and constructing an autosomal negative reference set and a sex chromosome negative reference set based on the retention coefficient of each window of the sample to be detected.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the invention, nor is it intended to be used to limit the scope of the invention.

Drawings

FIG. 1 shows a schematic diagram of a system for a method of correcting sequencing data, according to an embodiment of the invention.

Fig. 2 shows a flow diagram of a method for correcting sequencing data according to an embodiment of the invention.

FIG. 3 shows a flow diagram of a method for generating a unique log of bits per window via GC correction at the window level, according to an embodiment of the invention.

FIG. 4 shows a flow diagram of a method for generating a unique log of each window via GC correction at a chromosome level, according to an embodiment of the invention.

FIG. 5 shows a flow diagram of a method for detecting copy number variation according to an embodiment of the invention.

FIG. 6 schematically shows a block diagram of an electronic device suitable for use to implement an embodiment of the invention.

Like or corresponding reference characters designate like or corresponding parts throughout the several views.

Detailed Description

Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object.

As described above, in the conventional methods for correcting sequencing data, which are mainly based on genome-level GC correction, there are some regions with poor correction effect, especially regions with high GC content.

To address, at least in part, one or more of the above problems, as well as other potential problems, example embodiments of the present invention propose a scheme for correcting sequencing data. The windows are divided so as to carry out GC correction on the window level and GC correction on the chromosome level on all the windows on the sample to be detected, so that the influence of different GC content areas on the correction result can be eliminated, and a better sequencing data correction effect is achieved.

Fig. 1 shows a schematic diagram of a system 100 for a method of correcting sequencing data according to an embodiment of the invention. It should be understood that the system 100 may also implement methods for detecting copy number variations according to embodiments of the present invention. As shown in fig. 1, the system 100 includes: computing device 110, sequencing device 130, network 140. In some embodiments, the computing device 110, the sequencing device 130, and the network 140 interact with each other via data.

A sequencing apparatus 130, for example, for sequencing a sample to be tested of a test object, so as to generate sequencing data on the sample to be tested; and sending the generated sequencing sequence data of the sample to be tested to the computing device 110. The sequencing data for the sample to be tested is for example low depth whole genome sequencing.

With respect to the computing device 110, it is used, for example, to calculate the position of the read length on each unique alignment in the reference genome and the number of times each position was sequenced based on the alignment result data of the sequencing data of the sample to be tested and the human reference genome sequence; the genome is divided into windows of predetermined length to count the number of reads longer than the reads on the pair within each window. The computing device 110 may also be used to make the window-level GC corrections and the chromosome-level GC corrections, e.g., the computing device 110 makes the GC corrections for the number of reads over the alignment within each window to generate a unique log of each window via the window-level GC corrections; performing GC correction at the chromosome level for the unique log per window corrected by GC correction at the window level to generate a unique log per window corrected by GC correction at the chromosome level; and normalizing the unique alignment number for each window corrected by GC at the chromosome level to obtain a unique log of each window corrected by normalization and GC.

In some embodiments, computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, and ASICs, as well as general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device. The computing device 110 includes, for example: the positions and positions are sequenced to the number of times calculation unit 112, the number of reads on the window alignment statistics unit 114, the window-level GC correction unit 116, the chromosome-level GC correction unit 118, the unique alignment number obtaining unit 120. The above positions and positions are sequenced to the number of times calculation unit 112, the intra-window alignment read-up number statistics unit 114, the window-level GC correction unit 116, the chromosome-level GC correction unit 118, and the unique alignment number obtaining unit 120. May be configured on one or more computing devices 110.

And a sequencing-to-times-for-position calculation unit 112 for calculating the position of the read length on each unique alignment in the reference genome and the number of times each position was sequenced based on the alignment result data of the sequencing data of the sample to be tested and the human reference genome sequence.

And a statistics unit 114 for the number of reads in the intra-window alignment, which is used for dividing the genome into windows of predetermined lengths so as to count the number of reads in the alignment in each window.

A window-level GC correction unit 116 for GC correction for the number of read lengths on the alignment within each window to generate a unique log of each window corrected by GC of the window-level.

Regarding the chromosome level GC correction unit 118, it is configured to obtain a unique log-ratio adjustment ratio based on the unique alignment ratio of the chromosome level of the sample to be tested, so as to generate a unique log of each window corrected by GC of the chromosome level based on the unique log of each window corrected by GC of the window level and the unique alignment ratio adjustment ratio.

A unique alignment number obtaining unit 120 for normalizing the unique alignment number for each window corrected by GC at the chromosome level to obtain a unique log of each window corrected by normalization and GC.

A method for correcting sequencing data according to an embodiment of the present invention will be described below with reference to fig. 2. FIG. 2 shows a flow diagram of a method 200 for correcting sequencing data, according to an embodiment of the invention. It should be understood that the method 200 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 200 may also include additional acts not shown and/or may omit acts shown, as the scope of the invention is not limited in this respect.

At step 202, the computing device 110 calculates the location of the read length on each unique alignment in the reference genome and the number of times each location was sequenced based on the alignment data of the test sample with the human reference genome sequence.

The sequencing data of the sample to be tested is for example and not limited to low depth whole genome sequencing data, e.g., sequencing depth < = 3X.

For example, computing device 110 aligns the sequencing fastq data onto a human reference genomic sequence (e.g., hg19 or hg 38) using BWA or similar public software to generate alignment result data; and based on the comparison result data, reserving the read length on the unique comparison; the position of each read in the reference genome (positional information including, for example, chromosome number and genome coordinates) and the number of times each position was sequenced were calculated.

At step 204, the computing device 110 divides the genome into windows of predetermined length to count the number of reads within each window that are longer than on the pair.

For example, the computing device 110 determines whether the sequencing depth of the sequencing data of the sample to be tested is less than or equal to a predetermined sequencing depth value (e.g., the sequencing data is low-depth whole genome sequencing data with a sequencing depth < = 3X), then divides the sequencing data into a plurality of windows by a predetermined length (e.g., without limitation, 5 Kb) so as to count the number of reads within each window over a pair.

At step 206, the computing device 110 performs GC corrections for the number of read lengths over the alignment within each window in order to generate a unique log of each window via GC corrections at the window level.

With respect to a method of generating a unique log of each window corrected by GC at the window level, it includes, for example: the computing device 110 selects windows of aligned read lengths on all autosomes to determine the proportion of GC bases on each window; and obtaining a predicted unique one-bit logarithm over each window via a trained GC correction model based on the determined proportion of GC bases over each window for generating the unique one-bit logarithm per window via the first GC correction. By adopting the means, the accuracy of the unique comparison number corrected by the GC at the window level can be improved, and the method is further favorable for improving the accuracy of detecting the copy number variation based on the more accurate unique comparison number.

For example, the computing device 110 selects windows with aligned read lengths on all autosomes to determine the proportion of GC bases in each window on each chromosome of the sample to be tested; order to

Representing the proportion of GC bases on the jth window of the ith chromosome on the kth sample of the training sample so as to

Represents the number of reads on the alignment of the jth window of the ith chromosome on the kth sample of the training sample. The computing apparatus 110 then trains the GC correction model using smoothening spline algorithm with Y (i.e., the base-on-window GC ratio) as the independent variable and X (i.e., the number of read lengths on the unique alignment, or "log-only ratio") as the dependent variable. Inputting the proportion of GC bases on each window of all chromosomes of the sample to be detected into a trained GC correction model so as to obtain a predicted unique log on each window of the sample to be detected; then, the computing device 110 calculates the median of the normalized unique log of all windows on the autosome where there are read lengths on the alignments; and generating a unique log for each window corrected by GC at the window level based on the predicted unique log over each window and the median of the calculated normalized unique log.

The method 300 for generating the unique bitlog for each window corrected by GC at the window level will be described in detail with reference to fig. 3, and will not be described herein again.

At step 208, the computing device 110 obtains a unique log-to-ratio adjustment ratio based on the unique log-to-ratio of the chromosome level of the sample under test to generate a unique log of each window corrected via GC of the chromosome level based on the unique log of each window corrected via GC of the window level and the unique log-to-ratio adjustment ratio. The unique alignment ratio of the chromosome level of the sample to be tested includes, for example: and comparing the predicted unique comparison number of each chromosome of the sample to be detected with the actual unique comparison number of each chromosome of the sample to be detected.

A method for generating a unique log of each window corrected by GC at the chromosome level, which includes, for example: the computing device 110 calculates the proportion of GC bases on each chromosome based on the human reference genomic sequence; calculating the unique alignment number ratio of each chromosome of the negative reference set; calculating the unique comparison number ratio of each chromosome of the sample to be detected so as to calculate the content ratio of the unique comparison number ratio of each chromosome of the sample to be detected relative to the unique comparison number ratio of each chromosome of the negative reference set; and fitting via a linear regression model based on the calculated content ratio, the proportion of GC bases on each chromosome, so as to determine regression coefficients for generating a predicted unique log for each chromosome based on the determined regression coefficients to generate a unique log for each window corrected via GC at the chromosome level. For example, the computing device 110 obtains a unique alignment number ratio adjustment ratio based on the predicted unique alignment number ratio of each chromosome and the actual unique alignment number ratio of each chromosome of each sample to be tested; and generating the unique log-log per window corrected by GC at the chromosome level based on the unique log-per-window and the unique alignment score scaling of the GC at the window level. By adopting the above means, the present invention can eliminate the influence of the high GC content region or the low GC content region on the correction result.

The method 400 for generating the unique log of each window corrected by GC at the chromosome level will be described in detail below with reference to fig. 4, and will not be described herein again.

At step 210, the computing device 110 normalizes for the unique alignment number for each window via GC correction at the chromosome level to obtain a unique log of each window via normalization and GC correction. It will be appreciated that the data volume of the sequencing data for each sample is numerous and numerous, and that for comparison at the same data level, the unique alignment numbers for each window corrected by GC at the chromosome level need to be normalized.

For example, the computing device 110 normalizes the unique alignment number for each window of GC correction via the chromosome level for all samples to be tested on one chip to the same level. The algorithm for normalization is described below with reference to equation (1).

（1）

In the above-mentioned formula (1),

represents the data volume of the test sample via normalization (e.g., the unique log of each window of the test sample via normalization and GC correction).

Represents the average of the data quantities representing n samples tested (e.g., the unique aligned number average for each window of GC correction via the chromosome level for a plurality of samples tested).

In the scheme, the windows are divided so as to perform window-level GC correction and chromosome-level GC correction on all the windows on the sample to be detected, so that the influence of different GC content areas on the correction result can be eliminated, and a better sequencing data correction effect is achieved. For example, for high GC regions, and for chromosomes with higher GC content, such as chromosome 16, 19, etc., the method 200 can better correct the data bias caused by GC bias.

A method for generating a unique log of each window via GC correction at the window level according to an embodiment of the present invention will be described below in conjunction with fig. 3. FIG. 3 shows a flow diagram of a method 300 for generating a unique log of bits per window via GC correction of a window level, according to an embodiment of the invention. It should be understood that the method 300 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 300 may also include additional acts not shown and/or may omit acts shown, as the scope of the invention is not limited in this respect.

At step 302, the computing device 110 selects windows of aligned read lengths on all autosomes to determine the proportion of GC bases on each window.

For example, the computing device 110 selects a window where aligned read lengths exist on all autosomes, and lets

Represents the first

On the sample

First of a chromosome

The proportion of GC bases on each window.

At step 304, the computing device 110 obtains a predicted unique log of ratios over each window based on the determined proportion of GC bases over each window via a trained GC correction model.

For example, the computing device 110 takes as the independent variable the proportion of GC bases on each window of the training sample (e.g., Y), and as the dependent variable the number of reads on the alignment on each window of the training sample (e.g., X),for example and without limitation, the smoothening spline algorithm is used to train the GC correction model. For example,

represents the first

On the sample

First of a chromosome

The number of read lengths on the alignment on each window is input into the proportion of GC bases on each window through a trained GC correction model

) Generating a predicted unique log over each window

). It should be understood that other algorithms besides smoothening spline may be used to train the GC correction model, and by using smoothening spline model to perform GC correction at the window level, a better data correction effect may be achieved.

At step 306, the computing device 110 calculates the median of the normalized unique log pairs of all windows on the autosome where read lengths on alignments exist.

For example, the computing device 110 calculates the median of the normalized unique log of all windows on the K sample genomes where there are read lengths on alignments, e.g.,

。

at step 308, the computing device 110 generates a unique bit log for each window via GC correction at the window level based on the predicted unique bit log over each window and the median of the calculated normalized unique bit logs.

For example, the computing device 110 bases the predicted unique alignment number

And the median of the calculated normalized unique bit logarithm

The unique log of each window corrected by GC at the window level is generated by the method shown in the following equation (2).

In the above-mentioned formula (2),

represents the unique log of each window corrected by GC at the window level.

Represents the first

On the sample

First of a chromosome

The number of read lengths on the window alignment, or "the first

On the sample

First of a chromosome

The only log of the ratio over each window ".

Represents the median of the normalized unique log of all windows on the autosome where read length on alignment is present.

Representing the only log of the prediction over each window.

It should be understood that the GC proportion is related to the unique comparison number, and the invention can obtain the unique comparison logarithm close to the actual level by adopting the above means. In the conventional GC correction method, the whole genome is used as a GC content reference, so that the obtained correction coefficient is deviated, and the corrected unique alignment number is not accurate enough. Therefore, by adopting the means, the method can not only perform GC correction on all windows on the sample to be detected, but also improve the accuracy of the corrected unique comparison number so as to provide more accurate data signals for subsequent CNV detection.

A method for generating a unique bitlog for each window corrected by GC at the chromosome level according to an embodiment of the present invention will be described below in conjunction with fig. 4. FIG. 4 illustrates a flow diagram of a method 400 for generating a unique log of each window via GC corrections at a chromosome level, according to an embodiment of the invention. It should be understood that the method 400 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It is to be understood that method 400 may also include additional acts not shown and/or may omit acts shown, as the scope of the invention is not limited in this respect.

At step 402, the computing device 110 calculates the proportion of GC bases on each chromosome based on the human reference genomic sequence.

For example, the computing device 110 calculates the ratio of GC bases on each chromosome from the human reference genome sequence and makes it as

。

At step 404, the computing device 110 calculates a unique alignment count for each chromosome of the negative reference set.

An algorithm for calculating the unique per-chromosome one-log ratio of each autosome of the negative reference set is described below in conjunction with equation (3). That is, the computing device 110 calculates the per-chromosome unique ratio log ratio of the negative reference set autosomes based on the sum of the per-autosomal unique ratio log of the negative reference set and all the autosomal unique ratio logs of the negative reference set.

（3）

In the above-mentioned formula (3),

the unique alignment counts for each autosome representing the negative reference set were compared.

The only log of each autosome representing the negative reference set.

The sum of the unique alignment numbers of 22 autosomes representing the negative reference set.

It should be understood that the calculation method of the sex chromosome unique alignment number ratio of the negative reference set is the same as the algorithm of each chromosome unique alignment number ratio of the autosomes of the negative reference set shown in formula (3), except that the sex chromosome unique alignment number ratio pair is replaced by the autosomes unique alignment number pair of the negative reference set in the numerator.

At step 406, the computing device 110 calculates an actual unique alignment count ratio for each chromosome of the test sample to calculate a content ratio of the actual unique alignment count ratio for each chromosome of the test sample relative to the unique alignment count ratio for each chromosome of the negative reference set.

A method for calculating a unique alignment count ratio for each chromosome of a test sample includes, for example: the calculation device 110 calculates the sum of the proportion of GC bases of all windows of each autosome of the sample to be detected, the sum of the proportion of GC bases of 22 autosomes, and the unique comparison number proportion of each autosome of each sample to be detected; and calculating the unique comparison number ratio of the sex chromosomes of the sample to be detected based on the sum of the proportion of the GC bases of all windows of each sex chromosome of the sample to be detected and the sum of the proportion of the GC bases of all sex chromosomes so as to calculate the actual unique comparison number ratio of each chromosome of each sample to be detected.

An algorithm for calculating the unique alignment ratio of each autosome of each test sample is described below with reference to formula (4).

（4）

In the above-mentioned formula (4),

the unique alignment numbers of each autosome representing each test sample are compared.

Represents the sum of the proportion of GC bases in all (for example, but not limited to 12500) windows of each autosome of the sample to be tested.

Represents the sum of the proportions of GC bases of 22 autosomes.

It should be understood that the calculation method of the unique to logarithmic ratio of each sex chromosome of each test sample is the same as the algorithm of the unique to logarithmic ratio of each autosome of each test sample shown in formula (4), except that the unique to logarithmic ratio of the autosome in the numerator is replaced by the unique to logarithmic ratio of the sex chromosome. It will be appreciated that the calculation of the unique alignment ratio for each sex chromosome of the test samples for males and females is calculated separately.

An algorithm for calculating a content ratio of the unique alignment number ratio of each chromosome of the test sample to the unique alignment number ratio of chromosomes of the negative reference set is described below with reference to formula (5).

（5）

In the above-mentioned formula (5),

representing the unique alignment count per chromosome of the negative reference set.

The actual unique alignment numbers representing each chromosome of each test sample are compared.

Representing the content ratio of the unique alignment count ratio of each chromosome of the test sample relative to the unique alignment count ratio of each chromosome of the negative reference set.

At step 408, the computing device 110 fits via a linear regression model based on the calculated content ratio, the proportion of GC bases on each chromosome, to determine regression coefficients for generating a predicted unique log of each chromosome of the test sample based on the determined regression coefficients.

For example, the computing device 110 calculates, for each training sample, the ratio of GC bases on each chromosome (i.e.,

) As an argument, and as a content ratio of the unique aligned-number ratio of each chromosome of the training sample relative to the unique aligned-number ratio of chromosomes of the negative reference set (i.e.,

) For dependent variables, using linear models "Y = aX + b ", fitting, e.g. using least-squares, the best-fit line being selected so as to calculate the best regression coefficients

、

(i.e., using a formula)

To pair

Correction is performed). The predicted unique log of each chromosome is then pre-generated by a linear regression model. In some embodiments, by combining the smoothening spline model and the linear model, GC correction is performed twice at the window level and the chromosome level, respectively, so that a better data correction effect can be achieved.

An algorithm for generating a predicted unique log of each chromosome is described below in conjunction with equation (6).

In the above-mentioned formula (6),

representing the predicted unique log-log ratio of each chromosome.

、

Representing the regression coefficients of a linear regression model.

At step 410, the computing device 110 obtains a unique alignment number fraction adjustment ratio based on the predicted unique alignment number fraction of each chromosome and the actual unique alignment number fraction of each chromosome of each sample to be tested.

For example, the computing device 110 calculates the adjustment ratio based on a ratio of the predicted unique per-chromosome log ratio to the actual unique per-chromosome ratio of each chromosome of each sample to be tested (i.e., the actual unique per-chromosome ratio). An algorithm for calculating the ratio of the one-to-one log to the ratio adjustment is described below with reference to equation (7).

（7）

In the above-mentioned formula (7),

representing the unique alignment number ratio.

Representing the actual unique alignment number ratio of each chromosome of each sample to be tested, i.e. the actual unique alignment number ratio of each chromosome。

At step 412, the computing device 110 generates a unique log of each window corrected by GC at the chromosome level based on the unique log of each window corrected by GC at the window level and the unique alignment score scaling.

For example, the computing device 110 calculates the unique log of each window corrected via the chromosome level GC based on the product of the unique log of each window corrected via the window level GC and the unique alignment fraction scaling. An algorithm to compute the unique log of each window corrected by chromosome-level GC is described below in conjunction with equation (8).

（8）

In the above-mentioned formula (8),

representing the unique alignment number ratio.

Represents the unique log of each window corrected by GC at chromosome level, i.e., the unique number of alignments for each window after the second GC correction. Represents the unique log of each window via GC correction at the window level, i.e., the unique log of each window after the first GC correction.

It will be appreciated that the normal reference genome has a GC content of between 30-40%, and that a high GC content chromosome (e.g., with a GC content of 45%) will have a CG deviation from that of the whole genome (say 35%) that is somewhat greater if compared to that of the whole genome (say 35%) according to conventional GC correction methods, for example. According to the invention, the unique comparison log proportion of the sample on the chromosome is compared with the corresponding unique comparison log proportion of the chromosome of the negative reference set, so that the CG deviation proportion obtained by comparison is smaller and is closer to the actual real level.

In the above scheme, the method can further eliminate the influence of the high-GC content region or the low-GC content region on the correction result by predicting the predicted value of the unique comparison ratio through the regression model based on the content ratio of the unique comparison ratio of each chromosome of the sample to be detected to the unique comparison ratio of each chromosome of the negative reference set and correcting the unique comparison ratio of each window according to the adjustment ratio determined based on the actually measured unique comparison ratio and the predicted unique comparison ratio.

A method for detecting copy number variation according to an embodiment of the present invention will be described below with reference to fig. 5. FIG. 5 shows a flow diagram of a method 500 for detecting copy number variation, according to an embodiment of the invention. It should be understood that the method 500 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It is to be understood that method 500 may also include additional acts not shown and/or may omit acts shown, as the scope of the invention is not limited in this respect.

At step 502, the computing device 110 performs the aforementioned method 200 to obtain a unique bitlog for each window via normalization and GC correction.

At step 504, the computing device 110 constructs an autosomal negative reference set and a sex chromosome negative reference set.

Regarding the method for constructing the autosomal negative reference set and the sex chromosome negative reference set, it includes, for example: the computing device 110 performs a window-level GC correction and a chromosome-level GC correction for the unique log of each window of the negative reference set to obtain a normalized and GC-corrected unique log of each window of the negative reference set; calculating the mean and standard deviation of the unique log of the ratio over each window; determining a retention interval for each window based on the calculated mean and standard deviation to determine a retention coefficient for each window of the sample to be tested; and constructing an autosomal negative reference set and a sex chromosome negative reference set based on the retention coefficient of each window of the sample to be detected.

For example, the computing device 110 selects some negative samples for copy number variation (i.e., no change in copy number) to construct a negative reference set. The two GC corrections (i.e., the window-level GC correction and the chromosome-level GC correction) shown in method 200 are performed for the unique alignment number for each window of the negative reference set to obtain a unique log of each window of the negative reference set that is normalized and GC corrected. The computing device 110 then calculates the mean and standard deviation of the unique log over each window separately for the normalized and GC corrected unique log of the negative reference set for each window.

The algorithm for calculating the mean and standard deviation of the unique log ratio over each window of the negative reference set is described below in conjunction with equations (9) and (10).

（9）

（10）

In the above formulas (9) and (10),

the mean of only one log over each window representing the negative reference set.

Represents the standard deviation of the unique log of each window of the negative reference set.

The unique log of each window via normalization and GC correction representing the negative reference set.

For each window, a reserve interval is calculated

. The calculation of the reserved interval is described below with reference to equation (11)

The algorithm of (1).

（11）

In the above-mentioned formula (11),

representing the reserved interval. It is understood that the distribution of the next generation sequencing data across the genome is random, assuming it conforms to a normal distribution level. Based on the theory of normal distribution, setting an accuracy of 95% (i.e., 95% of the data is accurate, falling within the retention interval), the corresponding coefficient is 1.96.

Calculation of the retention factor is described below in conjunction with equation (12)

The method of (1). If the only comparison number of the current window belongs to the reserved interval, the reservation coefficient is 1, namely the current window belongs to a normal state, and the current window is reserved; if the only comparison number of the current window belongs to the reserved interval, the reserved coefficient is 0, the current window belongs to an abnormal state, and the current window is filtered.

=

。(12)

In the above-mentioned formula (12),

representing the retention factor.

The calculation of all the windows of chromosomes for the negative reference set is described below in connection with equation (13)

The algorithm of (1). For example, all the windows of chromosomes of the autosomal negative reference set can be constructed by formula (13) for the autosomes

(ii) a And respectively constructing a male negative reference set and a female negative reference set according to sex aiming at sex chromosomes.

（13）

In the above-mentioned formula (13),

representing all windows of chromosomes of a negative reference set

。

At step 506, the computing device 110 determines a normalized copy rate for each window of the sample under test based on the unique log of each window of the sample under test via normalization and GC correction, the constructed unique alignment numbers for each window of the autosomal negative reference set and the sex chromosome negative reference set. By adopting the means, the method can compare the standardized effective data volume of the sample to be detected with the normal reference level, and can obtain the standardized depth value (namely, the copy rate CopyRatio, CR) of each window of the sample to be detected.

The algorithm for calculating the normalized copy rate for each window of the test sample is described below in conjunction with equations (14) and (15). Wherein, the formula (14) indicates the algorithm of the normalized copy rate of the autosome and female sex chromosome of the sample to be tested. Equation (15) indicates the algorithm for the normalized copy rate of the male sex chromosome of the test sample.

（14）

（15）

In the above formulas (14) and (15),

is representative of the sample to be tested

A sample of

First of a chromosome

Normalized copy rate for each window.

Represents the first

A sample of

First of a chromosome

Unique alignment numbers after GC correction for each window.

Representing all windows of chromosomes in the same position in the negative reference set

。

At step 508, the computing device 110 combines the consecutive first predetermined number of windows into one section, so as to take the area where the copy rate of the consecutive second predetermined number of sections meets the predetermined condition as a candidate area.

With respect to the first predetermined number, it is, for example and without limitation, 10. With respect to the second predetermined number, it is, for example and without limitation, greater than or equal to 10. It should be understood that the first predetermined number and the second predetermined number may be the same or different.

For example, the computing device 110 joins 10 consecutive windows into one section (or window); then, with one section as a step size, whether the copy rate (i.e., copy ratio) of consecutive 10 sections or more meets a predetermined condition (e.g., the copy rate of consecutive 10 sections or more is greater than or equal to 1.1, or less than or equal to 0.9) is detected, and if the copy rate of consecutive 10 sections or more meets the predetermined condition, the region is determined as a candidate region. It will be appreciated how many windows are combined into a single segment can be adjusted according to the sequencing depth and the size of the CNV to be detected.

At step 510, the computing device 110 calculates the offset direction for each segment in the candidate region in order to determine copy number variations for segments of the first type having lengths that fall within a predetermined range.

As for the predetermined range, it is, for example, without limitation, 50 kb-1M. The first type of segment whose length falls within the predetermined range is, for example, a large segment.

A method for determining copy number variations of fragments of a first type having a length within a predetermined range, for example, comprising: the calculation device 110 calculates, for a first type segment whose length falls within a predetermined range, a shift direction of each segment in the candidate region using a mean-shift (mean-shift) algorithm; and merging the offset directions for the respective segments in the candidate region to determine the copy number variation of the first type segment. It should be understood that the CNV drift direction in each step is calculated by a mean shift algorithm, and then the current step and the left or right are merged or segmented according to whether the CNV drift directions of the left or right of the current step are the same or opposite, so as to accurately detect the small segment copy number variation.

An algorithm for calculating the drift direction of each section in the candidate region is described below with reference to equations (16) and (17). Wherein equation (16) indicates the algorithm for calculating the estimated density function. Equation (17) indicates an algorithm for inverting the estimated density function in order to determine the drift direction.

（16）

（17）

In the above formulas (16) and (17),

representing the estimated density function.

Represents the first

A window.

Represents and is

A window adjacent to the window. r represents the unique log of the GC corrected window mentioned above (e.g.)

Representing the bandwidth of one point per window, with the center point window extending (in some embodiments,

default to 5 windows). Representing a range of copy numbers (in some embodiments,

default to 0.2).

Representing a constant, for example 2.72.

The algorithm for determining the drift direction is described below in conjunction with equation (18).

（18）

In the above-mentioned formula (18),

representing the drift direction. Some windows representing the current window and the left side do not belong to a segment.

Some windows representing the current window and the right side belong to one segment.

For example, if the computing device 110 finds two adjacent windows with directions of-1 and 1, respectively, if at least 3 of the 4 windows behind the right window are oriented to the right, and at least 3 of the 4 windows in front of the left window are oriented to the left, then the break point is located between the two windows, and the region from the break point to the window with the same direction is a segment.

It will be appreciated that by calculating the copy number of a fragment, it is possible to determine whether there is a copy number variation.

An algorithm for calculating the copy number of a fragment for an autosome is described below in conjunction with equation (19); and a method for determining whether a copy number variation exists based on the copy number of the fragment is described in connection with equation (20).

（20）

In the above-mentioned formulas (19) and (20),

representing the number of copies of the fragment. It should be understood that for male sex chromosomes, equation (19)

The calculation of (c) does not need to be multiplied by 2.

As can be seen from equation (20), if the calculated copy number of the current segment is in the range [0.3,0.7], the current segment is deleted; if the calculated copy number of the current fragment is in the range [1.3, 2.7], then it is determined that the current fragment has a copy number variation of the first type of fragment.

At step 512, the computing device 110 determines copy number variations of the second type of fragments that are greater than a predetermined value in length to determine copy number variations of the test sample based on the copy number variations of the first type of fragments and the copy number variations of the second type of fragments.

For example, the computing device 110 integrates the copy number variation of the first type of fragment and the copy number variation of the second type of fragment as the copy number variation of the sample to be tested.

As for the predetermined value, it is, for example, without limitation, 1 Mb. The second type segments having a length larger than a predetermined value are for example large segments.

A method for determining copy number variation of a second type of fragment having a length greater than a predetermined value, comprising, for example: the computing device 110 detects copy number variations for fragments greater than 1Mb in length using a Cyclic Binary Segmentation (CBS) algorithm. Specifically, the computing device 110 initializes each chromosome of the genome to be detected in a circle; in a way that each window in the area to be detected corresponds to one point, enabling a plurality of (for example, n) windows in the area to be detected to correspond to a plurality of points; two points are arbitrarily selected from a plurality of points (for example, an arbitrary two points are taken as

、

Two points of

) To calculate the degree of deviation variance (i.e., the Z value) of the segment of the region defined by the two selected points; determining whether the calculated degree of deviation variance of the region segment is greater than or equal to a predetermined deviation variance threshold; in response to determining that the calculated degree of deviation variance of the region segment is greater than or equal to a predetermined deviation variance threshold (e.g., a predetermined deviation variance threshold of

) Determining that a breakpoint exists in the region segment defined by the two selected points, so as to select two points again in the region segment to circularly calculate the deviation variance degree of the region segment defined by the two selected points again; the section defined by the two points selected based on the last cycle is determined as a copyAnd the number of copies variation region fragments are used for determining the copy number variation of the second type of fragments based on the copy number variation region fragments so as to determine the copy number variation of the sample to be tested. If it is determined that the calculated degree of deviation variance of the region segment is less than the predetermined deviation variance threshold, determining that no break point exists within the region segment defined by the selected two points, and ending the loop;

an algorithm for calculating the degree of deviation variance of the region segment defined by the selected two points is described below in conjunction with equation (21).

In the above-mentioned formula (21),

to be provided with

、

The Z value of a region segment that is a breakpoint, i.e., the degree to which the region segment deviates from variance.

Representing the total number of windows of the suspect region.

、

Representing 2 arbitrarily selected breakpoint positions, i.e. the first

A window and the second

And (4) a window.

Represents the first

From window to first

Standard deviation of copy rate over a range of regions per window.

The copy rate standard deviation in the region range representing the whole number of windows of the region to be detected.

Represents the first

From window to first

From window to window

Standard deviation of copy rate over a range of regions of each window.

An algorithm for determining the number of cycles of deviation of the degree of variance of the region segment defined by the candidate breakpoints is described below in conjunction with equation (22).

（22）

In the above-mentioned formula (22),

the loop determines a number of loops of a plurality of degrees of deviation from the variance for the segment of the region defined by the plurality of candidate breakpoints.

With respect to a predetermined deviation variance threshold

For example, without limitation, 2.58. The predetermined deviation variance threshold is obtained by setting an α value according to a P-value method based on a standard normal distribution curve. For example when a =0.1,

2.58。

in the scheme, different processes are adopted by the large segment CNV and the small segment CNV, so that the large segment CNV is not fragmented completely and is sensitive to low-proportion embedding. But also can ensure enough sensitivity and accuracy for small segment CNV.

Fig. 6 schematically shows a block diagram of an electronic device 600 suitable for implementing an embodiment of the invention. The electronic device 600 may be a device for implementing the method 200 to 500 shown in fig. 2 to 5. As shown in fig. 6, electronic device 600 includes a central processing unit (i.e., CPU 601) that can perform various appropriate actions and processes in accordance with computer program instructions stored in a read-only memory (i.e., ROM 602) or loaded from storage unit 608 into a random access memory (i.e., RAM 603). In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 can also be stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output interface (i.e., I/O interface 605) is also connected to bus 604.

Various components in the electronic device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, and the CPU 601 executes the respective methods and processes described above, for example, the methods 200 to 500. For example, in some embodiments, methods 200-500 may be implemented as a computer software program stored on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by CPU 601, one or more operations of methods 200 through 500 described above may be performed. Alternatively, in other embodiments, CPU 601 may be configured by any other suitable means (e.g., by way of firmware) to perform one or more acts of methods 200-500.

It should be further appreciated that the present invention may be embodied as methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therein for carrying out aspects of the present invention.

The computer-readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, such as punch cards or in-groove raised structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

The above description is only an alternative embodiment of the present invention and is not intended to limit the present invention, and various modifications and variations of the present invention are possible to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for correcting sequencing data, comprising:

calculating the position of the read length of each unique alignment in the reference genome and the sequencing frequency of each position based on the sequencing data of the sample to be detected and the alignment result data of the human reference genome sequence;

dividing the genome according to windows with preset lengths so as to count the number of the longer reads in each window than the pairs;

performing GC corrections for the number of read lengths on the alignment within each window to generate a unique log of bits per window via GC corrections at the window level;

obtaining a unique log-to-ratio adjustment ratio based on the unique number-of-comparisons of the chromosome level of the sample to be tested, so as to generate a unique log-to-ratio of each window corrected by the GC of the chromosome level based on the unique log-to-ratio of each window corrected by the GC of the window level and the unique number-of-comparisons adjustment ratio; and

the unique alignment numbers for each window corrected by GC at the chromosome level are normalized to obtain a unique log of each window corrected by normalization and GC.

2. The method of claim 1, wherein comparing the unique alignment numbers for the chromosome level of the test sample comprises: and comparing the predicted unique comparison number of each chromosome of the sample to be detected with the actual unique comparison number of each chromosome of the sample to be detected.

3. The method of claim 1, wherein generating a unique alignment number for each window via GC correction at the window level comprises:

selecting windows with reading lengths on all autosomes in comparison so as to determine the proportion of GC (gas chromatography) bases on each window; and

based on the determined proportion of GC bases on each window, a predicted unique log over each window is obtained via a trained GC correction model for generating a unique log for each window via GC correction at the window level.

4. The method of claim 3, wherein obtaining the predicted unique log of comparisons over each window for generating the unique log of comparisons for each window via GC corrections for the window level comprises:

calculating the median of the normalized unique log of all windows with aligned read lengths on the autosome; and

generating a unique log for each window corrected by GC at the window level based on the predicted unique log over each window and the median of the calculated normalized unique log.

5. The method of claim 1, wherein generating a unique alignment number for each window corrected by GC at the chromosome level comprises:

calculating the proportion of GC bases on each chromosome based on the human reference genome sequence;

calculating the unique alignment number ratio of each chromosome of the negative reference set;

calculating the actual unique comparison number ratio of each chromosome of the sample to be detected so as to calculate the content ratio of the actual unique comparison number ratio of each chromosome of the sample to be detected relative to the unique comparison number ratio of each chromosome of the negative reference set; and

fitting via a linear regression model based on the calculated content ratio, the proportion of GC bases on each chromosome, to determine regression coefficients for generating a predicted unique log for each chromosome of the test sample based on the determined regression coefficients to generate a unique log for each window corrected by GC at the chromosome level.

6. The method of claim 5, wherein generating the predicted unique log-log for each chromosome based on the determined regression coefficients to generate a unique alignment number for each window corrected by GC at the chromosome level comprises:

obtaining a unique comparison number ratio adjustment ratio based on the predicted unique comparison number ratio of each chromosome and the actual unique comparison number ratio of each chromosome of each sample to be detected; and

the unique log-log per window corrected by GC at the chromosome level is generated based on the unique log-per-window and unique alignment scale adjustments of the GC at the window level.

7. A method for detecting copy number variation, comprising:

performing the steps of the method according to any one of claims 1 to 6, so as to obtain a unique one-bit logarithm per window via normalization and GC correction;

constructing an autosomal negative reference set and a sex chromosome negative reference set;

determining a normalized copy rate of each window of the sample to be tested based on the unique log of each window of the sample to be tested through normalization and GC correction, the unique comparison number of each window of the constructed autosomal negative reference set and the sex chromosome negative reference set;

combining continuous first preset number of windows into a section so as to take the area with the copying rate of continuous second preset number of sections meeting the preset condition as a candidate area;

calculating the offset direction of each segment in the candidate region to determine the copy number variation of the first type of segment with the length in a predetermined range; and

and determining the copy number variation of the second type of fragments with the length larger than a preset value so as to determine the copy number variation of the sample to be tested based on the copy number variation of the first type of fragments and the copy number variation of the second type of fragments.

8. The method of claim 7, wherein determining copy number variations for the first type of fragments having lengths within a predetermined range comprises:

calculating the offset direction of each section in the candidate region by using a mean shift algorithm aiming at the first type of segment with the length belonging to a preset range; and

the offset directions for the respective segments in the candidate region are combined to determine the copy number variation of the first type of segment.

9. The method of claim 7, wherein the copy number variation of the second type of fragments having a length greater than a predetermined value is determined by:

copy number variations of the second type of fragments having a length greater than a predetermined value are detected based on a cyclic binary partitioning algorithm.

10. The method of claim 7, wherein constructing the autosomal negative reference set and the sex chromosome negative reference set comprises:

performing a window-level GC correction and a chromosome-level GC correction for the unique log of each window of the negative reference set to obtain a normalized and GC-corrected unique log of each window of the negative reference set;

calculating the mean and standard deviation of the unique log of the ratio over each window;

determining a retention interval for each window based on the calculated mean and standard deviation to determine a retention coefficient for each window of the sample to be tested; and

and constructing an autosomal negative reference set and a sex chromosome negative reference set based on the retention coefficient of each window of the sample to be detected.

11. A computing device, comprising:

at least one processing unit;

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit, cause the apparatus to perform the steps of the method of any of claims 1 to 10.

12. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when executed by a machine, implements the method according to any one of claims 1 to 10.