CN116913378A

CN116913378A - Method and system for detecting genome homozygous region based on low-depth sequencing data

Info

Publication number: CN116913378A
Application number: CN202310493830.1A
Authority: CN
Inventors: 徐寒黎; 谢玉婷; 成喜雨; 吕兴; 金怡宸; 马腾跃; 李欣怡
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2023-10-20

Abstract

The invention discloses a method and a system for detecting a genome homozygous region based on low-depth sequencing data, which relate to the technical field of molecular genetics detection, and are used for calculating a first edge likelihood value of each target SNP locus belonging to a normal region and a second edge likelihood value of each target SNP locus belonging to the genome homozygous region for each chromosome region obtained by carrying out region division on a human reference genome, and respectively calculating a first joint distribution likelihood value and a second joint distribution likelihood value based on the first edge likelihood values and the second edge likelihood values of all target SNP loci in the chromosome region so as to judge whether the chromosome region is the genome homozygous region, thereby accurately detecting the genome homozygous region in a whole genome based on low-depth whole genome sequencing data with the sequencing depth of 0.1X-1X, reducing the dependence on the sequencing depth and reducing the detection cost.

Description

Method and system for detecting genome homozygous region based on low-depth sequencing data

Technical Field

The invention relates to the technical field of molecular genetics detection, in particular to a method and a system for detecting a genome homozygous region based on low-sequencing depth whole genome sequencing data.

Background

Genomic homozygous region (Region ofhomozygousity, ROH) refers to the phenomenon of loss of allele heterozygosity in a genomic region, and for most diploid cells, such as human somatic cells, there are two genomes, one from the father and one from the mother, at a SNP locus in the genome that is heterozygous when the bases from the parents are different, and if all SNP loci in that region are due to a mechanism (e.g., deletion, meiosis error, near relatives, etc.) only one type of genome from either the father or the mother, that region is genomic homozygous, with the most typical being a uniparent diploid (uniparental disomy, UPD) meaning that homologous chromosomes or partial fragments on chromosomes are derived from one of the parents and no other chromosome is present.

Detection of homozygous regions of the genome has received increasing attention from professional teams at home and abroad in recent years. In 2020, ACMG (American College ofMedical Genetics and Genomics ) issued a statement about ROH or UPD and the importance of detection thereof indicated support, and in the same year, there was also published a literature report about ROH detection. At present, detection technologies for detecting ROH are mainly divided into two types, one type is homozygous detection aiming at target specificity, and the main technologies comprise short tandem repeat sequence typing, methylation specific PCR and the like, however, the technologies are only applicable to detecting a certain specific genome homozygous region suspected according to clinical manifestations; the other is a detection technique for comprehensively detecting the homozygous region of the genome in the whole genome, and the main techniques include a chromosome microarray analysis technique and a sequencing technique, and the sequencing technique often refers to conventional deep sequencing, not low-deep whole genome sequencing, and is generally adopted when the prediction of the specific mutation site of the recessive gene is difficult or the detection of the specific mutation site of the recessive gene is wanted. The prior art is described in detail below:

(1) Short tandem repeat typing (STR): is a simple repetitive sequence uniformly distributed in the genome of eukaryotes, and can reflect differences in allele frequencies in a population due to the high variability in the number of repetitions of the repetitive units among individuals and the high degree of heterozygosity of the markers throughout the genome. However, the probability of mutation of the STR is high, interference can be caused to judgment, and due to the limitation of the acquisition site, the requirement on sample quality is high, the amount of information which can be obtained is low, and the STR is not suitable for UPD screening in the whole genome range, and only a preset target homozygous region can be detected.

(2) Methylation Specific PCR (MSP): is a specific site methylation detection technology. Methylation directly interferes with the combination of transcription factors and recognition sites of promoters, and as base C cannot be converted into base T, the primers cannot amplify target genes, so that the purpose of recognizing whether genome DNA is methylated or not can be achieved by the method, and the method has a certain significance on imprinted gene diseases, is high in sensitivity, does not need special instruments, and is low in cost. However, the DNA sequence of the fragment to be detected needs to be known in advance, and the design of the primer is very important, wherein the biggest disadvantage is that only the preset target UPD can be detected, furthermore, the treatment of the bisulphite is very critical, and if the treatment is incomplete, false positives can be caused.

(3) Chromosome microarray analysis technology (CMA): the probe designs are classified into comparative microarray-based genomic hybridization (aCGH) and single nucleotide polymorphism microarray (SNP array) technologies. SNP array adds SNP probes on the basis of non-polymorphic probes, overcomes the limitations of aCGH, and can detect most ROH and triploid besides chromosome copy number abnormality. The CMA technique scans SNP loci to identify the genotype of each SNP locus, and the technique can find polyploid abnormalities and ROH in the genome by using genotype information provided by the SNP loci. However, the main defect of the CMA technology is that the detection range is easily influenced by the preference of probes, meanwhile, the requirement of a chip on the initial quantity and quality of sample DNA is far higher than that of sequencing, and the detection cost of the chip is several times higher than that of low-depth whole genome sequencing.

(4) Whole Exome Sequencing (WES): the method comprises the steps of enriching exon region DNA by a custom probe, carrying out high-throughput sequencing on the exon region DNA, carrying out data analysis and comparison, and determining the genotype state of each SNP locus to estimate whether UPD exists. However, the main defects are that the exons only account for 1-2% of human genome, and the region except the exons can not be effectively detected, and the detection cost is high and the cost is high because of the high sequencing depth.

(5) Low depth whole genome sequencing (CNV-seq): is a chromosome analysis technique for detecting genome copy number variation based on a whole genome sequencing method of a second generation sequencing technique. The CNV-seq can accurately detect chromosome copy number abnormality at the whole genome level on the basis of extremely low sequencing depth (0.1-1×), and can detect chromosome copy number abnormality (CNV) with the length as low as 100kb and the chimeric proportion as low as 10%. Prior to 2019, professional guidelines also considered that diagnostic methods based on low depth sequencing were only able to detect CNV, but not ROH. Subsequently, three studies showed that by observing the B allele frequency of SNP sites, it was possible to detect ROH from low-depth whole genome sequencing data, but no performance indicators and parameters of detection were described. Until month 7 of 2021, a study was conducted to achieve detection of over 5Mb ROH from 4 XLow depth sequenced samples by inferring B allele frequencies, confirming the feasibility of low depth sequencing techniques for invasive detection of single parent homodimeric UPD. However, such methods of detecting ROH by B allele frequency still have the following limitations: the method requires that the sequencing depth of a sample is about 4X, and then SNP loci with coverage over 5 are selected from the sample to be used for presuming the B allele frequency, but the sequencing depth is often less than 1X in the current domestic application, so that most SNP loci cannot be used for presuming the B allele frequency, and the method is invalid. If the sequencing depth is increased to about 4 x, additional detection cost is increased, and popularization is further limited.

Table 1 below shows the advantages and disadvantages of the five genomic homozygous region detection techniques described above.

TABLE 1

The detection of genomic homozygous regions based on the CNV-seq technique presents a significant challenge because the genotype of each SNP site is unknown, and how to use whole genome sequencing data of low sequencing depth to meet the need for detecting ROH is a significant effort to control detection costs. There is therefore a need to establish a new method of detecting ROH from CNV-seq data below 1 x depth to solve the dilemma of the existing CNV-seq technology.

Disclosure of Invention

The invention aims to provide a method and a system for detecting a genome homozygous region based on low-depth sequencing data, which can accurately detect the genome homozygous region in a whole genome based on low-depth whole genome sequencing data with the sequencing depth of 0.1-1 x, reduce the dependence on the sequencing depth and reduce the detection cost.

In order to achieve the above object, the present invention provides the following solutions:

a method of detecting a homozygous region of a genome based on low depth sequencing data, the method comprising:

sequencing a sample to be tested by using a low-depth whole genome sequencing technology with the sequencing depth of 0.1-1 x to obtain low-depth whole genome sequencing data; the sample to be detected is individual DNA; the low depth whole genome sequencing data comprises a plurality of base sequences;

Respectively comparing each base sequence with a human reference genome to obtain the position and comparison quality of each base sequence in the human reference genome, and selecting the base sequence with the comparison quality higher than or equal to the preset quality as a base sequence for detection;

dividing the human reference genome into regions according to a preset window width and a preset step length to obtain a plurality of chromosome regions;

selecting, for each of the chromosomal regions, a SNP site in the chromosomal region covered with at least one of the base sequences for detection as a target SNP site, based on the positions of all the base sequences for detection, and determining a first number of times the A allele is covered with the base sequence for detection and a second number of times the B allele is covered with the base sequence for detection at each of the target SNP sites; calculating a first edge likelihood value of each target SNP locus belonging to a normal region and a second edge likelihood value of each target SNP locus belonging to a genome homozygous region according to the first times and the second times; calculating a first joint distribution likelihood value based on first edge likelihood values of all the target SNP sites within the chromosome region, and calculating a second joint distribution likelihood value based on second edge likelihood values of all the target SNP sites within the chromosome region; determining whether the chromosomal region is a genomic homozygous region according to the first joint distribution likelihood value and the second joint distribution likelihood value.

In some embodiments, the predetermined quality is that the base sequence and the human reference genome have only one base mismatch.

In some embodiments, the width of each of the chromosome regions is the preset window width, the preset window width is greater than the preset step size, and there is an overlap between adjacent chromosome regions.

In some embodiments, said calculating a first edge likelihood value for each of said target SNP sites belonging to a normal region and a second edge likelihood value for each of said target SNP sites belonging to a homozygous region of the genome based on said first and second times specifically comprises:

for each target SNP locus, taking the genotype distribution probability of each first genotype corresponding to the first times, the second times and the normal region and the frequency of the allele A as inputs, and calculating a first edge likelihood value of the target SNP locus belonging to the normal region by using an edge likelihood value calculation formula; and calculating a second edge likelihood value of the target SNP locus belonging to the genome homozygous region by using an edge likelihood value calculation formula by taking the first times, the second times and the genotype distribution probability of each second genotype corresponding to the genome homozygous region and the frequency of the allele A as inputs.

In some embodiments, the first genotype corresponding to the normal region comprises AA, AB, and BB; the genotype distribution probability of the first genotype AA is p ² +p (1-p) F, allele A frequency is 1-e; the genotype distribution probability of the first genotype AB is 2p (1-p) (1-F), and the frequency of the allele A is 1/2; the genotype distribution probability of the first genotype BB is (1-p) ² +p (1-p) F, allele A frequency is e; wherein p is the population frequency of the A allele in the ethnic group to which the individual belongs; f is the inbreeding coefficient of the individual; e is the sequencing error rate;

the second genotype corresponding to the homozygous region of the genome comprises AA and BB; the genotype distribution probability of the second genotype AA is p, and the frequency of the allele A is 1-e; the genotype distribution probability of the second genotype BB is 1-p, and the frequency of allele A is e.

In some embodiments, the edge likelihood value calculation formula is:

M(F，e)＝∑ _g P(g)B(F _A (g)，C _i )；

wherein M (F, e) is an edge likelihood value; f is the inbreeding coefficient of the individual; e is the sequencing error rate; p (g) is the genotype distribution probability of genotype g; b (F) _A (g)，C _i ) Likelihood values are binomial distributions; f (F) _A (g) The frequency of allele a, genotype g; c (C) _i Is a set comprising a first number of times and a second number of times;

Wherein x is a random variable representing the frequency of allele a; c _i ^A Is the first number of times; c _i ^B A second number of times.

In some embodiments, the calculating a first joint distribution likelihood value based on the first edge likelihood values of all the target SNP sites within the chromosome region and calculating a second joint distribution likelihood value based on the second edge likelihood values of all the target SNP sites within the chromosome region specifically comprises:

calculating the product of first edge likelihood values of all the target SNP loci in the chromosome region to obtain a first joint distribution likelihood value; and calculating the product of second edge likelihood values of all the target SNP loci in the chromosome region to obtain second joint distribution likelihood values.

In some embodiments, the determining whether the chromosomal region is a genomic homozygous region based on the first joint distribution likelihood value and the second joint distribution likelihood value specifically comprises:

calculating a judgment parameter according to the first joint distribution likelihood value and the second joint distribution likelihood value, and if the judgment parameter is larger than a preset threshold value, the chromosome region is a genome homozygous region;

the calculation formula of the judging parameter is as follows:

Wherein T is a judgment parameter; l (L) _ROH Likelihood values for a second joint distribution; l (L) _Normal Likelihood values for a first joint distribution;

or, the calculation formula of the judging parameter is as follows:

wherein T is a judgment parameter; θ _ROH The prior probability that the chromosomal region is a homozygous region of the genome; θ _Normal The prior probability that the chromosome region is a normal region; l (L) _ROH Likelihood values for a second joint distribution; l (L) _Normal Likelihood values are first joint distributions.

In some embodiments, after determining whether the chromosomal region is a genomic homozygous region, the method further comprises: adjacent chromosomal regions, which are all homozygous regions of the genome, are combined.

A system for detecting a homozygous region of a genome based on low depth sequencing data, the system comprising:

the sequencing module is used for sequencing the sample to be tested by using a low-depth whole genome sequencing technology with the sequencing depth of 0.1-1 x to obtain low-depth whole genome sequencing data; the sample to be detected is individual DNA; the low depth whole genome sequencing data comprises a plurality of base sequences;

the comparison module is used for respectively comparing each base sequence with a human reference genome to obtain the position and comparison quality of each base sequence in the human reference genome, and selecting the base sequence with the comparison quality higher than or equal to the preset quality as a base sequence for detection;

The division module is used for dividing the human reference genome into areas according to a preset window width and a preset step length to obtain a plurality of chromosome areas;

a detection module for selecting, for each of the chromosomal regions, an SNP site in the chromosomal region covered with at least one of the base sequences for detection as a target SNP site, based on the positions of all the base sequences for detection, and determining a first number of times the A allele is covered with the base sequence for detection and a second number of times the B allele is covered with the base sequence for detection at each of the target SNP sites; calculating a first edge likelihood value of each target SNP locus belonging to a normal region and a second edge likelihood value of each target SNP locus belonging to a genome homozygous region according to the first times and the second times; calculating a first joint distribution likelihood value based on first edge likelihood values of all the target SNP sites within the chromosome region, and calculating a second joint distribution likelihood value based on second edge likelihood values of all the target SNP sites within the chromosome region; determining whether the chromosomal region is a genomic homozygous region according to the first joint distribution likelihood value and the second joint distribution likelihood value.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a method and a system for detecting a genome homozygous region based on low-depth sequencing data, which are characterized in that a low-depth whole genome sequencing technology with the sequencing depth of 0.1 x-1 x is utilized to sequence a sample to be detected to obtain low-depth whole genome sequencing data, then each base sequence in the low-depth whole genome sequencing data is respectively compared with a human reference genome to obtain the position and the comparison quality of each base sequence in the human reference genome, and the base sequence with the comparison quality higher than or equal to the preset quality is selected as a base sequence for detection. Finally, dividing regions of a human reference genome by a preset window width and a preset step length to obtain a plurality of chromosome regions, selecting SNP loci covered by at least one base sequence for detection in each chromosome region as target SNP loci according to the positions of all the base sequences for detection in each chromosome region, determining the first times when the A allele is covered by the base sequence for detection and the second times when the B allele is covered by the base sequence for detection in each target SNP locus, calculating the first edge likelihood value of each target SNP locus belonging to a normal region and the second edge likelihood value of each target SNP locus belonging to a genome homozygous region according to the first times and the second times, calculating the first joint distribution likelihood value based on the first edge likelihood values of all the target SNP loci in the chromosome region, calculating the second joint distribution likelihood value based on the second edge likelihood values of all the target SNP loci in the chromosome region, and determining whether the chromosome region is the genome homozygous region according to the first joint distribution likelihood value and the second joint distribution likelihood value, thereby reducing the cost of detecting the genome homozygous region based on the low depth of the full-length sequencing of 0.1X-1X sequencing depth, and reducing the cost of detecting the genome homozygous region.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method according to embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of a technical route provided in embodiment 1 of the present invention;

FIG. 3 is a schematic block diagram of embodiment 1 of the present invention;

FIG. 4 is a schematic diagram showing the comparison of the chip results and the programming results of sample number SRR 0632236 according to example 1 of the present invention;

FIG. 5 is a schematic diagram showing the comparison of the chip results and the programming results of sample number ERR062966 provided in example 1 of the present invention;

fig. 6 is a system block diagram provided in embodiment 2 of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1:

before the technical solution of the present embodiment is described, some technical terms used in the present embodiment are described so that those skilled in the art can better understand the technical solution of the present embodiment.

(1) Genomic homozygous region: at the individual level, a region is referred to as a homozygous region of a genome if all SNP sites within that region are homozygous genotypes.

(2) Uniparent diploid: both copies of homologous chromosomes or chromosome part fragments are derived from individuals of the same party in the parents, and their pathogenic mechanisms include imprinted gene expression disorders and recessive homozygous mutations. Uniparental diploids can be categorized into uniparental homodimers (ibpd) and uniparental heterobodies (heterodurum, hspd), which are derived from the same chromosome of the same parent, whereas uniparental heterobodies are derived from two homologous chromosomes of the same parent, respectively. This embodiment focuses on a single affinity body.

(3) Imprinting genes: only one parent is expressed from a source and homologous genes from the other parent are not expressed.

(4) Low depth whole genome sequencing (low-pass whole genome sequencing) technique: and (3) performing low-depth whole genome sequencing of 0.1 x-1 x on the whole genome sequence DNA of the sample, and performing subsequent information extraction and analysis based on bioinformatics. This technique has been widely used in recent years in prenatal diagnosis of fetal chromosomal aberration because of its superior performance in detecting chromosomal copy number variation (copy numbervariation, CNV), also known as the CNV-seq technique.

(5) read heterozygous site: in the newly defined concept of this example, when one SNP site can be covered with 1 or more reads (base sequences), if the bases measured at the SNP site by these reads are not identical, the SNP site is called a read heterozygous site.

(6) read homozygous site: in the newly defined concept of this example, when one SNP site can be covered by more than 1 read, if the bases measured at the SNP site by these reads are the same, the SNP site is called a read homozygous site.

In this example, if one SNP site can be covered by more than 1 read, the SNP site can be defined as a read homozygous site or a read heterozygous site. Notably, the SNP site defined as the read homozygous site is not necessarily homozygous, it is likely that another allele is not opportunistically detected; the SNP site defined as a read heterozygous site is also not necessarily heterozygous, and may be caused by random sequencing errors. Although only fuzzy information can be provided for a single SNP site, the genotyping of the SNP site is still unknown, if fuzzy information on a plurality of SNP sites in a certain region are combined together for analysis, a probability model of combined distribution is established, and ROH information of the whole region can be deduced. Based on the thought, the embodiment establishes a detection model based on readheterozygo aiming at blind spots of the existing detection model, and aims to trace back effective information under extremely low sequencing depth, so as to finish detection of a genome homozygous region with high efficiency and high precision.

The present embodiment provides a method for detecting a homozygous region of a genome based on low depth sequencing data, as shown in fig. 1, 2 and 3, the method comprising:

s1: sequencing a sample to be tested by using a low-depth whole genome sequencing technology with the sequencing depth of 0.1-1 x to obtain low-depth whole genome sequencing data; the sample to be detected is individual DNA; the low depth whole genome sequencing data comprises a plurality of base sequences;

the sample to be measured can be obtained by the following steps: the DNA of the individual can be obtained by collecting a sample from the individual by means of blood drawing, oral swab and the like and extracting the DNA of the sample, and the DNA extraction method can adopt any existing method, such as a glass bead method, an ultrasonic method, a grinding method, an alkaline cracking method and the like. An individual refers to any person.

The method of the present embodiment may further comprise, before aligning each base sequence with the human reference genome, respectively: the low depth whole genome sequencing data is preprocessed in a manner that includes: deleting repeated base sequences introduced by PCR amplification in the low-depth whole genome sequencing data; deleting a base sequence containing one or more bases N in the low-depth whole genome sequencing data, wherein the base N refers to a base which is not measured; deleting base sequences with average sequencing quality lower than a preset value of continuous multiple bases in the low-depth whole genome sequencing data, wherein the number of the continuous multiple bases is at least 5, and the preset value can be 20, namely deleting base sequences with average sequencing quality lower than 20 of continuous 5 or more bases; the linker-contaminated portion of the base sequence in which the linker is contaminated in the low-depth whole genome sequencing data is cut off, and since the linker is a known sequence, the linker can be directly recognized in the base sequence, and the linker can be directly cut off in the base sequence after the linker is recognized. By the pretreatment mode, quality control can be performed on the low-depth whole genome sequencing data to obtain pretreated sequencing data, the pretreated sequencing data is used as new low-depth whole genome sequencing data, and the subsequent S2 is executed.

S2: respectively comparing each base sequence with a human reference genome to obtain the position and comparison quality of each base sequence in the human reference genome, and selecting the base sequence with the comparison quality higher than or equal to the preset quality as a base sequence for detection;

in this embodiment, each base sequence remaining after quality control can be aligned to a human reference genome using BWA software to determine the position of each base sequence in the human reference genome, and the alignment strategy is such that at most one base mismatch is allowed, and only if this alignment strategy is satisfied, the base sequence can be retained as a base sequence for detection for subsequent analysis. That is, the preset quality of the present example is that the base sequence and the human reference genome have only one base mismatch, and the base sequence can be used as the base sequence for detection only when the comparison quality of the base sequence is higher than or equal to the preset quality, that is, the base sequence and the human reference genome have only one base mismatch or no base mismatch. By the above alignment step, a plurality of base sequences for detection can be selected.

S3: dividing the human reference genome into regions according to a preset window width and a preset step length to obtain a plurality of chromosome regions;

In this embodiment, the width of each chromosome region is a preset window width, and the preset window width is larger than a preset step length, so that adjacent chromosome regions overlap. Specifically, a suitable window width and step size are selected to divide chromosome regions of a human reference genome, if the window width is too large and exceeds the practical range of ROH, effective signals are diluted, real ROH is difficult to detect, if the window width is too small, the number of available SNP loci in the window is too small, and the sensitivity is low, so that the sensitivity and the true positive rate are comprehensively considered to determine the window width and the step size, and in the embodiment, the preset window width is specifically selected to be 5Mb and the preset step size is 1Mb. After the preset window width and the preset step length are determined, the human reference genome can be divided into a plurality of chromosome regions according to the preset window width and the preset step length, the width of each chromosome region is 5Mb, and the overlapping part between the adjacent chromosome regions is 4Mb.

S4: selecting, for each of the chromosomal regions, a SNP site in the chromosomal region covered with at least one of the base sequences for detection as a target SNP site, based on the positions of all the base sequences for detection, and determining a first number of times the A allele is covered with the base sequence for detection and a second number of times the B allele is covered with the base sequence for detection at each of the target SNP sites; calculating a first edge likelihood value of each target SNP locus belonging to a normal region and a second edge likelihood value of each target SNP locus belonging to a genome homozygous region according to the first times and the second times; calculating a first joint distribution likelihood value based on first edge likelihood values of all the target SNP sites within the chromosome region, and calculating a second joint distribution likelihood value based on second edge likelihood values of all the target SNP sites within the chromosome region; determining whether the chromosomal region is a genomic homozygous region according to the first joint distribution likelihood value and the second joint distribution likelihood value.

In this embodiment, the process of establishing the ROH detection model includes constructing a probability distribution model and a target joint probability distribution model, which is specifically as follows:

although CNV-seq is a low-depth sequencing of less than 1X, there are still enough SNP sites in the resulting low-depth whole genome sequencing data to be covered by more than 1 base sequence for detection, as shown in Table 2. For any target SNP site having a coverage (i.e., the number of base sequences for detection covering the SNP site) of greater than 1, the present embodiment can define the target SNP site as a read homozygous site or a read heterozygous site.

TABLE 2

Table 2 shows the number of SNP sites that can be covered by 1, 2, 3 base sequences for detection in a CNV-seq sample in the case of 1 Xlow depth sequencing. As is clear from Table 2, even though the average coverage of the samples was only 1×, there were enough SNP sites that could be covered with 1 or more base sequences for detection.

The probability distribution model constructed in this embodiment is an actual model obtained by incorporating the sequencing error rate e and the inbreeding coefficient F of the individual on the basis of an ideal model. For SNP loci of both alleles, the inbred coefficient F directly affects the probability of genotypes AA, BB and AB.

The probability distribution model of the present embodiment specifically includes:

(1) For the normal region, the probability of genotype distribution P (g) and the frequency F of allele A at any SNP site _A (g) Table 3 below:

TABLE 3 Table 3

(2) For the region of genomic homozygote, probability of genotype distribution P (g) and frequency of allele A for individuals at arbitrary SNP sitesF _A (g) Table 4 below:

TABLE 4 Table 4

Genotype g	Probability of genotype distribution P (g)	Frequency F of allele A _A (g)
			AA	p	1-e
BB	1-p	e

Based on the probability distribution model, the target joint probability distribution model of the present embodiment is as follows:

for a target SNP site i covered with any 1 or more base sequences for detection, the set of the number of times that two alleles A and B are covered with a base sequence for detection is denoted as C _i ＝(c _i ^A ，c _i ^B ) Let the random variable x be the frequency of allele a, the binomial distribution likelihood values are:

in the formula (1), B (x, C) _i ) Likelihood values are binomial distributions; x is a random variable representing the frequency of allele a, the value of x being known when genotype g is determined, as given in tables 3 and 4; c _i ^A Is the object ofA first number of times that allele A in SNP locus i is covered with a base sequence for detection; c _i ^B Since the coordinates of the SNP site are known for the second number of times that the allele B in the target SNP site i is covered with the base sequence for detection, all the base sequences for detection covering the target SNP site i can be determined based on the alignment step of S2, the number of times that the base at the position corresponding to the target SNP site i is the base sequence for detection of the allele A is recorded as the first number of times, and the number of times that the base at the position corresponding to the target SNP site i is the base sequence for detection of the allele B is recorded as the second number of times, and thus both the first number of times and the second number of times are known values. Based on this, the binomial distribution likelihood values are known values that can be calculated in the genotype determination.

For each target SNP locus i, firstly calculating a binomial distribution likelihood value corresponding to each genotype g by using a formula (1), and then calculating an edge likelihood value of the target SNP locus i by taking a genotype distribution probability P (g) of the genotype g as a weight and carrying out weighted summation on the genotype g, wherein the edge likelihood value has the following calculation formula:

M(F，e)＝∑ _g P(g)B(F _A (g)，C _i )； (2)

in the formula (2), M (F, e) is the edge likelihood value of the target SNP locus i; f is the inbreeding coefficient of the individual; e is the sequencing error rate; p (g) is the genotype distribution probability of genotype g; b (F) _A (g)，C _i ) Likelihood values are binomial distributions; f (F) _A (g) The frequency of allele a, genotype g; c (C) _i Is a set comprising a first number of times and a second number of times.

Multiplying the edge likelihood values of all target SNP loci in the chromosome region R to obtain the joint distribution likelihood value of the joint distribution of the SNP loci in the chromosome region, wherein the calculation formula is as follows:

L(F，e)＝Π _i∈R M(F，e)； (3)

in the formula (3), L (F, e) is a joint distribution likelihood value of the chromosome region R.

Based on the target joint probability distribution model, joint distribution likelihood values of SNP loci in the chromosome region can be determined.

In the probability distribution model, the population frequency p of the A allele in the ethnic group of the individual is a known value, the population of the individual comprises east Asian, european, and the like, the sequencing error rate e is a known value of about 1/1000, and the embodiment also provides another calculation mode of the sequencing error rate e: and selecting all non-SNP (non-SNP) sites (which refer to other sites except SNP sites) which are covered once by a detected base sequence in a human reference genome, judging that the base which is inconsistent with the human reference genome and covers the non-SNP sites is a sequencing error because the base which is detected on the non-SNP sites is consistent with the human reference genome, counting the proportion of the number of sites inconsistent with the human reference genome on all non-SNP sites with coverage of 1, and taking the proportion as the sequencing error rate e. The individual inbred coefficient F is calculated by the following steps: making the chromosomal region R the whole region of the human reference genome and considering the whole region as a normal region, determining a first number and a second number corresponding to each target SNP site covered by at least one base sequence for detection in the whole region, and simultaneously combining the first number, the second number and the frequency F of allele A of each genotype in Table 3 for each target SNP site _A (g) Substituting formula (1) to obtain binomial distribution likelihood values of the target SNP sites in each genotype, substituting formula (2) for genotype distribution probability P (g) of each genotype in table 3 and binomial distribution likelihood values of the target SNP sites in each genotype, calculating to obtain edge likelihood values of each target SNP site, substituting formula (3) for edge likelihood values of all target SNP sites in the whole region to obtain L _{whole genome} ，L _whole _genome The only variable in the method is F, and the method can be solved by using a maximum likelihood method to obtain the order L _{whole genome} F, which is the largest value of (c), is also known.

Then, based on the above detection model, for each chromosome region, P (g) and F in Table 3 are assumed to be normal regions _A (g) Substituting the first joint distribution likelihood value into the formulas (2) and (3), wherein the first joint distribution likelihood value of the joint distribution of SNP loci in the chromosome region under normal conditions can be directly calculated and is marked as L because all variables are known _normal The method comprises the steps of carrying out a first treatment on the surface of the Assume thatThe region was a region homozygous for the genome, and P (g) and F in Table 4 were determined _A (g) Substituting the second combined distribution likelihood value of the SNP locus combined distribution in the chromosome region under the condition of genome homozygosity can be directly calculated as L by substituting the second combined distribution likelihood value into the formula (2) and the formula (3) because all variables are known _ROH . In this embodiment, the step of determining whether the chromosomal region is a genomic homozygous region for each chromosomal region includes:

(1) According to the positions of all the base sequences for detection, selecting at least one SNP site covered by one base sequence for detection in a chromosome region as a target SNP site, and determining a first number of times that an A allele is covered by the base sequence for detection and a second number of times that a B allele is covered by the base sequence for detection on each target SNP site.

(2) And calculating a first edge likelihood value of each target SNP locus belonging to the normal region and a second edge likelihood value of each target SNP locus belonging to the genome homozygous region according to the first times and the second times.

For each target SNP locus, taking the genotype distribution probability of each first genotype corresponding to the first times, the second times and the normal region and the frequency of the allele A as inputs, and calculating the first edge likelihood value of the target SNP locus belonging to the normal region by using an edge likelihood value calculation formula. And calculating a second edge likelihood value of the target SNP locus belonging to the genome homozygous region by using the edge likelihood value calculation formula by taking the genotype distribution probability of each second genotype and the frequency of the allele A corresponding to the first times, the second times and the genome homozygous region as inputs.

Wherein the first genotype corresponding to the normal region comprises AA, AB and BB, and the genotype distribution probability of the first genotype AA is p ² +p (1-p) F, allele A frequency is 1-e; the genotype distribution probability of the first genotype AB is 2p (1-p) (1-F), and the frequency of the allele A is 1/2; the genotype distribution probability of the first genotype BB is (1-p) ² +p (1-p) F, allele A frequency is e, wherein p is the population frequency of the A allele in the ethnic group to which the individual belongs; f is the inbreeding coefficient of the individual; e is the sequencing error rate, see in particular table 3. Genomic homozygous region pairsThe corresponding second genotype comprises AA and BB, the genotype distribution probability of the second genotype AA is p, and the frequency of the allele A is 1-e; the genotype distribution probability of the second genotype BB is 1-p, and the frequency of allele A is e, see Table 4 for details.

(3) The first joint distribution likelihood value is calculated based on the first edge likelihood values of all the target SNP sites within the chromosome region, and the second joint distribution likelihood value is calculated based on the second edge likelihood values of all the target SNP sites within the chromosome region.

Specifically, calculating the product of first edge likelihood values of all target SNP loci in a chromosome region to obtain a first joint distribution likelihood value; and calculating the product of the second edge likelihood values of all the target SNP loci in the chromosome region to obtain a second joint distribution likelihood value.

(4) Determining whether the chromosomal region is a homozygous region of the genome based on the first and second combined distribution likelihood values.

And calculating a judgment parameter according to the first joint distribution likelihood value and the second joint distribution likelihood value, wherein if the judgment parameter is larger than a preset threshold value, the chromosome region is a genome homozygous region, otherwise, the chromosome region is a normal region.

The calculation formula of the judgment parameters is as follows:

in the formula (4), T is a judgment parameter; l (L) _ROH Likelihood values for a second joint distribution; l (L) _Normal Likelihood values are first joint distributions.

Alternatively, the calculation formula of the judgment parameter is:

in the formula (5), T is a judgment parameter; θ _ROH The prior probability of ROH being a homozygous region of the genome for the chromosomal region; θ _Normal Is a chromosome regionThe prior probability of the normal region; l (L) _ROH Likelihood values for a second joint distribution; l (L) _Normal Likelihood values are first joint distributions. θ _ROH And theta _Normal Can be assigned according to experience, theta _ROH The value of (2) can be in the range of 1/5000 to 1/3000, and the embodiment can enable theta to be _ROH ＝1/3500。θ _Normal ＝1-θ _ROH In the process of determining theta according to the value range _ROH After that, θ _Normal Also known.

The present embodiment provides two calculation methods for determining parameters, namely, directly calculating a log likelihood ratio, namely, calculating the determining parameters by using the formula (4), and calculating the log ratio of posterior probability by using the formula (5) with consideration of prior probability, taking into consideration of a bayesian framework, wherein, compared with the first calculation method for calculating the determining parameters based on current data only, the second calculation method can better avoid false positives by considering both prior information and current data.

When one chromosome region is detected, detecting the next chromosome region according to the sequence, and finally obtaining the respective judgment results of all chromosome regions in the whole genome range. If no chromosome region in the sample to be detected is judged as ROH, not reporting; if the chromosomal region is determined to be ROH, the coordinates of the chromosomal region are checked to determine whether or not there is an overlapping chromosomal region, the chromosomal regions having an overlapping relationship are merged, and the merged ROH region is output. That is, after determining whether the chromosomal region is a homozygous region of the genome, the method of the present embodiment further comprises: adjacent chromosomal regions, which are all homozygous regions of the genome, are combined.

The embodiment realizes the detection of the ROH of a chromosome region above 5Mb by establishing a mathematical model of detection based on low-depth whole genome sequencing data, reduces the dependence on sequencing depth while guaranteeing the detection accuracy, and widens the detection range of the CNV-seq technology in diagnosis on the premise of guaranteeing the detection cost.

A number of problems of genetic imprinting disorders associated with homozygous regions of the genome have been reported, with the occurrence of about 1/3500 in newborns, leading to serious birth defects such as dysplasia, mental retardation, etc., and thus serious consequences if the homozygous regions of the genome are missed prenatally. The method of the embodiment can also be applied to prenatal diagnosis, the ROH of the fetus is judged, samples such as villus, amniotic fluid, umbilical cord blood and the like are obtained through prenatal invasive detection (through methods such as villus sampling, amniotic fluid puncture and the like), the main detection components are mainly fetal DNA, and the fetal DNA is obtained through a technology of screening the prenatal condition of the fetus at the cost of invasive and minimally invasive, so that whether the ROH region exists in the fetus can be judged by using the method.

In this embodiment, two sets of results with sample numbers SRR 0632236 and ERR062966 are selected for display, the chip result refers to the UPD section obtained by using the chip, the programming result refers to the UPD section obtained by using the method of this embodiment, and as can be seen from fig. 4 and fig. 5, the chip result has high consistency with the programming result obtained by using the method of this embodiment, which also illustrates the effectiveness and accuracy of the method of this embodiment.

According to the embodiment, on the premise of a low-depth whole genome sequencing technology, a mathematical model is utilized to establish a genotype distribution probability of a normal region and a genome homozygous region and a distribution model of frequency of an allele A only under the condition of sequencing depth of 1X, and the difference of heterozygosity is analyzed under the consideration of an inbreeding coefficient, a sequencing error rate and crowd allele frequency information, so that the existing sequencing depth of more than 4X can be reduced to 1X, the detection cost is greatly reduced, and the accuracy of detection under the low depth is improved. The prior art has the problems of high mutation rate, high false positive proportion, high price, limited sequencing range and the like, so the embodiment provides the whole genome sequencing combined with the whole genome sequencing (CNV-seq) technology, thereby expanding the sequencing range; and a method for reducing the sequencing depth to 1X is initiated, and the problem that the sequencing depth cannot be reduced at present is solved. By using an accurate mathematical model, taking fetal inbreeding coefficient, sequencing error rate and crowd allele frequency information into consideration, using a probability statistical method, developing a new way, effectively solving and analyzing a genome homozygous region, achieving the purposes of effectively reducing cost, increasing testing accuracy and application breadth, solving the problems of high ROH detection cost and limited application scene in the prior art, expanding the fields of UPD detection and the like, and expanding the application range of a low-depth whole genome sequencing technology.

Example 2:

this embodiment is for providing a system for detecting a homozygous region of a genome based on low depth sequencing data, as shown in fig. 6, comprising:

the sequencing module M1 is used for sequencing a sample to be tested by using a low-depth whole genome sequencing technology with the sequencing depth of 0.1-1 x to obtain low-depth whole genome sequencing data; the sample to be detected is individual DNA; the low depth whole genome sequencing data comprises a plurality of base sequences;

a comparison module M2 for respectively comparing each base sequence with a human reference genome to obtain the position and comparison quality of each base sequence in the human reference genome, and selecting the base sequence with the comparison quality higher than or equal to the preset quality as a base sequence for detection;

the dividing module M3 is used for dividing the human reference genome into a plurality of chromosome regions according to a preset window width and a preset step length;

a detection module M4 for selecting, for each of the chromosomal regions, a SNP site in the chromosomal region covered by at least one of the detection base sequences as a target SNP site according to the positions of all the detection base sequences, and determining a first number of times the a allele is covered by the detection base sequence and a second number of times the B allele is covered by the detection base sequence at each of the target SNP sites; calculating a first edge likelihood value of each target SNP locus belonging to a normal region and a second edge likelihood value of each target SNP locus belonging to a genome homozygous region according to the first times and the second times; calculating a first joint distribution likelihood value based on first edge likelihood values of all the target SNP sites within the chromosome region, and calculating a second joint distribution likelihood value based on second edge likelihood values of all the target SNP sites within the chromosome region; determining whether the chromosomal region is a genomic homozygous region according to the first joint distribution likelihood value and the second joint distribution likelihood value.

In this specification, each embodiment is mainly described in the specification as a difference from other embodiments, and the same similar parts between the embodiments are referred to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A method for detecting a homozygous region of a genome based on low depth sequencing data, the method comprising:

2. The method of claim 1, wherein the predetermined quality is that the base sequence and the human reference genome have only one base mismatch.

3. The method of claim 1, wherein each of the chromosomal regions has a width that is the predetermined window width, the predetermined window width being greater than the predetermined step size, and there is an overlap between adjacent chromosomal regions.

4. The method according to claim 1, wherein calculating a first edge likelihood value of each of the target SNP sites belonging to a normal region and a second edge likelihood value of each of the target SNP sites belonging to a genomic homozygous region based on the first and second times comprises:

5. The method of claim 4, wherein the first genotype corresponding to the normal region comprises AA, AB, and BB; the genotype distribution probability of the first genotype AA is p ² +p (1-p) F, allele A frequency is 1-e; the genotype distribution probability of the first genotype AB is 2p (1-p) (1-F), and the frequency of the allele A is 1/2; the genotype distribution probability of the first genotype BB is (1-p) ² +p (1-p) F, allele A frequency is e; wherein p is the population frequency of the A allele in the ethnic group to which the individual belongs; f is the inbreeding coefficient of the individual; e is the sequencing error rate;

6. The method of claim 4, wherein the edge likelihood value calculation formula is:

M(F，e)＝∑ _g P(g)B(F _A (g)，C _i )；

7. The method according to claim 1, wherein said calculating a first joint distribution likelihood value based on first edge likelihood values of all said target SNP sites within said chromosome region and a second joint distribution likelihood value based on second edge likelihood values of all said target SNP sites within said chromosome region specifically comprises:

8. The method of claim 1, wherein said determining whether the chromosomal region is a genomic homozygous region from the first and second joint distribution likelihood values comprises:

The calculation formula of the judging parameter is as follows:

or, the calculation formula of the judging parameter is as follows:

9. The method of claim 3, wherein after determining whether the chromosomal region is a genomic homozygous region, the method further comprises: adjacent chromosomal regions, which are all homozygous regions of the genome, are combined.

10. A system for detecting a homozygous region of a genome based on low depth sequencing data, the system comprising: