CN115910200A

CN115910200A - Non-target region genotype filling method based on whole exon sequencing

Info

Publication number: CN115910200A
Application number: CN202211684704.6A
Authority: CN
Inventors: 于晓光; 杜政霖; 邢世来
Original assignee: Wenzhou Puxi Medical Laboratory Co ltd
Current assignee: Wenzhou Puxi Medical Laboratory Co ltd
Priority date: 2022-12-27
Filing date: 2022-12-27
Publication date: 2023-04-04

Abstract

The invention discloses a non-target region genotype filling method, a system, equipment and a computer readable storage medium based on whole exon sequencing, wherein the method comprises the following steps: acquiring whole exon sequencing data of a target queue and a reference whole genome sequencing data set; filtering sites in the reference whole genome sequencing data set, and outputting SNP site information of the reference whole genome sequencing data set; calculating GL of the SNP locus of each sample based on the SNP locus information and the sequencing data of the whole exon, and obtaining GL of each sample; combining the GL of each sample to obtain the GL of all samples; fragmenting and segmenting the reference whole genome sequencing data set to obtain genome fragment information; and estimating the genotype of the non-targeted region of the single sample in the target queue by utilizing a machine learning algorithm based on GL, SNP site information and genome fragment information of all samples to obtain the genotype estimation result of the non-targeted region of the single sample.

Description

Non-target region genotype filling method based on whole exon sequencing

Technical Field

The invention relates to the technical field of gene prediction, in particular to a non-targeted region genotype filling method and a non-targeted region genotype filling system based on whole exon sequencing.

Background

With the rapid development of high throughput sequencing (NGS), more and more GWAS attempt to perform genotyping using sequencing technology. The Whole Genome Sequencing (WGS) can effectively cover most genetic mutation sites, but has the disadvantage of being expensive and can not be applied to large-scale cohort samples. In addition, if the sequencing depth is reduced for cost, the genotypic state of low frequency variations, which are often located in the protein coding region and have important biological functions, cannot be accurately identified.

Whole exon sequencing technology (WES) has been widely used to study the role of protein coding variations in genetic diseases. However, over 98% of the sequences in the genome are non-coding sequences, and there are many important regulatory elements that can significantly affect gene expression. A great deal of research shows that non-coding variation has close relation with the formation and development of diseases, so how to accurately measure gene variation information in the whole genome range becomes a key for restricting scientific research personnel to explore the disease mechanism. The WES can design a probe aiming at the gene exon region, sequencing is carried out after the region is captured in a targeted mode, and the sequencing depth of the exon region can be obviously improved under the condition of certain data flux; however, this technique has the disadvantage of not being able to monitor non-coding variations located in a large percentage of the genome, which tend to have more significant genetic potency. The study shows that WES sequencing data not only comprise the sequence of the designed probe part, but also a part of the sequence outside the probe area is captured and then detected by a sequencer. Compared with the probe region, the sequencing depth of the non-probe region/non-target region (Off-target sequence) is often too low to be accurately identified by using the traditional genome identification algorithm. With the growing disclosure of large reference genomes, new genotyping algorithms can utilize reference genotype panel for genotyping low coverage regions.

However, large reference genome panel has strong population specificity, and most of the mainstream databases at present are based on europe or mixed population, and the panel cannot meet the low coverage genotype identification requirement of specific population in different countries/races/regions due to the difference of haplotypes caused by ethnicity difference.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a non-target region genotype filling method and a system thereof based on whole exon sequencing; the method is based on a whole exon sequencing technology and a reference whole genome sequencing data set, the genotype possibility of a non-target region in a target queue is calculated, the genotype estimation result of the non-target region is obtained, low-depth sequencing data and a large reference genome panel are integrated, and the genotype of a WES non-target region genetic variation site is efficiently and accurately identified; and (4) mining the life rule hidden behind sequencing data from a deep level, and solving related life science problems.

The application discloses in a first aspect a non-targeted region genotype filling method based on whole exon sequencing, which comprises the following steps:

acquiring whole exon sequencing data and a reference whole genome sequencing data set of a target queue;

filtering the sites in the reference whole genome sequencing data set, and outputting SNP site information of the reference whole genome sequencing data set; the SNP locus information comprises a chromosome number, a genome coordinate, an allele and the crowd genotype information of a locus from which non-SNP is removed;

calculating the genotype possibility result of each SNP locus of each sample in the target queue based on the SNP locus information of the reference whole genome sequencing data set and the whole exon sequencing data of the target queue, and obtaining the genotype possibility result of each sample;

combining the genotype possibility results in each sample to obtain the genotype possibility results in all samples in the target queue;

performing fragmentation and segmentation processing on the reference whole genome sequencing data set to obtain genome fragment information after the fragmentation and segmentation processing;

and estimating the genotype of the non-target region of the single sample in the target queue by utilizing a machine learning algorithm based on the genotype possibility result of all samples in the target queue, the SNP locus information of the reference whole genome sequencing data set and the genome fragment information to obtain the genotype estimation result of the non-target region of the single sample.

The method further comprises the following steps:

calculating the genotype possibility result of each SNP locus of each sample in the target queue based on the SNP locus information of the reference whole genome sequencing data set and the whole exon sequencing data of the target queue, and obtaining the genotype possibility result of a non-target region and/or a target region in each sample;

combining the genotype possibility results of the non-targeted region and/or the targeted region in each sample to obtain the genotype possibility results of the non-targeted region and/or the targeted region in all samples of the target queue;

estimating the genotype of a single sample in the target queue by utilizing a machine learning algorithm based on the genotype possibility result of the non-targeted region and/or the targeted region in all samples in the target queue, the SNP site information of the reference whole genome sequencing data set and the genome fragment information, and estimating the genotype of the non-targeted region and/or the targeted region of the single sample in the target queue to obtain the genotype estimation result of the non-targeted region and/or the targeted region of the single sample;

optionally, the method for calculating the genotype probability of each SNP site of each sample in the target cohort includes: calculating by using mpieup of BCFtools to obtain the genotype possibility of each SNP locus calculated based on the sequencing depth;

optionally, the estimation result includes: the results of post-fill genotype dose (filled genotype dosage), genotype posterior probability (genotypes) and best estimate genotype (best gain genotype).

The machine learning algorithm includes, but is not limited to, the following algorithms for genotype estimation: an iterative optimization algorithm; the iterative optimization algorithm comprises one or more of the following: gradient descent, conjugate gradient, coordinate descent, newton iteration, stepwise regression, minimum angle regression, lagrange multiplication.

The reference whole genome sequencing dataset is a reference whole genome sequencing dataset of a specific population, and the reference whole genome sequencing dataset of the specific population comprises specific populations of different countries/ethnicities/regions;

optionally, the reference whole genome sequencing dataset of the specific population is a reference whole genome sequencing dataset of a chinese population.

The method further comprises the following steps: integrating genomic fragment information based on the estimation results to obtain a chromosome level result comprising a genotype;

optionally, the method further includes: integrating genomic fragment information based on the estimation results to obtain chromosome level results comprising genotype and haplotype information; the estimation result further includes: haplotype information.

Optionally, obtaining the chromosome level result using glipse _ blocked; the chromosome level results are VCF files.

The process of filtering the sites in the reference whole genome sequencing dataset and outputting the SNP site information of the reference whole genome sequencing dataset comprises: extracting bi-allel genetic polymorphic sites (bi-allel SNPs) from the reference whole genome sequencing dataset to obtain a first dataset and a second dataset comprising SNP site information of the reference whole genome sequencing dataset;

calculating the genotype possibility result of each SNP locus of each sample in the target queue based on the whole exon sequencing data, the first data set and the second data set of the target queue, and obtaining the genotype possibility result of a non-target region and/or a target region in each sample; combining the genotype possibility results of the non-targeted region and/or the targeted region in each sample to obtain the genotype possibility results of the non-targeted region and/or the targeted region in all samples of the target queue;

performing fragmentation and segmentation processing on the reference whole genome sequencing data set to obtain genome fragment information after fragmentation and segmentation processing;

estimating the genotype of a single sample in the target queue by utilizing a machine learning algorithm based on the genotype possibility result of the non-targeted region and/or the targeted region in all samples in the target queue, the first data set and the genome fragment information, and estimating the genotype of the non-targeted region and/or the targeted region of the single sample in the target queue to obtain the estimation result of the non-targeted region and/or the targeted region of the single sample;

optionally, the information of the genotype of the population from which the non-SNP sites have been removed is a first data set; the chromosome number, genomic coordinates and alleles are a second data set that does not contain genotype information, but only SNP site information in the first data set.

The genomic fragment information includes: and fragmenting the genomic fragment information after the segmentation treatment, wherein the genomic fragment information comprises chromosome, starting site coordinates and ending site coordinates.

The second aspect of the invention discloses a non-target region genotype filling system based on whole exon sequencing, which comprises:

the acquisition unit is used for acquiring the sequencing data of the whole exons of the target queue and a reference whole genome sequencing data set;

the first processing unit is used for filtering the sites in the reference whole genome sequencing data set and outputting SNP site information of the reference whole genome sequencing data set; the SNP locus information comprises a chromosome number, a genome coordinate, an allele and the crowd genotype information of a locus from which non-SNP is removed;

the second processing unit is used for calculating the genotype possibility result of each SNP locus of each sample in the target queue based on the SNP locus information of the reference whole genome sequencing data set and the whole exon sequencing data of the target queue and obtaining the genotype possibility result of each sample;

the third processing unit is used for combining the genotype possibility results in each sample to obtain the genotype possibility results in all the samples of the target queue;

the fourth processing unit is used for carrying out fragmentation and segmentation processing on the reference whole genome sequencing data set to obtain genome fragment information after fragmentation and segmentation processing;

and the fifth processing unit is used for estimating the genotype of the non-target area of the single sample in the target queue by utilizing a machine learning algorithm based on the genotype possibility result of all samples in the target queue, the SNP locus information of the reference whole genome sequencing data set and the genome fragment information to obtain the genotype estimation result of the non-target area of the single sample.

The third aspect of the invention discloses a non-target region genotype filling device based on whole exon sequencing, which comprises: a memory and a processor;

the memory is to store program instructions; the processor is configured to invoke program instructions that, when executed, perform the non-targeted region genotype filling method based on whole exon sequencing described above.

In a fourth aspect of the invention, a computer readable storage medium is disclosed, on which a computer program is stored, which, when being executed by a processor, implements the non-targeted region genotype filling method based on whole exon sequencing as described above.

The application of the device in diagnosis of the occurrence and development of diseases; optionally the development of said disease is associated with changes in the type and number of microorganisms; optionally, the disease comprises an eye disease (such as myopia), and the non-targeted area and/or the targeted area are filled using WES data of myopes.

The application has the following beneficial effects:

1. the application innovatively discloses a novel filling method for genotypes of non-target regions, the method is based on a whole exon sequencing technology and a reference whole genome sequencing data set, the genotype possibility of the non-target regions and/or the target regions in a target queue is calculated, the genotype estimation results of the non-target regions and/or the target regions are obtained, low-depth sequencing data and large-scale reference genome panel (namely the reference whole genome sequencing data set in the scheme) are integrated, and the genotypes of WES non-target region genetic variation sites are efficiently and accurately identified; and (3) deeply mining the life rule hidden behind sequencing data, ensuring filling when the non-target area is subjected to subsequent processing, and accurately identifying the non-target area. The problem that a large number of genetic variation sites in a non-coding region cannot be identified by the conventional WES sequencing analysis process is solved; meanwhile, the problem that the existing mainstream database lacks the specificity of different countries/ethnicities/regional populations is solved based on a reference whole genome sequencing data set; when the reference whole genome sequencing data set is the reference whole genome sequencing data set of Chinese population, the large reference genotype panel of the Chinese population is used as haplotype reference, so that the genotype analysis capability of the Chinese population queue is effectively improved; the invention effectively overcomes the problems that the existing WES sequencing analysis process cannot identify a large number of genetic variation sites in non-coding regions and lacks the specificity of Chinese population, and efficiently and accurately identifies the genotype of the genetic variation sites in the WES non-target regions.

2. The method and the device creatively utilize an iterative optimization algorithm to estimate the genotype of the sample, combine a biological information technology with a machine learning technology, fully utilize the existing sequencing data and technical means, and solve the problem that the cost and the accuracy cannot be obtained at present.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a non-targeted region genotype filling method based on whole exon sequencing according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a non-target region genotype filling device based on whole exon sequencing according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a non-targeted region genotype-filling system based on whole exon sequencing according to an embodiment of the present invention;

FIG. 4 is a flowchart of the WES non-targeted region genotype fill-in provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of the distribution of the non-target region post-quality control sites on the genome provided by the embodiment of the invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

In some of the flows described in the present specification and claims and in the above figures, a number of operations are included that occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they occur herein, with the order of the operations being indicated as 101, 102, etc. merely to distinguish between the various operations, and the order of the operations by themselves does not represent any order of performance. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of a non-target region genotype filling method based on whole exon sequencing according to an embodiment of the present invention, specifically, the method includes the following steps:

101: acquiring whole exon sequencing data and a reference whole genome sequencing data set of a target queue;

in one embodiment, the whole exon sequencing data of the target cohort is preprocessed data; the pretreatment process comprises the following steps: obtaining whole exon sequencing raw data and human reference Genome sequence data (fasta format file) from the university of california genotype research center (UCSC Genome Browser)); performing quality control on the whole exon sequencing raw data of each sample of the target queue, and comparing the quality-controlled whole exon sequencing raw data with the human reference genome sequence data (using a genome comparison tool (BWA)) to obtain a comparison result (bam-format file-storage sequencing sequence position information); the alignment result is the whole exon sequencing data of the target cohort. Detection of variation in the conventional targeted region includes:

in one embodiment, the reference whole genome sequencing dataset is a reference whole genome sequencing dataset for a specific population, the reference whole genome sequencing dataset for the specific population comprising specific populations of different countries/ethnicities/regions; optionally, the reference whole genome sequencing dataset of the specific population is a reference whole genome sequencing dataset of a chinese population. The method comprises 24114149 genetic polymorphism sites from 1.2 ten thousand Chinese, and the method comprises the steps of obtaining whole genome sequencing data of Chinese population from a public database, wherein the genome version number is GRCh37.p5;

in one embodiment, a non-targeted region in the whole exon sequencing data is identified; the identification method comprises the following steps: the whole exon sequencing technology can design a probe aiming at a gene exon region, the region is captured in a targeted mode, and most of the region is a targeted region; but some non-targeting sequences may also be captured.

In one embodiment, the target sample is the sample that requires genotyping using the present tool, is a WES sequenced cohort, and is the subject of the study by the investigator, typically including a disease health control cohort.

102: filtering the sites in the reference whole genome sequencing data set, and outputting SNP site information of the reference whole genome sequencing data set; the SNP locus information comprises a chromosome number, a genome coordinate, an allele and the crowd genotype information of a locus from which non-SNP is removed;

in one embodiment, the filtering the sites in the reference whole genome sequencing data set, and the outputting the SNP site information of the reference whole genome sequencing data set comprises: bi-allel genetic polymorphic sites (bi-allel SNPs) are extracted from the reference whole genome sequencing dataset to obtain a first dataset and a second dataset comprising SNP site information of the reference whole genome sequencing dataset. The filtering step is to meet the requirements of calculating GL, which needs all target individuals and all variant sites existing in the haplotype reference panel for interpolation/filling, and the tool used for calculating GL is variant detection file format processing tool-BCFtools, which cannot accurately calculate GL of segment insertion deletion in genome, so that bi-allelic genetic polymorphism sites (bi-allel SNPs) in panel need to be extracted. The input was the Variant Call Format (VCF) of the reference genotype panel, which contains a set of genotype information from the Chinese cohort of people. The ` bcftools view ` command was used to extract biallelic Sites (SNPs) with parameters of ` M2-M2-v SNPs `. After extraction, two output files were obtained, one of which was still in VCF format (first data set) and is the population genetic information with non-SNP removed sites. The other output file ends with the 'tsv' suffix (second data set), contains no genotype information, and stores only SNP site information in VCF, including chromosome number, genome coordinates, and alleles.

estimating the genotype of a single sample in the target queue by using a machine learning algorithm based on the genotype possibility result of the non-targeted region and/or the targeted region in all samples in the target queue, the first data set and the genome fragment information, and estimating the genotype of the non-targeted region and/or the targeted region of the single sample in the target queue to obtain the estimation result of the non-targeted region and/or the targeted region of the single sample;

in one embodiment, the population genotype information for the sites from which non-SNPs have been removed is a first data set (VCF file); the chromosome number, genomic coordinates and alleles are a second dataset (tsv file) that contains no genotype information, but only SNP site information in the first dataset.

In one embodiment, the genome fragment information (text format) comprises: and fragmenting the genomic fragment information after the segmentation treatment, wherein the genomic fragment information comprises chromosome, starting site coordinates and ending site coordinates. In the fragmentation division process, setting parameters includes: the minimum sliding window interval is N (2 Mb), and the minimum buffer interval is M (200 Kb).

103: calculating the genotype possibility result of each SNP site of each sample in the target queue based on the SNP site information of the reference whole genome sequencing data set and the whole exon sequencing data of the target queue, and obtaining the genotype possibility result of each sample;

in one embodiment, GL was calculated separately for each sample using mpieup tool of bcfttools, input file for the bam file after genome alignment and the first data set (VCF file) and the second data set (TSV file). The parameters for GL calculations using mpileup from BCFtools were ' -I-E-a ' FORMAT/DP '. After calculation, a VCF file is obtained, and the genotype probabilities of the respective sites estimated based on the sequencing depth are stored. In one example, genotype Likelihoods (GL) measure the probability that different genotypes may occur, using P-value statistics.

104: combining the genotype possibility results in each sample to obtain the genotype possibility results in all samples in the target queue;

in one embodiment, after GL is sequentially calculated for the samples in the target queue, the GL results of the individual samples are merged into an integrated VCF file using merge command of BCFtools for subsequent steps.

105: performing fragmentation and segmentation processing on the reference whole genome sequencing data set to obtain genome fragment information after the fragmentation and segmentation processing;

in one embodiment, a genome small segment (chunks) is defined as an important step before constructing a haplotype for genotyping/filling. Too long a region definition will greatly increase the operation time, and too short a region definition will reduce the accuracy of subsequent genotype identification. The genome is divided into regions by using a GLIMPSE _ chunk program of GLIMPSE, and a minimum sliding window interval is set to be 2Mb and a minimum buffer interval is set to be 200Kb. The input file refers to the VCF file (first data set) of the genotype panel, the output file is in text format, and each line is a piece of divided genomic fragment information, including chromosome, start and stop site coordinates.

106: and estimating the genotype of the non-target region of the single sample in the target queue by utilizing a machine learning algorithm based on the genotype possibility result of all samples in the target queue, the SNP locus information of the reference whole genome sequencing data set and the genome fragment information to obtain the genotype estimation result of the non-target region of the single sample.

In one embodiment, the genotype likelihood results in all samples of the target cohort are preliminary or non-targeted genotype likelihood results.

In one embodiment, the estimation result includes: the results of post-fill genotype dose (inputedgetype dosage), genotype posterior probability (genotypes) and best estimate genotype (best taste genotypes); the same output file contains the 3 results described above.

In one embodiment, the machine learning algorithm includes, but is not limited to, the following algorithms for genotype estimation: an iterative optimization algorithm; the iterative optimization algorithm comprises one or more of the following: gradient descent, conjugate gradient, coordinate descent, newton iteration, stepwise regression, minimum angle regression, lagrange multiplication.

In one embodiment, the genotype of the non-targeted region is estimated using the GLIMPSE _ phase program, which is optimized for low coverage sequencing data, with accurate estimation of the genotype based on the similarity of the haplotype structure of the reference genotype panel and the haplotype structure of the target cohort. The GLIMPSE _ phase program is used as an open-source bioinformatics tool, and the iterative optimization algorithm is a step in the internal execution process of the GLIMPSE _ phase program; the iterative optimization algorithm is automatically implemented using GLIMPSE _ phase. In this embodiment, a GLIMPSE _ phase program is used to estimate genotypes, i.e., an iterative optimization algorithm is directly invoked to achieve the genotype estimation process; the genotype estimation is not limited to the GLIMPSE _ phase program, i.e. to the iterative optimization algorithm.

The input data for this step includes: genotype likelihood results (VCF files) for non-targeted and/or targeted regions, VCF files (first dataset) for reference genotype panel, and genomic fragment information for all samples in the target cohort; the output file format is still VCF, and includes best estimate genotype (best gain genotype), interpolated genotype dosage (interpolated genotype dosage), and genotype probabilities of posteriori (genotypes). The output result is in the form of 0/0; the second item of information is 0.038, which is interpreted as post-interpolation genotype dosage; the third term is 0.961,0.038,0, interpreted as the genotype posterior probability (genotypes); the fourth item is haplotype information.

In one embodiment, the method further comprises:

calculating the genotype possibility result of each SNP locus of each sample in the target queue based on the SNP locus information of the reference whole genome sequencing data set and the whole exon sequencing data of the target queue, and obtaining the genotype possibility result of a non-target region and/or a target region in each sample; combining the genotype possibility results of the non-targeted region and/or the targeted region in each sample to obtain the genotype possibility results of the non-targeted region and/or the targeted region in all samples in the target queue;

estimating the genotype of a single sample in the target queue by using a machine learning algorithm based on the genotype possibility result of the non-target region and/or the target region in all samples in the target queue, the SNP locus information of the reference whole genome sequencing data set and the genome fragment information, and estimating the genotype of the non-target region and/or the target region of the single sample in the target queue to obtain the genotype estimation result of the non-target region and/or the target region of the single sample.

In one embodiment, the method further comprises: integrating genomic fragment information based on the estimation results to obtain a chromosome level result comprising a genotype;

optionally, the method further includes: integrating genomic fragment information based on the estimation results to obtain chromosome level results comprising genotype and haplotype information;

alternatively, the chromosome level results were obtained using glipse _ blocked. The chromosome level results are VCF files. The estimation result further includes: haplotype information. In this case, it is important to merge fragmented genomes together without losing interpolated information. The fragments were integrated using the GLIMPSE _ tagged program to obtain VCF files at the chromosome level, retaining genotype and haplotype information. The input of the step is the estimation result in the format of VCF, and the output is the file after the whole chromosome is connected, and the format of VCF is the same.

The method comprises the steps of firstly processing a targeted area and a non-targeted area together, integrally inputting and calculating GL, and then carrying out a series of calculations; then, the target region is separately processed, and the individual SNP trapping processing is carried out on the target region by using the conventional GATK process. The non-targeted regions of the target cohort are genotypically filled by the present invention. The non-targeted regions of the target cohort are genotypically filled by the present invention. GATK as a tool set for diversity sites, the typical flow of GATK consists essentially of 7 steps: the first part is sequence alignment, mainly completed by BWA; the second part is data cleaning, which mainly comprises the steps of Mark Duplicates, sort, indel validity and Base recalcification; the third part is the diversity finding, mainly by haplotypecall followed by Joint Genotyping and Variant Recalibration.

A gene consists of thousands of nucleotide pairs. The nucleotide sequences constituting the genes may be divided into different segments. During gene expression, different segments play different roles. The segment capable of transcribing to the corresponding messenger RNA, and thereby directing protein synthesis (i.e., capable of encoding a protein) is called the coding region. Segments that do not encode a protein are called noncoding regions. The non-coding regions are positioned in front of and behind the coding region and belong to a gene, and the expression and strength of the gene are controlled.

The noncoding region cannot encode a protein, but is essential for expression of genetic information. It has nucleotide sequence for regulating the expression of genetic information and genetic effect. Such as an RNA polymerase binding site. The non-coding region is mutated, and the heredity is not influenced.

The noncoding region is essential for the gene of interest. The non-coding region has a binding site for RNA polymerase, and has a regulation effect.

Insertions, deletions and substitutions of bases in non-coding regions of genes also belong to gene mutation events, although most studies are limited to coding region mutations. Promoters are located in non-coding regions. Most of the palindromic sequences are located in noncoding regions because of their specific arrangement.

Fig. 2 is a non-target region genotype filling apparatus based on whole exon sequencing, which includes: a memory and a processor; the memory is to store program instructions; the processor is configured to invoke program instructions that, when executed, perform the non-targeted region genotype filling method based on whole exon sequencing described above.

FIG. 3 is a non-target region genotype filling system based on whole exon sequencing, which is provided by the embodiment of the present invention and comprises:

an obtaining unit 301, configured to obtain whole exon sequencing data of the target cohort and a reference whole genome sequencing data set;

a first processing unit 302, configured to filter sites in the reference whole genome sequencing data set, and output SNP site information of the reference whole genome sequencing data set; the SNP locus information comprises a chromosome number, a genome coordinate, an allele and the crowd genotype information of a locus from which non-SNP is removed;

a second processing unit 303, configured to calculate a genotype possibility result of each SNP site of each sample in the target queue based on the SNP site information of the reference whole-genome sequencing data set and the whole-exon sequencing data of the target queue, and obtain a genotype possibility result in each sample;

a third processing unit 304, configured to combine the genotype possibility results in each sample to obtain genotype possibility results in all samples in the target queue;

a fourth processing unit 305, configured to perform fragmentation and segmentation processing on the reference whole genome sequencing data set to obtain genome fragment information after the fragmentation and segmentation processing;

a fifth processing unit 306, based on the genotype possibility results in all samples in the target queue, the SNP locus information of the reference whole-genome sequencing data set, and the genome fragment information, estimates the genotype of the non-targeted region in a single sample in the target queue by using a machine learning algorithm, so as to obtain the genotype estimation result of the non-targeted region in the single sample.

The embodiment of the invention also provides a non-target region genotype filling system for whole exon sequencing, which comprises the following steps:

a first acquisition unit for acquiring a reference whole genome sequencing dataset;

a first processing unit for extracting bi-allelic genetic polymorphism sites (bi-allole SNPs) from the reference whole genome sequencing dataset to obtain a first dataset and a second dataset;

a second obtaining unit, configured to obtain full exon sequencing data of the target cohort;

the second processing unit is used for filtering the sites in the reference whole genome sequencing data set and outputting SNP site information of the reference whole genome sequencing data set; the SNP locus information comprises a chromosome number, a genome coordinate, an allele and the crowd genotype information of a locus from which non-SNP is removed; the information of the genotype of the population from which the non-SNP sites have been removed is a first data set (VCF file); the chromosome number, genomic coordinates and alleles are a second dataset (tsv file) that does not contain genotype information, but only SNP site information in the first dataset;

a third processing unit, for calculating the genotype possibility result of each SNP site of each sample in the target queue based on the sequencing data of the whole exons of the target queue, the first data set and the second data set, and obtaining the genotype possibility result of the non-targeted region and/or the targeted region in each sample;

a fourth processing unit, configured to combine genotype possibility results of the non-targeted regions and/or the targeted regions in each sample to obtain genotype possibility results of the non-targeted regions and/or the targeted regions in all samples in the target cohort;

the fifth processing unit is used for carrying out fragmentation and segmentation processing on the reference whole genome sequencing data set to obtain genome fragment information after the fragmentation and segmentation processing;

and the sixth processing unit is used for estimating the genotype of the non-targeted region and/or the targeted region of the single sample in the target queue by utilizing a machine learning algorithm based on the genotype possibility result of the non-targeted region and/or the targeted region in all samples in the target queue, the first data set and the genome fragment information to obtain the estimation result of the non-targeted region and/or the targeted region of the single sample.

Fig. 4 is a WES non-targeted region genotype filling flowchart provided in the embodiment of the present invention, which mainly relates to a process of preprocessing whole exon sequencing data of a target cohort (steps of library construction, comparison with a reference genome, and the like), calculating GL in a targeted region and/or a non-targeted region in each sample of the target cohort based on a large reference genome panel, and merging to finally obtain a genotype estimation result of the targeted region and/or the non-targeted region of a single sample.

The experimental results are as follows:

(1) Counting sequencing coverage of WES non-target areas;

a) Counting the coverage of standard WES sequencing data, wherein the average coverage of exon probe regions is 77, and the average coverage of non-probe regions is 0.87;

b) The non-probe regions had 5%,3.75%,3.19% coverage at 1X,3X and 5X, i.e.the number of sites that eventually achieved coverage above 3X was 4,910,578.

(2) The coverage sites of the Han population and the thousand-person genome panel are distributed in the frequency of the Han population by more than 3X;

a) The number of sites with the sequencing depth of more than 3X is 4910578 according to the statistics of Han population genotype panel, the number of sites with the sequencing depth of more than 3X is 16919510 according to the statistics of thousand human genome panel, and the number of sites shared by the two is 3419584. Due to the higher population abundance in the thousand human genomes, the number of polymorphic sites is greater, but some sites in the Chinese population are still specific.

(3) Correlation of Genotype Likelihood (GL) quality values with sequence coverage;

when the genotype panel of the Chinese population is used, the genotype probability (GL) calculated quality value has a strong correlation directly with the sequence coverage (correlation coefficient of 0.676)

(4) A non-targeted region genotype interpolated/filled mass value (INFO score) profile;

according to the Chinese population reference genotype panel (whole genome sequencing data set), a total of 23639100 polymorphic sites are obtained in the non-target region after interpolation, wherein the mass value distribution is as follows:

	INFO>0.9	INFO>0.6	INFO>0.3
				WES targeting area	99.82％	99.90％	99.91％
WES non-target area	66.91％	73.42％	77.89％

Due to higher sequencing coverage, the INFO value of the WES targeted region was higher, but the INFO value of 66.91% of sites in the non-targeted region was still above 0.9. Under the condition of the selection of the interpolated quality value INFO >0.9 and the sequencing coverage >3X, 4910579 polymorphic sites can be obtained.

(5) The distribution of the non-target region post-quality control loci on the genome;

the genomic distribution of 4910579 polymorphic sites after quality control was analyzed, and it was found that nearly half of the sites were still located in the protein coding region (48.56%), followed by the UTR (untranslated region) and promoter regions, and about 8.49% of the sites were located in the enhancer region. As shown in FIG. 5, the distribution of polymorphic sites in CDS, enhancer, lncRNA, promoter, other, 3UTR, and 5UTR is shown, respectively.

The invention effectively utilizes the low coverage site information of the non-target region sequenced by the whole exome, combines the Chinese population reference genotype panel, can obviously improve the number of polymorphic sites found by WES sequencing, covers more genome intervals and provides more data for genetic association analysis. Through analysis, if the strict sequencing coverage is greater than 3X and the INFO is greater than 0.9 as a threshold value, the number of genotype loci of the traditional WES sequencing data can be expanded by 4910579 by utilizing the method, so that the cost is saved and a larger benefit is obtained.

A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the above-described non-targeted region genotype filling method based on whole exon sequencing.

The validation results of this validation example show that assigning an intrinsic weight to an indication can moderately improve the performance of the method relative to the default settings.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

It will be understood by those skilled in the art that all or part of the steps in the method according to the above embodiments may be implemented by hardware that is related to instructions of a program, and the program may be stored in a computer-readable storage medium, where the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk.

While the invention has been described in detail with reference to specific embodiments thereof, it will be apparent to one skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A non-targeted region genotype filling method based on whole exon sequencing comprises the following steps:

2. The method for whole exon sequencing-based genotype filling of non-targeted regions according to claim 1, further comprising:

calculating the genotype possibility result of each SNP locus of each sample in the target queue based on the SNP locus information of the reference whole genome sequencing data set and the whole exon sequencing data of the target queue, and obtaining the genotype possibility result of a non-targeted region and/or a targeted region in each sample;

estimating the genotype of a single sample in the target queue by using a machine learning algorithm based on the genotype possibility result of the non-targeted region and/or the targeted region in all samples in the target queue, the SNP locus information of the reference whole genome sequencing data set and the genome fragment information, and estimating the genotype of the non-targeted region and/or the targeted region of the single sample in the target queue to obtain the genotype estimation result of the non-targeted region and/or the targeted region of the single sample;

optionally, the method for calculating the genotype possibility of each SNP site in each sample in the target cohort includes: calculating by using mpieup of BCFtools to obtain the genotype possibility of each SNP locus calculated based on the sequencing depth;

optionally, the estimation result includes: genotype dose post-fill, genotype posterior probability, and results of best estimating genotype.

3. The whole exon sequencing-based non-targeted region genotype filling method according to claim 1, wherein the machine learning algorithm includes but is not limited to the following algorithms for genotype estimation: an iterative optimization algorithm; the iterative optimization algorithm comprises one or more of the following: gradient descent, conjugate gradient, coordinate descent, newton iteration, stepwise regression, minimum angle regression, lagrange multiplication.

4. The method for filling non-target region genotypes based on sequencing of the whole exons according to claim 1, wherein the reference whole genome sequencing dataset is a reference whole genome sequencing dataset of a specific population, the reference whole genome sequencing dataset of the specific population comprising a specific population of different countries/races/regions;

5. The method for whole exon sequencing-based genotype filling of non-targeted regions according to claim 1, further comprising: integrating genomic fragment information based on the estimation results to obtain a chromosome level result comprising a genotype;

alternatively, the chromosome level results were obtained using glipse _ blocked.

6. The method for filling non-target region genotypes based on whole exon sequencing according to claim 1, wherein the filtering the sites in the reference whole genome sequencing data set and outputting the SNP site information of the reference whole genome sequencing data set comprises: extracting biallelic genetic polymorphic sites from the reference whole genome sequencing dataset to obtain a first dataset and a second dataset comprising SNP site information of the reference whole genome sequencing dataset;

optionally, the information of the genotype of the population from which the non-SNP sites have been removed is a first data set; the chromosome number, genomic coordinates and alleles are a second dataset that contains no genotype information, but only SNP site information in the first dataset.

7. The method of claim 1, wherein the genomic fragment information comprises: and fragmenting the genomic fragment information after the segmentation treatment, wherein the genomic fragment information comprises chromosome, starting site coordinates and ending site coordinates.

8. A whole exon sequencing-based non-targeted region genotype filling system comprising:

an obtaining unit, configured to obtain whole exon sequencing data of the target cohort and a reference whole genome sequencing data set;

the fourth processing unit is used for carrying out fragmentation and segmentation processing on the reference whole genome sequencing data set to obtain genome fragment information after the fragmentation and segmentation processing;

9. A non-targeted region genotype filling apparatus based on whole exon sequencing, the apparatus comprising: a memory and a processor;

the memory is to store program instructions; the processor is configured to invoke program instructions that, when executed, perform the non-targeted region genotype filling method based on whole exon sequencing of any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the non-targeted region genotype filling method based on whole exon sequencing as claimed in any one of the preceding claims 1 to 7.