CN109994154B

CN109994154B - Screening device for candidate pathogenic genes of monogenic recessive genetic disease

Info

Publication number: CN109994154B
Application number: CN201811633860.3A
Authority: CN
Inventors: 周扬; 刘涛; 蒋德志; 玄兆伶; 李大为; 梁峻彬; 陈重建
Original assignee: Anouta Gene Technology Beijing Co ltd
Current assignee: Anouta Gene Technology Beijing Co ltd
Priority date: 2017-12-30
Filing date: 2018-12-29
Publication date: 2021-07-13
Anticipated expiration: 2038-12-29
Also published as: CN109994154A

Abstract

The invention relates to a screening device for candidate pathogenic genes of monogenic recessive genetic diseases, which comprises: the system comprises a family information analysis module, a variation information acquisition module, a variation information comparison module, a genetic pattern judgment module, a candidate locus annotation module and a result screening module.

Description

Screening device for candidate pathogenic genes of monogenic recessive genetic disease

Technical Field

The invention relates to the field of gene detection, in particular to a screening device and a screening method for candidate pathogenic genes of monogenic recessive genetic diseases.

Background

In the human genome, diseases caused by a pair of variant alleles are referred to as monogenic disorders. The diseases are more than 6600, and the speed is increased by 10-50 per year. Monogenic genetic diseases constitute a great threat to human health, wherein the more common diseases of the type comprise red-green achromatopsia, hemophilia, albinism and the like. Human monogenic genetic diseases are largely divided into two major groups, those according to Mendelian's law of inheritance, including autosomal inheritance, X-linked inheritance and Y-linked inheritance, and those not according to Mendelian's law of inheritance, including imprinted genetic diseases and mitochondrial diseases. The variant gene of the monogenic genetic disease can occur in germ cells or fertilized eggs and embryos at early stages, and the variant genetic information can form diseases with certain abnormal characters through expression.

An autosomal dominant disease refers to a disease that is manifested by a variation in a single allele of two alleles located on the autosome. The disease is more than 1700, such as familial multiple colon polyps, multiple fingers, and syndactyly. The genetic pedigree is characterized in that: 1. the heredity is unrelated to the sex, and the chances of the diseases are equal for men and women; 2. often one of the parents is the patient; 3. if the parents are disease-free, the children are not affected; 4. the patient is usually heterozygous, and if matched with normal people, the disease probability of the child is 50%; 5. inheritance of several successive generations is common. The dominant pathogenic gene is sometimes affected by internal and external environment, and heterozygote individuals carry the dominant pathogenic gene and do not express, namely incomplete exon. The exonic rate of the autosomal dominant hereditary disease is 60 to 90 percent.

The autosomal recessive genetic disease causing gene is a recessive gene located on the autosome, and the disease can be caused when the recessive gene is pure. Patients with recessive genetics are most descendants of two carriers, and are most likely to be children of a close partner. Many 1200 such diseases have been identified, such as congenital deaf-mutism, albinism, phenylketonuria. The carrier of heterozygote recessive pathogenic gene does not express corresponding characters, but can transmit the pathogenic gene to offspring. The offspring has 1/4 probability of being sick, and the children have equal probability of being sick. For example, glycogen deposition disease type I, in which glycogen is not decomposed into glucose-6-phosphate due to lack of glucose-6-phosphatase in the body of the patient, only the individual having the genotype gg shows the onset of the disease. When a couple is a carrier (i.e., genotype is Gg), there is a probability of 1/4 being diseased in its offspring, with a ratio of normal individuals to patients of 3: 1. Many diseases with genetic metabolic abnormalities belong to autosomal recessive genetic diseases. According to the concept of "one gene, one enzyme" (one gene one enzyme) or "one cistron, one polypeptide" (one cistron one polypeptide), these genetic metabolic diseases have abnormalities in enzyme or protein molecules derived from the respective coding gene. Pedigree characteristics of autosomal recessive genetic disorders: 1. the male and female have equal morbidity, and the morbidity is independent of the sex; 2. parents are disease-free carriers, and the incidence probability of children is 25%; 3. continuous generations of inheritance can not be seen, but alternate generations or alternate generations of inheritance can be seen; 4. when married to relatives, the prevalence of recessive genetic disorders in children is greatly increased because related individuals are more likely to carry the same variant allele. For example, when phenylketonuria is randomly married in people, the incidence rate is 1: 14500; the table brother and sister marriage was 1: 1700. The incidence rate of the systemic albinism in the population is 1: 40000; the table brothers and sisters marriage is 1: 3600.

At present, the analysis method of the candidate gene of the monogenic recessive genetic disease comprises the following steps: 1, determining candidate genes according to characters; 2, obtaining sequence information with candidate genes; 3, analyzing the variation information of the candidate gene; and 4, determining whether the candidate gene is a pathogenic gene of the monogenic recessive genetic disease.

Disclosure of Invention

At present, people generally judge whether a disease controlled by a single gene is a recessive inheritance pattern, namely whether the pathogenic gene is a recessive gene according to Mendelian's law of inheritance and family diseases. The method has low efficiency and large workload, is generally only suitable for judging a few genes, and cannot realize high-throughput operation.

The invention mainly aims to provide a method for detecting genes conforming to a recessive inheritance pattern in a whole genome range so as to realize the purpose of screening pathogenic genes with high flux and high efficiency. In order to achieve the purpose, the invention provides an analysis device and a method for positioning candidate genes of a monogenic recessive genetic disease by combining a next-generation sequencing method.

The invention relates to a screening device of candidate pathogenic genes of monogenic recessive genetic diseases, which comprises:

1. an apparatus for screening candidate causative genes of a monogenic recessive genetic disease, wherein the apparatus comprises:

family information collection module: collecting the disease condition of each member in the family to be analyzed; .

A variation information acquisition module: obtaining exon variation site information of each member in the family to be analyzed;

a variant information comparison module: the system is used for comparing and collecting the mutation site information of each member of the family to be analyzed, which is acquired by the mutation information acquisition module, and judging the genotype of each member in the family;

a genetic pattern determination module: the genetic pattern used for judging the mutation sites conforms to the Mendelian recessive genetic pattern;

candidate site annotation module: determining a gene where a mutation site conforming to a Mendelian recessive inheritance pattern is located, a specific position on the gene and the influence of the mutation on protein coding by using a database;

and the result screening module is used for further screening the screened variant loci which accord with the Mendelian recessive inheritance pattern, so that the screened variant loci are used as the last candidate variant loci, and genes containing the loci are used as target candidate genes.

2. The apparatus according to item 1, wherein the family information analysis module further comprises a family genetic map drawn according to the genetic relationship and the disease condition of each member in the family.

3. The apparatus according to item 1, wherein the variation information obtaining module obtains variation site information of each member of the family to be analyzed by whole genome re-sequencing or whole exome re-sequencing.

4. The apparatus of item 1, wherein the mutation site information in the mutation information alignment module comprises: the chromosome of the mutation, the starting site of the mutation, the ending site of the mutation, the genotype of the mutation on the reference genome, and the base type after the mutation.

5. The apparatus according to item 1, wherein the databases used by the genes at which the variant loci corresponding to the Mendelian recessive inheritance pattern are identified in the candidate locus annotation module are the NCBI database and the UCSC database.

6. The apparatus of item 1, wherein the candidate site annotation module further comprises:

performing disease-related annotation by using COSMIC, OMIM and HGMD databases;

annotation of pathogenicity prediction using SIFT, POLYPHEN, CADD and FATHMM-MKL databases:

conservative predictions were made using phyloP20way _ mammalian, phastConsElements46way, and the gerp + + elem database.

7. The apparatus of item 1, wherein the result screening module comprises the steps of:

step A: variation of exon regions and shearing site regions is reserved;

and B: filtering variation sites of a thousand-person genome database (the frequency in the population is more than 0.01), deleting diversity sites among individuals to obtain possible pathogenic rare variation, and keeping the variation sites of which the frequency in the thousand-person genome database is less than 0.01;

and C: the synonymous variation is deleted and a variation having an effect on gene expression is obtained.

8. The apparatus of item 1, wherein the apparatus further comprises:

target candidate gene function annotation module: for annotating the function of the target candidate gene with a database;

a functional enrichment analysis module: the Gene Ontology database is used for performing function enrichment analysis on candidate genes by utilizing the Gene Ontology database so as to analyze whether the candidate genes have similar functions and are enriched in function channels related to diseases or not; and

a pathway enrichment analysis module: used for carrying out pathway enrichment analysis on candidate Genes by using a Kyoto Encyclopedia of Genes and Genes database and judging whether the candidate Genes are enriched in metabolic pathways related to diseases.

9. The apparatus of item 8, wherein the pathway enrichment analysis module further comprises, after:

protein interaction analysis module: a visualized molecular interaction network is constructed by using Cytoscape software to analyze the interaction relationship between proteins coded by candidate genes.

ADVANTAGEOUS EFFECTS OF INVENTION

According to the invention, the candidate loci and the candidate genes related to recessive genetic diseases can be positioned in the whole genome range by directly utilizing data generated by mainstream next-generation sequencing in scientific research.

According to the invention, all variations in the whole genome range are taken into consideration, so that bias and omission caused by personal subjective factors are avoided; the variation site of the primary analysis of the invention is as high as 10⁶(even higher) and simple operation, only need analyst fill in several necessary parameters and can accomplish a key formula analysis, realized high flux simultaneously convenient, swift.

The invention utilizes various large databases related to human diseases to search, can carry out gene screening at a plurality of levels, and can also utilize a plurality of databases to carry out mutual evidence.

The invention also utilizes a plurality of databases to carry out enrichment analysis, and besides analyzing the functions of basic candidate genes, the invention also considers the effects and enrichment conditions of the basic candidate genes in different paths, thereby providing strong evidence for finally determining the candidate genes.

Detailed Description

Defining:

family tying: the number, the relativity and the distribution of related genetic traits or genetic diseases of a family in the family are referred to, wherein the members comprise father, mother and offspring generated by the father and the mother, and the number of the offspring is one or more than two;

synonymous mutations: the method refers to that after a certain basic group is changed, the original position of a certain amino acid is translated into the same amino acid due to the degeneracy of the genetic code of the organism;

cleavage site region: a region of sequence and linker sites in the intron and exon junction boundary in the spliceosome-recognizable RNA precursor;

family heritage map: different patterns are adopted according to different situations. Generally indicated by □ for men and o for women; □, O is called marriage line connected by horizontal line and is indicated as couple; making a vertical line from the near-middle point of the marital line downwards, connecting the marks of children at the lower end, wherein the children can be arranged from left to right in the birth sequence if more than two persons are present, and ■ and ● indicate diseases;

mendelian recessive inheritance pattern: the causative gene is recessive, only homozygotes show pathology, and the genetic pattern of the disease follows Mendelian inheritance pattern.

The invention relates to a screening device for candidate pathogenic genes of a monogenic recessive genetic disease, which needs to provide disease information of a core family. The method comprises the following steps:

The ancestry in the collection ancestry information collection module typically includes a father, a mother, and children from both, including but not limited to one child. Here, a typical family system of a parent plus a child is described as an example. After the relationship between the child and the parents is ensured, the ill conditions of each member in the family are collected, and the information is ensured to be accurate. Preferably, the pedigree genetic map is drawn according to genetic relationships and disease conditions.

A variation information acquisition module: genome wide (exon) single nucleotide variations (SNPs) and small fragment InDel variations (indels) of the samples were obtained. It is recommended to obtain the variation condition of all samples of the family to be analyzed through the whole genome re-sequencing or the whole exon re-sequencing of the human.

General analysis strategy for next generation sequencing data: sequencing data filter alignment-variation detection-ANNOVAR annotation-follow-up additional analysis. In order to be seamlessly connected with a second-generation sequencing analysis method, the invention starts with the result of individual variation detection to detect the variation locus conforming to the recessive genetic pattern, thereby mining the candidate gene related to the recessive genetic disease.

A variant information comparison module: integrating the variation results of individuals in the family, and merging the variation of family members. One mutation site row, the first few columns indicate the chromosome where the mutation is located, the start site of the mutation, the stop site of the mutation, the genotype of the mutation site on the reference genome, and the base type after the mutation of the mutation site. Next, the genotype (genotype) of each locus in the pedigree among different individuals was judged, wherein the homozygous locus corresponding to the reference genome is represented by 0/0, the locus of homozygous variation is represented by 1/1, and the heterozygous locus is represented by 0/1, and the present invention accomplishes this analysis by developing a program.

A genetic pattern determination module: according to mendelian's law of inheritance, recessive homozygous mutation sites generally appear as: parents are heterozygous sites and homozygous variant sites in diseased individuals and are of a different genotype than normal sisters (normal sisters may be heterozygous sites or homozygous genotypes different from the diseased variant genotypes). And judging whether the genetic pattern of each mutation site conforms to the Mendelian recessive genetic pattern or not according to the principle. The variations we detected included both SNPs and InDel.

Candidate site annotation module: in order to ensure the comprehensiveness of the annotation, the annotation is carried out according to NCBI and UCSC databases, and the gene where the mutation site is located and the influence of the specific position and the mutation on the protein coding are determined. In addition to this, we also make annotations of different aspects with multiple databases. Such as disease-related databases COSMIC, OMIM, HGMD, etc., pathogenicity prediction-related databases SIFT, POLYPHEN, CADD, FATHMM-MKL, etc., conservative prediction-related databases phyloP20way _ mammalian, phastConsumers 46way, gerp + + elem, etc.

And a result screening module: and (3) further screening the recessive genetic variation sites obtained by screening to obtain rare variation, wherein the specific screening method comprises the following steps:

1) variation of exon region (exonic) or cleavage site region (cleaving, 2bp upstream of cleavage site) is preserved;

2) filtering the variation sites of a thousand-person genome database (the frequency in the population is more than 0.01), and deleting the diversity sites among individuals to obtain the rare variation (rare) which can be possibly pathogenic really: reserving variant sites with frequency lower than 0.01 in a 1000Genome database;

3) deletion of synonymous variations (variations that do not result in changes in amino acid coding), variations that have an effect on gene expression products were obtained (retained variations include frameshift and non-frameshift variations in InDel). The screened mutation sites are used as the last candidate mutation sites, and genes containing the mutation sites are used as target candidate genes.

Here, the screening apparatus further includes:

target candidate gene function annotation module: annotating the target candidate gene function with a database.

A functional enrichment analysis module: a Gene Ontology (GO) database is used for carrying out function enrichment analysis on candidate genes so as to analyze whether the candidate genes have similar functions and are enriched in function channels related to diseases.

A pathway enrichment analysis module: the Kyoto Encyclopedia of Genes and Genes (KEGG for short) database is used for carrying out pathway enrichment analysis on candidate Genes to see whether the candidate Genes are enriched in metabolic pathways related to diseases.

Further, the method also comprises the following steps:

protein interaction analysis module: and (3) constructing a visualized molecular interaction network by using Cytoscape software of open source software to analyze the interaction relation between proteins coded by candidate genes.

Examples

In the embodiment, aiming at the family information collection module, due to the rapid development of the sequencing technology, the current main sequencing technology, i.e., the second-generation sequencing, is not only rapid and high in accuracy, but also high in throughput, so that the analysis of genetic information in the whole genome range becomes rapid and cheap, and the second-generation sequencing becomes a conventional means in a plurality of scientific research fields. The invention is also based on the second generation sequencing technology. The details of the ancestry to be analyzed are first determined, where the ancestry generally includes the father, the mother, and the offspring from both, although more than two offspring are possible. For convenience, in this specification, we will proceed with the following description taking the typical family of a parent plus a child as an example. The information that needs to be confirmed is as follows:

first, the kindred relationship between the child and the parent is determined. Several of the following conditions are not suitable for analysis by the method of the invention: 1) for the child in the adoptive care, the family formed by the child and the parents can not be used for analysis; 2) for the core families of three individuals, if the genotype of parents cannot be obtained or whether the disease is ill or not cannot be determined (the case that the parents are too old and the disease is late) cannot be analyzed; 3) the genetic information and the diseased condition of only one parent and a plurality of offspring can be analyzed, but the false positive result is higher; secondly, the disease status of each member of the family is determined, i.e. phenotypic identification is performed. For diseases with clinical manifestations, the phenotype can be directly judged; for diseases without clinical manifestations, the phenotype needs to be determined by physicochemical detection of the content of substances in vivo; thirdly, the invention only aims at the recessive genetic disease to carry out the positioning of the candidate gene, so that the method is necessary to accord with the recessive genetic pattern, and whether the genetic pattern of the disease accords with the recessive genetic pattern or not needs to be judged. As can be seen from mendelian's law of inheritance, recessive genetic disease appears normal in parents, while both diseased and normal individuals are present in offspring. If the phenotype identification is done for only one generation, then the parents' phenotype is normal, and one of the offspring is diseased; if the phenotypes of the generations are identified, the disease conditions of the generations can be used to determine whether the disease is a recessive inheritance pattern according to Mendelian inheritance law, in this embodiment, a family with normal parents and a small ear deformity phenotype is taken as an example.

In the mutation information acquisition module, the mutation covering the whole genome of the qualified family members is acquired, wherein the mutation comprises single nucleotide variation (SNP) and small-fragment insertion deletion variation (InDel). The general method of acquisition is as follows: firstly, extracting DNA (other tissue samples can also be used) in blood of each member of the family, and then extracting the DNA by adopting different methods according to different conditions of the DNA samples; the DNA was then randomly broken into short pieces of DNA using an autofocus acoustic sample processor. Adding a proper linker on the randomly broken DNA fragments, enriching the DNA fragments connected with the linker by using a PCR technology, and taking the enriched DNA fragments for purification. In the invention, high-throughput sequencing is adopted to sequence the prepared library, and a specific platform can be selected from Hiseq2000, Hiseq2500, Miseq, Netseq500, Xten and the like of the illumina company (if exon sequencing needs exon capture, no special requirements are made on sequencing type and insert size). Although the sequencing modes of different sequencing platforms are slightly different, DNA sequencing data derived from a sample to be detected can be obtained finally. Sequencing results included some original off-machine sequences (Raw Reads) that contained low quality sequences, linker contaminated sequences, sequences with a proportion of N bases greater than 5%, and Clean Reads. The higher the proportion of CleanReads, the better the data quality. In order to ensure the quality of subsequent information analysis, the original sequence is filtered to obtain high-quality Clean Reads, and then subsequent analysis is carried out. The data processing steps are as follows:

(1) removing the Reads contaminated by the linker (the number of bases contaminated by the linker is greater than 5bp, for double-ended sequencing, if one end is contaminated by the linker, removing the Reads at both ends);

(2) removing low quality Reads (Reads with quality Q ≦ 19 base accounting for more than 50% of the total bases, for paired-end sequencing, if one end is low quality Reads, both ends are removed);

(3) reads with an N base ratio of greater than 5% were removed (for paired-end sequencing, if one end contained more than 5% N, both ends of the Reads would be removed).

After filtering out the Reads which are liable to cause analysis errors, the UCSC hg19 reference genome is used as a reference sequence, and the filtered Clean Reads are aligned to the reference genome by using alignment software (BWA is recommended) to obtain an alignment result file in BAM (http:// samtools. BAM files are sequenced by using Samtools (Li et al, 2009) software, only the unique sequence comparison result is reserved, then Picard (http:// branched. githu. io/Picard) is used for marking the dual in the comparison result, and meanwhile, GATK (McKenna, A.et al, 2000) is used for carrying out local re-comparison on the sequences around InDel, so that the false positive of SNP detection is reduced, and the high-accuracy BAM file for the variation detection comparison result is obtained. For exon sequencing analysis, the quality conditions of samples, library construction, sequencing and the like need to be evaluated by counting index values such as target region capture efficiency, alignment Rate, coverage and the like, wherein Uniq Rate can visually reflect the ratio of sequences uniquely aligned to a reference genome, and Duplicationrate reflects the quality of sample library construction. On the basis of comparing to a reference genome sequence, extracting all potential polymorphic SNP and InDel sites in the whole genome through mutation analysis software GATK, and then performing further filtering and screening according to factors such as quality values, depth, repeatability and the like to finally obtain SNP and InDel variation data sets with high reliability.

In the variation information comparison module, the variation sites obtained in the variation information acquisition module are the results of individual detection, but not the results of group detection, so that the variation of each sample in the family is integrated in the variation information comparison module. Since only mutation sites (non-gvcf) are output when GATK is used for detecting the mutation sites of a single individual, and each individual has a unique mutation site, we need to synthesize the mutation sites of the whole family to determine the final genotype (genotype). Firstly, the mutation sites of each sample in the family are intersected according to several indexes such as chromosome number, initial position, termination position, reference genome basic type, mutation basic type and the like to obtain a total mutation site set M1. The genotypes of all individuals in the pedigree at the various mutation sites in M1 were then adjusted, wherein the homozygous site corresponding to the reference genome is designated 0/0, the homozygous mutated site is designated 1/1, and the heterozygous site is designated 0/1. The present invention accomplishes this analysis by developing programs for fast and efficient execution.

In the step genetic pattern determination module, genetic pattern determination is performed. According to Mendelian's law of inheritance, recessive homozygous mutation sites generally appear as: parents are heterozygous sites and homozygous mutant sites in the affected individual and are of a different genotype than normal sisters (normal sisters may be heterozygous sites or homozygous genotypes different from the diseased mutant genotype). And judging whether the genetic pattern of each mutation site accords with a recessive genetic pattern according to the principle. For human genome-wide re-sequencing data, the sum of the above-mentioned SNP and InDel variation is usually 10⁵-10⁶And (4) respectively. Even for the full exon data, two variants were detected at 10⁴-10⁵Horizontally, it is not possible to determine the genetic pattern visually from one locus. The present invention develops a corresponding program to accomplish this analysis.

In the candidate locus annotation module, we obtained all the variant loci that fit the recessive genetic pattern, which are spread over different positions of the genome, by analysis of the program in the genetic pattern determination module: 1. may be located on a gene or in an intergenic region; 2. may be located on an exon, or may be located in an intron region or a cleavage site; 3. the same mutation may be caused, and non-synonymous mutations may be caused. In order to determine the positions of these mutations on the gene, the influence of the protein encoded by the mutations, and the like, ANNOVAR annotation was performed on the mutations obtained in step S4. To ensure the comprehensiveness of the annotation, we performed annotation using both NCBI and UCSC databases to determine the gene where the mutation site is located and the specific location on the gene. For example, mutation site 1, which is located on chromosome 1, has an initiation site of 120611960, a termination site of 120611960, a base type of the reference genome of C, and a base type of T after mutation, is located on NOTCH2 of the gene, and is located on an exon at the position of the gene, is a nonsynonymous mutation that results in mutation of the 21 st amino acid of the first exon of NOTCH2 from a to T. In addition to this, we also make annotations of different aspects with multiple databases. Such as disease-related databases COSMIC, OMIM, HGMD, etc., pathogenicity prediction-related databases SIFT, POLYPHEN, CADD, FATHMM-MKL, etc., conservative prediction-related databases phyloP20way _ mammalian, phastConsumers 46way, gerp + + elem, etc.

As the mutation is the basic source of the heritable variation, the mutation of eukaryote or prokaryote and the mutation of any type have the characteristics of randomness, low frequency, reversibility and the like. Not all mutations cause human disease. And the mutation obtained by screening through the genetic pattern judgment module is further screened in the recessive genetic mutation site obtained by screening to obtain the rare mutation, and the specific screening method is as follows:

1) since the variation located in the exon region and the cleavage site may cause the change of protein function, most diseases are caused by the change of protein function, however, the variation located in the intergenic region and the intron region mostly do not affect the function, in order to further narrow the range of the functional site, the result of S4 is further screened, and the variation of the exon region (exonic) or the cleavage site region (cleaving, 2bp upstream of the cleavage site) is reserved;

2) because the genome database of thousands of people includes the data of normal people, the mutation sites with high frequency in the database can not be the pathogenic sites of monogenic diseases, so the following filtration is carried out: filtering variation sites of a thousand-person Genome database (the frequency in the population is more than 0.01), deleting diversity sites among individuals to obtain rare mutation (rare) which is really possible to cause diseases, and reserving the mutation sites of which the frequency in a 1000Genome database is less than 0.01;

3) synonymous variations (mutations that do not result in changes in amino acid coding) have no effect on the protein sequence, so synonymous variations are removed, resulting in variations that have an effect on the gene expression product (the remaining mutations include frameshift and non-frameshift mutations in InDel).

The screened mutation sites are used as the final candidate mutation sites, and genes containing the mutation sites are used as target candidate genes.

The results showed that the sample had 1717(A/C) variation in SIX2 genome on chromosome 2, 109268573(T/C) variation in FNDC7 on chromosome 1, and 195452951(C/G) variation in MUC20 on chromosome 3; genes PCDHA9, 140229086(C/G) variation of PCDH12, 141336264(G/T) variation of chromosome 5; the mutation of 40 genes, such as the mutation of 21207000(C/G) in IFNA10 on chromosome 7, the mutation of 125830970(A/T) in CDON gene on chromosome 11, the mutation of 79058445(G/A) in ADAMTS7 on chromosome 15, the mutation of KIZ and BPIFA2 on chromosome 20, and the like, has important significance for studying the auricular malformation.

While the foregoing description shows and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as described herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An apparatus for screening candidate causative genes of a monogenic recessive genetic disease, comprising:

a variant information comparison module: the system is used for comparing and collecting exon variation site information of each member of the family to be analyzed, which is acquired by the variation information acquisition module, and judging the genotype of each member in the family;

the result screening module is used for further screening the screened variant loci which accord with the Mendelian recessive inheritance pattern, so that the screened variant loci are used as the last candidate variant loci, and genes containing the loci are used as target candidate genes;

a functional enrichment analysis module: the Gene Ontology database is used for performing function enrichment analysis on candidate genes by utilizing the Gene Ontology database so as to analyze whether the candidate genes have similar functions and are enriched in function channels related to diseases or not;

a pathway enrichment analysis module: used for carrying out pathway enrichment analysis on candidate Genes by utilizing a Kyoto Encyclopedia of Genes and genomics database and judging whether the candidate Genes are enriched in metabolic pathways related to diseases or not; and

protein interaction analysis module: constructing a visualized molecular interaction network by using Cytoscape software to analyze the interaction relationship between proteins coded by candidate genes;

wherein the mutation site information comprises: the chromosome of the variation, the starting site of the variation, the termination site of the variation, the genotype of the variation on the reference genome, and the base type after the variation;

the comparison and union set is that the variation sites of each sample in the family are intersected according to the indexes of chromosome number, initial position, termination position, reference genome base type and mutation base type to obtain a total variation site set M1;

the result screening module comprises the following steps:

step A: variation of exon regions and shearing site regions is reserved;

and B: filtering variation sites with frequency more than 0.01 in a population in a thousand-person genome database, deleting diversity sites among individuals to obtain possible pathogenic rare variation, and reserving variation sites with frequency less than 0.01 in the thousand-person genome database;

and C: deleting the synonymous variation to obtain the variation which has influence on the gene expression;

and the variation information acquisition module acquires variation site information of each member of the family to be analyzed through whole genome re-sequencing or whole exome re-sequencing.

2. The apparatus of claim 1, wherein the family information comparison module further comprises a family genetic map according to the genetic relationship and the disease condition of each member in the family.

3. The apparatus of claim 1, wherein the databases used by the candidate site annotation module to identify the genes corresponding to the Mendelian recessive inheritance pattern variation sites are the NCBI database and the UCSC database.

4. The apparatus of claim 1, wherein the candidate site annotation module further comprises:

performing disease-related annotation by using COSMIC, OMIM and HGMD databases;