CN111139291A - High-throughput sequencing analysis method for monogenic hereditary diseases - Google Patents

High-throughput sequencing analysis method for monogenic hereditary diseases Download PDF

Info

Publication number
CN111139291A
CN111139291A CN202010035599.8A CN202010035599A CN111139291A CN 111139291 A CN111139291 A CN 111139291A CN 202010035599 A CN202010035599 A CN 202010035599A CN 111139291 A CN111139291 A CN 111139291A
Authority
CN
China
Prior art keywords
mutation
information
sites
site
pathogenic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010035599.8A
Other languages
Chinese (zh)
Inventor
秦彦文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING INSTITUTE OF HEART LUNG AND BLOOD VESSEL DISEASES
Beijing Anzhen Hospital
Original Assignee
BEIJING INSTITUTE OF HEART LUNG AND BLOOD VESSEL DISEASES
Beijing Anzhen Hospital
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING INSTITUTE OF HEART LUNG AND BLOOD VESSEL DISEASES, Beijing Anzhen Hospital filed Critical BEIJING INSTITUTE OF HEART LUNG AND BLOOD VESSEL DISEASES
Priority to CN202010035599.8A priority Critical patent/CN111139291A/en
Publication of CN111139291A publication Critical patent/CN111139291A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Physics & Mathematics (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Zoology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Wood Science & Technology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application provides a high-throughput sequencing analysis method for a single-gene hereditary disease. The method comprises the following steps: collecting condition information of a patient to be analyzed who has a target genetic disease; acquiring and storing high-throughput sequencing data of the gene sequence to be detected of the patient to be analyzed; carrying out mutation detection analysis on the high-throughput sequencing data to obtain mutation sites carrying mutation information, wherein the mutation information comprises basic information, harmfulness prediction information, mutation frequency information and pathogenicity information; and screening the mutation sites based on the condition information of the patient to be analyzed and the basic information, the harmfulness prediction information, the mutation frequency information and the pathogenicity information carried by the mutation sites to obtain the pathogenic sites of the target genetic diseases. The single-gene hereditary disease high-throughput sequencing analysis method provided by the application can realize high-efficiency screening of pathogenic sites.

Description

High-throughput sequencing analysis method for monogenic hereditary diseases
Technical Field
The application relates to the technical field of biomedicine, in particular to a high-throughput sequencing analysis method for a single-gene hereditary disease.
Background
Genetic diseases are diseases caused by changes in genetic material or controlled by pathogenic genes. A monogenic genetic disease refers to a genetic disease caused by mutation of a single gene, and is also called Mendelian genetic disease. Its genetic mode follows Mendelian's law, and the mutation can be originated from parent or self, so that it has the possibility of being inherited to next generation. According to World Health Organization (WHO) statistics, the cumulative incidence of all monogenic genetic diseases in the global birth population is up to 10/1000.
High-throughput sequencing, also known as "Next-generation" sequencing technology, is marked by the ability to sequence hundreds of thousands to millions of DNA molecules in parallel at one time, and by the short read length.
Millions of single nucleotide site variations (SNV), genomic small insert variations (InDel), etc., may be found in a human genome, and these variations occur in various combinations of the four bases adenine, guanine, cytosine, and thymine. Human genetic diseases are often associated with these variations, and although there may be millions of variations in humans, there are really few that are associated with the disease. With the popularization of high-throughput sequencing technologies, more and more sequencing data are generated. Compared with the traditional gene chip, PCR and other technologies, the high-throughput sequencing can find more gene variation sites. However, the original data generated by sequencing usually needs to be operated by a large amount of manual and machine, and then the pathogenic sites existing in the patient are found through complicated bioinformatics analysis, so that the workload is large and tedious, the efficiency is low, the clinical application is very inconvenient, and a large amount of letter generation knowledge and machine learning need to be learned by workers. This current situation is now an urgent problem to be solved.
Disclosure of Invention
In view of the above, the present application provides a high throughput sequencing analysis method for monogenic hereditary diseases, so as to solve the problems in the prior art.
The application provides a high-throughput sequencing analysis method for a single-gene hereditary disease, which comprises the following steps:
collecting disease condition information of a patient to be analyzed with a target hereditary disease;
acquiring and storing high-throughput sequencing data of the gene sequence to be detected of the patient to be analyzed;
performing mutation detection analysis on the high-throughput sequencing data to obtain mutation sites carrying mutation information, wherein the mutation information comprises basic information, harmfulness prediction information, mutation frequency information and pathogenicity information;
and step four, screening the mutation sites based on the disease condition information of the patient to be analyzed and the basic information, the harmfulness prediction information, the mutation frequency information and the pathogenicity information carried by the mutation sites to obtain the pathogenicity sites of the target genetic diseases.
Further, the acquiring and storing high-throughput sequencing data of the gene sequence to be tested of the patient to be analyzed includes:
acquiring high-throughput sequencing data of the gene sequence to be detected of the patient to be analyzed;
determining the fragment index number of the sequencing data based on a preset hierarchical fragment strategy;
distributively storing the sequencing data into a computer-readable storage medium based on the fragmentation index number of the sequencing data.
Further, the performing mutation detection analysis on the sequencing data to obtain a mutation site carrying mutation information includes:
carrying out mutation comparison detection on the high-throughput sequencing data to obtain mutation sites in a gene sequence to be detected of a patient to be analyzed;
and annotating the mutation sites based on the mutation information of the mutation sites to obtain the mutation sites carrying the mutation information.
Further, the annotating the mutation site based on mutation information of the mutation site comprises:
annotating basic information on the mutation site based on the position of the mutation site in a genome and the mutation type to obtain a mutation site carrying the basic information;
annotating hazard prediction information on the mutant site based on the influence of the mutant site on protein translation to obtain a mutant site carrying hazard prediction information;
annotating mutation frequency information on the mutation sites based on the allele frequencies of the mutation sites in the population to obtain mutation sites carrying the mutation frequency information;
and annotating pathogenicity information on the mutation site based on the relation between the mutation site and the disease information of the patient to be analyzed, and obtaining the mutation site carrying the pathogenicity information.
Further, the screening of the mutation sites based on the disease condition information of the patient to be analyzed and the basic information, the harmfulness prediction information, the mutation frequency information and the pathogenicity information carried by the mutation sites comprises:
screening and reserving a mutation site which exists and has an association with the disease information of the patient to be analyzed as a pathogenic site based on the disease information of the patient to be analyzed;
screening and reserving mutation sites having influence on gene expression products as pathogenic sites based on mutation information carried by the mutation sites;
screening and reserving mutation sites with mutation frequency within a preset threshold range as pathogenic sites based on mutation information carried by the mutation sites;
and screening and reserving the mutation sites with pathogenicity meeting the preset standard as the pathogenic sites of the target genetic diseases based on the mutation information carried by the mutation sites.
Further, the screening and retaining, as a pathogenic site, a mutation site that is already present and has an association with the disease information of the patient to be analyzed, based on the disease information of the patient to be analyzed, includes:
searching the included mutation sites which are related to the target genetic disease in a COSMIC database, a Clinvar database and/or an HGMD database to be used as pathogenic sites.
Further, the screening and retaining of mutation sites having an effect on gene expression products as pathogenic sites based on mutation information carried by the mutation sites includes:
and selecting the mutation site with any one of non-synonymous mutation, stop-gain mutation, frameshift mutation and nonsense mutation as a pathogenic site in the mutation information based on the mutation information carried by the mutation site.
Further, the screening and retaining of the mutation sites with the mutation frequency within a preset threshold range as pathogenic sites based on the mutation information carried by the mutation sites comprises:
and (3) counting the mutation frequency of the mutation sites in a 1000G database, an Exac.E database and/or an ESP database based on mutation information carried by the mutation sites, and taking the mutation sites with the mutation frequency within a preset threshold range as pathogenic sites.
Further, the preset threshold range is 0-0.01, and the screening and retaining of the mutation sites with the mutation frequency within the preset threshold range as pathogenic sites based on the mutation information carried by the mutation sites includes:
and (3) counting the mutation frequency of the mutation sites in a 1000G database, an Exac.E single database and/or an ESP database based on mutation information carried by the mutation sites, and taking the mutation sites with the mutation frequency of more than or equal to 0 and less than 0.01 as pathogenic sites.
Further, the screening and retaining of the mutation site with pathogenicity meeting the preset standard as the pathogenic site of the target genetic disease based on the mutation information carried by the mutation site comprises:
based on the pathogenicity degree of mutation of the mutation site in the mutation information carried by the mutation site, the mutation site is scored through any one or more of SIFT unit, PP2 unit, MT unit and MS unit; the mutation site with the SIFT unit scoring result of D, PP2 unit scoring result of P or D, MT unit scoring result of A or D or the MS unit scoring result of D, ACMG classified as Pathogenic and Likely Pathogenic is selected as the Pathogenic site.
Further, after the obtaining and storing the high-throughput sequencing data of the gene sequence to be tested of the patient to be analyzed, the method further comprises the following steps:
and filtering the high-throughput sequencing data based on the quality of the high-throughput sequencing data, and deleting the high-throughput sequencing data of which the quality does not meet a preset quality standard.
The single-gene hereditary disease high-throughput sequencing analysis method provided by the application provides a basis for searching pathogenic sites of target hereditary diseases by collecting the disease information of a patient to be analyzed with the target hereditary diseases, and detecting and analyzing the obtained high-throughput sequencing data of the gene sequence to be detected to obtain mutation sites carrying mutation information, screening the mutation sites based on the disease condition information of the patient to be analyzed and the mutation information carried by the mutation sites to obtain the pathogenic sites of the target hereditary disease, the method can effectively improve the screening comprehensiveness and the screening precision of the mutation sites by combining the characteristics of diseases, quickly present all pathogenic site information carried by patients in a one-click manner, including homozygous and heterozygous conditions, listing conditions, whether harmful mutation exists or not and the like, is detailed and clear, is simple to operate, and realizes high-efficiency screening of the pathogenic sites.
Drawings
FIG. 1 is a schematic diagram of the structure of a genetic disease sequencing analysis system according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a genetic disease sequencing analysis method according to an embodiment of the present application;
FIG. 3 is a schematic flow diagram of a familial hypercholesterolemia sequencing analysis method according to an embodiment of the present application;
FIG. 4 is a schematic diagram showing a screening process in a genetic disease sequencing method according to an experimental example of the present application;
FIG. 5 is a diagram showing the results of screening in the method for sequencing genetic diseases according to the experimental example of the present application.
Detailed Description
The following description of specific embodiments of the present application refers to the accompanying drawings.
In the present invention, unless otherwise specified, scientific and technical terms used herein have the meanings that are commonly understood by those skilled in the art. Also, the reagents, materials and procedures used herein are those that are widely used in the corresponding fields. Meanwhile, in order to better understand the present invention, the definitions and explanations of related terms are provided below.
Mutation site: refers to the different base types at the same position in the genome of a human individual and the genome of a human reference, and the mutation sites are possible to be pathogenic sites which affect the health of the human or cause the disease of the human.
SIFT unit (sorts interplant from tolerant): is a non-synonymous variation prediction tool unit based on sequence homology.
PP2 unit: (Polymorphism photosetting 2, Ployphen 2): is a tool unit for predicting the influence of amino acid substitution on the structure and function of human proteins.
An MT unit: is a tool unit for scoring mutation sites based on the pathogenicity of mutations occurring at the site.
An MS unit: is a tool unit for scoring mutation sites based on the pathogenicity of mutations occurring at the site.
Allele frequency: if 1) a particular locus is present in a chromosome, 2) a gene is present at that locus, 3) each individual in a population has n of that particular locus in somatic cells (e.g., two of that particular locus in cells of a diploid organism), 4) the gene has an allele or variant; then the allele frequency is the percentage of a particular locus in which an allele is present among all of the alleles in the population.
COSMIC database: is a cancer disease database at the level of mutations.
Clinvar database: is a database of human genomic variations associated with disease.
HGMD Database (Human Gene Mutation Database): is a database for collecting and organizing pathogenic sites closely related to human genetic diseases in published documents.
Familial Hypercholesterolaemia (FH), also known as Familial hyperprolodermaemia (β) is clinically characterized by Hypercholesterolemia, characteristic yellow tumors, and family history of early cardiovascular diseases FH is the most common hereditary hyperlipidemia in childhood and is also the most serious one of lipid metabolism diseases, can cause various life-threatening cardiovascular disease complications, and is an important risk factor of coronary artery diseases.
Example 1
As shown in fig. 1, the present embodiment discloses a high throughput sequencing analysis system for monogenic genetic diseases, which includes an information collection module 110, a data acquisition module 120, a mutation annotation module 130, and a site screening module 140.
An information collection module 110 configured to collect condition information of a patient to be analyzed having a target genetic disease.
Specifically, the target genetic disease is a single-gene genetic disease which needs to be subjected to high-throughput sequencing data mutation analysis and search for a pathogenic site, and the mutant site which causes or influences the generation or development of the genetic disease, namely the pathogenic site, is found by analyzing the gene sequence sequencing data mutant site of a patient suffering from the target genetic disease.
The patient to be analyzed is a patient suffering from a target genetic disease, for example, assuming that the target genetic disease is familial hypercholesterolemia, the patient to be analyzed is a patient suffering from familial hypercholesterolemia.
The condition information of the patient to be analyzed includes basic health information of the patient to be analyzed and index information related to the target genetic disease, for example, assuming that the target genetic disease is familial hypercholesterolemia, the condition information of the patient to be analyzed may include sex, age, height, weight, and blood lipid levels of the patient to be analyzed, such as triglyceride level, total cholesterol level, low density lipoprotein cholesterol level, high density lipoprotein cholesterol level, and the like.
The collection of the disease condition information of the patient to be analyzed with the target hereditary disease is helpful for analyzing and screening the mutation site of the target hereditary disease in a subsequent pertinence manner, and the screening speed and the screening precision of the mutation site are improved.
A data acquisition module 120 configured to acquire and store high throughput sequencing data of the gene sequence to be tested of the patient to be analyzed.
Optionally, the data obtaining module 120 is further configured to:
acquiring high-throughput sequencing data of the gene sequence to be detected of the patient to be analyzed; determining the fragment index number of the sequencing data based on a preset hierarchical fragment strategy; distributively storing the sequencing data into a computer-readable storage medium based on the fragmentation index number of the sequencing data.
The high-throughput sequencing data are in formats such as vcf, fastq and bam, instructions are stored on a computer-readable storage medium, and multiple input file formats are allowed, including the formats of vcf, fastq and bam.
Specifically, the gene sequence to be tested is a DNA sequence of a patient to be analyzed. The obtained high-throughput sequencing data of the gene sequence to be detected of the patient to be analyzed comprises each sequencing sequence and the comparison result of the starting position, the ending position and the reference gene of each sequencing sequence.
And determining the fragment index number of the high-throughput sequencing data based on a preset hierarchical fragment strategy, wherein the hierarchical fragment strategy comprises the number of layers, the number of fragments contained in each layer, the data volume contained in each fragment and the index number of the first fragment in each layer, and then performing distributed storage on the high-throughput sequencing data based on the fragment index number of the high-throughput sequencing data. The hierarchical fragmentation strategy may be determined based on the length of the sequencing sequence, the processing power of the computing resources, and the data size of the high-throughput sequencing sequence, which is not limited in this application.
According to the single-gene genetic disease high-throughput sequencing analysis system, the fragment index number of the high-throughput sequencing data is determined, and the sequencing data is stored in a distributed manner based on the fragment index number of the high-throughput sequencing data, so that the storage and query efficiency of the high-throughput sequencing data can be effectively improved, the sequencing data can be displayed more intuitively, and the analysis efficiency of the high-throughput sequencing data can be improved.
A mutation annotation module 130 configured to perform mutation detection analysis on the high-throughput sequencing data to obtain mutation sites carrying mutation information, wherein the mutation information includes basic information, hazard prediction information, mutation frequency information, and pathogenicity information.
Optionally, the mutation annotation module 130 is further configured to:
carrying out mutation comparison detection on the high-throughput sequencing data to obtain mutation sites in a gene sequence to be detected of a patient to be analyzed; and annotating the mutation sites based on the mutation information of the mutation sites to obtain the mutation sites carrying the mutation information.
In practical application, the mutation site in the gene sequence to be detected of the patient to be analyzed can be obtained by comparing and analyzing the sequencing data of the gene sequence to be detected of the patient to be analyzed with the normal human genome gene sequence.
Optionally, the abrupt annotation module 130 comprises a basic annotation module, a predictive annotation module, a frequency annotation module, and a pathogenicity annotation module.
The basic annotation module is configured to annotate basic information of the mutation sites based on the positions of the mutation sites in the genome and the mutation types, and obtain mutation sites carrying the basic information.
Specifically, the basic annotation module annotates the gene name and the gene structural region where the mutation site is based on the position of the mutation site in the genome to know on which gene the mutation specifically occurs and in what structural region of the gene, such as exons, introns, or gene intervals, etc.
The basic annotation module is used for determining which type of mutation the mutation site generates by annotating the type of the mutation site belongs to, such as nonsense mutation, missense mutation or synonymous mutation.
The prediction annotation module is configured to annotate the hazard prediction information on the mutation sites based on the influence of the mutation sites on protein translation, and obtain mutation sites carrying hazard prediction information.
Specifically, the prediction annotation module performs hazard prediction on the mutation site based on whether the mutation site affects the change of amino acids in the protein translation process, and the mutation site can be scored through SIFT unit, PP2 unit, MT unit and MS unit.
Taking SIFT unit and PP2 unit as examples, wherein, for SIFT software, the scoring result D represents harmful variation, the scoring result T represents harmless variation, and for PP2 unit, the scoring result D represents harmful variation, and P represents possible harmful variation.
The frequency annotation module is configured to annotate mutation frequency information on the mutation sites based on the allele frequencies of the mutation sites in the population, and obtain mutation sites carrying the mutation frequency information.
Specifically, the frequency annotation module annotates the allele frequencies of the mutation sites in the global population with a database unit, wherein the database unit can be a 1000G database (human thousand human genome database, 1000genome), an ESP database (national Exome sequencing project, NHLBI Grand Opportunity entity sequencing project), an exac.e database (Exome integration database, ExAC-EAS).
The pathogenicity annotation module is configured to annotate pathogenicity information on the mutation site based on the relation between the mutation site and the disease information of the patient to be analyzed, and obtain the mutation site carrying the pathogenicity information.
Specifically, the pathogenicity annotation module can annotate pathogenicity information on mutation sites based on database units containing literature information reported in research, wherein annotation related to target genetic diseases can be carried out by using any one or more databases of a COSMIC database, a Clinvar database and an HGMD database.
For example, assuming a record is found in the HGMD database that mutation site a is a pathogenic site of the target genetic disease, the pathogenicity annotation module annotates the pathogenicity information "the mutation site may cause the generation of the target genetic disease" for mutation site a.
The high-throughput sequencing analysis system for the monogenic hereditary diseases, which is disclosed by the embodiment, can effectively improve the identification degree of mutation sites by performing basic annotation, prediction annotation, frequency annotation and pathogenicity annotation on the mutation sites in high-throughput sequencing data, and is beneficial to quickly and accurately screening the pathogenic sites from a plurality of mutation sites.
And the site screening module 140 is configured to screen the mutation sites based on the disease condition information of the patient to be analyzed and the basic information, the harmfulness prediction information, the mutation frequency information and the pathogenicity information carried by the mutation sites to obtain pathogenic sites related to the target genetic diseases.
Optionally, the site screening module 140 comprises a basic screening module, a predictive screening module, a frequency screening module, and a pathogenicity screening module.
A basic screening module configured to screen and retain, as a pathogenic site, a mutation site that is already present and has an association with the condition information of the patient to be analyzed, based on the condition information of the patient to be analyzed.
Wherein the mutation site having a relationship with the disease condition information of the patient to be analyzed is a mutation site having a certain correlation with the generation or development of a target genetic disease. Mutation sites that are already present and have some correlation with the target genetic disease can be screened and retained as pathogenic sites based on the information of the target genetic disease.
Specifically, a great amount of information of mutation sites related to diseases is stored in the COSMIC database, the Clinvar database and the HGMD database, in this embodiment, a mutation site that is the same as the mutation site may be searched in any one or more databases of the COSMIC database, the Clinvar database and the HGMD database based on mutation information carried by the mutation site, and the mutation site is used as a pathogenic site under the condition that the same mutation site is found.
Taking the union of the mutation sites which exist in the three units and are related to the disease information of the patient to be analyzed to obtain the pathogenic site.
A predictive screening module configured to screen and retain mutation sites having an effect on gene expression products as pathogenic sites based on mutation information carried by the mutation sites.
Wherein the gene expression product is RNA or protein formed in the process of gene expression.
Specifically, a mutation site, of which the mutation type is any one of non-synonymous mutation, stop-gain mutation, frameshift mutation and nonsense mutation, in the mutation information is selected as a pathogenic site based on the mutation information carried by the mutation site.
Among them, non-synonymous mutation (NSS) is a mutation of a gene that may cause a change in the amino acid sequence of a polypeptide product or a change in the base sequence of a functional RNA, and is mostly harmful or even fatal.
Frame shift mutation (frame shift mutation) refers to a sequence of downstream code changes caused by reading frame changes caused by deletion or insertion of a base at a certain position in a DNA molecule, so that a gene originally encoding a certain peptide chain is changed into a gene encoding another completely different peptide chain sequence. The frame shift mutation can cause the change of protein property to cause the variation of character, and can cause death of individual in serious cases.
The Stop-gain mutation is a mutation in which one amino acid is mutated into a Stop codon (Stop codon).
The nonsense mutation (nonsense mutation) means that a codon representing a certain amino acid is mutated to a stop codon due to a change of a certain base, thereby prematurely terminating peptide chain synthesis.
And the frequency screening module is configured to screen and reserve the mutation sites with mutation frequencies larger than a preset threshold value as pathogenic sites based on the mutation information carried by the mutation sites.
Specifically, the mutation frequency of the mutation sites is respectively counted in a 1000G database, an Exac.E database and/or an ESP database based on mutation information carried by the mutation sites, and the mutation sites with the mutation frequency within a preset threshold range are used as pathogenic sites.
Wherein, the mutation frequency refers to the frequency of mutation of the site in the human population. The mutation frequency is greater than a preset threshold, that is, if the frequency of the mutation of the site in the population is greater than a certain value, the mutation site is the pathogenic site, wherein the frequency of the mutation of the site in the population is based on a human genome database, a national exome sequencing plan database and an exome integration database, the preset threshold of the mutation frequency may only include an upper limit value, may also include a lower limit value, and may also include both the upper limit value and the lower limit value, and the specific numerical value of the preset threshold may be determined according to the actual situation, which is not limited in the present application.
More specifically, the preset threshold range is 0-0.01, the mutation frequency of the mutation site is respectively counted in a 1000G database, an exac.e database and/or an ESP database based on mutation information carried by the mutation site, and the mutation site with the mutation frequency of 0 or more and less than 0.01 is taken as a pathogenic site.
And the pathogenicity screening module is configured to screen and reserve mutation sites with pathogenicity meeting a preset standard as pathogenicity sites based on mutation information carried by the mutation sites.
Specifically, the mutation sites are scored by any one or more of SIFT unit, PP2 unit, MT unit and MS unit based on the pathogenicity degree of mutation of the mutation sites in the mutation information carried by the mutation sites; the mutation site with the SIFT unit scoring result of D, PP2 unit scoring result of P or D, MT unit scoring result of A or D or the MS unit scoring result of D, ACMG classified as Pathogenic and Likely Pathogenic is selected as the Pathogenic site.
Wherein, SIFT unit scoring result is D to indicate that the mutation site is harmful, PP2 unit scoring result is P to indicate that the mutation site is possibly harmful, and D to indicate that the mutation site is harmful, MT unit scoring result is A to indicate that the mutation site is spontaneously Pathogenic, and D to indicate that the mutation site is Pathogenic, MS unit scoring result is D to indicate that the mutation site is harmful, ACMG grading result is Pathologic, and Likely Pathologic to indicate that the mutation site is harmful, and the union of the mutation sites meeting any one of the above conditions is selected to obtain the mutation site with pathogenicity meeting the preset standard and the mutation site is used as the Pathogenic site.
It should be noted that there is no clear sequence for executing the four modules, and all pathogenic sites of the target genetic disease can be obtained by selecting and collecting the pathogenic sites that are screened and retained in the four modules.
The pathogenic site is obtained through screening in four aspects of relevance of mutation of the site and disease information of a patient to be analyzed, influence of the mutation of the site on a gene expression product, mutation frequency of the site in a crowd and pathogenicity prediction of the mutation of the site, the comprehensiveness and screening precision of screening can be effectively improved, the real pathogenic site can be screened out from a plurality of mutant sites rapidly, comprehensively and accurately, omission of a screening process is avoided, workload is saved, and screening efficiency and accuracy are improved.
Optionally, the genetic disease sequencing analysis system of this embodiment further includes a quality control module configured to filter the sequencing data based on the quality of the sequencing data, and delete the sequencing data whose quality does not meet a preset quality standard.
The quality control module may be located between the data obtaining module 120 and the mutation annotation module 130, and after the data obtaining module 120 is executed, the quality control module is executed, and after the quality control module is executed, the mutation annotation module 130 is executed.
The quality control module can perform quality evaluation and statistical integration on the sequencing data, and filter low-quality sequencing data, so that the false positive and false negative of mutation sites are reduced, high-quality gene sequencing data are obtained, and accurate data are provided for accurate judgment of subsequent gene variation. The preset quality standard can be determined according to actual conditions, and the preset quality standard is not limited by the application.
The genetic disease sequencing analysis system provided by the embodiment provides a basis for searching for a pathogenic site of a target genetic disease by collecting disease condition information of a patient to be analyzed suffering from the target genetic disease, performs detection and analysis on the obtained high-throughput sequencing data of a gene sequence to be detected to obtain a mutation site carrying mutation information, and then screens the mutation site based on the disease condition information of the patient to be analyzed and the mutation information carried by the mutation site to obtain the pathogenic site of the target genetic disease, so that the screening comprehensiveness and the screening precision of the mutation site can be effectively improved, all pathogenic site information carried by the patient can be rapidly presented in a one-click manner, including a homozygous heterozygous condition, a listing condition, whether harmful mutation exists or not, and the system is detailed and clear, is simple to operate, and realizes high-efficiency screening of the pathogenic site.
Example 2
As shown in fig. 2, the present embodiment discloses a single-gene genetic disease high-throughput sequencing analysis method for the genetic disease sequencing analysis system, which includes steps S210 to S240.
S210, collecting disease condition information of the patient to be analyzed with the target genetic disease.
The collection of the disease condition information of the patient to be analyzed with the target hereditary disease is helpful for analyzing and screening the mutation site of the target hereditary disease in a subsequent pertinence manner, and the screening speed and the screening precision of the mutation site are improved.
S220, acquiring and storing the high-throughput sequencing data of the gene sequence to be detected of the patient to be analyzed.
Specifically, high-throughput sequencing data of the gene sequence to be detected of the patient to be analyzed is obtained; determining the fragment index number of the sequencing data based on a preset hierarchical fragment strategy; distributively storing the sequencing data into a computer-readable storage medium based on the fragmentation index number of the sequencing data.
According to the single-gene genetic disease high-throughput sequencing analysis method, the fragment index number of the sequencing data is determined, and the sequencing data is stored in a distributed manner based on the fragment index number of the sequencing data, so that the storage and query efficiency of the high-throughput sequencing data can be effectively improved, the high-throughput sequencing data can be displayed more intuitively, and the analysis efficiency of the high-throughput sequencing data can be improved.
Optionally, this embodiment further includes step S221.
S221, filtering the high-throughput sequencing data based on the quality of the high-throughput sequencing data, and deleting the high-throughput sequencing data of which the quality does not meet a preset quality standard.
By carrying out quality evaluation and statistical integration on the high-throughput sequencing data, low-quality high-throughput sequencing data are filtered, so that false positive and false negative of mutation sites are reduced, high-quality gene sequencing data are obtained, and accurate data are provided for accurate judgment of subsequent genetic variation.
And S230, carrying out mutation detection analysis on the high-throughput sequencing data to obtain mutation sites carrying mutation information, wherein the mutation information comprises basic information, harmfulness prediction information, mutation frequency information and pathogenicity information.
Specifically, mutation comparison detection is carried out on the sequencing data to obtain mutation sites in a gene sequence to be detected of a patient to be analyzed; and annotating the mutation sites based on the mutation information of the mutation sites to obtain the mutation sites carrying the mutation information.
The genetic disease sequencing analysis method provided by the embodiment can effectively improve the identification degree of the mutation sites by performing basic annotation, prediction annotation, frequency annotation and pathogenicity annotation on the mutation sites in sequencing data, and is beneficial to quickly and accurately screening the pathogenic sites from a plurality of mutation sites.
S240, screening the mutation sites based on the disease condition information of the patient to be analyzed and the basic information, the harmfulness prediction information, the mutation frequency information and the pathogenicity information carried by the mutation sites to obtain the pathogenicity sites related to the target genetic diseases.
Specifically, mutation sites that are already present and have an association with the condition information of the patient to be analyzed are screened and retained as disease-causing sites based on the condition information of the patient to be analyzed.
In practical application, the mutation sites which are the same as the mutation sites are searched in a COSMIC database, a Clinvar database and/or an HGMD database respectively based on mutation information carried by the mutation sites, and the mutation sites are used as pathogenic sites under the condition that the same mutation sites are found.
Specifically, mutation sites having an effect on gene expression products are screened and retained as pathogenic sites based on mutation information carried by the mutation sites.
In practical application, the mutation site of which the mutation type is any one of non-synonymous mutation, stop-gain mutation, frameshift mutation and nonsense mutation in the mutation information is selected as a pathogenic site based on the mutation information carried by the mutation site.
Specifically, mutation sites with mutation frequencies within a preset threshold range are screened and reserved as pathogenic sites based on mutation information carried by the mutation sites.
In practical application, the mutation frequency of the mutation sites is respectively counted in a 1000G database, an Exac.E database and/or an ESP database based on mutation information carried by the mutation sites, and the mutation sites with the mutation frequency within a preset threshold range are used as pathogenic sites.
Wherein, the preset threshold range can be 0-0.01, namely, the mutation site with the mutation frequency more than or equal to 0 and less than 0.01 is taken as the pathogenic site.
Specifically, the mutation site with pathogenicity meeting the preset standard is screened and reserved as the pathogenic site of the target genetic disease based on the mutation information carried by the mutation site.
In practical application, the mutation sites are scored through any one or more of SIFT units, PP2 units, MT units and MS units based on the pathogenicity degree of mutation of the mutation sites in the mutation information carried by the mutation sites; selecting a mutation site which meets the requirements that the SIFT unit scoring result is that the D, PP2 unit scoring result is P or the D, MT unit scoring result is that the A or D, MS unit scoring result is D, and ACMG is classified as either or more of Pathogenic and Likely Pathogenic as a Pathogenic site.
The pathogenic site is obtained through screening in four aspects of relevance of mutation of the site and disease information of a patient to be analyzed, influence of the mutation of the site on a gene expression product, mutation frequency of the site in a crowd and pathogenicity prediction of the mutation of the site, the comprehensiveness and screening precision of screening can be effectively improved, the real pathogenic site can be screened out from a plurality of mutant sites rapidly, comprehensively and accurately, omission of a screening process is avoided, workload is saved, and screening efficiency and accuracy are improved.
The details of the steps S210 to S240 can be referred to the above embodiments, and are not described herein again.
The genetic disease sequencing analysis method provided by the embodiment provides a basis for searching for a pathogenic site of a target genetic disease by collecting disease condition information of a patient to be analyzed suffering from the target genetic disease, detects and analyzes high-throughput sequencing data of a gene sequence to be detected after obtaining the high-throughput sequencing data of the gene sequence to be detected to obtain a mutation site carrying mutation information, and screens the mutation site based on the disease condition information of the patient to be analyzed and the mutation information carried by the mutation site to obtain the pathogenic site of the target genetic disease, so that the screening comprehensiveness and the screening precision of the mutation site can be effectively improved, all pathogenic site information carried by the patient including homozygous and heterozygous conditions, recording conditions, whether harmful mutation exists or not can be quickly presented in one key mode, the method is detailed and simple to operate, and high-efficiency screening of the pathogenic site is realized.
Example 3
As shown in fig. 3, the present embodiment discloses a familial hypercholesterolemia sequencing analysis method, which includes steps S310 to S370.
S310, collecting the disease information of the patient to be analyzed with the familial hypercholesterolemia.
Familial Hypercholesterolemia (FH) is an autosomal dominant genetic disease with abnormal lipoprotein metabolism, a common genetic cause of early Coronary Heart Disease (CHD), and vascular injury caused by increased plasma low density lipoprotein cholesterol (LDL-C). The currently accepted causative genes for FH are the LDL receptor (LDLR) gene, the APOB gene, the LDL receptor adaptor 1(LDLRAP1) gene, the PCSK9 gene.
The information on the condition of the patient to be analyzed who has familial hypercholesterolemia includes the age, sex, blood lipid level, etc. of the patient.
And S320, acquiring and storing the high-throughput sequencing data of the gene sequence to be detected of the patient to be analyzed.
S330, carrying out mutation detection analysis on the high-throughput sequencing data to obtain mutation sites carrying mutation information, wherein the mutation information comprises basic information, harmfulness prediction information, mutation frequency information and pathogenicity information.
The details of steps S310 to S330 can be found in the above embodiments, and are not described herein again.
S340, screening and reserving existing mutation sites as pathogenic sites based on the disease information of the patient to be analyzed, wherein the mutation sites have correlation with the disease information of the patient to be analyzed.
Specifically, the included mutation sites which are related to the target genetic disease are searched in a COSMIC database, a Clinvar database and/or an HGMD database respectively based on mutation information carried by the mutation sites and are used as pathogenic sites.
S350, screening and reserving the mutation sites having influence on gene expression products as pathogenic sites based on mutation information carried by the mutation sites.
Specifically, a mutation site, of which the mutation type is any one of non-synonymous mutation, stop-gain mutation, frameshift mutation and nonsense mutation, in the mutation information is selected as a pathogenic site based on the mutation information carried by the mutation site.
Since non-synonymous mutation, stop-gain mutation, frameshift mutation and nonsense mutation are all harmful mutations, and the possibility of causing or promoting the occurrence and development of familial hypercholesterolemia is very high, the site of mutation which generates any one of non-synonymous mutation, stop-gain mutation, frameshift mutation and nonsense mutation is used as a pathogenic site.
S360, screening and reserving the mutation sites with the mutation frequency within a preset threshold range as pathogenic sites based on the mutation information carried by the mutation sites.
Specifically, the mutation frequency of the mutation sites is respectively counted in a 1000G database, an Exac.E database and/or an ESP database based on mutation information carried by the mutation sites, and the mutation sites with the mutation frequency within a preset threshold range are used as pathogenic sites.
In the population, if the frequency of a certain mutation at a certain site is very high, the mutation at the site can be indicated as a mutation with normal properties, and the occurrence and development of diseases cannot be caused, but if the frequency of a certain mutation at a certain site is very low, the mutation site is very likely to be a pathogenic site. Therefore, the predetermined threshold range may be 0 to 0.01, that is, a mutation site where the frequency of occurrence of mutation is 0 or more and less than 0.01 is used as a pathogenic site.
S370, screening and reserving the mutation sites with pathogenicity meeting the preset standard as the pathogenic sites of the target genetic diseases based on the mutation information carried by the mutation sites.
Specifically, the mutation sites are scored by any one or more of SIFT unit, PP2 unit, MT unit and MS unit based on the pathogenicity degree of mutation of the mutation sites in the mutation information carried by the mutation sites; selecting a mutation site which meets the requirements that the SIFT unit scoring result is that the D, PP2 unit scoring result is P or the D, MT unit scoring result is that the A or D, MS unit scoring result is D, and ACMG is classified as either or more of Pathogenic and Likely Pathogenic as a Pathogenic site.
According to the sequencing analysis method for familial hypercholesterolemia, the pathogenic sites are obtained by screening the mutation sites carrying mutation information in four aspects of 'relevance of mutation of the sites and familial hypercholesterolemia', 'influence of mutation of the sites on gene expression products', 'frequency of mutation of the sites in people' and 'pathogenicity prediction of mutation of the sites on familial hypercholesterolemia', and by combining the characteristics of familial hypercholesterolemia, the comprehensive performance and screening precision of pathogenic site screening are effectively improved, so that the real disease pathogenic sites can be screened from a plurality of mutation sites rapidly, omission in the screening process is avoided, the workload is saved, and the screening efficiency and accuracy are improved.
Example 4
The following experiment was conducted by taking familial hypercholesterolemia as an example of a target hereditary disease.
Sequencing a DNA sequence in blood of a patient to be analyzed by a targeted capture sequencing technology in advance to obtain a sequencing data vcf file (the sequencing data vcf file is provided by Novisha Minigence company). Wherein, the DNA sequence in the blood of the patient to be analyzed is taken as the gene sequence to be tested in the embodiment 1-3, and the DNA sequence sequencing data vcf file is taken as the sequencing data of the gene sequence to be tested in the patient to be analyzed in the embodiment 1-3.
The condition information of the patient to be analyzed is collected by the information collection module, as shown in table 1:
TABLE 1
Sex For male
Age (age) Age 26 years old
Triglycerides (TG) 2.45mmol/L
Total Cholesterol (TC) 8.5mmol/L
Low density lipoprotein cholesterol (LDL-C) 8.05mmol/L
High density lipoprotein cholesterol (HDL-C) 1.02mmol/L
History of the past Coronary heart disease, PCI operation, hyperlipidemia
Family history Family history of hyperlipidemia
And acquiring the sequencing data of the gene sequence to be detected of the patient to be analyzed, namely the sequencing data vcf file of the DNA sequence in the blood of the patient to be analyzed, by a data acquisition module, and storing the sequencing data vcf file in a computer readable storage medium in a distributed manner.
The sequencing data were subjected to quality assessment and statistical integration by a quality control module, the low quality sequencing data were discarded, and the remaining high quality sequencing data were used as sequencing data in the subsequent steps of examples 1-3.
And comparing the sequencing data with a normal human genome through a mutation annotation module to obtain a mutation site.
Annotating basic information for each mutation site through a mutation annotation module based on the position of each mutation site in a genome and the mutation type to obtain the mutation site carrying the basic information; annotating hazard prediction information for each mutation site based on the influence of each mutation site on protein translation to obtain mutation sites carrying hazard prediction information; annotating mutation frequency information for each mutation site based on the allele frequency of each mutation site in the population to obtain the mutation site carrying the mutation frequency information; and annotating pathogenicity information for each mutation site based on the relation between each mutation site and the disease information of the patient to be analyzed, and obtaining the mutation site carrying the pathogenicity information.
As shown in fig. 4, the disease condition information of the patient to be analyzed and any one or a combination of several of basic information, harmfulness prediction information, mutation frequency information and pathogenicity information carried by the mutation site are used to screen the mutation site, so as to obtain the pathogenic site.
Wherein NSS (non-synonymous mutation), SG (stop-gain mutation), SL (termination deletion), FI (frame shift insertion), FD (frame shift deletion), FBS (block substitution) mutation is selected for the protein impact of the predictive screening module;
in the crowd frequency of the frequency screening module, the frequencies of a 1000G database, an ESP database and an ExAC.E database are all set to be 0-0.01, and a union set is taken;
for the pathogenicity screening module, SIFT selects D (harmful variation), pp2 selects D (harmful) and P (possibly harmful), MT selects A (spontaneous Pathogenic) and D (Pathogenic), MS selects D (harmful), ACMG selects Pathologic and Likelypathogenetic in grades, and the union is taken;
in the event setting for the basic screening module, there were Cosmic selection, Clinvar selection, HGMD ratings DM (pathogenic mutation), DM? (possibly pathogenic mutations), DP (disease-related polymorphic variation) and DFP (possibly disease-related polymorphic variation), and pooling.
The patient is a familial hypercholesterolemia patient, and the currently recognized pathogenic genes are a LDLR gene, an APOB gene, a LDLRAP1 gene and a PCSK9 gene. The sequencing data of the patient has variation of 108 loci of the LDLR, APOB, LDLRAP1 and PCSK94 genes, and variation of 3 loci of the APOB and LDLR genes is obtained as a pathogenic locus after the optimization screening step, and is shown in figure 5.
The first pathogenic site is a single nucleotide variation of 21231524 at the APOB gene position on chromosome 2, wherein the variation is that 8216 base pairs of the genome are changed, cytosine (C) is changed into thymine (T), and the 2739 amino acid P of the protein sequence is replaced by L. The mutation is a homozygous mutation and a nonsynonymous mutation located in an exon region, the pathogenicity of the mutation is shown to be harmful in SIFT unit, PP2 unit is shown to be possibly harmful, the possibility that the mutation is harmful is high, and the mutation is included in Clinvar unit and HGMD unit.
The second pathogenic site is a single nucleotide variation located at 21250914 of ABOP gene on chromosome 2, wherein the variation is that 1853 base pairs of genome are changed, cytosine (C) is changed into thymine (T), resulting in the replacement of 618 th amino acid P of protein sequence with V. The mutation is a homozygous mutation, a nonsynonymous mutation located in an exon region, and the pathogenicity of the mutation is shown to be harmless in SIFT unit, but PP2 unit is shown to be harmful, which indicates that the possibility of the mutation being harmful is high, and the mutation is included in Clinvar unit and HGMD unit.
The third pathogenic site is a single nucleotide variation with 11200236 th site of LDLR gene on chromosome 19, the variation is that the 12 th base pair of genome is changed, guanine (G) is changed into adenine (A), and the amino acid W at the 4 th site of protein sequence is replaced by X. This variation is a heterozygous mutation, stop-gain mutation, located in the exon region, the pathogenicity of which is shown to be harmless in the SIFT unit, but the MT unit is shown to be spontaneous, the mutation frequency is 0.55%, suggesting that the possibility of this variation being harmful is high, and this variation is included in the HGMD unit.
Therefore, the genetic disease sequencing analysis system and the method provided by the application can comprehensively and accurately screen a plurality of mutation sites in sequencing data, quickly present all pathogenic site information carried by a patient in a one-click manner, including homozygous and heterozygous conditions, inclusion conditions, whether harmful mutation exists or not and the like, are detailed and clear, are simple to operate, and realize high-efficiency screening of the pathogenic sites.
In this document, "upper", "lower", "front", "rear", "left", "right", and the like are used only to indicate relative positional relationships between relevant portions, and do not limit absolute positions of the relevant portions.
In this document, "first", "second", and the like are used only for distinguishing one from another, and do not indicate the degree and order of importance, the premise that each other exists, and the like.
In this context, "equal", "same", etc. are not strictly mathematical and/or geometric limitations, but also include tolerances as would be understood by a person skilled in the art and allowed for manufacturing or use, etc.
Unless otherwise indicated, numerical ranges herein include not only the entire range within its two endpoints, but also several sub-ranges subsumed therein.
The preferred embodiments and examples of the present application have been described in detail with reference to the accompanying drawings, but the present application is not limited to the embodiments and examples described above, and various changes can be made within the knowledge of those skilled in the art without departing from the concept of the present application.

Claims (11)

1. A method for high-throughput sequencing analysis of monogenic genetic diseases, which comprises the following steps:
collecting disease condition information of a patient to be analyzed with a target hereditary disease;
acquiring and storing high-throughput sequencing data of the gene sequence to be detected of the patient to be analyzed;
performing mutation detection analysis on the high-throughput sequencing data to obtain mutation sites carrying mutation information, wherein the mutation information comprises basic information, harmfulness prediction information, mutation frequency information and pathogenicity information;
and step four, screening the mutation sites based on the disease condition information of the patient to be analyzed and the basic information, the harmfulness prediction information, the mutation frequency information and the pathogenicity information carried by the mutation sites to obtain the pathogenicity sites of the target genetic diseases.
2. The method for high-throughput sequencing analysis of monogenic genetic diseases according to claim 1, wherein said obtaining and storing high-throughput sequencing data of said gene sequence to be tested of said patient to be analyzed comprises:
acquiring high-throughput sequencing data of the gene sequence to be detected of the patient to be analyzed;
determining the fragment index number of the sequencing data based on a preset hierarchical fragment strategy;
distributively storing the sequencing data into a computer-readable storage medium based on the fragmentation index number of the sequencing data.
3. The method for high-throughput sequencing analysis of single-gene genetic diseases according to claim 1, wherein the performing mutation detection analysis on the sequencing data to obtain mutation sites carrying mutation information comprises:
carrying out mutation comparison detection on the high-throughput sequencing data to obtain mutation sites in a gene sequence to be detected of a patient to be analyzed;
and annotating the mutation sites based on the mutation information of the mutation sites to obtain the mutation sites carrying the mutation information.
4. The method for high-throughput sequencing analysis of monogenic genetic diseases according to claim 3, wherein said annotating said mutation site based on mutation information of said mutation site comprises:
annotating basic information on the mutation site based on the position of the mutation site in a genome and the mutation type to obtain a mutation site carrying the basic information;
annotating hazard prediction information on the mutant site based on the influence of the mutant site on protein translation to obtain a mutant site carrying hazard prediction information;
annotating mutation frequency information on the mutation sites based on the allele frequencies of the mutation sites in the population to obtain mutation sites carrying the mutation frequency information;
and annotating pathogenicity information on the mutation site based on the relation between the mutation site and the disease information of the patient to be analyzed, and obtaining the mutation site carrying the pathogenicity information.
5. The single-gene genetic disease high-throughput sequencing analysis method of claim 1, wherein the screening of the mutation sites based on the disease condition information of the patient to be analyzed and the basic information, the harmfulness prediction information, the mutation frequency information and the pathogenicity information carried by the mutation sites comprises:
screening and reserving a mutation site which exists and has an association with the disease information of the patient to be analyzed as a pathogenic site based on the disease information of the patient to be analyzed;
screening and reserving mutation sites having influence on gene expression products as pathogenic sites based on mutation information carried by the mutation sites;
screening and reserving mutation sites with mutation frequency within a preset threshold range as pathogenic sites based on mutation information carried by the mutation sites;
and screening and reserving the mutation sites with pathogenicity meeting the preset standard as the pathogenic sites of the target genetic diseases based on the mutation information carried by the mutation sites.
6. The single-gene genetic disease high-throughput sequencing analysis method of claim 5, wherein the screening and retaining mutation sites existing and having an association with the disease information of the patient to be analyzed as pathogenic sites based on the disease information of the patient to be analyzed comprises:
searching the included mutation sites which are related to the target genetic disease in a COSMIC database, a Clinvar database and/or an HGMD database to be used as pathogenic sites.
7. The method for high-throughput sequencing analysis of monogenic genetic diseases according to claim 5, wherein the screening and retaining of mutation sites having an effect on gene expression products as pathogenic sites based on mutation information carried by the mutation sites comprises:
and selecting the mutation site with any one of non-synonymous mutation, stop-gain mutation, frameshift mutation and nonsense mutation as a pathogenic site in the mutation information based on the mutation information carried by the mutation site.
8. The method for high-throughput sequencing analysis of monogenic genetic diseases according to claim 5, wherein the screening and retaining of the mutation sites with the mutation frequency within a preset threshold range as pathogenic sites based on the mutation information carried by the mutation sites comprises:
and (3) counting the mutation frequency of the mutation sites in a 1000G database, an Exac.E database and/or an ESP database based on mutation information carried by the mutation sites, and taking the mutation sites with the mutation frequency within a preset threshold range as pathogenic sites.
9. The method for high-throughput sequencing analysis of monogenic genetic diseases according to claim 8, wherein the predetermined threshold range is 0-0.01, and the screening and retaining of the mutation sites with the mutation frequency within the predetermined threshold range as pathogenic sites based on the mutation information carried by the mutation sites comprises:
and (3) counting the mutation frequency of the mutation sites in a 1000G database, an Exac.E database and/or an ESP database based on mutation information carried by the mutation sites, and taking the mutation sites with the mutation frequency of more than or equal to 0 and less than 0.01 as pathogenic sites.
10. The method for high-throughput sequencing analysis of single-gene genetic diseases according to claim 5, wherein the screening and retaining of the mutation site with pathogenicity meeting the preset standard as the pathogenic site of the target genetic disease based on the mutation information carried by the mutation site comprises:
based on the pathogenicity degree of mutation of the mutation site in the mutation information carried by the mutation site, the mutation site is scored through any one or more of SIFT unit, PP2 unit, MT unit and MS unit; the mutation site with the SIFT unit scoring result of D, PP2 unit scoring result of P or D, MT unit scoring result of A or D or the MS unit scoring result of D, ACMG classified as Pathogenic and Likely Pathogenic is selected as the Pathogenic site.
11. The method for high-throughput sequencing analysis of monogenic genetic diseases according to claim 1, further comprising, after said obtaining and storing high-throughput sequencing data of said gene sequence to be tested of said patient to be analyzed:
and filtering the high-throughput sequencing data based on the quality of the high-throughput sequencing data, and deleting the high-throughput sequencing data of which the quality does not meet a preset quality standard.
CN202010035599.8A 2020-01-14 2020-01-14 High-throughput sequencing analysis method for monogenic hereditary diseases Pending CN111139291A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010035599.8A CN111139291A (en) 2020-01-14 2020-01-14 High-throughput sequencing analysis method for monogenic hereditary diseases

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010035599.8A CN111139291A (en) 2020-01-14 2020-01-14 High-throughput sequencing analysis method for monogenic hereditary diseases

Publications (1)

Publication Number Publication Date
CN111139291A true CN111139291A (en) 2020-05-12

Family

ID=70524849

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010035599.8A Pending CN111139291A (en) 2020-01-14 2020-01-14 High-throughput sequencing analysis method for monogenic hereditary diseases

Country Status (1)

Country Link
CN (1) CN111139291A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111798926A (en) * 2020-06-30 2020-10-20 广州金域医学检验中心有限公司 Pathogenic gene locus database and establishment method thereof
CN113470747A (en) * 2021-06-29 2021-10-01 首都医科大学附属北京胸科医院 Method and device for acquiring drug resistance analysis result of mycobacterium tuberculosis
CN113628683A (en) * 2021-08-24 2021-11-09 慧算医疗科技(上海)有限公司 High-throughput sequencing mutation detection method, equipment, device and readable storage medium
CN113689914A (en) * 2020-12-17 2021-11-23 武汉良培医学检验实验室有限公司 Screening method and chip for single-gene genetic disease expansibility carrier
CN113889188A (en) * 2021-10-22 2022-01-04 赛业(广州)生物科技有限公司 Disease prediction method, system, computer device and medium
WO2023124779A1 (en) * 2021-12-28 2023-07-06 成都齐碳科技有限公司 Third-generation sequencing data analysis method and device for point mutation detection
CN117953968A (en) * 2024-03-27 2024-04-30 北京智因东方转化医学研究中心有限公司 Method and device for sequencing harmfulness of genetic variation sites

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110098193A1 (en) * 2009-10-22 2011-04-28 Kingsmore Stephen F Methods and Systems for Medical Sequencing Analysis
CN103305618A (en) * 2013-06-26 2013-09-18 北京迈基诺基因科技有限责任公司 Screening method of inherited metabolic disorder gene
CN107229841A (en) * 2017-05-24 2017-10-03 重庆金域医学检验所有限公司 A kind of genetic mutation appraisal procedure and system
CN107391965A (en) * 2017-08-15 2017-11-24 上海派森诺生物科技股份有限公司 A kind of lung cancer somatic mutation determination method based on high throughput sequencing technologies
CN107506618A (en) * 2017-07-07 2017-12-22 北京中科晶云科技有限公司 The storage method and querying method of high-flux sequence sequence
CN107750279A (en) * 2015-03-16 2018-03-02 个人基因组诊断公司 Foranalysis of nucleic acids system and method
CN108251520A (en) * 2018-01-31 2018-07-06 杭州同欣基因科技有限公司 A kind of smoking addiction Risk Forecast Method and smoking cessation guidance method based on high throughput sequencing technologies
CN108710781A (en) * 2018-03-30 2018-10-26 北京恒华永力电力工程有限公司 A kind of sort method and device of genetic mutation
CN108920901A (en) * 2018-07-24 2018-11-30 中国医学科学院北京协和医院 A kind of sequencing data mutation analysis system
CN109295198A (en) * 2018-09-03 2019-02-01 安吉康尔(深圳)科技有限公司 For detecting the method, apparatus and terminal device of genetic disease genetic mutation
CN109994154A (en) * 2017-12-30 2019-07-09 安诺优达基因科技(北京)有限公司 A kind of screening plant of single-gene recessive genetic disorder candidate disease causing genes

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110098193A1 (en) * 2009-10-22 2011-04-28 Kingsmore Stephen F Methods and Systems for Medical Sequencing Analysis
CN103305618A (en) * 2013-06-26 2013-09-18 北京迈基诺基因科技有限责任公司 Screening method of inherited metabolic disorder gene
CN107750279A (en) * 2015-03-16 2018-03-02 个人基因组诊断公司 Foranalysis of nucleic acids system and method
CN107229841A (en) * 2017-05-24 2017-10-03 重庆金域医学检验所有限公司 A kind of genetic mutation appraisal procedure and system
CN107506618A (en) * 2017-07-07 2017-12-22 北京中科晶云科技有限公司 The storage method and querying method of high-flux sequence sequence
CN107391965A (en) * 2017-08-15 2017-11-24 上海派森诺生物科技股份有限公司 A kind of lung cancer somatic mutation determination method based on high throughput sequencing technologies
CN109994154A (en) * 2017-12-30 2019-07-09 安诺优达基因科技(北京)有限公司 A kind of screening plant of single-gene recessive genetic disorder candidate disease causing genes
CN108251520A (en) * 2018-01-31 2018-07-06 杭州同欣基因科技有限公司 A kind of smoking addiction Risk Forecast Method and smoking cessation guidance method based on high throughput sequencing technologies
CN108710781A (en) * 2018-03-30 2018-10-26 北京恒华永力电力工程有限公司 A kind of sort method and device of genetic mutation
CN108920901A (en) * 2018-07-24 2018-11-30 中国医学科学院北京协和医院 A kind of sequencing data mutation analysis system
CN109295198A (en) * 2018-09-03 2019-02-01 安吉康尔(深圳)科技有限公司 For detecting the method, apparatus and terminal device of genetic disease genetic mutation

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111798926A (en) * 2020-06-30 2020-10-20 广州金域医学检验中心有限公司 Pathogenic gene locus database and establishment method thereof
CN111798926B (en) * 2020-06-30 2023-09-29 广州金域医学检验中心有限公司 Pathogenic gene locus database and establishment method thereof
CN113689914A (en) * 2020-12-17 2021-11-23 武汉良培医学检验实验室有限公司 Screening method and chip for single-gene genetic disease expansibility carrier
CN113689914B (en) * 2020-12-17 2024-02-20 武汉良培医学检验实验室有限公司 Single-gene genetic disease expansibility carrier screening method and chip
CN113470747A (en) * 2021-06-29 2021-10-01 首都医科大学附属北京胸科医院 Method and device for acquiring drug resistance analysis result of mycobacterium tuberculosis
CN113470747B (en) * 2021-06-29 2024-04-26 首都医科大学附属北京胸科医院 Method and device for acquiring drug resistance analysis result of tubercle bacillus
CN113628683A (en) * 2021-08-24 2021-11-09 慧算医疗科技(上海)有限公司 High-throughput sequencing mutation detection method, equipment, device and readable storage medium
CN113628683B (en) * 2021-08-24 2024-04-09 慧算医疗科技(上海)有限公司 High-throughput sequencing mutation detection method, device and apparatus and readable storage medium
CN113889188A (en) * 2021-10-22 2022-01-04 赛业(广州)生物科技有限公司 Disease prediction method, system, computer device and medium
WO2023124779A1 (en) * 2021-12-28 2023-07-06 成都齐碳科技有限公司 Third-generation sequencing data analysis method and device for point mutation detection
CN117953968A (en) * 2024-03-27 2024-04-30 北京智因东方转化医学研究中心有限公司 Method and device for sequencing harmfulness of genetic variation sites

Similar Documents

Publication Publication Date Title
CN111139291A (en) High-throughput sequencing analysis method for monogenic hereditary diseases
Chaisson et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes
CA3018186C (en) Genetic variant-phenotype analysis system and methods of use
Kan et al. Gene structure prediction and alternative splicing analysis using genomically aligned ESTs
US8271206B2 (en) DNA sequence assembly methods of short reads
US9898578B2 (en) Visualizing expression data on chromosomal graphic schemes
CN109686439B (en) Data analysis method, system and storage medium for genetic disease gene detection
US20160224722A1 (en) Methods of Selection, Reporting and Analysis of Genetic Markers Using Broad-Based Genetic Profiling Applications
CN109767810B (en) High-throughput sequencing data analysis method and device
JP6066924B2 (en) DNA sequence data analysis method
CN109243530B (en) Genetic variation determination method, system, and storage medium
KR101693504B1 (en) Discovery system for disease cause by genetic variants using individual whole genome sequencing data
CN109994154A (en) A kind of screening plant of single-gene recessive genetic disorder candidate disease causing genes
KR101693510B1 (en) Genotype analysis system and methods using genetic variants data of individual whole genome
CN107480470A (en) Known the variation method for detecting and device examined based on Bayes and Poisson distribution
CN115052994A (en) Method for determining base type of predetermined site in chromosome of embryonic cell and application thereof
CN107247890A (en) A kind of gene data system for clinical diagnosis and prediction
CN110021346A (en) Gene Fusion and mutation detection methods and system based on RNAseq data
CN110648722A (en) Device for evaluating neonatal genetic disease risk
CN111223525A (en) Tumor exon sequencing data analysis method
Wu et al. Marfan syndrome: whole-exome sequencing reveals de novo mutations, second gene and genotype–phenotype correlations in the Chinese population
CN110373458A (en) A kind of kit and analysis system of thalassemia detection
CN115798579A (en) Evidence judgment method, system, device and medium for genetic variation
CN116209777A (en) Genetic relationship judging method and device based on noninvasive prenatal gene detection data
US20240221866A1 (en) Method of reducing artefact variants in high throughput-sequencing and uses thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200512

WD01 Invention patent application deemed withdrawn after publication