WO2021248695A1 - 基于临床特征和序列变异的单基因病名称推荐方法及系统 - Google Patents

基于临床特征和序列变异的单基因病名称推荐方法及系统 Download PDF

Info

Publication number
WO2021248695A1
WO2021248695A1 PCT/CN2020/111133 CN2020111133W WO2021248695A1 WO 2021248695 A1 WO2021248695 A1 WO 2021248695A1 CN 2020111133 W CN2020111133 W CN 2020111133W WO 2021248695 A1 WO2021248695 A1 WO 2021248695A1
Authority
WO
WIPO (PCT)
Prior art keywords
gene
feature
standard
clinical
standard single
Prior art date
Application number
PCT/CN2020/111133
Other languages
English (en)
French (fr)
Inventor
马旭
曹宗富
罗敏娜
陈翠霞
蔡瑞琨
喻浴飞
李乾
Original Assignee
国家卫生健康委科学技术研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国家卫生健康委科学技术研究所 filed Critical 国家卫生健康委科学技术研究所
Publication of WO2021248695A1 publication Critical patent/WO2021248695A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • the present invention relates to the field of medical information technology, in particular to a method and system for recommending names of single-gene diseases based on clinical features and sequence variations.
  • Monogenic disease is a common disease. It is a disease caused by a pair of allele mutations, also known as Mendelian genetic disease. Its characteristics are as follows:
  • the phenotype of single-gene disease is complex, and the phenotype of the same single-gene disease is highly heterogeneous, and there is a phenomenon that the clinical features of different single-gene diseases overlap with each other;
  • the genetic pattern of single-gene diseases is diversified. Even the same single-gene disease may show different inheritance patterns, and different single-gene diseases may also show the same inheritance pattern.
  • the purpose of the present invention is to provide a single gene disease name recommendation method and system based on clinical characteristics and sequence variation, which can accurately recommend a single gene disease name matching the patient's condition.
  • one aspect of the present invention provides a method for recommending names of single-gene diseases based on clinical characteristics and sequence variation, including:
  • case information of the patient including gene sequence, feature set I, and single gene disease name;
  • the recommended results of the standard single-gene disease names are output.
  • the feature set A corresponding to each standard single-gene disease name in the feature relational database is traversed, the set similarity value of each feature set A and the feature set I is respectively calculated, and the similar standard list is calculated according to the similarity value.
  • the name of the gene disease and the corresponding gene descending candidate output also include:
  • the standardized clinical feature phenotype tree is composed of multiple stem nodes and at least one branch node associated with each stem node. Each branch node is used to represent a standardized clinical feature, and each stem node is used to represent an associated standardized clinical feature.
  • the index of the feature is used to represent a standardized clinical feature.
  • traverse the feature set A corresponding to each standard single gene disease name in the feature relationship database calculate the set similarity value of each feature set A and feature set I, and divide the similar standard single genes according to the similarity value.
  • the methods for outputting disease names and corresponding genes in descending order include:
  • n-th standard single-gene disease name in the characteristic relational database Traverse the n-th standard single-gene disease name in the characteristic relational database, and mark the node of the standard clinical characteristic in the corresponding characteristic set A on the standardized clinical characteristic phenotype tree, and the initial value of n is 1;
  • the best standard clinical feature corresponding to each clinical feature in feature set I is matched from feature set A;
  • the set similarity value of feature set I and current feature set A is calculated;
  • the degree value summarizes the sorted candidate output.
  • the method of matching the best standard clinical feature corresponding to each clinical feature in feature set I from feature set A based on the node labels on the standardized clinical feature phenotype tree includes:
  • the feature set I includes multiple clinical features, and the feature set A includes multiple standard clinical features;
  • the method for selecting the standard clinical feature with the highest similarity to the i-th clinical feature from the feature set A includes:
  • the standard clinical feature corresponding to the maximum value is selected from multiple similarity value screens as the best standard clinical feature corresponding to the i-th clinical feature.
  • the method of comparing the gene sequence with the human reference genome to obtain comparison data, and obtaining the impact score of each genetic variation according to the comparison data includes:
  • the gene sequence is the gene sequence of 1 group of test persons, and when the gene detection mode is the family test mode, the gene sequence is 1 group of test persons and at least 1 set of gene sequences of the immediate family members of the tested persons;
  • the mutation type includes SNP mutation and Indel mutation
  • the type of mutation function includes mutation harmful, mutation low harmful or basically harmless
  • the clinical significance classification of genetic variation is performed, and the clinical significance classification includes pathogenicity, possible pathogenicity, and pathogenicity. Five levels of unknown, possibly benign and benign;
  • the impact score of each genetic variant in the gene is calculated.
  • multiple genes corresponding to the name of the single-gene disease are obtained from a preset gene list file, based on the impact score of each genetic variation in the gene, the inheritance mode of the genetic variation, and the association of the known disease
  • the method for calculating the pathogenicity score of each gene separately from the similarity value corresponding to the gene includes:
  • the pathogenicity score formula Score g max (Score v ) + w e S e + w t S t + w MLS S MLS is used to calculate the pathogenicity score of each gene, where max (Score v ) is the gene
  • max (Score v ) is the gene
  • S t is the value of the inheritance pattern of the genetic variation
  • S MLS is the similarity value corresponding to the gene
  • w e S e is the weight assigned weights
  • w t s t is assigned the right weight
  • the method before outputting the corresponding standard single-gene disease name candidates in descending order according to the size of the pathogenicity score value, the method further includes:
  • the blacklist method is used to filter out the standard single-gene disease names corresponding to the false positive mutation sites.
  • the single gene disease name recommendation method based on clinical characteristics and sequence variation provided by the present invention has the following beneficial effects:
  • a piece of patient case information including gene sequence, feature set I and names of single-gene diseases needs to be obtained, and then phenotypic assistance is made based on feature set I.
  • Recommendations for names of single-gene diseases for diagnosis, as well as names for single-gene diseases for genetic assistance based on gene sequence and single-gene disease names, and based on the intersection of the recommended results of phenotypic assistance and genetic assistance output the final standard list to the patient The recommended result of the name of the genetic disease.
  • the solution provided by the present invention integrates the clinical characteristics and genetic variation of patients for clinical auxiliary diagnosis, and can help clinicians to accurately diagnose complex single-gene diseases.
  • Another aspect of the present invention provides a single gene disease name recommendation system based on clinical characteristics and sequence variation, including:
  • the input unit is used to obtain the patient's case information, the case information including the gene sequence, feature set I and the name of the single gene disease;
  • the sequence comparison unit is used to compare the gene sequence with the human reference genome to obtain comparison data, and obtain the impact score of each genetic variation according to the comparison data;
  • the phenotypic auxiliary diagnosis unit is used to traverse the feature set A corresponding to each standard single-gene disease name in the feature relational database, and calculate the set similarity value of each feature set A and feature set I, and will be similar according to the similarity value Standard single-gene disease names and corresponding gene candidate output in descending order.
  • the standard single-gene disease names of candidate outputs are summarized to construct a standard single-gene disease name set P;
  • the genetic assistant diagnosis unit is used to obtain multiple genes corresponding to the name of the single-gene disease from a preset gene list file, based on the impact score of each genetic variation in the gene, the inheritance mode of the genetic variation, and the known
  • the relevance of the disease and the similarity value corresponding to the gene are respectively calculated for the pathogenicity score of each gene, and the corresponding standard single-gene disease names are output in descending order according to the pathogenicity score value, and the candidate outputs are summarized at the same time Standard single-gene disease names to construct a standard single-gene disease name set G;
  • the recommended output unit based on the intersection result of the standard single-gene disease name set G and the standard single-gene disease name set P, and the candidate output order of the standard single-gene disease names, output the recommended results of the standard single-gene disease names.
  • the beneficial effects of the single-gene disease name recommendation system based on clinical features and sequence variation provided by the present invention are the same as the beneficial effects of the single-gene disease name recommendation method based on clinical features and sequence variation provided by the above technical solutions. I won't repeat them here.
  • the third aspect of the present invention provides a computer-readable storage medium, for example, a non-volatile computer-readable storage medium, in which computer-readable instructions are stored on the computer-readable storage medium, and the computer-readable instructions are executed when the processor is running.
  • the beneficial effects of the computer-readable storage medium provided by the present invention are the same as the beneficial effects of the single-gene disease name recommendation method based on clinical features and sequence variation provided by the above technical solutions, and will not be repeated here.
  • Fig. 1 is a schematic flowchart of a method for recommending names of single-gene diseases based on clinical features and sequence variations in Example 1;
  • Fig. 2 is an example diagram of node labels on the standardized clinical feature phenotype tree in the first embodiment
  • Fig. 3 is a structural block diagram of a single gene disease name recommendation system based on clinical characteristics and sequence variation in the second embodiment
  • FIG. 4 is an example diagram of the environment architecture of the application of the single gene disease name recommendation method based on clinical characteristics and sequence variation in the fourth embodiment.
  • This embodiment provides a method for recommending names of single-gene diseases based on clinical features and sequence variations, including:
  • the patient's case information including the gene sequence, feature set I and the name of the single gene disease; compare the gene sequence with the human reference genome to obtain the comparison data, and obtain the impact score of each genetic variation based on the comparison data ; Traverse the feature set A corresponding to each standard single-gene disease name in the feature relational database, calculate the set similarity value of each feature set A and feature set I, and compare the similar standard single-gene disease names and The candidate output of the corresponding gene in descending order, and the standard single-gene disease names of the candidate outputs are summarized at the same time to construct the standard single-gene disease name set P; multiple genes corresponding to the single-gene disease names are obtained from the preset gene list file, based on each of the genes Calculate the pathogenicity score of each gene for the impact score of genetic variation, the inheritance mode of genetic variation, the relevance of known diseases, and the corresponding similarity value of the gene, and calculate the corresponding standard according to the value of the pathogenicity score.
  • the candidate output of genetic disease names in descending order, and the standard single-gene disease names of the candidate output are summarized to construct the standard single-gene disease name set G; based on the intersection result of the standard single-gene disease name set G and the standard single-gene disease name set P, and the standard single gene
  • a piece of patient case information including gene sequence, feature set I and names of single-gene diseases needs to be obtained, and then phenotypic assistance is made based on feature set I.
  • Recommendations for names of single-gene diseases for diagnosis, as well as names for single-gene diseases for genetic assistance based on gene sequence and single-gene disease names, and based on the intersection of the recommended results of phenotypic assistance and genetic assistance output the final standard list to the patient The recommended result of the name of the genetic disease.
  • the feature set A corresponding to each standard single-gene disease name in the feature relationship database is traversed, and the set similarity value of each feature set A and feature set I is calculated separately, and the similarity values are determined according to the similarity value.
  • the standard single-gene disease name and the corresponding gene descending candidate output also include:
  • the foreign language information in the characteristic relational database into Chinese information with reference to the Chinese Human Phenotype Standard Phrase Alliance, so as to realize the identification and matching of the Chinese version of the medical record data.
  • the public database is the MedGen database
  • the literature database is the PubMed database.
  • the feature relation database includes matching standard monogenic disease names, foreign language clinical features, clinical features in the human phenotype standard term database number (HPOIDs) and Chinese clinical features. This embodiment can provide clues and theoretical support for the clinical diagnosis and identification of monogenic diseases, and also provide data support for further narrowing the scope of genetic testing.
  • the clinical feature relationship database established in this example covers more than 8,600 types of monogenic diseases, more than 11,000 phenotypic clinical features of monogenic diseases, and more than 90,000 types of relationship data between phenotypes and clinical features, including single genes. The latest database version and literature report for disease research.
  • k is the correction factor, and k>1, and the characteristic relational database is used as a reference database.
  • Feature set I that is, clinical feature information collection
  • Feature set I can be standardized in two ways through visualization tools: the first way is to enter keywords, each keyword is equivalent to a clinical feature, and related standardized phenotypic information can be provided through instant search
  • the drop-down menu is convenient for users to choose and realize the input of standardized clinical special diagnosis information; the second way is to directly input the related standardized clinical feature information by clicking the mouse on the phenotype tree.
  • the standardized clinical feature phenotype tree consists of multiple stem nodes and at least one branch node associated with each stem node.
  • Each branch node is used to represent a standardized clinical feature
  • each stem node is used to represent an index of the associated standardized clinical feature.
  • HPO refers to the hp.obo file.
  • the feature set A corresponding to each standard single-gene disease name in the feature relational database is traversed, and the set similarity value of each feature set A and feature set I is calculated, and the similar standard list is calculated according to the similarity value.
  • the methods for outputting genetic disease names and corresponding gene candidates in descending order include:
  • the best match from feature set A corresponding to each clinical feature in feature set I Standard clinical features; according to the similarity value between each clinical feature and the corresponding best standard clinical feature, calculate the set similarity value between feature set I and current feature set A; let n n+1 re-traverse the feature relational database The nth standard single-gene disease name in the feature relation database is traversed until the standard single-gene disease name in the feature relational database is completed, and the set similarity values corresponding to feature set I and each feature set A are summarized and sorted out.
  • the method for selecting the standard clinical feature with the highest similarity to the i-th clinical feature from feature set A includes:
  • the directed set IB is the number of nodes in the path L IB
  • the length of the directed set AB is the number of nodes in the path L AB ; extract the directed set IB and the number of nodes in the path.
  • the length of the intersection IAB is the number of common nodes in the path L IAB ;
  • SM represents the similarity value between the j-th standard clinical feature and the i-th clinical feature at multiple levels of the phenotype tree; SI represents the j-th standard clinical feature and the i-th clinical feature at the same level in the phenotype tree Similarity value, ⁇ is the weight coefficient.
  • the same stem node is B t .
  • the calculation method is: all nodes in the connecting path between I i and B t form a directed set IB, the number of elements in the directed set IB is denoted as N IB , the directed set
  • All nodes in the connecting path between A j and B t form a directed set AB.
  • the number of elements in the directed set AB is denoted as NAB .
  • the intersection set of the directed set IB and the directed set AB is denoted as IAB
  • the number of elements in the intersection set IAB is denoted as N IAB
  • the length of the set IAB is defined as the number of nodes on the common path, denoted as L IAB
  • L IAB N IAB
  • SM L IAB /max(L AB ,L IB )
  • SI 1/(L AB +L IB -2L IAB +1)
  • is the weighting coefficient, ⁇ (0,1);
  • the method of calculating the set similarity value between the feature set I and the current feature set A according to the similarity value between each clinical feature and the corresponding best standard clinical feature in the foregoing embodiment includes:
  • a standard clinical feature A j corresponding to the greatest similarity can be found in the feature set A, that is to say, each clinical feature I i will get an and feature
  • the similarity value of the set A is defined as the sum of the similarity between each clinical feature I i in the feature set I and the feature set A.
  • the similarity value of feature set I and feature set A is defined as the sum of similarity between each clinical feature I i in feature set I and feature set A, and its calculation formula is S IA represents the similarity value between feature set I and feature set A.
  • the above-mentioned embodiment adopts the multi-level structure similarity algorithm, which has the characteristics of high accuracy in recommending standard single-gene disease names.
  • the method of comparing the gene sequence with the human reference genome to obtain the comparison data, and obtaining the impact score of each genetic variation according to the comparison data includes:
  • the gene sequence is the gene sequence of 1 group of test persons, and when the gene detection mode is the family test mode, the gene sequence is 1 group of test persons and at least 1 set of gene sequences of the immediate family members of the tested persons; respectively compare each set of gene sequences with the human reference genome to obtain the corresponding number of comparison data; obtain the length information and location information of the genetic variation from each group of comparison data And base change information, identify the type of mutation based on the length information of the genetic mutation, and predict its mutation function based on the location information of the genetic mutation and base change information.
  • the mutation type includes SNP mutation and inDel mutation, and the type of mutation function includes mutation harmful , Variation is low harm or basically harmless; according to the identification result of the variation type of each genetic variation, annotate the gene where the genetic variation is located and the frequency of the population, and determine the family genetic mode when the family is tested; based on each genetic variation
  • the length information, location information, population occurrence frequency, predicted variant function and family inheritance pattern are used to classify the genetic variation clinically.
  • the clinical significance classification includes pathogenicity, possible pathogenicity, unknown pathogenicity, possibly benign, and benign.
  • Species level According to one or more of the clinical significance level of each genetic variation, population frequency, disease-causing site clarity, and predicted variation function, the impact score of each genetic variation in the gene is calculated.
  • the above embodiment has two gene detection modes.
  • the gene detection mode is the single-sample detection mode, it is necessary to obtain the gene sequence of 1 group of persons to be tested, and when the gene detection mode is the family detection mode, it is necessary to obtain 1 Group the gene sequences of the test persons and at least one group of the direct relatives of the test persons.
  • the genes in the All genetic variants are scored for impact; then the relevant genes are obtained from the name of the patient’s single-gene disease, and then the genetic variants in the gene are matched with the genetic variants for which the impact score is calculated above, and the impact of genetic variants in the related genes is calculated Sexual score.
  • gene sequences there are many ways to obtain gene sequences.
  • users can import gene sequences for high-throughput sequencing based on a web interface.
  • the data format of gene sequences is fastq's gz compression format.
  • Commonly used import methods are from a local computer. Import and import through the ftp client. During the data import process, the integrity of the gene sequence will be checked, and corresponding reminders will be given for incomplete gene sequence data.
  • the attribute tag information includes file name, sample number, platform, family number, individual number, father number, mother number, gender, phenotype, age, race, place of residence, hometown, disease name, clinical characteristics, medical history data, Genetic model and so on.
  • the quality inspection indicators include: total sequence number, sequence length, base quality, sequence quality, base content, GC content, base level N content, sequence length distribution, repetitive sequence, transition expression sequence, linker sequence, K-mer Content etc.
  • the method for checking the gene sequence in this step is a technical method commonly used by those skilled in the art, and will not be repeated here.
  • the method of sequentially comparing each set of gene sequence data with a human reference genome to obtain a corresponding amount of comparison data includes:
  • the content of the comparison data includes the alignment position of the sequence on the chromosome, the comparison quality, and the paired sequence The alignment position on the chromosome, the length of the insert, the base composition of the sequence, or the quality of the sequence.
  • the methods for obtaining multiple sets of comparison data after sequentially performing deduplication, indel region correction, and base quality correction operations on the comparison results of each group include:
  • a summary analysis of the comparison data can be performed.
  • the content of the summary analysis includes the quality of the comparison data, the number of original reads of paired-end sequencing, the number of reads compared to the human reference genome, Information about the average read sequence length, the ratio of indels, and whether the positive and negative chains are balanced.
  • the sequence coverage of the targeted region can be observed to obtain the genome length, the length of the targeted region, the total number of reads, the number of reads in the targeted region, and the number of reads in the non-targeted region. Information such as the proportion of reads in the targeted region, the average sequencing depth of the targeted region, and so on.
  • the length information, location information, and base change information of the genetic variation are obtained from each set of comparison data, the type of variation is identified based on the length information of the genetic variation, and the location information and base based on the genetic variation
  • Methods of changing information to predict its mutation function include:
  • the Haplotyper Caller algorithm is used to identify the genetic variation as SNP variation or inDel variation based on the length information of the genetic variation in each set of comparison data; when the genetic variation is a missense mutation, SIFT software or Polyphen2 software is used to perform the mutation function of the genetic variation. Prediction: When the genetic variation is a splice site variation, the HSF software is used to predict the variation function of the genetic variation.
  • missense mutation is a form of single-nucleotide mutation, which means that the codon encoding an amino acid is changed to a codon encoding another amino acid after a base substitution, so that the amino acid type and sequence of the polypeptide chain are changed.
  • SIFT software can be used to predict whether amino acid substitution affects protein function, and the prediction results of amino acid changes caused by amino acid mutations can be normalized and scored.
  • the score range is [0,1], The lower the score, the greater the hazard. Generally, a score of ⁇ 0.05 represents Deleterious, and a score of ⁇ 0.05 represents tolerate; Polyphen2 software can also be used to integrate protein sequence and protein three-dimensional structure features.
  • the normalized score range of Polyphen2 is [0,1]. The higher the score, the greater the possibility of destroying the protein function, usually the score is 0.957-1 Among them, the corresponding prediction result is probably damage, between 0.453–0.956, and the corresponding prediction result is possible dmage, and between 0–0.452, the corresponding prediction result is basic Benign, in addition, splicing site mutation refers to the mutation that occurs in the region of the gene splicing site, which may affect the splicing of mRNA.
  • the HSF software can predict whether the mutation will cause a change in splicing, and if it can lead to a change in splicing When it represents Deleterious, otherwise it represents tolerate. It should be noted that the above scoring and function prediction methods are existing methods in the art, and this embodiment will not repeat them.
  • the method for annotating the gene and population occurrence frequency of the genetic variation based on the mutation type recognition result of each genetic variation, and judging the genetic mode of the family when the family detection mode includes:
  • transcripts refer to the NCBI RefSeq transcript database.
  • the transcript containing the most exons is used for annotation.
  • the population frequency information comes from the 1000 genomes (1000genomes), ESP and gnomAD databases.
  • the gene detection mode is the family detection mode, it is also necessary to judge the family genetic mode by analyzing the position information of the genetic variation in each group of comparison data. When the points of the genetic variation in each group of comparison data are related, it is judged as family inheritance. Otherwise, it is judged to be non-family inheritance. If the gene detection mode is the single-sample detection mode, this step is not necessary for judgment. It should be noted that the judgment of family inheritance can be automatically identified by analyzing multiple sets of gene sequence data with existing instruments, which is not described in detail in this embodiment.
  • the method for grading the clinical significance of the genetic variation based on the length information, location information, population frequency, predicted variation function or family inheritance mode of each genetic variation in the above embodiment includes:
  • PVS1 When the pathogenic mechanism of a disease is loss of function (LOF), there is no functional variation.
  • PS1 The same amino acid changes as previously identified as pathogenic variants.
  • PS2 The patient has a new mutation without a family history.
  • PS3 In vivo and in vitro functional experiments have confirmed the mutations that will lead to impaired gene function.
  • PS4 The frequency of mutations in the diseased population is significantly higher than that of the control population.
  • PM1 Located in the hotspot mutation area, and/or in the key functional domain that is known to have no benign mutations.
  • PM4 Protein length change caused by in-frame insertion/deletion of non-repetitive region or loss of stop codon.
  • PM5 A new missense mutation causes an amino acid change. This mutation has not been reported before, but the mutation that caused another amino acid at the same site has been confirmed to be pathogenic.
  • PP2 For a gene, if the missense variation of this gene is the cause of a certain disease, and the proportion of benign variation in this gene is very small, the new missense variation found in such a gene .
  • PP3 A variety of statistical methods predict that the mutation will have harmful effects on genes or gene products, including conservative predictions, evolutionary predictions, and splicing site effects.
  • PP4 The phenotype or family history of mutation carriers is highly consistent with a certain single-gene genetic disease.
  • BA1 ESP database, Thousand People database, ExAC database allele frequency> 5% variation.
  • BS1 Allele frequency is greater than disease incidence.
  • BS2 For early fully penetrative diseases, the mutation is found in healthy adults (recessive genetic disease is found to be homozygous, dominant genetic disease is found to be heterozygous, or X-linked hemizygous).
  • BS3 In vivo and in vitro experiments confirmed mutations that have no effect on protein function and splicing.
  • BS4 Lack of co-segregation among family members.
  • BP1 It is known that the cause of a disease is a truncated variant of a gene, a missense variant found in this gene.
  • BP2 A known pathogenic variant of the same gene on another chromosome was found in a dominant genetic disease, or a known pathogenic variant of the same gene on the same chromosome was found in any genetic disease.
  • BP3 Deletions/insertions in repeat regions of unknown function without causing changes to the gene coding frame.
  • BP4 A variety of statistical methods predict that the mutation will have no effect on the gene or gene product, including conservative prediction, evolutionary prediction, and splicing site impact.
  • BP5 A mutation found in a case where there is already another molecular cause of the disease.
  • BP6 A report from a reliable source of credit believes that the mutation is benign, but the evidence is not yet sufficient to support it.
  • BP7 Synonymous mutation and predicted not to affect splicing.
  • the combined rules for the classification of genetic variation include:
  • Pathogenic including any of i, ii, and iii:
  • PS1-PS4 strong evidence
  • PM1-PM6 medium evidence
  • Benign including either i or ii:
  • the filtering conditions are as follows: The first type is to filter out the intron variants (intron_variant), the intergenic variants (intergenic_variant), and the genetic variants. Upstream variants (upstream_gene_variant) and downstream gene variants (downstream_gene_variant); the second is to filter out mutation sites with a population frequency greater than 0.1; the third is to filter out genetic variants that are unqualified in quality assessment.
  • the method for calculating the impact score of each genetic variant in a gene includes :
  • the evidence includes clinical significance grading, population frequency, clearness of pathogenic locus, predicted variation function, whether it is included in the database, etc.;
  • Possible dmage is assigned 0.5 points, when the predicted mutation function is basically harmless (benign), it is assigned -1 point; if the result of the mutation function predicted by the HSF software is to affect shear, 2 points are accumulated, and the predicted mutation function result is no If it affects shearing, it accumulates 0 points; the clinical significance grading is assigned 3 points when causing disease, 2 points when possible, 1 point when pathogenicity is unknown, -2 points when possible benign, and 3 points when benign. Points; databases include ClinVar database, UniProt database or local database. When a genetic variation is included in any of the above databases, 1 point can be accumulated, and 5 points when the genetic variation locus belongs to a clear pathogenic locus.
  • Score v S c + S p + S vip + S sift + S pph2 + S HSF, where, S C represents a clinically significant hierarchy corresponding to the score, S p represents the frequency corresponding to the score population occurs, S vip Indicates the score corresponding to the definite pathogenic locus, S sift represents the score corresponding to the variant function predicted by the SIFT software, Spph2 represents the score corresponding to the variant function predicted by the Polyphen2 software, and S HSF represents the corresponding score of the variant function predicted by the HSF software Score.
  • multiple genes corresponding to the name of the single-gene disease are obtained from a preset gene list file, based on the impact score of each genetic variation in the gene, the inheritance mode of the genetic variation, and the association of known diseases
  • the method for calculating the pathogenicity score of each gene separately from the similarity value corresponding to the gene includes:
  • Score g max (Score v ) + w e S e + w t S t + w MLS S MLS to calculate each gene separately pathogenic score, wherein, max (score v) so that the maximum value of the genetic variation in genes affect the score, S e is a known genetic disease association assignment, S t is the assignment mode of inheritance of genetic variation, S MLS similarity value with the corresponding gene, w e S e is the weight assigned weights, w t s t is assigned the right weight, w MLS to assign rights S MLS weight.
  • S MLS is the largest value among the similarity values of the standard single-gene disease names in the database of the single-gene disease name corresponding to the gene and the characteristic relationship database, and the default values of w e and w t The values are all 1, and the default value of w MLS is 2, and the value range is 1–5.
  • a genetic analysis and interpretation report is automatically generated.
  • the content of the genetic analysis and interpretation report includes: individual information of genetic sequence data, the results of genetic analysis and interpretation, and the clinical characteristics of related monogenic diseases.
  • Individual information includes: sample number, name, gender, age, hometown, place of residence, disease diagnosis, disease description And other information.
  • the results of genetic analysis and interpretation include: physical location of disease-causing mutations, gene names, DNA changes, amino acid changes, frequency of East Asian populations, clinical significance grades, disease and family inheritance patterns.
  • the recommended results of the standard single-gene disease names are output.
  • the intersection result of the standard single-gene disease name set G and the standard single-gene disease name set P is empty, it indicates that the recommended results of the standard single-gene disease names obtained through genetic assistance and the standard single-gene disease names obtained through phenotypic genetic assistance The recommended results of genetic disease names are completely inconsistent. At this time, the recommended results of standard single-gene disease names are not output; when the intersection of the standard single-gene disease name set G and the standard single-gene disease name set P is 1, it means that the result is obtained through genetic assistance The recommended result of the standard single-gene disease name is the same as the recommended result of the standard single-gene disease name obtained from the phenotypic genetic assistance diagnosis.
  • the recommended result of the unique standard single-gene disease name is output; when the standard single-gene disease name set G and The intersection result of the standard single-gene disease name set P is multiple, indicating that the recommended results of the standard single-gene disease names obtained through genetic assistance are partially the same as the recommended results of the standard single-gene disease names obtained through phenotypic genetic assistance. In this case, follow The candidate output sequence of each standard single-gene disease name, and output the recommended results of multiple standard single-gene disease names.
  • the method further includes:
  • the blacklist method is used to filter out the standard single-gene disease names corresponding to the false positive mutation sites.
  • the blacklisted sites come from inside the laboratory and are false positive mutation sites for high-throughput sequencing.
  • this embodiment provides a single gene disease name recommendation system based on clinical characteristics and sequence variation, including:
  • the input unit is used to obtain the patient's case information, the case information including the gene sequence, feature set I and the name of the single gene disease;
  • the sequence comparison unit is used to compare the gene sequence with the human reference genome to obtain comparison data, and obtain the impact score of each genetic variation according to the comparison data;
  • the phenotypic auxiliary diagnosis unit is used to traverse the feature set A corresponding to each standard single-gene disease name in the feature relational database, and calculate the set similarity value of each feature set A and feature set I, and will be similar according to the similarity value Standard single-gene disease names and corresponding gene candidate output in descending order.
  • the standard single-gene disease names of candidate outputs are summarized to construct a standard single-gene disease name set P;
  • the genetic assistant diagnosis unit is used to obtain multiple genes corresponding to the name of the single-gene disease from a preset gene list file, based on the impact score of each genetic variation in the gene, the inheritance mode of the genetic variation, and the known
  • the relevance of the disease and the similarity value corresponding to the gene are respectively calculated for the pathogenicity score of each gene, and the corresponding standard single-gene disease names are output in descending order according to the pathogenicity score value, and the candidate outputs are summarized at the same time Standard single-gene disease names to construct a standard single-gene disease name set G;
  • the recommended output unit based on the intersection result of the standard single-gene disease name set G and the standard single-gene disease name set P, and the candidate output order of the standard single-gene disease names, output the recommended results of the standard single-gene disease names.
  • the aforementioned single-gene disease name recommendation system is applied to a computer device that includes a processor and a memory connected through a system bus.
  • the processor of the single gene disease name recommendation system is used to provide calculation and control capabilities.
  • the memory of the single gene disease name recommendation system includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer readable instructions.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the network interface of the single gene disease name recommendation system is used to communicate with external sensors.
  • the steps of the above-mentioned method for recommending names of single-gene diseases based on clinical characteristics and sequence variation are realized, for example, the above-mentioned input unit, sequence comparison unit, phenotypic diagnosis unit, and genetic assistance unit are used.
  • the diagnosis unit and the recommendation output unit implement the steps of the above-mentioned method for recommending names of single-gene diseases based on clinical features and sequence variations.
  • the beneficial effects of the single-gene disease name recommendation system based on clinical features and sequence variation provided in this embodiment are as beneficial as the single-gene disease name recommendation method based on clinical features and sequence variation provided in the first embodiment above.
  • the effect is the same, so I won’t repeat them here.
  • This embodiment provides a computer-readable storage medium, for example, a non-volatile computer-readable storage medium, in which computer-readable instructions are stored on the computer-readable storage medium, and the computer-readable instructions are executed when the processor is run.
  • the beneficial effects of the computer-readable storage medium provided in this embodiment are the same as those of the single-gene disease name recommendation method based on clinical features and sequence variation provided by the above technical solutions, and will not be repeated here.
  • FIG. 4 provides a schematic diagram of an environment architecture of an application scenario.
  • An application software can be developed to implement the single gene disease name recommendation method based on clinical features and sequence mutations in the foregoing embodiment, and the application software can be installed in a user terminal, and the user terminal is connected to the server to realize communication.
  • the user terminal may be any smart device such as a computer or a tablet computer, and this embodiment only uses a computer as an example for description.
  • the case information includes the gene sequence, feature set I, and the name of a single gene disease, so as to realize the case information in the application
  • the application program in the computer sends the gene sequence to the sequence comparison unit, the feature set I to the feature set I, and the single gene disease name is sent to the genetic assistant diagnosis unit.
  • the sequence alignment unit and the phenotypic assistant The diagnosis unit and the genetic auxiliary diagnosis unit can be realized by the server.
  • the phenotypic auxiliary diagnosis unit adopts the multi-level structure similarity algorithm to traverse and calculate the similarity value between the feature set A and the feature set I corresponding to each single standard genetic disease name in the feature relation database. , Construct the standard single-gene disease name set P, the genetic assistant diagnosis unit obtains multiple genes corresponding to the single-gene disease name from the preset gene list file, and uses the pathogenicity scoring algorithm to calculate the pathogenicity score of each gene separately , To construct a standard single-gene disease name set G, and finally a recommended output unit, such as a display, based on the intersection result of the standard single-gene disease name set G and the standard single-gene disease name set P, and the candidate output order of the standard single-gene disease names, Output the recommended results of standard single-gene disease names.
  • the above-mentioned inventive method can be implemented by a program instructing relevant hardware.
  • the above-mentioned program can be stored in a computer readable storage medium.
  • the storage medium of the program may be: ROM/RAM, magnetic disk, optical disk, memory card, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • Data Mining & Analysis (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Ecology (AREA)
  • Artificial Intelligence (AREA)
  • Physiology (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

一种基于临床特征和序列变异的单基因病名称推荐方法及系统,能够精准地推荐出与患者情况匹配的单基因病名称。该方法包括:获取患者的病例信息;将基因序列与人类参考基因组进行比对得到每个遗传变异的影响性评分;遍历特征关系数据库中各标准单基因病名称对应的特征集合A,分别计算与每个特征集合A的集合相似度值,将相似的标准单基因病名称及对应的基因降序候选输出,构建标准单基因病名称集合P;从预设的基因列表文件中获取与单基因病名称对应的多个基因,分别计算每个基因的致病性评分,将对应的标准单基因病名称降序候选输出,构建标准单基因病名称集合G;基于集合G和集合P的交集结果输出标准单基因病名称的推荐结果。

Description

基于临床特征和序列变异的单基因病名称推荐方法及系统 技术领域
本发明涉及医学信息技术领域,尤其涉及一种基于临床特征和序列变异的单基因病名称推荐方法及系统。
背景技术
单基因病是一种常见疾病,它是由一对等位基因突变导致的疾病,又称孟德尔式遗传病,其特点如下:
1、单基因病种类繁多,目前已发现的单基因病有8000种以上;
2、单基因病表型复杂,同一种单基因病表型异质性强,存在不同单基因病之间临床特征相互重叠的现象;
3、单基因病遗传模式多样化,即使同一种单基因病,也可能表现为不同的遗传模式,不同的单基因病也可表现为相同的遗传模式。
4、大部分单基因病发病率很低,较为罕见。
这些复杂因素使得临床医生很难对所有的单基因病表型都了解,给单基因病临床诊疗带来了极大的困难。
发明内容
本发明的目的在于提供一种基于临床特征和序列变异的单基因病名称推荐方法及系统,能够精准地推荐出与患者情况匹配的单基因病名称。
为了实现上述目的,本发明的一方面提供一种基于临床特征和序列变异的单基因病名称推荐方法,包括:
获取患者的病例信息,所述病例信息包括基因序列、特征集合I和单基因病名称;
将所述基因序列与人类参考基因组进行比对得到比对数据,并根据比对数据得到每个遗传变异的影响性评分;
遍历特征关系数据库中各标准单基因病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,并按相似度值大小将相似的标准单基因病名称及对应的基因降序候选输出,同时汇总候选输出的标准单基因病名称构建标准单基因病名称集合P;
从预设的基因列表文件中获取与所述单基因病名称对应的多个基因,基于所述基因中各遗传变异的影响性评分、遗传变异的遗传模式、已知疾病的关联性和所述基因对应的相似度值分别计算每个所述基因的致病性评分,并按照致病性评分值大小将对应的标准单基 因病名称降序候选输出,同时汇总候选输出的标准单基因病名称构建标准单基因病名称集合G;
基于标准单基因病名称集合G和标准单基因病名称集合P的交集结果,以及标准单基因病名称的候选输出顺序,输出标准单基因病名称的推荐结果。
优选地,在步骤遍历特征关系数据库中各标准单基因病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,并按相似度值大小将相似的标准单基因病名称及对应的基因降序候选输出之前还包括:
从单基因病的公共数据库和文献数据库,获得已知的标准单基因病名称及其对应的标准临床特征;
基于已知的标准单基因病名称及其对应的标准临床特征,建立标准单基因病名称与标准临床特征的特征关系数据库;
分别计算每种标准单基因病名称对应的各标准临床特征对该单基因病的贡献度c i
从特征关系数据库中获取数据,基于HPO构建单基因病的标准化临床特征表型树;
所述标准化临床特征表型树由多个干节点和与每个干节点关联的至少一个支节点组成,每个支节点用于表示一个标准化临床特征,每个干节点用于表示关联的标准化临床特征的索引。
较佳地,遍历特征关系数据库中各标准单基因病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,并按相似度值大小将相似的标准单基因病名称及对应的基因降序候选输出的方法包括:
将特征集合I中的临床特征在标准化临床特征表型树上的节点标记;
遍历特征关系数据库中的第n个标准单基因病名称,将其对应的特征集合A中的标准临床特征在标准化临床特征表型树上的节点标记,所述n的初始值为1;
基于标准化临床特征表型树上的节点标记,从特征集合A中匹配出与特征集合I中每个临床特征对应的最佳标准临床特征;
根据每个临床特征与对应的最佳标准临床特征的相似度值,计算出特征集合I与当前特征集合A的集合相似度值;
令n=n+1重新遍历特征关系数据库中的第n个标准单基因病名称,直至特征关系数据库中的标准单基因病名称遍历完毕,将特征集合I与每个特征集合A对应的集合相似度值汇总排序候选输出。
进一步地,基于标准化临床特征表型树上的节点标记,从特征集合A中匹配出与特征集合I中每个临床特征对应的最佳标准临床特征的方法包括:
所述特征集合I包括多个临床特征,所述特征集合A包括多个标准临床特征;
遍历所述特征集合I中的第i个临床特征,从所述特征集合A中筛选出与所述第i个临床特征相似度最高的标准临床特征,作为与所述第i个临床特征对应的最佳标准临床特征,所述i的初始值为1;
令i=i+1后重新遍历所述特征集合I中的第i个临床特征,直至特征集合I中的临床特征遍历完毕,从第n个标准单基因病名称对应的特征集合A中筛选出与特征集合I中临床特征一一对应的多个最佳标准临床特征。
进一步地,从所述特征集合A中筛选出与所述第i个临床特征相似度最高的标准临床特征的方法包括:
遍历所述特征集合A中的第j个标准临床特征,基于已建立的索引判断所述第j个标准临床特征与所述第i个临床特征是否存在相同的干节点B t,所述j的初始值为1;
若判断结果为否,则认为所述第j个标准临床特征与所述第i个临床特征的相似度值为零;
若判断结果为是,基于多层级结构相似度算法计算所述第j个标准临床特征与所述第i个临床特征的相似度值;
令j=j+1后重新遍历所述特征集合A中的第j个标准临床特征,并继续执行所述第j个标准临床特征与所述第i个临床特征的相似度计算,直至所述特征集合A中的标准临床特征遍历完毕,对应得到与所述特征集合A中标准临床特征一一对应的多个相似度值;
从多个相似度值筛中筛选出最大值对应的标准临床特征作为与第i个临床特征对应的最佳标准临床特征。
优选地,将所述基因序列与人类参考基因组进行比对得到比对数据,并根据比对数据得到每个遗传变异的影响性评分的方法包括:
对基因序列进行属性标记,其中,基因检测模式为单样本检测模式时,基因序列为1组待测人员的基因序列,基因检测模式为家系检测模式时,基因序列为1组待测人员和至少1组待测人员直系亲属的基因序列;
分别将每组基因序列与人类参考基因组进行序列比对,得到对应数量的比对数据;
从每组比对数据中获取遗传变异的长度信息、位置信息和碱基改变信息,基于所述遗传变异的长度信息识别其变异类型,以及基于所述遗传变异的位置信息和碱基改变信息预测其变异功能,所述变异类型包括SNP变异和Indel变异,所述变异功能的类型包括变异有害、变异低害或基本无害;
针对每个遗传变异的变异类型识别结果,对遗传变异所在的基因和人群发生频率进行 注释,并在家系检测模式时判断其家系遗传模式;
基于每个遗传变异的长度信息、位置信息、人群发生频率、预测的变异功能和家系遗传模式对遗传变异进行临床显著性分级,所述临床显著性分级包括致病、可能致病、致病性不明、可能良性和良性五种级别;
根据各遗传变异的临床显著性分级、人群发生频率、致病位点明确性、预测的变异功能中的一种或多种,计算基因中各遗传变异的影响性评分。
较佳地,从预设的基因列表文件中获取与所述单基因病名称对应的多个基因,基于所述基因中各遗传变异的影响性评分、遗传变异的遗传模式、已知疾病的关联性和所述基因对应的相似度值分别计算每个所述基因的致病性评分的方法包括:
获取基因中的遗传变异,匹配出各遗传变异的影响性评分;
采用致病评分公式Score g=max(Score v)+w eS e+w tS t+w MLSS MLS分别计算每个基因的致病性评分,其中,max(Score v)为所述基因中的遗传变异影响性评分最大值,S e为所述基因对已知疾病的关联性赋值,S t为遗传变异的遗传模式赋值,S MLS为与所述基因对应的相似度值,w e为S e的赋值权重,w t为s t的赋值权重,w MLS为S MLS的赋值权重。
优选地,按照致病性评分值大小将对应的标准单基因病名称降序候选输出之前还包括:
对于候选输出的标准单基因病名称,采用黑名单方式过滤掉假阳性变异位点对应的标准单基因病名称。
与现有技术相比,本发明提供的基于临床特征和序列变异的单基因病名称推荐方法具有以下有益效果:
本发明提供的基于临床特征和序列变异的单基因病名称推荐方法中,首先需要获取一份包括基因序列、特征集合I和单基因病名称的患者病例信息,然后基于特征集合I做表型辅诊的单基因病名称推荐,以及基于基因序列和单基因病名称做遗传辅诊的单基因病名称推荐,并根据表型辅诊和遗传辅诊推荐结果的交集,向患者输出最终的标准单基因病名称推荐结果。
可见,本发明提供的方案综合了患者的临床特征和遗传变异进行临床辅助诊断,能够帮助临床医生对复杂单基因病进行精准诊断。
本发明的另一方面提供一种基于临床特征和序列变异的单基因病名称推荐系统,包括:
输入单元,用于获取患者的病例信息,所述病例信息包括基因序列、特征集合I和单基因病名称;
序列比对单元,用于将所述基因序列与人类参考基因组进行比对得到比对数据,并根据比对数据得到每个遗传变异的影响性评分;
表型辅诊单元,用于遍历特征关系数据库中各标准单基因病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,并按相似度值大小将相似的标准单基因病名称及对应的基因降序候选输出,同时汇总候选输出的标准单基因病名称构建标准单基因病名称集合P;
遗传辅诊单元,用于从预设的基因列表文件中获取与所述单基因病名称对应的多个基因,基于所述基因中各遗传变异的影响性评分、遗传变异的遗传模式、已知疾病的关联性和所述基因对应的相似度值分别计算每个所述基因的致病性评分,并按照致病性评分值大小将对应的标准单基因病名称降序候选输出,同时汇总候选输出的标准单基因病名称构建标准单基因病名称集合G;
推荐输出单元,基于标准单基因病名称集合G和标准单基因病名称集合P的交集结果,以及标准单基因病名称的候选输出顺序,输出标准单基因病名称的推荐结果。
与现有技术相比,本发明提供的基于临床特征和序列变异的单基因病名称推荐系统的有益效果与上述技术方案提供的基于临床特征和序列变异的单基因病名称推荐方法有益效果相同,在此不做赘述。
本发明的第三方面提供一种计算机可读存储介质,例如是非易失性计算机可读存储介质,其中计算机可读存储介质上存储有计算机可读指令,计算机可读指令被处理器运行时执行上述基于临床特征和序列变异的单基因病名称推荐方法的步骤。
与现有技术相比,本发明提供的计算机可读存储介质的有益效果与上述技术方案提供的基于临床特征和序列变异的单基因病名称推荐方法的有益效果相同,在此不做赘述。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本发明的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1为实施例一中基于临床特征和序列变异的单基因病名称推荐方法的流程示意图;
图2为实施例一中标准化临床特征表型树上的节点标记示例图;
图3为实施例二中基于临床特征和序列变异的单基因病名称推荐系统的结构框图;
图4为实施例四中基于临床特征和序列变异的单基因病名称推荐方法应用的环境架构的一种示例图。
具体实施方式
为使本发明的上述目的、特征和优点能够更加明显易懂,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例仅 仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动的前提下所获得的所有其它实施例,均属于本发明保护的范围。
实施例一
请参阅图1,本实施例提供一种基于临床特征和序列变异的单基因病名称推荐方法,包括:
获取患者的病例信息,病例信息包括基因序列、特征集合I和单基因病名称;将基因序列与人类参考基因组进行比对得到比对数据,并根据比对数据得到每个遗传变异的影响性评分;遍历特征关系数据库中各标准单基因病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,并按相似度值大小将相似的标准单基因病名称及对应的基因降序候选输出,同时汇总候选输出的标准单基因病名称构建标准单基因病名称集合P;从预设的基因列表文件中获取与单基因病名称对应的多个基因,基于基因中各遗传变异的影响性评分、遗传变异的遗传模式、已知疾病的关联性和基因对应的相似度值分别计算每个基因的致病性评分,并按照致病性评分值大小将对应的标准单基因病名称降序候选输出,同时汇总候选输出的标准单基因病名称构建标准单基因病名称集合G;基于标准单基因病名称集合G和标准单基因病名称集合P的交集结果,以及标准单基因病名称的候选输出顺序,输出标准单基因病名称的推荐结果。
本发明提供的基于临床特征和序列变异的单基因病名称推荐方法中,首先需要获取一份包括基因序列、特征集合I和单基因病名称的患者病例信息,然后基于特征集合I做表型辅诊的单基因病名称推荐,以及基于基因序列和单基因病名称做遗传辅诊的单基因病名称推荐,并根据表型辅诊和遗传辅诊推荐结果的交集,向患者输出最终的标准单基因病名称推荐结果。
上述实施例中,在步骤遍历特征关系数据库中各标准单基因病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,并按相似度值大小将相似的标准单基因病名称及对应的基因降序候选输出之前还包括:
从单基因病的公共数据库和文献数据库,获得已知的标准单基因病名称及其对应的标准临床特征;基于已知的标准单基因病名称及其对应的标准临床特征,建立标准单基因病名称与标准临床特征的特征关系数据库;分别计算每种标准单基因病名称对应的各标准临床特征对该单基因病的贡献度c i;从特征关系数据库中获取数据,基于HPO构建单基因病的标准化临床特征表型树;标准化临床特征表型树由多个干节点和与每个干节点关联的至少一个支节点组成,每个支节点用于表示一个标准化临床特征,每个干节点用于表示关 联的标准化临床特征的索引。
优选地,还需参照中文人类表型标准用语联盟将特征关系数据库中的外文信息对应翻译成中文信息,以实现对中文版病历资料的识别匹配。
具体实施时,公共数据库为MedGen数据库,文献数据库为PubMed数据库,特征关系数据库中包括互相匹配的标准单基因病名称、外文临床特征、临床特征在人类表型标准用语数据库中的编号(HPOIDs)以及中文临床特征。本实施例可以为单基因病的临床诊断和鉴别提供线索和理论支持,也为进一步缩小基因检测的范围提供了数据支持。同时,本实施例建立的临床特征关系数据库覆盖的单基因病种类达8600种以上,单基因病表型临床特征超过11000个,表型与临床特征关系数据达9万种以上,囊括了单基因病研究方向最新的数据库版本和文献报道。
具体地,每种标准单基因病名称对应的各标准临床特征对该单基因病的贡献度c i的计算方法如下:
在特征关系数据库中,假设共有a种标准临床特征,a种标准临床特征在特征关系数据库中一共出现N次,假定每种标准临床特征出现的次数为a i,则每个标准临床特征在特征关系数据库中出现的频率为f i,f i的计算公式为:
f i=a i/N;
对于特征关系数据库中的某种标准单基因病名称,假定对应有m个标准临床特征,每个标准临床特征在特征关系数据库中的分布频率依次为f 1、f 2、……、f m,则某个标准临床特征对该单基因病的贡献度c i的计算公式为:
Figure PCTCN2020111133-appb-000001
上述公式中,k为校正因子,且k>1,特征关系数据库作为参考数据库使用。
特征集合I,也即临床特征信息集合可通过可视化工具实现两种方式的标准化输入:第一种方式是输入关键词,每一个关键词相当于一个临床特征,通过即时搜索提供相关标准化表型信息的下拉菜单方便用户选择,实现标准化临床特诊信息的输入;第二种方式是直接在表型树上,通过鼠标点击相关的标准化临床特征信息进行输入。
上述实施例中构建单基因病的标准化临床特征表型树的方法包括:
从特征关系数据库中获取数据,基于HPO构建单基因病的标准化临床特征表型树;其中,标准化临床特征表型树由多个干节点和与每个干节点关联的至少一个支节点组成,每个支节点用于表示一个标准化临床特征,每个干节点用于表示关联的标准化临床特征的索引。HPO是指hp.obo文件。
上述实施例中,遍历特征关系数据库中各标准单基因病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,并按相似度值大小将相似的标准单基因病名称及对应的基因降序候选输出的方法包括:
将特征集合I中的临床特征在标准化临床特征表型树上的节点标记;遍历特征关系数据库中的第n个标准单基因病名称,将其对应的特征集合A中的标准临床特征在标准化临床特征表型树上的节点标记,所述n的初始值为1;基于标准化临床特征表型树上的节点标记,从特征集合A中匹配出与特征集合I中每个临床特征对应的最佳标准临床特征;根据每个临床特征与对应的最佳标准临床特征的相似度值,计算出特征集合I与当前特征集合A的集合相似度值;令n=n+1重新遍历特征关系数据库中的第n个标准单基因病名称,直至特征关系数据库中的标准单基因病名称遍历完毕,将特征集合I与每个特征集合A对应的集合相似度值汇总排序候选输出。
具体地,从特征集合A中筛选出与第i个临床特征相似度最高的标准临床特征的方法包括:
遍历特征集合A中的第j个标准临床特征,基于已建立的索引判断第j个标准临床特征与第i个临床特征是否存在相同的干节点B t,j的初始值为1;若判断结果为否,则认为第j个标准临床特征与第i个临床特征的相似度值为零;若判断结果为是,基于多层级结构相似度算法计算第j个标准临床特征与第i个临床特征的相似度值;令j=j+1后重新遍历特征集合A中的第j个标准临床特征,并继续执行第j个标准临床特征与第i个临床特征的相似度计算,直至特征集合A中的标准临床特征遍历完毕,对应得到与特征集合A中标准临床特征一一对应的多个相似度值;从多个相似度值筛中筛选出最大值对应的标准临床特征作为与第i个临床特征对应的最佳标准临床特征。
上述实施例中基于多层级结构相似度算法计算第j个标准临床特征与第i个临床特征的相似度值的方法包括:
基于标准化临床特征表型树上的节点标记,获取第i个临床特征与相同干节点B t连接通路中所有节点的有向集合IB,以及获取第j个标准临床特征相同干节点B t连接通路中所有节点的有向集合AB,有向集合IB长度的值为通路中节点的个数L IB,有向集合AB长度的值为通路中节点的个数L AB;提取有向集合IB和有向集合AB中节点的交集IAB,交集IAB长度的值为通路中共有节点的个数L IAB;采用公式
Figure PCTCN2020111133-appb-000002
计算第j个标准临床特征与第i个临床特征的相似度值;
其中,SM表示第j个标准临床特征与第i个临床特征在表型树多层次间的相似度值;SI表示第j个标准临床特征与第i个临床特征在表型树同层次间的相似度值,β为权重系数。
具体实施时,对于特征关系数据库中某一标准单基因病名称对应的特征集合A有n个元素A j组成,分别为A 1、A 2、……、A n,也即A=[A 1,A 2,...,A j...,A n],特征关系数据库中的每一个标准基因病名称均对应一个集合A。假若某一单基因病患者输入的标准化特征集合I,有m个临床特征I i组成,对应的特征集合I=[I 1、I 2、……、I m]。如果I i与A j的干节点不相同,则认为I i与A j的相似度为0,如果I i与A j的干节点相同,如图2所示,相同的干节点为B t,则计算I i与A j的相似度,计算方法为:I i到B t之间连接通路中的所有节点组成有向集合IB,有向集合IB的元素个数记为N IB,有向集合IB的长度定义为该通路上节点的个数,记为L IB,且L IB=N IB
A j到B t之间连接通路中的所有节点组成有向集合AB,有向集合AB的元素个数记为N AB,有向集合AB的长度定义为该通路上节点的个数,记为L AB,且L AB=N AB
有向集合IB和有向集合AB的交集集合记为IAB,交集集合IAB的元素个数记为N IAB,集合IAB的长度定义为共有路径上节点的个数,记为L IAB,则L IAB=N IAB,其中,SM=L IAB/max(L AB,L IB),SI=1/(L AB+L IB-2L IAB+1),β为权重系数,β∈(0,1);I i与A j之间的相似度的取值范围
Figure PCTCN2020111133-appb-000003
进一步地,上述实施例中根据每个临床特征与对应的最佳标准临床特征的相似度值,计算出特征集合I与当前特征集合A的集合相似度值的方法包括:
利用第i个临床特征的贡献度c i,对特征集合A中与之对应最佳标准临床特征的最大相似度值进行加权处理;令i=i+1,重新对特征集合A中与第i个临床特征对应的最佳标准临床特征的最大相似度值进行加权处理,直至将特征集合A中筛选出的全部最佳标准临床特征加权处理完毕,累加特征集合A中全部最佳标准临床特征对应的加权最大相似度值,得到特征集合I与当前特征集合A的集合相似度值。
具体实施时,对于每个输入的临床特征I i,都可以在特征集合A中找到一个与之对应相似度最大的标准临床特征A j,也就是说每个临床特征I i都会得到一个与特征集合A的相似度值,特征集合I和特征集合A的相似度,定义为特征集合I中的每个临床特征I i与特征集合A的相似度之和。
考虑到每个临床特征对单基因病的贡献程度不一,需对相应的最大相似度值进行加权处理,其计算公式为
Figure PCTCN2020111133-appb-000004
其中
Figure PCTCN2020111133-appb-000005
表示临床特征I i与特征集合A的相似度值。特征集合I和特征集合A的相似度值,定义为特征集合I中每个临床特征I i与特征集 合A的相似度之和,其计算公式为
Figure PCTCN2020111133-appb-000006
S IA表示特征集合I与特征集合A的相似度值。
可见,上述实施例采用多层级结构相似度算法具有标准单基因病名称推荐准确度高的特点。
上述实施例中,将基因序列与人类参考基因组进行比对得到比对数据,并根据比对数据得到每个遗传变异的影响性评分的方法包括:
对基因序列进行属性标记,其中,基因检测模式为单样本检测模式时,基因序列为1组待测人员的基因序列,基因检测模式为家系检测模式时,基因序列为1组待测人员和至少1组待测人员直系亲属的基因序列;分别将每组基因序列与人类参考基因组进行序列比对,得到对应数量的比对数据;从每组比对数据中获取遗传变异的长度信息、位置信息和碱基改变信息,基于遗传变异的长度信息识别其变异类型,以及基于遗传变异的位置信息和碱基改变信息预测其变异功能,变异类型包括SNP变异和inDel变异,变异功能的类型包括变异有害、变异低害或基本无害;针对每个遗传变异的变异类型识别结果,对遗传变异所在的基因和人群发生频率进行注释,并在家系检测模式时判断其家系遗传模式;基于每个遗传变异的长度信息、位置信息、人群发生频率、预测的变异功能和家系遗传模式对遗传变异进行临床显著性分级,临床显著性分级包括致病、可能致病、致病性不明、可能良性和良性五种级别;根据各遗传变异的临床显著性分级、人群发生频率、致病位点明确性、预测的变异功能中的一种或多种,计算基因中各遗传变异的影响性评分。
具体实施时,上述实施例具有两种基因检测模式,当基因检测模式为单样本检测模式时,需获取1组待测人员的基因序列,而当基因检测模式为家系检测模式时,需要获取1组待测人员和至少1组待测人员直系亲属的基因序列。将任一检测模式中的各组基因序列分别与人类参考基因组进行序列比对,得到对应的比对数据;并从比对数据中获取遗传变异的长度信息、位置信息和碱基改变信息,然后基于遗传变异的长度信息识别其变异类型,以及遗传变异的位置信息和碱基改变信息预测其变异功能,并对遗传变异所在的基因和人群发生频率进行注释,当在家系检测模式时还需判断其是否为家系内共分离遗传,接着,根据上述得到的遗传变异的长度信息、位置信息、人群发生频率、预测的变异功能、甚至还包括家系遗传模式对各遗传变异进行临床显著性分级,当每个遗传变异的临床显著性分级、人群发生频率、致病位点明确性、预测的变异功能中的一种或多种,以及是否被数据库收录这些核心信息采集完毕后,分别对基因中的全部遗传变异进行影响性评分;之后通过患者所患的单基因病名称获取相关的基因,然后将基因中的遗传变异与上述计算过影响 性评分的遗传变异匹配,计算相关基因中遗传变异的影响性评分。
可以理解的是,获取基因序列的方法多种多样,例如,可以由用户基于web界面导入高通量测序的基因序列,基因序列的数据格式为fastq的gz压缩格式,常用的导入方法从本地计算机导入和通过ftp客户端导入,数据导入过程中会对基因序列进行完整性检查,对不完整的基因序列数据给予相应的提醒。其中,属性标记的信息包括文件名、样本编号、平台、家系编号、个体编号、父亲编号、母亲编号、性别、表型、年龄、种族、居住地、籍贯、疾病名称、临床特征、病历资料、遗传模式等。
当属性标记的步骤完成后,还需对基因序列的质量进行检查,确保基因序列的质量是合格的,能够用于下游分析和解读。质量检查的指标包括:总序列数、序列长度、碱基质量、序列质量、碱基含量、GC含量、碱基水平N含量、序列长度分布、重复序列、过渡表达序列、接头序列、K-mer含量等。该步骤中基因序列的检查方法为本领域技术人员常用的技术手段,在此不做赘述。
上述实施例中,分别将每组基因序列数据与人类参考基因组进行序列比对,得到对应数量的比对数据的方法包括:
针对获取的基因序列数据进行质量检测,对质量检测不合格的基因序列数据进行标记;将质量检测合格的基因序列数据输入BWA软件,使其与人类参考基因hg19或人类参考基因hg38进行序列比对;依次对各组比对结果进行去重、indel区域校正、碱基质量校正操作后得到多组比对数据;比对数据的内容包括序列在染色体上的比对位置、比对质量、配对序列在染色体上的比对位置、插入片段长度、序列的碱基组成或序列质量。
具体实施时,依次对各组比对结果进行去重、indel区域校正、碱基质量校正操作后得到多组比对数据的方法包括:
采用Picard MarkDuplicates软件对比对结果进行去重;对indel区域校正的方法为利用GATK RealignerTargetCreator软件产生indel列表,并追加1000基因组数据库中发现的已知indel位点,利用GATK IndelRealigner对这些indel区域进行局部重新比对,以实现indel区域的校正;碱基质量校正的方法为使用GATK BaseRecalibrator软件结合已知位点信息对碱基的质量分数进行校正。
这些操作步骤完成后,可针对比对数据进行汇总性分析,汇总性分析的内容包括比对数据的质量,以及双端测序的原始读序数目、比对到人类参考基因组上的读序数目、平均读序长度、indel的比例、正负链是否平衡等信息。另外,此阶段还可对靶向区域的序列覆盖情况进行观察,以获取基因组长度、靶向区域的长度、总读序数目、靶向区域的读序数目、非靶向区域的读序数目、靶向区域读序所占的比例、靶向区域的平均测序深度等信息。
进一步地,上述实施例中从每组比对数据中获取遗传变异的长度信息、位置信息和碱基改变信息,基于遗传变异的长度信息识别其变异类型,以及基于遗传变异的位置信息和碱基改变信息预测其变异功能的方法包括:
利用Haplotyper Caller算法基于每组比对数据中遗传变异的长度信息,识别出遗传变异为SNP变异或者inDel变异;当遗传变异为错义突变时,采用SIFT软件或者Polyphen2软件对遗传变异的变异功能进行预测;当遗传变异为剪接位点变异时,采用HSF软件对遗传变异的变异功能进行预测。
具体实施时,错义突变为单核苷酸突变的一种形式,是指编码氨基酸的密码子经过碱基替换后变成编码另一种氨基酸的密码子,从而使多肽链的氨基酸种类和序列发生改变,在对其功能预测的过程中,可采用SIFT软件预测氨基酸替换是否影响蛋白质功能,对由氨基酸突变引起的氨基酸改变的预测结果进行归一化评分,评分范围为[0,1],得分越低则表明危害性就越大,通常,分数<0.05代表变异有害(Deleterious),分数≥0.05代表变异低害(tolerate);也可采用Polyphen2软件通过整合蛋白质序列和蛋白质三维结构特征,来预测人类蛋白质的氨基酸替换对结构和功能的影响,Polyphen2的归一化评分范围为[0,1],分数越高,意味着有越大的破坏蛋白功能的可能性,通常分数在0.957–1之间,其相应的预测结果为变异有害(probably damage),在0.453–0.956之间,其相应的预测结果为变异低害(possible dmage),在0–0.452之间其相应的预测结果为基本无害(benign),另外,剪接位点变异是指发生在基因剪接位点区域的变异,可能影响mRNA的剪接,通过HSF软件预测该变异是否导致剪切的改变,当能够导致剪切的改变时代表变异有害(Deleterious),否则代表变异低害(tolerate)。需要说明的是,上述评分及功能预测的方法为本领域现有的方法,本实施例对此不做赘述。
进一步地,上述实施例中针对每个遗传变异的变异类型识别结果,对遗传变异所在的基因和人群发生频率进行注释,并在家系检测模式时判断其家系遗传模式的方法包括:
基于每个遗传变异的变异类型识别结果,通过公共数据库对遗传变异所在的基因和人群发生频率进行注释;在基因检测模式为家系检测模式时,通过分析各组比对数据中遗传变异的位置信息判断其家系遗传模式,当各组比对数据中遗传变异的位置信息相关联时判断为家系遗传,否则判断为非家系遗传。
具体实施时,根据公共数据库对遗传变异所在的基因、转录本、外显子位置、氨基酸改变、变异类型、以及在世界不同人群发生频率等进行注释。转录本参考NCBI RefSeq转录本数据库,对于有多个不同转录剪切的基因,采用包含最多外显子的转录本进行注释。人群频率信息来自于千人基因组(1000genomes)、ESP和gnomAD数据库。若基因检测 模式为家系检测模式时,还需通过分析各组比对数据中遗传变异的位置信息判断其家系遗传模式,当各组比对数据中遗传变异的点位关联时判断为家系遗传,否则判断为非家系遗传,若基因检测模式为单样本检测模式时,则无需此步判断。需要说明的是,家系遗传的判断可通过现有仪器分析多组基因序列数据自动识别,本实施例对此不做赘述。
需要说明的是,上述实施例中基于每个遗传变异的长度信息、位置信息、人群发生频率、预测的变异功能或家系遗传模式对遗传变异进行临床显著性分级的方法包括:
参考美国医学遗传学和基因组学会(The American College of Medical Genetics and Genomics,ACMG)与美国分子病理协会(Association for Molecular Pathology,AMP)提出的变异临床显著性的分级标准和指南,对遗传变异进行临床显著性分级。示例性地如下:
参与ACMG致病性分级的证据包括:
PVS1:当一个疾病的致病机制为功能丧失(LOF)时,无功能变异。
PS1:与先前已确定为致病性的变异有相同的氨基酸改变。
PS2:患者的新发变异,且无家族史。
PS3:体内、体外功能实验已明确会导致基因功能受损的变异。
PS4:变异出现在患病群体中的频率显著高于对照群体。
PM1:位于热点突变区域,和/或位于已知无良性变异的关键功能域。
PM2:ESP数据库、千人数据库、EXAC数据库中正常对照人群中未发现的变异。
PM3:在隐性遗传病中,在反式位置上检测到致病变异。
PM4:非重复区框内插入/缺失或终止密码子丧失导致的蛋白质长度变化。
PM5:新的错义突变导致氨基酸变化,此变异之前未曾报道,但是在同一位点,导致另外一种氨基酸的变异已经确认是致病性的。
PM6:未经父母样本验证的新发变异。
PP1:突变与疾病在家系中共分离(在家系多个患者中检测到此变异)
PP2:对某个基因来说,如果这个基因的错义变异是造成某种疾病的原因,并且这个基因中良性变异所占的比例很小,在这样的基因中所发现的新的错义变异。
PP3:多种统计方法预测出该变异会对基因或基因产物造成有害的影响,包括保守性预测、进化预测、剪接位点影响等。
PP4:变异携带者的表型或家族史高度符合某种单基因遗传疾病。
PP5:有可靠信誉来源的报告认为该变异为致病的,但证据尚不足以支持进行实验室独立评估。
BA1:ESP数据库、千人数据库、ExAC数据库中等位基因频率>5%的变异。
BS1:等位基因频率大于疾病发病率。
BS2:对于早期完全外显的疾病,在健康成年人中发现该变异(隐性遗传病发现纯合、显性遗传病发现杂合,或者X连锁半合子)。
BS3:在体内外实验中确认对蛋白质功能和剪接没有影响的变异。
BS4:在一个家系成员中缺乏共分离。
BP1:已知一个疾病的致病原因是由于某基因的截短变异,在此基因中所发现的错义变异。
BP2:在显性遗传病中又发现了另一条染色体上同一基因的一个已知致病变异,或者是任意遗传模式遗传病中又发现了同一条染色体上同一基因的一个已知致病变异。
BP3:功能未知重复区域内的缺失/插入,同时没有导致基因编码框改变。
BP4:多种统计方法预测出该变异会对基因或基因产物无影响,包括保守性预测、进化预测、剪接位点影响等。
BP5:在已经有另一分子致病原因的病例中发现的变异。
BP6:有可靠信誉来源的报告认为该变异为良性的,但证据尚不足以支持。
BP7:同义变异且预测不影响剪接。
遗传变异分级的联合规则包括:
致病(pathogenic),包括i、ii、iii中任一种情况:
i、包括1个非常强证据PVS1和a-d中任一种证据;
a、一个以上强证据(PS1-PS4)
b、2个以上中等证据(PM1-PM6)
c、1个中等证据(PM1-PM6)和1个支持证据(PP1-PP5)
d、≥2个支持证据(PP1-PP5);
ii、≥2个强证据(PS1-PS4);
iii、1个强证据(PS1)和a、b、c中任一种情况:
a、≥3个中等证据(PM1-PM6)
b、2个中等证据(PM1-PM6)和≥2个支持证据(PP1-PP5)
c、1个中等证据(PM1-PM6)和≥4个支持证据(PP1-PP5)。
可能致病(likely pathogenic),包括i–vi中任一种情况:
i、1个非常强证据(PVS1)和1个中等证据(PM1-PM6);
ii、1个强证据(PS1-PS4)和1-2个中等证据(PM1-PM6);
iii、1个强证据(PS1-PS4)和≥2个支持证据(PP1-PP5);
iv、≥3个中等证据(PM1-PM6);
v、2个中等证据(PM1-PM6)和≥2个支持证据(PP1-PP5);
vi、1个中等证据(PM1-PM6)和≥4个支持证据(PP1-PP5)。
良性(benign),包括i或ii中任一种情况:
i、1个独立证据(BA1);
ii、≥2个强证据(BS1-BS4)。
可能良性(likely benign),包括i或ii中任一种情况:
i、1个强证据(BS1-BS4)和1个支持证据(BP1-BP7);
ii、≥2个支持证据(BP1-BP7)。
致病性不明(uncertain significance),包括i或ii中任一种情况:
i、不满足上述标准;或
ii、良性和致病标准相互矛盾。
可选地,为了保证遗传变异的数据有效性,可对部分遗传变异进行过滤,过滤的条件如下:第一种为过滤掉内含子上变异(intron_variant)、基因间的变异(intergenic_variant)、基因上游的变异(upstream_gene_variant)和基因下游的变异(downstream_gene_variant);第二种为过滤掉人群发生频率大于0.1的变异位点;第三种为过滤掉质量评估不合格的遗传变异。
上述实施例中根据各遗传变异的临床显著性分级、人群发生频率、致病位点明确性、预测的变异功能中的一种或多种,计算基因中各遗传变异的影响性评分的方法包括:
通过对每个遗传变异的证据的进行赋值,所述证据包括临床显著性分级、人群发生频率、致病位点明确性、预测的变异功能、是否被数据库收录等;
采用基因中遗传变异的影响性评分公式
Figure PCTCN2020111133-appb-000007
分别计算每个遗传变异的影响性评分,其中,f为证据的数量,w i为第i各证据的权重,s i为第i各证据的赋值。
具体实施时,当变异类型为错义突变和剪接位点变异时赋值4分;当人群发生频率小于或等于10 -4或者无消息时赋值1分,当人群发生频率处于10 -4至10 -3时赋值0.5分,当人群发生频率大于0.05时赋值-1分;当HSF软件预测的变异功能影响剪切时,则赋值2分,使用SIFT软件预测的变异功能为变异有害(Deleterious)时赋值1分,当预测的变异功能为变异低害(tolerate)时赋值-1分,使用Polyphen2软件预测的变异功能为变异有害(probably damage)时赋值1分,当预测的变异功能为变异低害(possible dmage)时赋值0.5分,当预测的变异功能为基本无害(benign)时赋值-1分;使用HSF软件预测的变异功能结果为影响剪切则累积2分,预测的变异功能结果为不影响剪切则累积0分;临床显 著性分级为致病时赋值3分,可能致病时赋值2分,致病性不明时赋值1分,可能良性时赋值-2分,良性时赋值-3分;数据库包括ClinVar数据库、UniProt数据库或本地数据库,当遗传变异被上述任何一个数据库中收录时均可累积1分,当该遗传变异位点属于明确致病位点时赋值5分。
示例性地,Score v=S c+S p+S vip+S sift+S pph2+S HSF,其中,S C表示临床显著性分级对应的得分,S p表示人群发生频率对应的得分,S vip表示致病位点明确性对应的得分,S sift表示使用SIFT软件预测的变异功能对应的得分,S pph2表示使用Polyphen2软件预测的变异功能对应的得分,S HSF表示使用HSF软件预测的变异功能对应的得分。
上述实施例中从预设的基因列表文件中获取与所述单基因病名称对应的多个基因,基于所述基因中各遗传变异的影响性评分、遗传变异的遗传模式、已知疾病的关联性和所述基因对应的相似度值分别计算每个所述基因的致病性评分的方法包括:
获取基因中的遗传变异,匹配出各遗传变异的影响性评分;采用致病评分公式Score g=max(Score v)+w eS e+w tS t+w MLSS MLS分别计算每个基因的致病性评分,其中,max(Score v)为基因中的所以遗传变异影响性评分的最大值,S e为基因对已知疾病的关联性赋值,S t为遗传变异的遗传模式赋值,S MLS为与基因对应的相似度值,w e为S e的赋值权重,w t为s t的赋值权重,w MLS为S MLS的赋值权重。
具体实施时,从预设的基因列表文件中获取与患者的单基因病名称对应的多个基因,分别提取各基因中的遗传变异,并与已计算出影响性评分的遗传变异匹配,得到上述相关基因中各遗传变异的影响性评分,然后采用致病评分公式Score g=max(Score v)+w eS e+w tS t+w MLSS MLS,分别计算每个基因的致病性评分,其中,S e为基因对疾病的关联性赋值,当所述基因为疾病关联的已知基因时赋值10分,其基因他赋值0分;S t为遗传变异的遗传模式赋值,当遗传模式为家系遗传时赋值5分,否则赋值0分;S MLS为该基因对应的单基因病名称与特征关系数据库中标准单基因病名称相似度值中最大的值,w e和w t的默认值均为1,w MLS的默认值为2,取值范围1–5,w e、w t和w MLS在实际操作中根据情况可调。
需要补充的是,使用本实施例提供的方法,还能够获取显示基因序列在遗传变异所在基因和外显子位置、参考基因组序列、遗传变异两侧的覆盖度、遗传变异两侧的比对质量、两侧变异分布等。对单基因病可能致病的遗传变异进行人工检查后,自动生成遗传分析解读报告。遗传分析解读报告内容包括:基因序列数据的个体信息、遗传分析解读结果、相关单基因病的临床特征,个体信息包括:样本编号、姓名、性别、年龄、籍贯、居住地、疾病诊断、疾病描述等信息。遗传分析解读结果包括:致病突变的物理位置、基因名称、DNA改变、氨基酸改变、东亚人群频率、临床显著性分级、疾病及家系遗传模式。
上述实施例中,基于标准单基因病名称集合G和标准单基因病名称集合P的交集结果,以及标准单基因病名称的候选输出顺序,输出标准单基因病名称的推荐结果。
具体实施时,当标准单基因病名称集合G和标准单基因病名称集合P的交集结果为空,说明通过遗传辅诊得到的标准单基因病名称推荐结果与表型遗传辅诊得到的标准单基因病名称推荐结果完全不一致,此时不输出标准单基因病名称的推荐结果;当标准单基因病名称集合G和标准单基因病名称集合P的交集结果为1个,说明通过遗传辅诊得到的标准单基因病名称推荐结果与表型遗传辅诊得到的标准单基因病名称推荐结果有一个相同,此时输出唯一的标准单基因病名称的推荐结果;当标准单基因病名称集合G和标准单基因病名称集合P的交集结果为多个,说明通过遗传辅诊得到的标准单基因病名称推荐结果与表型遗传辅诊得到的标准单基因病名称推荐结果存在部分相同,此时按照各标准单基因病名称的候选输出顺序,输出多个标准单基因病名称的推荐结果。
进一步地,上述实施例中按照致病性评分值大小将对应的标准单基因病名称降序候选输出之前还包括:
对于候选输出的标准单基因病名称,采用黑名单方式过滤掉假阳性变异位点对应的标准单基因病名称。黑名单的位点来自于实验室内部,是高通量测序的假阳性变异位点。
实施例二
请参阅图3,本实施例提供一种基于临床特征和序列变异的单基因病名称推荐系统,包括:
输入单元,用于获取患者的病例信息,所述病例信息包括基因序列、特征集合I和单基因病名称;
序列比对单元,用于将所述基因序列与人类参考基因组进行比对得到比对数据,并根据比对数据得到每个遗传变异的影响性评分;
表型辅诊单元,用于遍历特征关系数据库中各标准单基因病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,并按相似度值大小将相似的标准单基因病名称及对应的基因降序候选输出,同时汇总候选输出的标准单基因病名称构建标准单基因病名称集合P;
遗传辅诊单元,用于从预设的基因列表文件中获取与所述单基因病名称对应的多个基因,基于所述基因中各遗传变异的影响性评分、遗传变异的遗传模式、已知疾病的关联性和所述基因对应的相似度值分别计算每个所述基因的致病性评分,并按照致病性评分值大小将对应的标准单基因病名称降序候选输出,同时汇总候选输出的标准单基因病名称构建标准单基因病名称集合G;
推荐输出单元,基于标准单基因病名称集合G和标准单基因病名称集合P的交集结果,以及标准单基因病名称的候选输出顺序,输出标准单基因病名称的推荐结果。
在一个实施例中,上述的单基因病名称推荐系统应用于计算机设备,该计算机设备包括通过系统总线连接的处理器和存储器。其中,该单基因病名称推荐系统的处理器用于提供计算和控制能力。该单基因病名称推荐系统的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该单基因病名称推荐系统的网络接口用于与外部的传感器通信。该计算机可读指令被处理器执行时以实现上述基于临床特征和序列变异的单基因病名称推荐方法的步骤,例如是以上述的输入单元、序列比对单元、表型辅诊单元、遗传辅诊单元以及推荐输出单元实现上述基于临床特征和序列变异的单基因病名称推荐方法的步骤。
与现有技术相比,本实施例提供的基于临床特征和序列变异的单基因病名称推荐系统的有益效果与上述实施例一提供的基于临床特征和序列变异的单基因病名称推荐方法的有益效果相同,在此不做赘述。
实施例三
本实施例提供一种计算机可读存储介质,例如是非易失性计算机可读存储介质,其中计算机可读存储介质上存储有计算机可读指令,计算机可读指令被处理器运行时执行上述基于临床特征和序列变异的单基因病名称推荐方法的步骤。
与现有技术相比,本实施例提供的计算机可读存储介质的有益效果与上述技术方案提供的基于临床特征和序列变异的单基因病名称推荐方法的有益效果相同,在此不做赘述。
实施例四
基于上述实施例,请参阅图4所示,提供一种应用场景的环境架构示意图。
可以开发一个应用软件,用于实现上述实施例中的基于临床特征和序列变异的单基因病名称推荐方法,并且,该应用软件可以安装在用户终端,用户终端与服务器连接,实现通信。
其中,用户终端可以为计算机、平板电脑等任何智能设备,本实施例仅以电脑为例进行说明。
例如,打开智能设备相关的应用程序,用户使用输入单元如键盘、鼠标等输入获取患者的病例信息,所述病例信息包括基因序列、特征集合I和单基因病名称,实现在应用程序中病例信息的输入,电脑中的应用程序将基因序列发送至序列比对单元,将特征集合I 发送至特征集合I,将单基因病名称发送至遗传辅诊单元,其中,序列比对单元、表型辅诊单元和遗传辅诊单元可通过服务器实现,表型辅诊单元采用多层级结构相似度算法遍历计算特征关系数据库中各单标准基因病名称对应的特征集合A与特征集合I集合的相似度值,构建标准单基因病名称集合P,遗传辅诊单元从预设的基因列表文件中获取与单基因病名称对应的多个基因,采用致病性评分算法分别计算每个基因的致病性评分,构建标准单基因病名称集合G,最终由推荐输出单元,如显示器,基于标准单基因病名称集合G和标准单基因病名称集合P的交集结果,以及标准单基因病名称的候选输出顺序,输出标准单基因病名称的推荐结果。
本领域普通技术人员可以理解,实现上述发明方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,上述程序可以存储于计算机可读取存储介质中,该程序在执行时,包括上述实施例方法的各步骤,而该程序的存储介质可以是:ROM/RAM、磁碟、光盘、存储卡等。
以上,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (11)

  1. 一种基于临床特征和序列变异的单基因病名称推荐方法,其特征在于,包括:
    获取患者的病例信息,所述病例信息包括基因序列、特征集合I和单基因病名称;
    将所述基因序列与人类参考基因组进行比对得到比对数据,并根据比对数据得到每个遗传变异的影响性评分;
    遍历特征关系数据库中各标准单基因病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,并按相似度值大小将相似的标准单基因病名称及对应的基因降序候选输出,同时汇总候选输出的标准单基因病名称构建标准单基因病名称集合P;
    从预设的基因列表文件中获取与所述单基因病名称对应的多个基因,基于所述基因中各遗传变异的影响性评分、遗传变异的遗传模式、已知疾病的关联性和所述基因对应的相似度值分别计算每个所述基因的致病性评分,并按照致病性评分值大小将对应的标准单基因病名称降序候选输出,同时汇总候选输出的标准单基因病名称构建标准单基因病名称集合G;以及
    基于标准单基因病名称集合G和标准单基因病名称集合P的交集结果,以及标准单基因病名称的候选输出顺序,输出标准单基因病名称的推荐结果。
  2. 根据权利要求1所述的方法,其特征在于,在步骤遍历特征关系数据库中各标准单基因病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,并按相似度值大小将相似的标准单基因病名称及对应的基因降序候选输出之前还包括:
    从单基因病的公共数据库和文献数据库,获得已知的标准单基因病名称及其对应的标准临床特征;
    基于已知的标准单基因病名称及其对应的标准临床特征,建立标准单基因病名称与标准临床特征的特征关系数据库;
    分别计算每种标准单基因病名称对应的各标准临床特征对该单基因病的贡献度c i;以及
    从特征关系数据库中获取数据,基于HPO构建单基因病的标准化临床特征表型树;
    其中所述标准化临床特征表型树由多个干节点和与每个干节点关联的至少一个支节点组成,每个支节点用于表示一个标准化临床特征,每个干节点用于表示关联的标准化临床特征的索引。
  3. 根据权利要求1或2所述的方法,其特征在于,遍历特征关系数据库中各标准单基因病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值, 并按相似度值大小将相似的标准单基因病名称及对应的基因降序候选输出的方法包括:
    将特征集合I中的临床特征在标准化临床特征表型树上的节点标记;
    遍历特征关系数据库中的第n个标准单基因病名称,将其对应的特征集合A中的标准临床特征在标准化临床特征表型树上的节点标记,所述n的初始值为1;
    基于标准化临床特征表型树上的节点标记,从特征集合A中匹配出与特征集合I中每个临床特征对应的最佳标准临床特征;
    根据每个临床特征与对应的最佳标准临床特征的相似度值,计算出特征集合I与当前特征集合A的集合相似度值;以及
    令n=n+1重新遍历特征关系数据库中的第n个标准单基因病名称,直至特征关系数据库中的标准单基因病名称遍历完毕,将特征集合I与每个特征集合A对应的集合相似度值汇总排序候选输出。
  4. 根据权利要求3所述的方法,其特征在于,基于标准化临床特征表型树上的节点标记,从特征集合A中匹配出与特征集合I中每个临床特征对应的最佳标准临床特征的方法包括:
    所述特征集合I包括多个临床特征,所述特征集合A包括多个标准临床特征;
    遍历所述特征集合I中的第i个临床特征,从所述特征集合A中筛选出与所述第i个临床特征相似度最高的标准临床特征,作为与所述第i个临床特征对应的最佳标准临床特征,所述i的初始值为1;
    令i=i+1后重新遍历所述特征集合I中的第i个临床特征,直至特征集合I中的临床特征遍历完毕,从第n个标准单基因病名称对应的特征集合A中筛选出与特征集合I中临床特征一一对应的多个最佳标准临床特征。
  5. 根据权利要求4所述的方法,其特征在于,从所述特征集合A中筛选出与所述第i个临床特征相似度最高的标准临床特征的方法包括:
    遍历所述特征集合A中的第j个标准临床特征,基于已建立的索引判断所述第j个标准临床特征与所述第i个临床特征是否存在相同的干节点B t,所述j的初始值为1;
    若判断结果为否,则认为所述第j个标准临床特征与所述第i个临床特征的相似度值为零;
    若判断结果为是,基于多层级结构相似度算法计算所述第j个标准临床特征与所述第i个临床特征的相似度值;
    令j=j+1后重新遍历所述特征集合A中的第j个标准临床特征,并继续执行所述第j 个标准临床特征与所述第i个临床特征的相似度计算,直至所述特征集合A中的标准临床特征遍历完毕,对应得到与所述特征集合A中标准临床特征一一对应的多个相似度值;
    从多个相似度值筛中筛选出最大值对应的标准临床特征作为与第i个临床特征对应的最佳标准临床特征。
  6. 根据权利要求1至5任一所述的方法,其特征在于,将所述基因序列与人类参考基因组进行比对得到比对数据,并根据比对数据得到每个遗传变异的影响性评分的方法包括:
    对基因序列进行属性标记,其中,基因检测模式为单样本检测模式时,基因序列为1组待测人员的基因序列,基因检测模式为家系检测模式时,基因序列为1组待测人员和至少1组待测人员直系亲属的基因序列;
    分别将每组基因序列与人类参考基因组进行序列比对,得到对应数量的比对数据;
    从每组比对数据中获取遗传变异的长度信息、位置信息和碱基改变信息,基于所述遗传变异的长度信息识别其变异类型,以及基于所述遗传变异的位置信息和碱基改变信息预测其变异功能,所述变异类型包括SNP变异和Indel变异,所述变异功能的类型包括变异有害、变异低害或基本无害;
    针对每个遗传变异的变异类型识别结果,对遗传变异所在的基因和人群发生频率进行注释,并在家系检测模式时判断其家系遗传模式;
    基于每个遗传变异的长度信息、位置信息、人群发生频率、预测的变异功能和家系遗传模式对遗传变异进行临床显著性分级,所述临床显著性分级包括致病、可能致病、致病性不明、可能良性和良性五种级别;以及
    根据各遗传变异的临床显著性分级、人群发生频率、致病位点明确性、预测的变异功能中的一种或多种,计算基因中各遗传变异的影响性评分。
  7. 根据权利要求1至6任一所述的方法,其特征在于,从预设的基因列表文件中获取与所述单基因病名称对应的多个基因,基于所述基因中各遗传变异的影响性评分、遗传变异的遗传模式、已知疾病的关联性和所述基因对应的相似度值分别计算每个所述基因的致病性评分的方法包括:
    获取基因中的遗传变异,匹配出各遗传变异的影响性评分;以及
    采用致病评分公式Score g=max(Score v)+w eS e+w tS t+w MLSS MLS分别计算每个基因的致病性评分,其中,max(Score v)为所述基因中的遗传变异影响性评分最大值,S e为所述基因对已知疾病的关联性赋值,S t为遗传变异的遗传模式赋值,S MLS为与所述基因对应的相 似度值,w e为S e的赋值权重,w t为s t的赋值权重,w MLS为S MLS的赋值权重。
  8. 根据权利要求1至7任一所述的方法,其特征在于,按照致病性评分值大小将对应的标准单基因病名称降序候选输出之前还包括:
    对于候选输出的标准单基因病名称,采用黑名单方式过滤掉假阳性变异位点对应的标准单基因病名称。
  9. 一种基于临床特征和序列变异的单基因病名称推荐系统,包括:
    输入单元,用于获取患者的病例信息,所述病例信息包括基因序列、特征集合I和单基因病名称;
    序列比对单元,用于将所述基因序列与人类参考基因组进行比对得到比对数据,并根据比对数据得到每个遗传变异的影响性评分;
    表型辅诊单元,用于遍历特征关系数据库中各标准单基因病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,并按相似度值大小将相似的标准单基因病名称及对应的基因降序候选输出,同时汇总候选输出的标准单基因病名称构建标准单基因病名称集合P;
    遗传辅诊单元,用于从预设的基因列表文件中获取与所述单基因病名称对应的多个基因,基于所述基因中各遗传变异的影响性评分、遗传变异的遗传模式、已知疾病的关联性和所述基因对应的相似度值分别计算每个所述基因的致病性评分,并按照致病性评分值大小将对应的标准单基因病名称降序候选输出,同时汇总候选输出的标准单基因病名称构建标准单基因病名称集合G;以及
    推荐输出单元,基于标准单基因病名称集合G和标准单基因病名称集合P的交集结果,以及标准单基因病名称的候选输出顺序,输出标准单基因病名称的推荐结果。
  10. 一种非易失性计算机可读存储介质上存储有计算机可读指令,其中,所述计算机可读指令被处理器运行时执行上述权利要求1至8任一项所述方法的步骤。
  11. 一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,其中当所述计算机可读指令被处理器执行时,使得所述一个或多个处理器执行如权利要求1至8任一项所述方法的步骤。
PCT/CN2020/111133 2020-06-08 2020-08-25 基于临床特征和序列变异的单基因病名称推荐方法及系统 WO2021248695A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010511089.3A CN111883210B (zh) 2020-06-08 2020-06-08 基于临床特征和序列变异的单基因病名称推荐方法及系统
CN202010511089.3 2020-06-08

Publications (1)

Publication Number Publication Date
WO2021248695A1 true WO2021248695A1 (zh) 2021-12-16

Family

ID=73154061

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/111133 WO2021248695A1 (zh) 2020-06-08 2020-08-25 基于临床特征和序列变异的单基因病名称推荐方法及系统

Country Status (2)

Country Link
CN (1) CN111883210B (zh)
WO (1) WO2021248695A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114927191A (zh) * 2022-04-13 2022-08-19 北京高灵智腾信息科技有限公司 血液系统疾病ngs报告解读方法
CN116386726A (zh) * 2023-03-22 2023-07-04 深圳市天大生物医疗器械有限公司 融合pcr熔解曲线的基因分型在线检测系统及其应用方法
CN116705332A (zh) * 2023-06-08 2023-09-05 湖北大学 一种用于肿瘤诊疗的临床解读系统、方法、设备及介质
CN117877578A (zh) * 2024-01-16 2024-04-12 广东劢智医疗科技有限公司 一种用于遗传变异分析的基因变异打分排序方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689914B (zh) * 2020-12-17 2024-02-20 武汉良培医学检验实验室有限公司 一种单基因遗传病扩展性携带者筛查方法及芯片
CN113611361B (zh) * 2021-08-10 2023-08-08 飞科易特(广州)基因科技有限公司 一种用于婚恋匹配的单基因常染色体隐性遗传病的匹配方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002021428A1 (en) * 2000-09-01 2002-03-14 Large Scale Proteomics Corporation Reference database
CN109086571A (zh) * 2018-08-03 2018-12-25 国家卫生计生委科学技术研究所 一种单基因病遗传变异智能解读及报告的方法和系统
CN109119132A (zh) * 2018-08-03 2019-01-01 国家卫生计生委科学技术研究所 基于病历特征匹配单基因病名称的方法及系统

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629148A (zh) * 2017-03-15 2018-10-09 深圳市乐土精准医疗科技有限公司 基于表型分析的眼部生理信息的基因组分析方法和装置
CN106971071A (zh) * 2017-03-27 2017-07-21 为朔医学数据科技(北京)有限公司 一种临床决策支持系统及方法
CN110021364B (zh) * 2017-11-24 2023-07-28 上海暖闻信息科技有限公司 基于病人临床症状数据和全外显子组测序数据筛选单基因遗传病致病基因的分析检测系统
WO2020077352A1 (en) * 2018-10-12 2020-04-16 Human Longevity, Inc. Multi-omic search engine for integrative analysis of cancer genomic and clinical data
CN110046236B (zh) * 2019-03-20 2022-12-20 腾讯科技(深圳)有限公司 一种非结构化数据的检索方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002021428A1 (en) * 2000-09-01 2002-03-14 Large Scale Proteomics Corporation Reference database
CN109086571A (zh) * 2018-08-03 2018-12-25 国家卫生计生委科学技术研究所 一种单基因病遗传变异智能解读及报告的方法和系统
CN109119132A (zh) * 2018-08-03 2019-01-01 国家卫生计生委科学技术研究所 基于病历特征匹配单基因病名称的方法及系统

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CAO, ZONGFU ET AL.: "Automatic Analysis and Interpretation of Genetic Variations for Monogenic Diseases", JOURNAL OF REPRODUCTIVE MEDICINE, vol. 28, no. 7, 31 July 2019 (2019-07-31), pages 791 - 796, XP055879734, ISSN: 1004-3845 *
LI, JIANHUA ET AL.: "Review on the Research Progress of Mining of OMIM Data", JOURNAL OF BIOMEDICAL ENGINEERING, vol. 31, no. 6, 31 December 2014 (2014-12-31), pages 1400 - 1404, XP055840474, ISSN: 1001-5515 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114927191A (zh) * 2022-04-13 2022-08-19 北京高灵智腾信息科技有限公司 血液系统疾病ngs报告解读方法
CN116386726A (zh) * 2023-03-22 2023-07-04 深圳市天大生物医疗器械有限公司 融合pcr熔解曲线的基因分型在线检测系统及其应用方法
CN116705332A (zh) * 2023-06-08 2023-09-05 湖北大学 一种用于肿瘤诊疗的临床解读系统、方法、设备及介质
CN117877578A (zh) * 2024-01-16 2024-04-12 广东劢智医疗科技有限公司 一种用于遗传变异分析的基因变异打分排序方法

Also Published As

Publication number Publication date
CN111883210A (zh) 2020-11-03
CN111883210B (zh) 2021-05-25

Similar Documents

Publication Publication Date Title
WO2021248695A1 (zh) 基于临床特征和序列变异的单基因病名称推荐方法及系统
CN109086571B (zh) 一种单基因病遗传变异智能解读及报告的方法和系统
AU2020200351B2 (en) Family networks
CN110021364B (zh) 基于病人临床症状数据和全外显子组测序数据筛选单基因遗传病致病基因的分析检测系统
Xiang et al. AutoPVS1: An automatic classification tool for PVS1 interpretation of null variants
WO2021248694A1 (zh) 患者样本数据中结构变异的报告解读方法及系统
CN108920901B (zh) 一种测序数据突变分析系统
KR20180132727A (ko) 유전자 변이체 표현형 분석 시스템 및 사용 방법
CN107169310B (zh) 一种基因检测知识库构建方法及系统
EP1244047A2 (en) Method for providing clinical diagnostic services
JP2005276022A (ja) 診断支援システムおよび診断支援方法
Garcia et al. Insights on variant analysis in silico tools for pathogenicity prediction
KR20200065000A (ko) 게놈 데이터 분석에서 관련성을 활용하기 위한 시스템 및 방법
Shigemizu et al. IMSindel: An accurate intermediate-size indel detection tool incorporating de novo assembly and gapped global-local alignment with split read analysis
KR101906312B1 (ko) 추정 자손의 유전질환 발병 위험성을 예측하는 방법 및 시스템
CN111243753B (zh) 一种面向医疗数据的多因素相关性交互式分析方法
US20200058408A1 (en) Systems, methods, and apparatus for linking family electronic medical records and prediction of medical conditions and health management
CN106407747A (zh) 肿瘤对应的基因的突变位点的获取方法及装置
CN111816253A (zh) 一种基因检测解读方法及装置
JP2023510399A (ja) 遺伝子バリアント解釈を生成するためのゲノム情報を取得および処理するためのスクリーニングシステムおよび方法
Indencleef et al. The intersection of the genetic architectures of orofacial clefts and normal facial variation
Jin et al. Application of genome analysis strategies in the clinical testing for pediatric diseases
Mahecha et al. Machine learning models for accurate prioritization of variants of uncertain significance
CN111863132A (zh) 一种筛选致病性变异的方法和系统
EP4435791A1 (en) Sequence variation analysis method and system, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20940035

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20940035

Country of ref document: EP

Kind code of ref document: A1