WO2021248694A1 - 患者样本数据中结构变异的报告解读方法及系统 - Google Patents

患者样本数据中结构变异的报告解读方法及系统 Download PDF

Info

Publication number
WO2021248694A1
WO2021248694A1 PCT/CN2020/111132 CN2020111132W WO2021248694A1 WO 2021248694 A1 WO2021248694 A1 WO 2021248694A1 CN 2020111132 W CN2020111132 W CN 2020111132W WO 2021248694 A1 WO2021248694 A1 WO 2021248694A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
clinical
feature set
disease
standard
Prior art date
Application number
PCT/CN2020/111132
Other languages
English (en)
French (fr)
Inventor
马旭
蔡瑞琨
曹宗富
喻浴飞
陈翠霞
Original Assignee
国家卫生健康委科学技术研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国家卫生健康委科学技术研究所 filed Critical 国家卫生健康委科学技术研究所
Publication of WO2021248694A1 publication Critical patent/WO2021248694A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Definitions

  • the present invention relates to the technical field of medical information, and in particular to a method and system for interpreting reports of structural variations in patient sample data.
  • next-generation sequencing technology in disease-causing mutation research and medical practice is becoming more and more extensive.
  • whole-genome sequencing and whole-exome sequencing are one of the ideal methods for carrying out disease genomics-related research, identification of pathogenic mutations and molecular diagnosis of patients.
  • next-generation sequencing technology there are still many problems in data analysis and clinical interpretation based on next-generation sequencing technology, which are not conducive to the development of precision medicine and hinder the etiology of diseases related to structural variants based on next-generation sequencing technology.
  • problems include: First, the identification of structural variations, the analysis process is too complicated to be used by medical institution personnel and other non-bioinformatics personnel; second, the identification of pathogenic mutations requires a large amount of manual screening. The structural variation is checked and confirmed one by one, and the efficiency is very low.
  • the purpose of the present invention is to provide a report interpretation method and system for structural variation in patient sample data, which can accurately interpret patient sample data, and improve interpretation efficiency while reducing the threshold for report interpretation.
  • one aspect of the present invention provides a report interpretation method for structural variations in patient sample data, including:
  • test sample data of the patient where the test sample data includes gene sequence, disease name, and feature set I;
  • the construction method of the reference baseline includes:
  • the population gene sequence belongs to the whole exome sequencing data, then input the gene sequence of multiple normal phenotypes into ExomeDepth software to construct a reference baseline.
  • the method for annotating the structural variation and obtaining the pathogenicity classification of the structural variation according to the annotation result includes:
  • annotation results include one or more of the population frequency, the genes contained in the structural variation, and the corresponding disease name, mutation type, population frequency, and disease-causing conditions of the mutation;
  • the structural variants are classified into pathogenicity, and the pathogenicity classification includes three types: pathogenic or possibly pathogenic, pathogenic or possibly pathogenic, but the annotation results also include benign annotations and other conditions.
  • the method for constructing a gene recommendation list by grabbing relevant disease genes from public databases and literature databases according to the disease name and/or the clinical features in the feature set I includes:
  • the step of traversing the feature set A corresponding to each standard disease name in the feature relational database, and separately calculating the set similarity value of each feature set A and the feature set I further includes:
  • the standardized clinical feature phenotype tree is composed of multiple stem nodes and at least one branch node associated with each stem node. Each branch node is used to represent a standardized clinical feature, and each stem node is used to represent an associated standardized clinical feature.
  • the index of the feature is used to represent a standardized clinical feature.
  • traversing the feature set A corresponding to each standard disease name in the feature relational database, respectively calculating the set similarity value of each feature set A and feature set I, and recommending multiple standard disease names according to the similarity value includes :
  • n Traverse the nth standard disease name in the characteristic relational database, and mark the node of the standard clinical feature in the corresponding feature set A on the standardized clinical feature phenotype tree, and the initial value of n is 1;
  • the best standard clinical feature corresponding to each clinical feature in feature set I is matched from feature set A;
  • the set similarity value of feature set I and current feature set A is calculated;
  • the method of matching the best standard clinical feature corresponding to each clinical feature in feature set I from feature set A based on the node labels on the standardized clinical feature phenotype tree includes:
  • the feature set I includes multiple clinical features, and the feature set A includes multiple standard clinical features;
  • the method for selecting the standard clinical feature with the highest similarity to the i-th clinical feature from the feature set A includes:
  • the standard clinical feature corresponding to the maximum value is selected from multiple similarity value screens as the best standard clinical feature corresponding to the i-th clinical feature.
  • the method for outputting multiple structural variations in descending order based on the importance of the influencing elements corresponding to each structural variation includes:
  • the report interpretation method for structural variation in patient sample data provided by the present invention has the following beneficial effects:
  • the report interpretation method of structural variation in patient sample data it is first necessary to obtain a piece of sample data to be tested including gene sequence, disease name and feature set I, and by comparing the gene sequence with a reference baseline, it is necessary to detect the sample data to be tested. Measure and annotate the structural variations in the sample data, and then classify and score the pathogenicity of each structural variation based on the annotation results, and then grab related diseases from public databases and literature databases based on the disease name and/or feature set I
  • the gene recommendation list is constructed by genes.
  • the feature set A corresponding to each standard disease name in the feature relationship database is traversed, and the set similarity value of each feature set A and feature set I is calculated separately, and multiple standard diseases are recommended according to the similarity value. Name, and finally output multiple structural variations in descending order to generate an interpretation report based on the importance of the influencing factors corresponding to each structural variation.
  • the influencing factors include the pathogenicity classification corresponding to the structural variation, the consistency of the disease name and the disease name in the annotation result, and whether the gene contained in the structural variation in the annotation result belongs to the known pathogenicity existing in the gene recommendation list
  • One or more of genes, feature set I and feature set A maximum similarity value, population frequency, and mutation location that is, the present invention evaluates the pathogenicity of structural variation from multiple dimensions, and then can Accurate interpretation of the patient’s sample to be tested.
  • the report interpretation method for structural variation in patient sample data provided by the present invention can realize the whole process automation from the sample data to be tested to the recommendation of pathogenic structural variation, which greatly reduces the workload of manual interpretation and analysis of structural variation data, and improves the structure. The efficiency of variation analysis and clinical interpretation.
  • Another aspect of the present invention provides a report interpretation system for structural variations in patient sample data, including:
  • the input unit is used to obtain the patient's sample data to be tested, the sample data to be tested includes gene sequence, disease name and feature set I;
  • the annotation unit is used to compare the gene sequence with the reference baseline, detect multiple structural variations in the sample data to be tested and annotate them one by one, and at the same time classify the pathogenicity of each structural variation according to the annotation results ;
  • the recommendation list generating unit is configured to construct a gene recommendation list by grabbing relevant disease genes from public databases and literature databases according to the disease name and/or the clinical features in the feature set I;
  • Disease name recommendation unit used to traverse the feature set A corresponding to each standard disease name in the feature relation database, calculate the set similarity value of each feature set A and feature set I, and recommend multiple standard disease names according to the similarity value ;
  • the report output unit outputs multiple structural variants in descending order based on the degree of importance of the influencing elements corresponding to each structural variant, and generates an interpretation report.
  • the influential elements include the pathogenicity classification corresponding to the structural variants, the name of the disease, and the annotation result The consistency of the name of the disease, whether the gene contained in the structural variant in the annotation result belongs to the known disease-causing gene in the gene recommendation list, the maximum similarity value between feature set I and feature set A, population frequency, and location of the mutation One or more of.
  • the beneficial effect of the report interpretation system for structural variation in patient sample data provided by the present invention is the same as the beneficial effect of the report interpretation method for structural variation in patient sample data provided by the above technical solution, and will not be repeated here.
  • the third aspect of the present invention provides a computer-readable storage medium, for example, a non-volatile computer-readable storage medium, in which computer-readable instructions are stored on the computer-readable storage medium, and the computer-readable instructions are executed when the processor is running.
  • a computer-readable storage medium for example, a non-volatile computer-readable storage medium, in which computer-readable instructions are stored on the computer-readable storage medium, and the computer-readable instructions are executed when the processor is running.
  • the beneficial effects of the computer-readable storage medium provided by the present invention are the same as the beneficial effects of the report interpretation method for structural variation in patient sample data provided by the above technical solutions, and will not be repeated here.
  • FIG. 1 is a schematic flowchart of a method for interpreting a report of structural variation in patient sample data in Embodiment 1;
  • Fig. 2 is an example diagram of node labels on the standardized clinical feature phenotype tree in the first embodiment
  • FIG. 3 is a structural block diagram of a report interpretation system for structural variations in patient sample data in the second embodiment
  • Fig. 4 is an example diagram of the application environment architecture of the report interpretation method for structural variation in patient sample data in the fourth embodiment.
  • this embodiment provides a method for interpreting a report of structural variation in patient sample data, which is characterized in that it includes:
  • the sample data to be tested includes gene sequence, disease name and feature set I; compare the gene sequence with the reference baseline to detect multiple structural variations in the sample data to be tested and perform them one by one Annotate, and at the same time classify the pathogenicity of each structural variant according to the annotation results; according to the disease name and/or clinical features in feature set I, grab relevant disease genes from public databases and literature databases to construct a gene recommendation list; traverse features The feature set A corresponding to each standard disease name in the relational database is calculated, and the set similarity value of each feature set A and feature set I is calculated respectively, and multiple standard disease names are recommended according to the similarity value; based on the influence factors corresponding to each structural variation The importance of multiple structural variants is output in descending order and an interpretation report is generated.
  • the influencing factors include the pathogenicity classification corresponding to the structural variant, the consistency of the disease name and the disease name in the annotation result, and the gene contained in the structural variation in the annotation result Whether it belongs to one or more of the known disease-causing genes that exist in the gene recommendation list, the maximum similarity value between feature set I and feature set A, the frequency of population occurrence, and the location of mutation.
  • the report interpretation method for structural variation in patient sample data it is first necessary to obtain a piece of test sample data including gene sequence, disease name, and feature set I. By comparing the gene sequence with a reference baseline, it is detected The structural variations in the sample data to be tested are annotated, and then the pathogenicity of each structural variation is graded and scored according to the annotation results, and then based on the disease name and/or feature set I, the relevant information is retrieved from public databases and literature databases Disease genes construct a gene recommendation list. In addition, traverse the feature set A corresponding to each standard disease name in the feature relationship database, calculate the set similarity value of each feature set A and feature set I, and recommend multiple standards according to the similarity value. The name of the disease is finally outputted in descending order based on the importance of the influencing factors corresponding to each structural variation to generate an interpretation report.
  • the influencing factors include the pathogenicity classification corresponding to the structural variation, the consistency of the disease name and the disease name in the annotation result, and whether the gene contained in the structural variation in the annotation result belongs to the known pathogenicity existing in the gene recommendation list
  • One or more of genes, the maximum similarity value between feature set I and feature set A, the frequency of population occurrence, and the location where the mutation occurs that is, the present embodiment evaluates the pathogenicity of structural variation from multiple dimensions, and thus can Accurate interpretation of the patient's sample to be tested.
  • the interpretation of the report of structural variation in the patient sample data provided in this embodiment can realize the whole process of automation from the sample data to be tested to the recommendation of pathogenic structural variation, which greatly reduces the workload of manual interpretation and analysis of structural variation data, and improves the structure. Analysis of variation and efficiency of clinical interpretation.
  • the quality inspection indicators include: total sequence number, sequence length, base quality, sequence quality, base content, GC content, base level N content, sequence length distribution, repetitive sequence, transition expression sequence, linker sequence, K-mer Content etc.
  • the quality test is performed on the gene sequence of the sample data to be tested and the gene sequence of the normal phenotype, and the gene sequence that fails the quality test is marked; the gene sequence and phenotype of the test sample data that pass the quality test are normal
  • the population gene sequence is input into BWA software to make sequence comparison with human reference gene hg19 or human reference gene hg38; the comparison results are preprocessed, such as deduplication, indel region correction, base quality correction, etc., to obtain the comparison data ;
  • the content of the comparison data includes the position of the sequence on the chromosome, the quality of the comparison, the position of the matched sequence on the chromosome, the length of the insert, the base composition of the sequence or the quality of the sequence.
  • the Picard MarkDuplicates software was used to compare the results to deduplicate; the method to correct the indel region is to use the GATK RealignerTargetCreator software to generate the indel list, and add the known indel loci found in the Thousand Genome Database, and use GATK IndelRealigner to analyze these The indel region is locally re-aligned to achieve the correction of the indel region; the method of base quality correction is to use the GATK BaseRecalibrator software to correct the quality score of the base in combination with the known site information.
  • a summary analysis of the comparison data can be performed.
  • the content of the summary analysis includes the quality of the comparison data, the number of original reads of paired-end sequencing, the number of reads compared to the human reference genome, Information about the average read sequence length, the ratio of indels, and whether the positive and negative chains are balanced.
  • the sequence coverage of the targeted region can be observed to obtain the genome length, the length of the targeted region, the total number of reads, the number of reads in the targeted region, and the number of reads in the non-targeted region. Information such as the proportion of reads in the targeted region, the average sequencing depth of the targeted region, and so on.
  • the method for constructing a reference baseline includes: acquiring multiple gene sequences of a population with normal phenotypes in the same batch as the sample data to be tested; Enter the population gene sequence into CNVKit software to construct a reference baseline; if the population gene sequence belongs to the whole exome sequencing data, enter multiple phenotypic population gene sequences into ExomeDepth software to construct a reference baseline.
  • the process of identifying structural variation in the sample data to be tested is as follows: respectively calculate the sequencing depth within and outside the target area of the sample to be tested, and then calculate their relative ratio to the reference baseline, and then convert the relative ratio into an absolute copy number. If the absolute copy number is not 2, it is recognized as structural variation.
  • the structural variation can be identified through CNVKit software, and if the sample data to be tested is whole-exome sequencing data, the structural variation can be identified through ExomeDepth software.
  • the input gene sequence is compared with the gene sequence of multiple people with normal phenotypes in the same batch of samples to be tested to establish a reference baseline.
  • the sequencing depth within the target region and outside the target region of the gene sequence of each phenotypic normal population is calculated separately, all control samples are merged, and systematic errors such as GC content are corrected to construct the basic reference baseline of the phenotypic normal population gene sequence.
  • Whole genome sequencing data is realized by CNVKit software, and ExomeDepth software is adopted for whole exome sequencing data. Using the same batch of multiple phenotypic normal population gene sequences to establish a reference baseline can reduce the deviation of the comparison results caused by systematic errors.
  • the method for annotating the structural variation and obtaining the pathogenicity classification of the structural variation according to the annotation result includes:
  • the AnnotSV software is used to annotate each structural variation separately.
  • the annotation results include one or more of the population frequency, the gene contained in the structural variation, and the corresponding disease name, variant type, population frequency, and pathogenicity of the mutation; according to The annotation results classify the pathogenicity of structural variants.
  • the pathogenicity classification includes three types: disease-causing or likely to cause disease, or disease-causing or likely to cause disease, but the annotation results also include three types: benign annotations and other conditions.
  • AnnotSV software which uses the classification standards defined by the American College of Medical Genetics and Genomics (ACMG) for pathogenicity classification, and the specific pathogenicity classification operation steps It is well known to those skilled in the art, and this embodiment will not be repeated here.
  • the method for constructing a gene recommendation list by grabbing relevant disease genes from public databases and literature databases according to the disease name and/or clinical features in the feature set I includes:
  • the disease name grab the relevant first disease gene from public databases and literature databases; according to the clinical features in feature set I, traverse multiple sets of clinical features corresponding to each disease in public databases and literature databases; adopt clinical features
  • the enrichment analysis algorithm calculates the significance value of the feature set I and the corresponding set of each disease in the public database and the literature database; matches and outputs the second disease gene corresponding to multiple significance values; summarizes the first disease gene and the second disease Genes, build a recommended list of genes.
  • the gene recommendation list there are two sources of gene data in the gene recommendation list.
  • One is to grab the first disease gene related to the disease name in the sample data to be tested based on the disease-gene association database in the public database and the literature database;
  • the other is to traverse the disease-gene association databases in public databases and literature databases according to the clinical features in feature set I to obtain all disease names, and the clinical features corresponding to each disease form a standard set; count the number of standard sets X, and mark each standard set in sequence.
  • the clinical feature enrichment analysis algorithm is used to calculate the significance value of feature set I and each standard set; the specific algorithm is as follows:
  • Step S1 select the Y-th standard set from the X standard sets as the to-be-processed set B, and set the initial value of Y to 1;
  • Step S2 using the Jaccard similarity algorithm to calculate the similarity coefficient between the to-be-processed set B and the feature set I;
  • Step S3 using the Jaccard distance algorithm to calculate the distance vector between the feature set I and the to-be-processed set B based on the similarity coefficient;
  • Step S4 use the algorithm Calculate the clinical feature enrichment factor coefficient f of feature set I and set B, where a represents the number of clinical features included in feature set I in set B, and b represents the clinical features in the disease-gene association database included in set B The number of clinical features, c represents the number of clinical features in feature set I that are not included in set B, and d represents the number of clinical features in the disease-gene association database that are not included in set B;
  • Step S5 based on the value of the distance vector and the clinical feature enrichment factor coefficient f, perform filtering processing on the set B to be processed, so that the set B to be processed that has not been cleared performs step S6;
  • a table method can be used to assist the calculation of the significance value, where a represents the number of clinical features included in the feature set I in the set B, and b represents the clinical features in the disease-gene association database are included in the set
  • the coefficient is used to measure the similarity between two sets. It is defined as the number of elements in the intersection of the two sets divided by the number of elements in the union.
  • the corresponding calculation formula is
  • the Jaccard distance algorithm is Among them,
  • step S5 includes: when the value of the distance vector is less than the first threshold and the value of the clinical feature enrichment factor coefficient f is greater than the second threshold, the set B is retained, otherwise the set B to be processed is eliminated.
  • the first threshold and the second threshold can be freely set by the user, and the default first threshold is 1 and the second threshold is 0.
  • the multiple significance values are sorted from low to high, and the matched second disease genes are output in order.
  • the public database is the MedGen database
  • the literature database is the PubMed database.
  • the step of traversing the feature set A corresponding to each standard disease name in the feature relational database, and separately calculating the set similarity value of each feature set A and the feature set I further includes:
  • the calculation method of the contribution c i of each standard clinical feature corresponding to each disease name to the disease is as follows:
  • k is the correction factor, and k>1, and the characteristic relational database is used as a reference database.
  • traversing the feature set A corresponding to each standard disease name in the feature relational database, respectively calculating the set similarity value of each feature set A and feature set I, and recommending multiple standard disease names according to the similarity value includes:
  • the method of matching the best standard clinical feature corresponding to each clinical feature in feature set I from feature set A based on the node labels on the standardized clinical feature phenotype tree includes:
  • the methods for screening the standard clinical features with the highest similarity to the i-th clinical feature from feature set A include:
  • the directed set IB is the number of nodes in the path L IB
  • the length of the directed set AB is the number of nodes in the path L AB ; extract the directed set IB and the number of nodes in the path.
  • the length of the intersection IAB is the number of common nodes in the path L IAB ;
  • SM represents the similarity value between the j-th standard clinical feature and the i-th clinical feature at multiple levels of the phenotype tree; SI represents the j-th standard clinical feature and the i-th clinical feature at the same level in the phenotype tree Similarity value, ⁇ is the weight coefficient.
  • the same stem node is B t .
  • the calculation method is: all nodes in the connecting path between I i and B t form a directed set IB, the number of elements in the directed set IB is denoted as N IB , the directed set
  • All nodes in the connecting path between A j and B t form a directed set AB.
  • the number of elements in the directed set AB is denoted as NAB .
  • the intersection set of the directed set IB and the directed set AB is denoted as IAB
  • the number of elements in the intersection set IAB is denoted as N IAB
  • the length of the set IAB is defined as the number of nodes on the common path, denoted as L IAB
  • L IAB N IAB
  • SM L IAB /max(L AB ,L IB )
  • SI 1/(L AB +L IB -2L IAB +1)
  • is the weight coefficient, ⁇ (0,1);
  • the method of calculating the set similarity value between the feature set I and the current feature set A includes:
  • a standard clinical feature A j corresponding to the greatest similarity can be found in the feature set A, that is to say, each clinical feature I i will get an and feature
  • the similarity value of the set A is defined as the sum of the similarity between each clinical feature I i in the feature set I and the feature set A.
  • the calculation formula is in Indicates the similarity value between the clinical feature I i and the feature set A.
  • the similarity value of feature set I and feature set A is defined as the sum of similarity between each clinical feature I i in feature set I and feature set A, and its calculation formula is S IA represents the similarity value between feature set I and feature set A.
  • the above-mentioned embodiment adopts the multi-level structure similarity algorithm, which has the characteristics of high accuracy of standard disease name recommendation.
  • the method of outputting multiple structural variations in descending order based on the importance of the influencing elements corresponding to each structural variation includes:
  • Adopt the formula Calculate the importance score of each structural variation respectively, where f is the number of influencing factors, w i is the weight of the i-th influencing element, and s i is the assignment of the i-th influencing element; statistics the importance of each structural variation Degree score, and output the corresponding structural variation in descending order of score value.
  • S c pathogenic classification ratings entry is set as follows: for the pathogenic or potentially pathogenic level assignment 5 minutes, but for the pathogenic or possibly pathogenic comment annotation results benign level assignment comprises three points, in addition to the above two 0 points are assigned to other situations outside of the situation, and the default value of w c is 1, which can be adjusted according to the actual situation.
  • Whether the gene contained in the structural variation in the annotation result belongs to the known disease-causing gene s e in the gene recommendation list is set as follows: The gene included in the structural variation in the annotation result belongs to the known disease-causing gene assignment in the gene recommendation list 10 points, the gene contained in the structural variation in the annotation result does not belong to the known disease-causing genes in the gene recommendation list. The value is 0. The default value of w e is 1, which can be adjusted according to the actual situation.
  • I characteristic feature set with the greatest similarity value set A s h, w h default value is 5, can be adjusted according to actual situation.
  • Rating population frequency setting item s p is as follows: when the maximum frequency in the population is less than or equal to 10 -3 MAX_AF or 2 minutes and no assignment message, when a maximum frequency in the population from 0.05 to 10 -3 0 assignment, when When the population frequency is greater than 0.05, it is assigned a value of -5 points, and the default value of w p is 1, which can be adjusted according to the actual situation.
  • the scoring items for the mutation location s q are set as follows: if the structural mutation region contains protein coding regions or other important functional elements (such as splicing site regulatory regions), assign 0 points, otherwise assign -2 points, and w q defaults to 1. It can be adjusted according to the actual situation.
  • the recommended information also includes The location of structural variation, the name of the gene covered, the exons, the score results, the location map of the mutation at the chromosome level, and the related phenotypic information and variation frequency information are included.
  • this embodiment provides a report interpretation system for structural variation in patient sample data, including:
  • the input unit is used to obtain the patient's sample data to be tested, the sample data to be tested includes gene sequence, disease name, and feature set I;
  • the annotation unit is used to compare the gene sequence with the reference baseline, detect multiple structural variations in the sample data to be tested and annotate them one by one, and at the same time classify the pathogenicity of each structural variation according to the annotation results ;
  • the recommendation list generating unit is configured to construct a gene recommendation list by grabbing relevant disease genes from public databases and literature databases according to the disease name and/or the clinical features in the feature set I;
  • Disease name recommendation unit used to traverse the feature set A corresponding to each standard disease name in the feature relation database, calculate the set similarity value of each feature set A and feature set I, and recommend multiple standard disease names according to the similarity value ;
  • the report output unit outputs multiple structural variants in descending order based on the degree of importance of the influencing elements corresponding to each structural variant, and generates an interpretation report.
  • the influential elements include the pathogenicity classification corresponding to the structural variants, the name of the disease, and the annotation result The consistency of the name of the disease, whether the gene contained in the structural variant in the annotation result belongs to the known disease-causing gene in the gene recommendation list, the maximum similarity value between feature set I and feature set A, population frequency, and location of the mutation One or more of.
  • the above-mentioned report interpretation system for structural variations in patient sample data is applied to a computer device, and the computer device includes a processor and a memory connected via a system bus.
  • the processor of the report interpretation system for structural variations in the patient sample data is used to provide calculation and control capabilities.
  • the memory of the report interpretation system for structural variations in the patient sample data includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer readable instructions.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the network interface of the report interpretation system for structural variations in the patient sample data is used to communicate with external sensors.
  • the steps of the above-mentioned method for interpretation of structural variations in patient sample data are realized, for example, the above-mentioned input unit, annotation unit, recommendation list generating unit, disease name recommendation unit, and report output
  • the unit implements the steps of the report interpretation method for structural variations in the patient sample data.
  • the beneficial effects of the report interpretation system for structural variations in patient sample data provided in this embodiment are the same as the beneficial effects of the report interpretation method for structural variations in patient sample data provided in the first embodiment. Do repeats.
  • This embodiment provides a computer-readable storage medium, such as a non-volatile computer-readable storage medium, in which computer-readable instructions are stored on the computer-readable storage medium, and the computer-readable instructions are executed by a processor when the above-mentioned patient sample is executed. Steps in the interpretation method of the report on structural variation in the data.
  • the beneficial effects of the computer-readable storage medium provided in this embodiment are the same as the beneficial effects of the report interpretation method for structural variation in patient sample data provided by the above technical solutions, and will not be repeated here.
  • FIG. 4 provides a schematic diagram of an environment architecture of an application scenario.
  • An application software can be developed to implement the report interpretation method for structural variation in patient sample data in the foregoing embodiment, and the application software can be installed in a user terminal, and the user terminal is connected to the server to realize communication.
  • the user terminal may be any smart device such as a computer or a tablet computer, and this embodiment only uses a computer as an example for description.
  • sample data to be tested includes gene sequence, disease name, and feature set I, which are implemented in the application
  • the application in the user terminal sends the gene sequence to the annotation unit, sends the disease name and feature set I to the recommendation list generating unit, and sends the feature set I to the disease name recommendation unit, where the annotation unit ,
  • the recommendation list generating unit and the disease name recommendation unit can both be realized by the server.
  • a report output unit such as a display will output multiple structural variants in descending order based on the importance of the influencing factors corresponding to each structural variant and generate an interpretation report.
  • the above-mentioned inventive method can be implemented by a program instructing relevant hardware.
  • the above-mentioned program can be stored in a computer readable storage medium.
  • the storage medium of the program may be: ROM/RAM, magnetic disk, optical disk, memory card, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种患者样本数据中结构变异的报告解读方法及系统,能够对患者的样本数据准确解读,在降低报告解读门槛的同时提升了解读效率。方法包括:获取患者的待测样本数据;将基因序列与参考基线比对,检测出待测样本数据中的多个结构变异并对其一一进行注释,根据注释结果得到每个结构变异的致病性分级;根据疾病名称和/或特征集合I中的临床特征从公共数据库和文献数据库中抓取相关的疾病基因构建基因推荐列表;遍历特征关系数据库中各标准疾病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,按照相似度值大小推荐多个标准疾病名称;基于各结构变异对应的影响要素的重要性程度将多个结构变异降序输出并生成解读报告。

Description

患者样本数据中结构变异的报告解读方法及系统 技术领域
本发明涉及医学信息技术领域,尤其涉及一种患者样本数据中结构变异的报告解读方法及系统。
背景技术
新一代测序技术在疾病的致病突变研究和医疗实践中的应用越来越广泛。近年来,大量的研究证实,全基因组测序和全外显子测序是开展疾病基因组学相关研究,对患者进行致病突变鉴定和分子诊断的理想方法之一。
然而,在精准医学应用层面,基于新一代测序技术在数据分析和临床解读方面还存在着很多问题,不利于精准医学的发展,阻碍着基于新一代测序技术中结构变异相关疾病的病因学探究。这些问题包括:一是结构变异的识别,其分析流程过于复杂,难以为医疗机构人员和其他广大非生物信息学人员所掌握使用;二是致病突变的鉴定,需利用大量的人工对筛选的结构变异进行逐个检查确认,效率非常低。
发明内容
本发明的目的在于提供一种患者样本数据中结构变异的报告解读方法及系统,能够对患者的样本数据准确解读,在降低报告解读门槛的同时提升了解读效率。
为了实现上述目的,本发明的一方面提供一种患者样本数据中结构变异的报告解读方法,包括:
获取患者的待测样本数据,所述待检测样本数据包括基因序列、疾病名称和特征集合I;
将所述基因序列与参考基线比对,检测出待测样本数据中的多个结构变异并对其一一进行注释,同时根据注释结果对每个结构变异进行致病性分级;
根据所述疾病名称和/或所述特征集合I中的临床特征从公共数据库和文献数据库中抓取相关的疾病基因构建基因推荐列表;
遍历特征关系数据库中各标准疾病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,按照相似度值大小推荐多个标准疾病名称;
基于各结构变异对应的影响要素的重要性程度将多个结构变异降序输出并生成解读报告,所述影响要素包括与结构变异对应的致病性分级、所述疾病名称与注释结果中疾病名称的一致性情况、注释结果中结构变异包含的基因是否属于基因推荐列表中存在的已知致病基因、特征集合I与特征集合A的最大相似度值、人群发生频率、变异发生位置中的 一种或多种。
优选地,参考基线的构建方法包括:
获取与待测样本数据同批次的多个表型正常的人群基因序列;
若人群基因序列属于全基因组测序数据,则将多个表型正常的人群基因序列输入CNVKit软件构建参考基线;
若人群基因序列属于全外显子测序数据,则将多个表型正常的人群基因序列输入ExomeDepth软件构建参考基线。
较佳地,对结构变异进行注释,同时根据注释结果得到结构变异致病性分级的方法包括:
采用AnnotSV软件分别对每个结构变异进行注释,注释结果包括人群发生频率、结构变异包含的基因及对应的疾病名称、变异类型、人群发生频率、变异致病情况中的一种或多种;
根据所述注释结果对结构变异进行致病性分级,所述致病性分级包括致病或可能致病、致病或可能致病但注释结果也包含良性注释、其他情况三种类型。
优选地,根据所述疾病名称和/或所述特征集合I中的临床特征从公共数据库和文献数据库中抓取相关的疾病基因构建基因推荐列表的方法包括:
根据所述疾病名称从公共数据库和文献数据库中抓取相关的第一疾病基因;
根据所述特征集合I中的临床特征,遍历公共数据库和文献数据库中每种疾病对应的临床特征的多个集合;
采用临床特征富集度分析算法计算所述特征集合I分别与公共数据库和文献数据库中各疾病对应集合的显著性值;
匹配输出与多个显著性值对应的第二疾病基因;
汇总第一疾病基因和第二疾病基因,构建基因推荐列表。
优选地,在步骤遍历特征关系数据库中各标准疾病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值之前还包括:
从疾病的公共数据库和文献数据库,获得已知的标准疾病名称及其对应的标准临床特征;
基于已知的标准疾病及其对应的标准临床特征,建立标准疾病名称与标准临床特征的特征关系数据库;
分别计算每种疾病对应的各标准临床特征对该疾病的贡献度c i
从特征关系数据库中获取数据,基于HPO构建疾病的标准化临床特征表型树;
所述标准化临床特征表型树由多个干节点和与每个干节点关联的至少一个支节点组成,每个支节点用于表示一个标准化临床特征,每个干节点用于表示关联的标准化临床特征的索引。
较佳地,遍历特征关系数据库中各标准疾病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,按照相似度值大小推荐多个标准疾病名称的方法包括:
将特征集合I中的临床特征在标准化临床特征表型树上的节点标记;
遍历特征关系数据库中的第n个标准疾病名称,将其对应的特征集合A中的标准临床特征在标准化临床特征表型树上的节点标记,所述n的初始值为1;
基于标准化临床特征表型树上的节点标记,从特征集合A中匹配出与特征集合I中每个临床特征对应的最佳标准临床特征;
根据每个临床特征与对应的最佳标准临床特征的相似度值,计算出特征集合I与当前特征集合A的集合相似度值;
令n=n+1重新遍历特征关系数据库中的第n个标准疾病名称,直至特征关系数据库中的标准疾病名称遍历完毕,将特征集合I与每个特征集合A对应的集合相似度值汇总,并按照相似度值大小降序推荐多个标准疾病名称。
进一步地,基于标准化临床特征表型树上的节点标记,从特征集合A中匹配出与特征集合I中每个临床特征对应的最佳标准临床特征的方法包括:
所述特征集合I包括多个临床特征,所述特征集合A包括多个标准临床特征;
遍历所述特征集合I中的第i个临床特征,从所述特征集合A中筛选出与所述第i个临床特征相似度最高的标准临床特征,作为与所述第i个临床特征对应的最佳标准临床特征,所述i的初始值为1;
令i=i+1后重新遍历所述特征集合I中的第i个临床特征,直至特征集合I中的临床特征遍历完毕,从第n个标准疾病名称对应的特征集合A中筛选出与特征集合I中临床特征一一对应的多个最佳标准临床特征。
进一步地,从所述特征集合A中筛选出与所述第i个临床特征相似度最高的标准临床特征的方法包括:
遍历所述特征集合A中的第j个标准临床特征,基于已建立的索引判断所述第j个标准临床特征与所述第i个临床特征是否存在相同的干节点B t,所述j的初始值为1;
若判断结果为否,则认为所述第j个标准临床特征与所述第i个临床特征的相似度值为零;
若判断结果为是,基于多层级结构相似度算法计算所述第j个标准临床特征与所述第i个临床特征的相似度值;
令j=j+1后重新遍历所述特征集合A中的第j个标准临床特征,并继续执行所述第j个标准临床特征与所述第i个临床特征的相似度计算,直至所述特征集合A中的标准临床特征遍历完毕,对应得到与所述特征集合A中标准临床特征一一对应的多个相似度值;
从多个相似度值筛中筛选出最大值对应的标准临床特征作为与第i个临床特征对应的最佳标准临床特征。
优选地,基于各结构变异对应的影响要素的重要性程度将多个结构变异降序输出的方法包括:
采用公式
Figure PCTCN2020111132-appb-000001
分别计算每个结构变异的重要性程度评分,其中,f为影响要素的数量,w i为第i各影响要素的权重,s i为第i各影响要素的赋值;
统计各结构变异的重要性程度评分,并按照分值大小降序输出对应的结构变异。
与现有技术相比,本发明提供的患者样本数据中结构变异的报告解读方法具有以下有益效果:
本发明提供的患者样本数据中结构变异的报告解读方法中,首先需要获取一份包括基因序列、疾病名称和特征集合I的待测样本数据,通过将基因序列与参考基线比对,检测出待测样本数据中的结构变异并进行注释,然后根据注释结果对每个结构变异进行致病性分级并评分,接着基于疾病名称和/或特征集合I从公共数据库和文献数据库中抓取相关的疾病基因构建基因推荐列表,另外,遍历特征关系数据库中各标准疾病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,按照相似度值大小推荐多个标准疾病名称,最终基于各结构变异对应的影响要素重要性程度将多个结构变异降序输出生成解读报告。
可见,由于影响要素包括了与结构变异对应的致病性分级、疾病名称与注释结果中疾病名称的一致性情况、注释结果中结构变异包含的基因是否属于基因推荐列表中存在的已知致病基因、特征集合I与特征集合A的最大相似度值、人群发生频率、变异发生位置中的一种或多种,也即本发明从多个维度评价了结构变异的致病性,进而能够对患者的待测样本进行准确解读。而且本发明提供的患者样本数据中结构变异的报告解读方法能够实现从待测样本数据到致病结构变异推荐的全程自动化,极大降低了人工对结构变异数据的解读分析工作量,提高了结构变异的分析和临床解读的效率。
本发明的另一方面提供一种患者样本数据中结构变异的报告解读系统,包括:
输入单元,用于获取患者的待测样本数据,所述待检测样本数据包括基因序列、疾病 名称和特征集合I;
注释单元,用于将所述基因序列与参考基线比对,检测出待测样本数据中的多个结构变异并对其一一进行注释,同时根据注释结果对每个结构变异进行致病性分级;
推荐列表生成单元,用于根据所述疾病名称和/或所述特征集合I中的临床特征从公共数据库和文献数据库中抓取相关的疾病基因构建基因推荐列表;
疾病名称推荐单元,用于遍历特征关系数据库中各标准疾病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,按照相似度值大小推荐多个标准疾病名称;
报告输出单元,基于各结构变异对应的影响要素的重要性程度将多个结构变异降序输出并生成解读报告,所述影响要素包括与结构变异对应的致病性分级、所述疾病名称与注释结果中疾病名称的一致性情况、注释结果中结构变异包含的基因是否属于基因推荐列表中存在的已知致病基因、特征集合I与特征集合A的最大相似度值、人群发生频率、变异发生位置中的一种或多种。
与现有技术相比,本发明提供的患者样本数据中结构变异的报告解读系统的有益效果与上述技术方案提供的患者样本数据中结构变异的报告解读方法有益效果相同,在此不做赘述。
本发明的第三方面提供一种计算机可读存储介质,例如是非易失性计算机可读存储介质,其中计算机可读存储介质上存储有计算机可读指令,计算机可读指令被处理器运行时执行上述患者样本数据中结构变异的报告解读方法的步骤。
与现有技术相比,本发明提供的计算机可读存储介质的有益效果与上述技术方案提供的患者样本数据中结构变异的报告解读方法的有益效果相同,在此不做赘述。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本发明的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1为实施例一中患者样本数据中结构变异的报告解读方法的流程示意图;
图2为实施例一中标准化临床特征表型树上的节点标记示例图;
图3为实施例二中患者样本数据中结构变异的报告解读系统的结构框图;
图4为实施例四中患者样本数据中结构变异的报告解读方法应用环境架构示例图。
具体实施方式
为使本发明的上述目的、特征和优点能够更加明显易懂,下面将结合本发明实施例中 的附图,对本发明实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动的前提下所获得的所有其它实施例,均属于本发明保护的范围。
实施例一
请参阅图1,本实施例提供一种患者样本数据中结构变异的报告解读方法,其特征在于,包括:
获取患者的待测样本数据,待检测样本数据包括基因序列、疾病名称和特征集合I;将基因序列与参考基线比对,检测出待测样本数据中的多个结构变异并对其一一进行注释,同时根据注释结果对每个结构变异进行致病性分级;根据疾病名称和/或特征集合I中的临床特征从公共数据库和文献数据库中抓取相关的疾病基因构建基因推荐列表;遍历特征关系数据库中各标准疾病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,按照相似度值大小推荐多个标准疾病名称;基于各结构变异对应的影响要素的重要性程度将多个结构变异降序输出并生成解读报告,影响要素包括与结构变异对应的致病性分级、疾病名称与注释结果中疾病名称的一致性情况、注释结果中结构变异包含的基因是否属于基因推荐列表中存在的已知致病基因、特征集合I与特征集合A的最大相似度值、人群发生频率、变异发生位置中的一种或多种。
本实施例提供的患者样本数据中结构变异的报告解读方法中,首先需要获取一份包括基因序列、疾病名称和特征集合I的待测样本数据,通过将基因序列与参考基线比对,检测出待测样本数据中的结构变异并进行注释,然后根据注释结果对每个结构变异进行致病性分级并评分,接着基于疾病名称和/或特征集合I从公共数据库和文献数据库中抓取相关的疾病基因构建基因推荐列表,另外,遍历特征关系数据库中各标准疾病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,按照相似度值大小推荐多个标准疾病名称,最终基于各结构变异对应的影响要素重要性程度将多个结构变异降序输出生成解读报告。
可见,由于影响要素包括了与结构变异对应的致病性分级、疾病名称与注释结果中疾病名称的一致性情况、注释结果中结构变异包含的基因是否属于基因推荐列表中存在的已知致病基因、特征集合I与特征集合A的最大相似度值、人群发生频率、变异发生位置中的一种或多种,也即本实施例从多个维度评价了结构变异的致病性,进而能够对患者的待测样本进行准确解读。而且本实施例提供的患者样本数据中结构变异的报告解读能够实现从待测样本数据到致病结构变异推荐的全程自动化,极大降低了人工对结构变异数据的解 读分析工作量,提高了结构变异的分析和临床解读的效率。
在将基因序列与参考基线比对之前,还需对待测样本数据的基因序列和表型正常人群基因序列的质量进行检查,确保使用的基因序列的质量是合格的,能够用于下游分析和解读。质量检查的指标包括:总序列数、序列长度、碱基质量、序列质量、碱基含量、GC含量、碱基水平N含量、序列长度分布、重复序列、过渡表达序列、接头序列、K-mer含量等。
具体实施时,针对待测样本数据的基因序列和表型正常人群基因序列进行质量检测,对质量检测不合格的基因序列进行标记;将质量检测合格的待测样本数据的基因序列和表型正常人群基因序列输入BWA软件,使其与人类参考基因hg19或人类参考基因hg38进行序列比对;对比对结果进行预处理,如去重、indel区域校正、碱基质量校正等操作后得到比对数据;比对数据的内容包括序列在染色体上的比对位置、比对质量、配对序列在染色体上的比对位置、插入片段长度、序列的碱基组成或序列质量。
具体实施时,采用Picard MarkDuplicates软件对比对结果进行去重;对indel区域校正的方法为利用GATK RealignerTargetCreator软件产生indel列表,并追加千人基因组数据库中发现的已知indel位点,利用GATK IndelRealigner对这些indel区域进行局部重新比对,以实现indel区域的校正;碱基质量校正的方法为使用GATK BaseRecalibrator软件结合已知位点信息对碱基的质量分数进行校正。
这些操作步骤完成后,可针对比对数据进行汇总性分析,汇总性分析的内容包括比对数据的质量,以及双端测序的原始读序数目、比对到人类参考基因组上的读序数目、平均读序长度、indel的比例、正负链是否平衡等信息。另外,此阶段还可对靶向区域的序列覆盖情况进行观察,以获取基因组长度、靶向区域的长度、总读序数目、靶向区域的读序数目、非靶向区域的读序数目、靶向区域读序所占的比例、靶向区域的平均测序深度等信息。
最终将待测样本数据的基因序列和每个表型正常人群基因序列的质量检查结果分别以bam格式的数据输出,用于后续分析使用。
上述实施例中,参考基线的构建方法包括:获取与待测样本数据同批次的多个表型正常的人群基因序列;若人群基因序列属于全基因组测序数据,则将多个表型正常的人群基因序列输入CNVKit软件构建参考基线;若人群基因序列属于全外显子测序数据,则将多个表型正常的人群基因序列输入ExomeDepth软件构建参考基线。
待测样本数据中的结构变异识别过程如下,分别计算待测样本的目的区域内和目的区域外的测序深度,然后计算它们相对于参考基线的相对比值,接着将相对比值转换成绝对拷贝数,对于绝对拷贝数不是2的识别为结构变异。同理,若待测样本数据为全基因组测 序数据可通过CNVKit软件识别出结构变异,对于待测样本数据为全外显子测序数据可通过ExomeDepth软件识别出结构变异。
具体实施时,将输入的基因序列与待检测样本同批次的多个表型正常的人群基因序列对照,建立一个参考基线。具体地,分别计算每个表型正常人群基因序列的目的区域内和目的区域外的测序深度,合并所有对照样本,矫正GC含量等系统误差,构建表型正常人群基因序列的基参考基线,对于全基因组测序数据采用CNVKit软件实现,对于全外显子测序数据采用ExomeDepth软件实现。采用同批次的多个表型正常人群基因序列建立参考基线能够减小因系统误差造成的比对结果偏差。
上述实施例中,对结构变异进行注释,同时根据注释结果得到结构变异致病性分级的方法包括:
采用AnnotSV软件分别对每个结构变异进行注释,注释结果包括人群发生频率、结构变异包含的基因及对应的疾病名称、变异类型、人群发生频率、变异致病情况中的一种或多种;根据注释结果对结构变异进行致病性分级,致病性分级包括致病或可能致病、致病或可能致病但注释结果也包含良性注释、其他情况三种类型。
上述实施例中对于识别的结构变异,可根据公共数据库对变异起始/终止位置、所覆盖的基因、变异类型、在世界范围内的人群发生频率、以及DGV数据库、千人数据库、dbVar数据库和OMIM数据库中已知变异的致病情况等采用AnnotSV软件进行注释,该软件使用了美国医学遗传学和基因组学学院(ACMG)定义的分类标准进行致病性分级,具体致病性分级的操作步骤为本领域技术人员所熟知的,本实施例对此不做赘述。
上述实施例中,根据疾病名称和/或特征集合I中的临床特征从公共数据库和文献数据库中抓取相关的疾病基因构建基因推荐列表的方法包括:
根据疾病名称从公共数据库和文献数据库中抓取相关的第一疾病基因;根据特征集合I中的临床特征,遍历公共数据库和文献数据库中每种疾病对应的临床特征的多个集合;采用临床特征富集度分析算法计算特征集合I分别与公共数据库和文献数据库中各疾病对应集合的显著性值;匹配输出与多个显著性值对应的第二疾病基因;汇总第一疾病基因和第二疾病基因,构建基因推荐列表。
具体实施时,基因推荐列表中的基因数据来源有两种,一种为根据公共数据库和文献数据库中的疾病-基因关联数据库,抓取与待测样本数据中疾病名称相关的第一疾病基因;另一种为根据特征集合I中的临床特征,遍历公共数据库和文献数据库中的疾病-基因关联数据库,获取全部的疾病名称,每种疾病对应的临床特征组成一个标准集合;统计标准集合的数量X,并顺序对各标准集合标记。然后采用临床特征富集度分析算法计算特征集合 I分别与各标准集合的显著性值;具体算法如下:
步骤S1,从X个标准集合中选择第Y个标准集合作为待处理集合B,设置Y的初始值为1;
步骤S2,采用Jaccard相似性算法计算待处理集合B与特征集合I的相似性系数;
步骤S3,基于所述相似性系数采用Jaccard距离算法计算特征集合I与待处理集合B的距离向量;
步骤S4,利用算法
Figure PCTCN2020111132-appb-000002
计算特征集合I与集合B的临床特征富集因子系数f,其中,a表示特征集合I包含在集合B中的临床特征数目,b表示疾病-基因关联数据库中的临床特征包含在集合B中的临床特征数目,c表示特征集合I不包含在集合B中的临床特征数目,d表示疾病-基因关联数据库中的临床特征不包含在集合B中的临床特征数目;
步骤S5,基于所述距离向量的值和所述临床特征富集因子系数f,对待处理集合B做过滤处理,使得未被清除的待处理集合B执行步骤S6;
步骤S6,利用算法
Figure PCTCN2020111132-appb-000003
计算特征集合I与集合B的显著性值,其中n=a+b+c+d;
步骤S67,当Y<X时,令Y=Y+1,并返回步骤S1,直至X个标准集合全部被选择处理为止;
具体实施时,为了简化运算可采用表格法来辅助显著性值的计算,其中,a表示特征集合I包含在集合B中的临床特征数目,b表示疾病-基因关联数据库中的临床特征包含在集合B中的临床特征数目,c表示特征集合I不包含在集合B中的临床特征数目,d表示疾病-基因关联数据库中的临床特征不包含在集合B中的临床特征数目;其中,Jaccard相似性系数是用来度量两个集合之间的相似性,它被定义为两个集合交集的元素个数除以并集的元素个数,对应的计算公式为
Figure PCTCN2020111132-appb-000004
Jaccard距离算法为
Figure PCTCN2020111132-appb-000005
其中,|A|表示特征集合I中的临床特征数目,|B|表示集合B中的临床特征数目,|A∩B|表示特征集合I和集合B交集中的临床特征数目,|A∪B|表示特征集合I和集合B并集中的临床特征数目。J(A,B)取值范围为[0,1],距离向量的值越小,则表示两个集合越相似,若特征集合I和集合B的集合均为空,则J(A,B)=1。最终获取到特征集合I与各标准集合的显著性值P,而显著性值P越小则说明两个集合的相似性越大。
示例性地,步骤S5包括:当距离向量的值小于第一阈值,且临床特征富集因子系数f 的值大于第二阈值时,将该集合B保留,否则将该待处理集合B剔除。其中,第一阈值和第二阈值可由用户自由设定,默认的第一阈值为1,第二阈值为0。
最后,将多个显著性值由低到高排序,并将匹配的第二疾病基因顺序输出,显著性值越小则说明对应的第二疾病基因越符合特征集合I中表现的临床特征。
示例性地,公共数据库为MedGen数据库,文献数据库为PubMed数据库。
上述实施例中,在步骤遍历特征关系数据库中各标准疾病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值之前还包括:
从疾病的公共数据库和文献数据库中,获得已知的标准疾病名称及其对应的标准临床特征;基于已知的标准疾病及其对应的标准临床特征,建立标准疾病名称与标准临床特征的特征关系数据库;分别计算每种疾病对应的各标准临床特征对该疾病的贡献度c i;从特征关系数据库中获取数据,基于HPO构建疾病的标准化临床特征表型树;标准化临床特征表型树由多个干节点和与每个干节点关联的至少一个支节点组成,每个支节点用于表示一个标准化临床特征,每个干节点用于表示关联的标准化临床特征的索引。
具体实施时,每种疾病名称对应的各标准临床特征对该疾病的贡献度c i的计算方法如下:
在特征关系数据库中,假设共有a种标准临床特征,a种标准临床特征在特征关系数据库中一共出现N次,假定每种标准临床特征出现的次数为a i,则每个标准临床特征在特征关系数据库中出现的频率为f i,f i的计算公式为:
f i=a i/N;
对于特征关系数据库中的某种标准疾病名称,假定对应有m个标准临床特征,每个标准临床特征在特征关系数据库中的分布频率依次为f 1、f 2、……、f m,则某个标准临床特征对该疾病的贡献度c i的计算公式为:
Figure PCTCN2020111132-appb-000006
上述公式中,k为校正因子,且k>1,特征关系数据库作为参考数据库使用。
进一步地,遍历特征关系数据库中各标准疾病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,按照相似度值大小推荐多个标准疾病名称的方法包括:
将特征集合I中的临床特征在标准化临床特征表型树上的节点标记;遍历特征关系数据库中的第n个标准疾病名称,将其对应的特征集合A中的标准临床特征在标准化临床特征表型树上的节点标记,所述n的初始值为1;基于标准化临床特征表型树上的节点标记, 从特征集合A中匹配出与特征集合I中每个临床特征对应的最佳标准临床特征;根据每个临床特征与对应的最佳标准临床特征的相似度值,计算出特征集合I与当前特征集合A的集合相似度值;令n=n+1重新遍历特征关系数据库中的第n个标准疾病名称,直至特征关系数据库中的标准疾病名称遍历完毕,将特征集合I与每个特征集合A对应的集合相似度值汇总,并按照相似度值大小降序推荐多个标准疾病名称。
具体地,基于标准化临床特征表型树上的节点标记,从特征集合A中匹配出与特征集合I中每个临床特征对应的最佳标准临床特征的方法包括:
特征集合I包括多个临床特征,特征集合A包括多个标准临床特征;遍历特征集合I中的第i个临床特征,从特征集合A中筛选出与第i个临床特征相似度最高的标准临床特征,作为与第i个临床特征对应的最佳标准临床特征,i的初始值为1;令i=i+1后重新遍历特征集合I中的第i个临床特征,直至特征集合I中的临床特征遍历完毕,从第n个标准疾病名称对应的特征集合A中筛选出与特征集合I中临床特征一一对应的多个最佳标准临床特征。
其中,从特征集合A中筛选出与第i个临床特征相似度最高的标准临床特征的方法包括:
遍历特征集合A中的第j个标准临床特征,基于已建立的索引判断第j个标准临床特征与第i个临床特征是否存在相同的干节点B t,j的初始值为1;若判断结果为否,则认为第j个标准临床特征与第i个临床特征的相似度值为零;若判断结果为是,基于多层级结构相似度算法计算第j个标准临床特征与第i个临床特征的相似度值;令j=j+1后重新遍历特征集合A中的第j个标准临床特征,并继续执行第j个标准临床特征与第i个临床特征的相似度计算,直至特征集合A中的标准临床特征遍历完毕,对应得到与特征集合A中标准临床特征一一对应的多个相似度值;从多个相似度值筛中筛选出最大值对应的标准临床特征作为与第i个临床特征对应的最佳标准临床特征。
上述实施例中基于多层级结构相似度算法计算第j个标准临床特征与第i个临床特征的相似度值的方法包括:
基于标准化临床特征表型树上的节点标记,获取第i个临床特征与相同干节点B t连接通路中所有节点的有向集合IB,以及获取第j个标准临床特征相同干节点B t连接通路中所有节点的有向集合AB,有向集合IB长度的值为通路中节点的个数L IB,有向集合AB长度的值为通路中节点的个数L AB;提取有向集合IB和有向集合AB中节点的交集IAB,交集IAB长度的值为通路中共有节点的个数L IAB;采用公式
Figure PCTCN2020111132-appb-000007
计算第j个标准临床特征与第i个临床特征的相似度值;
其中,SM表示第j个标准临床特征与第i个临床特征在表型树多层次间的相似度值;SI表示第j个标准临床特征与第i个临床特征在表型树同层次间的相似度值,β为权重系数。
具体实施时,对于特征关系数据库中某一标疾病名称对应的特征集合A有n个元素A j组成,分别为A 1、A 2、……、A n,也即A=[A 1,A 2,...,A j...,A n],特征关系数据库中的每一个标准疾病名称均对应一个集合A。假若某一疾病患者输入的标准化特征集合I,有m个临床特征I i组成,对应的特征集合I=[I 1、I 2、……、I m]。如果I i与A j的干节点不相同,则认为I i与A j的相似度为0,如果I i与A j的干节点相同,如图2所示,相同的干节点为B t,则计算I i与A j的相似度,计算方法为:I i到B t之间连接通路中的所有节点组成有向集合IB,有向集合IB的元素个数记为N IB,有向集合IB的长度定义为该通路上节点的个数,记为L IB,且L IB=N IB
A j到B t之间连接通路中的所有节点组成有向集合AB,有向集合AB的元素个数记为N AB,有向集合AB的长度定义为该通路上节点的个数,记为L AB,且L AB=N AB
有向集合IB和有向集合AB的交集集合记为IAB,交集集合IAB的元素个数记为N IAB,集合IAB的长度定义为共有路径上节点的个数,记为L IAB,则L IAB=N IAB,其中,SM=L IAB/max(L AB,L IB),SI=1/(L AB+L IB-2L IAB+1),β为权重系数,β∈(0,1);I i与A j之间的相似度的取值范围
Figure PCTCN2020111132-appb-000008
进一步地,计算特征集合I与当前特征集合A的集合相似度值的方法包括:
利用第i个临床特征的贡献度c i,对特征集合A中与之对应最佳标准临床特征的最大相似度值进行加权处理;令i=i+1,重新对特征集合A中与第i个临床特征对应的最佳标准临床特征的最大相似度值进行加权处理,直至将特征集合A中筛选出的全部最佳标准临床特征加权处理完毕,累加特征集合A中全部最佳标准临床特征对应的加权最大相似度值,得到特征集合I与当前特征集合A的集合相似度值。
具体实施时,对于每个输入的临床特征I i,都可以在特征集合A中找到一个与之对应相似度最大的标准临床特征A j,也就是说每个临床特征I i都会得到一个与特征集合A的相似度值,特征集合I和特征集合A的相似度,定义为特征集合I中的每个临床特征I i与特征集合A的相似度之和。
考虑到每个临床特征对疾病的贡献程度不一,需对相应的最大相似度值进行加权处理,其计算公式为
Figure PCTCN2020111132-appb-000009
其中
Figure PCTCN2020111132-appb-000010
表示临床特征I i与特征集合A的相似度值。特征集合I和特征集合A的相似度值,定义为特征集合I中每个临床特征I i与特征集合A 的相似度之和,其计算公式为
Figure PCTCN2020111132-appb-000011
S IA表示特征集合I与特征集合A的相似度值。
可见,上述实施例采用多层级结构相似度算法具有标准疾病名称推荐准确度高的特点。
上述实施例中,基于各结构变异对应的影响要素的重要性程度将多个结构变异降序输出的方法包括:
采用公式
Figure PCTCN2020111132-appb-000012
分别计算每个结构变异的重要性程度评分,其中,f为影响要素的数量,w i为第i各影响要素的权重,s i为第i各影响要素的赋值;统计各结构变异的重要性程度评分,并按照分值大小降序输出对应的结构变异。
致病性分级s c的评分项设置如下:对于致病或可能致病的等级赋值5分,对于致病或可能致病但注释结果也包含良性注释的等级赋值3分,对于除上述两种情况之外的其他情况赋值0分,w c默认值为1,可根据实际情况进行调整。
疾病名称与注释结果中疾病名称一致性情况s d的评分项设置如下:疾病名称与注释结果中的疾病名称一致赋值5分,疾病名称与注释结果中的疾病名称不一致赋值0分,w d默认值为1,可根据实际情况进行调整。
注释结果中结构变异包含的基因是否属于基因推荐列表中存在的已知致病基因s e的评分项设置如下:注释结果中结构变异包含的基因属于基因推荐列表中存在的已知致病基因赋值10分,注释结果中结构变异包含的基因不属于基因推荐列表中存在的已知致病基因赋值0分,w e默认值为1,可根据实际情况进行调整。
特征集合I与特征集合A的最大相似度值为s h,w h默认值为5,可根据实际情况进行调整。
人群发生频率s p的评分项设置如下:当人群中的最大频率MAX_AF小于或等于10 -3或者无消息时赋值2分,当人群中的最大频率处于0.05至10 -3时赋值0分,当人群发生频率大于0.05时赋值-5分,w p默认值为1,可根据实际情况进行调整。
变异发生位置s q的评分项设置如下:如果结构变异区域包含蛋白质编码区域或其他重要功能原件(如剪切位点调控区)赋值0分,否则赋值-2分,w q默认值为1,可根据实际情况进行调整。
评分结果的值越大则说明重要性程度越高,最后根据评分结果的大小将多个结构变异及其对应的基因名称在解读报告中降序输出,实现致病变异的推荐,推荐的信息还包括了结构变异的位置、覆盖的基因名称、外显子、评分结果、染色体水平上突变的位置图,以 及相关的表型信息和变异频率信息等。
实施例二
请参阅图3,本实施例提供一种患者样本数据中结构变异的报告解读系统,包括:
输入单元,用于获取患者的待测样本数据,所述待检测样本数据包括基因序列、疾病名称和特征集合I;
注释单元,用于将所述基因序列与参考基线比对,检测出待测样本数据中的多个结构变异并对其一一进行注释,同时根据注释结果对每个结构变异进行致病性分级;
推荐列表生成单元,用于根据所述疾病名称和/或所述特征集合I中的临床特征从公共数据库和文献数据库中抓取相关的疾病基因构建基因推荐列表;
疾病名称推荐单元,用于遍历特征关系数据库中各标准疾病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,按照相似度值大小推荐多个标准疾病名称;
报告输出单元,基于各结构变异对应的影响要素的重要性程度将多个结构变异降序输出并生成解读报告,所述影响要素包括与结构变异对应的致病性分级、所述疾病名称与注释结果中疾病名称的一致性情况、注释结果中结构变异包含的基因是否属于基因推荐列表中存在的已知致病基因、特征集合I与特征集合A的最大相似度值、人群发生频率、变异发生位置中的一种或多种。
在一个实施例中,上述的患者样本数据中结构变异的报告解读系统应用于计算机设备,该计算机设备包括通过系统总线连接的处理器和存储器。其中,该患者样本数据中结构变异的报告解读系统的处理器用于提供计算和控制能力。该患者样本数据中结构变异的报告解读系统的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该患者样本数据中结构变异的报告解读系统的网络接口用于与外部的传感器通信。该计算机可读指令被处理器执行时以实现上述的患者样本数据中结构变异的报告解读方法的步骤,例如是以上述的输入单元、注释单元、推荐列表生成单元、疾病名称推荐单元以及报告输出单元实现上述患者样本数据中结构变异的报告解读方法的步骤。
与现有技术相比,本实施例提供的患者样本数据中结构变异的报告解读系统的有益效果与上述实施例一提供的患者样本数据中结构变异的报告解读方法的有益效果相同,在此不做赘述。
实施例三
本实施例提供一种计算机可读存储介质,例如是非易失性计算机可读存储介质,其中计算机可读存储介质上存储有计算机可读指令,计算机可读指令被处理器运行时执行上述患者样本数据中结构变异的报告解读方法的步骤。
与现有技术相比,本实施例提供的计算机可读存储介质的有益效果与上述技术方案提供的患者样本数据中结构变异的报告解读方法的有益效果相同,在此不做赘述。
实施例四
基于上述实施例,请参阅图4所示,提供一种应用场景的环境架构示意图。
可以开发一个应用软件,用于实现上述实施例中的患者样本数据中结构变异的报告解读方法,并且,该应用软件可以安装在用户终端,用户终端与服务器连接,实现通信。
其中,用户终端可以为计算机、平板电脑等任何智能设备,本实施例仅以电脑为例进行说明。
例如,打开智能设备相关的应用程序,用户使用输入单元如键盘、鼠标等输入获取患者的待测样本数据,其中,待测样本数据包括基因序列、疾病名称和特征集合I,实现在应用程序中待测样本数据的输入,用户终端中的应用程序将基因序列发送至注释单元,将疾病名称和特征集合I发送至推荐列表生成单元,将特征集合I发送至疾病名称推荐单元,其中,注释单元、推荐列表生成单元和疾病名称推荐单元均可通过服务器实现,最后由报告输出单元如显示器,基于各结构变异对应的影响要素的重要性程度将多个结构变异降序输出并生成解读报告。
本领域普通技术人员可以理解,实现上述发明方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,上述程序可以存储于计算机可读取存储介质中,该程序在执行时,包括上述实施例方法的各步骤,而该程序的存储介质可以是:ROM/RAM、磁碟、光盘、存储卡等。
以上,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (12)

  1. 一种患者样本数据中结构变异的报告解读方法,包括:
    获取患者的待测样本数据,所述待检测样本数据包括基因序列、疾病名称和特征集合I;
    将所述基因序列与参考基线比对,检测出待测样本数据中的多个结构变异并对其一一进行注释,同时根据注释结果对每个结构变异进行致病性分级;
    根据所述疾病名称和/或所述特征集合I中的临床特征从公共数据库和文献数据库中抓取相关的疾病基因构建基因推荐列表;
    遍历特征关系数据库中各标准疾病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,按照相似度值大小推荐多个标准疾病名称;以及
    基于各结构变异对应的影响要素的重要性程度将多个结构变异降序输出并生成解读报告,所述影响要素包括与结构变异对应的致病性分级、所述疾病名称与注释结果中疾病名称的一致性情况、注释结果中结构变异包含的基因是否属于基因推荐列表中存在的已知致病基因、特征集合I与特征集合A的最大相似度值、人群发生频率、变异发生位置中的一种或多种。
  2. 根据权利要求1所述的方法,其中,参考基线的构建方法包括:
    获取与待测样本数据同批次的多个表型正常的人群基因序列;
    若人群基因序列属于全基因组测序数据,则将多个表型正常的人群基因序列输入CNVKit软件构建参考基线;以及
    若人群基因序列属于全外显子测序数据,则将多个表型正常的人群基因序列输入ExomeDepth软件构建参考基线。
  3. 根据权利要求1或2所述的方法,其中,对结构变异进行注释,同时根据注释结果得到结构变异致病性分级的方法包括:
    采用AnnotSV软件分别对每个结构变异进行注释,注释结果包括人群发生频率、结构变异包含的基因及对应的疾病名称、变异类型、人群发生频率、变异致病情况中的一种或多种;以及
    根据所述注释结果对结构变异进行致病性分级,所述致病性分级包括致病或可能致病、致病或可能致病但注释结果也包含良性注释、其他情况三种类型。
  4. 根据权利要求1至3任一所述的方法,其中,根据所述疾病名称和/或所述特征集合I中的临床特征从公共数据库和文献数据库中抓取相关的疾病基因构建基因推荐列表 的方法包括:
    根据所述疾病名称从公共数据库和文献数据库中抓取相关的第一疾病基因;
    根据所述特征集合I中的临床特征,遍历公共数据库和文献数据库中每种疾病对应的临床特征的多个集合;
    采用临床特征富集度分析算法计算所述特征集合I分别与公共数据库和文献数据库中各疾病对应集合的显著性值;
    匹配输出与多个显著性值对应的第二疾病基因;以及
    汇总第一疾病基因和第二疾病基因,构建基因推荐列表。
  5. 根据权利要求1至4任一所述的方法,其中,在步骤遍历特征关系数据库中各标准疾病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值之前还包括:
    从疾病的公共数据库和文献数据库,获得已知的标准疾病名称及其对应的标准临床特征;
    基于已知的标准疾病及其对应的标准临床特征,建立标准疾病名称与标准临床特征的特征关系数据库;
    分别计算每种疾病对应的各标准临床特征对该疾病的贡献度c i;以及
    从特征关系数据库中获取数据,基于HPO构建疾病的标准化临床特征表型树;
    其中所述标准化临床特征表型树由多个干节点和与每个干节点关联的至少一个支节点组成,每个支节点用于表示一个标准化临床特征,每个干节点用于表示关联的标准化临床特征的索引。
  6. 根据权利要求5所述的方法,其中,遍历特征关系数据库中各标准疾病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,按照相似度值大小推荐多个标准疾病名称的方法包括:
    将特征集合I中的临床特征在标准化临床特征表型树上的节点标记;
    遍历特征关系数据库中的第n个标准疾病名称,将其对应的特征集合A中的标准临床特征在标准化临床特征表型树上的节点标记,所述n的初始值为1;
    基于标准化临床特征表型树上的节点标记,从特征集合A中匹配出与特征集合I中每个临床特征对应的最佳标准临床特征;
    根据每个临床特征与对应的最佳标准临床特征的相似度值,计算出特征集合I与当前特征集合A的集合相似度值;以及
    令n=n+1重新遍历特征关系数据库中的第n个标准疾病名称,直至特征关系数据库中的标准疾病名称遍历完毕,将特征集合I与每个特征集合A对应的集合相似度值汇总,并按照相似度值大小降序推荐多个标准疾病名称。
  7. 根据权利要求6所述的方法,其中,基于标准化临床特征表型树上的节点标记,从特征集合A中匹配出与特征集合I中每个临床特征对应的最佳标准临床特征的方法包括:
    所述特征集合I包括多个临床特征,所述特征集合A包括多个标准临床特征;
    遍历所述特征集合I中的第i个临床特征,从所述特征集合A中筛选出与所述第i个临床特征相似度最高的标准临床特征,作为与所述第i个临床特征对应的最佳标准临床特征,所述i的初始值为1;以及
    令i=i+1后重新遍历所述特征集合I中的第i个临床特征,直至特征集合I中的临床特征遍历完毕,从第n个标准疾病名称对应的特征集合A中筛选出与特征集合I中临床特征一一对应的多个最佳标准临床特征。
  8. 根据权利要求7所述的方法,其中,从所述特征集合A中筛选出与所述第i个临床特征相似度最高的标准临床特征的方法包括:
    遍历所述特征集合A中的第j个标准临床特征,基于已建立的索引判断所述第j个标准临床特征与所述第i个临床特征是否存在相同的干节点B t,所述j的初始值为1;
    若判断结果为否,则认为所述第j个标准临床特征与所述第i个临床特征的相似度值为零;
    若判断结果为是,基于多层级结构相似度算法计算所述第j个标准临床特征与所述第i个临床特征的相似度值;
    令j=j+1后重新遍历所述特征集合A中的第j个标准临床特征,并继续执行所述第j个标准临床特征与所述第i个临床特征的相似度计算,直至所述特征集合A中的标准临床特征遍历完毕,对应得到与所述特征集合A中标准临床特征一一对应的多个相似度值;以及
    从多个相似度值筛中筛选出最大值对应的标准临床特征作为与第i个临床特征对应的最佳标准临床特征。
  9. 根据权利要求1至8任一所述的方法,其中,基于各结构变异对应的影响要素的重要性程度将多个结构变异降序输出的方法包括:
    采用公式
    Figure PCTCN2020111132-appb-100001
    分别计算每个结构变异的重要性程度评分,其中,f为影响要素的数量,w i为第i各影响要素的权重,s i为第i各影响要素的赋值;
    统计各结构变异的重要性程度评分,并按照分值大小降序输出对应的结构变异。
  10. 一种患者样本数据中结构变异的报告解读系统,包括:
    输入单元,用于获取患者的待测样本数据,所述待检测样本数据包括基因序列、疾病名称和特征集合I;
    注释单元,用于将所述基因序列与参考基线比对,检测出待测样本数据中的多个结构变异并对其一一进行注释,同时根据注释结果对每个结构变异进行致病性分级;
    推荐列表生成单元,用于根据所述疾病名称和/或所述特征集合I中的临床特征从公共数据库和文献数据库中抓取相关的疾病基因构建基因推荐列表;
    疾病名称推荐单元,用于遍历特征关系数据库中各标准疾病名称对应的特征集合A,分别计算每个特征集合A与特征集合I的集合相似度值,按照相似度值大小推荐多个标准疾病名称;以及
    报告输出单元,基于各结构变异对应的影响要素的重要性程度将多个结构变异降序输出并生成解读报告,所述影响要素包括与结构变异对应的致病性分级、所述疾病名称与注释结果中疾病名称的一致性情况、注释结果中结构变异包含的基因是否属于基因推荐列表中存在的已知致病基因、特征集合I与特征集合A的最大相似度值、人群发生频率、变异发生位置中的一种或多种。
  11. 一种非易失性计算机可读存储介质上存储有计算机可读指令,其中,所述计算机可读指令被处理器运行时执行上述权利要求1至9任一项所述方法的步骤。
  12. 一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,其中当所述计算机可读指令被处理器执行时,使得所述一个或多个处理器执行如权利要求1至9任一项所述方法的步骤。
PCT/CN2020/111132 2020-06-11 2020-08-25 患者样本数据中结构变异的报告解读方法及系统 WO2021248694A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010529411.5 2020-06-11
CN202010529411.5A CN111883223B (zh) 2020-06-11 2020-06-11 患者样本数据中结构变异的报告解读方法及系统

Publications (1)

Publication Number Publication Date
WO2021248694A1 true WO2021248694A1 (zh) 2021-12-16

Family

ID=73157983

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/111132 WO2021248694A1 (zh) 2020-06-11 2020-08-25 患者样本数据中结构变异的报告解读方法及系统

Country Status (2)

Country Link
CN (1) CN111883223B (zh)
WO (1) WO2021248694A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116453591A (zh) * 2023-05-08 2023-07-18 上海信诺佰世医学检验有限公司 基于RNA-seq数据分析、变异评级和报告生成系统及方法
CN117373696A (zh) * 2023-12-08 2024-01-09 神州医疗科技股份有限公司 一种基于文献证据库的遗传病自动解读系统及方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113671164A (zh) * 2021-09-26 2021-11-19 吾征智能技术(北京)有限公司 一种基于大便颜色及气味判断疾病的系统、设备及介质
CN116547391A (zh) * 2021-10-28 2023-08-04 京东方科技集团股份有限公司 疾病预测方法及装置、电子设备、计算机可读存储介质
CN113793638B (zh) * 2021-11-15 2022-03-25 北京橡鑫生物科技有限公司 一种同源重组修复基因变异的解读方法
CN114300044B (zh) * 2021-12-31 2023-04-18 深圳华大医学检验实验室 基因评估方法、装置、存储介质及计算机设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6909971B2 (en) * 2001-06-08 2005-06-21 Licentia Oy Method for gene mapping from chromosome and phenotype data
CN109086571A (zh) * 2018-08-03 2018-12-25 国家卫生计生委科学技术研究所 一种单基因病遗传变异智能解读及报告的方法和系统
CN109119132A (zh) * 2018-08-03 2019-01-01 国家卫生计生委科学技术研究所 基于病历特征匹配单基因病名称的方法及系统
CN111341458A (zh) * 2020-02-27 2020-06-26 国家卫生健康委科学技术研究所 基于多层级结构相似度的单基因病名称推荐方法和系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8693788B2 (en) * 2010-08-06 2014-04-08 Mela Sciences, Inc. Assessing features for classification
CN110544537A (zh) * 2019-07-29 2019-12-06 北京荣之联科技股份有限公司 单基因遗传病基因分析报告的生成方法及其电子设备
CN111026841B (zh) * 2019-11-27 2023-04-18 云知声智能科技股份有限公司 一种基于检索和深度学习的自动编码方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6909971B2 (en) * 2001-06-08 2005-06-21 Licentia Oy Method for gene mapping from chromosome and phenotype data
CN109086571A (zh) * 2018-08-03 2018-12-25 国家卫生计生委科学技术研究所 一种单基因病遗传变异智能解读及报告的方法和系统
CN109119132A (zh) * 2018-08-03 2019-01-01 国家卫生计生委科学技术研究所 基于病历特征匹配单基因病名称的方法及系统
CN111341458A (zh) * 2020-02-27 2020-06-26 国家卫生健康委科学技术研究所 基于多层级结构相似度的单基因病名称推荐方法和系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, JIANHUA ET AL.: "Review on the Research Progress of Mining of OMIM Data", JOURNAL OF BIOMEDICAL ENGINEERING, vol. 31, no. 6, 31 December 2014 (2014-12-31), pages 1400 - 1404, XP055840474 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116453591A (zh) * 2023-05-08 2023-07-18 上海信诺佰世医学检验有限公司 基于RNA-seq数据分析、变异评级和报告生成系统及方法
CN117373696A (zh) * 2023-12-08 2024-01-09 神州医疗科技股份有限公司 一种基于文献证据库的遗传病自动解读系统及方法
CN117373696B (zh) * 2023-12-08 2024-03-01 神州医疗科技股份有限公司 一种基于文献证据库的遗传病自动解读系统及方法

Also Published As

Publication number Publication date
CN111883223A (zh) 2020-11-03
CN111883223B (zh) 2021-05-25

Similar Documents

Publication Publication Date Title
WO2021248694A1 (zh) 患者样本数据中结构变异的报告解读方法及系统
Rakocevic et al. Fast and accurate genomic analyses using genome graphs
Chothani et al. deltaTE: detection of translationally regulated genes by integrative analysis of Ribo‐seq and RNA‐seq Data
US20200381087A1 (en) Systems and methods of clinical trial evaluation
Delahanty et al. Development and evaluation of an automated machine learning algorithm for in-hospital mortality risk adjustment among critical care patients
US20220198726A1 (en) Methods and systems for determining and displaying pedigrees
US7809660B2 (en) System and method to optimize control cohorts using clustering algorithms
WO2021248695A1 (zh) 基于临床特征和序列变异的单基因病名称推荐方法及系统
CN107609343B (zh) 亲缘关系鉴定方法、系统、计算机设备及可读存储介质
CN107491992B (zh) 一种基于云计算的智能服务推荐方法
WO2021169203A1 (zh) 基于多层级结构相似度的单基因病名称推荐方法和系统
US20210193269A1 (en) Method for Assessing Classification Annotations Assigned to DNA Sequences of Organisms
Birgmeier et al. AVADA: toward automated pathogenic variant evidence retrieval directly from the full-text literature
Jia et al. Thousands of missing variants in the UK Biobank are recoverable by genome realignment
US20210375443A1 (en) System and Method Associated with Determining Physician Attribution Related to In-Patient Care Using Prediction-Based Analysis
Papadimitriou et al. Toward reporting standards for the pathogenicity of variant combinations involved in multilocus/oligogenic diseases
Connor et al. Towards increased accuracy and reproducibility in SARS-CoV-2 next generation sequence analysis for public health surveillance
CN115881259A (zh) 病历数据处理方法、装置、设备及存储介质
CN112236824A (zh) 使用基于图的参考基因组的等位基因解读的系统和方法
CN115274121A (zh) 健康医疗数据的管理方法、系统、电子设备及存储介质
EP3566230A1 (en) Methods and systems for monitoring bacterial ecosystems and providing decision support for antibiotic use
CN114496170A (zh) 藏药显示推荐方法、系统、计算机设备和可读存储介质
CN113921103A (zh) 鉴别诊断病种敏感性测量方法、装置、电子设备及介质
Klann et al. Modeling the information-value decay of medical problems for problem list maintenance
US20190267114A1 (en) Device for presenting sequencing data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20940381

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20940381

Country of ref document: EP

Kind code of ref document: A1