WO2023070422A1 - 疾病预测方法及装置、电子设备、计算机可读存储介质 - Google Patents

疾病预测方法及装置、电子设备、计算机可读存储介质 Download PDF

Info

Publication number
WO2023070422A1
WO2023070422A1 PCT/CN2021/126970 CN2021126970W WO2023070422A1 WO 2023070422 A1 WO2023070422 A1 WO 2023070422A1 CN 2021126970 W CN2021126970 W CN 2021126970W WO 2023070422 A1 WO2023070422 A1 WO 2023070422A1
Authority
WO
WIPO (PCT)
Prior art keywords
variation
site
score
disease
mutation
Prior art date
Application number
PCT/CN2021/126970
Other languages
English (en)
French (fr)
Inventor
刘梦佳
Original Assignee
京东方科技集团股份有限公司
成都京东方光电科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司, 成都京东方光电科技有限公司 filed Critical 京东方科技集团股份有限公司
Priority to PCT/CN2021/126970 priority Critical patent/WO2023070422A1/zh
Priority to CN202180003144.0A priority patent/CN116547391A/zh
Publication of WO2023070422A1 publication Critical patent/WO2023070422A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Definitions

  • the present disclosure relates to the field of display technology, and in particular to a disease prediction method, a disease prediction device, electronic equipment, and a computer-readable storage medium.
  • Gene mutation refers to the change of bases (SNV) and sequence change (indel) on the genomic DNA molecule.
  • SNV sequence change
  • Mitochondria are organelles related to energy metabolism and are an indispensable part of many life processes such as cell survival and cell death, among which abnormal oxidative phosphorylation in the respiratory chain is related to many human diseases.
  • Common mitochondrial diseases include subacute necrotizing encephalomyelopathy (Leigh syndrome), deafness (Deafness), encephalomyopathy (Encephalomyopathy), dystonia (Dystonia), etc.
  • the mutations of these mitochondrial diseases involve point mutations, deletions, etc., and the regions involved include mutations in rRNA/tRNA regions, mutations in coding and non-coding regions.
  • the disclosure proposes a disease prediction method, a disease prediction device, electronic equipment, and a computer-readable storage medium.
  • the present disclosure provides a disease prediction method, including:
  • the disease annotation is performed on the variation site, and the mitochondrial disease corresponding to the variation site is obtained;
  • the first mitochondrial disease corresponding to the mutation-related information of the mutation site is recorded in the preset disease database
  • the first mitochondrial disease is used as the mitochondrial disease corresponding to the mutation site
  • the first mitochondrial disease is not recorded in the preset disease database and the impact degree score is greater than the preset threshold
  • the first mitochondrial disease corresponding to the adjacent site of the mutation site is obtained from the preset disease database.
  • Two mitochondrial diseases using the second mitochondrial disease as the mitochondrial disease corresponding to the mutation site.
  • the variation-related information includes: variation type, variation region, variation position, and variation leading to CDS and protein changes.
  • predicting the impact degree score of the variant site on gene function according to the variation-related information of the variant site specifically includes:
  • the influence degree score of the variation site on the gene function is determined.
  • the influence degree scores of different variation-related information on gene functions are respectively predicted to obtain multi-faceted influence degree scores, specifically including:
  • the various influence degree scores include: the first score, the second score and the third score;
  • Determining the score of the degree of influence of the variation site on the function of the gene according to the scores of the degree of influence in various aspects, specifically including: determining the score of the degree of influence of the variation site on the function of the gene according to the following formula:
  • both ⁇ 1 and ⁇ 2 are between 0.15-0.25, and ⁇ 3 is between 0.5-0.7.
  • the degree of influence of the variation site on protein conservation and physicochemical properties is obtained, specifically including:
  • the average value of the multiple reference influence degree scores is used as the first score.
  • obtaining the score of the degree of influence of the variation type of the variation site on the gene function includes:
  • the preset first mapping relationship determine the score of the degree of influence of the variation type of the variation site on the gene function; wherein, the first mapping relationship records the scores of the degree of influence of multiple different types of variation on the function of the gene .
  • the variation position includes the position number n of the variation position in the protein sequence
  • the third score is determined according to the following formula:
  • the third score is 0.
  • obtaining the gene sequencing data of the test sample includes:
  • the initial gene sequencing data is filtered to obtain the gene sequencing data.
  • obtaining initial gene sequencing data of a test sample includes:
  • Nanopore sequencing technology or targeted enrichment sequencing technology to obtain the initial gene sequencing data of the test sample.
  • data analysis is performed on the gene sequencing data to obtain the variation sites in the gene sequencing data, specifically including:
  • the comparison result data including: the position of the gene sequencing data in the reference mitochondrial genome
  • Variation detection is performed on the comparison result data to determine the variation sites in the comparison result data.
  • the variation detection is performed on the comparison result data, and the variation sites in the comparison result data are determined, specifically including:
  • the first detection result including: SNV sites included in the comparison result data;
  • the second detection result including: the indel site included in the comparison result data;
  • variation site includes the SNV site and the indel site.
  • An embodiment of the present disclosure also provides a disease prediction device, including:
  • the data acquisition module is configured to acquire the gene sequencing data of the detection sample
  • the analysis module is configured to perform data analysis on the gene sequencing data to obtain the variation sites in the gene sequencing data;
  • the variation annotation module is configured to perform variation annotation on the variation site, and obtain variation-related information of the variation location;
  • the prediction module is configured to predict the impact degree score of the variation site on gene function according to the variation-related information of the variation site;
  • the disease annotation module is configured to perform disease annotation on the variant site according to the impact degree score of the variant site on gene function and the preset disease database, and obtain the mitochondrial disease corresponding to the variant site;
  • the first mitochondrial disease corresponding to the mutation-related information of the mutation site is recorded in the preset disease database
  • the first mitochondrial disease is used as the mitochondrial disease corresponding to the mutation site
  • the first mitochondrial disease is not recorded in the preset disease database and the impact degree score is greater than the preset threshold
  • the first mitochondrial disease corresponding to the adjacent site of the mutation site is obtained from the preset disease database.
  • Two mitochondrial diseases using the second mitochondrial disease as the mitochondrial disease corresponding to the mutation site.
  • An embodiment of the present disclosure further provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, wherein the computer program implements the above method when executed by the processor.
  • An embodiment of the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, wherein the above-mentioned method is implemented when the computer program is executed by a processor.
  • Fig. 1 is a schematic diagram of a disease prediction method provided in some embodiments of the present disclosure.
  • Fig. 2 is a schematic diagram of a disease prediction method provided in other embodiments of the present disclosure.
  • FIG. 3 is a statistical diagram of read length distribution of off-board data provided in an example of the present disclosure.
  • FIG. 4A is a composition map of the first 100 bases of off-machine data reads provided in an example of the present disclosure.
  • FIG. 4B is a composition map of the last 100 bases of the off-machine data read provided in an example of the present disclosure.
  • FIG. 5A is a graph of the average base quality of the first 100 bases of the off-board data provided in an example of the present disclosure.
  • FIG. 5B is a graph of the average base quality of the last 100 bases of the off-board data provided in an example of the present disclosure.
  • FIG. 6 is a schematic diagram of a visualized output result after statistics of the comparison result data in step S2a provided in an example of the present disclosure.
  • Fig. 7 is a schematic diagram of an optional implementation manner of step S4 provided in some embodiments of the present disclosure.
  • Fig. 8 is a schematic diagram of a disease prediction device provided in some embodiments of the present disclosure.
  • FIG. 9 is a schematic diagram of an electronic device provided in some embodiments of the present disclosure.
  • FIG. 10 is a schematic diagram of a computer-readable storage medium provided in some embodiments of the present disclosure.
  • High-throughput sequencing also known as "next generation” sequencing technology (Next Generation Sequencing, NGS)
  • NGS Next Generation Sequencing
  • Sequencing refers to the analysis of the base sequence of a specific DNA fragment, that is, the arrangement of adenine (A), thymine (T), cytosine (C) and guanine (G).
  • A adenine
  • T thymine
  • C cytosine
  • G guanine
  • Read length It is the sequencing sequence obtained after high-throughput sequencing, including sequencing base information and quality value information.
  • Nanopore nanopore single-molecule sequencing technology, which uses electrical signals and endonucleases for sequencing.
  • the sequencing length is very long, usually ranging from a dozen Kbp to dozens of Kbp in average.
  • Genome In the fields of molecular biology and genetics, a genome is the sum total of all the genetic material of an organism. This genetic material includes DNA or RNA. The genome includes coding and non-coding DNA, mitochondrial DNA and chloroplast DNA.
  • Gene mutation The biological meaning refers to the change of the genetic gene in the cell (usually refers to the deoxyribonucleic acid present in the nucleus). It includes point mutations caused by single base changes, or deletions, duplications and insertions of multiple bases. Causes can be errors in the replication of genetic genes when cells divide, or the effects of chemicals, genotoxicity, radiation, or viruses.
  • SNV single nucleotide variation
  • Indel refers to the type of mutation of insertion or deletion on DNA.
  • hgvs human genome variation society
  • the traditional identification method for detecting mitochondrial diseases is mainly through clinical biochemistry, but there are problems such as high requirements for doctors, possible misjudgments, missed judgments, etc., and it is difficult to judge relatively rare mitochondrial diseases. At the same time, the throughput is low and the operation is complicated. Disadvantages such as long cycle time.
  • the method of detecting mitochondrial diseases through gene sequencing can only judge known variations, but has no ability to identify variations that have not been reported and also affect gene and protein functions, that is, cannot detect Which disease causes the variant.
  • Fig. 1 is a schematic diagram of a disease prediction method provided in some embodiments of the present disclosure, and the disease prediction method is especially suitable for predicting mitochondrial diseases through genetic testing.
  • disease prediction methods include:
  • the detection sample may be a DNA sample of a patient suffering from mitochondrial disease, such as plasma or serum of the patient.
  • the gene sequencing data of the test samples can be obtained by sequencing with a third-generation sequencer.
  • the variation site is the base type that is different from the same position in the reference genome in the gene sequencing data set of the test sample, and these variation sites may be pathogenic sites that affect human health or cause human diseases.
  • the data analysis of the gene sequencing data in step S2 may include: performing quality control and filtering on the gene sequencing data to obtain high-quality data, and performing genetic testing based on the filtered data to determine mutation sites .
  • genetic testing for example, SNV mutation sites can be detected by means of SNV testing.
  • the mutation-related information may include: mutation type, mutation position, and changes in CDS bases and proteins caused by the mutation.
  • the influence degree score is used to indicate the influence degree of the mutation site on the gene function.
  • the degree of influence on the gene function may be different; when the variation positions of the mutation sites are different, the degree of influence on the gene function may also be different. Therefore, the score of the degree of influence of the variant site on the gene function can be predicted based on the variation-related information of the variant site.
  • the disease annotation is performed on the variant site to obtain the mitochondrial disease corresponding to the variant site.
  • the preset disease database records the corresponding relationship between the known variation-related information and mitochondrial diseases, that is to say, for the variation-related information of some mutation sites, it can be determined by searching the preset disease database. Corresponding mitochondrial diseases.
  • the preset disease database the disease directly corresponding to the mutation-related information of the mutation site cannot be found.
  • the mitochondrial disease corresponding to the mutation site can be determined according to the impact degree score.
  • the first mitochondrial disease corresponding to the mutation-related information of the mutation site is recorded in the preset disease database
  • the first mitochondrial disease is used as the mitochondrial disease corresponding to the mutation site; when no mutation site is recorded in the preset disease database
  • the second mitochondrial disease corresponding to the adjacent site of the mutation site is obtained from the preset disease database, Taking the second mitochondrial disease as the mitochondrial disease corresponding to the mutation site; when the first mitochondrial disease corresponding to the mutation-related information of the mutation site is not recorded in the preset disease database, and the impact degree score obtained in step S4 is not greater than the preset threshold , it is considered that the mutation site has little effect on protein function and is not enough to cause disease.
  • the adjacent site refers to the site closest to the mutation site among all sites satisfying the following two conditions.
  • the first condition is: the mitochondrial disease corresponding to the mutation-related information is recorded in the preset disease database; the second condition is: it is located in the same gene and protein as the mutation site.
  • step S2 it is determined that the mutation site is the second site on a certain protein sequence, and in step S4, it is determined that the degree of influence of the mutation site on gene function is greater than the preset threshold, while in the preset disease database Mitochondrial diseases directly corresponding to the mutation-related information of the mutation sites are not recorded, but the mitochondrial diseases corresponding to the 4th and 10th sites on the same protein are recorded, at this time, the 4th site
  • the corresponding mitochondrial disease is used as the mitochondrial disease corresponding to the variant site.
  • the preset threshold may be between 0.4 and 0.5, for example, the preset threshold is 0.4, or 0.45, or 0.5.
  • the variation annotation is performed on the variation site to obtain the variation-related information of the variation site, and then the effect of the variation site on the gene function is predicted according to the variation-related information. degree of influence score; the disease database is preset according to the degree of influence score, and disease annotation is performed on the variant site, so that when the first mitochondrial disease directly corresponding to the variation-related information of the variant site is recorded in the preset disease database, The first mitochondrial disease is taken as the mitochondrial disease corresponding to the mutation site; and when the first mitochondrial disease directly corresponding to the mutation-related information of the mutation site is not recorded in the preset disease database, and the impact degree score is greater than the preset threshold, The second mitochondrial disease corresponding to the adjacent site of the mutation site recorded in the preset disease database is used as the mitochondrial disease corresponding to the mutation site. Therefore, the embodiments of the present disclosure can not only determine mitochondrial diseases when known mutations occur in the genome, but also determine mitochondrial diseases when unknown mutations
  • Fig. 2 is a schematic diagram of a disease prediction method provided in other embodiments of the present disclosure, and Fig. 2 is a specific implementation scheme of Fig. 1 .
  • the disease prediction methods include:
  • step S1 includes step S11 and step S12.
  • the detection sample in step S11, may be subjected to gene sequencing using Nanopore sequencing technology or targeted enrichment sequencing technology to obtain initial gene sequencing data.
  • Nanopore sequencing technology Compared with NGS sequencing, Nanopore sequencing technology has a longer read length and has incomparable advantages in genome assembly of animals and plants.
  • step S12 may specifically include: analyzing, quality controlling, and filtering the initial gene sequencing data, so as to obtain high-quality data for subsequent biological information analysis, and provide accurate data for subsequent analysis processing.
  • analysis, quality control, and filtering were performed using the nanopack analysis software package.
  • the filtering parameters are designed according to the actual situation.
  • the filtering parameter is "-Q 7–l 1000–headcrop 100–tailcrop 100”, that is, to filter out the sequences whose length is less than 1000 and the average quality value of the entire sequence is less than Q7, and cut off the head and tail of each sequence 100bp.
  • Table 1 for specific filtering information.
  • Mean read length average read length
  • Mean read quality average read quality
  • Median read length median of read length
  • Median read quality median of read quality
  • Number of reads read Read length N50: the read length of the N50 value
  • STDEV read length the standard deviation of the read length
  • Q5-Q15 Nanopore quality value statistics, the statistical content is the number of reads, the percentage of the total number, and the total number of bases.
  • FIG. 3 is a statistical diagram of the read length distribution of off-board data provided in an example of the present disclosure.
  • the off-board data is also the initial gene sequencing data, wherein the horizontal axis represents the length of the read (read length), and the vertical axis represents the number of reads ( Number of reads). As shown in Figure 3, the length of reads is basically distributed around 1000.
  • FIG. 4A is a composition diagram of the first 100 bases of off-board data reads provided in an example of the present disclosure
  • FIG. 4B is a composition diagram of the last 100 bases of off-machine data reads provided in an example of the present disclosure.
  • the vertical axis of Figure 4A and Figure 4B represents the frequency of nucleotide readout (frequency of nucleotide in read)
  • the horizontal axis in Figure 4A represents the position of the sequence head (position in read from start)
  • the horizontal axis in Figure 4B Indicates the position at the end of the sequence (position in read from end).
  • Figure 5A is an average base quality map of the first 100 bases of the off-board data provided in an example of the present disclosure
  • Figure 5B is an average base of the last 100 bases of the off-board data provided in an example of the present disclosure Basic mass diagram.
  • the vertical axis in Figure 5A and Figure 5B represents the average base quality (Mean quality score of base calls)
  • the horizontal axis in Figure 5A represents the position of the sequence head (position in read from start)
  • Figure 5B represents the sequence tail The position (position in read from end).
  • step S2 includes step S21 and step S22:
  • step S21 the position of the gene sequencing data in the mitochondrial genome can be determined.
  • the minimap2 tool can be used for comparison, the comparison parameter is "-ax map-out", and the generated result of the minimap2 tool is in sam format.
  • Use the samtools tool to convert the comparison result data in Sam format to bam format, and sort the generated bam format. Then use the flagstat and stats commands in the samtools tool to perform comparison statistics, and use the plot-bamstats program in the samtools tool to visualize the comparison results.
  • Fig. 6 is a schematic diagram of the visualized output results after the statistics of the comparison result data in step S2a provided in an example of the present disclosure. The (a) figure in Fig.
  • FIG. 6 is a coverage map obtained according to the statistics of the comparison result data, and the horizontal axis represents Coverage; the vertical axis represents the number of bases that can be compared with the reference mitochondrial genome (Number of mapped bases).
  • the graph (b) in Figure 6 is the GC distribution graph based on the statistics of the comparison results, the horizontal axis represents the GC content (GC content); the vertical axis represents the normalized frequency (Normalized Frequency).
  • Figure 6 (c) is a statistical diagram of the mass distribution of reads that can be compared with the reference mitochondrial genome, and the horizontal axis indicates the length of the reads that can be compared with the reference mitochondrial genome, namely Cycle (fwd reads); the vertical axis Indicates the quality value (Quality).
  • step S22 specifically includes:
  • S22a Perform SNV detection on the comparison result data to obtain a first detection result, where the first detection result includes: SNV sites included in the comparison result data pair.
  • the longshot tool can be used for SNV detection.
  • This tool is an excellent variation detection tool that can accurately detect erroneous read length data.
  • the tool takes a bam file as input and outputs a file with variation information and genotype information. vcf file. You can use mitoDel V3.0 to detect the indel on the mitochondria, and the output of the indel results includes the number of read counts from the indel, the starting position of the indel, the position of the indel and whether the quality filter passed.
  • the mutation sites detected in the above step S2 include SNV sites and indel sites.
  • step S22a and step S22b is not particularly limited, as long as the SNV detection and indel detection are performed separately. Detection accuracy can be improved by performing SNV detection and indel detection separately.
  • the variation-related information may specifically include: variation type, variation region, variation position; and variation leading to changes in CDS and protein of cysteine sulfinate decarboxylase, where CDS is a sequence encoding a protein on a gene.
  • the variation type can be: nonsense mutation, drift mutation, synonymous mutation, etc.
  • the variation region can be: non-gene region, gene region, control region, etc.;
  • the position can be mapped to the position on the specific gene and protein sequence, expressed as the change of the specific transcript of the gene, CDS or protein.
  • step S4 includes: respectively predicting the influence degree scores of different variation-related information on gene functions to obtain multi-faceted influence degree scores; determining the impact of the variation site on the Score for degree of influence on gene function. For example, the influence degree scores of information such as mutation type and mutation position on gene function are respectively predicted, so as to obtain multi-faceted influence degree scores.
  • Fig. 7 is a schematic diagram of an optional implementation manner of step S4 provided in some embodiments of the present disclosure.
  • step S4 specifically includes steps S41 to S44.
  • step S41 may specifically include step S41a and step S41b.
  • a variety of prediction tools include: PANTHER, PolyPhen-2, and SIFT for prediction respectively, and each prediction tool can predict the degree of influence of mutation sites on protein conservation and physicochemical properties.
  • the degree of impact of a property can be one of four: no impact, possible impact, harmful, unpredictable outcome.
  • the reference impact score output by the prediction tool When the degree of influence of the mutation site on protein conservation and physical and chemical properties is no influence, the reference impact score output by the prediction tool is 0; when the degree of influence of the mutation site on protein conservation and physical and chemical properties is possible influence , then the reference impact score output by the prediction tool is 0.5; when the impact degree of the mutation site on protein conservation and physical and chemical properties is harmful, the reference impact score output by the prediction tool is 1; When the influence degree of physical and chemical properties cannot predict the result, the reference influence degree score output by the prediction tool is NA (no score).
  • the reference impact score obtained by PANTHER is recorded as S PANTHER
  • the reference impact score obtained by PolyPhen-2 is recorded as S PolyPhen-2
  • the reference impact score obtained by SIFT is recorded as S SIFT
  • the first score Si (S PANTHER + S PolyPhen-2 +S SIFT )/N
  • N is the number of predicted tools with scores, ie, the number of predicted tools with NA removed.
  • step S42 may specifically include: according to the preset first mapping relationship, determine the score of the degree of influence of the variation type of the mutation site on the gene function; wherein, the first mapping relationship records a variety of different variations Score for the degree of influence of type on gene function.
  • the same mutation and intergenic region mutation usually have no effect on gene function. Therefore, when the variation type of the variation site is the same mutation or intergenic region mutation, the second score is 0; non-synonymous mutation and non-drift mutation It may have an impact on protein function. Therefore, when the variation type of the mutation site is non-synonymous mutation and non-drift mutation, the second score is 0.5; nonsense mutation and drift mutation have a great impact on protein function, so when When the mutation type of the mutation site is nonsense mutation or drift mutation, the second score is 1.
  • the variation position may include: the position number n of the variation position in the protein sequence, that is, the variation position is located at the nth amino acid position in the protein sequence.
  • Si is the first score
  • St is the second score
  • Sp is the third score
  • the above-mentioned “respectively predict the influence degree scores of different mutation-related information on gene functions, and obtain multi-faceted influence degree scores” includes the above steps S41-S43, and the multi-faceted influence degree scores include: the first score, the second score and Third score.
  • the above-mentioned “determining the score of the influence degree of the mutation site on the function of the gene according to the influence degree scores of various aspects” includes the above-mentioned step S44.
  • the disease annotation is performed on the variant site to obtain the mitochondrial disease corresponding to the variant site.
  • the first mitochondrial disease is used as the mitochondrial disease corresponding to the mutation site; when the first mitochondrial disease is not recorded in the preset disease database
  • the second mitochondrial disease corresponding to the adjacent site of the mutation site is obtained from the preset disease database, and the second mitochondrial disease is used as the mitochondrial disease corresponding to the mutation site.
  • step S5 when step S5 is performed, the mutant disease data on mitomap can be downloaded, and a database in tsv format (called mitoDisease) can be constructed, which is the default disease database.
  • mitoDisease a database in tsv format
  • the preset threshold may be 0.5.
  • Table 2 shows the information about some mitochondrial diseases and corresponding variants.
  • Mutation Num represents the variation annotation number; Postion represents the position of the mutation site on the mitochondrial genome; Mutation(hgvs.g), Mutation(hgvs.c), and Mutation(hgvs.p) represent the standard genome, CDS, protein hgvs format; S indicates the impact degree score of the mutation site on gene function; IF (mitoDisease) indicates whether the current mutation is in the mitoDisease database, 1 indicates that the current mutation exists in the database, and 0 indicates that there is no current mutation; If there is a current mutation in the database, mitoDisease Num means: the number of the current mutation in the mitoDisease database; if there is no current mutation in the database, mitoDisease Num means: the mutation of the adjacent site of the current mutation is in the mitoDisease database mitoDisease-gene: the current mutation corresponds to the gene name in mitoDisease; Disease: the name of the disease associated with the mutation; Status: whether the current mutation has been reported
  • Fig. 8 is a schematic diagram of a disease prediction device provided in some embodiments of the present disclosure, which is used to implement the above-mentioned disease prediction method.
  • the disease prediction device includes: a data acquisition module 10 , an analysis module 20 , a variation annotation module 30 , a prediction module 40 and a disease annotation module 50 .
  • the data acquisition module 10 is configured to acquire gene sequencing data of the test sample.
  • the analysis module 20 is configured to perform data analysis on the gene sequencing data to obtain mutation sites in the gene sequencing data.
  • the variation annotation module 30 is configured to perform variation annotation on the variation site to obtain variation-related information of the variation location.
  • the prediction module 40 is configured to predict the score of the degree of influence of the variation site on gene function according to the variation-related information of the variation site.
  • the disease annotation module 50 is configured to perform disease annotation on the variant site according to the impact degree score of the variant site on gene function and the preset disease database, and obtain the mitochondrial disease corresponding to the variant site.
  • the first mitochondrial disease corresponding to the mutation-related information of the mutation site is recorded in the preset disease database
  • the first mitochondrial disease is used as the mitochondrial disease corresponding to the mutation site; when the first mitochondrial disease is not recorded in the preset disease database
  • the second mitochondrial disease corresponding to the adjacent site of the mutation site is obtained from the preset disease database, and the second mitochondrial disease is used as the mitochondrial disease corresponding to the mutation site.
  • FIG. 9 is a schematic diagram of an electronic device provided in some embodiments of the present disclosure.
  • the electronic device 100 includes: a memory 101 and a processor 102, and a computer program is stored on the memory 101, wherein the computer program is executed by the processor When 102 is executed, the above-mentioned disease prediction method is realized, for example, steps S1 to S4 in FIG. 1 are realized.
  • the electronic device 100 may be computing devices such as desktop computers, notebooks, palmtop computers, and cloud servers.
  • the electronic device 100 may include, but not limited to, a processor 102 and a memory 101 .
  • FIG. 9 is only an example of the electronic device 100, and does not constitute a limitation to the electronic device 100. It may include more or less components than those shown in the figure, or combine certain components, or different components.
  • the electronic device 100 may further include an input and output device, a network access device, a bus, and the like.
  • the processor 102 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general purpose processor 102 may be a microprocessor or the processor may be any conventional processor or the like.
  • the storage 101 may be an internal storage unit of the electronic device 100 , such as a hard disk or memory of the electronic device 100 .
  • the memory 101 can also be an external storage device of the electronic device 100, such as a plug-in hard disk equipped on the electronic device 100, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.
  • the memory 101 may also include both an internal storage unit of the electronic device 100 and an external storage device.
  • the memory 101 is used to store the computer program and other programs and data required by the terminal device.
  • the memory 101 can also be used to temporarily store data that has been output or will be output.
  • FIG. 10 is a schematic diagram of a computer-readable storage medium provided in some embodiments of the present disclosure.
  • a computer program 201 is stored on the computer-readable storage medium 200, wherein the computer program 201 is implemented when executed by a processor.
  • the above disease prediction method for example, implements steps S1 to S4 in FIG. 1 .
  • the computer-readable storage medium 200 includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridge, tape, magnetic disk storage or other magnetic storage device , or any other medium that can be used to store desired information and that can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Landscapes

  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Public Health (AREA)
  • Zoology (AREA)
  • General Health & Medical Sciences (AREA)
  • Wood Science & Technology (AREA)
  • Genetics & Genomics (AREA)
  • Pathology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Biotechnology (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

提供一种疾病预测方法、疾病预测装置、电子设备(100)和计算机可读存储介质(200),疾病预测方法包括:获取检测样本的基因测序数据;对基因测序数据进行数据分析,得到基因测序数据中的变异位点;对变异位点进行变异注释,得到变异位点的变异相关信息;根据变异位点的变异相关信息预测变异位点对基因功能的影响程度得分;根据变异位点对基因功能的影响程度得分以及预设疾病数据库,对变异位点进行疾病注释,得到变异位点对应的线粒体疾病。

Description

疾病预测方法及装置、电子设备、计算机可读存储介质 技术领域
本公开涉及显示技术领域,具体涉及疾病预测方法、疾病预测装置、电子设备和计算机可读存储介质。
背景技术
基因突变指基因组DNA分子上发生碱基的改变(SNV)和排列顺序的改变(indel)。基因突变对生物的影响差异极大。在非基因区及非基因调控区域的突变对生物体基本没有影响,在基因调控区域的基因突变可使得基因不能正常转录,在基因外显子、内含子和外显子交界处的突变可能导致mRNA降解,或影响蛋白质的正常翻译、蛋白质的三维结构改变、蛋白质亚细胞定位错误、蛋白质正常跨膜、酶类的活性等。
线粒体是与能力代谢相关的细胞器,是细胞成活和细胞死亡等多个生命过程中不可缺少的部分,其中呼吸链上的氧化磷酸化异常与许多人类疾病相关。常见的线粒体疾病包括亚急性坏死性脑脊髓病(Leigh syndrome)、耳聋(Deafness)、脑肌病(Encephalomyopathy)、肌张力障碍(Dystonia)等。这些线粒体疾病的突变涉及点突变、缺失等,涉及到的区域包括rRNA/tRNA区域的突变、编码和非编码区域的突变。
发明内容
本公开提出了一种疾病预测方法、疾病预测装置、电子设备和计算机可读存储介质。
本公开提供一种疾病预测方法,包括:
获取检测样本的基因测序数据;
对所述基因测序数据进行数据分析,得到所述基因测序数据中的变异位点;
对所述变异位点进行变异注释,得到所述变异位点的变异相关信息;
根据所述变异位点的变异相关信息预测所述变异位点对基因功能的影响程度得分;
根据所述变异位点对基因功能的影响程度得分以及预设疾病数据库,对所述变异位点进行疾病注释,得到所述变异位点对应的线粒体疾病;
其中,当所述预设疾病数据库中记录有与所述变异位点的变异相关信息对应的第一线粒体疾病时,将所述第一线粒体疾病作为所述变异位点对应的线粒体疾病;当所述预设疾病数据库中未记录所述第一线粒体疾病、且所述影响程度得分大于预设阈值时,则从所述预设疾病库中获取与所述变异位点的临近位点对应的第二线粒体疾病,将所述第二线粒体疾病作为所述变异位点对应的线粒体疾病。
在一些实施例中,所述变异相关信息包括:变异类型、变异区域、变异位置以及变异导致C DS和蛋白质的改变。
在一些实施例中,根据所述变异位点的变异相关信息预测所述变异位点对基因功能的影响程度得分,具体包括:
分别预测不同的变异相关信息对基因功能的影响程度得分,得到多方面的影响程度得分;
根据多方面的影响程度得分,确定所述变异位点对所述基因功能的影响程度得分。
在一些实施例中,分别预测不同的变异相关信息对基因功能的影响程度得分,得到多方面的影响程度得分,具体包括:
获取所述变异位点对蛋白质保守性和理化性质的影响程度得分,并作为第一得分;
获取所述变异位点的变异类型对基因功能的影响程度得分,作为第二 得分;
获取所述变异位点的变异位置对基因功能的影响程度得分,作为第三得分;
其中,所述多方面的影响程度得分包括:所述第一得分、所述第二得分和所述第三得分;
根据多方面的影响程度得分,确定所述变异位点对所述基因功能的影响程度得分,具体包括:根据以下公式确定所述变异位点对基因功能的影响程度得分:
S=λ1*Si+λ2*St+λ3*Sp,其中,S为所述变异位点对基因功能的影响程度得分,Si为所述第一得分,St为所述第二得分;Sp为所述第三得分;λ1、λ2、λ3为预设的权重,λ1+λ2+λ3=1。
在一些实施例中,λ1、λ2均在0.15~0.25之间,λ3在0.5~0.7之间。
在一些实施例中,获取所述变异位点对蛋白质保守性和理化性质的影响程度得分,具体包括:
利用多种预测工具分别对所述变异位点的变异相关信息进行分析,以预测出所述变异位点对蛋白质保守性和理化性质的多个参考影响程度得分;
将所述多个参考影响程度得分的平均值作为所述第一得分。
在一些实施例中,获取所述变异位点的变异类型对基因功能的影响程度得分,包括:
根据预设的第一映射关系,确定所述变异位点的变异类型对基因功能的影响程度得分;其中,所述第一映射关系中记载有多种不同的变异类型对基因功能的影响程度得分。
在一些实施例中,所述变异位置包括所述变异位点在蛋白质序列中的位置序号n;
获取所述变异位点的变异位置对基因功能的影响程度得分,具体包括:
当所述变异位点的变异类型为漂移突变或无义突变时,根据以下公式 确定所述第三得分:
Sp=1-n/L,其中,L为所述蛋白序列的长度;
当所述变异位点的变异类型为所述漂移突变和所述无义突变之外的其他类型时,确定所述第三得分为0。
在一些实施例中,获取检测样本的基因测序数据包括:
获取所述检测样本的初始基因测序数据;
对所述初始基因测序数据进行过滤,得到所述基因测序数据。
在一些实施例中,获取检测样本的初始基因测序数据,包括:
利用Nanopore测序技术或靶向富集测序技术,获取所述检测样本的初始基因测序数据。
在一些实施例中,对所述基因测序数据进行数据分析,得到所述基因测序数据中的变异位点,具体包括:
将所述基因测序数据与参考线粒体基因组的参考序列进行比对,确定比对结果数据,所述比对结果数据包括:所述基因测序数据在参考线粒体基因组中的位点;
对所述比对结果数据进行变异检测,确定出所述比对结果数据中的变异位点。
在一些实施例中,对所述比对结果数据进行变异检测,确定出所述比对结果数据中的变异位点,具体包括:
对所述比对结果数据进行SNV检测,得到第一检测结果,所述第一检测结果包括:所述比对结果数据中所包括的SNV位点;
对所述比对结果数据进行indel检测,得到第二检测结果,所述第二检测结果包括:所述对比结果数据中所包括的indel位点;
其中,所述变异位点包括所述SNV位点和所述indel位点。
本公开实施例还提供一种疾病预测装置,包括:
数据获取模块,被配置为获取检测样本的基因测序数据;
分析模块,被配置为对所述基因测序数据进行数据分析,得到所述基因测序数据中的变异位点;
变异注释模块,被配置为对所述变异位点进行变异注释,得到所述变异位点的变异相关信息;
预测模块,被配置为根据所述变异位点的变异相关信息预测所述变异位点对基因功能的影响程度得分;
疾病注释模块,被配置为根据所述变异位点对基因功能的影响程度得分以及预设疾病数据库,对所述变异位点进行疾病注释,得到所述变异位点对应的线粒体疾病;
其中,当所述预设疾病数据库中记录有与所述变异位点的变异相关信息对应的第一线粒体疾病时,将所述第一线粒体疾病作为所述变异位点对应的线粒体疾病;当所述预设疾病数据库中未记录所述第一线粒体疾病、且所述影响程度得分大于预设阈值时,则从所述预设疾病库中获取与所述变异位点的临近位点对应的第二线粒体疾病,将所述第二线粒体疾病作为所述变异位点对应的线粒体疾病。
本公开实施例还提供一种电子设备,包括存储器和处理器,所述存储器上存储有计算机程序,其中,所述计算机程序被所述处理器执行时实现上述的方法。
本公开实施例还提供一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现上述的方法。
附图说明
附图是用来提供对本公开的进一步理解,并且构成说明书的一部分,与下面的具体实施方式一起用于解释本公开,但并不构成对本公开的限制。在附图中:
图1为本公开的一些实施例中提供的疾病预测方法的示意图。
图2为本公开的另一些实施例中提供的疾病预测方法的示意图。
图3为本公开的一个示例中提供的下机数据read长度分布统计图。
图4A为本公开的一个示例中提供的下机数据read的前100个碱基的组成图。
图4B为本公开的一个示例中提供的下机数据read的后100个碱基的组成图。
图5A为本公开的一个示例中提供的下机数据的前100个碱基的平均碱基质量图。
图5B为本公开的一个示例中提供的下机数据的后100个碱基的平均碱基质量图。
图6为本公开的一个示例中提供的对步骤S2a的比对结果数据统计后的可视化输出结果示意图。
图7为本公开的一些实施例中提供的步骤S4的可选实现方式示意图。
图8为本公开的一些实施例中提供的疾病预测装置的示意图。
图9为本公开的一些实施例中提供的电子设备的示意图。
图10为本公开的一些实施例中提供的计算机可读存储介质的示意图。
具体实施方式
以下结合附图对本公开的具体实施方式进行详细说明。应当理解的是,此处所描述的具体实施方式仅用于说明和解释本公开,并不用于限制本公开。
关于本公开的具体实施方式中的一些名词解释:
高通量测序(High-throughput sequencing),又称“下一代”测序技术(Next Generation Sequencing,NGS),以能一次并行对几十万到几百万条DNA分子进行序列测定和一般读长较短等为标志。其中测序是指分析特定DNA片段的碱基序列,也就是腺嘌呤(A)、胸腺嘧啶(T)、胞嘧啶(C)与鸟嘌 呤(G)的排列方式。快速的DNA测序方法的出现极大地推动了生物学和医学的研究和发现。
读长(Read):是高通量测序后获得的测序序列,包含测序碱基信息、质量值信息。
Nanopore:纳米孔单分子测序技术,采用电信号和核酸内切酶进行测序,测序长度很长,通常平均长度可达十几Kbp至几十Kbp不等。
基因组:在分子生物学和遗传学领域,基因组是指生物体所有遗传物质的总和。这些遗传物质包括DNA或RNA。基因组包括编码DNA和非编码DNA、线粒体DNA和叶绿体DNA。
基因突变:在生物学上的含义是指细胞中的遗传基因(通常指存在于细胞核中的去氧核糖核酸)发生的改变。它包括单个碱基改变所引起的点突变,或多个碱基的缺失、重复和插入。原因可以是细胞分裂时遗传基因的复制发生错误、或受化学物质、基因毒性、辐射或病毒的影响。
SNV(single nucleotide variation):指单个DNA碱基发生改变。
Indel(insertion or deletion):指DNA上插入或缺失的突变类型。
hgvs(human genome variation society):标准的人类基因组变异的格式。
传统的检测线粒体疾病的鉴定方法主要是通过临床生化,但存在着对医师要求高,可能错判、漏判等问题,且对相对罕见类线粒体疾病难以判断,同时存在通量低,操作复杂,周期长等缺点。通常,通过基因测序检测线粒体疾病的方法,仅能对已知的变异存在判断能力,但是,而对未被报道、且同样影响基因和蛋白功能的变异,并没有鉴定能力,即,不能检测出该变异是哪种疾病导致的。
图1为本公开的一些实施例中提供的疾病预测方法的示意图,该疾病预测方法尤其适用于通过基因检测,来预测线粒体疾病。如图1所示,疾病预测方法包括:
S1、获取检测样本的基因测序数据。
其中,检测样本可以为患有线粒体疾病的病患的DNA样本,例如为病患的血浆或血清。可以通过三代测序仪测序,得到检测样本的基因测序数据。
S2、对基因测序数据进行数据分析,得到基因测序数据中的变异位点。
其中,变异位点为检测样本的基因测序数据组中,与参考基因组中相同位置上不同的碱基类型,这些变异位点有可能是影响人体健康,或导致人类患病的致病位点。
在一些实施例中,步骤S2中对基因测序数据进行数据分析可以包括:对基因测序数据进行质控和过滤,得到高质量的数据,并基于过滤后的数据进行基因检测,以确定变异位点。基因检测的类型可以有多种,例如,可以通过SNV检测的方式检测出发生SNV突变的位点。
S3、对变异位点进行变异注释,得到变异注释结果,所述变异注释结果包括变异位点的变异相关信息。
在一些实施例中,变异相关信息可以包括:变异类型、变异位置、变异导致CDS碱基及蛋白质的改变情况。
S4、根据变异位点的变异相关信息预测变异位点对基因功能的影响程度得分。
其中,影响程度得分用于表示,变异位点对基因功能的影响程度。影响程度得分越高,表示变异位点对基因功能的影响越大;影响程度得分越低,表示变异位点对基因功能的影响越小。
变异位点的变异类型不同时,对基因功能的影响程度可能不同;变异位点的变异位置不同时,对基因功能的影响程度也可能不同。因此,可以根据变异位点的变异相关信息,来预测变异位点对基因功能的影响程度得分。
S5、根据变异位点对基因功能的影响程度得分以及预设疾病数据库,对变异位点进行疾病注释,得到变异位点对应的线粒体疾病。
其中,预设疾病数据库中记录有已知的变异相关信息与线粒体疾病之间的对应关系,也就是说,对于某些变异位点的变异相关信息而言,可以通过查找预设疾病数据库,确定相对应的线粒体疾病。但是,还存在以下情况:从预设疾病数据库,无法查找到与变异位点的变异相关信息直接对应的疾病,此时,可以根据影响程度得分,来确定变异位点对应的线粒体疾病。
例如,当预设疾病数据库中记录有与变异位点的变异相关信息对应的第一线粒体疾病时,将第一线粒体疾病作为变异位点对应的线粒体疾病;当预设疾病数据库中未记录变异位点的变异相关信息对应的第一线粒体疾病、且步骤S4计算得到的影响程度得分大于预设阈值时,则从预设疾病库中获取与变异位点的临近位点对应的第二线粒体疾病,将第二线粒体疾病作为变异位点对应的线粒体疾病;当预设疾病数据库中未记录变异位点的变异相关信息对应的第一线粒体疾病、且步骤S4得到的影响程度得分不大于预设阈值时,则认为变异位点对蛋白质功能影响不大,不足以致病。
其中,临近位点是指,满足以下两个条件的所有位点中,与变异位点最近的位点。其中,第一个条件为:变异相关信息所对应的线粒体疾病在预设疾病数据库中有所记载;第二个条件为:与变异位点位于同一基因、同一蛋白。例如,步骤S2中确定变异位点为某一蛋白序列上的第2个位点,且步骤S4中确定出该变异位点对基因功能的影响程度得分大于预设阈值,而预设疾病数据库中未记录与变异位点的变异相关信息直接对应的线粒体疾病,但记录有同一蛋白上的第4个位点和第10个位点所对应的线粒体疾病,此时,则将第4个位点所对应的线粒体疾病作为变异位点所对应的线粒体疾病。
在一些示例中,预设阈值可以在0.4~0.5之间,例如,预设阈值为0.4,或0.45,或0.5。
在本公开实施例中,在确定出基因测序数据中的变异位点后,对变异 位点进行变异注释,得到变异位点的变异相关信息,之后,根据变异相关信息预测变异位点对基因功能的影响程度得分;根据影响程度得分预设疾病数据库,对所述变异位点进行疾病注释,从而在预设疾病数据库中记录有与变异位点的变异相关信息直接对应的第一线粒体疾病时,将该第一线粒体疾病作为变异位点对应的线粒体疾病;而当预设疾病数据库中未记录与变异位点的变异相关信息直接对应的第一线粒体疾病、且影响程度得分大于预设阈值时,将预设疾病库中所记录的与变异位点的临近位点对应的第二线粒体疾病,作为与变异位点对应的线粒体疾病。因此,本公开实施例不仅可以在基因组发生已知的变异时,确定出线粒体疾病;还可以在基因组发生未知的变异时,确定出线粒体疾病。
图2为本公开的另一些实施例中提供的疾病预测方法的示意图,图2为图1的一种具体化实现方案。如图2所示,疾病预测方法包括:
S1、获取检测样本的基因测序数据。
在一些实施例中,步骤S1包括步骤S11和步骤S12。
S11、获取检测样本的初始基因测序数据。
在一些实施例中,在步骤S11中,可以利用Nanopore测序技术或靶向富集测序技术,对检测样本进行基因测序,得到初始基因测序数据。
例如,可以利用Nanopore测序技术进行基因测序。相较于NGS测序,Nanopore测序技术的读长更长,在动植物基因组组装中有着不可比拟的优势。
S12、对初始基因测序数据进行过滤,得到基因测序数据。
在一些实施例中,步骤S12具体可以包括:对初始基因测序数据进行分析、质量控制和过滤,从而得到高质量的数据用于后续的生物信息分析,为后续的分析处理提供准确数据。
在一个示例中,使用nanopack分析软件包进行分析、质量控制和过滤。使用nanoQC软件进行核酸组成统计和碱基质量统计;使用NanoStat软件 补充统计,生成统计结果html文件;使用NanoPlot软件进行数据的可视化作图。随后使用NanoFit软件进行过滤,在进行过滤时,可以过滤掉一些长度过短的序列,
其中,过滤参数根据实际情况设计。例如,过滤参数为“-Q 7–l 1000–headcrop 100–tailcrop 100”,即,将长度小于1000、同时整条序列平均质量值小于Q7的序列过滤掉,同时将每条序列首尾各剪掉100bp。在一个示例中,具体的过滤信息见表1。
表1
  过滤前 过滤后
Mean read length 1921.5 2490.2
Mean read quality 9.9 10.2
Median read length 1442.5 1686
Median read quality 10.1 10.3
Number of reads 110192 65246
Read length N50 2336 2870
STDEV read length 2132.9 2457.6
Total bases 211728431 162473151
Q5 110184(100.0%)211.7Mb 65246(100.0%)162.5Mb
Q7 110080(99.9%)211.3Mb 65245(100.0%)162.5Mb
Q10 61612(55.9%)100.5Mb 35441(54.3%)72.8Mb
Q12 18846(17.1%)32.3Mb 13382(20.5%)27.1Mb
Q15 6(0.0%)0.0Mb 16(0.0%)0.0Mb
在表1中,Mean read length:平均read的长度;Mean read quality:平均read的质量;Median read length:read长度的中位数;Median read quality:read质量的中位数;Number of reads:read的总数;Read length N50:N50值的read长度;STDEV read length:read长度的标准方差;Q5-Q15:Nanopore的质量值统计,统计内容分别为reads数目、所占总数百分比、总碱基数。
图3为本公开的一个示例中提供的下机数据read长度分布统计图,下机数据也即初始基因测序数据,其中,横轴表示read的长度(read length),纵轴表示read的数量(Number of reads)。如图3所示,read的长度基本分布在1000左右。
图4A为本公开的一个示例中提供的下机数据read的前100个碱基的组成图,图4B为本公开的一个示例中提供的下机数据read的后100个碱基的组成图。其中,图4A和图4B纵轴表示核苷酸读出频率(requency of nucleotide in read),图4A中的横轴表示序列头部的位置(position in read from start),图4B中的横轴表示序列尾部的位置(position in read from end)。
图5A为本公开的一个示例中提供的下机数据的前100个碱基的平均碱基质量图,图5B为本公开的一个示例中提供的下机数据的后100个碱基的平均碱基质量图。其中,图5A和图5B中的纵轴表示平均碱基质量(Mean quality score of base calls),图5A中的横轴表示序列头部的位置(position in read from start),图5B表示序列尾部的位置(position in read from end)。
S2、对基因测序数据进行数据分析,得到基因测序数据中的变异位点。
在一些实施例中,步骤S2包括步骤S21和步骤S22:
S21、将所述基因测序数据与参考线粒体基因组的参考序列进行比对,确定比对结果数据,所述比对结果数据包括:所述基因测序数据中的线粒体基因组中的位点。
通过步骤S21,可以确定基因测序数据在线粒体基因组中的位点。
在一个示例中,可以利用minimap2工具进行比对,对比参数为“-ax map-out”,minimap2工具的生成结果为sam格式。通过samtools工具将Sam格式的对比结果数据转化为bam格式,并对生成的bam格式进行排序。然后使用samtools工具中的flagstat及stats命令进行比对统计,并使用samtools工具中的plot-bamstats程序将比对结果可视化输出。图6为本公开 的一个示例中提供的对步骤S2a的比对结果数据统计后的可视化输出结果示意图,图6中的(a)图为根据对比结果数据统计得到的覆盖度图,横轴表示覆盖度(coverage);纵轴表示能够与参考线粒体基因组比对上的碱基数量(Number of mapped bases)。图6中的(b)图为根据对比结果数据统计的GC分布图,横轴表示GC含量(GC content);纵轴表示归一化的频率(Normalized Frequency)。图6中的(c)图为能够与参考线粒体基因组比对上的reads的质量分布统计图,横轴表示能够与参考线粒体基因组比对上的read的长度,即Cycle(fwd reads);纵轴表示质量值(Quality)。
S22、对所述比对结果数据进行变异检测,确定出比对结果数据中的变异位点。
在一些实施例中,步骤S22具体包括:
S22a、对所述比对结果数据进行SNV检测,得到第一检测结果,第一检测结果包括:所述比对结果数据对所包括的SNV位点。
S22b、对所述比对结果进行indel检测,得到第二检测结果,第二检测结果包括:对比结果数据中所包括的indel位点。
示例性地,可以利用longshot工具进行SNV检测,该工具是优秀变异检测工具,对存在错误的读长数据有着准确的检测,该工具以bam文件作为输入,输出带有变异信息和基因型信息的vcf文件。可以使用mitoDel V3.0对线粒体上的indel进行检测,输出indel的结果包括从indel的read count数目,indel的起始位置、indel的位置及是否质量过滤通过。
应当理解的是,上述步骤S2中所检测的变异位点包括SNV位点和indel位点。
需要说明的是,步骤S22a和步骤S22b的先后顺序不做特别限定,只要将SNV检测和indel检测分开进行即可。通过将SNV检测和indel检测分开进行,可以提高检测准确性。
S3、对变异位点进行变异注释,得到变异位点的变异相关信息。
在一些实施例中,变异相关信息具体可以包括:变异类型、变异区域、变异位置;以及变异导致半胱亚磺酸脱羧酶的CDS和蛋白质的改变,CDS为基因上编码蛋白质的序列。其中,变异类型可以为:无义突变、漂移突变、同义突变等,变异区域可以为:非基因区、基因区、控制区等;变异位置是指:变异位点位于基因组的具体位置,该位置可映射到具体的基因及蛋白质序列上的位置,表示为基因的特定转录本、CDS或蛋白质的第几位改变。
S4、根据变异相关信息预测变异位点对基因功能的影响程度得分。
在一些实施例中,步骤S4包括:分别预测不同的变异相关信息对基因功能的影响程度得分,得到多方面的影响程度得分;根据多方面的影响程度得分,确定所述变异位点对所述基因功能的影响程度得分。例如,分别预测变异类型、变异位置等信息对基因功能的影响程度得分,从而得到多方面的影响程度得分。
图7为本公开的一些实施例中提供的步骤S4的可选实现方式示意图,在一些实施例中,步骤S4具体包括步骤S41~步骤S44。
S41、获取变异位点对蛋白质保守性和理化性质的影响程度得分,并作为第一得分。
其中,步骤S41具体可以包括步骤S41a和步骤S41b。
S41a、利用多种预测工具分别对所述变异位点的变异相关信息进行分析,以预测出所述变异位点对蛋白质保守性和理化性质的多个参考影响程度得分。
S41b、将所述多个参考影响程度得分的平均值作为第一得分。
例如,多种预测工具包括:PANTHER、PolyPhen-2和SIFT分别进行预测,每种预测工具可以预测变异位点对蛋白质保守性和理化性质的影响程度,其中,变异位点对蛋白质保守性和理化性质的影响程度可以为以下四种之一:无影响、可能有影响、有害、不能预测结果。当变异位点对蛋 白质保守性和理化性质的影响程度为无影响时,则预测工具输出的参考影响程度得分为0;当变异位点对蛋白质保守性和理化性质的影响程度为可能有影响时,则预测工具输出的参考影响程度得分为0.5;当变异位点对蛋白质保守性和理化性质的影响程度为有害时,则预测工具输出的参考影响程度得分为1;当变异位点对蛋白质保守性和理化性质的影响程度为不能预测结果时,则预测工具输出的参考影响程度得分为NA(无得分)。
将PANTHER得到的参考影响得分记作S PANTHER,将PolyPhen-2得到的参考影响得分记作S PolyPhen-2,将SIFT得到的参考影响得分记作S SIFT,则第一得分Si=(S PANTHER+S PolyPhen-2+S SIFT)/N,N为有得分的预测工具的数量,即,去除NA的预测工具的数量。
S42、获取变异位点的变异类型对基因功能的影响程度得分,作为第二得分。
在一些实施例中,步骤S42具体可以包括:根据预设的第一映射关系,确定变异位点的变异类型对基因功能的影响程度得分;其中,第一映射关系中记载有多种不同的变异类型对基因功能的影响程度得分。
例如,同一突变、基因间区突变对基因功能通常无影响,因此,当变异位点的变异类型为同一突变或基因间区突变时,第二得分为0;而非同义突变、非漂移突变对蛋白质功能可能有影响,因此,当变异位点的变异类型为非同义突变、非漂移突变时,第二得分为0.5;无义突变、漂移突变对蛋白质功能有很大影响,因此,当变异位点的变异类型为无义突变或漂移突变时,第二得分为1。
S43、获取变异位点的变异位置对基因功能的影响程度得分,作为第三得分。
在一些实施例中,变异位置可以包括:变异位点在蛋白质序列中的位置序号n,即,变异位点位于蛋白质序列中的第n个氨基酸位置。
S43具体可以包括:当变异位点的变异类型为漂移突变或无义突变时, 第三得分Sp=1-n/L,其中,L为蛋白序列的长度。当变异位点的变异类型为漂移突变和无义突变之外的其他类型时,确定第三得分为0。例如,某蛋白质序列长度为200个氨基酸,变异位点在蛋白质序列的第20个氨基酸位置,且变异类型为漂移突变或无义突变,则Sp=1-20/200=0.9。
S44、根据以下公式(1)确定变异位点对基因功能的影响程度得分S:
S=λ1*Si+λ2*St+λ3*Sp
其中,Si为第一得分,St为第二得分;Sp为第三得分;λ1、λ2、λ3为预设的权重,λ1+λ2+λ3=1。S越大,表示变异位点对蛋白质的影响越大,S为0表示对蛋白质功能完全无影响,S为1表示对蛋白质有害,蛋白质完全不能行使其功能。
其中,上述“分别预测不同的变异相关信息对基因功能的影响程度得分,得到多方面的影响程度得分”包括上述步骤S41~S43,多方面的影响程度得分包括:第一得分、第二得分和第三得分。上述“根据多方面的影响程度得分,确定所述变异位点对所述基因功能的影响程度得分”包括上述步骤S44。
S5、根据变异位点对基因功能的影响程度得分以及预设疾病数据库,对变异位点进行疾病注释,得到变异位点对应的线粒体疾病。其中,当预设疾病数据库中记录有与变异位点的变异相关信息对应的第一线粒体疾病时,将第一线粒体疾病作为变异位点对应的线粒体疾病;当预设疾病数据库中未记录第一线粒体疾病、且影响程度得分大于预设阈值时,则从预设疾病库中获取与变异位点的临近位点对应的第二线粒体疾病,将第二线粒体疾病作为变异位点对应的线粒体疾病。
在一些示例中,进行步骤S5时,可以下载mitomap上的变异疾病数据,构建tsv格式的数据库(称之为mitoDisease),该数据库即为预设疾病数据库。在一些示例中,预设阈值可以为0.5。
表2 中显示了部分线粒体疾病与相应的变异相关信息。
表2
Mutation Num. 19 32
Postion 9025 15672
Mutation(hgvs.g) m.9025G>A m.15672T>G
Mutation(hgvs.c) c.499G>A c.926T>G
Mutation(hgvs.p) p.G167S p.I309R
S 0.549 0.41
IF(mitoDisease) 1 0
mitoDisease Num. 242 475
mitoDisease-gene ATP6 CYB
Disease Motor neuropathy,Leigh-like,colon cancer LHON
Status Reported -
GB Freq 0.06% -
在表2中,Mutation Num表示变异注释编号;Postion表示变异位点在线粒体基因组上的位置;Mutation(hgvs.g)、Mutation(hgvs.c)、Mutation(hgvs.p)分别表示标准的基因组、CDS、蛋白质hgvs格式;S表示变异位点对基因功能的影响程度得分;IF(mitoDisease)表示当前变异情况是否位于mitoDisease数据库中,1表示数据库中存在当前变异情况,0表示不存在当前变异情况;若数据库中存在当前变异情况,mitoDisease Num表示:当前变异情况在mitoDisease数据库中的编号;若数据库中不存在当前变异情况,mitoDisease Num表示:当前变异位点的临近位点的变异情况在mitoDisease数据库中的编号;mitoDisease-gene:当前变异情况对应mitoDisease中的基因名;Disease:该变异相关的疾病名;Status:当前变异情况是否被报道,Reported表示当前变异情况已被报道;GB Freq:当前变异情况在GeneBank中人线粒体数据库中的频率。
图8为本公开的一些实施例中提供的疾病预测装置的示意图,该疾病预测装置用于执行上述疾病预测方法。如图8所示,疾病预测装置包括:数据获取模块10、分析模块20、变异注释模块30、预测模块40和疾病注 释模块50。
其中,数据获取模块10被配置为获取检测样本的基因测序数据。
分析模块20被被配置为对所述基因测序数据进行数据分析,得到所述基因测序数据中的变异位点。
变异注释模块30被配置为对所述变异位点进行变异注释,得到所述变异位点的变异相关信息。
预测模块40被配置为根据所述变异位点的变异相关信息预测所述变异位点对基因功能的影响程度得分。
疾病注释模块50被配置为根据变异位点对基因功能的影响程度得分以及预设疾病数据库,对变异位点进行疾病注释,得到变异位点对应的线粒体疾病。其中,当预设疾病数据库中记录有与变异位点的变异相关信息对应的第一线粒体疾病时,将第一线粒体疾病作为变异位点对应的线粒体疾病;当预设疾病数据库中未记录第一线粒体疾病、且影响程度得分大于预设阈值时,则从预设疾病库中获取与变异位点的临近位点对应的第二线粒体疾病,将第二线粒体疾病作为变异位点对应的线粒体疾病。
其中,各模块的功能参见上述疾病预测方法中的描述,这里不再赘述。
图9为本公开的一些实施例中提供的电子设备的示意图,如图9所示,电子设备100包括:存储器101和处理器102,存储器101上存储有计算机程序,其中,计算机程序被处理器102执行时实现上述的疾病预测方法,例如实现图1中的步骤S1至S4。
电子设备100可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。电子设备100可包括,但不仅限于,处理器102和存储器101。本领域技术人员可以理解,图9仅仅是电子设备100的示例,并不构成对电子设备100的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述电子设备100还可以包括输入输出设备、网络接入设备、总线等。
处理器102可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器102可以是微处理器或者该处理器也可以是任何常规的处理器等。
存储器101可以是电子设备100的内部存储单元,例如电子设备100的硬盘或内存。所述存储器101也可以是所述电子设备100的外部存储设备,例如所述电子设备100上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器101还可以既包括所述电子设备100的内部存储单元也包括外部存储设备。所存储器101用于存储所述计算机程序以及所述终端设备所需的其他程序和数据。存储器101还可以用于暂时地存储已经输出或者将要输出的数据。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
图10为本公开的一些实施例中提供的计算机可读存储介质的示意图,如图10所示,计算机可读存储介质200上存储有计算机程序201,其中, 计算机程序201被处理器执行时实现上述疾病预测方法,例如实现图1中的步骤S1至步骤S4。计算机可读存储介质200包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
可以理解的是,以上实施方式仅仅是为了说明本公开的原理而采用的示例性实施方式,然而本公开并不局限于此。对于本领域内的普通技术人员而言,在不脱离本公开的精神和实质的情况下,可以做出各种变型和改进,这些变型和改进也视为本公开的保护范围。

Claims (15)

  1. 一种疾病预测方法,包括:
    获取检测样本的基因测序数据;
    对所述基因测序数据进行数据分析,得到所述基因测序数据中的变异位点;
    对所述变异位点进行变异注释,得到所述变异位点的变异相关信息;
    根据所述变异位点的变异相关信息预测所述变异位点对基因功能的影响程度得分;
    根据所述变异位点对基因功能的影响程度得分以及预设疾病数据库,对所述变异位点进行疾病注释,得到所述变异位点对应的线粒体疾病;
    其中,当所述预设疾病数据库中记录有与所述变异位点的变异相关信息对应的第一线粒体疾病时,将所述第一线粒体疾病作为所述变异位点对应的线粒体疾病;当所述预设疾病数据库中未记录所述第一线粒体疾病、且所述影响程度得分大于预设阈值时,则从所述预设疾病库中获取与所述变异位点的临近位点对应的第二线粒体疾病,将所述第二线粒体疾病作为所述变异位点对应的线粒体疾病。
  2. 根据权利要求1所述的疾病预测方法,其中,所述变异相关信息包括:变异类型、变异区域、变异位置以及变异导致CDS和蛋白质的改变。
  3. 根据权利要求2所述的疾病预测方法,其中,根据所述变异位点的变异相关信息预测所述变异位点对基因功能的影响程度得分,具体包括:
    分别预测不同的变异相关信息对基因功能的影响程度得分,得到多方面的影响程度得分;
    根据多方面的影响程度得分,确定所述变异位点对所述基因功能的影响程度得分。
  4. 根据权利要求3所述的疾病预测方法,其中,分别预测不同的变异相关信息对基因功能的影响程度得分,得到多方面的影响程度得分,具体包括:
    获取所述变异位点对蛋白质保守性和理化性质的影响程度得分,并作为第一得分;
    获取所述变异位点的变异类型对基因功能的影响程度得分,作为第二得分;
    获取所述变异位点的变异位置对基因功能的影响程度得分,作为第三得分;
    其中,所述多方面的影响程度得分包括:所述第一得分、所述第二得分和所述第三得分;
    根据多方面的影响程度得分,确定所述变异位点对所述基因功能的影响程度得分,具体包括:根据以下公式确定所述变异位点对基因功能的影响程度得分:
    S=λ1*Si+λ2*St+λ3*Sp,其中,S为所述变异位点对基因功能的影响程度得分,Si为所述第一得分,St为所述第二得分;Sp为所述第三得分;λ1、λ2、λ3为预设的权重,λ1+λ2+λ3=1。
  5. 根据权利要求4所述的疾病预测方法,其中,λ1、λ2均在0.15~0.25之间,λ3在0.5~0.7之间。
  6. 根据权利要求4所述的疾病预测方法,其中,获取所述变异位点对蛋白质保守性和理化性质的影响程度得分,具体包括:
    利用多种预测工具分别对所述变异位点的变异相关信息进行分析,以预测出所述变异位点对蛋白质保守性和理化性质的多个参考影响程度得分;
    将所述多个参考影响程度得分的平均值作为所述第一得分。
  7. 根据权利要求4所述的疾病预测方法,其中,获取所述变异位点的变异类型对基因功能的影响程度得分,包括:
    根据预设的第一映射关系,确定所述变异位点的变异类型对基因功能的影响程度得分;其中,所述第一映射关系中记载有多种不同的变异类型对基因功能的影响程度得分。
  8. 根据权利要求4所述的疾病预测方法,其中,所述变异位置包括所述变异位点在蛋白质序列中的位置序号n;
    获取所述变异位点的变异位置对基因功能的影响程度得分,具体包括:
    当所述变异位点的变异类型为漂移突变或无义突变时,根据以下公式确定所述第三得分:
    Sp=1-n/L,其中,L为所述蛋白序列的长度;
    当所述变异位点的变异类型为所述漂移突变和所述无义突变之外的其他类型时,确定所述第三得分为0。
  9. 根据权利要求1至8中任意一项所述的疾病预测方法,其中,获取检测样本的基因测序数据包括:
    获取所述检测样本的初始基因测序数据;
    对所述初始基因测序数据进行过滤,得到所述基因测序数据。
  10. 根据权利要求9所述的疾病预测方法,其中,获取检测样本的初始基因测序数据,包括:
    利用Nanopore测序技术或靶向富集测序技术,获取所述检测样本的初始基因测序数据。
  11. 根据权利要求1至8中任意一项所述的疾病预测方法,其中,对所述基因测序数据进行数据分析,得到所述基因测序数据中的变异位点,具体包括:
    将所述基因测序数据与参考线粒体基因组的参考序列进行比对,确定比对结果数据,所述比对结果数据包括:所述基因测序数据在参考线粒体基因组中的位点;
    对所述比对结果数据进行变异检测,确定出所述比对结果数据中的变异位点。
  12. 根据权利要求11所述的疾病预测方法,其中,
    对所述比对结果数据进行变异检测,确定出所述比对结果数据中的变异位点,具体包括:
    对所述比对结果数据进行SNV检测,得到第一检测结果,所述第一检测结果包括:所述比对结果数据中所包括的SNV位点;
    对所述比对结果数据进行indel检测,得到第二检测结果,所述第二检测结果包括:所述对比结果数据中所包括的indel位点;
    其中,所述变异位点包括所述SNV位点和所述indel位点。
  13. 一种疾病预测装置,包括:
    数据获取模块,被配置为获取检测样本的基因测序数据;
    分析模块,被配置为对所述基因测序数据进行数据分析,得到所述基因测序数据中的变异位点;
    变异注释模块,被配置为对所述变异位点进行变异注释,得到所述变异位点的变异相关信息;
    预测模块,被配置为根据所述变异位点的变异相关信息预测所述变异 位点对基因功能的影响程度得分;
    疾病注释模块,被配置为根据所述变异位点对基因功能的影响程度得分以及预设疾病数据库,对所述变异位点进行疾病注释,得到所述变异位点对应的线粒体疾病;
    其中,当所述预设疾病数据库中记录有与所述变异位点的变异相关信息对应的第一线粒体疾病时,将所述第一线粒体疾病作为所述变异位点对应的线粒体疾病;当所述预设疾病数据库中未记录所述第一线粒体疾病、且所述影响程度得分大于预设阈值时,则从所述预设疾病库中获取与所述变异位点的临近位点对应的第二线粒体疾病,将所述第二线粒体疾病作为所述变异位点对应的线粒体疾病。
  14. 一种电子设备,包括存储器和处理器,所述存储器上存储有计算机程序,其中,所述计算机程序被所述处理器执行时实现权利要求1至12中任意一项所述的方法。
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现权利要求1至12中任意一项所述的方法。
PCT/CN2021/126970 2021-10-28 2021-10-28 疾病预测方法及装置、电子设备、计算机可读存储介质 WO2023070422A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/126970 WO2023070422A1 (zh) 2021-10-28 2021-10-28 疾病预测方法及装置、电子设备、计算机可读存储介质
CN202180003144.0A CN116547391A (zh) 2021-10-28 2021-10-28 疾病预测方法及装置、电子设备、计算机可读存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/126970 WO2023070422A1 (zh) 2021-10-28 2021-10-28 疾病预测方法及装置、电子设备、计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2023070422A1 true WO2023070422A1 (zh) 2023-05-04

Family

ID=86158806

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/126970 WO2023070422A1 (zh) 2021-10-28 2021-10-28 疾病预测方法及装置、电子设备、计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN116547391A (zh)
WO (1) WO2023070422A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013067001A1 (en) * 2011-10-31 2013-05-10 The Scripps Research Institute Systems and methods for genomic annotation and distributed variant interpretation
US20160048634A1 (en) * 2013-03-15 2016-02-18 Ali Torkamani Systems and methods for genomic annotation and distributed variant interpretation
CN105740597A (zh) * 2015-12-10 2016-07-06 西安时代基因健康科技股份有限公司 复杂疾病遗传风险参数检测系统
US20170268057A1 (en) * 2014-07-30 2017-09-21 Sutter West Bay Hospitals Mitochondrial dna mutation profile for predicting human health conditions and disease risk and for monitoring treatments
WO2018042185A1 (en) * 2016-09-02 2018-03-08 Imperial Innovations Ltd Methods, systems and apparatus for identifying pathogenic gene variants
CN110931081A (zh) * 2019-11-28 2020-03-27 广州基迪奥生物科技有限公司 一种人单基因遗传疾病检测生物信息分析方法
CN111883223A (zh) * 2020-06-11 2020-11-03 国家卫生健康委科学技术研究所 患者样本数据中结构变异的报告解读方法及系统
US20210074378A1 (en) * 2018-01-26 2021-03-11 The Trustees Of Princeton University Methods for Analyzing Genetic Data to Classify Multifactorial Traits Including Complex Medical Disorders

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013067001A1 (en) * 2011-10-31 2013-05-10 The Scripps Research Institute Systems and methods for genomic annotation and distributed variant interpretation
US20160048634A1 (en) * 2013-03-15 2016-02-18 Ali Torkamani Systems and methods for genomic annotation and distributed variant interpretation
US20170268057A1 (en) * 2014-07-30 2017-09-21 Sutter West Bay Hospitals Mitochondrial dna mutation profile for predicting human health conditions and disease risk and for monitoring treatments
CN105740597A (zh) * 2015-12-10 2016-07-06 西安时代基因健康科技股份有限公司 复杂疾病遗传风险参数检测系统
WO2018042185A1 (en) * 2016-09-02 2018-03-08 Imperial Innovations Ltd Methods, systems and apparatus for identifying pathogenic gene variants
US20210074378A1 (en) * 2018-01-26 2021-03-11 The Trustees Of Princeton University Methods for Analyzing Genetic Data to Classify Multifactorial Traits Including Complex Medical Disorders
CN110931081A (zh) * 2019-11-28 2020-03-27 广州基迪奥生物科技有限公司 一种人单基因遗传疾病检测生物信息分析方法
CN111883223A (zh) * 2020-06-11 2020-11-03 国家卫生健康委科学技术研究所 患者样本数据中结构变异的报告解读方法及系统

Also Published As

Publication number Publication date
CN116547391A (zh) 2023-08-04

Similar Documents

Publication Publication Date Title
US20220093212A1 (en) Size-based analysis of fetal dna fraction in plasma
Sheng et al. Multi-perspective quality control of Illumina RNA sequencing data analysis
Luthra et al. Next-generation sequencing in clinical molecular diagnostics of cancer: advantages and challenges
Gautier et al. Alternative mapping of probes to genes for Affymetrix chips
US11043283B1 (en) Systems and methods for automating RNA expression calls in a cancer prediction pipeline
Bock Analysing and interpreting DNA methylation data
CN104302781B (zh) 一种检测染色体结构异常的方法及装置
JP2018524993A (ja) 染色体異常を検出するための核酸及び方法
CN108920899B (zh) 一种基于目标区域测序的单个外显子拷贝数变异预测方法
US20120102054A1 (en) Systems and Methods for Annotating Biomolecule Data
EP4300500A2 (en) Cell-free dna end characteristics
KR20160022374A (ko) 유전적 변이의 비침습 평가를 위한 방법 및 프로세스
CN106715711A (zh) 确定探针序列的方法和基因组结构变异的检测方法
Yu et al. Statistical and bioinformatics analysis of data from bulk and single-cell RNA sequencing experiments
Larson et al. A clinician’s guide to bioinformatics for next-generation sequencing
CN111868832A (zh) 识别拷贝数异常的方法
CN115083521A (zh) 一种单细胞转录组测序数据中肿瘤细胞类群的鉴定方法及系统
Lee et al. Comparability of reference-based and reference-free transcriptome analysis approaches at the gene expression level
Myers The age of the “ome”: genome, transcriptome and proteome data set collection and analysis
Pankratov et al. Prioritizing autoimmunity risk variants for functional analyses by fine-mapping mutations under natural selection
CN108728515A (zh) 一种使用duplex方法检测ctDNA低频突变的文库构建和测序数据的分析方法
CN109461473B (zh) 胎儿游离dna浓度获取方法和装置
Kõks et al. Sequencing and annotated analysis of full genome of Holstein breed bull
WO2023070422A1 (zh) 疾病预测方法及装置、电子设备、计算机可读存储介质
Cai et al. De novo genome assembly of a Han Chinese male and genome-wide detection of structural variants using Oxford Nanopore sequencing

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202180003144.0

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 17922017

Country of ref document: US

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21961789

Country of ref document: EP

Kind code of ref document: A1