EP4022646A1 - A prioritization and scoring method - Google Patents
A prioritization and scoring methodInfo
- Publication number
- EP4022646A1 EP4022646A1 EP20907928.4A EP20907928A EP4022646A1 EP 4022646 A1 EP4022646 A1 EP 4022646A1 EP 20907928 A EP20907928 A EP 20907928A EP 4022646 A1 EP4022646 A1 EP 4022646A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- features
- feature
- variant
- score
- new
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H40/00—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
- G16H40/20—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the management or administration of healthcare resources or facilities, e.g. managing hospital staff or surgery rooms
Definitions
- the invention relates to a variant prioritization and scoring method that facilitates the interpretation of the genetic variants (in VCF file formed as a result of the bioinformatics pipeline), using machine learning for the analysis of Next-Generation Sequencing (NGS) data.
- NGS Next-Generation Sequencing
- VCF file includes many variants as the output of bioinformatics pipeline (for example, there may be an average of 20.000 variants with confirmed quality, as a result of whole exome sequencing), most of which does not have a pathogenic effect. It is essential to determine a small number of candidate variants that might cause diseases among these variants for diagnosis. Processes such as filtering and prioritization required to determine whether these variants are associated with a disease or not, are carried out manually in the clinic. However, it is a difficult and long process. Finding a small number of candidate variants automatically is hence crucial for faster diagnosis.
- Correct classification and prioritization of the variants from next-generation sequencing data is one of the most important steps in clinical diagnosis, which consists of manually filtering approximately tens of thousands of variants according to certain features. However, in most cases the correct variant might not be obtained since the filtering method does not have any standard and the filtering parameters were user-dependent.
- ClinVar one of the most used variant databases recommends the use of American Medical Genetics and Genomics College Guideline (ACMG) so as to improve clinical classification of the variants in the human genome.
- ACMG American Medical Genetics and Genomics College Guideline
- Using these rules as features for the machine learning model is very important in terms of increasing the success of classification of variants.
- These criteria are applied as present/absent (as binary features) to the variants in the current applications whereas the invention creates new rules using these criteria and takes them as features.
- the invention aims to solve the abovementioned disadvantages motivated from the current conditions.
- the main goal of the invention is to shorten the time required for genetic diagnosis, by determining the candidate variants that could be associated with a disease, compared to the existing systems.
- the invention provides an algorithm based on machine learning methods that calculate pathogenicity scores for single nucleotide variants (SNVs). Novel features that haven’t been used previously in the literature for variant scoring and some of the existing scoring models (for example FATHMM, M-CAP, CERENKOV2, SIFT, PolyPhen, ClinPred, CADD, DANN, Mutation Tester) are used to develop a variant scoring system for SNV type variants.
- SNVs single nucleotide variants
- ACMG guideline criteria and rules and family segregation information mentioned in the state-of-the-art are used as features (as factors that affect pathogenicity in machine learning models) in the method.
- ExAC PLI score of a given gene region where the relevant variant is formed is also used as a feature in the method. PLI score gives a probability regarding the tolerance of a given gene to the loss of function on the basis of the number of protein truncating variants.
- the invention comprises constructing new features (feature generation/construction) from the existing features. The main aim here is to find out the relations between different features via mathematical operations (division, multiplication etc.) using the existing features.
- feature construction methods such as ExploreKit, AutoLearn, Iterative Feature Construction, Association Rule Mining are used.
- the workload on the user (usually a medical geneticist), required for the diagnosis, is significantly reduced by means of automatically scoring SNP type variants.
- the user may require detailed information regarding how the variant scores are generated, to evaluate the variants for diagnosis.
- Machine learning models are generally complex and their results are not always easy to interpret.
- additional information is provided to the user regarding the decision process (consisting of complex machine learning models), using Machine Learning Interpretability methods.
- SHAP Values, Permutation Importance, LIME methods are used.
- presenting the complex models in the form of one decision tree as Decision Tree Surrogate using Quinlan’s C4.5 Algorithm
- the invention is a prioritization and scoring method which facilitates the interpretation of the genetic variants (in VCF file formed as a result of the bioinformatics pipeline), using machine learning for the analysis of new generation sequencing data. It comprises the following process steps;
- Figure 1 illustrates the process steps for generating novel features for the variant scoring model.
- Figure 2 illustrates the general structure of the variant scoring model.
- Figure 3 illustrates the structure that shows the complete system.
- Figure 4 illustrates the position of the invention within the system.
- FIG 3 the view of the system within a complete structure is given.
- Figure 4 the position of the invention within the system is shown.
- the complete system starts with the examination of a patient that exhibits various symptoms by a physician.
- the physician asks for a genetic test, if he/she finds it appropriate.
- the blood sample taken from the patient is prepared for DNA sequencing by the laboratory.
- the prepared sample is processed in the laboratory by the sequencing device and digital DNA data (raw data) of the patient is obtained. Since variant information regarding the disease cannot be achieved directly from the raw material, this data is required to be processed in computer environment via bioinformatics tools and thus the variant information is reached.
- a variant report that shows the relation of the variant with the disease is created.
- the bioinformatics pipeline initiates with the raw data obtained from the sequencing device.
- Raw data contains readings from different parts of the DNA of the patient.
- the readings in the data are aligned to the human reference genome and saved in the SAM/BAM format so as to determine the regions that these readings are from.
- variant information that does not confirm with the human reference genome is obtained from the processed SAM / BAM files and written in the VCF file.
- Specific filters are applied to determine the candidate variants that are associated with the disease, among many variants in VCF file. Variants that are left after filtering are reported and the genetic diagnosis report of the patient is created.
- the scoring process of the variants is carried out in the variant interpretation step following the creation of the VCF file step.
- New feature space (30) is created by the feature construction model (20) based on the original features (10) in the data set, to be used for the variant scoring model (40).
- the features (10) are taken as input to the feature construction model (20).
- the new feature creation module (21) creates new features from the received features (10) via mathematical operators. New features and original features (10) are ranked according to criteria, such as consistency and information gain with the feature ranking module (22). A predetermined number of features are selected among the ranked features via the feature selection module (23).
- a new feature space (30) is created by the selected features to be used by the variant scoring model (40).
- New feature space (30) that is obtained after all the stages in the feature construction model (20) are carried out, is used as input parameters of the variant scoring model (40).
- the variant scoring model (40) is trained with machine learning methods by using the variant data set containing new feature space (30).
- the scoring model (40) generates variant score (51) by scoring the variant according to the input values.
- the features from the same feature space
- their weights may be different. This is also valid for the variant scoring model.
- a user evaluates the variant score (51) he/she may desire to state his/her expert opinion by referring to the information regarding which features are considered to what extent.
- the feature coefficients of the score (53) are calculated by using SHAP Values and LIME method so as to present such information. The user thus can be able to see the how the underlying process to obtain the score is carried out.
- Variant scoring model (40) is a complex model, thus it may not be easy to interpret fully its results and how it operates. For this purpose, using Machine Learning Model Interpretability methods the scoring model summary (52) is formed so as to provide additional information to the user about the process regarding how the decision is made by the complex variant scoring model (40). Scoring model summary (52) is formed so as to assist the user to make a more accurate evaluation. Methods such as Permutation Significance and Decision Tree Proxy Models are used to create the scoring model summary (52).
- score (51) After a variant is applied as an input to the scoring model (40), score (51), feature coefficients of the score (53) and the scoring model summary (52) are displayed to the user via an interface. Therefore, the user can see the underlying decision process specific to the variant, along with the variant score (51).
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Epidemiology (AREA)
- Business, Economics & Management (AREA)
- Biotechnology (AREA)
- Public Health (AREA)
- Evolutionary Biology (AREA)
- General Business, Economics & Management (AREA)
- Bioinformatics & Computational Biology (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioethics (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Primary Health Care (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TR201921589 | 2019-12-25 | ||
PCT/TR2020/051374 WO2021133351A1 (en) | 2019-12-25 | 2020-12-24 | A prioritization and scoring method |
Publications (2)
Publication Number | Publication Date |
---|---|
EP4022646A1 true EP4022646A1 (en) | 2022-07-06 |
EP4022646A4 EP4022646A4 (en) | 2022-11-02 |
Family
ID=76576076
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP20907928.4A Pending EP4022646A4 (en) | 2019-12-25 | 2020-12-24 | A prioritization and scoring method |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP4022646A4 (en) |
WO (1) | WO2021133351A1 (en) |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6073902B2 (en) * | 2011-10-06 | 2017-02-01 | セクエノム, インコーポレイテッド | Methods and processes for non-invasive assessment of genetic variation |
ES2875892T3 (en) * | 2013-09-20 | 2021-11-11 | Spraying Systems Co | Spray nozzle for fluidized catalytic cracking |
US20210074378A1 (en) * | 2018-01-26 | 2021-03-11 | The Trustees Of Princeton University | Methods for Analyzing Genetic Data to Classify Multifactorial Traits Including Complex Medical Disorders |
CN109295198A (en) * | 2018-09-03 | 2019-02-01 | 安吉康尔(深圳)科技有限公司 | For detecting the method, apparatus and terminal device of genetic disease genetic mutation |
-
2020
- 2020-12-24 WO PCT/TR2020/051374 patent/WO2021133351A1/en unknown
- 2020-12-24 EP EP20907928.4A patent/EP4022646A4/en active Pending
Also Published As
Publication number | Publication date |
---|---|
EP4022646A4 (en) | 2022-11-02 |
WO2021133351A1 (en) | 2021-07-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11037685B2 (en) | Method and process for predicting and analyzing patient cohort response, progression, and survival | |
Manni et al. | BUSCO: assessing genomic data quality and beyond | |
Moreau et al. | Computational tools for prioritizing candidate genes: boosting disease gene discovery | |
Baele et al. | Bayesian evolutionary model testing in the phylogenomics era: matching model complexity with computational efficiency | |
US7324928B2 (en) | Method and system for determining phenotype from genotype | |
JP2015527635A (en) | System and method for generating biomarker signatures using an integrated dual ensemble and generalized simulated annealing technique | |
Castillo-Secilla et al. | KnowSeq R-Bioc package: The automatic smart gene expression tool for retrieving relevant biological knowledge | |
Fung et al. | Automation of QIIME2 metagenomic analysis platform | |
US20070173700A1 (en) | Disease risk information display device and program | |
CN114424287A (en) | Single cell RNA-SEQ data processing | |
JP2019530098A (en) | Method and apparatus for coordinated mutation selection and treatment match reporting | |
Zhang et al. | MaLAdapt reveals novel targets of adaptive introgression from Neanderthals and Denisovans in worldwide human populations | |
Hawinkel et al. | Model-based joint visualization of multiple compositional omics datasets | |
JP5067417B2 (en) | Molecular network analysis support program, molecular network analysis support device, and molecular network analysis support method | |
Wen et al. | OmicsEV: a tool for comprehensive quality evaluation of omics data tables | |
Gaynor et al. | Identification of differentially expressed gene sets using the Generalized Berk–Jones statistic | |
Zhang et al. | VEF: a variant filtering tool based on ensemble methods | |
EP4022646A1 (en) | A prioritization and scoring method | |
May et al. | ClearCNV: CNV calling from NGS panel data in the presence of ambiguity and noise | |
Reimand et al. | Pathway enrichment analysis of-omics data | |
Ahmad et al. | A review of genetic variant databases and machine learning tools for predicting the pathogenicity of breast cancer | |
KR102483880B1 (en) | disease profiling information providing system based on multiple database information and method therefor | |
JP2001178463A (en) | Method for extracting similar expression pattern and method for extracting related biopolymer | |
Albrecht et al. | Machine Learning in Quality Assessment of Early Stage Next-Generation Sequencing Data | |
Bruno et al. | AIM in Medical Informatics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20220328 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
A4 | Supplementary search report drawn up and despatched |
Effective date: 20221005 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G16B 40/20 20190101ALI20220928BHEP Ipc: G16B 20/20 20190101ALI20220928BHEP Ipc: G16H 50/20 20180101AFI20220928BHEP |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) |