EP4022646A1 - Procédé de classement par ordre de priorité et de notation - Google Patents

Procédé de classement par ordre de priorité et de notation

Info

Publication number
EP4022646A1
EP4022646A1 EP20907928.4A EP20907928A EP4022646A1 EP 4022646 A1 EP4022646 A1 EP 4022646A1 EP 20907928 A EP20907928 A EP 20907928A EP 4022646 A1 EP4022646 A1 EP 4022646A1
Authority
EP
European Patent Office
Prior art keywords
features
feature
variant
score
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP20907928.4A
Other languages
German (de)
English (en)
Other versions
EP4022646A4 (fr
Inventor
Kazim Kivanç EREN
Ya mur Ceren DARDA AN
Orçun TA AR
Muhammed AKTOLUN
Esra ÇINAR
Irmak TÜRKO LU ÖZTORUN
Cüneyt Öksüz
Bahadir ONAY
Hüseyin ONAY
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Idea Teknoloji Coezuemleri Bilgisayar Sanayi Ve Ticaret AS
Idea Teknoloji Coezuemleri Bilgisayar Sanayi Ve Ticaret AS
Original Assignee
Idea Teknoloji Coezuemleri Bilgisayar Sanayi Ve Ticaret AS
Idea Teknoloji Coezuemleri Bilgisayar Sanayi Ve Ticaret AS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Idea Teknoloji Coezuemleri Bilgisayar Sanayi Ve Ticaret AS, Idea Teknoloji Coezuemleri Bilgisayar Sanayi Ve Ticaret AS filed Critical Idea Teknoloji Coezuemleri Bilgisayar Sanayi Ve Ticaret AS
Publication of EP4022646A1 publication Critical patent/EP4022646A1/fr
Publication of EP4022646A4 publication Critical patent/EP4022646A4/fr
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H40/00ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
    • G16H40/20ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the management or administration of healthcare resources or facilities, e.g. managing hospital staff or surgery rooms

Definitions

  • the invention relates to a variant prioritization and scoring method that facilitates the interpretation of the genetic variants (in VCF file formed as a result of the bioinformatics pipeline), using machine learning for the analysis of Next-Generation Sequencing (NGS) data.
  • NGS Next-Generation Sequencing
  • VCF file includes many variants as the output of bioinformatics pipeline (for example, there may be an average of 20.000 variants with confirmed quality, as a result of whole exome sequencing), most of which does not have a pathogenic effect. It is essential to determine a small number of candidate variants that might cause diseases among these variants for diagnosis. Processes such as filtering and prioritization required to determine whether these variants are associated with a disease or not, are carried out manually in the clinic. However, it is a difficult and long process. Finding a small number of candidate variants automatically is hence crucial for faster diagnosis.
  • Correct classification and prioritization of the variants from next-generation sequencing data is one of the most important steps in clinical diagnosis, which consists of manually filtering approximately tens of thousands of variants according to certain features. However, in most cases the correct variant might not be obtained since the filtering method does not have any standard and the filtering parameters were user-dependent.
  • ClinVar one of the most used variant databases recommends the use of American Medical Genetics and Genomics College Guideline (ACMG) so as to improve clinical classification of the variants in the human genome.
  • ACMG American Medical Genetics and Genomics College Guideline
  • Using these rules as features for the machine learning model is very important in terms of increasing the success of classification of variants.
  • These criteria are applied as present/absent (as binary features) to the variants in the current applications whereas the invention creates new rules using these criteria and takes them as features.
  • the invention aims to solve the abovementioned disadvantages motivated from the current conditions.
  • the main goal of the invention is to shorten the time required for genetic diagnosis, by determining the candidate variants that could be associated with a disease, compared to the existing systems.
  • the invention provides an algorithm based on machine learning methods that calculate pathogenicity scores for single nucleotide variants (SNVs). Novel features that haven’t been used previously in the literature for variant scoring and some of the existing scoring models (for example FATHMM, M-CAP, CERENKOV2, SIFT, PolyPhen, ClinPred, CADD, DANN, Mutation Tester) are used to develop a variant scoring system for SNV type variants.
  • SNVs single nucleotide variants
  • ACMG guideline criteria and rules and family segregation information mentioned in the state-of-the-art are used as features (as factors that affect pathogenicity in machine learning models) in the method.
  • ExAC PLI score of a given gene region where the relevant variant is formed is also used as a feature in the method. PLI score gives a probability regarding the tolerance of a given gene to the loss of function on the basis of the number of protein truncating variants.
  • the invention comprises constructing new features (feature generation/construction) from the existing features. The main aim here is to find out the relations between different features via mathematical operations (division, multiplication etc.) using the existing features.
  • feature construction methods such as ExploreKit, AutoLearn, Iterative Feature Construction, Association Rule Mining are used.
  • the workload on the user (usually a medical geneticist), required for the diagnosis, is significantly reduced by means of automatically scoring SNP type variants.
  • the user may require detailed information regarding how the variant scores are generated, to evaluate the variants for diagnosis.
  • Machine learning models are generally complex and their results are not always easy to interpret.
  • additional information is provided to the user regarding the decision process (consisting of complex machine learning models), using Machine Learning Interpretability methods.
  • SHAP Values, Permutation Importance, LIME methods are used.
  • presenting the complex models in the form of one decision tree as Decision Tree Surrogate using Quinlan’s C4.5 Algorithm
  • the invention is a prioritization and scoring method which facilitates the interpretation of the genetic variants (in VCF file formed as a result of the bioinformatics pipeline), using machine learning for the analysis of new generation sequencing data. It comprises the following process steps;
  • Figure 1 illustrates the process steps for generating novel features for the variant scoring model.
  • Figure 2 illustrates the general structure of the variant scoring model.
  • Figure 3 illustrates the structure that shows the complete system.
  • Figure 4 illustrates the position of the invention within the system.
  • FIG 3 the view of the system within a complete structure is given.
  • Figure 4 the position of the invention within the system is shown.
  • the complete system starts with the examination of a patient that exhibits various symptoms by a physician.
  • the physician asks for a genetic test, if he/she finds it appropriate.
  • the blood sample taken from the patient is prepared for DNA sequencing by the laboratory.
  • the prepared sample is processed in the laboratory by the sequencing device and digital DNA data (raw data) of the patient is obtained. Since variant information regarding the disease cannot be achieved directly from the raw material, this data is required to be processed in computer environment via bioinformatics tools and thus the variant information is reached.
  • a variant report that shows the relation of the variant with the disease is created.
  • the bioinformatics pipeline initiates with the raw data obtained from the sequencing device.
  • Raw data contains readings from different parts of the DNA of the patient.
  • the readings in the data are aligned to the human reference genome and saved in the SAM/BAM format so as to determine the regions that these readings are from.
  • variant information that does not confirm with the human reference genome is obtained from the processed SAM / BAM files and written in the VCF file.
  • Specific filters are applied to determine the candidate variants that are associated with the disease, among many variants in VCF file. Variants that are left after filtering are reported and the genetic diagnosis report of the patient is created.
  • the scoring process of the variants is carried out in the variant interpretation step following the creation of the VCF file step.
  • New feature space (30) is created by the feature construction model (20) based on the original features (10) in the data set, to be used for the variant scoring model (40).
  • the features (10) are taken as input to the feature construction model (20).
  • the new feature creation module (21) creates new features from the received features (10) via mathematical operators. New features and original features (10) are ranked according to criteria, such as consistency and information gain with the feature ranking module (22). A predetermined number of features are selected among the ranked features via the feature selection module (23).
  • a new feature space (30) is created by the selected features to be used by the variant scoring model (40).
  • New feature space (30) that is obtained after all the stages in the feature construction model (20) are carried out, is used as input parameters of the variant scoring model (40).
  • the variant scoring model (40) is trained with machine learning methods by using the variant data set containing new feature space (30).
  • the scoring model (40) generates variant score (51) by scoring the variant according to the input values.
  • the features from the same feature space
  • their weights may be different. This is also valid for the variant scoring model.
  • a user evaluates the variant score (51) he/she may desire to state his/her expert opinion by referring to the information regarding which features are considered to what extent.
  • the feature coefficients of the score (53) are calculated by using SHAP Values and LIME method so as to present such information. The user thus can be able to see the how the underlying process to obtain the score is carried out.
  • Variant scoring model (40) is a complex model, thus it may not be easy to interpret fully its results and how it operates. For this purpose, using Machine Learning Model Interpretability methods the scoring model summary (52) is formed so as to provide additional information to the user about the process regarding how the decision is made by the complex variant scoring model (40). Scoring model summary (52) is formed so as to assist the user to make a more accurate evaluation. Methods such as Permutation Significance and Decision Tree Proxy Models are used to create the scoring model summary (52).
  • score (51) After a variant is applied as an input to the scoring model (40), score (51), feature coefficients of the score (53) and the scoring model summary (52) are displayed to the user via an interface. Therefore, the user can see the underlying decision process specific to the variant, along with the variant score (51).

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Epidemiology (AREA)
  • Business, Economics & Management (AREA)
  • Biotechnology (AREA)
  • Public Health (AREA)
  • Evolutionary Biology (AREA)
  • General Business, Economics & Management (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Primary Health Care (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne un procédé de classement par ordre de priorité et de notation qui facilite l'interprétation des variants génétiques (dans un fichier VCF formé suite au pipeline bioinformatique) par l'utilisateur, à l'aide d'un apprentissage automatique pour l'analyse de données de séquençage de nouvelle génération (NGS).
EP20907928.4A 2019-12-25 2020-12-24 Procédé de classement par ordre de priorité et de notation Pending EP4022646A4 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
TR201921589 2019-12-25
PCT/TR2020/051374 WO2021133351A1 (fr) 2019-12-25 2020-12-24 Procédé de classement par ordre de priorité et de notation

Publications (2)

Publication Number Publication Date
EP4022646A1 true EP4022646A1 (fr) 2022-07-06
EP4022646A4 EP4022646A4 (fr) 2022-11-02

Family

ID=76576076

Family Applications (1)

Application Number Title Priority Date Filing Date
EP20907928.4A Pending EP4022646A4 (fr) 2019-12-25 2020-12-24 Procédé de classement par ordre de priorité et de notation

Country Status (2)

Country Link
EP (1) EP4022646A4 (fr)
WO (1) WO2021133351A1 (fr)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DK2764459T3 (da) * 2011-10-06 2021-08-23 Sequenom Inc Fremgangsmåder og processer til ikke-invasiv bedømmelse af genetiske variationer
ES2875892T3 (es) * 2013-09-20 2021-11-11 Spraying Systems Co Boquilla de pulverización para craqueo catalítico fluidizado
WO2019148141A1 (fr) * 2018-01-26 2019-08-01 The Trustees Of Princeton University Procédés d'analyse de données génétiques pour le classement de traits multifactoriels comprenant des pathologies complexes
CN109295198A (zh) * 2018-09-03 2019-02-01 安吉康尔(深圳)科技有限公司 用于检测遗传性疾病基因变异的方法、装置及终端设备

Also Published As

Publication number Publication date
WO2021133351A1 (fr) 2021-07-01
EP4022646A4 (fr) 2022-11-02

Similar Documents

Publication Publication Date Title
US11037685B2 (en) Method and process for predicting and analyzing patient cohort response, progression, and survival
Manni et al. BUSCO: assessing genomic data quality and beyond
Couronné et al. Random forest versus logistic regression: a large-scale benchmark experiment
Moreau et al. Computational tools for prioritizing candidate genes: boosting disease gene discovery
US7324928B2 (en) Method and system for determining phenotype from genotype
JP2015527635A (ja) 統合デュアルアンサンブルおよび一般化シミュレーテッドアニーリング技法を用いてバイオマーカシグネチャを生成するためのシステムおよび方法
Castillo-Secilla et al. KnowSeq R-Bioc package: The automatic smart gene expression tool for retrieving relevant biological knowledge
Fung et al. Automation of QIIME2 metagenomic analysis platform
US20070173700A1 (en) Disease risk information display device and program
CN114424287A (zh) 单细胞rna-seq数据处理
JP2019530098A (ja) 協調的な変異選択及び治療合致レポートのための方法及び装置
Zhang et al. MaLAdapt reveals novel targets of adaptive introgression from Neanderthals and Denisovans in worldwide human populations
Hawinkel et al. Model-based joint visualization of multiple compositional omics datasets
JP5067417B2 (ja) 分子ネットワーク分析支援プログラム、分子ネットワーク分析支援装置、および分子ネットワーク分析支援方法
Wen et al. OmicsEV: a tool for comprehensive quality evaluation of omics data tables
Gaynor et al. Identification of differentially expressed gene sets using the Generalized Berk–Jones statistic
Zhang et al. VEF: a variant filtering tool based on ensemble methods
EP4022646A1 (fr) Procédé de classement par ordre de priorité et de notation
May et al. ClearCNV: CNV calling from NGS panel data in the presence of ambiguity and noise
Reimand et al. Pathway enrichment analysis of-omics data
Ahmad et al. A review of genetic variant databases and machine learning tools for predicting the pathogenicity of breast cancer
CN113010783A (zh) 基于多模态心血管疾病信息的医疗推荐方法、系统及介质
KR102483880B1 (ko) 복수의 데이터베이스 정보를 기반으로 하는 질병 프로파일링 정보 제공 시스템 및 그 방법
JP2001178463A (ja) 類似発現パターン抽出方法及び関連生体高分子抽出方法
Albrecht et al. Machine Learning in Quality Assessment of Early Stage Next-Generation Sequencing Data

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20220328

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

A4 Supplementary search report drawn up and despatched

Effective date: 20221005

RIC1 Information provided on ipc code assigned before grant

Ipc: G16B 40/20 20190101ALI20220928BHEP

Ipc: G16B 20/20 20190101ALI20220928BHEP

Ipc: G16H 50/20 20180101AFI20220928BHEP

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)