EP4022646A1

EP4022646A1 - A prioritization and scoring method

Info

Publication number: EP4022646A1
Application number: EP20907928.4A
Authority: EP
Inventors: Kazim Kivanç EREN; Ya mur Ceren DARDA AN; Orçun TA AR; Muhammed AKTOLUN; Esra ÇINAR; Irmak TÜRKO LU ÖZTORUN; Cüneyt Öksüz; Bahadir ONAY; Hüseyin ONAY
Original assignee: Idea Teknoloji Coezuemleri Bilgisayar Sanayi Ve Ticaret AS
Current assignee: Idea Teknoloji Coezuemleri Bilgisayar Sanayi Ve Ticaret AS
Priority date: 2019-12-25
Filing date: 2020-12-24
Publication date: 2022-07-06
Also published as: EP4022646A4; WO2021133351A1

Abstract

The invention relates to a prioritization and scoring method which facilitates the interpretation of the genetic variants (in VCF file formed as a result of the bioinformatics pipeline) by the user, using machine learning for the analysis of Next-Generation Sequencing (NGS) data.

Description

DESCRIPTION

A PRIORITIZATION AND SCORING METHOD

Technical Field

The invention relates to a variant prioritization and scoring method that facilitates the interpretation of the genetic variants (in VCF file formed as a result of the bioinformatics pipeline), using machine learning for the analysis of Next-Generation Sequencing (NGS) data.

State of the Art

Today, the most time-consuming stage in the DNA sequencing data analysis process is filtering the records in the variant list and interpreting the variants obtained subsequently. VCF file includes many variants as the output of bioinformatics pipeline (for example, there may be an average of 20.000 variants with confirmed quality, as a result of whole exome sequencing), most of which does not have a pathogenic effect. It is essential to determine a small number of candidate variants that might cause diseases among these variants for diagnosis. Processes such as filtering and prioritization required to determine whether these variants are associated with a disease or not, are carried out manually in the clinic. However, it is a difficult and long process. Finding a small number of candidate variants automatically is hence crucial for faster diagnosis.

Correct classification and prioritization of the variants from next-generation sequencing data is one of the most important steps in clinical diagnosis, which consists of manually filtering approximately tens of thousands of variants according to certain features. However, in most cases the correct variant might not be obtained since the filtering method does not have any standard and the filtering parameters were user-dependent.

There are several methods that calculate a pathogenicity score for each variant by using different variant features (such as, allele frequency, functional effect, conservation scores) in machine learning methods. The results of these models that use different algorithms, features and training data, might be conflicting with each other. Thus, a consensus has not been reached on how to classify variants according to their pathogenicity. Moreover, the presence of a variant in other members of the family (segregation information) is very important in terms of clinical diagnosis. Family segregation data, which may increase the success of model prediction has not been used in the existing algorithms as a feature. On the other hand, ClinVar (one of the most used variant databases) recommends the use of American Medical Genetics and Genomics College Guideline (ACMG) so as to improve clinical classification of the variants in the human genome. Using these rules as features for the machine learning model is very important in terms of increasing the success of classification of variants. These criteria are applied as present/absent (as binary features) to the variants in the current applications whereas the invention creates new rules using these criteria and takes them as features.

As a result of the research conducted, US20160357903A1 , US20150066378A1 , US20130332081 A1 and EP3061020 are the patent documents that were found. In these applications, systems and methods that are disclosed are used to generate a priority score for a variant of a gene to evaluate the potential significance of said variant in a disease. This invention aims to generalize the prioritization process for any variant.

As a result, due to the abovementioned disadvantages and the insufficiency of the current solutions regarding the subject matter, further developments are needed in the relevant technical field.

Aim of the Invention

The invention aims to solve the abovementioned disadvantages motivated from the current conditions.

The main goal of the invention is to shorten the time required for genetic diagnosis, by determining the candidate variants that could be associated with a disease, compared to the existing systems. The invention provides an algorithm based on machine learning methods that calculate pathogenicity scores for single nucleotide variants (SNVs). Novel features that haven’t been used previously in the literature for variant scoring and some of the existing scoring models (for example FATHMM, M-CAP, CERENKOV2, SIFT, PolyPhen, ClinPred, CADD, DANN, Mutation Tester) are used to develop a variant scoring system for SNV type variants.

Features: ACMG guideline criteria and rules and family segregation information mentioned in the state-of-the-art are used as features (as factors that affect pathogenicity in machine learning models) in the method. ExAC PLI score of a given gene region where the relevant variant is formed, is also used as a feature in the method. PLI score gives a probability regarding the tolerance of a given gene to the loss of function on the basis of the number of protein truncating variants. In addition to the relevant features, the invention comprises constructing new features (feature generation/construction) from the existing features. The main aim here is to find out the relations between different features via mathematical operations (division, multiplication etc.) using the existing features. Here, feature construction methods such as ExploreKit, AutoLearn, Iterative Feature Construction, Association Rule Mining are used.

With the invention, the workload on the user (usually a medical geneticist), required for the diagnosis, is significantly reduced by means of automatically scoring SNP type variants.

The user may require detailed information regarding how the variant scores are generated, to evaluate the variants for diagnosis. Machine learning models are generally complex and their results are not always easy to interpret. For this reason, with this invention, additional information is provided to the user regarding the decision process (consisting of complex machine learning models), using Machine Learning Interpretability methods. For this goal, SHAP Values, Permutation Importance, LIME methods are used. Also, by presenting the complex models in the form of one decision tree as Decision Tree Surrogate (using Quinlan’s C4.5 Algorithm), detailed information regarding the decision mechanism of the method is rendered more intuitive.

Machine learning methods:

Different machine learning methods are used to score variants. Some of these methods are as follows; Random Forest, XGBoost, CatBoost Classifier, Support Vector Machines, Deep Learning Models, Gauss Mixture Modeling.

Innovative features of the invention can be listed as follows:

• New features are created from the features that are used in the literature (feature construction). Relevant Features are obtained for the machine learning models for variant scoring. • Using Machine Learning Interpretability methods, for each variant, when a score is assigned, an explanation is provided as to how much each feature contributed to the scoring decision by the machine learning model. Therefore, different from the other scoring models in the literature, a more directly interpretable score is provided to the user by displaying the information regarding what the scoring decision is based on for each variant score.

In order to fulfill the abovementioned goals, the invention is a prioritization and scoring method which facilitates the interpretation of the genetic variants (in VCF file formed as a result of the bioinformatics pipeline), using machine learning for the analysis of new generation sequencing data. It comprises the following process steps;

• taking the features as input to the feature construction model,

• creating new features by using the received features via mathematical operators with the feature construction module,

• ranking features (constructed and original features) according to their consistency and information gain with the feature ranking module,

• selecting a predetermined number of features from the listed features via the feature selection module and creating a new feature space,

• generating features in the new feature space by using feature calculation module,

• applying the generated features as input to the variant scoring model and obtaining variant score,

• calculating coefficients of the features by using SHAP Values and LIME method,

• creating scoring model summary by using Permutation Importance and Decision Tree Surrogate Models,

• displaying, via a user interface the obtained score, feature coefficients and of scoring model summary. The structural and characteristic features of the present invention will be understood clearly by the following drawings and the detailed description made with reference to these drawings and therefore the evaluation shall be made by taking these figures and the detailed description into consideration. Figures Clarifying the Invention

Figure 1 illustrates the process steps for generating novel features for the variant scoring model.

Figure 2 illustrates the general structure of the variant scoring model.

Figure 3 illustrates the structure that shows the complete system. Figure 4 illustrates the position of the invention within the system.

Description of the Part References

10. Feature

20. Feature construction model

21 . New feature creation module 22. Feature ranking module

23. Feature selection module 30. New feature space

31 . Feature calculation module 40. Scoring model 50. Score monitor

51 . Score

52. Scoring model summary

53. Feature coefficients of the score Detailed Description of the Invention

In this detailed description, the preferred embodiments of the inventive prioritization and scoring method is described by means of examples only for clarifying the subject matter.

In Figure 1 , the flowchart for constructing the novel features for the variant scoring model is shown.

In Figure 2, the general structure of the variant scoring model is shown.

In Figure 3, the view of the system within a complete structure is given. In Figure 4, the position of the invention within the system is shown. The complete system starts with the examination of a patient that exhibits various symptoms by a physician. The physician asks for a genetic test, if he/she finds it appropriate. The blood sample taken from the patient is prepared for DNA sequencing by the laboratory. The prepared sample is processed in the laboratory by the sequencing device and digital DNA data (raw data) of the patient is obtained. Since variant information regarding the disease cannot be achieved directly from the raw material, this data is required to be processed in computer environment via bioinformatics tools and thus the variant information is reached. After variants are obtained and examined by the user, a variant report that shows the relation of the variant with the disease is created.

The bioinformatics pipeline initiates with the raw data obtained from the sequencing device. Raw data contains readings from different parts of the DNA of the patient. The readings in the data are aligned to the human reference genome and saved in the SAM/BAM format so as to determine the regions that these readings are from. Subsequently, variant information that does not confirm with the human reference genome is obtained from the processed SAM / BAM files and written in the VCF file. Specific filters are applied to determine the candidate variants that are associated with the disease, among many variants in VCF file. Variants that are left after filtering are reported and the genetic diagnosis report of the patient is created. The scoring process of the variants is carried out in the variant interpretation step following the creation of the VCF file step.

Preliminary steps performed to construct the scoring model: New feature space (30) is created by the feature construction model (20) based on the original features (10) in the data set, to be used for the variant scoring model (40). As a first step, the features (10) are taken as input to the feature construction model (20). Then, the new feature creation module (21) creates new features from the received features (10) via mathematical operators. New features and original features (10) are ranked according to criteria, such as consistency and information gain with the feature ranking module (22). A predetermined number of features are selected among the ranked features via the feature selection module (23). A new feature space (30) is created by the selected features to be used by the variant scoring model (40). New feature space (30) that is obtained after all the stages in the feature construction model (20) are carried out, is used as input parameters of the variant scoring model (40). The variant scoring model (40) is trained with machine learning methods by using the variant data set containing new feature space (30).

Scorina:

Features in the new feature space (30) (which is obtained in the preprocessing step from original features (10)) are calculated using feature calculation module (31) and are applied as an input to the variant scoring model (40). The scoring model (40) generates variant score (51) by scoring the variant according to the input values.

When a complex model is applied for any individual input for decision making, the features (from the same feature space) and their weights may be different. This is also valid for the variant scoring model. When a user evaluates the variant score (51), he/she may desire to state his/her expert opinion by referring to the information regarding which features are considered to what extent. The feature coefficients of the score (53) are calculated by using SHAP Values and LIME method so as to present such information. The user thus can be able to see the how the underlying process to obtain the score is carried out.

Variant scoring model (40) is a complex model, thus it may not be easy to interpret fully its results and how it operates. For this purpose, using Machine Learning Model Interpretability methods the scoring model summary (52) is formed so as to provide additional information to the user about the process regarding how the decision is made by the complex variant scoring model (40). Scoring model summary (52) is formed so as to assist the user to make a more accurate evaluation. Methods such as Permutation Significance and Decision Tree Proxy Models are used to create the scoring model summary (52).

After a variant is applied as an input to the scoring model (40), score (51), feature coefficients of the score (53) and the scoring model summary (52) are displayed to the user via an interface. Therefore, the user can see the underlying decision process specific to the variant, along with the variant score (51).

Claims

1. A prioritization and scoring method that facilitates the interpretation of the genetic variants (in VCF file formed as a result of the bioinformatics pipeline), by using machine learning for the analysis of new generation sequencing data, characterized by comprising of the following steps;

• taking the features (10) as an input to the feature construction model (20),

• the new feature creation module’s (21) creating new features by associating the received features (10) via the mathematical operators,

• ranking new features and features (10) according to their consistency and information gains with the feature ranking module (22),

• selecting a predetermined number of consistent features from the listed features via the feature selection module (23) and creating a new feature space (30),

• generating features in the new feature space (30) by using feature calculation module (31),

• applying the generated features as an input to the variant scoring model (40) and obtaining the variant score (51),

• calculating the feature coefficients of the score (53) by using SHAP Values and LIME method,

• creating the scoring model summary (52) by using Permutation Significance and Decision Tree Proxy Models,

• displaying the obtained score (51), the feature coefficients of the score (53) and the scoring model summary (52) to the user via an interface.