WO2022229964A1

WO2022229964A1 - Method of generating a diseases database, usage of the diseases database, and system therefor

Info

Publication number: WO2022229964A1
Application number: PCT/IL2022/050441
Authority: WO
Inventors: Tzach Itzchak DAVIDI; Noam ALON
Original assignee: Impilo Ltd.
Priority date: 2021-04-29
Filing date: 2022-04-28
Publication date: 2022-11-03

Abstract

A computer-implemented method for generating a diseases database is provided. The method includes obtaining medical data, including multiple clinical findings, each having associated timing information; processing the medical data to correlate the disease with one or more clinical findings common to a majority of patients, and to identify stages of development of the disease, based on the respective clinical findings that are common to the majority of the patients and their associated timing information; and maintaining a database mapping the disease to its respective stages of development and clinical findings. There is also provided a computer- implemented method for retrieving data from a diseases database, which includes accessing and searching the diseases database to identify a disease and at least one respective stage of development of the disease corresponding to medical data pertaining to a patient.

Description

METHOD OF GENERATING A DISEASES DATABASE, USAGE OF THE DISEASES DATABASE, AND SYSTEM THEREFOR

TECHNICAL FIELD

The presently disclosed subject matter relates to generating a diseases database, and usage of the diseases database to diagnose a disease, and, more particularly, to generate a diseases database for rare diseases, and usage thereof.

BACKGROUND

The total number of patients around the world with rare diseases adds up to about 10% of the world's population. Yet, despite progress of the technology that enables medicine to diagnose more and more diseases, with respect to diagnosis of rare diseases, current technology still suffers from reduced and limited capabilities. The unique nature of low prevalence of rare diseases, on the one hand, and the high variability and complexity of such diseases on the other hand, pose significant challenges to practitioners and health care systems to identify and diagnose rare diseases of patients. In practice, the low ability to diagnose rare diseases results in poor statistical outcomes, estimated such that, on average, each patient with a rare disease will experience an unacceptable seven years' delay in diagnosis, eight different specialist consultations, and three misdiagnoses.

In an effort to formulate a differential diagnosis for patients suffering from more common diseases, and accurately detect the conditions that lead to a patient's symptoms, physicians frequently use deductive reasoning, based on known models and procedures. By asking appropriately targeted questions relating to the patient's complaints and clinical findings, the physician either confirms or cancels the match to various models, until a differential diagnosis is determined. Known Decision Support Systems (DSS) use similar logic, and assist a physician to reach a differential diagnosis with calculation speed, memory ability, and pointing out human errors in the process. However, such currently available diagnostic tools are designed to cope with more common diseases, and are not suitable for dealing with the special nature of rare diseases. Either the tools are designed to diagnose specific diseases, requiring a high volume of accurate information and knowledge that is not available in the rare diseases field, or these tools are indifferent to the progression and development of such diseases over time. Also, current tools are indifferent to the interrelationships between the high volume of unrelated symptoms. The medical and technological challenges related to reaching a differential diagnosis in rare and complex patients with such diseases present further challenges. This includes a lack of agreed models, procedures, and protocols to be applied for a patient experiencing a set of given symptoms. In addition, this may include fragmented, irregular, and multi-participant processes. For example, the patient sees multiple medical experts, each with his own expertise and perception, at different times, in different organizations (community care, hospitals), with different procedures and protocols, depending on physicians' experience and organization policy. Moreover, rare diseases are characterized by multifaceted, complex, and mild symptoms, along with compositions of various symptoms, which make it difficult for current models to detect them.

It is therefore desired to detect or predict rare diseases in a more efficient and precise manner, using technological tools.

GENERAL DESCRIPTION

Known tools for diagnosis of common diseases are indifferent to the special characteristics of rare diseases. Low prevalence of each rare disease in the population results in a lack of available accurate information and knowledge. On the other hand, such rare diseases are characterized by fast progression and development of comorbidities. In many cases, a patient suffering from a rare disease will experience a high volume of mild and unrelated symptoms, making it difficult for a medical professional to detect. Also, a significant factor, which should be considered when wishing to diagnose a rare disease, is the time dimension of development of the disease. For example, in a Niemann pick type C disease, a patient may experience symptoms of cognitive impairment, enlarged spleen, and lack of vertical eye movement, which could easily be diagnosed separately, without connecting the symptoms to one disease. The symptoms might appear at different times, which makes the diagnosis process even more difficult.

Moreover, current tools rely on characteristics of common diseases and ignore the special characteristics of rare diseases. This results in difficulties in implementing current diagnostic tools and differential diagnostic tools for patients experiencing rare diseases. Therefore, when wishing to detect rare diseases, high sensitivity should be given to the time dimension of the development of the disease, by referring the various stages of development of the disease, and to the accumulation of symptoms, and the effects of the various interventions. For example, some patient experience adverse reactions to X-ray screening, while others react in a different manner to certain medications. These might be clues to the occurrence of the disease and should be considered during the diagnosis.

It is to be noted that the following description is provided for rare diseases, in particular, and the development of such diseases. Those skilled in the art will readily appreciate that the teachings of the presently disclosed subject matter are, likewise, applicable to other, more common diseases, sharing only partial characteristics with rare diseases, such as a high volume of mild and unrelated symptoms. For example, a specific type of intrauterine device that is inserted into a woman's body may result in a side effect of pain in the breasts. The woman suffering from breast pain will not relate the intrauterine device to being the reason for the pain in the breast, since the two separate organs do not seem to connect to each other. Moreover, the woman may have had some other diseases during that time, such as influenza, which may seem to her as being more relevant. The procedure of insertion of the intrauterine device was done by a gynecologist. Flowever, the woman will approach a breast specialist to complain about the pain in the breasts, who will rarely be able to connect between the intrauterine device and the pains. Detection of the intrauterine device being the source of the pain, and the treatment (replacing the intrauterine device,) can be made only by a physician who specializes in both areas of gynecology and breasts, which is not common in the medical community. Providing a diseases database that addresses the problem of identifying a disease involving unrelated symptoms may be advantageous, and assists in detecting the disease in an efficient manner.

In accordance with certain embodiments of the presently disclosed subject matter, clinical findings may include the acquisition of subjective or objective medical information pertaining to a patient, from various sources. Such medical information can be obtained, for example, from the patient himself/herself, or from medical records pertaining to the patient. Clinical findings can include diagnosis of the patient and symptoms experienced by the patient. For example, clinical findings can include the age of the patient, or lab test results. In order to detect a rare disease, a connection between the disease and its clinical findings is made. For example, Flepatomegaly can be connected to a Nieman Pick Disease Type C. In addition, a connection between the disease and its stages of development should be made. For example, Nieman Pick Disease Type C is characterized by the following stages of development: "Minimal to Asymptomatic", "Mild", "Moderate" and "Severe/Advanced". These stages of development can be connected to Nieman Pick Disease Type C.

Therefore, in accordance with certain embodiments of the presently disclosed subject matter, in order to detect rare diseases having unrelated symptoms, and that are characterized by development of the diseases, a diseases database, that maps between diseases, their clinical findings, and the stages of development, can be provided. The diseases database can comprise a plurality of disease graphs. Each disease graph can pertain to one disease, and can map the disease to its respective stages of development and clinical findings. The connection between two separate diseases, sharing some of their clinical findings, can be reflected in the diseases database by connecting the graphs of the two separate diseases graphs. The disease graphs of two diseases that share the same clinical findings can be connected by sharing nodes of the shared clinical findings. Sharing nodes of clinical findings between several graphs relating to several diseases is advantageous, as it assists, when searching the graph, to identify a disease and a disease stage, to navigate between the various graphs of the disease, and reach and identify the correct disease pertaining to the clinical findings. In some cases, the clinical findings of a patient can be associated with timing information including e.g., information pertaining to the patient, such as the age of the patient, or can relate to timing information of another clinical findings, such as the time that the patient started taking a prescribed medication, or the time that has passed from the point of taking the medication until the patient experienced a certain symptom. In some examples, the timing information can be used to identify stages of development of a disease, thereby facilitating in generating a diseases database that maps not only between diseases and their clinical findings, but also between the diseases, their clinical findings, and stages of development of the disease.

In some cases, the clinical findings of a patient can be used to search the diseases database and identify a disease and a disease stage of development. The various disease graphs can be searched to find nodes of clinical findings that correspond to the clinical findings of the patient. Using the edges between the nodes, the search can proceed from the node that was found to match the clinical findings, to other nodes of diseases and diseases stages of development, in order to identify the disease and the disease stage of the patient.

According to a first aspect of the presently disclosed subject matter there is provided a computer-implemented method for generating a diseases database comprising: obtaining medical data pertaining to a plurality of patients having a disease common to all the patients, wherein the medical data includes, for each of the patients, multiple clinical findings each having associated timing information; processing the medical data to correlate the disease with one or more clinical findings common to a majority of the patients, and to identify stages of development of the disease, based on the respective clinical findings that are common to the majority of the patients and their associated timing information; and maintaining a database mapping the disease to its respective stages of development and clinical findings.

In addition to the above features, the computer implemented method according to this aspect of the presently disclosed subject matter can optionally comprise in some examples one or more of features (i) to (xvi) below, in any technically possible combination or permutation: i. Wherein the medical data comprises a plurality of electronic health records (EHRs), each EHR pertaining to a patient. ii. Wherein obtaining the medical data further comprises: deidentifying and/or cleansing the medical data; iii. Wherein each clinical finding has a form of structured data, unstructured data or a combination thereof. iv. Wherein processing the medical data further comprises: for each clinical finding, identifying at least one predefined medical category that matches each clinical finding and associating the timing information to the respective at least one predefined medical category; and processing the at least one predefined medical category with the respective timing information to identify the stages of development of the disease. v. Wherein processing each clinical finding having the form of unstructured data to identify at least one predefined medical category is performed using a machine learning model. vi. Wherein the timing information pertains to the patient's clinical findings. vii. Wherein the medical data includes, for a given patient of the patients, first and second clinical findings, wherein the timing information of the second clinical finding relates the timing information of the first clinical finding. viii. Wherein processing the medical data comprises: clustering the clinical findings into multiple latent features clusters; and processing the clusters to identify the stages of development of the disease. ix. Wherein prior to clustering the clinical findings the further comprises: for each patient, generating a respective timeline of the clinical findings based on the associated timing information; based on the timelines of the patients, clustering the clinical findings. x. Wherein processing the clusters is performed using an Al model. xi. Wherein processing the clusters comprises determining for each stage of development one or more of: respective probabilities to reach the stage from other stages, a respective weight of latent features pertaining to the stage and an average duration of the stage. xii. The method further comprising: calculating, for at least one disease, a probability of occurrence and/or a disease average duration. xiii. Wherein the database is organized as a plurality of disease graphs, each disease graph having a tree structure and comprising a plurality of connected nodes of stages of development and the clinical findings. xiv. Wherein at least two disease graphs share at least one node of clinical findings. xv. Wherein an edge between first and second nodes is associated with a transition probability indicative of a probability to transfer from the first node into the second node, and wherein the method further comprises associating at least one transition probability to at least one edge. xvi. The method further comprising: obtaining medical literature, wherein the medical literature includes multiple clinical findings pertaining to the disease; and processing the medical data with the medical literature to correlate the disease with the one or more clinical findings and identify the stages of development.

The presently disclosed subject matter further comprises a computer system comprising a processing circuitry that comprises at least one processor and a computer memory, the processing circuity is configured to execute a method as described above with reference to the first aspect and may optionally further comprise one or more of the features (i) to (xvi) listed above, mutatis mutandis, in any technically possible combination or permutation. The presently disclosed subject matter further comprises a non-transitory computer readable storage medium tangibly embodying a program of instructions that, when executed by a computer, cause the computer to perform a method as described above with reference the first aspect, and may optionally further comprise one or more of the features (i) to (xvi) listed above, mutatis mutandis, in any technically possible combination or permutation.

According to a second aspect of the presently disclosed subject matter there is provided a disease data storage and retrieval system for a computer having a processing and memory circuit (PMC), comprising: a processor of the PMC for configuring the memory of the PMC to store a diseases database comprising a plurality of disease graphs; wherein each disease graph pertains to a disease and maps the disease to its respective stages of development and clinical findings.

In addition to the above features, the system according to this aspect of the presently disclosed subject matter can comprise one or more of features (i) to (xiv) listed below, in any desired combination or permutation which is technically possible: i. Wherein the processor is configured to generate the stages of development were generated based on timing information associated with the clinical findings. ii. Wherein the processor is configured to generate each disease graph by: obtaining medical data pertaining to a plurality of patients having a disease common to all the patients, Wherein the medical data includes, for each of the patients, multiple clinical finding each having an associated timing information; processing the medical data to correlate the disease with one or more clinical findings common to a majority of the patients, and identify stages of development of the disease, based on the respective clinical findings that are common to the majority of the patients and their associated timing information; and generating the disease graph by mapping the disease to its respective stages of development and clinical findings. iii. Wherein the medical data comprises a plurality of electronic health records (EHRs), each EHR pertaining to a patient. iv. Wherein the medical data is deidentified and/or cleansed. v. Wherein each clinical finding has a form of structured data, unstructured data, or a combination thereof. vi. Wherein the timing information pertains to the patient's clinical findings. vii. Wherein each disease graph is further generated by: clustering the clinical findings into multiple latent features clusters; and processing the clusters to identify the stages of development of the disease. viii. Wherein at least one disease graph of the plurality of disease graphs is associated with a probability of occurrence and/or a disease average duration. ix. Wherein at least one of the disease graphs has a tree structure and comprises a plurality of connected nodes of the stages of development and the clinical findings. x. Wherein at least two of the disease graphs share at least one node of clinical findings. xi. Wherein an edge between first and second nodes is associated with a transition probability indicative of a probability to transfer from the first node into the second node, and Wherein at least one edge in a disease tree is associated with transition probability. xii. Wherein at least one disease graph includes at least one node of latent features cluster generated by clustering the clinical findings pertaining to the disease. xiii. Wherein at least one node of latent features cluster is connected by an edge to at least one node of stage of the disease, and wherein the at least one node of stage of the disease is associated with a respective weight of the latent features pertaining to the stage. xiv. Wherein at least one node in the graph is associated with medical data pertaining to a patient, based on which the node was generated. According to a third aspect of the presently disclosed subject matter there is provided a computer-implemented method for retrieving data from a diseases database comprising: accessing a diseases database comprising a plurality of disease graphs, wherein each disease graph pertains to a disease and maps the disease to its respective stages of development and clinical findings; and searching the diseases database to identify a disease and at least one respective stage of development of the disease corresponding to medical data pertaining to a patient, wherein the medical data includes multiple clinical findings and related timing information.

In addition to the above features, the method according to this aspect of the presently disclosed subject matter can comprise one or more of features (i) to (vi) listed below, in any desired combination or permutation which is technically possible: i. wherein at least one of the disease graphs has a tree structure and comprises a plurality of connected nodes of stages of development and the clinical findings, wherein an edge between connected first and second nodes is associated with a respective transition probability indicative of a probability to transfer from the first node into the second node, and wherein searching the diseases database comprises: searching the clinical findings nodes to identify one or more matching node to the clinical findings included in the medical data; based on transition probabilities from the one or more matching node to other nodes, navigating the disease graphs to identify the least one associated disease and the respective stage of development. ii. The method further comprising: calculating at least one disease probability for the at least one identified disease. iii. The method further comprising: identifying the disease having a maximum disease probability. iv. The method further comprising: in response to identifying the stage of development, searching the graph to identify second clinical findings nodes that are associated with an earlier stage of development of the identified disease; and determining one or more matching clinical findings in the medical data to the second clinical findings. v. The method further comprising: updating at least one disease graph based on newly acquired medical data. vi. The method further comprising: in response to identifying the stage of development, searching the graph to identify second clinical findings nodes that are associated with an earlier stage of development of the identified disease; and determining one or more matching clinical findings in the medical data to the second clinical findings. The presently disclosed subject matter further comprises a computer system comprising a processing circuitry that comprises at least one processor and a computer memory, the processing circuity is configured to execute a method as described above with reference the third aspect and may optionally further comprise one or more of the features (i) to (vi) listed above, mutatis mutandis, in any technically possible combination or permutation.

The presently disclosed subject matter further comprises a non-transitory computer readable storage medium tangibly embodying a program of instructions that, when executed by a computer, cause the computer to perform a method as described above with reference the third aspect, and may optionally further comprise one or more of the features (i) to (vi) listed above, mutatis mutandis, in any technically possible combination or permutation. BRIEF DESCRIPTION OF THE DRAWINGS

In order to understand the invention and to see how it can be carried out in practice, embodiments will be described, by way of non-limiting examples, with reference to the accompanying drawings, in which: Fig. 1 illustrates a functional diagram of a disease data storage and retrieval system 100, in accordance with certain embodiments of the presently disclosed subject matter;

Fig. 2 illustrates a generalized flow chart 200 of a method for generating a diseases database, according to certain embodiments of the presently disclosed subject matter;

Fig. 3 illustrates one example of a disease graph 300 pertaining to a disease 310, in accordance with certain embodiments of the presently disclosed subject matter;

Fig. 4 illustrates an example of one disease graph 400 of a Nieman Pick Disease Type C, in accordance with certain embodiments of the presently disclosed subject matter; and

Fig. 5 illustrates a generalized flow chart 500 of a method for retrieving data from a diseases database, according to certain embodiments of the presently disclosed subject matter.

DETAILED DESCRIPTION In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the presently disclosed subject matter may be practiced without these specific details. In other instances, well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the presently disclosed subject matter.

Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions utilizing terms such as "obtaining", "extracting", "processing", "updating", "deidentifying", "cleansing", "generating", "using", "transitioning", "receiving", "conducting", "identifying", "calculating", "providing", "prioritizing", "updating", "associating", "maintaining", "clustering", "accessing", "searching", "navigating", or the like, refer to the action(s) and/or process(es) of a computer that manipulate and/or transform data into other data, said data represented as physical, such as electronic, quantities and/or said data representing the physical objects. The terms computer/computer device/computerized system, or the like, should be expansively construed to include any kind of hardware- based electronic device with a processing circuitry (e.g., digital signal processor (DSP), a GPU, a TPU, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), microcontroller, microprocessor etc.), including, by way of non-limiting example, computerized systems or devices such as disease data storage and retrieval system 100 disclosed in the present application. The processing circuitry can comprise, for example, one or more processors operatively connected to computer memory, loaded with executable instructions for executing operations as further described below.

The terms "non-transitory memory" and "non-transitory storage medium" used herein should be expansively construed to cover any volatile or non-volatile computer memory suitable to the presently disclosed subject matter.

The operations in accordance with the teachings herein may be performed by a computer specially constructed for the desired purposes, or by a general-purpose computer specially configured for the desired purpose by a computer program stored in a non-transitory computer-readable storage medium.

Usage of conditional language, such as "may", "might", or variants thereof, should be construed as conveying that one or more examples of the subject matter may include, while one or more other examples of the subject matter may not necessarily include, certain methods, procedures, components, and features. Thus, such conditional language is not generally intended to imply that a particular described method, procedure, component or circuit is necessarily included in all examples of the subject matter. Moreover, the usage of non-conditional language does not necessarily imply that a particular described method, procedure, component, or circuit, is necessarily included in all examples of the subject matter. Also, reference in the specification to "one case", "some cases", "other cases", or variants thereof, means that a particular feature, structure, or characteristic described in connection with the embodiment(s), is included in at least one embodiment of the presently disclosed subject matter. Thus, the appearance of the phrase "one case", "some cases", "other cases" or variants thereof does not necessarily refer to the same embodiment(s).

It is appreciated that certain features of the presently disclosed subject matter, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the presently disclosed subject matter, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination.

Despite existing systems for diagnosis of diseases, current systems for diagnosis of rare diseases still lack advanced capabilities, and perform poorly. Current tools lack both the ability to identify the progression of the disease, and distinguish between the different stages of development of a disease, and do not provide adequate weight to the accumulation of symptoms and the effects of various interventions.

In accordance with certain embodiments of the presently disclosed subject matter, there is provided a computer-implemented method, which can be performed by a processor and a memory circuitry, for generating a diseases database, e.g., for rare diseases, that will enhance diagnosis of certain diseases using the database. The database maps between diseases, their stages of development, and medical information pertaining to patients, such as clinical findings.

Bearing this in mind, attention is drawn to Fig. 1 illustrating a functional diagram of a disease data storage and retrieval system 100, in accordance with certain embodiments of the presently disclosed subject matter. System 100 for a computer comprises several components which operatively communicate with each other, and is configured to both store medical data, and to enable to retrieve the stored medical data.

System 100 comprises a processor and memory circuitry (PMC) 110 comprising a processor 120 and a memory 130. System 100 further comprises a communication interface 140 enabling system 100 to operatively communicate with external devices and storages, such as an external electronic health record (EHR) database (DB) 150 and a workstation 160. The processor 120 is configured to execute several functional modules in accordance with computer-readable instructions implemented on a non- transitory computer-readable storage medium such as memory 130. Such functional modules are referred to hereinafter as comprised in the processor 120. The processor 120 can comprise an obtaining module 121, extraction module 122, clustering module 123, stages module 124, generation module 125 and query module 126. In some cases, obtaining module 121 is configured to obtain medical data pertaining to a plurality of patients. The patients may have a disease common to all the patients. For example, obtaining module 121 is configured to obtain the medical data from memory 130, from an external database such as External electronic health record (EHR) DB 150, or to receive it from a user such as a physician using workstation 160. The medical data can include at least one EHR of a patient. In some examples, the medical data can include a plurality of EHRs of a plurality of patients. Extraction module 122 is configured to receive the medical data obtained by obtaining module 121 and to extract clinical findings pertaining to the patients. For example, if the medical data comprises EHRs of a plurality of patients, extraction module 122 can extract, from each EHR of a patient, multiple clinical findings. In some cases, the clinical findings can be associated with timing information.

The timing information can include information pertaining to the patient, such as an age of the patient, or can relate to other clinical findings of the patient, such as a time that has passed from another clinical finding. Further details of the timing information are described below with respect to Fig. 2. Extraction module 122 is further configured to match the clinical findings from the medical data to predefined medical categories, and to associate the timing information of a clinical finding with the respective medical categories. Clustering module 123 is configured to receive the extracted clinical findings and the medical categories, and to cluster the clinical findings into feature clusters forming similar clinical findings, in some examples, such as relate to similar organs of the body or same body system. Stages module 124 is configured to receive the clusters and to identify stages of the disease. Generation module 125 is configured to generate the diseases database 132 comprising a plurality of disease graphs, or to update the diseases database 132 with new acquired data, and to store the diseases database 132 in memory130. Query module 126 is configured to query the diseases database in order to retrieve data.

Memory130 comprised in PMC 110 can store a diseases database. The diseases database 132 may comprise a plurality of disease graphs, where each disease graph pertains to a certain disease. Each disease graph maps the disease to its respective stages of development and clinical findings. An example of the disease data is further described with respect to Figs. 3 and 4. Memory 130 may further store predefined categories 134 including predefined codes that match clinical findings pertaining to the same disease.

Fig. 1 further illustrates an external ERP DB 150 and workstation 160. The external ERP DB 150 may store medical data pertaining to a plurality of patients, including EHRs of patients. The medical data may be a source of data for generating the diseases database. Workstation 160 can be operated by a user (not illustrated), such as a physician or a medical professional. The user can communicate with the system 100 using through communication interface 140. The user workstation 160 can comprise several components (not shown in Fig. 1), which operatively communicate with each other, such as a processor, a memory, and a communication interface. The user can operate user workstation 160 to communicate to system 100, new medical data pertaining to a patient, and to search the diseases database stored in system 100 to identify a disease and stages of a disease that correspond to the new medical data.

It is noted that the teachings of the presently disclosed subject matter are not bound by the system 100 described with reference to Fig.1. Equivalent and/or modified functionality can be consolidated or divided in another manner, and can be implemented in any appropriate combination of software with firmware and/or hardware and executed on a suitable device. The system 100 can be a standalone network entity, or integrated, fully or partly, with other network entities. Those skilled in the art will also readily appreciate that the data repositories can be consolidated or divided in other manner; databases such as the diseases database 132 can be shared with other systems, or be provided by other systems, including third party equipment.

Reference is made to Fig. 2 illustrating a generalized flow chart 200 of a method for generating a diseases database, according to certain embodiments of the presently disclosed subject matter. The following flowchart operations can be executed by elements of the computer system 100 and the PMC 110 including processor 120 described in Fig. 1. Flowever, this is by no means binding, and the operations can be executed by elements other than those described herein.

In some cases, medical data pertaining to a plurality of patients is obtained (block 210) e.g., by obtaining module 121. The plurality of patients have a disease common to all the patients, and the medical data can include an indication of the disease common to all patients. For example, medical data can pertain to a plurality of patients that have Nieman Pick Disease Type C. The medical data can include, for each of the patients, multiple clinical findings. In some examples, the medical data can comprise at least one EH R of the patient. The medical data pertaining to a plurality of patients can comprise a plurality of EHRs, each pertaining to a patient, where each EHR can comprise multiple clinical findings. For example, an EHR of a patient can comprise the following clinical findings: parameters of the patient, such as age, sex, tests and lab results (blood tests, etc.), summaries of doctors' visits, drug prescriptions. A drug prescription can include a type of medicine, dosage, the method of intake, etc. A physician summary might include personal and family medical history, patient's complaints, the physician's analysis of the possible diagnoses, and text remarks by expert physicians.

Obtaining module 121 can obtain the medical data from the patients themselves, or from physicians treating the patient, e.g., by receiving medical data uploaded by physicians using workstation 160. Obtaining module 121 can also obtain the medical data by retrieving it from external ERP DB 150 including a plurality of EHRs of patients.

In some cases, clinical findings in the EHRs of patients can be associated with timing information. The timing information of a clinical finding can appear in the EHR near the clinical finding, and can include, for example, a timestamp of adding the clinical finding to the EHR. For example, the timing information can include the date the physician saw the patient. In some examples, the timing information can be retrieved from free text appearing in the EHR. For example, a summary of a physician can include time indication relative to the patient's birth date, and relative to the onset of the disease or symptoms, or can relate to other clinical findings of the patient, such as time relative to previous visits or consultations, and time relative to the disease stage, the time that the patient started taking a prescribed medication, or the time that has passed from the point of taking the medication until the patient experienced a certain symptom. A clinical finding along with the timing information can be regarded as an "event". An EHR can therefore comprise multiple sequential events, where each event pertains to a clinical finding and the respective timing information. An example of an EHR with events relating to clinical findings along with associated timing information is described further below.

In some examples, obtaining module 121 can process the obtained medical data by deidentifying and/or cleansing it (blocks 211 and 212). Medical data, such as EHRs of patients, includes identifying details such as the IDs of the patients, medical insurance details, contact details, and other identifying data that are less relevant to the process of mapping diseases to their respective clinical findings and stages. In addition, the medical data may include duplicate data, such as if the same visit to the physician was recorded in several medical systems, and was later integrated into a single system. The medical data may also include corrupted data or duplicate data. Obtaining module 121 can deidentify the medical data to reduce leakage of personally identifiable information (Pll) into the database (block 211). For example, the medical data can be masked to break the link between the medical data and the patient with whom the data is initially associated. The medical data can further be cleansed (block 212) for identifying and correcting corrupt, incomplete, duplicated, incorrect, and irrelevant data from the medical data. For example, the medical data can be cleansed by removing duplicates, fixing structural errors, handling missing data etc. Deidentifying and/or cleansing the medical data may be performed using known methods. In some cases, the medical data is processed to correlate the disease with one or more clinical findings common to a majority of the patients. The clinical findings that are common to the majority of the patients, and their associated timing information, are then used to identify stages of development of the disease (block 220). Then, the diseases database 132 mapping a disease to its respective stages of development and clinical findings can be maintained (block 320). Following is a description of processing the medical data, to correlate the disease with clinical findings, and to identify the stage of development.

Each clinical finding may have a form of structured data, unstructured data, or a combination thereof. Structured data can include e.g., the age of the patient, formal and semi-formal medical terminologies, and taxonomies such as SNOMED codes, ICD codes, lab tests results such as LOINC codes, drug prescriptions coded as NDC code, symptoms and diagnoses in ICD-10 or SNOMED CT codes. Unstructured data can include summaries of physician visits written in free text, such that they include a description of the symptoms of the patient. Some of the clinical findings can include a combination of structured and unstructured data, such as a summary of a physician visit including both unstructured data, such as free text relating to symptoms of the patient, and structured data, such as a code relating to the drug that was prescribed.

In order to process the medical data to correlate the disease with clinical findings and to identify disease stages, in some examples, extraction module 122 is configured to receive the medical data from obtaining module 121, including the clinical findings having structured and unstructured data form, where each clinical finding is associated with its respective timing information, together constituting an event. For each clinical finding, extraction module 122 can identify at least one predefined medical category that matches the clinical findings (block 221). For example, a clinical finding having a structured data form that specifies codes of FIPO or ICD-10 can be matched against a list of standard codes to identify at least one standard code as a matching medical category to the clinical finding. Extraction module 122 is further configured to process the clinical findings having unstructured data form to identify one or more predefined categories. For example, extraction module 122 can execute a machine learning (ML) model, such as a Natural Language Processing (NLP) algorithm. The NLP algorithm can extract phenotype codes entered in HPO standard and identify the standard code (e.g., remove hyphens, white space) and then to match the code with a standard OPT code. For example, a specific HPO ID code might be written in free text as "HP:0000007", "HP : 0000007", "HP:7", "HP-7" etc. The NLP algorithm can match the free text to the standard formal code ("HP:0000007"). In some examples, the free text does not include codes at all, but a human readable name, such as "Autosomal recessive inheritance". The human readable name can also be processed by the NLP algorithm and be matched to the formal HPO code ("HP:0000007").

The predefined categories can be retrieved from published standard medical lists of terminologies and taxonomies. In addition, the predefined categories can also be retrieved from predefined categories 134 in memory 130. Predefined categories 134 can store standard terminologies and taxonomies (such as HPO, ICD-10), but can also store predefined learned or manually entered categories. For example, an EHR including free text which includes the following wording: "genu valgum" can be processed by the NLP algorithm to identify a match to a predefined category of (SNOMED: 299330008 or ICD- 10 code M21.062). If no match is found, an expert physician can determine whether or not the clinical finding is relevant to the disease, and, if so, to add a new internal code for a category that matches the clinical finding. The new internal code will be added as a category in predefined categories 134 and can be used by the extraction module 122 in future matching. The predefined medical categories that were identified can be used by processor 120 to identify the stages of development of the disease.

In some examples, the timing information of each clinical finding in the EHR is associated with the respective predefined medical category that was identified as matching to the clinical finding (block 222). The predefined medical categories that were identified as matching with the respective timing information, can be processed, as described further below, to identify the stages of development of the disease.

To illustrate the above, consider the following example of an EHR of a patient including medical data. The EHR includes three records for a certain patient of: an initial outpatient pediatrician visit on a specified date, an outpatient psychiatric consultation, on a specified date, and an emergency department visit on a specified date. Assuming the initial outpatient pediatrician visit includes the following text:

"Carol Smith is an 8-year [timing information] old girl, who was brought in by her parents to the pediatrician with main complaint of them noticing that she is somewhat "slower" [symptom].

HP I [history of present illness]: Over the past month, her parents noticed that the patient has been running slower [symptoms] than usual. She often seems "spacy, not focused", [symptom] somewhat slower to respond [symptom]. They denied any recent significant illness, such as upper respiratory tract infection, diarrhea, fevers. They denied any recent trauma, weight loss, change in diet, recent travels. Her 5-years sibling has no such symptoms. They approached her school teacher who reported a slight decrease in concentration [symptom], otherwise no significant social withdrawal."

From the above text, the following clinical findings can be extracted:

1. slow

2. spacy, not focused

3. slower to respond

4. slight decrease in concentration.

The timing information associated with each of the above clinical findings is, in this case, identical to all, and pertains to Caro's age being 8.

The clinical findings for these records will be processed to extract formal codes for diseases, symptoms and treatments (e.g., ICD-9 code: "ADD F90.9") and informal terms ("decrease in concentration", "slow", "spacy, not focused"). These informal terms can be matched by an NLP algorithm to the formal codes, if they exist.

In some examples, in order to identify stages of development of the diseases, clustering module 123 can receive the clinical findings extracted from the EHR and their respective timing information, and cluster them into clusters. Stages of the disease can then be identified, based on the clusters. In some examples, in order to cluster the clinical findings, clustering module 123 can generate, for each patient, a respective timeline of the clinical findings included in the EHR based on the associated timing information (block 223). The clinical findings timeline is indicative of a chronological order of the clinical findings of the patient. In cases where one or more medical categories were identified as matching the clinical findings, then clustering module 123 can generate the respective timeline of the clinical findings based on medical categories and the associated timing information. Using associated timing information for each clinical finding is advantageous, as a clinical finding appearing at a first point in time during the development of a disease, should not be regarded in the same manner when appearing at a second point in time during the disease. The timeline generated based on the timing information, facilitates generating a diseases database that maps not only between diseases and their clinical findings, but also between the diseases, their clinical findings, and stages of development of the disease.

In some examples, based on the clinical findings and the timelines of the patients, clustering module 123 can cluster the clinical findings into multiple latent feature clusters (block 224). For example, clustering module 123 can apply a statistical or a machine learning model on the plurality of timelines of the patients that were generated. For example, Principal Component Analysis (PCA) can be used, e.g. using Bayesian latent variable analysis, or other known methods. PCA process can summarize the timelines, each including a large number of clinical findings, into a smaller set of similar clinical findings. . The result of the PCA includes a plurality of latent feature clusters. The latent features can be the clinical findings, where each cluster may contain a homogeneous group of similar clinical findings, considering their timeline. In some examples, two clinical findings can be clustered together if they are similar, e.g., relate to the same organ or body system in the human body, and appear substantially the same on their timeline. For example, two symptoms relating to the digestive system, appearing substantially at the beginning/initial stages on the timeline, can be clustered together, whereas the same symptom, appearing both at the beginning of the timelines of some patients, and towards the end of the timelines of some other patients, will not be clustered together.

In examples where the clinical findings are matched with categories, clustering module 123 can cluster the categories into multiple latent features clusters instead of the clinical findings. - Each feature cluster may be associated with timing information, similar to the timing information of the clinical findings belonging to the cluster.

In some examples, stages module 124 can receive the clusters with their associated timing information, and can process them to identify the stages of development of the disease (block 225). For example, stages module 124 can execute a statistical model such as Continuous Time Hidden Markov Model (CTHMM) to model the disease progression and to identify the stages of development. For example, two different symptoms that have identical timing information will not be considered in the same feature cluster if they are not associated with the same organ, even though they can belong to the same stage of disease. In addition, the statistical model can identify additional data pertaining to the disease, including the probability of occurrence of the disease, an average duration of the disease, an average duration of a disease stage, the time difference between two stages, the transition probability matrix between the stages, and the respective weight of latent features pertaining to a stage. The resulting information from the CTHMM, i.e., the disease, clinical findings, features, disease stages of development, and the additional data pertaining to the disease, can be used to generate a disease graph (block 231). A diseases database 132 mapping the disease to its respective stages of development and clinical findings, can then be maintained (block 230). In some cases, diseases database the disease graph may have a tree structure and may comprise a plurality of connected nodes, including stages of development nodes and clinical findings nodes. An edge connecting two nodes indicates an optional transition from one node to another. An edge can be associated with a transition probability indicative of a probability to transfer from one node to the other. While generating the graph, at least one edge can be associated with a transition probability.

Reference is made to Fig. 3 illustrating one example of a disease graph 300 pertaining to a disease 310, in accordance with certain embodiments of the presently disclosed subject matter. Disease graph 300 may constitute one disease graph or a plurality of disease graphs stored in diseases database 132. As illustrated, disease graph 310 has a tree structure and comprises a plurality of connected nodes of the disease 310, stages of development 320, features 330 constituting the feature clusters, and clinical findings 340. The nodes of disease graph 300, as well as the edges, can be generated by generation module 125 based on the information received from the CTHMM. As illustrated, disease node 310 is connected to a number of stage nodes 320. Stage nodes 320 can be connected to each other, and each stage node 320 can also be connected to one or several feature nodes 330 constituting the latent feature clusters. Each feature node 330 is a cluster of several similar clinical findings and can be connected to the respective similar clinical findings nodes 340. As illustrated, Feature 2 cluster contains clinical finding 2 and clinical finding 3 that were found to be similar. In some examples, the clinical findings nodes 340 can include the predefined medical categories. Some edges of Fig. 3 are also marked with probabilities, such as 0.3 from feature 1 to clinical findings 1. Nodes in the graph may also be connected to medical data pertaining to an anonymous patient #350 (e.g., the medical records in the EHR of a patient). The medical records of patient #350 include clinical findings and their respective timing information (events 1-N). The edge from the first clinical finding of the patient #350 (CF event 1) to clinical finding 3 included in disease graph 300 indicates that patient #350 has a clinical finding that is similar to clinical finding 3 stored in the disease graph 300. The connection to the medical data pertaining to the patients enables processing the connected medical data with new data, when such is acquired.

In some examples, prior knowledge can be used to generate the disease graph. For example, Niemann Pick disease has several known Suspicion indexes used by physicians to diagnose the disease and the disease stage. One such Suspicion Index might include the following clinical findings: Ataxia before age 25, and vertical supranuclear gaze palsy (VSGP), or Dystonia. A latent feature cluster including these clinical findings can be added to the disease graph. Another example pertains to data from literature. For example, if, according to medical literature it is known that patients in the specific disease stage suffer both from VSGP and Ataxia, these clinical findings can be added to the disease graph as clinical findings nodes and/or as a latent feature node.

To exemplify the above, reference is made to Fig. 4 illustrating an example of one disease graph 400 of a Nieman Pick Disease Type C, in accordance with certain embodiments of the presently disclosed subject matter. As illustrated, the disease node of Nieman Pick Disease Type C 410 is connected to four stages of development nodes 420 ("Minimal to Asymptomatic", "Mild", "Moderate" and "Severe/Advanced"). Each of these stages can be connected to feature nodes 430 ("Cognitive impairments", "Respiratory failure", "Liver function", etc.). Each feature node 430 may be connected to the respective clinical findings nodes 440 ("Flepatomegaly", "Inspiratory rales", "ADFID", "Jaundice", "Splenomegaly" etc.). ADFID clinical finding node may be connected to ADFID event representing clinical finding of a patient450.

Those versed in the art will realize that the disease graph illustrated in accordance with certain embodiments of the presently disclosed subject matter can include fewer or additional nodes and/or edges. For example, each clinical findings node can be associated with a plurality of categories that correspond the clinical finding, e.g. from ICD-10, SNOMED CT, etc. Also, those versed in the art will realize that the term database is meant in the broad sense, and may be implemented as several physical databases, using several technologies, for example, both a graph database and a relational database. In some cases, in a similar manner to that described above with respect to Fig.2, a plurality of disease graphs, such as disease graph 300, can be generated for a plurality of diseases. The diseases database 132 can store a plurality of such generated disease graphs diseases database Some of the diseases may have the same clinical findings that are associated with these diseases. For example, Pompe disease and von-Gierke disease both have the symptoms of diarrhea and a enlarged liver. The diseases database 132, mapping the diseases to their respective stages of development and clinical findings, enables a physician or other medical professional to search the diseases database 132 based on clinical findings experienced by the patient, in order to identify the disease that is relevant to the patient. It is advantageous that the diseases database will enable the physician inputting a certain set of clinical findings that are shared between two separate diseases, to identify these two diseases as potential. In order to reflect sharing of similar clinical findings between two separate diseases, and enable the search in the graph to reach both diseases from the same set of clinical findings, in some examples, the disease graphs of such two diseases that share similar clinical findings are connected. The connection between the two graphs can be effected by sharing the same clinical findings node in their trees. As such, a single clinical findings node will be common to a first disease graph relating to a first disease, and to a second disease graph pertaining to a second disease.

Sharing of the nodes between two separate disease graphs is advantageous as it enables a physician searching the database to navigate from one clinical finding node to two separate disease graphs, e.g., depending on transition probabilities of the edges connecting to the two disease graphs.

In some examples, once generated, a disease graph may be updated based on newly acquired medical data. For example, new EHRs of patients can be processed in a manner similar to that described above with respective to Fig. 2, and the disease graph can be updated by updating data pertaining to nodes or the edges in the graph. For example, a transition probability to reach a certain stage or clinical finding node can be updated based on data obtained from the new EHRs. Furthermore, in some examples, additional medical data may enrich a disease graph, by updating the disease graph and data pertaining to nodes or the edges. For example, the published Orphanet database for orphan diseases, statistics gathered from textbooks and research articles, and a NORD database can be used to enrich a disease graph. Additionally, previous knowledge of experts in the field can be used to determine disease stages and symptoms. Therefore, obtaining module 121 may obtain medical literature including e.g., published databases and literature pertaining to experts' knowledge. The medical literature may include multiple clinical findings pertaining to a certain disease, disease stages, weighted clinical findings, and probabilities of the disease and/or disease stages based on the clinical findings. Medical data pertaining to the patients sharing that certain disease can be processed together with the medical literature to correlate the disease with the one or more clinical findings, and identify the stages of development, while enriching the graph with the medical literature. Alternatively or additionally, the medical literature can be used to update a disease graph that was generated based on the EHRs. In such cases, a disease graph may be generated using EHRs pertaining to patients. Medical literature can then be used to update the disease graph. Last, medical literature, including experts' knowledge, may be used to update nodes and edges or data pertaining to the nodes or the edges, in the disease graph. Generating the graph, and then updating it, instead of using aggregated data from the various sources to generate the graph, may be advantageous, as aggregated data might lose some of its statistical properties, which might be beneficial to the algorithm performance. Those versed in the art will realize that the order of generating and updating the graph based on medical data is not bound by the order illustrated above, and may be done in separate stages of generation, or in a reverse order. For example, a disease graph can be generated based on medical literature, and can then be updated, or be enriched with medical data pertaining to patients. Generating a basic graph based on medical literature, and then enriching it with medical data based on EHRs, may be beneficial in rare diseases databases, where due to the problem of a low quantity of data in rare diseases, there might not be enough training data to use regular machine learning tools to generate the graph from scratch.

Reference is made to Fig. 5 illustrating a generalized flow chart 500 of a method for retrieving data from a diseases database such as diseases database 132, according to certain embodiments of the presently disclosed subject matter. The following flowchart operations can be executed by elements of system 100 and the PMC 110 described in Fig. 1. However, this is by no means binding, and the operations can be executed by elements other than those described herein. ln some cases, query module 126 can access a diseases database (block 510), such as diseases database 132 stored in memory 130. The diseases database 132 comprises a plurality of disease graphs, wherein each disease graph pertains to a disease. Each disease graph maps the disease to the stages of development of the disease and clinical findings pertaining to the disease. Query module 126 can further search the diseases database 132 to identify a disease and a stage of development of the disease that correspond to medical data of a patient (block520). In some examples, the medical data can be obtained from an External ERP DB 150 storing medical data of a plurality of patients. The medical data can also be obtained by receiving medical data inputted by a user using workstation 160, e.g., through communication interface 140. The physician can input medical data pertaining to a patient. The medical data used by query module 126 to search the diseases database 132 pertains to a patient and may include multiple clinical findings and related timing information. In addition, the user such as the physician can input additional medical data to search the diseases database 132. For example, the physician can add additional symptoms which he believes to be relevant. Although the patient did not experience these symptoms, the physician noticed that the symptoms experienced by the patient usually accompanied with other additional symptoms, and he would like to search the system 100 with the additional symptoms. As illustrated for example in Fig. 3, each disease graph in diseases database 132 may have a tree structure and may comprise a plurality of connected nodes of the stages of development and the clinical findings. Nodes in the disease graph may be connected by edges, where one or more edges may be associated with respective transition probabilities indicative of a probability to transfer between nodes. The transition probabilities can be used when searching the graph to identify a disease and disease stage. For example, calculation of the probabilities during the search can be done in the following manner: the clinical findings from the EHR can be taken with a value of "1", multiplied by the probabilities associated with the edge, leading to the features, and then again in similar manner to the disease stages. ln some examples, in order to identify a disease and a disease stage, query module 126 can search clinical findings nodes of various disease graphs included in diseases database 132 identify one or more matching nodes to the clinical findings included in the medical data of the patient (block 530). For example, the clinical findings in the medical data that have timing information that is most recent, e.g., that pertains to the eldest age of the patient, can first be searched in the diseases database 132 for a match. In some examples, a clinical finding node in the graph matches the clinical finding of a patient if it is identical. If a matching clinical finding node is identified, query module 126 can proceed to navigate the next levels of nodes in the disease graphs based on transition probabilities from the matching node to other nodes, until reaching a disease node, to identify the at least one associated disease and the respective stage of progression (block 540). For example, query module 126 can navigate the to next level of feature nodes 330 to identify matching feature nodes, e.g., feature nodes that are connected to the matching clinical findings nodes. Then, from the matching feature nodes, query module 126 can navigate to the next level to identify matching stage nodes 320, and then to a matching disease node. In some examples, navigating the graph is done based on highest aggregated transition probabilities of the edges. In such examples, if there are several nodes in the next level that are connected to nodes at the current level, then an aggregation of the transition probabilities on edges connecting the nodes in the current level to the nodes in the next level is calculated. Navigation proceeds to a node having the highest aggregated transition probability. With reference to Fig. 3, assume for example that three clinical findings nodes, clinical finding 2, clinical finding 3 and clinical finding 4, are identified as matching the three clinical findings of the patient. Query module 126 can navigate to the feature nodes 330 connected to the three matching nodes, feature 3. In case more than one feature node is connected to all three matching nodes, then aggregation of the transition probabilities on the edges connecting the clinical findings nodes to the feature nodes 330 is calculated. Navigation proceeds to the feature node that has the highest aggregated transition probability. Alternatively, or additionally, the combined probabilities of all the features can be used to calculate the disease stage probability, the disease stage with the highest probability is then selected.

Navigation can then proceed from the feature node having the highest probability to stages nodes 320 connected to the feature node. In a similar manner, navigation can proceed to connected stage node 320 and then to a disease node 310.

In some examples, the history of the patient can be used to navigate the diseases database 132. In the above example, a disease and a disease stage were found based on the three clinical findings of the patient. The three clinical findings may be associated with timing information that is most recent, such as that which pertains to the eldest age of the patient. Assuming the disease stage that was found is an advanced stage, e.g., stage 2 of the disease, it is now possible to search the disease graph in the opposite direction, based on earlier stages of the disease. The purpose is to identify the clinical findings that are connected to stage 1 of the disease. Once stage 1 clinical findings are identified, the historical medical data of the patient, i.e., clinical findings having an earlier date than those three clinical findings, can be searched for a match to these stage 1 clinical findings. If a match is found, then the likelihood that the patient has the identified disease increases.

In some examples, during navigation of the diseases database 132, suggestions on additional clinical findings can be provided to the physician. For example, assuming that, based on certain clinical findings that were inputted by the physician, a certain stage of development of a disease was identified. Any clinical findings that are connected to this stage of development in the graph but are lacking from the clinical findings that were input by the physician, can be provided to the physician, so he can accordingly question the patient on whether he has experienced these clinical findings as well. Indication from the patient that he has experienced these clinical findings can increase the likelihood that the patient suffering from the identified disease.

The identified disease and stages of the disease, or additional information obtained based on the navigation, such as additional proposed clinical findings, can be provided to the user, e.g., to be displayed on a display included in workstation 160. Navigating the disease graph based on a set of clinical findings sharing the same timing information is advantageous as it enables to provide a diagnosis of a probable disease and a disease stage. A second search of different clinical findings, having timing information, which is different from the timing information of the first clinical findings, assists in providing assurance that the identified disease is indeed the correct disease.

In some examples, at least one disease graph can be updated based on newly acquired medical data, including the medical data pertaining to the patient, as used to search and retrieve data from diseases database 132. For example, a disease graph can be updated by updating one or more transition probabilities, disease stage probabilities, probability of occurrence of the disease, or the disease average duration.

It will also be understood that the system according to the invention may be, at least partly, implemented on a suitably programmed computer. Likewise, the invention contemplates a computer program being readable by a computer for executing the method of the invention. The invention further contemplates a non-transitory computer-readable memory tangibly embodying a program of instructions executable by the computer for executing the method of the invention.

It is noted that the teachings of the presently disclosed subject matter are not bound by the flow charts illustrated in Figs. 2 and 5, and the illustrated operations can occur in a different order to that illustrated. For example, operations 211 and 212 shown in succession can be executed substantially concurrently, or in the reverse order. In addition, operation 223 can be an integral part of operation 224.

It is to be understood that the invention is not limited in its application to the details set forth in the description contained herein or illustrated in the drawings. The invention is capable of other embodiments and of being practiced and carried out in various ways. Flence, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting. As such, those skilled in the art will appreciate that the conception upon which this disclosure is based may readily be utilized as a basis for designing other structures, methods, and systems for carrying out the several purposes of the presently disclosed subject matter.

It will also be understood that the system according to the invention may be, at least partly, implemented on a suitably programmed computer. Likewise, the invention contemplates a computer program being readable by a computer for executing the method of the invention. The invention further contemplates a non-transitory computer-readable memory tangibly embodying a program of instructions executable by the computer for executing the method of the invention. Those skilled in the art will readily appreciate that various modifications and changes can be applied to the embodiments of the invention as hereinbefore described without departing from its scope, defined in and by the appended claims.

Claims

CLAIMS:

1. A computer-implemented method for generating a diseases database comprising: obtaining medical data pertaining to a plurality of patients having a disease common to all the patients, wherein the medical data includes, for each of the patients, multiple clinical findings each having associated timing information; processing the medical data to correlate the disease with one or more clinical findings common to a majority of the patients, and to identify stages of development of the disease, based on the respective clinical findings that are common to the majority of the patients and their associated timing information; and maintaining a database mapping the disease to its respective stages of development and clinical findings.

2. The method of claim 1, wherein the medical data comprises a plurality of electronic health records (EHRs), each EHR pertaining to a patient.

3. The method of claim 1 or 2, wherein obtaining the medical data further comprises: deidentifying and/or cleansing the medical data.

4. The method of any one of the preceding claims, wherein each clinical finding has a form of structured data, unstructured data or a combination thereof.

5. The method of claim 4, wherein processing the medical data further comprises: for each clinical finding, identifying at least one predefined medical category that matches each clinical finding and associating the timing information to the respective at least one predefined medical category; and processing the at least one predefined medical category with the respective timing information to identify the stages of development of the disease.

6. The method of claim 5, wherein processing each clinical finding having the form of unstructured data to identify at least one predefined medical category is performed using a machine learning model.

7. The method of any one of the preceding claims, wherein the timing information pertains to the patient's clinical findings.

8. The method of any one of the preceding claims, wherein the medical data includes, for a given patient of the patients, first and second clinical findings, wherein the timing information of the second clinical finding relates the timing information of the first clinical finding.

9. The method of any one of the preceding claims, wherein processing the medical data comprises: clustering the clinical findings into multiple latent features clusters; and processing the clusters to identify the stages of development of the disease.

10. The method of claim 9, wherein, prior to clustering, the clinical findings further comprise: for each patient, generating a respective timeline of the clinical findings based on the associated timing information; based on the timelines of the patients, clustering the clinical findings.

11. The method of claim 9 or 10, wherein processing the clusters is performed using an Al model.

12. The method of claim 10, wherein processing the clusters comprises determining for each stage of development one or more of: respective probabilities to reach the stage from other stages, a respective weight of latent features pertaining to the stage and an average duration of the stage.

13. The method of any one of the preceding claims, further comprising: calculating, for at least one disease, a probability of occurrence and/or a disease average duration.

14. The method of claim 1, wherein the database is organized as a plurality of disease graphs, each disease graph having a tree structure and comprising a plurality of connected nodes of stages of development and the clinical findings.

15. The method of claim 14, wherein at least two disease graphs share at least one node of clinical findings.

16. The method of claim 14 or 15, wherein an edge between first and second nodes is associated with a transition probability indicative of a probability to transfer from the first node into the second node, and wherein the method further comprises associating at least one transition probability to at least one edge.

17. The method of any one of the preceding claims, further comprising: obtaining medical literature, wherein the medical literature includes multiple clinical findings pertaining to the disease; and processing the medical data with the medical literature to correlate the disease with the one or more clinical findings and identify the stages of development.

18. A computer system comprising a processing circuitry that comprises at least one processer and computer memory, the processing circuitry being configured to execute the method as described in any one of claims 1 to 17.

19. A non-transitory computer readable storage medium tangibly embodying a program of instructions that, when executed by a computer, cause the computer to perform a method as described in any one of claims 1 to 17.

20. A disease data storage and retrieval system for a computer having a processing and memory circuit (PMC), comprising: a processor of the PMC for configuring the memory of the PMC to store a diseases database comprising a plurality of disease graphs; wherein each disease graph pertains to a disease and maps the disease to its respective stages of development and clinical findings.

21. The system of claim 20, wherein the processor is configured to generate the stages of development were generated based on timing information associated with the clinical findings.

22. The system of claim 20 or 21, wherein the processor is configured to generate each disease graph by: obtaining medical data pertaining to a plurality of patients having a disease common to all the patients, wherein the medical data includes, for each of the patients, multiple clinical finding each having an associated timing information; processing the medical data to correlate the disease with one or more clinical findings common to a majority of the patients, and identify stages of development of the disease, based on the respective clinical findings that are common to the majority of the patients and their associated timing information; and generating the disease graph by mapping the disease to its respective stages of development and clinical findings.

23. The system of claim 22, wherein the medical data comprises a plurality of electronic health records (EHRs), each EHR pertaining to a patient.

24. The system of claim 22 or 23, wherein the medical data is deidentified and/or cleansed.

25. The system of any one of claims 22 to 24, wherein each clinical finding has a form of structured data, unstructured data, or a combination thereof.

26. The system of any one of claims 22 to 25, wherein the timing information pertains to the patient's clinical findings.

27. The system of any one of claims 22 to 26, wherein each disease graph is further generated by: clustering the clinical findings into multiple latent features clusters; and processing the clusters to identify the stages of development of the disease.

28. The system of any one of claims 22 to 27, wherein at least one disease graph of the plurality of disease graphs is associated with a probability of occurrence and/or a disease average duration.

29. The system of any one of claims 22 to 28, wherein at least one of the disease graphs has a tree structure and comprises a plurality of connected nodes of the stages of development and the clinical findings.

30. The system of claim 29, wherein at least two of the disease graphs share at least one node of clinical findings.

31. The system of claim 29 or 30, wherein an edge between first and second nodes is associated with a transition probability indicative of a probability to transfer from the first node into the second node, and wherein at least one edge in a disease tree is associated with transition probability.

32. The system of any one of claims 29 to 31, wherein at least one disease graph includes at least one node of latent features cluster generated by clustering the clinical findings pertaining to the disease.

33. The system of claim 31 or 32, wherein at least one node of latent features cluster is connected by an edge to at least one node of stage of the disease, and wherein the at least one node of stage of the disease is associated with a respective weight of the latent features pertaining to the stage.

34. The system of any one of claims 29 to 33, wherein at least one node in the graph is associated with medical data pertaining to a patient, based on which the node was generated.

35. A computer-implemented method for retrieving data from a diseases database comprising: accessing a diseases database comprising a plurality of disease graphs, wherein each disease graph pertains to a disease and maps the disease to its respective stages of development and clinical findings; and searching the diseases database to identify a disease and at least one respective stage of development of the disease corresponding to medical data pertaining to a patient, wherein the medical data includes multiple clinical findings and related timing information.

36. The method of claim 35, wherein at least one of the disease graphs has a tree structure and comprises a plurality of connected nodes of stages of development and the clinical findings, wherein an edge between connected first and second nodes is associated with a respective transition probability indicative of a probability to transfer from the first node into the second node, and wherein searching the diseases database comprises: searching the clinical findings nodes to identify one or more matching node to the clinical findings included in the medical data; based on transition probabilities from the one or more matching node to other nodes, navigating the disease graphs to identify the at least one associated disease and the respective stage of development.

37. The method of claim 36, further comprising: calculating at least one disease probability for the at least one identified disease.

38. The method of claim 37, further comprising: identifying the disease having a maximum disease probability.

39. The method of any one of claims 36 to 38, further comprising: in response to identifying the stage of development, searching the graph to identify second clinical findings nodes that are associated with an earlier stage of development of the identified disease; and determining one or more matching clinical findings in the medical data to the second clinical findings.

40. The method of any one of claims 35 to 39, further comprising: updating at least one disease graph based on newly acquired medical data.

41. A computer system comprising a processing circuitry that comprises at least one processer and computer memory, the processing circuitry being configured to execute the method as described in any one of claims 35 to 40.

42. A non-transitory computer readable storage medium tangibly embodying a program of instructions that, when executed by a computer, cause the computer to perform a method as described in any one of claims 35 to 40.