CN111370131B

CN111370131B - Method and system for screening biomarkers via disease trajectories

Info

Publication number: CN111370131B
Application number: CN201811602403.8A
Authority: CN
Inventors: 陈治平; 白敦文; 洪健中; 蔡元皓
Original assignee: National Taipei University of Technology
Current assignee: National Taipei University of Technology
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2023-06-09
Anticipated expiration: 2038-12-26
Also published as: CN111370131A

Abstract

The present invention relates to a biomarker screening system for screening biomarkers via disease trajectories of a target disease, and a method thereof. The system comprises a medical database comprising a plurality of medical information; a disease window including at least one disease information; an operation module; and a comparison module. The operation module is used for executing the method and comprises the following steps: (1) Obtaining at least one pre-disease of the target disease and the corresponding at least one disease information from the medical information and the disease window to provide a pre-disease information; and (2) performing an exploration process on the pre-disease information of step (1) to generate a disease trace result comprising a plurality of associated diseases associated with the target disease. The comparison module is used for executing a step of selecting at least one biomarker from the co-disease gene group of the target disease and the related diseases according to the disease track result.

Description

Method and system for screening biomarkers via disease trajectories

Technical Field

The present invention relates to a method for screening biomarkers for a target disease, and more particularly to a method for screening biomarkers for a target disease using a disease trace (disease trajectories) of the target disease.

Background

Along with the coming of the information age, huge amounts of data (also called big data) collected by various types of medical records are becoming the focus of related technical fields. The U.S. Food and Drug Administration (FDA) has uncovered "precision medical (precision medicine)" as an execution focus in the future by the recent passing "21 st century medical act" (21 st Century Cures Act). Specific concepts are to use data (also called "Real World Data (RWD)") such as data (e.g., electronic medical records, insurance records) collected in the Real world to construct valid "Real world evidence (Real world evidence, RWE)", through analysis techniques. Such concepts surpass the design thinking of traditional clinical medical drugs, not only are not easily limited by sample groups, but also are expected to reduce the time and cost required for developing new medical products.

The current method for screening or searching biomarkers is mainly still verified and determined by the physiological index correlation between specific diseases and patient individuals of a sample group and by performing large-scale clinical study one by one. However, the existing screening methods are often limited in the number of samples and have a problem of insufficient efficiency.

In view of the foregoing, there is a need in the art to develop a method for introducing new thinking for screening biomarkers, thereby effectively screening biomarkers that are highly correlated with a specific disease population, and thereby predicting or diagnosing the specific disease population.

Disclosure of Invention

In view of the foregoing, it is an object of the present invention to provide a method for effectively screening biomarkers of a specific target disease from a huge amount of data, thereby enhancing analysis of the huge amount of data, and obtaining accurate biomarkers through analysis results, thereby enhancing medical quality.

One aspect of the present disclosure relates to a method for screening biomarkers for a disease of interest. In certain embodiments, the method comprises: (a) Providing a medical database containing medical information of a plurality of individuals; (b) Providing a disease window comprising at least one disease information; (c) Obtaining at least one pre-disease of the individuals suffering from the target disease from the medical database of step (a), and obtaining the at least one disease information of the target disease and the at least one pre-disease based on the disease window of step (b) to form a pre-disease information, wherein the at least one pre-disease occurs within a predetermined time before the target disease occurs; (d) Performing a sequential pattern exploration (sequential pattern mining) of the pre-disease information of step (c) to generate a disease trace result, wherein the disease trace result comprises a plurality of associated diseases associated with the target disease; and (e) selecting at least one biomarker from a co-morbid gene group of the target disease and the plurality of related diseases according to the disease trace result of step (d), wherein the at least one biomarker can be used to detect the target disease.

According to certain embodiments of the present disclosure, step (c) of the aforementioned method comprises: (c-1) recording the occurrence time of the at least one pre-disease in each individual to form a time series data table; and (c-2) sorting the at least one pre-disease according to the time sequence data table of step (c-1) to generate at least one disease time sequence.

According to certain embodiments of the present disclosure, step (d) comprises: (d-1) performing the sequential pattern exploration on the at least one disease timing sequence of step (c-2) to generate the disease trace result; and (d-2) calculating an average time interval between any two consecutive pre-diseases of the at least one pre-disease according to the time sequence data table before outputting the disease trace result.

According to certain embodiments of the present disclosure, the sequential pattern exploration process in step (d) is achieved by performing the following steps: (i) Excluding the disease timing sequences of tail (suffix) that are not the disease of interest; and (ii) sequentially searching for a target subsequence from the disease timing sequences remaining after the excluding of step (i), wherein when the target subsequence is the last target subsequence's tail entry, the target subsequence is searched for only from the last target subsequence whose tail entry has the target subsequence's head (prefix).

In some optional embodiments, the method further comprises providing a threshold of odds ratio between step (d) and step (e) to present the plurality of associated diseases in the disease trajectory result that is greater than the odds ratio. Preferably, the threshold ratio is 4.

According to certain embodiments of the present disclosure, the disease of interest is selected from the group consisting of cardiovascular disease, premature labor, endocrine related disease, metabolic disease, dermatological disease, and respiratory disease.

According to certain embodiments of the present disclosure, the biomarker is selected from the group consisting of a nucleic acid, an amino acid, a peptide, a protein, a monosaccharide, a disaccharide, a polysaccharide, a glycoprotein, and combinations thereof.

Another aspect of the present disclosure relates to a biomarker screening system comprising a medical database, a disease window, an operation module, and a comparison module configured to implement the above method.

Specifically, the medical database contains a plurality of medical information. The disease window includes at least one disease information. The operation module is used for executing a method through instruction programming, wherein the method comprises the following steps: (1) Providing a pre-disease information according to at least one pre-disease of a target disease obtained from the plurality of medical information and the at least one disease information corresponding to the target disease and the at least one pre-disease obtained from the disease window, wherein the at least one pre-disease occurs within a predetermined time before the target disease occurs; and (2) performing a sequential pattern exploration process on the pre-disease information of step (1) to generate a disease trace result, wherein the disease trace result comprises a plurality of related diseases related to the target disease. The comparison module is programmed and executed by instructions to select at least one biomarker from a group of co-disease genes of the target disease and the plurality of related diseases according to the disease trajectory result. One of ordinary skill in the art or a clinical person may perform detection of a target disease in one or more individuals in need thereof by comparing at least one biomarker output by the module.

In certain embodiments of the present disclosure, the disease trace results comprise a plurality of representative sequences, each of the representative sequences having at least one segment between any two consecutive associated diseases

According to certain embodiments of the present disclosure, the disease trace results comprise the average time interval between any two consecutive associated diseases, and the odds ratio of each such representative sequence.

Through the above configuration, biomarkers can be rapidly, accurately and efficiently screened from huge amounts of data, and thus a gene detection kit with high performance can be designed and developed. The detection kits can be applied to predicting the risk index of the specific disease, and specific effects of accurate medical treatment and accurate prevention of individuals are achieved.

The basic spirit and other objects of the present invention, as well as the means and embodiments employed by the present invention will be readily apparent to those skilled in the art from consideration of the following description.

Drawings

The foregoing and other objects, features, advantages and embodiments of the invention will be apparent from the following description taken in conjunction with the accompanying drawings in which:

fig. 1 is a biomarker screening system S according to an embodiment of the present disclosure.

Fig. 2 is a flow chart illustrating a method 200 according to an embodiment of the present disclosure.

Fig. 3 is a diagram illustrating a network of disease trace results with an exemplary odds ratio greater than 5, in accordance with an embodiment of the present disclosure.

Fig. 4 is a network diagram illustrating disease trajectory results at an exemplary odds ratio greater than 6, in accordance with an embodiment of the present disclosure.

Various components and features are not necessarily drawn to scale in accordance with conventional practice, in order to best illustrate the specific features and components associated with the invention. In addition, like components/parts are referred to by the same or similar reference numerals among the different drawings.

Reference numerals:

system S, medical database 10, disease window 20, operation module 30, alignment module 40, method 200, steps S210-S260.

Detailed Description

In order that the manner in which the above recited invention is attained and can be understood in detail, a more particular description of the invention, briefly summarized below, may be had by reference to embodiments thereof which are illustrated in the appended drawings; this is not the only form of practicing or implementing the invention as embodied. The description covers the features of the embodiments and the method steps and sequences for constructing and operating the embodiments. However, other embodiments may be utilized to achieve the same or equivalent functions and sequences of steps.

For convenience of description, specific terms employed in the specification, examples and appended claims are systematically described herein. Unless defined otherwise herein, the meanings of scientific and technical terms used herein are the same as commonly understood and used by one of ordinary skill in the art. Furthermore, as used in this specification, the singular noun encompasses the plural version of the noun without conflict with the context; and plural nouns as used also encompasses singular versions of that noun. In particular, as used herein and in the appended claims, the singular forms "a," "an," and "the" include plural referents unless the context clearly dictates otherwise. Furthermore, in the present specification and claims, "at least one" (at least one) is meant to include one, two, three or more.

I. Definition of the definition

The term "disease trace" as used herein refers to the course and progress of all diseases occurring in a specific individual (including a single individual or group) within a predetermined period of time, so as to obtain the progress of each disease and the average period of each stage. Graphic presentations may be typically rendered.

As used in this disclosure, a "medical database" refers to any database or sample population that contains medical information. Medical information includes, but is not limited to, medical records, drug administration records, disease history, diagnostic content, biochemical data for medical procedures and health examinations, and the like. In particular, a database storing medical information of a large number of individuals, regardless of the source or the manner in which it is established, is included in the definition of medical database as used herein. The medical database may be a single database or may be an integrated database across units. Medical databases in the context of the present invention are particularly databases containing vast amounts of medical information. For example, the medical database includes, but is not limited to, electronic medical records of all patients of a public or private medical institution from which personal data has been removed, customer medical insurance information of a private insurance company (from which personal information has been removed), a national health insurance database provided by a government entity, and sample classification files of the foregoing national health insurance database. In general, the information in the general medical database can be used as a parent population, or a collection with specific information can be obtained from the medical database as a parent population for subsequent analysis or investigation for different purposes.

In the "disease window" of the present disclosure, each disease may have corresponding disease information. The disease information is a specific language format obtained by converting an actual disease by coding. Specifically, in the disease window, each disease has a code/code that can represent the disease, which constitutes "disease information" of the present invention. The encoding method is not particularly limited, and these actual diseases may be encoded according to conventional logic.

As used herein, "pre-disease information" refers to the conversion of pre-disease into corresponding integrated information composed in a specific language format. In the context of the present invention, the pre-disease information may comprise a classification code for a specific disease (pre-disease), such as the aforementioned disease information, and may additionally or alternatively comprise timing information for the occurrence of these pre-diseases.

As used herein, a "disease timing sequence" refers to a sequence of ordered sequences having a chronological relationship after a format-to-language conversion of all diseases occurring within a predetermined time period. The sequence is composed of at least one, and preferably a plurality of events/elements. Each sequence contains a series of event/element/item combinations. "subsequence" refers to a single event/element/item, or a subset of more than one event, element, or item in the original disease timing sequence. In the context of the present invention, by "event", "element" or "item" is meant in particular a specific illness or visit record that is picked up in a medical database. In a "disease timing sequence" or "subsequence", events/elements/items that are first in the sequence are referred to as "leader" (prefix), and events/elements/items that are last in the sequence are referred to as "trailer" (suffix), regardless of the length of the sequence.

As used herein, a "comorbid gene group" refers to the intersection of two or more disease-associated gene groups. In particular, co-morbid (co-morbid) refers to the occurrence of more than one additional disease or condition, either simultaneously or sequentially, in the presence of a primary diagnosis (disease). The lesions, physiological symptoms and/or physiological changes of these diseases may be positively correlated, negatively correlated or not correlated with each other. In the present disclosure, the correlation between the diseases (i.e., the disease tracks) is established by the data mining technique. In the present invention, the intersection between related gene groups of each disease itself in a disease trace having a significant degree and time sequence is called a comorbid gene group.

As used herein, "biomarker" refers specifically to an objective indicator that is observable in vitro in a patient or researcher. In general, the corresponding measurement method should be accurate and repeatable. A biomarker is "any chemical substance, biological structure or biological process that can be used to measure and predict a likely outcome or a likely disease, all referred to as a biomarker", according to conventional definitions of common knowledge in the art. Biomarkers should be able to reflect the progress or presence of a particular disease, a particular physiological condition, a particular tissue, or a particular cell (e.g., cancer cell). In the context of the present invention, substances known to those skilled in the art as biomarkers include, but are not limited to, nucleic acids, amino acids, peptides, proteins, monosaccharides, disaccharides, polysaccharides, glycoproteins, and combinations thereof.

Embodiment of the invention

The digital medical record information developed at present and the reduced personal gene sequencing cost can provide biologists and medical researchers with the capability of searching gene biomarkers related to various diseases more accurately. However, since the accumulation of huge amounts of data of medical records cannot be applied to the conventional data exploration technology, the present invention aims to develop a data exploration technology applicable to huge amounts of medical data, and accurately screen out the biomarkers of the pre-disease trace and co-disease of a specific disease, so as to solve the above-mentioned problems. In addition, the invention also aims to develop a biomarker screening system so as to meet the requirement of accurate medical treatment.

Specifically, the data exploration technology used in the present disclosure is designed according to the requirements of a huge database, and the clustering concept and the timing concept are introduced through the sequential pattern exploration technology, so that comprehensive analysis results are generated on the occurrence time of the related diseases and the disease tracks thereof.

Biomarker screening system and screening method thereof

In conjunction with fig. 1 and 2, an aspect of the present disclosure relates to a system S for screening biomarkers for a target disease and a screening method thereof. FIG. 1 schematically depicts a biomarker screening system S according to an embodiment of the present disclosure; FIG. 2 is a flow chart showing the screening of biomarkers using system S. The biomarker screening system S of the present invention at least comprises a medical database 10, a disease window 20, an operation module 30 and a comparison module 40.

The medical database 10 of the present disclosure contains medical information of a plurality of individuals. Specifically, the medical database may be a single database or a cross-unit integrated database, which includes medical information such as medical records, administration records, disease history, diagnosis contents, biochemical data of medical treatment and health examination of each individual. The medical database 10 is primarily intended to provide vast amounts of medical information that is used as the basis for real world data and analysis accordingly. Specifically, the medical database 10 may be an electronic medical record file of a public or private medical institution, customer medical insurance information of a private insurance company, a national health insurance database of a government entity, or the like. According to a specific embodiment of the present invention, the medical database 10 is a database storing only a huge amount of medical data of actual medical records with diagnosis contents of tandem relative time and two consecutive disease relative interval time information after a process of removing personal data and removing specific time of individual visit.

The disease window (classification sheet) 20 of the present disclosure includes at least one disease message. Specifically, the disease information means that each disease has a correspondable code, or specific format language, which constitute the disease information. The encoding method is not particularly limited, and these actual diseases may be encoded according to conventional logic. Preferably, the disease information is generated by grouping and classifying the disease, and giving the classification a corresponding code or code. Accordingly, disease window 20 is a window that presents a classification of a particular disease or all known diseases. The foregoing classification of groups may be performed according to common knowledge in the art, that is, the etiology and clinical characterization defined by the common knowledge in the art are used as the basis for disease classification, and classification and arrangement are performed. Specifically, when classifying and grouping diseases, anatomical structures are mainly used as classification spindles, and each disease is assigned a classification number according to the logic of the disease occurrence site, the cause, and/or the type of injury. One skilled in the art can obtain partial information of the specific disease by interpreting the classification number. According to some embodiments, the disease window 20 of the present disclosure may be a classification result that is publicly available and commonly known to a plurality of persons skilled in the art, or may be a classification system that is self-established by the relevant practitioner. Known and more widely used classification systems are for example versions of the international disease classification system (International Classification of Diseases, ICD) published by the world health organization (World Health Organization). According to a particular embodiment, disease window 20 of the present disclosure is a clustered classification window presented by the classification system of the ninth version of clinical revision of ICD.

According to some embodiments of the present disclosure, the medical database 10 and the computing module 30 are communicably connected to each other; the disease window 20 and the computing module 30 are also communicably connected to each other. In the exemplary embodiment depicted in fig. 1, the medical database 10 and disease window 20 may be stored in the same or different storage device connected to the computing module 30 via a cable connection or a wireless network, or may be stored in a storage device containing the biomarker screening system S. In this manner, the computing module 30 may receive such medical information from the medical database 10, as well as specific disease information from the disease window 20, to be programmed with instructions to perform subsequent steps of screening biomarkers.

The computing module 30 of the present disclosure may be a computer (e.g., a desktop, notebook, or laptop computer), a handheld computing device, a mobile device, an supercomputer, a workstation, or a server, etc., as well as other types of special or general purpose computing devices as appropriate for a given environment. The operation module 30 of the present disclosure includes one or more processing units. In particular, the method of screening biomarkers of the present disclosure may be performed using a general or special purpose processing unit (e.g., microprocessor, arithmetic chip, controller or control logic) programmed with instructions. The computing module 30 may also include one or more memory units, such as random access memory (Random access memory, RAM), read-only memory (ROM), flash memory (Flash memory), or other dynamic access devices, which may be used to store information and program instructions to be executed by the processing unit of the computing module 30. The computing module 30 may also include one or more information storage media, such as a hard disk, a flash drive, and an optical disk.

The operation module 30 of the present invention is communicably connected to the comparison module 40, and is a step of selecting a biomarker from a comorbid gene group by instruction programming of a disease trace result obtained after the operation process.

The alignment module 40 of the present disclosure includes one or more processing units. In particular, the method of screening biomarkers from a comorbid gene cluster of the present disclosure may be performed using a general or special purpose processing unit (e.g., microprocessor, arithmetic chip, controller or control logic) programmed with instructions. The comparison module 40 may also include one or more memory units, such as the aforementioned RAM, ROM, flash memory or other dynamic access devices, which may be used to store information and program instructions to be executed by the processing unit of the comparison module 40. The comparison module 40 may further comprise one or more information storage media, such as a hard disk, a flash drive, and an optical disk.

According to some embodiments, the biomarker screening system S of the present disclosure may be connected to a user interface (not shown) to output the data exploration and comparison results of the operation module 30 and the comparison module 40 to a desired user.

With reference to fig. 2, according to the biomarker screening method of the present invention, since the objective is to screen out the relevant biomarkers for a specific target disease, when the biomarker screening method is performed, a target disease is first selected (step S210), the aforementioned medical database 10 and the disease window 20 are provided (step S220), and the disease information corresponding to the target disease is obtained according to the disease window 20. The disease of interest may be any disease. In the context of the present invention, the disease of interest is selected from the group consisting of cardiovascular disease, premature labor, endocrine related disease, metabolic disease, dermatological disease and respiratory disease. Meanwhile, the medical database 10 may receive an external command to obtain at least one pre-disease of all the individuals suffering from the target disease from the medical database 10, and the at least one pre-disease occurs within a predetermined time period of the occurrence of the target disease by the individuals (step S230). The external instructions may be various parameters for picking at least one pre-condition, such as selecting a particular patient group (including age, gender, etc.). The external instructions may also include settings for the aforementioned predetermined times, such as 3 months, 6 months, 9 months, 1 year, 3 years, 5 years, 10 years, or all times before the occurrence of the target disease. After the at least one pre-disease is selected in step S230, the operation module 30 converts the at least one pre-disease into corresponding disease information composed of a specific language format according to the at least one disease information provided by the disease window 20, and integrates the disease information of the target disease to form a pre-disease information (step S240). Alternatively, the medical information of the plurality of individuals in the medical database 10 of the present disclosure also includes at least one disease information of the disease window 20, whereby the results of the format-language conversion (e.g., pre-disease information) can be directly obtained when the target disease is selected and at least one pre-disease is selected.

In particular, according to embodiments of the present disclosure, disease window 20 contains a classification code for each disease according to the aforementioned ICD ninth edition clinical revision (ICD-9-CM); the pre-disease information also contains a classification code for the specific disease (pre-disease).

According to certain embodiments of the present disclosure, the pre-disease information may include timing information for the occurrence of these pre-diseases. Specifically, the operation module 30 may additionally record the occurrence time of the pre-diseases in the specific individuals when sorting the at least one pre-diseases, and form a time sequence data table according to the converted formatting language (e.g., ICD-9-CM classification code). Then, at least one pre-disease of each individual is sequenced according to the time sequence data table to form a disease time sequence (sequence). It should be noted that each individual suffering from the target disease has a corresponding disease timing sequence, so that the pre-disease information formed by the above steps should include at least one disease timing sequence, preferably a plurality of disease timing sequences.

Continuing to match fig. 1 and 2. In order to find the related diseases related to the target disease from the pre-disease information (including the disease timing sequences), the operation module 30 of the present disclosure is programmed to execute step S250. That is, a data mining (data mining) process is performed on the pre-disease information to obtain a disease track result.

In particular, techniques suitable for content data exploration in the present invention may be any technique known in the art, and the general goal is to extract information from a massive database and convert it into an understandable structure for further use. The data mining technology can build six kinds of models, including but not limited to classification (classification), clustering (regression), regression (regression), time series (time series forecasting), association (association), and order type (sequential pattern). According to the preferred embodiment of the present invention, the sequential pattern exploration (sequential pattern mining) model is used as the basis for technical improvement and functional augmentation, and the concept of time sequence and time interval is additionally introduced. Thus, not only the relative sequence characteristics of each individual data are maintained, but also the sequence combination of the data can be identified simultaneously. According to the present invention, the optimized sequential pattern exploration technique increases the group classification (e.g. disease window) of the event item and the time interval (e.g. time sequence data table) of each event occurrence in the overall preamble information, thereby improving the operation efficiency and obtaining more real clinically significant analysis results.

As stated above, a sequential pattern exploration process is performed to find sub-sequence combinations that occur more frequently from a number of different timings/orderings. According to the embodiment of the present disclosure, the operation module 30 is programmed to perform sequential pattern exploration on the pre-disease information (preferably including at least one disease timing sequence) to obtain the disease that occurs more frequently or more frequently among the pre-diseases and the sequential relationship among the diseases. It should be noted that the subsequences described herein may be composed of a single event or more than one event (i.e., may contain a disease or multiple diseases).

To achieve the foregoing object, a sequential pattern exploration process of the present disclosure is achieved by performing the following processes: (1) Excluding the disease timing sequences of tail (suffix) that are not the disease of interest; and (2) sequentially searching for a target subsequence from the disease timing sequences remaining after the excluding of the foregoing process (1). When the target subsequence is the last target subsequence's tail, then the target subsequence is only found from the last target subsequence whose tail has the target subsequence's head (pretix).

Specifically, the above-described process (1) is performed for the purpose of rapidly excluding an irrelevant original sequence (e.g., a disease time series sequence), while the process (2) is performed for the purpose of excluding the case of repeated search and increasing the efficiency of identifying all target subsequences.

Still included in step S250, a disease trace result is obtained by the sequential pattern exploration process performed in the above steps, which includes a plurality of related diseases related to the target disease. According to some embodiments, a threshold of odds ratio may be provided when outputting a disease trace result to present the corresponding plurality of associated diseases in the disease trace result. Specifically, the disease trace result obtained after sequential pattern exploration is an integrated result obtained by picking all the diseases from a huge number of samples (individuals), wherein the correlation degree of the diseases is not consistent. To facilitate interpretation by users and clinical staff, a threshold of odds ratio may be given when presenting the disease trace results, whereby a plurality of associated disease results corresponding to greater odds ratios (i.e., having a particular degree of correlation) may be presented in the disease trace results. According to embodiments of the present disclosure, the odds ratio threshold may be 4, which indicates that the disease trace result presented is a result with a corresponding odds ratio greater than 4. The odds ratio threshold may be 4, 5, 6, 7, 8, 9, or 10. In a particular embodiment, the win ratio threshold is 5; in another particular embodiment, the win ratio threshold is 6.

According to some embodiments, the average time interval between any two consecutive pre-diseases of the at least one pre-disease may be calculated according to the aforementioned time-series data table before outputting the disease trace result. Accordingly, the disease trace result obtained after the sequential sample probing process is performed in the above steps includes not only a plurality of related diseases with a specific correlation degree and the average time interval between the related diseases, thereby facilitating the analysis of the overall disease trace of the target disease, such as the evolution state of each related disease and the average period of each stage from the occurrence of the first disease event to the occurrence of the last target disease.

Disease trace results that present the present disclosure may be visualized in a variety of data visualization tools well known to those skilled in the art. For example, the data visualization tools include, but are not limited to, statistical charts (e.g., bar charts, line charts, pie charts, circle charts (donut chart), etc.); a scatter plot; and a network map (network diagram), etc. According to one embodiment, the preferred presentation is a network map. A data visualization tool suitable for use in the context of the present invention is a network map (network diagram). Data visualization tools for converting resulting data into network graphs are common knowledge in the art, and exemplary software suitable for use in the present invention include, but are not limited to: visual Paradigm Online, cytoscape, smartdraw, lucidchart, solarwinds Network Topology Mapper, intermapper, CADE, dia, diagram Designer, eDraw, lanFlow, netProbe, network Notepad, microsoft Visio.

According to some embodiments, the disease trace results of the present disclosure are presented to a user in a network map. According to some embodiments, the network map may be presented to the user via a user interface (not shown). Specifically, the disease trace result is composed of a plurality of representative sequences, each of which has at least one line segment between any two consecutive associated diseases. In other words, the line segments are used to connect any two consecutive associated diseases to represent the occurrence or progress of the associated diseases. Preferably, in combination with the above-mentioned setting of the ratio, the ratio between any two consecutive associated diseases can be known from the corresponding representative sequence.

After obtaining the disease trace result, the comparison module 40 is programmed to execute step S260 to select at least one biomarker from the disease trace result. The comparison module 40 performs a comorbid gene cluster search based on a plurality of associated diseases associated with the target disease at a specific winning ratio presented by the disease trajectory results. Specifically, the comparison module 40 may be programmed to look up the target disease and the associated genes of each of these associated diseases from one or more gene databases or literature databases and to find the largest intersecting gene group of these associated genes, i.e., the comorbid gene group, therefrom. Specific database embodiments applicable to the present disclosure include, but are not limited to: online mendelian genetic database (Online Mendelian Inheritance in Man, OMIM), humsvar ClinVar, dbPTB, and the like.

Based on the population of co-morbid genes obtained by the comparison module 40, an objective gene expression indicator can be selected as an effective biomarker for the disease of interest by common knowledge in the art. For example, candidate substances (e.g., nucleic acids, amino acids, peptides, proteins, monosaccharides, disaccharides, polysaccharides, glycoproteins, and combinations thereof) that can be used as biomarkers can be selected from the public or database by gene chip, whole genome sequencing, exon sequencing, DNA sequencing/RNA sequencing of specific gene groups, and sequence alignment, but not limited thereto, and the expression levels of these candidate biomarkers (expression detection techniques such as Western blot, tissue Immunostaining (IHC), and Immunoprecipitation (IP) for detecting the expression levels of proteins), real-time polymerase chain reaction (real time PCR), in situ fluorescence hybridization (fluorescence in situ hybridization, FISH), and RNA sequencing (RNA sequencing), and functional analysis at the cell, tissue, and organism levels can be performed. The biomarkers selected by the foregoing method are biomarkers that are responsive to the initially selected target disease, thereby being used to detect the presence or absence of the target disease or its potential risk for a desired individual.

Therefore, the biomarker screening method of the invention can greatly improve the efficiency of screening biomarkers, reduce the time required for detection and improve the accuracy of risk prediction by providing a disease window to obtain disease information, sequentially exploring the disease information to generate disease track results and locking a few important candidate genes as biomarkers from related diseases represented by the disease track results.

The following examples are set forth to illustrate certain embodiments of the invention and to facilitate the operation of the invention by those skilled in the art. These examples should not be construed as limiting the scope of the invention. Without further elaboration, it is believed that one skilled in the art can, based on the description herein, utilize the present invention to its fullest extent. All publications cited herein are incorporated by reference in their entirety.

Example 1: screening of comorbid genes and biomarkers for premature delivery and related diseases

1.1 huge volumes of medical record data

The medical data used in this example includes 116,918 medical records of individuals in normal production and premature delivery, the specific time information of personal data and visit is removed, and the medical record file only includes the diagnosis results with the order from front to back, and the relative interval time information between any two consecutive diseases in each disease. The disease medical record can be selected according to the pre-time length defined by the user, and a new disease code record can be generated according to the disease definition group and used as the pattern exploration analysis of the follow-up disease track.

1.2 disease grouping and establishment of disease window therefor

Etiology and clinical characterization defined in section 9 of the international disease classification system (ICD-9-CM) was used as disease classification basis and for categorizing and organizing all diseases in the medical database. The currently known diseases listed in ICD-9-CM can be classified into 17 disease groups by definition, each as shown in Table 1.

Table 1: disease window established according to ICD-9-CM

Of the 17 disease large categories, each large category group may be further sub-categorized again according to the relevance of the disease. After classification, the thousands of diseases known to date can be divided into 147-member groups of classified diseases. Taking the 11 th major classification "complications of pregnancy, childbirth and puerperal" as an example, it can be further grouped into: ectopic pregnancy and other pregnancy and pregnancy outcomes (codes: 630-639), complications related primarily to pregnancy (codes: 640-649), normal labor and other pregnancy and labor care (codes: 650-659), complications that occur primarily during labor and production (codes: 660-669), puerperal complications (codes: 670-677), other maternal and fetal complications (codes: 678-679), and the like.

1.3 data exploration to obtain disease trajectory results

The disease of choice for this example was premature labor. To analyze the disease trace of premature disease and related diseases, a pattern exploration analysis of the disease trace was performed according to the 1.1 medical database and the disease window of 1.2. Premature ICD9-CM codes are labeled 6440, 6441, 6442, 64011, 64081, or 64091; ICD9-CM codes for normal term women were labeled 640 x 1-649 x 1 (excluding 64011, 64081, 64091), 650 or 651 x 1-659 x 1. The number of production samples of this example totaled 116,918, including 111,163 full term production and 5,755 premature; the average age of term production was 28.6 years and the average age of preterm women was 29.7 years. The date of the next hospital stay is considered as the production date, and is considered as the date standard of occurrence of premature birth to a specific individual, and relevant disease information before pregnancy is traced back according to the date standard.

As described above, the time range is set based on the premature birth/production day, and the estimated time is 260 days before the reference day and one more year, and all the visit records in this time interval are acquired. The diagnostic disease codes presented by each personal data field are sequenced according to the occurrence time of the diseases, and the codes of ICD-9-CM are used for representing the diagnostic disease codes to form a plurality of disease time sequence sequences. The disease time sequence is analyzed by a sequential pattern exploration technology, and the sequential pattern exploration is specifically carried out based on a PrefixSpan algorithm. The basic logic of the PreFixSpan algorithm is a matter of disclosure in the art, see J Pei et al, "PreFixSpan: mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth, "procedures 17th International Conference on Data Engineering (ICDE), heidelberg, germany,2001, pp.0215.

In order to improve the shortcomings of the analysis of time intervals of known Prefixspan algorithm binding sequences, the following modifications are made in this example. Since in this embodiment, the tail of all disease timing sequences (suffix) are the selected target diseases, for example: premature delivery, the algorithm thus designed will begin to search from the tail end of the sequence and exclude disease timing sequences of tail non-premature events. The subsequence length and occurrence frequency of the disease time sequence are different according to the time sequence of the actual diagnosis record. The rest disease time sequence after being removed is used as the searching object, and the target subsequence with high frequency is searched in sequence. When searching the target subsequence, the relation array of each target subsequence is repeatedly input into the original data to re-search the position and number of the target subsequence. However, in order to avoid repeated searching of the same disease time sequence, the invention adds judgment of the target subsequence in the algorithm. The specific concept is that if the currently calculated target subsequence is located at the sequence tail end or is a tail item of the last target subsequence, a prefix item is added to the tail item of the last target subsequence, so that the current target subsequence can be searched from the last target subsequences with the target subsequence head (prex) at the tail item. That is, instead of reestablishing a new relationship array from the original data, the relationship array of the algorithm will find the new relationship array of the current time by using the relationship array of the last target subsequence. Therefore, the efficiency of searching the new relation array can be increased, and all target subsequences can be rapidly identified. Finally, a combination of all different disease trajectories in all production records (including premature and normal production) was identified from the huge amount of time series data.

In order to generate more reliable information, the time corresponding to the diseases is additionally recorded in the process of identifying the disease track to form a time sequence data table. And meanwhile, calculating the average time of every two continuous related disease intervals of the target subsequence according to the searched disease track result, and calculating the winning ratio of each disease track to each disease track occurring in normal production females and premature females. Table 2 presents the top three diseases and their belonging taxonomies in premature women with a winning ratio of greater than 2. According to the three classification levels of the large, medium and small classifications presented in table 2, although in the small classification level, the first three of the frequently occurring diseases are diseases directly related to female pregnancy production, after the classification level is raised, the association between endocrine gland related diseases and premature birth can be observed to be extremely remarkable. Meanwhile, the diseases of hypertension and heart are also diseases with relatively high proportion of people in premature women and relatively high winning ratio.

Table 2: diseases significantly associated with premature delivery after analysis by sequential pattern mining

Further analyzing the disease track of the whole target disease, the evolution state from the first disease occurrence to the last premature and the average evolution period of each stage. Meanwhile, the network diagram of the disease track result can be presented in a data visualization tool Cytoscape by matching with the disease classification information provided by the 1.2 disease window. When a network map is presented, a threshold of winning ratios may be given, whereby representative sequences of the disease trajectories corresponding to the winning ratios may be presented, from which it may be known why the representative sequences comprise the associated disease.

As shown in fig. 3 and 4, the present example is a graph of the disease trace results from the previous year of pregnancy to the period of pregnancy to the production of women, in which premature birth is the target disease. Each shape represents a disease or event, and in general, the target disease and the rest of the associated diseases can be set to be different shapes to be distinguished; color, shape, or text may be used to distinguish why a large classification the disease belongs to. In this example, circles are premature events, and the remaining diamonds represent associated disease events that occur prior to premature delivery and are represented by ICD-9-CM codes. The arrow direction indicates the order of occurrence of the disease (i.e., the disease trace), it is also possible to use different line thicknesses or colors to indicate how many premature females are on the disease trace, while the numbers on the line indicate the average occurrence time interval (in days) of two consecutive disease events with respect to each other. FIG. 3 is a graph showing the results of disease trajectories with a odds ratio of greater than 5, and is shown as a classification of the aforementioned diseases, which include several representative sequences (disease trajectories). It can be seen that the disease trace results present 11 different disease categories, each of which is encompassed by 9 disease categories. If the threshold of the winning ratio is increased, the related diseases with higher relativity to premature delivery can be further screened. As shown in fig. 4, which is a disease trace result with a odds ratio greater than 6, and is presented in a category in the disease. With the disease trace results presented in fig. 4, two disease trace representative sequences were arbitrarily taken to describe: (one) female genital disease (617-629) → other inflammation of skin and subcutaneous tissue (690-698) → hypertensive disease (401-405) → premature birth, with a winning ratio of 6.654; and (II) other endocrine gland diseases (249-259) →hypertensive diseases (401-405) →premature labor, the winning ratio is 6.076. Disease progression can also be known from disease trace results. For example, the aforementioned trace (one), the average time to progression of female genital disease to other inflammation of the skin and subcutaneous tissue is 169.1 days, the average time to progression of other inflammation of the skin and subcutaneous tissue to hypertensive disease is 135.5 days, and the average time interval from hypertensive disease to premature labor is 89.8 days. In trace (two), the average disease interval from "other endocrine gland disease" to "hypertensive disease", and evolution to premature birth was 223.2 days and 89.8 days, respectively.

1.4 analysis of Co-morbid Gene populations

Disease trace results with high correlation to pre-term labor have been obtained from the previous steps, biomarkers suitable for detection of premature birth are then sought in the comorbid gene population of premature birth and these related diseases with each other. The analysis result of the disease trace according to the above 1.3 shows that multiple pregnancy, hypertension, metabolic disease, gastrointestinal disease, diabetes, respiratory system related disease and the like have remarkable correlation and are part of the trace of premature birth disease. The gene group of premature birth is obtained by cross-comparing the common disease gene group of the disease track in OMIM database, dbPTB database and published premature birth exon sequencing document, and has high relativity with related genes of hypertension diseases (such as adducin 1, ADDA 1 gene, angiotensinogen, AGT gene, follicle stimulating hormone receptor (follicle stimulating hormone receptor, FSHR) gene, nitric oxide synthase (nitric oxide synthase, nos2) gene and the like), and related genes of endocrine diseases (such as angiotensinamide I transferase (angiotensin I converting enzyme, ACE) gene, lipase C liver type (LIPC) gene, peroxidase proliferator-activated receptor gamma (peroxisome proliferator activated receptor gamma, PPARG) gene and the like). Whereby, starting from the intersection of these co-morbid gene groups, test bodies of premature females are collected for testing the mutation sites of these genes. The gene expression levels of these specific gene groups were analyzed by polymerase chain reaction to determine the site of genetic variation between term or preterm females, and to select the most statistically significant biomarkers. These biomarkers can be used to predict or detect the relative risk and probability of premature birth in a female who is to become pregnant.

In summary, the biomarker screening system and the biomarker screening method thereof of the present invention can rapidly screen related diseases highly related to a target disease from a huge amount of data, particularly medical data with a plurality of medical records, thereby facilitating the screening of biomarkers of the target disease from a co-disease gene group of the diseases, and thereby improving the efficiency and accuracy of detection and data interpretation.

It will be understood that the foregoing description of the embodiments is given by way of example only, and that various modifications may be made by those skilled in the art. The above specification, examples and experimental results provide a complete description of the structure and use of exemplary embodiments of the invention. Although various embodiments of the present invention have been disclosed in the foregoing description, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims.

Claims

1. A method for screening biomarkers for a disease of interest, wherein the method comprises:

(a) Providing a medical database containing medical information of a plurality of individuals;

(b) Providing a disease window comprising at least one disease information;

(c) Obtaining at least one pre-disease of the individuals suffering from the target disease from the medical database of step (a), and obtaining the at least one disease information of the target disease and the at least one pre-disease based on the disease window of step (b) to form a pre-disease information, wherein the at least one pre-disease occurs within a predetermined time before the target disease occurs;

(d) Performing a sequential pattern exploration process on the pre-disease information of step (c) to generate a disease trace result, wherein the disease trace result comprises a plurality of associated diseases related to the target disease; and

(e) Selecting at least one biomarker from a co-morbid gene population of the target disease and the plurality of related diseases according to the disease trajectory result of step (d), wherein the at least one biomarker is useful for detecting the target disease.

2. The method of claim 1, wherein the step (c) comprises:

(c-1) recording the time of occurrence of said at least one pre-disease in each of said individuals to form a time series data table; and

(c-2) sorting the at least one pre-disease in chronological order according to the chronological data table of step (c-1) to generate at least one disease timing sequence.

3. The method of claim 2, wherein the step (d) comprises:

(d-1) performing the sequential pattern exploration process on the at least one disease timing sequence of step (c-2) to produce the disease trace result; and

(d-2) calculating an average time interval between any two consecutive pre-diseases of the at least one pre-disease according to the time series data table before outputting the disease trace result.

4. A method as claimed in claim 3, wherein the sequential pattern exploration process of step (d) is achieved by performing the steps of:

(i) Excluding tail items from the disease timing sequence of the target disease; and

(ii) Sequentially searching target subsequences from the disease time sequence remained after the step (i),

and when the target subsequence is the tail item of the last target subsequence, searching the target subsequence only from the last target subsequence with the tail item having the target subsequence head.

5. The method of claim 1, further comprising providing a win ratio threshold between step (d) and step (e) to present the plurality of associated diseases in the disease trace result that is greater than the win ratio.

6. The method of claim 5, wherein the odds ratio threshold is 4.

7. The method of claim 1, wherein the disease of interest is selected from the group consisting of cardiovascular disease, premature labor, endocrine related disease, metabolic disease, dermatological disease, and respiratory disease.

8. The method of claim 1, wherein the biomarker is selected from the group consisting of a nucleic acid, an amino acid, a peptide, a protein, a monosaccharide, a disaccharide, a polysaccharide, a glycoprotein, and combinations thereof.

9. A biomarker screening system, wherein the system comprises:

a medical database containing a plurality of medical information;

a disease window including at least one disease information;

an arithmetic module programmed with instructions to perform a method, wherein the method comprises:

(1) Providing pre-disease information according to at least one pre-disease of a target disease obtained from the plurality of medical information and the at least one disease information corresponding to the target disease and the at least one pre-disease obtained from the disease window, wherein the at least one pre-disease occurs within a predetermined time before the target disease occurs; and

(2) Performing a sequential pattern exploration process on the pre-disease information of step (1) to generate a disease trace result, wherein the disease trace result comprises a plurality of associated diseases related to the target disease;

a comparison module programmed to execute at least one biomarker selected from a group of co-morbid genes of the target disease and the plurality of associated diseases according to the disease trajectory result.

10. The system of claim 9, wherein the disease trace result comprises a plurality of representative sequences, each of the representative sequences having at least one line segment between any two consecutive associated diseases.

11. The system of claim 10, wherein the disease trace result comprises an average time interval between any two consecutive associated diseases, and a win-loss ratio for each of the representative sequences.