WO2021214787A1 - System and method for performing multi-omics data integration - Google Patents

System and method for performing multi-omics data integration Download PDF

Info

Publication number
WO2021214787A1
WO2021214787A1 PCT/IN2021/050390 IN2021050390W WO2021214787A1 WO 2021214787 A1 WO2021214787 A1 WO 2021214787A1 IN 2021050390 W IN2021050390 W IN 2021050390W WO 2021214787 A1 WO2021214787 A1 WO 2021214787A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
omics
processing circuitry
datasets
analysis
Prior art date
Application number
PCT/IN2021/050390
Other languages
French (fr)
Inventor
Gowhar SHAFI
Krithika SRINIVASAN
Shruti DESAI
Mohan UTTARWAR
Original Assignee
Indx Technology (India) Private Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Indx Technology (India) Private Limited filed Critical Indx Technology (India) Private Limited
Publication of WO2021214787A1 publication Critical patent/WO2021214787A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/20Heterogeneous data integration

Definitions

  • the present disclosure relates generally to data analysis of biological data and more specifically, system and method for performing multi-omics data integration.
  • a drone is required to gather data from multiple sources during a flight, including but not limited to, using on-board sensors for gathering data along a pre-determined flying route of the drone, using remote servers capable of providing accurate temperature and climate-related data, third-party sources capable of transmitting air traffic-related data and so forth.
  • an on-board data processing unit or a remote data processing unit analyses such data to generate an output, such as, a change to the pre-determined flying route.
  • the present disclosure seeks to provide a system and method for performing multi-omics data integration.
  • an embodiment of the present disclosure provides a system for performing multi-omics data integration, the system comprising: a data storing unit configured to receive a plurality of datasets from various data sources and store the plurality of datasets; a processing circuitry configured to implement an artificial intelligence platform, the processing circuitry is communicatively coupled to the data storing unit for receiving data from the plurality of datasets, the processing circuitry is further configured to: analyze a behavior and a distribution associated with the data; process the data to determine a trend associated with the data, for identifying and removing high-dimensional noise associated with the data; identify at least one missing value associated with the data; impute the at least one missing value associated with the data; generate a prediction score for the data; and generate a visual representation of the data comprising the imputed at least one missing value therein, wherein the visual representation of the data corresponds to a plurality of parameters; and a display unit communicatively coupled to the processing circuitry, wherein the display unit receives the visual representation of the data from
  • the processing circuitry is configured to analyze the behavior and the distribution associated with the data by employing partial least squares discriminant analysis (PLS-DA) technique.
  • PLS-DA partial least squares discriminant analysis
  • the processing circuitry is configured to determine the trend associated with the data by determining correlations between data of the dataset and wherein data associated with less than 50% correlation is identified as the high-dimensional noise.
  • the processing circuitry is configured to normalize the data based on analysis of the distribution associated with the data.
  • the processing circuitry is configured to train a model using at least one training dataset and store the trained model having the at least one training dataset in the data storing unit.
  • the processing circuitry is configured to impute the atleast one missing value by one or more of single/omics data imputation and multi-omics data imputation.
  • an embodiment of the present disclosure provides a method for performing multi-omics data integration, the method comprising: receiving a plurality of datasets from various data sources; implementing an artificial intelligence platform on a data of the received plurality of datasets, wherein the implementing comprises: analyzing a behavior and a distribution associated with the data; processing the data to determine a trend associated with the data, for identifying high-dimensional noise associated with the data; removing the high-dimensional noise associated with the data; identifying at least one missing value associated with the data; imputing the at least one missing value associated with the data; generating a prediction score for the data; and generating a visual representation of the data comprising the imputed at least one missing value therein, wherein the visual representation of the data corresponds to a plurality of parameters; and displaying the visual representation of the data on a graphical user interface.
  • the method comprises analyzing the behavior and the distribution associated with the data by employing partial least squares discriminant analysis (PLS-DA) technique.
  • PLS-DA partial least squares discriminant analysis
  • the method comprises determining the trend associated with the data by determining correlations between data of the plurality of datasets and wherein data associated with less than 50% correlation is identified as the high-dimensional noise.
  • the method comprises normalizing the data based on analysis of the distribution associated with the data.
  • the method comprises training a model using at least one training dataset; and storing the trained model having the at least one training dataset in the data storing unit.
  • the method comprises imputing the atleast one missing value by one or more of single-omics data imputation and multi-omics data imputation.
  • FIG. 1 is a block diagram of a system for performing multi-omics data integration, in accordance with an embodiment of the present disclosure
  • FIG. 2 is a flowchart illustrating steps of a method for performing multi- omics data integration, in accordance with an embodiment of the present disclosure
  • FIG.3 shows an exemplary block diagram of a client server configuration for performing multi-omics data integration, in accordance with an example embodiment of the present disclosure.
  • FIG.4 shows a method flowchart for performing normalization of the plurality of data, in accordance with an embodiment of the present disclosure.
  • an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent.
  • a non-underlined number relates to an item identified by a line linking the non-underlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non- underlined number is used to identify a general item at which the arrow is pointing.
  • spatially relative terms such as “inner,” “outer,” “beneath,” “below,” “lower,” “above,” “upper,” and the like may be used herein for ease of description, to describe an element’s or a feature's relationship to another element(s) or feature(s) as illustrated in the figures.
  • spatially relative terms may be intended to encompass different orientations of the device in use or in operation, in addition to one or more orientations depicted in the figures. For example, if the device in the figures is turned over, elements described as “below” or “beneath” other elements or features would then be oriented “above” the other elements or features.
  • the example term “below” can encompass both an orientation of above and below. It will be appreciated that the device may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein should be interpreted accordingly.
  • first, second, third, etc. may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms may be only used to distinguish one element, component, region, layer or section from another region, layer or section. Terms such as “first,” “second,” and other numerical terms when used herein do not imply a sequence or order unless clearly indicated by the context. Thus, a first element, component, region, layer or section discussed below could be termed a second element, component, region, layer or section without departing from the teachings of the example embodiments.
  • the term “user” relates to at least one individual that uses or operates the system or arrangement or device (or other variants thereof) as claimed, such as, by interacting with at least one component of the system or arrangement or device (or other variants thereof).
  • the system allows simultaneous, reliable and efficient analysis of multiple types of data (such as, multi-omics data).
  • the system comprising the artificial intelligence platform implemented on the data processing unit, allows robust identification of high-dimensional noise associated with the data and subsequent reduction thereof.
  • the system enables generation of the visual representation of the data, thereby, improving user-friendly attributes associated with use of the system for analysis of the data, such as, for allowing users with minimal/no training to use the system to perform simultaneous analysis of multiple types of data.
  • embodiments of the present disclosure are concerned with systems and methods for performing multi-omics data integration.
  • FIG. 1 is a block diagram of a system 100 for performing multi-omics data integration, in accordance with an embodiment of the present disclosure.
  • the system 100 comprises a data storing unit 102 for storage of a plurality of datasets.
  • the data storing unit 102 can be implemented as a computer-readable medium for storage of data therein.
  • the data storing unit 102 can be implemented as transitory or non-transitory forms of computer-readable media, including but not limited to, a hard disk, a random access memory (RAM), a read only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fibre, a portable compact disc read-only memory (CD- ROM), an optical storage device, a digital versatile disk (DVD), a static random access memory (SRAM), a memory stick, a floppy disk and so forth.
  • the data storing unit 102 can be implemented using one or more databases (such as, an arrangement of more than one database communicatively coupled to each other), for example, as a cloud-based database.
  • the data storing unit 102 stores the plurality of datasets.
  • the plurality of datasets comprises patient-related data.
  • patient-related data can comprise data that is captured at different layers (corresponding to multilayer-omics data), for example, DNA-level data (associated with mutation, methylation and so forth), RNA-level data (such as, expression of genes, microRNAs, and the like), protein-related data (including intracellular protein expression, surface protein expression and suchlike), imaging data (such as, pathological images, radiological images and the like), clinical data (drug-response data, adverse-event data and suchlike) and so forth.
  • DNA-level data associated with mutation, methylation and so forth
  • RNA-level data such as, expression of genes, microRNAs, and the like
  • protein-related data including intracellular protein expression, surface protein expression and suchlike
  • imaging data such as, pathological images, radiological images and the like
  • clinical data drug-response data, adverse-event data and suchlike
  • the system 100 comprises a data processing unit 104 implementing an artificial intelligence platform 106 therein.
  • the data processing unit 104 can be implemented as a device capable of performing computations and operations on general or specific data, such as, by receiving the data as an input and performing analysis of the data to yield one or more outputs.
  • the data processing unit 104 can be implemented as a dedicated processor, a processing circuitry, a portion of a processor, a virtual processor, a portion of a virtual processor, portion of a virtual device, or a virtual device.
  • the virtual processor may correspond to one or more parts of a physical processor.
  • the data processing unit 104 can implement therein, various software platforms executing instructions or logic, such that the instructions or logic may be distributed and executed across one or more processors, virtual or physical, to execute the instructions or logic. Such execution of the instructions or logic by the data processing unit 104 via the software platforms allows the data processing unit 104 to yield the output by processing the data.
  • the artificial intelligence platform 106 is one such software platform.
  • the data processing unit 104 can comprise specialized hardware for allowing implementation and execution of the artificial intelligence platform 106.
  • the artificial intelligence platform 106 can be implemented as a software- based tool for identification of clinically-relevant biomarker features from the patient-related data.
  • the clinically-relevant biomarker features can be a combination of biomarkers (such as, DNA sequences) that are associated with susceptibility to a particular disease within a patient population.
  • biomarkers such as, DNA sequences
  • such biomarker features would be absent among healthy population, such as, a population of people not having developed the particular disease.
  • the patient population is associated with people having developed a specific form of cancer by a given age (such as, before an age of 30 years).
  • the data processing unit 104 is communicatively coupled to the data storing unit 102 for receiving data from the plurality of datasets using a network 112.
  • the network 112 may comprise suitable logic, circuitry, and interfaces that may be configured to provide a plurality of network ports and a plurality of communication channels for transmission and reception of data.
  • Each network port may correspond to a virtual address (or a physical machine address) for transmission and reception of the communication data.
  • the virtual address may be an Internet Protocol Version 4 (IPv4) (or an IPv6 address) and the physical address may be a Media Access Control (MAC) address.
  • IPv4 Internet Protocol Version 4
  • MAC Media Access Control
  • the network 112 may be associated with an application layer for implementation of communication protocols based on one or more communication requests from at least one of the one or more communication devices.
  • the communication data may be transmitted or received, via the communication protocols.
  • wired and wireless communication protocols may include, but are not limited to, Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP), ZigBee, EDGE, infrared (IR), IEEE 802.11, 802.16, cellular communication protocols, and/or Bluetooth (BT) communication protocols.
  • Examples of the network 112 may include, but is not limited to a wireless channel, a wired channel, a combination of wireless and wired channel thereof.
  • the wireless or wired channel may be associated with a network standard which may be defined by one of a Local Area Network (LAN), a Personal Area Network (PAN), a Wireless Local Area Network (WLAN), a Wireless Sensor Network (WSN), Wireless Area Network (WAN), Wireless Wide Area Network (WWAN), a Long Term Evolution (LTE) network, a plain old telephone service (POTS), and a Metropolitan Area Network (MAN).
  • the wired channel may be selected on the basis of bandwidth criteria. For example, an optical fiber channel may be used for a high bandwidth communication. Further, a coaxial cable-based or Ethernet-based communication channel may be used for moderate bandwidth communication.
  • the data processing unit 104 receives the patient-related data from the data storing unit 102 and optionally, performs integration of the data.
  • the patient-related data can comprise multi-omics data, such as, genomic-data and phenomic-data. Consequently, the data processing unit 104 receives such patient-related data comprising the genomic-data and the phenomic- data associated with the patients from the data storing unit 102 provides the data to the artificial intelligence platform 106.
  • the artificial intelligence platform 106 performs integration of the genomic-data and the phenomic-data (referred to as “ multi-omics data ” herein after) for the patient population.
  • the artificial intelligence platform 106 analyzes a behavior and a distribution associated with the data.
  • the artificial intelligence platform 106 analyzes the integrated multi-omics data for determining the behavior and distribution associated with the data. It will be appreciated that such an analysis of the integrated multi-omics data using the behavior and distribution thereof, instead of using physical parameters associated with the data, enables efficient, faster and reliable analysis of the multi-omics data as compared to conventional techniques that individually analyze the data based on physical parameters thereof.
  • the artificial intelligence platform 106 employs one or more other characteristic details associated with the data.
  • the distribution associated with the data corresponds to normal distribution.
  • the artificial intelligence platform 106 implemented on the data processing unit 104 analyzes the behavior and the distribution associated with the data by employing a partial least squares discriminant analysis (PLS-DA) technique.
  • PLS-DA partial least squares discriminant analysis
  • the artificial intelligence platform 106 employs the PLS- DA technique to identify associations between different features from the data, to identify the behavior and the distribution associated with the data.
  • the artificial intelligence platform 106 identifies associations between bio markers from the data corresponding to mutation from DNA-level data and expression from RNA-level data.
  • the artificial intelligence platform 106 implemented on the data processing unit 104 normalizes the data based on analysis of the distribution associated with the data. It will be appreciated that each type of data can be different from each other, based on a source of the data (such as, data that is acquired using different devices, different labs employing various protocols and so forth).
  • the artificial intelligence platform 106 identifies a pattern corresponding to the distribution associated with the data and employs an optimal normalization technique based on the distribution.
  • the artificial intelligence platform 106 processes the data to determine a trend associated with the data, for identifying high-dimensional noise associated with the data.
  • the multi-omics data comprising the genomic-data and phenomic-data will be associated with high dimensional noise therein.
  • the artificial intelligence platform 106 performs correlation for determining the high-dimensional noise associated with the data, by using the behavior and distribution of the data.
  • the artificial intelligence platform 106 employs one or more additional algorithms for determination of the high-dimensional noise.
  • the artificial intelligence platform 106 removes the high-dimensional noise associated with the data. It will be appreciated that all parameters associated within the integrated multi-omics data will not be responsible for determination of clinically relevant outcomes.
  • the clinically relevant outcome corresponds to determination of susceptibility of a population to a certain disease and a specific genomic feature is certainly known to not be associated with the susceptibility to that certain disease; the genomic feature is removed as high-dimensional noise.
  • the artificial intelligence platform 106 removes the determined high-dimensional noise associated with the data.
  • the artificial intelligence platform 106 implemented on the data processing unit 104 determines the trend associated with the data by determining correlations between data of the dataset and wherein data associated with less than 50% correlation is identified as the high-dimensional noise.
  • the artificial intelligence platform 106 performs pattern-based clustering of the data and subsequently, performs correlation of the data.
  • the artificial intelligence platform 106 performs correlation and distance-matrix analysis of the data to identify associations between closely- related biomarkers from the data. Thereafter, the data that is associated with less than 50% correlation (such as, data that is associated with 25% correlation therebetween) is identified as the high-dimensional noise and removed.
  • the artificial intelligence platform 106 identifies at least one missing value associated with the data.
  • the artificial intelligence platform 106 employs the behavior associated with the data to identify the at least one missing value associated with the data or one or more values not applicable for the data. For example, when clinically relevant biomarkers for determining disease susceptibility to lung cancer are to be identified, the artificial intelligence platform 106 employs the behavior associated with the data to determine population of individuals associated with habits such as drinking, smoking and so forth, to identify the at least one missing value that may correspond to a number of years a person was associated with smoking behavior. Thereafter, the artificial intelligence platform 106 imputes the at least one missing value associated with the data. The artificial intelligence platform 106 generates a prediction score for the data.
  • the artificial intelligence platform 106 identifies features from the data, such that the features are highly predictive of clinical outcomes.
  • the clinical outcomes correspond to susceptibility to a specific disease within the population and the identified features enable determination of the disease susceptibility within the population.
  • the features can comprise genomic features, such as, presence of a certain gene within the patient population but absent within the healthy population and phenomic features, such as, physical traits associated with the patient population.
  • the artificial intelligence platform 106 generates a prediction score for the data, based on the identified features. It will be appreciated that, when the identified features are highly predictive of a specific clinical outcome, the prediction score will be high. Alternatively, when the identified features are comparatively less predictive of the specific clinical outcome, the prediction score will be comparatively low.
  • the artificial intelligence platform 106 generates a visual representation of the data comprising the imputed at least one missing value therein.
  • the visual representation of the data can correspond to a graph that is generated using the data comprising the at least one missing value, such that a user can manipulate the graph (such as, by hiding one or more portions of the graph, highlighting one or more portions of the graph and so forth) to enable efficient visualization of the data.
  • the visual representation of the data corresponds to a plurality of parameters.
  • the visualization representation of the data enables classification of patients into different patient- groups by considering different types of data such as, multi-omics data.
  • the system 100 comprises a display unit 108 communicatively coupled to the data processing unit 104.
  • the display unit 108 can be implemented as a touchscreen of a smartphone, a screen of a laptop computer and suchlike.
  • the display unit 108 receives the visual representation of the data from the data processing unit 104 and displays the visual representation of the data on a graphical user interface 110 associated therewith.
  • the graphical user interface 110 is presented on the display unit 108 and allows manipulation of the visual representation of the data by the user.
  • the graphical user interface 110 presents one or more buttons, options, sliders and so forth for allowing user to show/hide certain portions of the visual representation of the data, display specific portions of the visual representation of the data using different colors, enlarge/reduce a size of the visual representation of the data on the graphical user interface 110 and so forth.
  • the system 100 will enable easy and efficient application thereof in various applications including but not limited to, data analysis, data classification and generation of prediction models, biomarker discovery and the like.
  • the system 100 can be employed in hospitals, clinical studies and trials; by physicians, oncologists, patients, scientists, researchers and so forth for easy accessibility to data cleaning and data normalization.
  • the system 100 can be employed for classifying clinical outcomes, visual interpretation of research outcomes, classification of patient groups, modality-based stratification and the like.
  • the system 100 can be employed in other applications that correspond to big data analysis and bit data integration.
  • the system 100 improves ease of utilization by experts of various technical and non-technical fields without requiring programming knowledge, such as, for advancing clinical translation, for performing molecular and biomarker discoveries, using multi-omics analytical approach and so forth.
  • the artificial intelligence platform 106 implemented on the data processing unit 104 is trained using at least one training dataset and wherein the data storing unit 102 stores the at least one training dataset.
  • the at least one training dataset comprises data associated with patients that responded to a particular treatment, patients associated with a specific expression (such as protein expression), patients associated with a particular mutation pattern and suchlike.
  • the training data is retrieved from the data storing unit 102 by the data processing unit 104 and used for training the artificial intelligence platform 106.
  • such training allows the artificial intelligence platform 106 to predict a response as seen in patients, based on presence of similar expression (such as protein expression) and mutation pattern found in other patients.
  • the system 100 can be employed within a hospital for prediction of response to treatment, potential adverse-events, survival, hot- and cold-tumor classification and so forth, for various patients receiving treatment within the hospital.
  • FIG. 2 is a flowchart 200 illustrating steps of a method for performing multi-omics data integration, in accordance with an embodiment of the present disclosure.
  • the method is performed by the data processing unit 104.
  • a behaviour and a distribution associated with the data are analysed.
  • the data is processed to determine a trend associated with the data, for identifying high-dimensional noise-to-signal ratio associated with the data.
  • the high-dimensional noise associated with the data is removed.
  • at least one missing value associated with the data is identified.
  • the at least one missing value associated with the data is imputed.
  • a prediction score for the data is generated.
  • a visual representation of the data comprising the imputed at least one missing value therein is generated.
  • the visual representation of the data corresponds to a plurality of parameters.
  • the generated visual representation of the data is presented, such as, on a graphical user interface.
  • FIG.3 discloses an exemplary block diagram 300 of a client-server configuration for performing multi-omics data integration, in accordance with an example embodiment of the present disclosure.
  • the client side comprises an application interface 304 that can be accessed by a user 302.
  • the server side comprises an Artificial Intelligence Platform (AIP) 306 deployed in the server.
  • AIP Artificial Intelligence Platform
  • the Artificial intelligence platform 306 is similar to the artificial intelligence platform 106 explained in FIG.l.
  • the AIP 306 deployed in the server may communicate with the application interface 304 in the client-side through a communication network 308.
  • the AIP 306 may comprise plurality of data 312, a data processing module 314, a data exploration and analysis module 316, a data integration module 318, and a prediction module 320.
  • the user 302 may access the AIP 306 by providing a client ID 310 to the application interface 304 wherein each user 302 is provided a unique client ID 310.
  • the AIP 306 may perform multi-omics data integration on the server side or on the client side based on the deployment.
  • the AIP 306 is deployed on a portable media device like a smartphone of the user 302.
  • the AIP 306 is deployed in the server side.
  • the AIP 306 may initially ingest plurality of data 312 from various sources.
  • the plurality of data 312 comprises different biological data.
  • the data ingestion process is built by taking into consideration variability of format, data types and vendors.
  • the AIP 306 identifies primary objects from the data such as study, patient (subject), samples, assay, and storage. Each of these objects can be associated with any attributes thereby allowing for flexible ingestion of data into the system.
  • Each of these objects are interconnected thereby maintaining a consistent and interconnected data lake environment for data management.
  • the AIP 306 may perform preprocessing on the ingested plurality of data 312 by using the preprocessing module 314.
  • the preprocessing module 314 performs a systematic preprocessing of data to make it fit for any further analysis.
  • the preprocessing module 314 performs data cleansing on the plurality of data 312.
  • the data preprocessing module 314 may perform data cleansing by removing highly sparse variables, identification of outliers for de-noising and finally imputing the dataset with correct values so that the plurality of data 312 can be made ready for analysis.
  • the preprocessing module 314 performs data imputation on the plurality of data 312.
  • the data imputation takes advantage of the correlation across different omic datasets with an assumption that missing feature from one type of omic data can be explained by its neighboring feature of the same omics data as well as features from other omics data.
  • the imputation algorithm implemented by the preprocessing module 314 works under the following assumption.
  • i indicates the type of omics data
  • pi is the number of rows of each matrix corresponding to different types of features (e.g., gene expression) and n is the number of columns corresponding to different subjects.
  • the preprocessing module 314 may perform single omics imputation and multi omics imputation.
  • the target gene contains missing values located in the first s subjects.
  • g t miss E Ri xs is the missing vector in the target gene and gt c e RI c ( n - s ) is complete vector containing non-missing values.
  • dt, j the distance between the target gene t and other gene j (or eigengene j ) is computed.
  • top k close genes are used for imputation.
  • KNN impute estimates gtmiss by averaging the weighted values of neighbouring genes or eigen genes while the other methods tend to use linear regression.
  • the pre-processing module 314 instead of imputing each omics data separately, the pre-processing module 314 combines multiple information from various omics data such as microRNA (Gl), mRNA (G2) and DNA methylation (G3), that are identified to be correlated with each other in their elements or components.
  • the basic models are generated based on three types of imputations, i.e., self-imputation and cross imputation by G2 and G3 respectively.
  • the self imputation is to impute Gl by itself using single-omics imputation method as explained in in “Single omics imputation”.
  • the cross-imputation is to impute Gl by other omics data, i.e., G2.
  • Each missing feature in Gl is imputed individually by exploiting the correlated information from G2.
  • gt [gtmiss, gt c ] in Gl
  • it is combined with correlated features in G2 to obtain a new missing matrix H.
  • the Matrix H is then imputed by self-imputation methods to estimate gtmiss.
  • three imputation outputs is obtained for all missing values in Gl by different omics data, denoted by Gl ⁇ — 1, Gl ⁇ — 2 and Gl ⁇ — 3 respectively.
  • the data preprocessing module 314 implements a least square regression model to combine the outputs from diverse models.
  • the preprocessing module 314 performs data normalization on the plurality of data 312. Normalization ensures that values are transformed into a common scale without distorting differences in the range of the values. In a multiomics analysis, this is a very important step to ensure compare and contrast between heterogenous datasets and run cross assay analysis.
  • the preprocessing module 314 applies various normality tests such as c 2 goodness- of-fit test with its variants, the Kolmogorov-Smimov (KS) one-sample cumulative probability test, the Shapiro-Wilk (SW) test, D’Agostino-Pearson (DP) test and Jarque-Bera (JB) test, tests based on the empirical distribution function such as Kuiper test, Watson test, Cramer-von Mises (CvM) test, and Anderson-Darling (AD) test.
  • the preprocessing module 314 performs several standardization techniques such as log 10, median, min_max, mean_standardization, MRN, mean_centering and the like to normalize a given dataset. The best normalization based on skewness, kurtosis and cumulants is then suggested to the user as a default option.
  • a graphical representation in the form of pre-normalization and post-normalization would then provide user about choice of normalization.
  • the arithmetic mean is defined by: and standardized mean by
  • FIG.4. illustrates a method flowchart for performing normalization of the plurality of data 312 in accordance with an embodiment of the present disclosure.
  • the method 400 is performed by the data preprocessing module 314.
  • the method starts at step 402.
  • the data preprocessing module 314 receives input data.
  • the input data comprises a plurality of input variables such as xl, x2, x3,..xn.
  • the data preprocessing module 314 calculates p-value and value of test static. Based on the calculated p-value and test statistic, at step 408, the data preprocessing module 314 checks whether the data is normalized. If the data is not normalized, the method proceeds to step 410. At step 410, standardizations are applied on the data for data transformation and to normalize the data. Subsequently, the method proceeds to step 406 to calculate the p-value and test statistic. At step 408, if it is determined that the data is normalized, the method proceeds to step 412. At step 412, the method displays a chart comprising data distribution. The chart enables the user for visualizing the plurality of data 312 and also enables combining data from various sources.
  • the AIP 306 may subsequently perform data exploration and analysis on the preprocessed data by using the data exploration and analysis module 316.
  • the data exploration and analysis module 316 may perform data exploration and identify correlations in the data.
  • the data exploration and analysis module 316 may analyze the behavior and perform multi omics analysis with the data by employing partial least squares discriminant analysis (PLS-DA) technique. Further, the data exploration and analysis module 316 may determine the trend associated with the data by determining correlations between data of the plurality of datasets.
  • the data exploration and analysis module 316 may perform single omics advanced analysis and multi omics advanced analysis.
  • the single omics advanced analysis is based on a cumulation of multiple algorithms to provide an effective way to finally cluster the data based on the response attribute in question.
  • the main techniques integrated into the framework include - correlation with target, binomial regression, logistic regression, linear modelling and DESeq based on negative binomial distribution (specially used in analysing gene expression scenario). Each of these algorithms are run and the resultant biomarker signature is used for clustering the input dataset based on response variable.
  • the data exploration and analysis module 316 enables the user to reiterate and flexibly change these algorithms to obtain the best analysis output.
  • the data exploration and analysis module 316 may perform multi omics advanced analysis is based on partial least square discriminant analysis (PLS-DA) technique.
  • PLS-DA partial least square discriminant analysis
  • the technique includes searching for latent variables with a maximum covariance with the Y- variables (containing the membership information).
  • relevant sources of data variability are modelled by the so-called Latent Variables (LVs), which are linear combinations of the original variables.
  • LVs Latent Variables
  • PLS-DA works to ensure that it maximizes the covariance between the individual components.
  • the AIP 306 may subsequently perform integration of the plurality of data from various sources by implementing the data integration module 318.
  • the data integration module 318 may implement a multi omic data analysis to integrate the data coming from different platform and modalities to obtain a biomarker signature which can uniquely define the response variable.
  • the data integration module 318 ensures that the data across modalities, across studies, across time points are harmonized, brought together, scaled on the same level to ensure a inter modality comparison.
  • the integration module is based on the partial least square discriminant analysis (PLS-DA) which ensures that the contribution of multiple modalities is accounted and their contribution towards the parameter in question is analyzed to provide the unique biomarker signature.
  • PLS-DA partial least square discriminant analysis
  • the AIP 306 may subsequently perform prediction on the data by implementing the prediction module 320.
  • the prediction module 320 is configured to train a model using at least one training dataset.
  • the training dataset is the data obtained after performing preprocessing and data exploration and analysis on the plurality of data 312.
  • the prediction module 320 implements various machine learning (ML) algorithms such as PLS-DA, decision trees, random forest and Support Vector Machine (SVM) to train the data for enabling prediction.
  • ML machine learning
  • SVM Support Vector Machine
  • the AIP 306 may implement the prediction module 320 to predict diseases and body side effects in various use cases including but not limited to hot/cold tumour prediction, drug response / toxicity prediction, survival and recurrence prediction.

Abstract

Disclosed is a method and system for performing multi-omics data integration. The system comprises a data storing unit for storage of a plurality of datasets. The system also comprises a data processing circuitry implementing an artificial intelligence platform. The artificial intelligence platform analyzes behavior and distribution of the data and processes it to determine associated trend, for identification and removal of noise in high-dimensional data. The artificial intelligence platform identifies at least one missing value associated with the data and imputes it. Subsequently, the artificial intelligence platform generates a prediction score followed by a visual representation for the respective data, which is displayed on a graphical user interface presented on the display unit communicatively coupled to the data processing unit.

Description

SYSTEM AND METHOD FOR PERFORMING MULTI-OMICS DATA
INTEGRATION
TECHNICAL FIELD [0001] The present disclosure relates generally to data analysis of biological data and more specifically, system and method for performing multi-omics data integration.
BACKGROUND [0002] In recent times, many technical processes require gathering of data from different sources and subsequent analysis thereof. For example, a drone is required to gather data from multiple sources during a flight, including but not limited to, using on-board sensors for gathering data along a pre-determined flying route of the drone, using remote servers capable of providing accurate temperature and climate-related data, third-party sources capable of transmitting air traffic-related data and so forth. Subsequently, an on-board data processing unit (or a remote data processing unit) analyses such data to generate an output, such as, a change to the pre-determined flying route.
[0003] Conventionally, in biological science, techniques that are employed for cleaning of data and pre-processing, focuses primarily on individual datasets i.e., single modality data and are thus, incapable of analyzing such data from multiple sources, such as, for generating holistic insights from the data. Furthermore, such techniques are highly time-consuming, are trail-based and are generally prone to errors. For example, cleaning and pre-processing of the data takes on an average, 80% of time allotted for analysis of the data, before the analysis of the data can be commenced.
[0004] Furthermore, in applications concerning big data, high-dimensionality associated with the data is associated with various problems. Consequently, such applications require reduction of the dimensionality of the data to be able to accurately process the data. However, conventional systems and methods do not allow satisfactory reduction of dimensionality associated with such big data (such as, multi-omics datasets) and there is no single system to address integration of different modalities. Such systems and techniques focus on analysis of individual datasets and are generally incapable of being applied to any type of data. Additionally, such techniques are associated with limited user-friendliness and thus, become tedious to use for users thereof as there is a limited choice to the user to combine several data. As there are several methods to analyze a basic RNA data and several algorithms get newly updated with better methodologies and techniques, there is a need for a system which allows for the user to dynamically choose and combine various biological data to get better insights from the same.
[0005] Therefore, in light of the foregoing discussion, there exists a need to overcome various problems associated with conventional systems and techniques of simultaneous analysis of multiple data types such as multi-omics data.
SUMMARY
[0006] This section provides a general summary of the disclosure, and is not a comprehensive disclosure of its full scope or all of its features.
[0007] The present disclosure seeks to provide a system and method for performing multi-omics data integration.
[0008] According to an aspect, an embodiment of the present disclosure provides a system for performing multi-omics data integration, the system comprising: a data storing unit configured to receive a plurality of datasets from various data sources and store the plurality of datasets; a processing circuitry configured to implement an artificial intelligence platform, the processing circuitry is communicatively coupled to the data storing unit for receiving data from the plurality of datasets, the processing circuitry is further configured to: analyze a behavior and a distribution associated with the data; process the data to determine a trend associated with the data, for identifying and removing high-dimensional noise associated with the data; identify at least one missing value associated with the data; impute the at least one missing value associated with the data; generate a prediction score for the data; and generate a visual representation of the data comprising the imputed at least one missing value therein, wherein the visual representation of the data corresponds to a plurality of parameters; and a display unit communicatively coupled to the processing circuitry, wherein the display unit receives the visual representation of the data from the processing circuitry and is configured to display the visual representation of the data on a graphical user interface.
[0009] In an embodiment, the processing circuitry is configured to analyze the behavior and the distribution associated with the data by employing partial least squares discriminant analysis (PLS-DA) technique.
[0010] In an embodiment, the processing circuitry is configured to determine the trend associated with the data by determining correlations between data of the dataset and wherein data associated with less than 50% correlation is identified as the high-dimensional noise.
[0011] In an embodiment, the processing circuitry is configured to normalize the data based on analysis of the distribution associated with the data.
[0012] In an embodiment, the processing circuitry is configured to train a model using at least one training dataset and store the trained model having the at least one training dataset in the data storing unit.
[0013] In an embodiment, the processing circuitry is configured to impute the atleast one missing value by one or more of single/omics data imputation and multi-omics data imputation.
[0014] According to an aspect, an embodiment of the present disclosure provides a method for performing multi-omics data integration, the method comprising: receiving a plurality of datasets from various data sources; implementing an artificial intelligence platform on a data of the received plurality of datasets, wherein the implementing comprises: analyzing a behavior and a distribution associated with the data; processing the data to determine a trend associated with the data, for identifying high-dimensional noise associated with the data; removing the high-dimensional noise associated with the data; identifying at least one missing value associated with the data; imputing the at least one missing value associated with the data; generating a prediction score for the data; and generating a visual representation of the data comprising the imputed at least one missing value therein, wherein the visual representation of the data corresponds to a plurality of parameters; and displaying the visual representation of the data on a graphical user interface.
[0015] In an embodiment, the method comprises analyzing the behavior and the distribution associated with the data by employing partial least squares discriminant analysis (PLS-DA) technique.
[0016] In an embodiment, the method comprises determining the trend associated with the data by determining correlations between data of the plurality of datasets and wherein data associated with less than 50% correlation is identified as the high-dimensional noise.
[0017] In an embodiment, the method comprises normalizing the data based on analysis of the distribution associated with the data.
[0018] In an embodiment, the method comprises training a model using at least one training dataset; and storing the trained model having the at least one training dataset in the data storing unit.
[0019] In an embodiment, the method comprises imputing the atleast one missing value by one or more of single-omics data imputation and multi-omics data imputation.
[0020] Further areas of applicability will become apparent from the description provided herein. It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims. The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.
DESCRIPTION OF THE DRAWINGS
[0021] The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.
[0022] Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:
[0023] FIG. 1 is a block diagram of a system for performing multi-omics data integration, in accordance with an embodiment of the present disclosure;
[0024] FIG. 2 is a flowchart illustrating steps of a method for performing multi- omics data integration, in accordance with an embodiment of the present disclosure;
[0025] FIG.3 shows an exemplary block diagram of a client server configuration for performing multi-omics data integration, in accordance with an example embodiment of the present disclosure; and
[0026] FIG.4 shows a method flowchart for performing normalization of the plurality of data, in accordance with an embodiment of the present disclosure. [0027] In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the non-underlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non- underlined number is used to identify a general item at which the arrow is pointing.
DESCRIPTION OF EMBODIMENTS
[0028] Exemplary embodiments will now be described more fully with reference to the accompanying drawings.
[0029] The exemplary embodiments are provided so that this disclosure will be thorough, and will fully convey the scope to persons skilled in the art. Numerous specific details are set forth such as examples of specific components, devices, and methods, to provide a thorough understanding of embodiments of the present disclosure. It will be appreciated by persons skilled in the art that specific details need not be employed. Exemplary embodiments may be embodied in many different forms. Thus, neither should be construed to limit the scope of the disclosure. In some example embodiments, well-known processes, well-known device structures, and well-known technologies are not described in detail.
[0030] The terminology used herein is for the purpose of describing particular exemplary embodiments only and is not intended to be limiting. As used herein, singular forms such as “a,” “an,” and “the” may be intended to include corresponding plural forms as well, unless the context clearly indicates otherwise. Furthermore, terms akin to “comprises,” “comprising,” “including,” and “having,” are inclusive and therefore, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
[0031] When an element or layer is referred to as being “on,” “engaged to,” “connected to,” or “coupled to” another element or layer, it may be disposed directly on, engaged, connected or coupled to the other element or layer, or intervening elements or layers may be present therein. However, when an element is referred to as being “directly on,” “directly engaged to,” “directly connected to,” or “directly coupled to” another element or layer, there may be no intervening elements or layers present. Other words used to describe a relationship between elements should be interpreted in a like manner (e.g., “between” versus “directly between,” “adjacent” versus “directly adjacent,” etc.). As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
[0032] Spatially relative terms such as “inner,” “outer,” “beneath,” “below,” “lower,” “above,” “upper,” and the like may be used herein for ease of description, to describe an element’s or a feature's relationship to another element(s) or feature(s) as illustrated in the figures. Furthermore, spatially relative terms may be intended to encompass different orientations of the device in use or in operation, in addition to one or more orientations depicted in the figures. For example, if the device in the figures is turned over, elements described as “below” or “beneath” other elements or features would then be oriented “above” the other elements or features. Thus, the example term “below” can encompass both an orientation of above and below. It will be appreciated that the device may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein should be interpreted accordingly.
[0033] Although the terms first, second, third, etc. may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms may be only used to distinguish one element, component, region, layer or section from another region, layer or section. Terms such as “first,” “second,” and other numerical terms when used herein do not imply a sequence or order unless clearly indicated by the context. Thus, a first element, component, region, layer or section discussed below could be termed a second element, component, region, layer or section without departing from the teachings of the example embodiments.
[0034] The term “user” relates to at least one individual that uses or operates the system or arrangement or device (or other variants thereof) as claimed, such as, by interacting with at least one component of the system or arrangement or device (or other variants thereof).
[0035] Moreover, if any method steps, processes, and operations are described, they are not to be construed as necessarily requiring their performance in the particular order discussed or illustrated, unless specifically identified as an order of performance. It is also to be understood that additional or alternative steps may be employed.
[0036] The system allows simultaneous, reliable and efficient analysis of multiple types of data (such as, multi-omics data). The system comprising the artificial intelligence platform implemented on the data processing unit, allows robust identification of high-dimensional noise associated with the data and subsequent reduction thereof. The system enables generation of the visual representation of the data, thereby, improving user-friendly attributes associated with use of the system for analysis of the data, such as, for allowing users with minimal/no training to use the system to perform simultaneous analysis of multiple types of data.
[0037] In overview, embodiments of the present disclosure are concerned with systems and methods for performing multi-omics data integration.
[0038] FIG. 1 is a block diagram of a system 100 for performing multi-omics data integration, in accordance with an embodiment of the present disclosure. The system 100 comprises a data storing unit 102 for storage of a plurality of datasets. The data storing unit 102 can be implemented as a computer-readable medium for storage of data therein. For example, the data storing unit 102 can be implemented as transitory or non-transitory forms of computer-readable media, including but not limited to, a hard disk, a random access memory (RAM), a read only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fibre, a portable compact disc read-only memory (CD- ROM), an optical storage device, a digital versatile disk (DVD), a static random access memory (SRAM), a memory stick, a floppy disk and so forth. Alternatively, the data storing unit 102 can be implemented using one or more databases (such as, an arrangement of more than one database communicatively coupled to each other), for example, as a cloud-based database.
[0039] The data storing unit 102 stores the plurality of datasets. In an embodiment, the plurality of datasets comprises patient-related data. Such patient-related data can comprise data that is captured at different layers (corresponding to multilayer-omics data), for example, DNA-level data (associated with mutation, methylation and so forth), RNA-level data (such as, expression of genes, microRNAs, and the like), protein-related data (including intracellular protein expression, surface protein expression and suchlike), imaging data (such as, pathological images, radiological images and the like), clinical data (drug-response data, adverse-event data and suchlike) and so forth.
[0040] Furthermore, the system 100 comprises a data processing unit 104 implementing an artificial intelligence platform 106 therein. The data processing unit 104 can be implemented as a device capable of performing computations and operations on general or specific data, such as, by receiving the data as an input and performing analysis of the data to yield one or more outputs. For example, the data processing unit 104 can be implemented as a dedicated processor, a processing circuitry, a portion of a processor, a virtual processor, a portion of a virtual processor, portion of a virtual device, or a virtual device. It will be appreciated that the virtual processor may correspond to one or more parts of a physical processor. Furthermore, the data processing unit 104 can implement therein, various software platforms executing instructions or logic, such that the instructions or logic may be distributed and executed across one or more processors, virtual or physical, to execute the instructions or logic. Such execution of the instructions or logic by the data processing unit 104 via the software platforms allows the data processing unit 104 to yield the output by processing the data. The artificial intelligence platform 106 is one such software platform. Furthermore, the data processing unit 104 can comprise specialized hardware for allowing implementation and execution of the artificial intelligence platform 106.
[0041] The artificial intelligence platform 106 can be implemented as a software- based tool for identification of clinically-relevant biomarker features from the patient-related data. For example, the clinically-relevant biomarker features can be a combination of biomarkers (such as, DNA sequences) that are associated with susceptibility to a particular disease within a patient population. In such an example, such biomarker features would be absent among healthy population, such as, a population of people not having developed the particular disease. In one example, the patient population is associated with people having developed a specific form of cancer by a given age (such as, before an age of 30 years).
[0042] The data processing unit 104 is communicatively coupled to the data storing unit 102 for receiving data from the plurality of datasets using a network 112. The network 112 may comprise suitable logic, circuitry, and interfaces that may be configured to provide a plurality of network ports and a plurality of communication channels for transmission and reception of data. Each network port may correspond to a virtual address (or a physical machine address) for transmission and reception of the communication data. For example, the virtual address may be an Internet Protocol Version 4 (IPv4) (or an IPv6 address) and the physical address may be a Media Access Control (MAC) address. The network 112 may be associated with an application layer for implementation of communication protocols based on one or more communication requests from at least one of the one or more communication devices. The communication data may be transmitted or received, via the communication protocols. Examples of such wired and wireless communication protocols may include, but are not limited to, Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP), ZigBee, EDGE, infrared (IR), IEEE 802.11, 802.16, cellular communication protocols, and/or Bluetooth (BT) communication protocols.
[0043] Examples of the network 112 may include, but is not limited to a wireless channel, a wired channel, a combination of wireless and wired channel thereof. The wireless or wired channel may be associated with a network standard which may be defined by one of a Local Area Network (LAN), a Personal Area Network (PAN), a Wireless Local Area Network (WLAN), a Wireless Sensor Network (WSN), Wireless Area Network (WAN), Wireless Wide Area Network (WWAN), a Long Term Evolution (LTE) network, a plain old telephone service (POTS), and a Metropolitan Area Network (MAN). Additionally, the wired channel may be selected on the basis of bandwidth criteria. For example, an optical fiber channel may be used for a high bandwidth communication. Further, a coaxial cable-based or Ethernet-based communication channel may be used for moderate bandwidth communication.
[0044] The data processing unit 104 receives the patient-related data from the data storing unit 102 and optionally, performs integration of the data. It will be appreciated that the patient-related data can comprise multi-omics data, such as, genomic-data and phenomic-data. Consequently, the data processing unit 104 receives such patient-related data comprising the genomic-data and the phenomic- data associated with the patients from the data storing unit 102 provides the data to the artificial intelligence platform 106. The artificial intelligence platform 106 performs integration of the genomic-data and the phenomic-data (referred to as “ multi-omics data ” herein after) for the patient population.
[0045] The artificial intelligence platform 106 analyzes a behavior and a distribution associated with the data. The artificial intelligence platform 106 analyzes the integrated multi-omics data for determining the behavior and distribution associated with the data. It will be appreciated that such an analysis of the integrated multi-omics data using the behavior and distribution thereof, instead of using physical parameters associated with the data, enables efficient, faster and reliable analysis of the multi-omics data as compared to conventional techniques that individually analyze the data based on physical parameters thereof. Optionally, the artificial intelligence platform 106 employs one or more other characteristic details associated with the data. In one example, the distribution associated with the data corresponds to normal distribution.
[0046] In an embodiment, the artificial intelligence platform 106 implemented on the data processing unit 104 analyzes the behavior and the distribution associated with the data by employing a partial least squares discriminant analysis (PLS-DA) technique. For example, the artificial intelligence platform 106 employs the PLS- DA technique to identify associations between different features from the data, to identify the behavior and the distribution associated with the data. For example, the artificial intelligence platform 106 identifies associations between bio markers from the data corresponding to mutation from DNA-level data and expression from RNA-level data.
[0047] In one embodiment, the artificial intelligence platform 106 implemented on the data processing unit 104 normalizes the data based on analysis of the distribution associated with the data. It will be appreciated that each type of data can be different from each other, based on a source of the data (such as, data that is acquired using different devices, different labs employing various protocols and so forth). The artificial intelligence platform 106 identifies a pattern corresponding to the distribution associated with the data and employs an optimal normalization technique based on the distribution. [0048] Furthermore, the artificial intelligence platform 106 processes the data to determine a trend associated with the data, for identifying high-dimensional noise associated with the data. It will be appreciated that the multi-omics data comprising the genomic-data and phenomic-data will be associated with high dimensional noise therein. Correspondingly, the artificial intelligence platform 106 performs correlation for determining the high-dimensional noise associated with the data, by using the behavior and distribution of the data. Optionally, the artificial intelligence platform 106 employs one or more additional algorithms for determination of the high-dimensional noise. Moreover, the artificial intelligence platform 106 removes the high-dimensional noise associated with the data. It will be appreciated that all parameters associated within the integrated multi-omics data will not be responsible for determination of clinically relevant outcomes. For example, when the clinically relevant outcome corresponds to determination of susceptibility of a population to a certain disease and a specific genomic feature is certainly known to not be associated with the susceptibility to that certain disease; the genomic feature is removed as high-dimensional noise. Correspondingly, the artificial intelligence platform 106 removes the determined high-dimensional noise associated with the data.
[0049] Optionally, the artificial intelligence platform 106 implemented on the data processing unit 104 determines the trend associated with the data by determining correlations between data of the dataset and wherein data associated with less than 50% correlation is identified as the high-dimensional noise. Optionally, the artificial intelligence platform 106 performs pattern-based clustering of the data and subsequently, performs correlation of the data. For example, the artificial intelligence platform 106 performs correlation and distance-matrix analysis of the data to identify associations between closely- related biomarkers from the data. Thereafter, the data that is associated with less than 50% correlation (such as, data that is associated with 25% correlation therebetween) is identified as the high-dimensional noise and removed.
[0050] The artificial intelligence platform 106 identifies at least one missing value associated with the data. The artificial intelligence platform 106 employs the behavior associated with the data to identify the at least one missing value associated with the data or one or more values not applicable for the data. For example, when clinically relevant biomarkers for determining disease susceptibility to lung cancer are to be identified, the artificial intelligence platform 106 employs the behavior associated with the data to determine population of individuals associated with habits such as drinking, smoking and so forth, to identify the at least one missing value that may correspond to a number of years a person was associated with smoking behavior. Thereafter, the artificial intelligence platform 106 imputes the at least one missing value associated with the data. The artificial intelligence platform 106 generates a prediction score for the data. The artificial intelligence platform 106 identifies features from the data, such that the features are highly predictive of clinical outcomes. For example, the clinical outcomes correspond to susceptibility to a specific disease within the population and the identified features enable determination of the disease susceptibility within the population. In such an example, the features can comprise genomic features, such as, presence of a certain gene within the patient population but absent within the healthy population and phenomic features, such as, physical traits associated with the patient population. Subsequently, the artificial intelligence platform 106 generates a prediction score for the data, based on the identified features. It will be appreciated that, when the identified features are highly predictive of a specific clinical outcome, the prediction score will be high. Alternatively, when the identified features are comparatively less predictive of the specific clinical outcome, the prediction score will be comparatively low. [0051] Moreover, the artificial intelligence platform 106 generates a visual representation of the data comprising the imputed at least one missing value therein. For example, the visual representation of the data can correspond to a graph that is generated using the data comprising the at least one missing value, such that a user can manipulate the graph (such as, by hiding one or more portions of the graph, highlighting one or more portions of the graph and so forth) to enable efficient visualization of the data. The visual representation of the data corresponds to a plurality of parameters. For example, the visualization representation of the data enables classification of patients into different patient- groups by considering different types of data such as, multi-omics data.
[0052] Furthermore, the system 100 comprises a display unit 108 communicatively coupled to the data processing unit 104. For example, the display unit 108 can be implemented as a touchscreen of a smartphone, a screen of a laptop computer and suchlike. The display unit 108 receives the visual representation of the data from the data processing unit 104 and displays the visual representation of the data on a graphical user interface 110 associated therewith. The graphical user interface 110 is presented on the display unit 108 and allows manipulation of the visual representation of the data by the user. For example, the graphical user interface 110 presents one or more buttons, options, sliders and so forth for allowing user to show/hide certain portions of the visual representation of the data, display specific portions of the visual representation of the data using different colors, enlarge/reduce a size of the visual representation of the data on the graphical user interface 110 and so forth.
[0053] It will be appreciated that the system 100 will enable easy and efficient application thereof in various applications including but not limited to, data analysis, data classification and generation of prediction models, biomarker discovery and the like. For example, the system 100 can be employed in hospitals, clinical studies and trials; by physicians, oncologists, patients, scientists, researchers and so forth for easy accessibility to data cleaning and data normalization. In such an example, the system 100 can be employed for classifying clinical outcomes, visual interpretation of research outcomes, classification of patient groups, modality-based stratification and the like. Furthermore, the system 100 can be employed in other applications that correspond to big data analysis and bit data integration. It will be appreciated that the system 100 improves ease of utilization by experts of various technical and non-technical fields without requiring programming knowledge, such as, for advancing clinical translation, for performing molecular and biomarker discoveries, using multi-omics analytical approach and so forth. [0054] Optionally, the artificial intelligence platform 106 implemented on the data processing unit 104 is trained using at least one training dataset and wherein the data storing unit 102 stores the at least one training dataset. For example, the at least one training dataset comprises data associated with patients that responded to a particular treatment, patients associated with a specific expression (such as protein expression), patients associated with a particular mutation pattern and suchlike. The training data is retrieved from the data storing unit 102 by the data processing unit 104 and used for training the artificial intelligence platform 106. For example, such training allows the artificial intelligence platform 106 to predict a response as seen in patients, based on presence of similar expression (such as protein expression) and mutation pattern found in other patients. In one example, the system 100 can be employed within a hospital for prediction of response to treatment, potential adverse-events, survival, hot- and cold-tumor classification and so forth, for various patients receiving treatment within the hospital.
[0055] FIG. 2 is a flowchart 200 illustrating steps of a method for performing multi-omics data integration, in accordance with an embodiment of the present disclosure. The method is performed by the data processing unit 104. At a step 202, a behaviour and a distribution associated with the data are analysed. At a step 204, the data is processed to determine a trend associated with the data, for identifying high-dimensional noise-to-signal ratio associated with the data. At a step 206, the high-dimensional noise associated with the data is removed. At a step 208, at least one missing value associated with the data is identified. At a step 210, the at least one missing value associated with the data is imputed. At a step 212, a prediction score for the data is generated. At a step 214, a visual representation of the data comprising the imputed at least one missing value therein is generated. The visual representation of the data corresponds to a plurality of parameters. Subsequently, the generated visual representation of the data is presented, such as, on a graphical user interface.
[0056] An exemplary multi-omics data integration in a client-server configuration is explained with reference to FIG.3. [0057] FIG.3 discloses an exemplary block diagram 300 of a client-server configuration for performing multi-omics data integration, in accordance with an example embodiment of the present disclosure. The client side comprises an application interface 304 that can be accessed by a user 302. The server side comprises an Artificial Intelligence Platform (AIP) 306 deployed in the server. The Artificial intelligence platform 306 is similar to the artificial intelligence platform 106 explained in FIG.l. The AIP 306 deployed in the server may communicate with the application interface 304 in the client-side through a communication network 308. The AIP 306 may comprise plurality of data 312, a data processing module 314, a data exploration and analysis module 316, a data integration module 318, and a prediction module 320. The user 302 may access the AIP 306 by providing a client ID 310 to the application interface 304 wherein each user 302 is provided a unique client ID 310.
[0058] The AIP 306 may perform multi-omics data integration on the server side or on the client side based on the deployment. In an example embodiment, the AIP 306 is deployed on a portable media device like a smartphone of the user 302. In an example embodiment, the AIP 306 is deployed in the server side. The AIP 306 may initially ingest plurality of data 312 from various sources. The plurality of data 312 comprises different biological data. The data ingestion process is built by taking into consideration variability of format, data types and vendors. The AIP 306 identifies primary objects from the data such as study, patient (subject), samples, assay, and storage. Each of these objects can be associated with any attributes thereby allowing for flexible ingestion of data into the system. Each of these objects are interconnected thereby maintaining a consistent and interconnected data lake environment for data management.
[0059] The AIP 306 may perform preprocessing on the ingested plurality of data 312 by using the preprocessing module 314. The preprocessing module 314 performs a systematic preprocessing of data to make it fit for any further analysis. The preprocessing module 314 performs data cleansing on the plurality of data 312. The data preprocessing module 314 may perform data cleansing by removing highly sparse variables, identification of outliers for de-noising and finally imputing the dataset with correct values so that the plurality of data 312 can be made ready for analysis. In an example embodiment, the preprocessing module 314 performs data imputation on the plurality of data 312. The data imputation takes advantage of the correlation across different omic datasets with an assumption that missing feature from one type of omic data can be explained by its neighboring feature of the same omics data as well as features from other omics data. The imputation algorithm implemented by the preprocessing module 314 works under the following assumption.
A: Initialize with replacing all missing values in all matrices Gi, i = 1, 2, 3 by self imputation methods to obtain complete matrices { Gi (0) }
B: Self-impute GI based on Gl(h - 1); Cross-impute GI by G2(h - 1), G3 (h - 1) using multi-omics imputation method to obtain GI (h).
C: Integrate multiple imputation models, we used a least square regression model to combine the outputs.
Figure imgf000019_0001
Where ‘i’ indicates the type of omics data, pi is the number of rows of each matrix corresponding to different types of features (e.g., gene expression) and n is the number of columns corresponding to different subjects. The missing point at the m-th feature on the i-th subject is denoted by Gim,i, m= 1, 2,.. p^ i = 1, 2,.. n. The preprocessing module 314 may perform single omics imputation and multi omics imputation. In the single omics imputation, for each single omics data matrix, global methods (e.g., BPCA and SVDimpute) and local methods (e.g., KNN-impute, LLS and iLLS) are implemented to explore neighbouring global or local features to impute missing features. gt E = Rn in G1 E RPl xn
Without the loss of generality, it is assumed that the target gene contains missing values located in the first s subjects. Hence,
Figure imgf000019_0002
where gt miss E Rixs is the missing vector in the target gene and gtc e RI c (n - s) is complete vector containing non-missing values. [0060] To estimate the missing vector gtmiss, firstly, the distance (Euclidean distance) dt, j between the target gene t and other gene j (or eigengene j ) is computed. Subsequently, top k close genes are used for imputation. Specifically, KNN impute estimates gtmiss by averaging the weighted values of neighbouring genes or eigen genes while the other methods tend to use linear regression.
[0061] In the multi omics imputation, instead of imputing each omics data separately, the pre-processing module 314 combines multiple information from various omics data such as microRNA (Gl), mRNA (G2) and DNA methylation (G3), that are identified to be correlated with each other in their elements or components. The basic models are generated based on three types of imputations, i.e., self-imputation and cross imputation by G2 and G3 respectively. The self imputation is to impute Gl by itself using single-omics imputation method as explained in in “Single omics imputation”. The cross-imputation is to impute Gl by other omics data, i.e., G2. Each missing feature in Gl is imputed individually by exploiting the correlated information from G2. For each target gene gt = [gtmiss, gt c ] in Gl, it is combined with correlated features in G2 to obtain a new missing matrix H. The Matrix H is then imputed by self-imputation methods to estimate gtmiss. Eventually, three imputation outputs is obtained for all missing values in Gl by different omics data, denoted by Gl<— 1, Gl<— 2 and Gl<— 3 respectively. Subsequently, to integrate multiple imputation models, the data preprocessing module 314 implements a least square regression model to combine the outputs from diverse models.
[0062] In an example embodiment, the preprocessing module 314 performs data normalization on the plurality of data 312. Normalization ensures that values are transformed into a common scale without distorting differences in the range of the values. In a multiomics analysis, this is a very important step to ensure compare and contrast between heterogenous datasets and run cross assay analysis. The preprocessing module 314 applies various normality tests such as c 2 goodness- of-fit test with its variants, the Kolmogorov-Smimov (KS) one-sample cumulative probability test, the Shapiro-Wilk (SW) test, D’Agostino-Pearson (DP) test and Jarque-Bera (JB) test, tests based on the empirical distribution function such as Kuiper test, Watson test, Cramer-von Mises (CvM) test, and Anderson-Darling (AD) test. The preprocessing module 314 performs several standardization techniques such as log 10, median, min_max, mean_standardization, MRN, mean_centering and the like to normalize a given dataset. The best normalization based on skewness, kurtosis and cumulants is then suggested to the user as a default option. A graphical representation in the form of pre-normalization and post-normalization would then provide user about choice of normalization.
Let X i , X 2 , . . ., X N be a random sample from the distribution with finite mean m = E(X 1 ). The arithmetic mean is defined by:
Figure imgf000021_0001
and standardized mean by
Figure imgf000021_0002
For n=2, this depends on only two random variables Xi and X2 which provides many realization of the statistic to obtain a long trajectory by diving the data into blocks of length 2. The process for normalization of the plurality of data 312 by the data preprocessing module 314 is explained with reference to FIG.4 [0063] FIG.4. illustrates a method flowchart for performing normalization of the plurality of data 312 in accordance with an embodiment of the present disclosure. The method 400 is performed by the data preprocessing module 314. The method starts at step 402. At step 404, the data preprocessing module 314 receives input data. The input data comprises a plurality of input variables such as xl, x2, x3,..xn. At step 406, the data preprocessing module 314 calculates p-value and value of test static. Based on the calculated p-value and test statistic, at step 408, the data preprocessing module 314 checks whether the data is normalized. If the data is not normalized, the method proceeds to step 410. At step 410, standardizations are applied on the data for data transformation and to normalize the data. Subsequently, the method proceeds to step 406 to calculate the p-value and test statistic. At step 408, if it is determined that the data is normalized, the method proceeds to step 412. At step 412, the method displays a chart comprising data distribution. The chart enables the user for visualizing the plurality of data 312 and also enables combining data from various sources.
[0064] Referring to FIG.3, the AIP 306 may subsequently perform data exploration and analysis on the preprocessed data by using the data exploration and analysis module 316. The data exploration and analysis module 316 may perform data exploration and identify correlations in the data. In an example embodiment, the data exploration and analysis module 316 may analyze the behavior and perform multi omics analysis with the data by employing partial least squares discriminant analysis (PLS-DA) technique. Further, the data exploration and analysis module 316 may determine the trend associated with the data by determining correlations between data of the plurality of datasets. The data exploration and analysis module 316 may perform single omics advanced analysis and multi omics advanced analysis. The single omics advanced analysis is based on a cumulation of multiple algorithms to provide an effective way to finally cluster the data based on the response attribute in question. The main techniques integrated into the framework include - correlation with target, binomial regression, logistic regression, linear modelling and DESeq based on negative binomial distribution (specially used in analysing gene expression scenario). Each of these algorithms are run and the resultant biomarker signature is used for clustering the input dataset based on response variable. The data exploration and analysis module 316 enables the user to reiterate and flexibly change these algorithms to obtain the best analysis output. The data exploration and analysis module 316 may perform multi omics advanced analysis is based on partial least square discriminant analysis (PLS-DA) technique. The technique includes searching for latent variables with a maximum covariance with the Y- variables (containing the membership information). In case of PLS-DA, relevant sources of data variability are modelled by the so-called Latent Variables (LVs), which are linear combinations of the original variables. For example, consider data coming from three modalities mRNA(Xi), proteins(X2) and metabolites(X3). These belong to different samples for which we are interested in understanding the classification of their subtype (stored in class variable Y).
For each component h=l,2,3.... H:
Figure imgf000023_0001
where, (q) - is penalizing parameter for each block of data ah (q)- is loading vector on component h associated with residual(deflated) matrix
Xh(q)
PLS-DA works to ensure that it maximizes the covariance between the individual components.
[0065] When we compare the outcome of PLS-DA with individual analysis, we can see that the prediction through the multiomics signature is far superior in the ensemble model when compared to individual model.
[0066] The AIP 306 may subsequently perform integration of the plurality of data from various sources by implementing the data integration module 318. The data integration module 318 may implement a multi omic data analysis to integrate the data coming from different platform and modalities to obtain a biomarker signature which can uniquely define the response variable. The data integration module 318 ensures that the data across modalities, across studies, across time points are harmonized, brought together, scaled on the same level to ensure a inter modality comparison. The integration module is based on the partial least square discriminant analysis (PLS-DA) which ensures that the contribution of multiple modalities is accounted and their contribution towards the parameter in question is analyzed to provide the unique biomarker signature.
[0067] The AIP 306 may subsequently perform prediction on the data by implementing the prediction module 320. The prediction module 320 is configured to train a model using at least one training dataset. The training dataset is the data obtained after performing preprocessing and data exploration and analysis on the plurality of data 312. The prediction module 320 implements various machine learning (ML) algorithms such as PLS-DA, decision trees, random forest and Support Vector Machine (SVM) to train the data for enabling prediction. The AIP 306 may implement the prediction module 320 to predict diseases and body side effects in various use cases including but not limited to hot/cold tumour prediction, drug response / toxicity prediction, survival and recurrence prediction.
[0068] The foregoing description of the embodiments has been provided for purposes of illustration and description. Modifications to embodiments of the invention described in the foregoing are possible without departing from the scope of the invention as defined by the accompanying claims. It is not intended to be exhaustive or to limit the disclosure. Individual elements or features of a particular embodiment are generally not limited to that particular embodiment, but, where applicable, are interchangeable and can be used in a selected embodiment, even if not specifically shown or described. The same may also be varied in many ways. Such variations are not to be regarded as a departure from the disclosure, and all such modifications are intended to be included within the scope of the disclosure.

Claims

Claims:
1. A system for performing multi-omics data integration, the system comprising: a data storing unit configured to receive a plurality of datasets from various data sources and store the plurality of datasets; a processing circuitry configured to implement an artificial intelligence platform, wherein the processing circuitry is communicatively coupled to the data storing unit for receiving data from the plurality of datasets, wherein the processing circuitry is further configured to: analyze a behavior and a distribution associated with the data; process the data to determine a trend associated with the data, for identifying high-dimensional noise associated with the data; remove the high-dimensional noise associated with the data; identify at least one missing value associated with the data; impute the at least one missing value associated with the data; generate a prediction score for the data; and generate a visual representation of the data comprising the imputed at least one missing value, wherein the visual representation of the data corresponds to a plurality of parameters; and a display unit communicatively coupled to the processing circuitry, wherein the display unit receives the visual representation of the data from the processing circuitry and is configured to display the visual representation of the data on a graphical user interface.
2. The system as claimed in claim 1, wherein the processing circuitry is configured to analyze the behavior and perform multi omics analysis with the data by employing partial least squares discriminant analysis (PLS-DA) technique.
3. The system as claimed in claim 1, wherein the processing circuitry is configured to determine the trend associated with the data by determining correlations between data of the plurality of datasets.
4. The system as claimed in claim 1, wherein the processing circuitry is configured to normalize the data based on analysis of the distribution associated with the data.
5. The system as claimed in claim 1, wherein the processing circuitry is configured to: train a model using at least one training dataset; and store the trained model having the at least one training dataset in the data storing unit.
6. The system as claimed in claim 1, wherein the processing circuitry is configured to impute the atleast one missing value by one or more of single-omics data imputation and multi -omics data imputation.
7. A method for performing multi-omics data integration, the method comprising: receiving a plurality of datasets from various data sources; implementing an artificial intelligence platform on a data of the received plurality of datasets, wherein the implementing comprises: analyzing a behavior and a distribution associated with the data; processing the data to determine a trend associated with the data, for identifying high-dimensional noise associated with the data; removing the high-dimensional noise associated with the data; identifying at least one missing value associated with the data; imputing the at least one missing value associated with the data; generating a prediction score for the data; and generating a visual representation of the data comprising the imputed at least one missing value, wherein the visual representation of the data corresponds to a plurality of parameters; and displaying the visual representation of the data on a graphical user interface.
8. The method as claimed in claim 7, wherein the method comprises analyzing the behavior and perform multi omics analysis with the data by employing partial least squares discriminant analysis (PLS-DA) technique.
9. The method as claimed in claim 7, wherein the method comprises determining the trend associated with the data by determining correlations between data of the plurality of datasets.
10. The method as claimed in claim 7, wherein the method comprises normalizing the data based on analysis of the distribution associated with the data.
11. The method as claimed in claim 7, wherein the method comprises: training a model using at least one training dataset; and storing the trained model having the at least one training dataset in the data storing unit.
12. The method as claimed in claim 7, wherein the method comprises imputing the atleast one missing value by one or more of single-omics data imputation and multi-omics data imputation.
PCT/IN2021/050390 2020-04-20 2021-04-20 System and method for performing multi-omics data integration WO2021214787A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN202021016959 2020-04-20
IN202021016959 2020-04-20

Publications (1)

Publication Number Publication Date
WO2021214787A1 true WO2021214787A1 (en) 2021-10-28

Family

ID=78270401

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IN2021/050390 WO2021214787A1 (en) 2020-04-20 2021-04-20 System and method for performing multi-omics data integration

Country Status (1)

Country Link
WO (1) WO2021214787A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130226838A1 (en) * 2012-02-23 2013-08-29 International Business Machines Corporation Missing value imputation for predictive models

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130226838A1 (en) * 2012-02-23 2013-08-29 International Business Machines Corporation Missing value imputation for predictive models

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BALLABIO DAVIDE, CONSONNI VIVIANA: "Classification tools in chemistry. Part 1: linear models. PLS-DA", ANALYTICAL METHODS, ROYAL SOCIETY OF CHEMISTRY, GB, vol. 5, no. 16, 6 June 2013 (2013-06-06), GB , pages 3790 - 3798, XP055867465, ISSN: 1759-9660, DOI: 10.1039/c3ay40582f *

Similar Documents

Publication Publication Date Title
JP7261846B2 (en) Relevance Feedback to Improve the Performance of Classification Models to Co-Classify Patients with Similar Profiles
McDermott et al. Challenges in biomarker discovery: combining expert insights with statistical analysis of complex omics data
Bizzego et al. Evaluating reproducibility of AI algorithms in digital pathology with DAPPER
Moreau et al. Computational tools for prioritizing candidate genes: boosting disease gene discovery
Cho et al. Coclustering of human cancer microarrays using minimum sum-squared residue coclustering
WO2017116817A2 (en) Testing of medicinal drugs and drug combinations
WO2015173435A1 (en) Method for predicting a phenotype from a genotype
US20220365934A1 (en) Linking individual datasets to a database
JP7041614B6 (en) Multi-level architecture for pattern recognition in biometric data
Geeitha et al. Incorporating EBO-HSIC with SVM for gene selection associated with cervical cancer classification
US20230056839A1 (en) Cancer prognosis
Chakraborty et al. Using the “Hidden” genome to improve classification of cancer types
Arowolo et al. Enhanced dimensionality reduction methods for classifying malaria vector dataset using decision tree
WO2021214787A1 (en) System and method for performing multi-omics data integration
Omar et al. Lung and colon cancer detection using weighted average ensemble transfer learning
US20220044762A1 (en) Methods of assessing breast cancer using machine learning systems
Qattous et al. PaCMAP-embedded convolutional neural network for multi-omics data integration
Chellamuthu et al. Data mining and machine learning approaches in breast cancer biomedical research
Ashraf et al. Iterative weighted k-NN for constructing missing feature values in Wisconsin breast cancer dataset
CN114821137A (en) Multi-modal tumor data fusion method and device
Choudhuri et al. A Review of Computational Learning and IoT Applications to High‐Throughput Array‐Based Sequencing and Medical Imaging Data in Drug Discovery and Other Health Care Systems
Bogner et al. Pathway analysis in microarray data: a comparison of two different pathway analysis devices in the same data set
Karađuzović-Hadžiabdić et al. Artificial intelligence in clinical decision-making for diagnosis of cardiovascular disease using epigenetics mechanisms
Patel et al. Big data analytics of genomic and clinical data for diagnosis and prognosis of cancer
Awotunde et al. Big data analytics enabled deep convolutional neural network for the diagnosis of cancer

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21792230

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21792230

Country of ref document: EP

Kind code of ref document: A1