BG112772A

BG112772A - Method for adaptive in silico knowledge discovery based on big genomic data analytics

Info

Publication number: BG112772A
Application number: BG112772A
Authority: BG
Inventors: Десислава Иванова; Иванова Боровска Пламенка; Пламенка Боровска; Антонова Иванова Десислава
Original assignee: Технически Университет - София
Priority date: 2018-07-12
Filing date: 2018-07-12
Publication date: 2020-01-31
Also published as: BG67367B1

Abstract

A method for adaptive in silico knowledge discovery and decision-making based on big genomic data analytics, which contains two parallel and correlated computational phases, a machine learning phase and an operational phase, which overlap and perform simultaneously, with information exchange. Each phase operates over four types of computational workflows for data analysis - descriptive, diagnostic, predictive and prescriptive analysis. The descriptive, diagnostic, and predictive differentiated workflows are model-based, while the predictive differentiated workflows use models to construct recommendations for personalized therapy and target grouping rules for precision therapy. The input data are four sets - sequenced genomes, clinical test results, parameters of the patient's individual lifestyle and environmental factors. Within the machine learning phase, parallel differentiated workflows are built up, as those with the highest estimate for accuracy, precision and sensitivity are used to construct an integrated knowledge discovery workflow, with the following outputs: mapping of cancer-related genes, detection of mutations, personalized cancer diagnosis and recommendations for a target group for precise patient therapy. The output data is verified by an expert, and if necessary the data sets are modified and the differentiated workflows are performed again in the machine learning phase. The process is repeated iteratively until verification by the oncologist expert. The method is adaptive in terms of genetic, biological and medical aspects, as well as in the computational aspect - scalability, reconfiguration of hardware and software resources.

Description

МЕТОД ЗА АДАПТИВНО ИЗВЛИЧАНЕ НА IN SILICO ЗНАНИЯ И ВЗЕМАНЕ НА РЕШЕНИЯ БАЗИРАН НА АНАЛИЗ НА ГОЛЕМИ ГЕНОМНИ ДАННИMETHOD FOR ADAPTIVE EXTRACTION OF IN SILICO KNOWLEDGE AND DECISION MAKING BASED ON GREAT GENOME DATA ANALYSIS

ОБЛАСТ НА ТЕХНИКАТАFIELD OF THE INVENTION

Областта е интердисциплинарна и обхваща биоинформатиката, компютърните науки, изкуствения интелект и прецизната медицина. Фокусът е върху откриване на нови in silico знания на основата на анализ на големи геномни данни за целите на изчислителната биология, персонализираната и прецизната медицина. In silico медицината, персонализираната медицина и прецизната медицина са „горещите“ области на съвременните научни изследвания. In silico медицината, известна също като изчислителна медицина, е приложението на in silico изследвания на проблеми, свързани със здравеопазването и медицината. Тя представлява директното използване на компютърни модели и симулации при определянето на диагнозата, лечението или превенцията за дадено заболяване. Персонализирана медицина се отнася до приспособяването на медицинското лечение към индивидуалните характеристики на всеки пациент, като не предполага създаването на лекарства или медицински устройства, които са уникални за пациента, а по-скоро способността да се класифицират индивидите в субпопулации, които се различават по чувствителността си към определена болест, в биологията и/или прогнозата на тези заболявания, които те могат да развият, или в отговор на специфично лечение.The field is interdisciplinary and covers bioinformatics, computer science, artificial intelligence and precision medicine. The focus is on discovering new in silico knowledge based on the analysis of large genomic data for the purposes of computational biology, personalized and precision medicine. In silico medicine, personalized medicine and precision medicine are the "hot" areas of modern research. In silico medicine, also known as computational medicine, is the application of in silico research to problems related to healthcare and medicine. It is the direct use of computer models and simulations in determining the diagnosis, treatment or prevention of a disease. Personalized medicine refers to the adaptation of medical treatment to the individual characteristics of each patient, not involving the creation of drugs or medical devices that are unique to the patient, but rather the ability to classify individuals into subpopulations that differ in sensitivity. to a particular disease, in the biology and / or prognosis of those diseases that they may develop, or in response to specific treatment.

ПРЕДШЕСТВАЩО СЪСТОЯНИЕ НА ТЕХНИКАТАBACKGROUND OF THE INVENTION

Понастоящем, големите данни (Big Data) се определят като революция в научните изследвания и една от най-перспективните тенденции в областта на ИТ. Това даде тласък на интензивното развитие на методите и технологиите за обработка на големи масиви от данни в последните години и доведе до радикални промени в парадигмите за научни изследвания. Предшестващата парадигма за научни изследвания е изчислителна наука. Изчислителната парадигма в научните изследвания обхваща компютърните модели и симулации (in silico експериментиране), които се наложиха поради изключителната сложност на теоретичния анализ, който в много случаи е неприложим. В резултат на компютърните симулации се генерира огромно количество данни от експериментите. Недостатък - налага се да се правят статистически проби с цел намаляване на обема на обработваните данни.Currently, Big Data is defined as a revolution in research and one of the most promising trends in IT. This has given impetus to the intensive development of methods and technologies for processing large data sets in recent years and has led to radical changes in research paradigms. The previous paradigm for research is computational science. The computational paradigm in research encompasses computer models and simulations (in silico experimentation), which have become necessary due to the extreme complexity of theoretical analysis, which in many cases is inapplicable. As a result of computer simulations, a huge amount of experimental data is generated. Disadvantage - it is necessary to make statistical tests in order to reduce the volume of processed data.

ТЕХНИЧЕСКА СЪЩНОСТTECHNICAL ESSENCE

Методът съдържа две паралелни и корелирани изчислителни фази, фаза на машинно обучение и оперативна фаза, които се припокриват и изпълняват едновременно, с информационен обмен. И двете фази се основават на модели и правила, като използват паралелно както методи за класификация, така и методи за клъстериране при анализа на данните. Всяка фаза функционира като изчислителен конвейер, съдържащ три основни компонента: (1) предварителна обработка на данните; (2) откриване на in-silco знания и автоматизирано вземане на решение, и (3) постобработка на резултатите — визуализация и оценка на полезността на откритото знания.The method contains two parallel and correlated computational phases, a machine learning phase and an operational phase, which overlap and run simultaneously, with information exchange. Both phases are based on models and rules, using in parallel both classification methods and clustering methods in data analysis. Each phase functions as a computational pipeline containing three main components: (1) data processing; (2) detection of in-silco knowledge and automated decision making, and (3) post-processing of the results - visualization and evaluation of the usefulness of the discovered knowledge.

Във фазата на предварителната обработка на данните се осъществява селекция на характеристиките на основата на метаевристични алгоритми на за комбинаторно търсене и метода анализ на главните компоненти, като множеството на характеристиките се редуцира посредством итеративно изпълнение на метода за машинно обучение. Пост-обработката на данните обхваща верификация, валидиране и оценка на полезността и приложимостта на откритото знания, както и визуализация на получените резултати.In the data pre-processing phase, the characteristics are selected on the basis of metaheuristic algorithms for combinatorial search and the principal components analysis method, as the set of characteristics is reduced by iterative implementation of the machine learning method. Post-data processing includes verification, validation and evaluation of the usefulness and applicability of the open knowledge, as well as visualization of the obtained results.

Всяка фаза оперира над 4 типа изчислителни работни потоци за анализ на данни описателен, диагностичен, прогнозен и предписателен анализи. Изчислителният работен поток представлява шаблон, дефиниращ консистентна имплементация на процеси или поток от задачи, които се планират и координират на основата на систематичен план. Научните работни потоци осигуряват метод за дефиниране на високо ниво на целите на експеримента, моделирани посредством работен поток от научни задачи, особено в случаите, когато изходните данни от изпълнението на една задача се използват като входни данни за следващата задача.Each phase operates over 4 types of computational workflows for data analysis descriptive, diagnostic, predictive and prescriptive analysis. A computational workflow is a template that defines the consistent implementation of processes or a flow of tasks that are planned and coordinated on the basis of a systematic plan. Scientific workflows provide a method for defining a high level of experimental objectives modeled by a workflow of scientific tasks, especially in cases where the output of one task is used as input to the next task.

За случая на рака на гърдата, описателният аналитичен работен поток е отговорен за идфентифицирането на гените, асоциирани с рака на гърдата (BRCA1 и BRCA2) и картирането им в генома на пациента. Целта на диагностичния аналитичен работен поток е да открие възможни мутации в гените, асоциирани с рака на гърдата. Прогнозният аналитичен работен поток използва като входни данни резултатите от дескриптивния аналитичен работен поток и данни за вида на раковите клетки, като определя типа на рака, неговата злокачественост и прогнозна оценка на живота. Предписателният аналитичен работен поток конструира препоръки за персонализирана терапия на основата на изходните резултати от диагностичния работен поток, данните за индивидуалния стил на живота на пациента и фактори на околната среда, след което класифицира пациента в целева група за прецизна терапия.In the case of breast cancer, the descriptive analytical workflow is responsible for identifying the genes associated with breast cancer (BRCA1 and BRCA2) and mapping them into the patient's genome. The purpose of the diagnostic analytical workflow is to detect possible mutations in genes associated with breast cancer. The predictive analytical workflow uses as input data the results of the descriptive analytical workflow and data on the type of cancer cells, determining the type of cancer, its malignancy and prognostic assessment of life. The prescriptive analytical workflow constructs recommendations for personalized therapy based on the initial results of the diagnostic workflow, data on the patient's individual lifestyle and environmental factors, and then classifies the patient into a target group for precision therapy.

Дескриптивният, диагностичният и прогнозният диференцирани работни потоци са базирани на модели поради огромното разнообразие на геномите и на техните междугенни пространства като се визират основно структурната специфика на промотерите и енхансерите. Прогнозният диференциран работен поток използва модели за конструиране на препоръки за персонализираната терапия и правила за класифициране в целева група за прецизна терапия.The descriptive, diagnostic and predictive differentiated workflows are based on models due to the huge variety of genomes and their intergenic spaces, mainly referring to the structural specifics of promoters and enhancers. The predicted differentiated workflow uses models to construct recommendations for personalized therapy and rules for classification into a target group for precision therapy.

През фазата на машинното обучение на метода се изгражда хранилище на синтезирани колекции от модели и правила, които се използват в оперативната фаза на метода за изграждането на интегриран работен поток. Фазата на машинното обучение се изпълнява offline с наборите от данни за обучение и валидиране, като основните изчислителни единици са 4 колекции от диференцирани работни потоци, като всяка колекция съдържа диференцирани потоци от един тип - дескриптивен, диагностичен, прогнозен или предписателен. Оперативната фаза се изпълнява online и обработва входни потоци от данни — секвениран геном на пациента, данни от клинични тестове, данни за индивидуалния стил на живота на пациента и фактори на околната среда. Основната изчислителна единица в оперативната фаза е интегриран работен поток за откриване на знания, изграден от 4 диференцирани работни потока (дескриптивен, диагностичен, прогнозен и предписателен).During the machine learning phase of the method, a repository of synthesized collections of models and rules is built, which are used in the operational phase of the method for building an integrated workflow. The machine learning phase is performed offline with the data sets for training and validation, as the main units of calculation are 4 collections of differentiated workflows, each collection containing differentiated flows of one type - descriptive, diagnostic, predictive or prescriptive. The operative phase is performed online and processes input data streams - sequenced patient's genome, data from clinical trials, data on the patient's individual lifestyle and environmental factors. The main computing unit in the operational phase is an integrated workflow for knowledge discovery, composed of 4 differentiated workflows (descriptive, diagnostic, predictive and prescriptive).

Във фазата на машинното обучение се обработват паралелно 4 пакета от диференцирани работни потоци, използващи различни модели за машинно обучение, вкл. методи за класифициране и клъстериране. Наборите данни за обучение за 4 типа: (1) генетични данни на пациента, (2) резултати от клинични тестове, (3) параметри на индивидуалния стил на живота на пациента, и (4) фактори на околната среда.In the machine learning phase, 4 packages of differentiated workflows are processed in parallel, using different models for machine learning, incl. classification and clustering methods. Training data sets for 4 types: (1) genetic data of the patient, (2) results of clinical tests, (3) parameters of the individual lifestyle of the patient, and (4) environmental factors.

Времето за изпълнение на фазата на машинното обучение е голямо, като за ускоряване на фазата пакетите от диференцирани работни потоци се обработват паралелно. Всеки диференциран работен поток изгражда модел, който се запазва в хранилището на работните потоци. Изходният резултат на всеки диференциран работен поток се валидира и се верифицира от експерт по молекулярна биология, генетик или онколог, респективно. В случай на положителни резултати при валидирането и верификацията, работния поток се запазва в хранилището, след което се подлага на оценки за акуратност, прецизност и сензитивност. Работните потоци в хранилището се подлагат на сравнителен анализ относно акуратност, прецизност и сензитивност и от всеки пакет диференцирани работни потоци се селектират оптималните с цел изграждане на интегриран работен поток в оперативната фаза.The execution time of the machine learning phase is long, and to speed up the phase, the packages of differentiated workflows are processed in parallel. Each differentiated workflow builds a model that is stored in the workflow repository. The outcome of each differentiated workflow is validated and verified by a molecular biology expert, geneticist or oncologist, respectively. In case of positive results during the validation and verification, the workflow is saved in the repository, after which it is subjected to evaluations for accuracy, precision and sensitivity. The workflows in the repository are subjected to a comparative analysis in terms of accuracy, precision and sensitivity and from each package of differentiated workflows the optimal ones are selected in order to build an integrated workflow in the operational phase.

В случаите, когато верификацията и валидирането на резултатите от работния поток са неуспешни, наборите данни се подлагат на модифициране и обновяване и отново се използват във вазата за машинно обучение.In cases where the verification and validation of workflow results are unsuccessful, the datasets are modified and updated and reused in the machine learning vase.

В рамките на оперативната фаза се изпълнява един интегриран работен поток, изграден от четирите оптимални диференцирани работни потоци от хранилището, по един от всеки пакет. Оперативната фаза се изпълнява online и обработва поточните данни на пациента - генетични данни, резултати от клинични тестове, параметри на индивидуалния стил на живот на пациента, и фактори на околната среда. Оперативната фаза генерира следните изходни данни - генетична специфика на пациента - спец. На гените, свързани с рака на гърдата, мутации, персонализирана диагностика на рака на пациента, оценки на злокачественост на рака и очаквана продължителност на живота, препоръки за персонализирана терапия на пациента и целева група за прецизна терапия.Within the operational phase, an integrated workflow is executed, consisting of the four optimal differentiated workflows from the repository, one from each package. The operative phase is performed online and processes the patient's current data - genetic data, clinical test results, parameters of the patient's individual lifestyle, and environmental factors. The operative phase generates the following initial data - genetic specificity of the patient - specialization of genes associated with breast cancer, mutations, personalized diagnosis of the patient's cancer, assessments of cancer malignancy and life expectancy, recommendations for personalized therapy of the patient and target group for precision therapy.

Знанието, открито в оперативната фаза се подлага на оценка от експерт - онколог, като оценките могат да бъдат „потвърждение“, „отхвърляне“, или „модифициране“. В случай на експертни оценки „отхвърляне, модифициране“ използваните диференцирани работни потоци в рамките на интегрирания работен поток в оперативната фаза се маркират като „невалидни“ в хранилището и съответните набори данни за обучение и валидиране се модифицират и обновяват, след което се стартира ново обучение във фазата на машинното обучение. Процесът се повтаря итеративно до получаване на оценка „потвърждение“ от експерта-онколог .The knowledge found in the operational phase is evaluated by an oncologist, and the evaluations can be "confirmation", "rejection", or "modification". In the case of "rejection, modification" expert assessments, the differentiated workflows used within the integrated workflow in the operational phase are marked as "invalid" in the repository and the relevant training and validation datasets are modified and updated, after which a new training is started. in the machine learning phase. The process is repeated iteratively until a "confirmation" grade is received from the oncologist.

ОПИСАНИЕ НА ПРИЛОЖЕНИТЕ ФИГУРИDESCRIPTION OF THE ATTACHED FIGURES

Фигура 1 Схема на метода с двете фази — на машинно обучение и оперативна фазаFigure 1 Scheme of the method with the two phases - of machine learning and operational phase

Фигура 2 Схема на фазата на машинното обучение на методаFigure 2 Schematic of the machine learning phase of the method

Фигура 3 Конфигуриране на интегрирания работен поток в оперативната фаза от диференцираните работни потоци, създадени във фазата на машинното обучение с максимални оценки за акуратност, прецизност и сензитивностFigure 3 Configuring the integrated workflow in the operational phase from the differentiated workflows created in the machine learning phase with maximum scores for accuracy, precision and sensitivity

Фигура 4 Концептуален модел на умен дигитален консултант за рак на гърдата, имплементиран на основата на методаFigure 4 Conceptual model of a smart digital breast cancer consultant implemented based on the method

ПРИМЕРИ ЗА ИЗПЪЛНЕНИЕEXAMPLES OF IMPLEMENTATION

Приложимостта на предложения метод е за проектиране, имплементиране и развитие на софтуер на умен дигитален консултант в помощ на онколога (за рак на гърдата), който помага и асистира онколозите да обработват, управляват и интерпретират огромното количество информация при диагностицирането на болестта.The applicability of the proposed method is for the design, implementation and development of software by a smart digital consultant to assist the oncologist (for breast cancer), who helps and assists oncologists to process, manage and interpret the vast amount of information in diagnosing the disease.

Софуерът е написан на езика Python и е развит в средата на клъстерната платформа Apache Spark, при използването на съответните софтуерни библиотеки за машинно обучение и библиотеки за числена обработка, както и софтуер за множествено подреждане на биологични секвенции.The software is written in Python and is developed in the middle of the Apache Spark cluster platform, using the relevant machine learning and numerical processing libraries, as well as software for multiple arrangement of biological sequences.

Claims

The method is adaptive extraction of silico knowledge and decision making based on large genomic data analysis, which contains two parallel and correlated computational phases, a machine learning phase and an operational phase, which overlap and perform simultaneously, with information exchange. Each phase operates over 4 types of computational workflows for data analysis - descriptive, diagnostic, predictive and prescriptive analysis. Descriptive, diagnostic, and predictive differentiated workflows are model-based, while predictive differentiated workflows use models to construct recommendations for personalized therapy and target grouping rules for precision therapy. The input data are 4 sets - sequenced genomes, results of clinical tests, parameters of the individual lifestyle of the patient and environmental factors. Within the machine learning phase, parallel differentiated workflows are built, with those with the highest score for accuracy, precision and sensitivity being used to construct an integrated workflow for knowledge discovery, with the following outputs: gene mapping, related to cancer, mutation detection, personalized cancer diagnosis and recommendations for a target group for precise patient therapy. The output data is verified by an expert, and if necessary the data sets are modified and the differentiated workflows are performed again in the machine learning phase. The process is repeated iteratively until verification by the oncologist. The method is adaptive in terms of genetic, biological and medical aspects