WO2024042341A1 - Method and system for the automated valuation of biological data - Google Patents

Method and system for the automated valuation of biological data Download PDF

Info

Publication number
WO2024042341A1
WO2024042341A1 PCT/GR2023/000042 GR2023000042W WO2024042341A1 WO 2024042341 A1 WO2024042341 A1 WO 2024042341A1 GR 2023000042 W GR2023000042 W GR 2023000042W WO 2024042341 A1 WO2024042341 A1 WO 2024042341A1
Authority
WO
WIPO (PCT)
Prior art keywords
natural language
data
artificial intelligence
report
models
Prior art date
Application number
PCT/GR2023/000042
Other languages
French (fr)
Inventor
Antonios Salakidis
Christos Karapiperis
Original Assignee
Dnasequence Srl Hellas
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dnasequence Srl Hellas filed Critical Dnasequence Srl Hellas
Publication of WO2024042341A1 publication Critical patent/WO2024042341A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis

Definitions

  • the existing technology for the analysis of the microbiome includes taking a sample, its preparation, its analysis by an analysis device (next generation sequencing) for the extraction of molecular sequences in digital form, identifying the microorganisms, the statistical analysis of the results and then the manual search, analysis and relevance of the results in relation to the question that has been posed and is the reason why the above procedure was carried out.
  • the interpretation stage is, to a large extent, a laborious and timeconsuming process. It involves searching databases for information regarding each organism or molecular sequence found in the examined sample and then correlating the results and drawing conclusions based on the existing literature. Furthermore, due to its manual nature, it is prone to producing incorrect, incomplete, inaccurate and misleading conclusions.
  • the proposed system includes a cognitive method, which fully automates the process of interpreting the results obtained from biological experiments - analyses.
  • the aim of this invention is to provide a method and a system for fully automating the process of explaining the results of biological experiments.
  • the method solves two important problems.
  • the first concerns the extraction of cognitive data, meta-data from public and proprietary databases, as well as the possibility of their automated processing for knowledge mining.
  • the second concerns the process of interpreting the results by producing cognitive data in report form.
  • Figure 1 shows a logic diagram with the steps of the method for the automated evaluation of biological data.
  • Figure 2 shows the flow diagram of the method, together with the evaluation of the results produced !) by the semantic search and ii) by the report text generation algorithm.
  • the proposed system implements a method based on cognitive technology.
  • the cognitive models of artificial intelligence that make up the different structures - parts that work in a specific order in the input layer, the main processing layer and the output layer.
  • the process starts with a dataset produced after the bioinformatic analysis which includes at least the tax identification number and quantitative information, such as for example the number of reads or the Operational Taxonomic Units (OTUs).
  • OTUs Operational Taxonomic Units
  • a search is made in the available bibliographic databases and all publications mentioning the specific microorganisms are found.
  • the results with the specific publications constitute the input data set of the method, where in combination with the user's query they constitute the input of the semantic search on the specific publications.
  • the data set is searched using a first set of transformer trained models, such as Bert, Bio-Bert XLNet or RoBERTa.
  • a vector is calculated by the user (embedding) and then a comparison is performed between the vectors by calculating their distance (inner product).
  • the vectors of the available literature may have already been calculated (embedding). With this approach, the execution time of the models from the start of the query to the results of the search is significantly reduced.
  • An example of a public database used to train the models is the Pub Med database with about 33 million scientific publications, while the data generated by specific queries is about 50.000 to 100.000 scientific publications.
  • the result of the search is a list showing each publication and the score of the semantic search in descending order, so that the relevant documents with the highest score appear at the top of the list.
  • the results of the semantic analysis i.e. the classified and limited data set, are again subjected to natural language processing by artificial intelligence networks and then some of them feed new queries with the aim of improving or evaluating the search results in an automated way and without human intervention.
  • the second artificial intelligence network used at this stage is based on recurrent neural network methods, such as Recurrent neural network, Hidden Markov model, Maximum-entropy Markov model (MEMM) or naive Bayes classifier. Specifically, the search results are rated based on their semantic analysis in relation to the query posed by the user.
  • Recurrent neural networks examine a series of parameters (metrics) such as Accuracy (correct decisions/total decisions), Precision, Recall, Fl score, Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), Root Mean Squared Error (RMSE), Perplexity, etc.
  • metrics such as Accuracy (correct decisions/total decisions), Precision, Recall, Fl score, Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), Root Mean Squared Error (RMSE), Perplexity, etc.
  • the result of the semantic analysis is a new classified and limited dataset, which together with the description of the problem in natural language by the user is the input to a third artificial intelligence network where it produces the final result, i.e. the report in natural language format through transformer trained models, such as, for example, GPT-2, GPT-3.
  • transformer trained models such as, for example, GPT-2, GPT-3.
  • GPT models and especially GPT-3 is a powerful model for natural language text generation based on the transformation architecture, it is pre-trained and its training is done without supervision (unsupervised training). It works by predicting the next token giving a sequence of tokens, and it can do this for natural language processing (NLP) tasks, on which it has not been trained.
  • NLP natural language processing
  • the model has been trained with publications related to the domains where the -omics technologies are applied to improve its performance.
  • FIG. 2 shows the flow diagram including the evaluation of the results produced i) by the semantic search and ii) by the report text generation algorithm.
  • This automatic evaluation stage concerns the report produced by the text generation stage, i.e. the third stage.
  • the stage results are evaluated and scored.
  • Evaluation models such as Latent semantic analysis or Semantic hashing are used to validate the natural language results produced.
  • the procedure followed to validate the results concerns preprocessing, weighting, singular value decomposition (SVD), rating, adjustments and accuracy. If the final accuracy result exceeds a certain threshold that can be set parametrically in advance, then the final report is accepted and available to the system user. Otherwise we have repetition of the text production process.
  • SVD singular value decomposition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioethics (AREA)
  • Mathematical Physics (AREA)
  • Public Health (AREA)
  • Library & Information Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Epidemiology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a method for the automated evaluation of biological data and a system for implementing the same. Based on a data set produced after bioinformatics analysis of databases and a query in natural language, artificial intelligence networks produce a limited and sorted subset of results that satisfy the query and which are used to automatically generate a natural language report. There are also additional artificial intelligence networks to examine a number of parameters of the results and the accuracy of the generated natural language report.

Description

DESCRIPTION
METHOD AND SYSTEM FOR AUTOMATED VALUATION OF BIOLOGICAL DATA
Background of the invention
To date, the existing technology for the analysis of the microbiome includes taking a sample, its preparation, its analysis by an analysis device (next generation sequencing) for the extraction of molecular sequences in digital form, identifying the microorganisms, the statistical analysis of the results and then the manual search, analysis and relevance of the results in relation to the question that has been posed and is the reason why the above procedure was carried out.
The interpretation stage is, to a large extent, a laborious and timeconsuming process. It involves searching databases for information regarding each organism or molecular sequence found in the examined sample and then correlating the results and drawing conclusions based on the existing literature. Furthermore, due to its manual nature, it is prone to producing incorrect, incomplete, inaccurate and misleading conclusions.
These problems do not allow the utilization of the modern methods of gene analysis and the exploitation of their full potential. This constitutes a brake for their commercial exploitation and their utilization in solving serious issues related to the genomic imprint and microbiome in a multitude of fields such as clinical research, food safety, bio-security of facilities, etc. The proposed system includes a cognitive method, which fully automates the process of interpreting the results obtained from biological experiments - analyses.
The aim of this invention is to provide a method and a system for fully automating the process of explaining the results of biological experiments.
The method, as well as its application system, solves two important problems. The first concerns the extraction of cognitive data, meta-data from public and proprietary databases, as well as the possibility of their automated processing for knowledge mining. The second concerns the process of interpreting the results by producing cognitive data in report form.
Brief description of the drawings
Figure 1 shows a logic diagram with the steps of the method for the automated evaluation of biological data.
Figure 2 shows the flow diagram of the method, together with the evaluation of the results produced !) by the semantic search and ii) by the report text generation algorithm.
Description of the invention
The proposed system implements a method based on cognitive technology. The cognitive models of artificial intelligence that make up the different structures - parts that work in a specific order in the input layer, the main processing layer and the output layer.
The process starts with a dataset produced after the bioinformatic analysis which includes at least the tax identification number and quantitative information, such as for example the number of reads or the Operational Taxonomic Units (OTUs). Based on the taxonomic code, a search is made in the available bibliographic databases and all publications mentioning the specific microorganisms are found. The results with the specific publications constitute the input data set of the method, where in combination with the user's query they constitute the input of the semantic search on the specific publications. The data set is searched using a first set of transformer trained models, such as Bert, Bio-Bert XLNet or RoBERTa. For each input query given as input, a vector is calculated by the user (embedding) and then a comparison is performed between the vectors by calculating their distance (inner product). In order to improve the execution time of the algorithm in parallel with the already trained model, the vectors of the available literature may have already been calculated (embedding). With this approach, the execution time of the models from the start of the query to the results of the search is significantly reduced. An example of a public database used to train the models is the Pub Med database with about 33 million scientific publications, while the data generated by specific queries is about 50.000 to 100.000 scientific publications.
In the second stage the result of the search is a list showing each publication and the score of the semantic search in descending order, so that the relevant documents with the highest score appear at the top of the list. The results of the semantic analysis, i.e. the classified and limited data set, are again subjected to natural language processing by artificial intelligence networks and then some of them feed new queries with the aim of improving or evaluating the search results in an automated way and without human intervention. The second artificial intelligence network used at this stage is based on recurrent neural network methods, such as Recurrent neural network, Hidden Markov model, Maximum-entropy Markov model (MEMM) or naive Bayes classifier. Specifically, the search results are rated based on their semantic analysis in relation to the query posed by the user. Recurrent neural networks examine a series of parameters (metrics) such as Accuracy (correct decisions/total decisions), Precision, Recall, Fl score, Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), Root Mean Squared Error (RMSE), Perplexity, etc. In this way we have the continuous training of the network and ensure that only the results that exceed a certain threshold that can be set parametrically in advance will be forwarded to the next stage.
In the third stage, the result of the semantic analysis is a new classified and limited dataset, which together with the description of the problem in natural language by the user is the input to a third artificial intelligence network where it produces the final result, i.e. the report in natural language format through transformer trained models, such as, for example, GPT-2, GPT-3. Both models are pre-trained, but further optimization is done in their training with data from academic publications. GPT models and especially GPT-3 is a powerful model for natural language text generation based on the transformation architecture, it is pre-trained and its training is done without supervision (unsupervised training). It works by predicting the next token giving a sequence of tokens, and it can do this for natural language processing (NLP) tasks, on which it has not been trained. Additionally, for the best performance of the model through the process of optimization (fine-tuning), the model has been trained with publications related to the domains where the -omics technologies are applied to improve its performance.
In an alternative embodiment of the invention and with the aim of fully automating the method, an additional stage of evaluating the results via neural networks is added. Fig. 2 shows the flow diagram including the evaluation of the results produced i) by the semantic search and ii) by the report text generation algorithm.
This automatic evaluation stage concerns the report produced by the text generation stage, i.e. the third stage. The stage results are evaluated and scored.
Evaluation models such as Latent semantic analysis or Semantic hashing are used to validate the natural language results produced. The procedure followed to validate the results concerns preprocessing, weighting, singular value decomposition (SVD), rating, adjustments and accuracy. If the final accuracy result exceeds a certain threshold that can be set parametrically in advance, then the final report is accepted and available to the system user. Otherwise we have repetition of the text production process.
In this way, we have the immediate creation of cognitive data of conceptually important reports on the analyzed samples, which allow conclusions to be immediately drawn. The end result of the method is the automated interpretation of biological data. The cognitive data produced are specific to conclusions, while the final report produced as an output is fully supported by academic publications.

Claims

1. A method for the automated valuation of biological data in three stages, where in the first stage a data set produced after bioinformatic analysis of databases is taken as input and includes at least the taxonomic code for searching in the databases and a natural language query, so as to produce as output a limited set of data that includes the specific taxonomic code and where based on the user query in natural language, using a first set of transformed trained models, calculation of a vector and comparison between the vectors is performed by calculating their distance so that a list is formed showing each publication and the semantic search score in descending order, in the second stage the sorted and limited data set is fed to an artificial intelligence network based on recurrent neural network methods to examine a series of parameters and forward to the next layer of only the results that exceed a certain, predefined threshold in parameter values, in the third stage the sorted and limited data set that exceeded the threshold in parameter values is used as input to a third artificial intelligence network, where using a third set of trained transforms models, a report is produced in natural language format.
2. A method for the automated valuation of biological data, according to claim 1 , wherein the report in natural language format resulting from the third stage is evaluated by evaluation models on a series of data so that if the final accuracy result exceeds a certain, predetermined threshold, the report is accepted and forwarded to the user.
3. A system for the automated valuation of biological data, consisting of a first artificial intelligence network that uses a first set of transformer trained models to semantically analyze and classify a set of scientific data based on a user's query in natural language and produces a list including every related publication and the semantic search score in descending order, a second artificial intelligence network, which uses recurrent neural network methods, examines the list of the sorted and limited dataset for a set of parameters and forwards to the next layer only the results that pass a specific, predefined threshold in parameter values, and a third artificial intelligence network, that uses a second set of transformer trained models, which takes as input the list of the relevant publications that exceeded the threshold in parameter values and produces as output the final report in natural language form.
4. The system for the automated valuation of biological data according to claim 3, wherein a fourth artificial intelligence network, using evaluation models, evaluates a series of data in the final report in natural language format and if the final accuracy result exceeds a certain, predetermined threshold, the final report is accepted.
PCT/GR2023/000042 2022-08-25 2023-08-02 Method and system for the automated valuation of biological data WO2024042341A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GR20220100706A GR1010503B (en) 2022-08-25 2022-08-25 Method and system for automated evaluation of biological data
GR20220100706 2022-08-25

Publications (1)

Publication Number Publication Date
WO2024042341A1 true WO2024042341A1 (en) 2024-02-29

Family

ID=87887939

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GR2023/000042 WO2024042341A1 (en) 2022-08-25 2023-08-02 Method and system for the automated valuation of biological data

Country Status (2)

Country Link
GR (1) GR1010503B (en)
WO (1) WO2024042341A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190078142A1 (en) * 2015-06-30 2019-03-14 uBiome, Inc. Method and system for characterization for female reproductive system-related conditions associated with microorganisms
US20210038654A1 (en) * 2018-03-16 2021-02-11 Persephone Biosciences Compositions for modulating gut microflora populations, enhancing drug potency and treating cancer, and methods for making and using same

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009111581A1 (en) * 2008-03-04 2009-09-11 Nextbio Categorization and filtering of scientific data
CN109448793B (en) * 2018-10-15 2021-04-20 智慧芽信息科技(苏州)有限公司 Method and system for labeling, searching and information labeling of right range of gene sequence
US11003701B2 (en) * 2019-04-30 2021-05-11 International Business Machines Corporation Dynamic faceted search on a document corpus
WO2021195133A1 (en) * 2020-03-23 2021-09-30 Sorcero, Inc. Cross-class ontology integration for language modeling

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190078142A1 (en) * 2015-06-30 2019-03-14 uBiome, Inc. Method and system for characterization for female reproductive system-related conditions associated with microorganisms
US20210038654A1 (en) * 2018-03-16 2021-02-11 Persephone Biosciences Compositions for modulating gut microflora populations, enhancing drug potency and treating cancer, and methods for making and using same

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PARK YESOL ET AL: "Discovering microbe-disease associations from the literature using a hierarchical long short-term memory network and an ensemble parser model", SCIENTIFIC REPORTS, vol. 11, no. 1, 24 February 2021 (2021-02-24), XP093005872, Retrieved from the Internet <URL:https://www.nature.com/articles/s41598-021-83966-8> DOI: 10.1038/s41598-021-83966-8 *

Also Published As

Publication number Publication date
GR1010503B (en) 2023-07-07

Similar Documents

Publication Publication Date Title
CN112732946B (en) Modular data analysis and database establishment method for medical literature
CN112051986B (en) Code search recommendation device and method based on open source knowledge
CN109241199B (en) Financial knowledge graph discovery method
CN113761893B (en) Relation extraction method based on mode pre-training
CN108986907A (en) A kind of tele-medicine based on KNN algorithm divides the method for examining automatically
CN103309953A (en) Method for labeling and searching for diversified pictures based on integration of multiple RBFNN classifiers
CN111079419B (en) National defense science and technology hotword discovery method and system based on big data
CN101751455A (en) Method for automatically generating title by adopting artificial intelligence technology
CN106529580A (en) EDSVM-based software defect data association classification method
CN110909542A (en) Intelligent semantic series-parallel analysis method and system
CN108520038B (en) Biomedical literature retrieval method based on sequencing learning algorithm
CN110909785B (en) Multitask Triplet loss function learning method based on semantic hierarchy
CN112307182A (en) Question-answering system-based pseudo-correlation feedback extended query method
Elayidom et al. A generalized data mining framework for placement chance prediction problems
CN110851593A (en) Complex value word vector construction method based on position and semantics
CN113032573B (en) Large-scale text classification method and system combining topic semantics and TF-IDF algorithm
Mustafa et al. Optimizing document classification: Unleashing the power of genetic algorithms
CN113434418A (en) Knowledge-driven software defect detection and analysis method and system
CN115600602B (en) Method, system and terminal device for extracting key elements of long text
WO2024042341A1 (en) Method and system for the automated valuation of biological data
CN113487194B (en) Electric power system dispatcher grade evaluation system based on text classification
CN114153976A (en) Traffic event classification method, system and medium based on social media data
Purnomo et al. Synthesis ensemble oversampling and ensemble tree-based machine learning for class imbalance problem in breast cancer diagnosis
CN105871630A (en) Method for determining Internet surfing behavior categories of network users
CN112258235A (en) Method and system for discovering new service of electric power marketing audit

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23798498

Country of ref document: EP

Kind code of ref document: A1