WO2024042341A1

WO2024042341A1 - Method and system for the automated valuation of biological data

Info

Publication number: WO2024042341A1
Application number: PCT/GR2023/000042
Authority: WO
Inventors: Antonios Salakidis; Christos Karapiperis
Original assignee: Dnasequence Srl Hellas
Priority date: 2022-08-25
Filing date: 2023-08-02
Publication date: 2024-02-29
Also published as: GR1010503B

Abstract

The invention relates to a method for the automated evaluation of biological data and a system for implementing the same. Based on a data set produced after bioinformatics analysis of databases and a query in natural language, artificial intelligence networks produce a limited and sorted subset of results that satisfy the query and which are used to automatically generate a natural language report. There are also additional artificial intelligence networks to examine a number of parameters of the results and the accuracy of the generated natural language report.

Description

DESCRIPTION

METHOD AND SYSTEM FOR AUTOMATED VALUATION OF BIOLOGICAL DATA

Background of the invention

To date, the existing technology for the analysis of the microbiome includes taking a sample, its preparation, its analysis by an analysis device (next generation sequencing) for the extraction of molecular sequences in digital form, identifying the microorganisms, the statistical analysis of the results and then the manual search, analysis and relevance of the results in relation to the question that has been posed and is the reason why the above procedure was carried out.

The interpretation stage is, to a large extent, a laborious and timeconsuming process. It involves searching databases for information regarding each organism or molecular sequence found in the examined sample and then correlating the results and drawing conclusions based on the existing literature. Furthermore, due to its manual nature, it is prone to producing incorrect, incomplete, inaccurate and misleading conclusions.

These problems do not allow the utilization of the modern methods of gene analysis and the exploitation of their full potential. This constitutes a brake for their commercial exploitation and their utilization in solving serious issues related to the genomic imprint and microbiome in a multitude of fields such as clinical research, food safety, bio-security of facilities, etc. The proposed system includes a cognitive method, which fully automates the process of interpreting the results obtained from biological experiments - analyses.

The aim of this invention is to provide a method and a system for fully automating the process of explaining the results of biological experiments.

The method, as well as its application system, solves two important problems. The first concerns the extraction of cognitive data, meta-data from public and proprietary databases, as well as the possibility of their automated processing for knowledge mining. The second concerns the process of interpreting the results by producing cognitive data in report form.

Brief description of the drawings

Figure 1 shows a logic diagram with the steps of the method for the automated evaluation of biological data.

Figure 2 shows the flow diagram of the method, together with the evaluation of the results produced !) by the semantic search and ii) by the report text generation algorithm.

Description of the invention

The proposed system implements a method based on cognitive technology. The cognitive models of artificial intelligence that make up the different structures - parts that work in a specific order in the input layer, the main processing layer and the output layer.

The process starts with a dataset produced after the bioinformatic analysis which includes at least the tax identification number and quantitative information, such as for example the number of reads or the Operational Taxonomic Units (OTUs). Based on the taxonomic code, a search is made in the available bibliographic databases and all publications mentioning the specific microorganisms are found. The results with the specific publications constitute the input data set of the method, where in combination with the user's query they constitute the input of the semantic search on the specific publications. The data set is searched using a first set of transformer trained models, such as Bert, Bio-Bert XLNet or RoBERTa. For each input query given as input, a vector is calculated by the user (embedding) and then a comparison is performed between the vectors by calculating their distance (inner product). In order to improve the execution time of the algorithm in parallel with the already trained model, the vectors of the available literature may have already been calculated (embedding). With this approach, the execution time of the models from the start of the query to the results of the search is significantly reduced. An example of a public database used to train the models is the Pub Med database with about 33 million scientific publications, while the data generated by specific queries is about 50.000 to 100.000 scientific publications.

In the second stage the result of the search is a list showing each publication and the score of the semantic search in descending order, so that the relevant documents with the highest score appear at the top of the list. The results of the semantic analysis, i.e. the classified and limited data set, are again subjected to natural language processing by artificial intelligence networks and then some of them feed new queries with the aim of improving or evaluating the search results in an automated way and without human intervention. The second artificial intelligence network used at this stage is based on recurrent neural network methods, such as Recurrent neural network, Hidden Markov model, Maximum-entropy Markov model (MEMM) or naive Bayes classifier. Specifically, the search results are rated based on their semantic analysis in relation to the query posed by the user. Recurrent neural networks examine a series of parameters (metrics) such as Accuracy (correct decisions/total decisions), Precision, Recall, Fl score, Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), Root Mean Squared Error (RMSE), Perplexity, etc. In this way we have the continuous training of the network and ensure that only the results that exceed a certain threshold that can be set parametrically in advance will be forwarded to the next stage.

In the third stage, the result of the semantic analysis is a new classified and limited dataset, which together with the description of the problem in natural language by the user is the input to a third artificial intelligence network where it produces the final result, i.e. the report in natural language format through transformer trained models, such as, for example, GPT-2, GPT-3. Both models are pre-trained, but further optimization is done in their training with data from academic publications. GPT models and especially GPT-3 is a powerful model for natural language text generation based on the transformation architecture, it is pre-trained and its training is done without supervision (unsupervised training). It works by predicting the next token giving a sequence of tokens, and it can do this for natural language processing (NLP) tasks, on which it has not been trained. Additionally, for the best performance of the model through the process of optimization (fine-tuning), the model has been trained with publications related to the domains where the -omics technologies are applied to improve its performance.

In an alternative embodiment of the invention and with the aim of fully automating the method, an additional stage of evaluating the results via neural networks is added. Fig. 2 shows the flow diagram including the evaluation of the results produced i) by the semantic search and ii) by the report text generation algorithm.

This automatic evaluation stage concerns the report produced by the text generation stage, i.e. the third stage. The stage results are evaluated and scored.

Evaluation models such as Latent semantic analysis or Semantic hashing are used to validate the natural language results produced. The procedure followed to validate the results concerns preprocessing, weighting, singular value decomposition (SVD), rating, adjustments and accuracy. If the final accuracy result exceeds a certain threshold that can be set parametrically in advance, then the final report is accepted and available to the system user. Otherwise we have repetition of the text production process.

In this way, we have the immediate creation of cognitive data of conceptually important reports on the analyzed samples, which allow conclusions to be immediately drawn. The end result of the method is the automated interpretation of biological data. The cognitive data produced are specific to conclusions, while the final report produced as an output is fully supported by academic publications.

Claims

1. A method for the automated valuation of biological data in three stages, where in the first stage a data set produced after bioinformatic analysis of databases is taken as input and includes at least the taxonomic code for searching in the databases and a natural language query, so as to produce as output a limited set of data that includes the specific taxonomic code and where based on the user query in natural language, using a first set of transformed trained models, calculation of a vector and comparison between the vectors is performed by calculating their distance so that a list is formed showing each publication and the semantic search score in descending order, in the second stage the sorted and limited data set is fed to an artificial intelligence network based on recurrent neural network methods to examine a series of parameters and forward to the next layer of only the results that exceed a certain, predefined threshold in parameter values, in the third stage the sorted and limited data set that exceeded the threshold in parameter values is used as input to a third artificial intelligence network, where using a third set of trained transforms models, a report is produced in natural language format.

2. A method for the automated valuation of biological data, according to claim 1 , wherein the report in natural language format resulting from the third stage is evaluated by evaluation models on a series of data so that if the final accuracy result exceeds a certain, predetermined threshold, the report is accepted and forwarded to the user.

3. A system for the automated valuation of biological data, consisting of a first artificial intelligence network that uses a first set of transformer trained models to semantically analyze and classify a set of scientific data based on a user's query in natural language and produces a list including every related publication and the semantic search score in descending order, a second artificial intelligence network, which uses recurrent neural network methods, examines the list of the sorted and limited dataset for a set of parameters and forwards to the next layer only the results that pass a specific, predefined threshold in parameter values, and a third artificial intelligence network, that uses a second set of transformer trained models, which takes as input the list of the relevant publications that exceeded the threshold in parameter values and produces as output the final report in natural language form.

4. The system for the automated valuation of biological data according to claim 3, wherein a fourth artificial intelligence network, using evaluation models, evaluates a series of data in the final report in natural language format and if the final accuracy result exceeds a certain, predetermined threshold, the final report is accepted.