WO2022258866A1

WO2022258866A1 - Method of genomic analysis on a bioinformatics platform

Info

Publication number: WO2022258866A1
Application number: PCT/ES2022/070351
Authority: WO
Inventors: Javier ECHEVARRIA CARRERES
Original assignee: Veritas Intercontinental, S.L.
Priority date: 2021-06-10
Filing date: 2022-06-06
Publication date: 2022-12-15
Also published as: ES2930699A1

Abstract

The present invention relates to a method of genomic analysis that is implemented on a remote bioinformatics platform configured for the automated genomic analysis and filtering of undescribed variants in healthy individuals, comprising the steps of biological sample input (1, 2) and sequencing (3) of the DNA from the biological sample, after which the data is structured into three fastq, sam/bam and vcf files, characterised in that it implements a first biopipeline sub-process (4) configured for collecting the data from a DNA sequencer and transforming the data into elements that are understandable to a second veribench sub-process (6) configured for inspecting the data imported from the first sub-process of data collection and transformation of the sequencer, a third sub-process configured for curing (7) and interpreting a genomic variant, and a fourth sub-process for generating and distributing reports (8, 9).

Description

DESCRIPTION

GENOMIC ANALYSIS METHOD IN A BIOINFORMATIC PLATFORM

technique field

The present invention refers to a genomic analysis platform that allows rapid and efficient analysis of raw data from Human Genome sequencing systems, facilitating the interpretation of variants and the generation of a personalized report.

state of the art

The sequencing of the nucleotides that make up the human DNA molecules allows the identification of variants in the genetic material. In this regard, "Sanger" sequencing in the 1970s was a milestone in the analysis of Human Genetics and is considered the origin of the genomic era.

After the discovery of sequencing, high-throughput or new generation sequencing (NGS) platforms emerged, which have the capacity to analyze millions of DNA fragments in parallel and in a massive way in a single sequencing process. This new technology increases performance, reduces the cost of the analysis, providing additional advantages over previous genomic sequencing systems.

From this moment, in order to improve the performance of genetic diagnosis, analysis laboratories begin to develop NGS panels that allow parallel analysis of multiple genes or selected regions of DNA, which are related to similar or overlapping phenotypes. These panels provide a first method of genetic diagnosis and, in the cases that no alteration is* detected in the analyzed genes, the physician will determine whether to extend the study by performing exome sequencing (WES) or whole genome sequencing (WGS).

NGS sequencing techniques generate mainly three types of files: FASTQ, SAM/BAM (alignment) and VCF (annotation). These files are heavy and difficult to handle, so a tool is essential to optimize the automation of their processing and interpretation in order to be able to extract highly clinically useful data from large numbers of samples. An example of this type of system is described in US2020/0042736A1 which describes the storage or transmission of genomic data is performed by using a compressed genomic data set structured in a file or in a genomic data stream. Selective access to data, or subsets of data, corresponding to specific genomic regions is achieved through the use of user-defined tags based on data classification and a specific indexing mechanism.

Corresponding methods and systems for anonymizing genetic data obtained from a patient are described in US2020/0035332A1. Ancestry data can be masked by identifying ancestry information marker (AiM) regions in the genetic data. Each AIM region may include the inclusion of one or more single nucleotide polymorphism (SNP) alleles associated with a population of patients belonging to a certain ancestry. Once the AIM regions are identified. one or more regions can be identified that include clinically relevant data. Clinically relevant data may be data having one or more genetic variants associated with a specific disease or disorder. Genetic data can be anonymized by masking or removing AIM regions that do not include clinically relevant data.

Finally, document US2019/0304571A1 describes systems and methods for managing biological data that can preserve alternative interpretations of the data and can implement multi-level encryption and privacy management. Systems and methods for managing biological data may include cell level architecture, bank and block level architecture, and/or multi-tier architecture. Systems and methods for managing biological data may incorporate definitions, rules, and directives and/or employ a two-dimensional or two-dimensional data structure.

Explanation of the invention

It is an object of the present invention to provide a cloud-based analysis platform that simplifies the analysis of genome and exome sequencing data, and that allows comprehensive management of sequencing files. Therefore, the present invention is configured to manage the files from the moment they are generated in the sequencer, proceeding to the identification and filtering of the variants, interpretation and report generation in different languages. In addition, the platform does not require any local software installation, since it runs in the cloud This object is achieved with the platform according to claim 1. Particular solutions of the invention are described in the dependent claims.

More specifically, it describes a cloud platform where the analysis of data from massive DNA sequencing is carried out. The platform has adaptation of integrated tools for the analysis and interpretation of variations in data from massive DNA sequencing. The platform is oriented towards the analysis and interpretation of genomic data from whole exome sequencing (WES - Whole Exome Sequencing) and whole genome (WGS - Whole Genome Sequencing), these data come from massive or new generation sequencing ( NGS - Next Generation Sequencing) of DNA extracted from biological samples. After sequencing, the invention filters the list of variants present in the patient's DNA against the reference human genome, reducing the number of variants that require manual interpretation.

The present invention is configured as an open source platform to manage, process, share, and interpret genomic data. The system provides capabilities for automating complex genomic interpretation and classification processes, as well as its flexibility and modularity.

One of the advantages of the invention is that it is optimized for handling large amounts of data from exome or whole genome sequencing. The files that are handled through the platform of the invention are large (greater than 100 Gb of data) and it is developed to handle a plurality of files simultaneously, reaching total amounts of data ranging from tens of terabytes to petabytes. .

Brief explanation of the drawings

To complement the description that is being made and in order to help a better understanding of the characteristics of the invention, a set of drawings is attached as an integral part of said description, where, with an illustrative and non-limiting nature, what has been represented has been Next:

Figure 1 Shows a block diagram of the genomic analysis process executed with the present invention

Detailed explanation of an embodiment of the invention

As previously discussed, the present invention describes a unique platform that automates the screening of non-described variants in healthy people. This process is carried out efficiently according to the method and system described below, which is configured for the management and interpretation of genomic and clinical information. Therefore, it is configured as a unique system for filtering and simultaneous analysis of non-described variants and genotype of ancestry, metabolism before drugs, among others. In addition, it allows real-time data analysis

Figure 1 shows the block diagram of the invention and that it comprises a first stage of creation of the entry order (1) in the platform, which includes the compilation of the documentation, the reception and the registration of the sample. Next, in the laboratory information management thread (2). the sample is admitted, the DNA is extracted and sequenced (3). The sequenced DNA data (3) is structured in a BIOPIPELINE thread (4).

The BIOPIPELINE thread (4) is therefore configured to structure the DNA sequencing data (3), where the raw data from the DNA sequencing machine is converted into FASTQ type files. The FASTO format is a text-based format for storing both a biological sequence—generally nucleotide sequence—and its corresponding quality scores. Both the sequence letter and quality score are encoded with a single ASCII character for brevity, assigning the barcoded sequences to individual samples in a demultiplexing process. The sequence-filled PASTQ files are then aligned to the hg19 and hg38 reference genomes. This results in a bam file, which is a binary file of the sam file, which is a text file containing the tab separated genome alignment data. This bam file therefore contains the structured and ordered data for import (5) by the VERIBENCH process (6).

The VERIBENCH thread (6) is configured to review the imported data (5) from the BIOPIPELINE thread (4). Three types of data are loaded into the VERIBENCH(6) thread for manual inspection. AND! first group of data loaded are the variants of Type I that pass all the thresholds described. The second is the Type II variants for each product, PGX, Risk, and Traits. The third is information on sample quality and identity verification, which we will now describe in detail.

A parallel lab process is used for client identity verification to ensure that the connected variant information is distributed to the correct patient. Genotyping chips are used to provide a second source of physical data that is derived from the patient's DNA.

The present invention contains a process, which we will call chlpd. to ensure correct identity. The input to this verification method is the data from the IIlumina iScan machine. Briefly, the iScan machine performs genotyping in a similar way to microarray analysis. The raw data of chip are standardized to use a particular format, ensuring that the columns maintain a particular order. The first step converts the chip data to a vcf format, using custom scripts. The customer-specific generated vcf is then compared to the vcf (also with custom scripts) that was derived from the chip data.

To calculate identity, the data is classified as true positives (TP). false positives (FP), true negatives (TN) or false negatives (FN). Next, the agreement is expressed with the equation

TP/sum(FP + TP + TN + FP)

This guarantees that the generated vcf is the true data. Identity is equivalent to agreement. The Veribench operation protocol follows the following steps: a) Access a remote platform with the appropriate credentials b) Start and provide the necessary parameters for the analysis c) Once the BIOPIPELINE process (4) has started, you can see the start it in the current execution window. After the job is complete, the color indicator will turn green and the job status will be set to successful. d) Once the job has been completed it means that the data processing has started successfully and the progress can be monitored by remote login to the platform. The healing process (7) is an independent system that supports the healing and Performance of a variant. In addition, it allows obtaining the approval of the laboratory director. What a curator does is create a connection between the pieces to create something greater than the sum of the individual pieces. The connection of the pieces with a context creates a story and. hence a set element.

The creation of the report (8) is a process that is activated when all the necessary requirements in the previous steps (curation, classification and interpretation) have been met. In addition, the signature of the laboratory director is recorded in the final report and the validation of the entire process unequivocally for each individual report and in compliance with the applicable regulatory standards in each case. The platform has been designed to admit any language and alphabet.

The distribution (9) is an independent system that allows the sending (or distribution) of reports, files, and notifications through different means. In the present case, we have activated the means of email, cloud repository, pdf report and report on the web in real time.

Claims

1.- A genomic analysis method implemented in a remote bioinformatics platform configured for automated genomic analysis and filtering of non-described variants in healthy people, comprising the stages of inputting a biological sample (1,2) and sequencing (3 ) of the DNA of the biological sample, after which the data is structured into three fastq files, sam/bam and vcf, which is characterized by the fact that it implements a first biopipeline thread (4) configured to collect the data from a DNA sequencer and transforming the data into comprehensible elements for a second veribench thread (6) configured for inspection of the data imported from the first data collection and transformation thread of the sequencer; a third thread configured for healing (7) and for the interpretation of a genomic variant; and a fourth report generation and distribution thread (8,9).

2.- The method of genomic analysis according to claim 1 wherein the BIOPIPELINE thread (4) is configured to structure the DNA sequencing data (3), wherein the raw data from the DNA sequencing machine is converted in fastq files, assigning the barcoded sequences to the individual samples in a demultiplexing process; and where the fastq files align with the hg19 and hg38 reference genomes, resulting in a bam binary file, which is a binary file of the sam file, which is a text file containing the genome alignment data separated by tabulation.

3 - The method of genomic analysis according to any one of claims 1 or 2 where the veribench process (6) comprises a process configured to ensure the correct identity of the sample in parallel by means of genotyping chips to provide a second data source that are derived from the patient's DNA.

4 - The method according to claim 3 where the raw data from the genotyping chip is standardized to ensure that the columns maintain a specific order, and where in a first step the chip data is converted to a vcf format, using scripts personalized; and where the generated tooth-specific vcf is then compared to the vcf, also with custom scripts, that was derived from the genotyping chip data.

5 - A remote bioinformatics platform characterized by the fact that it includes media configured to execute the method according to any one of the claims

1 to 4.