WO2022258866A1 - Method of genomic analysis on a bioinformatics platform - Google Patents
Method of genomic analysis on a bioinformatics platform Download PDFInfo
- Publication number
- WO2022258866A1 WO2022258866A1 PCT/ES2022/070351 ES2022070351W WO2022258866A1 WO 2022258866 A1 WO2022258866 A1 WO 2022258866A1 ES 2022070351 W ES2022070351 W ES 2022070351W WO 2022258866 A1 WO2022258866 A1 WO 2022258866A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- thread
- dna
- genomic analysis
- sequencing
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000011331 genomic analysis Methods 0.000 title claims abstract description 10
- 238000012163 sequencing technique Methods 0.000 claims abstract description 11
- 239000012472 biological sample Substances 0.000 claims abstract description 5
- 238000001914 filtration Methods 0.000 claims abstract description 4
- 238000013480 data collection Methods 0.000 claims abstract 2
- 238000013501 data transformation Methods 0.000 claims abstract 2
- 230000009466 transformation Effects 0.000 claims abstract 2
- 230000001131 transforming effect Effects 0.000 claims abstract 2
- 230000008569 process Effects 0.000 claims description 15
- 238000001712 DNA sequencing Methods 0.000 claims description 6
- 238000003205 genotyping method Methods 0.000 claims description 5
- 239000000523 sample Substances 0.000 claims description 4
- 238000013515 script Methods 0.000 claims description 4
- 230000035876 healing Effects 0.000 claims description 3
- 238000007689 inspection Methods 0.000 claims description 2
- 238000004458 analytical method Methods 0.000 description 12
- 230000002068 genetic effect Effects 0.000 description 6
- 238000007482 whole exome sequencing Methods 0.000 description 5
- 238000012070 whole genome sequencing analysis Methods 0.000 description 5
- 239000002773 nucleotide Substances 0.000 description 3
- 125000003729 nucleotide group Chemical group 0.000 description 3
- 108090000623 proteins and genes Proteins 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 238000012268 genome sequencing Methods 0.000 description 2
- 238000007481 next generation sequencing Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 108700028369 Alleles Proteins 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000035475 disorder Diseases 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000004060 metabolic process Effects 0.000 description 1
- 238000010208 microarray analysis Methods 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
Definitions
- the present invention refers to a genomic analysis platform that allows rapid and efficient analysis of raw data from Human Genome sequencing systems, facilitating the interpretation of variants and the generation of a personalized report.
- NGS new generation sequencing
- NGS panels that allow parallel analysis of multiple genes or selected regions of DNA, which are related to similar or overlapping phenotypes. These panels provide a first method of genetic diagnosis and, in the cases that no alteration is* detected in the analyzed genes, the physician will determine whether to extend the study by performing exome sequencing (WES) or whole genome sequencing (WGS).
- WES exome sequencing
- WGS whole genome sequencing
- NGS sequencing techniques generate mainly three types of files: FASTQ, SAM/BAM (alignment) and VCF (annotation). These files are heavy and difficult to handle, so a tool is essential to optimize the automation of their processing and interpretation in order to be able to extract highly clinically useful data from large numbers of samples.
- An example of this type of system is described in US2020/0042736A1 which describes the storage or transmission of genomic data is performed by using a compressed genomic data set structured in a file or in a genomic data stream. Selective access to data, or subsets of data, corresponding to specific genomic regions is achieved through the use of user-defined tags based on data classification and a specific indexing mechanism.
- Ancestry data can be masked by identifying ancestry information marker (AiM) regions in the genetic data.
- Each AIM region may include the inclusion of one or more single nucleotide polymorphism (SNP) alleles associated with a population of patients belonging to a certain ancestry.
- SNP single nucleotide polymorphism
- One or more regions can be identified that include clinically relevant data.
- Clinically relevant data may be data having one or more genetic variants associated with a specific disease or disorder.
- Genetic data can be anonymized by masking or removing AIM regions that do not include clinically relevant data.
- the platform has adaptation of integrated tools for the analysis and interpretation of variations in data from massive DNA sequencing.
- the platform is oriented towards the analysis and interpretation of genomic data from whole exome sequencing (WES - Whole Exome Sequencing) and whole genome (WGS - Whole Genome Sequencing), these data come from massive or new generation sequencing ( NGS - Next Generation Sequencing) of DNA extracted from biological samples.
- WES Whole Exome Sequencing
- WGS Whole Genome Sequencing
- NGS Next Generation Sequencing
- the present invention is configured as an open source platform to manage, process, share, and interpret genomic data.
- the system provides capabilities for automating complex genomic interpretation and classification processes, as well as its flexibility and modularity.
- One of the advantages of the invention is that it is optimized for handling large amounts of data from exome or whole genome sequencing.
- the files that are handled through the platform of the invention are large (greater than 100 Gb of data) and it is developed to handle a plurality of files simultaneously, reaching total amounts of data ranging from tens of terabytes to petabytes. .
- Figure 1 Shows a block diagram of the genomic analysis process executed with the present invention
- the present invention describes a unique platform that automates the screening of non-described variants in healthy people. This process is carried out efficiently according to the method and system described below, which is configured for the management and interpretation of genomic and clinical information. Therefore, it is configured as a unique system for filtering and simultaneous analysis of non-described variants and genotype of ancestry, metabolism before drugs, among others. In addition, it allows real-time data analysis
- Figure 1 shows the block diagram of the invention and that it comprises a first stage of creation of the entry order (1) in the platform, which includes the compilation of the documentation, the reception and the registration of the sample.
- the laboratory information management thread (2) the sample is admitted, the DNA is extracted and sequenced (3).
- the sequenced DNA data (3) is structured in a BIOPIPELINE thread (4).
- the BIOPIPELINE thread (4) is therefore configured to structure the DNA sequencing data (3), where the raw data from the DNA sequencing machine is converted into FASTQ type files.
- the FASTO format is a text-based format for storing both a biological sequence—generally nucleotide sequence—and its corresponding quality scores. Both the sequence letter and quality score are encoded with a single ASCII character for brevity, assigning the barcoded sequences to individual samples in a demultiplexing process.
- the sequence-filled PASTQ files are then aligned to the hg19 and hg38 reference genomes. This results in a bam file, which is a binary file of the sam file, which is a text file containing the tab separated genome alignment data.
- This bam file therefore contains the structured and ordered data for import (5) by the VERIBENCH process (6).
- the VERIBENCH thread (6) is configured to review the imported data (5) from the BIOPIPELINE thread (4). Three types of data are loaded into the VERIBENCH(6) thread for manual inspection. AND! first group of data loaded are the variants of Type I that pass all the thresholds described. The second is the Type II variants for each product, PGX, Risk, and Traits. The third is information on sample quality and identity verification, which we will now describe in detail.
- a parallel lab process is used for client identity verification to ensure that the connected variant information is distributed to the correct patient.
- Genotyping chips are used to provide a second source of physical data that is derived from the patient's DNA.
- the present invention contains a process, which we will call chlpd. to ensure correct identity.
- the input to this verification method is the data from the IIlumina iScan machine. Briefly, the iScan machine performs genotyping in a similar way to microarray analysis. The raw data of chip are standardized to use a particular format, ensuring that the columns maintain a particular order. The first step converts the chip data to a vcf format, using custom scripts. The customer-specific generated vcf is then compared to the vcf (also with custom scripts) that was derived from the chip data.
- the Veribench operation protocol follows the following steps: a) Access a remote platform with the appropriate credentials b) Start and provide the necessary parameters for the analysis c) Once the BIOPIPELINE process (4) has started, you can see the start it in the current execution window. After the job is complete, the color indicator will turn green and the job status will be set to successful. d) Once the job has been completed it means that the data processing has started successfully and the progress can be monitored by remote login to the platform.
- the healing process (7) is an independent system that supports the healing and Performance of a variant. In addition, it allows obtaining the approval of the laboratory director. What a curator does is create a connection between the pieces to create something greater than the sum of the individual pieces. The connection of the pieces with a context creates a story and. hence a set element.
- the creation of the report (8) is a process that is activated when all the necessary requirements in the previous steps (curation, classification and interpretation) have been met.
- the signature of the laboratory director is recorded in the final report and the validation of the entire process unequivocally for each individual report and in compliance with the applicable regulatory standards in each case.
- the platform has been designed to admit any language and alphabet.
- the distribution (9) is an independent system that allows the sending (or distribution) of reports, files, and notifications through different means.
Abstract
The present invention relates to a method of genomic analysis that is implemented on a remote bioinformatics platform configured for the automated genomic analysis and filtering of undescribed variants in healthy individuals, comprising the steps of biological sample input (1, 2) and sequencing (3) of the DNA from the biological sample, after which the data is structured into three fastq, sam/bam and vcf files, characterised in that it implements a first biopipeline sub-process (4) configured for collecting the data from a DNA sequencer and transforming the data into elements that are understandable to a second veribench sub-process (6) configured for inspecting the data imported from the first sub-process of data collection and transformation of the sequencer, a third sub-process configured for curing (7) and interpreting a genomic variant, and a fourth sub-process for generating and distributing reports (8, 9).
Description
DESCRIPCIÓN DESCRIPTION
MÉTODO DE ANÁLISIS GENÓMICO EN UNA PLATAFORMA BIOINFORMÁTICA GENOMIC ANALYSIS METHOD IN A BIOINFORMATIC PLATFORM
Campo de la técnica technique field
La presente invención se refiere a una plataforma de análisis genómico que permite un análisis rápido y eficiente de los datos brutos procedentes de sistemas de secuenciación del Genoma Humano, facilitando la interpretación de las variantes y ia generación de un informe personalizado. The present invention refers to a genomic analysis platform that allows rapid and efficient analysis of raw data from Human Genome sequencing systems, facilitating the interpretation of variants and the generation of a personalized report.
Estado de la técnica state of the art
La secuenciación de los nucleótidos que conforman las moléculas de ADN humano permite la identificación de variantes en el material genético. En este aspecto, la secuenciación “Sanger” en los años 70 del siglo XX supuso un hito en el análisis de la Genética Humana y es considerado como el origen de la era genómica. The sequencing of the nucleotides that make up the human DNA molecules allows the identification of variants in the genetic material. In this regard, "Sanger" sequencing in the 1970s was a milestone in the analysis of Human Genetics and is considered the origin of the genomic era.
Tras el descubrimiento de la secuenciación surgen las plataformas de secuenciación de alto rendimiento o nueva generación (NGS) que tienen la capacidad de analizar en paralelo y de forma masiva, millones de fragmentos de ADN en un único proceso de secuenciación. Esta nueva tecnología eleva el rendimiento, reduciendo el coste del análisis, aportando ventajas adicionales respecto de los sistemas de secuenciación genómica previos. After the discovery of sequencing, high-throughput or new generation sequencing (NGS) platforms emerged, which have the capacity to analyze millions of DNA fragments in parallel and in a massive way in a single sequencing process. This new technology increases performance, reduces the cost of the analysis, providing additional advantages over previous genomic sequencing systems.
A partir de este momento, con la finalidad de mejorar el rendimiento del diagnóstico genético, los laboratorios de análisis comienzan a desarrollar paneles NGS que permiten el análisis en paralelo de múltiples genes o regiones seleccionadas del ADN, que se relacionan con fenotipos parecidos o solapantes. Estos paneles proporcionan un primer método de diagnóstico genético y, en los casos que no se* detecta ninguna alteración en los genes analizados, el facultativo determinará si amplia el estudio realizando la secuenciación del exoma (WES) o del genoma completo (WGS). From this moment, in order to improve the performance of genetic diagnosis, analysis laboratories begin to develop NGS panels that allow parallel analysis of multiple genes or selected regions of DNA, which are related to similar or overlapping phenotypes. These panels provide a first method of genetic diagnosis and, in the cases that no alteration is* detected in the analyzed genes, the physician will determine whether to extend the study by performing exome sequencing (WES) or whole genome sequencing (WGS).
Las técnicas de secuenciación NGS generan, principalmente, tres tipos de ficheros: FASTQ, SAM/BAM (alineamiento) y VCF (anotación). Estos ficheros son pesados y difíciles de manejar, por lo que se hace imprescindible una herramienta que permita optimizar al máximo la automatización de su procesamiento e interpretación de cara a poder extraer los datos de alta utilidad clínica en números elevados de muestras.
Un ejemplo de este tipo de sistemas se describe en US2020/0042736A1 que describe el almacenamiento o la transmisión de datos genómicos se realiza mediante el empleo de un conjunto de datos genómicos comprimidos estructurados en un archivo o en un flujo de datos genómicos. El acceso selectivo a los datos, o subconjuntos de datos, correspondientes a regiones genómicas especificas se logra mediante el empleo de etiquetas definidas por el usuario basadas en la clasificación de datos y un mecanismo de indexación específico NGS sequencing techniques generate mainly three types of files: FASTQ, SAM/BAM (alignment) and VCF (annotation). These files are heavy and difficult to handle, so a tool is essential to optimize the automation of their processing and interpretation in order to be able to extract highly clinically useful data from large numbers of samples. An example of this type of system is described in US2020/0042736A1 which describes the storage or transmission of genomic data is performed by using a compressed genomic data set structured in a file or in a genomic data stream. Selective access to data, or subsets of data, corresponding to specific genomic regions is achieved through the use of user-defined tags based on data classification and a specific indexing mechanism.
En el documento US2020/0035332A1 se describen métodos y sistemas correspondientes para anonimizar los datos genéticos obtenidos de un paciente. Los datos de ascendencia se pueden enmascarar identificando regiones de marcadores de información de ascendencia (AiM) en los datos genéticos. Cada región AIM puede incluir la inclusión de uno o más alelos de polimorfismo de un solo nucleótido (SNP) asociados con una población de pacientes que pertenecen a una determinada ascendencia. Una vez que se identifican las regiones AIM. se pueden identificar una o más regiones que incluyen datos clínicamente relevantes. Los datos clínicamente relevantes pueden ser datos que tengan una o más variantes genéticas asociadas con una enfermedad o trastorno específico. Los datos genéticos se pueden anonimizar enmascarando o eliminando las regiones AIM que no incluyen datos cínicamente relevantes. Corresponding methods and systems for anonymizing genetic data obtained from a patient are described in US2020/0035332A1. Ancestry data can be masked by identifying ancestry information marker (AiM) regions in the genetic data. Each AIM region may include the inclusion of one or more single nucleotide polymorphism (SNP) alleles associated with a population of patients belonging to a certain ancestry. Once the AIM regions are identified. one or more regions can be identified that include clinically relevant data. Clinically relevant data may be data having one or more genetic variants associated with a specific disease or disorder. Genetic data can be anonymized by masking or removing AIM regions that do not include clinically relevant data.
Finalmente, en el documento US2019/0304571A1 se describen sistemas y métodos para la gestión de datos biológicos pueden preservar interpretaciones alternativas de los datos y pueden implementar cifrado de múltiples niveles y gestión de la privacidad. Los sistemas y métodos para la gestión de datos biológicos pueden incluir una arquitectura a nivel de célula, una arquitectura a nivel de banco y bloque y / o una arquitectura de varios niveles. Los sistemas y métodos para la gestión de datos biológicos pueden incorporar definiciones, reglas y directivas y / o emplear una estructura de datos bidimensional o bidimensional. Finally, document US2019/0304571A1 describes systems and methods for managing biological data that can preserve alternative interpretations of the data and can implement multi-level encryption and privacy management. Systems and methods for managing biological data may include cell level architecture, bank and block level architecture, and/or multi-tier architecture. Systems and methods for managing biological data may incorporate definitions, rules, and directives and/or employ a two-dimensional or two-dimensional data structure.
Explicación de la invención Explanation of the invention
Es un objeto de la presente invención proporcionar una plataforma de análisis basada en la nube que simplifica el análisis de los datos de la secuencíación del genoma y exoma, y que permita una gestión integral de los ficheros de secuenciación. Por lo tanto, la presente invención está configurada para gestionar los archivos desde que son generados en el secuenciador, procediendo a la identificación y filtrado de las variantes, interpretación y generación de informe en diferentes idiomas. Además, la plataforma no requiere de ninguna
instalación de software en local, puesto que se ejecuta en la nube Este objeto se alcanza con la plataforma de acuerdo con la reivindicación 1. En las reivindicaciones dependientes se describen soluciones particulares de la invención. It is an object of the present invention to provide a cloud-based analysis platform that simplifies the analysis of genome and exome sequencing data, and that allows comprehensive management of sequencing files. Therefore, the present invention is configured to manage the files from the moment they are generated in the sequencer, proceeding to the identification and filtering of the variants, interpretation and report generation in different languages. In addition, the platform does not require any local software installation, since it runs in the cloud This object is achieved with the platform according to claim 1. Particular solutions of the invention are described in the dependent claims.
Más concretamente, describe una plataforma en la nube donde se realiza el análisis de los datos procedentes de la secuenciación masiva de ADN. La plataforma cuenta con adaptación de herramientas integradas para el análisis e interpretación de vanantes en los datos procedentes de la secuenciación masiva de ADN. La plataforma está orientada al análisis e interpretación de los datos genómicos procedentes de la secuenciación del exoma completo (WES - Whole Exome Sequencing) y del genoma completo (WGS - Whole Genome Sequencing), estos datos proceden de la secuenciación masiva o de nueva generación (NGS - Next Generation Sequencing) del ADN extraído de muestras biológicas. Tras secuenciación la invención realiza un filtrado del listado de variantes presentes en el ADN del paciente frente al genoma humano de referencia, reduciendo el número de variantes que requieren interpretación manual. More specifically, it describes a cloud platform where the analysis of data from massive DNA sequencing is carried out. The platform has adaptation of integrated tools for the analysis and interpretation of variations in data from massive DNA sequencing. The platform is oriented towards the analysis and interpretation of genomic data from whole exome sequencing (WES - Whole Exome Sequencing) and whole genome (WGS - Whole Genome Sequencing), these data come from massive or new generation sequencing ( NGS - Next Generation Sequencing) of DNA extracted from biological samples. After sequencing, the invention filters the list of variants present in the patient's DNA against the reference human genome, reducing the number of variants that require manual interpretation.
La presente invención se configura corno una plataforma de código abierto para gestionar, procesar compartir e interpretar datos genómicos El sistema proporciona capacidades de automatización de los procesos complejos de interpretación y clasificación genómica, asi como su flexibilidad y medularidad. The present invention is configured as an open source platform to manage, process, share, and interpret genomic data. The system provides capabilities for automating complex genomic interpretation and classification processes, as well as its flexibility and modularity.
Una de las ventajas de la invención es que está optimizada para el manejo de una gran cantidad de datos procedentes de la secuenciación del exoma o del genoma completo. Los ficheros que se manejan a través de la plataforma de la invención son de gran tamaño (superior a 100 Gb de datos) y está desarrollada para manejar una pluralidad de ficheros simultáneamente, llegando a cantidades totales de datos que oscilan entre decenas de terabytes y petabytes. One of the advantages of the invention is that it is optimized for handling large amounts of data from exome or whole genome sequencing. The files that are handled through the platform of the invention are large (greater than 100 Gb of data) and it is developed to handle a plurality of files simultaneously, reaching total amounts of data ranging from tens of terabytes to petabytes. .
Breve explicación de los dibujos Brief explanation of the drawings
Para complementar la descripción que se está realizando y con objeto de ayudar a una mejor comprensión de las características de la invención, se acompaña como parte integrante de dicha descripción, un juego de dibujos en donde con carácter ilustrativo y no limitativo, se ha representado lo siguiente: To complement the description that is being made and in order to help a better understanding of the characteristics of the invention, a set of drawings is attached as an integral part of said description, where, with an illustrative and non-limiting nature, what has been represented has been Next:
Figura 1 Muestra un diagrama de bloques del proceso de análisis genómico ejecutado con
la presente invención Figure 1 Shows a block diagram of the genomic analysis process executed with the present invention
Explicación detallada de un modo de realización de la invención Detailed explanation of an embodiment of the invention
Como se ha comentado anteriormente, la presente invención describe una plataforma única que automatiza el filtrado de variantes no descritas en gente sana. Este proceso se ejecuta de forma eficiente de acuerdo con el método y sistema que se describe a continuación y que está configurado para la gestión e interpretación de información genómica y clínica. Por tanto, se configura como un sistema singular de filtrado y análisis simultáneo de variantes no descritas y genotipo de ascendencia, de metabolismo ante fármacos entre otros. Además, permite el análisis de datos en tiempo real As previously discussed, the present invention describes a unique platform that automates the screening of non-described variants in healthy people. This process is carried out efficiently according to the method and system described below, which is configured for the management and interpretation of genomic and clinical information. Therefore, it is configured as a unique system for filtering and simultaneous analysis of non-described variants and genotype of ancestry, metabolism before drugs, among others. In addition, it allows real-time data analysis
La figura 1 muestra el diagrama de bloques de la Invención y que comprende una primera etapa de creación de la orden de entrada (1) en la plataforma, que incluye la recopilación de la documentación, la recepción y el registro de la muestra. A continuación, en el subproceso de gestión de información de laboratorio (2). se admite la muestra, se extrae el ADN y se secuencia (3). Los datos del ADN secuenciado (3) se estructuran en un subproceso de BIOPIPELINE (4). Figure 1 shows the block diagram of the invention and that it comprises a first stage of creation of the entry order (1) in the platform, which includes the compilation of the documentation, the reception and the registration of the sample. Next, in the laboratory information management thread (2). the sample is admitted, the DNA is extracted and sequenced (3). The sequenced DNA data (3) is structured in a BIOPIPELINE thread (4).
Ei subproceso de BIOPIPELINE (4) está, por tanto, configurado para estructurar los datos de secuenciación de ADN (3), en donde los datos brutos de la máquina de secuenciación de ADN se convierten en archivos de tipo FASTQ. El formato FASTO es un formato basado en texto para almacenar tanto una secuencia biológica generalmente secuencia de nucieótidos- como sus puntuaciones de calidad correspondientes. Tanto la letra de secuencia como la puntuación de calidad están codificadas con un solo carácter ASCII para mayor brevedad, asignando las secuencias con códigos de barras a las muestras individuales en un proceso de demultiplexación. A continuación, los archivos PASTQ llenos de secuencias se alinean con los genomas de referencia hg19 y hg38. Esto da domo resultado un archivo tipo bam, que es un archivo binario del archivo sam, que es un archivo de texto que contiene los datos de alineamientos de los genomas separados por tabulación. Este archivo bam, en consecuencia, contiene los datos estructurados y ordenados para su importación (5) por el proceso VERIBENCH (6). The BIOPIPELINE thread (4) is therefore configured to structure the DNA sequencing data (3), where the raw data from the DNA sequencing machine is converted into FASTQ type files. The FASTO format is a text-based format for storing both a biological sequence—generally nucleotide sequence—and its corresponding quality scores. Both the sequence letter and quality score are encoded with a single ASCII character for brevity, assigning the barcoded sequences to individual samples in a demultiplexing process. The sequence-filled PASTQ files are then aligned to the hg19 and hg38 reference genomes. This results in a bam file, which is a binary file of the sam file, which is a text file containing the tab separated genome alignment data. This bam file therefore contains the structured and ordered data for import (5) by the VERIBENCH process (6).
El subproceso VERIBENCH (6) está configurado para la revisión de los datos importados (5) desde el subproceso BIOPIPELINE (4). En el subprocesa VERIBENCH (6) se cargan tres tipos de datos para su inspección manual. E! primer grupo de datos cargados son las variantes
de Tipo I que pasan todos los umbrales descritos. El segundo son las variantes de Tipo II para cada producto, PGX, Riesgo y Rasgos. El tercero es la información sobre la calidad de la muestra y la verificación de ia identidad, que ahora describiremos en detalle. The VERIBENCH thread (6) is configured to review the imported data (5) from the BIOPIPELINE thread (4). Three types of data are loaded into the VERIBENCH(6) thread for manual inspection. AND! first group of data loaded are the variants of Type I that pass all the thresholds described. The second is the Type II variants for each product, PGX, Risk, and Traits. The third is information on sample quality and identity verification, which we will now describe in detail.
Para la verificación de la identidad del cliente se utiliza un proceso de laboratorio paralelo para garantizar que la información de la variante conecta se distribuye al paciente correcto. Se utilizan chips de genotipado para proporcionar una segunda fuente de datos físicos que se derivan del ADN del paciente. A parallel lab process is used for client identity verification to ensure that the connected variant information is distributed to the correct patient. Genotyping chips are used to provide a second source of physical data that is derived from the patient's DNA.
La presente invención contiene un proceso, que denominaremos chlpld. para asegurar ia identidad correcta. La entrada a este método de verificación son los datos de la máquina IIlumina iScan. Brevemente, la máquina iScan realiza el genotipado deforma similar al análisis de microarrays. Los datos en crudo de! chip están estandarizados para utilizar un formato particular, asegurando que las columnas mantienen un orden concreto. El primer paso convierte los datos del chip a un formato vcf, utilizando scripts personalizados. El vcf generado específico para el cliente se compara entonces con el vcf (también con scripts personalizados) que se derivó de los datos del chip. The present invention contains a process, which we will call chlpd. to ensure correct identity. The input to this verification method is the data from the IIlumina iScan machine. Briefly, the iScan machine performs genotyping in a similar way to microarray analysis. The raw data of chip are standardized to use a particular format, ensuring that the columns maintain a particular order. The first step converts the chip data to a vcf format, using custom scripts. The customer-specific generated vcf is then compared to the vcf (also with custom scripts) that was derived from the chip data.
Para calcular la identidad, los datos se clasifican como verdaderos positivos (TP). falsos positivos (FP), verdaderos negativos (TN) o falsos negativos (FN). A continuación, la concordancia se expresa con la ecuación To calculate identity, the data is classified as true positives (TP). false positives (FP), true negatives (TN) or false negatives (FN). Next, the agreement is expressed with the equation
TP/suma(FP + TP + TN + FP) TP/sum(FP + TP + TN + FP)
Esto garantiza que el vcf generado es el dato verdadero. La identidad es equivalente a la concordancia. El protocolo de operación de Veribench sigue los siguientes pasos: a) Acceder a una plataforma en remoto con las credenciales adecuadas b) Iniciar y proporcionar los parámetros necesarios para el análisis c) Una vez iniciado el proceso de BIOPIPELINE (4) se puede ver el inicio de esta en la ventana de ejecución actual. Una vez completado el trabajo, el indicador de color se volverá verde y el estado del trabajo se establecerá como exitoso. d) Una vez que el trabajo se ha completado significa que el procesamiento de los datos ha comenzado con éxito y el progreso puede ser monitoreado por el inicio de sesión en remoto en la plataforma.
El proceso de curación (7) es un sistema independiente que apoya a la curación y a la Interpretación de una variante. Además, permite obtener la aprobación del director del laboratorio. Lo que hace un curador es crear una conexión entre las piezas para crear algo más grande que la suma de las piezas individuales. La conexión de las piezas con un contexto crea una historia y. por lo tanto, un elemento conjunto. This guarantees that the generated vcf is the true data. Identity is equivalent to agreement. The Veribench operation protocol follows the following steps: a) Access a remote platform with the appropriate credentials b) Start and provide the necessary parameters for the analysis c) Once the BIOPIPELINE process (4) has started, you can see the start it in the current execution window. After the job is complete, the color indicator will turn green and the job status will be set to successful. d) Once the job has been completed it means that the data processing has started successfully and the progress can be monitored by remote login to the platform. The healing process (7) is an independent system that supports the healing and Performance of a variant. In addition, it allows obtaining the approval of the laboratory director. What a curator does is create a connection between the pieces to create something greater than the sum of the individual pieces. The connection of the pieces with a context creates a story and. hence a set element.
La creación del informe (8) es un proceso que se activa en el momento que todos los requisitos necesarios en los anteriores pasos (curación, clasificación e interpretación) se han cumplido. Además, se consigna la firma del director del laboratorio en el informe final y la validación de todo el proceso de forma univoca para cada informe individual y cumpliendo con la normativa regulatoria aplicable en cada caso La plataforma se ha diseñado para admitir cualquier idioma y alfabeto. The creation of the report (8) is a process that is activated when all the necessary requirements in the previous steps (curation, classification and interpretation) have been met. In addition, the signature of the laboratory director is recorded in the final report and the validation of the entire process unequivocally for each individual report and in compliance with the applicable regulatory standards in each case. The platform has been designed to admit any language and alphabet.
La distribución (9) es un sistema independiente que permite el envío (o distribución) de informes, archivos, y notificaciones a través de distintos medios En el caso que nos ocupa tenemos activados los medios de correo electrónico, repositorio en cloud, informe en pdf e informe en web en tiempo real.
The distribution (9) is an independent system that allows the sending (or distribution) of reports, files, and notifications through different means. In the present case, we have activated the means of email, cloud repository, pdf report and report on the web in real time.
Claims
1.- Un método de análisis genómico ímplementado en una plataforma bioinformática en remoto configurado para el análisis genómico automatizado y el filtrado de variantes no descritas en personas sanas que comprende las etapas de entrada de una muestra biológica (1,2) y secuenciación (3) del ADN de la muestra biológica, tras lo que los datos se estructuren en tres ficheros fastq, sam/bam y vcf, que se caracteriza por que implementa un primer subproceso de biopipeline (4) configurado para recoger los datos de un secuenciador de ADN y transformar los datos en elementos comprensibles para un segundo subproceso de veribench (6) configurado para la inspección de los datos importados del primer subproceso de recogida y transformación de datos del secuenciador; un tercer subproceso configurado para la curación (7) y a la interpretación de una vanante genómica; y un cuarto subproceso de generación de informe y distribución (8,9). 1.- A genomic analysis method implemented in a remote bioinformatics platform configured for automated genomic analysis and filtering of non-described variants in healthy people, comprising the stages of inputting a biological sample (1,2) and sequencing (3 ) of the DNA of the biological sample, after which the data is structured into three fastq files, sam/bam and vcf, which is characterized by the fact that it implements a first biopipeline thread (4) configured to collect the data from a DNA sequencer and transforming the data into comprehensible elements for a second veribench thread (6) configured for inspection of the data imported from the first data collection and transformation thread of the sequencer; a third thread configured for healing (7) and for the interpretation of a genomic variant; and a fourth report generation and distribution thread (8,9).
2.- El método de análisis genómico de acuerdo con la reivindicación 1 donde el subproceso de BIOPIPELINE (4) está configurado para estructurar los datos de secuenciación de ADN (3), en donde los datos brutos de la máquina de secuenciación de ADN se convierten en archivos de tipo fastq, asignando las secuencias con códigos de barras a las muestras individuales en un proceso de demultiplexación; y donde los archivos fastq se alinean con los genomas de referencia hg19 y hg38, dando como resultado un archivo binario bam, que es un archivo binario del archivo sam, que es un archivo de texto que contiene los datos de alineamientos de los genomas separados por tabulación. 2.- The method of genomic analysis according to claim 1 wherein the BIOPIPELINE thread (4) is configured to structure the DNA sequencing data (3), wherein the raw data from the DNA sequencing machine is converted in fastq files, assigning the barcoded sequences to the individual samples in a demultiplexing process; and where the fastq files align with the hg19 and hg38 reference genomes, resulting in a bam binary file, which is a binary file of the sam file, which is a text file containing the genome alignment data separated by tabulation.
3 - El método de análisis genómico de acuerdo con una cualquiera de las reivindicaciones 1 o 2 donde el proceso veribench (6) comprende un proceso configurado para asegurar la identidad correcta de la muestra en paralelo mediante chips de genotipado para proporcionar una segunda fuente de datos físicos que se derivan del ADN del paciente. 3 - The method of genomic analysis according to any one of claims 1 or 2 where the veribench process (6) comprises a process configured to ensure the correct identity of the sample in parallel by means of genotyping chips to provide a second data source that are derived from the patient's DNA.
4 - El método de acuerdo con la reivindicación 3 donde los datos en crudo del chip de genotipado están estandarizados para asegurar que las columnas mantienen un orden concreto, y donde en un primer paso se convierte los datos del chip a un formato vcf, utilizando scripts personalizados; y donde el vcf generado específico para el diente se compara entonces con el vcf, también con scripts personalizados, que se derivó de los datos del chip de genotipado. 4 - The method according to claim 3 where the raw data from the genotyping chip is standardized to ensure that the columns maintain a specific order, and where in a first step the chip data is converted to a vcf format, using scripts personalized; and where the generated tooth-specific vcf is then compared to the vcf, also with custom scripts, that was derived from the genotyping chip data.
5 - Una plataforma bioinformática en remoto que se caracteriza porque comprende medios
configurados para ejecutar el método de acuerdo con una cualquiera de las reivindicaciones5 - A remote bioinformatics platform characterized by the fact that it includes media configured to execute the method according to any one of the claims
1 a 4.
1 to 4.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
ES202130535A ES2930699A1 (en) | 2021-06-10 | 2021-06-10 | GENOMIC ANALYSIS METHOD IN A BIOINFORMATIC PLATFORM (Machine-translation by Google Translate, not legally binding) |
ESP202130535 | 2021-06-10 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022258866A1 true WO2022258866A1 (en) | 2022-12-15 |
Family
ID=84425760
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/ES2022/070351 WO2022258866A1 (en) | 2021-06-10 | 2022-06-06 | Method of genomic analysis on a bioinformatics platform |
Country Status (2)
Country | Link |
---|---|
ES (1) | ES2930699A1 (en) |
WO (1) | WO2022258866A1 (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013070634A1 (en) * | 2011-11-07 | 2013-05-16 | Ingenuity Systems, Inc. | Methods and systems for identification of causal genomic variants |
US20150286495A1 (en) * | 2014-04-02 | 2015-10-08 | International Business Machines Corporation | Metadata-driven workflows and integration with genomic data processing systems and techniques |
US20150379193A1 (en) * | 2014-06-30 | 2015-12-31 | QIAGEN Redwood City, Inc. | Methods and systems for interpretation and reporting of sequence-based genetic tests |
US20160191076A1 (en) * | 2014-08-29 | 2016-06-30 | Bonnie Berger Leighton | Compressively-accelerated read mapping framework for next-generation sequencing |
US20190026425A1 (en) * | 2015-12-24 | 2019-01-24 | YouGene, Inc. | Curated genetic database for in silico testing, licensing and payment |
US20200042735A1 (en) * | 2016-10-11 | 2020-02-06 | Genomsys Sa | Method and system for selective access of stored or transmitted bioinformatics data |
US20200244283A1 (en) * | 2019-01-30 | 2020-07-30 | International Business Machines Corporation | Managing compression and storage of genomic data |
-
2021
- 2021-06-10 ES ES202130535A patent/ES2930699A1/en active Pending
-
2022
- 2022-06-06 WO PCT/ES2022/070351 patent/WO2022258866A1/en unknown
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013070634A1 (en) * | 2011-11-07 | 2013-05-16 | Ingenuity Systems, Inc. | Methods and systems for identification of causal genomic variants |
US20150286495A1 (en) * | 2014-04-02 | 2015-10-08 | International Business Machines Corporation | Metadata-driven workflows and integration with genomic data processing systems and techniques |
US20150379193A1 (en) * | 2014-06-30 | 2015-12-31 | QIAGEN Redwood City, Inc. | Methods and systems for interpretation and reporting of sequence-based genetic tests |
US20160191076A1 (en) * | 2014-08-29 | 2016-06-30 | Bonnie Berger Leighton | Compressively-accelerated read mapping framework for next-generation sequencing |
US20190026425A1 (en) * | 2015-12-24 | 2019-01-24 | YouGene, Inc. | Curated genetic database for in silico testing, licensing and payment |
US20200042735A1 (en) * | 2016-10-11 | 2020-02-06 | Genomsys Sa | Method and system for selective access of stored or transmitted bioinformatics data |
US20200244283A1 (en) * | 2019-01-30 | 2020-07-30 | International Business Machines Corporation | Managing compression and storage of genomic data |
Also Published As
Publication number | Publication date |
---|---|
ES2930699A1 (en) | 2022-12-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Pérez-Cobas et al. | Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses | |
Singh et al. | Integrative toxicogenomics: Advancing precision medicine and toxicology through artificial intelligence and OMICs technology | |
Butler | The future of forensic DNA analysis | |
Hemani et al. | Retracted article: Detection and replication of epistasis influencing transcription in humans | |
Bragg et al. | Metagenomics using next-generation sequencing | |
Ellegren | Sequencing goes 454 and takes large‐scale genomics into the wild | |
Tripathi et al. | Next-generation sequencing revolution through big data analytics | |
Korpelainen et al. | RNA-seq data analysis: a practical approach | |
US20150211054A1 (en) | Haplotype resolved genome sequencing | |
Roy et al. | SeqReporter: automating next-generation sequencing result interpretation and reporting workflow in a clinical laboratory | |
Furlani et al. | Sequencing of Nucleic Acids: from the First Human Genome to Next Generation Sequencing in COVID-19 Pandemic. | |
WO2022258866A1 (en) | Method of genomic analysis on a bioinformatics platform | |
Jiang et al. | DRAMS: A tool to detect and re-align mixed-up samples for integrative studies of multi-omics data | |
Muscarella et al. | Automated workflow for somatic and germline next generation sequencing analysis in routine clinical cancer diagnostics | |
JP2022544991A (en) | Methods for control of sequencing devices | |
Nimmy et al. | Investigation of DNA discontinuity for detecting tuberculosis | |
Budowle et al. | The forensic genomics toolbox is expanding | |
Salehin et al. | Prenet: Predictive network from ATAC-SEQ data | |
Baßler et al. | A Bioinformatic Toolkit for Single-Cell mRNA Analysis | |
Mangalea et al. | Assembly and Annotation of Viral Metagenomes from Short-Read Sequencing Data | |
US20230317211A1 (en) | Method and system for encrypting genetic data of a subject | |
CN106599612B (en) | Fingerprint identification method based on high-throughput sequencing data | |
Latham | Next-generation sequencing of formalin-fixed, paraffin-embedded tumor biopsies: navigating the perils of old and new technology to advance cancer diagnosis | |
US20160154930A1 (en) | Methods for identification of individuals | |
Zaaijer et al. | Rapid DNA re-identification for cell line authentication and forensics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22819687 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |