WO2022258866A1 - Method of genomic analysis on a bioinformatics platform - Google Patents

Method of genomic analysis on a bioinformatics platform Download PDF

Info

Publication number
WO2022258866A1
WO2022258866A1 PCT/ES2022/070351 ES2022070351W WO2022258866A1 WO 2022258866 A1 WO2022258866 A1 WO 2022258866A1 ES 2022070351 W ES2022070351 W ES 2022070351W WO 2022258866 A1 WO2022258866 A1 WO 2022258866A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
thread
dna
genomic analysis
sequencing
Prior art date
Application number
PCT/ES2022/070351
Other languages
Spanish (es)
French (fr)
Inventor
Javier ECHEVARRIA CARRERES
Original Assignee
Veritas Intercontinental, S.L.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Veritas Intercontinental, S.L. filed Critical Veritas Intercontinental, S.L.
Publication of WO2022258866A1 publication Critical patent/WO2022258866A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • the present invention refers to a genomic analysis platform that allows rapid and efficient analysis of raw data from Human Genome sequencing systems, facilitating the interpretation of variants and the generation of a personalized report.
  • NGS new generation sequencing
  • NGS panels that allow parallel analysis of multiple genes or selected regions of DNA, which are related to similar or overlapping phenotypes. These panels provide a first method of genetic diagnosis and, in the cases that no alteration is* detected in the analyzed genes, the physician will determine whether to extend the study by performing exome sequencing (WES) or whole genome sequencing (WGS).
  • WES exome sequencing
  • WGS whole genome sequencing
  • NGS sequencing techniques generate mainly three types of files: FASTQ, SAM/BAM (alignment) and VCF (annotation). These files are heavy and difficult to handle, so a tool is essential to optimize the automation of their processing and interpretation in order to be able to extract highly clinically useful data from large numbers of samples.
  • An example of this type of system is described in US2020/0042736A1 which describes the storage or transmission of genomic data is performed by using a compressed genomic data set structured in a file or in a genomic data stream. Selective access to data, or subsets of data, corresponding to specific genomic regions is achieved through the use of user-defined tags based on data classification and a specific indexing mechanism.
  • Ancestry data can be masked by identifying ancestry information marker (AiM) regions in the genetic data.
  • Each AIM region may include the inclusion of one or more single nucleotide polymorphism (SNP) alleles associated with a population of patients belonging to a certain ancestry.
  • SNP single nucleotide polymorphism
  • One or more regions can be identified that include clinically relevant data.
  • Clinically relevant data may be data having one or more genetic variants associated with a specific disease or disorder.
  • Genetic data can be anonymized by masking or removing AIM regions that do not include clinically relevant data.
  • the platform has adaptation of integrated tools for the analysis and interpretation of variations in data from massive DNA sequencing.
  • the platform is oriented towards the analysis and interpretation of genomic data from whole exome sequencing (WES - Whole Exome Sequencing) and whole genome (WGS - Whole Genome Sequencing), these data come from massive or new generation sequencing ( NGS - Next Generation Sequencing) of DNA extracted from biological samples.
  • WES Whole Exome Sequencing
  • WGS Whole Genome Sequencing
  • NGS Next Generation Sequencing
  • the present invention is configured as an open source platform to manage, process, share, and interpret genomic data.
  • the system provides capabilities for automating complex genomic interpretation and classification processes, as well as its flexibility and modularity.
  • One of the advantages of the invention is that it is optimized for handling large amounts of data from exome or whole genome sequencing.
  • the files that are handled through the platform of the invention are large (greater than 100 Gb of data) and it is developed to handle a plurality of files simultaneously, reaching total amounts of data ranging from tens of terabytes to petabytes. .
  • Figure 1 Shows a block diagram of the genomic analysis process executed with the present invention
  • the present invention describes a unique platform that automates the screening of non-described variants in healthy people. This process is carried out efficiently according to the method and system described below, which is configured for the management and interpretation of genomic and clinical information. Therefore, it is configured as a unique system for filtering and simultaneous analysis of non-described variants and genotype of ancestry, metabolism before drugs, among others. In addition, it allows real-time data analysis
  • Figure 1 shows the block diagram of the invention and that it comprises a first stage of creation of the entry order (1) in the platform, which includes the compilation of the documentation, the reception and the registration of the sample.
  • the laboratory information management thread (2) the sample is admitted, the DNA is extracted and sequenced (3).
  • the sequenced DNA data (3) is structured in a BIOPIPELINE thread (4).
  • the BIOPIPELINE thread (4) is therefore configured to structure the DNA sequencing data (3), where the raw data from the DNA sequencing machine is converted into FASTQ type files.
  • the FASTO format is a text-based format for storing both a biological sequence—generally nucleotide sequence—and its corresponding quality scores. Both the sequence letter and quality score are encoded with a single ASCII character for brevity, assigning the barcoded sequences to individual samples in a demultiplexing process.
  • the sequence-filled PASTQ files are then aligned to the hg19 and hg38 reference genomes. This results in a bam file, which is a binary file of the sam file, which is a text file containing the tab separated genome alignment data.
  • This bam file therefore contains the structured and ordered data for import (5) by the VERIBENCH process (6).
  • the VERIBENCH thread (6) is configured to review the imported data (5) from the BIOPIPELINE thread (4). Three types of data are loaded into the VERIBENCH(6) thread for manual inspection. AND! first group of data loaded are the variants of Type I that pass all the thresholds described. The second is the Type II variants for each product, PGX, Risk, and Traits. The third is information on sample quality and identity verification, which we will now describe in detail.
  • a parallel lab process is used for client identity verification to ensure that the connected variant information is distributed to the correct patient.
  • Genotyping chips are used to provide a second source of physical data that is derived from the patient's DNA.
  • the present invention contains a process, which we will call chlpd. to ensure correct identity.
  • the input to this verification method is the data from the IIlumina iScan machine. Briefly, the iScan machine performs genotyping in a similar way to microarray analysis. The raw data of chip are standardized to use a particular format, ensuring that the columns maintain a particular order. The first step converts the chip data to a vcf format, using custom scripts. The customer-specific generated vcf is then compared to the vcf (also with custom scripts) that was derived from the chip data.
  • the Veribench operation protocol follows the following steps: a) Access a remote platform with the appropriate credentials b) Start and provide the necessary parameters for the analysis c) Once the BIOPIPELINE process (4) has started, you can see the start it in the current execution window. After the job is complete, the color indicator will turn green and the job status will be set to successful. d) Once the job has been completed it means that the data processing has started successfully and the progress can be monitored by remote login to the platform.
  • the healing process (7) is an independent system that supports the healing and Performance of a variant. In addition, it allows obtaining the approval of the laboratory director. What a curator does is create a connection between the pieces to create something greater than the sum of the individual pieces. The connection of the pieces with a context creates a story and. hence a set element.
  • the creation of the report (8) is a process that is activated when all the necessary requirements in the previous steps (curation, classification and interpretation) have been met.
  • the signature of the laboratory director is recorded in the final report and the validation of the entire process unequivocally for each individual report and in compliance with the applicable regulatory standards in each case.
  • the platform has been designed to admit any language and alphabet.
  • the distribution (9) is an independent system that allows the sending (or distribution) of reports, files, and notifications through different means.

Abstract

The present invention relates to a method of genomic analysis that is implemented on a remote bioinformatics platform configured for the automated genomic analysis and filtering of undescribed variants in healthy individuals, comprising the steps of biological sample input (1, 2) and sequencing (3) of the DNA from the biological sample, after which the data is structured into three fastq, sam/bam and vcf files, characterised in that it implements a first biopipeline sub-process (4) configured for collecting the data from a DNA sequencer and transforming the data into elements that are understandable to a second veribench sub-process (6) configured for inspecting the data imported from the first sub-process of data collection and transformation of the sequencer, a third sub-process configured for curing (7) and interpreting a genomic variant, and a fourth sub-process for generating and distributing reports (8, 9).

Description

DESCRIPCIÓN DESCRIPTION
MÉTODO DE ANÁLISIS GENÓMICO EN UNA PLATAFORMA BIOINFORMÁTICA GENOMIC ANALYSIS METHOD IN A BIOINFORMATIC PLATFORM
Campo de la técnica technique field
La presente invención se refiere a una plataforma de análisis genómico que permite un análisis rápido y eficiente de los datos brutos procedentes de sistemas de secuenciación del Genoma Humano, facilitando la interpretación de las variantes y ia generación de un informe personalizado. The present invention refers to a genomic analysis platform that allows rapid and efficient analysis of raw data from Human Genome sequencing systems, facilitating the interpretation of variants and the generation of a personalized report.
Estado de la técnica state of the art
La secuenciación de los nucleótidos que conforman las moléculas de ADN humano permite la identificación de variantes en el material genético. En este aspecto, la secuenciación “Sanger” en los años 70 del siglo XX supuso un hito en el análisis de la Genética Humana y es considerado como el origen de la era genómica. The sequencing of the nucleotides that make up the human DNA molecules allows the identification of variants in the genetic material. In this regard, "Sanger" sequencing in the 1970s was a milestone in the analysis of Human Genetics and is considered the origin of the genomic era.
Tras el descubrimiento de la secuenciación surgen las plataformas de secuenciación de alto rendimiento o nueva generación (NGS) que tienen la capacidad de analizar en paralelo y de forma masiva, millones de fragmentos de ADN en un único proceso de secuenciación. Esta nueva tecnología eleva el rendimiento, reduciendo el coste del análisis, aportando ventajas adicionales respecto de los sistemas de secuenciación genómica previos. After the discovery of sequencing, high-throughput or new generation sequencing (NGS) platforms emerged, which have the capacity to analyze millions of DNA fragments in parallel and in a massive way in a single sequencing process. This new technology increases performance, reduces the cost of the analysis, providing additional advantages over previous genomic sequencing systems.
A partir de este momento, con la finalidad de mejorar el rendimiento del diagnóstico genético, los laboratorios de análisis comienzan a desarrollar paneles NGS que permiten el análisis en paralelo de múltiples genes o regiones seleccionadas del ADN, que se relacionan con fenotipos parecidos o solapantes. Estos paneles proporcionan un primer método de diagnóstico genético y, en los casos que no se* detecta ninguna alteración en los genes analizados, el facultativo determinará si amplia el estudio realizando la secuenciación del exoma (WES) o del genoma completo (WGS). From this moment, in order to improve the performance of genetic diagnosis, analysis laboratories begin to develop NGS panels that allow parallel analysis of multiple genes or selected regions of DNA, which are related to similar or overlapping phenotypes. These panels provide a first method of genetic diagnosis and, in the cases that no alteration is* detected in the analyzed genes, the physician will determine whether to extend the study by performing exome sequencing (WES) or whole genome sequencing (WGS).
Las técnicas de secuenciación NGS generan, principalmente, tres tipos de ficheros: FASTQ, SAM/BAM (alineamiento) y VCF (anotación). Estos ficheros son pesados y difíciles de manejar, por lo que se hace imprescindible una herramienta que permita optimizar al máximo la automatización de su procesamiento e interpretación de cara a poder extraer los datos de alta utilidad clínica en números elevados de muestras. Un ejemplo de este tipo de sistemas se describe en US2020/0042736A1 que describe el almacenamiento o la transmisión de datos genómicos se realiza mediante el empleo de un conjunto de datos genómicos comprimidos estructurados en un archivo o en un flujo de datos genómicos. El acceso selectivo a los datos, o subconjuntos de datos, correspondientes a regiones genómicas especificas se logra mediante el empleo de etiquetas definidas por el usuario basadas en la clasificación de datos y un mecanismo de indexación específico NGS sequencing techniques generate mainly three types of files: FASTQ, SAM/BAM (alignment) and VCF (annotation). These files are heavy and difficult to handle, so a tool is essential to optimize the automation of their processing and interpretation in order to be able to extract highly clinically useful data from large numbers of samples. An example of this type of system is described in US2020/0042736A1 which describes the storage or transmission of genomic data is performed by using a compressed genomic data set structured in a file or in a genomic data stream. Selective access to data, or subsets of data, corresponding to specific genomic regions is achieved through the use of user-defined tags based on data classification and a specific indexing mechanism.
En el documento US2020/0035332A1 se describen métodos y sistemas correspondientes para anonimizar los datos genéticos obtenidos de un paciente. Los datos de ascendencia se pueden enmascarar identificando regiones de marcadores de información de ascendencia (AiM) en los datos genéticos. Cada región AIM puede incluir la inclusión de uno o más alelos de polimorfismo de un solo nucleótido (SNP) asociados con una población de pacientes que pertenecen a una determinada ascendencia. Una vez que se identifican las regiones AIM. se pueden identificar una o más regiones que incluyen datos clínicamente relevantes. Los datos clínicamente relevantes pueden ser datos que tengan una o más variantes genéticas asociadas con una enfermedad o trastorno específico. Los datos genéticos se pueden anonimizar enmascarando o eliminando las regiones AIM que no incluyen datos cínicamente relevantes. Corresponding methods and systems for anonymizing genetic data obtained from a patient are described in US2020/0035332A1. Ancestry data can be masked by identifying ancestry information marker (AiM) regions in the genetic data. Each AIM region may include the inclusion of one or more single nucleotide polymorphism (SNP) alleles associated with a population of patients belonging to a certain ancestry. Once the AIM regions are identified. one or more regions can be identified that include clinically relevant data. Clinically relevant data may be data having one or more genetic variants associated with a specific disease or disorder. Genetic data can be anonymized by masking or removing AIM regions that do not include clinically relevant data.
Finalmente, en el documento US2019/0304571A1 se describen sistemas y métodos para la gestión de datos biológicos pueden preservar interpretaciones alternativas de los datos y pueden implementar cifrado de múltiples niveles y gestión de la privacidad. Los sistemas y métodos para la gestión de datos biológicos pueden incluir una arquitectura a nivel de célula, una arquitectura a nivel de banco y bloque y / o una arquitectura de varios niveles. Los sistemas y métodos para la gestión de datos biológicos pueden incorporar definiciones, reglas y directivas y / o emplear una estructura de datos bidimensional o bidimensional. Finally, document US2019/0304571A1 describes systems and methods for managing biological data that can preserve alternative interpretations of the data and can implement multi-level encryption and privacy management. Systems and methods for managing biological data may include cell level architecture, bank and block level architecture, and/or multi-tier architecture. Systems and methods for managing biological data may incorporate definitions, rules, and directives and/or employ a two-dimensional or two-dimensional data structure.
Explicación de la invención Explanation of the invention
Es un objeto de la presente invención proporcionar una plataforma de análisis basada en la nube que simplifica el análisis de los datos de la secuencíación del genoma y exoma, y que permita una gestión integral de los ficheros de secuenciación. Por lo tanto, la presente invención está configurada para gestionar los archivos desde que son generados en el secuenciador, procediendo a la identificación y filtrado de las variantes, interpretación y generación de informe en diferentes idiomas. Además, la plataforma no requiere de ninguna instalación de software en local, puesto que se ejecuta en la nube Este objeto se alcanza con la plataforma de acuerdo con la reivindicación 1. En las reivindicaciones dependientes se describen soluciones particulares de la invención. It is an object of the present invention to provide a cloud-based analysis platform that simplifies the analysis of genome and exome sequencing data, and that allows comprehensive management of sequencing files. Therefore, the present invention is configured to manage the files from the moment they are generated in the sequencer, proceeding to the identification and filtering of the variants, interpretation and report generation in different languages. In addition, the platform does not require any local software installation, since it runs in the cloud This object is achieved with the platform according to claim 1. Particular solutions of the invention are described in the dependent claims.
Más concretamente, describe una plataforma en la nube donde se realiza el análisis de los datos procedentes de la secuenciación masiva de ADN. La plataforma cuenta con adaptación de herramientas integradas para el análisis e interpretación de vanantes en los datos procedentes de la secuenciación masiva de ADN. La plataforma está orientada al análisis e interpretación de los datos genómicos procedentes de la secuenciación del exoma completo (WES - Whole Exome Sequencing) y del genoma completo (WGS - Whole Genome Sequencing), estos datos proceden de la secuenciación masiva o de nueva generación (NGS - Next Generation Sequencing) del ADN extraído de muestras biológicas. Tras secuenciación la invención realiza un filtrado del listado de variantes presentes en el ADN del paciente frente al genoma humano de referencia, reduciendo el número de variantes que requieren interpretación manual. More specifically, it describes a cloud platform where the analysis of data from massive DNA sequencing is carried out. The platform has adaptation of integrated tools for the analysis and interpretation of variations in data from massive DNA sequencing. The platform is oriented towards the analysis and interpretation of genomic data from whole exome sequencing (WES - Whole Exome Sequencing) and whole genome (WGS - Whole Genome Sequencing), these data come from massive or new generation sequencing ( NGS - Next Generation Sequencing) of DNA extracted from biological samples. After sequencing, the invention filters the list of variants present in the patient's DNA against the reference human genome, reducing the number of variants that require manual interpretation.
La presente invención se configura corno una plataforma de código abierto para gestionar, procesar compartir e interpretar datos genómicos El sistema proporciona capacidades de automatización de los procesos complejos de interpretación y clasificación genómica, asi como su flexibilidad y medularidad. The present invention is configured as an open source platform to manage, process, share, and interpret genomic data. The system provides capabilities for automating complex genomic interpretation and classification processes, as well as its flexibility and modularity.
Una de las ventajas de la invención es que está optimizada para el manejo de una gran cantidad de datos procedentes de la secuenciación del exoma o del genoma completo. Los ficheros que se manejan a través de la plataforma de la invención son de gran tamaño (superior a 100 Gb de datos) y está desarrollada para manejar una pluralidad de ficheros simultáneamente, llegando a cantidades totales de datos que oscilan entre decenas de terabytes y petabytes. One of the advantages of the invention is that it is optimized for handling large amounts of data from exome or whole genome sequencing. The files that are handled through the platform of the invention are large (greater than 100 Gb of data) and it is developed to handle a plurality of files simultaneously, reaching total amounts of data ranging from tens of terabytes to petabytes. .
Breve explicación de los dibujos Brief explanation of the drawings
Para complementar la descripción que se está realizando y con objeto de ayudar a una mejor comprensión de las características de la invención, se acompaña como parte integrante de dicha descripción, un juego de dibujos en donde con carácter ilustrativo y no limitativo, se ha representado lo siguiente: To complement the description that is being made and in order to help a better understanding of the characteristics of the invention, a set of drawings is attached as an integral part of said description, where, with an illustrative and non-limiting nature, what has been represented has been Next:
Figura 1 Muestra un diagrama de bloques del proceso de análisis genómico ejecutado con la presente invención Figure 1 Shows a block diagram of the genomic analysis process executed with the present invention
Explicación detallada de un modo de realización de la invención Detailed explanation of an embodiment of the invention
Como se ha comentado anteriormente, la presente invención describe una plataforma única que automatiza el filtrado de variantes no descritas en gente sana. Este proceso se ejecuta de forma eficiente de acuerdo con el método y sistema que se describe a continuación y que está configurado para la gestión e interpretación de información genómica y clínica. Por tanto, se configura como un sistema singular de filtrado y análisis simultáneo de variantes no descritas y genotipo de ascendencia, de metabolismo ante fármacos entre otros. Además, permite el análisis de datos en tiempo real As previously discussed, the present invention describes a unique platform that automates the screening of non-described variants in healthy people. This process is carried out efficiently according to the method and system described below, which is configured for the management and interpretation of genomic and clinical information. Therefore, it is configured as a unique system for filtering and simultaneous analysis of non-described variants and genotype of ancestry, metabolism before drugs, among others. In addition, it allows real-time data analysis
La figura 1 muestra el diagrama de bloques de la Invención y que comprende una primera etapa de creación de la orden de entrada (1) en la plataforma, que incluye la recopilación de la documentación, la recepción y el registro de la muestra. A continuación, en el subproceso de gestión de información de laboratorio (2). se admite la muestra, se extrae el ADN y se secuencia (3). Los datos del ADN secuenciado (3) se estructuran en un subproceso de BIOPIPELINE (4). Figure 1 shows the block diagram of the invention and that it comprises a first stage of creation of the entry order (1) in the platform, which includes the compilation of the documentation, the reception and the registration of the sample. Next, in the laboratory information management thread (2). the sample is admitted, the DNA is extracted and sequenced (3). The sequenced DNA data (3) is structured in a BIOPIPELINE thread (4).
Ei subproceso de BIOPIPELINE (4) está, por tanto, configurado para estructurar los datos de secuenciación de ADN (3), en donde los datos brutos de la máquina de secuenciación de ADN se convierten en archivos de tipo FASTQ. El formato FASTO es un formato basado en texto para almacenar tanto una secuencia biológica generalmente secuencia de nucieótidos- como sus puntuaciones de calidad correspondientes. Tanto la letra de secuencia como la puntuación de calidad están codificadas con un solo carácter ASCII para mayor brevedad, asignando las secuencias con códigos de barras a las muestras individuales en un proceso de demultiplexación. A continuación, los archivos PASTQ llenos de secuencias se alinean con los genomas de referencia hg19 y hg38. Esto da domo resultado un archivo tipo bam, que es un archivo binario del archivo sam, que es un archivo de texto que contiene los datos de alineamientos de los genomas separados por tabulación. Este archivo bam, en consecuencia, contiene los datos estructurados y ordenados para su importación (5) por el proceso VERIBENCH (6). The BIOPIPELINE thread (4) is therefore configured to structure the DNA sequencing data (3), where the raw data from the DNA sequencing machine is converted into FASTQ type files. The FASTO format is a text-based format for storing both a biological sequence—generally nucleotide sequence—and its corresponding quality scores. Both the sequence letter and quality score are encoded with a single ASCII character for brevity, assigning the barcoded sequences to individual samples in a demultiplexing process. The sequence-filled PASTQ files are then aligned to the hg19 and hg38 reference genomes. This results in a bam file, which is a binary file of the sam file, which is a text file containing the tab separated genome alignment data. This bam file therefore contains the structured and ordered data for import (5) by the VERIBENCH process (6).
El subproceso VERIBENCH (6) está configurado para la revisión de los datos importados (5) desde el subproceso BIOPIPELINE (4). En el subprocesa VERIBENCH (6) se cargan tres tipos de datos para su inspección manual. E! primer grupo de datos cargados son las variantes de Tipo I que pasan todos los umbrales descritos. El segundo son las variantes de Tipo II para cada producto, PGX, Riesgo y Rasgos. El tercero es la información sobre la calidad de la muestra y la verificación de ia identidad, que ahora describiremos en detalle. The VERIBENCH thread (6) is configured to review the imported data (5) from the BIOPIPELINE thread (4). Three types of data are loaded into the VERIBENCH(6) thread for manual inspection. AND! first group of data loaded are the variants of Type I that pass all the thresholds described. The second is the Type II variants for each product, PGX, Risk, and Traits. The third is information on sample quality and identity verification, which we will now describe in detail.
Para la verificación de la identidad del cliente se utiliza un proceso de laboratorio paralelo para garantizar que la información de la variante conecta se distribuye al paciente correcto. Se utilizan chips de genotipado para proporcionar una segunda fuente de datos físicos que se derivan del ADN del paciente. A parallel lab process is used for client identity verification to ensure that the connected variant information is distributed to the correct patient. Genotyping chips are used to provide a second source of physical data that is derived from the patient's DNA.
La presente invención contiene un proceso, que denominaremos chlpld. para asegurar ia identidad correcta. La entrada a este método de verificación son los datos de la máquina IIlumina iScan. Brevemente, la máquina iScan realiza el genotipado deforma similar al análisis de microarrays. Los datos en crudo de! chip están estandarizados para utilizar un formato particular, asegurando que las columnas mantienen un orden concreto. El primer paso convierte los datos del chip a un formato vcf, utilizando scripts personalizados. El vcf generado específico para el cliente se compara entonces con el vcf (también con scripts personalizados) que se derivó de los datos del chip. The present invention contains a process, which we will call chlpd. to ensure correct identity. The input to this verification method is the data from the IIlumina iScan machine. Briefly, the iScan machine performs genotyping in a similar way to microarray analysis. The raw data of chip are standardized to use a particular format, ensuring that the columns maintain a particular order. The first step converts the chip data to a vcf format, using custom scripts. The customer-specific generated vcf is then compared to the vcf (also with custom scripts) that was derived from the chip data.
Para calcular la identidad, los datos se clasifican como verdaderos positivos (TP). falsos positivos (FP), verdaderos negativos (TN) o falsos negativos (FN). A continuación, la concordancia se expresa con la ecuación To calculate identity, the data is classified as true positives (TP). false positives (FP), true negatives (TN) or false negatives (FN). Next, the agreement is expressed with the equation
TP/suma(FP + TP + TN + FP) TP/sum(FP + TP + TN + FP)
Esto garantiza que el vcf generado es el dato verdadero. La identidad es equivalente a la concordancia. El protocolo de operación de Veribench sigue los siguientes pasos: a) Acceder a una plataforma en remoto con las credenciales adecuadas b) Iniciar y proporcionar los parámetros necesarios para el análisis c) Una vez iniciado el proceso de BIOPIPELINE (4) se puede ver el inicio de esta en la ventana de ejecución actual. Una vez completado el trabajo, el indicador de color se volverá verde y el estado del trabajo se establecerá como exitoso. d) Una vez que el trabajo se ha completado significa que el procesamiento de los datos ha comenzado con éxito y el progreso puede ser monitoreado por el inicio de sesión en remoto en la plataforma. El proceso de curación (7) es un sistema independiente que apoya a la curación y a la Interpretación de una variante. Además, permite obtener la aprobación del director del laboratorio. Lo que hace un curador es crear una conexión entre las piezas para crear algo más grande que la suma de las piezas individuales. La conexión de las piezas con un contexto crea una historia y. por lo tanto, un elemento conjunto. This guarantees that the generated vcf is the true data. Identity is equivalent to agreement. The Veribench operation protocol follows the following steps: a) Access a remote platform with the appropriate credentials b) Start and provide the necessary parameters for the analysis c) Once the BIOPIPELINE process (4) has started, you can see the start it in the current execution window. After the job is complete, the color indicator will turn green and the job status will be set to successful. d) Once the job has been completed it means that the data processing has started successfully and the progress can be monitored by remote login to the platform. The healing process (7) is an independent system that supports the healing and Performance of a variant. In addition, it allows obtaining the approval of the laboratory director. What a curator does is create a connection between the pieces to create something greater than the sum of the individual pieces. The connection of the pieces with a context creates a story and. hence a set element.
La creación del informe (8) es un proceso que se activa en el momento que todos los requisitos necesarios en los anteriores pasos (curación, clasificación e interpretación) se han cumplido. Además, se consigna la firma del director del laboratorio en el informe final y la validación de todo el proceso de forma univoca para cada informe individual y cumpliendo con la normativa regulatoria aplicable en cada caso La plataforma se ha diseñado para admitir cualquier idioma y alfabeto. The creation of the report (8) is a process that is activated when all the necessary requirements in the previous steps (curation, classification and interpretation) have been met. In addition, the signature of the laboratory director is recorded in the final report and the validation of the entire process unequivocally for each individual report and in compliance with the applicable regulatory standards in each case. The platform has been designed to admit any language and alphabet.
La distribución (9) es un sistema independiente que permite el envío (o distribución) de informes, archivos, y notificaciones a través de distintos medios En el caso que nos ocupa tenemos activados los medios de correo electrónico, repositorio en cloud, informe en pdf e informe en web en tiempo real. The distribution (9) is an independent system that allows the sending (or distribution) of reports, files, and notifications through different means. In the present case, we have activated the means of email, cloud repository, pdf report and report on the web in real time.

Claims

REIVINDICACIONES
1.- Un método de análisis genómico ímplementado en una plataforma bioinformática en remoto configurado para el análisis genómico automatizado y el filtrado de variantes no descritas en personas sanas que comprende las etapas de entrada de una muestra biológica (1,2) y secuenciación (3) del ADN de la muestra biológica, tras lo que los datos se estructuren en tres ficheros fastq, sam/bam y vcf, que se caracteriza por que implementa un primer subproceso de biopipeline (4) configurado para recoger los datos de un secuenciador de ADN y transformar los datos en elementos comprensibles para un segundo subproceso de veribench (6) configurado para la inspección de los datos importados del primer subproceso de recogida y transformación de datos del secuenciador; un tercer subproceso configurado para la curación (7) y a la interpretación de una vanante genómica; y un cuarto subproceso de generación de informe y distribución (8,9). 1.- A genomic analysis method implemented in a remote bioinformatics platform configured for automated genomic analysis and filtering of non-described variants in healthy people, comprising the stages of inputting a biological sample (1,2) and sequencing (3 ) of the DNA of the biological sample, after which the data is structured into three fastq files, sam/bam and vcf, which is characterized by the fact that it implements a first biopipeline thread (4) configured to collect the data from a DNA sequencer and transforming the data into comprehensible elements for a second veribench thread (6) configured for inspection of the data imported from the first data collection and transformation thread of the sequencer; a third thread configured for healing (7) and for the interpretation of a genomic variant; and a fourth report generation and distribution thread (8,9).
2.- El método de análisis genómico de acuerdo con la reivindicación 1 donde el subproceso de BIOPIPELINE (4) está configurado para estructurar los datos de secuenciación de ADN (3), en donde los datos brutos de la máquina de secuenciación de ADN se convierten en archivos de tipo fastq, asignando las secuencias con códigos de barras a las muestras individuales en un proceso de demultiplexación; y donde los archivos fastq se alinean con los genomas de referencia hg19 y hg38, dando como resultado un archivo binario bam, que es un archivo binario del archivo sam, que es un archivo de texto que contiene los datos de alineamientos de los genomas separados por tabulación. 2.- The method of genomic analysis according to claim 1 wherein the BIOPIPELINE thread (4) is configured to structure the DNA sequencing data (3), wherein the raw data from the DNA sequencing machine is converted in fastq files, assigning the barcoded sequences to the individual samples in a demultiplexing process; and where the fastq files align with the hg19 and hg38 reference genomes, resulting in a bam binary file, which is a binary file of the sam file, which is a text file containing the genome alignment data separated by tabulation.
3 - El método de análisis genómico de acuerdo con una cualquiera de las reivindicaciones 1 o 2 donde el proceso veribench (6) comprende un proceso configurado para asegurar la identidad correcta de la muestra en paralelo mediante chips de genotipado para proporcionar una segunda fuente de datos físicos que se derivan del ADN del paciente. 3 - The method of genomic analysis according to any one of claims 1 or 2 where the veribench process (6) comprises a process configured to ensure the correct identity of the sample in parallel by means of genotyping chips to provide a second data source that are derived from the patient's DNA.
4 - El método de acuerdo con la reivindicación 3 donde los datos en crudo del chip de genotipado están estandarizados para asegurar que las columnas mantienen un orden concreto, y donde en un primer paso se convierte los datos del chip a un formato vcf, utilizando scripts personalizados; y donde el vcf generado específico para el diente se compara entonces con el vcf, también con scripts personalizados, que se derivó de los datos del chip de genotipado. 4 - The method according to claim 3 where the raw data from the genotyping chip is standardized to ensure that the columns maintain a specific order, and where in a first step the chip data is converted to a vcf format, using scripts personalized; and where the generated tooth-specific vcf is then compared to the vcf, also with custom scripts, that was derived from the genotyping chip data.
5 - Una plataforma bioinformática en remoto que se caracteriza porque comprende medios configurados para ejecutar el método de acuerdo con una cualquiera de las reivindicaciones5 - A remote bioinformatics platform characterized by the fact that it includes media configured to execute the method according to any one of the claims
1 a 4. 1 to 4.
PCT/ES2022/070351 2021-06-10 2022-06-06 Method of genomic analysis on a bioinformatics platform WO2022258866A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
ES202130535A ES2930699A1 (en) 2021-06-10 2021-06-10 GENOMIC ANALYSIS METHOD IN A BIOINFORMATIC PLATFORM (Machine-translation by Google Translate, not legally binding)
ESP202130535 2021-06-10

Publications (1)

Publication Number Publication Date
WO2022258866A1 true WO2022258866A1 (en) 2022-12-15

Family

ID=84425760

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/ES2022/070351 WO2022258866A1 (en) 2021-06-10 2022-06-06 Method of genomic analysis on a bioinformatics platform

Country Status (2)

Country Link
ES (1) ES2930699A1 (en)
WO (1) WO2022258866A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013070634A1 (en) * 2011-11-07 2013-05-16 Ingenuity Systems, Inc. Methods and systems for identification of causal genomic variants
US20150286495A1 (en) * 2014-04-02 2015-10-08 International Business Machines Corporation Metadata-driven workflows and integration with genomic data processing systems and techniques
US20150379193A1 (en) * 2014-06-30 2015-12-31 QIAGEN Redwood City, Inc. Methods and systems for interpretation and reporting of sequence-based genetic tests
US20160191076A1 (en) * 2014-08-29 2016-06-30 Bonnie Berger Leighton Compressively-accelerated read mapping framework for next-generation sequencing
US20190026425A1 (en) * 2015-12-24 2019-01-24 YouGene, Inc. Curated genetic database for in silico testing, licensing and payment
US20200042735A1 (en) * 2016-10-11 2020-02-06 Genomsys Sa Method and system for selective access of stored or transmitted bioinformatics data
US20200244283A1 (en) * 2019-01-30 2020-07-30 International Business Machines Corporation Managing compression and storage of genomic data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013070634A1 (en) * 2011-11-07 2013-05-16 Ingenuity Systems, Inc. Methods and systems for identification of causal genomic variants
US20150286495A1 (en) * 2014-04-02 2015-10-08 International Business Machines Corporation Metadata-driven workflows and integration with genomic data processing systems and techniques
US20150379193A1 (en) * 2014-06-30 2015-12-31 QIAGEN Redwood City, Inc. Methods and systems for interpretation and reporting of sequence-based genetic tests
US20160191076A1 (en) * 2014-08-29 2016-06-30 Bonnie Berger Leighton Compressively-accelerated read mapping framework for next-generation sequencing
US20190026425A1 (en) * 2015-12-24 2019-01-24 YouGene, Inc. Curated genetic database for in silico testing, licensing and payment
US20200042735A1 (en) * 2016-10-11 2020-02-06 Genomsys Sa Method and system for selective access of stored or transmitted bioinformatics data
US20200244283A1 (en) * 2019-01-30 2020-07-30 International Business Machines Corporation Managing compression and storage of genomic data

Also Published As

Publication number Publication date
ES2930699A1 (en) 2022-12-20

Similar Documents

Publication Publication Date Title
Pérez-Cobas et al. Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses
Singh et al. Integrative toxicogenomics: Advancing precision medicine and toxicology through artificial intelligence and OMICs technology
Butler The future of forensic DNA analysis
Hemani et al. Retracted article: Detection and replication of epistasis influencing transcription in humans
Bragg et al. Metagenomics using next-generation sequencing
Ellegren Sequencing goes 454 and takes large‐scale genomics into the wild
Tripathi et al. Next-generation sequencing revolution through big data analytics
Korpelainen et al. RNA-seq data analysis: a practical approach
US20150211054A1 (en) Haplotype resolved genome sequencing
Roy et al. SeqReporter: automating next-generation sequencing result interpretation and reporting workflow in a clinical laboratory
Furlani et al. Sequencing of Nucleic Acids: from the First Human Genome to Next Generation Sequencing in COVID-19 Pandemic.
WO2022258866A1 (en) Method of genomic analysis on a bioinformatics platform
Jiang et al. DRAMS: A tool to detect and re-align mixed-up samples for integrative studies of multi-omics data
Muscarella et al. Automated workflow for somatic and germline next generation sequencing analysis in routine clinical cancer diagnostics
JP2022544991A (en) Methods for control of sequencing devices
Nimmy et al. Investigation of DNA discontinuity for detecting tuberculosis
Budowle et al. The forensic genomics toolbox is expanding
Salehin et al. Prenet: Predictive network from ATAC-SEQ data
Baßler et al. A Bioinformatic Toolkit for Single-Cell mRNA Analysis
Mangalea et al. Assembly and Annotation of Viral Metagenomes from Short-Read Sequencing Data
US20230317211A1 (en) Method and system for encrypting genetic data of a subject
CN106599612B (en) Fingerprint identification method based on high-throughput sequencing data
Latham Next-generation sequencing of formalin-fixed, paraffin-embedded tumor biopsies: navigating the perils of old and new technology to advance cancer diagnosis
US20160154930A1 (en) Methods for identification of individuals
Zaaijer et al. Rapid DNA re-identification for cell line authentication and forensics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22819687

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE