CN112435712A

CN112435712A - Method and system for analyzing gene sequencing data

Info

Publication number: CN112435712A
Application number: CN202011314466.0A
Authority: CN
Inventors: 郎继东; 田埂; 梁乐彬; 杨家亮
Original assignee: Geneis Technology Suzhou Co ltd
Current assignee: Geneis Technology Suzhou Co ltd
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2021-03-02

Abstract

The invention discloses a method and a system for analyzing gene sequencing data. The method not only can utilize all detected variation information, but also integrates a visual concept, simulates unprocessed variation information into an image to visually see variation distribution and density, and directly processes the image to find differences by utilizing an image comparison or image recognition technology, thereby greatly reducing analysis complexity, reducing the cost of analysis time, doubling the analysis time compared with the traditional analysis method, and enabling the analysis process to be more concise and visual.

Description

Method and system for analyzing gene sequencing data

Technical Field

The present invention relates to the field of bioinformatic analysis, and in particular to a method and system for analyzing gene sequencing data.

Background

With the progress of technology, the cost of gene sequencing is rapidly reduced, so that a large amount of gene sequencing data information is generated, and the analysis demand for the data is higher and higher, and the data is more and more refined. Thereby enabling the use of sequencing technologies to detect biomarkers in cancer to be more and more normalized and personalized. At present, the most widely applied solutions also focus on next generation sequencing (next generation sequencing) technologies, for example, technologies such as whole genome sequencing, whole exome sequencing, high-depth target region sequencing, transcriptome sequencing, methylation sequencing and the like are applied to real-time monitoring and targeted medication of cancer patients, and further, the most widely applied solutions can also be applied to large-scale queue data to discover new cancer specific biomarkers, so as to develop and research new drugs or novel therapeutic means. Of course, in recent years, third-generation sequencing technologies (such as Pacbio and Oxford-anocore) have been developed rapidly and are applied to clinic more and more in combination with the second-generation sequencing technology, and the detection results are more and more accurate.

In recent years, the breakthrough of key technologies such as image recognition, deep learning and neural network drives the rapid development of artificial intelligence, and the progress of the field of artificial intelligence and medical treatment is also a rapid advance, and especially the machine learning auxiliary diagnosis and treatment and analysis are widely applied. For example, the research of the 2017 Nature cover article Dermatologic-level classification of skin cancer with deep neural networks proves that dermatologists can classify skin cancer through a deep neural network algorithm, and the accuracy rate can reach more than 91%; the detection of new coronavirus, now abusive worldwide, also treats nucleic acid detection in combination with CT scanning as the "gold standard". Meanwhile, in view of the difference from the traditional statistical method, the machine learning method is applied to the mining of big data again by people with good generalization and accuracy, for example, some heavy-duty researches are directed at the cancer genome atlas (tcga) to re-analyze the sequencing database by using the machine learning method, so that many problems which cannot be solved by the traditional statistical method are solved, and meanwhile, many heavy-duty research results are obtained.

However, at present, analysis of sequencing data based on machine learning is basically performed by using variation results (SNV/Indel/SV/CNV and the like) obtained by analysis software or devices, and combining certain filtering conditions to obtain filtered results for downstream modeling analysis, because the number of sites and the modeling complexity are in an exponential relationship, the sites for general modeling do not require too many sites, otherwise, a large amount of computing resources and time cost are consumed; meanwhile, the set filtering condition is generally determined according to the experience of an analyst, so that a large number of subjective factors are introduced, and if the filtering condition is too strict or loose, the result also introduces a large number of false positives or false negatives, thereby causing inaccuracy of the result. In addition, machine learning and deep learning are recognized as a 'black box' method, many contents of which may not be proven by the existing theory yet, so the selection and application of the method need to be based on the practical consideration of the solved problem.

The information in this background is only for the purpose of illustrating the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art that is known to a person skilled in the art.

Disclosure of Invention

Aiming at least part of technical problems in the prior art, the invention adopts a brand-new method for analyzing variation, firstly converts the analysis result into a simulation Image (Image) on the basis of the primary analysis result after the variation is detected, then directly searches for the difference through Image comparison, and then converts the difference into corresponding variation information (such as chromosome position, base type, base structure and the like) to obtain the difference result. Specifically, the present invention includes the following.

In a first aspect of the invention, there is provided a method for analysing gene sequencing data, comprising the steps of:

(1) comparing the sequencing data of the gene to be analyzed with a standard genome to obtain variation information in the sequencing data of the gene to be analyzed, and arranging the variation information to obtain a mutation consistency sequence consisting of ATGC;

(2) converting the mutation consistency sequence into a digital matrix consisting of 0 and 1, further taking the digital matrix as a pixel, and converting the mutation consistency sequence into an analog image according to a preset rule;

(3) comparing the simulation image with a reference simulation image, and searching for image difference between the simulation image and the reference simulation image, wherein the reference simulation image is obtained by converting reference gene sequencing data; and

(4) and confirming mutation difference sites or difference areas between the genes corresponding to the sequencing data of the genes to be analyzed and the genes corresponding to the sequencing data of the reference genes according to the pixel point coordinates of the image difference.

In certain embodiments, the method for analyzing gene sequencing data according to the present invention, wherein the reference simulation image is transformed by the same method as in steps (1) and (2) using reference gene sequencing data.

In certain embodiments, the method for analyzing gene sequencing data according to the present invention, wherein the standard genome is a human genome.

In certain embodiments, the method for analyzing gene sequencing data according to the present invention, wherein the variation information comprises at least one of a point mutation, a structural variation, and a methylation level site.

In certain embodiments, the method for analyzing gene sequencing data according to the present invention, wherein the variation information is ordered according to the number of chromosomes or the position of chromosomes, and the chromosomal loci without variation are replaced with the base types of the corresponding positions of the standard genome.

In certain embodiments, the method for analyzing gene sequencing data according to the present invention, wherein the simulated image is compared to a reference simulated image by visual inspection or image recognition techniques.

In a second aspect of the invention, there is provided a system for analyzing gene sequencing data, comprising:

a. an input device for receiving analytical gene sequencing data;

b. a memory having a database for storing at least information of the reference simulated image and analytical genetic sequencing data input by the input device;

c. a processor capable of communicating with the memory and configured to: calling the gene sequencing data to be analyzed from a memory, converting the gene sequencing data into a simulation image, and comparing the simulation image with a reference image to obtain mutation difference sites or difference areas between genes corresponding to the gene sequencing data to be analyzed and genes corresponding to the reference gene sequencing data;

d. an output or display device for outputting or displaying information of the mutation difference site or the difference region.

In some embodiments, the system for analyzing gene sequencing data according to the present invention, wherein the converting comprises aligning the gene sequencing data to be analyzed to a standard genome, thereby obtaining variation information in the gene sequencing data to be analyzed, arranging the variation information to obtain a mutation identity sequence consisting of ATGC, converting the mutation identity sequence into a digital matrix consisting of 0 and 1, further using the digital matrix as a pixel, and converting the mutation identity sequence into an analog image according to a predetermined rule.

In certain embodiments, the system for analyzing gene sequencing data according to the present invention, wherein the information of the standard genome is pre-stored in the memory or retrieved by the system from a database through a network.

The invention not only can utilize all detected variation information, but also integrates a visual concept, namely, unprocessed variation information is simulated into an image which can visually see the variation distribution and the density degree, and the image comparison or image recognition technology is utilized to directly process the image to find the difference, thereby greatly reducing the analysis complexity, reducing the cost of analysis time and leading the analysis process and the result to be more concise and visual.

For example, to compare Tumor Mutation Burden (TMB) and differential specific mutation of lung cancer and pancreatic cancer, the conventional method requires that after a somatic mutation result is obtained by a mutation detection analysis method (e.g., GATK), a certain filtering condition is set to obtain a more "accurate" result, and then a difference result is screened after modeling analysis and comparison are performed according to a statistical method such as clustering and principal component analysis or a deep learning method such as neural network, logistic regression, classifier, etc., so that huge analysis resources and time are consumed, and the confidence of the analysis result also depends on the experience and ability of an analyst. After the somatic mutation result is obtained, the two results are directly simulated into two images, the height of the TMB can be intuitively judged according to the distribution and the density degree of the mutation, then the difference points or blocks of the two images are directly searched by utilizing an image recognition or image comparison technology, and the inaccuracy of the result caused by system error points brought by sequencing or experiments is also eliminated; and because the results are obtained by one-time analysis, the analysis time is doubled compared with the traditional analysis method.

Drawings

FIG. 1 is a diagram illustrating the flow of an analysis method of the present invention;

FIG. 2 shows the analysis process and results of the embodiment of the present invention.

Detailed Description

Reference will now be made in detail to various exemplary embodiments of the invention, the detailed description should not be construed as limiting the invention but as a more detailed description of certain aspects, features and embodiments of the invention.

It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Further, for numerical ranges in this disclosure, it is understood that the upper and lower limits of the range, and each intervening value therebetween, is specifically disclosed. Every smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in a stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although only preferred methods and materials are described herein, any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention. All documents mentioned in this disclosure are incorporated by reference for the purpose of disclosing and describing the methods and/or materials associated with the documents. In case of conflict with any incorporated document, the present disclosure controls. Unless otherwise indicated, "%" is percent by weight.

In a first aspect of the invention, there is provided an analytical method for gene sequencing data, sometimes referred to simply as the method of the invention, which is typically used to confirm differences or changes between two genes of different population origin, for example for detecting cancer-specific variations based on sequencing data. The method mainly comprises the following steps:

The gene sequencing data of the present invention is not particularly limited, and may include second generation sequencing data or third generation sequencing data, and specifically may be data obtained by whole genome sequencing, whole exome sequencing, high-depth target region sequencing, transcriptome sequencing, methylation sequencing, or a combination thereof.

The standard genome of the present invention is composed of genes embodying genetic information of species, and examples thereof include the human genome, which is composed of 23 pairs of chromosomes including 22 pairs of autosomes and 1 pair of sex chromosomes. The information of the standard genome is preferably known information commonly used in the art. Available through the internetHuman genome information. For example fromhttp:// hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gzHg19 was obtained, etc. The variation information present in the sequencing data can be found by alignment with a standard genome.

The mutation information of the present invention refers to information that is different from the standard genome. These differences include, but are not limited to, point mutations (SNP/SNV/InDel), structural variations (SV/CNV), or methylation level loci (CpG), among others. The variation information of the present invention also includes a combination of at least two of the above.

In the present invention, the mutation information arrangement method is not particularly limited, and any known method may be used. Alignment is generally performed with reference to a standard genome. In an exemplary embodiment, the permutation of variant information is ordered by the numbering of the chromosomes. In further exemplary embodiments, the permutation of variant information is ordered by the location of the chromosome. The mutant consensus sequence consisting of ATCG was obtained by permutation of the variation information.

In the present invention, in order to convert the mutation-consensus sequence expressed by ATCG into an analog image, four kinds of bases of ATCG are first digitally encoded. The encoding rule is not particularly limited as long as the ATCGs are respectively composed of unique numbers. For example, 10 may be encoded as a, 01 may be encoded as T, 00 may be encoded as C, and 11 may be encoded as G. On the basis of which the person skilled in the art can freely select other coding modes as required without affecting the object of the invention at all. After code conversion, a digital matrix consisting of 0 and 1 is obtained. Next, the 0 and 1 matrices are used as pixels of the image, and the abrupt consistency sequence represented by the plurality of pixels is converted into an analog image. For the sake of comparison, it is preferable that the pixels are arranged in a predetermined rule. In general, the arrangement rule of the simulation images obtained from the gene sequencing data to be analyzed is identical to that of the reference simulation images. In an exemplary embodiment, the pixels are first grouped into sub-images corresponding to each chromosome, and then arranged according to the sub-images for each large chromosome. For example, in the case of 24 human chromosomes, the sub-images corresponding to the chromosomes may be arranged in a manner of 4x6 or 6x 4.

In the present invention, the reference simulated image refers to an image used for comparison, which is generally a simulated image converted from sequencing data different from the gene to be analyzed. The data conversion method or process when obtaining the analog image and when referring to the analog image is generally the same. The reference simulated image may be an image obtained by conversion in advance, or may be another simulated image obtained while processing together with the gene sequencing data to be analyzed. For example, in the case of a first set of gene sequencing data and a second set of gene sequencing data, the image obtained from the first set of gene sequencing data can be used as a simulation image and the image obtained from the second set of gene sequencing data can be used as a reference image, or vice versa.

In the present invention, after the analog image is obtained, the comparison between the analog image and the reference image may be performed by a known method, such as a visual method or a known image recognition technique. Examples of image recognition techniques include, but are not limited to, python, openCV, scimit-image, and the like, for example.

After the image difference is obtained by image comparison, the mutation difference site or difference region of the gene to be analyzed and the reference gene can be confirmed by the difference. This allows further differentiation between the gene to be analyzed and the reference gene.

The analysis method of the present invention is exemplified below with reference to fig. 1. It should be noted that fig. 1 is only used for illustrative purposes and is not intended to limit the scope of the present invention. As shown in fig. 1, the analysis method of the present invention mainly includes the following:

alignment of sequencing data to the human genome (website)http://hgdownload.soe.ucsc.edu/ goldenPath/hg19/bigZips/hg19.fa.gz) And performing mutation detection on the DNA fragment.

Arranging the results of the variation information according to the chromosome position to obtain a mutation consistency sequence, and sequencing according to the chromosome or the chromosome position; chromosomal sites that have no variation can be replaced with the reference sequence base pattern at the corresponding position.

The "mutation consensus sequence" was subjected to base conversion into a string sequence containing only 0 and 1, and the conversion rule was: base A is replaced by 10, T is replaced by 01, C is replaced by 00 and G is replaced by 11, resulting in a converted 0, 1 numerical matrix. Taking the 0 and 1 matrixes as pixels of the Image, and performing analog conversion on the pixels into an analog Image (Image);

the analog images (images) are arranged. Since a person has 22 chromosomes and a pair of X, Y sex chromosomes, each chromosome is independent when converted, if the simulated image converted from the sequence is long, in order to make the image "regular", the image is subjected to arrangement of chromosome simulation images;

searching difference points or block blocks between the images by utilizing an image comparison or image identification technology; and finding out the corresponding chromosome position and the corresponding mutation site or mutation area according to the pixel point coordinates of the image difference point or block, namely the mutation difference site or difference area.

In the second aspect of the present invention, a system for gene sequencing data, which is simply referred to as the system of the present invention, may be designed in the form of a computer or an analysis instrument, and the form of the system of the present invention is not particularly limited as long as the analysis method of the first aspect of the present invention can be performed.

The system of the present invention generally comprises:

a. an input device for receiving analytical gene sequencing data;

In the system of the present invention, the memory is used to store at least information of the reference simulation image and the analytical gene sequencing data input by the input device. The storage includes a case where a reference simulation image is already stored for a long period or permanently by being externally introduced into a memory of the system through an input device or the like before the system of the present invention is operated, and also includes a case where sequencing data other than the sequencing data of the gene to be analyzed is converted into a simulation image by operating the system of the present invention and temporarily stored.

The information on the standard genome of the present invention may be stored in a memory in advance, or may be retrieved from a database by a system via a network.

Examples

45 patients with renal cancer (kidney chromophobe) and 256 patients with prostate adenocarcinoma (prostate adenocarinoma) in the TCGA database were selected as examples. The method of the present invention was used to search for specific differential genes and mutations of two cancer species (shown in FIG. 2). The samples of the embodiment all have public result data sets, and consistency comparison among methods is convenient to carry out.

Comparing TCGA original sequencing data of 301 samples to a human reference genome by utilizing BWA software to obtain a comparison file in an SAM format, then sequencing and deduplicating the SAM file by utilizing Samtools software to obtain a file in a BAM format, and finally obtaining somatic mutation results of all samples by utilizing VarScan software, wherein all somatic mutation sets of renal cancer patients are marked as A, and all somatic mutation sets of prostate adenocarcinoma patients are marked as B;

merging all mutation positions of the set A and the set B in ascending order of chromosomes chr1, chr2, chr3 … … chrX and chrY, and arranging mutation base types of each sample of the set A and the set B according to the chromosome positions (randomly ordering among samples), wherein the positions without mutation are supplemented with the base types of a reference genome, so as to obtain a kidney cancer patient 'mutation consensus sequence' matrix M1 and a prostate adenocarcinoma patient 'mutation consensus sequence' matrix M2;

respectively carrying out 0 and 1 data conversion on a matrix M1 and a matrix M2 of the 'mutation consensus sequence', wherein the rule is that a base A is replaced by 10, a T is replaced by 01, a C is replaced by 00, and a G is replaced by 11 to obtain a converted 0, 1 digital matrix D1 and a matrix D2, and carrying out up-sampling on the digital matrix D1 of the kidney cancer by the number of samples, namely randomly sampling the number of the samples from 45 samples to 256 patients with prostate adenocarcinoma;

the matrix D1 and the matrix D2 were modeled as images (images) using the ImageIO library of python, denoted F1 and F2, respectively;

the simulated barcode images (images) of F1 and F2 are respectively arranged according to the ascending order of chromosomes, namely the Image of the first row is chr1-chr6, the Image of the second row is chr7-chr12, the Image of the third row is chr13-chr18, and the Image of the fourth row is chr19-chrY, and because the lengths of all chromosomes are different, the step adopts a direct splicing method, namely no blank is directly spliced between barcodes, and new simulated images NF1 and NF2 of two cancer species are obtained;

performing image comparison on NF1 and NF2 by using python, openCV and scimit-image methods to obtain difference points and difference block blocks of the images, wherein black is excluded positions (namely the same result), and white represents the difference positions;

the pixel point coordinates of the difference points and the difference block blocks of the image correspond to the corresponding positions of a matrix M1 and a matrix M2 of a mutation consistency sequence, and the difference genes and the difference sites of the kidney cancer and the prostate adenocarcinoma are found;

comparing the obtained differential sites with the analysis differential result data of kidney cancer and prostate adenocarcinoma provided by the known TCGA official part, the concordance rate is found to be 100%, and the time for analyzing the sequencing data of 301 samples is shortened from about 1200 hours to about 82 hours, which proves that the method is feasible.

While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. Various modifications or changes may be made to the exemplary embodiments of the present invention without departing from the scope or spirit of the present invention. The scope of the claims is to be accorded the broadest interpretation so as to encompass all modifications and equivalent structures and functions.

Claims

1. A method for analyzing gene sequencing data, comprising the steps of:

2. The method for analyzing gene sequencing data of claim 1, wherein the reference simulation image is obtained by converting the reference gene sequencing data by the same method as that of steps (1) and (2).

3. The method for analyzing gene sequencing data of claim 1, wherein the standard genome is a human genome.

4. The method for analyzing gene sequencing data of claim 1, wherein the variation information comprises at least one of a point mutation, a structural variation, and a methylation level site.

5. The method for analyzing gene sequencing data of claim 1, wherein the variation information is ordered according to chromosome number or chromosome position, and the chromosomal locus without variation is replaced with the base type of the corresponding position of the standard genome.

6. The method for analyzing gene sequencing data of claim 1, wherein the simulated image is compared to a reference simulated image by visual inspection or image recognition techniques.

7. A system for analyzing gene sequencing data, comprising:

a. an input device for receiving analytical gene sequencing data;

8. The system according to claim 7, wherein the converting comprises comparing the gene sequencing data to be analyzed with a standard genome to obtain variation information in the gene sequencing data to be analyzed, arranging the variation information to obtain a mutation identity sequence consisting of ATGC, converting the mutation identity sequence into a digital matrix consisting of 0 and 1, further using the digital matrix as a pixel, and converting the mutation identity sequence into an analog image according to a predetermined rule.

9. The system for analyzing gene sequencing data of claim 7, wherein the information of the standard genome is pre-stored in the memory or retrieved by the system from a database over a network.