CN112435712A - Method and system for analyzing gene sequencing data - Google Patents

Method and system for analyzing gene sequencing data Download PDF

Info

Publication number
CN112435712A
CN112435712A CN202011314466.0A CN202011314466A CN112435712A CN 112435712 A CN112435712 A CN 112435712A CN 202011314466 A CN202011314466 A CN 202011314466A CN 112435712 A CN112435712 A CN 112435712A
Authority
CN
China
Prior art keywords
sequencing data
image
gene sequencing
mutation
difference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011314466.0A
Other languages
Chinese (zh)
Inventor
郎继东
田埂
梁乐彬
杨家亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Geneis Technology Suzhou Co ltd
Original Assignee
Geneis Technology Suzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Geneis Technology Suzhou Co ltd filed Critical Geneis Technology Suzhou Co ltd
Priority to CN202011314466.0A priority Critical patent/CN112435712A/en
Publication of CN112435712A publication Critical patent/CN112435712A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Analytical Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Chemical & Material Sciences (AREA)
  • Radiology & Medical Imaging (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method and a system for analyzing gene sequencing data. The method not only can utilize all detected variation information, but also integrates a visual concept, simulates unprocessed variation information into an image to visually see variation distribution and density, and directly processes the image to find differences by utilizing an image comparison or image recognition technology, thereby greatly reducing analysis complexity, reducing the cost of analysis time, doubling the analysis time compared with the traditional analysis method, and enabling the analysis process to be more concise and visual.

Description

Method and system for analyzing gene sequencing data
Technical Field
The present invention relates to the field of bioinformatic analysis, and in particular to a method and system for analyzing gene sequencing data.
Background
With the progress of technology, the cost of gene sequencing is rapidly reduced, so that a large amount of gene sequencing data information is generated, and the analysis demand for the data is higher and higher, and the data is more and more refined. Thereby enabling the use of sequencing technologies to detect biomarkers in cancer to be more and more normalized and personalized. At present, the most widely applied solutions also focus on next generation sequencing (next generation sequencing) technologies, for example, technologies such as whole genome sequencing, whole exome sequencing, high-depth target region sequencing, transcriptome sequencing, methylation sequencing and the like are applied to real-time monitoring and targeted medication of cancer patients, and further, the most widely applied solutions can also be applied to large-scale queue data to discover new cancer specific biomarkers, so as to develop and research new drugs or novel therapeutic means. Of course, in recent years, third-generation sequencing technologies (such as Pacbio and Oxford-anocore) have been developed rapidly and are applied to clinic more and more in combination with the second-generation sequencing technology, and the detection results are more and more accurate.
In recent years, the breakthrough of key technologies such as image recognition, deep learning and neural network drives the rapid development of artificial intelligence, and the progress of the field of artificial intelligence and medical treatment is also a rapid advance, and especially the machine learning auxiliary diagnosis and treatment and analysis are widely applied. For example, the research of the 2017 Nature cover article Dermatologic-level classification of skin cancer with deep neural networks proves that dermatologists can classify skin cancer through a deep neural network algorithm, and the accuracy rate can reach more than 91%; the detection of new coronavirus, now abusive worldwide, also treats nucleic acid detection in combination with CT scanning as the "gold standard". Meanwhile, in view of the difference from the traditional statistical method, the machine learning method is applied to the mining of big data again by people with good generalization and accuracy, for example, some heavy-duty researches are directed at the cancer genome atlas (tcga) to re-analyze the sequencing database by using the machine learning method, so that many problems which cannot be solved by the traditional statistical method are solved, and meanwhile, many heavy-duty research results are obtained.
However, at present, analysis of sequencing data based on machine learning is basically performed by using variation results (SNV/Indel/SV/CNV and the like) obtained by analysis software or devices, and combining certain filtering conditions to obtain filtered results for downstream modeling analysis, because the number of sites and the modeling complexity are in an exponential relationship, the sites for general modeling do not require too many sites, otherwise, a large amount of computing resources and time cost are consumed; meanwhile, the set filtering condition is generally determined according to the experience of an analyst, so that a large number of subjective factors are introduced, and if the filtering condition is too strict or loose, the result also introduces a large number of false positives or false negatives, thereby causing inaccuracy of the result. In addition, machine learning and deep learning are recognized as a 'black box' method, many contents of which may not be proven by the existing theory yet, so the selection and application of the method need to be based on the practical consideration of the solved problem.
The information in this background is only for the purpose of illustrating the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art that is known to a person skilled in the art.
Disclosure of Invention
Aiming at least part of technical problems in the prior art, the invention adopts a brand-new method for analyzing variation, firstly converts the analysis result into a simulation Image (Image) on the basis of the primary analysis result after the variation is detected, then directly searches for the difference through Image comparison, and then converts the difference into corresponding variation information (such as chromosome position, base type, base structure and the like) to obtain the difference result. Specifically, the present invention includes the following.
In a first aspect of the invention, there is provided a method for analysing gene sequencing data, comprising the steps of:
(1) comparing the sequencing data of the gene to be analyzed with a standard genome to obtain variation information in the sequencing data of the gene to be analyzed, and arranging the variation information to obtain a mutation consistency sequence consisting of ATGC;
(2) converting the mutation consistency sequence into a digital matrix consisting of 0 and 1, further taking the digital matrix as a pixel, and converting the mutation consistency sequence into an analog image according to a preset rule;
(3) comparing the simulation image with a reference simulation image, and searching for image difference between the simulation image and the reference simulation image, wherein the reference simulation image is obtained by converting reference gene sequencing data; and
(4) and confirming mutation difference sites or difference areas between the genes corresponding to the sequencing data of the genes to be analyzed and the genes corresponding to the sequencing data of the reference genes according to the pixel point coordinates of the image difference.
In certain embodiments, the method for analyzing gene sequencing data according to the present invention, wherein the reference simulation image is transformed by the same method as in steps (1) and (2) using reference gene sequencing data.
In certain embodiments, the method for analyzing gene sequencing data according to the present invention, wherein the standard genome is a human genome.
In certain embodiments, the method for analyzing gene sequencing data according to the present invention, wherein the variation information comprises at least one of a point mutation, a structural variation, and a methylation level site.
In certain embodiments, the method for analyzing gene sequencing data according to the present invention, wherein the variation information is ordered according to the number of chromosomes or the position of chromosomes, and the chromosomal loci without variation are replaced with the base types of the corresponding positions of the standard genome.
In certain embodiments, the method for analyzing gene sequencing data according to the present invention, wherein the simulated image is compared to a reference simulated image by visual inspection or image recognition techniques.
In a second aspect of the invention, there is provided a system for analyzing gene sequencing data, comprising:
a. an input device for receiving analytical gene sequencing data;
b. a memory having a database for storing at least information of the reference simulated image and analytical genetic sequencing data input by the input device;
c. a processor capable of communicating with the memory and configured to: calling the gene sequencing data to be analyzed from a memory, converting the gene sequencing data into a simulation image, and comparing the simulation image with a reference image to obtain mutation difference sites or difference areas between genes corresponding to the gene sequencing data to be analyzed and genes corresponding to the reference gene sequencing data;
d. an output or display device for outputting or displaying information of the mutation difference site or the difference region.
In some embodiments, the system for analyzing gene sequencing data according to the present invention, wherein the converting comprises aligning the gene sequencing data to be analyzed to a standard genome, thereby obtaining variation information in the gene sequencing data to be analyzed, arranging the variation information to obtain a mutation identity sequence consisting of ATGC, converting the mutation identity sequence into a digital matrix consisting of 0 and 1, further using the digital matrix as a pixel, and converting the mutation identity sequence into an analog image according to a predetermined rule.
In certain embodiments, the system for analyzing gene sequencing data according to the present invention, wherein the information of the standard genome is pre-stored in the memory or retrieved by the system from a database through a network.
The invention not only can utilize all detected variation information, but also integrates a visual concept, namely, unprocessed variation information is simulated into an image which can visually see the variation distribution and the density degree, and the image comparison or image recognition technology is utilized to directly process the image to find the difference, thereby greatly reducing the analysis complexity, reducing the cost of analysis time and leading the analysis process and the result to be more concise and visual.
For example, to compare Tumor Mutation Burden (TMB) and differential specific mutation of lung cancer and pancreatic cancer, the conventional method requires that after a somatic mutation result is obtained by a mutation detection analysis method (e.g., GATK), a certain filtering condition is set to obtain a more "accurate" result, and then a difference result is screened after modeling analysis and comparison are performed according to a statistical method such as clustering and principal component analysis or a deep learning method such as neural network, logistic regression, classifier, etc., so that huge analysis resources and time are consumed, and the confidence of the analysis result also depends on the experience and ability of an analyst. After the somatic mutation result is obtained, the two results are directly simulated into two images, the height of the TMB can be intuitively judged according to the distribution and the density degree of the mutation, then the difference points or blocks of the two images are directly searched by utilizing an image recognition or image comparison technology, and the inaccuracy of the result caused by system error points brought by sequencing or experiments is also eliminated; and because the results are obtained by one-time analysis, the analysis time is doubled compared with the traditional analysis method.
Drawings
FIG. 1 is a diagram illustrating the flow of an analysis method of the present invention;
FIG. 2 shows the analysis process and results of the embodiment of the present invention.
Detailed Description
Reference will now be made in detail to various exemplary embodiments of the invention, the detailed description should not be construed as limiting the invention but as a more detailed description of certain aspects, features and embodiments of the invention.
It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Further, for numerical ranges in this disclosure, it is understood that the upper and lower limits of the range, and each intervening value therebetween, is specifically disclosed. Every smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in a stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although only preferred methods and materials are described herein, any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention. All documents mentioned in this disclosure are incorporated by reference for the purpose of disclosing and describing the methods and/or materials associated with the documents. In case of conflict with any incorporated document, the present disclosure controls. Unless otherwise indicated, "%" is percent by weight.
In a first aspect of the invention, there is provided an analytical method for gene sequencing data, sometimes referred to simply as the method of the invention, which is typically used to confirm differences or changes between two genes of different population origin, for example for detecting cancer-specific variations based on sequencing data. The method mainly comprises the following steps:
(1) comparing the sequencing data of the gene to be analyzed with a standard genome to obtain variation information in the sequencing data of the gene to be analyzed, and arranging the variation information to obtain a mutation consistency sequence consisting of ATGC;
(2) converting the mutation consistency sequence into a digital matrix consisting of 0 and 1, further taking the digital matrix as a pixel, and converting the mutation consistency sequence into an analog image according to a preset rule;
(3) comparing the simulation image with a reference simulation image, and searching for image difference between the simulation image and the reference simulation image, wherein the reference simulation image is obtained by converting reference gene sequencing data; and
(4) and confirming mutation difference sites or difference areas between the genes corresponding to the sequencing data of the genes to be analyzed and the genes corresponding to the sequencing data of the reference genes according to the pixel point coordinates of the image difference.
The gene sequencing data of the present invention is not particularly limited, and may include second generation sequencing data or third generation sequencing data, and specifically may be data obtained by whole genome sequencing, whole exome sequencing, high-depth target region sequencing, transcriptome sequencing, methylation sequencing, or a combination thereof.
The standard genome of the present invention is composed of genes embodying genetic information of species, and examples thereof include the human genome, which is composed of 23 pairs of chromosomes including 22 pairs of autosomes and 1 pair of sex chromosomes. The information of the standard genome is preferably known information commonly used in the art. Available through the internetHuman genome information. For example fromhttp:// hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gzHg19 was obtained, etc. The variation information present in the sequencing data can be found by alignment with a standard genome.
The mutation information of the present invention refers to information that is different from the standard genome. These differences include, but are not limited to, point mutations (SNP/SNV/InDel), structural variations (SV/CNV), or methylation level loci (CpG), among others. The variation information of the present invention also includes a combination of at least two of the above.
In the present invention, the mutation information arrangement method is not particularly limited, and any known method may be used. Alignment is generally performed with reference to a standard genome. In an exemplary embodiment, the permutation of variant information is ordered by the numbering of the chromosomes. In further exemplary embodiments, the permutation of variant information is ordered by the location of the chromosome. The mutant consensus sequence consisting of ATCG was obtained by permutation of the variation information.
In the present invention, in order to convert the mutation-consensus sequence expressed by ATCG into an analog image, four kinds of bases of ATCG are first digitally encoded. The encoding rule is not particularly limited as long as the ATCGs are respectively composed of unique numbers. For example, 10 may be encoded as a, 01 may be encoded as T, 00 may be encoded as C, and 11 may be encoded as G. On the basis of which the person skilled in the art can freely select other coding modes as required without affecting the object of the invention at all. After code conversion, a digital matrix consisting of 0 and 1 is obtained. Next, the 0 and 1 matrices are used as pixels of the image, and the abrupt consistency sequence represented by the plurality of pixels is converted into an analog image. For the sake of comparison, it is preferable that the pixels are arranged in a predetermined rule. In general, the arrangement rule of the simulation images obtained from the gene sequencing data to be analyzed is identical to that of the reference simulation images. In an exemplary embodiment, the pixels are first grouped into sub-images corresponding to each chromosome, and then arranged according to the sub-images for each large chromosome. For example, in the case of 24 human chromosomes, the sub-images corresponding to the chromosomes may be arranged in a manner of 4x6 or 6x 4.
In the present invention, the reference simulated image refers to an image used for comparison, which is generally a simulated image converted from sequencing data different from the gene to be analyzed. The data conversion method or process when obtaining the analog image and when referring to the analog image is generally the same. The reference simulated image may be an image obtained by conversion in advance, or may be another simulated image obtained while processing together with the gene sequencing data to be analyzed. For example, in the case of a first set of gene sequencing data and a second set of gene sequencing data, the image obtained from the first set of gene sequencing data can be used as a simulation image and the image obtained from the second set of gene sequencing data can be used as a reference image, or vice versa.
In the present invention, after the analog image is obtained, the comparison between the analog image and the reference image may be performed by a known method, such as a visual method or a known image recognition technique. Examples of image recognition techniques include, but are not limited to, python, openCV, scimit-image, and the like, for example.
After the image difference is obtained by image comparison, the mutation difference site or difference region of the gene to be analyzed and the reference gene can be confirmed by the difference. This allows further differentiation between the gene to be analyzed and the reference gene.
The analysis method of the present invention is exemplified below with reference to fig. 1. It should be noted that fig. 1 is only used for illustrative purposes and is not intended to limit the scope of the present invention. As shown in fig. 1, the analysis method of the present invention mainly includes the following:
alignment of sequencing data to the human genome (website)http://hgdownload.soe.ucsc.edu/ goldenPath/hg19/bigZips/hg19.fa.gz) And performing mutation detection on the DNA fragment.
Arranging the results of the variation information according to the chromosome position to obtain a mutation consistency sequence, and sequencing according to the chromosome or the chromosome position; chromosomal sites that have no variation can be replaced with the reference sequence base pattern at the corresponding position.
The "mutation consensus sequence" was subjected to base conversion into a string sequence containing only 0 and 1, and the conversion rule was: base A is replaced by 10, T is replaced by 01, C is replaced by 00 and G is replaced by 11, resulting in a converted 0, 1 numerical matrix. Taking the 0 and 1 matrixes as pixels of the Image, and performing analog conversion on the pixels into an analog Image (Image);
the analog images (images) are arranged. Since a person has 22 chromosomes and a pair of X, Y sex chromosomes, each chromosome is independent when converted, if the simulated image converted from the sequence is long, in order to make the image "regular", the image is subjected to arrangement of chromosome simulation images;
searching difference points or block blocks between the images by utilizing an image comparison or image identification technology; and finding out the corresponding chromosome position and the corresponding mutation site or mutation area according to the pixel point coordinates of the image difference point or block, namely the mutation difference site or difference area.
In the second aspect of the present invention, a system for gene sequencing data, which is simply referred to as the system of the present invention, may be designed in the form of a computer or an analysis instrument, and the form of the system of the present invention is not particularly limited as long as the analysis method of the first aspect of the present invention can be performed.
The system of the present invention generally comprises:
a. an input device for receiving analytical gene sequencing data;
b. a memory having a database for storing at least information of the reference simulated image and analytical genetic sequencing data input by the input device;
c. a processor capable of communicating with the memory and configured to: calling the gene sequencing data to be analyzed from a memory, converting the gene sequencing data into a simulation image, and comparing the simulation image with a reference image to obtain mutation difference sites or difference areas between genes corresponding to the gene sequencing data to be analyzed and genes corresponding to the reference gene sequencing data;
d. an output or display device for outputting or displaying information of the mutation difference site or the difference region.
In the system of the present invention, the memory is used to store at least information of the reference simulation image and the analytical gene sequencing data input by the input device. The storage includes a case where a reference simulation image is already stored for a long period or permanently by being externally introduced into a memory of the system through an input device or the like before the system of the present invention is operated, and also includes a case where sequencing data other than the sequencing data of the gene to be analyzed is converted into a simulation image by operating the system of the present invention and temporarily stored.
The information on the standard genome of the present invention may be stored in a memory in advance, or may be retrieved from a database by a system via a network.
Examples
45 patients with renal cancer (kidney chromophobe) and 256 patients with prostate adenocarcinoma (prostate adenocarinoma) in the TCGA database were selected as examples. The method of the present invention was used to search for specific differential genes and mutations of two cancer species (shown in FIG. 2). The samples of the embodiment all have public result data sets, and consistency comparison among methods is convenient to carry out.
Comparing TCGA original sequencing data of 301 samples to a human reference genome by utilizing BWA software to obtain a comparison file in an SAM format, then sequencing and deduplicating the SAM file by utilizing Samtools software to obtain a file in a BAM format, and finally obtaining somatic mutation results of all samples by utilizing VarScan software, wherein all somatic mutation sets of renal cancer patients are marked as A, and all somatic mutation sets of prostate adenocarcinoma patients are marked as B;
merging all mutation positions of the set A and the set B in ascending order of chromosomes chr1, chr2, chr3 … … chrX and chrY, and arranging mutation base types of each sample of the set A and the set B according to the chromosome positions (randomly ordering among samples), wherein the positions without mutation are supplemented with the base types of a reference genome, so as to obtain a kidney cancer patient 'mutation consensus sequence' matrix M1 and a prostate adenocarcinoma patient 'mutation consensus sequence' matrix M2;
respectively carrying out 0 and 1 data conversion on a matrix M1 and a matrix M2 of the 'mutation consensus sequence', wherein the rule is that a base A is replaced by 10, a T is replaced by 01, a C is replaced by 00, and a G is replaced by 11 to obtain a converted 0, 1 digital matrix D1 and a matrix D2, and carrying out up-sampling on the digital matrix D1 of the kidney cancer by the number of samples, namely randomly sampling the number of the samples from 45 samples to 256 patients with prostate adenocarcinoma;
the matrix D1 and the matrix D2 were modeled as images (images) using the ImageIO library of python, denoted F1 and F2, respectively;
the simulated barcode images (images) of F1 and F2 are respectively arranged according to the ascending order of chromosomes, namely the Image of the first row is chr1-chr6, the Image of the second row is chr7-chr12, the Image of the third row is chr13-chr18, and the Image of the fourth row is chr19-chrY, and because the lengths of all chromosomes are different, the step adopts a direct splicing method, namely no blank is directly spliced between barcodes, and new simulated images NF1 and NF2 of two cancer species are obtained;
performing image comparison on NF1 and NF2 by using python, openCV and scimit-image methods to obtain difference points and difference block blocks of the images, wherein black is excluded positions (namely the same result), and white represents the difference positions;
the pixel point coordinates of the difference points and the difference block blocks of the image correspond to the corresponding positions of a matrix M1 and a matrix M2 of a mutation consistency sequence, and the difference genes and the difference sites of the kidney cancer and the prostate adenocarcinoma are found;
comparing the obtained differential sites with the analysis differential result data of kidney cancer and prostate adenocarcinoma provided by the known TCGA official part, the concordance rate is found to be 100%, and the time for analyzing the sequencing data of 301 samples is shortened from about 1200 hours to about 82 hours, which proves that the method is feasible.
While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. Various modifications or changes may be made to the exemplary embodiments of the present invention without departing from the scope or spirit of the present invention. The scope of the claims is to be accorded the broadest interpretation so as to encompass all modifications and equivalent structures and functions.

Claims (9)

1. A method for analyzing gene sequencing data, comprising the steps of:
(1) comparing the sequencing data of the gene to be analyzed with a standard genome to obtain variation information in the sequencing data of the gene to be analyzed, and arranging the variation information to obtain a mutation consistency sequence consisting of ATGC;
(2) converting the mutation consistency sequence into a digital matrix consisting of 0 and 1, further taking the digital matrix as a pixel, and converting the mutation consistency sequence into an analog image according to a preset rule;
(3) comparing the simulation image with a reference simulation image, and searching for image difference between the simulation image and the reference simulation image, wherein the reference simulation image is obtained by converting reference gene sequencing data; and
(4) and confirming mutation difference sites or difference areas between the genes corresponding to the sequencing data of the genes to be analyzed and the genes corresponding to the sequencing data of the reference genes according to the pixel point coordinates of the image difference.
2. The method for analyzing gene sequencing data of claim 1, wherein the reference simulation image is obtained by converting the reference gene sequencing data by the same method as that of steps (1) and (2).
3. The method for analyzing gene sequencing data of claim 1, wherein the standard genome is a human genome.
4. The method for analyzing gene sequencing data of claim 1, wherein the variation information comprises at least one of a point mutation, a structural variation, and a methylation level site.
5. The method for analyzing gene sequencing data of claim 1, wherein the variation information is ordered according to chromosome number or chromosome position, and the chromosomal locus without variation is replaced with the base type of the corresponding position of the standard genome.
6. The method for analyzing gene sequencing data of claim 1, wherein the simulated image is compared to a reference simulated image by visual inspection or image recognition techniques.
7. A system for analyzing gene sequencing data, comprising:
a. an input device for receiving analytical gene sequencing data;
b. a memory having a database for storing at least information of the reference simulated image and analytical genetic sequencing data input by the input device;
c. a processor capable of communicating with the memory and configured to: calling the gene sequencing data to be analyzed from a memory, converting the gene sequencing data into a simulation image, and comparing the simulation image with a reference image to obtain mutation difference sites or difference areas between genes corresponding to the gene sequencing data to be analyzed and genes corresponding to the reference gene sequencing data;
d. an output or display device for outputting or displaying information of the mutation difference site or the difference region.
8. The system according to claim 7, wherein the converting comprises comparing the gene sequencing data to be analyzed with a standard genome to obtain variation information in the gene sequencing data to be analyzed, arranging the variation information to obtain a mutation identity sequence consisting of ATGC, converting the mutation identity sequence into a digital matrix consisting of 0 and 1, further using the digital matrix as a pixel, and converting the mutation identity sequence into an analog image according to a predetermined rule.
9. The system for analyzing gene sequencing data of claim 7, wherein the information of the standard genome is pre-stored in the memory or retrieved by the system from a database over a network.
CN202011314466.0A 2020-11-20 2020-11-20 Method and system for analyzing gene sequencing data Pending CN112435712A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011314466.0A CN112435712A (en) 2020-11-20 2020-11-20 Method and system for analyzing gene sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011314466.0A CN112435712A (en) 2020-11-20 2020-11-20 Method and system for analyzing gene sequencing data

Publications (1)

Publication Number Publication Date
CN112435712A true CN112435712A (en) 2021-03-02

Family

ID=74693313

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011314466.0A Pending CN112435712A (en) 2020-11-20 2020-11-20 Method and system for analyzing gene sequencing data

Country Status (1)

Country Link
CN (1) CN112435712A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114820460A (en) * 2022-04-02 2022-07-29 南京航空航天大学 Method and device for analyzing correlation of single gene locus and time sequence brain image
CN116564415A (en) * 2023-07-10 2023-08-08 深圳华大基因科技服务有限公司 Stream sequencing analysis method, device, storage medium and computer equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114820460A (en) * 2022-04-02 2022-07-29 南京航空航天大学 Method and device for analyzing correlation of single gene locus and time sequence brain image
CN114820460B (en) * 2022-04-02 2023-09-29 南京航空航天大学 Method and device for correlation analysis of single gene locus and time sequence brain image
CN116564415A (en) * 2023-07-10 2023-08-08 深圳华大基因科技服务有限公司 Stream sequencing analysis method, device, storage medium and computer equipment
CN116564415B (en) * 2023-07-10 2023-10-17 深圳华大基因科技服务有限公司 Stream sequencing analysis method, device, storage medium and computer equipment

Similar Documents

Publication Publication Date Title
US11756652B2 (en) Systems and methods for analyzing sequence data
Gkoutos et al. The anatomy of phenotype ontologies: principles, properties and applications
CN106068330B (en) Systems and methods for using known alleles in read mapping
US10584380B2 (en) Systems and methods for mitochondrial analysis
Herrero et al. Ensembl comparative genomics resources
Jiang et al. PRISM: pair-read informed split-read mapping for base-pair level detection of insertion, deletion and structural variants
CN107169310B (en) Gene detection knowledge base construction method and system
US20050187916A1 (en) System and method for pattern recognition in sequential data
CN113168886A (en) Systems and methods for germline and somatic variant calling using neural networks
KR101117603B1 (en) System and method for providing functional correlation information of biomedical data by generating inter-linkable maps
CA2930597A1 (en) Methods for the graphical representation of genomic sequence data
Wang et al. DeepDRK: a deep learning framework for drug repurposing through kernel-based multi-omics integration
Olson et al. Variant calling and benchmarking in an era of complete human genome sequences
CN112435712A (en) Method and system for analyzing gene sequencing data
Heyer et al. MAGIC Tool: integrated microarray data analysis
Llinares-López et al. Genome-wide genetic heterogeneity discovery with categorical covariates
Holtgrewe et al. Methods for the detection and assembly of novel sequence in high-throughput sequencing data
Loughrey et al. The topology of data: opportunities for cancer research
Cretin et al. SWORD2: hierarchical analysis of protein 3D structures
Sahu et al. Healthcare information technology for rural healthcare development: insight into bioinformatics techniques
CN114566221A (en) Automatic analysis and interpretation system for NGS data of genetic diseases
Lelandais et al. Comparing gene expression networks in a multi-dimensional space to extract similarities and differences between organisms
Wu et al. Be-1DCNN: a neural network model for chromatin loop prediction based on bagging ensemble learning
CN111243661A (en) Gene physical examination system based on gene data
Tian et al. Integrative classification and analysis of multiple arrayCGH datasets with probe alignment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination