CN117153248B

CN117153248B - Gene region variation detection and visualization method and system based on pan genome

Info

Publication number: CN117153248B
Application number: CN202311133414.7A
Authority: CN
Inventors: 焦成智; 陈力杨; 高丹
Original assignee: Tianjin Jizhi Gene Technology Co ltd
Current assignee: Tianjin Jizhi Gene Technology Co ltd
Priority date: 2023-09-05
Filing date: 2023-09-05
Publication date: 2024-05-07
Anticipated expiration: 2043-09-05
Also published as: CN117153248A

Abstract

The invention belongs to the technical field of gene information detection, and discloses a genetic region variation detection and visualization method and system based on a pan genome. According to the method, a coverage value and an identity value are filtered, screened and compared according to a preliminary annotation result obtained by mapping, an optimal comparison result is screened according to the coverage value and the identity value, whether genes are located in the same collinearity region on different genomes is judged, and a final standard annotation result is determined according to the judgment result; extracting each genome gene region and upstream and downstream sequences according to the final standard annotation result of the annotation file; sequence comparison is carried out on the extracted sequences according to a specific sequence, and variation among genomes is detected; and filtering the mutation detection result to remove the N-containing fragment. The invention is based on the visualization of the variation among the multiple genes, and can more conveniently and intuitively find the influence of the variation among the genes on each functional area of the genes.

Description

Gene region variation detection and visualization method and system based on pan genome

Technical Field

The invention belongs to the technical field of gene information detection, and particularly relates to a genetic region variation detection and visualization method and system based on a genome.

Background

With the decreasing cost of high throughput sequencing, there is currently an increasing number of species with a ubiquitous genome. By comparing and analyzing genomic differences and variations between different individuals, genetic diversity and evolutionary history within the same species can be revealed. This is important to understand the differences in phenotype, adaptability, and susceptibility to disease among different individuals. The genome-wide research can help reveal genetic differences and genome variation conditions among different individuals in the same species, and has important significance in the aspects of evolutionary process, phenotype differences and the like. Whereas the mutation detection visualization of the pan genome may further enhance understanding of these genetic differences and genomic variations. The visual research significance of mutation detection is that the complex genome mutation information can be displayed on an intuitive and easily understood graphical interface, so that scientific researchers can conveniently conduct data analysis and conclusion deduction. By visualizing the genomic variation differences exhibited, the genetic characteristics and differences between different individuals in the same species can be better understood and more relatedness can be found.

In the biological research process, there are often some genes which need to be concerned, and genetic variation information related to specific individuals can be found by analyzing and comparing genome information of individuals with the genes, so that possible genetic risks, biological characteristics, gene functions and the like of the individuals are revealed. The information has important significance in life science research, medical diagnosis, personalized treatment and other aspects. Because the variant file VCF is not convenient for generating variant digest statistics. With such complexity and difficulty, researchers or doctors can further analyze and understand the data characteristics of the mutation of the gene region of interest, such as mutation type, distribution, frequency, position and the like of a specific sample, through visual analysis, so that more comprehensive support is provided for subsequent work. Therefore, the visual inspection of the mutation of the star gene region is of great importance.

The current genome-wide variation visualization tools are based on variation display of whole genome, and have no variation presentation mode focused on a certain gene region.

Through the above analysis, the problems and defects existing in the prior art are as follows:

(1) At present, the genome-wide variation visualization tool is based on variation display of whole genome, the display form is macroscopic, and the variation line form of a specific gene region cannot be focused. There is currently no tool for simultaneous presentation of multiple genome, mutation detection, gene structure, etc. The prior art variation display is often a display of all variation distribution/content across the genome or across a chromosome. Because of the long genomic sequence, only macroscopic display variations can occur in which regions the number is high and in which regions the number is low. Current researchers often look at a gene for the presence of a mutation in a region of the gene on a different genome, what type of mutation, where the mutation is located. This further results in an increased effort and the accuracy in the visual presentation of the information data obtained is somewhat affected.

(2) Since the current units for providing genome and annotation come from various sources, the quality and standard of genome annotation is not uniform, the specific genes to be analyzed often suffer from the reason that annotation standards are not uniform, resulting in incomplete gene structure, or no annotation to specific genes, thereby affecting the accurate determination of candidate gene regions.

Disclosure of Invention

In order to overcome the problems in the related art, the invention provides a genetic region variation detection and visualization method and system based on a pan genome. The invention aims at microscopically exhibiting the mutation situation of a gene region. The invention also aims to avoid that no specific region of the genome can be found due to non-standardization of annotation results in order to standardize the annotation results.

The technical scheme of the invention is as follows: the genetic region variation detection and visualization method based on the pan genome comprises the following steps:

S1, mapping a gene cds sequence of interest to a genome sequence in a comparison mode to obtain a preliminary annotation result;

S2, filtering, screening and comparing to calculate a coverage value and an identity value according to the preliminary annotation result obtained by mapping, screening an optimal comparison result according to the coverage value and the identity value, judging whether genes are located in the same collinearity region on different genomes, and determining a final standard annotation result according to the judgment result;

s3, extracting each genome gene region and upstream and downstream sequences according to the final standard annotation result of the annotation file;

S4, comparing the extracted sequences, and detecting variation among genomes;

S5, filtering a variation detection result, removing N-containing fragments, and classifying analysis variation types;

s6, visually displaying the classified mutation types by utilizing svg.

In step S1, the preliminary annotation result is the information of the structure of the gene on each genome, including the chromosome and specific location of the gene, and the information length of the gene region, the information length of the exon region, and the information length of the intron region.

In step S2, filtering, screening and comparing to calculate coverage value and identity value, including: comparing the specific gene sequence to the genome by gmap software, determining specific position information of the specific gene on the genome, and calculating a coverage value and an identity value according to the comparison result; the calculation formula is as follows:

coverage = length of aligned upper sequences/total length of genes x 100;

identity = length of sequences on base identical sequences/alignment x 100.

Further comprises: comparing the same gene with different genomes, determining the positions and the gene structures on the different genomes, selecting coverage values preferentially, selecting identity values, reserving comparison results and mapping, and taking the comparison results as final annotation results.

In step S3, each genomic gene region and upstream and downstream sequences are extracted according to the final standard annotation result of the annotation file, including: and writing a script according to the final standard annotation result to extract the sequence of the corresponding position.

Further, the upstream and downstream sequences include sequences of 5kb up and down, respectively.

In step S4, sequence alignment is performed on the extracted sequences, and variation between genomes is detected, including: and carrying out sequence comparison on the extracted sequences by mummer software, and sequentially carrying out inter-genome variation detection according to the arrangement sequence of the genomes.

Further, the mutation detection method comprises: after genome is compared by mummer software, a file of the linear delta.1 records is generated, and mutation information is extracted according to script writing of the linear delta.1 records file content.

Another object of the present invention is to provide a genome-wide-genome-based genetic region variation detection and visualization system, which implements the genome-wide-genome-based genetic region variation detection and visualization method, the system comprising:

the preliminary annotation result acquisition unit is used for mapping the cds sequence of the concerned gene onto the genome sequence in a comparison mode to obtain a preliminary annotation result;

The final standard annotation result determining unit is used for filtering, screening and comparing to calculate a coverage value and an identity value according to the preliminary annotation result obtained by mapping, screening an optimal comparison result according to the coverage value and the identity value, judging whether genes are located in the same co-linear region on different genomes, and determining a final standard annotation result according to the judgment result;

The genome gene region and upstream and downstream sequences extraction unit is used for extracting genome gene regions and upstream and downstream sequences according to the final standard annotation result of the annotation file;

An inter-genome variation detection unit for performing sequence alignment on the extracted sequences and detecting variation between genomes;

The mutation type filtering and classifying unit is used for filtering mutation detection results, removing N-containing fragments and analyzing and classifying mutation types;

and the display unit is used for visually displaying the classified mutation types by utilizing svg.

Further, the system is mounted on a raw information analysis platform and executes corresponding functions of each unit.

By combining all the technical schemes, the invention has the advantages and positive effects that: the invention visually displays the mutation information of the specific gene region among different genomes by providing standard annotation information of the specific gene in different genomes and providing detection results of the specific gene region.

The invention provides a visual display mode of a multi-genome gene region variation detection result, which is more applicable to the large trend of full genome assembly of a large number of samples with lower cost of full genome sequencing data. The invention is based on the visualization of the variation among the multiple genes, and can more conveniently and intuitively find the influence of the variation among the genes on each functional area of the genes.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure;

FIG. 1 is a diagram of a method for detecting and visualizing genetic region variation based on a pan genome;

FIG. 2 is a schematic diagram of a genetic region variation detection and visualization method based on the pan genome;

FIG. 3 is a schematic diagram of a genome-wide variation detection and visualization system provided by the invention;

In the figure: 1. a preliminary annotation result acquisition unit; 2. a final standard annotation result determination unit; 3. each genome gene region and upstream and downstream sequence extraction units; 4. an inter-genome variation detection unit; 5. a mutation type filtering and classifying unit; 6. and a display unit.

Detailed Description

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit or scope of the invention, which is therefore not limited to the specific embodiments disclosed below.

Embodiment 1, as shown in fig. 1, the method for detecting and visualizing genetic region variation based on pan genome according to the embodiment of the present invention comprises the following steps:

S4, comparing the extracted sequences, and detecting variation among genomes;

s6, visually displaying the classified mutation types by utilizing svg.

In step S1, the preliminary annotation result is the gene structure information of the relevant genes on each genome. Including the chromosome and specific position of the gene, and the information of the gene region, the exon region, the intron region, the length of the intron region, etc.

Preferably, because the current units for providing genome and annotation come from various sources, the quality and standard of genome annotation is not uniform, the specific genes to be obtained often suffer from incomplete gene structure due to non-uniform annotation standard, or do not annotate specific genes to influence the determination of candidate gene regions.

In step S2, the specific gene sequence is aligned to the genome by gmap software, and specific position information of the specific gene on the genome is determined; meanwhile, calculating a coverage value and an identity value according to the comparison result; the calculation formula is as follows:

coverage = length of aligned upper sequences/total length of genes x 100;

identity = length of sequences on base identical sequences/alignment x 100.

Because of the high homology of part of genes, the situation that different genes are compared to the same genome region may occur, when the situation occurs, the coverage value is preferentially selected, the identity value is selected, and the gene mapping result with a good comparison result is reserved as a final annotation result.

In step S3, the genome gene regions and the upstream and downstream sequences are extracted from the final standard annotation result according to the annotation file, and the sequences at the corresponding positions are extracted by writing a script according to the final standard annotation result.

In step S4, sequence alignment of the extracted sequences using mummer software in a specific order and detection of inter-genome variation includes: and detecting variation among the genomes according to the arrangement sequence of the genomes. Such as 1 to 2,2 to 3,3 to 4, … …,8 to 9,9 to 10.

The detection method comprises the following steps: and mummer, comparing the genome by software, generating a file of the cooling.delta.1copies, and writing a script according to the content of the file to extract variation information.

Example 2 as another embodiment of the present invention, as shown in fig. 2, the method for detecting and visualizing genetic region variation based on pan genome according to the example of the present invention includes:

Gene cds of interest and mapped onto the pan genome sequence;

using gmap software to obtain a preliminary annotation result;

Filtering and screening to obtain a standard annotation result;

Extracting genes and upstream and downstream sequences to obtain candidate distinguishing sequences;

performing sequence comparison on the extracted sequence by mummer software to obtain a variation detection result;

filtering and classifying the mutation detection result, and visually displaying by utilizing svg.

Embodiment 3 as shown in fig. 3, the genome-wide variation detection and visualization system provided by the embodiment of the present invention includes:

A preliminary annotation result acquisition unit 1, configured to map a cds sequence of a gene of interest onto a genome sequence in a manner of alignment, so as to obtain a preliminary annotation result;

The final standard annotation result determining unit 2 is used for filtering, screening and comparing to calculate a coverage value and a identity value according to the preliminary annotation result obtained by mapping, screening an optimal comparison result according to the coverage value and the identity value, judging whether genes are located in the same co-linear region on different genomes, and determining a final standard annotation result according to the judgment result;

the genome gene region and upstream and downstream sequences extracting unit 3 is used for extracting the genome gene region and upstream and downstream sequences according to the final standard annotation result of the annotation file;

wherein the upstream and downstream sequences comprise sequences of 5kb up and down, respectively;

An inter-genome variation detection unit 4 for performing sequence alignment on the extracted sequences and detecting variation between genomes;

The mutation type filtering and classifying unit 5 is used for filtering the mutation detection result, removing the N-containing fragments and analyzing and classifying the mutation types;

And the display unit 6 is used for visually displaying the classified mutation types by utilizing svg.

Example 4 the genome-wide-based genetic region variation detection and visualization system of example 3 was mounted on a high-throughput sequencer and performed the corresponding functions of each unit.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

The content of the information interaction and the execution process between the devices/units and the like is based on the same conception as the method embodiment of the present invention, and specific functions and technical effects brought by the content can be referred to in the method embodiment section, and will not be described herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present invention. For specific working processes of the units and modules in the system, reference may be made to corresponding processes in the foregoing method embodiments.

Based on the technical solutions described in the embodiments of the present invention, the following application examples may be further proposed.

According to an embodiment of the present invention, there is also provided a computer apparatus including: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, which when executed by the processor performs the steps of any of the various method embodiments described above.

Embodiments of the present invention also provide a computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of the respective method embodiments described above.

The embodiment of the invention also provides an information data processing terminal, which is used for providing a user input interface to implement the steps in the method embodiments when being implemented on an electronic device, and the information data processing terminal is not limited to a mobile phone, a computer and a switch.

The embodiment of the invention also provides a server, which is used for realizing the steps in the method embodiments when being executed on the electronic device and providing a user input interface.

Embodiments of the present invention also provide a computer program product which, when run on an electronic device, causes the electronic device to perform the steps of the method embodiments described above.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing device/terminal apparatus, recording medium, computer Memory, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc.

While the invention has been described with respect to what is presently considered to be the most practical and preferred embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but on the contrary, is intended to cover various modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims

1. A genetic region variation detection and visualization method based on a pan genome is characterized by comprising the following steps:

S4, comparing the extracted sequences, and detecting variation among genomes;

s6, visually displaying the classified variation types by utilizing svg;

In step S1, the preliminary annotation result is the genetic structure information on each genome, including the chromosome and specific position where the gene is located, and the information length of the gene region, the information length of the exon region, and the information length of the intron region;

In step S2, the screening the optimal comparison result according to the coverage value and the identity value, determining whether the genes are located in the same co-linear region on different genomes, and determining the final standard annotation result according to the determination result includes: comparing the same gene with different genomes, determining positions and gene structures on different genomes, selecting coverage values preferentially, selecting identity values, and reserving a gene mapping result with a good comparison result as a final annotation result.

2. The method for detecting and visualizing a genomic region variation according to claim 1, wherein in step S2, filtering and comparing the coverage value and the identity value comprises:

comparing the specific gene sequence to the genome by gmap software, determining specific position information of the specific gene on the genome, and calculating a coverage value and an identity value according to the comparison result; the calculation formula is as follows:

coverage = length of aligned upper sequences/total length of genes x 100;

identity = length of sequences on base identical sequences/alignment x 100.

3. The method for detecting and visualizing a genomic region variation according to claim 1, wherein in step S3, extracting each genomic region and upstream and downstream sequences according to the final standard annotation result of the annotation file comprises: and writing a script according to the final standard annotation result to extract the sequence of the corresponding position.

4. The method for detecting and visualizing a genomic region variation according to claim 3, wherein the upstream and downstream sequences comprise sequences of 5kb up and down, respectively.

5. The method for detecting and visualizing a genomic region variation according to claim 1, wherein in step S4, sequence alignment is performed on the extracted sequences and the variation between genomes is detected, comprising: and carrying out sequence comparison on the extracted sequences by mummer software, and sequentially carrying out inter-genome variation detection according to the arrangement sequence of the genomes.

6. The method for detecting and visualizing a variation in a genomic region according to claim 5, wherein the method for detecting a variation comprises: after genome is compared by mummer software, a file of the linear delta.1 records is generated, and mutation information is extracted according to script writing of the linear delta.1 records file content.

7. A genome-wide-based genetic region variation detection and visualization system, characterized in that it implements the genome-wide-based genetic region variation detection and visualization method according to any one of claims 1 to 6, the system comprising:

A preliminary annotation result acquisition unit (1) for mapping the cds sequence of the concerned gene onto the genome sequence by means of comparison to obtain a preliminary annotation result; the preliminary annotation result is the gene structure information on each genome, including the chromosome and specific position of the gene, and the information length of the gene region, the information length of the exon region and the information length of the intron region;

The final standard annotation result determining unit (2) is configured to filter, filter and compare and calculate a coverage value and a identity value according to the preliminary annotation result obtained by mapping, filter and compare an optimal comparison result according to the coverage value and the identity value, determine whether genes are located in the same co-linear region on different genomes, and determine a final standard annotation result according to the determination result, and specifically includes: comparing the same gene with different genomes, determining positions and gene structures on different genomes, preferentially selecting coverage values, selecting identity values, and reserving a gene mapping result with a good comparison result as a final annotation result;

the genome gene region and upstream and downstream sequences extraction unit (3) is used for extracting the genome gene region and upstream and downstream sequences according to the final standard annotation result of the annotation file;

An inter-genome variation detection unit (4) for aligning the extracted sequences and detecting variation between genomes;

The mutation type filtering and classifying unit (5) is used for filtering mutation detection results, removing N-containing fragments and analyzing and classifying mutation types;

And the display unit (6) is used for visually displaying the classified mutation types by utilizing svg.

8. The genome-wide-based genetic region variation detection and visualization system of claim 7, wherein the system is mounted on a biological analysis platform and performs the corresponding functions of each unit.