DNA sequence polymorphism analysis method and system
Technical Field
The invention relates to the field of DNA sequence characteristic analysis, in particular to a DNA sequence polymorphism analysis method and a DNA sequence polymorphism analysis system.
Background
With the rapid development of bioinformatics, the techniques of gene sequencing and sequence alignment have been advanced, mathematical statistics is performed on polymorphisms of DNA sequences, and accordingly comparative studies are being conducted as one of important methods for understanding DNA sequence evolution mechanisms, reconstructing phylogenetic relationships, and recognizing protein-encoding exons at a molecular level. The number of synonymous substitutions ds and the number of non-synonymous substitutions dn (Ks and Ka, respectively, in the coding region) between two sequences are one of the statistical indicators that are widely used to evaluate the degree of evolutionary divergence of DNA sequences between different units. Ks and Ka are defined as the nonsensical substitution numbers of the sum of synonymous (silent) substitution numbers of each homologous site per year or generation, respectively. It is believed that Ka > Ks, ka=ks, and Ka < Ks represent positive selection, neutral mutation, and negative selection, respectively. The statistics of Ka/Ks can also be used to detect adaptive evolution, assessing whether it is experiencing positive selection pressure or undergoing rapid evolution. However, considering dynamic characteristics of DNA sequence evolution, such as transformation/conversion rate bias, nucleotide frequency bias, and abnormal speed substitution, the Ka and Ks need to be estimated under different substitution models.
The prior art is primarily directed to anatomical analysis of the evolutionary constraints of protein-encoding genes, such as sliding window analysis programs (swascs) that detect selective constraints, by estimating the nucleotide substitution rate for specific codon regions in each branch of the phylogenetic tree, using several sets of simulated sequence permutations to estimate the probability of synonymous and non-synonymous nucleotide substitutions. And a statistical analysis of the simulated sequence is performed to determine an optimal window size.
The web-based tool WSPMake (Window-sliding Selection pressure Plot Maker) was used to calculate the selection pressure (estimated in Ka/Ks) for two protein-encoding DNA sequence (CDSs) subregions. By analyzing protein-encoding DNA sequences using a window of definable length, the overall/specific region selectivity constraints of both sequences are calculated and demonstrated. Domain information from the Pfam HMM model was used to detect highly conserved bases in homologous proteins. However, the prior art has some defects, such as sequence characteristics of nucleotide polymorphism, bias and the like, and sequence evolution characteristics of non-uniformity of base substitution rate and the like, which have relatively large influence on statistics, and the prior art does not fully consider the influence, so that analysis results have certain deviation, and meanwhile, parameters of optimal gamma distribution can be added in the methods in consideration of the heterogeneity of the base substitution rate among sites. There is a certain difference between these different methods, which to some extent affects the estimation of the evolution information.
Disclosure of Invention
The invention mainly solves the technical problem of providing a DNA sequence polymorphism analysis method and a system, which can optimize a series of software packages such as KaKs-Caculer and the like which need to be executed through command lines on other platforms, integrate upstream and downstream analysis tools for inputting and analyzing various data types such as genome, gene sequence and the like, greatly reduce the operation complexity of the traditional comparative evolution relationship and simultaneously improve the calculation efficiency of the evolution relationship.
In order to solve the technical problems, the invention provides a DNA sequence polymorphism analysis method, which comprises the following steps:
s1: inputting a biological gene file, the biological gene file comprising: CDS, rRNA;
s2: extracting the genes of the coding region of the organism from the organism gene file to form a group of organism same-position gene sequence files;
s3: inputting the gene sequence file at the same position of the organism into MAFFT software, and comparing the selected gene sequence file one by one through the MAFFT software to generate an equivalent gene matrix, thereby obtaining a complete gene sequence file;
s4: obtaining biological species information, selecting a reference genetic code sub-table for a corresponding biological type, the biological species information comprising: biological name, coding region gene length;
s5, judging whether the biological species information exists in a system database, if so, acquiring the size and the step length of a sliding window, and if not, inputting data such as the size and the step length of the sliding window by a user, and storing the input data in the system database;
s6: inputting the size and the step length of the sliding window, carrying out sliding window statistics of Ka/Ks values on the complete gene sequence file, and generating the Ka/Ks values of the single genes in a chart form;
s7: and carrying out GC content statistics on the complete gene sequence file, and generating the GC content value of the single gene in a graph form.
Further, the coding region gene consists of a DNA sequence, and the gene sequences specifically included in the gb file include: CDS, and rRNA, wherein the coding region genes include: 13 CDS gene queues.
Furthermore, the MAFFT software is used for carrying out multiple sequence alignment of genes in the biological field.
Furthermore, the same-position gene sequence file refers to a fas file composed of the same-position gene sequences of three organisms, wherein the gene file of one organism contains 13 coding region sequence information, and finally 13 fas files are generated for storing the coding region gene sequences of the three organisms.
Further, the reference genetic code table is given by the system;
the genetic code sub-table comprises: a 5-Arthropoda genetic code table and a 2-chord genetic code table; wherein the 5-Arthropoda genetic code table refers to an invertebrate genetic code table; the 2-chord data genetic code table refers to a vertebrate genetic code table.
Further, the sliding window statistics of the Ka/Ks value is carried out by selecting a reference sequence set by referring to a genetic code table to compare the ith sequence with the reference sequence one by one, and carrying out statistics of the Ka/Ks values of the two sequences until one cycle is completed for all non-reference sequences, and generating the Ka/Ks values of i single genes in a graph form;
further, the statistics of the GC content refers to statistics of the ratio of guanine and cytosine in the complete gene sequence file.
The beneficial effects of the invention are as follows: according to the invention, the professional complex upstream and downstream analysis software is integrated through the biological evolution system to carry out sequential processing, so that the mathematical statistics efficiency of the polymorphism of the DNA sequence is improved, and meanwhile, a windowed operation platform is provided to avoid command line operation, so that the operation difficulty of the biological evolution system is reduced; a series of software packages such as KaKs-Caculer and the like which need to be executed through command lines on other platforms are optimized; the result is suitable for application of multiple scenes such as interactive databases, analysis software, platforms, publication publishing and the like; the traditional comparison evolution relation operation is integrated, so that the operation complexity is greatly reduced, and meanwhile, the calculation efficiency of the evolution relation is improved.
Drawings
FIG. 1 is a flow chart of a method of DNA sequence polymorphism analysis;
FIG. 2 is a workflow diagram of a method for DNA sequence polymorphism analysis;
FIG. 3 is a schematic diagram of a visual result of a DNA sequence polymorphism analysis method;
FIG. 4 is a system block diagram of a DNA sequence polymorphism analysis system.
Detailed Description
The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, thereby making clear and defining the scope of the present invention.
Referring to fig. 1, 2, 3 and 4, an embodiment of the present invention includes:
a DNA sequence polymorphism analysis method comprising the steps of:
s1: inputting a bio-gene file, the bio-gene file comprising: CDS, rRNA;
s2: extracting the genes of the coding region of the organism from the organism gene file to form a group of organism same-position gene sequence files;
s3: inputting the gene sequence file at the same position of the organism into MAFFT software, and comparing the selected gene sequence file one by one through the MAFFT software to generate an equivalent gene matrix, thereby obtaining a complete gene sequence file;
s4: obtaining biological species information, selecting a reference genetic code sub-table for a corresponding biological type, the biological species information comprising: biological name, coding region gene length;
s5: judging whether the biological species information exists in a system database, if so, acquiring the size and the step length of a sliding window, if not, inputting data such as the size and the step length of the sliding window by a user, and storing the input data in the system database;
s6: inputting the size and the step length of the sliding window, carrying out sliding window statistics of Ka/Ks values on the complete gene sequence file, and generating the Ka/Ks values of the single genes in a chart form;
s7: and carrying out GC content statistics on the complete gene sequence file, and generating the GC content value of the single gene in a graph form.
As shown in fig. 2, a user inputs gene files of 3 organisms of 10_nesodiprion_japonicas_nad2, 55_nesodiprion_biremis_nsd2 and 360_dentathalia_scutellariae on a DNA sequence characteristic analysis platform; extracting genes of coding regions of organisms from the three organism gene files to form a same group of organism same-position gene sequence files, and then carrying out successive comparison by MAFFT software to generate an equivalent gene matrix so as to obtain three organism complete gene sequence files; let 360_Dentathalia_scutellariae be species 0, 10_Nesodiprion_japonicas_Nad2 be species 1, 55_Nesodiprion_biremis_Nd2 be species 2, compare the complete gene sequence files of species 0 and species 1, species 0 and species 2 one by selecting the reference sequence set by referring to the genetic code table, sliding window set to 60, step size set to 3, and count the Ka/Ks values of the two sequences until one cycle is completed for all non-reference sequences, generating the Ka/Ks value of the single gene in the form of a graph.
As shown in FIG. 3, the solid lines are 360_Dentathalia_scutellariae and 10_Nesodiprion_japonius_Nad2, and statistics of Ka/Ks values of the two sequences are performed until one cycle is completed for all non-reference sequences, and the Ka/Ks values of the single genes are generated in a graph form; the dashed lines are 360_Dentathalia_scutellariae and 55_Nesodiprion_biremis_Nsd2, and statistics of Ka/Ks values of the two sequences are performed until one cycle is completed for all non-reference sequences, and the Ka/Ks values of the single genes are generated in a graph form.
Further, the coding region gene consists of a DNA sequence, and the gene sequences specifically included in the gb file include: CDS, and rRNA, wherein the coding region genes include: 13 CDS gene queues.
Furthermore, the MAFFT software is used for carrying out multiple sequence alignment of genes in the biological field.
Furthermore, the same-position gene sequence file refers to a fas file composed of the same-position gene sequences of three organisms, wherein the gene file of one organism contains 13 coding region sequence information, and finally 13 fas files are generated for storing the coding region gene sequences of the three organisms.
Further, the reference genetic code table is given by the system;
the genetic code sub-table comprises: a 5-Arthropoda genetic code table and a 2-chord genetic code table; wherein the 5-Arthropoda genetic code table refers to an invertebrate genetic code table; the 2-chord data genetic code table refers to a vertebrate genetic code table.
Further, the sliding window statistics of the Ka/Ks value is carried out by selecting a reference sequence set by referring to a genetic code table to compare the ith sequence with the reference sequence one by one, and carrying out statistics of the Ka/Ks values of the two sequences until one cycle is completed for all non-reference sequences, and generating the Ka/Ks values of i single genes in a graph form;
further, the statistics of the GC content refers to statistics of the ratio of guanine and cytosine in the complete gene sequence file.
As shown in fig. 4, a DNA sequence polymorphism analysis system, comprising:
the file input module is used for inputting the gene file to be tested into the system;
the gene extraction module is used for extracting the coding region genes from the input gene files to form the gene sequence files at the same position;
the gene comparison module is used for comparing the gene sequence files at the same position to form an equivalent gene matrix, so as to obtain a complete gene sequence file;
the information acquisition module is used for acquiring biological species information of the complete gene sequence file;
the data storage module is used for storing the historical species information in the system and the sliding window size and step length set by the historical species information;
the Ka/ks value calculation module is used for calculating the Ka/ks value in the complete gene sequence and generating a single gene Ka/ks value in a chart form;
the GC content statistics module is used for counting the GC content in the complete gene sequence and generating single-gene GC content values in a chart form.
The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present invention.