CN116130002A - DNA sequence polymorphism analysis method and system - Google Patents

DNA sequence polymorphism analysis method and system Download PDF

Info

Publication number
CN116130002A
CN116130002A CN202211691737.3A CN202211691737A CN116130002A CN 116130002 A CN116130002 A CN 116130002A CN 202211691737 A CN202211691737 A CN 202211691737A CN 116130002 A CN116130002 A CN 116130002A
Authority
CN
China
Prior art keywords
gene
file
sequence
gene sequence
genetic code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211691737.3A
Other languages
Chinese (zh)
Inventor
马勇
江兴鸿
牛耕耘
简文欣
戴梦轩
郑文胜
肖艺
李敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangxi Normal University
Original Assignee
Jiangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangxi Normal University filed Critical Jiangxi Normal University
Priority to CN202211691737.3A priority Critical patent/CN116130002A/en
Publication of CN116130002A publication Critical patent/CN116130002A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Ecology (AREA)
  • Physiology (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a DNA sequence polymorphism analysis method and a system, wherein the method comprises the following steps: inputting a gene file of the organism; extracting the genes of the coding region of the organism from the gene file to form a group of gene sequence files at the same position of the organism; inputting the gene sequence file at the same position into MAFFT software, and performing successive comparison by the software to generate an equivalent gene matrix to obtain a complete gene sequence file; selecting a reference genetic code table for the corresponding biological type; acquiring the size and the step length of a sliding window in sliding window statistics; and carrying out sliding window statistics of Ka/Ks values on the complete gene sequence file, carrying out GC content statistics on the complete gene sequence file, and generating the Ka/Ks values and GC content values of the single genes in a graph form. By the method, the traditional comparison evolution relation operation can be integrated, the operation complexity is reduced, and the calculation efficiency of the evolution relation is improved.

Description

DNA sequence polymorphism analysis method and system
Technical Field
The invention relates to the field of DNA sequence characteristic analysis, in particular to a DNA sequence polymorphism analysis method and a DNA sequence polymorphism analysis system.
Background
With the rapid development of bioinformatics, the techniques of gene sequencing and sequence alignment have been advanced, mathematical statistics is performed on polymorphisms of DNA sequences, and accordingly comparative studies are being conducted as one of important methods for understanding DNA sequence evolution mechanisms, reconstructing phylogenetic relationships, and recognizing protein-encoding exons at a molecular level. The number of synonymous substitutions ds and the number of non-synonymous substitutions dn (Ks and Ka, respectively, in the coding region) between two sequences are one of the statistical indicators that are widely used to evaluate the degree of evolutionary divergence of DNA sequences between different units. Ks and Ka are defined as the nonsensical substitution numbers of the sum of synonymous (silent) substitution numbers of each homologous site per year or generation, respectively. It is believed that Ka > Ks, ka=ks, and Ka < Ks represent positive selection, neutral mutation, and negative selection, respectively. The statistics of Ka/Ks can also be used to detect adaptive evolution, assessing whether it is experiencing positive selection pressure or undergoing rapid evolution. However, considering dynamic characteristics of DNA sequence evolution, such as transformation/conversion rate bias, nucleotide frequency bias, and abnormal speed substitution, the Ka and Ks need to be estimated under different substitution models.
The prior art is primarily directed to anatomical analysis of the evolutionary constraints of protein-encoding genes, such as sliding window analysis programs (swascs) that detect selective constraints, by estimating the nucleotide substitution rate for specific codon regions in each branch of the phylogenetic tree, using several sets of simulated sequence permutations to estimate the probability of synonymous and non-synonymous nucleotide substitutions. And a statistical analysis of the simulated sequence is performed to determine an optimal window size.
The web-based tool WSPMake (Window-sliding Selection pressure Plot Maker) was used to calculate the selection pressure (estimated in Ka/Ks) for two protein-encoding DNA sequence (CDSs) subregions. By analyzing protein-encoding DNA sequences using a window of definable length, the overall/specific region selectivity constraints of both sequences are calculated and demonstrated. Domain information from the Pfam HMM model was used to detect highly conserved bases in homologous proteins. However, the prior art has some defects, such as sequence characteristics of nucleotide polymorphism, bias and the like, and sequence evolution characteristics of non-uniformity of base substitution rate and the like, which have relatively large influence on statistics, and the prior art does not fully consider the influence, so that analysis results have certain deviation, and meanwhile, parameters of optimal gamma distribution can be added in the methods in consideration of the heterogeneity of the base substitution rate among sites. There is a certain difference between these different methods, which to some extent affects the estimation of the evolution information.
Disclosure of Invention
The invention mainly solves the technical problem of providing a DNA sequence polymorphism analysis method and a system, which can optimize a series of software packages such as KaKs-Caculer and the like which need to be executed through command lines on other platforms, integrate upstream and downstream analysis tools for inputting and analyzing various data types such as genome, gene sequence and the like, greatly reduce the operation complexity of the traditional comparative evolution relationship and simultaneously improve the calculation efficiency of the evolution relationship.
In order to solve the technical problems, the invention provides a DNA sequence polymorphism analysis method, which comprises the following steps:
s1: inputting a biological gene file, the biological gene file comprising: CDS, rRNA;
s2: extracting the genes of the coding region of the organism from the organism gene file to form a group of organism same-position gene sequence files;
s3: inputting the gene sequence file at the same position of the organism into MAFFT software, and comparing the selected gene sequence file one by one through the MAFFT software to generate an equivalent gene matrix, thereby obtaining a complete gene sequence file;
s4: obtaining biological species information, selecting a reference genetic code sub-table for a corresponding biological type, the biological species information comprising: biological name, coding region gene length;
s5, judging whether the biological species information exists in a system database, if so, acquiring the size and the step length of a sliding window, and if not, inputting data such as the size and the step length of the sliding window by a user, and storing the input data in the system database;
s6: inputting the size and the step length of the sliding window, carrying out sliding window statistics of Ka/Ks values on the complete gene sequence file, and generating the Ka/Ks values of the single genes in a chart form;
s7: and carrying out GC content statistics on the complete gene sequence file, and generating the GC content value of the single gene in a graph form.
Further, the coding region gene consists of a DNA sequence, and the gene sequences specifically included in the gb file include: CDS, and rRNA, wherein the coding region genes include: 13 CDS gene queues.
Furthermore, the MAFFT software is used for carrying out multiple sequence alignment of genes in the biological field.
Furthermore, the same-position gene sequence file refers to a fas file composed of the same-position gene sequences of three organisms, wherein the gene file of one organism contains 13 coding region sequence information, and finally 13 fas files are generated for storing the coding region gene sequences of the three organisms.
Further, the reference genetic code table is given by the system;
the genetic code sub-table comprises: a 5-Arthropoda genetic code table and a 2-chord genetic code table; wherein the 5-Arthropoda genetic code table refers to an invertebrate genetic code table; the 2-chord data genetic code table refers to a vertebrate genetic code table.
Further, the sliding window statistics of the Ka/Ks value is carried out by selecting a reference sequence set by referring to a genetic code table to compare the ith sequence with the reference sequence one by one, and carrying out statistics of the Ka/Ks values of the two sequences until one cycle is completed for all non-reference sequences, and generating the Ka/Ks values of i single genes in a graph form;
further, the statistics of the GC content refers to statistics of the ratio of guanine and cytosine in the complete gene sequence file.
The beneficial effects of the invention are as follows: according to the invention, the professional complex upstream and downstream analysis software is integrated through the biological evolution system to carry out sequential processing, so that the mathematical statistics efficiency of the polymorphism of the DNA sequence is improved, and meanwhile, a windowed operation platform is provided to avoid command line operation, so that the operation difficulty of the biological evolution system is reduced; a series of software packages such as KaKs-Caculer and the like which need to be executed through command lines on other platforms are optimized; the result is suitable for application of multiple scenes such as interactive databases, analysis software, platforms, publication publishing and the like; the traditional comparison evolution relation operation is integrated, so that the operation complexity is greatly reduced, and meanwhile, the calculation efficiency of the evolution relation is improved.
Drawings
FIG. 1 is a flow chart of a method of DNA sequence polymorphism analysis;
FIG. 2 is a workflow diagram of a method for DNA sequence polymorphism analysis;
FIG. 3 is a schematic diagram of a visual result of a DNA sequence polymorphism analysis method;
FIG. 4 is a system block diagram of a DNA sequence polymorphism analysis system.
Detailed Description
The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, thereby making clear and defining the scope of the present invention.
Referring to fig. 1, 2, 3 and 4, an embodiment of the present invention includes:
a DNA sequence polymorphism analysis method comprising the steps of:
s1: inputting a bio-gene file, the bio-gene file comprising: CDS, rRNA;
s2: extracting the genes of the coding region of the organism from the organism gene file to form a group of organism same-position gene sequence files;
s3: inputting the gene sequence file at the same position of the organism into MAFFT software, and comparing the selected gene sequence file one by one through the MAFFT software to generate an equivalent gene matrix, thereby obtaining a complete gene sequence file;
s4: obtaining biological species information, selecting a reference genetic code sub-table for a corresponding biological type, the biological species information comprising: biological name, coding region gene length;
s5: judging whether the biological species information exists in a system database, if so, acquiring the size and the step length of a sliding window, if not, inputting data such as the size and the step length of the sliding window by a user, and storing the input data in the system database;
s6: inputting the size and the step length of the sliding window, carrying out sliding window statistics of Ka/Ks values on the complete gene sequence file, and generating the Ka/Ks values of the single genes in a chart form;
s7: and carrying out GC content statistics on the complete gene sequence file, and generating the GC content value of the single gene in a graph form.
As shown in fig. 2, a user inputs gene files of 3 organisms of 10_nesodiprion_japonicas_nad2, 55_nesodiprion_biremis_nsd2 and 360_dentathalia_scutellariae on a DNA sequence characteristic analysis platform; extracting genes of coding regions of organisms from the three organism gene files to form a same group of organism same-position gene sequence files, and then carrying out successive comparison by MAFFT software to generate an equivalent gene matrix so as to obtain three organism complete gene sequence files; let 360_Dentathalia_scutellariae be species 0, 10_Nesodiprion_japonicas_Nad2 be species 1, 55_Nesodiprion_biremis_Nd2 be species 2, compare the complete gene sequence files of species 0 and species 1, species 0 and species 2 one by selecting the reference sequence set by referring to the genetic code table, sliding window set to 60, step size set to 3, and count the Ka/Ks values of the two sequences until one cycle is completed for all non-reference sequences, generating the Ka/Ks value of the single gene in the form of a graph.
As shown in FIG. 3, the solid lines are 360_Dentathalia_scutellariae and 10_Nesodiprion_japonius_Nad2, and statistics of Ka/Ks values of the two sequences are performed until one cycle is completed for all non-reference sequences, and the Ka/Ks values of the single genes are generated in a graph form; the dashed lines are 360_Dentathalia_scutellariae and 55_Nesodiprion_biremis_Nsd2, and statistics of Ka/Ks values of the two sequences are performed until one cycle is completed for all non-reference sequences, and the Ka/Ks values of the single genes are generated in a graph form.
Further, the coding region gene consists of a DNA sequence, and the gene sequences specifically included in the gb file include: CDS, and rRNA, wherein the coding region genes include: 13 CDS gene queues.
Furthermore, the MAFFT software is used for carrying out multiple sequence alignment of genes in the biological field.
Furthermore, the same-position gene sequence file refers to a fas file composed of the same-position gene sequences of three organisms, wherein the gene file of one organism contains 13 coding region sequence information, and finally 13 fas files are generated for storing the coding region gene sequences of the three organisms.
Further, the reference genetic code table is given by the system;
the genetic code sub-table comprises: a 5-Arthropoda genetic code table and a 2-chord genetic code table; wherein the 5-Arthropoda genetic code table refers to an invertebrate genetic code table; the 2-chord data genetic code table refers to a vertebrate genetic code table.
Further, the sliding window statistics of the Ka/Ks value is carried out by selecting a reference sequence set by referring to a genetic code table to compare the ith sequence with the reference sequence one by one, and carrying out statistics of the Ka/Ks values of the two sequences until one cycle is completed for all non-reference sequences, and generating the Ka/Ks values of i single genes in a graph form;
further, the statistics of the GC content refers to statistics of the ratio of guanine and cytosine in the complete gene sequence file.
As shown in fig. 4, a DNA sequence polymorphism analysis system, comprising:
the file input module is used for inputting the gene file to be tested into the system;
the gene extraction module is used for extracting the coding region genes from the input gene files to form the gene sequence files at the same position;
the gene comparison module is used for comparing the gene sequence files at the same position to form an equivalent gene matrix, so as to obtain a complete gene sequence file;
the information acquisition module is used for acquiring biological species information of the complete gene sequence file;
the data storage module is used for storing the historical species information in the system and the sliding window size and step length set by the historical species information;
the Ka/ks value calculation module is used for calculating the Ka/ks value in the complete gene sequence and generating a single gene Ka/ks value in a chart form;
the GC content statistics module is used for counting the GC content in the complete gene sequence and generating single-gene GC content values in a chart form.
The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present invention.

Claims (8)

1. A method for analyzing a polymorphism in a DNA sequence, comprising the steps of:
s1: inputting a biological gene file, the biological gene file comprising: CDS, rRNA;
s2: extracting the genes of the coding region of the organism from the organism gene file to form a group of organism same-position gene sequence files;
s3: inputting the gene sequence file at the same position of the organism into MAFFT software, and comparing the selected gene sequence file one by one through the MAFFT software to generate an equivalent gene matrix, thereby obtaining a complete gene sequence file;
s4: obtaining biological species information, selecting a reference genetic code sub-table for a corresponding biological type, the biological species information comprising: biological name, coding region gene length;
s5, judging whether the biological species information exists in a system database, if so, acquiring the size and the step length of a sliding window, and if not, inputting data such as the size and the step length of the sliding window by a user, and storing the input data in the system database;
s6: inputting the size and the step length of the sliding window, carrying out sliding window statistics of Ka/Ks values on the complete gene sequence file, and generating the Ka/Ks values of the single genes in a chart form;
s7: and carrying out GC content statistics on the complete gene sequence file, and generating the GC content value of the single gene in a graph form.
2. A method of analysing a DNA sequence polymorphism according to claim 1, wherein the coding region genes in step S2 consist of DNA sequences, in particular 13 sets of CDS gene queues.
3. The method of claim 1, wherein the MAFFT software in step S3 is used for multiple sequence alignment of genes in biological fields.
4. The method of claim 1, wherein the gene sequence file at the same location in step S3 is a fas file composed of gene sequences at the same location of three organisms, wherein the gene file of one organism contains 13 coding region sequence information, and finally 13 fas files are generated to store the coding region gene sequences of the three organisms.
5. The method of claim 1, wherein the reference genetic code table in step S4 comprises: a 5-Arthropoda genetic code table and a 2-chord genetic code table;
wherein the 5-Arthropoda genetic code table refers to an invertebrate genetic code table; the 2-chord data genetic code table refers to a vertebrate genetic code table.
6. The method of claim 1, wherein the sliding window statistics of Ka/Ks values in step S6 is performed by selecting a reference sequence set by referring to a genetic code table, comparing the ith sequence with the reference sequence one by one, and performing statistics of Ka/Ks values of two sequences until one cycle is completed for all non-reference sequences, and generating Ka/Ks values of i single genes in a graph form.
7. The method of claim 1, wherein the GC content statistics in step S8 are based on the ratio of guanine and cytosine in the complete gene sequence file.
8. A system for applying the DNA sequence polymorphism analysis method as set forth in any one of claims 1 to 7, comprising:
the file input module is used for inputting the gene file to be tested into the system;
the gene extraction module is used for extracting the coding region genes from the input gene files to form the gene sequence files at the same position;
the gene comparison module is used for comparing the gene sequence files at the same position to form an equivalent gene matrix, so as to obtain a complete gene sequence file;
the information acquisition module is used for acquiring biological species information of the complete gene sequence file;
the data storage module is used for storing the historical species information in the system and the sliding window size and step length set by the historical species information;
the Ka/ks value calculation module is used for calculating the Ka/ks value in the complete gene sequence and generating a single gene Ka/ks value in a chart form;
the GC content statistics module is used for counting the GC content in the complete gene sequence and generating single-gene GC content values in a chart form.
CN202211691737.3A 2022-12-28 2022-12-28 DNA sequence polymorphism analysis method and system Pending CN116130002A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211691737.3A CN116130002A (en) 2022-12-28 2022-12-28 DNA sequence polymorphism analysis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211691737.3A CN116130002A (en) 2022-12-28 2022-12-28 DNA sequence polymorphism analysis method and system

Publications (1)

Publication Number Publication Date
CN116130002A true CN116130002A (en) 2023-05-16

Family

ID=86303866

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211691737.3A Pending CN116130002A (en) 2022-12-28 2022-12-28 DNA sequence polymorphism analysis method and system

Country Status (1)

Country Link
CN (1) CN116130002A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118038991A (en) * 2024-04-12 2024-05-14 宁波甬恒瑶瑶智能科技有限公司 Gene sequence processing method, system, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118038991A (en) * 2024-04-12 2024-05-14 宁波甬恒瑶瑶智能科技有限公司 Gene sequence processing method, system, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109524059B (en) Rapid and stable animal individual genome breeding value evaluation method
CA2424031C (en) System and process for validating, aligning and reordering genetic sequence maps using ordered restriction map
CA2987015A1 (en) Discovering population structure from patterns of identity-by-descent
CN106897774B (en) Multiple soft measurement algorithm cluster modeling methods based on Monte Carlo cross validation
CN115021679B (en) Photovoltaic equipment fault detection method based on multi-dimensional outlier detection
CN110111843A (en) Method, equipment and the storage medium that nucleic acid sequence is clustered
CN116130002A (en) DNA sequence polymorphism analysis method and system
CN105868584A (en) Method for performing whole genome selective breeding by selecting extreme character individual
Wu et al. MEC: Misassembly error correction in contigs based on distribution of paired-end reads and statistics of GC-contents
CN106709028A (en) High-throughput sequencing data counting method and counting device
CN111161797B (en) Transcription analysis method based on three-generation sequencing detection multi-sample comparison
Czech et al. Grenedalf: population genetic statistics for the next generation of pool sequencing
WO2020234666A1 (en) Deep learning based system and method for prediction of alternative polyadenylation site
Alachiotis et al. ChromatoGate: a tool for detecting base mis-calls in multiple sequence alignments by semi-automatic chromatogram inspection
Cooke et al. Fine-tuning of approximate Bayesian computation for human population genomics
Luo et al. Estimation of genetic parameters using linkage between a marker gene and a locus underlying a quantitative character in F2 populations
EP3971902B1 (en) Base mutation detection method and apparatus based on sequencing data, and storage medium
CN116508105A (en) Genomic marker interpolation based on haplotype blocks
CN113035274A (en) NMF-based tumor gene point mutation characteristic map extraction algorithm
CN115995262B (en) Method for analyzing corn genetic mechanism based on random forest and LASSO regression
CN113257343A (en) Protein DNA binding residue prediction method based on coefficient of variation method
CN108897990B (en) Interactive feature parallel selection method for large-scale high-dimensional sequence data
CN112164424A (en) Population evolution analysis method based on non-reference genome
AlEisa et al. K‐Mer Spectrum‐Based Error Correction Algorithm for Next‐Generation Sequencing Data
CN116467596B (en) Training method of rice grain length prediction model, morphology prediction method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Ma Yong

Inventor after: Jiang Xinghong

Inventor after: Niu Gengyun

Inventor after: Jian Wenxin

Inventor after: Dai Mengxuan

Inventor after: Zheng Wensheng

Inventor after: Xiao Yi

Inventor after: Li Min

Inventor before: Ma Yong

Inventor before: Jiang Xinghong

Inventor before: Niu Gengyun

Inventor before: Jian Wenxin

Inventor before: Dai Mengxuan

Inventor before: Zheng Wensheng

Inventor before: Xiao Yi

Inventor before: Li Min

CB03 Change of inventor or designer information