CN118430645A - Full-gene DNA data redefinition method - Google Patents
Full-gene DNA data redefinition method Download PDFInfo
- Publication number
- CN118430645A CN118430645A CN202410880242.8A CN202410880242A CN118430645A CN 118430645 A CN118430645 A CN 118430645A CN 202410880242 A CN202410880242 A CN 202410880242A CN 118430645 A CN118430645 A CN 118430645A
- Authority
- CN
- China
- Prior art keywords
- file
- genome
- gene
- data
- region
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 80
- 230000035772 mutation Effects 0.000 claims abstract description 26
- 108020004414 DNA Proteins 0.000 claims abstract description 25
- 108091026890 Coding region Proteins 0.000 claims abstract description 24
- 230000031018 biological processes and functions Effects 0.000 claims abstract description 10
- 230000009471 action Effects 0.000 claims abstract description 4
- 102000004169 proteins and genes Human genes 0.000 claims description 30
- 238000012163 sequencing technique Methods 0.000 claims description 26
- 241000894007 species Species 0.000 claims description 26
- 230000006916 protein interaction Effects 0.000 claims description 15
- 108020004999 messenger RNA Proteins 0.000 claims description 11
- 238000004458 analytical method Methods 0.000 claims description 8
- 244000068988 Glycine max Species 0.000 claims description 7
- 235000010469 Glycine max Nutrition 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 6
- 230000011987 methylation Effects 0.000 claims description 5
- 238000007069 methylation reaction Methods 0.000 claims description 5
- 108010037365 Arabidopsis Proteins Proteins 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 230000008826 genomic mutation Effects 0.000 claims description 3
- 238000012165 high-throughput sequencing Methods 0.000 claims description 3
- 230000010365 information processing Effects 0.000 claims description 3
- 238000011144 upstream manufacturing Methods 0.000 claims description 3
- 235000010716 Vigna mungo Nutrition 0.000 claims description 2
- 240000001417 Vigna umbellata Species 0.000 claims description 2
- 235000011453 Vigna umbellata Nutrition 0.000 claims description 2
- 230000008901 benefit Effects 0.000 abstract description 2
- 238000004364 calculation method Methods 0.000 abstract description 2
- 238000007405 data analysis Methods 0.000 abstract description 2
- 230000003247 decreasing effect Effects 0.000 abstract 1
- 241000209094 Oryza Species 0.000 description 9
- 235000007164 Oryza sativa Nutrition 0.000 description 8
- 235000009566 rice Nutrition 0.000 description 8
- 210000000349 chromosome Anatomy 0.000 description 7
- 238000011160 research Methods 0.000 description 7
- 108091028043 Nucleic acid sequence Proteins 0.000 description 5
- 241000196324 Embryophyta Species 0.000 description 4
- 108700026244 Open Reading Frames Proteins 0.000 description 4
- 239000002773 nucleotide Substances 0.000 description 4
- 125000003729 nucleotide group Chemical group 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 2
- 238000003556 assay Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 108020004705 Codon Proteins 0.000 description 1
- 108700024394 Exon Proteins 0.000 description 1
- 108700039691 Genetic Promoter Regions Proteins 0.000 description 1
- 108091092195 Intron Proteins 0.000 description 1
- 108091023040 Transcription factor Proteins 0.000 description 1
- 102000040945 Transcription factor Human genes 0.000 description 1
- 125000003275 alpha amino acid group Chemical group 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000010230 functional analysis Methods 0.000 description 1
- 238000001415 gene therapy Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000010362 genome editing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000008558 metabolic pathway by substance Effects 0.000 description 1
- 238000002703 mutagenesis Methods 0.000 description 1
- 231100000350 mutagenesis Toxicity 0.000 description 1
- 230000035790 physiological processes and functions Effects 0.000 description 1
- 229920001184 polypeptide Polymers 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 102000004196 processed proteins & peptides Human genes 0.000 description 1
- 108090000765 processed proteins & peptides Proteins 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000004850 protein–protein interaction Effects 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 238000009394 selective breeding Methods 0.000 description 1
- 238000002864 sequence alignment Methods 0.000 description 1
- 230000019491 signal transduction Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- -1 transcripts Proteins 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6888—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
- C12Q1/6895—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for plants, fungi or algae
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/156—Polymorphic or mutational markers
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Chemical & Material Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Biotechnology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Organic Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Genetics & Genomics (AREA)
- Software Systems (AREA)
- Immunology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Databases & Information Systems (AREA)
- Botany (AREA)
- Mycology (AREA)
- Epidemiology (AREA)
- Microbiology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Artificial Intelligence (AREA)
- Biochemistry (AREA)
- General Engineering & Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention relates to the technical field of genome data analysis, in particular to a redefinition method of whole-gene DNA data. Comprising the following steps: acquiring a genome region file, a reference genome file and a genome variation site file of a species to be detected; the genomic variation site file is determined by comparing the reference genomic file; extracting data information of the position of the gene of the species to be detected from the genome region file; determining mutation sites through a genome mutation site file; classifying data information in the genome region file; taking a gene coding region as a weight, sequentially scoring other regions according to the action exerted in the biological process of the species to be detected and the distance from the mutation site from high to low, marking the region with the nearest position information to the mutation site as 10 points, and sequentially decreasing the scores in the other regions; and defining the species to be tested according to the score. The advantages are that: simple calculation, time saving and wide application.
Description
Technical Field
The invention relates to the technical field of genome data analysis, in particular to a redefinition method of whole-gene DNA data.
Background
A gene is the entire nucleotide sequence required to produce a single polypeptide chain or functional RNA, and a DNA fragment with genetic information is called a gene.
The genome region file is an important file format, and important information such as coding regions, non-coding regions, gene structures, protein coding regions, promoter regions, transcription factor binding sites and the like in a genome DNA sequence can be known by reading the genome region file. This information can play a key role in the functional analysis of subsequent species. Such as the GFF format commonly used in assays. The genomic mutation site file encompasses all mutation sites of the genome, whether or not they are identical to a reference genome, and can be embodied in this file, such as VCF files commonly used in assays.
The GFF format is defined by Sanger research, a simple and convenient data format for characterizing DNA, RNA and protein sequences, and is currently also a common format for sequence annotation. GFF files are all called "General Feature Format", a generic feature format, a text file format that describes genes, transcripts, exons, introns and other sequence features in biological sequences. Typically, these features are used in applications such as genome annotation, gene recognition, sequence alignment, gene function prediction, and the like. Besides the position information describing the features, the GFF file can record information such as the names, roles and references of the features, and more fully describe all feature information in the sequence.
CDS (Coding sequence) is the sequence coding for a protein product, the DNA is transcribed into mRNA, the mRNA is translated into protein after being processed by splicing and the like, CDS is the DNA sequence corresponding to the protein sequence one by one, the sequence does not contain other sequences which are not corresponding to the protein, and the sequence change in the process of mRNA processing and the like is not considered, in short, the CDS completely corresponds to the codon of the protein. Through research on CDS, the amino acid sequence and function of the gene coding protein can be further known, and the evolution and variation of the gene can be researched. In addition, in the fields of gene editing, gene therapy and the like, analysis and modification of CDS sequences are also of great value.
Protein-interacting networks (PPI, protein-Protein Interaction Networks) are graphical representations of Protein-interacting networks that describe the relationship of Protein interactions. In biology, proteins exist not only as individual molecules, but they can also interact to form complex network structures. These interactions may be direct, such as physical binding between proteins, or indirect, such as through a common signaling pathway. The system analyzes the interaction relation of a large amount of proteins in a biological system, has great significance for solving the working principle of the proteins in the biological system, knowing the reaction mechanism of biological signals and energy substance metabolism under special physiological states such as diseases and the like and knowing the functional relation among the proteins, and is often related to screening key genes.
At the current whole gene level, DNA data is important biological data, and research of DNA data and interpretation of the meaning thereof are also major research tasks in the genome era. The method has wide application in various aspects, stores arbitrary digital information, builds a DNA database, determines genotypes, performs gene sequencing, performs subsequent analysis and the like. However, the species cannot be directly resolved by GFF files and vcf files in the prior art.
In the fields of molecular biology and genetics, genome is an important biological data, and research into genome and interpretation of the meaning are also main research tasks of molecular biology and genetics. Since a change in the gene causes a change in the protein, which causes the structure and function of the protein, it is self-evident that the importance of the gene is studied. However, it is not easy to identify genes, and we need to use the computing power of a computer and design algorithms based on biological knowledge to find them. In view of this, the present invention provides a method of visualizing a genome and redefining the genome. The known or unknown data patterns or comparison differences are conveniently and intuitively identified by correspondingly scoring the analysis data related to the most basic genome DNA sequences, annotation data and other genomes and the annotated annotation files of the protein network node number of each gene.
Disclosure of Invention
The invention provides a whole-gene DNA data redefinition method for solving the problems.
The invention aims to provide a redefinition method of whole-gene DNA data, which specifically comprises the following steps:
s1, acquiring a file: acquiring a genome region file, a reference genome file and a genome variation site file of a species to be detected; the genomic variation site file is determined by comparing the reference genomic file;
S2, data information processing: extracting data information of the position of the gene of the species to be detected from the genome area file; determining mutation sites by the genomic mutation site file;
s3, classifying data information: classifying data information in the genomic region file;
S4, scoring and sorting: taking a gene coding region as a weight, sequentially scoring mRNA, gene, exon, UTR, QTL and/or a methylation region according to the action of the gene coding region in the biological process of the species to be detected and the distance between the gene coding region and a mutation site from high to low, marking the region with the nearest position information to the mutation site as 10 points, and sequentially and progressively scoring the region within 1000bp from the upstream and downstream of the mutation site; and defining the species to be tested according to the score.
Preferably, the genome region file in step S1 includes a genome structure annotation file and a protein annotation file.
Preferably, the classification in step S3 specifically includes the following steps:
S31, extracting a column of a CDS (coding region) of a gene in the genome region file, extracting a column of mRNA, gene, exon, UTR and/or a column of functions, and classifying by using an awk command in Linux;
s32, selecting all classified data, and screening the data according to different types of areas;
s33, importing the genome region file and the reference genome file by using software TBtools, outputting the types of the files, and finishing classification;
S34, constructing a protein interaction network: and pairing the protein annotation file with homologous Arabidopsis proteins, constructing a protein interaction network according to the existing protein interaction database after pairing, and calculating the number of nodes connected with each protein.
Preferably, step S4 further includes: sequentially scoring the biological process of the species to be detected and the distance between the biological process and the mutation site from high to low to obtain region basic scores, calculating node scores according to the node number of each gene in the protein interaction network, and marking each node as 1 score;
The scoring formula is:
Node score = ;
And adding the regional basis and the node score to obtain the score.
Preferably, the awk command is:
;
The X is CDS, mRNA, gene, exon, UTR, QTL or a methylation region.
Preferably, the reference genome file is a FASTA sequence file; the genomic variation site file is a vcf file.
Preferably, the method for acquiring the vcf file in step S1 specifically includes:
Removing sequencing joints from the machine-down data by fastp data quality control software to obtain sequencing data; comparing the sequencing data to a reference genome by utilizing bwa sequence comparison software, and sequencing the compared sequencing data by utilizing samtools sequence comparison software and preset genome position information; filtering the repeated segment PCR in the sequenced sequencing data by using picard high-throughput sequencing data format kit; and (3) performing genome mutation analysis on the filtered sequencing data by using GATK, and finally obtaining a vcf file.
Preferably, the genome structure annotation file is an annotation file added with annotations of the node number according to a GFF format file of a certain gene.
Preferably, the GFF format file is obtained by annotating the reference genome file and sequencing data.
Preferably, the test species is rice or soybean.
Compared with the prior art, the invention has the following beneficial effects:
(1) Assigning biological significance to GFF files without a format of biological significance;
(2) Only the GFF files and vcf files of the species to be tested are processed and the scoring and sorting are carried out correspondingly, excessive calculation is not needed, time is saved to a certain extent, and the subsequent analysis is simplified;
(3) The files can be classified and ordered only by the Genome region files, the reference Genome files and the Genome variation site files of the species to be analyzed, and can be applied to multiple aspects such as GWAS (Genome-Wide Association Studies) -whole Genome association research, GS (Genomic selection) -whole Genome selective breeding, QTL positioning, species evolution and evolution, large population screening, new species identification standards, radiation mutagenesis screening and the like.
(4) The present invention provides a method for visualizing a genome and redefining the genome. The known or unknown data patterns or comparison differences are conveniently and intuitively identified by correspondingly scoring the analysis data related to the most basic genome DNA sequences, annotation data and other genomes and the annotated annotation files of the protein network node number of each gene.
Drawings
Fig. 1 is a flowchart of a whole-gene DNA data redefinition method provided according to an embodiment of the present invention.
Detailed Description
Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the following description, like modules are denoted by like reference numerals. In the case of the same reference numerals, their names and functions are also the same. Therefore, a detailed description thereof will not be repeated.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limiting the invention.
Example 1
In this embodiment, taking rice as an example, a method for redefining whole-gene DNA data is provided, which specifically includes the following steps:
S1, acquiring a file: obtaining a genomic region (GFF) file, a reference genomic file and a genomic variation site (vcf) file on chromosome one of 533 rice data; the genomic variation site (vcf) file is determined by comparing the reference genomic file;
Genomic region (GFF) files include genomic structure annotation files and protein annotation files; combining 533 parts of GFF files of genes with the initial positions of 261504 on chromosome I in rice data and annotation files added with annotation of the number of protein network nodes of each gene according to the formation of the protein interaction network relationship of the genes to obtain genome structure annotation files and protein annotation files;
The GFF file is obtained by the following steps:
searching Latin names of species rice, and downloading in NCBI, ensembl, UCSC, geneCode database by using the Latin names;
or checking and downloading the corresponding GFF files in the published literature;
or by annotation using the reference genome file and sequencing data;
The reference genome file is obtained by the following steps:
Downloading in a NCBI, ensembl, UCSC, geneCode database and the like; searching the Latin name of the species, entering the website, and searching by using the Latin name to obtain a reference genome file of the species;
or viewing and downloading the corresponding reference genome file in the published literature;
For species without a reference genome, sequencing data can be obtained by a sequencing means and genome assembly is carried out to obtain a reference genome file;
vcf file is a commonly used bioinformatics file format for storing genomic or transcriptome variant information; the acquisition mode is as follows:
Performing sequencing joint removal treatment on parent-parent and hybrid progeny (F1) single plant off-machine data by fastp data quality control software to obtain parent-parent and hybrid progeny single plant sequencing data;
Or comparing the sequencing data of the parent and the hybrid progeny individuals to a reference genome by utilizing bwa sequence comparison software, and respectively sequencing the sequences of the compared parent and the compared hybrid progeny individuals by utilizing samtools sequence comparison software and preset genome position information;
Or filtering the repeated segment PCR in sequencing the single plant of the sequenced father and mother and the sequenced filial generation by using picard high-throughput sequencing data format kit;
Or performing genome variation analysis on the filtered parent-parent and hybrid progeny single plant sequencing data by using GATK, and finally obtaining the vcf file.
S2, data information processing: extracting data information of each region such as a gene coding region, mRNA, gene, exon, five _prime_UTR and the like from a genome region (GFF) file; determining mutation sites by vcf file;
after determining the mutation site, extracting genome position coordinate information of the mutation site;
Specifically, the data information includes specific information of position coordinates of the gene structure and genomic position coordinate information of the mutation site.
S3, classifying data information: the data information extracted in the step S2 is classified, and the specific method comprises the following steps:
S31, extracting a column only comprising a gene coding region (CDS region) from the GFF file, and classifying by using an awk command in Linux, such as: awk 'BEGIN { fs=ofs= "\t" } $3= "CDS" { print $0}' wuzhong.gff3> cds.txt; if other categories are to be extracted, the contents in the' are replaced;
S32, opening the file to be processed by Excel, selecting all data, screening a third column (GFF file format is fixed, the third column is Type), and classifying the data according to the information of the column;
s33, importing GFF files and FASTA sequence files by using existing software TBtools, outputting the types of the files, and completing classification;
The specific implementation mode is that Tbtools software is opened, and Sequence Toolkit-GFF 3/GTF MANIPLATE-GXF Sequences Extract is clicked in Sequence; importing the GFF file and the FASTA sequence file, clicking the initialization, selecting the category in the Feature Tag pull-down, selecting the file output position, setting the file name of the output, and clicking the Start;
the results are shown in Table 1 below;
TABLE 1 data information for each region of chromosome I in Rice data
Taking a CDS region with a start position 335869 and an end position 337498 on chromosome 1 of rice as an example, vcf files are shown in the following table:
TABLE 2 Rice vcf File
S34, constructing a protein interaction network: pairing is carried out according to a protein annotation file of a species to be detected and homologous arabidopsis proteins, a protein interaction network is constructed according to the existing protein interaction database after pairing, and the number of nodes connected with each protein is calculated;
S4, scoring and sorting: taking a gene coding region as a weight, sequentially scoring mRNA, gene, exon, UTR, QTL and/or a methylation region according to the action of the gene coding region in the biological process of the species to be detected and the distance between the gene coding region and a mutation site from high to low, marking the region with the nearest position information to the mutation site as 10 points, and sequentially and progressively scoring the region within 1000bp from the upstream and downstream of the mutation site; scoring results are shown in table 3;
TABLE 3 scoring results for rice
Sequentially scoring the biological process of the species to be detected and the distance between the biological process and the mutation site from high to low to obtain region basic scores, calculating node scores according to the node number of each gene in the protein interaction network, and marking each node as1 score; the scoring formula is:
Node score = ;
Nodes in the protein interaction network represent proteins, and edges represent interaction relations between proteins;
Adding the regional basis and the node score to obtain a score; defining the species to be detected according to the score; and finally, redefining DNA data of the rice according to the scoring result, and estimating the sites with high value.
Example 2
Selecting genes whose initial positions on chromosome one of soybean data are 319784 and 353520 (the specific procedure is the same as that of example 1); wherein the data information of each region of chromosome one (Gm 01) in the soybean data is shown in table 4;
TABLE 4 data information for each region of chromosome I in soybean data
Scoring results are shown in table 5;
Table 5 scoring results for soybean
Defining the species to be detected according to the score; finally, redefining DNA data of the soybeans according to the scoring result; the high value sites are estimated.
The method of the invention is a new definition mode of conventional single nucleotide polymorphism; SNP-single nucleotide polymorphism refers mainly to DNA sequence polymorphism caused by variation of a single nucleotide at the genomic level. Some SNPs located inside genes are likely to directly affect protein structure or expression level, so the study of the mutation sites of binding SNPs according to various regions of GFF files is very representative.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.
Claims (10)
1. A redefinition method of whole-gene DNA data is characterized by comprising the following steps:
s1, acquiring a file: acquiring a genome region file, a reference genome file and a genome variation site file of a species to be detected; the genomic variation site file is determined by comparing the reference genomic file;
S2, data information processing: extracting data information of the position of the gene of the species to be detected from the genome area file; determining mutation sites by the genomic mutation site file;
s3, classifying data information: classifying data information in the genomic region file;
S4, scoring and sorting: taking a gene coding region as a weight, sequentially scoring mRNA, gene, exon, UTR, QTL and/or a methylation region according to the action of the gene coding region in the biological process of the species to be detected and the distance between the gene coding region and a mutation site from high to low, marking the region with the nearest position information to the mutation site as 10 points, and sequentially and progressively scoring the region within 1000bp from the upstream and downstream of the mutation site; and defining the species to be tested according to the score.
2. The method for redefining whole-gene DNA data according to claim 1, wherein: the genomic region file in the step S1 includes a genomic structure annotation file and a protein annotation file.
3. The method for redefining whole-gene DNA data according to claim 2, wherein: the classification in the step S3 specifically includes the following steps:
S31, extracting a column of a CDS (coding region) of a gene in the genome region file, extracting a column of mRNA, gene, exon, UTR and/or a column of functions, and classifying by using an awk command in Linux;
s32, selecting all classified data, and screening the data according to different types of areas;
s33, importing the genome region file and the reference genome file by using software TBtools, outputting the types of the files, and finishing classification;
S34, constructing a protein interaction network: and pairing the protein annotation file with homologous Arabidopsis proteins, constructing a protein interaction network according to the existing protein interaction database after pairing, and calculating the number of nodes connected with each protein.
4. A whole-gene DNA data redefinition method as set forth in claim 3, wherein: the step S4 further includes: sequentially scoring the biological process of the species to be detected and the distance between the biological process and the mutation site from high to low to obtain region basic scores, calculating node scores according to the node number of each gene in the protein interaction network, and marking each node as 1 score; the scoring formula is:
Node score = ;
And adding the regional basis and the node score to obtain the score.
5. The method for redefining whole-gene DNA data according to claim 4, wherein: the awk command is:
;
The X is CDS, mRNA, gene, exon, UTR, QTL or a methylation region.
6. The method for redefining whole-gene DNA data according to claim 5, wherein: the reference genome file is a FASTA sequence file; the genomic variation site file is a vcf file.
7. The method for redefining whole-gene DNA data according to claim 6, wherein: the method for acquiring the vcf file in the step S1 specifically includes:
Removing sequencing joints from the machine-down data by fastp data quality control software to obtain sequencing data; comparing the sequencing data to a reference genome by utilizing bwa sequence comparison software, and sequencing the compared sequencing data by utilizing samtools sequence comparison software and preset genome position information; filtering the repeated segment PCR in the sequenced sequencing data by using picard high-throughput sequencing data format kit; and (3) performing genome mutation analysis on the filtered sequencing data by using GATK, and finally obtaining a vcf file.
8. The method for redefining whole-gene DNA data according to claim 7, wherein: the genome structure annotation file is an annotation file added with the annotation of the node number according to the GFF format file of a certain gene.
9. The method for redefining whole-gene DNA data according to claim 8, wherein: the GFF format file is obtained by annotating the reference genome file and sequencing data.
10. The method for redefining whole-gene DNA data according to claim 9, wherein: the species to be detected is rice or soybean.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2023109707163 | 2023-08-03 | ||
CN202310970716.3A CN116705155A (en) | 2023-08-03 | 2023-08-03 | Definition method of whole-gene DNA data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118430645A true CN118430645A (en) | 2024-08-02 |
Family
ID=87837808
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310970716.3A Pending CN116705155A (en) | 2023-08-03 | 2023-08-03 | Definition method of whole-gene DNA data |
CN202410880242.8A Pending CN118430645A (en) | 2023-08-03 | 2024-07-02 | Full-gene DNA data redefinition method |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310970716.3A Pending CN116705155A (en) | 2023-08-03 | 2023-08-03 | Definition method of whole-gene DNA data |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN116705155A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105008599A (en) * | 2013-02-07 | 2015-10-28 | 中国种子集团有限公司 | Rice whole genome breeding chip and application thereof |
US20190311785A1 (en) * | 2013-03-15 | 2019-10-10 | The Scripps Research Institute | Systems and methods for genomic annotation and distributed variant interpretation |
CN112349350A (en) * | 2020-11-09 | 2021-02-09 | 山西大学 | Method for strain identification based on Dunaliella core genome sequence |
CN112542215A (en) * | 2020-12-21 | 2021-03-23 | 成都基因坊科技有限公司 | Gene annotation file format and analysis tool for same |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104450898B (en) * | 2014-11-26 | 2017-03-29 | 中华人民共和国常州出入境检验检疫局 | A kind of species discrimination method of Euproctis insecticide |
CN112927755B (en) * | 2021-02-09 | 2022-03-25 | 北京博奥医学检验所有限公司 | Method and system for identifying cfDNA (cfDNA) variation source |
IL310649A (en) * | 2021-08-05 | 2024-04-01 | Grail Llc | Somatic variant cooccurrence with abnormally methylated fragments |
CN115838808A (en) * | 2022-07-29 | 2023-03-24 | 江苏省家禽科学研究所科技创新有限公司 | Molecular marker for identifying Wenshang Luhua chicken variety and application thereof |
CN116426647A (en) * | 2023-03-10 | 2023-07-14 | 江苏省家禽科学研究所 | Molecular marker combination for identifying Tianjin monkey chicken variety and application thereof |
-
2023
- 2023-08-03 CN CN202310970716.3A patent/CN116705155A/en active Pending
-
2024
- 2024-07-02 CN CN202410880242.8A patent/CN118430645A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105008599A (en) * | 2013-02-07 | 2015-10-28 | 中国种子集团有限公司 | Rice whole genome breeding chip and application thereof |
US20190311785A1 (en) * | 2013-03-15 | 2019-10-10 | The Scripps Research Institute | Systems and methods for genomic annotation and distributed variant interpretation |
CN112349350A (en) * | 2020-11-09 | 2021-02-09 | 山西大学 | Method for strain identification based on Dunaliella core genome sequence |
CN112542215A (en) * | 2020-12-21 | 2021-03-23 | 成都基因坊科技有限公司 | Gene annotation file format and analysis tool for same |
Non-Patent Citations (1)
Title |
---|
王倩文: "大白菜核雄性不育相关基因挖掘及鉴定", 《中国优秀硕士学位论文全文数据库》, 1 May 2023 (2023-05-01) * |
Also Published As
Publication number | Publication date |
---|---|
CN116705155A (en) | 2023-09-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Minnoye et al. | Chromatin accessibility profiling methods | |
Mathelier et al. | Identification of altered cis-regulatory elements in human disease | |
US20220101944A1 (en) | Methods for detecting copy-number variations in next-generation sequencing | |
Yao et al. | A comparison of experimental assays and analytical methods for genome-wide identification of active enhancers | |
CN116042833A (en) | Alignment and variant sequencing analysis pipeline | |
Weighill et al. | Data integration in poplar:‘omics layers and integration strategies | |
Pool et al. | Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references | |
Keel et al. | Recent developments and future directions in meta-analysis of differential gene expression in livestock RNA-Seq | |
Yang et al. | SoyMD: a platform combining multi-omics data with various tools for soybean research and breeding | |
CN109524060B (en) | Genetic disease risk prompting gene sequencing data processing system and processing method | |
Lian et al. | inGAP-family: accurate detection of meiotic recombination loci and causal mutations by filtering out artificial variants due to genome complexities | |
Sezerman et al. | Bioinformatics workflows for genomic variant discovery, interpretation and prioritization | |
CN117612600A (en) | Analysis method, storage medium and equipment of full-length transcriptome sequencing data based on PacBio sequencing | |
Pool et al. | Enhanced recovery of single-cell RNA-sequencing reads for missing gene expression data | |
WO2019132010A1 (en) | Method, apparatus and program for estimating base type in base sequence | |
CN118430645A (en) | Full-gene DNA data redefinition method | |
Mishra et al. | Genome assembly and annotation | |
CN111028885B (en) | Method and device for detecting yak RNA editing site | |
D’Agaro | New advances in NGS technologies | |
JP2008161056A (en) | Dna sequence analyzer and method and program for analyzing dna sequence | |
Sudigyo et al. | Bioinformatics pathway analysis pipeline for NGS transcriptome profile data on nasopharyngeal carcinoma | |
CN105787294B (en) | Determine method, the kit and application thereof of probe collection | |
Oh et al. | PIC-Me: paralogs and isoforms classifier based on machine-learning approaches | |
Lin et al. | Reference-based identification of long noncoding RNAs in plants with strand-specific RNA-sequencing data | |
WO2017025925A1 (en) | Method and system for filtering whole exome sequence variants |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |