CN118430645A

CN118430645A - Full-gene DNA data redefinition method

Info

Publication number: CN118430645A
Application number: CN202410880242.8A
Authority: CN
Inventors: 夏志强; 罗璇; 田阳阳; 江思容; 赵龙; 夏成材; 李子璇; 邹枚伶
Original assignee: Sanya Nanfan Research Institute Of Hainan University; Sanya Research Institute of Hainan University
Current assignee: Sanya Nanfan Research Institute Of Hainan University; Sanya Research Institute of Hainan University
Priority date: 2023-08-03
Filing date: 2024-07-02
Publication date: 2024-08-02
Also published as: CN116705155A

Abstract

The invention relates to the technical field of genome data analysis, in particular to a redefinition method of whole-gene DNA data. Comprising the following steps: acquiring a genome region file, a reference genome file and a genome variation site file of a species to be detected; the genomic variation site file is determined by comparing the reference genomic file; extracting data information of the position of the gene of the species to be detected from the genome region file; determining mutation sites through a genome mutation site file; classifying data information in the genome region file; taking a gene coding region as a weight, sequentially scoring other regions according to the action exerted in the biological process of the species to be detected and the distance from the mutation site from high to low, marking the region with the nearest position information to the mutation site as 10 points, and sequentially decreasing the scores in the other regions; and defining the species to be tested according to the score. The advantages are that: simple calculation, time saving and wide application.

Description

Full-gene DNA data redefinition method

Technical Field

The invention relates to the technical field of genome data analysis, in particular to a redefinition method of whole-gene DNA data.

Background

A gene is the entire nucleotide sequence required to produce a single polypeptide chain or functional RNA, and a DNA fragment with genetic information is called a gene.

The genome region file is an important file format, and important information such as coding regions, non-coding regions, gene structures, protein coding regions, promoter regions, transcription factor binding sites and the like in a genome DNA sequence can be known by reading the genome region file. This information can play a key role in the functional analysis of subsequent species. Such as the GFF format commonly used in assays. The genomic mutation site file encompasses all mutation sites of the genome, whether or not they are identical to a reference genome, and can be embodied in this file, such as VCF files commonly used in assays.

The GFF format is defined by Sanger research, a simple and convenient data format for characterizing DNA, RNA and protein sequences, and is currently also a common format for sequence annotation. GFF files are all called "General Feature Format", a generic feature format, a text file format that describes genes, transcripts, exons, introns and other sequence features in biological sequences. Typically, these features are used in applications such as genome annotation, gene recognition, sequence alignment, gene function prediction, and the like. Besides the position information describing the features, the GFF file can record information such as the names, roles and references of the features, and more fully describe all feature information in the sequence.

CDS (Coding sequence) is the sequence coding for a protein product, the DNA is transcribed into mRNA, the mRNA is translated into protein after being processed by splicing and the like, CDS is the DNA sequence corresponding to the protein sequence one by one, the sequence does not contain other sequences which are not corresponding to the protein, and the sequence change in the process of mRNA processing and the like is not considered, in short, the CDS completely corresponds to the codon of the protein. Through research on CDS, the amino acid sequence and function of the gene coding protein can be further known, and the evolution and variation of the gene can be researched. In addition, in the fields of gene editing, gene therapy and the like, analysis and modification of CDS sequences are also of great value.

Protein-interacting networks (PPI, protein-Protein Interaction Networks) are graphical representations of Protein-interacting networks that describe the relationship of Protein interactions. In biology, proteins exist not only as individual molecules, but they can also interact to form complex network structures. These interactions may be direct, such as physical binding between proteins, or indirect, such as through a common signaling pathway. The system analyzes the interaction relation of a large amount of proteins in a biological system, has great significance for solving the working principle of the proteins in the biological system, knowing the reaction mechanism of biological signals and energy substance metabolism under special physiological states such as diseases and the like and knowing the functional relation among the proteins, and is often related to screening key genes.

At the current whole gene level, DNA data is important biological data, and research of DNA data and interpretation of the meaning thereof are also major research tasks in the genome era. The method has wide application in various aspects, stores arbitrary digital information, builds a DNA database, determines genotypes, performs gene sequencing, performs subsequent analysis and the like. However, the species cannot be directly resolved by GFF files and vcf files in the prior art.

In the fields of molecular biology and genetics, genome is an important biological data, and research into genome and interpretation of the meaning are also main research tasks of molecular biology and genetics. Since a change in the gene causes a change in the protein, which causes the structure and function of the protein, it is self-evident that the importance of the gene is studied. However, it is not easy to identify genes, and we need to use the computing power of a computer and design algorithms based on biological knowledge to find them. In view of this, the present invention provides a method of visualizing a genome and redefining the genome. The known or unknown data patterns or comparison differences are conveniently and intuitively identified by correspondingly scoring the analysis data related to the most basic genome DNA sequences, annotation data and other genomes and the annotated annotation files of the protein network node number of each gene.

Disclosure of Invention

The invention provides a whole-gene DNA data redefinition method for solving the problems.

The invention aims to provide a redefinition method of whole-gene DNA data, which specifically comprises the following steps:

s1, acquiring a file: acquiring a genome region file, a reference genome file and a genome variation site file of a species to be detected; the genomic variation site file is determined by comparing the reference genomic file;

S2, data information processing: extracting data information of the position of the gene of the species to be detected from the genome area file; determining mutation sites by the genomic mutation site file;

s3, classifying data information: classifying data information in the genomic region file;

S4, scoring and sorting: taking a gene coding region as a weight, sequentially scoring mRNA, gene, exon, UTR, QTL and/or a methylation region according to the action of the gene coding region in the biological process of the species to be detected and the distance between the gene coding region and a mutation site from high to low, marking the region with the nearest position information to the mutation site as 10 points, and sequentially and progressively scoring the region within 1000bp from the upstream and downstream of the mutation site; and defining the species to be tested according to the score.

Preferably, the genome region file in step S1 includes a genome structure annotation file and a protein annotation file.

Preferably, the classification in step S3 specifically includes the following steps:

S31, extracting a column of a CDS (coding region) of a gene in the genome region file, extracting a column of mRNA, gene, exon, UTR and/or a column of functions, and classifying by using an awk command in Linux;

s32, selecting all classified data, and screening the data according to different types of areas;

s33, importing the genome region file and the reference genome file by using software TBtools, outputting the types of the files, and finishing classification;

S34, constructing a protein interaction network: and pairing the protein annotation file with homologous Arabidopsis proteins, constructing a protein interaction network according to the existing protein interaction database after pairing, and calculating the number of nodes connected with each protein.

Preferably, step S4 further includes: sequentially scoring the biological process of the species to be detected and the distance between the biological process and the mutation site from high to low to obtain region basic scores, calculating node scores according to the node number of each gene in the protein interaction network, and marking each node as 1 score;

The scoring formula is:

Node score = ；

And adding the regional basis and the node score to obtain the score.

Preferably, the awk command is:

；

The X is CDS, mRNA, gene, exon, UTR, QTL or a methylation region.

Preferably, the reference genome file is a FASTA sequence file; the genomic variation site file is a vcf file.

Preferably, the method for acquiring the vcf file in step S1 specifically includes:

Removing sequencing joints from the machine-down data by fastp data quality control software to obtain sequencing data; comparing the sequencing data to a reference genome by utilizing bwa sequence comparison software, and sequencing the compared sequencing data by utilizing samtools sequence comparison software and preset genome position information; filtering the repeated segment PCR in the sequenced sequencing data by using picard high-throughput sequencing data format kit; and (3) performing genome mutation analysis on the filtered sequencing data by using GATK, and finally obtaining a vcf file.

Preferably, the genome structure annotation file is an annotation file added with annotations of the node number according to a GFF format file of a certain gene.

Preferably, the GFF format file is obtained by annotating the reference genome file and sequencing data.

Preferably, the test species is rice or soybean.

Compared with the prior art, the invention has the following beneficial effects:

(1) Assigning biological significance to GFF files without a format of biological significance;

(2) Only the GFF files and vcf files of the species to be tested are processed and the scoring and sorting are carried out correspondingly, excessive calculation is not needed, time is saved to a certain extent, and the subsequent analysis is simplified;

(3) The files can be classified and ordered only by the Genome region files, the reference Genome files and the Genome variation site files of the species to be analyzed, and can be applied to multiple aspects such as GWAS (Genome-Wide Association Studies) -whole Genome association research, GS (Genomic selection) -whole Genome selective breeding, QTL positioning, species evolution and evolution, large population screening, new species identification standards, radiation mutagenesis screening and the like.

(4) The present invention provides a method for visualizing a genome and redefining the genome. The known or unknown data patterns or comparison differences are conveniently and intuitively identified by correspondingly scoring the analysis data related to the most basic genome DNA sequences, annotation data and other genomes and the annotated annotation files of the protein network node number of each gene.

Drawings

Fig. 1 is a flowchart of a whole-gene DNA data redefinition method provided according to an embodiment of the present invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the following description, like modules are denoted by like reference numerals. In the case of the same reference numerals, their names and functions are also the same. Therefore, a detailed description thereof will not be repeated.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limiting the invention.

Example 1

In this embodiment, taking rice as an example, a method for redefining whole-gene DNA data is provided, which specifically includes the following steps:

S1, acquiring a file: obtaining a genomic region (GFF) file, a reference genomic file and a genomic variation site (vcf) file on chromosome one of 533 rice data; the genomic variation site (vcf) file is determined by comparing the reference genomic file;

Genomic region (GFF) files include genomic structure annotation files and protein annotation files; combining 533 parts of GFF files of genes with the initial positions of 261504 on chromosome I in rice data and annotation files added with annotation of the number of protein network nodes of each gene according to the formation of the protein interaction network relationship of the genes to obtain genome structure annotation files and protein annotation files;

The GFF file is obtained by the following steps:

searching Latin names of species rice, and downloading in NCBI, ensembl, UCSC, geneCode database by using the Latin names;

or checking and downloading the corresponding GFF files in the published literature;

or by annotation using the reference genome file and sequencing data;

The reference genome file is obtained by the following steps:

Downloading in a NCBI, ensembl, UCSC, geneCode database and the like; searching the Latin name of the species, entering the website, and searching by using the Latin name to obtain a reference genome file of the species;

or viewing and downloading the corresponding reference genome file in the published literature;

For species without a reference genome, sequencing data can be obtained by a sequencing means and genome assembly is carried out to obtain a reference genome file;

vcf file is a commonly used bioinformatics file format for storing genomic or transcriptome variant information; the acquisition mode is as follows:

Performing sequencing joint removal treatment on parent-parent and hybrid progeny (F1) single plant off-machine data by fastp data quality control software to obtain parent-parent and hybrid progeny single plant sequencing data;

Or comparing the sequencing data of the parent and the hybrid progeny individuals to a reference genome by utilizing bwa sequence comparison software, and respectively sequencing the sequences of the compared parent and the compared hybrid progeny individuals by utilizing samtools sequence comparison software and preset genome position information;

Or filtering the repeated segment PCR in sequencing the single plant of the sequenced father and mother and the sequenced filial generation by using picard high-throughput sequencing data format kit;

Or performing genome variation analysis on the filtered parent-parent and hybrid progeny single plant sequencing data by using GATK, and finally obtaining the vcf file.

S2, data information processing: extracting data information of each region such as a gene coding region, mRNA, gene, exon, five _prime_UTR and the like from a genome region (GFF) file; determining mutation sites by vcf file;

after determining the mutation site, extracting genome position coordinate information of the mutation site;

Specifically, the data information includes specific information of position coordinates of the gene structure and genomic position coordinate information of the mutation site.

S3, classifying data information: the data information extracted in the step S2 is classified, and the specific method comprises the following steps:

S31, extracting a column only comprising a gene coding region (CDS region) from the GFF file, and classifying by using an awk command in Linux, such as: awk 'BEGIN { fs=ofs= "\t" } $3= "CDS" { print $0}' wuzhong.gff3> cds.txt; if other categories are to be extracted, the contents in the' are replaced;

S32, opening the file to be processed by Excel, selecting all data, screening a third column (GFF file format is fixed, the third column is Type), and classifying the data according to the information of the column;

s33, importing GFF files and FASTA sequence files by using existing software TBtools, outputting the types of the files, and completing classification;

The specific implementation mode is that Tbtools software is opened, and Sequence Toolkit-GFF 3/GTF MANIPLATE-GXF Sequences Extract is clicked in Sequence; importing the GFF file and the FASTA sequence file, clicking the initialization, selecting the category in the Feature Tag pull-down, selecting the file output position, setting the file name of the output, and clicking the Start;

the results are shown in Table 1 below;

TABLE 1 data information for each region of chromosome I in Rice data

Taking a CDS region with a start position 335869 and an end position 337498 on chromosome 1 of rice as an example, vcf files are shown in the following table:

TABLE 2 Rice vcf File

S34, constructing a protein interaction network: pairing is carried out according to a protein annotation file of a species to be detected and homologous arabidopsis proteins, a protein interaction network is constructed according to the existing protein interaction database after pairing, and the number of nodes connected with each protein is calculated;

S4, scoring and sorting: taking a gene coding region as a weight, sequentially scoring mRNA, gene, exon, UTR, QTL and/or a methylation region according to the action of the gene coding region in the biological process of the species to be detected and the distance between the gene coding region and a mutation site from high to low, marking the region with the nearest position information to the mutation site as 10 points, and sequentially and progressively scoring the region within 1000bp from the upstream and downstream of the mutation site; scoring results are shown in table 3;

TABLE 3 scoring results for rice

Sequentially scoring the biological process of the species to be detected and the distance between the biological process and the mutation site from high to low to obtain region basic scores, calculating node scores according to the node number of each gene in the protein interaction network, and marking each node as1 score; the scoring formula is:

Node score = ；

Nodes in the protein interaction network represent proteins, and edges represent interaction relations between proteins;

Adding the regional basis and the node score to obtain a score; defining the species to be detected according to the score; and finally, redefining DNA data of the rice according to the scoring result, and estimating the sites with high value.

Example 2

Selecting genes whose initial positions on chromosome one of soybean data are 319784 and 353520 (the specific procedure is the same as that of example 1); wherein the data information of each region of chromosome one (Gm 01) in the soybean data is shown in table 4;

TABLE 4 data information for each region of chromosome I in soybean data

Scoring results are shown in table 5;

Table 5 scoring results for soybean

Defining the species to be detected according to the score; finally, redefining DNA data of the soybeans according to the scoring result; the high value sites are estimated.

The method of the invention is a new definition mode of conventional single nucleotide polymorphism; SNP-single nucleotide polymorphism refers mainly to DNA sequence polymorphism caused by variation of a single nucleotide at the genomic level. Some SNPs located inside genes are likely to directly affect protein structure or expression level, so the study of the mutation sites of binding SNPs according to various regions of GFF files is very representative.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A redefinition method of whole-gene DNA data is characterized by comprising the following steps:

2. The method for redefining whole-gene DNA data according to claim 1, wherein: the genomic region file in the step S1 includes a genomic structure annotation file and a protein annotation file.

3. The method for redefining whole-gene DNA data according to claim 2, wherein: the classification in the step S3 specifically includes the following steps:

4. A whole-gene DNA data redefinition method as set forth in claim 3, wherein: the step S4 further includes: sequentially scoring the biological process of the species to be detected and the distance between the biological process and the mutation site from high to low to obtain region basic scores, calculating node scores according to the node number of each gene in the protein interaction network, and marking each node as 1 score; the scoring formula is:

Node score = ；

And adding the regional basis and the node score to obtain the score.

5. The method for redefining whole-gene DNA data according to claim 4, wherein: the awk command is:

；

The X is CDS, mRNA, gene, exon, UTR, QTL or a methylation region.

6. The method for redefining whole-gene DNA data according to claim 5, wherein: the reference genome file is a FASTA sequence file; the genomic variation site file is a vcf file.

7. The method for redefining whole-gene DNA data according to claim 6, wherein: the method for acquiring the vcf file in the step S1 specifically includes:

8. The method for redefining whole-gene DNA data according to claim 7, wherein: the genome structure annotation file is an annotation file added with the annotation of the node number according to the GFF format file of a certain gene.

9. The method for redefining whole-gene DNA data according to claim 8, wherein: the GFF format file is obtained by annotating the reference genome file and sequencing data.

10. The method for redefining whole-gene DNA data according to claim 9, wherein: the species to be detected is rice or soybean.