CN118430645A - Full-gene DNA data redefinition method - Google Patents

Full-gene DNA data redefinition method Download PDF

Info

Publication number
CN118430645A
CN118430645A CN202410880242.8A CN202410880242A CN118430645A CN 118430645 A CN118430645 A CN 118430645A CN 202410880242 A CN202410880242 A CN 202410880242A CN 118430645 A CN118430645 A CN 118430645A
Authority
CN
China
Prior art keywords
file
genome
gene
data
region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410880242.8A
Other languages
Chinese (zh)
Inventor
夏志强
罗璇
田阳阳
江思容
赵龙
夏成材
李子璇
邹枚伶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sanya Nanfan Research Institute Of Hainan University
Sanya Research Institute of Hainan University
Original Assignee
Sanya Nanfan Research Institute Of Hainan University
Sanya Research Institute of Hainan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sanya Nanfan Research Institute Of Hainan University, Sanya Research Institute of Hainan University filed Critical Sanya Nanfan Research Institute Of Hainan University
Publication of CN118430645A publication Critical patent/CN118430645A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • C12Q1/6895Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for plants, fungi or algae
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Organic Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Genetics & Genomics (AREA)
  • Software Systems (AREA)
  • Immunology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Databases & Information Systems (AREA)
  • Botany (AREA)
  • Mycology (AREA)
  • Epidemiology (AREA)
  • Microbiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to the technical field of genome data analysis, in particular to a redefinition method of whole-gene DNA data. Comprising the following steps: acquiring a genome region file, a reference genome file and a genome variation site file of a species to be detected; the genomic variation site file is determined by comparing the reference genomic file; extracting data information of the position of the gene of the species to be detected from the genome region file; determining mutation sites through a genome mutation site file; classifying data information in the genome region file; taking a gene coding region as a weight, sequentially scoring other regions according to the action exerted in the biological process of the species to be detected and the distance from the mutation site from high to low, marking the region with the nearest position information to the mutation site as 10 points, and sequentially decreasing the scores in the other regions; and defining the species to be tested according to the score. The advantages are that: simple calculation, time saving and wide application.

Description

Full-gene DNA data redefinition method
Technical Field
The invention relates to the technical field of genome data analysis, in particular to a redefinition method of whole-gene DNA data.
Background
A gene is the entire nucleotide sequence required to produce a single polypeptide chain or functional RNA, and a DNA fragment with genetic information is called a gene.
The genome region file is an important file format, and important information such as coding regions, non-coding regions, gene structures, protein coding regions, promoter regions, transcription factor binding sites and the like in a genome DNA sequence can be known by reading the genome region file. This information can play a key role in the functional analysis of subsequent species. Such as the GFF format commonly used in assays. The genomic mutation site file encompasses all mutation sites of the genome, whether or not they are identical to a reference genome, and can be embodied in this file, such as VCF files commonly used in assays.
The GFF format is defined by Sanger research, a simple and convenient data format for characterizing DNA, RNA and protein sequences, and is currently also a common format for sequence annotation. GFF files are all called "General Feature Format", a generic feature format, a text file format that describes genes, transcripts, exons, introns and other sequence features in biological sequences. Typically, these features are used in applications such as genome annotation, gene recognition, sequence alignment, gene function prediction, and the like. Besides the position information describing the features, the GFF file can record information such as the names, roles and references of the features, and more fully describe all feature information in the sequence.
CDS (Coding sequence) is the sequence coding for a protein product, the DNA is transcribed into mRNA, the mRNA is translated into protein after being processed by splicing and the like, CDS is the DNA sequence corresponding to the protein sequence one by one, the sequence does not contain other sequences which are not corresponding to the protein, and the sequence change in the process of mRNA processing and the like is not considered, in short, the CDS completely corresponds to the codon of the protein. Through research on CDS, the amino acid sequence and function of the gene coding protein can be further known, and the evolution and variation of the gene can be researched. In addition, in the fields of gene editing, gene therapy and the like, analysis and modification of CDS sequences are also of great value.
Protein-interacting networks (PPI, protein-Protein Interaction Networks) are graphical representations of Protein-interacting networks that describe the relationship of Protein interactions. In biology, proteins exist not only as individual molecules, but they can also interact to form complex network structures. These interactions may be direct, such as physical binding between proteins, or indirect, such as through a common signaling pathway. The system analyzes the interaction relation of a large amount of proteins in a biological system, has great significance for solving the working principle of the proteins in the biological system, knowing the reaction mechanism of biological signals and energy substance metabolism under special physiological states such as diseases and the like and knowing the functional relation among the proteins, and is often related to screening key genes.
At the current whole gene level, DNA data is important biological data, and research of DNA data and interpretation of the meaning thereof are also major research tasks in the genome era. The method has wide application in various aspects, stores arbitrary digital information, builds a DNA database, determines genotypes, performs gene sequencing, performs subsequent analysis and the like. However, the species cannot be directly resolved by GFF files and vcf files in the prior art.
In the fields of molecular biology and genetics, genome is an important biological data, and research into genome and interpretation of the meaning are also main research tasks of molecular biology and genetics. Since a change in the gene causes a change in the protein, which causes the structure and function of the protein, it is self-evident that the importance of the gene is studied. However, it is not easy to identify genes, and we need to use the computing power of a computer and design algorithms based on biological knowledge to find them. In view of this, the present invention provides a method of visualizing a genome and redefining the genome. The known or unknown data patterns or comparison differences are conveniently and intuitively identified by correspondingly scoring the analysis data related to the most basic genome DNA sequences, annotation data and other genomes and the annotated annotation files of the protein network node number of each gene.
Disclosure of Invention
The invention provides a whole-gene DNA data redefinition method for solving the problems.
The invention aims to provide a redefinition method of whole-gene DNA data, which specifically comprises the following steps:
s1, acquiring a file: acquiring a genome region file, a reference genome file and a genome variation site file of a species to be detected; the genomic variation site file is determined by comparing the reference genomic file;
S2, data information processing: extracting data information of the position of the gene of the species to be detected from the genome area file; determining mutation sites by the genomic mutation site file;
s3, classifying data information: classifying data information in the genomic region file;
S4, scoring and sorting: taking a gene coding region as a weight, sequentially scoring mRNA, gene, exon, UTR, QTL and/or a methylation region according to the action of the gene coding region in the biological process of the species to be detected and the distance between the gene coding region and a mutation site from high to low, marking the region with the nearest position information to the mutation site as 10 points, and sequentially and progressively scoring the region within 1000bp from the upstream and downstream of the mutation site; and defining the species to be tested according to the score.
Preferably, the genome region file in step S1 includes a genome structure annotation file and a protein annotation file.
Preferably, the classification in step S3 specifically includes the following steps:
S31, extracting a column of a CDS (coding region) of a gene in the genome region file, extracting a column of mRNA, gene, exon, UTR and/or a column of functions, and classifying by using an awk command in Linux;
s32, selecting all classified data, and screening the data according to different types of areas;
s33, importing the genome region file and the reference genome file by using software TBtools, outputting the types of the files, and finishing classification;
S34, constructing a protein interaction network: and pairing the protein annotation file with homologous Arabidopsis proteins, constructing a protein interaction network according to the existing protein interaction database after pairing, and calculating the number of nodes connected with each protein.
Preferably, step S4 further includes: sequentially scoring the biological process of the species to be detected and the distance between the biological process and the mutation site from high to low to obtain region basic scores, calculating node scores according to the node number of each gene in the protein interaction network, and marking each node as 1 score;
The scoring formula is:
Node score =
And adding the regional basis and the node score to obtain the score.
Preferably, the awk command is:
The X is CDS, mRNA, gene, exon, UTR, QTL or a methylation region.
Preferably, the reference genome file is a FASTA sequence file; the genomic variation site file is a vcf file.
Preferably, the method for acquiring the vcf file in step S1 specifically includes:
Removing sequencing joints from the machine-down data by fastp data quality control software to obtain sequencing data; comparing the sequencing data to a reference genome by utilizing bwa sequence comparison software, and sequencing the compared sequencing data by utilizing samtools sequence comparison software and preset genome position information; filtering the repeated segment PCR in the sequenced sequencing data by using picard high-throughput sequencing data format kit; and (3) performing genome mutation analysis on the filtered sequencing data by using GATK, and finally obtaining a vcf file.
Preferably, the genome structure annotation file is an annotation file added with annotations of the node number according to a GFF format file of a certain gene.
Preferably, the GFF format file is obtained by annotating the reference genome file and sequencing data.
Preferably, the test species is rice or soybean.
Compared with the prior art, the invention has the following beneficial effects:
(1) Assigning biological significance to GFF files without a format of biological significance;
(2) Only the GFF files and vcf files of the species to be tested are processed and the scoring and sorting are carried out correspondingly, excessive calculation is not needed, time is saved to a certain extent, and the subsequent analysis is simplified;
(3) The files can be classified and ordered only by the Genome region files, the reference Genome files and the Genome variation site files of the species to be analyzed, and can be applied to multiple aspects such as GWAS (Genome-Wide Association Studies) -whole Genome association research, GS (Genomic selection) -whole Genome selective breeding, QTL positioning, species evolution and evolution, large population screening, new species identification standards, radiation mutagenesis screening and the like.
(4) The present invention provides a method for visualizing a genome and redefining the genome. The known or unknown data patterns or comparison differences are conveniently and intuitively identified by correspondingly scoring the analysis data related to the most basic genome DNA sequences, annotation data and other genomes and the annotated annotation files of the protein network node number of each gene.
Drawings
Fig. 1 is a flowchart of a whole-gene DNA data redefinition method provided according to an embodiment of the present invention.
Detailed Description
Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the following description, like modules are denoted by like reference numerals. In the case of the same reference numerals, their names and functions are also the same. Therefore, a detailed description thereof will not be repeated.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limiting the invention.
Example 1
In this embodiment, taking rice as an example, a method for redefining whole-gene DNA data is provided, which specifically includes the following steps:
S1, acquiring a file: obtaining a genomic region (GFF) file, a reference genomic file and a genomic variation site (vcf) file on chromosome one of 533 rice data; the genomic variation site (vcf) file is determined by comparing the reference genomic file;
Genomic region (GFF) files include genomic structure annotation files and protein annotation files; combining 533 parts of GFF files of genes with the initial positions of 261504 on chromosome I in rice data and annotation files added with annotation of the number of protein network nodes of each gene according to the formation of the protein interaction network relationship of the genes to obtain genome structure annotation files and protein annotation files;
The GFF file is obtained by the following steps:
searching Latin names of species rice, and downloading in NCBI, ensembl, UCSC, geneCode database by using the Latin names;
or checking and downloading the corresponding GFF files in the published literature;
or by annotation using the reference genome file and sequencing data;
The reference genome file is obtained by the following steps:
Downloading in a NCBI, ensembl, UCSC, geneCode database and the like; searching the Latin name of the species, entering the website, and searching by using the Latin name to obtain a reference genome file of the species;
or viewing and downloading the corresponding reference genome file in the published literature;
For species without a reference genome, sequencing data can be obtained by a sequencing means and genome assembly is carried out to obtain a reference genome file;
vcf file is a commonly used bioinformatics file format for storing genomic or transcriptome variant information; the acquisition mode is as follows:
Performing sequencing joint removal treatment on parent-parent and hybrid progeny (F1) single plant off-machine data by fastp data quality control software to obtain parent-parent and hybrid progeny single plant sequencing data;
Or comparing the sequencing data of the parent and the hybrid progeny individuals to a reference genome by utilizing bwa sequence comparison software, and respectively sequencing the sequences of the compared parent and the compared hybrid progeny individuals by utilizing samtools sequence comparison software and preset genome position information;
Or filtering the repeated segment PCR in sequencing the single plant of the sequenced father and mother and the sequenced filial generation by using picard high-throughput sequencing data format kit;
Or performing genome variation analysis on the filtered parent-parent and hybrid progeny single plant sequencing data by using GATK, and finally obtaining the vcf file.
S2, data information processing: extracting data information of each region such as a gene coding region, mRNA, gene, exon, five _prime_UTR and the like from a genome region (GFF) file; determining mutation sites by vcf file;
after determining the mutation site, extracting genome position coordinate information of the mutation site;
Specifically, the data information includes specific information of position coordinates of the gene structure and genomic position coordinate information of the mutation site.
S3, classifying data information: the data information extracted in the step S2 is classified, and the specific method comprises the following steps:
S31, extracting a column only comprising a gene coding region (CDS region) from the GFF file, and classifying by using an awk command in Linux, such as: awk 'BEGIN { fs=ofs= "\t" } $3= "CDS" { print $0}' wuzhong.gff3> cds.txt; if other categories are to be extracted, the contents in the' are replaced;
S32, opening the file to be processed by Excel, selecting all data, screening a third column (GFF file format is fixed, the third column is Type), and classifying the data according to the information of the column;
s33, importing GFF files and FASTA sequence files by using existing software TBtools, outputting the types of the files, and completing classification;
The specific implementation mode is that Tbtools software is opened, and Sequence Toolkit-GFF 3/GTF MANIPLATE-GXF Sequences Extract is clicked in Sequence; importing the GFF file and the FASTA sequence file, clicking the initialization, selecting the category in the Feature Tag pull-down, selecting the file output position, setting the file name of the output, and clicking the Start;
the results are shown in Table 1 below;
TABLE 1 data information for each region of chromosome I in Rice data
Taking a CDS region with a start position 335869 and an end position 337498 on chromosome 1 of rice as an example, vcf files are shown in the following table:
TABLE 2 Rice vcf File
S34, constructing a protein interaction network: pairing is carried out according to a protein annotation file of a species to be detected and homologous arabidopsis proteins, a protein interaction network is constructed according to the existing protein interaction database after pairing, and the number of nodes connected with each protein is calculated;
S4, scoring and sorting: taking a gene coding region as a weight, sequentially scoring mRNA, gene, exon, UTR, QTL and/or a methylation region according to the action of the gene coding region in the biological process of the species to be detected and the distance between the gene coding region and a mutation site from high to low, marking the region with the nearest position information to the mutation site as 10 points, and sequentially and progressively scoring the region within 1000bp from the upstream and downstream of the mutation site; scoring results are shown in table 3;
TABLE 3 scoring results for rice
Sequentially scoring the biological process of the species to be detected and the distance between the biological process and the mutation site from high to low to obtain region basic scores, calculating node scores according to the node number of each gene in the protein interaction network, and marking each node as1 score; the scoring formula is:
Node score =
Nodes in the protein interaction network represent proteins, and edges represent interaction relations between proteins;
Adding the regional basis and the node score to obtain a score; defining the species to be detected according to the score; and finally, redefining DNA data of the rice according to the scoring result, and estimating the sites with high value.
Example 2
Selecting genes whose initial positions on chromosome one of soybean data are 319784 and 353520 (the specific procedure is the same as that of example 1); wherein the data information of each region of chromosome one (Gm 01) in the soybean data is shown in table 4;
TABLE 4 data information for each region of chromosome I in soybean data
Scoring results are shown in table 5;
Table 5 scoring results for soybean
Defining the species to be detected according to the score; finally, redefining DNA data of the soybeans according to the scoring result; the high value sites are estimated.
The method of the invention is a new definition mode of conventional single nucleotide polymorphism; SNP-single nucleotide polymorphism refers mainly to DNA sequence polymorphism caused by variation of a single nucleotide at the genomic level. Some SNPs located inside genes are likely to directly affect protein structure or expression level, so the study of the mutation sites of binding SNPs according to various regions of GFF files is very representative.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. A redefinition method of whole-gene DNA data is characterized by comprising the following steps:
s1, acquiring a file: acquiring a genome region file, a reference genome file and a genome variation site file of a species to be detected; the genomic variation site file is determined by comparing the reference genomic file;
S2, data information processing: extracting data information of the position of the gene of the species to be detected from the genome area file; determining mutation sites by the genomic mutation site file;
s3, classifying data information: classifying data information in the genomic region file;
S4, scoring and sorting: taking a gene coding region as a weight, sequentially scoring mRNA, gene, exon, UTR, QTL and/or a methylation region according to the action of the gene coding region in the biological process of the species to be detected and the distance between the gene coding region and a mutation site from high to low, marking the region with the nearest position information to the mutation site as 10 points, and sequentially and progressively scoring the region within 1000bp from the upstream and downstream of the mutation site; and defining the species to be tested according to the score.
2. The method for redefining whole-gene DNA data according to claim 1, wherein: the genomic region file in the step S1 includes a genomic structure annotation file and a protein annotation file.
3. The method for redefining whole-gene DNA data according to claim 2, wherein: the classification in the step S3 specifically includes the following steps:
S31, extracting a column of a CDS (coding region) of a gene in the genome region file, extracting a column of mRNA, gene, exon, UTR and/or a column of functions, and classifying by using an awk command in Linux;
s32, selecting all classified data, and screening the data according to different types of areas;
s33, importing the genome region file and the reference genome file by using software TBtools, outputting the types of the files, and finishing classification;
S34, constructing a protein interaction network: and pairing the protein annotation file with homologous Arabidopsis proteins, constructing a protein interaction network according to the existing protein interaction database after pairing, and calculating the number of nodes connected with each protein.
4. A whole-gene DNA data redefinition method as set forth in claim 3, wherein: the step S4 further includes: sequentially scoring the biological process of the species to be detected and the distance between the biological process and the mutation site from high to low to obtain region basic scores, calculating node scores according to the node number of each gene in the protein interaction network, and marking each node as 1 score; the scoring formula is:
Node score =
And adding the regional basis and the node score to obtain the score.
5. The method for redefining whole-gene DNA data according to claim 4, wherein: the awk command is:
The X is CDS, mRNA, gene, exon, UTR, QTL or a methylation region.
6. The method for redefining whole-gene DNA data according to claim 5, wherein: the reference genome file is a FASTA sequence file; the genomic variation site file is a vcf file.
7. The method for redefining whole-gene DNA data according to claim 6, wherein: the method for acquiring the vcf file in the step S1 specifically includes:
Removing sequencing joints from the machine-down data by fastp data quality control software to obtain sequencing data; comparing the sequencing data to a reference genome by utilizing bwa sequence comparison software, and sequencing the compared sequencing data by utilizing samtools sequence comparison software and preset genome position information; filtering the repeated segment PCR in the sequenced sequencing data by using picard high-throughput sequencing data format kit; and (3) performing genome mutation analysis on the filtered sequencing data by using GATK, and finally obtaining a vcf file.
8. The method for redefining whole-gene DNA data according to claim 7, wherein: the genome structure annotation file is an annotation file added with the annotation of the node number according to the GFF format file of a certain gene.
9. The method for redefining whole-gene DNA data according to claim 8, wherein: the GFF format file is obtained by annotating the reference genome file and sequencing data.
10. The method for redefining whole-gene DNA data according to claim 9, wherein: the species to be detected is rice or soybean.
CN202410880242.8A 2023-08-03 2024-07-02 Full-gene DNA data redefinition method Pending CN118430645A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2023109707163 2023-08-03
CN202310970716.3A CN116705155A (en) 2023-08-03 2023-08-03 Definition method of whole-gene DNA data

Publications (1)

Publication Number Publication Date
CN118430645A true CN118430645A (en) 2024-08-02

Family

ID=87837808

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202310970716.3A Pending CN116705155A (en) 2023-08-03 2023-08-03 Definition method of whole-gene DNA data
CN202410880242.8A Pending CN118430645A (en) 2023-08-03 2024-07-02 Full-gene DNA data redefinition method

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202310970716.3A Pending CN116705155A (en) 2023-08-03 2023-08-03 Definition method of whole-gene DNA data

Country Status (1)

Country Link
CN (2) CN116705155A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105008599A (en) * 2013-02-07 2015-10-28 中国种子集团有限公司 Rice whole genome breeding chip and application thereof
US20190311785A1 (en) * 2013-03-15 2019-10-10 The Scripps Research Institute Systems and methods for genomic annotation and distributed variant interpretation
CN112349350A (en) * 2020-11-09 2021-02-09 山西大学 Method for strain identification based on Dunaliella core genome sequence
CN112542215A (en) * 2020-12-21 2021-03-23 成都基因坊科技有限公司 Gene annotation file format and analysis tool for same

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104450898B (en) * 2014-11-26 2017-03-29 中华人民共和国常州出入境检验检疫局 A kind of species discrimination method of Euproctis insecticide
CN112927755B (en) * 2021-02-09 2022-03-25 北京博奥医学检验所有限公司 Method and system for identifying cfDNA (cfDNA) variation source
IL310649A (en) * 2021-08-05 2024-04-01 Grail Llc Somatic variant cooccurrence with abnormally methylated fragments
CN115838808A (en) * 2022-07-29 2023-03-24 江苏省家禽科学研究所科技创新有限公司 Molecular marker for identifying Wenshang Luhua chicken variety and application thereof
CN116426647A (en) * 2023-03-10 2023-07-14 江苏省家禽科学研究所 Molecular marker combination for identifying Tianjin monkey chicken variety and application thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105008599A (en) * 2013-02-07 2015-10-28 中国种子集团有限公司 Rice whole genome breeding chip and application thereof
US20190311785A1 (en) * 2013-03-15 2019-10-10 The Scripps Research Institute Systems and methods for genomic annotation and distributed variant interpretation
CN112349350A (en) * 2020-11-09 2021-02-09 山西大学 Method for strain identification based on Dunaliella core genome sequence
CN112542215A (en) * 2020-12-21 2021-03-23 成都基因坊科技有限公司 Gene annotation file format and analysis tool for same

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王倩文: "大白菜核雄性不育相关基因挖掘及鉴定", 《中国优秀硕士学位论文全文数据库》, 1 May 2023 (2023-05-01) *

Also Published As

Publication number Publication date
CN116705155A (en) 2023-09-05

Similar Documents

Publication Publication Date Title
Minnoye et al. Chromatin accessibility profiling methods
Mathelier et al. Identification of altered cis-regulatory elements in human disease
US20220101944A1 (en) Methods for detecting copy-number variations in next-generation sequencing
Yao et al. A comparison of experimental assays and analytical methods for genome-wide identification of active enhancers
CN116042833A (en) Alignment and variant sequencing analysis pipeline
Weighill et al. Data integration in poplar:‘omics layers and integration strategies
Pool et al. Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references
Keel et al. Recent developments and future directions in meta-analysis of differential gene expression in livestock RNA-Seq
Yang et al. SoyMD: a platform combining multi-omics data with various tools for soybean research and breeding
CN109524060B (en) Genetic disease risk prompting gene sequencing data processing system and processing method
Lian et al. inGAP-family: accurate detection of meiotic recombination loci and causal mutations by filtering out artificial variants due to genome complexities
Sezerman et al. Bioinformatics workflows for genomic variant discovery, interpretation and prioritization
CN117612600A (en) Analysis method, storage medium and equipment of full-length transcriptome sequencing data based on PacBio sequencing
Pool et al. Enhanced recovery of single-cell RNA-sequencing reads for missing gene expression data
WO2019132010A1 (en) Method, apparatus and program for estimating base type in base sequence
CN118430645A (en) Full-gene DNA data redefinition method
Mishra et al. Genome assembly and annotation
CN111028885B (en) Method and device for detecting yak RNA editing site
D’Agaro New advances in NGS technologies
JP2008161056A (en) Dna sequence analyzer and method and program for analyzing dna sequence
Sudigyo et al. Bioinformatics pathway analysis pipeline for NGS transcriptome profile data on nasopharyngeal carcinoma
CN105787294B (en) Determine method, the kit and application thereof of probe collection
Oh et al. PIC-Me: paralogs and isoforms classifier based on machine-learning approaches
Lin et al. Reference-based identification of long noncoding RNAs in plants with strand-specific RNA-sequencing data
WO2017025925A1 (en) Method and system for filtering whole exome sequence variants

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination