CN117637028A - Method for obtaining orthologous gene by combining transcriptome and resequencing - Google Patents

Method for obtaining orthologous gene by combining transcriptome and resequencing Download PDF

Info

Publication number
CN117637028A
CN117637028A CN202311604097.2A CN202311604097A CN117637028A CN 117637028 A CN117637028 A CN 117637028A CN 202311604097 A CN202311604097 A CN 202311604097A CN 117637028 A CN117637028 A CN 117637028A
Authority
CN
China
Prior art keywords
species
data
transcriptome
resequencing
gene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311604097.2A
Other languages
Chinese (zh)
Inventor
许吾琴
张秋云
郑进芳
陈广勇
王平安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202311604097.2A priority Critical patent/CN117637028A/en
Publication of CN117637028A publication Critical patent/CN117637028A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for acquiring orthologous genes by combining transcriptome and resequencing data, which combines DNA sequence information of genome from resequencing data and expression information of genes from transcriptome data by adopting an algorithm written in a C language and a data processing flow so as to identify SOGs more accurately and efficiently. Comprising the following steps: assembling the original transcriptome data, searching SOGs and comparing the SOGs to a reference genome, thereby obtaining the specific position of the SOGs on the genome; genetic variation locus detection is carried out on the re-sequencing data to obtain a VCF file containing all locus information; by utilizing the FindSOG tool, SOGs sequences in resequencing data are extracted according to the position information of the SOGs and the VCF file, SOGs from a transcription group and resequencing data are compared, and the obtained sequence matrix can be used for subsequent evolutionary analysis. The invention integrates two different types of data sets, can provide genetic information of more species more comprehensively, and can be widely applied in the field of bioinformatics research.

Description

Method for obtaining orthologous gene by combining transcriptome and resequencing
Technical Field
The invention relates to the field of bioinformatics, in particular to a method for acquiring orthologous genes by combining transcriptomes and resequencing.
Background
In current bioinformatics research, integration and analysis of histologic data is one of the key challenges. Single-copy orthologous genes (SOGs) play an important role in bioinformatics analysis, and the genetic information provided by them can be used not only for studying evolutionary relationships of species, but also for functional annotation of the diversity of genes. Current researchers typically use orthofine software (Emms & Kelly, 2019) to find inter-species SOGs based on transcriptome data. orthoFinder, a mature software, can use whole protein alignment and cluster analysis methods to identify SOGs present in different species as follows:
1) Protein sequence acquisition: orthoFinder first acquired protein sequence data for a number of species, which sequences can be obtained by genomic sequencing and transcriptome data.
2) Whole protein alignment: the obtained protein sequences of a plurality of species are subjected to full sequence alignment for searching the similarity and homology among the sequences.
3) Clustering of homologous gene clusters: based on the similarity of protein sequences, orthoFinder groups these proteins into homologous gene clusters. Proteins in these clusters have similar sequence characteristics, indicating that they may be homologous genes.
4) Identifying SOGs: orthoFinder analyzed these homologous gene clusters to determine which genes were present in only one copy per species. That is, in each species, only homologous genes with a single copy are selected.
5) Providing classification information: the orthoFinder provides a list of interspecies SOGs based on the analysis of the steps above and sorts them, helping the user to make interspecies comparisons and functional comments.
Orthofine is currently only applicable to transcriptome data. Transcriptome sequencing, however, typically requires collection of fresh tissue samples and extraction of RNA is highly demanding in terms of experimental conditions, which is difficult to achieve in the laboratory for rare biological samples as well as for general conditions. The requirement of a re-sequenced experimental material is lower than that of a transcriptome, a dry tissue specimen is usually adopted, and the difficulty of DNA extraction experiment is low, so that the re-sequencing of some species incapable of performing transcriptome sequencing can be performed. Although the two data complement each other, so that researchers can conveniently obtain more comprehensive genetic information for analyzing the evolution relationship of species, no tool exists at present, and the two data sets can be used simultaneously to identify SOGs between species with different sequencing means.
The invention can identify SOGs based on orthogene, find out corresponding genes from the resequencing data set by using a whole genome comparison searching mode, and integrate genes shared by all species for subsequent analysis.
Disclosure of Invention
The present invention aims at overcoming the defects of the prior art and providing a method for obtaining orthologous genes by combining transcriptomes and resequencing.
To achieve the above object, the present invention provides a method for obtaining orthologous genes by combining transcriptome and resequencing, comprising the steps of:
(1) Assembling and splicing the transcriptome original data of the species A, B, C to obtain a transcript file; performing open reading frame prediction on each transcript, and identifying a sequence region with potential protein coding capability; using protein coding genes of a plurality of species as input files, searching single-copy orthologous gene sequences of the species A, B, C, and then comparing the single-copy orthologous genes of the species A, B, C to a reference genome to obtain specific positions of the single-copy orthologous genes of the species A, B, C on the reference genome;
(2) Genetic variation locus detection is carried out on the resequenced data of the species D, E, F, and a VCF file containing all locus information of the species D, E, F is obtained;
(3) Extracting a single copy orthologous gene sequence in the resequencing data of the species D, E, F according to the position information of the single copy orthologous gene of the species A, B, C obtained in the step (1) and the VCF file of the species D, E, F obtained in the step (2) by using a FindSOG tool; the single copy orthologous gene sequence of species A, B, C transcriptome data was aligned with the single copy orthologous gene sequence of species D, E, F resequencing data to obtain a single copy orthologous gene sequence matrix common to species A, B, C, D, E, F.
Further, the step (1) specifically comprises the following steps: assembling and splicing the transcriptome original data of the species A, B, C, and then removing redundant sequences by using a CD-HIT tool to obtain a transcript file; performing open reading frame prediction on each transcript, and identifying a sequence region with potential protein coding capability; using protein coding genes of a plurality of species as input files, using orthogene software to find single copy orthologous genes of the species A, B, C, selecting annotation files of reference genomes of target species, and comparing the single copy orthologous genes of the species A, B, C to the reference genomes by using BLAST software to obtain specific positions of the single copy orthologous genes of the species A, B, C on the reference genomes.
Further, according to the annotation information in the annotation file, the chromosome number and the start-stop position of the gene are obtained.
Further, in the step (2), the VCF file records the position and base information of each nucleotide position in the species D, E, F for extracting the corresponding gene according to the position information of SOGs.
Further, the FindSOG tool uses the c++ language program to extract single copy orthologous genes from VCF files based on gene start and stop positions.
Further, the working process of the FindSOG tool includes: obtaining the position information and VCF file data of the single copy orthologous genes of the original whole genome, and preprocessing the data to obtain gene fragments; establishing a hash table for storing index information of genes, storing the preprocessed gene fragments in the hash table, and classifying according to genome names; processing the gene information according to the rows, and establishing initial and end position indexes of the gene fragments; and extracting context information according to the starting and ending positions of the gene segments, and recording species information, function information and context information of the gene information.
Further, the preprocessing operation comprises data cleaning and format conversion.
To achieve the above object, the present invention also provides an apparatus for obtaining orthologous genes by combining transcriptome and resequencing data, comprising one or more processors for implementing the above method for obtaining orthologous genes by combining transcriptome and resequencing data.
To achieve the above object, the present invention also provides an electronic device including a memory and a processor, the memory being coupled to the processor; wherein the memory is configured to store program data and the processor is configured to execute the program data to implement the above-described method of combining transcriptome and resequencing data to obtain orthologous genes.
To achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described method of obtaining orthologous genes in combination with transcriptomes and resequencing.
The beneficial effects of the invention are as follows: the invention integrates two different types of data sets, can provide more genetic information of more species more comprehensively, and can be widely applied in the field of bioinformatics research, in particular in aspects of biodiversity, evolutionary biology, disease gene research and the like. In addition, the scheme can also provide powerful support for the fields of biotechnology, medicine development and the like. It provides a more comprehensive and efficient genome analysis method, and is expected to have important influence in the field of bioinformatics.
Because of the huge information content of VCF files containing a plurality of individual whole genome loci, the processing speed is very slow, and the method adopts the following strategies to improve the efficiency:
1) Efficient c++ language algorithm: through the high-efficiency algorithm written in the C++ language, the calculation speed and the resource utilization rate are improved.
2) Multithreading parallel computing: the data are processed in parallel by utilizing a multithreading technology, so that the extraction process of the genetic loci of the whole genome is accelerated.
3) Distributed computing architecture: the distributed system is designed and realized, tasks are effectively managed and distributed, and the expandability and the calculation efficiency of the system are improved.
4) Performance improvement over 10-fold: compared with the traditional CPU calculation mode, the method can improve the performance by more than 10 times in a large-scale whole genome calculation task.
Drawings
FIG. 1 is a SOGs lookup flow chart provided by the present invention;
FIG. 2 is a schematic representation of BLAST results provided by the present invention;
FIG. 3 is a schematic view of a genome annotation file provided by the present invention;
FIG. 4 is a schematic diagram of a mappingrate0.8.Variant. Raw. Vcf file provided by the present invention;
FIG. 5 is a schematic view of a find_blast.cds file provided by the present invention;
FIG. 6 is a schematic representation of a phylogenetic tree constructed using 338 SOGs provided by the present invention;
FIG. 7 is a schematic view of a device according to an embodiment of the present invention;
fig. 8 is a schematic diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.
FIG. 1 is a flow chart of SOGs search provided by the invention, wherein a study object is usually a group comprising a plurality of species, and in the figure, a transcriptome is detected on the assumption that A, B, C and a resequencing is detected on the assumption that D, E, F; the left side of the flow chart shows the transcriptome analysis flow, and the right side of the flow chart shows the resequencing analysis flow; the middle dashed box FindSOG represents the self-created tool of the present invention, and the resulting A, B, C, D, E, F common SOGs are available for downstream analysis; this procedure requires the use of reference genomic sequences and annotation files for any of the species A, B, C, D, E, F. Fig. 2 is a schematic diagram of BLAST results, wherein each column represents: species number, gene name, percent identity, length of aligned sequence, number of mismatches, number of gaps, alignment start position in query sequence, alignment end position in query sequence, alignment start position in target sequence, alignment end position in target sequence, E value (expected value), alignment score (Bit score). FIG. 3 is a genome annotation file, each column representing: chromosome number, source, feature type, start-stop position, end position, score, orientation, phase, annotation. Fig. 4 is a VCF file, each column representing: CHROM (chromosome), POS (position), ID (identifier), REF (reference base), ALT (alternate base), QUAL (quality value), FILTER (FILTER tag), INFO (other information), FORMAT (FORMAT), sample columns (Samples). FIG. 5 is a find_blast.cds file, the first column is the gene number, the second column is the gene name, followed by the start and stop positions of the gene on the genome, the multiple start and stop positions representing multiple segments of coding regions. FIG. 6 is a tree of development of a system of the family of the muscarinic tree constructed using 338 SOGs, the species names in the figure being Latin abbreviations, L.A., S.abbreviations for Liquidambar, altingia and Semiquidambar, respectively; transcriptome sequencing is indicated by a black triangle following the species name, resequencing is indicated by the non-standard; the numerical value of the tree node and the blue long box represent the calculated differentiation time and 95% confidence interval, respectively, corresponding to the time scale at the bottom of the tree.
The present invention will be described in detail with reference to the accompanying drawings. The features of the examples and embodiments described below may be combined with each other without conflict.
The following description of the technical solution in the embodiments of the present invention is clear and complete.
Example 1
As shown in FIG. 1, the method for obtaining orthologous genes by combining transcriptome and resequencing provided by the invention comprises the following steps:
(1) Assembling and splicing the transcriptome original data of the species A, B, C to obtain a transcript file; performing open reading frame prediction on each transcript, and identifying a sequence region with potential protein coding capability; the single copy orthologous genes of species A, B, C were searched for using orthogene software using protein encoding genes of multiple species as input files, and single copy orthologous genes of species A, B, C were aligned to the reference genome using BLAST software to obtain specific positions of single copy orthologous genes of species A, B, C on the reference genome.
Specifically, the original files (i.e., transcriptome raw data) resulting from transcriptome sequencing (here assuming transcriptome sequencing of species A, B, C) were assembled and spliced using Trinity software (grabhrr et al, 2011), followed by removal of redundant sequences using a CD-HIT (Fu et al, 2012) tool to yield the final transcript file. Open reading frame prediction was performed on each transcript using Transdecoder software (http:// Transdecoder. Sourceforge. Net /), identifying sequence regions in the transcript that have potential protein encoding capabilities. Single copy orthologous genes (Single-copy Orthologous Genes, SOGs) of species A, B, C in protein-encoding genes were searched for using orthogene software using protein-encoding genes of multiple species as input files and named gene set 1. Selecting annotation files of a reference genome of a target species, and comparing the gene set 1 to the reference genome by using BLAST software (Johnson et al, 2008) to obtain matched gene names (see column 2 of FIG. 2), namely obtaining the specific position of SOGs on the reference genome. The Blast tool (Basic Local Alignment Search Tool) is a tool for searching the best matching area according to the similarity between the input sequence and the sequences in the database.
According to the annotation information in the annotation file, the chromosome numbers and the start-stop positions of the genes can be obtained, for example, in fig. 3, the first column is the chromosome number, the fourth column and the fifth column are the start-stop positions, and the first column shows the gene with the number evm.model.HIC_ASM_12.688 found in fig. 2:
(2) Genetic variation locus detection is carried out on the resequenced data of the species D, E, F, and a VCF file containing all locus information of the species D, E, F is obtained;
specifically, the resequenced data (here assuming transcriptome sequencing of species D, E, F) was aligned to the reference genome using BWA-MEM (Li, 2013), filtered using samtoils software (Li et al, 2009). Genetic variation site detection was performed on the re-sequenced data using the biplotypeCaller function of GATK software (McKenna et al 2010) to obtain a VCF file (genetic variation matrix) containing all sites of species D, E, F. The VCF (Variant Call Format) file is a standard file format for storing mutation information in the genome. The VCF file records the position and base information for each nucleotide position in species D, E, F and can be used to subsequently extract the corresponding gene based on the SOGs positional information. This VCF file is named mappingrate0.8.Variant. Raw. VCF, see fig. 4.
(3) Extracting SOGs sequences in species D, E, F resequencing data according to SOGs position information of species A, B, C obtained in the step (1) and the VCF file of species D, E, F obtained in the step (2) by utilizing the FindSOG tool, and comparing SOGs sequences of species A, B, C transcriptome data with SOGs sequences of species D, E, F resequencing data to obtain a SOGs sequence matrix shared by the species A, B, C, D, E, F; the sequence matrix can be used for subsequent evolutionary analysis.
The FindSOG tool extracts SOGs: since 1 gene may contain a plurality of coding regions (e.g., evm.model. HIC_ASM_12.637 gene in FIG. 3), the start and stop positions of the plurality of regions are first integrated into the same row, designated as find_blast. Cds (gene set), as shown in FIG. 5, for subsequent manipulation. And taking the find_blast.cds file and the mapping rate0.8.Variant.raw.vcf file (VCF file) as inputs, and extracting SOGs from the VCF file according to the start and stop positions of genes by using a C++ language program.
The invention carries out pretreatment on genome data, establishes an index table, extracts context information and provides some measures for performance optimization through a FindSOG tool, thereby realizing the function of extracting SOGs from a genome VCF file with huge information quantity according to the gene position. The invention integrates the transcriptome ortholog gene and the whole genome resequencing data by the original tool, thereby providing more comprehensive genome information analysis.
The FindSOG tool works as follows:
s1, data preprocessing: and acquiring SOGs position information file and VCF file data of the original whole genome, and carrying out necessary pretreatment on the acquired data to obtain gene fragments, wherein the pretreatment operations comprise data cleaning, format conversion and the like. This provides a clean, well-formatted data base for subsequent steps.
S2, establishing an index table of gene information: and a C++ language is adopted to write an efficient algorithm so as to improve the calculation speed and optimize the resource utilization. And establishing a hash table for storing index information of genes, storing the preprocessed gene fragments in the hash table, classifying according to genome names, and accelerating information extraction in large-scale calculation.
S3, extracting the context in a single thread: the gene information is processed by rows, and the initial and end position indexes of the gene fragments are established. And extracting context information according to the starting and ending positions of the gene segments, and recording the context information such as species information, functional information, gene information and the like.
S4, multithreading parallel computing: with multi-threading techniques, a computing task is divided into multiple sub-tasks and executed simultaneously on multiple CPU cores to speed up the data processing and computing process.
S5, designing a distributed computing architecture: and depending on a distributed computing architecture on the cloud, distributing computing tasks to a plurality of cloud computing nodes and realizing task coordination and data exchange so as to improve computing efficiency and the capability of processing large-scale whole genome data.
S6, realizing performance optimization: the algorithm and the architecture are optimized according with the data scale, so that the system can realize high-efficiency and stable performance when processing large-scale whole genome data. The hash table is used, the file is read and written, the character string is processed, and the like are optimized.
Example 2
To verify whether the invention is viable, the inventors selected the family of myco (altiginaceae) species as the study subject for verification. The family muscarinic comprises three genera: sweetgum genus (Liquidambar), mushroom genus (Altigia) and half-maple lotus genus (Semilquidambar). The inventors selected 7 plants for transcriptome sequencing (black triangles in fig. 6, which may correspond to A, B, C … … in fig. 1), and the remaining 10 plants for resequencing (D, E, F … … in fig. 1). The specific flow is as follows:
(1) Transcriptome data processing: assembling transcriptome data by using Trinity software, predicting protein coding sequences by using TransDecoder software after removing redundancy, and then searching transcriptome SOGs by using Orthofactor software;
(2) Resequencing data processing: the resequenced data for 10 individuals were aligned to the reference genome of sweetgum tree using BWA-MEM tools. And performing mutation detection by using GATK software to obtain the VCF file.
(3) Obtaining common SOGs: and comparing the transcriptome SOGs with reference genome of the sweetgum tree by using BLAST software to obtain genome position information, and extracting loci of the transcriptome SOGs from the re-sequenced VCF file according to the position information by using FindSOG tool after finishing. The SOGs which are common to the 17 plants can be extracted, and are arranged and aligned for subsequent analysis.
And performing SOGs searching on the data of the two data sets by using the flow, and finding 338 SOGs in total. To verify the reliability of these SOGs, the inventors constructed phylogenetic trees of species using these genes. The result shows that the support rate of each node is more than 95%, the topological structure is consistent with the previous research result, and the 338 SOGs contain reliable genetic information and can be used for subsequent evolutionary analysis.
Example 3
In accordance with the foregoing embodiments of a method for obtaining orthologous genes in combination with transcriptome and resequencing data, embodiments of an apparatus for obtaining orthologous genes in combination with transcriptome and resequencing data are also provided.
Referring to FIG. 7, an apparatus for obtaining orthologous genes from a combined transcriptome and resequenced data according to an embodiment of the present invention includes one or more processors configured to implement the method for obtaining orthologous genes from a combined transcriptome and resequenced data according to the above embodiment.
The embodiments of the apparatus for acquiring orthologous genes in combination with transcriptome and resequencing data of the present invention may be applied to any device having data processing capabilities, such as a computer or the like. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 7, a hardware structure diagram of an apparatus with any data processing capability where the apparatus for acquiring orthologous genes by combining transcriptome and resequencing data according to the present invention is located is shown in fig. 7, and in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 7, any apparatus with data processing capability in the embodiment generally includes other hardware according to the actual function of the any apparatus with data processing capability, which will not be described herein.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Example 4
Corresponding to the embodiments of the aforementioned method of combining transcriptome and resequencing to obtain orthologous genes, embodiments of the present application also provide an electronic device comprising: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods of obtaining orthologous genes in combination with transcriptomes and re-sequencing as described above. As shown in fig. 8, a hardware structure diagram of an apparatus with optional data processing capability, where the method for obtaining orthologous genes by combining transcriptome and resequencing data according to the embodiment of the present application is located, is shown in fig. 8, and in addition to a processor, a memory, a DMA controller, a disk, and a nonvolatile memory, any apparatus with optional data processing capability in the embodiment is generally according to an actual function of the apparatus with optional data processing capability, and may further include other hardware, which will not be described herein.
Example 5
In accordance with the foregoing embodiments of the method for combining transcriptome and resequencing data to obtain an orthologous gene, embodiments of the present invention also provide a computer-readable storage medium having a program stored thereon that, when executed by a processor, implements the method for combining transcriptome and resequencing data to obtain an orthologous gene of the foregoing embodiments.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any device having data processing capability, for example, a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.
The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications according to the principles and design ideas of the present invention are within the scope of the present invention.

Claims (10)

1. A method for obtaining orthologous genes in combination with transcriptome and resequencing data comprising the steps of:
(1) Assembling and splicing the transcriptome original data of the species A, B, C to obtain a transcript file; performing open reading frame prediction on each transcript, and identifying a sequence region with potential protein coding capability; using protein coding genes of a plurality of species as input files, searching single-copy orthologous gene sequences of the species A, B, C, and then comparing the single-copy orthologous genes of the species A, B, C to a reference genome to obtain specific positions of the single-copy orthologous genes of the species A, B, C on the reference genome;
(2) Genetic variation locus detection is carried out on the resequenced data of the species D, E, F, and a VCF file containing all locus information of the species D, E, F is obtained;
(3) Extracting a single copy orthologous gene sequence in the resequencing data of the species D, E, F according to the position information of the single copy orthologous gene of the species A, B, C obtained in the step (1) and the VCF file of the species D, E, F obtained in the step (2) by using a FindSOG tool; the single copy orthologous gene sequence of species A, B, C transcriptome data was aligned with the single copy orthologous gene sequence of species D, E, F resequencing data to obtain a single copy orthologous gene sequence matrix common to species A, B, C, D, E, F.
2. The method for obtaining orthologous genes by combining transcriptome and resequencing data according to claim 1, wherein said step (1) is specifically: assembling and splicing the transcriptome original data of the species A, B, C, and then removing redundant sequences by using a CD-HIT tool to obtain a transcript file; performing open reading frame prediction on each transcript, and identifying a sequence region with potential protein coding capability; using protein coding genes of a plurality of species as input files, using orthogene software to find single copy orthologous genes of the species A, B, C, selecting annotation files of reference genomes of target species, and comparing the single copy orthologous genes of the species A, B, C to the reference genomes by using BLAST software to obtain specific positions of the single copy orthologous genes of the species A, B, C on the reference genomes.
3. The method for obtaining orthologous genes by combining transcriptome and resequencing data according to claim 2, wherein the chromosome number and the start-stop position of the gene are obtained according to the annotation information in the annotation file.
4. The method for obtaining orthologous genes by combining transcriptome and resequencing according to claim 1, wherein in the step (2), the VCF file records the position and base information of each nucleotide position in the species D, E, F for extracting the corresponding gene according to the position information of SOGs.
5. The method for obtaining orthologous genes by combining transcriptome and resequencing according to claim 1, wherein the FindSOG tool extracts single copy orthologous genes from VCF files according to gene start-stop positions using c++ language program.
6. The method of claim 1, wherein the FindSOG tool operates by combining transcriptome and resequencing to obtain orthologous genes, comprising: obtaining the position information and VCF file data of the single copy orthologous genes of the original whole genome, and preprocessing the data to obtain gene fragments; establishing a hash table for storing index information of genes, storing the preprocessed gene fragments in the hash table, and classifying according to genome names; processing the gene information according to the rows, and establishing initial and end position indexes of the gene fragments; and extracting context information according to the starting and ending positions of the gene segments, and recording species information, function information and context information of the gene information.
7. The method for combining transcriptome and resequencing data to obtain an ortholog gene according to claim 6, wherein the preprocessing operation comprises data cleansing, format conversion.
8. An apparatus for obtaining orthologous genes in combination with transcriptome and resequencing data comprising one or more processors configured to implement the method of obtaining orthologous genes in combination with transcriptome and resequencing of any one of claims 1-7.
9. An electronic device comprising a memory and a processor, wherein the memory is coupled to the processor; wherein the memory is for storing program data and the processor is for executing the program data to implement the method of obtaining orthologous genes in combination with transcriptome and resequencing data of any one of the preceding claims 1-7.
10. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of combining transcriptomes and resequencing to obtain orthologous genes according to any one of claims 1-7.
CN202311604097.2A 2023-11-28 2023-11-28 Method for obtaining orthologous gene by combining transcriptome and resequencing Pending CN117637028A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311604097.2A CN117637028A (en) 2023-11-28 2023-11-28 Method for obtaining orthologous gene by combining transcriptome and resequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311604097.2A CN117637028A (en) 2023-11-28 2023-11-28 Method for obtaining orthologous gene by combining transcriptome and resequencing

Publications (1)

Publication Number Publication Date
CN117637028A true CN117637028A (en) 2024-03-01

Family

ID=90037054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311604097.2A Pending CN117637028A (en) 2023-11-28 2023-11-28 Method for obtaining orthologous gene by combining transcriptome and resequencing

Country Status (1)

Country Link
CN (1) CN117637028A (en)

Similar Documents

Publication Publication Date Title
US11488688B2 (en) Methods and systems for detecting sequence variants
CN105793859B (en) System for detecting sequence variants
US8428882B2 (en) Method of processing and/or genome mapping of diTag sequences
Lassmann TagDust2: a generic method to extract reads from sequencing data
US20160019339A1 (en) Bioinformatics tools, systems and methods for sequence assembly
CN105740650B (en) A method of quick and precisely identifying high-throughput genomic data pollution sources
US7809510B2 (en) Positional hashing method for performing DNA sequence similarity search
CN112522371A (en) Analysis method of spatial transcriptome sequencing data
WO2009024974A2 (en) Systems and methods for rational selection of context sequences and sequence templates
CN114420212B (en) Escherichia coli strain identification method and system
CN113488106A (en) Method for rapidly acquiring comparison result data of target genome region
US20100293167A1 (en) Biological database index and query searching
CN115662516A (en) Analysis method for high-throughput prediction of phage host based on next-generation sequencing technology
CN109658981B (en) Data classification method for single cell sequencing
US20030200033A1 (en) High-throughput alignment methods for extension and discovery
CN106709273B (en) The matched rapid detection method of microalgae protein characteristic sequence label and system
CN117637028A (en) Method for obtaining orthologous gene by combining transcriptome and resequencing
US20170169159A1 (en) Repetition identification
CN113361752A (en) Protein solvent accessibility prediction method based on multi-view learning
Mrozek et al. A large-scale and serverless computational approach for improving quality of NGS data supporting big multi-omics data analyses
KR20200125549A (en) A Method for automatic analysis of Chromatin-immunoprecipitation-Sequencing data
KR102030055B1 (en) A method for extracting specific protein sequence of virus
CN115938491B (en) High-quality bacterial genome database construction method and system for clinical pathogen diagnosis
Majhi et al. Artificial Intelligence in Bioinformatics
EP3418927A1 (en) Method and device for processing dna sequence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination