LU501941B1 - Method for rapidly obtaining target gene family of genome-free species based on transcriptome - Google Patents

Method for rapidly obtaining target gene family of genome-free species based on transcriptome Download PDF

Info

Publication number
LU501941B1
LU501941B1 LU501941A LU501941A LU501941B1 LU 501941 B1 LU501941 B1 LU 501941B1 LU 501941 A LU501941 A LU 501941A LU 501941 A LU501941 A LU 501941A LU 501941 B1 LU501941 B1 LU 501941B1
Authority
LU
Luxembourg
Prior art keywords
species
gene family
sequences
software
amino acid
Prior art date
Application number
LU501941A
Other languages
French (fr)
Inventor
Xueyi He
Cheng Pan
Yuqun Niu
Original Assignee
Univ Jiliang China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Jiliang China filed Critical Univ Jiliang China
Priority to LU501941A priority Critical patent/LU501941B1/en
Application granted granted Critical
Publication of LU501941B1 publication Critical patent/LU501941B1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B10/00ICT specially adapted for evolutionary bioinformatics, e.g. phylogenetic tree construction or analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Animal Behavior & Ethology (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention provides a method for rapidly obtaining a target gene family of a genome-free species based on a transcriptome, and more particularly relates to a visualization method for rapidly, efficiently, accurately and comprehensively obtaining all sequences of the target gene family for the genome-free species that does not require to be written with professional bioinformatics knowledge, R language and other software platforms, or operation codes, including the following steps: selecting a reference species and a series typological species, and using in combination gene search comparison software Bioedit, phylogenetic tree construction software MEGA6, and online analysis software MEME based on target transcriptome data. The present invention adopts the method to successfully obtain all sequences of a PEBP gene family of chrysanthemum morifolium ramat, which is verified by a first-generation sequencing with an accuracy rate of 100%.

Description

METHOD FOR RAPIDLY OBTAINING TARGET GENE FAMILY OF GENOME-FREE SPECIES BASED ON TRANSCRIPTOME
TECHNICAL FIELD The present invention relates to the field of transcriptome sequencing data analysis, more particularly to a method for analyzing a transcriptome of a genome-free species, and in particular to a method for rapidly obtaining a target gene family of a genome-free species based on transcriptome data of second-generation sequencing technology.
BACKGROUND ART There are generally two following methods for obtaining sequences of a target gene family of a genome-free species: a first method: finding gene sequences of a typological species, adopting a first-generation sequencing method, designing degenerate primers, and obtaining the sequences of the target gene family of the genome-free species with PCR cloning and a sequencing method; and a second method: obtaining high-flux sequencing data such as a transcriptome via second-generation sequencing, and searching directly the sequences of the target gene family of the genome-free species with reference to annotation information given by a sequencing company.
The experiment of the first method requires reagents, equipment and a long time, has more uncertainty and higher costs, and is difficult to obtain the results. For the second method, due to more annotation information given by the sequencing company, seven commonly used major databases, such as nr database, have a large amount of data due to open source methods of the data but have more erroneous annotation information. Although kegg is more accurate, the kegg only studies more genes or proteins have annotation. Other databases have more problems. Further, since bioinformatics analysts usually lack of the professional knowledge of a specific research object, the results obtained through conventional parameters are less directed. How to efficiently, accurately and comprehensively obtain the target gene family is a difficulty in systematically studying a family gene.
However, researchers in biology are often unable to obtain accurate information directly from the transcriptome (especially a nonparticipating transcriptome) due to lack of computer expertise (which is written with R language and other software platforms, and operation codes) in processing transcriptome or genome data or lack of high-quality equipment required for operational data. Meanwhile, most of the bioinformatics analysts have computer-related professional backgrounds, and often do not have biological expertise at a gene or protein level specific to a certain function. When processing data, the bioinformatics analysts often uses information from public databases to optimize algorithms or parameters to screen target genes. Results of precision studies for a certain gene family often have many errors and omissions. Meanwhile, during assembly and annotation processes of the nonparticipating transcriptome, it is very easy for gene family members to cause incomplete sequences, an assembly mismatch, and an annotation error in the results due to high similarity in sequences.
Therefore, it is urgent to find a method that is suitable for both biological researchers and bioinformatics analysts, does not require to be written with professional bioinformatics knowledge, software platforms such as R language or operation codes, can more conveniently, accurately and rapidly obtain all sequences of the target gene family in the genome-free species from the transcriptome data.
DETAILED DESCRIPTION OF THE INVENTION In order to solve the above problems, the present invention provides a method for rapidly obtaining a target gene family of a genome-free species based on a transcriptome, and more particularly relates to a visualization method for rapidly, efficiently, accurately and comprehensively obtaining all sequences of the target gene family for the genome-free species that does not require to be written with professional bioinformatics knowledge, R language and other software platforms, or operation codes, including the following steps: selecting a reference species and a series typological species, and using in combination gene search comparison software Bioedit, phylogenetic tree construction software MEGA6, and online analysis software MEME based on transcriptome data. The method successfully obtains all sequences of a PEBP gene family of chrysanthemum morifolium ramat, and is verified by a first-generation sequencing with an accuracy rate of 100%.
In one aspect, the present invention provides a method for rapidly obtaining a target gene family of a genome-free species, including the following steps: (1) Selecting a reference species and a series typological species, using whole-genome data of the reference species and the series typological species, and transcriptome data of species to be tested, and constructing a local database of a target gene family of the series typological species and the species to be tested with BioEdit software; the species to be tested being a genome-free species; (2) Constructing a phylogenetic tree with MEGAG software, so that amino acid sequences in the local database of the target gene family of the species to be tested are classified with sorting rules of the phylogenetic tree; (3) Performing motif analysis with MEME software;
(4) Aligning individually the amino acid sequences.
The series typological species mentioned in the present invention refers to related species selected for the target gene family, and including different evolution stages, different shape characteristics and the like.
In some embodiments, the specific version of the BioEdit software used in the present invention is BioEdit 7.2.6.1.
In some embodiments, the specific version of the MEGA6 software used in the present invention is MEGA 6.06.
In some embodiments, the specific version of the MEME software used in the present invention is MEME 5.4.1.
In some embodiments, the target gene family is a PEBP (phosphatidyl ethanolamine-binding proteins) gene family.
In some embodiments, it is necessary to select the series typological species for the PEBP gene family. Since the function of a PEBP gene is mainly based on flower bud differentiation and flower organ development, the series typological species selects a perennial plant and an annual plant of a non-flowering plant, a dicotyledonous plant, and a monocotyledonous plant, as well as a plant that also may select different evolutionary processes according to the degree of evolution of the PEBP gene family.
In some embodiments, the species to be tested is chrysanthemum morifolium ramat. At present, no genome report is provided for chrysanthemum morifolium ramat on a public website. The present invention obtains the transcriptome data of chrysanthemum morifolium ramat by second-generation sequencing, selects the suitable reference species and the series typological species for the PEBP gene family, and then uses in combination gene search comparison software Bioedit, phylogenetic tree construction software MEGA6 and online analysis software MEME to successfully obtain all sequences of the PEBP gene family of chrysanthemum morifolium ramat. The present invention is verified by first-generation sequencing, with an accuracy rate of 100%. It can be seen that the method provided by the present invention can quickly and accurately obtain all sequences of the target gene family of the genome-free species at a very low cost (second-generation sequencing single-sample database sequencing and assembly library construction only costs 600-800 yuan). Compared with first-generation sequencing, the present invention saves a lot of manpower, materials and financial resources.
It can be understood that for any target gene family and any genome-free species, the method provided by the present invention can obtain quickly and accurately all sequences of the target gene family of the genome-free species at a low cost, can be visualized in a whole process without professional expertise, is not required to be written by software platforms such as R language and operation codes, and so falls within the protection scope of the present invention.
Further, step (1) further comprises: determining a member type of the target gene family of the reference species; the series typological species comprises at least ten species with the whole-genome data, and has species of the target gene family; the reference species is one of the series typological species.
In some embodiments, for the PEBP gene family, researchers can first refer to relevant published literatures to understand which species has more research on the PEBP gene family and has a relatively complete genome, so as to determine that Arabidopsis, rice and other research objects with relatively complete genomes can be selected as the reference species. À plurality of literatures was combined to confirm the member type of the PEBP gene family (including FT, TFL1, MFT, etc.) of Arabidopsis or rice, and the structural domain of PF00161. Meanwhile, the corresponding sequences uploaded by authors of the literatures are downloaded for reference.
In some embodiments, upon review of published literatures, the PEBP gene family can be divided into three categories: FT-like, TFL-like, and MF T-like.
In some embodiments, the present invention selects Arabidopsis as the reference species.
In some embodiments, the present invention searches for the members of the PEBP gene family by querying the Arabidopsis TAIR database and an online website, and using keywords PEBP, FT, TFL1, MFT, PF00161 and other PEBP gene family-related information to determine that the PEBP gene family of Arabidopsis has 6 members (that is, 6 protein sequences): AtFT (AT1G65480), AtTSF (AT4G20370), AtTFL1 (AT5G03840), AtATC (AT2G27550), AtBFT (AT5G62040) and AtMFT (AT1G18100).
In some embodiments, since the function of the PEBP gene family centers on flower bud differentiation and flower organ development, the typological species can be selected from the perennial plant and the annual plant of the dicotyledonous plant and the monocotyledonous plant. Meanwhile, according to an evolutionary process, a plant with different evolutionary stages of the PEBP gene family is selected. À non-flowering plant (such as selaginella moellendorfii) and a moss (such as physcomitrella patens) with a lower degree of evolution also can be selected. Therefore, in order to obtain the PEBP gene family sequences of chrysanthemum morifolium ramat, the series typological species are selected as follows: a total of 13 species including Arabidopsis, Citrus clementina, Citrus sinensis, Cucumis sativus, Glycine max, Medicago truncatula, Oryza sativa, Physcomitrella patens, Populus trichocarpa, Prunus persica, Selaginella moellendorffii, Sorghum bicolor and Zea mays, of which Arabidopsis as the reference species is also one of the series typological species.
Further, constructing the local database with the BioEdit software in step (1) comprises: obtaining the amino acid sequences and corresponding nucleotide sequences of the target gene family from the genome data of the series typological species with blastp program of the BioEdit software, and obtaining the amino acid sequences and the corresponding nucleotide sequences of the target gene family from the transcriptome data of the species to be tested with blastx program and the blastp program of the BioEdit software.
The present invention comprehensively searches the complete genome data of the reference species and the series typological species via the BioEdit software, and obtains more complete all sequences of the target gene family. In some embodiments, since amino acid sequences directly determine protein’s structure and function, the present invention mainly determines the gene family data via amino acid sequences by constructing a complete amino acid sequence database.
For the species to be tested without the genome data, the complete amino acid sequence database is mainly translated and constructed from the transcriptome data.
The blastp program can be used to search and compare protein sequences. The blastx program can be used to achieve translation and alignment from nucleic acid sequences to the amino acid sequences. Therefore, the amino acid sequences of the reference species and the series typological species can be searched and aligned by blastp, The amino acid sequences of the species to be tested can be obtained by the blastx program. The amino acid sequences of the species to be tested can be assisted to be searched by blastp.
Further, when using the blastp program, the parameter of expectation value needs to be set as 10. A homology is set as 40%. Search results need to be deduplicated, so that each amino acid sequence only retains a unique number.
When using the blastp program to search for amino acid sequences, a higher expectation value and a lower homology are set to search for more amino acid sequences. After a plurality of experiments, it is proved that by setting the parameter of expectation value as 10“ and the homology as 40%, it is possible to obtain more amino acid sequences to the greatest extent, thus ensuring that all amino acid sequences of the target gene family can be obtained comprehensively and completely. When the parameter of expectation value is set as 10™, or the homology is set as 80%, the obtained amino acid sequences are significantly reduced, which may make it difficult to obtain comprehensive gene family data.
Further, after constructing a local database with the BioEdit software, Pfam31.0 needs to be used to identify the integrity of the protein domain, so that the protein domains of the target gene families of the series typological species and the species to be tested reach 100%.
By verifying the integrity of the domain of the PEBP family, the integrity of the amino acid series of the PEBP family can help be verified.
In some embodiments, the integrity of the domains of the target gene family can also be identified through SMART online website.
In some embodiments, annotations of similar sequences can also be viewed through the NCBI online website to help further judge and identify the integrity of the amino acid sequences.
Further, constructing the phylogenetic tree with the MEGA6 software in step (2) comprises: necessarily first importing amino acid sequences of the series typological species and the species to be tested, selecting Neighbor-Joining Tree in Phylogeny, and setting a parameter of bootstrap as 1000.
Bootstrap is a bootstrap value, which is used to test the confidence level of a branch of the calculated phylogenetic tree. The larger the value of Bootstrap, the more accurate the result. The parameter of bootstrap is set as 1000, which is enough to meet the needs of the present invention to construct the phylogenetic tree, and the very high confidence level of the phylogenetic tree can be obtained.
The phylogenetic tree is constructed by MEGAG6 software, and all amino acid sequences can be clustered and analyzed. The more similar amino acid sequences is clustered to closer bifurcation positions, and the less similar amino acid sequences is clustered to further bifurcation position.
Further, when the phylogenetic tree is constructed by the MEGA6 software described in step (2), if the branch value of the two amino acid sequences is 100 and belongs to the same species, a separate amino acid sequence needs to be aligned. If the similarity of the amino acid sequences exceeds 99%, one of the amino acid sequences is removed. if the similarity of the amino acid sequences does not reach 99%, all of the amino acid sequences are retained.
Further, after deleting the amino acid sequences with a similarity of over 99%, the remaining amino acid sequences are imported into the MEGA6 software again to construct the phylogenetic tree.
In some embodiments, according to the literatures searched in step (1), the PEBP gene family can be divided into three categories: FT-like, TFL-like and MFT-like. The phylogenetic tree is constructed. Class determination is conducted according to the sequence distribution of Arabidopsis. The amino acid sequences of chrysanthemum morifolium ramat of the species to be tested and other series typological species are preliminarily classified.
In some embodiments, according to the distinction of Arabidopsis, combining the amino acid sequences of the series typological species to construct the phylogenetic tree, the entire phylogenetic tree can be divided into 3 major groups and 5 minor groups, where groupA is FT-like, groupB is FT-like is TFL-like, and groupC is MFT-like.
Further, performing motif analysis with the MEME software in step (3) comprises: taking the reference species as the reference object, identifying and classifying the reference species by conserving motifs, and distinguishing the types of internal members of the target gene family of the species to be tested.
Because the completeness identification of the domain PBP by Pfam31.0 in step (1) and the simple classification of the phylogenetic tree in step (2) may have similar genes as target genes, the incompleteness of transcriptome sequences and error source data in assembly cannot be resolved. Therefore, it is also necessary to use a MEME online database for motif analysis. Through the MEME, motif analysis can be performed on each type of sequences after preliminary classification. Further identification and classification can be performed to distinguish the internal members of the gene family, while excluding sequences with large functional differences.
In some embodiments, sequences with large functional differences may not be the target sequences, may be misassembled genes, etc., which can be removed by motif analysis.
Further, aligning individually the amino acid sequences in step (4) refers to a way that for the amino acid sequences close to the classification distance on the phylogenetic tree, during the motif analysis, consistent conservative motifs is provided. Otherwise, the amino acid sequences need to be compared individually. Sequence integrity is checked. Splicing errors are excluded.
In order to further improve the accuracy of the identification results of PEBP members and the accuracy of taxa, according to the analysis results of the major categories of the phylogenetic tree and the error elimination process of the motif analysis of the MEME, the obtained PEBP members need to be further aligned and analyzed from the level of amino acid arrangement.
In yet another aspect, the present invention provides a software combination for rapidly acquiring the use of the target gene family of the genome-free species. The software combination includes BioEdit software, MEGA6 software, and MEME online software.
The present invention has the following beneficial effects: (1) Constructing a visualization method for rapidly, efficiently, accurately and comprehensively obtaining all sequences of the target gene family for the genome-free species that does not require to be written with professional bioinformatics knowledge, R language and other software platforms, or operation codes; (2) Successfully obtaining all sequences of a PEBP gene family of chrysanthemum morifolium ramat. The method is verified by a first-generation sequencing with an accuracy rate of 100%.
(3) Quickly obtaining target family genes of genome-free species via the simplest and most efficient means by combining the sequence search and alignment functions of the BioEdit software, as well as classification, deduplication of the MEGA6 software and the MEME software, finally aligning the integrity of some individual amino acid sequences, and excluding splicing errors, (4) Solving a huge information gap between traditional biological technical means and modern bioinformatics means to obtain genetic information, which can save a lot of time of traditional biological technical means and correct the limiting factors of modern bioinformatics means, as well as obtaining comprehensively and quickly the target gene family in chrysanthemum morifolium ramat without a reference genome by non-experiment means.
(5) Do not use the annotation information of the target transcriptome, so as to avoid information errors caused by bioinformatics or public database errors. All software used in the process are visual bioinformatics software or online websites. Less expertise in bioinformatics 1s required. The search and identification of complete sequences can be obtained based on a small amount of accurate literature information, thus enabling front-line researchers to get started quickly.
DESCRIPTION OF THE ATTACHED DRAWINGS FIG. 1 1s a setting interface that blastp program of BioEdit software in Example 1 sets a parameter of expectation value as 10 for a query object; FIG. 2 is a search interface of blastp program of BioEdit software in Example 1; FIG. 3 is a search result interface of blastp program of BioEdit software in Example 1; FIG. 4 is an operation interface of identifying a domain integrity of a PEBP gene family by Pfam31.0 in Example 1; FIG. 5 is an interface of identifying that a protein domain of a PEBP gene family of chrysanthemum morifolium ramat reaches a 100% complete result by Pfam31.0 in Example 1; FIG. 6 is a phylogeny tree of PEBP gene families of chrysanthemum morifolium ramat and a series typological species constructed by MEGA6 software in Example 1; and FIG. 7 is a schematic diagram showing results of performing motif analysis of a PEBP family of chrysanthemum morifolium ramat by taking Arabidopsis as a reference object in Example 1.
DETAILED DESCRIPTION OF EMBODIMENTS Preferred examples of the present invention will be described in further detail below with reference to the accompanying drawings. It should be noted that the following examples are intended to facilitate the understanding of the present invention, but do not have any limiting effect thereto. The raw materials and equipment used in the specific examples of the present invention are all known products, which are obtained by purchasing commercially available products.
Example 1: Quickly obtaining a PEBP gene family of chrysanthemum morifolium ramat In this example, based on transcriptome data of chrysanthemum morifolium ramat, all amino acid sequences of the PEBP gene family of chrysanthemum morifolium ramat are quickly obtained. À specific method includes the following steps: I Checking existing data
1. Obtaining the transcriptome data of chrysanthemum morifolium ramat: In this example, the transcriptome data of chrysanthemum morifolium ramat was obtained by second-generation sequencing of Wuhan Huada Gene Technology Co., Ltd. (iE KILRRH ABR A 7). The cost of second-generation sequencing of a single sample of chrysanthemum morifolium ramat was 800 yuan. A total of 223,732 transcriptome clean data was obtained.
2. Understanding the PEBP gene family: After reviewing published literatures, the PEBP gene family could be divided into three subfamily genes: FT-like, TFL-like and MFT-like. Different family members had different biological functions, such as taking a systemic signaling factor to regulate flowering time of a plant as well as a sexual or asexual development process of participating in the development of inflorescence meristems and flower organs and seed germination and the like. In general, FT-like subfamily genes could promote flowering, is expressed in leaves as a florigen and is transported to the meristem to interact with Flowering Lotus D (FD). The floral meristem gene AP1 is activated and hence expressed. Overexpression of the Ft-like advanced flowering time. TFL1 delayed flowering time, so that the plant maintains the characteristics of unlimited growth. MFT-like subfamily genes mainly played a role in the development and germination of seeds.
3. Selecting the reference species: Relevant published literatures were checked to understand which species had more research on the PEBP gene family and had relatively complete genomes. In this example, Arabidopsis was selected as the reference species, and all the genome data of Arabidopsis were downloaded. The present invention searched for the members of the PEBP gene family by querying the Arabidopsis TAIR database and an online website, and using keywords PEBP, FT, TFL1, MFT, PF00161 and other PEBP gene family-related information to determine that the PEBP gene family of Arabidopsis had 6 members (that is, 6 protein sequences): AtFT (AT1G65480), AtTSF (AT4G20370), AtTFL1 (AT5G03840), AtATC (AT2G27550), AtBFT (AT5G62040) and AtMFT (AT1G18100).
4. Since the function of the PEBP gene family centers on flower bud differentiation and flower organ development, the typological species can be selected from the perennial plant and the annual plant of the dicotyledonous plant and the monocotyledonous plant. Meanwhile, according to an evolutionary process, a plant with different evolutionary stages of the PEBP gene family is selected. A non-flowering plant (such as selaginella moellendorfii) and a moss (such as physcomitrella patens) with a lower degree of evolution also can be selected. Therefore, in order to obtain the PEBP gene family sequences of chrysanthemum morifolium ramat, the series typological species are selected as follows: a total of 13 species including: Arabidopsis, Citrus clementina, Citrus sinensis, Cucumis sativus, Glycine max, Medicago truncatula, Oryza sativa, Physcomitrella patens, Populus trichocarpa, Prunus persica, Selaginella moellendorffii, Sorghum bicolor and Zea mays, of which Arabidopsis as the reference species 1s also one of the series typological species. All genomic data for 13 serial typological species were downloaded.
II Constructing a local database of the PEBP family via BioEdit software The specific version of the BioEdit software used in this example was BioEdit 7.2.6.1, which was a visual bioinformatics software written and developed by Borland Corporation of the United States based on Borland C++Builder 3.0. Only a part of simple biological knowledge needs to operate. Computer expertise such as languages and programming did not need to involve.
1. All genomic data of Arabidopsis and 13 series typological species were taken as the local database. With the blastp program, each of six amino acid sequences (6 family members) of the PEBP family of Arabidopsis was set as “a query object”. The parameter of expectation value was set as 10°, as shown in FIG. 1. Obtained results were checked. Sequences with more than 40% homology were taken as candidate sequences. The corresponding genome number (FIG. 2, FIG. 3) was recorded. Search results need to be deduplicated, so that each amino acid sequence only retained a unique number. A local database of PEBP family genes of Arabidopsis was constructed. The obtained results were compared with the sequences reviewed and obtained from existing data. The consistency of the number of members, the consistency of the protein sequences of the same gene, etc were compared. If an increase in members was provided, the identification and analysis of a single gene were performed. Three online website databases of Pfam31.0, SMART and NCBI were used for identification.
The priority of PEBP family members in this example mainly took the priority of the integrity of a domain PBP in Pfam31.0 as an identification basis. Each member of each gene family had a unified structural domain. Structural domains of the PEBP families of different species were also consistent. The integrity of the amino acid series of the PEBP family was verified by verifying the integrity of the domain of the PEBP family.
The identification result was that the protein domain of the PEBP family of Arabidopsis reached 100%. The specific verification process was shown in FIG. 4, where (1) represented a complete domain, and (2)~(5) represented incomplete domains. If the incomplete domain was provided, it is necessary to check whether a gene splicing error was provided. If the sequences were wrong, the sequences needed to be adjusted until the complete domain was finally obtained.
Finally, the numbers and amino acid sequences of all PEBP members in Arabidopsis were obtained. The 6 amino acid sequences of Arabidopsis finally obtained in this example were completely consistent with the 6 PEBP family members obtained by querying the existing data.
After the verification of domains of the PEBP family of 13 species, a 100% identification result of a protein domain was obtained. If an error in the sequences was provided, adjustments could be made until the complete sequences of the domain were obtained. Finally, the obtained protein domain reached 100%.
2. With blastp program, each of the six amino acid sequences of a PEBP family of Arabidopsis was set as “the query object”. A parameter of expectation value was set as 107. The homology was set as 40%. Researcher referred to the above method to search for the genome sequences of other typological species in the 13 series typological species. The search results needed to be deduplicated, so that each amino acid sequence only retained a unique number. Three online website databases of Pfam31.0, SMART and NCBI were used for identification. It was verified that the protein domains of the PEBP gene families of 13 series typological species reached 100%. À local database of the PEBP families of 13 series typological species was successfully constructed.
3. By using the blastx program to translate transcriptome data of chrysanthemum morifolium ramat into the amino acid sequences, and then using the blastp program, each of the six amino acid sequences of the PEBP family of Arabidopsis was sequentially set as “the query object”. The parameter of expectation value was set as 10*. the homology was set as 40%. The amino acid sequences of chrysanthemum morifolium ramat were searched according to the above method. Searched results needed to be deduplicated, so that each amino acid sequence only retained a unique number. Pfam31.0 of an online website database was used for identification. À consistent structure domain of Arabidopsis was used to verify the PEBP family domain of chrysanthemum morifolium ramat. After verification, the protein domain of the PEBP gene family of chrysanthemum morifolium ramat was 100% (FIG. 5). The local database of the PEBP family of chrysanthemum morifolium ramat was successfully constructed (at this time, 43 sequences of the PEBP gene family of chrysanthemum morifolium ramat were obtained in total).
III Constructing a phylogenetic tree via MEGA6 software The specific version of the MEGA6 software used in this example was MEGA 6.06, which was visual bioinformatics software developed by Mega Corporation of New Zealand. Only a part of simple biological knowledge needed to operate. Computer expertise such as languages and programming did not need to involve.
The basic operation of constructing the phylogenetic tree with the MEGA6 software was as follows: Retrieved 13 series typological species and PEBP protein sequences of chrysanthemum morifolium ramat were stored in a txt file. After sequences in the MEGA6 software were then imported, an analysis result of an Align by ClustalW method in Alignment was selected. File was clicked to open a stored XX.mas file. Analyze was clicked to select Neighbor-Joining Tree in Phylogeny. A parameter of bootstrap was set as 1000. The results were exported for viewing. If a branch value (a confidence level) of two sequences was set as 100 and belonged to the same species. Separate sequences were aligned. If the similarity of the two sequences exceeded 99%, one of the two sequences was removed. If the similarity of the two sequences did not reach 99%, the two sequences were retained (it is considered that the two sequences were differed).
The deduplicated amino acid sequences, together with the amino acid sequences of all series typological species, were imported into the MEGA6 software again to construct the phylogenetic tree, so as to preliminarily classify the remaining sequences and rename the genes of chrysanthemum morifolium ramat and the series typological species.
According to the existing literatures, the PEBP gene family could be divided into three categories: FT-like, TFL-like and MFT-like. The phylogenetic tree was constructed. Class determination was conducted according to the sequence distribution of Arabidopsis. The amino acid sequences of chrysanthemum morifolium ramat of the species to be tested and other series typological species were preliminarily classified. The phylogenetic tree was constructed as shown in FIG. 6. The entire phylogenetic tree could be divided into 3 major groups and 5 minor groups, of which groupA is FT-like, groupB is TFL-like, and groupC is MFT-like. Cm was chrysanthemum morifolium ramat. At was Arabidopsis. Cc was Citrus clementina. Csi was Citrus sinensis. Csa was Cucumis sativus. Gm was Glycine max, Mt was Medicago truncatula. Os was Oryza sativa. Ppa was Physcomitrella patens. Pt was Populus trichocarpa. Ppe was Prunus persica. Sm was Selaginella moellendorffii, Sb was Sorghum bicolo. Zm is Zea mays. In addition, 23 PEBP gene family sequences were provided in chrysanthemum morifolium ramat.
IV Performing motif analysis with MEME software The specific version of the MEME software used in this example was MEME 5.4.1, which was a visual bioinformatics software developed by Timothy Bailey from the Department of Pharmacology of the University of Nevada, Reno and William Stafford Noble from the Department of Genome Sciences of the University of Washington. Only a part of simple biological knowledge needed to operate. Computer expertise such as languages and programming did not need to involve.
Because the completeness identification of the domain PBP by Pfam31.0 and the simple classification of constructing the phylogenetic tree with MEGA6 software may have similar genes as target genes, the incompleteness of transcriptome sequences and error source data in assembly could not be resolved. Therefore, it is also necessary to use a MEME online database for motif analysis. Through the MEME, motif analysis can be performed on each type of sequences after preliminary classification. Further identification and classification can be performed to distinguish the internal members of the gene family, while excluding sequences with large functional differences.
Sequences with large functional differences may not be the target sequences, such as wrongly assembled genes, etc, which could be removed by motif analysis. However, the sequences with large functional differences might also be different members of the same gene family. For example, the MFT gene in the PEBP gene family was quite different from the other two types of family members. Therefore, it was necessary to first analyze each class obtained by phylogenetic tree classification. Motif analysis for each class after classification was then performed to prevent the MEME from directly excluding other two types of MFT genes with large differences.
In this example, Arabidopsis was taken as the reference object, to perform motif analysis for the PEBP family of chrysanthemum morifolium ramat. According to the domain sequences of the PEBP family, the maximum number of motifs was set as 15. Each motif was less than 50 amino acids. The results were shown in FIG. 7.
It could be seen from FIG. 7 that 9 PEBP gene family sequences in total were provided from chrysanthemum morifolium ramat.
V Aligning individually with the amino acid sequences Aligning individually the amino acid sequences referred to a way that for the amino acid sequences close to the classification distance on the phylogenetic tree, during the motif analysis, consistent conservative motifs was provided. Otherwise, the amino acid sequences needed to be compared separately. Sequence integrity was checked. Splicing errors were excluded.
In this example, in order to further improve the accuracy of the identification results of PEBP members and the accuracy of taxa, according to the analysis results of the major categories of the phylogenetic tree and the error elimination process of the motif analysis of the MEME, the obtained PEBP members were further aligned and analyzed from the level of amino acid arrangement. According to published literatures and PEBP members of Arabidopsis, a domain structure domain in Pfam31.0 and conserved binding sites were annotated. Judgment was made according to the conserved sequences and the conserved binding sites in the domains. If a plurality of genes from a plurality of species had a close classification distance in the phylogenetic tree, the conserved sequences and conserved binding sites of these genes showed consistence or similarity in sequences (a plurality of sequences showed consistence and conservation). If other cases occurred, the gene sequences might be wrong and incomplete, or the assembly of the transcriptome was wrong, etc. The gene objects that were relatively close in the phylogenetic tree of target species were individually aligned to check sequence integrity and exclude splicing errors.
After the amino acid sequences were aligned individually, a total of 9 PEBP gene family sequences were obtained, and were named as CmFT1, CmFT2, CmFT3, CmTFL1, CMTFL2, CmTFL3, CmTFL4, CMMFT1, CmMFT2 with reference to classification and orders. The phylogenetic tree was reconstructed by using the obtained PEBP gene family of chrysanthemum morifolium ramat and the PEBP gene family sequences of the series typological species, and sequences were aligned and analyzed.
Example 2: Effect of parameter setting of BioEdit software on search results In this example, according to the method provided in Example 1, the amino acid sequence search of chrysanthemum morifolium ramat was performed with blastp program. AtFT of Arabidopsis was taken as “a query object”, and a parameter of expectation value was set as 10°, 10%, 10°, and 10°, respectively. The standard Identities was selected as 40%, The obtained results were shown in Table 1, respectively.
Table 1: Effect of parameter setting of BioEdit software on search results The number of | 20 (with alarge number of 12 5 ee 1 T- sequences by other “query objects” AtMFT of Arabidopsis was taken as “the query object”, and the parameter of expectation value was set as 10°, 10*, 10%, and 10°, respectively. The standard Identities was selected as 40%, The obtained results were shown in Table 2, respectively. Table 2: Effect of parameter setting of BioEdit software on search results The number of | 18 (with a large number of 10 4 3 el 1 sequences by other “query objects” From the results in Table 1 and Table 2, it could be seen that the setting of the expectation value could not be too low. When the expectation value was selected to be lower than 10“ and the standard Identities was selected to be 40%, more comprehensive sequences could be obtained. Too many retrieval results that were overlapped with other query objects could not be provided. When the expectation value was selected as 10“ or 10, it was more likely to cause the problem of incomplete search. However, when the expectation value was selected as 10°, a lot of amino acid sequences of a non-similar domain were brought. That is, when searching was performed based on other query objects, this part of the sequences was also searched out, which brought a lot of repetitive work to subsequent operations.
Therefore, it was necessary to set the parameter of expectation value as 107, and select the standard Identities as 40%, so as to obtain more amino acid sequences to the greatest extent, prevent omissions, and not to bring too many repeated operations.
Example 3: Effect of different identification methods on obtaining sequences of a PEBP gene family
In this example, according to the method provided in Example 1, after a local database was constructed by BioEdit software, three methods were used for classification and identification: 1. constructing a phylogenetic tree with only MEGA6 software; 2. performing motif analysis with only MEME software; 3. constructing the phylogenetic tree with MEGA6 software first, and then performing motif analysis with MEME software; and comparing individually the amino acid sequences.
The obtained results were shown in Table 3. Table 3: Effect of different identification methods on obtaining sequences of a PEBP gene family Classification and . identification methods The number of sequences of a PEBP gene family 13 (four mismatched genes presented 7 (deletion of two MFT genes ss rr 8 As could be seen from Table 3, the different classification and identification methods directly affected the accuracy of obtaining the sequences of a PEBP family gene of chrysanthemum morifolium ramat.
When the phylogenetic tree was constructed with only the MEGA6 software, it was difficult to find wrongly spliced sequences even if the repeated sequences were compared by amino acid sequences later.
It was easy to directly exclude MFT genes with large differences in the other two types of family members as error genes, and it was also impossible to obtain accurate results.
Only when a combined analysis method for constructing the phylogenetic tree with the MEGA6 software and then performing motif analysis with MEME software was used, accurate results could be guaranteed to be obtained.
Example 4: Verification for sequences of a PEBP gene family of chrysanthemum morifolium ramat In this example, 9 sequences of the PEBP gene family of chrysanthemum morifolium ramat identified and obtained in Example 1: CmFT1, CmFT2, CmFT3, CmTFL1, CmTFL2, CmTFL3, CmTFL4, CmMFTI1, and CmMFT2. Full-length verification primers were designed at both ends of predicted ORF.
After PCR cloning, gel cutting recovery, ligation transformation and colony culture, a single clone copy was verified by first-generation sequencing.
The designed primers were shown in Table 4, and the first-generation sequencing results were shown in Table 5. Table 4. Primer sequences designed for identification of 9 sequences of a PEBP gene family of chrysanthemum morifolium ramat Genes Forward primer sequence (5'—3’) Reverse primer sequence (5'—3') CmFTI ATGTCTTGCAATAGGGAGGGT TATCTTCTTCTGGCGGCATT CmFT2 ATGCCGAGGGAAAGGGA TCCGTCTTCCACCAAATCTA CmFT3 ATGCCGAGGGAAAGGGA TCCGTCTTCTACCAAATCCA CmTFL1 ATGTCGCTTGCAATAGGG TCTTCTTCGTGCGGCATT CmTFL2 ATGTCTTGCAATAGGGAGGGT TTATCTTCTTCTGGCGGCATT CmTFL3 ATGGCAAGATTAACTTCGGG TCGTCTTCTGGGAGCTGTTT
CmTFL4 ATGTCAAGAATGAATGAGCCA ATCTTCTACGGGCTGCATTT CmMFTI ATGGCAAGATTAACTTCGGG TCGTCTTCTGGGAGCTGT CmMEIZ __ATGOOTICAACGOCATTITA ___AACCGAGAITCTICCTCTICCTAG _ Table 5: Comparison with the first-generation sequencing results A proprietary method was used to obtain a Gene names Sample similarity of sequences and the first-generation sequencing 100% CmFT1 100% 100% 100% 100% 100% CmFT2 100% 100% 100% 100% CmFT3 100% 100% 100% 100% 100% CmTFL1 100% 100% 100% 100% 100% 100% CmTFL2 100% 100% 100% 100% CmTFL3 100% 100% 100% 100% 100% 100% CmTFL4 100% 100% 100% 100% 100% 100% CmMFT1 100% 100% 100% 100% CmMFT2 100% 100% 100% 100% According to the verification of the first-generation sequencing, the results were shown in Table 5. The nine amino acid sequences of the PEBP gene family of chrysanthemum morifolium ramat obtained by combined analysis of the transcriptome data and BioEdit software, MEGA6 software and MEME software were correct by 100%.
Although the present invention was disclosed above, the present invention was not limited thereto. For example, the present invention could be expanded according to the application range of the present invention in molecular biology. Those skilled in the art can make changes and modifications without departing from the spirit and scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the scope defined by the claims.

Claims (4)

CLAIMS What 1s claimed 1s:
1. A method for rapidly obtaining a target gene family of a genome-free species, comprising the following steps: (1) selecting a reference species and a series typological species, using whole-genome data of the reference species and the series typological species, and transcriptome data of species to be tested, and constructing a local database of a target gene family of the series typological species and the species to be tested with BioEdit software; the species to be tested being a genome-free species; (2) constructing a phylogenetic tree with MEGA6 software, so that amino acid sequences in the local database of the target gene family of the species to be tested are classified with sorting rules of the phylogenetic tree; (3) performing motif analysis with MEME software; and (4) aligning individually the amino acid sequences.
2. The method according to claim 1, wherein step (1) further comprises determining a member type of the target gene family of the reference species; the series typological species comprises at least ten species with the whole-genome data, and all has species of the target gene family; the reference species is one of the series typological species.
3. The method according to claim 2, wherein constructing the local database with the BioEdit software in step (1) comprises: obtaining the amino acid sequences and corresponding nucleotide sequences of the target gene family from the genome data of the series typological species with blastp program of the BioEdit software, and obtaining the amino acid sequences and the corresponding nucleotide sequences of the target gene family from the transcriptome data of the species to be tested with blastx program and the blastp program of the BioEdit software.
4. The method according to claim 3, wherein constructing the phylogenetic tree with the MEGA6 software in step (2) comprises: necessarily first importing amino acid sequences of the series typological species and the species to be tested, selecting Neighbor-Joining Tree in Phylogeny, and setting a parameter of bootstrap as 1000.
LU501941A 2022-04-26 2022-04-26 Method for rapidly obtaining target gene family of genome-free species based on transcriptome LU501941B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
LU501941A LU501941B1 (en) 2022-04-26 2022-04-26 Method for rapidly obtaining target gene family of genome-free species based on transcriptome

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
LU501941A LU501941B1 (en) 2022-04-26 2022-04-26 Method for rapidly obtaining target gene family of genome-free species based on transcriptome

Publications (1)

Publication Number Publication Date
LU501941B1 true LU501941B1 (en) 2022-10-26

Family

ID=83852951

Family Applications (1)

Application Number Title Priority Date Filing Date
LU501941A LU501941B1 (en) 2022-04-26 2022-04-26 Method for rapidly obtaining target gene family of genome-free species based on transcriptome

Country Status (1)

Country Link
LU (1) LU501941B1 (en)

Similar Documents

Publication Publication Date Title
Jin et al. GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes
Zhou et al. A platinum standard pan-genome resource that represents the population structure of Asian rice
Alonge et al. RaGOO: fast and accurate reference-guided scaffolding of draft genomes
Patel et al. BAR expressolog identification: expression profile similarity ranking of homologous genes in plant species
US8428882B2 (en) Method of processing and/or genome mapping of diTag sequences
Zhang et al. Genome-wide identification and evolutionary analysis of NBS-LRR genes from Dioscorea rotundata
Cokus et al. Evolutionary insights from de nov o transcriptome assembly and SNP discovery in California white oaks
Moser et al. Comparative analysis of expressed sequence tags from different organs of Vitis vinifera L.
Julca et al. Toward kingdom-wide analyses of gene expression
Liu et al. Genome-wide identification of MAPKKK genes and their responses to phytoplasma infection in Chinese jujube (Ziziphus jujuba Mill.)
Singh et al. Next-generation sequencing (NGS) tools and impact in plant breeding
Baute et al. Using genomic approaches to unlock the potential of CWR for crop adaptation to climate change
LU501941B1 (en) Method for rapidly obtaining target gene family of genome-free species based on transcriptome
CN110970093B (en) Method and device for screening primer design template and application
Die et al. geneHummus: an R package to define gene families and their expression in legumes and beyond
CN116935949A (en) Method for rapidly obtaining genome-free species target gene family based on transcriptome
Ebrahimi et al. The complete chloroplast genome sequence of American elm (Ulmus americana) and comparative genomics of related species
Ariani et al. Comprehensive workflow for the genome-wide identification and expression meta-analysis of the ATL E3 ubiquitin ligase gene family in grapevine
Wu et al. CTREP-finder: A web service for quick identification and visualization of clean transgenic and genome-edited plants
Li et al. TrG2P: A transfer learning-based tool integrating multi-trait data for accurate prediction of crop yield
Chen et al. Chromosome-level genome assembly of Hippophae gyantsensis
Albouyeh et al. Multivariate analysis of digital gene expression profiles identifies a xylem signature of the vascular tissue of white spruce (Picea glauca)
Zhou et al. Twelve Platinum-Standard reference genomes sequences (PSRefSeq) that complete the full range of genetic diversity of asian rice
Gao et al. Integrated phylogenomic analyses reveal recurrent ancestral large-scale duplication events in mosses
Street Genomics of forest trees

Legal Events

Date Code Title Description
FG Patent granted

Effective date: 20221026