CN109584968B

CN109584968B - Method for screening new genes involved in biological process regulation

Info

Publication number: CN109584968B
Application number: CN201811428144.1A
Authority: CN
Inventors: 赵磊; 何欣叶; 尚钰轩; 姚婷婷; 宓东; 孙野青
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2018-11-27
Filing date: 2018-11-27
Publication date: 2022-09-23
Anticipated expiration: 2038-11-27
Also published as: CN109584968A

Abstract

The invention discloses a method for screening new genes participating in biological process regulation, which screens the semantic relation of a specific biological process by using a bioinformatics method to search a tool in a gene ontology; screening gene set information annotated to different species of the semantics in genome annotation databases of the different species; carrying out homology analysis on the gene set information of a specific species, and screening orthologous genes of the specific species in a species to be researched; comparing and analyzing the genes with reference genes participating in the biological process in the species to be researched, and screening new genes participating in the regulation and control of the biological process in a specific species. The invention establishes a method for screening new genes involved in biological process regulation based on the characteristic that a specific biological process has high conservation among different species, provides a supporting function for further reconstructing a gene regulation network of a system, and has important significance for early diagnosis, personalized treatment and drug research and development of diseases.

Description

Method for screening new gene participating in regulation and control of biological process

Technical Field

The invention belongs to the technical field of biological information, and relates to a method for screening a new gene participating in biological process regulation.

Background

Gene Regulatory Networks (GRNs) are a complex network of regulation that represents the interaction relationships between genes. The gene regulation network can systematically analyze and understand the interaction relationship between genes, recognize and master the operation mechanism of cell life activity, thereby finding out the key genes causing diseases. This has a crucial role in the treatment of complex diseases (e.g., radiation-induced cancer, etc.), and research in this field has become one of the hot spots in bioinformatics and system biology. In recent years, gene regulation and control networks are increasingly applied to the research fields of disease gene prediction, drug target screening and the like, and play an important role in supporting early diagnosis, personalized treatment and drug research and development of diseases. Therefore, how to reconstruct the gene regulation network in a specific biological process, i.e., identify the expression regulation relationship between the genes involved in the transcription level in the specific biological process and the new genes involved in the regulation network, is one of the important problems to be solved urgently in the field, and has important value and significance in both theory and practice.

Traditionally, gene regulation networks are reconstructed based on biological/medical experiments and the like, and the gene expression regulation network relationship needs to be researched under different experimental conditions. However, this method has many disadvantages, such as long time consumption, high cost, etc., and it is difficult to identify new genes participating in a specific gene regulatory network in most cases, which limits the reconstruction of the gene regulatory network and its application. Therefore, screening of new genes participating in regulation and control of specific biological processes by using a bioinformatics method becomes one of the most concerned methods and techniques in the field of research at present, and has important significance for subsequently establishing a system biological model and predicting the behavior of a gene regulation and control network.

The Gene Ontology (GO) is a widely used Gene function annotation tool in the field of bioinformatics, and covers three biological functions of genes, including molecular function, biological process and cellular component. Where the molecular function describes an activity in the biology of the individual molecule, such as a catalytic activity or a binding activity. Biological processes describe a biological cascade of multiple steps consisting of a series of molecules with specific molecular functions in order. Moreover, biological processes are generally highly conserved among different species, which is of great significance for identifying new genes participating in gene regulatory networks in restructured gene regulatory networks.

Disclosure of Invention

The present invention aims at overcoming the said demerits in available technology, and provides bioinformatics method for screening new gene participating in the regulation of biological process. By finding out new genes participating in a specific gene regulation network, a supporting role is provided for further reconstructing the gene regulation network of the system. The invention has important significance for early diagnosis, personalized treatment and drug research and development of diseases.

In order to achieve the purpose, the invention is realized by the following technical scheme: a method for screening for a novel gene involved in the regulation of a biological process comprising the steps of:

searching semantic relation of a biological process (biological process) to be researched in a Gene Ontology (GO) searching tool to obtain a Gene Ontology semantic (GO Term) of the biological process to be researched;

the biological process to be studied is a subject of study, and there are many types of biological processes, such as DNA repair processes (including base excision repair, nucleotide excision repair, mismatch repair, homologous recombination repair, non-homologous end joining repair, etc.), metabolic processes (including phosphate metabolic process, protein metabolic process, lipid metabolic process, etc.), etc., and many types of biological processes are usually selected according to the needs of the study.

Obtaining GO semantics of a biological process to be researched as a filtering condition, and respectively searching and acquiring a gene set A annotated in specific species i and j to obtain GO semantics in the step I in a gene annotation database _i And A _j The information of (a); the gene set A _i And A _j The information includes gene number (gene table ID), gene name (gene name), protein number (protein table ID), protein reference sequence information (RefSeq peptide ID), GO semantic number (GO term access), GO semantic name (GO term name), etc.;

the species i and j to be studied are of many kinds, including those known in biology or those unknown, as long as they are contained in the database, and in the examples, Arabidopsis thaliana (Arabidopsis thaliana) and human (Homo sapiens) are used.

Thirdly, the gene set A obtained in the second step _j The protein reference sequence information (RefSeq peptide ID) of (2) is loaded into a homologous protein (homology Group ID) search tool, a homology Group number (Ortho Group ID) is searched, and a gene set A is searched and downloaded based on the Ortho Group ID _j Protein molecules of the orthologous genes of each gene in the species i to be investigated, to obtain a corresponding protein set B _i The information of (a); the protein set B _i The information includes protein number (protein access), protein name (protein name), and the like;

fourthly, the protein set B obtained in the third step _i The protein accession number (protein access) of (A) was loaded into a Gene Annotation Database (Gene Annotation Database), and the protein set B was retrieved and downloaded separately _i The corresponding gene of each protein molecule in (1) to obtain the corresponding orthologous gene set C _i The information of (a); the gene set C _i The information includes gene number (gene stable ID), gene name (gene name), GO semantic number (GO term access), GO semantic name (GO term name), etc.; wherein, the C _i Represents the set of orthologous genes in species i.

Fifthly, the gene set A obtained in the step two and the step four _i And orthologous Gene set C _i Analyzing the gene set data in statistical analysis software, solving and obtaining a new gene set D participating in the biological process regulation and control in the step (i) in the species i _i Information of formula D _i ＝A _i UC _i -A _i (ii) a The gene set D _i The information includes a gene number (gene stable ID), a gene name (gene name), a GO semantic number (GO access), a GO semantic name (GO name), and the like.

Further, in the above technical solution, the method specifically includes the following steps:

a) the step (i) further comprises that the gene ontology search tool can be a QuickGO database (https:// www.ebi.ac.uk/QuickGO /); by utilizing the QuickGO database, lower-level semantics of the biological process to be researched can be repeatedly searched until a semantic relation graph meeting the selection standard is obtained;

the "selection criteria" described above is generally a criteria that is specifically proposed by a researcher, for example, in the "DNA repair process", the researcher only wants to study "homologous recombination repair" and "non-homologous end joining repair", and then when determining the semantic relationship, only needs to select the two semantics, rather than completely searching the semantic relationship map of the "DNA repair process" (usually, the semantic relationship maps of some biological processes are relatively complex).

b) The step (c) also comprises that the gene annotation database can be a BioMart tool in an animal genome annotation database Ensembl (http:// asia. Ensembl. org/index. html) or a plant genome annotation database Ensembl Plants (http:// Plants. Ensembl. org/index. html);

c) the homologous protein search tool may be a protein sequence search tool (identity protein sequence) in the OrthoMCL database (http:// OrthoMCL. org/OrthoMCL /) (http:// OrthoMCL. org/OrthoMCL/showquestion. doquetonfullname ═ sequence questions. byidlst);

d) selecting a biological gene annotation database according to the type of species in an animal genome database Ensembl (http:// asia. ensemblel.org/index. html), a plant genome database Ensembl Plants (http:// Plants. ensemblel.org/index. html) and an NCBI database (https:// www.ncbi.nlm.nih.gov /);

the biological gene annotation database is selected according to the species, and in combination with the method, the biological gene annotation database means that the arabidopsis thaliana belongs to Plants, and the plant gene annotation database Ensembl Plants are adopted for annotation; the human belongs to the animal, and the annotation of the human belongs to the animal gene annotation database Ensembl; or in the Gene database of the National Center for Biotechnology Information (NCBI).

e) The fifth step further comprises that the statistical analysis software is any one of statistical software such as Microsoft Office Excel software, SPSS 17.0 and the like. And the content of the statistical analysis is the union set of the gene sets in the fifth step and the set subtraction operation.

Further, the method of claim 1, wherein the specific species i and j are species with known genomic information contained in the biological gene annotation database; the relationship between the specific species i and j may be i ≠ j or i ≠ j, which can be specifically adjusted as required. The relationship between particular species i and j may be the same, for example, by human reference to find new genes; or between different species for reference and for finding new genes.

Further, according to the method of claim 1 or 2, the method for screening a novel gene involved in the regulation of a biological process is a bioinformatics-specific screening algorithm based on the characteristic that the biological process to be studied has high conservation among different species

According to the technical scheme, the invention has the following beneficial effects:

the invention screens the semantic relation of the specific biological process by using a bioinformatics method to search a tool in a gene ontology based on the characteristic that the specific biological process has high conservation among different species; further, screening the gene set information annotated to the different species of the semantics in the genome annotation databases of the different species; carrying out homology analysis on the gene set information of a specific species, and screening orthologous genes of the specific species in a species to be researched; comparing and analyzing the genes with reference genes of the species to be researched, and screening new genes participating in the regulation and control of the biological process in the species to be researched. Therefore, the invention establishes a method for screening new genes participating in biological process regulation, provides a supporting function for further reconstructing a gene regulation network of a system, and has important significance for early diagnosis, personalized treatment and drug research and development of diseases.

Drawings

The invention is shown in the attached figure 2:

FIG. 1 is a flow chart of an implementation of the method of the present invention;

FIG. 2 is a semantic graph of the radiation response obtained using QuickGO.

Detailed Description

The following description will be made in detail with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method for screening for novel genes involved in the regulation of biological processes according to the present invention.

In this example, as shown in fig. 1, the present invention relates to a method for screening a novel gene involved in the regulation of DNA repair process in human, wherein the species i is human (Homo sapiens) and the species j is Arabidopsis thaliana (Arabidopsis thaliana), comprising the following steps:

s1: in the Gene Ontology (GO) search tool QuickGO (https:// www.ebi.ac.uk/QuickGO /), the semantic relation of DNA repair (DNA repair) is retrieved, 5 different types of DNA repair GO semantics are selected, namely "base-extension replay" (GO:0006284), "nucleotide-extension replay" (GO:0006289), "mismatch replay" (GO:0006298), "double-and-strand replay virus mutation" (GO:0000724), "double-strand replay" clone replay virus non-homology "(GO: 0006303), which are abbreviated as" MMR "," NER "," NHEJ ", and finally the semantic relation with DNA repair is obtained as shown in the graph 2;

s2: and (3) respectively taking the 5 different types of DNA repair GO semantics obtained in the step (S1) as filtering (filter) conditions, loading the filtering (filter) conditions into a BioMart tool of an animal genome database Ensembl (version number: Release 93) (http:// asia. Ensemb. org/index. html), and retrieving and acquiring 243 gene sets A annotated to the GO semantics in human (H.sapiens) _i (i is a specific DNA repair semantic meaning) of 35, 43, 25, 79, 61 genes in the data (the specific gene set information is omitted), wherein the numbers of the genes annotated in the semantic meanings such as "BER", "NER", "MMR", "HR", "NHEJ", etc. are 35, 43, 25, 79, 61, respectively, and the results are shown in Table 1;

s3: respectively using the 5 different types of DNA repair GO semantics obtained in the step S1 as filtering conditions, loading the DNA repair GO semantics into a BioMart tool of a plant genome database Ensembl Plants (version number: Release 40) (http:// Plants. ensemble. org/index. html), searching and acquiring information data (specific gene set information is omitted) of a set of 151 genes annotated to the GO semantics in a species Arabidopsis thaliana (A.thaliana) contained in the database, wherein the number of the genes annotated to the semantics of 'BER', 'NER', 'MMR', 'HR', 'NHEJ' and the like is respectively 34, 25, 17, 64 and 14, and the results are shown in Table 2;

TABLE 1 number of genes in human (Homo sapiens) annotated to 5 DNA repair semantics, respectively

TABLE 2 number of genes in Arabidopsis thaliana (Arabidopsis thaliana) annotated to 5 DNA repair semantics respectively

S4: loading the protein reference sequence information (RefSeq peptide ID) in the gene set obtained in the step S3 into a homologous protein (homologus protein) search tool OrthoMCL (http:// OrthoMCL. org/OrthoMCL /), searching a homology Group ID, respectively searching and downloading protein molecules of the orthologous genes of each gene in the gene set in the step S3 in human (H.sapiens) according to the Ortho Group ID, and obtaining information data (specific protein molecule set information is omitted) of the corresponding protein molecule set;

s5: loading the protein number (protein access) of the protein set obtained in the step S4 into a gene annotation database Ensembl (http:// asia. ensemble. org/index. html), respectively searching and downloading the gene corresponding to each protein molecule in the protein molecule set information in the step S4, and obtaining a corresponding gene set C _i (i is specific DNA repair semantics) (the information of the specific orthologous gene set is omitted), and corresponding 75 orthologous genes are obtained in total, and the result is shown in Table 3;

TABLE 3 number of orthologous genes in human (Homo sapiens) of genes annotated to DNA repair semantics in Arabidopsis (Arabidopsis thaliana) in 5

S6: the gene set obtained in step S2 and the orthologous gene set obtained in step S5 are put in Microsoft Office Excel as statistical analysis software according to formula D _i ＝A _i UC _i -A _i (i is specific DNA repair semantics) and obtaining a new gene D for human (H.sapiens) participating in the regulation and control of the DNA repair process in the step S1 _i (i is a specific DNA repair semantic), and a total of 16 new genes participating in the regulation and control of the DNA repair process are found, as shown in Table 4; among them, information on the set of novel genes involved in the regulation of DNA repair process is shown in tables 5 and 6.

S7: the method of The present invention is used to screen new genes involved in The regulation of biological processes, provide support for gene regulation networks of further reconstruction systems, and provide a new reference technology for The early stage and diagnosis of diseases, and drug research and development, by comparing The new genes involved in The regulation of DNA repair processes obtained in step S6 with published literature data at home and abroad (Molecular Cell,2017,68(1): 61-75; Proceedings of The National Academy of Sciences of The United States of America,2016,113(13): 3515-3520; EMBO Journal,2007,26, 2094-2103; Critical Reviews in Biochemistry and Molecular Biology,2017,52(6):696 714).

TABLE 4 number of novel genes involved in the DNA repair process in humans (Homo sapiens)

TABLE 5 novel genes involved in the repair Process by Homologous Recombination (HR) in humans (Homo sapiens)

TABLE 6 novel genes involved in the repair Process of non-homologous end joining (NHEJ) in humans (Homo sapiens)

Although the present invention has been described in detail with reference to specific embodiments, it should be understood that the present invention is not limited to the details and examples, but rather, it should be understood that various changes and modifications can be made without departing from the spirit and scope of the invention.

Claims

1. A method for screening for a novel gene involved in regulation of a biological process, comprising the steps of:

searching semantic relation of a biological process to be researched in a Gene Ontology (GO) searching tool to obtain a GO Term of the biological process to be researched;

secondly, with the Gene ontology semantics (GO Term) of the biological process to be researched obtained in the step I as a filtering condition, respectively searching and acquiring a Gene set A annotated in the species i and j to be researched to the GO semantic obtained in the step I in a Gene Annotation Database (Gene Annotation Database) _i And A _j The information of (a); the gene set A _i And A _j The information includes gene number (gene table ID), gene name (gene name), protein table ID, protein reference sequence information (RefSeq peptide ID), GO semantic number(GO term access), GO semantic name (GO term name);

thirdly, the gene set A obtained in the second step _j Loading the protein reference sequence information (RefSeq peptide ID) into a homologous protein (homology protein) searching tool, respectively searching for homology Group numbers (Ortho Group ID), downloading the searching result and obtaining a gene set A _j Protein molecules of the orthologous genes of each gene in the species i to be investigated, to obtain a corresponding set of proteins B _i The information of (a); the protein set B _i The information includes protein number (protein access) and protein name (protein name);

fourthly, the protein set B obtained in the third step _i The protein accession number (protein access) of (A) was loaded into a Gene Annotation Database (Gene Annotation Database), and the protein set B was retrieved and downloaded separately _i Obtaining the information of the corresponding orthologous gene set Ci from the gene corresponding to each protein molecule; the gene set C _i The information includes gene number (gene stable ID), gene name (gene name), GO semantic number (GO term access), GO semantic name (GO term name);

fifthly, the gene set A obtained in the step two and the step four _i And orthologous Gene set C _i The gene set data are analyzed in statistical analysis software, and a new gene set which is involved in the biological process regulation in the step I in the species i to be researched is solved and obtained, and is named as D _i Set of the formula D _i ＝A _i UC _i -A _i (ii) a The gene set D _i The information includes a gene number (gene stable ID), a gene name (gene name), a GO semantic number (GO access), a GO semantic name (GO name).

2. The method of claim 1, wherein: the gene ontology search tool in the step I can be a QuickGO database; and repeatedly searching the lower-level semantics of the biological process to be researched by utilizing the QuickGO database until a semantic relation graph meeting the selection standard is obtained.

3. The method of claim 1, wherein: the gene annotation database in the step (II) or (IV) can be an animal genome annotation database Ensembl or a plant genome annotation database Ensembl Plants; the tool used was a BioMart tool.

4. The method of claim 1, wherein: the homologous protein search tool described in step (c) may be a protein sequence search tool (identity protein sequence) in the OrthoMCL database.

5. The method of claim 1, wherein: in step (iv), a kind of biological Gene annotation database is selected according to the species in the animal genome database Ensembl, plant genome database Ensembl Plants or in the Gene database of National Center for Biotechnology Information (NCBI).

6. The method of claim 1, wherein: the statistical analysis software in the step (v) is Microsoft Office Excel software or SPSS 17.0 statistical software.

7. The method of claim 1, wherein: the specific species i and j are species with known genome information contained in a biological gene annotation database; the relationship between the specific species i and j may be i ≠ j or i ≠ j, which can be specifically adjusted as required.