CN115223657B - Medicinal plant transcriptional regulation map prediction method - Google Patents

Medicinal plant transcriptional regulation map prediction method Download PDF

Info

Publication number
CN115223657B
CN115223657B CN202211140336.9A CN202211140336A CN115223657B CN 115223657 B CN115223657 B CN 115223657B CN 202211140336 A CN202211140336 A CN 202211140336A CN 115223657 B CN115223657 B CN 115223657B
Authority
CN
China
Prior art keywords
medicinal plant
predicted
transcription
regulation
medicinal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211140336.9A
Other languages
Chinese (zh)
Other versions
CN115223657A (en
Inventor
任艳姣
兰杰
梁艳春
于合龙
李高阳
柴世民
郭宏亮
高一萌
马丽
韩烨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin Agricultural University
Original Assignee
Jilin Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin Agricultural University filed Critical Jilin Agricultural University
Priority to CN202211140336.9A priority Critical patent/CN115223657B/en
Publication of CN115223657A publication Critical patent/CN115223657A/en
Application granted granted Critical
Publication of CN115223657B publication Critical patent/CN115223657B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Mathematical Physics (AREA)

Abstract

A prediction method of a transcription regulation and control map of a medicinal plant belongs to the technical field of bioinformatics. In order to solve the problem that the genetic background of the current medicinal plant is unclear, the method collects the de novo sequencing data and the transcriptional control network of the genome of the medicinal plant such as ginseng and the like, and adopts a sequence comparison method to migrate the denova whole genome level transcriptional control network of a closely-sourced species to the genome of the medicinal plant which is not completely annotated from an analogy angle; constructing a deep network learning architecture by using methods such as a graph convolution network and an attention mechanism, and predicting a co-expression transcription regulation and control network module; and identifying the differential co-expression network module by adopting a community discovery algorithm, and further disclosing main components in the medicinal plant to be predicted and a biosynthesis pathway of the main components. The visual database platform of the transcription regulation and control network of the medicinal plants built by the method can inquire transcription factors of the medicinal plants, the transcription factor regulation and control network and biological channels.

Description

Medicinal plant transcription regulation and control map prediction method
Technical Field
The invention belongs to the technical field of bioinformatics, and particularly relates to a medicinal plant transcriptional control map prediction method.
Background
In recent years, due to the development of high-throughput sequencing technology, a plurality of groups of medical plant data are continuously emerged, a plurality of omics machine learning analysis methods are generated in the field of accurate human medicine, and the method can be applied to the research of an effective component synthesis mechanism of the medical plant and provides convenience for a computerizer to systematically analyze the secondary metabolites of the medical plant from the aspect of the omics. However, from a biological perspective, the current omics data study of medicinal plants faces two problems: on the one hand, polyploid biological sequencing complexity results in a small amount of public data; on the other hand, annotation is incomplete due to unclear non-model biogenetic background.
Transcriptome sequencing (RNA-Seq) is an efficient and rapid transcriptome research means, can research gene functions and structures from the whole level, and discloses a specific biological process and a molecular mechanism in a disease occurrence process. At present, most transcriptome research is bioinformatics calculation models such as differential expression, co-expression analysis and the like realized by biologists by using software, the algorithm research mainly focuses on model animals and plants, but is relatively shallow and single in the starting stage of non-model medicinal plants, particularly the field of traditional Chinese medicines, and computer experts are urgently required to explore more efficient algorithms to construct molecular mechanism analysis models from the system perspective.
Disclosure of Invention
In order to solve the problems that the genetic background of the current medicinal plant is not clear, the related synthetic pathway and the regulation process of medicinal molecules are lacked, and new annotations need to be introduced, the invention provides a medicinal plant transcription regulation map prediction method based on a graph convolution network, which comprises the following steps:
s1, obtaining gene expression data and pre-constructing a transcription regulation and control relation:
downloading a transcriptome sequencing sample, a genome file and a gene structure annotation file of the medicinal plant to be predicted, and preprocessing data and comparing a reference genome to obtain gene expression data of the medicinal plant to be predicted; the method comprises the steps of downloading the genome of a near-source species of the medicinal plant to be predicted and the transcription factor genome of the near-source species of the medicinal plant to be predicted, and performing sequence comparison through BLAST to obtain a medicinal plant transcription factor set to be predicted; downloading a regulation network relation of the closely-sourced species of the medicinal plants, transferring the mapping relation to the genome of the medicinal plants to be predicted, and combining the repeated relation and deleting the unimportant relation to obtain the transcription regulation relation among the isogenes of the transcription factors of the medicinal plants to be predicted; after checking and filtering, further obtaining the accurate transcription regulation relation of the isogene of the transcription factor of the medicinal plant to be predicted by utilizing a WGCNA algorithm;
s2, constructing a co-expression regulation and control network recognition algorithm based on a depth map convolution network model:
randomly masking 10% of genes except transcription factors of the medicinal plant gene expression data to be predicted, taking the remaining 90% of genes and the gene expression quantity of the transcription factors together with the transcription regulation and control relation obtained by S1 as the input of a model to carry out graph convolution operation training, and adding an attention mechanism in the graph convolution process to ensure reasonable distribution of weights; adopting a Mean Square Error (MSE) to calculate and predict the expression quantity of the remaining 10 percent of genes by a loss function, and updating through the whole propagation polymerization process to obtain a transcription factor isogenic regulation and control relation weight matrix; and then, aggregating the obtained weight matrix by adopting a Louvain algorithm to search a co-expression module, and revealing main components in the medicinal plant to be predicted and a biosynthesis pathway thereof by calculating different growth conditions or difference co-expression modules among tissues and organs and performing function annotation enrichment analysis on the difference co-expression modules.
Further, the medicinal plants include radix Notoginseng, radix Ginseng, rhizoma Polygonati, radix Glycyrrhizae, radix rehmanniae, rhizoma Pinelliae, radix Codonopsis, aloe, semen Ginkgo, and ramulus et folium taxi Cuspidatae.
Further limiting, the preprocessing of the data in the step S1 is to download, split, control and filter the downloaded transcriptome sequencing sample of the medicinal plant to be predicted by adopting a sratools toolkit; reference genome preparation was performed using software gffread.
Further limiting, the alignment of the reference genome in S1 uses hisat2 software, which is specifically divided into three steps: the first step is that an index is constructed for a reference genome, and comparison is carried out, wherein the parameter is-dta; secondly, compressing and sequencing comparison results and constructing indexes; the third step calculates the FPKM or TPM expression matrix of the transcript using the transcript assembly software stringtie.
Further, the selection method of the close source species S1 is to find the whole genus pedigree of the genome of the medicinal plant to be predicted from the database of Taxinomy Browser of NCBI, and perform analogy search from the close and distant species having a transcription regulation relationship with the database of plantaTFDB.
Further defined, when the medicinal plant to be predicted is notoginseng, the closely-sourced species selected are carrot, potato, raccoon, tomato, boea densiflora, sesame and eggplant;
when the medicinal plant to be predicted is ginseng, the selected closely sourced species are pseudo-ginseng, carrot, potato, spirulina, tomato, coffee cherry, tobacco and arabidopsis thaliana;
when the medicinal plant to be predicted is polygonatum, the selected near-source species are corn, potato, highland barley, sorghum, wheat, swan goose grass and small fruit wild bananas;
when the medicinal plant to be predicted is liquorice, the selected close-source species are red clover, chickpea, medicago truncatula, lotus vein fern, mung bean and red bean;
when the medicinal plant to be predicted is rehmannia, the selected closely-sourced species are spirulina, civetta, boea, sesame, potato and coffea canescens;
when the medicinal plant to be predicted is pinellia ternate, the selected closely sourced species are corn, barley, sorghum, wheat, wild rice, leersia hexandra and Wularch pattern wheat;
when the medicinal plant to be predicted is codonopsis pilosula, the selected closely sourced species are carrots, potatoes, spirulina, petunia, tobacco and tomatoes;
when the medicinal plant to be predicted is aloe, the selected closely-sourced species are Phalaenopsis amabilis, corn, wheat, barley, pineapple and sorghum;
when the medicinal plant to be predicted is ginkgo, the selected closely-sourced species are musa minor, hops, musella rotundifolia, peaches, fuji apples, papayas and Arabidopsis;
when the medicinal plant to be predicted is taxus chinensis, the selected closely-sourced species are douglas fir, musa minor, corn, musella rotundifolia, hop, arabidopsis thaliana and barley.
Further, the Louvain algorithm S2 aggregates the weight matrix from two stages of local movement of nodes and network aggregation: determining a co-expression sub-network of the transcription factor according to weight contribution distribution of each transcription factor in a prediction process of gene expression values obtained under multiple conditions, wherein differences of co-expression modules of different tissue expression data after aggregation through a graph volume network are used for querying and identifying specific modules, and the differences can be identified through a statistical test process based on super-geometric distribution.
The invention also provides an electronic device, which comprises at least one processor and a memory which is connected with the at least one processor in a communication way; wherein the memory stores instructions that, when executed by the at least one processor, cause the at least one processor to implement the method for predicting a transcription regulation profile of a medicinal plant as described above.
The present invention also provides a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the above-mentioned medicinal plant transcriptional regulatory map prediction method.
The invention also provides a medicinal plant transcription factor regulation network visual database platform constructed by using the medicinal plant transcription regulation and control map prediction method, and the medicinal plant transcription factor regulation and control network visual database platform provides medicinal plant transcription factor query, sequence comparison, transcription factor regulation and control network and biological channel query services.
The invention has the beneficial effects that:
the method comprises the steps of firstly, screening and collecting medicinal plant genome de novo sequencing data and a transcription regulation and control network through a public database and extensive literature, and on the basis of comparing the transcription regulation and control networks and transcription factors of a plurality of near-source species, migrating the transcription regulation and control network to the medicinal plant genome which is not completely annotated from an analogy angle by adopting a sequence comparison method; secondly, a deep network learning framework is constructed by using methods such as a graph convolution network and an attention mechanism, and a transcription regulation and control network is predicted; and finally, identifying the differential co-expression network module by adopting a community discovery algorithm, and further disclosing main components in the medicinal plant to be predicted and a biosynthesis pathway thereof through the research on the module.
The invention solves the technical problem that the genetic background of medicinal plants such as ginseng and the like as the traditional Chinese medicinal materials with regional characteristics is not clear and a new annotation research idea needs to be introduced, obtains the gene mapping information between the transcription factor set and the species of the medicinal plants such as ginseng and the like in a research mode similar to a near-source species, realizes the high-level annotation of non-modal organisms on the transcription group level, and explores a novel non-modal organism annotation idea.
Furthermore, the invention constructs a graph convolution neural network model based on the constructed transcription regulation network and the transcriptome sequencing data with organ/condition specificity, identifies the transcription regulation co-expression network module with organ or condition specificity, deduces a specific transcription regulation network channel related to a drug molecule, and provides a high-reliability reference for the biological research and analysis of medicinal plants.
At present, the application of deep learning models in gene expression data is mostly in the problems of regression and classification prediction, and an algorithm in the aspect of network reconstruction is lacked. The invention provides a deep network reconstruction method for designing a ginseng transcription regulation and control diagram, which is used for mining transcription regulation and control information from gene expression data and providing a set of unique medicinal plant transcriptome data deep mining strategy.
Drawings
FIG. 1 is a technical scheme of the prediction method of the transcription regulation map of medicinal plants according to the present invention;
FIG. 2 is a diagram of a process for generating a ginseng transcriptional control network;
FIG. 3 is a schematic diagram of a graph convolution network algorithm based on an attention mechanism;
FIG. 4 is a diagram of a transcriptional control network aggregation model based on graph convolutional network; wherein, the thickness of the intergenic connecting line in the transcriptional control network in FIG. 4 represents the connection weight, and the thicker the intergenic connecting line is, the closer the relationship between the two genes is;
FIG. 5 is a graph showing the similar effects of the transcriptional control map of Panax ginseng;
FIG. 6 is a similar graph of the effect of the transcriptional control map of Codonopsis pilosula;
FIG. 7 is a similar graph of the effect of the transcriptional control map of Panax notoginseng;
FIG. 8 is a graph showing the effect of transcriptional control map of pinellia ternata;
FIG. 9 is a graph showing the effect of the transcriptional control pattern of rehmannia glutinosa Libosch.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to specific embodiments and the accompanying drawings.
Example 1: method for predicting medicinal plant transcription regulation map based on graph convolution network to obtain ginseng transcription regulation map
A medicinal plant transcription regulation map prediction method based on graph convolution network (the technical route is shown in figure 1), which comprises the following steps:
step one, obtaining gene expression data and pre-constructing transcription regulation relationship
(1) Conversion of ginseng transcriptome sequencing data into expression data
Transcriptome sequencing data, i.e., RNA-Seq data, in the initial form as a sequence file in the FASTA format, and expression data, i.e., a gene expression abundance matrix calculated from the RNA-Seq data, are named using a standardized method, e.g., FPKM matrix. Firstly, downloading ginseng transcriptome sequencing samples from an SRA database of NCBI, wherein the platform numbers are SRP049125, SRP066368, SRP336641 and SRP345858, the platform numbers are respectively 18, 50, 39 and 18 samples, and a sratools toolkit is adopted to download, split, control and filter the data; downloading ginseng genome files and gene structure annotation files from korean ginseng genome database genome db (snu. Ac. Kr) reference genome preparation was performed using software gffred, and alignment to reference genome using hisat2 software was divided into three steps: the method comprises the steps of firstly, constructing an index for a reference genome, carrying out comparison, wherein the parameter is-dta, secondly, carrying out compression sequencing on comparison results and constructing the index, and thirdly, calculating an FPKM expression matrix of a transcript by using transcript assembly software stringtie. About 3975 genes of ginseng were produced.
(2) Near source species selection and download
The whole genus pedigree of the ginseng genome is found from a Taxonom Browser database of NCBI, and analogy search is carried out on species which have a transcription regulation and control relationship with a plantaTFDB database from near to far, and the selected ginseng close-source species are pseudo-ginseng, carrot, potato, spirulina, tomato, chinese coffee, tobacco and arabidopsis thaliana. The genome and transcription Factor genome of the above-mentioned close-source species are downloaded, and the TF (Transcript Factor) transcription Factor sequence information of the ginseng close-source species pseudo-ginseng is downloaded from the iTAK database.
(3) Determination of Ginseng Transcription Factor (TF) by proximity comparison
And performing BLAST sequence comparison on the genome of the ginseng near-source species and the downloaded ginseng genome, wherein in a BLAST result, evalue takes E-30 and bitscore takes Top4 as a screening threshold value, the corresponding relation between all genes of the ginseng and other species is found, about 2000 genes corresponding to seven species transcription factors are found, meanwhile, the transcription factor genome of pseudo-ginseng and the ginseng are compared, about 3000 genes matched with pseudo-ginseng TF are found, and the intersection of the two genes is determined to be the transcription factor set of the ginseng.
(4) Migration mapping transcriptional regulatory relationships
And downloading the regulation and control network relation of each close-source species of the ginseng from a latest PlantTFDB database integration library PlantRegMap, and simultaneously transferring the mapping relation and applying the mapping relation to a ginseng genome to obtain the transcription regulation and control relation between the transcription factor isogenes of the ginseng. Through operations such as combining repeated relations and deleting unimportant relations, the corresponding regulation and control relation between the transcription factor isogenes is obtained finally.
(5) Network amendment based on co-expression module
In order to further confirm the regulation and control relation between the transcription factor isogenes, the invention adopts a WGCNA algorithm to calculate a co-expression module to correct a regulation and control diagram. Firstly, judging whether the initial expression matrix accords with the scale-free network characteristics or not, and calculating R 2 And if the reference line of 0.9 is reached, if the non-scale network characteristics meeting the algorithm requirements are met, determining a soft threshold value, and then constructing the co-expression network. Through the calculated co-expression module, the transcription factor is compared with the gene in the co-expression module, the transcription factor and the gene which appear in the co-expression module are retained, the gene which does not appear in the co-expression module with the transcription factor is deleted, and finally, a further accurate co-gene regulation and control relation of the transcription factor is obtained. The process of generating the transcriptional regulation relationship of the ginseng transcription factor isogene is shown in figure 2. In order to avoid the expression confusion of the deep network model and the gene interaction network graph, the subsequent expression refers to the gene regulation network graph by referring to the graph, and the graph convolutional network algorithm by referring to the network.
Step two, co-expression regulation and control network recognition algorithm construction based on depth map convolution network model
(1) Attention Mechanism (Attention Mechanism) -based energy aggregation
The invention constructs a deep network learning framework by using methods such as a graph convolution network and an attention mechanism, and performs convolution aggregation mainly through information of neighbor nodes by using a space-based graph convolution network model. The attention mechanism can enable a neural network to only pay attention to nodes and edges which are more relevant to the task, and the training effectiveness and the testing accuracy are improved. On the basis of the spatial map convolution, a Graph Attention Network (GAT) based Attention mechanism stacks the masked self-Attention layers into an aggregation function of the spatial map convolution.
The graph convolution network model based on the space is mainly used for carrying out convolution aggregation through information of neighbor nodes. The attention mechanism can enable a neural network to only pay attention to nodes and edges which are more relevant to the task, and training effectiveness and testing accuracy are improved. On the basis of the space map convolution, the GAT stacks the masked self-attention layer into the aggregation function of the space map convolution (see fig. 3).
The Attention mechanism input layer in GAT is the node feature set,
Figure 756966DEST_PATH_IMAGE001
Figure 278078DEST_PATH_IMAGE002
Figure 465477DEST_PATH_IMAGE003
is the number of nodes that are present,
Figure 388433DEST_PATH_IMAGE004
is the number of features per node. This layer generates a new set of node features (possibly with different cardinalities)
Figure 370296DEST_PATH_IMAGE005
),
Figure 378703DEST_PATH_IMAGE006
Figure 166530DEST_PATH_IMAGE007
As an output. In order to obtain sufficient expressive power to transform input features into higher-level features, at least one learnable linear transformation is required. To this end, as a first step, a shared linear transformation is passed through a weight matrix
Figure 943994DEST_PATH_IMAGE008
And shared attention mechanism
Figure 362337DEST_PATH_IMAGE009
Is applied on each node, for the node
Figure 326881DEST_PATH_IMAGE010
The calculation formula of the energy aggregation coefficient of the adjacent node is as follows:
Figure 121662DEST_PATH_IMAGE011
Figure 550369DEST_PATH_IMAGE012
(1)
formula (1) represents a node
Figure 874035DEST_PATH_IMAGE013
For a node
Figure 857034DEST_PATH_IMAGE010
The degree of correlation, i.e. the degree of energy concentration,
Figure 189926DEST_PATH_IMAGE014
representative node
Figure 473140DEST_PATH_IMAGE010
Is determined by the node of the neighbor node set,
Figure 498865DEST_PATH_IMAGE015
a matrix of the weights is represented by,
Figure 449721DEST_PATH_IMAGE016
representing a single-layer forward propagating neural network with parameters being weight vectors
Figure 586304DEST_PATH_IMAGE017
Using LeakyReLU with a negative half-axis slope of 0.2 as the nonlinear activation function:
Figure 724024DEST_PATH_IMAGE018
(2)
wherein, | | represents a splicing operation,
Figure 920651DEST_PATH_IMAGE019
representing a transposition. To make the coefficients easy to compare between different nodes, we use the softmax function to fit them across all nodes
Figure 347084DEST_PATH_IMAGE013
Normalization in the selection of (a):
Figure 21779DEST_PATH_IMAGE020
(3)
obtaining a normalized coefficient
Figure 482847DEST_PATH_IMAGE021
Then, calculating the linear combination of the corresponding characteristics, and activating the function through the nonlinearity
Figure 647112DEST_PATH_IMAGE022
Then, the final output feature vector of each node is:
Figure 560842DEST_PATH_IMAGE023
(4)
suppose a graph is given
Figure 242490DEST_PATH_IMAGE024
In which
Figure 620382DEST_PATH_IMAGE025
Is a collection of vertices or nodes that are,
Figure 893231DEST_PATH_IMAGE026
is a collection of edges that are to be considered,
Figure 559836DEST_PATH_IMAGE027
is a weight matrix of edges, nodes
Figure 310754DEST_PATH_IMAGE028
The propagation/aggregation process of the Graph Convolutional Network (GCN) can be represented by equation (4).
(2) Construction of depth map convolution network model
According to the method, firstly, 10% of genes except Transcription Factors (TF) of gene expression data are randomly masked, the remaining 90% of genes and the gene expression quantity of the transcription factors are combined with the regulation relationship obtained in the first step, namely point characteristics of a graph and structural characteristics of the graph are used as input of a model to carry out graph convolution operation training model, and attention is added in the convolution process to ensure reasonable distribution of weights. Calculating loss function by using Mean Square Error (MSE), predicting expression of the remaining 10% of genes, and updating weight matrix through the whole propagation polymerization process
Figure 746415DEST_PATH_IMAGE027
The invention introduces an attention mechanism module to learn and optimize the network structure when training the graph convolutional neural network, wherein the attention mechanism allows the dependency relationship to be modeled without considering the distance between the input sequence and the output sequence. Regardless of the amount of input, the attention mechanism keeps paying attention to the most relevant parts. When attention mechanisms are introduced to learn the characterization of a single sequence, they are often described as inner-conscious or self-conscious. And implicitly capturing the weight value based on the end-to-end neural network framework so that more important nodes are distributed with larger weight values. And clustering the obtained weight matrix to find the coexpression module in the regulation network.
The input X in fig. 4 includes a gene expression data matrix and a transcription regulation relationship diagram representing a regulation relationship between genes, and a new weight matrix is obtained through graph convolution layer (gcnv) and loss function calculation, that is, a new transcription regulation relationship diagram with different weights is generated, and the weight network is clustered to obtain a co-expression module in the transcription regulation network.
The invention adopts the Louvain algorithm to aggregate the weighted co-expression networks obtained by training, namely community discovery, so as to identify and organize the related network modules. The Louvain algorithm optimizes the module from two basic phases: local movement of nodes, (1) network aggregation. In the local moving stage, a single node is moved to a community with the largest quality function increase; in the aggregation phase, an aggregation network is created according to the partition obtained in the local moving phase, and each community in the partition becomes a node in the aggregation network. These two phases are repeated until the quality function cannot be increased further. Specifically, a coexpression sub-network of each transcription factor is determined according to weight contribution distribution of each transcription factor in the prediction process of gene expression values obtained under multiple conditions, the difference of coexpression modules of different tissue expression data after GCN aggregation is used for inquiring and identifying a specificity module, and the calculation of the difference can be identified through a statistical test process based on super-geometric distribution.
The main components in ginseng and the biosynthesis pathway thereof are revealed by calculating the differential coexpression modules under different growth conditions or between tissues and organs and further carrying out functional annotation enrichment analysis on the differential coexpression modules.
The similar effect of the ginseng transcription control map obtained in this example is shown in FIG. 5.
Example 2: codonopsis pilosula transcription regulation and control map obtained by using medicinal plant transcription regulation and control map prediction method based on graph convolution network
The transcription regulation and control map of codonopsis pilosula is predicted by referring to the prediction method of the transcription regulation and control map of the medicinal plant described in example 1. The sequencing sample of the codonopsis pilosula transcriptome is downloaded from an SRA database of NCBI, platform numbers are SRP235322, SRP219399, SRP287212, SRP314669, SRP314990 and SRP315291, a whole genus pedigree of the codonopsis pilosula genome is found from a Taxinomy Browser database of NCBI, species analogy with a transcription regulation relation in a plantaTFDB database from near to far is searched, and the selected codonopsis pilosula close-source species are carrot, potato, spirulina platensis, petunia, tobacco and tomato. The obtained Codonopsis pilosula transcription control map effect is shown in FIG. 6.
Example 3: obtaining the notoginseng transcription regulation map by using the medicinal plant transcription regulation map prediction method based on the graph convolution network
Referring to the method for predicting the transcription regulation and control map of the medicinal plant as described in example 1, the transcription regulation and control map of notoginseng is predicted. The method comprises the steps of downloading pseudo-ginseng transcriptome sequencing samples from an SRA database of NCBI, wherein platform numbers are SRP045917, SRP082250, SRP082995, SRP091819, SRP100663, SRP105080, SRP316468, SRP151328, SRP293477 and SRP300593, finding a whole genus pedigree of a pseudo-ginseng genome from a Taxonom Browser database of NCBI, performing analogy search on near and distant species which have a transcription regulation relation with a plantatTFDB database, and selecting pseudo-ginseng near-source species as carrot, potato, spirulina, tomato, boea densita, sesame and eggplant. The obtained panax notoginseng transcription control map effect is shown in figure 7.
Example 4: pinellia ternata transcription regulation map obtained by using medicinal plant transcription regulation map prediction method based on graph convolution network
Referring to the method for predicting the transcription regulation map of medicinal plants described in example 1, the transcription regulation map of pinellia ternata was predicted. Downloading pinellia ternate transcriptome sequencing samples from an SRA database of NCBI, wherein platform numbers are SRP233223, SRP255725, SRP126709, SRP182920 and SRP215828, finding a whole genus pedigree of a pinellia ternate genome from a Taxonomy Browser database of NCBI, performing analogy search on species which have a transcription regulation and control relationship with a plantatTFDB database from near to far, and selecting pinellia ternate near-source species of corn, barley, sorghum, wheat, wild rice, leersia hexandra and Wularg map wheat. The final obtained pinellia ternata transcriptional control map effect is shown in figure 8.
Example 5: rehmannia glutinosa transcription regulation and control map obtained by medicinal plant transcription regulation and control map prediction method based on graph convolution network
Referring to the method for predicting the transcription regulation map of medicinal plants described in example 1, the transcription regulation map of rehmannia glutinosa is predicted. Downloading rehmannia transcriptome sequencing samples from an SRA database of NCBI, wherein the platform numbers are SRP348149, SRP103641, SRP168980, SRP242653, SRP036150 and SRP261047, finding out a whole genus pedigree of a rehmannia genome from a Taxonom Browser database of the NCBI, performing analogy search on species which have a transcription regulation and control relationship with a plantatTFDB database from near and far, and selecting rehmannia proximal species as spirulina, leoparus mitochondrus, boea rosea, sesame, potato and coffea canephora. The obtained rehmannia glutinosa libosch transcription control pattern has similar effect as shown in FIG. 9.
Example 6: obtaining rhizoma polygonati transcription regulation map by using medicinal plant transcription regulation map prediction method based on graph convolution network
Referring to the method for predicting the transcription regulation map of the medicinal plant described in example 1, the transcription regulation map of polygonatum sibiricum is predicted. Downloading a polygonatum sibiricum transcriptome sequencing sample from an SRA database of NCBI, wherein the platform numbers are SRP380100, SRP193176 and SRP239571, finding a whole genus pedigree of a polygonatum sibiricum genome from a Taxonomy Browser database of the NCBI, performing analogy search on species which have a transcription regulation and control relationship with a plantaTFDB database from near to far, and selecting polygonatum sibiricum near-source species as corn, potato, highland barley, sorghum, wheat, velvet grass and small fruit plantain.
Example 7: obtaining liquorice transcription regulation map by using medicinal plant transcription regulation map prediction method based on graph convolution network
Referring to the method for predicting the transcription regulation and control map of the medicinal plant described in example 1, the transcription regulation and control map of licorice is predicted. Downloading a liquorice transcriptome sequencing sample from an SRA database of NCBI (national center of health) with platform numbers of SRP332963, SRP341206, SRP349355, SRP360528, SRP215420 and SRP215914, finding a whole genus pedigree of a liquorice genome from a Taxonomy Browser database of NCBI, performing analog search on species which have a transcription regulation and control relationship with a plantatTFDB database from near to far, and selecting the liquorice proximal species of red clover, chickpea, medicago truncatula, crowtoe, mung bean and ormosia.
Example 8: obtaining aloe transcription regulation and control map by using medicinal plant transcription regulation and control map prediction method based on graph convolution network
The transcription control map of aloe was predicted by referring to the prediction method of transcription control map of medicinal plant described in example 1. The method comprises the steps of downloading aloe transcriptome sequencing samples from an SRA database of NCBI, wherein the platform numbers are SRP263487, SRP265189, SRP075487, SRP096304 and SRP169103, finding a whole genus pedigree of an aloe genome from a Taxinomy Browser database of the NCBI, carrying out analogy search on species which have a transcription regulation and control relationship with a plantaTFDB database from near to far, and selecting aloe closely-sourced species as phalaenopsis miniata, corn, wheat, barley, pineapple and sorghum.
Example 9: ginkgo transcriptional control map obtained by using medicinal plant transcriptional control map prediction method based on graph convolution network
Referring to the method for predicting the transcription regulation and control map of the medicinal plant as described in example 1, the transcription regulation and control map of ginkgo biloba is predicted. The method comprises the steps of downloading a gingko transcriptome sequencing sample from an SRA database of NCBI, wherein platform numbers are SRP156472, SRP226212, SRP239440, SRP242114, SRP246776, SRP274053, SRP275669, SRP278694, SRP292448, SRP322251, SRP068183, SRP007598, SRP149113 and SRP180427, finding a whole genus pedigree of a gingko genome from a Taxonomy Browser database of NCBI, carrying out analog search from near and distant species having a transcription regulation relation with a PlantTFDB database, and selecting gingko closely-sourced species as canna minor, hops, arabidopsis thaliana, peaches, apple, papaya and Arabidopsis thaliana.
Example 10: obtaining taxus chinensis transcription regulation map by using medicinal plant transcription regulation map prediction method based on graph convolution network
Referring to the medicinal plant transcriptional regulation map prediction method described in example 1, the transcriptional regulation map of taxus chinensis is predicted. Downloading taxus transcriptome sequencing samples from an SRA database of NCBI (national center for information service) with platform numbers of SRP284525, SRP063030 and SRP127697, finding a whole genus pedigree of a taxus genome from a Taxonomy Browser database of NCBI, performing analog search on species which have a transcription regulation and control relationship with a plantaTFDB database from near to far, and selecting taxus near-source species as douglas fir, small fruit canna, corn, arabidopsis thaliana, hop, arabidopsis thaliana and barley.
Example 11: the invention relates to a medicinal plant transcription factor regulation and control network visual database platform constructed by a medicinal plant transcription regulation and control map prediction method.
Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes and modifications can be made therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A medicinal plant transcription regulation map prediction method based on a graph convolution network is characterized by comprising the following steps:
s1, acquiring gene expression data and pre-constructing a transcription regulation relation:
downloading a transcriptome sequencing sample, a genome file and a gene structure annotation file of the medicinal plant to be predicted, and preprocessing data and comparing a reference genome to obtain gene expression data of the medicinal plant to be predicted; downloading the genome of the medicinal plant near-source species to be predicted, respectively carrying out sequence comparison on the medicinal plant genome to be predicted and the genome of the medicinal plant near-source species to be predicted through BLAST, screening the results by utilizing evalue and bitscore comparison to respectively obtain gene sets matched with the transcription factors of the medicinal plants to be predicted and different medicinal plant near-source species to be predicted, and taking intersection from the gene sets matched with the transcription factors of the different medicinal plant near-source species to be predicted to obtain a transcription factor set of the medicinal plants to be predicted; downloading a regulation network relation of the closely-sourced species of the medicinal plants, transferring the mapping relation to the genome of the medicinal plants to be predicted, and combining the repeated relation and deleting the unimportant relation to obtain the transcription regulation relation among the isogenes of the transcription factors of the medicinal plants to be predicted; after checking and filtering, further obtaining the accurate transcription regulation relation of the transcription factor isogene of the medicinal plant to be predicted by utilizing a WGCNA algorithm;
s2, constructing a co-expression regulation and control network recognition algorithm based on a depth map convolution network model:
randomly masking 10% of genes except transcription factors of the medicinal plant gene expression data to be predicted, taking the remaining 90% of genes and the gene expression quantity of the transcription factors together with the transcription regulation and control relation obtained by S1 as the input of a model to carry out graph convolution operation training, and adding an attention mechanism in the graph convolution process to ensure reasonable distribution of weights; adopting a Mean Square Error (MSE) to calculate and predict the expression quantity of the remaining 10 percent of genes by a loss function, and updating through the whole propagation polymerization process to obtain a transcription factor isogenic regulation and control relation weight matrix; and then, aggregating the obtained weight matrix by adopting a Louvain algorithm to search a co-expression module, and revealing main components in the medicinal plant to be predicted and a biosynthesis pathway thereof by calculating different growth conditions or difference co-expression modules among tissues and organs and performing function annotation enrichment analysis on the difference co-expression modules.
2. The method of predicting a transcription regulation profile of a medicinal plant according to claim 1, wherein the medicinal plant comprises Panax notoginseng, panax ginseng, polygonatum sibiricum, glycyrrhiza uralensis, rehmannia glutinosa, pinellia ternate, codonopsis pilosula, aloe vera, ginkgo biloba, and Taxus chinensis.
3. The medicinal plant transcriptional control map prediction method according to claim 1, wherein the data preprocessing of S1 is to download, split, control and filter the downloaded medicinal plant transcriptome sequencing sample to be predicted by using a sratools toolkit; reference genome preparation was performed using software gffread.
4. The method for predicting the transcription regulation map of the medicinal plant according to claim 1, wherein the comparison of the reference genome of S1 uses hisat2 software, and the method comprises the following three steps: the first step is that an index is constructed for a reference genome, and comparison is carried out, wherein the parameter is-dta; secondly, compressing and sequencing comparison results and constructing indexes; the third step calculates the FPKM or TPM expression matrix of the transcript using the transcript assembly software stringitie.
5. The method for predicting transcription regulation mapping of medicinal plants as claimed in claim 1, wherein the close species of S1 is selected by finding the whole genus pedigree of the genome of medicinal plants to be predicted from the database of Taxonomy Browser at NCBI, and performing analogy search from the close and distant species having transcription regulation relation with the database of plantatTFDB.
6. The method for predicting a transcription regulation profile of a medicinal plant according to claim 5,
when the medicinal plant to be predicted is pseudo-ginseng, the selected close-source species are carrot, potato, raccoon, tomato, boea densiflora, sesame and eggplant;
when the medicinal plant to be predicted is ginseng, the selected closely sourced species are pseudo-ginseng, carrot, potato, spirulina, tomato, coffee cherry, tobacco and arabidopsis thaliana;
when the medicinal plant to be predicted is polygonatum, the selected near-source species are corn, potato, highland barley, sorghum, wheat, swan goose grass and small fruit wild bananas;
when the medicinal plant to be predicted is liquorice, the selected close-source species are red clover, chickpea, medicago truncatula, lotus vein fern, mung bean and red bean;
when the medicinal plant to be predicted is rehmannia, the selected closely-sourced species are spirulina, civetta, boea, sesame, potato and coffea canescens;
when the medicinal plant to be predicted is pinellia ternate, the selected closely-sourced species are corn, barley, sorghum, wheat, wild rice, leersia hexandra and Wularg wheat;
when the medicinal plant to be predicted is codonopsis pilosula, the selected closely sourced species are carrots, potatoes, spirulina, petunia, tobacco and tomatoes;
when the medicinal plant to be predicted is aloe, the selected closely-sourced species are Phalaenopsis amabilis, corn, wheat, barley, pineapple and sorghum;
when the medicinal plant to be predicted is ginkgo, the selected closely-sourced species are musa minor, hops, musella rotundifolia, peaches, fuji apples, papayas and Arabidopsis;
when the medicinal plant to be predicted is taxus chinensis, the selected close-source species are douglas fir, canna microphylla, corn, arabidopsis thaliana, hop, arabidopsis thaliana and barley.
7. The prediction method of transcription regulation map of medicinal plant according to claim 1, wherein the Louvain algorithm S2 aggregates weight matrices from two stages of local movement of nodes and network aggregation: determining a co-expression sub-network of each transcription factor according to weight contribution distribution of each transcription factor in a prediction process of gene expression values obtained under multiple conditions, wherein differences of co-expression modules of different tissue expression data after graph convolution network aggregation are used for querying and identifying specificity modules, and the differences can be identified through a statistical test process based on super-geometric distribution.
8. An electronic device, comprising at least one processor, and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions for execution by the at least one processor to cause the at least one processor to perform a method of predicting a transcription regulatory profile of a medicinal plant according to any one of claims 1 to 7.
9. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method for predicting a transcription regulatory profile of a medicinal plant according to any one of claims 1 to 7.
10. A medicinal plant transcription factor regulation network visual database platform constructed by using the medicinal plant transcription regulation and control map prediction method of any one of claims 1 to 7 is characterized in that the medicinal plant transcription factor regulation and control network visual database platform provides medicinal plant transcription factor query, sequence comparison, transcription factor regulation and control network and biological pathway query services.
CN202211140336.9A 2022-09-20 2022-09-20 Medicinal plant transcriptional regulation map prediction method Active CN115223657B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211140336.9A CN115223657B (en) 2022-09-20 2022-09-20 Medicinal plant transcriptional regulation map prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211140336.9A CN115223657B (en) 2022-09-20 2022-09-20 Medicinal plant transcriptional regulation map prediction method

Publications (2)

Publication Number Publication Date
CN115223657A CN115223657A (en) 2022-10-21
CN115223657B true CN115223657B (en) 2022-12-06

Family

ID=83616886

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211140336.9A Active CN115223657B (en) 2022-09-20 2022-09-20 Medicinal plant transcriptional regulation map prediction method

Country Status (1)

Country Link
CN (1) CN115223657B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117344053B (en) * 2023-12-05 2024-03-19 中国农业大学 Method for evaluating physiological development process of plant tissue

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364880A (en) * 2020-11-30 2021-02-12 腾讯科技(深圳)有限公司 Omics data processing method, device, equipment and medium based on graph neural network
CN112992267A (en) * 2021-04-13 2021-06-18 中国人民解放军军事科学院军事医学研究院 Single-cell transcription factor regulation network prediction method and device
CN114022693A (en) * 2021-09-29 2022-02-08 西安热工研究院有限公司 Double-self-supervision-based single-cell RNA-seq data clustering method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109637588B (en) * 2018-12-29 2022-07-15 北京百迈客生物科技有限公司 Method for constructing gene regulation network based on whole transcriptome high-throughput sequencing
EP4193235A1 (en) * 2020-04-13 2023-06-14 Aiberry, Inc. Multimodal analysis combining monitoring modalities to elicit cognitive states and perform screening for mental disorders
CN112435714B (en) * 2020-11-03 2021-07-02 北京科技大学 Tumor immune subtype classification method and system
CN113470741B (en) * 2021-07-28 2023-07-18 腾讯科技(深圳)有限公司 Drug target relation prediction method, device, computer equipment and storage medium
CN114093422B (en) * 2021-11-23 2024-06-25 湖南大学 Prediction method and system for interaction between miRNA and gene based on multiple relationship graph rolling network
CN114496092B (en) * 2022-02-09 2024-05-03 中南林业科技大学 MiRNA and disease association relation prediction method based on graph rolling network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364880A (en) * 2020-11-30 2021-02-12 腾讯科技(深圳)有限公司 Omics data processing method, device, equipment and medium based on graph neural network
CN112992267A (en) * 2021-04-13 2021-06-18 中国人民解放军军事科学院军事医学研究院 Single-cell transcription factor regulation network prediction method and device
CN114022693A (en) * 2021-09-29 2022-02-08 西安热工研究院有限公司 Double-self-supervision-based single-cell RNA-seq data clustering method

Also Published As

Publication number Publication date
CN115223657A (en) 2022-10-21

Similar Documents

Publication Publication Date Title
Blair et al. Cryptic diversity in the Mexican highlands: thousands of UCE loci help illuminate phylogenetic relationships, species limits and divergence times of montane rattlesnakes (Viperidae: Crotalus)
Ravdin et al. A practical application of neural network analysis for predicting outcome of individual breast cancer patients
CN115223657B (en) Medicinal plant transcriptional regulation map prediction method
CN114255886B (en) Multi-group similarity guide-based drug sensitivity prediction method and device
Sriwong et al. Dermatological classification using deep learning of skin image and patient background knowledge
CN112070277A (en) Hypergraph neural network-based drug-target interaction prediction method
CN112599187B (en) Method for predicting drug and target protein binding fraction based on double-flow neural network
Andersson et al. Algorithmic approaches for studies of variable influence, contribution and selection in neural networks
CN104866710B (en) The method for predicting Cytochrome P450 1A2 inhibitor inhibition concentrations
CN114944199A (en) Artificial intelligence based strain screening method and device
Peng et al. Improving drug response prediction based on two-space graph convolution
CN113178234B (en) Compound function prediction method based on neural network and connection graph algorithm
Lagergren et al. Climatic clustering and longitudinal analysis with impacts on food, bioenergy, and pandemics
Lee et al. An Adaptive GA—PSO Approach with Gene Clustering to Infer S-system Models of Gene Regulatory Networks
CN116758993A (en) DNA methylation prediction method integrating multiple groups of chemical characteristics
Liu et al. Standardization, objectification, and essence research of traditional Chinese medicine syndrome: A 15‐year bibliometric and content analysis from 2006 to 2020 in Web of Science database
CN116313147A (en) Knowledge graph attention network-based anticancer drug collaborative prediction method
CN110010251A (en) A kind of Chinese medicine community information generation method, system, device and storage medium
Penchev et al. INFERCNMR: a 13C NMR interpretive library search system
Utomo et al. Best-parameterized sigmoid elm for benign and malignant breast cancer detection
Lee et al. Applying intelligent computing techniques to modeling biological networks from expression data
CN107710206B (en) Methods, systems, and apparatus for subpopulation detection based on biological data
CN115618745B (en) Biological network interaction construction method
CN114512188B (en) DNA binding protein recognition method based on improved protein sequence position specificity matrix
Cumiskey et al. Gene network reconstruction using a distributed genetic algorithm with a backprop local search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant