CN115798602A

CN115798602A - Gene regulation and control network construction method, device, equipment and storage medium

Info

Publication number: CN115798602A
Application number: CN202310054444.2A
Authority: CN
Inventors: 赵纪永; 王维玉
Original assignee: Beijing Lingxun Pharmaceutical Technology Co ltd
Current assignee: Beijing Lingxun Pharmaceutical Technology Co ltd
Priority date: 2023-02-03
Filing date: 2023-02-03
Publication date: 2023-03-14

Abstract

The invention discloses a method, a device, equipment and a storage medium for constructing a gene regulation network, wherein the method comprises the following steps: obtaining source gene data, and screening the source gene data to obtain target gene data, wherein the source gene data is gene expression profile data of an ovarian cancer patient; performing Bootstrap resampling on target gene data to obtain a first preset number of gene data sets; obtaining a first preset number of target Bayesian network models based on each gene data set; and performing confidence estimation on the target Bayesian network models with the first preset number to obtain a target gene regulation network. According to the invention, target gene data related to ovarian cancer can be screened from source gene data, and a Bayesian network model which is obtained according to the target gene data and is constructed based on Bootstrap resampling and confidence estimation is used as a target gene regulation and control network, so that the ovarian cancer related gene can be determined and the ovarian cancer related gene regulation and control relationship can be revealed.

Description

Gene regulation and control network construction method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of electronic digital data processing, in particular to a method, a device, equipment and a storage medium for constructing a gene regulation and control network.

Background

Ovarian cancer is a clinically common gynecological cancer, and because ovarian cancer patients are diagnosed at a late stage of cancer, the ovarian cancer has a high fatality rate and ranks first in gynecological tumors. Therefore, there is an urgent need to study genes associated with ovarian cancer and the intrinsic mechanisms of the genes to prevent the occurrence of ovarian cancer.

Nowadays, researchers often take precise measurements of genes, proteins and metabolites of ovarian cancer patients in order to understand the potential mechanism of action of genes related to ovarian cancer, and these technical measurements generate massive omics data, so that researchers often use complex statistical models to analyze the mathematical data. However, most statistical methods today are limited by the limited number of patient cases and the diversity of patients. To some extent, there are many confounding factors in these studies, such as different treatment regimens, tumor stages, subtypes, etc., which all may significantly affect the clinical therapeutic efficacy. If these confounders are not properly addressed, it is difficult to obtain reliable results in subsequent analyses. In addition, there are complex regulatory relationships between genes associated with ovarian cancer, and the traditional statistical methods have difficulty in identifying these regulatory relationships, and thus cannot represent the regulatory relationships between genes. Therefore, the traditional statistical method can not meet the analysis requirements of the present day, and a statistical model for determining the ovarian cancer related genes and revealing the regulatory relationship of the ovarian cancer related genes is needed to better research the intrinsic mechanism of the ovarian cancer.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide a method, a device, equipment and a storage medium for constructing a gene regulation and control network, and aims to solve the technical problem that the existing statistical method cannot determine ovarian cancer related genes and reveal the regulation and control relation of the ovarian cancer related genes. In order to achieve the above object, the present invention provides a method for constructing a gene regulatory network, the method comprising:

obtaining source gene data, and screening the source gene data to obtain target gene data, wherein the source gene data are gene expression profile data of an ovarian cancer patient;

performing Bootstrap resampling on the target gene data to obtain a first preset number of gene data sets;

obtaining the first preset number of target Bayesian network models based on each gene data set;

and performing confidence estimation on the first preset number of target Bayesian network models to obtain a target gene regulation network.

Optionally, the step of performing confidence estimation on the first preset number of target bayesian network models to obtain a target gene regulatory network includes:

averaging the target Bayesian network models of the first preset number to obtain an intermediate gene regulation network;

performing confidence estimation on the intermediate gene regulation and control network to obtain the connection probability between any two nodes in the intermediate gene regulation and control network;

and outputting the connecting edge which is not lower than a preset probability threshold value in the intermediate gene regulation network based on the connection probability of the intermediate gene regulation network to obtain the target gene regulation network.

Optionally, the step of screening the source gene data to obtain target gene data includes:

performing preset replacement inspection on the source gene data to obtain differential expression gene data smaller than a preset screening confidence;

and carrying out KEGG channel enrichment analysis on the differentially expressed gene data to obtain target gene data enriched in a preset signal channel.

Optionally, the step of obtaining the first preset number of target bayesian network models based on each gene data set includes:

learning each gene data set through a constraint-based algorithm to obtain a preliminary network structure corresponding to each gene data set;

and learning each initial network structure through a search score algorithm to obtain the first preset number of target Bayesian network models.

Optionally, the step of learning each initial network structure by using a search score algorithm to obtain the first preset number of target bayesian network models includes:

obtaining scores of all the preliminary network structures through a BIC score function;

traversing the scores of the preliminary network structures by a greedy hill climbing method, and taking the network structure with the highest BIC score as a target Bayesian network model.

Optionally, after traversing the scores of the preliminary network structures by a greedy hill-climbing method and taking the network structure with the highest BIC score as a target bayesian network model, the method further includes:

and when detecting that the greedy hill climbing method is trapped in local optimization, performing random restart search for each preliminary network structure for a second preset number of times, and taking a network with the highest BIC score in the Bayesian networks obtained after restart as the target Bayesian network model.

Optionally, after obtaining the target gene regulatory network, the method further comprises:

randomly disordering the sequence of all measured values of each gene in the target gene regulatory network to generate a new gene data set;

constructing a random restart network based on the new gene data set, and acquiring the confidence of the random restart network;

verifying the reliability of the target gene regulation network based on the confidence of the random restart network.

In addition, in order to achieve the above object, the present invention also provides a gene regulatory network constructing apparatus, comprising:

the data screening module is used for acquiring source gene data and screening the source gene data to acquire target gene data;

the sampling module is used for performing Bootstrap resampling on the target gene data to obtain a first preset number of gene data sets;

a Bayesian network model construction module for obtaining the first preset number of target Bayesian network models based on each gene data set;

and the gene regulation and control network construction module is used for carrying out confidence estimation on the first preset number of target Bayesian network models to obtain a target gene regulation and control network.

In addition, in order to achieve the above object, the present invention also provides a gene regulatory network constructing apparatus, comprising: a memory, a processor and a gene regulatory network construction program stored on the memory and executable on the processor, the gene regulatory network construction program configured to implement the steps of the gene regulatory network construction method as described above.

In addition, to achieve the above object, the present invention further provides a storage medium having a gene regulatory network construction program stored thereon, wherein the gene regulatory network construction program, when executed by a processor, implements the steps of the gene regulatory network construction method as described above.

The invention discloses a method, a device, equipment and a storage medium for constructing a gene regulation network, wherein the method comprises the following steps: obtaining source gene data, and carrying out preset replacement inspection on the source gene data to obtain difference expression gene data smaller than a preset screening confidence coefficient, wherein the source gene data is gene expression profile data of an ovarian cancer patient; performing KEGG channel enrichment analysis on the differentially expressed gene data to obtain target gene data enriched in a preset signal channel; performing Bootstrap resampling on target gene data to obtain a first preset number of gene data sets; obtaining a first preset number of target Bayesian network models based on each gene data set; averaging the target Bayesian network models of the first preset number to obtain an intermediate gene regulation network; performing confidence estimation on the intermediate gene regulation and control network to obtain the connection probability between any two nodes in the intermediate gene regulation and control network; and outputting the connecting edge which is not lower than a preset probability threshold value in the intermediate gene regulation network based on the connection probability of the intermediate gene regulation network to obtain the target gene regulation network. Different from the existing statistical method which cannot express the gene regulation and control relationship, the method can screen target gene data related to ovarian cancer from source gene data through preset permutation test and KEGG (Kegg-based pathway enrichment analysis), obtain a first preset number of gene data sets through Bootstrap resampling on the target gene data, obtain a first preset number of target Bayesian network models based on each gene data set, and finally carry out confidence estimation based on a plurality of target Bayesian network models, so that the target gene regulation and control network with any two-node connection probability larger than a preset probability threshold is constructed, each node in the target gene regulation and control network represents the gene related to ovarian cancer, and the connection edge between each node represents the regulation and control relationship between each gene. Therefore, the invention provides a statistical model for determining the ovarian cancer related genes and revealing the regulation relationship of the ovarian cancer related genes.

Drawings

FIG. 1 is a schematic structural diagram of a gene regulatory network construction device of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart showing a method of constructing a gene regulatory network according to a first embodiment of the present invention;

FIG. 3 is a schematic view of a target gene regulatory network in a first embodiment of the method for constructing a gene regulatory network according to the present invention;

FIG. 4 is a schematic flow chart showing a method of constructing a gene regulatory network according to a second embodiment of the present invention;

FIG. 5 is a schematic flow chart showing a method for constructing a gene regulatory network according to a second embodiment of the present invention;

FIG. 6 is a schematic diagram showing the comparison of the number of directed edges between a target gene regulatory network and a random restart network under different predetermined probability thresholds according to the third embodiment of the method for constructing a gene regulatory network of the present invention;

FIG. 7 is a block diagram showing the construction of a first embodiment of the apparatus for constructing a gene regulatory network according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a gene regulatory network construction device of a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the gene regulatory network constructing apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. The communication bus 1002 is used to implement connection communication among these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (WI-FI) interface). The Memory 1005 may be a high-speed Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the architecture shown in FIG. 1 does not constitute a limitation of the gene regulatory network construction apparatus and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, the memory 1005, which is a storage medium, may include therein an operating system, a data storage module, a network communication module, a user interface module, and a gene regulatory network constructing program.

In the gene regulatory network constructing apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the gene regulatory network construction device of the present invention may be provided in the gene regulatory network construction device, and the gene regulatory network construction device invokes the gene regulatory network construction program stored in the memory 1005 through the processor 1001 and executes the gene regulatory network construction method provided by the embodiment of the present invention.

The embodiment of the invention provides a method for constructing a gene regulation network, and referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the method for constructing the gene regulation network.

In this embodiment, the method for constructing the gene regulation network includes the following steps:

step S10: obtaining source gene data, and screening the source gene data to obtain target gene data, wherein the source gene data are gene expression profile data of an ovarian cancer patient;

it should be noted that the main body of the method of this embodiment may be a computing service device with data processing, network communication, and program running functions, such as a mobile phone, a television, a tablet computer, a personal computer, and the like, and may also be other electronic devices capable of implementing the same or similar functions. The gene regulatory network construction method provided in this embodiment and each of the following embodiments is specifically described with reference to the above gene regulatory network construction device (simply referred to as network construction device).

It should be understood that, in the present embodiment, the gene regulatory network constructed by the network construction device is used for determining the ovarian cancer related genes and revealing the relationship between the ovarian cancer related genes, so that the source gene data acquired by the network construction device may be gene expression profile data (which can be downloaded from the TCGA database) of the ovarian cancer related patient. The TCGA database is an information database for gene sequencing, and stores data of the glioblastoma multiforme and the ovarian cancer, including clinical basic information of cases, such as basic data, treatment process, clinical stages, tumor pathology, survival conditions and the like.

It can be understood that, the data of the whole gene expression profile can measure tens of thousands of gene expression values, so that the data amount of the source gene data acquired by the network construction equipment is too large, and in this embodiment, the gene related to ovarian cancer (i.e., the target gene data) can be screened out first, and then the bayesian network can be constructed based on the target gene data. The screening of the source gene data can improve the modeling efficiency, make the constructed network more reasonable and help to explain the relationship between the related genes of the ovarian cancer.

Further, in this embodiment, step S10 includes:

step S101: performing preset replacement inspection on the source gene data to obtain differential expression gene data smaller than a preset screening confidence;

it should be noted that the predetermined replacement test may be a replacement test based on Wilcoxon rank-sum test, or a replacement test based on two independent sample t-tests, in which the replacement test is performed based on a comparison gene data set and a statistical test quantity, and the specific process may be to perform sequential random replacement on the source gene data and the comparison gene data set, and recalculate the statistical test quantity, and then repeat the above process for multiple times, for example, 1000 times of the replacement test based on Wilcoxon rank-sum test or the replacement test based on two independent sample t-tests may be performed on the source gene data and the comparison gene data set, and finally, the confidence of each gene in the source gene data and the comparison gene data set and the empirical distribution of the statistical test quantity constructed are calculated, and the confidence of each gene is compared with the predetermined screening confidence, so as to obtain the differentially expressed gene data with its own confidence lower than the predetermined screening confidence, where the data of the comparison gene data set may also be downloaded from the TCGA database. The preset screening confidence may be set according to related experience, and the specific numerical value is not limited in this embodiment.

Step S102: and carrying out KEGG channel enrichment analysis on the differentially expressed gene data to obtain target gene data enriched in a preset signal channel.

It should be understood that the KEGG database is a resource database, and is widely used for reference of biological knowledge such as biological pathways and cellular processes, and in this embodiment, the KEGG pathway data is added to the analysis process as background knowledge, so as to improve the accuracy and reliability of the research result.

It is understood that the predetermined signaling pathway may be configured according to ovarian cancer, for example, if the subject to be analyzed is ovarian cancer, the predetermined signaling pathway may be a p53 signaling pathway, and if the subject to be analyzed is ovarian cancer chemotherapy drug sensitive, the predetermined signaling pathway may be a Wnt signaling pathway and a lysosomal pathway, and therefore, the type and number of the predetermined signaling pathways are not limited in this embodiment. It is readily understood that genes significantly associated with ovarian cancer or ovarian cancer chemotherapeutic drug sensitivity following KEGG pathway enrichment assays would be significantly enriched in the predetermined signaling pathway.

Step S20: performing Bootstrap resampling on the target gene data to obtain a first preset number of gene data sets;

it should be noted that the bootstrapping resampling may be performed by extracting a sample from the target gene data, and performing a replacement sampling from the sample, so as to obtain a plurality of bootstrapping datasets. Therefore, the present embodiment can obtain a plurality of gene datasets based on the target gene data, and how many times the Bootstrap resampling is performed will result in how many gene datasets, and the amounts of gene data in the respective gene datasets are the same, and therefore, the above-mentioned first preset number is the number of times the Bootstrap resampling is performed even for the total number of gene datasets, and the specific number of the first preset number is not limited by the present embodiment.

Step S30: obtaining the first preset number of target Bayesian network models based on each gene data set;

it should be understood that the conventional statistical method does not consider the problems of non-linear relationship between variables and the lack of a priori knowledge, and the problems can be solved well by applying a probability graph model, and the probability graph model can represent causal connection between variables rather than simple correlation relationship between variables. And the most representative of the probability graph model is a bayesian network model which can intuitively reflect the joint probability distribution of the variables in a graph form. Specifically, the bayesian network model can intuitively reflect the relation between variables in the form of a directed acyclic graph, wherein nodes in the graph represent variables, and edges between the nodes represent the dependency relation between the variables. Therefore, the target bayesian network model can be generated based on each gene data set, and a target gene regulation network is further constructed.

Since the number of gene data sets is more than one, the number of target bayesian network models generated based on each gene data set is also more than one, and the number of target bayesian network models matches the number of gene data sets.

Step S40: and performing confidence estimation on the first preset number of target Bayesian network models to obtain a target gene regulation network.

It is understood that the nodes in the objective bayesian network model can represent different molecules (proteins, compounds, enzymes, etc.), the edges can represent different relationships between the nodes, such as activation, inhibition, etc., and the gene network, the compounds and the protein network can be extracted from the metabolic pathways composed of the nodes and the edges. Therefore, different from the traditional omics data analysis, the target bayesian network model can intuitively reflect the relationship between the variables in a graph form, and specifically can intuitively reflect the joint probability distribution of the variables in a graph form, so that the embodiment can perform confidence estimation on the target bayesian network model to obtain the connection probability between the nodes in the target bayesian network model, and further obtain the relationship between the genes in the target bayesian network model. Therefore, the target gene regulatory network can be identified by nodes and connecting edges between nodes, wherein each node represents a gene related to ovarian cancer, and the connecting edges between nodes represent a regulatory relationship between genes, and the regulatory relationship can be activation, inhibition, and the like.

Further, in this embodiment, step S40 includes:

step S401: averaging the target Bayesian network models of the first preset number to obtain an intermediate gene regulation network;

step S402: performing confidence estimation on the intermediate gene regulation and control network to obtain the connection probability between any two nodes in the intermediate gene regulation and control network;

step S403: and outputting the connecting edge which is not lower than a preset probability threshold value in the intermediate gene regulation network based on the connection probability of the intermediate gene regulation network to obtain the target gene regulation network.

It should be noted that, as can be seen from the theorem of majorities, if the statistical data is large enough, the frequency of occurrence of an object can approach its expectation (mean value) infinitely. That is, if the node-connected feature does exist in the real network structure, the confidence (or connection probability) of the edge corresponding to the connected feature should be close to 1, and if not, it should be close to 0.

It should be understood that the confidence estimation may be Bootstrap confidence estimation, and based on the principle, the method for performing confidence estimation by the network building apparatus in this embodiment may specifically be to average the value of the first preset number of target bayesian network models obtained from the gene data set obtained by Bootstrap resampling, and specifically may be to average the probability of whether any two nodes in the target bayesian network models obtained by Bootstrap resampling for each time are connected networks, that is, average the connection probability of any two nodes in the bayesian network obtained by Bootstrap resampling for each time, thereby obtaining the estimated connection probability (or confidence) between the nodes in the network structure, and further building the intermediate gene regulation network based on the estimated connection probability between the nodes.

It should be noted that, a part of interference edges may exist in the obtained intermediate gene regulatory network, and therefore, in this implementation, the network construction device is further configured with a preset probability threshold, and can perform network structure screening based on the preset probability threshold, and output a connection edge that is not lower than the preset probability threshold in the intermediate gene regulatory network after screening, thereby obtaining the target gene regulatory network. It is easy to understand that the preset probability threshold value cannot be set too high in the experimental process, otherwise, the real edge is easy to be missed. In addition, if the connection number of a certain node and the other nodes exceeds the preset connection number, the gene represented by the node is determined to be regulated by a plurality of genes, is closely related to ovarian cancer, and can be determined as a pivot gene.

In a specific implementation, for convenience of understanding, fig. 3 is taken as an example for illustration, and fig. 3 is a schematic diagram of a target gene regulatory network in a first embodiment of the method for constructing a gene regulatory network according to the present invention, wherein the source gene data may be gene expression profile data of an ovarian cancer patient, and the control gene data set may be gene expression profile data of a healthy population, so that the network construction equipment may first perform 1000 replacement tests based on Wilcoxon rank sum test on the gene expression profile data of the ovarian cancer patient and the gene expression profile data of the healthy population, and screen out a gene with P <0.05 as differentially expressed gene data, where 0.05 is a corresponding preset screening confidence. Then, performing KEGG channel enrichment analysis on the part of differentially expressed gene data, and taking genes which are significantly enriched in a p53 signal channel (namely a corresponding preset signal channel) as target gene data. And then constructing a target Bayesian network based on the target gene data, and then performing confidence estimation on the network characteristics of the target Bayesian network by combining a Bootstrap resampling method to obtain an intermediate gene regulation and control network, wherein the resampling times can be set to 1000 times to ensure the stability of the result. Then, setting the preset probability threshold to 0.8, and then screening based on the preset probability threshold, and finally constructing a target gene regulation and control network as shown in fig. 3, each node in fig. 3 may represent a gene enriched in a p53 signal pathway, a dark node may represent a pivot gene regulated by a plurality of genes, a dotted line may represent an edge where the connection probability (confidence) is greater than 0.8 and less than 1, and an edge where the connection probability is equal to 1 is represented by a solid line.

In the embodiment, the source gene data is obtained, and the preset replacement test is carried out on the source gene data to obtain the differential expression gene data smaller than the preset screening confidence coefficient, wherein the source gene data is the gene expression profile data of the ovarian cancer patient; performing KEGG channel enrichment analysis on the differentially expressed gene data to obtain target gene data enriched in a preset signal channel; performing Bootstrap resampling on target gene data to obtain a first preset number of gene data sets; obtaining a first preset number of target Bayesian network models based on each gene data set; averaging the target Bayesian network models of the first preset number to obtain an intermediate gene regulation network; carrying out confidence estimation on the intermediate gene regulation network to obtain the connection probability between any two nodes in the intermediate gene regulation network; and outputting the connecting edge which is not lower than a preset probability threshold value in the intermediate gene regulation network based on the connection probability of the intermediate gene regulation network to obtain the target gene regulation network. Different from the existing statistical method which cannot express the gene regulation relationship, the present embodiment may screen target gene data related to ovarian cancer from source gene data through preset permutation test and KEGG pathway enrichment analysis, obtain a first preset number of gene data sets by performing Bootstrap resampling on the target gene data, and obtain a first preset number of target bayesian network models based on the gene data sets. And finally, performing confidence estimation based on a plurality of target Bayesian network models, thereby constructing a target gene regulation and control network with any two-node connection probability being greater than a preset probability threshold, wherein each node in the target gene regulation and control network represents a gene related to ovarian cancer, and a connection edge between each node represents a regulation and control relation between each gene.

Referring to fig. 4, fig. 4 is a schematic flow chart of a second embodiment of the method for constructing a gene regulatory network according to the present invention, and the second embodiment of the method for constructing a gene regulatory network according to the present invention is provided based on the embodiment shown in fig. 2.

It can be understood that the reason why the conventional statistical method cannot meet the requirement of the analysis is that the conventional method does not consider the problem of the nonlinear relationship between the variables and does not utilize the prior knowledge, and therefore, in the embodiment, the qualitative relationship between the variables is integrated into the bayesian network analysis in the form of the prior knowledge, and a more complete and true biological network can be constructed.

Further, in this embodiment, step S30 specifically includes:

step S301: learning each gene data set through a constraint-based algorithm to obtain a preliminary network structure corresponding to each gene data set;

step S302: and learning each initial network structure through a search score algorithm to obtain the first preset number of target Bayesian network models.

It should be noted that, the current bayesian network structure learning method can be classified as constraint-based algorithms (constraint-based algorithms) or search-scoring methods (search-scoring methods). The constraint-based algorithm judges the dependency and independence between variables through conditional independence tests (CI tests), and intuitively reflects the relations through a constructed network. But the constraint-based algorithm has more advantages in calculation speed and is more suitable for analyzing high-dimensional omics data. But its independence test relies on a previously set level of significance and higher order independence tests require larger sample sizes. For high-dimensional omics data, the sample size is often not enough to obtain reliable high-order independence test results, which greatly improves the false positive results and reduces the accuracy of the predicted biological network. The search scoring algorithm evaluates the degree of fitting of networks with different structures to data by defining a scoring equation, wherein the higher the score is, the better the fitting of the network structure to the data is, so the search scoring algorithm selects the network structure with the highest score as the optimal network structure. The search scoring algorithm can generate a more accurate structure than a constraint-based algorithm and is suitable for omics data of high-dimensional small samples. And can identify structures that are unavailable to some constraint-based algorithms. However, the disadvantage of this algorithm is that the learning speed is relatively slow, and especially when the network structure becomes large in size, the number of possible structures will grow exponentially as the number of nodes increases.

It can be understood that the embodiment can combine the two algorithms to exert the advantages of the two algorithms and complement the disadvantages of each other. Therefore, in this embodiment, a constraint-based algorithm may be used to learn a non-directional preliminary network structure corresponding to each gene data set in a relatively short time, and then a search scoring algorithm may be used to further learn the network according to the preliminary network structure, that is, to determine a direction for an edge in the preliminary network structure, so as to obtain a final network structure, that is, the target bayesian network model.

Further, in this embodiment, step S302 specifically includes:

step S3021: obtaining scores of all the preliminary network structures through a BIC score function;

it should be understood that the score function involved in the search score algorithm can be used to measure the matching degree of the network structure and the data, and can be divided into two categories according to the basic principle: in practical application, the learning effect of the BDe score is better than that of the AIC score and the BIC score, and the penalty of the AIC score on a complex network is looser, so that the obtained network has higher false positives (namely, a virtual false edge is determined as a true edge); although the BDe score has the best effect, when the BDe score is used, a parameter, that is, the minimum equivalent sample, needs to be set, and the BIC score does not greatly differ from the BDe score in effect, and no parameter needs to be set, so the present embodiment may use the BIC score to measure the matching degree of the network structure and the data.

It should be noted that the BIC score function can sufficiently combine the prior knowledge about the network structure, for example, if the prior joint probability of each preliminary network structure D is expressed as

Then, according to the Bayesian formula, the training sample set is given

The posterior joint probability of structure D is

Then, the BIC score function corresponding to each preliminary network structure is:

=log

=

+

-

in the above expression

Regardless of the network configuration D, it is,

which is an edge likelihood distribution, is the average of the local conditional probabilities of all nodes comprised by the network structure D,

is the prior joint probability distribution of the network structure D.

Step S3022: traversing the scores of the preliminary network structures by a greedy hill climbing method, and taking the network structure with the highest BIC score as a target Bayesian network model.

Step S3023: and when the greedy hill climbing method is detected to be trapped in local optimization, random restart search is carried out on each preliminary network structure for a second preset number of times, and a network with the highest BIC score in the Bayesian networks obtained after restart is used as the target Bayesian network model.

It should be understood that the BIC score function asymptotically agrees with the assumed common a priori joint distribution. I.e. a true network structure can always be derived when the sample size of the data is large enough. If all possible network structures are given a certain prior joint probability distribution and the sample size is enough, bayesian network structure learning is to find out a best network structure based on data

This network structure may be a BIC scoring function

Obtaining the maximum value, and further obtaining the optimal network from the preliminary network structure

As a target bayesian network model.

It should be noted that, in this embodiment, a greedy hill climbing method may be used to obtain a network structure with the highest BIC score from each preliminary network structure, and when it is detected that the greedy hill climbing method is trapped in local optimality, the network construction device may further perform random restart search for a second preset number of times on each preliminary network structure, that is, the network construction device may randomly disturb (may be addition, deletion, or reversal) edges in each preliminary network structure through an algorithm and perform search again through the greedy hill climbing method. And finally, after restarting for a second preset number of times, stopping restarting, and selecting a network with the highest BIC score from the Bayesian networks obtained after each restart as the target Bayesian network model.

The implementation learns each gene data set through a constraint-based algorithm to obtain a preliminary network structure corresponding to each gene data set; obtaining scores of all the preliminary network structures through a BIC score function; and traversing the scores of the preliminary network structures by a greedy hill climbing method, and taking the network structure with the highest BIC score as a target Bayesian network model. And when the greedy hill climbing method is detected to be trapped in local optimization, random restarting search is carried out on each preliminary network structure for a second preset number of times, and the network with the highest BIC score in the Bayesian networks obtained after restarting is used as a target Bayesian network model. According to the method, the target Bayesian network model is constructed through the constraint-based algorithm and the search scoring algorithm based on the BIC scoring function, the greedy hill climbing method and the random restart search, the qualitative relation among all variables in the network is integrated into Bayesian network structure learning in the form of prior knowledge, a more complete and real biological network is constructed, and the fitting effect of the Bayesian network structure on data is improved.

Referring to fig. 5, fig. 5 is a schematic flow chart of a third embodiment of the method for constructing a gene regulatory network according to the present invention, which is proposed based on the embodiment shown in fig. 2 or 3, and fig. 4 is an example based on the embodiment shown in fig. 1.

It is understood that after the target gene regulatory network is constructed, the reliability of the current target gene regulatory network can be checked to prevent errors. Thus, the network construction apparatus can generate a new gene data set by randomly rearranging the measured values of each gene in such a data set that the genes are independent of each other and thus it is not desirable to find the true edges therefrom.

Further, in this embodiment, after step S40, the method further includes:

step S50: randomly disordering the sequence of all measured values of each gene in the target gene regulatory network to generate a new gene data set;

step S60: constructing a random restart network based on the new gene data set, and acquiring the confidence of the random restart network;

step S70: verifying the reliability of the target gene regulation and control network based on the reliability of the random restart network.

It should be noted that, in this embodiment, the method for constructing the random restart network based on the new gene data set is the same as the method in embodiments 1 and 2, and the method also includes performing Bootstrap sampling on the new gene data set to construct a bayesian network, and performing probability screening on connecting edges in a network structure of the constructed bayesian network through Bootstrap confidence estimation and the preset probability threshold, so as to obtain the random restart network.

It should be understood that, after the random restart network is obtained, the reliability of the target gene regulatory network needs to be verified according to the confidence of the random restart network, and the specific steps may be to count and compare the number of directed edges greater than a preset probability threshold in the random restart network and the target gene regulatory network when the two directed edges are at different preset probability thresholds, where the preset probability threshold represents a confidence threshold of a network structure.

For convenience of understanding, fig. 6 is an exemplary illustration, and fig. 6 is a schematic diagram illustrating comparison between numbers of directed edges of a target gene regulatory network and a random restart network under different preset probability thresholds in a third embodiment of the method for constructing a gene regulatory network according to the present invention, where as shown in fig. 6, a solid line in the diagram indicates the number of directed edges under different preset probability thresholds in the target gene regulatory network constructed based on real data, a dotted line indicates the number of directed edges under different preset probability thresholds in the random restart network constructed based on random rearrangement data, a horizontal axis indicates a preset probability threshold (or confidence threshold), and a vertical axis indicates the number of directed edges in the network that is greater than or equal to a corresponding preset probability. If as shown in fig. 6, the confidence of the edges in the network constructed based on the random rearrangement data is generally lower, and the edge numbers of the real data set and the random data set under different confidence coefficients are compared, it can be seen that the edge number distribution of the real data has a longer and heavier tail in the high-confidence region. When the confidence coefficient is larger than 0.2, the two distribution curves have intervals, and the intervals are larger and larger along with the increase of the confidence coefficient, so that the network relation obtained on the real data has certain confidence coefficient, namely the target gene regulation and control network can find a large number of network relations. If the confidence of the edge in the network constructed based on the random rearranged data is higher or the difference between the confidence of the edge in the network constructed based on the random rearranged data and the confidence of the target gene regulation and control network is not large, it means that the confidence of the target gene regulation and control network obtained on the real data is not high, and the target gene regulation and control network needs to be reconstructed.

In the embodiment, a new gene data set is generated by randomly disordering the sequence of all measured values of each gene in a target gene regulation network; constructing a random restart network based on the new gene data set, and acquiring the confidence of the random restart network; and verifying the reliability of the target gene regulation network based on the confidence of the random restart network. In this embodiment, a random restart network is constructed based on a gene data set generated by random rearrangement, and then the number of directed edges of the random restart network and the target gene regulatory network under different preset probability thresholds is compared, so that the confidence coefficient distribution of the random restart network and the target gene regulatory network is obtained, and the confidence coefficient of the target gene regulatory network is further judged. Therefore, the embodiment can verify the stability and the reliability of the target gene regulation network, thereby improving the authenticity of the target gene regulation network.

In addition, a storage medium is provided in an embodiment of the present invention, where the storage medium stores a gene regulatory network constructing program, and the gene regulatory network constructing program is executed by a processor to implement the steps of the gene regulatory network constructing method described above.

Referring to FIG. 7, FIG. 7 is a block diagram showing the construction of a first embodiment of the apparatus for constructing a gene regulatory network according to the present invention.

As shown in fig. 7, the gene regulatory network constructing apparatus provided in the embodiment of the present invention includes:

the data screening module 701 is used for acquiring source gene data and screening the source gene data to acquire target gene data, wherein the source gene data is gene expression profile data of an ovarian cancer patient;

a sampling module 702, configured to perform bootstrapping resampling on the target gene data to obtain a first preset number of gene data sets;

a bayesian network model constructing module 703 configured to obtain the first preset number of target bayesian network models based on each gene data set;

and a gene regulatory network constructing module 704, configured to perform confidence estimation on the first preset number of target bayesian network models to obtain a target gene regulatory network.

In the embodiment, the source gene data is obtained, and the preset replacement test is carried out on the source gene data to obtain the differential expression gene data smaller than the preset screening confidence coefficient, wherein the source gene data is the gene expression profile data of the ovarian cancer patient; the method is used for performing Bootstrap resampling on the target gene data to obtain a first preset number of gene data sets; and performing confidence estimation on the first preset number of target Bayesian network models to obtain a target gene regulation network. Different from the existing statistical method which cannot express the gene regulation and control relationship, the embodiment can screen target gene data related to ovarian cancer from source gene data, perform Bootstrap resampling on the target gene data to obtain a first preset number of gene data sets, and obtain a first preset number of target Bayesian network models based on each gene data set. And finally, performing confidence estimation based on a plurality of target Bayesian network models to construct a target gene regulation and control network, so that the embodiment provides a statistical model for determining ovarian cancer related genes and revealing the regulation and control relationship of the ovarian cancer related genes.

Other embodiments or specific implementation manners of the gene regulatory network construction device of the invention can refer to the above method embodiments, and are not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solution of the present invention or the portions contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above and includes several instructions for enabling a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are also included in the scope of the present invention.

Claims

1. A gene regulation network construction method is characterized by comprising the following steps:

obtaining source gene data, and screening the source gene data to obtain target gene data, wherein the source gene data is gene expression profile data of an ovarian cancer patient;

2. The method for constructing a gene regulatory network of claim 1, wherein the step of performing confidence estimation on the first predetermined number of target bayesian network models to obtain a target gene regulatory network comprises:

3. The method for constructing a gene regulatory network according to claim 1, wherein the step of screening the source gene data to obtain the target gene data comprises:

4. The method of constructing a gene regulatory network of claim 1, wherein the step of obtaining the first preset number of target bayesian network models based on each gene data set comprises:

and learning each initial network structure through a search scoring algorithm to obtain the first preset number of target Bayesian network models.

5. The method of constructing a gene regulatory network of claim 4, wherein the step of learning each initial network structure by a search scoring algorithm to obtain the first preset number of target Bayesian network models comprises:

obtaining the score of each preliminary network structure through a BIC score function;

and traversing the scores of the preliminary network structures by a greedy hill climbing method, and taking the network structure with the highest BIC score as a target Bayesian network model.

6. The method for constructing a gene regulatory network of claim 5, wherein after traversing the scores of the preliminary network structures by a greedy hill climbing method and using the network structure with the highest BIC score as the target Bayesian network model, the method further comprises:

and when the greedy hill climbing method is detected to be trapped in local optimization, random restart search is carried out on each preliminary network structure for a second preset number of times, and a network with the highest BIC score in the Bayesian networks obtained after restart is used as the target Bayesian network model.

7. The method for constructing a gene regulatory network of claim 1, wherein after obtaining the target gene regulatory network, the method further comprises:

randomly disturbing the sequence of all measured values of each gene in the target gene regulation and control network to generate a new gene data set;

8. A gene regulatory network constructing apparatus, comprising:

9. A gene regulatory network construction apparatus, comprising: a memory, a processor, and a gene regulatory network construction program stored on the memory and executable on the processor, the gene regulatory network construction program configured to implement the steps of the gene regulatory network construction method of any one of claims 1 to 7.

10. A storage medium having stored thereon a gene regulatory network construction program which, when executed by a processor, implements the steps of the gene regulatory network construction method according to any one of claims 1 to 7.