CN114093426B

CN114093426B - Marker screening method based on gene regulation network construction

Info

Publication number: CN114093426B
Application number: CN202111330308.9A
Authority: CN
Inventors: 黄晓然; 林晓惠; 东坤杰
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2024-05-07
Anticipated expiration: 2041-11-11
Also published as: CN114093426A

Abstract

The invention discloses a marker screening method based on gene regulation and control network construction, which screens biomarkers by constructing a difference network and belongs to the technical field of biological data analysis. The method comprises the steps of firstly executing a genetic algorithm in a global scope, clustering according to optimal individuals corresponding to each task, dividing similar tasks into the same class, executing the genetic algorithm in each class, and carrying out information migration in each task to further optimize inference of a regulation and control relation so as to obtain a final regulation and control network; and finally, respectively constructing regulation and control networks on normal and disease samples to obtain a difference network, and screening genes as markers through the difference network. The core content of the invention is that the inherent relation between different genes hidden in gene expression data is mined by combining a floating point number coding genetic algorithm with a multi-objective optimization mode, an effective gene relation inference model is established, a gene regulation network is constructed, and markers are screened by the differential regulation network.

Description

Marker screening method based on gene regulation network construction

Technical Field

The invention belongs to the technical field of biological data analysis, and discloses a method for constructing a gene regulation network, constructing a difference network and screening biomarkers by analyzing the relation among expression data of genes and deducing potential regulation relation among genes.

Background

In biological cells, genes are expressed by transcription and translation processes, and the expression products can activate or inhibit the expression levels of other genes, which is the regulation among genes. The regulatory relationships between genes are different in different physiological states, such as health, disease, and post-operative intervention. Therefore, the regulation and control relation among genes under different physiological states is analyzed, so that the difference among the genes is found, the genes causing the physiological state difference, namely the biomarker, can be found, and the method has great significance for the work of disease treatment, drug research and development and the like.

Methods for constructing gene regulatory networks are diverse, for example: based on the pearson correlation coefficient, a co-expression network is established by calculating the degree of correlation between genes in the gene expression data; measuring the correlation of two genes under the condition of removing the effects of other genes by using the partial correlation coefficient; using mutual information, the relationship between two genes is measured in expression data containing nonlinear relationships, and so on. In addition to the above, genetic algorithms (Genetic Algorithm, GA) can also be used for gene regulatory network inference. Assuming that n genes are present, the GA treats each gene in turn as a regulated gene (i.e., a target gene), and then searches for a regulated gene combination of the gene from other n-1 genes through a series of crossover, mutation, screening, etc. The method achieves good effect in many problems of gene regulation network construction.

The invention starts from the angle of multitask optimization, and the inference problem is different tasks according to the corresponding regulation and control relation of different target genes, and the similarity between the multitasks is fully considered and utilized in the process of inferring the gene regulation and control network, so as to improve the accuracy and efficiency of the inference. Meanwhile, the idea of transfer learning is introduced by utilizing a genetic algorithm to solve the problem of the proposed multi-task optimization, so that information among different tasks is more effectively transferred, and the reasoning construction of a gene regulation network is more accurately carried out. And finally, respectively constructing regulation and control networks on the normal and disease samples by using the method, obtaining a difference regulation and control network based on the difference of the two established networks, and screening corresponding markers through the network.

Disclosure of Invention

The invention aims at realizing reasoning of a gene regulation network by utilizing a genetic algorithm based on the thought of multitasking optimization, so as to construct a difference network and screen biomarkers. In the implementation process, the inference of the regulation and control relation corresponding to each target gene is regarded as a task, and the parallel operation of each task genetic algorithm is optimized.

In order to achieve the above object, the present invention adopts the following technical scheme:

The marker screening method based on gene regulation network construction comprises the following steps:

Let f= { F ₁,f₂,…,f_n } represent the gene set, n be the number of genes, x= { X ₁,x₂,…,x_m } represent the sample set, and m be the number of samples. Let v= (V ₁,V₂,…,V_i,…,V_n) denote the expression level of n genes over m samples, where V _i＝(v_i1,v_i2,…,v_im denotes the expression level of the i-th gene over m samples.

(1) Randomly initializing a population of size N, i.e. generating a number N of individuals, each individual comprising the following information:

① Factor cost (Factorial Cost): Representing the objective function value of individual p _i at task T _j for the factor cost of individual p _i at task T _j;

② Factor grade (FactorialRank): for task T _j, the corresponding individuals are ranked according to the factor cost, and the factor grade of individual p _i on task T _j Defined as the index of the individual p _i in the ordered list;

③ Skill Factor (Skill Factor): among all tasks, the task index that an individual p _i performs best is defined as the skill factor of p _i, also known as the "primary task", i.e

In addition, the individual dimension is the gene number n, and different from the common binary code, the numerical value corresponding to each dimension adopts a floating point number coding mode, the numerical value of one dimension represents the weight of the gene corresponding to the dimension, and the larger the weight is, the greater the possibility that the regulation and control relationship exists between the gene corresponding to the dimension and the target gene corresponding to the current individual is. Factor cost, factor level and skill factor are updated after each iteration;

(2) Randomly extracting parent individuals in the current population to perform crossover and mutation: a number of parental pairings were randomly drawn among all individuals. For a pair of parent p _a、p_b, randomly generating a value h between (0, 1), if τ _a＝τ_b or h is smaller than the balance factor between crossover and mutation, crossover the two individuals to generate child c _a、c_b, which respectively retain their own primary tasks, otherwise, mutation is performed on the two individuals to generate child c _a、c_b, which still respectively retain their own primary tasks.

Crossing:

c_a＝βp_b+(1-β)p_a (1)

c_b＝βp_a+(1-β)p_b (2)

Where β is a constant of intersection, which takes on the value (0, 1).

Variation:

Wherein k is a variation constant, the value (0, 1), p _max is the upper limit of the value of each dimension of the individual, p _min is the lower limit, and r is the random number generated by rand ();

(3) Screening individuals in the current population produces offspring: calculating the objective function value of each individual on the main task tau, wherein the objective function value is the sum of squares of residual errors between the expression value of the target gene of the task and the expression value of the remaining n-1 genes and the numerical value obtained by corresponding multiplication and addition of the weights of the remaining n-1 genes on m samples:

let the principal task τ _i＝T_ma of individual p _i be in equation (4) The objective function value for the individual p _i at its main task T _ma, V ^\ma＝(V₁,V₂,…,V_ma-1,V_ma+1,…,V_n) represents a matrix of expression value vectors of n-1 genes other than the ma-th gene. The smaller the objective function value of an individual on a certain task, the more accurate the inference result of the individual on the regulation and control relation of the objective gene corresponding to the task is. And for the optimal individual corresponding to each task, generating another identical individual to be incorporated into the current population while keeping the individual, replacing the main task of the incorporated individual with a random one of the rest tasks, and calculating the objective function value of the main task on the new main task. N individuals with the highest fitness, namely the smallest objective function value, are selected from the current population to form offspring, and the rest individuals are eliminated.

Step (2) and step (3) are iterated, after each round, the accuracy degree of the current inferred regulation and control relation is measured by using the target function value mean value of the current corresponding optimal individual of each task, if the target function value mean value reaches the set threshold value or the iteration reaches the maximum times, the iteration is stopped, and step (4) is executed;

(4) Clustering the tasks: clustering optimal individuals corresponding to each task by using a K-Means algorithm, so that n tasks are divided into several major classes;

(5) Extracting parent individuals from the population corresponding to the similar tasks to execute a genetic evolution process: in the step (4), all tasks are clustered, so that the tasks in the same class have higher similarity, and the tasks among the classes have the similarity as low as possible. In this step, steps (2) and (3) are executed again, except that the algorithm will be executed only within a class, and algorithms between multiple major classes are executed in parallel;

(6) Information migration is carried out among different dimensions of the same task: randomly extracting individuals p _r from individuals with the same main task, and a group of n-1 individuals, enabling the n-1 individuals to be an individual set S, selecting a certain dimension l, taking the first dimension data of the individuals in the S to form an n-1 dimension vector, and sequentially replacing the n-1 dimensions of the individuals p _r except for the target genes of the main task corresponding to p _r by using the values of the dimensions of the vector to generate a new individual q _r. In the replacing process, if the fitness of p _r is improved after the replacement of a certain dimension, namely the objective function value is reduced, the replacement is carried out, otherwise, the replacement is not carried out, and the next dimension is shifted; the specific process is illustrated in fig. 1:

the above process is described in pseudo code as follows:

Wherein P _i is the current population corresponding to task T _i, A value corresponding to the first dimension representing the individual p _j;

(7) Acquiring a regulation and control relation matrix: and selecting an individual with the optimal objective function value on each task as a final screening result. The corresponding target genes exist in each task, so that the individuals are arranged according to the sequence of the genes represented by each dimension, a matrix D formed by individual vectors can be obtained, and the larger the numerical value of the ith row and the jth column in the matrix is, the stronger the regulation and control effect of the represented gene j on the gene i is, and the more likely the connected edges exist between the two in the constructed regulation and control network;

(8) Constructing a difference network, and screening markers: and (3) respectively executing the steps (1) - (7) on the normal sample and the disease sample to obtain two regulation relation matrixes D _n、D_i, taking absolute values by taking differences between the two matrixes, and obtaining a difference regulation matrix D _s:

D_s＝|D_n-D_i| (5)

The larger the value of row i, column j in D _s, the greater the difference in the regulation of gene j to gene i before and after the disease, the more likely a border between the two in the constructed differential regulation network, the greater the degree of the gene in the differential network, the more likely it is to be considered as an important gene related to the disease, and thus, the more likely it is to be screened as a marker.

The sample generally refers to tissues and organs under different physiological states, and the data refers to the expression levels of a plurality of genes in different samples.

The invention has the beneficial effects that: in the invention, continuous numerical values are utilized to encode individuals in a genetic algorithm in the process of deducing a gene regulation network, so that the deduction of the regulation relation strength can be more accurate; the similarity between regulation and control relations corresponding to different genes is fully considered, the migration of the optimal individuals in different tasks increases the searching range of the algorithm, reduces the probability of sinking into local optimum, and improves the accuracy of inference; in addition, before the higher-level genetic algorithm is executed, the tasks are clustered by utilizing the result obtained by the lower-level genetic algorithm, so that the similarity problem can be ensured to be divided into the same group, the similarity of the tasks in the group and the difference of the tasks among the groups are improved, the next step has higher similarity degree among the tasks for information sharing in the execution process, more effective knowledge sharing is realized, and the convergence rate is accelerated; finally, through a difference network, the change of gene regulation activities in two different physiological states before and after the disease can be intuitively seen, and the accuracy and the efficiency of searching key gene markers related to the disease are improved.

Drawings

FIG. 1 is a diagram of information sharing among individual dimensions within a task.

Detailed Description

The following describes embodiments of the present invention further in conjunction with a technical scheme and a set of simulation data, which are only for the purpose of illustrating the present invention and are not limiting.

In table 1 is the simulation data of the present invention, which contains two tags: gene f and sample x, the number of genes was 5 and sample number 3.

TABLE 1 Gene expression simulation data

(1) Randomly initializing a population, taking an individual p ₁ = (0.2,0.4,0.1,0.3,0.2) in the population as an example, and taking the factor cost of the individual on a task T ₁ as shown in a formula (4)Then there is/>, similarly, for individual p ₂ = (0.3,0.3,0.2,0.5,0.1)Then the p ₁ factor scale is less than p ₂, i.e. >If there is/> (J=1, 2, …, 5), then the skill factor of p ₁ is τ ₁ =2, i.e. the primary task is to infer questions for regulatory relationships with f ₂ as the target gene;

(2) Taking p ₁、p₂ in step (1) as an example, when β=0.3, k=0.2, r=0.6, rand ()% 2=0: performing a crossover operation, as can be derived from formulas (1), (2): c ₁＝(0.23,0.37,0.13,0.36,0.17),c₂ = (0.27,0.33,0.17,0.44,0.13); performing mutation operation, wherein c ₁ = (0.224,0.4,0.136,0.312,0.52) is obtained by the formula (3);

(3) Calculating an objective function value of each individual in the current population on a main task thereof according to a formula (4) (see step (1)), replacing by using the optimal individual through the main task to generate a new individual, selecting N individuals with the highest fitness from the population after each iteration, namely, the N individuals with the smallest objective function values as offspring, and eliminating the rest individuals;

(4) (5) clustering the optimal individuals corresponding to each task, namely the individuals with the smallest objective function value, by utilizing a K-Means algorithm, so as to divide the multi-task into a plurality of major classes; executing the steps (2) and (3) again in each major class, and executing the algorithms among a plurality of major classes in parallel;

(6) Assuming that the individual p _r = (0.2,0.4,0.1,0.3,0.2) is randomly extracted from all the individuals who are tasked with T ₁, and the individual constituent set S＝{(0.02,0.12,0.45,0.71,0.36),(0.66,0.11,0.56,0.03,0.3),(0.78,0.11,0.44,0.01,0.23),(0.02,0.12,0.05,0.81,0.16)}, is randomly extracted, when l=3, the third column constituent vector z= (0.45,0.56,0.44,0.05) of each individual in S is taken. From equation (4) If 0.45 of the vector z is substituted for 0.4 of the individual p _r, then/> The objective function value becomes large and is therefore not replaced; similarly, 0.56 of vector z does not replace 0.1,0.44 of individual p _r and 0.3,0.05 of individual p _r replaces 0.2 of individual p _r. Then q _r = (0.2,0.4,0.1,0.3,0.05);

(7) And selecting an individual with the optimal objective function value on each task as a final screening result. If at this time, among the individuals with the primary task T ₁, the objective function value is the smallest, p ₃ = (0.15,0.69,0.45,0.88,0.06), the optimal individual with the primary task T ₂ is p ₉ = (0.02,0.12,0.45,0.71,0.36), the optimal individual with the primary task T ₃ is p ₂ = (0.66,0.11,0.56,0.03,0.3), the optimal individual with the primary task T ₄ is p ₂₆ = (0.78,0.11,0.44,0.01,0.23), and the optimal individual with the primary task T ₅ is p ₁₅ = (0.02,0.12,0.05,0.81,0.16), the following matrix can be obtained:

each column in the matrix sequentially corresponds to 5 genes { f ₁,f₂,f₃,f₄,f₅ }, taking the element 0.88 of the fourth column of the first row as an example, the element has a larger value, which means that the gene f ₄ has stronger regulation and control effect on the gene f ₁, and the formed regulation and control network has a larger possibility of connecting edges;

(8) Assuming that x ₁ is a normal sample and x ₂、x₃ is a disease sample, performing steps (1) - (7) on the two types of samples respectively to obtain two matrices D _n、D_i, and obtaining according to formula (5):

taking the element 0.82 in the third row and the fourth column as an example, the element has a larger value, which represents that the difference of the gene f ₄ on the regulation action of the gene f ₃ before and after the disease is larger, and the edge connection is more likely to exist in the constructed difference regulation network. If a node has a high degree of differential regulation in the network, the gene corresponding to the node tends to be screened as a marker associated with the disease.

Claims

1. The marker screening method based on gene regulation network construction is characterized by comprising the following steps:

let f= { F ₁,f₂,...,f_n } represent the gene set, n is the number of genes, x= { X ₁,x₂,...,x_m } represent the sample set, m is the number of samples; let v= (V ₁,V₂,...,V_i,...,V_n) represent the expression level of n genes over m samples, where V _i＝(v_i1,v_i2,...,v_im represents the expression level of the i-th gene over m samples;

① Factor cost: Representing the objective function value of individual p _i at task T _j for the factor cost of individual p _i at task T _j;

② Factor grade: for task T _j, the corresponding individuals are ranked according to the factor cost, and the factor grade of individual p _i on task T _j Defined as the index of the individual p _i in the ordered list;

③ Skill factors: among all tasks, the task index that an individual p _i performs best is defined as the skill factor of p _i, also known as the "primary task", i.e

The individual dimension is the gene number n, the value corresponding to each dimension adopts a floating point number coding mode, the value of a certain dimension represents the weight of the gene corresponding to the dimension, and the larger the weight is, the greater the possibility that the regulation and control relationship exists between the gene corresponding to the dimension and the target gene corresponding to the current individual is; factor cost, factor level and skill factor are updated after each iteration;

(2) Randomly extracting parent individuals in the current population to perform crossover and mutation: randomly drawing a certain number of parent pairings from all individuals; for a pair of parent p _a、p_b, randomly generating a value h between (0, 1), if τ _a＝τ_b or h is smaller than a balance factor between crossover and mutation, crossover two individuals to generate child c _a、c_b and respectively reserve the original main tasks, otherwise, performing mutation on the two individuals to generate child c _a、c_b and respectively reserve the original main tasks, wherein the crossover and mutation modes are as follows:

Crossing:

c_a＝βp_b+(1-β)p_a (1)

c_b＝βp_a+(1-β)p_b (2)

wherein beta is a constant of the intersection, and the value is (0, 1);

Variation:

Let the principal task τ _i＝T_ma of individual p _i, then in equation (4), An objective function value on its main task T _ma for individual p _i, V ^\ma＝(V₁,V₂,...,V_ma-1,V_ma+1,...,V_n) represents a matrix of expression value vectors of n-1 genes other than the ma-th gene; the smaller the objective function value of an individual on a certain task, the more accurate the inference result of the individual on the regulation and control relation of the objective gene corresponding to the task is; for the optimal individual corresponding to each task, generating another identical individual to be incorporated into the current population while keeping the individual, replacing the main task of the incorporated individual with a random one of the rest tasks, and calculating the objective function value of the main task on the new main task; selecting N individuals with highest fitness, namely the smallest objective function value from the current population to form offspring, and eliminating the rest individuals;

(5) Extracting parent individuals from the population corresponding to the similar tasks to execute a genetic evolution process: in the step (4), all tasks are clustered, so that the tasks in the same class have higher similarity, and the tasks among the classes have the similarity as low as possible; in this step, steps (2) and (3) are executed again, except that the algorithm will be executed only within a class, and algorithms between multiple major classes are executed in parallel;

(6) Information migration is carried out among different dimensions of the same task: randomly extracting individuals p _r from individuals with the same main task, and a group of n-1 individuals, enabling the n-1 individuals to be an individual set S, selecting a certain dimension l, taking the first dimension data of the individuals in the S to form an n-1 dimension vector, and sequentially replacing the n-1 dimensions of the individuals p _r except for the target genes of the main task corresponding to p _r by using the values of the dimensions of the vector to generate a new individual q _r; in the replacing process, if the fitness of p _r is improved after the replacement of a certain dimension, namely the objective function value is reduced, the replacement is carried out, otherwise, the replacement is not carried out, and the next dimension is shifted;

(7) Acquiring a regulation and control relation matrix: selecting an individual with the optimal objective function value on each task as a final screening result; the corresponding target genes exist in each task, so that the individuals are arranged according to the sequence of the genes represented by each dimension, a matrix D formed by individual vectors can be obtained, and the larger the numerical value of the ith row and the jth column in the matrix is, the stronger the regulation and control effect of the represented gene j on the gene i is, and the more likely the connected edges exist between the two in the constructed regulation and control network;

D_s＝|D_n-D_i| (5)