CN112735514B - Training and visualization method and system for neural network extraction regulation and control DNA combination mode - Google Patents

Training and visualization method and system for neural network extraction regulation and control DNA combination mode Download PDF

Info

Publication number
CN112735514B
CN112735514B CN202110063192.0A CN202110063192A CN112735514B CN 112735514 B CN112735514 B CN 112735514B CN 202110063192 A CN202110063192 A CN 202110063192A CN 112735514 B CN112735514 B CN 112735514B
Authority
CN
China
Prior art keywords
dna sequence
neural network
dna
convolutional neural
specific function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110063192.0A
Other languages
Chinese (zh)
Other versions
CN112735514A (en
Inventor
汪小我
魏征
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202110063192.0A priority Critical patent/CN112735514B/en
Publication of CN112735514A publication Critical patent/CN112735514A/en
Application granted granted Critical
Publication of CN112735514B publication Critical patent/CN112735514B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Abstract

The invention discloses a training and visualization method and a system for extracting and regulating a DNA combination mode by a neural network, wherein the method comprises the following steps: obtaining a DNA sequence with a specific function and a DNA sequence without the specific function; labeling two DNA sequences, and representing the DNA sequences with specific functions and the DNA sequences without specific functions by using unique heat codes; building a convolutional neural network, taking the unique hot code of the labeled DNA sequence as input, labeling the corresponding DNA sequence as a fitting value output by the convolutional neural network, and training the convolutional neural network so as to enable the convolutional neural network to identify the DNA sequence; decoupling the trained convolutional neural network by using a NeuronMotif algorithm to obtain a gene regulatory element combination module, and expressing and storing by using a regulatory element syntax tree. The method provides a general neural network interpretation algorithm NeuronMotif for decoupling a convolutional neural network to discover and visualize patterns identified by the convolutional neural network.

Description

Training and visualization method and system for neural network extraction regulation and control DNA combination mode
Technical Field
The invention relates to the technical field of gene regulation, in particular to a training and visualization method and system for a neural network extraction and regulation DNA combined mode.
Background
Gene expression and regulation determine the growth and differentiation of cells, and the transcriptional regulation process of the control gene can control the level of gene expression to a certain extent and thus control various states of the cells. The logic of the combinatorial arrangement of various regulatory elements on genomic DNA is one of the most critical factors in the transcriptional regulation of genes. In the application of gene editing and modification, the design and adjustment can be carried out according to the base preference, distance position, sequence, appearance quantity and other logics of a plurality of regulatory elements according to the requirements of specific gene functions so as to achieve the control of gene transcription level. Such complex regulatory modules and logic are difficult to extract and represent with current shallow machine learning methods and models. The deep learning model is excellent in many genome function annotation tasks due to the complex expressive ability and excellent automatic feature extraction ability, but the conventional gene regulatory element combination module is difficult to interpret and extract.
In recent years, a great deal of work has been carried out to develop methods for extracting gene regulatory element combination modules in neural networks, and some progress has been made, but the progress has not solved the problem. At present, the thinking of explaining the neural network in the problem of DNA sequence prediction is basically consistent, the relation between the base input by the neuron and the output of the neuron is researched, and the method is basically improved from the field of computer vision and can also be applied to the visualization of the neural network in the field of computer vision or other fields. These methods can be basically divided into three major categories: (1) changing a change in the input view output value; (2) a reverse gradient propagation algorithm; (3) the sequence input distribution with maximized activation values. They explain the neural network to a certain level but ignore that the neural network is a hybrid model and there is no way to try to open the black box of the neural network to solve this problem.
A typical representation of this method of changing the change in the input-view output value is DeepSEA. The advantage of this method is the simplest and straightforward one, easy to understand. If the input base is changed and the output neuron is not changed, the base is not a critical base, otherwise, the base is very important. The main disadvantage of this method is that it is extremely computationally intensive, and the number of combinations that change at each base position grows exponentially with the length of the DNA sequence. The method is mainly suitable for researching the single nucleotide polymorphism, and the influence of mutation of a few sites on the function in a sequence is concerned, but not all base positions are researched, so that the method can basically meet the requirements of users. The analysis does not seem to reveal the knowledge comprehension learned by the neural network, and most of the work of analyzing the neural network is not limited to the method, so that the application of the method is not particularly extensive.
For the other two methods, they both use the method commonly used in the image field in recent years to resolve the importance of all bases in each sample. The two methods are realized by using an inverse gradient propagation algorithm, but the specific use method is different. Saliency Map and DeepLIFT are typical representations of analytic methods based on the inverse gradient propagation algorithm, which use the partial derivative of the neuron output values to the input values or similar deformations as an evaluation of the importance of the input positions. The method can use a reverse gradient propagation algorithm to conveniently solve, so that the method can be easily applied to any neuron, and a user only needs to provide a section of sequence to be researched, input the sequence into a neural network, propagate the sequence in the forward direction once, calculate a certain gradient and propagate the sequence in the reverse direction once, and then can finish the importance annotation of the corresponding position in the sequence. It is relatively more widely used because of its lower computational cost, but it also has considerable problems. One of the problems is that it is not possible to directly calculate Motif, which is a probability distribution statistic for all base positions possessed by a plurality of sequences, and this method provides only an evaluation of the importance of the corresponding position of one sequence and thus has no statistical significance. In order to meet the requirement, a research group based on a DeepLIFT algorithm develops TF-MoDISco, and the basic idea is to perform a series of post-processing such as matching alignment, cutting, clustering and the like on key subsequences in a plurality of concerned sequences, and finally merge importance scores of all base positions of a plurality of sequences. However, there are problems in that the importance scores of the corresponding positions of each sequence are not comparable, the relative sizes have no absolute significance, the calculation operation process is highly dependent on manual setting, the result is not particularly stable, and therefore, the calculated or discovered so-called "Motif" is not widely used.
The sequence input distribution with maximized activation values is primarily considered due to the characteristics of the neural network itself. Any neuron can influence the next layer of neurons to play its own role only in the activated state, which means that the sequence recognized by the neuron is the sequence which can enable the neuron to be activated, so that as long as the sequences are collected, PPM (PositionProhabilitiyMatrix) and PWM (PositionWeightMatrix) can be calculated according to the sequence set. However, there are a lot of problems, such as how the threshold of these sequences should be selected, and no reasonable explanation is given, in the example of explaining the baseset model, in order to explain motifs learned by the first layer of neurons, the author scans all samples by using corresponding convolution kernels, selects a sequence with an activation value of more than half of the maximum value obtained as the activated sequence set, calculates PWM by using this set, and draws a WebLogo corresponding to motifs, which have satisfactory similarity with motifs in the standard database, but the reason why the threshold is half of the maximum value is not explained, and other works have similar problems. Although the Basset model has good results in the first layer of neuron interpretation, there is little work to date to reasonably resolve what Motif was learned by neurons in the second and above layers using this approach. This suggests that this approach may no longer be directly applicable at the second or deeper layers.
In view of the above three aspects, the current methods for extracting the gene regulatory element combination modules in the neural network have met with bottlenecks, and better methods are needed for extracting the gene regulatory element combination modules learned in the neural network.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, an object of the present invention is to provide a training and visualization method for extracting a regulatory DNA combination pattern using a neural network, which provides a general interpretation algorithm neuro motif for decoupling a neural network, which can decouple a convolutional neural network model for annotating whether DNA has a specific function, and find and visualize a gene regulatory element combination module identified therein, and which can also be used for the discovery and visualization of patterns identified by any convolutional neural network in other problem or field applications.
Another object of the invention is to propose a training and visualization system using neural network extraction for modulating DNA combination patterns.
In order to achieve the above object, an embodiment of the invention provides a training and visualization method for a neural network extraction and regulation DNA combination pattern, which includes:
s1, obtaining DNA sequence with specific function and DNA sequence without the specific function;
s2, labeling two DNA sequences, and representing the DNA sequence with the specific function and the DNA sequence without the specific function by using unique heat codes;
s3, building a convolutional neural network, taking the one-hot code of the labeled DNA sequence as input, labeling the corresponding DNA sequence as a fitting value output by the convolutional neural network, and training the convolutional neural network to enable the convolutional neural network to identify the DNA sequence;
and S4, decoupling the trained convolutional neural network through a NeuronMotif algorithm to obtain a gene regulatory element combination module, and representing and storing by using a regulatory element syntax tree.
The training and visualization method for the neural network extraction regulation and control DNA combined mode comprises the steps of obtaining a DNA sequence with a specific function and a DNA sequence without the specific function; labeling two DNA sequences, and representing the DNA sequences with specific functions and the DNA sequences without specific functions by using unique heat codes; building a convolutional neural network, taking the unique hot code of the labeled DNA sequence as input, labeling the corresponding DNA sequence as a fitting value output by the convolutional neural network, and training the convolutional neural network so as to enable the convolutional neural network to identify the DNA sequence; the trained convolutional neural network is decoupled by designing and using a NeuronMotif algorithm, so that Motif and Motif combined modules corresponding to each neuron are discovered, gene regulatory element combined modules are obtained, expression and storage are carried out by using a regulatory element syntax tree, and a set of new ideas and schemes are provided for extracting the gene regulatory element combined modules in the neural network.
In addition, the training and visualization method for the neural network extraction regulation and control DNA combination mode according to the above embodiment of the present invention may further have the following additional technical features:
further, in an embodiment of the present invention, S1 further includes:
s11, cutting the DNA sequence segment with the specific function and the DNA sequence segment without the specific function on the biological genome marked by the biological experiment means.
Further, in an embodiment of the present invention, S1 further includes:
s12, artificially synthesizing DNA sequence fragment molecules, performing any type of biological function verification experiment, and determining the fragment molecules with the specific function and the fragment molecules without the specific function.
Further, in one embodiment of the present invention, the labeling of the DNA sequence comprises:
labeling the DNA sequence with the specific function as a positive sample, and labeling the DNA sequence without the specific function as a negative sample.
Further, in an embodiment of the present invention, S4 further includes:
s41, for a neuron in the convolutional neural network, collecting a new DNA sequence set, wherein different DNA sequences in the new DNA sequence set have neuron activation values with various sizes;
s42, respectively calculating all neuron activation values of the DNA sequences in the new DNA sequence set, which can affect the neurons in each layer of the neural network;
s43, dividing the new DNA sequence set to obtain a plurality of DNA sequence subsets;
and S44, calculating the mathematical expression form of the gene function element combination module corresponding to each DNA sequence subset, and expressing and storing the gene function element combination module by using the regulatory element syntax tree.
Further, in an embodiment of the present invention, S41 further includes:
randomly generating a DNA sequence according to the size of a neuron receiving domain, optimizing the DNA sequence by using a genetic algorithm, wherein the optimization target is a neuron activation value of the DNA sequence, mutation of the DNA sequence in the genetic algorithm is sampled according to the neuron activation value and the gradient size of the one-hot coding input of the DNA sequence as probability, besides cross interchange of the DNA sequence is kept, cyclic displacement is required according to a neural network pooling layer structure, the DNA sequence of an intermediate result optimized by the genetic algorithm is sampled, the sampled DNA sequence is not repeated, and the sampled DNA sequence forms various activated DNA sequence sets.
Further, in an embodiment of the present invention, S43 further includes:
s431, for the new DNA sequence set, detecting from a deep layer to a shallow layer from the layer where the neuron is located, if a largest pooling layer is met, clustering the new DNA sequence set into K classes according to the pooling size K and the neuron activation value characteristics of the shallow layer of the pooling layer corresponding to the sequences of the new DNA sequence set by using a Kmeans algorithm, wherein each class corresponds to a divided DNA sequence subset;
s432, all the divided DNA sequence subsets are used as a DNA sequence new set, detection is carried out from a deep layer to a shallow layer from a clustering occurrence layer, if a largest pooling layer is met, the DNA sequence new set is clustered into K classes according to the pooling size K and a Kmeans algorithm according to neuron activation value characteristics of a shallow layer of the pooling layer corresponding to the sequences of the DNA sequence new set, and each class corresponds to the divided DNA sequence subsets;
and S433, repeating the step S432 to the first layer to obtain a plurality of DNA sequence subsets.
Further, in an embodiment of the present invention, the computational expression of the gene function element combination module is E [ E (X | Y) ], where X is a random variable corresponding to one-hot code of the sampling sequence, Y is a random variable represented by an activation value corresponding to the sampling sequence, and a relationship Y ═ f (X) between Y and X is determined by a corresponding neuron, where a distribution of the random variable Y needs to be given and is a free variable, and the random variable X depends on the random variable Y.
In order to achieve the above object, an embodiment of another aspect of the present invention provides a training and visualization system for a neural network to extract and regulate a DNA combination pattern, including:
the acquisition module is used for acquiring a DNA sequence with a specific function and a DNA sequence without the specific function;
a labeling module for labeling two DNA sequences and representing the DNA sequence with a specific function and the DNA sequence without the specific function by using unique heat codes;
the training module is used for building a convolutional neural network, taking the one-hot code of the labeled DNA sequence as input, labeling the corresponding DNA sequence as a fitting value output by the convolutional neural network, and training the convolutional neural network so that the convolutional neural network can identify the DNA sequence;
and the decoupling module is used for decoupling the trained convolutional neural network through a NeuronMotif algorithm to obtain a gene regulatory element combination module, and expressing and storing the gene regulatory element combination module by utilizing a regulatory element syntax tree.
The training and visualization system for extracting and regulating the DNA combination mode by the neural network obtains the DNA sequence with a specific function and the DNA sequence without the specific function; labeling two DNA sequences, and representing the DNA sequences with specific functions and the DNA sequences without specific functions by using unique heat codes; building a convolutional neural network, taking the unique hot code of the labeled DNA sequence as input, labeling the corresponding DNA sequence as a fitting value output by the convolutional neural network, and training the convolutional neural network so as to enable the convolutional neural network to identify the DNA sequence; the trained convolutional neural network is decoupled by designing and using a NeuronMotif algorithm, so that Motif and Motif combined modules corresponding to each neuron are discovered, gene regulatory element combined modules are obtained, expression and storage are carried out by using a regulatory element syntax tree, and a set of new ideas and schemes are provided for extracting the gene regulatory element combined modules in the neural network.
In addition, the training and visualization system for the neural network extraction regulatory DNA combination mode according to the above embodiment of the present invention may further have the following additional technical features:
further, in one embodiment of the present invention, obtaining a DNA sequence having a specific function and a DNA sequence not having the specific function includes:
intercepting DNA sequence fragments with the specific functions and DNA sequence fragments without the specific functions on a biological genome marked by using a biological experimental means; or
Artificially synthesizing DNA sequence fragment molecules, performing any type of biological function verification experiment, and determining the fragment molecules with the specific function and the fragment molecules without the specific function.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a method for training and visualizing a neural network extraction regulatory DNA combination pattern according to one embodiment of the present invention;
FIG. 2 is a diagram of a PPM in mathematical form according to one embodiment of the present invention;
FIG. 3 is a schematic diagram of transcription factor matching according to one embodiment of the present invention;
FIG. 4 is a diagram of a syntax tree, according to one embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a training and visualization system for a neural network extraction regulatory DNA combination model according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The training and visualization method and system for the neural network extracted regulatory DNA combination mode provided by the embodiment of the invention are described below with reference to the accompanying drawings.
The training and visualization method of the neural network extraction regulatory DNA combination pattern proposed according to the embodiment of the present invention will be described first with reference to the accompanying drawings.
FIG. 1 is a flow chart of a training and visualization method for a neural network extraction regulatory DNA combination model according to an embodiment of the present invention.
As shown in FIG. 1, the training and visualization method for the neural network to extract the combined DNA regulation mode comprises the following steps:
in step S1, a DNA sequence having a specific function and a DNA sequence not having a specific function are obtained.
Further, in the examples of the present invention, two methods of collecting DNA sequences are provided. First, fragments of DNA sequences with and without functions, such as DNA sequences of chromatin open regions and DNA sequences of chromatin non-open regions labeled with ATAC-seq, DNA sequences of nucleosome modification or transcription factor binding sites labeled with ChIP-seq, and DNA sequences without nucleosome modification or transcription factor binding sites, are cut from the genome of an organism labeled with various biological experimental means.
Second, artificially synthesizing DNA sequence fragment molecules, performing any type of biological function verification experiment, determining fragment molecules with functions and fragment molecules without functions, such as SELEX technology, synthesizing designed DNA sequences, and extracting sequences with protein combination and sequences without protein combination.
In step S2, labeling the two DNA sequences, and representing the DNA sequence with specific function and the DNA sequence without specific function by using unique heat code.
Further, the DNA sequences were annotated, including:
marking a fixed-length DNA sequence with a specific function as a positive sample, and marking the positive sample as 1 in numerical value; DNA sequences without specific function are labeled as negative examples and are numerically noted as 0. For each DNA sequence, multiple functions are allowed, so that multiple labels can be present, each corresponding to whether a corresponding function is present.
And step S3, constructing a convolutional neural network, taking the unique hot code of the labeled DNA sequence as input, labeling the corresponding DNA sequence as a fitting value output by the convolutional neural network, and training the convolutional neural network so that the convolutional neural network can identify the DNA sequence.
Specifically, a DNA sequence with function is labeled as a positive sample, a DNA sequence without function is labeled as a negative sample, and the DNA sequences are represented using one-hot coding. And (2) building any convolutional neural network, wherein the structure can comprise a convolutional layer, a pooling layer, a full-connection layer and the like, the input dimension should be matched with the length of the DNA sequence and the unique hot coding format, the output dimension should be matched with the number of the DNA functions, the unique hot coding of the DNA sequence is used as input, the corresponding DNA sequence is marked as a fitting value output by the neural network, and the neural network is trained so that whether the DNA sequence is a positive sample or not can be accurately identified by the neural network as far as possible.
And step S4, decoupling the trained convolutional neural network through a NeuronMotif algorithm to obtain a gene regulatory element combination module, and representing and storing the gene regulatory element combination module by using a regulatory element syntax tree.
The NeuronMotif algorithm is a general algorithm for decoupling a convolutional neural network, can decouple a convolutional neural network model for annotating whether DNA has a specific function or not, finds out and visualizes a gene regulatory element combination module identified in the convolutional neural network model, and can also be used for finding out and visualizing the identified mode of the convolutional neural network in other problem applications.
Further, S4 further includes:
s41, for a neuron in the convolutional neural network, collecting a new DNA sequence set, wherein different DNA sequences in the new DNA sequence set have neuron activation values with various sizes;
s42, respectively calculating all neuron activation values of the DNA sequences in the new DNA sequence set, which can affect the neurons in each layer of the neural network;
s43, dividing the new DNA sequence set to obtain a plurality of DNA sequence subsets;
and S44, calculating the mathematical expression form of the gene function component combination module corresponding to each DNA sequence subset, and expressing and storing the gene function component combination module by using the regulatory element syntax tree.
It will be appreciated that for each neuron in the convolutional neural network, the process of S41-S44 needs to be completed from the shallow layer to the deep layer.
Further, in an embodiment of the present invention, S41 further includes:
randomly generating a DNA sequence according to the size of a neuron receiving domain, optimizing the DNA sequence by using a genetic algorithm, wherein the optimization target is a neuron activation value of the DNA sequence, mutation of the DNA sequence in the genetic algorithm is sampled according to the neuron activation value and the gradient size of the one-hot coding input of the DNA sequence as probability, besides cross interchange of the DNA sequence is kept, cyclic displacement is required according to a neural network pooling layer structure, the DNA sequence of an intermediate result optimized by the genetic algorithm is sampled, the sampled DNA sequence is not repeated, and the sampled DNA sequence forms various activated DNA sequence sets.
It is understood that if the number of the sampled DNA sequences is too large, the DNA sequences are divided into 20 or more activation value intervals according to the maximum activation value, the DNA sequences in each interval are randomly selected without repetition, and the unselected DNA sequence samples are discarded.
Further, in an embodiment of the present invention, S43 further includes:
s431, for the new DNA sequence set, starting from the layer where the neurons are located, detecting from the deep layer to the shallow layer, if the largest pooling layer is met, clustering the new DNA sequence set into K classes according to the pooling size K and the neuron activation value characteristics of the superficial layer of the pooling layer corresponding to the sequences of the new DNA sequence set by using a Kmeans algorithm, wherein each class corresponds to a divided DNA sequence subset;
s432, all the divided DNA sequence subsets are used as a DNA sequence new set, detection is carried out from a deep layer to a shallow layer from a clustering occurrence layer, if a largest pooling layer is met, the DNA sequence new set is clustered into K classes according to the pooling size K and a Kmeans algorithm according to neuron activation value characteristics of a shallow layer of the pooling layer corresponding to the sequences of the DNA sequence new set, and each class corresponds to the divided DNA sequence subsets;
and S433, repeating the step S432 to the first layer to obtain a plurality of DNA sequence subsets.
Further, in an embodiment of the present invention, the computational expression of the gene function element combination module is E [ E (X | Y) ], where X is a random variable corresponding to one-hot codes of the sampling sequence, Y is a random variable represented by activation values corresponding to the sampling sequence, and a relationship Y ═ f (X) between Y and X is determined by a corresponding neuron, where a distribution of the random variable Y needs to be given and is a free variable, and the random variable X depends on the random variable Y, where it is recommended to select a probability density function corresponding to the distribution of Y as p (Y) ═ 2Y/(a) > and a is a maximum value of all sequence activation values in the DNA sequence set. E [ E (X | Y) ] represents the expected value of the one-hot code of the DNA sequence sample taken under the conditions of a given distribution of activation values. And E [ E (X | Y) ] is calculated for each divided DNA sequence sample subset, and the gene functional element combination module represented by the neuron can be obtained.
For all the resulting gene function combination modules, the length, arrangement and relative position of the function in each combination module are determined by finding the same set of function patterns, denoted { A, B, C … }, and in general, their arrangement is fixed in the combination module, e.g., both ABCABC, and the number of bases spaced between two adjacent elements is not the same. If the distance between adjacent elements is a fixed number of bases in all the combinatorial modules, the number of bases inserted between two adjacent elements in the array, e.g., A-6N-B, 6N represents a 6 base length spacing, while for uncertainty in distance, the interval can be written and divided by brackets, e.g., A- [6 + -2N ] -B. Functional submodules with relatively fixed or related functions can be marked by brackets, and modules can be nested in the modules, such as: [ A-6N- [ B-C ] ] - [ A-6N- [ B-C ] ], a syntax tree of a function block can be shown in parentheses.
Specifically, a method of decoupling any one of the convolutional neurons in a neural network is described:
1) for a neuron, a new collection of DNA sequences is collected, which has a sufficient number of new collections of DNA sequences for various activation values.
Randomly generating a DNA sequence according to the size of a neuron receiving domain, optimizing the DNA sequence by using a genetic algorithm, wherein the optimization target is a neuron activation value of the DNA sequence, and the mutation of the DNA sequence in the genetic algorithm is sampled by taking the gradient size of the one-hot coding input of the DNA sequence as probability according to the neuron activation value, so that the cyclic displacement is required according to the neural network pooling layer structure in addition to the cross exchange of the DNA sequence. Sampling DNA sequences of intermediate results of genetic algorithm optimization, obtaining various activated DNA sequence sets without allowing repeated DNA sequences, if the number is overlarge, dividing the DNA sequences into 20 or more activation value intervals according to the maximum activation value, carrying out non-repeated random selection on the DNA sequences in each interval, and discarding unselected DNA sequence samples.
2) And respectively calculating all neuron activation values of the DNA sequences in the new DNA sequence set, which can influence the neuron at each layer of the neural network.
For the layer where the neuron is located and the shallower layer, only part of the activation results of the neurons in each layer can affect the layer of neurons. Activation values for all these neurons were calculated for each DNA sequence.
3) The new set is divided.
And for the new aggregate DNA sequence, detecting from the deep layer to the shallow layer from the layer where the neuron is located, and when the largest pooling layer is met, clustering the new aggregate DNA sequence into K classes according to the pooling layer shallow layer neuron activation value characteristics corresponding to the new aggregate sequence by using a Kmeans algorithm, wherein each class corresponds to a partitioned DNA sequence subset. And (3) taking the divided DNA sequence subsets as a new set DNA sequence, starting from a clustering occurrence layer, detecting from a deep layer to a shallow layer, clustering the new set DNA sequence into K classes according to the pooling size K when a maximum pooling layer is met, using a Kmeans algorithm according to neuron activation value characteristics of a shallow layer of the pooling layer corresponding to the sequences of the new set, wherein each class corresponds to the divided DNA sequence subset, and repeating the process until reaching the first layer to obtain a large number of sets.
4) And finally, calculating the mathematical expression form of the gene function element combination module corresponding to each subset.
The computational expression of PPM is E [ E (X | Y) ], the expected value of one-hot encoding of the sample sequence under the condition of the activation value. X is a random variable corresponding to the one-hot code of the sampling sequence, Y is a random variable represented by the activation value corresponding to the sampling sequence, and the relationship Y ═ f (X) between Y and X is determined by the corresponding neuron, wherein the distribution of the random variable Y needs to be given and is a free variable, and the random variable X depends on the random variable Y, wherein the probability density function corresponding to the distribution of Y is recommended to be selected as p (Y) ═ 2Y/(a), and a is the maximum value of the activation values of all sequences in the DNA sequence set. E [ E (X | Y) ] represents the expected value of the one-hot code of the DNA sequence sample taken given the activation value distribution. And E [ E (X | Y) ] is calculated for each divided DNA sequence sample subset, and the gene functional element combination module represented by the neuron can be obtained. The specific calculation method is as follows:
for any subset of DNA sequences, calculating the activation value of the neuron, obtaining the maximum activation value A, equally dividing the maximum activation value into N parts according to the size of the maximum activation value, and performing the following operations in each interval i (i-1, 2, …, N) [ A (i-1)/N, A i/N ]:
finding the sequence of the subset satisfying the activation value in the interval [ A (i-1)/N, A i/N ];
calculating the average value Vi of the activation values of the sequences;
calculating the average value of each position of the unique hot codes of the sequences to obtain an average matrix PPMi;
after completing the calculation within each interval, the genome functional element module of the subset is calculated as:
PPM ═ V (PPM1 × V1+ PPM1 × V2+ … + PPM × VN)/(V1+ V2+ … + VN); PPM is the genome functional element module corresponding to the subset. The mathematically expressed PPM can be plotted as a WebLogo graph as shown in figure 2.
5) The calculated PPM for all subsets of the DNA sequence of this neuron generalizes the stored syntax.
For all the obtained gene function element combination modules, finding the function element pattern set expressed as { A, B, C … }, determining the length, arrangement and relative position of the function element in each combination module, generally speaking, in the combination module, their arrangement is fixed, such as ABCABCBC, and the number of bases spaced between two adjacent elements is not consistent. If the distance between adjacent elements is a fixed number of bases in all the combinatorial modules, the number of bases inserted between two adjacent elements in the array, e.g., A-6N-B, 6N represents a 6 base length spacing, while for uncertainty in distance, the interval can be written and divided by brackets, e.g., A- [6 + -2N ] -B. Functional submodules with relatively fixed or related functions can be marked by brackets, and modules can be nested in the modules, such as: [ A-6N- [ B-C ] ] - [ A-6N- [ B-C ] ], and the syntax tree of the functional module can be shown according to brackets.
In one embodiment, matching may be performed based on Motif in a known database, resulting in basic elements corresponding to relevant patterns, some of which are known to exist in the database, some of which are unknown, and none of which exist in the database, as shown in FIG. 3, which includes CTCF, DDIT3:: CEBPA, ZEB1, and some unknown transcription factor.
From these basic elements and their relative positions in the WebLogi graph, the following relationships [ CTCF 6N ] -DDIT3:: CEBPA ] - [59 + -1N ] - [ CTCF 6N ] -DDIT3:: CEBPA ] can be summarized according to the parenthesized relationships to generate the representation of the syntax tree shown in FIG. 4.
If the result is not satisfactory, the processes of 3), 4) and 5) can be repeated on the basis of the existing set until the result is satisfactory. Similar operations are performed on each subset of each neuron, and a large number of genome functional element modules can be obtained. By using the method, a large number of genome functional element modules expressed in PPM can be extracted.
It will be appreciated that embodiments of the present invention provide a general algorithm, neuromontif, for decoupling convolutional neural networks, which decouples the model of the convolutional neural network used to annotate whether DNA has a particular function, and in which the identified gene regulatory element combination modules are discovered and visualized, and which is also useful for the discovery and visualization of patterns identified by convolutional neural networks in other problem applications. In the neuromonmotif algorithm, the mathematical statistical form of Motif for neuron correspondences is first defined. Then each neuron is regarded as a hidden variable model, and the source and meaning of hidden variables in the hidden variable model are analyzed in a classification mode. According to the brand-new analysis and understanding of the neural network and the neurons, the neuron Motif is designed to realize the decoupling of the neuron mixed model, so that a Motif and Motif combined module (represented by PPM) corresponding to each neuron is discovered, namely a representation form of the gene regulatory element combined module. And a theoretical basis is established for extracting the gene regulatory element combination module in the neural network.
According to the training and visualization method for extracting and regulating the DNA combination mode by the neural network, provided by the embodiment of the invention, a DNA sequence with a specific function and a DNA sequence without the specific function are obtained; labeling two DNA sequences, and representing the DNA sequences with specific functions and the DNA sequences without specific functions by using unique heat codes; building a convolutional neural network, taking the unique hot code of the labeled DNA sequence as input, labeling the corresponding DNA sequence as a fitting value output by the convolutional neural network, and training the convolutional neural network so as to enable the convolutional neural network to identify the DNA sequence; the trained convolutional neural network is decoupled by designing and using a NeuronMotif algorithm, so that Motif and Motif combined modules corresponding to each neuron are discovered, gene regulatory element combined modules are obtained, expression and storage are carried out by using a regulatory element syntax tree, and a set of new ideas and schemes are provided for extracting the gene regulatory element combined modules in the neural network.
Next, a training and visualization system for a neural network extraction regulatory DNA combination pattern proposed according to an embodiment of the present invention will be described with reference to the drawings.
FIG. 5 is a schematic diagram of a training and visualization system for neural network extraction regulatory DNA combination patterns, according to an embodiment of the present invention.
As shown in fig. 5, the training and visualization system for neural network extraction and regulation of DNA combination patterns includes: an acquisition module 201, an annotation module 202, a training module 203, and a decoupling module 204.
An obtaining module 201, configured to obtain a DNA sequence with a specific function and a DNA sequence without a specific function.
And the labeling module 202 is used for labeling the two DNA sequences and representing the DNA sequences with specific functions and the DNA sequences without specific functions by using unique thermal codes.
And the training module 203 is used for building a convolutional neural network, taking the unique hot code of the labeled DNA sequence as input, labeling the corresponding DNA sequence as a fitting value output by the convolutional neural network, and training the convolutional neural network so that the convolutional neural network can identify the DNA sequence.
And the decoupling module 204 is used for decoupling the trained convolutional neural network through a NeuronMotif algorithm to obtain a gene regulatory element combination module, and expressing and storing the gene regulatory element combination module by using a regulatory element syntax tree.
Further, in one embodiment of the present invention, obtaining a DNA sequence having a specific function and a DNA sequence not having a specific function includes:
intercepting DNA sequence fragments with specific functions and DNA sequence fragments without the specific functions on a biological genome marked by using a biological experimental means; or
Artificially synthesizing DNA sequence fragment molecules, performing any type of biological function verification experiment, and determining fragment molecules with specific functions and fragment molecules without specific functions.
It should be noted that the foregoing explanation of the method embodiment is also applicable to the system of this embodiment, and is not repeated here.
According to the training and visualization system for extracting and regulating the DNA combination mode by the neural network, provided by the embodiment of the invention, a DNA sequence with a specific function and a DNA sequence without the specific function are obtained; labeling two DNA sequences, and representing the DNA sequences with specific functions and the DNA sequences without specific functions by using unique heat codes; building a convolutional neural network, taking the unique hot code of the labeled DNA sequence as input, labeling the corresponding DNA sequence as a fitting value output by the convolutional neural network, and training the convolutional neural network so that the convolutional neural network can identify the DNA sequence; the trained convolutional neural network is decoupled by designing and using a NeuronMotif algorithm, so that Motif and Motif combined modules corresponding to each neuron are discovered, gene regulatory element combined modules are obtained, expression and storage are carried out by using a regulatory element syntax tree, and a set of new ideas and schemes are provided for extracting the gene regulatory element combined modules in the neural network.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (8)

1. A training and visualization method for a neural network extraction regulation and control DNA combination mode is characterized by comprising the following steps:
s1, obtaining a DNA sequence with a specific function and a DNA sequence without the specific function;
s2, labeling two DNA sequences, and representing the DNA sequence with the specific function and the DNA sequence without the specific function by using unique heat codes;
s3, building a convolutional neural network, taking the one-hot code of the labeled DNA sequence as input, labeling the corresponding DNA sequence as a fitting value output by the convolutional neural network, and training the convolutional neural network to enable the convolutional neural network to identify the DNA sequence;
s4, decoupling the trained convolutional neural network through a NeuronMotif algorithm to obtain a gene regulatory element combination module, and expressing and storing the gene regulatory element combination module by using a regulatory element syntax tree;
s41, for a neuron in the convolutional neural network, collecting a new DNA sequence set, wherein different DNA sequences in the new DNA sequence set have neuron activation values with various sizes;
s42, respectively calculating all neuron activation values of the DNA sequences in the new DNA sequence set, which can affect the neurons in each layer of the neural network;
s43, dividing the new DNA sequence set to obtain a plurality of DNA sequence subsets;
s44, calculating the mathematical expression form of the gene function component combination module corresponding to each DNA sequence subset, and expressing and storing the gene function component combination module by using a regulatory element syntax tree;
s431, for the new DNA sequence set, detecting from a deep layer to a shallow layer from the layer where the neuron is located, if a largest pooling layer is met, clustering the new DNA sequence set into K classes according to the pooling size K and the neuron activation value characteristics of the shallow layer of the pooling layer corresponding to the sequences of the new DNA sequence set by using a Kmeans algorithm, wherein each class corresponds to a divided DNA sequence subset;
s432, all the divided DNA sequence subsets are used as a DNA sequence new set, detection is carried out from a deep layer to a shallow layer from a clustering occurrence layer, if a largest pooling layer is met, the DNA sequence new set is clustered into K classes according to the pooling size K and a Kmeans algorithm according to neuron activation value characteristics of a shallow layer of the pooling layer corresponding to the sequences of the DNA sequence new set, and each class corresponds to the divided DNA sequence subsets;
and S433, repeating the step S432 to the first layer to obtain a plurality of DNA sequence subsets.
2. The method of claim 1, wherein S1 further comprises:
s11, cutting DNA sequence fragments with the specific function and DNA sequence fragments without the specific function on the biological genome marked by using a biological experimental means.
3. The method of claim 1, wherein S1 further comprises:
s12, artificially synthesizing DNA sequence fragment molecules, carrying out any type of biological function verification experiment, and determining the fragment molecules with the specific function and the fragment molecules without the specific function.
4. The method of claim 1, wherein labeling two DNA sequences comprises:
labeling the DNA sequence with the specific function as a positive sample, and labeling the DNA sequence without the specific function as a negative sample.
5. The method of claim 1, wherein S41 further comprises:
randomly generating a DNA sequence according to the size of a neuron receiving domain, optimizing the DNA sequence by using a genetic algorithm, wherein the optimization target is a neuron activation value of the DNA sequence, mutation of the DNA sequence in the genetic algorithm is sampled according to the neuron activation value and the gradient size of the one-hot coding input of the DNA sequence as probability, besides cross interchange of the DNA sequence is kept, cyclic displacement is required according to a neural network pooling layer structure, the DNA sequence of an intermediate result optimized by the genetic algorithm is sampled, the sampled DNA sequence is not repeated, and the sampled DNA sequence forms various activated DNA sequence sets.
6. The method according to claim 1, wherein the computational expression of the gene function element combination module is E [ E (X | Y) ], where X is a random variable corresponding to one-hot encoding of the sampling sequence, Y is a random variable represented by an activation value corresponding to the sampling sequence, and the relationship Y = f (X) between Y and X is determined by the corresponding neuron, where the distribution of the random variable Y needs to be given and is a free variable, and the random variable X depends on the random variable Y.
7. A training and visualization system of a neural network extraction-regulatory DNA combination pattern, characterized in that the training and visualization method for realizing the neural network extraction-regulatory DNA combination pattern of any one of claims 1 to 5 comprises:
the acquisition module is used for acquiring a DNA sequence with a specific function and a DNA sequence without the specific function;
a labeling module for labeling two DNA sequences and representing the DNA sequence with a specific function and the DNA sequence without the specific function by using unique heat codes;
the training module is used for building a convolutional neural network, taking the one-hot code of the labeled DNA sequence as input, labeling the corresponding DNA sequence as a fitting value output by the convolutional neural network, and training the convolutional neural network so that the convolutional neural network can identify the DNA sequence;
and the decoupling module is used for decoupling the trained convolutional neural network through a NeuronMotif algorithm to obtain a gene regulatory element combination module, and expressing and storing the gene regulatory element combination module by utilizing a regulatory element syntax tree.
8. The system of claim 7, wherein obtaining a DNA sequence having a specific function and a DNA sequence not having the specific function comprises:
intercepting DNA sequence fragments with the specific functions and DNA sequence fragments without the specific functions on a biological genome marked by using a biological experimental means; or
Artificially synthesizing DNA sequence fragment molecules, performing any type of biological function verification experiment, and determining the fragment molecules with the specific function and the fragment molecules without the specific function.
CN202110063192.0A 2021-01-18 2021-01-18 Training and visualization method and system for neural network extraction regulation and control DNA combination mode Active CN112735514B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110063192.0A CN112735514B (en) 2021-01-18 2021-01-18 Training and visualization method and system for neural network extraction regulation and control DNA combination mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110063192.0A CN112735514B (en) 2021-01-18 2021-01-18 Training and visualization method and system for neural network extraction regulation and control DNA combination mode

Publications (2)

Publication Number Publication Date
CN112735514A CN112735514A (en) 2021-04-30
CN112735514B true CN112735514B (en) 2022-09-16

Family

ID=75593414

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110063192.0A Active CN112735514B (en) 2021-01-18 2021-01-18 Training and visualization method and system for neural network extraction regulation and control DNA combination mode

Country Status (1)

Country Link
CN (1) CN112735514B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113314187B (en) * 2021-05-27 2022-05-10 广州大学 Data storage method, decoding method, system, device and storage medium
CN114399033B (en) * 2022-03-25 2022-07-19 浙江大学 Brain-like computing system and method based on neuron instruction coding
CN116153404B (en) * 2023-02-28 2023-08-15 成都信息工程大学 Single-cell ATAC-seq data analysis method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109825583A (en) * 2019-03-01 2019-05-31 清华大学 Marker and its application of people's repeat element DNA methylation as hepatocarcinoma early diagnosis
CN109858506A (en) * 2018-05-28 2019-06-07 哈尔滨工程大学 A kind of visualized algorithm towards convolutional neural networks classification results
CN111312329A (en) * 2020-02-25 2020-06-19 成都信息工程大学 Transcription factor binding site prediction method based on deep convolution automatic encoder

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019018693A2 (en) * 2017-07-19 2019-01-24 Altius Institute For Biomedical Sciences Methods of analyzing microscopy images using machine learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858506A (en) * 2018-05-28 2019-06-07 哈尔滨工程大学 A kind of visualized algorithm towards convolutional neural networks classification results
CN109825583A (en) * 2019-03-01 2019-05-31 清华大学 Marker and its application of people's repeat element DNA methylation as hepatocarcinoma early diagnosis
CN111312329A (en) * 2020-02-25 2020-06-19 成都信息工程大学 Transcription factor binding site prediction method based on deep convolution automatic encoder

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
OmarYáñez-Cuna 等.Deciphering the transcriptional cis-regulatory code.《Trends in Genetics》.2013, *
表达相关的转录因子相互作用模式;冯琳璎等;《内蒙古大学学报(自然科学版)》;20180921(第05期);全文 *

Also Published As

Publication number Publication date
CN112735514A (en) 2021-04-30

Similar Documents

Publication Publication Date Title
CN112735514B (en) Training and visualization method and system for neural network extraction regulation and control DNA combination mode
CN107679367B (en) Method and system for identifying co-regulation network function module based on network node association degree
van Nimwegen Finding regulatory elements and regulatory motifs: a general probabilistic framework
Wang et al. A brief review of machine learning methods for RNA methylation sites prediction
CN109545283B (en) Method for constructing phylogenetic tree based on sequence pattern mining algorithm
KR100597089B1 (en) Method for identifying of relevant groups of genes using gene expression profiles
CN115394348A (en) IncRNA subcellular localization prediction method, equipment and medium based on graph convolution network
KR20230170680A (en) Multi-channel protein voxelization to predict variant pathogenicity using deep convolutional neural networks
Wu et al. Predicting nucleosome positioning based on geometrically transformed tsallis entropy
KR100504039B1 (en) Computerized method for identifying ncRNA sequence
Bhat et al. OTU clustering: A window to analyse uncultured microbial world
Brandenburg et al. Inverse folding based pre-training for the reliable identification of intrinsic transcription terminators
Ma et al. PPRTGI: A Personalized PageRank Graph Neural Network for TF-target Gene Interaction Detection
Becker Systems biology analysis of large-scale gene expression data
CN117831638A (en) Deep fuzzy neural network method for identifying DNA N6-methyladenine locus
Yang Computational Methods for Multi-Species Comparison of 3D Genome Organization and Function
Singh Inferring interaction networks from transcriptomic data: methods and applications
Robinson et al. Combining experts in order to identify binding sites in yeast and mouse genomic data
Fu et al. Improvement of TRANSFAC matrices using multiple local alignment of transcription factor binding site sequences
Will Predicting Transcription Factor Complexes: A Novel Approach to Data Integration in Systems Biology
Bhat et al. PSSM amino-acid composition based rules for gene identification
Kerdprasop et al. Constraint-Based System for Genomic Analysis
Glisovic System for DNA visualization and clustering in searching through information
CN117178327A (en) Multi-channel protein voxelization using deep convolutional neural networks to predict variant pathogenicity
CN116189771A (en) Cell type detection method, system, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant