CN116612816A - Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment - Google Patents
Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment Download PDFInfo
- Publication number
- CN116612816A CN116612816A CN202310415049.2A CN202310415049A CN116612816A CN 116612816 A CN116612816 A CN 116612816A CN 202310415049 A CN202310415049 A CN 202310415049A CN 116612816 A CN116612816 A CN 116612816A
- Authority
- CN
- China
- Prior art keywords
- layer
- whole genome
- cnn
- coding sequence
- coding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 108010047956 Nucleosomes Proteins 0.000 title claims abstract description 77
- 210000001623 nucleosome Anatomy 0.000 title claims abstract description 77
- 238000000034 method Methods 0.000 title claims abstract description 52
- 108091026890 Coding region Proteins 0.000 claims abstract description 60
- 108091028043 Nucleic acid sequence Proteins 0.000 claims abstract description 38
- 238000000605 extraction Methods 0.000 claims abstract description 37
- 210000000349 chromosome Anatomy 0.000 claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 12
- 238000004590 computer program Methods 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 14
- 239000002773 nucleotide Substances 0.000 claims description 13
- 125000003729 nucleotide group Chemical group 0.000 claims description 13
- 238000003860 storage Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 2
- 238000002474 experimental method Methods 0.000 abstract description 11
- 238000009826 distribution Methods 0.000 abstract description 7
- 238000013527 convolutional neural network Methods 0.000 description 64
- 238000010586 diagram Methods 0.000 description 13
- 239000000126 substance Substances 0.000 description 9
- 240000004808 Saccharomyces cerevisiae Species 0.000 description 8
- 235000014680 Saccharomyces cerevisiae Nutrition 0.000 description 8
- 210000004027 cell Anatomy 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000031018 biological processes and functions Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000008439 repair process Effects 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 230000004543 DNA replication Effects 0.000 description 2
- 241000699670 Mus sp. Species 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000894007 species Species 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 108010077544 Chromatin Proteins 0.000 description 1
- 108020004414 DNA Proteins 0.000 description 1
- 102000052510 DNA-Binding Proteins Human genes 0.000 description 1
- 108700020911 DNA-Binding Proteins Proteins 0.000 description 1
- 241000699666 Mus <mouse, genus> Species 0.000 description 1
- 108700009124 Transcription Initiation Site Proteins 0.000 description 1
- 239000002253 acid Substances 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004166 bioassay Methods 0.000 description 1
- 210000003483 chromatin Anatomy 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 239000013604 expression vector Substances 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 229910052739 hydrogen Inorganic materials 0.000 description 1
- 239000001257 hydrogen Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000031267 regulation of DNA replication Effects 0.000 description 1
- 230000014493 regulation of gene expression Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Abstract
The application relates to a whole genome nucleosome density prediction method, a whole genome nucleosome density prediction system and electronic equipment, wherein the whole genome nucleosome density prediction method comprises the following steps: acquiring DNA sequences of whole genome chromosomes, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence; constructing and training a deep NDP model to obtain a trained deep NDP model; inputting the first coding sequence and the second coding sequence into a trained deep NDP model to obtain a whole genome nucleosome density result, wherein the deep NDP model comprises a feature extraction network, a conjugate layer, a transducer layer, a Flatten layer and two full-connection layers which are sequentially connected. According to the application, the DNA sequence is encoded into two forms, so that the model generalization capability is replaced, and the application can more efficiently and accurately identify the distribution of nucleosomes without carrying out a biological experiment with high cost.
Description
Technical Field
The application relates to the technical field of bioinformatics, in particular to a whole genome nucleosome density prediction method, a whole genome nucleosome density prediction system and electronic equipment.
Background
Nucleosome density prediction refers to the use of computational methods to predict the nucleosome signal intensity at each base site, resulting in a continuous nucleosome density across the genome. Nucleosomes are key participants in genetic processes as the basic units of chromatin, whose precise locations can regulate genomic accessibility to DNA binding proteins, thereby effecting regulation of gene expression, DNA replication and repair. Thus, identifying the location of nucleosomes on the genome may help one to study various biological processes in depth.
In past studies, many DNA sequence-based calculation methods have been proposed to determine nucleosome position in DNA sequences, for example:
(1) iNuc-PseKNC: a method for locating nucleosomes. A DNA sequence with the length of 147bp is input, a characteristic vector consisting of pseudo k-tuple nucleotides with 6 local DNA structural characteristics is extracted, and then the characteristics are input into an SVM classifier to predict whether the sequence is a nucleosome sequence.
(2) DLNN: a method for locating nucleosomes. Inputting a DNA sequence with the length of 147bp, encoding into a ont-hot form, modeling and analyzing the sequence by using a convolution network and a circulation network, and predicting whether the sequence is a nucleosome sequence.
(3) Routhier et al: a method for predicting the density of nucleosomes. DNA sequences on the whole chromosome were obtained in the form of sliding windows, and the nucleosome density at the central site of the input sequence was predicted using three sequentially stacked convolution layers.
In the prior art, the nucleosome positioning method can only capture the context information within 147bp, cannot learn the long-range interaction relation between bases, and cannot rapidly predict and analyze the whole chromosome sequence.
While Routhier et al propose that the recognition accuracy of the deep learning-based nucleosome density prediction method is low, and the prediction performance still has room for improvement.
Disclosure of Invention
Therefore, the application aims to solve the technical problem that the identification precision of the nuclear corpuscle density prediction method in the prior art is low.
In order to solve the technical problems, the application provides a whole genome nucleosome density prediction method, which comprises the following steps:
step S1: acquiring DNA sequences of whole genome chromosomes, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence;
meanwhile, constructing and training a deep NDP model to obtain a trained deep NDP model;
step S2: inputting the first coding sequence and the second coding sequence into a trained deep NDP model for prediction to obtain a whole genome nucleosome density result;
the deep model comprises a feature extraction network, a connectate layer, a transducer layer, a flame layer and two full connection layers which are connected in sequence;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the jointing layer is used for jointing the first local feature and the second local feature to obtain a jointing feature; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer; the holo-junction layer is used to predict whole genome nucleosome density.
In one embodiment of the present application, the feature extraction network in the step S2 includes a feature extraction module res net for extracting a first local feature of the first coding sequence and a feature extraction module CNNNet for extracting a second local feature of the second coding sequence.
In one embodiment of the present application, the feature extraction module res net includes a first CNN layer, three ResBlock layers, a second CNN layer, a third CNN layer, and a first Reshape layer, which are sequentially connected, where the first Reshape layer is used to change a dimension of an output of the third CNN layer.
In one embodiment of the present application, the ResBlock layer includes a first column of CNN cells and a second column of CNN cells;
the first-column CNN unit comprises a fourth CNN layer, a fifth CNN layer and a sixth CNN layer which are sequentially connected, wherein convolution kernels adopted by the fourth CNN layer, the fifth CNN layer and the sixth CNN layer are sequentially 5, 16 and 16;
the second-column CNN unit comprises a seventh CNN layer, an eighth CNN layer and a ninth CNN layer which are sequentially connected, wherein the convolution kernel adopted by the seventh CNN layer, the eighth CNN layer and the ninth CNN layer is sequentially 3, 8 and 8;
and adding the output of the sixth CNN layer, the output of the ninth CNN layer and the input of the current Resblock layer.
In one embodiment of the application, all CNN layers in the feature extraction module res net are followed by a ReLU activation function.
In an embodiment of the present application, the feature extraction module CNNNet includes a tenth CNN layer, an eleventh CNN layer, a twelfth CNN layer, and a second Reshape layer, which are sequentially connected, and the second Reshape layer is used to change a dimension of an output of the twelfth CNN layer.
In one embodiment of the present application, the method for obtaining the DNA sequence of the whole genome chromosome in step S1 and performing the first encoding and the second encoding respectively to obtain the first encoding sequence and the second encoding sequence includes:
obtaining a DNA sequence of a whole genome chromosome;
and carrying out One-hot coding on the DNA sequence of the whole genome chromosome to obtain an One-hot coding sequence, and simultaneously carrying out nucleotide coding on the DNA sequence of the whole genome chromosome to obtain a nucleotide coding sequence, wherein the One-hot coding sequence is a first coding sequence, and the nucleotide coding sequence is a second coding sequence.
In order to solve the technical problems, the application provides a whole genome nucleosome density prediction system, which comprises:
encoding and construction module: the method comprises the steps of obtaining a DNA sequence of a whole genome chromosome, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence;
meanwhile, the method is used for constructing and training a deep model to obtain a trained deep model;
and a prediction module: the method comprises the steps of inputting a first coding sequence and a second coding sequence into a trained deep NDP model for prediction to obtain a whole genome nucleosome density result;
the deep model comprises a feature extraction network, a connectate layer, a transducer layer, a flame layer and two full connection layers which are connected in sequence;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the jointing layer is used for jointing the first local feature and the second local feature to obtain a jointing feature; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer; the holo-junction layer is used to predict whole genome nucleosome density.
In order to solve the technical problems, the application provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the whole genome nucleosome density prediction method when executing the computer program.
To solve the above technical problem, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the whole genome nucleosome density prediction method as described above.
Compared with the prior art, the technical scheme of the application has the following advantages:
according to the application, the DNA sequence is encoded into two forms, so that the constructed deep learning model can learn more information from the DNA sequence, and the method can more efficiently and accurately identify the distribution of nucleosomes of the whole genome without time-consuming and labor-consuming biological experiments with high cost;
the deep NDP model provided by the application can be used among different species, has strong generalization capability, and omits the complexity of a plurality of models of a plurality of species;
the deep NDP model can be used for detecting the distribution of nucleosomes in biological research, thereby helping researchers to deeply study various biological processes such as gene expression, DNA replication, repair and the like.
Drawings
In order that the application may be more readily understood, a more particular description of the application will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings.
FIG. 1 is a flow chart of the method of the present application;
FIG. 2 is a schematic diagram of the deep model of the present application;
FIG. 3 is a diagram showing the comparison of DeepNDP model and chemical method using Saccharomyces cerevisiae as an example in the present application;
FIG. 4 is a diagram showing a comparison between a DeepNDP model and an existing model of Saccharomyces cerevisiae in an embodiment of the present application;
FIG. 5 is a schematic diagram of the effect of the deep model with the NCP code removed as input in an embodiment of the present application;
FIG. 6 is a graph showing the comparison of the performances of the deep model and the chemical method using mice as an example in the embodiment of the present application.
Detailed Description
The present application will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the application and practice it.
Example 1
Referring to FIG. 1, the whole genome nucleosome density prediction method of the present application comprises:
step S1: obtaining DNA sequences of whole genome chromosomes and respectively performing first coding (One-hot coding) and second coding (NCP coding) to obtain a first coding sequence and a second coding sequence;
meanwhile, constructing and training a deep NDP model to obtain a trained deep NDP model;
step S2: inputting the first coding sequence and the second coding sequence into a trained deep NDP model for prediction to obtain a whole genome nucleosome density result;
the deep model comprises a feature extraction network, a connectate layer, a transducer layer, a flame layer and two full connection layers (namely a Dense layer) which are connected in sequence;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the jointing layer is used for jointing the first local feature and the second local feature to obtain a jointing feature; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer (namely, leveling the output of the transducer layer); the holo-junction layer (i.e., the Dense layer) is used to predict whole genome nucleosome density.
The feature extraction network comprises a feature extraction module ResNet and a feature extraction module CNNNet, wherein the feature extraction module ResNet is used for extracting first local features of a first coding sequence, and the feature extraction module CNNNet is used for extracting second local features of a second coding sequence.
According to the application, the DNA sequence is encoded into two forms, so that the constructed deep learning model can learn more information from the DNA sequence, and the method can more efficiently and accurately identify the distribution of nucleosomes of the whole genome without time-consuming and labor-consuming biological experiments with high cost; the deep NDP model can be used for detecting the distribution of nucleosomes in biological research, thereby helping researchers to deeply study various biological processes such as gene expression, DNA replication, repair and the like.
The present application is described in detail below:
in the step S1, dividing the DNA sequences in the data set into a training set, a verification set and a test set according to chromosome numbers; specifically, taking Saccharomyces cerevisiae as an example, the genome of Saccharomyces cerevisiae comprises 16 chromosomes, the 1 st to 13 th chromosomes are used as training sets, the 14 th and 15 th chromosomes are used as verification sets, and the 16 th chromosome is used as a test set;
the One-hot code (One-hot code) is to encode A, T, C and G four bases in the DNA sequence and unknown site N, respectively (1, 0), (0, 1, 0), (0, 1, 0) binary vector representations of (0, 1) and (0, 0);
nucleotide chemical property coding (NCP coding) is to express a DNA sequence as A, C, G, T and an unknown site N as (1, 1), (0, 1, 0), (1, 0, 1) and (0, 0) according to three chemical properties of a cyclic structure, chemical function and hydrogen bond of a base, respectively.
In step S1, as shown in A in FIG. 2 (left part of FIG. 2), the deep model contains two input ports (One-hot encoding input port, NCP input port), two different feature extraction modules ResNet and CNNNet, transformer layers, a Flatten layer, two fully connected layers (Dense layer);
further, the feature extraction module res net structure of the present embodiment is shown in B (middle part of fig. 2) in fig. 2, and is configured to extract local features in data, and includes a first CNN layer, three ResBlock (i.e. residual module) layers, a second CNN layer, a third CNN layer, and a first Reshape layer, which are sequentially connected, where the first Reshape layer is configured to change a dimension of an output of the third CNN layer; the ResBlock layer comprises a first column CNN unit and a second column CNN unit, wherein the first column CNN unit is used for extracting abstract features, and the second column CNN unit is used for extracting detail features; the first-column CNN unit comprises a fourth CNN layer, a fifth CNN layer and a sixth CNN layer which are sequentially connected, wherein convolution kernels adopted by the fourth CNN layer, the fifth CNN layer and the sixth CNN layer are sequentially 5, 16 and 16; the second-column CNN unit comprises a seventh CNN layer, an eighth CNN layer and a ninth CNN layer which are sequentially connected, wherein the convolution kernel adopted by the seventh CNN layer, the eighth CNN layer and the ninth CNN layer is sequentially 3, 8 and 8; and performing an Addition (ADD) operation on the output (X) of the sixth CNN layer, the output (X) of the ninth CNN layer and the input (X_shortcut) of the current Resblock layer, and taking the result as an output result of the current Resblock layer. The feature extraction module CNNNet of the present embodiment is shown by C in fig. 2 (right part of fig. 2), which is also used to extract local features in data, and is composed of three sequentially stacked convolution layers (i.e., tenth CNN layer, eleventh CNN layer, twelfth CNN layer) and a second Reshape layer. The transducer layer is a self-attention mechanism-based architecture, integrates residual design and multi-head attention mechanism, and is used for extracting global features of data, and specifically comprises two parts: self-attention sublayer and feed-forward neural network sublayer. The self-attention sub-layer is used for calculating the correlation between the expression vector of each position in the input sequence and other positions, so as to capture the long-distance dependency relationship in the sequence; the function of the feedforward neural network sub-layer is to perform nonlinear transformation on the output of the self-attention sub-layer, increase the expression capacity of the model, and have a residual connection and a layer normalization operation behind each sub-layer to improve the stability and convergence speed of the model. The full connection layer (i.e., the Dense layer) is used to predict the output result.
It should be noted that, the res net of this embodiment fuses multi-scale convolution and residual networks, and the convolution layers with different convolution kernel sizes can extract features on different scales, as shown in the res block layer B in fig. 2, the number of the convolution kernels of the CNN is set to 16, so as to ensure that three features can be added subsequently. Channels with the convolution kernel sizes of CNNs set to 5, 16, and 16 in the first column of CNN cells can extract more abstract features in the sequence matrix, while channels with the convolution kernel sizes of CNNs set to 3, 8, and 8 in the second column of CNN cells can extract more detailed features in the sequence matrix. The Add function is then used to Add the X output by the two columns of CNN cells to the x_shortcut input by the current ResBlock layer. The design can extract the characteristics of the sequence matrix on different scales, and the neural network can learn the identity mapping more easily, so that the information loss and gradient attenuation in the deep network are avoided.
It should be noted that, in the embodiment of the present disclosure, a ReLU activation function is connected after each convolution layer (CNN layer) in the res net, which has the function of removing the negative value in the convolution result, keeping the positive value unchanged, improving the gradient vanishing problem, accelerating the convergence speed of gradient descent, and improving the calculation efficiency.
Further, the convolution layer (i.e., tenth CNN layer, eleventh CNN layer, twelfth CNN layer) parameters of the feature extraction module CNNNet of the present embodiment are set as follows: the number of the convolution kernels is 64, the size of the convolution kernels is 3, the number of the convolution kernels is 16, the size of the convolution kernels is 8, the number of the convolution kernels is 8, and the size of the convolution kernels is 80.
The two full connection layer parameters of this embodiment are set as: the output sizes were 256 and 1.
The two inputs input1 and input2 of the deep model are DNA sequences expressed by two coding modes respectively; in the step S2, two inputs are respectively input into ResNet and CNNNet to extract local features, and then are horizontally spliced together through a connectate layer; the transducer layer integrates the spliced local features and learns global features; the full ligation layer prediction was then used to output a number between 0 and 1 representing the nucleosome density at the central site of the input DNA sequence.
Further, in the embodiment, the deep ndp model is trained by using a training set and a verification set, and the performances of the deep ndp model are verified by using a test set; a sliding window with a window size of 2001bp and a step length of 1bp is used on the DNA sequence, and the 2001bp DNA sequence is used as a model input sequence, so that the DNA sequence of the whole chromosome is read; calculating the difference degree between the prediction and the actual data by using the loss function, and carrying out gradient update so as to update the parameters of the deep model;
setting the random discarding rate to be 0.2 in training; the loss function is set as:
wherein, the liquid crystal display device comprises a liquid crystal display device,is the model predictor, y is the true value, MAE is the mean absolute error between the two, and corr is the Pearson correlation coefficient between the two.
Further, the present embodiment encodes the test set to obtain a feature sequence of two forms, one-hot encoding and nucleotide chemistry encoding; inputting the two feature sequences into the local feature features extracted in the two feature extraction modules, and then horizontally splicing; the transducer layer integrates the spliced local features and learns the global features of the sequence; then, extracting a nonlinear relation of global features by using two full-connection layers, and outputting a prediction result;
wherein the output is mapped to a final predicted density, representing the nucleosome density of the central site of the input sequence, using softmax as an activation function in the fully connected layer.
Comparing the results of the density of nucleosomes predicted by the method of this example with those obtained by the biological assay, the results are shown in fig. 3 as A, B, C: in fig. 3, a is a scatter diagram of a deep ndp model prediction result and a biological experiment result, which shows quantitative comparison of the density of the nucleosome obtained by prediction and the biological experiment result, an X axis is a calculation experiment result, a Y axis is a biological experiment result, when a black area in the diagram is closer to a y=x direction, a stronger positive correlation exists between two signals, otherwise, a negative correlation exists, when the black area is deeper, data is more concentrated, and the predicted value distribution of the deep ndp model can be found to be consistent with the true value of the biological experiment in the diagram; in fig. 3, B and C show the predicted variation of the nucleosome density and the biological experiment nucleosome density along the DNA sequence, respectively, and it can be seen from the graph that the predicted high and low partitions of the nucleosome density are consistent with the biological experiment result, which means that the deep ndp model can accurately identify the dense nucleosome region and the depleted nucleosome region on the DNA sequence.
Comparing the predicted results of the method of this example with those of the existing method on the same dataset, as shown in FIG. 4, the pearson correlation coefficient results obtained on the sixteenth chromosome of the Saccharomyces cerevisiae genome by DLNN and LeNup, such as DeepNDP, routhier, etc., are sequentially obtained from left to right. Through comparative studies, it was found that the correlation coefficient between the two signals of nucleosome density and Mnase-seq obtained by deep NDP prediction reached 0.723, and the Pearson correlation coefficient obtained by the method proposed by Routeer et al was 0.68, which was used to predict models of nucleosome localization, such as the behavior of DLNN, leNup when used to predict whole genome nucleosome density was 0.43 and 0.40, respectively. Therefore, the deep model of the embodiment has better training results than the previous model and better correlation of the prediction results.
It should be noted that, in this example, the DNA sequence of the whole genome chromosome is subjected to the One-hot encoding and NCP encoding, respectively, and the One-hot encoding is not used in the present example, because the effect of combining the One-hot encoding and NCP encoding is better. Specifically, also for the case of Saccharomyces cerevisiae, the DNA sequence was encoded as only One form of One-hot encoding, thereby verifying the effectiveness of nucleotide chemical property encoding (NCP) in nucleosome density prediction: the results show that when the deep model uses only One coding scheme of One-hot coding, the Pearson correlation is predicted to be 0.703. FIG. 5 shows a distribution curve of nucleosome density on a Saccharomyces cerevisiae chr16 fragment, with solid line segments representing predicted results using nucleotide chemistry encoding, dotted line segments representing predicted results not using nucleotide chemistry encoding, and dashed line segments representing biological experimental results. Obviously, after nucleotide acid chemical property coding is added on the saccharomyces cerevisiae chr16 segment, the prediction result is better fitted with the biological experiment result, and the deep model effect is improved to a certain extent.
In addition to the above-mentioned s.cerevisiae, the present example also uses a DNA sequence of 2000bp range upstream and downstream of the transcription initiation site of the mouse as a test set to be input into the deep model to verify the effectiveness of the model for cross-species recognition, and the result is shown in FIG. 6: the ordinate of the upper and lower graphs are the nucleosome density predicted by the deep ndp model and ncp_score (nucleosome centering score) obtained by a chemical method, respectively, and ncp_score indicates the signal intensity of the nucleosome centering center site. As can be easily seen from fig. 6, the density of the mice nucleosome predicted by the deep ndp model has similar periodicity as the ncp_score obtained by the chemical method, and the deep ndp model has better cross-species generalization capability.
Example two
The application provides a whole genome nucleosome density prediction system, comprising:
and a coding module: the method comprises the steps of obtaining a DNA sequence of a whole genome chromosome, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence;
meanwhile, the method is used for constructing and training a deep model to obtain a trained deep model;
and a prediction module: the method comprises the steps of inputting a first coding sequence and a second coding sequence into a trained deep NDP model for prediction to obtain a whole genome nucleosome density result;
the deep model comprises a feature extraction network, a connectate layer, a transducer layer, a flame layer and two full connection layers which are connected in sequence;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the jointing layer is used for jointing the first local feature and the second local feature to obtain a jointing feature; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer; the holo-junction layer is used to predict whole genome nucleosome density.
Example III
The present embodiment provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the whole genome nucleosome density prediction method according to the first embodiment when executing the computer program.
Example IV
The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the whole genome nucleosome density prediction method of the first embodiment.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present application will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present application.
Claims (10)
1. A whole genome nucleosome density prediction method, which is characterized in that: comprising the following steps:
step S1: acquiring DNA sequences of whole genome chromosomes, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence;
meanwhile, constructing and training a deep NDP model to obtain a trained deep NDP model;
step S2: inputting the first coding sequence and the second coding sequence into a trained deep NDP model for prediction to obtain a whole genome nucleosome density result;
the deep model comprises a feature extraction network, a connectate layer, a transducer layer, a flame layer and two full connection layers which are connected in sequence;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the jointing layer is used for jointing the first local feature and the second local feature to obtain a jointing feature; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer; the holo-junction layer is used to predict whole genome nucleosome density.
2. The whole genome nucleosome density prediction method according to claim 1, wherein: the feature extraction network in the step S2 includes a feature extraction module res net and a feature extraction module CNNNet, where the feature extraction module res net is used to extract a first local feature of the first coding sequence, and the feature extraction module CNNNet is used to extract a second local feature of the second coding sequence.
3. The whole genome nucleosome density prediction method according to claim 2, characterized in that: the feature extraction module ResNet comprises a first CNN layer, three ResBlock layers, a second CNN layer, a third CNN layer and a first Reshape layer which are sequentially connected, wherein the first Reshape layer is used for changing the output dimension of the third CNN layer.
4. The whole genome nucleosome density prediction method according to claim 3, wherein: the ResBlock layer comprises a first row of CNN units and a second row of CNN units;
the first-column CNN unit comprises a fourth CNN layer, a fifth CNN layer and a sixth CNN layer which are sequentially connected, wherein convolution kernels adopted by the fourth CNN layer, the fifth CNN layer and the sixth CNN layer are sequentially 5, 16 and 16;
the second-column CNN unit comprises a seventh CNN layer, an eighth CNN layer and a ninth CNN layer which are sequentially connected, wherein the convolution kernel adopted by the seventh CNN layer, the eighth CNN layer and the ninth CNN layer is sequentially 3, 8 and 8;
and adding the output of the sixth CNN layer, the output of the ninth CNN layer and the input of the current Resblock layer.
5. The whole genome nucleosome density prediction method according to claim 4, wherein: and a ReLU activation function is connected to the back of all CNN layers in the feature extraction module ResNet.
6. The whole genome nucleosome density prediction method according to claim 2, characterized in that: the feature extraction module CNNNet comprises a tenth CNN layer, an eleventh CNN layer, a twelfth CNN layer and a second Reshape layer which are sequentially connected, wherein the second Reshape layer is used for changing the dimension output by the twelfth CNN layer.
7. The whole genome nucleosome density prediction method according to claim 1, wherein: the method for obtaining the DNA sequence of the whole genome chromosome in the step S1 and respectively carrying out the first coding and the second coding to obtain the first coding sequence and the second coding sequence comprises the following steps:
obtaining a DNA sequence of a whole genome chromosome;
and carrying out One-hot coding on the DNA sequence of the whole genome chromosome to obtain an One-hot coding sequence, and simultaneously carrying out nucleotide coding on the DNA sequence of the whole genome chromosome to obtain a nucleotide coding sequence, wherein the One-hot coding sequence is a first coding sequence, and the nucleotide coding sequence is a second coding sequence.
8. A whole genome nucleosome density prediction system, characterized in that: comprising the following steps:
encoding and construction module: the method comprises the steps of obtaining a DNA sequence of a whole genome chromosome, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence;
meanwhile, the method is used for constructing and training a deep model to obtain a trained deep model;
and a prediction module: the method comprises the steps of inputting a first coding sequence and a second coding sequence into a trained deep NDP model for prediction to obtain a whole genome nucleosome density result;
the deep model comprises a feature extraction network, a connectate layer, a transducer layer, a flame layer and two full connection layers which are connected in sequence;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the jointing layer is used for jointing the first local feature and the second local feature to obtain a jointing feature; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer; the holo-junction layer is used to predict whole genome nucleosome density.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized by: the processor, when executing the computer program, implements the steps of the whole genome nucleosome density prediction method according to any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program, when executed by a processor, performs the steps of the whole genome nucleosome density prediction method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310415049.2A CN116612816A (en) | 2023-04-18 | 2023-04-18 | Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310415049.2A CN116612816A (en) | 2023-04-18 | 2023-04-18 | Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116612816A true CN116612816A (en) | 2023-08-18 |
Family
ID=87673676
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310415049.2A Pending CN116612816A (en) | 2023-04-18 | 2023-04-18 | Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116612816A (en) |
-
2023
- 2023-04-18 CN CN202310415049.2A patent/CN116612816A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Fan et al. | lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning | |
Al-Ajlan et al. | CNN-MGP: convolutional neural networks for metagenomics gene prediction | |
Nguyen et al. | Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution | |
CN111798921A (en) | RNA binding protein prediction method and device based on multi-scale attention convolution neural network | |
Yi et al. | RPI-SE: a stacking ensemble learning framework for ncRNA-protein interactions prediction using sequence information | |
Zhang et al. | Identification of DNA–protein binding sites by bootstrap multiple convolutional neural networks on sequence information | |
CN113178227B (en) | Method, system, device and storage medium for identifying multiomic fusion splice sites | |
CN111564179B (en) | Species biology classification method and system based on triple neural network | |
CN112951328B (en) | MiRNA-gene relation prediction method and system based on deep learning heterogeneous information network | |
Zhang et al. | Gene prediction in metagenomic fragments with deep learning | |
CN114464270A (en) | Universal method for designing medicines aiming at different target proteins | |
CN116072227B (en) | Marine nutrient biosynthesis pathway excavation method, apparatus, device and medium | |
Kao et al. | naiveBayesCall: an efficient model-based base-calling algorithm for high-throughput sequencing | |
CN110556184A (en) | non-coding RNA and disease relation prediction method based on Hessian regular nonnegative matrix decomposition | |
CN114743600A (en) | Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity | |
Wang et al. | A brief review of machine learning methods for RNA methylation sites prediction | |
Chakraborty et al. | Predicting MicroRNA sequence using CNN and LSTM stacked in Seq2Seq architecture | |
CN116401555A (en) | Method, system and storage medium for constructing double-cell recognition model | |
CN115966316B (en) | Tumor drug sensitivity prediction method, system, equipment and storage medium | |
Kao et al. | naiveBayesCall: An efficient model-based base-calling algorithm for high-throughput sequencing | |
CN116612816A (en) | Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment | |
CN111048145A (en) | Method, device, equipment and storage medium for generating protein prediction model | |
CN114627964B (en) | Prediction enhancer based on multi-core learning and intensity classification method and classification equipment thereof | |
CN114582420A (en) | Transcription factor binding site prediction method and system based on fault-tolerant coding and multi-scale dense connection network | |
CN114927163A (en) | Method for predicting genetic model based on single cell map and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |