CN116612816B - Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment - Google Patents

Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment Download PDF

Info

Publication number
CN116612816B
CN116612816B CN202310415049.2A CN202310415049A CN116612816B CN 116612816 B CN116612816 B CN 116612816B CN 202310415049 A CN202310415049 A CN 202310415049A CN 116612816 B CN116612816 B CN 116612816B
Authority
CN
China
Prior art keywords
layer
whole genome
cnn
coding sequence
coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310415049.2A
Other languages
Chinese (zh)
Other versions
CN116612816A (en
Inventor
吴庭芳
周昳婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202310415049.2A priority Critical patent/CN116612816B/en
Publication of CN116612816A publication Critical patent/CN116612816A/en
Application granted granted Critical
Publication of CN116612816B publication Critical patent/CN116612816B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Mathematical Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a whole genome nucleosome density prediction method, a whole genome nucleosome density prediction system and electronic equipment, wherein the whole genome nucleosome density prediction method comprises the following steps: acquiring DNA sequences of whole genome chromosomes, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence; constructing and training DeepNDP models to obtain trained DeepNDP models; inputting the first coding sequence and the second coding sequence into a trained DeepNDP model to obtain a whole genome nucleosome density result, wherein the DeepNDP model comprises a feature extraction network, a Concatenate layer, a transducer layer, a Flatten layer and two full connection layers which are sequentially connected. According to the invention, the DNA sequence is encoded into two forms, so that the model generalization capability is replaced, and the invention can more efficiently and accurately identify the distribution of nucleosomes without carrying out a biological experiment with high cost.

Description

Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment
Technical Field
The invention relates to the technical field of bioinformatics, in particular to a whole genome nucleosome density prediction method, a whole genome nucleosome density prediction system and electronic equipment.
Background
Nucleosome density prediction refers to the use of computational methods to predict the nucleosome signal intensity at each base site, resulting in a continuous nucleosome density across the genome. Nucleosomes are key participants in genetic processes as the basic units of chromatin, whose precise locations can regulate genomic accessibility to DNA binding proteins, thereby effecting regulation of gene expression, DNA replication and repair. Thus, identifying the location of nucleosomes on the genome may help one to study various biological processes in depth.
In past studies, many DNA sequence-based calculation methods have been proposed to determine nucleosome position in DNA sequences, for example:
(1) iNuc-PseKNC: a method for locating nucleosomes. A DNA sequence with the length of 147bp is input, a characteristic vector consisting of pseudo k-tuple nucleotides with 6 local DNA structural characteristics is extracted, and then the characteristics are input into an SVM classifier to predict whether the sequence is a nucleosome sequence.
(2) DLNN: a method for locating nucleosomes. Inputting a DNA sequence with the length of 147bp, encoding into ont-hot form, modeling and analyzing the sequence by using a convolution network and a circulation network, and predicting whether the sequence is a nucleosome sequence.
(3) Routhier et al: a method for predicting the density of nucleosomes. DNA sequences on the whole chromosome were obtained in the form of sliding windows, and the nucleosome density at the central site of the input sequence was predicted using three sequentially stacked convolution layers.
In the prior art, the nucleosome positioning method can only capture the context information within 147bp, cannot learn the long-range interaction relation between bases, and cannot rapidly predict and analyze the whole chromosome sequence.
And Routhier et al propose that the recognition accuracy of the deep learning-based nuclear corpuscle density prediction method is low, and the prediction performance still has room for improvement.
Disclosure of Invention
Therefore, the invention aims to solve the technical problem that the identification precision of the nuclear corpuscle density prediction method in the prior art is low.
In order to solve the technical problems, the invention provides a whole genome nucleosome density prediction method, which comprises the following steps:
Step S1: acquiring DNA sequences of whole genome chromosomes, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence;
Simultaneously constructing and training DeepNDP models to obtain a trained DeepNDP model;
Step S2: inputting the first coding sequence and the second coding sequence into a trained DeepNDP model for prediction to obtain a whole genome nucleosome density result;
the DeepNDP model comprises a feature extraction network, a Concatenate layer, a transducer layer, a flame layer and two fully connected layers which are sequentially connected;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the Concatenate layers are used for splicing the first local features and the second local features to obtain spliced features; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer; the holo-junction layer is used to predict whole genome nucleosome density.
In one embodiment of the present invention, the feature extraction network in the step S2 includes a feature extraction module ResNet and a feature extraction module CNNNet, where the feature extraction module ResNet is configured to extract a first local feature of a first coding sequence and the feature extraction module CNNNet is configured to extract a second local feature of a second coding sequence.
In one embodiment of the present invention, the feature extraction module ResNet includes a first CNN layer, three ResBlock layers, a second CNN layer, a third CNN layer, and a first Reshape layer, which are sequentially connected, where the first Reshape layer is used to change a dimension of an output of the third CNN layer.
In one embodiment of the present invention, the ResBlock layers include a first column of CNN cells and a second column of CNN cells;
The first-column CNN unit comprises a fourth CNN layer, a fifth CNN layer and a sixth CNN layer which are sequentially connected, wherein convolution kernels adopted by the fourth CNN layer, the fifth CNN layer and the sixth CNN layer are sequentially 5, 16 and 16;
The second-column CNN unit comprises a seventh CNN layer, an eighth CNN layer and a ninth CNN layer which are sequentially connected, wherein the convolution kernel adopted by the seventh CNN layer, the eighth CNN layer and the ninth CNN layer is sequentially 3, 8 and 8;
And adding the output of the sixth CNN layer, the output of the ninth CNN layer and the input of the current ResBlock layer.
In one embodiment of the invention, all CNN layers in the feature extraction module ResNet are followed by a ReLU activation function.
In one embodiment of the present invention, the feature extraction module CNNNet includes a tenth CNN layer, an eleventh CNN layer, a twelfth CNN layer, and a second Reshape layer connected in sequence, where the second Reshape layer is used to change a dimension of an output of the twelfth CNN layer.
In one embodiment of the present invention, the method for obtaining the DNA sequence of the whole genome chromosome in step S1 and performing the first encoding and the second encoding respectively to obtain the first encoding sequence and the second encoding sequence includes:
obtaining a DNA sequence of a whole genome chromosome;
And carrying out One-hot coding on the DNA sequence of the whole genome chromosome to obtain an One-hot coding sequence, and simultaneously carrying out nucleotide coding on the DNA sequence of the whole genome chromosome to obtain a nucleotide coding sequence, wherein the One-hot coding sequence is a first coding sequence, and the nucleotide coding sequence is a second coding sequence.
In order to solve the technical problems, the invention provides a whole genome nucleosome density prediction system, which comprises:
encoding and construction module: the method comprises the steps of obtaining a DNA sequence of a whole genome chromosome, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence;
Meanwhile, the method is used for constructing and training DeepNDP models to obtain trained DeepNDP models;
and a prediction module: the method comprises the steps of inputting a first coding sequence and a second coding sequence into a trained DeepNDP model for prediction to obtain a whole genome nucleosome density result;
the DeepNDP model comprises a feature extraction network, a Concatenate layer, a transducer layer, a flame layer and two fully connected layers which are sequentially connected;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the Concatenate layers are used for splicing the first local features and the second local features to obtain spliced features; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer; the holo-junction layer is used to predict whole genome nucleosome density.
In order to solve the technical problems, the invention provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the whole genome nucleosome density prediction method when executing the computer program.
To solve the above technical problem, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the whole genome nucleosome density prediction method as described above.
Compared with the prior art, the technical scheme of the invention has the following advantages:
According to the invention, the DNA sequence is encoded into two forms, so that the constructed deep learning model can learn more information from the DNA sequence, and the method can more efficiently and accurately identify the distribution of nucleosomes of the whole genome without time-consuming and labor-consuming biological experiments with high cost;
The DeepNDP model provided by the invention can be used among different species, has strong generalization capability, and omits the complexity of a plurality of models of a plurality of species;
the DeepNDP model of the invention can be used for detecting the distribution of nucleosomes in biological research, thereby helping researchers to deeply study various biological processes such as gene expression, DNA replication, repair and the like.
Drawings
In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings.
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of a molding structure of the invention DeepNDP;
FIG. 3 is a diagram showing the comparison of DeepNDP model of Saccharomyces cerevisiae with chemical process in the examples of the present invention;
FIG. 4 is a diagram showing a comparison of DeepNDP model of Saccharomyces cerevisiae with a conventional model in an embodiment of the present invention;
FIG. 5 is a diagram showing the effect of DeepNDP models with the NCP code removed as input in an embodiment of the present invention;
FIG. 6 is a graph showing the comparison of the performance of DeepNDP model and chemical method using mice as an example in the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the invention and practice it.
Example 1
Referring to FIG. 1, the whole genome nucleosome density prediction method of the present invention comprises:
Step S1: obtaining DNA sequences of whole genome chromosomes and respectively performing first coding (One-hot coding) and second coding (NCP coding) to obtain a first coding sequence and a second coding sequence;
Simultaneously constructing and training DeepNDP models to obtain a trained DeepNDP model;
Step S2: inputting the first coding sequence and the second coding sequence into a trained DeepNDP model for prediction to obtain a whole genome nucleosome density result;
The DeepNDP model comprises a feature extraction network, a Concatenate layer, a transducer layer, a flame layer and two fully connected layers (namely a Dense layer) which are connected in sequence;
The feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the Concatenate layers are used for splicing the first local features and the second local features to obtain spliced features; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer (namely, leveling the output of the transducer layer); the holo-junction layer (i.e., the Dense layer) is used to predict whole genome nucleosome density.
The feature extraction network includes a feature extraction module ResNet and a feature extraction module CNNNet, the feature extraction module ResNet being configured to extract a first local feature of a first coding sequence and the feature extraction module CNNNet being configured to extract a second local feature of a second coding sequence.
According to the invention, the DNA sequence is encoded into two forms, so that the constructed deep learning model can learn more information from the DNA sequence, and the method can more efficiently and accurately identify the distribution of nucleosomes of the whole genome without time-consuming and labor-consuming biological experiments with high cost; the DeepNDP model of the invention can be used for detecting the distribution of nucleosomes in biological research, thereby helping researchers to deeply study various biological processes such as gene expression, DNA replication, repair and the like.
The present invention is described in detail below:
In the step S1, dividing the DNA sequences in the data set into a training set, a verification set and a test set according to chromosome numbers; specifically, taking Saccharomyces cerevisiae as an example, the genome of Saccharomyces cerevisiae comprises 16 chromosomes, the 1 st to 13 th chromosomes are used as training sets, the 14 th and 15 th chromosomes are used as verification sets, and the 16 th chromosome is used as a test set;
The single thermal coding (One-hot coding) is to make A, T, C and G four bases in DNA sequence and unknown site N, respectively (1, 0), (0, 1, 0), (0, 1, 0) binary vector representations of (0, 1) and (0, 0);
Nucleotide chemical property coding (NCP coding) is to express a DNA sequence of A, C, G, T and unknown site N as (1, 1), (0, 1, 0), (1, 0, 1) and (0, 0) according to three chemical properties of cyclic structure, chemical function and hydrogen bond of a base, respectively.
In step S1, as shown in a in fig. 2 (left part of fig. 2), the DeepNDP model contains two input ports (One-hot encoded input port, NCP input port), two different feature extraction modules ResNet and CNNNet, transformer layers, a flat layer, two fully connected layers (Dense layer);
Further, the structure of the feature extraction module ResNet in this embodiment is shown as B (middle part of fig. 2) in fig. 2, and is used for extracting local features in data, and includes a first CNN layer, three ResBlock (i.e. residual module) layers, a second CNN layer, a third CNN layer, and a first Reshape layer, which are sequentially connected, where the first Reshape layer is used for changing the dimension of the output of the third CNN layer; the ResBlock layer comprises a first column CNN unit and a second column CNN unit, wherein the first column CNN unit is used for extracting abstract features, and the second column CNN unit is used for extracting detail features; the first-column CNN unit comprises a fourth CNN layer, a fifth CNN layer and a sixth CNN layer which are sequentially connected, wherein convolution kernels adopted by the fourth CNN layer, the fifth CNN layer and the sixth CNN layer are sequentially 5, 16 and 16; the second-column CNN unit comprises a seventh CNN layer, an eighth CNN layer and a ninth CNN layer which are sequentially connected, wherein the convolution kernel adopted by the seventh CNN layer, the eighth CNN layer and the ninth CNN layer is sequentially 3, 8 and 8; the output (X) of the sixth CNN layer, the output (X) of the ninth CNN layer, and the input (x_shortcut) of the current ResBlock layer perform an ADD (ADD) operation as an output result of the current ResBlock layer. The feature extraction module CNNNet of the present embodiment is shown by C in fig. 2 (right part of fig. 2), which is also used to extract local features in data, and is composed of three sequentially stacked convolutional layers (i.e., tenth CNN layer, eleventh CNN layer, twelfth CNN layer) and second Reshape layer. The transducer layer is a self-attention mechanism-based architecture, integrates residual design and multi-head attention mechanism, and is used for extracting global features of data, and specifically comprises two parts: self-attention sublayer and feed-forward neural network sublayer. The self-attention sub-layer is used for calculating the correlation between the expression vector of each position in the input sequence and other positions, so as to capture the long-distance dependency relationship in the sequence; the function of the feedforward neural network sub-layer is to perform nonlinear transformation on the output of the self-attention sub-layer, increase the expression capacity of the model, and have a residual connection and a layer normalization operation behind each sub-layer to improve the stability and convergence speed of the model. The full connection layer (i.e., the Dense layer) is used to predict the output result.
It should be noted that ResNet of this embodiment fuses multi-scale convolution and residual networks, and convolution layers with different convolution kernel sizes can extract features on different scales, as shown in ResBlock layer B in fig. 2, the number of convolution kernels of CNN is set to 16, so as to ensure that three features can be added subsequently. Channels with the convolution kernel sizes of CNNs set to 5, 16, and 16 in the first column of CNN cells can extract more abstract features in the sequence matrix, while channels with the convolution kernel sizes of CNNs set to 3, 8, and 8 in the second column of CNN cells can extract more detailed features in the sequence matrix. The Add function is then used to Add the X output from the two columns of CNN cells to the x_shortcut input from the current ResBlock layers. The design can extract the characteristics of the sequence matrix on different scales, and the neural network can learn the identity mapping more easily, so that the information loss and gradient attenuation in the deep network are avoided.
It should be noted that, in this embodiment ResNet, a ReLU activation function is connected after each convolution layer (CNN layer), which is used to remove the negative value in the convolution result, keep the positive value unchanged, improve the gradient vanishing problem, accelerate the convergence speed of gradient descent, and improve the calculation efficiency.
Further, the convolution layer (i.e., tenth CNN layer, eleventh CNN layer, twelfth CNN layer) parameters of the feature extraction module CNNNet of the present embodiment are set as follows: the number of the convolution kernels is 64, the size of the convolution kernels is 3, the number of the convolution kernels is 16, the size of the convolution kernels is 8, the number of the convolution kernels is 8, and the size of the convolution kernels is 80.
The two full connection layer parameters of this embodiment are set as: the output sizes were 256 and 1.
The two inputs input1 and input2 of DeepNDP model are DNA sequences represented by two coding modes respectively; in the step S2, two inputs are respectively input ResNet and CNNNet to extract local features, and then are horizontally spliced together through Concatenate layers; the transducer layer integrates the spliced local features and learns global features; the full ligation layer prediction was then used to output a number between 0 and 1 representing the nucleosome density at the central site of the input DNA sequence.
Further, in the embodiment, the training set and the verification set are used for training DeepNDP models, and the test set is used for verifying DeepNDP model performance; a sliding window with a window size of 2001bp and a step length of 1bp is used on the DNA sequence, and the 2001bp DNA sequence is used as a model input sequence, so that the DNA sequence of the whole chromosome is read; calculating the difference degree between the prediction and the actual data by using the loss function, and carrying out gradient update so as to update the parameters of the DeepNDP model;
setting the random discarding rate to be 0.2 in training; the loss function is set as:
Wherein, Is the model predictor, y is the true value, MAE is the mean absolute error between the two, and corr is the Pearson correlation coefficient between the two.
Further, the present embodiment encodes the test set to obtain a feature sequence of two forms, one-hot encoding and nucleotide chemistry encoding; inputting the two feature sequences into the local feature features extracted in the two feature extraction modules, and then horizontally splicing; the transducer layer integrates the spliced local features and learns the global features of the sequence; then, extracting a nonlinear relation of global features by using two full-connection layers, and outputting a prediction result;
wherein the output is mapped to a final predicted density, representing the nucleosome density of the central site of the input sequence, using softmax as an activation function in the fully connected layer.
Comparing the results of the density of nucleosomes obtained by the method prediction of this example with those obtained by the biological experimental method, the results are shown in fig. 3 as A, B, C: in fig. 3, a is a scatter diagram of a DeepNDP model prediction result and a biological experiment result, which shows quantitative comparison of the predicted nucleosome density and the biological experiment result, an X axis is a calculation experiment result, a Y axis is a biological experiment result, when a black area in the diagram is closer to a y=x direction, a stronger positive correlation is shown between two signals, otherwise, a negative correlation is shown, when the black area is deeper, data is shown to be more concentrated, and the predicted value distribution of the DeepNDP model can be found to be consistent with the true value of the biological experiment in the diagram; in fig. 3, B and C show the predicted variation of the nucleosome density and the biological sample nucleosome density along the DNA sequence, respectively, and it can be seen from the graph that the predicted high and low partitions of the nucleosome density are consistent with the biological test results, which means that DeepNDP model can accurately identify the dense nucleosome region and the depletion region on the DNA sequence.
Comparing the predicted results of the method of this example with those of the existing method on the same dataset, as shown in FIG. 4, the pearson correlation coefficient results obtained on the sixteenth chromosome of the Saccharomyces cerevisiae genome by DeepNDP, routhier et al, DLNN and LeNup are shown in the order from left to right. Through comparative studies, it was found that the correlation coefficient between the two signals of the nucleosome density and Mnase-seq obtained by DeepNDP prediction reached 0.723, and the Pearson correlation coefficient obtained by the method proposed by Routhier et al was 0.68, which was used to predict models of nucleosome localization, such as the behavior of DLNN, leNup when used to predict whole genome nucleosome density was 0.43 and 0.40, respectively. Therefore, the DeepNDP model of the embodiment is superior to the previous model in not only training results, but also correlation of prediction results.
It should be noted that, in this example, the DNA sequence of the whole genome chromosome is subjected to the One-hot encoding and NCP encoding, respectively, and the One-hot encoding is not used in the present example, because the effect of combining the One-hot encoding and NCP encoding is better. Specifically, also for the case of Saccharomyces cerevisiae, the DNA sequence was encoded as only One form of One-hot encoding, thereby verifying the effectiveness of nucleotide chemical property encoding (NCP) in nucleosome density prediction: the result shows that when DeepNDP model uses only One coding mode of One-hot coding, the predicted Pearson correlation is 0.703. FIG. 5 shows a distribution curve of nucleosome density on a Saccharomyces cerevisiae chr16 fragment, with solid line segments representing predicted results using nucleotide chemistry encoding, dotted line segments representing predicted results not using nucleotide chemistry encoding, and dashed line segments representing biological experimental results. Obviously, after nucleotide acid chemical property coding is added on the saccharomyces cerevisiae chr16 segment, the predicted result is better fitted with the biological experiment result, and the DeepNDP model effect is improved to a certain extent.
In addition to the above-mentioned s.cerevisiae, the present example also uses a DNA sequence of 2000bp range upstream and downstream of the transcription initiation site of the mouse as a test set to be input into DeepNDP model to verify the effectiveness of the model for cross-species recognition, and the result is shown in FIG. 6: the ordinate of the upper and lower graphs is the predicted nucleosome density of DeepNDP model and the chemically derived NCP_score (nucleosome centering score), which represents the signal intensity of the nucleosome centering center site. As can be easily seen from fig. 6, the mice nuclear corpuscle density predicted by DeepNDP model has similar periodicity as ncp_score obtained by chemical method, and shows better cross-species generalization ability of DeepNDP model.
Example two
The invention provides a whole genome nucleosome density prediction system, comprising:
And a coding module: the method comprises the steps of obtaining a DNA sequence of a whole genome chromosome, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence;
Meanwhile, the method is used for constructing and training DeepNDP models to obtain trained DeepNDP models;
and a prediction module: the method comprises the steps of inputting a first coding sequence and a second coding sequence into a trained DeepNDP model for prediction to obtain a whole genome nucleosome density result;
the DeepNDP model comprises a feature extraction network, a Concatenate layer, a transducer layer, a flame layer and two fully connected layers which are sequentially connected;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the Concatenate layers are used for splicing the first local features and the second local features to obtain spliced features; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer; the holo-junction layer is used to predict whole genome nucleosome density.
Example III
The present embodiment provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the whole genome nucleosome density prediction method according to the first embodiment when executing the computer program.
Example IV
The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the whole genome nucleosome density prediction method of the first embodiment.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present invention.

Claims (10)

1. A whole genome nucleosome density prediction method, which is characterized in that: comprising the following steps:
Step S1: acquiring DNA sequences of whole genome chromosomes, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence;
Simultaneously constructing and training DeepNDP models to obtain a trained DeepNDP model;
Step S2: inputting the first coding sequence and the second coding sequence into a trained DeepNDP model for prediction to obtain a whole genome nucleosome density result;
the DeepNDP model comprises a feature extraction network, a Concatenate layer, a transducer layer, a flame layer and two fully connected layers which are sequentially connected;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the Concatenate layers are used for splicing the first local features and the second local features to obtain spliced features; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer; the holo-junction layer is used to predict whole genome nucleosome density.
2. The whole genome nucleosome density prediction method according to claim 1, wherein: the feature extraction network in step S2 includes a feature extraction module ResNet and a feature extraction module CNNNet, where the feature extraction module ResNet is configured to extract a first local feature of a first coding sequence and the feature extraction module CNNNet is configured to extract a second local feature of a second coding sequence.
3. The whole genome nucleosome density prediction method according to claim 2, characterized in that: the feature extraction module ResNet includes a first CNN layer, three ResBlock layers, a second CNN layer, a third CNN layer, and a first Reshape layer, which are sequentially connected, where the first Reshape layer is used to change the dimension of the output of the third CNN layer.
4. The whole genome nucleosome density prediction method according to claim 3, wherein: the ResBlock layers comprise a first column of CNN units and a second column of CNN units;
The first-column CNN unit comprises a fourth CNN layer, a fifth CNN layer and a sixth CNN layer which are sequentially connected, wherein convolution kernels adopted by the fourth CNN layer, the fifth CNN layer and the sixth CNN layer are sequentially 5, 16 and 16;
The second-column CNN unit comprises a seventh CNN layer, an eighth CNN layer and a ninth CNN layer which are sequentially connected, wherein the convolution kernel adopted by the seventh CNN layer, the eighth CNN layer and the ninth CNN layer is sequentially 3, 8 and 8;
And adding the output of the sixth CNN layer, the output of the ninth CNN layer and the input of the current ResBlock layer.
5. The whole genome nucleosome density prediction method according to claim 4, wherein: all CNN layers in the feature extraction module ResNet are connected with a ReLU activation function.
6. The whole genome nucleosome density prediction method according to claim 2, characterized in that: the feature extraction module CNNNet includes a tenth CNN layer, an eleventh CNN layer, a twelfth CNN layer, and a second Reshape layer that are sequentially connected, where the second Reshape layer is configured to change a dimension output by the twelfth CNN layer.
7. The whole genome nucleosome density prediction method according to claim 1, wherein: the method for obtaining the DNA sequence of the whole genome chromosome in the step S1 and respectively carrying out the first coding and the second coding to obtain the first coding sequence and the second coding sequence comprises the following steps:
obtaining a DNA sequence of a whole genome chromosome;
And carrying out One-hot coding on the DNA sequence of the whole genome chromosome to obtain an One-hot coding sequence, and simultaneously carrying out nucleotide coding on the DNA sequence of the whole genome chromosome to obtain a nucleotide coding sequence, wherein the One-hot coding sequence is a first coding sequence, and the nucleotide coding sequence is a second coding sequence.
8. A whole genome nucleosome density prediction system, characterized in that: comprising the following steps:
encoding and construction module: the method comprises the steps of obtaining a DNA sequence of a whole genome chromosome, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence;
Meanwhile, the method is used for constructing and training DeepNDP models to obtain trained DeepNDP models;
and a prediction module: the method comprises the steps of inputting a first coding sequence and a second coding sequence into a trained DeepNDP model for prediction to obtain a whole genome nucleosome density result;
the DeepNDP model comprises a feature extraction network, a Concatenate layer, a transducer layer, a flame layer and two fully connected layers which are sequentially connected;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the Concatenate layers are used for splicing the first local features and the second local features to obtain spliced features; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer; the holo-junction layer is used to predict whole genome nucleosome density.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized by: the processor, when executing the computer program, implements the steps of the whole genome nucleosome density prediction method according to any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program, when executed by a processor, performs the steps of the whole genome nucleosome density prediction method according to any one of claims 1 to 7.
CN202310415049.2A 2023-04-18 2023-04-18 Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment Active CN116612816B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310415049.2A CN116612816B (en) 2023-04-18 2023-04-18 Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310415049.2A CN116612816B (en) 2023-04-18 2023-04-18 Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment

Publications (2)

Publication Number Publication Date
CN116612816A CN116612816A (en) 2023-08-18
CN116612816B true CN116612816B (en) 2024-06-21

Family

ID=87673676

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310415049.2A Active CN116612816B (en) 2023-04-18 2023-04-18 Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment

Country Status (1)

Country Link
CN (1) CN116612816B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382338A (en) * 2020-11-16 2021-02-19 南京理工大学 DNA-protein binding site prediction method based on self-attention residual error network

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018024749A1 (en) * 2016-08-01 2018-02-08 Consejo Superior De Investigaciones Científicas A method for tailoring a dna sequence to obtain species-specific nucleosome positioning
US20200027000A1 (en) * 2018-07-23 2020-01-23 Samsung Electronics Co., Ltd. Methods and systems for annotating regulatory regions of a microbial genome
CN112735514B (en) * 2021-01-18 2022-09-16 清华大学 Training and visualization method and system for neural network extraction regulation and control DNA combination mode
CN113782096B (en) * 2021-09-16 2023-06-16 平安科技(深圳)有限公司 Method and device for predicting unpaired probability of RNA (ribonucleic acid) base
CN114496069A (en) * 2022-02-17 2022-05-13 华东师范大学 Method for predicting off-target of CIRPCAs 9 system based on Transformer architecture
CN115762536A (en) * 2022-11-25 2023-03-07 南京信息工程大学 Small sample optimization bird sound recognition method based on bridge transform

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382338A (en) * 2020-11-16 2021-02-19 南京理工大学 DNA-protein binding site prediction method based on self-attention residual error network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于卷积神经网络的基因剪接位点预测;李国斌;杜秀全;李新路;吴志泽;;盐城工学院学报(自然科学版);20200630(第02期);全文 *

Also Published As

Publication number Publication date
CN116612816A (en) 2023-08-18

Similar Documents

Publication Publication Date Title
Fan et al. lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning
CN113178227B (en) Method, system, device and storage medium for identifying multiomic fusion splice sites
Zhang et al. Identification of DNA–protein binding sites by bootstrap multiple convolutional neural networks on sequence information
CN111564179B (en) Species biology classification method and system based on triple neural network
Zhang et al. Gene prediction in metagenomic fragments with deep learning
US20220351804A1 (en) Improved Variant Caller Using Single-Cell Analysis
CN110556184A (en) non-coding RNA and disease relation prediction method based on Hessian regular nonnegative matrix decomposition
Chakraborty et al. Predicting MicroRNA sequence using CNN and LSTM stacked in Seq2Seq architecture
CN114743600A (en) Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity
Wang et al. A brief review of machine learning methods for RNA methylation sites prediction
Chen et al. DECODE: A De ep-learning Framework for Co n de nsing Enhancers and Refining Boundaries with Large-scale Functional Assays
CN116612816B (en) Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment
Kao et al. naiveBayesCall: An efficient model-based base-calling algorithm for high-throughput sequencing
CN115966316B (en) Tumor drug sensitivity prediction method, system, equipment and storage medium
CN116705150A (en) Method, device, equipment and medium for determining gene expression efficiency
CN111048145A (en) Method, device, equipment and storage medium for generating protein prediction model
CN116153396A (en) Non-coding variation prediction method based on transfer learning
CN114627964B (en) Prediction enhancer based on multi-core learning and intensity classification method and classification equipment thereof
CN114783507A (en) Method and device for predicting drug-protein affinity based on secondary structure feature coding
Sanchez Reconstructing our past˸ deep learning for population genetics
CN114927163A (en) Method for predicting genetic model based on single cell map and storage medium
CN111261228A (en) Method and system for calculating conserved nucleic acid sequence
CN115083522B (en) Method and device for predicting cell types and server
CN114512188B (en) DNA binding protein recognition method based on improved protein sequence position specificity matrix
CN118114125B (en) MiRNA based on incremental learning and isomer family information identification method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant