CN116612816A - Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment - Google Patents

Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment Download PDF

Info

Publication number
CN116612816A
CN116612816A CN202310415049.2A CN202310415049A CN116612816A CN 116612816 A CN116612816 A CN 116612816A CN 202310415049 A CN202310415049 A CN 202310415049A CN 116612816 A CN116612816 A CN 116612816A
Authority
CN
China
Prior art keywords
layer
whole genome
cnn
coding sequence
coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310415049.2A
Other languages
Chinese (zh)
Inventor
吴庭芳
周昳婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202310415049.2A priority Critical patent/CN116612816A/en
Publication of CN116612816A publication Critical patent/CN116612816A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The application relates to a whole genome nucleosome density prediction method, a whole genome nucleosome density prediction system and electronic equipment, wherein the whole genome nucleosome density prediction method comprises the following steps: acquiring DNA sequences of whole genome chromosomes, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence; constructing and training a deep NDP model to obtain a trained deep NDP model; inputting the first coding sequence and the second coding sequence into a trained deep NDP model to obtain a whole genome nucleosome density result, wherein the deep NDP model comprises a feature extraction network, a conjugate layer, a transducer layer, a Flatten layer and two full-connection layers which are sequentially connected. According to the application, the DNA sequence is encoded into two forms, so that the model generalization capability is replaced, and the application can more efficiently and accurately identify the distribution of nucleosomes without carrying out a biological experiment with high cost.

Description

Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment
Technical Field
The application relates to the technical field of bioinformatics, in particular to a whole genome nucleosome density prediction method, a whole genome nucleosome density prediction system and electronic equipment.
Background
Nucleosome density prediction refers to the use of computational methods to predict the nucleosome signal intensity at each base site, resulting in a continuous nucleosome density across the genome. Nucleosomes are key participants in genetic processes as the basic units of chromatin, whose precise locations can regulate genomic accessibility to DNA binding proteins, thereby effecting regulation of gene expression, DNA replication and repair. Thus, identifying the location of nucleosomes on the genome may help one to study various biological processes in depth.
In past studies, many DNA sequence-based calculation methods have been proposed to determine nucleosome position in DNA sequences, for example:
(1) iNuc-PseKNC: a method for locating nucleosomes. A DNA sequence with the length of 147bp is input, a characteristic vector consisting of pseudo k-tuple nucleotides with 6 local DNA structural characteristics is extracted, and then the characteristics are input into an SVM classifier to predict whether the sequence is a nucleosome sequence.
(2) DLNN: a method for locating nucleosomes. Inputting a DNA sequence with the length of 147bp, encoding into a ont-hot form, modeling and analyzing the sequence by using a convolution network and a circulation network, and predicting whether the sequence is a nucleosome sequence.
(3) Routhier et al: a method for predicting the density of nucleosomes. DNA sequences on the whole chromosome were obtained in the form of sliding windows, and the nucleosome density at the central site of the input sequence was predicted using three sequentially stacked convolution layers.
In the prior art, the nucleosome positioning method can only capture the context information within 147bp, cannot learn the long-range interaction relation between bases, and cannot rapidly predict and analyze the whole chromosome sequence.
While Routhier et al propose that the recognition accuracy of the deep learning-based nucleosome density prediction method is low, and the prediction performance still has room for improvement.
Disclosure of Invention
Therefore, the application aims to solve the technical problem that the identification precision of the nuclear corpuscle density prediction method in the prior art is low.
In order to solve the technical problems, the application provides a whole genome nucleosome density prediction method, which comprises the following steps:
step S1: acquiring DNA sequences of whole genome chromosomes, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence;
meanwhile, constructing and training a deep NDP model to obtain a trained deep NDP model;
step S2: inputting the first coding sequence and the second coding sequence into a trained deep NDP model for prediction to obtain a whole genome nucleosome density result;
the deep model comprises a feature extraction network, a connectate layer, a transducer layer, a flame layer and two full connection layers which are connected in sequence;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the jointing layer is used for jointing the first local feature and the second local feature to obtain a jointing feature; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer; the holo-junction layer is used to predict whole genome nucleosome density.
In one embodiment of the present application, the feature extraction network in the step S2 includes a feature extraction module res net for extracting a first local feature of the first coding sequence and a feature extraction module CNNNet for extracting a second local feature of the second coding sequence.
In one embodiment of the present application, the feature extraction module res net includes a first CNN layer, three ResBlock layers, a second CNN layer, a third CNN layer, and a first Reshape layer, which are sequentially connected, where the first Reshape layer is used to change a dimension of an output of the third CNN layer.
In one embodiment of the present application, the ResBlock layer includes a first column of CNN cells and a second column of CNN cells;
the first-column CNN unit comprises a fourth CNN layer, a fifth CNN layer and a sixth CNN layer which are sequentially connected, wherein convolution kernels adopted by the fourth CNN layer, the fifth CNN layer and the sixth CNN layer are sequentially 5, 16 and 16;
the second-column CNN unit comprises a seventh CNN layer, an eighth CNN layer and a ninth CNN layer which are sequentially connected, wherein the convolution kernel adopted by the seventh CNN layer, the eighth CNN layer and the ninth CNN layer is sequentially 3, 8 and 8;
and adding the output of the sixth CNN layer, the output of the ninth CNN layer and the input of the current Resblock layer.
In one embodiment of the application, all CNN layers in the feature extraction module res net are followed by a ReLU activation function.
In an embodiment of the present application, the feature extraction module CNNNet includes a tenth CNN layer, an eleventh CNN layer, a twelfth CNN layer, and a second Reshape layer, which are sequentially connected, and the second Reshape layer is used to change a dimension of an output of the twelfth CNN layer.
In one embodiment of the present application, the method for obtaining the DNA sequence of the whole genome chromosome in step S1 and performing the first encoding and the second encoding respectively to obtain the first encoding sequence and the second encoding sequence includes:
obtaining a DNA sequence of a whole genome chromosome;
and carrying out One-hot coding on the DNA sequence of the whole genome chromosome to obtain an One-hot coding sequence, and simultaneously carrying out nucleotide coding on the DNA sequence of the whole genome chromosome to obtain a nucleotide coding sequence, wherein the One-hot coding sequence is a first coding sequence, and the nucleotide coding sequence is a second coding sequence.
In order to solve the technical problems, the application provides a whole genome nucleosome density prediction system, which comprises:
encoding and construction module: the method comprises the steps of obtaining a DNA sequence of a whole genome chromosome, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence;
meanwhile, the method is used for constructing and training a deep model to obtain a trained deep model;
and a prediction module: the method comprises the steps of inputting a first coding sequence and a second coding sequence into a trained deep NDP model for prediction to obtain a whole genome nucleosome density result;
the deep model comprises a feature extraction network, a connectate layer, a transducer layer, a flame layer and two full connection layers which are connected in sequence;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the jointing layer is used for jointing the first local feature and the second local feature to obtain a jointing feature; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer; the holo-junction layer is used to predict whole genome nucleosome density.
In order to solve the technical problems, the application provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the whole genome nucleosome density prediction method when executing the computer program.
To solve the above technical problem, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the whole genome nucleosome density prediction method as described above.
Compared with the prior art, the technical scheme of the application has the following advantages:
according to the application, the DNA sequence is encoded into two forms, so that the constructed deep learning model can learn more information from the DNA sequence, and the method can more efficiently and accurately identify the distribution of nucleosomes of the whole genome without time-consuming and labor-consuming biological experiments with high cost;
the deep NDP model provided by the application can be used among different species, has strong generalization capability, and omits the complexity of a plurality of models of a plurality of species;
the deep NDP model can be used for detecting the distribution of nucleosomes in biological research, thereby helping researchers to deeply study various biological processes such as gene expression, DNA replication, repair and the like.
Drawings
In order that the application may be more readily understood, a more particular description of the application will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings.
FIG. 1 is a flow chart of the method of the present application;
FIG. 2 is a schematic diagram of the deep model of the present application;
FIG. 3 is a diagram showing the comparison of DeepNDP model and chemical method using Saccharomyces cerevisiae as an example in the present application;
FIG. 4 is a diagram showing a comparison between a DeepNDP model and an existing model of Saccharomyces cerevisiae in an embodiment of the present application;
FIG. 5 is a schematic diagram of the effect of the deep model with the NCP code removed as input in an embodiment of the present application;
FIG. 6 is a graph showing the comparison of the performances of the deep model and the chemical method using mice as an example in the embodiment of the present application.
Detailed Description
The present application will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the application and practice it.
Example 1
Referring to FIG. 1, the whole genome nucleosome density prediction method of the present application comprises:
step S1: obtaining DNA sequences of whole genome chromosomes and respectively performing first coding (One-hot coding) and second coding (NCP coding) to obtain a first coding sequence and a second coding sequence;
meanwhile, constructing and training a deep NDP model to obtain a trained deep NDP model;
step S2: inputting the first coding sequence and the second coding sequence into a trained deep NDP model for prediction to obtain a whole genome nucleosome density result;
the deep model comprises a feature extraction network, a connectate layer, a transducer layer, a flame layer and two full connection layers (namely a Dense layer) which are connected in sequence;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the jointing layer is used for jointing the first local feature and the second local feature to obtain a jointing feature; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer (namely, leveling the output of the transducer layer); the holo-junction layer (i.e., the Dense layer) is used to predict whole genome nucleosome density.
The feature extraction network comprises a feature extraction module ResNet and a feature extraction module CNNNet, wherein the feature extraction module ResNet is used for extracting first local features of a first coding sequence, and the feature extraction module CNNNet is used for extracting second local features of a second coding sequence.
According to the application, the DNA sequence is encoded into two forms, so that the constructed deep learning model can learn more information from the DNA sequence, and the method can more efficiently and accurately identify the distribution of nucleosomes of the whole genome without time-consuming and labor-consuming biological experiments with high cost; the deep NDP model can be used for detecting the distribution of nucleosomes in biological research, thereby helping researchers to deeply study various biological processes such as gene expression, DNA replication, repair and the like.
The present application is described in detail below:
in the step S1, dividing the DNA sequences in the data set into a training set, a verification set and a test set according to chromosome numbers; specifically, taking Saccharomyces cerevisiae as an example, the genome of Saccharomyces cerevisiae comprises 16 chromosomes, the 1 st to 13 th chromosomes are used as training sets, the 14 th and 15 th chromosomes are used as verification sets, and the 16 th chromosome is used as a test set;
the One-hot code (One-hot code) is to encode A, T, C and G four bases in the DNA sequence and unknown site N, respectively (1, 0), (0, 1, 0), (0, 1, 0) binary vector representations of (0, 1) and (0, 0);
nucleotide chemical property coding (NCP coding) is to express a DNA sequence as A, C, G, T and an unknown site N as (1, 1), (0, 1, 0), (1, 0, 1) and (0, 0) according to three chemical properties of a cyclic structure, chemical function and hydrogen bond of a base, respectively.
In step S1, as shown in A in FIG. 2 (left part of FIG. 2), the deep model contains two input ports (One-hot encoding input port, NCP input port), two different feature extraction modules ResNet and CNNNet, transformer layers, a Flatten layer, two fully connected layers (Dense layer);
further, the feature extraction module res net structure of the present embodiment is shown in B (middle part of fig. 2) in fig. 2, and is configured to extract local features in data, and includes a first CNN layer, three ResBlock (i.e. residual module) layers, a second CNN layer, a third CNN layer, and a first Reshape layer, which are sequentially connected, where the first Reshape layer is configured to change a dimension of an output of the third CNN layer; the ResBlock layer comprises a first column CNN unit and a second column CNN unit, wherein the first column CNN unit is used for extracting abstract features, and the second column CNN unit is used for extracting detail features; the first-column CNN unit comprises a fourth CNN layer, a fifth CNN layer and a sixth CNN layer which are sequentially connected, wherein convolution kernels adopted by the fourth CNN layer, the fifth CNN layer and the sixth CNN layer are sequentially 5, 16 and 16; the second-column CNN unit comprises a seventh CNN layer, an eighth CNN layer and a ninth CNN layer which are sequentially connected, wherein the convolution kernel adopted by the seventh CNN layer, the eighth CNN layer and the ninth CNN layer is sequentially 3, 8 and 8; and performing an Addition (ADD) operation on the output (X) of the sixth CNN layer, the output (X) of the ninth CNN layer and the input (X_shortcut) of the current Resblock layer, and taking the result as an output result of the current Resblock layer. The feature extraction module CNNNet of the present embodiment is shown by C in fig. 2 (right part of fig. 2), which is also used to extract local features in data, and is composed of three sequentially stacked convolution layers (i.e., tenth CNN layer, eleventh CNN layer, twelfth CNN layer) and a second Reshape layer. The transducer layer is a self-attention mechanism-based architecture, integrates residual design and multi-head attention mechanism, and is used for extracting global features of data, and specifically comprises two parts: self-attention sublayer and feed-forward neural network sublayer. The self-attention sub-layer is used for calculating the correlation between the expression vector of each position in the input sequence and other positions, so as to capture the long-distance dependency relationship in the sequence; the function of the feedforward neural network sub-layer is to perform nonlinear transformation on the output of the self-attention sub-layer, increase the expression capacity of the model, and have a residual connection and a layer normalization operation behind each sub-layer to improve the stability and convergence speed of the model. The full connection layer (i.e., the Dense layer) is used to predict the output result.
It should be noted that, the res net of this embodiment fuses multi-scale convolution and residual networks, and the convolution layers with different convolution kernel sizes can extract features on different scales, as shown in the res block layer B in fig. 2, the number of the convolution kernels of the CNN is set to 16, so as to ensure that three features can be added subsequently. Channels with the convolution kernel sizes of CNNs set to 5, 16, and 16 in the first column of CNN cells can extract more abstract features in the sequence matrix, while channels with the convolution kernel sizes of CNNs set to 3, 8, and 8 in the second column of CNN cells can extract more detailed features in the sequence matrix. The Add function is then used to Add the X output by the two columns of CNN cells to the x_shortcut input by the current ResBlock layer. The design can extract the characteristics of the sequence matrix on different scales, and the neural network can learn the identity mapping more easily, so that the information loss and gradient attenuation in the deep network are avoided.
It should be noted that, in the embodiment of the present disclosure, a ReLU activation function is connected after each convolution layer (CNN layer) in the res net, which has the function of removing the negative value in the convolution result, keeping the positive value unchanged, improving the gradient vanishing problem, accelerating the convergence speed of gradient descent, and improving the calculation efficiency.
Further, the convolution layer (i.e., tenth CNN layer, eleventh CNN layer, twelfth CNN layer) parameters of the feature extraction module CNNNet of the present embodiment are set as follows: the number of the convolution kernels is 64, the size of the convolution kernels is 3, the number of the convolution kernels is 16, the size of the convolution kernels is 8, the number of the convolution kernels is 8, and the size of the convolution kernels is 80.
The two full connection layer parameters of this embodiment are set as: the output sizes were 256 and 1.
The two inputs input1 and input2 of the deep model are DNA sequences expressed by two coding modes respectively; in the step S2, two inputs are respectively input into ResNet and CNNNet to extract local features, and then are horizontally spliced together through a connectate layer; the transducer layer integrates the spliced local features and learns global features; the full ligation layer prediction was then used to output a number between 0 and 1 representing the nucleosome density at the central site of the input DNA sequence.
Further, in the embodiment, the deep ndp model is trained by using a training set and a verification set, and the performances of the deep ndp model are verified by using a test set; a sliding window with a window size of 2001bp and a step length of 1bp is used on the DNA sequence, and the 2001bp DNA sequence is used as a model input sequence, so that the DNA sequence of the whole chromosome is read; calculating the difference degree between the prediction and the actual data by using the loss function, and carrying out gradient update so as to update the parameters of the deep model;
setting the random discarding rate to be 0.2 in training; the loss function is set as:
wherein, the liquid crystal display device comprises a liquid crystal display device,is the model predictor, y is the true value, MAE is the mean absolute error between the two, and corr is the Pearson correlation coefficient between the two.
Further, the present embodiment encodes the test set to obtain a feature sequence of two forms, one-hot encoding and nucleotide chemistry encoding; inputting the two feature sequences into the local feature features extracted in the two feature extraction modules, and then horizontally splicing; the transducer layer integrates the spliced local features and learns the global features of the sequence; then, extracting a nonlinear relation of global features by using two full-connection layers, and outputting a prediction result;
wherein the output is mapped to a final predicted density, representing the nucleosome density of the central site of the input sequence, using softmax as an activation function in the fully connected layer.
Comparing the results of the density of nucleosomes predicted by the method of this example with those obtained by the biological assay, the results are shown in fig. 3 as A, B, C: in fig. 3, a is a scatter diagram of a deep ndp model prediction result and a biological experiment result, which shows quantitative comparison of the density of the nucleosome obtained by prediction and the biological experiment result, an X axis is a calculation experiment result, a Y axis is a biological experiment result, when a black area in the diagram is closer to a y=x direction, a stronger positive correlation exists between two signals, otherwise, a negative correlation exists, when the black area is deeper, data is more concentrated, and the predicted value distribution of the deep ndp model can be found to be consistent with the true value of the biological experiment in the diagram; in fig. 3, B and C show the predicted variation of the nucleosome density and the biological experiment nucleosome density along the DNA sequence, respectively, and it can be seen from the graph that the predicted high and low partitions of the nucleosome density are consistent with the biological experiment result, which means that the deep ndp model can accurately identify the dense nucleosome region and the depleted nucleosome region on the DNA sequence.
Comparing the predicted results of the method of this example with those of the existing method on the same dataset, as shown in FIG. 4, the pearson correlation coefficient results obtained on the sixteenth chromosome of the Saccharomyces cerevisiae genome by DLNN and LeNup, such as DeepNDP, routhier, etc., are sequentially obtained from left to right. Through comparative studies, it was found that the correlation coefficient between the two signals of nucleosome density and Mnase-seq obtained by deep NDP prediction reached 0.723, and the Pearson correlation coefficient obtained by the method proposed by Routeer et al was 0.68, which was used to predict models of nucleosome localization, such as the behavior of DLNN, leNup when used to predict whole genome nucleosome density was 0.43 and 0.40, respectively. Therefore, the deep model of the embodiment has better training results than the previous model and better correlation of the prediction results.
It should be noted that, in this example, the DNA sequence of the whole genome chromosome is subjected to the One-hot encoding and NCP encoding, respectively, and the One-hot encoding is not used in the present example, because the effect of combining the One-hot encoding and NCP encoding is better. Specifically, also for the case of Saccharomyces cerevisiae, the DNA sequence was encoded as only One form of One-hot encoding, thereby verifying the effectiveness of nucleotide chemical property encoding (NCP) in nucleosome density prediction: the results show that when the deep model uses only One coding scheme of One-hot coding, the Pearson correlation is predicted to be 0.703. FIG. 5 shows a distribution curve of nucleosome density on a Saccharomyces cerevisiae chr16 fragment, with solid line segments representing predicted results using nucleotide chemistry encoding, dotted line segments representing predicted results not using nucleotide chemistry encoding, and dashed line segments representing biological experimental results. Obviously, after nucleotide acid chemical property coding is added on the saccharomyces cerevisiae chr16 segment, the prediction result is better fitted with the biological experiment result, and the deep model effect is improved to a certain extent.
In addition to the above-mentioned s.cerevisiae, the present example also uses a DNA sequence of 2000bp range upstream and downstream of the transcription initiation site of the mouse as a test set to be input into the deep model to verify the effectiveness of the model for cross-species recognition, and the result is shown in FIG. 6: the ordinate of the upper and lower graphs are the nucleosome density predicted by the deep ndp model and ncp_score (nucleosome centering score) obtained by a chemical method, respectively, and ncp_score indicates the signal intensity of the nucleosome centering center site. As can be easily seen from fig. 6, the density of the mice nucleosome predicted by the deep ndp model has similar periodicity as the ncp_score obtained by the chemical method, and the deep ndp model has better cross-species generalization capability.
Example two
The application provides a whole genome nucleosome density prediction system, comprising:
and a coding module: the method comprises the steps of obtaining a DNA sequence of a whole genome chromosome, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence;
meanwhile, the method is used for constructing and training a deep model to obtain a trained deep model;
and a prediction module: the method comprises the steps of inputting a first coding sequence and a second coding sequence into a trained deep NDP model for prediction to obtain a whole genome nucleosome density result;
the deep model comprises a feature extraction network, a connectate layer, a transducer layer, a flame layer and two full connection layers which are connected in sequence;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the jointing layer is used for jointing the first local feature and the second local feature to obtain a jointing feature; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer; the holo-junction layer is used to predict whole genome nucleosome density.
Example III
The present embodiment provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the whole genome nucleosome density prediction method according to the first embodiment when executing the computer program.
Example IV
The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the whole genome nucleosome density prediction method of the first embodiment.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present application will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present application.

Claims (10)

1. A whole genome nucleosome density prediction method, which is characterized in that: comprising the following steps:
step S1: acquiring DNA sequences of whole genome chromosomes, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence;
meanwhile, constructing and training a deep NDP model to obtain a trained deep NDP model;
step S2: inputting the first coding sequence and the second coding sequence into a trained deep NDP model for prediction to obtain a whole genome nucleosome density result;
the deep model comprises a feature extraction network, a connectate layer, a transducer layer, a flame layer and two full connection layers which are connected in sequence;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the jointing layer is used for jointing the first local feature and the second local feature to obtain a jointing feature; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer; the holo-junction layer is used to predict whole genome nucleosome density.
2. The whole genome nucleosome density prediction method according to claim 1, wherein: the feature extraction network in the step S2 includes a feature extraction module res net and a feature extraction module CNNNet, where the feature extraction module res net is used to extract a first local feature of the first coding sequence, and the feature extraction module CNNNet is used to extract a second local feature of the second coding sequence.
3. The whole genome nucleosome density prediction method according to claim 2, characterized in that: the feature extraction module ResNet comprises a first CNN layer, three ResBlock layers, a second CNN layer, a third CNN layer and a first Reshape layer which are sequentially connected, wherein the first Reshape layer is used for changing the output dimension of the third CNN layer.
4. The whole genome nucleosome density prediction method according to claim 3, wherein: the ResBlock layer comprises a first row of CNN units and a second row of CNN units;
the first-column CNN unit comprises a fourth CNN layer, a fifth CNN layer and a sixth CNN layer which are sequentially connected, wherein convolution kernels adopted by the fourth CNN layer, the fifth CNN layer and the sixth CNN layer are sequentially 5, 16 and 16;
the second-column CNN unit comprises a seventh CNN layer, an eighth CNN layer and a ninth CNN layer which are sequentially connected, wherein the convolution kernel adopted by the seventh CNN layer, the eighth CNN layer and the ninth CNN layer is sequentially 3, 8 and 8;
and adding the output of the sixth CNN layer, the output of the ninth CNN layer and the input of the current Resblock layer.
5. The whole genome nucleosome density prediction method according to claim 4, wherein: and a ReLU activation function is connected to the back of all CNN layers in the feature extraction module ResNet.
6. The whole genome nucleosome density prediction method according to claim 2, characterized in that: the feature extraction module CNNNet comprises a tenth CNN layer, an eleventh CNN layer, a twelfth CNN layer and a second Reshape layer which are sequentially connected, wherein the second Reshape layer is used for changing the dimension output by the twelfth CNN layer.
7. The whole genome nucleosome density prediction method according to claim 1, wherein: the method for obtaining the DNA sequence of the whole genome chromosome in the step S1 and respectively carrying out the first coding and the second coding to obtain the first coding sequence and the second coding sequence comprises the following steps:
obtaining a DNA sequence of a whole genome chromosome;
and carrying out One-hot coding on the DNA sequence of the whole genome chromosome to obtain an One-hot coding sequence, and simultaneously carrying out nucleotide coding on the DNA sequence of the whole genome chromosome to obtain a nucleotide coding sequence, wherein the One-hot coding sequence is a first coding sequence, and the nucleotide coding sequence is a second coding sequence.
8. A whole genome nucleosome density prediction system, characterized in that: comprising the following steps:
encoding and construction module: the method comprises the steps of obtaining a DNA sequence of a whole genome chromosome, and respectively performing first coding and second coding to obtain a first coding sequence and a second coding sequence;
meanwhile, the method is used for constructing and training a deep model to obtain a trained deep model;
and a prediction module: the method comprises the steps of inputting a first coding sequence and a second coding sequence into a trained deep NDP model for prediction to obtain a whole genome nucleosome density result;
the deep model comprises a feature extraction network, a connectate layer, a transducer layer, a flame layer and two full connection layers which are connected in sequence;
the feature extraction network is used for extracting a first local feature of the first coding sequence and extracting a second local feature of the second coding sequence; the jointing layer is used for jointing the first local feature and the second local feature to obtain a jointing feature; the transducer layer is used for extracting global features of the splicing features; the flat layer is used for changing the dimension of the output of the transducer layer; the holo-junction layer is used to predict whole genome nucleosome density.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized by: the processor, when executing the computer program, implements the steps of the whole genome nucleosome density prediction method according to any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program, when executed by a processor, performs the steps of the whole genome nucleosome density prediction method according to any one of claims 1 to 7.
CN202310415049.2A 2023-04-18 2023-04-18 Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment Pending CN116612816A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310415049.2A CN116612816A (en) 2023-04-18 2023-04-18 Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310415049.2A CN116612816A (en) 2023-04-18 2023-04-18 Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment

Publications (1)

Publication Number Publication Date
CN116612816A true CN116612816A (en) 2023-08-18

Family

ID=87673676

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310415049.2A Pending CN116612816A (en) 2023-04-18 2023-04-18 Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment

Country Status (1)

Country Link
CN (1) CN116612816A (en)

Similar Documents

Publication Publication Date Title
Fan et al. lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning
Al-Ajlan et al. CNN-MGP: convolutional neural networks for metagenomics gene prediction
Nguyen et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution
CN111798921A (en) RNA binding protein prediction method and device based on multi-scale attention convolution neural network
Yi et al. RPI-SE: a stacking ensemble learning framework for ncRNA-protein interactions prediction using sequence information
Zhang et al. Identification of DNA–protein binding sites by bootstrap multiple convolutional neural networks on sequence information
CN113178227B (en) Method, system, device and storage medium for identifying multiomic fusion splice sites
CN111564179B (en) Species biology classification method and system based on triple neural network
CN112951328B (en) MiRNA-gene relation prediction method and system based on deep learning heterogeneous information network
Zhang et al. Gene prediction in metagenomic fragments with deep learning
CN114464270A (en) Universal method for designing medicines aiming at different target proteins
CN116072227B (en) Marine nutrient biosynthesis pathway excavation method, apparatus, device and medium
Kao et al. naiveBayesCall: an efficient model-based base-calling algorithm for high-throughput sequencing
CN110556184A (en) non-coding RNA and disease relation prediction method based on Hessian regular nonnegative matrix decomposition
CN114743600A (en) Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity
Wang et al. A brief review of machine learning methods for RNA methylation sites prediction
Chakraborty et al. Predicting MicroRNA sequence using CNN and LSTM stacked in Seq2Seq architecture
CN116401555A (en) Method, system and storage medium for constructing double-cell recognition model
CN115966316B (en) Tumor drug sensitivity prediction method, system, equipment and storage medium
Kao et al. naiveBayesCall: An efficient model-based base-calling algorithm for high-throughput sequencing
CN116612816A (en) Whole genome nucleosome density prediction method, whole genome nucleosome density prediction system and electronic equipment
CN111048145A (en) Method, device, equipment and storage medium for generating protein prediction model
CN114627964B (en) Prediction enhancer based on multi-core learning and intensity classification method and classification equipment thereof
CN114582420A (en) Transcription factor binding site prediction method and system based on fault-tolerant coding and multi-scale dense connection network
CN114927163A (en) Method for predicting genetic model based on single cell map and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination