CN115019876A - Gene expression prediction method and device - Google Patents
Gene expression prediction method and device Download PDFInfo
- Publication number
- CN115019876A CN115019876A CN202210613683.2A CN202210613683A CN115019876A CN 115019876 A CN115019876 A CN 115019876A CN 202210613683 A CN202210613683 A CN 202210613683A CN 115019876 A CN115019876 A CN 115019876A
- Authority
- CN
- China
- Prior art keywords
- gene
- prediction
- gene sequence
- predicted
- expression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B35/00—ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
- G16B35/20—Screening of libraries
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
Abstract
The embodiment of the invention provides a method and a device for predicting gene expression, wherein the method comprises the following steps: obtaining a gene sequence segment to be predicted; inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model; the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction label training, and the prediction label is a nucleotide corresponding to the gene sequence fragment sample. According to the gene expression prediction method and device provided by the embodiment of the invention, the gene expression is predicted through the prediction model constructed based on the multi-head self-attention mechanism, so that the gene prediction time is reduced, and the gene prediction efficiency is improved; moreover, the relation among a plurality of nucleotides in the gene sequence can be obtained by calculating the mutual influence among the nucleotides, so that the prediction accuracy of the gene prediction is improved.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for predicting gene expression.
Background
The gene prediction is to predict the gene structure and function of unknown sequence by computer simulation and calculation by using the existing theory and the known information of gene sequence. The gene prediction can be applied to a plurality of fields such as agriculture, medical treatment, biology and the like, and has important effect on social development.
The gene prediction methods in the prior art mainly include homologous prediction and de novo prediction. The homology prediction requires multiple visits to the target gene sequence to match the reference gene, and the time consumption is large. The de novo prediction is based on the given sequence characteristics, and the existing gene prediction method based on the gene data extraction characteristics has the problem of low accuracy.
Disclosure of Invention
The invention provides a gene expression prediction method, which is used for overcoming the defect of low gene prediction accuracy in the prior art and improving the gene prediction accuracy.
In a first aspect, the present invention provides a method for predicting gene expression, comprising:
obtaining a gene sequence segment to be predicted;
inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model;
the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction label training, and the prediction label is a nucleotide corresponding to the gene sequence fragment sample.
Optionally, the inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model includes:
performing characteristic extraction on the gene sequence segment to be predicted to obtain standard gene characteristics;
performing multi-head self-attention weight calculation on the standard gene characteristics to obtain attention expression;
predicting the gene sequence segment to be predicted based on the attention expression, and obtaining a prediction result, wherein the prediction result comprises a nucleotide type and a probability corresponding to the nucleotide type.
Optionally, the performing multiple self-attention weight calculations on the standard gene features to obtain an attention representation includes:
obtaining a splicing expression based on the standard gene characteristics and pre-stored historical gene characteristics;
obtaining a query vector, a key vector and a value vector corresponding to the standard gene feature based on the standard gene feature and the splicing expression;
obtaining an initial attention representation based on the query vector, the key vector, and the value vector;
and carrying out a standardization operation on the initial attention representation to obtain the attention representation.
Optionally, the performing feature extraction on the gene sequence segment to be predicted to obtain a standard gene feature includes:
coding the gene sequence segment to be predicted to obtain coding representation;
performing initial feature extraction on the coding expression to obtain initial gene features;
performing maximum pooling operation on the initial gene characteristics to obtain pooled gene characteristics;
and carrying out standardized operation on the pooled gene characteristics to obtain standard gene characteristics.
Optionally, the performing a normalization operation includes:
performing batch standardization operation based on a preset batch standardization formula;
the preset batch standardization formula is as follows:
wherein x is i For the data to be normalized, μ is the mean parameter, σ 2 The mean parameter and the variance parameter are determined based on the gene sequence fragment sample, epsilon is a hyperparameter, a is a first model parameter, b is a second model parameter.
Optionally, the method further comprises:
and carrying out arithmetic coding on the basis of the prediction result and the gene sequence segment to be predicted to obtain a compressed gene sequence.
Optionally, the method further comprises:
and performing arithmetic data decoding on the compressed gene sequence based on the prediction result to obtain a decoded gene sequence.
In a second aspect, the present invention also provides a gene expression prediction apparatus comprising:
an acquisition unit for acquiring a gene sequence segment to be predicted;
the prediction unit is used for inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model;
the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction tag through training, and the prediction tag is a nucleotide corresponding to the gene sequence fragment sample.
In a third aspect, the present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the method for predicting gene expression according to the first aspect.
In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium, on which a computer program is stored, which computer program, when executed by a processor, implements the gene expression prediction method according to the first aspect.
According to the gene expression prediction method and device provided by the embodiment of the invention, the gene expression is predicted through the prediction model constructed based on the multi-head self-attention mechanism, a reference gene is not needed, and only one access is needed for a target gene sequence, so that the gene prediction time is reduced, and the gene prediction efficiency is improved; and the multi-head self-attention mechanism is used for connecting at least two single-head self-attention mechanisms and extracting characteristics from at least two directions to gene data, and the self-attention mechanism can obtain the relation among a plurality of nucleotides in a gene sequence by calculating the mutual influence among the nucleotides, so that the prediction accuracy of the next nucleotide is improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for predicting gene expression according to an embodiment of the present invention;
FIG. 2 is a second schematic flow chart of a method for predicting gene expression according to an embodiment of the present invention;
FIG. 3 is a third schematic flow chart of a method for predicting gene expression according to an embodiment of the present invention;
FIG. 4 is a fourth flowchart of a method for predicting gene expression according to an embodiment of the present invention;
FIG. 5 is a schematic structural view of a gene expression prediction apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The technical terms related to the invention are described as follows:
gene compression: techniques for encoding gene sequences into data that requires less storage space and that can be decoded into the original gene sequences.
Nucleotide: nucleic acids include ribonucleic acid (RNA) and deoxyribonucleic acid (DNA), with most biological DNA being genetic material and RNA performing multiple complex functions. The monomers constituting nucleic acids are called nucleotides, which are classified into deoxyribonucleic acid and ribonucleotide according to the presence or absence of the five-carbon sugar deoxidation.
Deoxynucleotide: deoxynucleotides (deoxyribotides) are basic units of deoxyribonucleic acid (DNA for short), are small molecular compounds consisting of purine or pyrimidine bases, deoxyribose and phosphate, and are material bases for forming the DNA of genetic materials of organisms. The diversity of organisms is determined by the difference in the arrangement order of adenine (adenine, abbreviated as A), thymine (thymine, abbreviated as T), cytosine (cytosine, abbreviated as C) and guanine (guanine, abbreviated as G) among deoxynucleotides. Four bases are arranged along the inside of the long DNA strand, and the sequence of the four bases stores genetic information.
Embedding (Embedding): embedding is a feature extracted from the raw data, i.e., the low-dimensional vector after it has been mapped through the neural network.
Encoder (Encoder): encoder, as its name implies, encodes input data and converts the input data into an intermediate representation through a non-linear transformation.
Attention (attention): attention is a very common, but neglected fact. For example, when a bird in the sky flies over, the human attention tends to follow the bird, and the sky naturally becomes background (background) information in the human visual system. The basic idea of attention mechanism in computer vision is to let the system learn to focus on places of interest, ignoring background information and focusing on important information.
For the gene prediction method of homologous prediction, multiple visits to the target gene sequence are needed to achieve the purpose of matching the reference gene, and the time consumption is large. For de novo predicted gene prediction methods, such as those based on the LSTM model or those based on the bi-directional LSTM model, the features of the gene sequence can only be observed from one direction or two directions due to the insufficient complexity of the model. Experiments have shown that the solution proposed by the present solution can already outperform the lstm-based model under the same conditions.
In order to solve the above problems, embodiments of the present invention provide a method for predicting gene expression, which predicts a gene sequence based on a multi-head self-attention mechanism, does not need a reference gene, shortens the prediction time of gene expression, and improves the accuracy of gene expression prediction.
The method for predicting gene expression provided by the embodiment of the present invention will be described below with reference to FIGS. 1 to 4.
Fig. 1 is a schematic flow chart of a gene expression prediction method provided in an embodiment of the present invention, and as shown in fig. 1, the embodiment of the present invention provides a gene expression prediction method, including:
specifically, the gene sequence segment to be predicted may be a gene sequence consisting of continuous N nucleotides in one single strand of a double-stranded DNA, wherein N is a positive integer. For example, a segment of a gene sequence to be predicted includes 10 consecutive deoxynucleotides: ACTGAGTCCG are provided. Optionally, N is 64.
It should be understood that the embodiment of the present invention does not limit the specific obtaining manner of the gene sequence segment to be predicted, for example, the gene sequence segment to be predicted may be stored in a set region (e.g., a database), and the gene sequence segment to be predicted may be obtained by accessing the set region. In other embodiments, the gene sequence segment to be predicted may be obtained by a gene collecting device, and the gene collecting device may include a blood collector, a saliva collector, a skin collector, and the like. Other methods can be used to obtain the gene sequence segment to be predicted by those skilled in the art, and are not described herein.
the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction label training, and the prediction label is a nucleotide corresponding to the gene sequence fragment sample.
Specifically, the prediction result may be the next nucleotide of the gene sequence fragment to be predicted. The gene sequence fragment sample refers to a gene sequence fragment of which the next nucleotide is known and is obtained in advance, and the prediction tag is the next nucleotide of the gene sequence fragment sample. Illustratively, a gene sequence with a length of 126 (comprising 126 consecutive nucleotides) is known, and the 0 th deoxynucleotide to the 63 th deoxynucleotide can be used as a first gene sequence fragment sample, and the 64 th deoxynucleotide can be used as a prediction tag corresponding to the first gene sequence fragment sample; taking the 1 st to 64 th deoxynucleotides as a second gene sequence fragment sample, taking the 65 th deoxynucleotide as a corresponding prediction label of the second gene sequence fragment sample, and so on, a plurality of gene sequence fragment samples can be obtained based on the gene sequence with the length of 126.
The self-attention mechanism is a variant of the attention mechanism, can capture the internal correlation of data, and can be applied to gene prediction to obtain the relation among a plurality of nucleotides in a gene sequence by calculating the mutual influence among the nucleotides. The multi-head self-attention mechanism is characterized in that at least two single-head self-attention mechanisms are connected, and characteristics are extracted from gene data from at least two directions.
It is understood that the same gene sequence segment to be predicted and its prediction result may be input to the prediction model cyclically until the complete gene sequence is obtained. In one embodiment, the first gene sequence segment to be predicted comprises 0 th deoxynucleotide to 63 th deoxynucleotide, and the first gene sequence segment to be predicted is input into a prediction model to obtain a prediction result corresponding to the first gene sequence segment to be predicted, namely, a 64 th deoxynucleotide; in the next round, the 1 st to 64 th deoxynucleotides are used as second gene sequence segments to be predicted and input into a prediction model to obtain the prediction results corresponding to the second gene sequence segments to be predicted, namely, the 65 th deoxynucleotide; until the complete gene sequence is obtained.
According to the gene expression prediction method provided by the embodiment of the invention, the gene expression is predicted through the prediction model constructed based on the multi-head self-attention mechanism, a reference gene is not needed, and only one access is needed for a target gene sequence, so that the gene prediction time is reduced, and the gene prediction efficiency is improved; moreover, the characteristics of the gene data can be extracted from multiple directions, and the relation between nucleotides in the gene sequence segment to be predicted can be obtained, so that the prediction accuracy of the next nucleotide is improved.
In the following, a possible implementation manner of the above steps in a specific embodiment is further described.
Optionally, the inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model includes:
step 121, extracting the characteristics of the gene sequence segment to be predicted to obtain standard gene characteristics;
after the gene sequence segment to be predicted is obtained, feature extraction operation can be performed on the gene sequence of the gene sequence segment to be predicted to obtain gene features. Optionally, the gene signature is normalized, which means that for samples in the training set (i.e., all or part of the gene sequence fragment samples used in the training process), the data is divided by the variance or the data is subtracted by the mean based on the column statistics (the result is that the variance is equal to 1 and the data is around 0). The standardization can improve the convergence rate of the model optimization stage in the training stage, and can also avoid the overlarge influence of the sample with large variance on the model training.
Step 122, performing multi-head self-attention weight calculation on the standard gene characteristics to obtain attention expression;
specifically, the Multi-head self-attention weight calculation refers to a weight corresponding to a standard gene feature calculated by a Multi-head self-attention mechanism (Multi-head-self-attention mechanism). Multi-headed-self attribute means that for each feature element, its corresponding attention weight is found from multiple directions. The attention expression refers to a feature obtained by weighting a standard gene feature with a plurality of self-attention weights.
And 123, predicting the gene sequence segment to be predicted based on the attention expression, and obtaining the prediction result, wherein the prediction result comprises a nucleotide type and a probability corresponding to the nucleotide type.
Specifically, the fully-ligated layer can be used as a classifier, and attention can be expressed as the probability of generating the next nucleotide by the softmax layer after the fully-ligated layer is input.
Wherein the content of the first and second substances,representing the attention representation after classification through the full connection layer, z i Denotes the vector corresponding to the i-th nucleotide in the attention expression, w j A matrix representing a one-dimensional convolution kernel, b j A matrix representing the offset of the optical fiber,
illustratively, the gene sequence segment to be predicted is input into the prediction model, and the obtained prediction result is as follows: a, 80%; t, 14%; c, 4%; g, 2 percent.
Alternatively, end gene expression prediction may be achieved by defining a window to end gene expression prediction, i.e. by defining a segment of the gene sequence to be input into the model, illustratively if the device is stopped at a certain nucleotide, i.e. the first 64 nucleotides of that nucleotide are input into the window and are not input.
According to the gene expression prediction method provided by the embodiment of the invention, the gene expression is predicted through the prediction model constructed based on the multi-head self-attention mechanism, a reference gene is not needed, and only one access is needed for a target gene sequence, so that the gene prediction time is reduced, and the gene prediction efficiency is improved; moreover, characteristics can be extracted from the gene data from multiple directions, and the relation between nucleotides in the gene sequence segment to be predicted is obtained, so that the prediction accuracy of the next nucleotide is improved; the samples with large variance can be prevented from generating excessive influence on the model through standardization; the predicted nucleotide type and the probability corresponding to the nucleotide type can be obtained through the full-link layer and the softmax layer, and can be used for gene sequence compression.
FIG. 2 is a second schematic flow chart of a gene expression prediction method provided in an embodiment of the present invention, and a possible implementation manner of the above steps in an embodiment is further described with reference to FIG. 2.
Optionally, the performing multi-head self-attention weight calculation on the standard gene feature to obtain an attention representation includes:
step 1221, obtaining a splicing expression based on the standard gene characteristics and pre-stored historical gene characteristics;
the splicing process can be shown as follows:
h is the standard gene characteristic and represents the representation corresponding to the gene sequence segment to be predicted input at the current moment; SG (-) denotes stop gradient; degree represents the stitching operation in the channel dimension; m is a historical gene characteristic and represents the corresponding representation of the historical gene sequence segment input at the previous moment;is a tiled representation.
Illustratively, in the first round, the standard gene characteristics corresponding to the gene sequence segments to be predicted in the first round, that is, the gene characteristics corresponding to the 0 th nucleotide to the 63 rd nucleotide, are input at the current time, and since the gene expression prediction is performed in the first round, the historical gene sequence segments may be empty; in the second round, the standard gene characteristics corresponding to the gene sequence segment to be predicted in the second round, that is, the gene characteristics corresponding to the 1 st nucleotide to the 64 th nucleotide, are input at the current time, and the historical gene characteristics are the gene characteristics corresponding to the 0 th nucleotide to the 63 rd nucleotide input in the previous round.
Step 1222, based on the standard gene feature and the splicing expression, obtaining a query vector, a key vector and a value vector corresponding to the standard gene feature;
the process can be represented by the following formula:
wherein q represents a query vector, k represents a key vector, and v represents a value vector; w q Representing a matrix, W, used for generating a query vector k Representing a matrix, W, for generating key vectors v A matrix used to generate a vector of values is represented.
A step 1223 of obtaining an initial attention representation based on the query vector, the key vector and the value vector;
the process can be represented by the following formula:
wherein R is i-j Is a relative position code, the relative position relation uses a position code matrixThe ith row represents a position vector with a relative position interval of i, and is generated by a sine function. A. the i,j Indicating the result after position encoding.
It is understood that relative position encoding refers to determining the encoding based on the relative position (e.g., the relationship of several nucleotides apart) between the individual nucleotides in the gene sequence. Illustratively, when the process data is a vector corresponding to nucleotide a, row i represents a vector corresponding to nucleotides at positions i apart (i nucleotides apart) from the relative position of a.
Step 1224 of normalizing the initial attention representation to obtain the attention representation.
Optionally, the standardization operation is batch standardization.
The batch normalization process may be represented by the following formula:
wherein mu is the mean value in the channel dimension, namely mu is the mean value parameter; sigma 2 Is the variance in channel dimension, i.e. variance parameter; it is to be understood that the mean parameter and the variance parameter are determined on the basis of the gene sequence fragment samples, the mean parameter and the variance parameter being determined during training by a sub-training set (batch) of all training samples, and in the trained model by all training samples.
Alternatively, the normalized initial attention representation may be input to the fully-connected layer, resulting in an attention representation of the fully-connected layer output. The full-connection layer can improve the nonlinear expression capability of the prediction model, so that the learning capability and the expression capability of the model are improved.
In one embodiment, standard gene signatures are input into the transform-xl encoder, which may be composed of two parts, a self-attention module and a full-link layer.
The process of the self-attention module is as follows:
h is the standard gene characteristic and represents the representation corresponding to the gene sequence segment to be predicted input at the current moment; SG (-) denotes stop gradient; degree represents the stitching operation in the channel dimension; m is a historical gene characteristic and represents the corresponding representation of the historical gene sequence segment input at the previous moment;is a spliced representation; q represents a query vector, k represents a key vector, and v represents a value vector; w q Representing a matrix of generated query vectors, W k Representing a matrix of generating key vectors, W v Matrix W representing vectors of generated values k,R Representing a location-based vector; r i-j Is a relative position code, the relative position relation uses a position code matrixThe ith row represents a position vector with a relative position interval of i, and is generated by a sine function. A. the i,j Indicating the result after position encoding. Mu is the mean value of the channel dimension, namely mu is the mean value parameter; sigma 2 Is the variance in channel dimension, i.e. variance parameter; the epsilon is taken as a hyper-parameter, the epsilon value is very small, the value range can be 1e-04 to 1e-06, and optionally the epsilon value is 1 e-05; a is a first model parameter, b is a second model parameter, and a and b are updated through a model training process.Is represented by A i,j And (4) the result of the layer normalization processing, namely the normalized initial attention representation.
The full connection layer is a full connection layer containing a hidden layer, so that the embedding of data is transformed in the channel dimension, and the expression capability of the model is enhanced.
The process is shown as the following formula:
z i,j =δ(w j ⊙[a i ,a i+1 ,…,a i+k-1 ]+b j )
wherein z is i,j For attention, a i Representing the corresponding vector of the ith nucleotide in the normalized initial attention representation; δ represents the activation function, and δ may alternatively be a rule function, w j A matrix representing a one-dimensional convolution kernel, b j A matrix representing the bias.
It should be understood that, on the basis of the embodiment of the present invention, by adjusting the internal structure of the transform-xl encoder, for example: adjusting the number of nodes of the full link layer, the number of multi-head attentions, etc., or replacing the transform-xl with other similar transform variants, such as: vanilla transformers, compressive transformers, etc., should also be within the scope of the present invention.
According to the gene expression prediction method provided by the embodiment of the invention, the gene expression is predicted through the prediction model constructed based on the multi-head self-attention mechanism, the relation between nucleotides in the gene sequence fragment to be predicted is obtained, the overlarge influence on the model caused by a sample with large variance is avoided through standardization, and in addition, the expression capability of the model is improved through the full-connection layer, so that the prediction accuracy of the next nucleotide is improved.
Optionally, the performing feature extraction on the gene sequence segment to be predicted to obtain a standard gene feature includes:
step 1211, encoding the gene sequence segment to be predicted to obtain an encoding representation;
specifically, the gene sequence segment to be predicted may be in fasta format. Illustratively, the gene sequence fragments to be predicted are read from the gene sequence file in fasta format such as: GGCTA … …, etc., encoded by unique hot codes to obtain an encoded representation.
Illustratively, the encoding is as follows: a is {1,0,0,0 }; c, 0,1,0, 0; g, 0,1, 0; t {0,0,0,1 }. It should be understood that the above is an example for facilitating understanding of the present invention, and the present invention should not be limited to any way, for example, the correspondence relationship in the above example may be replaced as long as the discrimination of the ATCG four nucleotides by coding can be achieved.
Step 1212, performing initial feature extraction on the coded representation to obtain initial gene features;
optionally, the coded representation is mapped into a high-dimensional space by one-dimensional convolution for feature extraction. Is provided withRepresenting the coded representation of the one-hot code, the gene sequence after one-dimensional convolution processing, namely the initial gene characteristics, is as follows:
o i,j =δ(w j ⊙[x i ,x i+1 ,…,x i+k-1 ]+b j );
wherein o is i,j Denotes the initial Gene signature, w j A matrix representing a one-dimensional convolution kernel, b j A matrix representing the bias, δ being an activation function for adding a non-linear transformation in the network, optionally a relu function used in embodiments of the invention, outputting 0 when the input is less than 0 and outputting the original value when the input is greater than or equal to 0, x i Refers to the ith nucleotide in the gene sequence segment to be predicted.
Step 1213, performing maximum pooling operation on the initial gene characteristics to obtain pooled gene characteristics;
specifically, the initial gene features obtained after one-dimensional convolution processing are retained by the largest pooling layer to obtain pooled gene features, and the pooled gene features are smaller than the initial gene features in data length, so that the calculation complexity of the model can be reduced.
Step 1214, standardizing the pooled gene features to obtain standard gene features.
The normalization operation and the batch normalization operation are described in step 121 and step 1221, and will not be described in detail here.
Optionally, the method further comprises:
and carrying out arithmetic coding on the basis of the prediction result and the gene sequence segment to be predicted to obtain a compressed gene sequence.
Fig. 3 is a third schematic flow chart of the gene expression prediction method provided in the embodiment of the present invention, and as shown in fig. 3, in an embodiment, the application of the gene expression prediction method to the encoding process may include three parts, i.e., model training, inference and arithmetic coding.
Model training: gene sequence data related to the gene sequence fragment to be predicted (e.g., a gene sequence of the same species as the gene sequence fragment to be predicted) is collected as a data set and processed into fasta format according to a ratio of 7: 2: 1, dividing a training set, a verification set and a test set. And training the constructed deep learning model by using the training set to ensure that the deep learning model is converged on the training set and has good performance on the verification set.
Reasoning: and (2) taking the previous segment of the gene sequence (namely the gene sequence segment to be predicted) of the predicted target nucleotide as input, predicting the type of the next nucleotide of the gene sequence segment to be predicted and the probability corresponding to each type through a trained prediction model, and calculating the occurrence probability of the gene sequence consisting of all currently known nucleotides.
Arithmetic coding: the probability finally obtained is the result of the arithmetic coding of the gene sequence.
According to the gene expression prediction method provided by the embodiment of the invention, through the prediction model based on the multi-head self-attention mechanism, the model can be used for predicting the next nucleotide type with higher accuracy, and the expression length after gene sequence coding is shorter and the compression ratio is higher.
Optionally, the method further comprises:
and performing arithmetic data decoding on the compressed gene sequence based on the prediction result to obtain a decoded gene sequence.
Fig. 4 is a fourth schematic flowchart of a gene expression prediction method provided in an embodiment of the present invention, and as shown in fig. 4, in an embodiment, the application of the gene expression prediction method to the decoding process may include two parts, namely, inference and arithmetic decoding.
Reasoning: and (3) taking the previous short segment of the sequence of the next nucleotide (namely the gene sequence segment to be predicted) as an input, and predicting the occurrence probability of the next nucleotide by using a trained deep learning model.
Arithmetic decoding: and decoding the nucleotide of the current next position according to the probability calculated by the model.
The gene compression method based on the prediction model provided by the embodiment of the invention can effectively improve the compression effect of the gene sequence. Table 1 shows the experimental results provided by the embodiments of the present invention, and as shown in table 1, the experiment proves that the solution proposed by the embodiments of the present invention can make bpb reach 0.011, lower than the LSTM or bidirectional LSTM based gene compression method, under the same conditions.
The gene expression predicting device provided by the present invention will be described below, and the gene expression predicting device described below and the gene expression predicting method described above may be referred to each other correspondingly.
Fig. 5 is a schematic structural diagram of a gene expression prediction apparatus according to an embodiment of the present invention, and as shown in fig. 5, the gene expression prediction apparatus according to the embodiment of the present invention includes an obtaining unit 510 and a prediction unit 520:
an obtaining unit 510, configured to obtain a gene sequence segment to be predicted;
the prediction unit 520 is configured to input the gene sequence segment to be predicted into a prediction model, and obtain a prediction result output by the prediction model;
the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction tag through training, and the prediction tag is a nucleotide corresponding to the gene sequence fragment sample.
Optionally, the predicting unit 520 is configured to input the gene sequence segment to be predicted into a prediction model, and obtain a prediction result output by the prediction model, and includes:
the prediction unit 520 is used for extracting the characteristics of the gene sequence segment to be predicted to obtain standard gene characteristics;
a prediction unit 520, configured to perform multi-head self-attention weight calculation on the standard gene features to obtain an attention representation;
a predicting unit 520, configured to predict the gene sequence segment to be predicted based on the attention expression, and obtain the prediction result, where the prediction result includes a nucleotide type and a probability corresponding to the nucleotide type.
Optionally, the predicting unit 520 is configured to perform multi-head self-attention weight calculation on the standard gene feature to obtain an attention representation, and includes:
a prediction unit 520, configured to obtain a splicing expression based on the standard gene feature and a pre-stored historical gene feature;
a prediction unit 520, configured to obtain a query vector, a key vector, and a value vector corresponding to the standard gene feature based on the standard gene feature and the splicing expression;
a prediction unit 520 for obtaining an initial attention representation based on the query vector, the key vector and the value vector;
a prediction unit 520, configured to perform a normalization operation on the initial attention representation to obtain the attention representation.
Optionally, the predicting unit 520 is configured to perform feature extraction on the gene sequence segment to be predicted to obtain a standard gene feature, and includes:
the prediction unit 520 is used for coding the gene sequence segment to be predicted to obtain coding representation;
a prediction unit 520, configured to perform initial feature extraction on the encoded representation to obtain initial gene features;
a prediction unit 520, configured to perform a maximal pooling operation on the initial gene features to obtain pooled gene features;
and the prediction unit 520 is used for carrying out standardization operation on the pooled gene characteristics to obtain standard gene characteristics.
Optionally, the prediction unit 520 is configured to perform a normalization operation, and includes:
a prediction unit 520, configured to perform batch normalization operations based on a preset batch normalization formula;
the preset batch standardization formula is as follows:
wherein x is i For the data to be normalized, μ is the mean parameter, σ 2 The mean parameter and the variance parameter are determined based on the gene sequence fragment sample, epsilon is a hyperparameter, a is a first model parameter, b is a second model parameter.
Optionally, the gene expression prediction device further comprises:
and the coding unit is used for carrying out arithmetic coding on the basis of the prediction result and the gene sequence segment to be predicted to obtain a compressed gene sequence.
Optionally, the gene expression prediction device further comprises:
and the decoding unit is used for carrying out arithmetic data decoding on the compressed gene sequence based on the prediction result to obtain a decoded gene sequence.
It should be noted that, the apparatus provided in the embodiment of the present invention can implement all the method steps implemented by the method embodiment and achieve the same technical effect, and detailed descriptions of the same parts and beneficial effects as the method embodiment in this embodiment are omitted here.
Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor)610, a communication Interface (Communications Interface)620, a memory (memory)630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 communicate with each other via the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a method of gene expression prediction comprising: obtaining a gene sequence segment to be predicted; inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model; the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction tag through training, and the prediction tag is a nucleotide corresponding to the gene sequence fragment sample.
In addition, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing a method for predicting gene expression provided by the above methods, the method comprising: obtaining a gene sequence segment to be predicted; inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model; the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction label training, and the prediction label is a nucleotide corresponding to the gene sequence fragment sample.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements a method for predicting gene expression provided by the above methods, including: obtaining a gene sequence segment to be predicted; inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model; the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction tag through training, and the prediction tag is a nucleotide corresponding to the gene sequence fragment sample.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A method for predicting gene expression, comprising:
obtaining a gene sequence segment to be predicted;
inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model;
the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction label training, and the prediction label is a nucleotide corresponding to the gene sequence fragment sample.
2. The method for predicting gene expression according to claim 1, wherein the inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model comprises:
performing characteristic extraction on the gene sequence segment to be predicted to obtain standard gene characteristics;
performing multi-head self-attention weight calculation on the standard gene characteristics to obtain attention expression;
predicting the gene sequence segment to be predicted based on the attention expression, and obtaining a prediction result, wherein the prediction result comprises a nucleotide type and a probability corresponding to the nucleotide type.
3. The method of predicting gene expression according to claim 2, wherein said performing a multi-shot self-attention weight calculation on said standard gene signature to obtain an attention representation comprises:
obtaining a splicing expression based on the standard gene characteristics and pre-stored historical gene characteristics;
obtaining a query vector, a key vector and a value vector corresponding to the standard gene feature based on the standard gene feature and the splicing expression;
obtaining an initial attention representation based on the query vector, the key vector, and the value vector;
and carrying out a standardization operation on the initial attention representation to obtain the attention representation.
4. The method for predicting gene expression according to claim 2, wherein the extracting the characteristics of the gene sequence segment to be predicted to obtain the standard gene characteristics comprises:
coding the gene sequence segment to be predicted to obtain coding representation;
performing initial feature extraction on the coding expression to obtain initial gene features;
performing maximum pooling operation on the initial gene characteristics to obtain pooled gene characteristics;
and carrying out standardized operation on the pooled gene characteristics to obtain standard gene characteristics.
5. The method of predicting gene expression according to claim 3 or 4, wherein the performing of the normalization operation comprises:
performing batch standardization operation based on a preset batch standardization formula;
the preset batch standardization formula is as follows:
wherein x is i For the data to be normalized, μ is the mean parameter, σ 2 The mean parameter and the variance parameter are variance parameters, which are determined based on the gene sequence fragment sample, e is a hyperparameter, a is a first model parameter, and b is a second model parameter.
6. The method of predicting gene expression according to any one of claims 1 to 4, further comprising:
and carrying out arithmetic coding on the basis of the prediction result and the gene sequence segment to be predicted to obtain a compressed gene sequence.
7. The method of predicting gene expression of claim 6, further comprising:
and performing arithmetic data decoding on the compressed gene sequence based on the prediction result to obtain a decoded gene sequence.
8. A gene expression prediction apparatus comprising:
an acquisition unit, configured to acquire a gene sequence segment to be predicted;
the prediction unit is used for inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model;
the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction tag through training, and the prediction tag is a nucleotide corresponding to the gene sequence fragment sample.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the gene expression prediction method according to any one of claims 1 to 7 when executing the program.
10. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the gene expression prediction method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210613683.2A CN115019876A (en) | 2022-05-31 | 2022-05-31 | Gene expression prediction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210613683.2A CN115019876A (en) | 2022-05-31 | 2022-05-31 | Gene expression prediction method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115019876A true CN115019876A (en) | 2022-09-06 |
Family
ID=83070998
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210613683.2A Pending CN115019876A (en) | 2022-05-31 | 2022-05-31 | Gene expression prediction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115019876A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116580767A (en) * | 2023-04-26 | 2023-08-11 | 之江实验室 | Gene phenotype prediction method and system based on self-supervision and transducer |
-
2022
- 2022-05-31 CN CN202210613683.2A patent/CN115019876A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116580767A (en) * | 2023-04-26 | 2023-08-11 | 之江实验室 | Gene phenotype prediction method and system based on self-supervision and transducer |
CN116580767B (en) * | 2023-04-26 | 2024-03-12 | 之江实验室 | Gene phenotype prediction method and system based on self-supervision and transducer |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111312329B (en) | Transcription factor binding site prediction method based on deep convolution automatic encoder | |
Baldi et al. | Bioinformatics: the machine learning approach | |
WO2005024562A2 (en) | System and method for pattern recognition in sequential data | |
US20230207054A1 (en) | Deep learning network for evolutionary conservation | |
CN111710364B (en) | Method, device, terminal and storage medium for acquiring flora marker | |
CN111276187B (en) | Gene expression profile feature learning method based on self-encoder | |
CN116386899A (en) | Graph learning-based medicine disease association relation prediction method and related equipment | |
CN114743600A (en) | Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity | |
CN115019876A (en) | Gene expression prediction method and device | |
Bhardwaj et al. | Computational biology in the lens of CNN | |
Downey et al. | alineR: An R package for optimizing feature-weighted alignments and linguistic distances | |
CN116959585B (en) | Deep learning-based whole genome prediction method | |
Akkaya et al. | Classification of DNA Sequences with k-mers Based Vector Representations | |
CN116386733A (en) | Protein function prediction method based on multi-view multi-scale multi-attention mechanism | |
CN115810398A (en) | TF-DNA binding identification method based on multi-feature fusion | |
CN116312748A (en) | Enhancer-promoter interaction prediction model construction method based on multi-head attention mechanism | |
CN114842983A (en) | Anti-cancer drug response prediction method and device based on tumor cell line self-supervision learning | |
Shanan et al. | Bacteria taxonomic classification using machine learning models | |
Villmann et al. | Searching for the origins of life–detecting RNA life signatures using learning vector quantization | |
CN113539358A (en) | Hilbert coding-based enhancer-promoter interaction prediction method and device | |
CN116994645B (en) | Prediction method of piRNA and mRNA target pair based on interactive reasoning network | |
CN108427867A (en) | One kind being based on Grey BP Neural Network interactions between protein Relationship Prediction method | |
US20220367011A1 (en) | Identification of unknown genomes and closest known genomes | |
CN116886398B (en) | Internet of things intrusion detection method based on feature selection and integrated learning | |
Nerkar et al. | Deep Learning Approaches in Genomic Analysis: A Review of DNA Sequence Classification Techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |