CN115019876A - Gene expression prediction method and device - Google Patents

Gene expression prediction method and device Download PDF

Info

Publication number
CN115019876A
CN115019876A CN202210613683.2A CN202210613683A CN115019876A CN 115019876 A CN115019876 A CN 115019876A CN 202210613683 A CN202210613683 A CN 202210613683A CN 115019876 A CN115019876 A CN 115019876A
Authority
CN
China
Prior art keywords
gene
prediction
gene sequence
predicted
expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210613683.2A
Other languages
Chinese (zh)
Inventor
王岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202210613683.2A priority Critical patent/CN115019876A/en
Publication of CN115019876A publication Critical patent/CN115019876A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/20Screening of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Abstract

The embodiment of the invention provides a method and a device for predicting gene expression, wherein the method comprises the following steps: obtaining a gene sequence segment to be predicted; inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model; the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction label training, and the prediction label is a nucleotide corresponding to the gene sequence fragment sample. According to the gene expression prediction method and device provided by the embodiment of the invention, the gene expression is predicted through the prediction model constructed based on the multi-head self-attention mechanism, so that the gene prediction time is reduced, and the gene prediction efficiency is improved; moreover, the relation among a plurality of nucleotides in the gene sequence can be obtained by calculating the mutual influence among the nucleotides, so that the prediction accuracy of the gene prediction is improved.

Description

Gene expression prediction method and device
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for predicting gene expression.
Background
The gene prediction is to predict the gene structure and function of unknown sequence by computer simulation and calculation by using the existing theory and the known information of gene sequence. The gene prediction can be applied to a plurality of fields such as agriculture, medical treatment, biology and the like, and has important effect on social development.
The gene prediction methods in the prior art mainly include homologous prediction and de novo prediction. The homology prediction requires multiple visits to the target gene sequence to match the reference gene, and the time consumption is large. The de novo prediction is based on the given sequence characteristics, and the existing gene prediction method based on the gene data extraction characteristics has the problem of low accuracy.
Disclosure of Invention
The invention provides a gene expression prediction method, which is used for overcoming the defect of low gene prediction accuracy in the prior art and improving the gene prediction accuracy.
In a first aspect, the present invention provides a method for predicting gene expression, comprising:
obtaining a gene sequence segment to be predicted;
inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model;
the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction label training, and the prediction label is a nucleotide corresponding to the gene sequence fragment sample.
Optionally, the inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model includes:
performing characteristic extraction on the gene sequence segment to be predicted to obtain standard gene characteristics;
performing multi-head self-attention weight calculation on the standard gene characteristics to obtain attention expression;
predicting the gene sequence segment to be predicted based on the attention expression, and obtaining a prediction result, wherein the prediction result comprises a nucleotide type and a probability corresponding to the nucleotide type.
Optionally, the performing multiple self-attention weight calculations on the standard gene features to obtain an attention representation includes:
obtaining a splicing expression based on the standard gene characteristics and pre-stored historical gene characteristics;
obtaining a query vector, a key vector and a value vector corresponding to the standard gene feature based on the standard gene feature and the splicing expression;
obtaining an initial attention representation based on the query vector, the key vector, and the value vector;
and carrying out a standardization operation on the initial attention representation to obtain the attention representation.
Optionally, the performing feature extraction on the gene sequence segment to be predicted to obtain a standard gene feature includes:
coding the gene sequence segment to be predicted to obtain coding representation;
performing initial feature extraction on the coding expression to obtain initial gene features;
performing maximum pooling operation on the initial gene characteristics to obtain pooled gene characteristics;
and carrying out standardized operation on the pooled gene characteristics to obtain standard gene characteristics.
Optionally, the performing a normalization operation includes:
performing batch standardization operation based on a preset batch standardization formula;
the preset batch standardization formula is as follows:
Figure BDA0003672721360000031
wherein x is i For the data to be normalized, μ is the mean parameter, σ 2 The mean parameter and the variance parameter are determined based on the gene sequence fragment sample, epsilon is a hyperparameter, a is a first model parameter, b is a second model parameter.
Optionally, the method further comprises:
and carrying out arithmetic coding on the basis of the prediction result and the gene sequence segment to be predicted to obtain a compressed gene sequence.
Optionally, the method further comprises:
and performing arithmetic data decoding on the compressed gene sequence based on the prediction result to obtain a decoded gene sequence.
In a second aspect, the present invention also provides a gene expression prediction apparatus comprising:
an acquisition unit for acquiring a gene sequence segment to be predicted;
the prediction unit is used for inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model;
the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction tag through training, and the prediction tag is a nucleotide corresponding to the gene sequence fragment sample.
In a third aspect, the present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the method for predicting gene expression according to the first aspect.
In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium, on which a computer program is stored, which computer program, when executed by a processor, implements the gene expression prediction method according to the first aspect.
According to the gene expression prediction method and device provided by the embodiment of the invention, the gene expression is predicted through the prediction model constructed based on the multi-head self-attention mechanism, a reference gene is not needed, and only one access is needed for a target gene sequence, so that the gene prediction time is reduced, and the gene prediction efficiency is improved; and the multi-head self-attention mechanism is used for connecting at least two single-head self-attention mechanisms and extracting characteristics from at least two directions to gene data, and the self-attention mechanism can obtain the relation among a plurality of nucleotides in a gene sequence by calculating the mutual influence among the nucleotides, so that the prediction accuracy of the next nucleotide is improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for predicting gene expression according to an embodiment of the present invention;
FIG. 2 is a second schematic flow chart of a method for predicting gene expression according to an embodiment of the present invention;
FIG. 3 is a third schematic flow chart of a method for predicting gene expression according to an embodiment of the present invention;
FIG. 4 is a fourth flowchart of a method for predicting gene expression according to an embodiment of the present invention;
FIG. 5 is a schematic structural view of a gene expression prediction apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The technical terms related to the invention are described as follows:
gene compression: techniques for encoding gene sequences into data that requires less storage space and that can be decoded into the original gene sequences.
Nucleotide: nucleic acids include ribonucleic acid (RNA) and deoxyribonucleic acid (DNA), with most biological DNA being genetic material and RNA performing multiple complex functions. The monomers constituting nucleic acids are called nucleotides, which are classified into deoxyribonucleic acid and ribonucleotide according to the presence or absence of the five-carbon sugar deoxidation.
Deoxynucleotide: deoxynucleotides (deoxyribotides) are basic units of deoxyribonucleic acid (DNA for short), are small molecular compounds consisting of purine or pyrimidine bases, deoxyribose and phosphate, and are material bases for forming the DNA of genetic materials of organisms. The diversity of organisms is determined by the difference in the arrangement order of adenine (adenine, abbreviated as A), thymine (thymine, abbreviated as T), cytosine (cytosine, abbreviated as C) and guanine (guanine, abbreviated as G) among deoxynucleotides. Four bases are arranged along the inside of the long DNA strand, and the sequence of the four bases stores genetic information.
Embedding (Embedding): embedding is a feature extracted from the raw data, i.e., the low-dimensional vector after it has been mapped through the neural network.
Encoder (Encoder): encoder, as its name implies, encodes input data and converts the input data into an intermediate representation through a non-linear transformation.
Attention (attention): attention is a very common, but neglected fact. For example, when a bird in the sky flies over, the human attention tends to follow the bird, and the sky naturally becomes background (background) information in the human visual system. The basic idea of attention mechanism in computer vision is to let the system learn to focus on places of interest, ignoring background information and focusing on important information.
For the gene prediction method of homologous prediction, multiple visits to the target gene sequence are needed to achieve the purpose of matching the reference gene, and the time consumption is large. For de novo predicted gene prediction methods, such as those based on the LSTM model or those based on the bi-directional LSTM model, the features of the gene sequence can only be observed from one direction or two directions due to the insufficient complexity of the model. Experiments have shown that the solution proposed by the present solution can already outperform the lstm-based model under the same conditions.
In order to solve the above problems, embodiments of the present invention provide a method for predicting gene expression, which predicts a gene sequence based on a multi-head self-attention mechanism, does not need a reference gene, shortens the prediction time of gene expression, and improves the accuracy of gene expression prediction.
The method for predicting gene expression provided by the embodiment of the present invention will be described below with reference to FIGS. 1 to 4.
Fig. 1 is a schematic flow chart of a gene expression prediction method provided in an embodiment of the present invention, and as shown in fig. 1, the embodiment of the present invention provides a gene expression prediction method, including:
step 110, obtaining a gene sequence segment to be predicted;
specifically, the gene sequence segment to be predicted may be a gene sequence consisting of continuous N nucleotides in one single strand of a double-stranded DNA, wherein N is a positive integer. For example, a segment of a gene sequence to be predicted includes 10 consecutive deoxynucleotides: ACTGAGTCCG are provided. Optionally, N is 64.
It should be understood that the embodiment of the present invention does not limit the specific obtaining manner of the gene sequence segment to be predicted, for example, the gene sequence segment to be predicted may be stored in a set region (e.g., a database), and the gene sequence segment to be predicted may be obtained by accessing the set region. In other embodiments, the gene sequence segment to be predicted may be obtained by a gene collecting device, and the gene collecting device may include a blood collector, a saliva collector, a skin collector, and the like. Other methods can be used to obtain the gene sequence segment to be predicted by those skilled in the art, and are not described herein.
Step 120, inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model;
the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction label training, and the prediction label is a nucleotide corresponding to the gene sequence fragment sample.
Specifically, the prediction result may be the next nucleotide of the gene sequence fragment to be predicted. The gene sequence fragment sample refers to a gene sequence fragment of which the next nucleotide is known and is obtained in advance, and the prediction tag is the next nucleotide of the gene sequence fragment sample. Illustratively, a gene sequence with a length of 126 (comprising 126 consecutive nucleotides) is known, and the 0 th deoxynucleotide to the 63 th deoxynucleotide can be used as a first gene sequence fragment sample, and the 64 th deoxynucleotide can be used as a prediction tag corresponding to the first gene sequence fragment sample; taking the 1 st to 64 th deoxynucleotides as a second gene sequence fragment sample, taking the 65 th deoxynucleotide as a corresponding prediction label of the second gene sequence fragment sample, and so on, a plurality of gene sequence fragment samples can be obtained based on the gene sequence with the length of 126.
The self-attention mechanism is a variant of the attention mechanism, can capture the internal correlation of data, and can be applied to gene prediction to obtain the relation among a plurality of nucleotides in a gene sequence by calculating the mutual influence among the nucleotides. The multi-head self-attention mechanism is characterized in that at least two single-head self-attention mechanisms are connected, and characteristics are extracted from gene data from at least two directions.
It is understood that the same gene sequence segment to be predicted and its prediction result may be input to the prediction model cyclically until the complete gene sequence is obtained. In one embodiment, the first gene sequence segment to be predicted comprises 0 th deoxynucleotide to 63 th deoxynucleotide, and the first gene sequence segment to be predicted is input into a prediction model to obtain a prediction result corresponding to the first gene sequence segment to be predicted, namely, a 64 th deoxynucleotide; in the next round, the 1 st to 64 th deoxynucleotides are used as second gene sequence segments to be predicted and input into a prediction model to obtain the prediction results corresponding to the second gene sequence segments to be predicted, namely, the 65 th deoxynucleotide; until the complete gene sequence is obtained.
According to the gene expression prediction method provided by the embodiment of the invention, the gene expression is predicted through the prediction model constructed based on the multi-head self-attention mechanism, a reference gene is not needed, and only one access is needed for a target gene sequence, so that the gene prediction time is reduced, and the gene prediction efficiency is improved; moreover, the characteristics of the gene data can be extracted from multiple directions, and the relation between nucleotides in the gene sequence segment to be predicted can be obtained, so that the prediction accuracy of the next nucleotide is improved.
In the following, a possible implementation manner of the above steps in a specific embodiment is further described.
Optionally, the inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model includes:
step 121, extracting the characteristics of the gene sequence segment to be predicted to obtain standard gene characteristics;
after the gene sequence segment to be predicted is obtained, feature extraction operation can be performed on the gene sequence of the gene sequence segment to be predicted to obtain gene features. Optionally, the gene signature is normalized, which means that for samples in the training set (i.e., all or part of the gene sequence fragment samples used in the training process), the data is divided by the variance or the data is subtracted by the mean based on the column statistics (the result is that the variance is equal to 1 and the data is around 0). The standardization can improve the convergence rate of the model optimization stage in the training stage, and can also avoid the overlarge influence of the sample with large variance on the model training.
Step 122, performing multi-head self-attention weight calculation on the standard gene characteristics to obtain attention expression;
specifically, the Multi-head self-attention weight calculation refers to a weight corresponding to a standard gene feature calculated by a Multi-head self-attention mechanism (Multi-head-self-attention mechanism). Multi-headed-self attribute means that for each feature element, its corresponding attention weight is found from multiple directions. The attention expression refers to a feature obtained by weighting a standard gene feature with a plurality of self-attention weights.
And 123, predicting the gene sequence segment to be predicted based on the attention expression, and obtaining the prediction result, wherein the prediction result comprises a nucleotide type and a probability corresponding to the nucleotide type.
Specifically, the fully-ligated layer can be used as a classifier, and attention can be expressed as the probability of generating the next nucleotide by the softmax layer after the fully-ligated layer is input.
Figure BDA0003672721360000091
Figure BDA0003672721360000092
Wherein the content of the first and second substances,
Figure BDA0003672721360000093
representing the attention representation after classification through the full connection layer, z i Denotes the vector corresponding to the i-th nucleotide in the attention expression, w j A matrix representing a one-dimensional convolution kernel, b j A matrix representing the offset of the optical fiber,
Figure BDA0003672721360000094
illustratively, the gene sequence segment to be predicted is input into the prediction model, and the obtained prediction result is as follows: a, 80%; t, 14%; c, 4%; g, 2 percent.
Alternatively, end gene expression prediction may be achieved by defining a window to end gene expression prediction, i.e. by defining a segment of the gene sequence to be input into the model, illustratively if the device is stopped at a certain nucleotide, i.e. the first 64 nucleotides of that nucleotide are input into the window and are not input.
According to the gene expression prediction method provided by the embodiment of the invention, the gene expression is predicted through the prediction model constructed based on the multi-head self-attention mechanism, a reference gene is not needed, and only one access is needed for a target gene sequence, so that the gene prediction time is reduced, and the gene prediction efficiency is improved; moreover, characteristics can be extracted from the gene data from multiple directions, and the relation between nucleotides in the gene sequence segment to be predicted is obtained, so that the prediction accuracy of the next nucleotide is improved; the samples with large variance can be prevented from generating excessive influence on the model through standardization; the predicted nucleotide type and the probability corresponding to the nucleotide type can be obtained through the full-link layer and the softmax layer, and can be used for gene sequence compression.
FIG. 2 is a second schematic flow chart of a gene expression prediction method provided in an embodiment of the present invention, and a possible implementation manner of the above steps in an embodiment is further described with reference to FIG. 2.
Optionally, the performing multi-head self-attention weight calculation on the standard gene feature to obtain an attention representation includes:
step 1221, obtaining a splicing expression based on the standard gene characteristics and pre-stored historical gene characteristics;
the splicing process can be shown as follows:
Figure BDA0003672721360000101
h is the standard gene characteristic and represents the representation corresponding to the gene sequence segment to be predicted input at the current moment; SG (-) denotes stop gradient; degree represents the stitching operation in the channel dimension; m is a historical gene characteristic and represents the corresponding representation of the historical gene sequence segment input at the previous moment;
Figure BDA0003672721360000102
is a tiled representation.
Illustratively, in the first round, the standard gene characteristics corresponding to the gene sequence segments to be predicted in the first round, that is, the gene characteristics corresponding to the 0 th nucleotide to the 63 rd nucleotide, are input at the current time, and since the gene expression prediction is performed in the first round, the historical gene sequence segments may be empty; in the second round, the standard gene characteristics corresponding to the gene sequence segment to be predicted in the second round, that is, the gene characteristics corresponding to the 1 st nucleotide to the 64 th nucleotide, are input at the current time, and the historical gene characteristics are the gene characteristics corresponding to the 0 th nucleotide to the 63 rd nucleotide input in the previous round.
Step 1222, based on the standard gene feature and the splicing expression, obtaining a query vector, a key vector and a value vector corresponding to the standard gene feature;
the process can be represented by the following formula:
Figure BDA0003672721360000103
Figure BDA0003672721360000104
Figure BDA0003672721360000105
wherein q represents a query vector, k represents a key vector, and v represents a value vector; w q Representing a matrix, W, used for generating a query vector k Representing a matrix, W, for generating key vectors v A matrix used to generate a vector of values is represented.
A step 1223 of obtaining an initial attention representation based on the query vector, the key vector and the value vector;
the process can be represented by the following formula:
Figure BDA0003672721360000111
wherein R is i-j Is a relative position code, the relative position relation uses a position code matrix
Figure BDA0003672721360000112
The ith row represents a position vector with a relative position interval of i, and is generated by a sine function. A. the i,j Indicating the result after position encoding.
It is understood that relative position encoding refers to determining the encoding based on the relative position (e.g., the relationship of several nucleotides apart) between the individual nucleotides in the gene sequence. Illustratively, when the process data is a vector corresponding to nucleotide a, row i represents a vector corresponding to nucleotides at positions i apart (i nucleotides apart) from the relative position of a.
Step 1224 of normalizing the initial attention representation to obtain the attention representation.
Optionally, the standardization operation is batch standardization.
The batch normalization process may be represented by the following formula:
Figure BDA0003672721360000113
wherein mu is the mean value in the channel dimension, namely mu is the mean value parameter; sigma 2 Is the variance in channel dimension, i.e. variance parameter; it is to be understood that the mean parameter and the variance parameter are determined on the basis of the gene sequence fragment samples, the mean parameter and the variance parameter being determined during training by a sub-training set (batch) of all training samples, and in the trained model by all training samples.
Alternatively, the normalized initial attention representation may be input to the fully-connected layer, resulting in an attention representation of the fully-connected layer output. The full-connection layer can improve the nonlinear expression capability of the prediction model, so that the learning capability and the expression capability of the model are improved.
In one embodiment, standard gene signatures are input into the transform-xl encoder, which may be composed of two parts, a self-attention module and a full-link layer.
The process of the self-attention module is as follows:
Figure BDA0003672721360000114
Figure BDA0003672721360000115
Figure BDA0003672721360000121
Figure BDA0003672721360000122
Figure BDA0003672721360000123
Figure BDA0003672721360000124
h is the standard gene characteristic and represents the representation corresponding to the gene sequence segment to be predicted input at the current moment; SG (-) denotes stop gradient; degree represents the stitching operation in the channel dimension; m is a historical gene characteristic and represents the corresponding representation of the historical gene sequence segment input at the previous moment;
Figure BDA0003672721360000125
is a spliced representation; q represents a query vector, k represents a key vector, and v represents a value vector; w q Representing a matrix of generated query vectors, W k Representing a matrix of generating key vectors, W v Matrix W representing vectors of generated values k,R Representing a location-based vector; r i-j Is a relative position code, the relative position relation uses a position code matrix
Figure BDA0003672721360000126
The ith row represents a position vector with a relative position interval of i, and is generated by a sine function. A. the i,j Indicating the result after position encoding. Mu is the mean value of the channel dimension, namely mu is the mean value parameter; sigma 2 Is the variance in channel dimension, i.e. variance parameter; the epsilon is taken as a hyper-parameter, the epsilon value is very small, the value range can be 1e-04 to 1e-06, and optionally the epsilon value is 1 e-05; a is a first model parameter, b is a second model parameter, and a and b are updated through a model training process.
Figure BDA0003672721360000127
Is represented by A i,j And (4) the result of the layer normalization processing, namely the normalized initial attention representation.
The full connection layer is a full connection layer containing a hidden layer, so that the embedding of data is transformed in the channel dimension, and the expression capability of the model is enhanced.
The process is shown as the following formula:
z i,j =δ(w j ⊙[a i ,a i+1 ,…,a i+k-1 ]+b j )
wherein z is i,j For attention, a i Representing the corresponding vector of the ith nucleotide in the normalized initial attention representation; δ represents the activation function, and δ may alternatively be a rule function, w j A matrix representing a one-dimensional convolution kernel, b j A matrix representing the bias.
It should be understood that, on the basis of the embodiment of the present invention, by adjusting the internal structure of the transform-xl encoder, for example: adjusting the number of nodes of the full link layer, the number of multi-head attentions, etc., or replacing the transform-xl with other similar transform variants, such as: vanilla transformers, compressive transformers, etc., should also be within the scope of the present invention.
According to the gene expression prediction method provided by the embodiment of the invention, the gene expression is predicted through the prediction model constructed based on the multi-head self-attention mechanism, the relation between nucleotides in the gene sequence fragment to be predicted is obtained, the overlarge influence on the model caused by a sample with large variance is avoided through standardization, and in addition, the expression capability of the model is improved through the full-connection layer, so that the prediction accuracy of the next nucleotide is improved.
Optionally, the performing feature extraction on the gene sequence segment to be predicted to obtain a standard gene feature includes:
step 1211, encoding the gene sequence segment to be predicted to obtain an encoding representation;
specifically, the gene sequence segment to be predicted may be in fasta format. Illustratively, the gene sequence fragments to be predicted are read from the gene sequence file in fasta format such as: GGCTA … …, etc., encoded by unique hot codes to obtain an encoded representation.
Illustratively, the encoding is as follows: a is {1,0,0,0 }; c, 0,1,0, 0; g, 0,1, 0; t {0,0,0,1 }. It should be understood that the above is an example for facilitating understanding of the present invention, and the present invention should not be limited to any way, for example, the correspondence relationship in the above example may be replaced as long as the discrimination of the ATCG four nucleotides by coding can be achieved.
Step 1212, performing initial feature extraction on the coded representation to obtain initial gene features;
optionally, the coded representation is mapped into a high-dimensional space by one-dimensional convolution for feature extraction. Is provided with
Figure BDA0003672721360000131
Representing the coded representation of the one-hot code, the gene sequence after one-dimensional convolution processing, namely the initial gene characteristics, is as follows:
o i,j =δ(w j ⊙[x i ,x i+1 ,…,x i+k-1 ]+b j );
wherein o is i,j Denotes the initial Gene signature, w j A matrix representing a one-dimensional convolution kernel, b j A matrix representing the bias, δ being an activation function for adding a non-linear transformation in the network, optionally a relu function used in embodiments of the invention, outputting 0 when the input is less than 0 and outputting the original value when the input is greater than or equal to 0, x i Refers to the ith nucleotide in the gene sequence segment to be predicted.
Step 1213, performing maximum pooling operation on the initial gene characteristics to obtain pooled gene characteristics;
specifically, the initial gene features obtained after one-dimensional convolution processing are retained by the largest pooling layer to obtain pooled gene features, and the pooled gene features are smaller than the initial gene features in data length, so that the calculation complexity of the model can be reduced.
Step 1214, standardizing the pooled gene features to obtain standard gene features.
The normalization operation and the batch normalization operation are described in step 121 and step 1221, and will not be described in detail here.
Optionally, the method further comprises:
and carrying out arithmetic coding on the basis of the prediction result and the gene sequence segment to be predicted to obtain a compressed gene sequence.
Fig. 3 is a third schematic flow chart of the gene expression prediction method provided in the embodiment of the present invention, and as shown in fig. 3, in an embodiment, the application of the gene expression prediction method to the encoding process may include three parts, i.e., model training, inference and arithmetic coding.
Model training: gene sequence data related to the gene sequence fragment to be predicted (e.g., a gene sequence of the same species as the gene sequence fragment to be predicted) is collected as a data set and processed into fasta format according to a ratio of 7: 2: 1, dividing a training set, a verification set and a test set. And training the constructed deep learning model by using the training set to ensure that the deep learning model is converged on the training set and has good performance on the verification set.
Reasoning: and (2) taking the previous segment of the gene sequence (namely the gene sequence segment to be predicted) of the predicted target nucleotide as input, predicting the type of the next nucleotide of the gene sequence segment to be predicted and the probability corresponding to each type through a trained prediction model, and calculating the occurrence probability of the gene sequence consisting of all currently known nucleotides.
Arithmetic coding: the probability finally obtained is the result of the arithmetic coding of the gene sequence.
According to the gene expression prediction method provided by the embodiment of the invention, through the prediction model based on the multi-head self-attention mechanism, the model can be used for predicting the next nucleotide type with higher accuracy, and the expression length after gene sequence coding is shorter and the compression ratio is higher.
Optionally, the method further comprises:
and performing arithmetic data decoding on the compressed gene sequence based on the prediction result to obtain a decoded gene sequence.
Fig. 4 is a fourth schematic flowchart of a gene expression prediction method provided in an embodiment of the present invention, and as shown in fig. 4, in an embodiment, the application of the gene expression prediction method to the decoding process may include two parts, namely, inference and arithmetic decoding.
Reasoning: and (3) taking the previous short segment of the sequence of the next nucleotide (namely the gene sequence segment to be predicted) as an input, and predicting the occurrence probability of the next nucleotide by using a trained deep learning model.
Arithmetic decoding: and decoding the nucleotide of the current next position according to the probability calculated by the model.
The gene compression method based on the prediction model provided by the embodiment of the invention can effectively improve the compression effect of the gene sequence. Table 1 shows the experimental results provided by the embodiments of the present invention, and as shown in table 1, the experiment proves that the solution proposed by the embodiments of the present invention can make bpb reach 0.011, lower than the LSTM or bidirectional LSTM based gene compression method, under the same conditions.
Figure BDA0003672721360000151
The gene expression predicting device provided by the present invention will be described below, and the gene expression predicting device described below and the gene expression predicting method described above may be referred to each other correspondingly.
Fig. 5 is a schematic structural diagram of a gene expression prediction apparatus according to an embodiment of the present invention, and as shown in fig. 5, the gene expression prediction apparatus according to the embodiment of the present invention includes an obtaining unit 510 and a prediction unit 520:
an obtaining unit 510, configured to obtain a gene sequence segment to be predicted;
the prediction unit 520 is configured to input the gene sequence segment to be predicted into a prediction model, and obtain a prediction result output by the prediction model;
the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction tag through training, and the prediction tag is a nucleotide corresponding to the gene sequence fragment sample.
Optionally, the predicting unit 520 is configured to input the gene sequence segment to be predicted into a prediction model, and obtain a prediction result output by the prediction model, and includes:
the prediction unit 520 is used for extracting the characteristics of the gene sequence segment to be predicted to obtain standard gene characteristics;
a prediction unit 520, configured to perform multi-head self-attention weight calculation on the standard gene features to obtain an attention representation;
a predicting unit 520, configured to predict the gene sequence segment to be predicted based on the attention expression, and obtain the prediction result, where the prediction result includes a nucleotide type and a probability corresponding to the nucleotide type.
Optionally, the predicting unit 520 is configured to perform multi-head self-attention weight calculation on the standard gene feature to obtain an attention representation, and includes:
a prediction unit 520, configured to obtain a splicing expression based on the standard gene feature and a pre-stored historical gene feature;
a prediction unit 520, configured to obtain a query vector, a key vector, and a value vector corresponding to the standard gene feature based on the standard gene feature and the splicing expression;
a prediction unit 520 for obtaining an initial attention representation based on the query vector, the key vector and the value vector;
a prediction unit 520, configured to perform a normalization operation on the initial attention representation to obtain the attention representation.
Optionally, the predicting unit 520 is configured to perform feature extraction on the gene sequence segment to be predicted to obtain a standard gene feature, and includes:
the prediction unit 520 is used for coding the gene sequence segment to be predicted to obtain coding representation;
a prediction unit 520, configured to perform initial feature extraction on the encoded representation to obtain initial gene features;
a prediction unit 520, configured to perform a maximal pooling operation on the initial gene features to obtain pooled gene features;
and the prediction unit 520 is used for carrying out standardization operation on the pooled gene characteristics to obtain standard gene characteristics.
Optionally, the prediction unit 520 is configured to perform a normalization operation, and includes:
a prediction unit 520, configured to perform batch normalization operations based on a preset batch normalization formula;
the preset batch standardization formula is as follows:
Figure BDA0003672721360000171
wherein x is i For the data to be normalized, μ is the mean parameter, σ 2 The mean parameter and the variance parameter are determined based on the gene sequence fragment sample, epsilon is a hyperparameter, a is a first model parameter, b is a second model parameter.
Optionally, the gene expression prediction device further comprises:
and the coding unit is used for carrying out arithmetic coding on the basis of the prediction result and the gene sequence segment to be predicted to obtain a compressed gene sequence.
Optionally, the gene expression prediction device further comprises:
and the decoding unit is used for carrying out arithmetic data decoding on the compressed gene sequence based on the prediction result to obtain a decoded gene sequence.
It should be noted that, the apparatus provided in the embodiment of the present invention can implement all the method steps implemented by the method embodiment and achieve the same technical effect, and detailed descriptions of the same parts and beneficial effects as the method embodiment in this embodiment are omitted here.
Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor)610, a communication Interface (Communications Interface)620, a memory (memory)630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 communicate with each other via the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a method of gene expression prediction comprising: obtaining a gene sequence segment to be predicted; inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model; the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction tag through training, and the prediction tag is a nucleotide corresponding to the gene sequence fragment sample.
In addition, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing a method for predicting gene expression provided by the above methods, the method comprising: obtaining a gene sequence segment to be predicted; inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model; the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction label training, and the prediction label is a nucleotide corresponding to the gene sequence fragment sample.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements a method for predicting gene expression provided by the above methods, including: obtaining a gene sequence segment to be predicted; inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model; the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction tag through training, and the prediction tag is a nucleotide corresponding to the gene sequence fragment sample.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for predicting gene expression, comprising:
obtaining a gene sequence segment to be predicted;
inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model;
the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction label training, and the prediction label is a nucleotide corresponding to the gene sequence fragment sample.
2. The method for predicting gene expression according to claim 1, wherein the inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model comprises:
performing characteristic extraction on the gene sequence segment to be predicted to obtain standard gene characteristics;
performing multi-head self-attention weight calculation on the standard gene characteristics to obtain attention expression;
predicting the gene sequence segment to be predicted based on the attention expression, and obtaining a prediction result, wherein the prediction result comprises a nucleotide type and a probability corresponding to the nucleotide type.
3. The method of predicting gene expression according to claim 2, wherein said performing a multi-shot self-attention weight calculation on said standard gene signature to obtain an attention representation comprises:
obtaining a splicing expression based on the standard gene characteristics and pre-stored historical gene characteristics;
obtaining a query vector, a key vector and a value vector corresponding to the standard gene feature based on the standard gene feature and the splicing expression;
obtaining an initial attention representation based on the query vector, the key vector, and the value vector;
and carrying out a standardization operation on the initial attention representation to obtain the attention representation.
4. The method for predicting gene expression according to claim 2, wherein the extracting the characteristics of the gene sequence segment to be predicted to obtain the standard gene characteristics comprises:
coding the gene sequence segment to be predicted to obtain coding representation;
performing initial feature extraction on the coding expression to obtain initial gene features;
performing maximum pooling operation on the initial gene characteristics to obtain pooled gene characteristics;
and carrying out standardized operation on the pooled gene characteristics to obtain standard gene characteristics.
5. The method of predicting gene expression according to claim 3 or 4, wherein the performing of the normalization operation comprises:
performing batch standardization operation based on a preset batch standardization formula;
the preset batch standardization formula is as follows:
Figure FDA0003672721350000021
wherein x is i For the data to be normalized, μ is the mean parameter, σ 2 The mean parameter and the variance parameter are variance parameters, which are determined based on the gene sequence fragment sample, e is a hyperparameter, a is a first model parameter, and b is a second model parameter.
6. The method of predicting gene expression according to any one of claims 1 to 4, further comprising:
and carrying out arithmetic coding on the basis of the prediction result and the gene sequence segment to be predicted to obtain a compressed gene sequence.
7. The method of predicting gene expression of claim 6, further comprising:
and performing arithmetic data decoding on the compressed gene sequence based on the prediction result to obtain a decoded gene sequence.
8. A gene expression prediction apparatus comprising:
an acquisition unit, configured to acquire a gene sequence segment to be predicted;
the prediction unit is used for inputting the gene sequence segment to be predicted into a prediction model to obtain a prediction result output by the prediction model;
the prediction model is constructed based on a multi-head self-attention mechanism, the prediction model is obtained based on a gene sequence fragment sample and a prediction tag through training, and the prediction tag is a nucleotide corresponding to the gene sequence fragment sample.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the gene expression prediction method according to any one of claims 1 to 7 when executing the program.
10. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the gene expression prediction method according to any one of claims 1 to 7.
CN202210613683.2A 2022-05-31 2022-05-31 Gene expression prediction method and device Pending CN115019876A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210613683.2A CN115019876A (en) 2022-05-31 2022-05-31 Gene expression prediction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210613683.2A CN115019876A (en) 2022-05-31 2022-05-31 Gene expression prediction method and device

Publications (1)

Publication Number Publication Date
CN115019876A true CN115019876A (en) 2022-09-06

Family

ID=83070998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210613683.2A Pending CN115019876A (en) 2022-05-31 2022-05-31 Gene expression prediction method and device

Country Status (1)

Country Link
CN (1) CN115019876A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116580767A (en) * 2023-04-26 2023-08-11 之江实验室 Gene phenotype prediction method and system based on self-supervision and transducer

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116580767A (en) * 2023-04-26 2023-08-11 之江实验室 Gene phenotype prediction method and system based on self-supervision and transducer
CN116580767B (en) * 2023-04-26 2024-03-12 之江实验室 Gene phenotype prediction method and system based on self-supervision and transducer

Similar Documents

Publication Publication Date Title
CN111312329B (en) Transcription factor binding site prediction method based on deep convolution automatic encoder
Baldi et al. Bioinformatics: the machine learning approach
WO2005024562A2 (en) System and method for pattern recognition in sequential data
US20230207054A1 (en) Deep learning network for evolutionary conservation
CN111710364B (en) Method, device, terminal and storage medium for acquiring flora marker
CN111276187B (en) Gene expression profile feature learning method based on self-encoder
CN116386899A (en) Graph learning-based medicine disease association relation prediction method and related equipment
CN114743600A (en) Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity
CN115019876A (en) Gene expression prediction method and device
Bhardwaj et al. Computational biology in the lens of CNN
Downey et al. alineR: An R package for optimizing feature-weighted alignments and linguistic distances
CN116959585B (en) Deep learning-based whole genome prediction method
Akkaya et al. Classification of DNA Sequences with k-mers Based Vector Representations
CN116386733A (en) Protein function prediction method based on multi-view multi-scale multi-attention mechanism
CN115810398A (en) TF-DNA binding identification method based on multi-feature fusion
CN116312748A (en) Enhancer-promoter interaction prediction model construction method based on multi-head attention mechanism
CN114842983A (en) Anti-cancer drug response prediction method and device based on tumor cell line self-supervision learning
Shanan et al. Bacteria taxonomic classification using machine learning models
Villmann et al. Searching for the origins of life–detecting RNA life signatures using learning vector quantization
CN113539358A (en) Hilbert coding-based enhancer-promoter interaction prediction method and device
CN116994645B (en) Prediction method of piRNA and mRNA target pair based on interactive reasoning network
CN108427867A (en) One kind being based on Grey BP Neural Network interactions between protein Relationship Prediction method
US20220367011A1 (en) Identification of unknown genomes and closest known genomes
CN116886398B (en) Internet of things intrusion detection method based on feature selection and integrated learning
Nerkar et al. Deep Learning Approaches in Genomic Analysis: A Review of DNA Sequence Classification Techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination