CN112884087A

CN112884087A - Biological enhancer and identification method for type thereof

Info

Publication number: CN112884087A
Application number: CN202110375106.XA
Authority: CN
Inventors: 杨润涛; 吴峰; 张承进; 张丽娜
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2021-04-07
Filing date: 2021-04-07
Publication date: 2021-06-01

Abstract

The application provides a biological enhancer and a method for identifying types of the biological enhancer, and solves the problem of improving the identification performance of the biological enhancer and the types of the biological enhancer. The application provides a method for identifying a biological enhancer, which comprises the following steps: pre-segmenting words of the sequences in the reference data set according to the n-gram to obtain pre-segmented word sequences, wherein the enhancer sequences are pre-segmented to obtain first pre-segmented word sequences, and the non-enhancer sequences are pre-segmented to obtain second pre-segmented word sequences; training a first pre-classified word sequence according to a Seq-GAN network model to generate a first artificial sequence, training a second pre-classified word sequence to generate a second artificial sequence; fusing the first artificial sequence and the original positive sample data set to obtain an amplified positive sample data set; fusing the second artificial sequence with the original negative sample data set to obtain an amplified negative sample data set; and performing sequence word segmentation on the amplification positive sample data set based on statistics to obtain a first word segmentation result and a word segmentation model. The present application improves the performance of enhancers and their type identification.

Description

Biological enhancer and identification method for type thereof

Technical Field

The application relates to the technical field of biotechnology and life science, in particular to a biological enhancer and a method for identifying types of the biological enhancer.

Background

The first discovery of enhancer was that in 1981, Benerji, a foreign scholarman, found a 140dp length sequence in SV40DNA that increased the expression level of SV40 DNA/rabbit β -hemoglobin fusion gene. Since then, the position, function and property of the enhancer have been studied more and more, and the enhancer has been found to be weak. In 2018, Liu and the like construct two complex integrated learning models (iEnhancer-EL) to realize enhancer and type identification; in 2019, based on a word embedding technology, word vectors of biological vocabularies are extracted by Nguyen and the like by using Fasttext, and the method shows quite good performance on a reference data set and an independent test set, but has a great space for improvement in practical terms, particularly a classifier is used by a traditional machine learning algorithm.

In the prior art, Khanal et al propose a model iEnhancer-CNN based on a word embedding technology and a convolutional neural network in 2020, and the method applies the word embedding technology and deep learning to research subjects at the same time. The word vector of each word in the DNA sequence is trained, so that two-dimensional matrix representation of each word is obtained, and a classification and identification experiment is carried out by utilizing two layers of convolutional neural networks. Finally, high-precision enhancer and type recognition thereof are realized, so that the effectiveness of combining the word embedding technology and deep learning in the research subject is proved. In addition, Cai et al construct an ensemble learning predictor iEnhancer-XG based on five DNA sequence features and the XGboost algorithm, and obtain excellent performances of 81.1% and 65.7% respectively.

Firstly, the method based on the machine learning classification algorithm directly performs the feature vector construction based on the nucleotide composition features on the DNA sequence, the artificially designed features cannot acquire deep features in the sequence, and even though some methods introduce a word embedding technology, the method uses a mechanical sequence word segmentation method of a sliding window; secondly, the overall performance of the machine learning algorithm is inferior to that of the deep learning algorithm, the recognition effect is improved by directly using the deep learning algorithm, and the depth model is easy to be under-fitted due to insufficient data volume.

Disclosure of Invention

The application provides a biological enhancer and a type identification method thereof, data volume is expanded through a Seq-GAN network model, effectiveness and interpretability of word segmentation are improved based on statistical sequence word segmentation, and the problem of improving identification performance of the biological enhancer and the type thereof can be solved.

The first aspect of the present application provides a method for identifying a biological enhancer, comprising:

obtaining a reference data set, the reference data set comprising: a training data set, the training data set comprising: the method comprises the steps of obtaining an original positive sample data set and an original negative sample data set, wherein the original positive sample data set refers to an enhanced subdata set, the original negative sample data set refers to a non-enhanced subdata set, the enhanced subdata set refers to a set containing a plurality of enhanced sequences, and the non-enhanced subdata set refers to a set containing a plurality of non-enhanced sequences; the method utilizes the characteristics of nucleotide composition in the DNA sequence to fit the biological sequence from data, generates a reasonable and effective artificial sequence, expands the data volume and provides more abundant data characteristics.

Pre-dividing words of the sequences in the reference data set according to n-gram to obtain pre-divided word sequences, wherein the enhancer sequences are pre-divided into first pre-divided word sequences, and the non-enhancer sequences are pre-divided into second pre-divided word sequences; although the input of the Seq-GAN network is in a sequence form, the input of the Seq-GAN network cannot realize sequence word segmentation, and the input format is an English-like sentence form with a blank space as a segmentation point, so that the sequence needs to be pre-segmented. If the word segmentation is performed by using a single nucleotide, namely A, C, G, T, the word segmentation unit is single, which affects the quality of the generated sequence, so that the pre-segmentation by using a dinucleotide, namely n-2, is selected.

Training the first pre-classified word sequence according to a Seq-GAN network model to generate a first human process sequence, training the second pre-classified word sequence to generate a second artificial sequence; fusing the first artificial sequence and the original positive sample data set to obtain an amplified positive sample data set; fusing the second artificial sequence and the original negative sample data set to obtain an amplified negative sample data set; by the technical means, the data volume is expanded, richer data characteristics are provided, and the problem that the depth model is easy to be under-fitted is solved.

Performing sequence word segmentation on the amplified positive sample data set based on statistics to obtain a first word segmentation result and a word segmentation model; performing sequence segmentation based on statistics; the quality of the segmentation determines the quality of the sequence feature representation, so that a change in the segmentation mode is also an important aspect. The nucleotide composition characteristics in the DNA sequence cannot be perfectly extracted by a mechanical word segmentation mode, and the occurrence probability of a certain DNA sequence is maximized from the statistical viewpoint, so that the sequence word segmentation is rationalized.

Performing word segmentation on the amplified negative sample data set according to a word segmentation model to obtain a second word segmentation result; and training the first word segmentation result and the second word segmentation result to obtain a trained classification model.

According to the technical scheme, the recognition method disclosed by the application makes up the defects of the prior art from the aspects of data volume, word segmentation mode, feature extraction mode and classification algorithm, and improves the effectiveness and interpretability of word segmentation by replacing an n-gram mechanical word segmentation method with a word segmentation mode based on a statistical sequence. The method and the device can effectively improve the accuracy of the enhancer identification task.

Preferably, the method further comprises the following steps: and performing Word vector training on the first Word segmentation result according to a Word2vec network model to obtain a Word embedding matrix.

Each word is inquired from the word embedding matrix to obtain a corresponding word vector, and the word vector is spliced with the word vector of the next word, so that the formed two-dimensional splicing can form a word embedding layer. The present application uses word embedding technology to convert the divided vocabulary into computer recognizable vectors, which serve as word embedding layers in the last used convolutional neural network to accomplish the task of enhancer and its type recognition. The word embedding technology can effectively avoid the defect of manually designed features, and the inherent feature relation between words is deeply excavated. In addition, compared with the addition and the average, the sequence features can be better displayed by directly utilizing the two-dimensional splicing, and the feature overlapping and the coverage can not be caused. The convolutional neural network belongs to a deep neural network, and can mine deep features, so that neurons learn the features which cannot be understood by human beings. In addition, the convolution operation can effectively complete the feature fusion between two biological words, and feature extraction is continuously carried out in the classification process.

Preferably, the method further comprises the following steps: acquiring a data set to be detected; performing word segmentation on the data set to be tested according to the word segmentation model to obtain a word segmentation result to be tested; and obtaining the recognition result of the data set to be detected according to the word segmentation result to be detected and the word embedding matrix.

Preferably, the step of performing sequence segmentation on the amplified positive sample data set based on statistics to obtain a first segmentation result and a segmentation model includes:

calculating probability p (x) using maximum likelihood function method_i) Said p (x)_i) The method comprises the following steps: the probability that a word appearing at the ith position of sequence X appears in the dataset;

the likelihood function L is:

wherein D represents a corpus and X^(s)Represents the S-th sample in the corpus, S (X)^(s)) All candidate compositions representing the s-th sample;

according to the p (x)_i) Calculating a combined probability p (X) of the sequence X, the p (X) being:

wherein v represents a dictionary, and the sum of the probabilities of all words in the dictionary is 1;

and selecting the composition mode corresponding to the highest value of the p (X) and recording the composition mode as the word segmentation result of the sequence X.

Preferably, the step of training the first pre-segmentation word sequence, generating a first human process sequence, training the second pre-segmentation word sequence, and generating a second artificial sequence according to a Seq-GAN network model includes:

sequence completion: generating a complete sequence of each action according to a Monte Carlo tree search algorithm, wherein the action refers to a character or a word which is possibly generated at the next time point based on the current state, and the current state refers to a sequence fragment generated by the generator based on the first pre-segmentation sequence and the second pre-segmentation sequence;

an artificial sequence generation step: the Seq-GAN network model identifies the complete sequence and generates a reward; transmitting the reward back to a generator of the Seq-GAN network model; updating the generator and updating the state, wherein the state refers to the existing words or characters; judging whether a preset condition is reached, wherein the preset condition comprises the following steps: a training target or maximum number of iterations; and if so, obtaining the artificial sequence of the pre-classified word sequence.

Preferably, the step of training the first pre-segmentation word sequence, generating a first human process sequence, training the second pre-segmentation word sequence, and generating a second artificial sequence according to a Seq-GAN network model includes: performing Word vector training on the pre-classified Word sequence according to a Word2vec network model to obtain a pre-training Word vector; and training the pre-training word vector according to a Seq-GAN network model to generate the artificial sequence.

The original Seq-GAN network performs network training by randomly initializing word vectors and generates data, and the word vectors need to be trained in the network training process, so that the training speed is reduced, and the quality of the generated sequence cannot be guaranteed. Therefore, in practical use, a Word2vec is used for training a pre-participled data set to obtain a pre-training Word vector of 16 dinucleotides, the calculation complexity of the Seq-GAN network is reduced, and the training speed and the quality of a generated sequence are improved.

A second aspect of the present application provides a method for identifying a bio-enhanced sub-type, comprising: acquiring a strong enhancement subdata set, wherein the strong enhancement subdata set is a set containing a plurality of strong enhancement subsequence sequences; performing statistical sequence word segmentation on the strong enhancer sequence to obtain word segmentation results; and carrying out word vector training on the word segmentation result to obtain a word embedding matrix.

According to the technical scheme, the biological enhancer and the identification method of the type of the biological enhancer are provided. The data size is expanded by using a sequence generation technology, then a statistical sequence word segmentation mode is adopted to replace an n-gram mechanical word segmentation method, the effectiveness and the interpretability of word segmentation are improved, then a word embedding technology is used to convert a segmented word into a vector which can be recognized by a computer, and the vector serves as a word embedding layer in a used convolutional neural network to complete an enhancer and a type recognition task of the enhancer.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of an embodiment of a method for identifying a biological enhancer of the present application;

FIG. 2 is a schematic diagram of the overall recognition process of a biological enhancer recognition method according to the present application;

FIG. 3 is a schematic diagram of the structure of a Seq-GAN network model in the method for identifying a bio-enhancer of the present application;

FIG. 4 is a schematic diagram of a convolutional neural network for a method of identifying a bio-enhancer of the present application;

FIG. 5 is a diagram illustrating the content analysis of artificial sequence and natural sequence mononucleotide in a biological enhancer and its type identification method according to the present application;

FIG. 6 is a diagram illustrating the analysis of the content of artificial non-enhancer double codons in a method for identifying a biological enhancer and its type according to the present application;

FIG. 7 is a schematic diagram of the content analysis of triple nucleotides of an artificial weak enhancer in a method for identifying a biological enhancer and its type according to the present application;

FIG. 8 is a schematic diagram of the physicochemical property analysis of the artificial sequence and the natural sequence in the method for identifying a bio-enhancer and its type according to the present application.

Detailed Description

Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following examples do not represent all embodiments consistent with the present application. But merely as exemplifications of systems and methods consistent with certain aspects of the application, as recited in the claims.

FIG. 1 is a schematic flow chart of an embodiment of a method for identifying a biological enhancer of the present application. As shown in fig. 1, the present application provides a method for identifying a biological enhancer, comprising:

s1 obtaining a reference data set, the reference data set comprising: a training data set, the training data set comprising: the method comprises the steps of obtaining an original positive sample data set and an original negative sample data set, wherein the original positive sample data set refers to an enhanced subdata set, the original negative sample data set refers to a non-enhanced subdata set, the enhanced subdata set refers to a set containing a plurality of enhanced sequences, and the non-enhanced subdata set refers to a set containing a plurality of non-enhanced sequences;

the reference data set used in the present embodiment is constructed by Liu or the like. This data set combined chromatin state information for nine cell lines, including H1ES, K562, GM12878, HepG2, UVEC, HSMM, NHLF, NHEK and MEC. The chromatin state information is annotated by hromHMM on a plurality of histone-labeled well genome maps, such as H3K4me1, 3K4me3, H3K27ac, etc. The reference data set contains 1484 non-enhancer sequences in the non-enhancer data set, and the enhancer sequences in the enhancer data set are of two types, namely strong enhancer sequences and weak enhancer sequences, wherein 742 strong enhancer sequences and 742 weak enhancer sequences.

When identifying whether a sequence is an enhancer sequence, the non-enhancer dataset is a negative sample set and the sum of the strong and weak enhancer datasets is a positive sample set.

S2, pre-segmenting words of the sequences in the reference data set according to n-grams to obtain pre-segmented word sequences, wherein the enhancer sequences obtain first pre-segmented word sequences through pre-segmentation, and the non-enhancer sequences obtain second pre-segmented word sequences through pre-segmentation;

n-gram is a word segmentation method commonly used in natural language processing, and in this embodiment, n ═ 2 means that two enhancer sequences are separated from each other, starting from the head of the enhancer sequence. Such as: when the enhancer sequence is GGTGTGGAAAGG, the pre-classified word sequence is GG TG TG GA AA GG, i.e. the form of English-like sentence with space as the division point.

It should be noted that, the generation process of the non-enhancer benchmark dataset, the strong enhancer benchmark dataset, and the weak enhancer benchmark dataset are also trained respectively, so the first pre-sorted word sequence includes: the strong and weak enhancement word segmentation sequences are pre-segmentation sequences and weak enhancement word segmentation sequences, except that the sum of the strong and weak enhancement sub data sets is a positive sample set.

S3, training the first pre-classified word sequence according to a Seq-GAN network model to generate a first human process sequence, training the second pre-classified word sequence to generate a second artificial sequence;

and generating an artificial sequence by using a Seq-GAN network model, namely combining words and expressions mutually and expanding a strong enhancement sub data set and a weak enhancement sub data set. And (2) aiming at the reference data set in the step (1), generating artificial sequences by a Seq-GAN network model, expanding the non-enhancer, the strong enhancer and the weak enhancer sub data set, and generating 20,000 artificial non-enhancer sequences, 10,000 artificial strong enhancer sequences and 10,000 artificial weak enhancer sequences.

Preferably, since there may be sequences in the augmented enhancer dataset that have a high similarity to the original data, they cannot be directly applied to the recognition task. Therefore, redundant sequences with similarity higher than 80% were removed using CD-hit software.

Here, a description will be given of a Seq-GAN network model that is a GAN network model generated for a sequence, and generates high-quality discrete data such as a speech sequence, a text sequence, and a time sequence by adding a reinforcement learning idea from a GAN network.

FIG. 3 is a schematic structural diagram of a Seq-GAN network model in the method for identifying a bio-enhancer of the present application. As shown in fig. 3, like the general GAN network, the Seq-GAN network also includes a generator (g) and a discriminator (d), and the modification of the word embedding layer can be seen from fig. 3.

In some embodiments, the step of training the pre-segmentation word sequence according to a Seq-GAN network model to generate an artificial sequence of the pre-segmentation word sequence includes:

s301 sequence completion step: generating a complete sequence of each action according to a Monte Carlo tree search algorithm, wherein the action refers to a character or a word which is possibly generated at the next time point based on the current state, and the current state refers to a sequence fragment generated by the generator based on the first pre-segmentation sequence and the second pre-segmentation sequence;

the generator pre-divides the words generated from the original data set, and selects one word to be spliced together at each step, as shown in fig. 3, and in a specific embodiment, the current state is a small sequence segment composed of three words (i.e., circles o shown in fig. 3, each o represents a word). In this embodiment, the action is the next possible word, and specifically, the words or phrases herein refer to the 16 kinds of dinucleotides, i.e., AA, AC, etc. Non-enhancer, strong enhancer and weak enhancer sequences are all 200 a long, so the "state" in seq-GAN is only a sequence segment of length 1-199, regardless of the artificial sequence that one wants to generate. The N most likely complete sequences for each action are found by a monte carlo tree search.

S302, an artificial sequence generation step:

s3021, a discriminator D of the Seq-GAN network model identifies the complete sequence and generates reward;

s3022 transmitting the reward back to the generator of the Seq-GAN network model;

s3023 updating the generator and updating the state, wherein the state refers to the existing words or characters;

s3024, determining whether a preset condition is reached, where the preset condition includes: a training target or maximum number of iterations;

and if S3025 is yes, obtaining an artificial sequence of the pre-classified word sequence.

In addition, the original Seq-GAN network performs network training by randomly initializing word vectors and generates data, and the word vectors need to be trained in the network training process, so that the training speed is reduced, and the quality of the generated sequence cannot be guaranteed.

Therefore, in practical use, training a pre-segmented data set using Word2vec includes the following steps:

s31, performing Word vector training on the first pre-classified Word sequence according to a Word2vec network model to obtain a first pre-training Word vector, and performing Word vector training on the second pre-classified Word sequence to obtain a second pre-training Word vector;

s32, training the first pre-training word vector according to the Seq-GAN network model to generate a first artificial sequence, training the second pre-training word vector to generate a second artificial sequence.

The pre-training word vector of 16 dinucleotides is obtained, the calculation complexity of the Seq-GAN network is reduced, and the training speed and the quality of the generated sequence are improved.

From a network architecture perspective, the goal of generator G is to maximize the reward expectation J (θ):

the formula can be described as, at s₀And theta, the expectation of generating some complete sequence. Q is the action-value function of the sequence (selecting action a (a ═ y) in state s₁) Thereafter, policy decision making is followed, resulting in expectations).

This formula is intuitively understood: g generates the probability of a certain sequence y multiplied by the Q value of that sequence and summed. Therefore, the key to calculating the expectation is the calculation of the Q value.

In Seq-GAN, D acts as an environment for this reinforcement learning system, and the probability output of D should be rewarded. When evaluating the complete sequence, the Q value calculation is defined as:

the reward of an incomplete sequence has no practical significance and is therefore in its original y₁To y_t-1In the case of (2), y is generated₁The Q value of (a) cannot be in y_t-1Direct calculation after production, unless y₁Is the last of the entire sequence. In order to solve this problem, the vacancy after the current time needs to be filled, and the method used is Monte Carlo Search Tree (Monte Carlo Search Tree). And calculating the reward of each possible sequence, and then averaging to obtain the current Q value.

The main idea of the monte carlo tree search is based on a Confidence interval Upper limit algorithm (UCT) of the tree, and the UCT calculation formula is as follows:

where v' represents the current tree node, v represents the parent node, Q represents the cumulative quality value of this tree node, N represents the number of times this tree node has been made, and C is a constant.

A value is obtained for each node for the following selection, the value is composed of two parts, the left part is the average profit value of the node (the higher the value is more worthy of selection, the higher the expected profit of the node is), and the right part is the variable which is the total access times of the parent node divided by the access times of the child nodes (if the access times of the child nodes are less, the value is larger, the more worthy of selection), so the exploration and utilization can be considered by using the formula.

The complete steps of the Monte Carlo tree search comprise the following four steps:

selecting: finding a best node worth searching in the tree, wherein a general strategy is to select the sub-nodes which are not searched first, and select the sub-node with the largest UCT value if the sub-nodes are searched;

expanding steps: a new child node is created by one step in the previously selected child nodes, and the general strategy is to randomly execute an operation which cannot be repeated with the previous child node;

a simulation step: after expansion, the algorithm randomly selects a child node and simulates a random game starting from the selected node until it reaches the final state of the game, so that the score of the expanded node can be received;

backtracking: and feeding back the node scores to all the previous father nodes, and updating the quality values and the access times of the nodes to facilitate the calculation of the UCT value at the later stage.

And after N Monte Carlo tree searches are carried out, N samples are obtained, and the Q value at the current moment is obtained by calculating a reward mean value.

Using the generated data, retraining D:

that is, maximizing D judges the true data to be true plus D judges the generated data to be false, i.e., minimizing their inverse.

After one or more rounds of training D, G needs to be updated with D, and the method used is policy gradient (which can be regarded as a gradient descent process).

The partial derivative is calculated for theta, so that the model parameters can be updated,

so far, the model parameters can be updated according to the gradient, and the training steps are repeated after the generator is updated until the training target or the maximum iteration number is reached.

S4, fusing the first artificial sequence and the original positive sample data set to obtain an amplified positive sample data set;

in this example, 5000 strong enhancers were fused, and 5000 weak enhancers were fused.

S5, fusing the second artificial sequence and the original negative sample data set to obtain an amplified negative sample data set;

in this example, 10000 sequences of the negative sample data set were amplified.

S6, based on statistics, carrying out sequence word segmentation on the amplification positive sample data set to obtain a first word segmentation result and a word segmentation model;

and performing word segmentation model training on the positive sample set by using a statistical sequence word segmentation method, and setting the number of words to be 150.

The DNA is composed of deoxynucleotides of only four kinds, A, C, G, T, ACGT can be used to compose many deoxynucleotide combinations, which can be used as words. To find the optimal word segmentation result for a DNA, probability analysis is performed using statistics. Performing sequence word segmentation on the artificial sequence based on statistics to obtain word segmentation results and a word segmentation model, and specifically comprising the following steps of:

s601 calculating probability p (x) using maximum likelihood function method_i) Said p (x)_i) The method comprises the following steps: the probability that a word appearing at the ith position of sequence X appears in the dataset;

the likelihood function L is:

wherein D represents a corpus and X^(s)Represents the S-th sample in the corpus, S (X)^(s)) All candidate compositions representing the s-th sample; for a given corpus, when calculating the ACG, it is equivalent to deduct the ACG in the current sequence, and the AC in the ACG does not participate in the likelihood function calculation when continuing to count the AC. When the likelihood function is maximum, the dictionary constitution with the maximum whole corpus composition probability can be obtained, and p (x) is calculated_i)。

The key to calculating the optimal combination is p (x)_i) And (4) obtaining. The frequency and frequency are not reasonable by simple calculations because the frequency statistics are not independent. Assuming that counting is used for frequency calculation, the ACG content and the AC content are not completely independent when calculated, since AC is part of the ACG. In fact, after counting the ACG at the current time, the ACs contained in the ACG cannot continue to participate in the statistics. Therefore, p (x) is calculated using the maximum likelihood function method_i)。

S602 according to the p (x)_i) Calculating a combined probability p (X) of the sequence X, the p (X) being:

s603, selecting the composition mode corresponding to the highest value of the p (X) and recording the composition mode as the word segmentation result of the sequence X.

The optimal word segmentation result of the sequence is x^*Let P (x) be the highest word combination:

where S (x) is the set of all possible partitions of the sequence.

S7, performing word segmentation on the amplified negative sample data set according to a word segmentation model to obtain a second word segmentation result;

s8, training the first word segmentation result and the second word segmentation result to obtain a trained classification model.

Based on ten-fold cross validation, positive and negative samples are continuously input into the convolutional neural network, and network parameters are continuously updated, so that the training process of the classification model is realized.

With reference to the foregoing embodiments, in some feasible embodiments, Word vector training is performed on the first Word segmentation result according to a Word2vec network model, so as to obtain a Word embedding matrix.

The Word vector training is performed on the positive sample set by using the Word2vec algorithm to obtain a 300-dimensional vector of each biological Word, so that a Word embedding matrix with the size of 150 × 300 is formed, and the matrix can be used for subsequently constructing a Word embedding layer in the convolutional neural network shown in fig. 4. Word2vec converts the text into a numeric vector representation, which allows the computer to better process it.

Based on the above embodiment, when the other enhancer to be tested is identified, the following steps can be performed:

s9, acquiring a data set to be detected;

acquiring a dataset to be tested, comprising: enhanced subdata sets, non-enhanced subdata sets; the enhanced sub data set further includes: strong and weak enhancers

S10, performing word segmentation on the data set to be tested according to the word segmentation model to obtain a word segmentation result to be tested;

and (3) training the statistical sequence word segmentation model of the enhancer, and then segmenting the word of the data set to be tested by using the trained model.

And S11, obtaining the recognition result of the data set to be detected according to the word segmentation result to be detected and the word embedding matrix.

And identifying the vectors in the vector set by using a convolutional neural network with an embedded layer to obtain an identification result of the enhancer.

The application also provides an identification method of the enhanced subtype, which comprises the following steps:

acquiring a strong enhancement subdata set, wherein the strong enhancement subdata set is a set containing a plurality of strong enhancement subsequence sequences;

and acquiring an enhancer type identification datum data set, wherein the enhancer type identification datum data set comprises a strong enhancer data set and a weak enhancer data set, the strong enhancer data set is a set containing a plurality of strong enhancer sequences, and the weak enhancer data set is a set containing a plurality of weak enhancer sequences.

Performing statistical sequence word segmentation on the strong enhancer sequence to obtain word segmentation results;

and carrying out word vector training on the word segmentation result to obtain a word embedding matrix.

In one possible embodiment, a lexicon size of 150 is set, and a word vector dimension is 300.

It should be noted that, in the identification method of enhancer type, the generation method of the sequence can be referred to the generation method of the sequence in the identification method of enhancer in the above embodiments, which is not described herein again.

FIG. 2 is a schematic diagram of the overall recognition process of a biological enhancer recognition method according to the present application. As shown in fig. 2, the reference data set includes: a training data set and an independent test set, the training data set comprising: an original positive sample data set and an original negative sample data set.

In the identification method of the enhancer, the original positive sample data set refers to an enhanced subdata set, and the original negative sample data set refers to a non-enhanced enhancer data set.

In the identification method of the enhanced subtype, the identification target is to identify the strength and weakness degree of the enhancer, so the original positive sample data set refers to a strong enhanced sub data set, and the original negative sample data set refers to a weak enhanced sub data set.

The independent test set is used to test the recognition capability, i.e., predictive performance, of the model. Firstly, summing up the strong enhancement sub data set and the weak enhancement sub data set in the independent test set to obtain 200 positive sample independent test sets, wherein the negative sample independent test set is a non-enhancement independent test set. And performing word segmentation on the positive sample independent test set and the negative sample independent test set by using the trained word segmentation model. The word segments can be directly input into the just trained convolutional neural network to output predictions for individual test sequences. The evaluation of the classification model may be accomplished using a plurality of evaluation indices.

Based on the overall recognition flow diagram as shown in fig. 2, the effectiveness of the recognition method of the bio-enhancer and the type thereof proposed in the present application is analyzed as follows.

The reference data set contains 1484 enhancer sequences and 1484 non-enhancer sequences. There are two types of enhancer sequences, including 742 strong enhancer sequences and 742 weak enhancer sequences. In addition, an independent test set was constructed, which contained 200 enhancer sequences (divided into 100 strong enhancer sequences and 100 weak enhancer sequences) and 200 non-enhancer sequences.

First, in order to verify the validity of the artificial sequence, the matching degree of each attribute of the artificial sequence with the natural sequence was analyzed according to the nucleotide composition and the sequence physicochemical properties, as shown in fig. 5 to 8.

In FIG. 5, the content comparison of four mononucleotides is shown, and the four nucleotide contents of the non-enhancer, the strong enhancer and the weak enhancer are different, for example, the A and T contents of the non-enhancer are relatively large. The nucleotide content distribution between the strong and weak enhancers is uniform, and the problem that the strong and weak enhancers are difficult to identify is reflected on the side face. Whereas the respective nucleotide contents are almost the same between the non-enhancer and the artificial non-enhancer, which is also the case in the strong enhancer and the weak enhancer.

The 16 dinucleotide content alignments between non-enhancers and artificial non-enhancers are shown in FIG. 6, and it can be seen that the gap between the artificial and natural sequences is small, with little variation except for TT content. For the analysis reasons, it is found that since non-enhancers are composed of randomly selected sequences other than enhancers, it is more difficult to generate sequences. The difference in mononucleotide content is negligible, and with the increase of sequence composition units, the disadvantage of randomly composed non-enhancer data sets is also revealed. However, the overall composition is still very close, which can also account for the effectiveness of the sequence generation model.

FIG. 7 shows the 64 triplet nucleotide content alignments between the weak enhancer and the artificial weak enhancer, and it can be seen that the gap between the artificial and natural sequences is small.

FIG. 8 is a schematic diagram of the physicochemical property analysis of the artificial sequence and the natural sequence in the method for identifying a bio-enhancer and its type according to the present application. As shown in fig. 8, showing the alignment of 6 physicochemical properties based on the triplet of nucleotides between the artificial and natural sequences, the actual values of each property are listed below the figure due to the large number base of some properties and the small contrast difference. It can be seen that the difference between various artificial sequences and natural sequences is small, and the difference between the physicochemical properties does not exceed 5%.

In conclusion, the artificial sequences generated by using the Seq-GAN network have very similar properties from the aspects of statistics and physicochemical properties, and the effectiveness and the usability of the artificial sequences are verified. Thus, randomly drawn data from the CD-hit sequences, combined with the original benchmark dataset, the final non-enhanced subdata set contains 10,000 sequences, and the strong enhanced and weak enhanced subdata sets contain 5,000 sequences, respectively. In order to ensure the reliability and the reasonableness of the experiment, the test set is not expanded.

Secondly, in order to just explain the effectiveness of other technical features of the application, an original model is introduced, wherein the original model refers to a reference data set obtained by 3gram participles, word2vec and CNN with an embedded layer. The 3gram sequence segmentation is overlapped segmentation, for example, the sequence of ACGT is divided into ACG CGT. Word2vec is also a vector trained to 300 dimensions. A default-mode CNN means that the word embedding layer in CNN does not participate in training, but is simply the query process.

Table 1 shows the predicted performance of the improved model based on statistical sequence segmentation of the reference data set compared to the original model on an independent test set. As shown in Table 1, the effectiveness of the statistical sequence word segmentation is verified, and the performance of the model added with the statistical sequence word segmentation is greatly improved on an independent test set, namely the generalization performance is improved. For example, the model with statistical sequence segmentation improved Acc, Sp, Sn and MCC from 75.23%, 72.40%, 78.05% and 0.5390 to 77.23%, 74.55%, 79.90% and 0.5775, respectively, in the enhancer recognition task, i.e., each index was improved considerably. Also, Acc, Sn and MCC are raised by 0.6%, 0.74% and 0.0526 respectively in the enhancer type identification task. In conclusion, the model based on the statistical sequence word segmentation can effectively improve the generalization performance of the model, so that the effectiveness of the statistical sequence word segmentation is verified.

TABLE 1

Finally, to verify the validity of the combined model generated based on statistical sequence participles and Seq-GAN sequences, table 2 compares the combined improved model with the original model and the single improved model on the reference dataset for prediction performance. As shown in Table 2, on the tasks of enhancer identification and enhancer type identification, each performance index of the combined model reaches the optimum, which not only indicates the strong identification capability of the combined model, but also indicates the effectiveness of statistical sequence word segmentation.

TABLE 2

Table 3 combined improved models versus original model and single improved model prediction performance on independent test sets. As shown in table 3, Acc and Sn of the combined models gave the highest values on the independent test set of enhancer identification, and Sp and MCC values did not exceed those of the models based on statistical sequence segmentation alone. By integrating all the evaluation indexes, the combined model can reach the optimum on two or three indexes, so the generalization performance of the combined model is stronger.

TABLE 3

In conclusion, the combination method of the statistical sequence participle and the generation of the Seq-GAN network sequence greatly improves the model performance, fully exerts the advantages of the two schemes, makes up the defects of the two schemes, and finally realizes the excellent performances of strong recognition capability and strong generalization capability. Therefore, a deep learning model generated based on statistical sequence participles and Seq-GAN network sequences is adopted as an optimal model for enhancer and type recognition.

The embodiments provided in the present application are only a few examples of the general concept of the present application, and do not limit the scope of the present application. Any other embodiments extended according to the scheme of the present application without inventive efforts will be within the scope of protection of the present application for a person skilled in the art.

Claims

1. A method of identifying a biological enhancer, comprising:

obtaining a reference data set, the reference data set comprising: a training data set, the training data set comprising: the method comprises the steps of obtaining an original positive sample data set and an original negative sample data set, wherein the original positive sample data set refers to an enhanced subdata set, the original negative sample data set refers to a non-enhanced subdata set, the enhanced subdata set refers to a set containing a plurality of enhanced sequences, and the non-enhanced subdata set refers to a set containing a plurality of non-enhanced sequences;

pre-dividing words of the sequences in the reference data set according to n-gram to obtain pre-divided word sequences, wherein the enhancer sequences are pre-divided into first pre-divided word sequences, and the non-enhancer sequences are pre-divided into second pre-divided word sequences;

training the first pre-classified word sequence according to a Seq-GAN network model to generate a first human process sequence, training the second pre-classified word sequence to generate a second artificial sequence;

fusing the first artificial sequence and the original positive sample data set to obtain an amplified positive sample data set;

fusing the second artificial sequence and the original negative sample data set to obtain an amplified negative sample data set;

performing sequence word segmentation on the amplified positive sample data set based on statistics to obtain a first word segmentation result and a word segmentation model;

performing word segmentation on the amplified negative sample data set according to a word segmentation model to obtain a second word segmentation result;

and training the first word segmentation result and the second word segmentation result to obtain a trained classification model.

2. The identification method according to claim 1, further comprising:

and performing Word vector training on the first Word segmentation result according to a Word2vec network model to obtain a Word embedding matrix.

3. The identification method according to claim 2, further comprising:

acquiring a data set to be detected;

performing word segmentation on the data set to be tested according to the word segmentation model to obtain a word segmentation result to be tested;

and obtaining the recognition result of the data set to be detected according to the word segmentation result to be detected and the word embedding matrix.

4. The method according to claim 1, wherein the step of performing sequence segmentation on the amplified positive sample data set based on statistics to obtain a first segmentation result and a segmentation model comprises:

the likelihood function L is:

5. The recognition method of claim 1, wherein the steps of training the first pre-segmentation sequence to generate a first human process sequence, training the second pre-segmentation sequence to generate a second artificial sequence according to a Seq-GAN network model comprise:

an artificial sequence generation step: identifying the complete sequence and generating a reward;

transmitting the reward back to the generator;

updating the generator and updating the state, wherein the state refers to the existing words or characters;

judging whether a preset condition is reached, wherein the preset condition comprises the following steps: a training target or maximum number of iterations;

and if so, obtaining the artificial sequence of the pre-classified word sequence.

6. The recognition method of claim 1, wherein the steps of training the first pre-segmentation sequence to generate a first human process sequence, training the second pre-segmentation sequence to generate a second artificial sequence according to a Seq-GAN network model comprise:

performing Word vector training on the first pre-classified Word sequence according to a Word2vec network model to obtain a first pre-training Word vector, and performing Word vector training on the second pre-classified Word sequence to obtain a second pre-training Word vector;

and training the first pre-training word vector according to a Seq-GAN network model to generate a first artificial sequence, training the second pre-training word vector to generate a second artificial sequence.

7. A method for identifying a type of bio-enhancer, comprising: