CN108009402A

CN108009402A - A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network

Info

Publication number: CN108009402A
Application number: CN201710609781.8A
Authority: CN
Inventors: 段大高; 赵振东; 韩忠明
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2017-07-25
Filing date: 2017-07-25
Publication date: 2018-05-08

Abstract

A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network of the present invention：Step 1：Obtain the microbial gene sequences data of existing classification results；Step 2：Data prediction；Step 3：Build dynamic convolution network structure frame；Step 4：The dynamic convolutional network that ready data input step three is established, with backpropagation, stochastic gradient descent method iteration 100 times, training dynamic convolutional network；Using cross entropy of classifying more as cost function, sorting algorithm model is finally obtained；Step 5：The segmentation sequence for needing to classify is input to step 4 trained dynamic convolution network model, obtains classification results.The method of the present invention does not have to artificial treatment data and extraction feature, and model extracts abstract characteristics and completes classification task automatically, and efficiency of algorithm and accuracy are high, can be efficiently applied to analysis of biological information and processing.

Description

A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network

Technical field

The present invention relates to a kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network, it is applied to Microbial gene sequences Classification and Identification, belongs to data mining and technical field of biological information.

Background technology

DNA sequence data is one of main study subject of bioinformatics.By analyzing DNA sequence dna, it will be appreciated that sequence Potential structure and functional relationship between row.The data volume of DNA sequence dna is exponentially increased, if can be analyzed with modern computer These huge data help us to understand DNA, this is extraordinary.DNA sequence classification follows the sequence with similar structure Row also have the principle of identity function.It is similar conventionally by sequence is established using sequence alignment method (such as BLAST and FASTA) Property.This selection is that have two main assumptions：(1) functional imperative shares consensus feature, the relative ranks of (2) functional element It is conservative between different indirect conditions.Although these assume that in the case of extensive be effective, they are not general Time.Anyway, although these nearest problems, when the serious key issue for limiting alignment schemes application is still that they calculate Between complexity.Therefore, that develops recently has become the effective ways of research genome analysis without alignment schemes.In no alignment Sequence is considered to gather into K-mer in method, then by analyzing K-mer distribution characters in each sequence, searches out effective spy Then sign is classified sequence with traditional sorting technique.And gene order classification in, Feature Selection and analysis often it is time-consuming again Arduously, and effect is also uncertain.

At present, achieved based on the model algorithm of deep learning in the field such as image recognition and natural language processing very well Effect, and be increasingly taken seriously.This method is based primarily upon in deep learning dynamic convolution real-time performance based on sequence Classification.Since deep learning can extract high-level abstract characteristics in itself, so as to eliminate the feature in conventional machines learning algorithm Engineering process, so that the efficiency of Resolving probiems is greatly improved, and also accuracy has reached very high level.

The content of the invention

1st, purpose：

, can it is an object of the present invention to provide a kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network With microbial gene sequences of effectively classifying, so as to improve microbiological analysis efficiency and level.

The principle of the present invention is：Gene order processing is carried out first, and the gene order text of some microorganism is divided Word, obtains word segmentation result and is inputted as algorithm model, and sequence participle can be passed through word by algorithm model first according to word segmentation result Gene order is converted into vector matrix by embedded technology, in the convolutional layer of model, by one-dimensional convolution kernel to word embeded matrix into Row convolution, first layer convolution are set to 12 passages, and data enter dynamic pond layer after convolutional layer, and dynamic pond layer can be according to input The length of sequence and the current convolution number of plies determine the size in pond domain, to maximize the effective information for retaining sequence.Finally exist Foldable layer, matrix dimensionality reduction, classifies dynamic convolutional network abstraction sequence abstract characteristics in full articulamentum.

2nd, technical solution：Technical solution provided by the invention is as follows：

The present invention is a kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network, as shown in Fig. 2, This method comprises the following steps that：

Step 1：Obtain the microbial gene sequences data of existing classification results.

Step 2：Data prediction：

1) forbidden character in gene order is deleted；

2) different class categories attributes is subjected to one-hot codings；

3) gene order is pressed 8 character cuttings into word sequence；

4) the sequence text set put in order is divided into door (phylum), guiding principle (class), mesh by corresponding tag along sort (order) section (family), is divided into four category level data

Step 3：Dynamic convolution network structure frame is built, such as attached drawing (1)；

1) word embeding layer, the sequence participle vectorization of word embeding layer input.Different sequence words map out it is different to Amount, similar sequence word can be more close on mapping space.It is 48 to select term vector length herein；

2) input layer, input layer receive the sequence vector of word embeding layer output, data matrix then are transmitted to next layer；

3) convolutional layer, mainly one-dimensional convolution, it is defeated that convolution Nuclear receptor co repressor list entries data matrix carries out convolution algorithm generation Go out result；The convolutional layer has two, is named as the first convolutional layer and the second convolutional layer

4) dynamic pond layer, pond layer dynamically select pond parameter according to the convolution number of plies, and list entries long hair, to protect Stay most effective informations.Dynamic pond layer parameter is selected according to formula (1)：

K_lIt is the selective value of l layers of pond layer parameter k, K_top, it is top pond layer parameter, L represents convolutional layer in network Total number of plies, it is the current number of plies that l, which is represented, and the length selection bp of behalf sequence is unit；

5) foldable layer, foldable layer will be laminated per upper and lower two in first convolutional layer and the second convolutional layer output matrix And merged in the form of numerical value addition etc.；

6) full articulamentum, is connected with 1024 neurons entirely, can extract the profound level that neutral net is acquired and be abstracted spy Sign；

7) output layer, output layer set neuron number, point of output nerve e-learning according to specific class categories number Class result；

Step 4：The dynamic convolutional network that ready data input step three is established, with backpropagation, stochastic gradient Descent method iteration 500 times, training dynamic convolutional network.Using cross entropy of classifying more as cost function.Finally obtain sorting algorithm mould Type.

Step 5：By the trained dynamic convolution network model of sequence segmentation sequence input step four, output category knot Fruit.

3rd, advantage and effect：A kind of microbial gene sequences classification mould based on dynamic convolutional network provided by the present invention The method of type, assorting process do not have to artificial treatment data and extraction feature, and model extracts abstract characteristics and completes classification task automatically, Efficiency of algorithm and accuracy are all of a relatively high, can be efficiently applied to analysis of biological information and processing.

Brief description of the drawings：

Fig. 1 show the dynamic convolution network structure that step 3 of the present invention is established.In figure：V1, V2, V3, V4, V5, V6, V7 represents 7 words insertion vector respectively, inputs term vector composing training data matrix, by convolutional layer, pond layer, and foldable layer, Full articulamentum extracts sequence abstract characteristics to precise classification.

Fig. 2 show the flow chart of the method for the present invention.

Embodiment：

With reference to the accompanying drawings and examples, technical scheme is described further.

A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network of the present invention, specifically includes as follows Step：

Step 1：Obtaining microbial gene sequences data, (download address is by taking RDP rRNA gene datas as an example: http://rdp.cme.msu.edu/http:This data of //rdp.cme.msu.edu/. have marked 3,356,809 micro- lifes Thing 16S rRNAs data), therefrom 300,000 sequence datas of stochastical sampling, to model training afterwards.

Step 2：Data prediction：

1) forbidden character in gene order is deleted；

2) different class categories attribute one-hot is encoded；

3) gene order is pressed 8 character cuttings into word sequence；

Such as：Gtttataagggcttgccctt terminal sequences：It can be segmented and be：Gtttataa, tttataag, Ttataagg tataaggg, ataagggc, taagggct, aagggctt, every 8 bp regard a word as.

As shown in following sequence：

gtttataagggcttgcccttatagatagtggcgaacgggtgcgtaacacgtgagcaacctgccccaaagtttggaat aacaccgggaaaccgatgctaataccaaatatgctcacactatcacaagatagagtgaggaaagtttttcgctttgg gaggggctcgcggcctatcagcttgttggtgaggtaacggctcaccaaggcatcgacgggtagctggtctgagagga cgatcagccacactgggactgagacacggcccagactcctacgggaggcagcagtggggaatattgcgcaatgggcg aaagcctgacgcagcaacgccgcgtggaggatgaaggccttagggtcgtaaactcctttcagcaggaacgaaaatga cggtacctgcagaagaagctccggccaactacgtgccagcagccgcggtaatacgtagggagcaagcgttgtccgga tttattgggcgtaaagagctcgtaggcggcttggcaagtcggatgtgaaacccccaggcttaacctggggccgccat tcgatactgctatggcttgagttcggtaggggattgtggaattcccggtgtagcggtgaaatgcgcagatatcggga Ggaac is split it by every 8 characters, obtains following sequence text set：

4) the sequence text set put in order is divided into door by corresponding tag along sort, guiding principle, mesh, section, is divided into four classification stages Other data；

Step 3：Build dynamic convolution network structure frame, network structure such as figure attached drawing (1)；Network parameter is provided with Ktop=5, word insertion vector length is 48, and first layer convolution is set to 12 passages, and second layer convolution is set to 8 passages, output Layer neuron number is depending on different classification task classification numbers, such as this experiment phylum:50,class:114, order:256,family:1386。

Step 4：The dynamic convolutional network that ready data input step three is established, with backpropagation, stochastic gradient Descent method iteration 500 times, selects batch size to train dynamic convolutional network for 200 training samples every time.Handed over more classification Fork entropy is cost function.Finally obtain sorting algorithm model.Experiment is represented in 500 times or so the basic convergences of model iteration, classification Accuracy rate can reach 99.6% or so, have very big lifting than conventional machines learning algorithm.

Step 5：The segmentation sequence for needing to classify is input to step 5 trained dynamic convolution network model, is arrived It is classified as follows result.

phylum：Actinobacteria

class：Actinobacteria

order：Acidimicrobiales

family：Acidimicrobiaceae.

Claims

1. a kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network, is characterized in that：This method step It is as follows：

Step 1：Obtain the microbial gene sequences data of existing classification results；

Step 2：Data prediction：

1) forbidden character in gene order is deleted；

2) different class categories attribute one-hot is encoded；

3) gene order is pressed 8 character cuttings into word sequence；

4) the sequence text set put in order is divided into door by corresponding tag along sort, guiding principle, mesh, section, is divided into four category level numbers According to；

Step 3：Build dynamic convolution network structure frame：

1) word embeding layer, the sequence participle vectorization of word embeding layer input；Different sequence words map out different vectors, phase Near sequence word can be more close on mapping space；It is 48 to select term vector length herein；

3) convolutional layer, mainly one-dimensional convolution, convolution Nuclear receptor co repressor list entries data matrix carry out convolution algorithm and produce output knot Fruit；The convolutional layer has two, is named as the first convolutional layer and the second convolutional layer；

4) dynamic pond layer, pond layer dynamically select pond parameter according to the convolution number of plies, and list entries long hair, to retain most More effective informations；Dynamic pond layer parameter is selected according to formula (1)：

K_lIt is the selective value of l layers of pond layer parameter k, K_top, it is top pond layer parameter, L represents the total layer of convolutional layer in network Number, it is the current number of plies that l, which is represented, and the length of behalf sequence, it is unit to select bp；

5) foldable layer, foldable layer will merge in first convolutional layer and the second convolutional layer output matrix per upper and lower two layers, with Numerical value is added form and merges；

6) full articulamentum, is connected with 1024 neurons entirely, can extract the profound abstract characteristics that neutral net is acquired；

7) output layer, output layer set neuron number, the classification knot of output nerve e-learning according to specific class categories number Fruit；

Step 4：The dynamic convolutional network that ready data input step three is established, with backpropagation, stochastic gradient descent Method iteration 100 times, training dynamic convolutional network；Using cross entropy of classifying more as cost function, sorting algorithm model is finally obtained；

Step 5：The segmentation sequence for needing to classify is input to step 4 trained dynamic convolution network model, is divided Class result.