CN108009402A - A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network - Google Patents

A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network Download PDF

Info

Publication number
CN108009402A
CN108009402A CN201710609781.8A CN201710609781A CN108009402A CN 108009402 A CN108009402 A CN 108009402A CN 201710609781 A CN201710609781 A CN 201710609781A CN 108009402 A CN108009402 A CN 108009402A
Authority
CN
China
Prior art keywords
layer
dynamic
sequence
convolutional
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710609781.8A
Other languages
Chinese (zh)
Inventor
段大高
赵振东
韩忠明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Technology and Business University
Original Assignee
Beijing Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Technology and Business University filed Critical Beijing Technology and Business University
Priority to CN201710609781.8A priority Critical patent/CN108009402A/en
Publication of CN108009402A publication Critical patent/CN108009402A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Abstract

A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network of the present invention:Step 1:Obtain the microbial gene sequences data of existing classification results;Step 2:Data prediction;Step 3:Build dynamic convolution network structure frame;Step 4:The dynamic convolutional network that ready data input step three is established, with backpropagation, stochastic gradient descent method iteration 100 times, training dynamic convolutional network;Using cross entropy of classifying more as cost function, sorting algorithm model is finally obtained;Step 5:The segmentation sequence for needing to classify is input to step 4 trained dynamic convolution network model, obtains classification results.The method of the present invention does not have to artificial treatment data and extraction feature, and model extracts abstract characteristics and completes classification task automatically, and efficiency of algorithm and accuracy are high, can be efficiently applied to analysis of biological information and processing.

Description

A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network
Technical field
The present invention relates to a kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network, it is applied to Microbial gene sequences Classification and Identification, belongs to data mining and technical field of biological information.
Background technology
DNA sequence data is one of main study subject of bioinformatics.By analyzing DNA sequence dna, it will be appreciated that sequence Potential structure and functional relationship between row.The data volume of DNA sequence dna is exponentially increased, if can be analyzed with modern computer These huge data help us to understand DNA, this is extraordinary.DNA sequence classification follows the sequence with similar structure Row also have the principle of identity function.It is similar conventionally by sequence is established using sequence alignment method (such as BLAST and FASTA) Property.This selection is that have two main assumptions:(1) functional imperative shares consensus feature, the relative ranks of (2) functional element It is conservative between different indirect conditions.Although these assume that in the case of extensive be effective, they are not general Time.Anyway, although these nearest problems, when the serious key issue for limiting alignment schemes application is still that they calculate Between complexity.Therefore, that develops recently has become the effective ways of research genome analysis without alignment schemes.In no alignment Sequence is considered to gather into K-mer in method, then by analyzing K-mer distribution characters in each sequence, searches out effective spy Then sign is classified sequence with traditional sorting technique.And gene order classification in, Feature Selection and analysis often it is time-consuming again Arduously, and effect is also uncertain.
At present, achieved based on the model algorithm of deep learning in the field such as image recognition and natural language processing very well Effect, and be increasingly taken seriously.This method is based primarily upon in deep learning dynamic convolution real-time performance based on sequence Classification.Since deep learning can extract high-level abstract characteristics in itself, so as to eliminate the feature in conventional machines learning algorithm Engineering process, so that the efficiency of Resolving probiems is greatly improved, and also accuracy has reached very high level.
The content of the invention
1st, purpose:
, can it is an object of the present invention to provide a kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network With microbial gene sequences of effectively classifying, so as to improve microbiological analysis efficiency and level.
The principle of the present invention is:Gene order processing is carried out first, and the gene order text of some microorganism is divided Word, obtains word segmentation result and is inputted as algorithm model, and sequence participle can be passed through word by algorithm model first according to word segmentation result Gene order is converted into vector matrix by embedded technology, in the convolutional layer of model, by one-dimensional convolution kernel to word embeded matrix into Row convolution, first layer convolution are set to 12 passages, and data enter dynamic pond layer after convolutional layer, and dynamic pond layer can be according to input The length of sequence and the current convolution number of plies determine the size in pond domain, to maximize the effective information for retaining sequence.Finally exist Foldable layer, matrix dimensionality reduction, classifies dynamic convolutional network abstraction sequence abstract characteristics in full articulamentum.
2nd, technical solution:Technical solution provided by the invention is as follows:
The present invention is a kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network, as shown in Fig. 2, This method comprises the following steps that:
Step 1:Obtain the microbial gene sequences data of existing classification results.
Step 2:Data prediction:
1) forbidden character in gene order is deleted;
2) different class categories attributes is subjected to one-hot codings;
3) gene order is pressed 8 character cuttings into word sequence;
4) the sequence text set put in order is divided into door (phylum), guiding principle (class), mesh by corresponding tag along sort (order) section (family), is divided into four category level data
Step 3:Dynamic convolution network structure frame is built, such as attached drawing (1);
1) word embeding layer, the sequence participle vectorization of word embeding layer input.Different sequence words map out it is different to Amount, similar sequence word can be more close on mapping space.It is 48 to select term vector length herein;
2) input layer, input layer receive the sequence vector of word embeding layer output, data matrix then are transmitted to next layer;
3) convolutional layer, mainly one-dimensional convolution, it is defeated that convolution Nuclear receptor co repressor list entries data matrix carries out convolution algorithm generation Go out result;The convolutional layer has two, is named as the first convolutional layer and the second convolutional layer
4) dynamic pond layer, pond layer dynamically select pond parameter according to the convolution number of plies, and list entries long hair, to protect Stay most effective informations.Dynamic pond layer parameter is selected according to formula (1):
KlIt is the selective value of l layers of pond layer parameter k, Ktop, it is top pond layer parameter, L represents convolutional layer in network Total number of plies, it is the current number of plies that l, which is represented, and the length selection bp of behalf sequence is unit;
5) foldable layer, foldable layer will be laminated per upper and lower two in first convolutional layer and the second convolutional layer output matrix And merged in the form of numerical value addition etc.;
6) full articulamentum, is connected with 1024 neurons entirely, can extract the profound level that neutral net is acquired and be abstracted spy Sign;
7) output layer, output layer set neuron number, point of output nerve e-learning according to specific class categories number Class result;
Step 4:The dynamic convolutional network that ready data input step three is established, with backpropagation, stochastic gradient Descent method iteration 500 times, training dynamic convolutional network.Using cross entropy of classifying more as cost function.Finally obtain sorting algorithm mould Type.
Step 5:By the trained dynamic convolution network model of sequence segmentation sequence input step four, output category knot Fruit.
3rd, advantage and effect:A kind of microbial gene sequences classification mould based on dynamic convolutional network provided by the present invention The method of type, assorting process do not have to artificial treatment data and extraction feature, and model extracts abstract characteristics and completes classification task automatically, Efficiency of algorithm and accuracy are all of a relatively high, can be efficiently applied to analysis of biological information and processing.
Brief description of the drawings:
Fig. 1 show the dynamic convolution network structure that step 3 of the present invention is established.In figure:V1, V2, V3, V4, V5, V6, V7 represents 7 words insertion vector respectively, inputs term vector composing training data matrix, by convolutional layer, pond layer, and foldable layer, Full articulamentum extracts sequence abstract characteristics to precise classification.
Fig. 2 show the flow chart of the method for the present invention.
Embodiment:
With reference to the accompanying drawings and examples, technical scheme is described further.
A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network of the present invention, specifically includes as follows Step:
Step 1:Obtaining microbial gene sequences data, (download address is by taking RDP rRNA gene datas as an example: http://rdp.cme.msu.edu/http:This data of //rdp.cme.msu.edu/. have marked 3,356,809 micro- lifes Thing 16S rRNAs data), therefrom 300,000 sequence datas of stochastical sampling, to model training afterwards.
Step 2:Data prediction:
1) forbidden character in gene order is deleted;
2) different class categories attribute one-hot is encoded;
3) gene order is pressed 8 character cuttings into word sequence;
Such as:Gtttataagggcttgccctt terminal sequences:It can be segmented and be:Gtttataa, tttataag, Ttataagg tataaggg, ataagggc, taagggct, aagggctt, every 8 bp regard a word as.
As shown in following sequence:
gtttataagggcttgcccttatagatagtggcgaacgggtgcgtaacacgtgagcaacctgccccaaagtttggaat aacaccgggaaaccgatgctaataccaaatatgctcacactatcacaagatagagtgaggaaagtttttcgctttgg gaggggctcgcggcctatcagcttgttggtgaggtaacggctcaccaaggcatcgacgggtagctggtctgagagga cgatcagccacactgggactgagacacggcccagactcctacgggaggcagcagtggggaatattgcgcaatgggcg aaagcctgacgcagcaacgccgcgtggaggatgaaggccttagggtcgtaaactcctttcagcaggaacgaaaatga cggtacctgcagaagaagctccggccaactacgtgccagcagccgcggtaatacgtagggagcaagcgttgtccgga tttattgggcgtaaagagctcgtaggcggcttggcaagtcggatgtgaaacccccaggcttaacctggggccgccat tcgatactgctatggcttgagttcggtaggggattgtggaattcccggtgtagcggtgaaatgcgcagatatcggga Ggaac is split it by every 8 characters, obtains following sequence text set:
4) the sequence text set put in order is divided into door by corresponding tag along sort, guiding principle, mesh, section, is divided into four classification stages Other data;
Step 3:Build dynamic convolution network structure frame, network structure such as figure attached drawing (1);Network parameter is provided with Ktop=5, word insertion vector length is 48, and first layer convolution is set to 12 passages, and second layer convolution is set to 8 passages, output Layer neuron number is depending on different classification task classification numbers, such as this experiment phylum:50,class:114, order:256,family:1386。
Step 4:The dynamic convolutional network that ready data input step three is established, with backpropagation, stochastic gradient Descent method iteration 500 times, selects batch size to train dynamic convolutional network for 200 training samples every time.Handed over more classification Fork entropy is cost function.Finally obtain sorting algorithm model.Experiment is represented in 500 times or so the basic convergences of model iteration, classification Accuracy rate can reach 99.6% or so, have very big lifting than conventional machines learning algorithm.
Step 5:The segmentation sequence for needing to classify is input to step 5 trained dynamic convolution network model, is arrived It is classified as follows result.
phylum:Actinobacteria
class:Actinobacteria
order:Acidimicrobiales
family:Acidimicrobiaceae.

Claims (1)

1. a kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network, is characterized in that:This method step It is as follows:
Step 1:Obtain the microbial gene sequences data of existing classification results;
Step 2:Data prediction:
1) forbidden character in gene order is deleted;
2) different class categories attribute one-hot is encoded;
3) gene order is pressed 8 character cuttings into word sequence;
4) the sequence text set put in order is divided into door by corresponding tag along sort, guiding principle, mesh, section, is divided into four category level numbers According to;
Step 3:Build dynamic convolution network structure frame:
1) word embeding layer, the sequence participle vectorization of word embeding layer input;Different sequence words map out different vectors, phase Near sequence word can be more close on mapping space;It is 48 to select term vector length herein;
2) input layer, input layer receive the sequence vector of word embeding layer output, data matrix then are transmitted to next layer;
3) convolutional layer, mainly one-dimensional convolution, convolution Nuclear receptor co repressor list entries data matrix carry out convolution algorithm and produce output knot Fruit;The convolutional layer has two, is named as the first convolutional layer and the second convolutional layer;
4) dynamic pond layer, pond layer dynamically select pond parameter according to the convolution number of plies, and list entries long hair, to retain most More effective informations;Dynamic pond layer parameter is selected according to formula (1):
KlIt is the selective value of l layers of pond layer parameter k, Ktop, it is top pond layer parameter, L represents the total layer of convolutional layer in network Number, it is the current number of plies that l, which is represented, and the length of behalf sequence, it is unit to select bp;
5) foldable layer, foldable layer will merge in first convolutional layer and the second convolutional layer output matrix per upper and lower two layers, with Numerical value is added form and merges;
6) full articulamentum, is connected with 1024 neurons entirely, can extract the profound abstract characteristics that neutral net is acquired;
7) output layer, output layer set neuron number, the classification knot of output nerve e-learning according to specific class categories number Fruit;
Step 4:The dynamic convolutional network that ready data input step three is established, with backpropagation, stochastic gradient descent Method iteration 100 times, training dynamic convolutional network;Using cross entropy of classifying more as cost function, sorting algorithm model is finally obtained;
Step 5:The segmentation sequence for needing to classify is input to step 4 trained dynamic convolution network model, is divided Class result.
CN201710609781.8A 2017-07-25 2017-07-25 A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network Pending CN108009402A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710609781.8A CN108009402A (en) 2017-07-25 2017-07-25 A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710609781.8A CN108009402A (en) 2017-07-25 2017-07-25 A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network

Publications (1)

Publication Number Publication Date
CN108009402A true CN108009402A (en) 2018-05-08

Family

ID=62047663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710609781.8A Pending CN108009402A (en) 2017-07-25 2017-07-25 A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network

Country Status (1)

Country Link
CN (1) CN108009402A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111564179A (en) * 2020-05-09 2020-08-21 厦门大学 Species biology classification method and system based on triple neural network
CN112151119A (en) * 2020-09-01 2020-12-29 阿里云计算有限公司 Gene vector model training method, method for analyzing gene data, and respective devices

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101401101A (en) * 2006-03-10 2009-04-01 皇家飞利浦电子股份有限公司 Methods and systems for identification of DNA patterns through spectral analysis
US20160070854A1 (en) * 2013-04-17 2016-03-10 Andrew Ka-Ching WONG Aligning and clustering sequence patterns to reveal classificatory functionality of sequences
CN106295245A (en) * 2016-07-27 2017-01-04 广州麦仑信息科技有限公司 The method of storehouse noise reduction own coding gene information feature extraction based on Caffe
US20170046480A1 (en) * 2015-08-14 2017-02-16 Tetracore, Inc. Device and method for detecting the presence or absence of nucleic acid amplification
CN106547885A (en) * 2016-10-27 2017-03-29 桂林电子科技大学 A kind of Text Classification System and method
CN106599618A (en) * 2016-12-23 2017-04-26 吉林大学 Non-supervision classification method for metagenome contigs
CN106874378A (en) * 2017-01-05 2017-06-20 北京工商大学 The entity of rule-based model extracts the method that knowledge mapping is built with relation excavation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101401101A (en) * 2006-03-10 2009-04-01 皇家飞利浦电子股份有限公司 Methods and systems for identification of DNA patterns through spectral analysis
US20160070854A1 (en) * 2013-04-17 2016-03-10 Andrew Ka-Ching WONG Aligning and clustering sequence patterns to reveal classificatory functionality of sequences
US20170046480A1 (en) * 2015-08-14 2017-02-16 Tetracore, Inc. Device and method for detecting the presence or absence of nucleic acid amplification
CN106295245A (en) * 2016-07-27 2017-01-04 广州麦仑信息科技有限公司 The method of storehouse noise reduction own coding gene information feature extraction based on Caffe
CN106547885A (en) * 2016-10-27 2017-03-29 桂林电子科技大学 A kind of Text Classification System and method
CN106599618A (en) * 2016-12-23 2017-04-26 吉林大学 Non-supervision classification method for metagenome contigs
CN106874378A (en) * 2017-01-05 2017-06-20 北京工商大学 The entity of rule-based model extracts the method that knowledge mapping is built with relation excavation

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
GIOSUE LO BOSCO等: ""Deep Learning Architectures for DNA Sequence Classification"", 《INTERNATIONAL WORKSHOP ON FUZZY LOGIC AND SOFT COMPUTING APPLICATIONS》 *
NGOC GIANG NGUYEN等: ""DNA Sequence Classification by Convolutional Neural Network"", 《J. BIOMEDICAL SCIENCE AND ENGINEERING》 *
ZHONGMING HAN等: ""Reconstructing Genetic Regulation Network: Problems and Methods"", 《2009 2ND IEEE INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY》 *
张添龙: ""基于多类型池化的卷积神经网络的文本分类算法"", 《电脑知识与技术》 *
杨铁军: "《产业专利分析报告 第33册 智能识别》", 30 June 2015, 知识产权出版社 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111564179A (en) * 2020-05-09 2020-08-21 厦门大学 Species biology classification method and system based on triple neural network
CN111564179B (en) * 2020-05-09 2022-04-29 厦门大学 Species biology classification method and system based on triple neural network
CN112151119A (en) * 2020-09-01 2020-12-29 阿里云计算有限公司 Gene vector model training method, method for analyzing gene data, and respective devices

Similar Documents

Publication Publication Date Title
Carter et al. What made you do this? understanding black-box decisions with sufficient input subsets
CN107506761A (en) Brain image dividing method and system based on notable inquiry learning convolutional neural networks
CN110765260A (en) Information recommendation method based on convolutional neural network and joint attention mechanism
CN109165563B (en) Pedestrian re-identification method and apparatus, electronic device, storage medium, and program product
CN108460089A (en) Diverse characteristics based on Attention neural networks merge Chinese Text Categorization
Qiu et al. Hierarchical context features embedding for object detection
CN110083700A (en) A kind of enterprise's public sentiment sensibility classification method and system based on convolutional neural networks
CN109919368B (en) Law recommendation prediction system and method based on association graph
CN107808011A (en) Classification abstracting method, device, computer equipment and the storage medium of information
CN112633431B (en) Tibetan-Chinese bilingual scene character recognition method based on CRNN and CTC
CN111475622A (en) Text classification method, device, terminal and storage medium
CN110245685A (en) Genome unit point makes a variation pathogenic prediction technique, system and storage medium
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN108846047A (en) A kind of picture retrieval method and system based on convolution feature
CN111475615B (en) Fine granularity emotion prediction method, device and system for emotion enhancement and storage medium
CN109918649B (en) Suicide risk identification method based on microblog text
CN115131698A (en) Video attribute determination method, device, equipment and storage medium
Zeng et al. Multi-scale fully convolutional DenseNets for automated skin lesion segmentation in dermoscopy images
CN111967267A (en) XLNET-based news text region extraction method and system
CN108009402A (en) A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network
CN114037699A (en) Pathological image classification method, equipment, system and storage medium
Li et al. SwordNet: Chinese character font style recognition network
CN115082840B (en) Action video classification method and device based on data combination and channel correlation
CN111782811A (en) E-government affair sensitive text detection method based on convolutional neural network and support vector machine
CN115033700A (en) Cross-domain emotion analysis method, device and equipment based on mutual learning network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180508

RJ01 Rejection of invention patent application after publication