CN108009402A - A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network - Google Patents
A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network Download PDFInfo
- Publication number
- CN108009402A CN108009402A CN201710609781.8A CN201710609781A CN108009402A CN 108009402 A CN108009402 A CN 108009402A CN 201710609781 A CN201710609781 A CN 201710609781A CN 108009402 A CN108009402 A CN 108009402A
- Authority
- CN
- China
- Prior art keywords
- layer
- dynamic
- sequence
- convolutional
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Abstract
A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network of the present invention:Step 1:Obtain the microbial gene sequences data of existing classification results;Step 2:Data prediction;Step 3:Build dynamic convolution network structure frame;Step 4:The dynamic convolutional network that ready data input step three is established, with backpropagation, stochastic gradient descent method iteration 100 times, training dynamic convolutional network;Using cross entropy of classifying more as cost function, sorting algorithm model is finally obtained;Step 5:The segmentation sequence for needing to classify is input to step 4 trained dynamic convolution network model, obtains classification results.The method of the present invention does not have to artificial treatment data and extraction feature, and model extracts abstract characteristics and completes classification task automatically, and efficiency of algorithm and accuracy are high, can be efficiently applied to analysis of biological information and processing.
Description
Technical field
The present invention relates to a kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network, it is applied to
Microbial gene sequences Classification and Identification, belongs to data mining and technical field of biological information.
Background technology
DNA sequence data is one of main study subject of bioinformatics.By analyzing DNA sequence dna, it will be appreciated that sequence
Potential structure and functional relationship between row.The data volume of DNA sequence dna is exponentially increased, if can be analyzed with modern computer
These huge data help us to understand DNA, this is extraordinary.DNA sequence classification follows the sequence with similar structure
Row also have the principle of identity function.It is similar conventionally by sequence is established using sequence alignment method (such as BLAST and FASTA)
Property.This selection is that have two main assumptions:(1) functional imperative shares consensus feature, the relative ranks of (2) functional element
It is conservative between different indirect conditions.Although these assume that in the case of extensive be effective, they are not general
Time.Anyway, although these nearest problems, when the serious key issue for limiting alignment schemes application is still that they calculate
Between complexity.Therefore, that develops recently has become the effective ways of research genome analysis without alignment schemes.In no alignment
Sequence is considered to gather into K-mer in method, then by analyzing K-mer distribution characters in each sequence, searches out effective spy
Then sign is classified sequence with traditional sorting technique.And gene order classification in, Feature Selection and analysis often it is time-consuming again
Arduously, and effect is also uncertain.
At present, achieved based on the model algorithm of deep learning in the field such as image recognition and natural language processing very well
Effect, and be increasingly taken seriously.This method is based primarily upon in deep learning dynamic convolution real-time performance based on sequence
Classification.Since deep learning can extract high-level abstract characteristics in itself, so as to eliminate the feature in conventional machines learning algorithm
Engineering process, so that the efficiency of Resolving probiems is greatly improved, and also accuracy has reached very high level.
The content of the invention
1st, purpose:
, can it is an object of the present invention to provide a kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network
With microbial gene sequences of effectively classifying, so as to improve microbiological analysis efficiency and level.
The principle of the present invention is:Gene order processing is carried out first, and the gene order text of some microorganism is divided
Word, obtains word segmentation result and is inputted as algorithm model, and sequence participle can be passed through word by algorithm model first according to word segmentation result
Gene order is converted into vector matrix by embedded technology, in the convolutional layer of model, by one-dimensional convolution kernel to word embeded matrix into
Row convolution, first layer convolution are set to 12 passages, and data enter dynamic pond layer after convolutional layer, and dynamic pond layer can be according to input
The length of sequence and the current convolution number of plies determine the size in pond domain, to maximize the effective information for retaining sequence.Finally exist
Foldable layer, matrix dimensionality reduction, classifies dynamic convolutional network abstraction sequence abstract characteristics in full articulamentum.
2nd, technical solution:Technical solution provided by the invention is as follows:
The present invention is a kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network, as shown in Fig. 2,
This method comprises the following steps that:
Step 1:Obtain the microbial gene sequences data of existing classification results.
Step 2:Data prediction:
1) forbidden character in gene order is deleted;
2) different class categories attributes is subjected to one-hot codings;
3) gene order is pressed 8 character cuttings into word sequence;
4) the sequence text set put in order is divided into door (phylum), guiding principle (class), mesh by corresponding tag along sort
(order) section (family), is divided into four category level data
Step 3:Dynamic convolution network structure frame is built, such as attached drawing (1);
1) word embeding layer, the sequence participle vectorization of word embeding layer input.Different sequence words map out it is different to
Amount, similar sequence word can be more close on mapping space.It is 48 to select term vector length herein;
2) input layer, input layer receive the sequence vector of word embeding layer output, data matrix then are transmitted to next layer;
3) convolutional layer, mainly one-dimensional convolution, it is defeated that convolution Nuclear receptor co repressor list entries data matrix carries out convolution algorithm generation
Go out result;The convolutional layer has two, is named as the first convolutional layer and the second convolutional layer
4) dynamic pond layer, pond layer dynamically select pond parameter according to the convolution number of plies, and list entries long hair, to protect
Stay most effective informations.Dynamic pond layer parameter is selected according to formula (1):
KlIt is the selective value of l layers of pond layer parameter k, Ktop, it is top pond layer parameter, L represents convolutional layer in network
Total number of plies, it is the current number of plies that l, which is represented, and the length selection bp of behalf sequence is unit;
5) foldable layer, foldable layer will be laminated per upper and lower two in first convolutional layer and the second convolutional layer output matrix
And merged in the form of numerical value addition etc.;
6) full articulamentum, is connected with 1024 neurons entirely, can extract the profound level that neutral net is acquired and be abstracted spy
Sign;
7) output layer, output layer set neuron number, point of output nerve e-learning according to specific class categories number
Class result;
Step 4:The dynamic convolutional network that ready data input step three is established, with backpropagation, stochastic gradient
Descent method iteration 500 times, training dynamic convolutional network.Using cross entropy of classifying more as cost function.Finally obtain sorting algorithm mould
Type.
Step 5:By the trained dynamic convolution network model of sequence segmentation sequence input step four, output category knot
Fruit.
3rd, advantage and effect:A kind of microbial gene sequences classification mould based on dynamic convolutional network provided by the present invention
The method of type, assorting process do not have to artificial treatment data and extraction feature, and model extracts abstract characteristics and completes classification task automatically,
Efficiency of algorithm and accuracy are all of a relatively high, can be efficiently applied to analysis of biological information and processing.
Brief description of the drawings:
Fig. 1 show the dynamic convolution network structure that step 3 of the present invention is established.In figure:V1, V2, V3, V4, V5, V6,
V7 represents 7 words insertion vector respectively, inputs term vector composing training data matrix, by convolutional layer, pond layer, and foldable layer,
Full articulamentum extracts sequence abstract characteristics to precise classification.
Fig. 2 show the flow chart of the method for the present invention.
Embodiment:
With reference to the accompanying drawings and examples, technical scheme is described further.
A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network of the present invention, specifically includes as follows
Step:
Step 1:Obtaining microbial gene sequences data, (download address is by taking RDP rRNA gene datas as an example:
http://rdp.cme.msu.edu/http:This data of //rdp.cme.msu.edu/. have marked 3,356,809 micro- lifes
Thing 16S rRNAs data), therefrom 300,000 sequence datas of stochastical sampling, to model training afterwards.
Step 2:Data prediction:
1) forbidden character in gene order is deleted;
2) different class categories attribute one-hot is encoded;
3) gene order is pressed 8 character cuttings into word sequence;
Such as:Gtttataagggcttgccctt terminal sequences:It can be segmented and be:Gtttataa, tttataag,
Ttataagg tataaggg, ataagggc, taagggct, aagggctt, every 8 bp regard a word as.
As shown in following sequence:
gtttataagggcttgcccttatagatagtggcgaacgggtgcgtaacacgtgagcaacctgccccaaagtttggaat
aacaccgggaaaccgatgctaataccaaatatgctcacactatcacaagatagagtgaggaaagtttttcgctttgg
gaggggctcgcggcctatcagcttgttggtgaggtaacggctcaccaaggcatcgacgggtagctggtctgagagga
cgatcagccacactgggactgagacacggcccagactcctacgggaggcagcagtggggaatattgcgcaatgggcg
aaagcctgacgcagcaacgccgcgtggaggatgaaggccttagggtcgtaaactcctttcagcaggaacgaaaatga
cggtacctgcagaagaagctccggccaactacgtgccagcagccgcggtaatacgtagggagcaagcgttgtccgga
tttattgggcgtaaagagctcgtaggcggcttggcaagtcggatgtgaaacccccaggcttaacctggggccgccat
tcgatactgctatggcttgagttcggtaggggattgtggaattcccggtgtagcggtgaaatgcgcagatatcggga
Ggaac is split it by every 8 characters, obtains following sequence text set:
4) the sequence text set put in order is divided into door by corresponding tag along sort, guiding principle, mesh, section, is divided into four classification stages
Other data;
Step 3:Build dynamic convolution network structure frame, network structure such as figure attached drawing (1);Network parameter is provided with
Ktop=5, word insertion vector length is 48, and first layer convolution is set to 12 passages, and second layer convolution is set to 8 passages, output
Layer neuron number is depending on different classification task classification numbers, such as this experiment phylum:50,class:114,
order:256,family:1386。
Step 4:The dynamic convolutional network that ready data input step three is established, with backpropagation, stochastic gradient
Descent method iteration 500 times, selects batch size to train dynamic convolutional network for 200 training samples every time.Handed over more classification
Fork entropy is cost function.Finally obtain sorting algorithm model.Experiment is represented in 500 times or so the basic convergences of model iteration, classification
Accuracy rate can reach 99.6% or so, have very big lifting than conventional machines learning algorithm.
Step 5:The segmentation sequence for needing to classify is input to step 5 trained dynamic convolution network model, is arrived
It is classified as follows result.
phylum:Actinobacteria
class:Actinobacteria
order:Acidimicrobiales
family:Acidimicrobiaceae.
Claims (1)
1. a kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network, is characterized in that:This method step
It is as follows:
Step 1:Obtain the microbial gene sequences data of existing classification results;
Step 2:Data prediction:
1) forbidden character in gene order is deleted;
2) different class categories attribute one-hot is encoded;
3) gene order is pressed 8 character cuttings into word sequence;
4) the sequence text set put in order is divided into door by corresponding tag along sort, guiding principle, mesh, section, is divided into four category level numbers
According to;
Step 3:Build dynamic convolution network structure frame:
1) word embeding layer, the sequence participle vectorization of word embeding layer input;Different sequence words map out different vectors, phase
Near sequence word can be more close on mapping space;It is 48 to select term vector length herein;
2) input layer, input layer receive the sequence vector of word embeding layer output, data matrix then are transmitted to next layer;
3) convolutional layer, mainly one-dimensional convolution, convolution Nuclear receptor co repressor list entries data matrix carry out convolution algorithm and produce output knot
Fruit;The convolutional layer has two, is named as the first convolutional layer and the second convolutional layer;
4) dynamic pond layer, pond layer dynamically select pond parameter according to the convolution number of plies, and list entries long hair, to retain most
More effective informations;Dynamic pond layer parameter is selected according to formula (1):
KlIt is the selective value of l layers of pond layer parameter k, Ktop, it is top pond layer parameter, L represents the total layer of convolutional layer in network
Number, it is the current number of plies that l, which is represented, and the length of behalf sequence, it is unit to select bp;
5) foldable layer, foldable layer will merge in first convolutional layer and the second convolutional layer output matrix per upper and lower two layers, with
Numerical value is added form and merges;
6) full articulamentum, is connected with 1024 neurons entirely, can extract the profound abstract characteristics that neutral net is acquired;
7) output layer, output layer set neuron number, the classification knot of output nerve e-learning according to specific class categories number
Fruit;
Step 4:The dynamic convolutional network that ready data input step three is established, with backpropagation, stochastic gradient descent
Method iteration 100 times, training dynamic convolutional network;Using cross entropy of classifying more as cost function, sorting algorithm model is finally obtained;
Step 5:The segmentation sequence for needing to classify is input to step 4 trained dynamic convolution network model, is divided
Class result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710609781.8A CN108009402A (en) | 2017-07-25 | 2017-07-25 | A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710609781.8A CN108009402A (en) | 2017-07-25 | 2017-07-25 | A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108009402A true CN108009402A (en) | 2018-05-08 |
Family
ID=62047663
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710609781.8A Pending CN108009402A (en) | 2017-07-25 | 2017-07-25 | A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108009402A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111564179A (en) * | 2020-05-09 | 2020-08-21 | 厦门大学 | Species biology classification method and system based on triple neural network |
CN112151119A (en) * | 2020-09-01 | 2020-12-29 | 阿里云计算有限公司 | Gene vector model training method, method for analyzing gene data, and respective devices |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101401101A (en) * | 2006-03-10 | 2009-04-01 | 皇家飞利浦电子股份有限公司 | Methods and systems for identification of DNA patterns through spectral analysis |
US20160070854A1 (en) * | 2013-04-17 | 2016-03-10 | Andrew Ka-Ching WONG | Aligning and clustering sequence patterns to reveal classificatory functionality of sequences |
CN106295245A (en) * | 2016-07-27 | 2017-01-04 | 广州麦仑信息科技有限公司 | The method of storehouse noise reduction own coding gene information feature extraction based on Caffe |
US20170046480A1 (en) * | 2015-08-14 | 2017-02-16 | Tetracore, Inc. | Device and method for detecting the presence or absence of nucleic acid amplification |
CN106547885A (en) * | 2016-10-27 | 2017-03-29 | 桂林电子科技大学 | A kind of Text Classification System and method |
CN106599618A (en) * | 2016-12-23 | 2017-04-26 | 吉林大学 | Non-supervision classification method for metagenome contigs |
CN106874378A (en) * | 2017-01-05 | 2017-06-20 | 北京工商大学 | The entity of rule-based model extracts the method that knowledge mapping is built with relation excavation |
-
2017
- 2017-07-25 CN CN201710609781.8A patent/CN108009402A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101401101A (en) * | 2006-03-10 | 2009-04-01 | 皇家飞利浦电子股份有限公司 | Methods and systems for identification of DNA patterns through spectral analysis |
US20160070854A1 (en) * | 2013-04-17 | 2016-03-10 | Andrew Ka-Ching WONG | Aligning and clustering sequence patterns to reveal classificatory functionality of sequences |
US20170046480A1 (en) * | 2015-08-14 | 2017-02-16 | Tetracore, Inc. | Device and method for detecting the presence or absence of nucleic acid amplification |
CN106295245A (en) * | 2016-07-27 | 2017-01-04 | 广州麦仑信息科技有限公司 | The method of storehouse noise reduction own coding gene information feature extraction based on Caffe |
CN106547885A (en) * | 2016-10-27 | 2017-03-29 | 桂林电子科技大学 | A kind of Text Classification System and method |
CN106599618A (en) * | 2016-12-23 | 2017-04-26 | 吉林大学 | Non-supervision classification method for metagenome contigs |
CN106874378A (en) * | 2017-01-05 | 2017-06-20 | 北京工商大学 | The entity of rule-based model extracts the method that knowledge mapping is built with relation excavation |
Non-Patent Citations (5)
Title |
---|
GIOSUE LO BOSCO等: ""Deep Learning Architectures for DNA Sequence Classification"", 《INTERNATIONAL WORKSHOP ON FUZZY LOGIC AND SOFT COMPUTING APPLICATIONS》 * |
NGOC GIANG NGUYEN等: ""DNA Sequence Classification by Convolutional Neural Network"", 《J. BIOMEDICAL SCIENCE AND ENGINEERING》 * |
ZHONGMING HAN等: ""Reconstructing Genetic Regulation Network: Problems and Methods"", 《2009 2ND IEEE INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY》 * |
张添龙: ""基于多类型池化的卷积神经网络的文本分类算法"", 《电脑知识与技术》 * |
杨铁军: "《产业专利分析报告 第33册 智能识别》", 30 June 2015, 知识产权出版社 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111564179A (en) * | 2020-05-09 | 2020-08-21 | 厦门大学 | Species biology classification method and system based on triple neural network |
CN111564179B (en) * | 2020-05-09 | 2022-04-29 | 厦门大学 | Species biology classification method and system based on triple neural network |
CN112151119A (en) * | 2020-09-01 | 2020-12-29 | 阿里云计算有限公司 | Gene vector model training method, method for analyzing gene data, and respective devices |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Carter et al. | What made you do this? understanding black-box decisions with sufficient input subsets | |
CN107506761A (en) | Brain image dividing method and system based on notable inquiry learning convolutional neural networks | |
CN110765260A (en) | Information recommendation method based on convolutional neural network and joint attention mechanism | |
CN109165563B (en) | Pedestrian re-identification method and apparatus, electronic device, storage medium, and program product | |
CN108460089A (en) | Diverse characteristics based on Attention neural networks merge Chinese Text Categorization | |
Qiu et al. | Hierarchical context features embedding for object detection | |
CN110083700A (en) | A kind of enterprise's public sentiment sensibility classification method and system based on convolutional neural networks | |
CN109919368B (en) | Law recommendation prediction system and method based on association graph | |
CN107808011A (en) | Classification abstracting method, device, computer equipment and the storage medium of information | |
CN112633431B (en) | Tibetan-Chinese bilingual scene character recognition method based on CRNN and CTC | |
CN111475622A (en) | Text classification method, device, terminal and storage medium | |
CN110245685A (en) | Genome unit point makes a variation pathogenic prediction technique, system and storage medium | |
CN111783394A (en) | Training method of event extraction model, event extraction method, system and equipment | |
CN108846047A (en) | A kind of picture retrieval method and system based on convolution feature | |
CN111475615B (en) | Fine granularity emotion prediction method, device and system for emotion enhancement and storage medium | |
CN109918649B (en) | Suicide risk identification method based on microblog text | |
CN115131698A (en) | Video attribute determination method, device, equipment and storage medium | |
Zeng et al. | Multi-scale fully convolutional DenseNets for automated skin lesion segmentation in dermoscopy images | |
CN111967267A (en) | XLNET-based news text region extraction method and system | |
CN108009402A (en) | A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network | |
CN114037699A (en) | Pathological image classification method, equipment, system and storage medium | |
Li et al. | SwordNet: Chinese character font style recognition network | |
CN115082840B (en) | Action video classification method and device based on data combination and channel correlation | |
CN111782811A (en) | E-government affair sensitive text detection method based on convolutional neural network and support vector machine | |
CN115033700A (en) | Cross-domain emotion analysis method, device and equipment based on mutual learning network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180508 |
|
RJ01 | Rejection of invention patent application after publication |