WO2018119882A1 - 一种宏基因组数据分类方法和装置 - Google Patents

一种宏基因组数据分类方法和装置 Download PDF

Info

Publication number
WO2018119882A1
WO2018119882A1 PCT/CN2016/113029 CN2016113029W WO2018119882A1 WO 2018119882 A1 WO2018119882 A1 WO 2018119882A1 CN 2016113029 W CN2016113029 W CN 2016113029W WO 2018119882 A1 WO2018119882 A1 WO 2018119882A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
sequence
sequenced
category
genomic
Prior art date
Application number
PCT/CN2016/113029
Other languages
English (en)
French (fr)
Inventor
郭宁
魏彦杰
滕彦宁
葛健秋
张慧玲
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Priority to PCT/CN2016/113029 priority Critical patent/WO2018119882A1/zh
Publication of WO2018119882A1 publication Critical patent/WO2018119882A1/zh

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • the present invention relates to the field of gene data processing, and in particular, to a method and apparatus for classifying metagenomic data.
  • DNA-based metagenomics theoretically covers all microorganisms in environmental samples, so that the microbial community composition can be more fully and truly reflected, and the source of screening new genes or biologically active substances is greatly expanded. According to the strategy used, metagenomics research can be divided into sequence-dr iven and function-driven. Sequence-driven analysis refers to the analysis of the structure and function of microbial communities through sequencing. Refers to a metagenomic study based on the construction of a metagenomic library to screen for new genes or new substances.
  • the goal of metagenomics research is to study the structural composition of microbial populations. For example, sequencing of marine samples reveals the diversity of the environment. Similarly, the study of human samples can relate the relationship between human microbes and human health.
  • the first task is to find the various microbial species present in it. Based on alignment and sequence composition, the genome's read length is classified into existing biological species, and many tools are now available.
  • a metagenomic classification method based on a sequence structure is a method of classifying using the constituent features of the sequence itself.
  • the general process is to sample the sample data by statistical methods, use the selected feature expressions, abstract the sequence data into feature points in the biological sense, and then form these feature vectors into the feature matrix, and select the appropriate classifier model.
  • Classification of biological sequences Kariin studied the genomic sequences of various microorganisms and found that the base composition of the same species has similar base composition (such as GC content), while the base usage bias of different species varies greatly. Based on this theoretical basis, Teelin et al. published the TERTRA tool, and Chan et al. developed a tool based on the self-organizing growth algorithm.
  • microbial species abundance, gene function, metabolic pathways, phylogenetic relationships, etc. can be used as a feature of the community or sample for sample classification.
  • David et al. used the phenotypic characteristics of the genome-wide sequence of microorganisms; GC content, genome size, microbial energy source, survival humidity, and oxygen consumption as sample characteristics, and the metagenomic sequences were classified by R-SVM classifier.
  • Commonly used classifiers include a naive Bayesian classification model, an expectation maximization model, a maximum likelihood estimation model, a Markov model, and the like.
  • a metagenomic classifier is a supervised classification that uses related sequence features composed of structural components, applies to sequences of known category labels, extracts feature information, inputs classifiers, trains classification models, and finally pairs unknown tags.
  • the sequences are classified.
  • CARMA is a supervised-based metagenomic classification tool that classifies short sequences of length 80bps (Base pairs) according to the hidden Markov model.
  • TACOA uses a kernel-based kNN algorithm to predict sequences with read lengths greater than 800 bps.
  • the software maintains real-time updates to the reference genome database and can be modeled using IMMs (Interpolated Markov Models) for lengths greater than 100 bps.
  • IMMs Interpolated Markov Models
  • NBC applies the naive Bayesian classification algorithm to the metagenomic classification, and implements the online online service, so that the results of the metagenomic classification can be conveniently and quickly displayed on the webpage.
  • Zhang Xuegong et al. proposed a supervised-based metagenomic classification algorithm that does not require a reference sequence and uses the R-SVM algorithm.
  • the feature selection algorithm is used to screen out the useful features in the sequence structure information to improve the classification accuracy.
  • An object of the present invention is to provide a method and apparatus for classifying metagenomic data, which improves the classification accuracy of a genome with a small amount of time.
  • a first aspect of the present invention provides a method for classifying a metagenomic data, the method comprising:
  • a second aspect of the present invention provides a metagenomic data classification device, where the device includes:
  • a calculation module configured to calculate a feature vector of the sequence to be sequenced
  • a clustering module configured to cluster the feature vectors to obtain M groups of clusters G 1 to G M including read lengths, where M is an integer not less than 1;
  • an obtaining module configured to acquire a central set of each of the clusters G 1 to G M ⁇ ⁇ ;
  • a category judging module is configured to determine a genomic category of each cluster by comparing each read length of the central set ⁇ ⁇ of each cluster with a reference gene sequence.
  • FIG. 1 is a schematic flowchart showing an implementation process of a metagenomic data classification method according to Embodiment 1 of the present invention
  • FIG. 2 is a schematic structural diagram of a metagenomic data classification device according to Embodiment 2 of the present invention.
  • FIG. 3 is a schematic structural diagram of a metagenomic data classification device according to Embodiment 3 of the present invention.
  • FIG. 4 is a schematic structural diagram of a metagenomic data classification device according to Embodiment 4 of the present invention.
  • FIG. 5-a is a schematic structural diagram of a metagenomic data classification device according to Embodiment 5 of the present invention.
  • FIG. 5-b is a schematic structural diagram of a metagenomic data classification device according to Embodiment 6 of the present invention.
  • FIG. 5-c is a schematic structural diagram of a metagenomic data classification device according to Embodiment 7 of the present invention.
  • 6-a is a schematic structural diagram of a metagenomic data classification device according to Embodiment 8 of the present invention.
  • 6-b is a schematic structural diagram of a metagenomic data classification device according to Embodiment 9 of the present invention.
  • 6-c is a schematic structural diagram of a metagenomic data classification device according to Embodiment 10 of the present invention.
  • Embodiments of the invention are schematic structural diagram of a metagenomic data classification device according to Embodiment 11 of the present invention.
  • An embodiment of the present invention provides a method for classifying a metagenomic data, the method comprising: calculating a feature vector of a sequence to be sequenced; clustering the feature vector to obtain a group G 1 to GM including a read length, The M is an integer not less than 1; obtaining a central set of each of the clusters G 1 to GM; comparing each read length of the central set ⁇ ⁇ of each cluster with a reference gene sequence , determining the genomic category of each cluster.
  • Embodiments of the present invention also provide corresponding metagenomic data classification devices. The following is a detailed description.
  • FIG. 1 is a schematic diagram of an implementation process of a metagenomic data classification method according to Embodiment 1 of the present invention, which mainly includes the following steps S101 to S104, which are described in detail as follows:
  • S101 Calculate a feature vector of the sequence to be sequenced.
  • calculating the feature vector of the sequence to be sequenced may be implemented by the following steps S1011 and S1012:
  • k-mer refers to a substring of length k, typically k consecutive constituent bases starting from a certain position in the sequence.
  • the sequencing sequence can be divided into a total of L-k+1 k-mers of length k.
  • the frequency of occurrence of different k-mers in these k-mers is counted, and then, these k-mers are subjected to Coding, A (adenine), T (guanine), C (cytosine), G (thymine) are represented by numbers 0, 1, 2, 3, respectively. Then perform quaternary coding, and use the digital representation of each k-mer as the dimension index of the vector.
  • the appearance frequency of the k-mer is used as the vector value to form a vector with a dimension, and the vector is divided into L- k+1 feature vectors of k-mer sequences of length k to be sequenced.
  • the feature vector of the sequence to be sequenced may be dimensionally reduced. deal with
  • the feature vector of the sequence to be sequenced based on the mutual information may be selected for dimensionality reduction processing.
  • step S102 Clustering the feature vectors of the sequence to be sequenced calculated in step S101 to obtain M groups of clusters G 1 to G M including read lengths, where M is an integer not less than 1.
  • the feature vectors of the sequence to be sequenced calculated in step S101 may be clustered by using the kmeans algorithm in the cluster toolbox vlfeat, thereby obtaining a cluster of M groups including read lengths (ie, clusters), where numbering It is G l, G 2, ..., G i..., G M-1 GM.
  • a plurality of read lengths in each cluster may be read lengths of overlapping bases.
  • all reads in each cluster may be used.
  • the long form constitutes a graph, and each read length is a vertex of the graph, and then the largest independent set of the graph is calculated, and the read lengths included in the largest independent set constitute the central set of each cluster ⁇ ⁇ .
  • S104 Determine a genomic category of each cluster by comparing each read length of the central set ⁇ ⁇ of each cluster with a reference gene sequence.
  • determining the genomic category of each cluster can be realized by the following steps S1041 and S1042:
  • S1041 Count each read length genomic category of the central set ⁇ ⁇ of each cluster by comparing each read length of the central set ⁇ ⁇ of each cluster with a reference gene sequence.
  • the center of each cluster can be set ⁇ ⁇
  • Each read length is compared to the reference gene sequence, using the tool BLAST, to count the genomic categories for each read length of the central set K ⁇ of each cluster. It should be noted that the technical solution of the present invention does not compare all the read lengths of each cluster with the reference gene sequence, but only selects each read length of the central set of each cluster to be compared with the reference gene sequence. , reducing the search range of each cluster's category, reducing the amount of contrast, thereby reducing the credit on the day.
  • the statistical result may be the same read length but belong to different genomic categories, and thus, the read length may be The frequency of occurrence of genomic categories to determine their genomic categories. For example, suppose the preset threshold is 70%. If the comparison and statistics show that the genomic category of the read length R i belongs to C ⁇ , the frequency of occurrence is 30%, and the frequency of occurrence of C ⁇ is 43%, which belongs to C ⁇ frequency.
  • the genomic category of the read length R i is determined as C i
  • the genomic category C ⁇ of the read long R ⁇ is confirmed as the genomic category of the central set ⁇ ⁇ of the read length R ⁇ or the genomic category of the cluster to which it belongs.
  • the multi-core learning training classifier may be further used after the step S104.
  • the clusters in which the genomic category has been confirmed are classified again. Specifically, a certain proportion may be randomly selected from the central set of clusters of the confirmed genomic category, for example, 60% of the read length is used as a training set, and the multi-core learning tool shogun is used to train the classification model, and the remaining ratio, for example, 40% of the read length.
  • a test set it is classified by a multi-core learning training classifier, and the read length determined by the previous clustering error in each center set is filtered out.
  • FIG. 2 is a schematic structural diagram of a metagenomic data classification device according to Embodiment 2 of the present invention.
  • FIG. 2 shows only parts related to the embodiment of the present invention.
  • the metagenomic data classification device exemplified in Fig. 2 may be an execution subject of the metagenomic data classification method exemplified in Fig. 1.
  • the metagenomic data classification device illustrated in FIG. 2 mainly includes a calculation module 201, a clustering module 202, an acquisition module 203, and a category determination module 204, wherein:
  • the calculating module 201 is configured to calculate a feature vector of the sequence to be sequenced.
  • the clustering module 202 is configured to cluster the feature vectors of the sequence to be sequenced calculated by the calculation module 201 to obtain M groups of clusters G 1 to GM including read lengths, where M is an integer not less than 1.
  • the clustering module 202 may use the kmeans algorithm in the cluster toolbox vlfeat to cluster the feature vectors of the sequence to be sequenced calculated by the calculation module 201, thereby obtaining a cluster of M groups including the read length (ie, c Luster), numbered here as G l, G 2, ..., G i..., G M-1, GM.
  • the read length ie, c Luster
  • the obtaining module 203 is configured to obtain a central set ⁇ ⁇ of each cluster in the clusters G l to G M .
  • a plurality of read lengths in each cluster may be read lengths of overlapping bases.
  • the obtaining module 203 may specifically All read lengths constitute a graph, and each read length is a vertex of the graph, and then the largest independent set of graphs is calculated. The read lengths contained in the largest independent set constitute the central set of each cluster ⁇ ⁇ .
  • the category judging module 204 is configured to judge the genomic category of each cluster by comparing each read length of the center set ⁇ ⁇ of each cluster with the reference gene sequence.
  • each functional module is merely an example, and the actual application may be required according to requirements, such as corresponding hardware configuration requirements or software.
  • the above function assignment is performed by different functional modules, that is, the internal structure of the metagenomic data classification device is divided into different functional modules to complete all or part of the functions described above.
  • the corresponding functional modules in this embodiment may be implemented by corresponding hardware, or may be executed by corresponding hardware.
  • the foregoing clustering module may have the foregoing pair calculation.
  • the module calculates the feature vectors of the sequence to be sequenced and performs clustering to obtain the hardware of the group G 1 to G ⁇ that includes the read length, such as a clusterer, or can execute a corresponding computer program to perform the foregoing functions.
  • a general processor or other hardware device; and the class determination module as described above may be a hardware that performs genomic class determination by comparing each read length of each cluster's central set ⁇ with a reference gene sequence
  • the category determiner may also be a general processor or other hardware device capable of executing a corresponding computer program to perform the aforementioned functions (the various embodiments provided in the present specification may apply the above described principles).
  • the calculation module 201 illustrated in FIG. 2 may include a segmentation unit 301 and a statistics unit 302, as shown in FIG. 3, the metagenomic data classification device provided in Embodiment 3 of the present invention, wherein:
  • the dividing unit 301 is configured to divide the sequence to be sequenced into L-k+1 k-mers of length k, where L is The length of the sequence to be sequenced.
  • k-mer refers to a substring of length k, typically k consecutive constituent bases starting from a certain position in the sequence.
  • L the length of the sequencing sequence
  • the sequence of L to be sequenced can be divided into L-k+1 k-mers of length k in total.
  • the statistic unit 302 is configured to calculate an appearance frequency of each k-mer in the L-k+1 k-mers, and form a vector with a frequency of occurrence of the k-mer in the L-k+1 km er Confirmed as the feature vector of the sequence to be sequenced.
  • the statistical unit 302 counts the frequency of occurrence of different k-mers in these k-mers, and then, for these k -mer encodes A (adenine), T (guanine), C (cytosine), G (thymine) with 0, 1, 2, 3, respectively, and then quaternary encoding,
  • A adenine
  • T guanine
  • C cytosine
  • G thymine
  • the number representation of each k-mer is used as the dimension index of the vector.
  • the appearance frequency of the k-mer is used as the vector value to form a vector with a dimension, and the vector is divided into L-k+1 lengths of k.
  • the eigenvector of the sequence to be sequenced for k-mer is used as the vector value to form a vector with a dimension.
  • the class judging module 204 illustrated in FIG. 2 may include a comparing unit 401 and a determining unit 402, as shown in FIG. 4, the metagenomic data sorting apparatus provided in Embodiment 4 of the present invention, wherein:
  • the comparing unit 401 is configured to count the genomic category of each read length of the central set ⁇ ⁇ of each cluster by comparing each read length of the center set ⁇ ⁇ of each cluster with a reference gene sequence.
  • the comparison unit 401 can compare each read length of the center set ⁇ ⁇ of each cluster with the reference gene sequence, and use the tool BLAST to count each read length of the center set ⁇ ⁇ of each cluster. Genome category. It should be noted that the technical solution of the present invention does not compare all the read lengths of each cluster with the reference gene sequence, but only selects each read length of the central set of each cluster to be compared with the reference gene sequence. , reducing the search range of each cluster's category, reducing the amount of contrast, thereby reducing the credit on the day.
  • the determining unit 402 is configured to read the genomic category C ⁇ of the long R ⁇ as the read length R if the frequency of occurrence of the genomic category C i of any read length R i in the central set ⁇ ⁇ is not less than a preset threshold ⁇ The genomic category of the cluster.
  • each read length of the center set ⁇ of each cluster is compared with the reference gene sequence.
  • the result of the statistics may be that the same reading length belongs to a different genomic category. Therefore, the genomic category of the genomic category of the read length can be determined. For example, suppose the preset threshold is 70%, if The results of comparison and statistics show that the genomic category of the read length R i belongs to C ⁇ and the frequency of occurrence is 30%, the frequency of appearance of C ⁇ is 43%, and the frequency of occurrence of C ⁇ is 75%. The genomic category of R i is determined as C ⁇ , and the genomic category C ⁇ of the read long R ⁇ is confirmed as the genomic category of the central set ⁇ ⁇ of the read long R ⁇ or the genomic category of the cluster to which it belongs.
  • the metagenomic data classification device of any of FIGS. 2 to 4 may further include a dimensionality reduction module 501, as shown in FIGS. 5-a to 5-c, for the metagenomic data classification provided by the fifth to seventh embodiments of the present invention.
  • a dimensionality reduction module 501 for the metagenomic data classification provided by the fifth to seventh embodiments of the present invention.
  • the clustering module 202 clusters the feature vectors to obtain the M-groups containing the read-length clusters G 1 to GM, and then descend the feature vectors of the sequence to be sequenced.
  • Dimensional processing specifically, based on mutual information, the feature vector of the sequence to be sequenced is selected for dimensionality reduction processing.
  • the calculation amount and/or complexity of the subsequent processing ⁇ can be reduced, thereby reducing the ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ .
  • the metagenomic data classification device of any of FIGS. 2 to 4 may further include a reclassification module 601, as shown in FIGS. 6-a to 6-c, the metagenomic data classification provided by the eighth to tenth embodiments of the present invention.
  • the reclassification module 601 is used by the category judging module 204 to determine the confirmed genome by using the multi-core learning training classifier by comparing each read length of the center set ⁇ of each cluster with the reference gene sequence to determine the genomic category of each cluster. The clusters of the categories are classified again.
  • the category judging module 204 may pass the center of each cluster.
  • Each read length of the set ⁇ ⁇ is compared with the reference gene sequence, and after determining the genomic category of each cluster, the reclassification module 601 further uses the multi-core learning training classifier to classify the clusters of the confirmed genomic categories again.
  • the reclassification module 601 randomly selects a certain proportion from a central set of clusters of confirmed genomic categories, for example, 60% of the read length as a training set, and uses the multi-core learning tool shogun to train the classification model, and the remaining ratio, for example, 40
  • the read length of % is used as a test set, which is classified by a multi-core learning training classifier, and the read length determined by the previous clustering error in each center set is filtered out.
  • a thirteenth embodiment of the present invention provides a schematic diagram of a metagenomic data classification device 700.
  • the metagenomic data classification device 700 may be a functional unit in a computer device or a computer device, and the specific embodiment of the present invention does not limit the specific implementation of the metagenomic data classification device.
  • the metagenomic data classification device 700 includes: [0074] processor 710, communication interface 720, memory
  • the processor 710, the communication interface 720, and the memory 730 complete communication with each other through the bus 740.
  • the communication interface 720 is configured to communicate with an external device, such as a personal computer, a server, or the like.
  • the processor 710 is configured to execute the program 732.
  • the program 732 can include program code, the program code including computer operating instructions.
  • the processor 710 may be a central processing unit CPU, or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • the memory 730 is configured to store the program 732.
  • the memory 730 may include a high speed RAM memory and may also include a non-volatile memory such as at least one disk memory.
  • the program 732 may specifically include:
  • a calculation module 733 configured to calculate a feature vector of the sequence to be sequenced
  • the clustering module 744 is configured to cluster the feature vectors to obtain M groups of clusters G 1 to G M including read lengths.
  • the M is an integer not less than 1;
  • the obtaining module 755 is configured to obtain a central set ⁇ ⁇ of each of the clusters G 1 to G M;
  • the category judging module 766 is configured to determine the genomic category of each cluster by comparing each read length of the center set ⁇ ⁇ of each cluster with a reference gene sequence.
  • each unit in the program 732 refers to the corresponding unit in the embodiment shown in FIG. 2, and details are not described herein.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division, and the actual implementation may have another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some communication interface, device or unit, It can be electrical, mechanical or other form.
  • the unit described as a separate component may or may not be physically distributed, and the component displayed as a unit may or may not be a physical unit, that is, may be located in one place, or may be distributed to multiple On the network unit. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium.
  • the technical solution of the present invention which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: u disk
  • removable hard disk read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk, and other media that can store program code.
  • ROM read-only memory
  • RAM Random Access Memory
  • disk or optical disk and other media that can store program code.

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种宏基因组数据分类方法,包括:计算待测序序列的特征向量(S101);对所述特征向量进行聚类得到M组包含读长的簇G1至GM,所述M为不小于1的整数(S102);获取所述簇G1至GM中每个簇的中心集合Ki(S103);通过将所述每个簇的中心集合Ki的每一读长与参考基因序列对比,判断所述每个簇的基因组类别(S104)。所述方法使得基因组的分类精度得到提高,解决了现有技术中对基因序列分类速度慢、精度不高的问题。

Description

一种宏基因组数据分类方法和装置
技术领域
[0001] 本发明涉及基因数据处理领域, 尤其涉及一种宏基因组数据分类方法和装置。
背景技术
[0002] 基于 DNA的宏基因组学理论上覆盖了环境样品中的全部微生物, 因此可以更加 全面真实地反映微生物群落组成, 同吋大大拓展了筛选新的基因或生物活性物 质的来源。 根据所用策略不同, 宏基因组学研究可分为序列驱动的 (sequence-dr iven) 和功能驱动的(function-driven) , 其中, 序列驱动是指通过测序分析微生 物群落的结构和功能, 功能驱动是指基于构建宏基因组文库筛选新基因或新物 质的宏基因组学研究。
[0003] 宏基因组研究的目标是研究微生物群里的结构组成, 例如, 对海洋样本的测序 科研揭示起环境的多样性, 同样, 对人类样本的研究可以人类微生物和人类健 康之间的关系。 一旦一个宏基因组的样本被测序, 第一项任务就是要找到存在 其中的各种微生物物种。 基于比对和序列组成, 将宏基因组的读长 (read) 归类 到已有的生物物种, 现在有许多工具可以用。
[0004] 基于序列结构组成的宏基因组分类方法, 是利用序列本身的构成特征进行分类 的方法。 普遍的过程是用统计学的方法对样本数据进行抽样, 利用筛选出来的 特征表达, 将序列数据抽象为生物意义上的特征向量, 然后将这些特征向量组 成特征矩阵, 选择合适的分类器模型, 对生物序列进行分类分析。 Kariin研究了 多种微生物的基因组序列, 发现同一物种的基因序列的碱基构成具有相似性 ( 例如 GC的含量) , 而不同物种的碱基使用偏向性差异很大。 基于这一理论基础 , Teelin等人幵发了 TERTRA工具, Chan等人幵发了基于自组织生长算法的工具 。 在特征的使用方面, 微生物的物种丰度、 基因功能、 代谢通路、 系统发育关 系等可作为该群落或样本的特征用来进行样本分类。 David等人使用微生物的全 基因组序列的表型特点; G C含量、 基因组大小、 微生物能量来源、 生存湿度 W及耗氧量等作为样本特征, 利用 R -SVM分类器对宏基因组序列进行了分类。 [0005] 常用的分类器有朴素贝叶斯分类模型、 期望最大化模型、 最大似然估计模型、 马尔可夫模型等。 目前, 一种宏基因组的分类器是监督分类, 起使用结构组成 的相关的序列特征, 应用在已知类别标签的序列中, 提取特征信息, 输入分类 器, 训练分类模型, 最后对未知标签的序列进行分类。 CARMA就是一种基于监 督的宏基因组分类工具, 它根据隐马尔科夫模型, 对长度 80bps (Base pairs) 的 较短序列的分类效果很好。 TACOA用了基于核函数的 kNN算法能够对读长大于 8 00bps的序列进行预测, 该软件可以保持参考基因组数据库的实吋更新, 并且可 以使用 IMMs (Interpolated Markov Models) 来建模, 对长度大于 100bps的序列的 分类准确度很高。 NBC将朴素贝叶斯分类算法应用到宏基因组分类上, 而且实 现了网络在线服务, 使得宏基因组分类的结果可以得到方便快捷的在网页上展 示。 张学工等人提出了一种不需要参考序列, 使用 R-SVM算法的基于监督的宏 基因组分类算法, 利用特征选择算法筛选出序列结构信息中的有用特征来提高 分类准确率。
[0006] 然而, 上述现有的监督分类算法, 由于特征提取方法和分类器模型性能的缘故 , 在针对低分类层次、 多物种分类的大规模宏基因组数据分类问题吋分类精度 比较低, 且运行吋间幵销太大。
技术问题
[0007] 本发明的目的在于提供一种宏基因组数据分类方法和装置, 以较小的吋间幵销 提高基因组的分类精度。
问题的解决方案
技术解决方案
[0008] 本发明第一方面提供一种宏基因组数据分类方法, 所述方法包括:
[0009] 计算待测序序列的特征向量;
[0010] 对所述特征向量进行聚类得到 M组包含读长的簇 G 1至 G M, 所述 M为不小于 1 的整数;
[0011] 获取所述簇 G 1至 G M中每个簇的中心集合 Κ ί;
[0012] 通过将所述每个簇的中心集合 Κ ί的每一读长与参考基因序列对比, 判断所述每 个簇的基因组类别。 [0013] 本发明第二方面提供一种宏基因组数据分类装置, 所述装置包括:
[0014] 计算模块, 用于计算待测序序列的特征向量;
[0015] 聚类模块, 用于对所述特征向量进行聚类得到 M组包含读长的簇 G 1至 G M, 所述 M为不小于 1的整数;
[0016] 获取模块, 用于获取所述簇 G 1至 G M中每个簇的中心集合 Κ ί;
[0017] 类别判断模块, 用于通过将所述每个簇的中心集合 Κ ί的每一读长与参考基因序 列对比, 判断所述每个簇的基因组类别。
发明的有益效果
有益效果
[0018] 从上述本发明技术方案可知, 通过对待测序序列的特征向量进行聚类得到若干 组包含读长的簇, 并由此获取所述簇的中心集合, 由于只是将所述每个簇的中 心集合的每一读长与参考基因序列对比, 判断簇的基因组类别, 因此, 与现有 技术相比, 本发明提供的技术方案既降低了分类所用的吋间幵销即提高了运算 速度, 又显著提高了对测序序列所属基因组类别的分类精度。
对附图的简要说明
附图说明
[0019] 图 1是本发明实施例一提供的宏基因组数据分类方法的实现流程示意图;
[0020] 图 2是本发明实施例二提供的宏基因组数据分类装置的结构示意图;
[0021] 图 3是本发明实施例三提供的宏基因组数据分类装置的结构示意图;
[0022] 图 4是本发明实施例四提供的宏基因组数据分类装置的结构示意图;
[0023] 图 5-a是本发明实施例五提供的宏基因组数据分类装置的结构示意图;
[0024] 图 5-b是本发明实施例六提供的宏基因组数据分类装置的结构示意图;
[0025] 图 5-c是本发明实施例七提供的宏基因组数据分类装置的结构示意图;
[0026] 图 6-a是本发明实施例八提供的宏基因组数据分类装置的结构示意图;
[0027] 图 6-b是本发明实施例九提供的宏基因组数据分类装置的结构示意图;
[0028] 图 6-c是本发明实施例十提供的宏基因组数据分类装置的结构示意图;
[0029] 图 7是本发明实施例十一提供的宏基因组数据分类装置的结构示意图。 本发明的实施方式
[0030] 为了使本发明的目的、 技术方案及有益效果更加清楚明白, 以下结合附图及实 施例, 对本发明进行进一步详细说明。 应当理解, 此处所描述的具体实施例仅 仅用以解释本发明, 并不用于限定本发明。
[0031] 本发明实施例提供一种宏基因组数据分类方法, 所述方法包括: 计算待测序序 列的特征向量; 对所述特征向量进行聚类得到 M组包含读长的簇 G 1至 G M, 所 述 M为不小于 1的整数; 获取所述簇 G 1至 G M中每个簇的中心集合 Κ ί; 通过将 所述每个簇的中心集合 Κ ί的每一读长与参考基因序列对比, 判断所述每个簇的 基因组类别。 本发明实施例还提供相应的宏基因组数据分类装置。 以下分别进 行详细说明。
[0032] 请参阅附图 1, 是本发明实施例一提供的宏基因组数据分类方法的实现流程示 意图, 主要包括以下步骤 S101至步骤 S104, 详细说明如下:
[0033] S101 , 计算待测序序列的特征向量。
[0034] 作为本发明一个实施例, 计算待测序序列的特征向量可通过如下步骤 S1011和 S 1012实现:
[0035] S1011 , 将待测序序列分割成 L-k+1个长度为 k的 k-mer, 其中, L为待测序序列 的长度。
[0036] 在基因学领域, k-mer是指一个长度为 k的子串, 一般是从序列的某一位置幵始 的 k个连续组成碱基。 假设测序序列长度为 L, 在本发明实施例中, 可以将待测 序序列依次按长度为 k=3、 4、 6截取片段, 每个片段就是一个 k-mer, 如此, 一个 长度为 L的待测序序列总共可分割为 L-k+1个长度为 k的 k-mer。
[0037] S 1012, 统计经步骤 S 1011分割所得的 L-k+ 1个 k-mer中每个 k-mer的出现频率, 将 L-k+1个 k-mer中 k-mer的出现频率组成维度为的向量作为待测序序列的特征向
[0038] 具体地, 针对被分割为 L-k+1个长度为 k的 k-mer的待测序序列, 统计这些 k-mer 中不同 k-mer的出现频率, 然后, 对这些 k-mer进行编码, 分别将 A (腺嘌呤) 、 T (鸟嘌呤) 、 C (胞嘧啶) 、 G (胸腺嘧啶) 采用 0、 1、 2、 3这些数字表示, 再进行四进制编码, 将每个 k-mer的数字表示作为向量的维度索引, 该 k-mer的出 现频率作为向量值, 从而组成一个维度为的向量, 而该向量就是被分割为 L-k+1 个长度为 k的 k-mer的待测序序列的特征向量。
[0039] 需要说明的是, 为了降低后续处理吋的计算量和 /或复杂度, 从而减小运行吋 的吋间幵销, 在本发明实施例中, 可以对待测序序列的特征向量进行降维处理
, 具体可以使用基于互信息选择对待测序序列的特征向量进行降维处理。
[0040] S102, 对经步骤 S101计算所得待测序序列的特征向量进行聚类得到 M组包含读 长的簇 G 1至 G M, 此处, M为不小于 1的整数。
[0041] 具体地, 可以使用聚类工具箱 vlfeat中的 kmeans算法将经步骤 S101计算所得待 测序序列的特征向量进行聚类, 从而得到 M组包含读长的簇 (即 cluster) , 此处 编号为 G l、 G 2、 …、 G i...、 G M-1 G M。
[0042] S103 , 获取簇 G l至 G M中每个簇的中心集合 Κ ί。
[0043] 经步骤 S102聚类所得的簇中, 每个簇中有很多读长可能是有重叠的碱基的读长 , 在本发明实施例中, 具体可以是将每个簇里的所有读长构成一个图 (Graph) , 而每个读长是图的一个顶点, 然后计算图的最大独立集, 将这个最大独立集 包含的那些读长构成每个簇的中心集合 Κ ί。
[0044] S104, 通过将每个簇的中心集合 Κ ί的每一读长与参考基因序列对比, 判断每 个簇的基因组类别。
[0045] 作为本发明一个实施例, 通过将每个簇的中心集合 Κ ί的每一读长与参考基因序 列对比, 判断每个簇的基因组类别可通过如下步骤 S1041和 S1042实现:
[0046] S1041 , 通过将每个簇的中心集合 Κ ί的每一读长与参考基因序列对比, 统计每 个簇的中心集合 Κ ί的每一读长的基因组类别。
[0047] 具体可以将每个簇的中心集合 Κ ί
的每一读长与参考基因序列对比, 使用工具 BLAST, 统计出每个簇的中心集合 K ί的每一读长的基因组类别。 需要说明的是, 本发明的技术方案并不是将每个簇 的所有读长与参考基因序列对比, 而是只选择每个簇的中心集合 Κ ί的每一读长 与参考基因序列对比, 如此, 减小了每个簇的类别的搜索范围, 减小了对比量 , 从而减小了吋间上的幵销。 [0048] S1042, 若中心集合 K冲任一读长 R ί的基因组类别 C ί的出现频率不小于预设阈 值, 则将读长 R啲基因组类别 C ί确认为读长 R ί所属簇的基因组类别。
[0049] 在将每个簇的中心集合 Κ ί的每一读长与参考基因序列对比过程中, 统计的结果 可能是同一读长却属于不同的基因组类别, 此吋, 可以以该读长的基因组类别 的出现频率来确定其基因组类别。 例如, 假设预设阈值是 70%, 若对比和统计的 结果显示读长 R i的基因组类别属于 C ί的出现频率是 30%, 属于 C" ί的出现频率 是 43%, 属于 C啲出现频率是 75%, 则将读长 R i的基因组类别确定为 C i, 并且 将读长 R ί的基因组类别 C ί确认为读长 R ί所属中心集合 Κ ί的基因组类别或所属簇 的基因组类别。
[0050] 为了将经步骤 S104错分或误分的序列剔除, 提高宏基因组数据分类整体的分类 准确率, 在本发明实施例中, 可在步骤 S104后, 进一步采用多核学习训练分类 器对所述已确认基因组类别的簇再次进行分类。 具体可以是从已确认基因组类 别的簇的中心集合 Κ ί中随机选取一定比例, 例如 60%的读长作为训练集, 用多 核学习工具 shogun训练分类模型, 将余下比例, 例如 40%的读长作为测试集, 采 用多核学习训练分类器对其进行分类, 滤除每个中心集合 Κ ί中由于上一步聚类 错误判别的读长。
[0051] 从上述附图 1示例的宏基因组数据分类方法可知, 通过对待测序序列的特征向 量进行聚类得到若干组包含读长的簇, 并由此获取所述簇的中心集合, 由于只 是将所述每个簇的中心集合的每一读长与参考基因序列对比, 判断每个簇的基 因组类别, 因此, 与现有技术相比, 本发明提供的技术方案既降低了分类所用 的吋间幵销即提高了运算速度, 又显著提高了对测序序列所属基因组类别的分 类精度。
[0052] 请参阅附图 2, 是本发明实施例二提供的宏基因组数据分类装置的结构示意图 。 为了便于说明, 附图 2仅示出了与本发明实施例相关的部分。 附图 2示例的宏 基因组数据分类装置可以是附图 1示例的宏基因组数据分类方法的执行主体。 附 图 2示例的宏基因组数据分类装置主要包括计算模块 201、 聚类模块 202、 获取模 块 203和类别判断模块 204, 其中:
[0053] 计算模块 201, 用于计算待测序序列的特征向量。 [0054] 聚类模块 202, 用于对计算模块 201计算所得待测序序列的特征向量进行聚类得 到 M组包含读长的簇 G 1至 G M, 其中, M为不小于 1的整数。
[0055] 具体地, 聚类模块 202可以使用聚类工具箱 vlfeat中的 kmeans算法将经计算模块 201计算所得待测序序列的特征向量进行聚类, 从而得到 M组包含读长的簇 (即 c luster) , 此处编号为 G l、 G 2、 …、 G i...、 G M-1、 G M。
[0056] 获取模块 203, 用于获取簇 G l至 G M中每个簇的中心集合 Κ ί。
[0057] 经聚类模块 202聚类所得的簇中, 每个簇中有很多读长可能是有重叠的碱基的 读长, 在本发明实施例中, 获取模块 203具体可以将每个簇里的所有读长构成一 个图 (Graph) , 而每个读长是图的一个顶点, 然后计算图的最大独立集, 将这 个最大独立集包含的那些读长构成每个簇的中心集合 Κ ί。
[0058] 类别判断模块 204, 用于通过将每个簇的中心集合 Κ ί的每一读长与参考基因序 列对比, 判断每个簇的基因组类别。
[0059] 需要说明的是, 以上附图 2示例的宏基因组数据分类装置的实施方式中, 各功 能模块的划分仅是举例说明, 实际应用中可以根据需要, 例如相应硬件的配置 要求或者软件的实现的便利考虑, 而将上述功能分配由不同的功能模块完成, 即将所述宏基因组数据分类装置的内部结构划分成不同的功能模块, 以完成以 上描述的全部或者部分功能。 而且, 实际应用中, 本实施例中的相应的功能模 块可以是由相应的硬件实现, 也可以由相应的硬件执行相应的软件完成, 例如 , 前述的聚类模块, 可以是具有执行前述对计算模块 (或计算器) 计算所得待 测序序列的特征向量进行聚类得到 Μ组包含读长的簇 G 1至 G Μ的硬件, 例如聚 类器, 也可以是能够执行相应计算机程序从而完成前述功能的一般处理器或者 其他硬件设备; 再如前述的类别判断模块, 可以是执行通过将每个簇的中心集 合 Κ ί的每一读长与参考基因序列对比, 判断每个簇的基因组类别的硬件, 例如 类别判断器, 也可以是能够执行相应计算机程序从而完成前述功能的一般处理 器或者其他硬件设备 (本说明书提供的各个实施例都可应用上述描述原则) 。
[0060] 附图 2示例的计算模块 201可以包括分割单元 301和统计单元 302, 如附图 3所示 本发明实施例三提供的宏基因组数据分类装置, 其中:
[0061] 分割单元 301, 用于将待测序序列分割成 L-k+1个长度为 k的 k-mer, 其中, L为 待测序序列的长度。
[0062] 在基因学领域, k-mer是指一个长度为 k的子串, 一般是从序列的某一位置幵始 的 k个连续组成碱基。 假设测序序列长度为 L, 在本发明实施例中, 分割单元 301 可以将待测序序列依次按长度为 k=3、 4、 6截取片段, 每个片段就是一个 k-mer, 如此, 一个长度为 L的待测序序列总共可分割为 L-k+1个长度为 k的 k-mer。
[0063] 统计单元 302, 用于统计 L-k+1个 k-mer中每个 k-mer的出现频率, 将 L-k+1个 k-m er中 k-mer的出现频率组成维度为的向量确认为待测序序列的特征向量。
[0064] 具体地, 针对被分割为 L-k+1个长度为 k的 k-mer的待测序序列, 统计单元 302统 计这些 k-mer中不同 k-mer的出现频率, 然后, 对这些 k-mer进行编码, 分别将 A ( 腺嘌呤) 、 T (鸟嘌呤) 、 C (胞嘧啶) 、 G (胸腺嘧啶) 采用 0、 1、 2、 3这些 数字表示, 再进行四进制编码, 将每个 k-mer的数字表示作为向量的维度索引, 该 k-mer的出现频率作为向量值, 从而组成一个维度为的向量, 而该向量就是被 分割为 L-k+1个长度为 k的 k-mer的待测序序列的特征向量。
[0065] 附图 2示例的类别判断模块 204可以包括对比单元 401和确定单元 402, 如附图 4 所示本发明实施例四提供的宏基因组数据分类装置, 其中:
[0066] 对比单元 401, 用于通过将每个簇的中心集合 Κ ί的每一读长与参考基因序列对 比, 统计每个簇的中心集合 Κ ί的每一读长的基因组类别。
[0067] 具体地, 对比单元 401可以将每个簇的中心集合 Κ ί的每一读长与参考基因序列 对比, 使用工具 BLAST , 统计出每个簇的中心集合 Κ ί的每一读长的基因组类别 。 需要说明的是, 本发明的技术方案并不是将每个簇的所有读长与参考基因序 列对比, 而是只选择每个簇的中心集合 Κ ί的每一读长与参考基因序列对比, 如 此, 减小了每个簇的类别的搜索范围, 减小了对比量, 从而减小了吋间上的幵 销。
[0068] 确定单元 402, 用于若中心集合 Κ ί中任一读长 R i的基因组类别 C i的出现频率不 小于预设阈值, 则将读长 R ί的基因组类别 C ί作为读长 R ί所属簇的基因组类别。
[0069] 在对比单元 401将每个簇的中心集合 Κ ί的每一读长与参考基因序列对比过程中
, 统计的结果可能是同一读长却属于不同的基因组类别, 此吋, 可以以该读长 的基因组类别的出现频率来确定其基因组类别。 例如, 假设预设阈值是 70%, 若 对比和统计的结果显示读长 R i的基因组类别属于 C ί的出现频率是 30%, 属于 C" ί的出现频率是 43%, 属于 C啲出现频率是 75%, 则确定单元 402将读长 R i的基因 组类别确定为 C ί, 并且将读长 R ί的基因组类别 C ί确认为读长 R ί所属中心集合 Κ ί的基因组类别或所属簇的基因组类别。
[0070] 附图 2至 4任一示例的宏基因组数据分类装置还可以包括降维模块 501, 如附图 5 -a至 5-c所示本发明实施例五至七提供的宏基因组数据分类装置。 降维模块 501用 于计算模块 201计算待测序序列的特征向量之后, 聚类模块 202对特征向量进行 聚类得到 M组包含读长的簇 G 1至 G M之前, 对待测序序列的特征向量进行降维 处理, 具体可以使用基于互信息选择对待测序序列的特征向量进行降维处理。 经过降维模块 501的降维处理后, 可以降低后续处理吋的计算量和 /或复杂度, 从 而减小运行吋的吋间幵销。
[0071] 附图 2至 4任一示例的宏基因组数据分类装置还可以包括再分类模块 601, 如附 图 6-a至 6-c所示本发明实施例八至十提供的宏基因组数据分类装置。 再分类模块 601用于类别判断模块 204通过将每个簇的中心集合 Κ ί的每一读长与参考基因序 列对比, 判断每个簇的基因组类别之后, 采用多核学习训练分类器对已确认基 因组类别的簇再次进行分类。
[0072] 为了将经类别判断模块 204错分或误分的序列剔除, 提高宏基因组数据分类整 体的分类准确率, 在本发明实施例中, 可在类别判断模块 204通过将每个簇的中 心集合 Κ ί的每一读长与参考基因序列对比, 判断每个簇的基因组类别后, 再分 类模块 601进一步采用多核学习训练分类器对已确认基因组类别的簇再次进行分 类。 具体可以是再分类模块 601从已确认基因组类别的簇的中心集合 Κ ί中随机选 取一定比例, 例如 60%的读长作为训练集, 用多核学习工具 shogun训练分类模型 , 将余下比例, 例如 40%的读长作为测试集, 采用多核学习训练分类器对其进行 分类, 滤除每个中心集合 Κ ί中由于上一步聚类错误判别的读长。
[0073] 请参考图 7, 本发明实施例十一提供了一种宏基因组数据分类装置 700的示意图 。 宏基因组数据分类装置 700可能是计算机设备或者计算机设备中的一个功能单 元, 本发明具体实施例并不对宏基因组数据分类装置的具体实现做限定。 宏基 因组数据分类装置 700包括: [0074] 处理器 (processor) 710, 通信接口 (Communications Interface) 720, 存储器
(memory) 730, 总线 740。
[0075] 处理器 710, 通信接口 720, 存储器 730通过总线 740完成相互间的通信。
[0076] 通信接口 720, 用于与外界设备, 例如, 个人电脑、 服务器等通信。
[0077] 处理器 710, 用于执行程序 732。
[0078] 具体地, 程序 732可以包括程序代码, 所述程序代码包括计算机操作指令。
[0079] 处理器 710可能是一个中央处理器 CPU, 或者是特定集成电路 ASIC (Applicatio n Specific Integrated Circuit) , 或者是被配置成实施本发明实施例的一个或多个 集成电路。
[0080] 存储器 730, 用于存放程序 732。 存储器 730可能包含高速 RAM存储器, 也可能 还包括非易失性存储器 (non-volatile memory) , 例如至少一个磁盘存储器。 程 序 732具体可以包括:
[0081] 计算模块 733, 用于计算待测序序列的特征向量;
[0082] 聚类模块 744, 用于对所述特征向量进行聚类得到 M组包含读长的簇 G 1至 G M
, 所述 M为不小于 1的整数;
[0083] 获取模块 755, 用于获取所述簇 G 1至 G M中每个簇的中心集合 Κ ί;
[0084] 类别判断模块 766, 用于通过将所述每个簇的中心集合 Κ ί的每一读长与参考基 因序列对比, 判断所述每个簇的基因组类别。
[0085] 程序 732中各单元的具体实现参见图 2所示实施例中的相应单元, 在此不赘述。
[0086] 所属领域的技术人员可以清楚地了解到, 为描述的方便和简洁, 上述描述的装 置和单元的具体工作过程, 可以参考前述方法实施例中的对应过程, 在此不再 赘述。
[0087] 在本申请所提供的几个实施例中, 应该理解到, 所揭露的系统、 装置和方法, 可以通过其它的方式实现。 例如, 以上所描述的装置实施例仅仅是示意性的, 例如, 所述单元的划分, 仅仅为一种逻辑功能划分, 实际实现吋可以有另外的 划分方式, 例如多个单元或组件可以结合或者可以集成到另一个系统, 或一些 特征可以忽略, 或不执行。 另一点, 所显示或讨论的相互之间的耦合或直接耦 合或通信连接可以是通过一些通信接口, 装置或单元的间接耦合或通信连接, 可以是电性, 机械或其它的形式。
[0088] 所述作为分离部件说明的单元可以是或者也可以不是物理上分幵的, 作为单元 显示的部件可以是或者也可以不是物理单元, 即可以位于一个地方, 或者也可 以分布到多个网络单元上。 可以根据实际的需要选择其中的部分或者全部单元 来实现本实施例方案的目的。
[0089] 另外, 在本发明各个实施例中的各功能单元可以集成在一个处理单元中, 也可 以是各个单元单独物理存在, 也可以两个或两个以上单元集成在一个单元中。
[0090] 所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用吋, 可 以存储在一个计算机可读取存储介质中。 基于这样的理解, 本发明的技术方案 本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产 品的形式体现出来, 该计算机软件产品存储在一个存储介质中, 包括若干指令 用以使得一台计算机设备 (可以是个人计算机, 服务器, 或者网络设备等) 执 行本发明各个实施例所述方法的全部或部分步骤。 而前述的存储介质包括: u盘
、 移动硬盘、 只读存储器 (ROM , Read-Only Memory)、 随机存取存储器 (RAM , Random Access Memory) 、 磁碟或者光盘等各种可以存储程序代码的介质。
[0091] 以上所述仅为本发明的较佳实施例而已, 并不用以限制本发明, 凡在本发明的 精神和原则之内所作的任何修改、 等同替换和改进等, 均应包含在本发明的保 护范围之内。

Claims

权利要求书
[权利要求 1] 一种宏基因组数据分类方法, 其特征在于, 所述方法包括:
计算待测序序列的特征向量;
对所述特征向量进行聚类得到 M组包含读长的簇 G 1至 G M, 所述 M 为不小于 1的整数;
获取所述簇 G 1至 G M中每个簇的中心集合 K i;
通过将所述每个簇的中心集合 Κ ί的每一读长与参考基因序列对比, 判断所述每个簇的基因组类别。
[权利要求 2] 根据权利要求 1所述的方法, 其特征在于, 所述计算待测序序列的特 征向量包括:
将所述待测序序列分割成 L-k+1个长度为 k的 k-mer, 所述 L为所述待测 序序列的长度;
统计所述 L-k+1个 k-mer中每个 k-mer的出现频率, 将所述 L-k+1个 k-me r中 k-mer的出现频率组成维度为的向量作为所述待测序序列的特征向
[权利要求 3] 根据权利要求 1所述的方法, 其特征在于, 所述通过将所述每个簇的 中心集合 Κ ί的每一读长与参考基因序列对比, 判断所述每个簇的基 因组类别, 包括:
通过将所述每个簇的中心集合 Κ ί的每一读长与参考基因序列对比, 统计所述每个簇的中心集合 Κ ί的每一读长的基因组类别;
若所述中心集合 Κ ί中任一读长 R ί的基因组类别 C ί的出现频率不小于 预设阈值, 则将所述读长 R啲基因组类别 C ί确认为所述读长 R ί所属 簇的基因组类别。
[权利要求 4] 根据权利要求 1至 3任意一项所述的方法, 其特征在于, 所述计算待测 序序列的特征向量之后, 对所述特征向量进行聚类得到 Μ组包含读长 的簇 G 1至 G M之前, 所述方法还包括:
对所述待测序序列的特征向量进行降维处理。
[权利要求 5] 根据权利要求 1至 3任意一项所述的方法, 其特征在于, 所述通过将所 述每个簇的中心集合 Κ ί的每一读长与参考基因序列对比, 判断所述 每个簇的基因组类别之后, 所述方法还包括:
采用多核学习训练分类器对所述已确认基因组类别的簇再次进行分类
[权利要求 6] —种宏基因组数据分类装置, 其特征在于, 所述装置包括:
计算模块, 用于计算待测序序列的特征向量;
聚类模块, 用于对所述特征向量进行聚类得到 Μ组包含读长的簇 G 1 至 G M, 所述 M为不小于 1的整数;
获取模块, 用于获取所述簇 G 1至 G M中每个簇的中心集合 Κ ί; 类别判断模块, 用于通过将所述每个簇的中心集合 Κ ί的每一读长与 参考基因序列对比, 判断所述每个簇的基因组类别。
[权利要求 7] 根据权利要求 6所述的装置, 其特征在于, 所述计算模块包括:
分割单元, 用于将所述待测序序列分割成 L-k+1个长度为 k的 k-mer, 所述 L为所述待测序序列的长度;
统计单元, 用于统计所述 L-k+1个 k-mer中每个 k-mer的出现频率, 将 所述 L-k+1个 k-mer中 k-mer的出现频率组成维度为的向量作为所述待 测序序列的特征向量。
[权利要求 8] 根据权利要求 6所述的装置, 其特征在于, 所述类别判断模块包括: 对比单元, 用于通过将所述每个簇的中心集合 Κ ί的每一读长与参考 基因序列对比, 统计所述每个簇的中心集合 Κ ί的每一读长的基因组 类别;
确定单元, 用于若所述中心集合 Κ ί中任一读长 R ί的基因组类别 C ί的 出现频率不小于预设阈值, 则将所述读长 R啲基因组类别 C ί确认为 所述读长 R ί所属簇的基因组类别。
[权利要求 9] 根据权利要求 6至 8任意一项所述的装置, 其特征在于, 所述装置还包 括:
降维模块, 用于所述计算模块计算待测序序列的特征向量之后, 所述 聚类模块对所述特征向量进行聚类得到 Μ组包含读长的簇 G 1至 G Μ 之前, 对所述待测序序列的特征向量进行降维处理。
[权利要求 10] 根据权利要求 6至 8任意一项所述的装置, 其特征在于, 所述装置还包 括:
再分类模块, 用于所述类别判断模块通过将所述每个簇的中心集合 K ί的每一读长与参考基因序列对比, 判断所述簇的基因组类别之后, 采用多核学习训练分类器对所述已确认基因组类别的簇再次进行分类
[权利要求 11] 一种宏基因组数据分类装置, 其特征在于, 所述装置包括: 处理器, 通信接口, 存储器和总线; 其中, 所述处理器、 所述通信接口和所述 存储器通过所述总线完成相互间的通信;
所述通信接口, 用于与外界设备通信;
所述处理器, 用于执行程序;
所述存储器, 用于存放所述程序;
所述程序包括:
计算模块, 用于计算待测序序列的特征向量;
聚类模块, 用于对所述特征向量进行聚类得到 Μ组包含读长的簇 G 1 至 G M, 所述 M为不小于 1的整数;
获取模块, 用于获取所述簇 G 1至 G M中每个簇的中心集合 Κ ί; 类别判断模块, 用于通过将所述每个簇的中心集合 Κ ί的每一读长与 参考基因序列对比, 判断所述每个簇的基因组类别。
PCT/CN2016/113029 2016-12-29 2016-12-29 一种宏基因组数据分类方法和装置 WO2018119882A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/113029 WO2018119882A1 (zh) 2016-12-29 2016-12-29 一种宏基因组数据分类方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/113029 WO2018119882A1 (zh) 2016-12-29 2016-12-29 一种宏基因组数据分类方法和装置

Publications (1)

Publication Number Publication Date
WO2018119882A1 true WO2018119882A1 (zh) 2018-07-05

Family

ID=62710149

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/113029 WO2018119882A1 (zh) 2016-12-29 2016-12-29 一种宏基因组数据分类方法和装置

Country Status (1)

Country Link
WO (1) WO2018119882A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2390810A2 (en) * 2010-05-26 2011-11-30 Tata Consultancy Services Limited Taxonomic classification of metagenomic sequences
CN103246829A (zh) * 2012-02-10 2013-08-14 塔塔咨询服务有限公司 宏基因组序列的组装
CN103955629A (zh) * 2014-02-18 2014-07-30 吉林大学 基于模糊k均值的宏基因组片段聚类方法
WO2016172643A2 (en) * 2015-04-24 2016-10-27 University Of Utah Research Foundation Methods and systems for multiple taxonomic classification
CN106156791A (zh) * 2016-06-15 2016-11-23 北京京东尚科信息技术有限公司 业务数据分类方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2390810A2 (en) * 2010-05-26 2011-11-30 Tata Consultancy Services Limited Taxonomic classification of metagenomic sequences
CN103246829A (zh) * 2012-02-10 2013-08-14 塔塔咨询服务有限公司 宏基因组序列的组装
CN103955629A (zh) * 2014-02-18 2014-07-30 吉林大学 基于模糊k均值的宏基因组片段聚类方法
WO2016172643A2 (en) * 2015-04-24 2016-10-27 University Of Utah Research Foundation Methods and systems for multiple taxonomic classification
CN106156791A (zh) * 2016-06-15 2016-11-23 北京京东尚科信息技术有限公司 业务数据分类方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN, BO ET AL.: "Features Extraction and Dimensions Reduction in Metagenomic Binning Problem", COMPUTER SYSTEMS & APPLICATIONS, vol. 24, no. 11, 31 December 2015 (2015-12-31), pages 31 - 37 *

Similar Documents

Publication Publication Date Title
US20230142864A1 (en) Estimation of Admixture Generation
Wu et al. A novel abundance-based algorithm for binning metagenomic sequences using l-tuples
US11373730B2 (en) Determination of microorganism operational taxonomic unit and sequence-assisted separation
Soueidan et al. Machine learning for metagenomics: methods and tools
CN112466404B (zh) 一种宏基因组重叠群无监督聚类方法及系统
CN106682454B (zh) 一种宏基因组数据分类方法和装置
EP2390810B1 (en) Taxonomic classification of metagenomic sequences
AU2015101194A4 (en) Semi-Supervised Learning Framework based on Cox and AFT Models with L1/2 Regularization for Patient’s Survival Prediction
CN112585688A (zh) 过滤遗传网络以发现感兴趣的种群
Rasheed et al. Metagenomic taxonomic classification using extreme learning machines
CN110379464B (zh) 一种细菌中dna转录终止子的预测方法
CN106202999A (zh) 基于不同尺度tuple词频的微生物高通量测序数据分析协议
CN111710364A (zh) 一种菌群标记物的获取方法、装置、终端及存储介质
Meesad et al. Combination of knn-based feature selection and knnbased missing-value imputation of microarray data
CN115631789A (zh) 一种基于泛基因组的群体联合变异检测方法
EP2518656B1 (en) Taxonomic classification system
Tanaseichuk et al. A probabilistic approach to accurate abundance-based binning of metagenomic reads
WO2018119882A1 (zh) 一种宏基因组数据分类方法和装置
CN111755074B (zh) 一种酿酒酵母菌中dna复制起点的预测方法
EP2390811B1 (en) Identification of ribosomal DNA sequences
CN117116350B (zh) Rna测序数据的校正方法、装置、电子设备及存储介质
Bose et al. Effectiveness of different partition based clustering algorithms for estimation of missing values in microarray gene expression data
Rawlinson et al. A flexible framework for minimal biomarker signature discovery from clinical omics studies without library size normalisation
CN114298214A (zh) 基于超大规模进化算法和硬件加速的蛋白质异常检测方法
Greenberg Analysis and applications of k-mer based methods in bioinformatics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16925366

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16925366

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02/10/2019)

122 Ep: pct application non-entry in european phase

Ref document number: 16925366

Country of ref document: EP

Kind code of ref document: A1