CN114664383A - Metagenome component classification method and system combining reference library prior knowledge - Google Patents

Metagenome component classification method and system combining reference library prior knowledge Download PDF

Info

Publication number
CN114664383A
CN114664383A CN202210335489.2A CN202210335489A CN114664383A CN 114664383 A CN114664383 A CN 114664383A CN 202210335489 A CN202210335489 A CN 202210335489A CN 114664383 A CN114664383 A CN 114664383A
Authority
CN
China
Prior art keywords
binning
clustering
species
feasible
target sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210335489.2A
Other languages
Chinese (zh)
Inventor
宋闻欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weihai Xinyuan Fruit Industry Technology Service Co ltd
Original Assignee
Weihai Xinyuan Fruit Industry Technology Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weihai Xinyuan Fruit Industry Technology Service Co ltd filed Critical Weihai Xinyuan Fruit Industry Technology Service Co ltd
Priority to CN202210335489.2A priority Critical patent/CN114664383A/en
Publication of CN114664383A publication Critical patent/CN114664383A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Ecology (AREA)
  • Bioethics (AREA)
  • Physiology (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of metagenomics and data science, and provides a metagenomics binning method and a metagenomics binning system combined with reference library priori knowledge, wherein the method comprises the following steps: acquiring a target sequence data set, extracting the characteristics of each sequence sample in the target sequence data set, and obtaining a box-dividing characteristic vector set after characteristic transformation; comparing the target sequence data set with a reference library to obtain an estimated value of the number of species and confidence degrees that each sequence sample belongs to different species, further obtaining feasible intervals of the number of sub-boxes and a confidence clustering center of each species, and taking the feasible intervals of the number of sub-boxes and the confidence clustering centers of each species as prior knowledge; and clustering the feasible binning quantity of each feasible binning quantity of the binning quantity feasible section by adopting a clustering method, selecting an optimal clustering result, and performing binning on the target sequence data set by using the optimal clustering result. Compared with the prior art, the invention solves the problems that the conventional metagenome binning can not process unknown species sequences or the binning precision is insufficient and the like.

Description

Metagenome component classification method and system combining reference library prior knowledge
Technical Field
The application relates to the technical field of metagenomics and data science, in particular to a metagenomics binning method and a metagenomics binning system combining prior knowledge of a reference library.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
Metagenome studies direct the genetic material of microorganisms from natural environmental samples, provides an effective method for studying the real microbial world, and avoids the deviation caused by laboratory culture. The metagenomic classification is to classify gene sequences to distinguish different microorganism species or subspecies, and the classification result directly influences the accuracy of metagenomic research, so the metagenomic classification becomes a key problem in the metagenomic research.
At present, the metagenomic binning study is mainly divided into two major categories, namely contig (contig) binning and long read (long read) binning. Wherein, the contig is a long gene segment formed by connecting short reads (short reads) with each other through the terminal overlapping sequence; whereas long reads are long gene sequences generated with third generation sequencing Technology (TGS). Both classes are better suited for binning than short reads due to longer sequences, containing more genetic features.
As for the binning method, the existing metagenomics binning methods are roughly divided into two major categories, Reference-based binning and Reference-free binning. The Reference database-based binning realizes classification of a target sequence data set by comparing the target sequence data set with a Reference database (Reference database) of a known species, and the binning method can achieve high binning precision on a gene sequence of the known species but cannot process unknown species; and independent classification of the reference library is realized by adopting feature engineering and a clustering method based on the distinguishing features of the gene sequences without depending on the reference library. Such binning methods can classify unknown species, but their binning accuracy is typically low, especially when the distinguishing features differ less or the number of species is large.
In recent years, with continuous discovery and registration of new species, a reference library is greatly supplemented and perfected, convenient evaluation can be provided for species information of a target sequence dataset, although the evaluation result is not accurate enough, the evaluation result is valuable priori knowledge, and the accuracy of binning can be greatly improved by fully utilizing the priori information. At present, two binning methods are combined by scholars to realize metagenome binning, the binning process is divided into two relatively independent stages, firstly, a binning method irrelevant to a feature library is adopted for primary binning, and then, a feature library-based method is adopted for rebinning a sequence with insufficient binning quality; the method is essentially a supplementary strategy of taking characteristic base based binning as independent binning of a characteristic base, and two binning methods are not really fused. When the difference of distinguishing features is small or the number of species is large, the binning effect of the method depends on re-binning based on the reference library, thus affecting the identification of unknown species sequences.
Disclosure of Invention
In order to overcome the defects of the prior art, the application provides a metagenomic classification method and a metagenomic classification system combining the prior knowledge of a reference library, which make full use of the prior knowledge provided by the prior reference library and incorporate the prior knowledge into characteristic-independent metagenomic classification. Compared with the prior art, the method realizes the fusion of the two classification methods in essence, solves the problems that the existing metagenome classification can not process unknown species sequences or has insufficient classification precision and the like, can classify the unknown species gene sequences, and has better classification performance compared with the reference library-independent classification method.
In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:
a metagenomic genomic binning method in combination with a priori knowledge of a reference library storing standard gene sequences of known species, comprising the steps of:
acquiring a target sequence data set, performing feature extraction on each sequence sample in the target sequence data set to obtain an initial distinguishing feature vector of the target sequence, and performing feature transformation to obtain a box-dividing feature vector of the target sequence;
comparing a target sequence data set with a reference database to obtain an estimated value of the number of species contained in the target sequence data set and confidence degrees of different species of each sequence sample; generating feasible bin number intervals according to the estimated species number, wherein in the interval, the feasible bin number is more than or equal to the estimated species number; obtaining a prior clustering center set according to the confidence coefficients of different species of the sequence samples, and taking the feasible interval of the box dividing quantity and the prior clustering center set as prior knowledge;
determining a corresponding clustering center set for each bin number in the bin number feasible interval, wherein the clustering center set comprises a priori clustering center set and an amplified clustering center set; and for each feasible binning quantity, performing clustering analysis on the binning feature vector set based on the corresponding clustering center set, selecting an optimal clustering result, and obtaining a sequence sample set corresponding to each cluster in the optimal clustering result, namely a binning result.
Further, the feature transformation is used to obtain a low-dimensional representation of the initial discriminating feature vector, i.e. a binned feature vector.
Further, the feature transformation adopts any one of the following methods:
(1) a deep learning model VAE; (2) a dimension reduction model UMAP; (3) firstly, a deep learning model VAE is adopted, and then a dimension reduction model UMAP is further adopted for the obtained hidden vector.
Further, obtaining a priori clustering center set according to the confidence degrees that the sequence samples belong to different species includes:
for each type of species, selecting a plurality of benchmarking sequence samples of the species according to the confidence of the species to which each sequence sample belongs, and acquiring the box-divided feature vectors of the benchmarking samples based on a box-divided feature vector set, wherein the centers of the box-divided feature vectors are the prior clustering centers of the type of species.
Further, the amplification clustering centers are selected in a box-dividing feature vector set random mode, and the distances between every two amplification clustering centers and the distances between the amplification clustering centers and the prior clustering centers are not smaller than a set threshold value.
And further, according to the contour coefficient or the CH index, comparing the clustering results corresponding to all feasible binning quantities, and selecting the optimal clustering result.
Further, the obtaining of the sequence sample set corresponding to each cluster in the optimal clustering result includes:
assuming that X is a target sequence dataset, recording X ∈ X for any sequence sample
Figure BDA0003576630990000041
The x corresponding sub-box characteristic vector is obtained, if the optimal clustering result is
Figure BDA0003576630990000044
Calculate BjX | X ∈ X and
Figure BDA0003576630990000042
j=1,2,…,K*obtaining the sequence sample set corresponding to each cluster
Figure BDA0003576630990000043
Namely the binning result.
One or more embodiments provide a metagenomic banking system that incorporates a priori knowledge of a reference library storing standard gene sequences of known species, comprising:
the characteristic extraction module is used for acquiring a target sequence data set, extracting the characteristic of each sequence sample in the target sequence data set to obtain an initial distinguishing characteristic vector of each sequence sample, and obtaining a box-dividing characteristic vector of each sequence after characteristic transformation;
the priori knowledge acquisition module is used for comparing a target sequence data set with a reference database to obtain an estimated value of the number of species contained in the target sequence data set and confidence degrees of different species of each sequence sample; generating feasible bin number intervals according to the estimated species number, wherein in the interval, the feasible bin number is more than or equal to the estimated species number; obtaining a prior clustering center set according to the confidence coefficient that each sequence sample belongs to different species, and taking the feasible bin quantity intervals and the prior clustering center set as prior knowledge;
the binning module is used for determining a corresponding clustering center set for each feasible binning quantity in the binning quantity feasible interval, wherein the clustering center set comprises a priori clustering center set and an amplified clustering center set; and for each feasible binning quantity, performing clustering analysis on the binning feature vector set based on the corresponding clustering center set, selecting an optimal clustering result, and obtaining a sequence sample set corresponding to each cluster in the optimal clustering result, namely a binning result.
One or more embodiments provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the metagenomic banking method in conjunction with prior knowledge of a reference library when executing the program.
One or more embodiments provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the metagenomic binning method in conjunction with reference library a priori knowledge.
The above one or more technical solutions have the following beneficial effects:
the method comprises the steps of firstly estimating the number of species based on a reference library, selecting confidence sequence samples of the species, taking the centers of box-dividing feature vectors of the confidence sequence samples as prior clustering centers, then expanding unknown species clustering centers on the basis of the prior clustering centers, then clustering sequence box-dividing feature sets based on the prior clustering centers and the expanded clustering centers, and finally realizing box-dividing of target sequence data sets according to an optimal clustering result. The method and the device make full use of the prior knowledge provided by the prior reference library, incorporate the prior knowledge into the characteristic-independent metagenome classification box, and realize the fusion of the two classification box methods in essence.
Compared with the binning based on a reference library, the method and the device realize the binning of the unknown species sequence; compared with the reference library-independent binning, the reference library-independent binning method has better binning performance than the existing reference library-independent binning method due to the fact that species information from the reference library is used as priori knowledge.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
Fig. 1 is a schematic flow chart illustrating main steps of a metagenomic component classification method in combination with prior knowledge of a reference library in one or more embodiments of the present application.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
The embodiment provides a metagenomic component binning method combining prior knowledge of a reference library, which comprises three stages of feature processing, prior knowledge acquisition and binning. The method comprises the following steps:
step 1: and acquiring a target sequence data set, performing feature extraction on each sequence sample in the target sequence data set to obtain an initial distinguishing feature vector of the target sequence, and performing feature transformation to obtain a box-dividing feature vector of the sequence.
The step 1 corresponds to a feature processing stage, and specifically includes:
step 1-1: acquiring a target sequence data set, and setting the target sequence data set as X ═ X1,x2,...,xnAnd n denotes the number of sequence samples.
Step 1-2: and performing feature extraction on all sequence samples in the target sequence data set to obtain an initial distinguishing feature set. Specifically, for any sequence sample xiE.g. X, calculating the initial distinguishing characteristics to obtain the initial distinguishing characteristic vector F of the sequence samplei=(fi,1,fi,1,…,fi,m) M is a feature dimension; thus, an initial distinguishing feature set { F) corresponding to the target sequence data set X is obtained1,F2,…,Fn}. For long reads and contigs, this initial discriminatory feature mainly includes composition (composition) and coverage (coverage) information.
Step 1-3: and performing characteristic transformation on the initial distinguished characteristic set to obtain a box-divided characteristic vector set. Specifically, the feature set { F) is distinguished for the initial portion1,F2,…,FnAnd performing feature transformation by adopting a machine learning or deep learning model to obtain a box-divided feature vector set { F'1,F'2,…,F'nWherein, F'i=(f'i,1,f'i,2,…,f'i,s) Is a sequence sample xiI is 1,2, …, and n, s is the dimension of the binned feature.
The feature transformation aims to obtain a low-dimensional representation of the initial distinguishing feature vector, namely a binning feature vector, and can be realized through a deep learning model VAE or a dimension reduction model UMAP. As a preferred implementation mode, a deep learning model VAE is adopted firstly, and then a dimensionality reduction model UMAP is further adopted for the obtained hidden vector. The adoption of the deep learning model VAE can furthest retain the significance information in the initial distinguishing feature set while realizing dimension reduction, and then obtains the final low-dimensional representation through the dimension reduction model UMAP, thereby ensuring the difference among the different species in the binning feature vector set and being beneficial to improving the subsequent clustering efficiency.
Taking the long-read dataset as an example, the long-read dataset is set as X ═ X1,x2,...,xnN is the number of long read samples. For any long reading sample xiE.g. X, calculating the 3 nucleotide composition and the k-mer coverage as initial distinguishing characteristics to obtain an initial distinguishing characteristic set { F1,F2,…,Fn}; then, the initial distinguishing feature set is transformed by adopting a deep learning model VAE and a dimensionality reduction model UMAP to obtain a box-dividing feature vector set { F'1,F'2,…,F'nWherein, F'i=(f'i,1,f'i,2,…,f'i,s),i=1,2, …, n, s is the dimension of the binned feature vector. In this example, s is 2.
Step 2: comparing a target sequence data set with a reference library to obtain an estimated value of the number of species contained in the target sequence data set and confidence degrees that each sequence sample belongs to different species, generating feasible bin number intervals according to the estimated value of the number of species, wherein the number of each feasible bin is greater than or equal to the estimated value of the number of species in the interval, obtaining a prior clustering center set according to the confidence degrees that each sequence sample belongs to different species, and taking the feasible bin number intervals and the prior clustering center set as prior knowledge.
The step 2 corresponds to a stage of obtaining the prior knowledge, and specifically includes:
step 2-1: comparing the target sequence data set X with a reference library to obtain an estimated value K of the number of species contained in the data set X#And the confidence that each sequence sample belongs to a different species.
Step 2-2: setting feasible interval of box number
Figure BDA0003576630990000071
Where α is an interval width factor, and its value is an empirical value greater than 1, and in this embodiment, α is 1.5.
Step 2-3: for each type of species, selecting a plurality of benchmarking samples of the species according to the confidence degree of the sequence samples belonging to the type of species, and acquiring the box-separating feature vectors of the benchmarking samples based on the box-separating feature vector set, wherein the centers of the box-separating feature vectors are the prior clustering centers corresponding to the type of species. In this embodiment, the mean value of the plurality of bin feature vectors is used as the center of the plurality of bin feature vectors.
Specifically, the confidence that each sequence sample obtained in step 2-1 belongs to a different species, v, 1,2, …, K for species v, v ═ 1,2, …, K#Selecting t sequence samples with the maximum confidence as the species benchmarking samples; then, at { F'1,F'2,…,F'nFinding the box-divided characteristic vectors of the benchmarking samples, and calculating the centers of the box-divided characteristic vectors as confidence cluster centers c of the species vxvAnd obtaining the prior clustering center set of all species according to the above
Figure BDA0003576630990000081
Taking a long-reading data set as an example, firstly, the target sequence data set X is compared with the reference library to obtain an estimated value K of the number of species contained in the data set X#And the confidence that each sequence sample belongs to a different species, then v, v-1, 2, …, K for any species#Selecting 50 long reading samples with the maximum confidence as benchmark samples; then in { F'1,F'2,…,F'nFinding out the box-divided characteristic vectors of the 50 marker post samples, and calculating the centers of the box-divided characteristic vectors as the confidence cluster center cx of the species vvAnd obtaining the prior clustering center set of all species according to the above
Figure BDA0003576630990000082
The step 3 corresponds to a binning stage.
And step 3: and determining a corresponding clustering center set for each feasible binning quantity in the binning quantity feasible interval, wherein the clustering center set comprises a priori clustering center set and an amplified clustering center. And for each feasible binning quantity, performing clustering analysis on the binning feature vector set based on the corresponding clustering center set, selecting an optimal clustering result, and obtaining a sequence sample set corresponding to each cluster in the optimal clustering result, namely a binning result.
The step 3 specifically includes:
step 3-1: for each feasible sub-box quantity K belonging to G, selecting the clustering center set corresponding to the sub-box quantity
Figure BDA0003576630990000083
Wherein,
Figure BDA0003576630990000086
for the a-priori cluster centers from the reference library,
Figure BDA0003576630990000087
to amplify the cluster centers. The selection of the amplification clustering centers is randomly selected in the box-dividing characteristic vector set, and the distances between every two amplification clustering centers and between the amplification clustering centers and the box-dividing characteristic vectors of the prior clustering centers are not less than a threshold value D. The distance between the bin feature vectors may be calculated by using an euclidean distance, a cosine similarity, and the like, which is not limited herein.
Step 3-2: selecting a center-based clustering model, such as a k-means model, with UKSet of binned feature vectors { F 'for cluster center set'1,F'2,…,F'nClustering is carried out to obtain a clustering result { c) corresponding to each feasible binning quantity K1,c2,…,cK}。
Step 3-3: to pair
Figure BDA0003576630990000091
Comparing the clustering results corresponding to the number of all the sub-boxes, selecting the optimal clustering result, and recording the optimal clustering result as the optimal clustering result
Figure BDA0003576630990000092
Wherein, K*The optimal number of bins.
And the evaluation of the clustering result can adopt indexes such as contour coefficient or CH index and the like, and the clustering result is compared based on various indexes to obtain the optimal clustering result through comprehensive evaluation.
Step 3-4: for any sequence sample X belongs to X, set
Figure BDA0003576630990000093
The sub-box characteristic vector corresponding to x is obtained according to the optimal clustering result
Figure BDA0003576630990000094
Calculation of BjX | X ∈ X and
Figure BDA0003576630990000095
},j=1,2,…,K*obtaining the sequence sample set corresponding to each cluster
Figure BDA0003576630990000096
Namely the binning result.
Example two
Based on the method described in the first embodiment, the present embodiment provides a metagenomic component classification system combining a priori knowledge of a reference library, where the reference library stores standard gene sequences of known species, and the method includes:
the characteristic extraction module is used for acquiring a target sequence data set, extracting the characteristic of each sequence sample in the target sequence data set to obtain an initial distinguishing characteristic vector of each sequence sample, and obtaining a box-dividing characteristic vector of each sequence after characteristic transformation;
the priori knowledge acquisition module is used for comparing a target sequence data set with a reference database to obtain an estimated value of the number of species contained in the target sequence data set and confidence degrees of different species of each sequence sample; generating feasible bin number intervals according to the estimated species number, wherein in the interval, the feasible bin number is more than or equal to the estimated species number; obtaining a prior clustering center set according to the confidence coefficient that each sequence sample belongs to different species, and taking the feasible bin quantity intervals and the prior clustering center set as prior knowledge;
the binning module is used for determining a corresponding clustering center set for each feasible binning quantity in the binning quantity feasible interval, wherein the clustering center set comprises a priori clustering center set and an amplified clustering center set; and for each feasible binning quantity, performing clustering analysis on the binning feature vector set based on the corresponding clustering center set, selecting an optimal clustering result, and obtaining a sequence sample set corresponding to each cluster in the optimal clustering result, namely a binning result.
EXAMPLE III
The embodiment aims to provide an electronic device.
An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first embodiment when executing the program.
Example four
An object of the present embodiment is to provide a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of the first embodiment.
The second to fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the related description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.
One or more of the above embodiments exploit the a priori knowledge provided by existing reference libraries to incorporate them into the metagenomic bins. Compared with the prior art, the method solves the problems that the conventional metagenome binning can not process unknown species sequences or has insufficient binning precision and the like, can bin unknown species gene sequences, and has better binning performance compared with a binning method irrelevant to a reference library.
Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims (10)

1. A metagenomic genomic binning method in combination with a priori knowledge of a reference library storing standard gene sequences of known species comprising the steps of:
acquiring a target sequence data set, performing feature extraction on each sequence sample in the target sequence data set to obtain an initial distinguishing feature vector of the target sequence, and performing feature transformation to obtain a box-dividing feature vector of the target sequence;
comparing a target sequence data set with a reference database to obtain an estimated value of the number of species contained in the target sequence data set and confidence degrees of different species of each sequence sample; generating feasible bin number intervals according to the estimated species number, wherein in the interval, the feasible bin number is more than or equal to the estimated species number; obtaining a prior clustering center set according to the confidence coefficient that each sequence sample belongs to different species, and taking the feasible bin quantity intervals and the prior clustering center set as prior knowledge;
determining a corresponding clustering center set for each bin number in the bin number feasible interval, wherein the clustering center set comprises a priori clustering center set and an amplified clustering center set; and for each feasible binning quantity, performing clustering analysis on the binning feature vector set based on the corresponding clustering center set, selecting an optimal clustering result, and obtaining a sequence sample set corresponding to each cluster in the optimal clustering result, namely a binning result.
2. The metagenomic component binning method of claim 1 in combination with reference library a priori knowledge, wherein said feature transformation is used to obtain a low dimensional representation of said initial binning feature vectors, i.e. binning feature vectors.
3. The metagenomic component binning method of claim 2 in combination with reference library a priori knowledge, wherein said feature transformation employs any one of the following methods:
(1) a deep learning model VAE; (2) a dimension reduction model UMAP; (3) firstly, a deep learning model VAE is adopted, and then a dimensionality reduction model UMAP is further adopted for the obtained hidden vector.
4. The metagenomic component binning method of claim 1 in combination with reference library a priori knowledge, wherein obtaining a priori cluster center sets based on the confidence that each sequence sample belongs to a different species comprises:
for each type of species, selecting a plurality of benchmarking samples of the species according to the confidence degree that each sequence sample belongs to the type of species, and acquiring the box-dividing feature vectors of the benchmarking samples based on a box-dividing feature vector set, wherein the centers of the box-dividing feature vectors are the prior clustering centers corresponding to the type of species.
5. The metagenomic organization binning method of claim 1 in which the amplification clustering centers are selected randomly in the binning feature vector set and the distance between each two amplification clustering centers and between an amplification clustering center and each prior clustering center is not less than a set threshold.
6. The metagenomic component binning method of claim 1 in combination with reference library a priori knowledge, wherein clustering results corresponding to all feasible binning quantities are compared according to profile coefficients or CH indices, and an optimal clustering result is selected.
7. The metagenomic component binning method of claim 1 in combination with reference library a priori knowledge, wherein obtaining the sequence sample sets corresponding to each cluster in the optimal clustering result comprises:
assuming that X is a target sequence dataset, recording X ∈ X for any sequence sample
Figure FDA0003576630980000021
The x corresponding sub-box characteristic vector is obtained, if the optimal clustering result is
Figure FDA0003576630980000023
Then calculate
Figure FDA0003576630980000025
Figure FDA0003576630980000022
Obtaining the sequence sample set corresponding to each cluster
Figure FDA0003576630980000024
Namely the binning result.
8. A metagenomic banking system incorporating prior knowledge of a reference library storing standard gene sequences of known species, comprising:
the characteristic extraction module is used for acquiring a target sequence data set, extracting the characteristic of each sequence sample in the target sequence data set to obtain an initial distinguishing characteristic vector of the target sequence sample, and obtaining a sub-box characteristic vector of the target sequence after characteristic transformation;
the priori knowledge acquisition module is used for comparing a target sequence data set with a reference database to obtain an estimated value of the number of species contained in the target sequence data set and confidence degrees of different species of each sequence sample; generating feasible bin number intervals according to the estimated species number, wherein in the interval, the feasible bin number is more than or equal to the estimated species number; obtaining a prior clustering center set according to the confidence coefficient that each sequence sample belongs to different species, and taking the feasible bin quantity intervals and the prior clustering center set as prior knowledge;
the binning module is used for determining a corresponding clustering center set for each feasible binning quantity in the binning quantity feasible interval, wherein the clustering center set comprises a priori clustering center set and an amplified clustering center set; and for each feasible binning quantity, performing clustering analysis on the binning feature vector set based on the corresponding clustering center set, selecting an optimal clustering result, and obtaining a sequence sample set corresponding to each cluster in the optimal clustering result, namely a binning result.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the metagenomic banking method in combination with reference library a priori knowledge according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the metagenomic binning method in combination with a priori knowledge of a reference library as claimed in any one of claims 1 to 7.
CN202210335489.2A 2022-03-31 2022-03-31 Metagenome component classification method and system combining reference library prior knowledge Pending CN114664383A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210335489.2A CN114664383A (en) 2022-03-31 2022-03-31 Metagenome component classification method and system combining reference library prior knowledge

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210335489.2A CN114664383A (en) 2022-03-31 2022-03-31 Metagenome component classification method and system combining reference library prior knowledge

Publications (1)

Publication Number Publication Date
CN114664383A true CN114664383A (en) 2022-06-24

Family

ID=82033091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210335489.2A Pending CN114664383A (en) 2022-03-31 2022-03-31 Metagenome component classification method and system combining reference library prior knowledge

Country Status (1)

Country Link
CN (1) CN114664383A (en)

Similar Documents

Publication Publication Date Title
CN111125658B (en) Method, apparatus, server and storage medium for identifying fraudulent user
CN106202999B (en) Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement
CN111325264A (en) Multi-label data classification method based on entropy
Colombo et al. FastMotif: spectral sequence motif discovery
CN113823356A (en) Methylation site identification method and device
Maduranga et al. Dimensionality reduction for cluster identification in metagenomics using autoencoders
CN109886151B (en) False identity attribute detection method
CN116958809A (en) Remote sensing small sample target detection method for feature library migration
CN114664383A (en) Metagenome component classification method and system combining reference library prior knowledge
CN111048145A (en) Method, device, equipment and storage medium for generating protein prediction model
Pereira et al. Assessing active learning strategies to improve the quality control of the soybean seed vigor
CN116226698A (en) Cell type identification method, system and equipment based on multi-group chemical data integration
Li et al. A novel approach to remote sensing image retrieval with multi-feature VP-tree indexing and online feature selection
CN117011539A (en) Target detection method, training method, device and equipment of target detection model
Hu et al. A novel method for discovering local spatial clusters of genomic regions with functional relationships from DNA contact maps
JP2006092478A (en) Gene expression profile retrieval apparatus, gene expression profile retrieval method, and program
CN117746997B (en) Cis-regulation die body identification method based on multi-mode priori information
CN116343923B (en) Genome structural variation homology identification method
CN108959650A (en) Image search method based on symbiosis SURF feature
CN110727901B (en) Uniform sampling method and device for data samples for big data analysis
Cai et al. Application and research progress of machine learning in Bioinformatics
Mondal et al. Protein Localization by Integrating Multiple Protein Correlation Networks
CN117995283B (en) Single-sample metagenome clustering method, system, terminal and storage medium
Zhang LSGDDN-LCD: An Appearance-based Loop Closure Detection using Local Superpixel Grid Descriptors and Incremental Dynamic Nodes
Uddin et al. Practical analysis of macromolecule identity from cryo-electron tomography images using deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination