CN115985400B - Method for reassigning metagenome multiple comparison sequences and application - Google Patents

Method for reassigning metagenome multiple comparison sequences and application Download PDF

Info

Publication number
CN115985400B
CN115985400B CN202211545623.8A CN202211545623A CN115985400B CN 115985400 B CN115985400 B CN 115985400B CN 202211545623 A CN202211545623 A CN 202211545623A CN 115985400 B CN115985400 B CN 115985400B
Authority
CN
China
Prior art keywords
sequence
comparison
abundance
species
reference sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211545623.8A
Other languages
Chinese (zh)
Other versions
CN115985400A (en
Inventor
宋纯
张岩
郭祖乐
刘俊锋
郭昊
李诗濛
任用
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Xiansheng Medical Diagnosis Co ltd
Nanjing Xiansheng Diagnostic Technology Co ltd
Jiangsu Xiansheng Medical Devices Co ltd
Original Assignee
Jiangsu Xiansheng Medical Diagnosis Co ltd
Nanjing Xiansheng Diagnostic Technology Co ltd
Jiangsu Xiansheng Medical Devices Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Xiansheng Medical Diagnosis Co ltd, Nanjing Xiansheng Diagnostic Technology Co ltd, Jiangsu Xiansheng Medical Devices Co ltd filed Critical Jiangsu Xiansheng Medical Diagnosis Co ltd
Priority to CN202211545623.8A priority Critical patent/CN115985400B/en
Publication of CN115985400A publication Critical patent/CN115985400A/en
Application granted granted Critical
Publication of CN115985400B publication Critical patent/CN115985400B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The application belongs to the technical field of raw information analysis, and particularly relates to a macro-genome multiple alignment sequence reassignment method and application. According to the method, the metagenome sequencing data are rapidly classified, a small metagenome reference sequence database is extracted based on a classification tree report, the minimum classification unit and the independent subset of the comparison sequence are rapidly compared and divided, the unique comparison rate is calculated in a simulation mode, a multiple comparison sequence redistribution probability model is constructed, and further rapid and accurate species sequence abundance assessment is achieved.

Description

Method for reassigning metagenome multiple comparison sequences and application
Technical Field
The application belongs to the technical field of bioinformatics, and particularly relates to a method for reassigning a metagenome multiple alignment sequence and application thereof.
Technical Field
Metagenomic sequencing (metagenomics next generation sequencing, mNGS) is a novel diagnostic technique for rapid and accurate detection of pathogens by unbiased sampling independent of culture. mNGS has obvious advantages in detecting unknown causes, difficult culture and co-infection pathogens, and is an important tool in the pathogen detection field in the next ten years.
Species identification and abundance estimation are key links in mNGS bioinformatics analysis, and accuracy of the mNGS bioinformatics analysis directly influences sensitivity and specificity of pathogen detection. Because the mNSS pathogen comparison database is a macro-reference genome, the comparison result contains a large number of multiple comparison sequences, and therefore, effective redistribution of the multiple comparison sequences is not only a serious difficulty of mNSS pathogen analysis, but also an optimization point for pathogen detection efficiency improvement.
Currently, common species identification and abundance estimation methods include: blast, kraken2+braicken, metaPhIAn2, etc. The Blast method is based on local comparison, has high accuracy, low operation efficiency and no multiple comparison sequence redistribution function; krake 2 is based on Kmer and LCA mapping, operating efficiently, but at a species level with low resolution, a large number of multiple aligned sequences are identified above the species level; the Bracken sequence abundance re-estimation module has certain drawbacks, such as: miss detection occurs for species with low abundance or no unique alignment sequence, and bottom-up (plant-to-species level) sequence reassignment does not depend on a probability assignment model; metaPhIAn2 is only dependent on Marker gene sequences, and has high specificity but low sensitivity.
Therefore, the macro genome reference sequence database and mNGS unique comparison sequence information are fully mined, and the multiple comparison sequences are redistributed, so that species identification and abundance estimation accuracy can be effectively improved.
In view of this, the present application is presented.
Disclosure of Invention
In order to solve the technical problems, a set of macro genome multiple comparison sequence redistribution method is established through bioinformatics analysis and research, and the accuracy of species identification and abundance estimation can be remarkably improved, and the pathogen detection efficiency is effectively improved.
Specifically, the application proposes the following technical scheme:
the application firstly provides a method for reassigning a metagenome multiple alignment sequence, which comprises the following steps:
1) Quick classification: based on a reference sequence database, rapidly classifying metagenome sequencing data to obtain a classification tree report;
2) Extracting a small library: extracting a target reference sequence from a complete metagenome reference sequence database based on the classification tree report to obtain a small metagenome reference sequence database;
3) And (3) fast comparison: based on a small metagenome reference sequence database, carrying out rapid comparison on metagenome sequencing data to obtain sequence comparison information;
4) Dividing the subsets: based on the sequence comparison information, dividing the minimum classification unit and an independent subset with closed comparison sequences;
5) And (3) constructing a model: based on the small macro genome reference sequence database and the independent subset, simulating and calculating the unique comparison rate of each minimum classification unit in the independent subset, and constructing a multiple comparison sequence reassignment probability model;
6) Sequence reassignment: based on the multiple comparison sequence reassignment probability model and the sequence comparison information, calculating posterior assignment probability of each multiple comparison sequence, and reassigning the minimum classification unit identifier for the sequence by taking the posterior assignment probability as a random function parameter;
preferably, the method further comprises the steps of:
step 7) abundance estimation: species abundance is assessed based on the unique alignment and multiple alignment reassignment results.
Further, the 1) specifically includes: based on a complete metagenome reference sequence database, rapidly classifying metagenome sequencing data by using Krake 2 to obtain a classification tree report;
preferably, the reference sequence database is a complete metagenome reference sequence database; the complete metagenome reference sequence database is a redundancy-removing and low-quality nt database, a professional virus library and a professional parasite library.
More preferably, the metagenomic sequencing data is sequencing data from which a human sequence is removed and from which a low-quality sequence is removed;
further, the Kraken2 parameter is preferably set to confidence=0.5;
more preferably, the Kraken2 classification tree report information includes: each classification hierarchy level classification tree information (classification name, classification number), the total number of sequences classified to the hierarchy level and below, the number of sequences classified to the hierarchy level, and the number of sequences classified to the hierarchy level and below.
Further, in the 2), the extraction is performed according to a metagenome reference sequence inclusion standard; preferably, the reference sequence inclusion criteria include:
a. the reference sequences listed in the classification tree report for all species listed in the complete metagenome reference sequence database;
b. all species subtypes, serotypes and strains encompassed by all the species listed above are listed in the complete metagenome reference sequence database;
c. all species, species subtypes, serotypes and strains contained in the family listed in the classification tree report with unassigned sequence numbers >1000 are listed in the complete metagenomic reference sequence database;
d. all species, species subtypes, serotypes and strains contained in the genus listed in the classification tree report with unassigned sequence numbers >100 are listed in the complete metagenomic reference sequence database.
Further, the comparison in 3) is to use minimap2 to rapidly compare the metagenomic sequencing data;
preferably, the sequence alignment information includes:
a. results of sequence alignment, comprising: the comparison sequence name, the reference sequence name, the minimum classification unit name, the comparison position, the comparison quality and the detailed comparison information;
b. a minimum classification unit comprising: species, species subtype, serotype and strain.
Further, the independent subset of 4) includes any one or more of the following features:
a. the minimum classification unit between the independent subsets has no intersection;
b. the independent subset and the sequence between the independent subsets have no intersection;
c. the union of all independent subsets is the sequence alignment information.
In the step 4), the rule of dividing the subset is that the minimum classification unit and the comparison sequence are closed, and the division of the subset depends on the sequence comparison information and is irrelevant to the classification tree structure.
Further, the analog computation in the 5) includes the steps of:
a. extracting a reference sequence of each minimum classification unit in the independent subset, and generating an independent subset reference sequence database;
b. simulating to generate a sequencing sequence based on the independent subset reference sequence database;
c. comparing the simulation sequences to the reference sequences of the minimum classification units in the independent subsets through the minimap2 to obtain sequence comparison results of the simulation sequences in the independent subsets;
d. based on the comparison of the simulated sequences, a unique comparison rate for each minimum taxon within the independent subset is calculated.
Further, the construction in 5) includes any one or more of the following features:
a. modeling within each independent subset;
b. estimating the sequence abundance of each minimum taxa in the independent subset based on the unique comparison sequence and the unique comparison ratio;
c. estimating the abundance of the multiple aligned sequences for each minimum taxa based on the unique alignment and the abundance of each minimum taxa;
d. the abundance of the multiple comparison sequences of each minimum classification unit is used as a priori reassignment probability;
e. generating an observation probability of each minimum classification unit based on the comparison result of each multiple comparison sequence;
f. and constructing a multiple comparison sequence reassignment probability model based on the priori reassignment probability and the observed value probability.
Further, the estimating in 7) includes the steps of:
a. firstly, estimating the sequence abundance of each minimum classification unit, namely the sum of the unique comparison sequence number and the multiple comparison and redistribution sequence number;
b. constructing hierarchical relations of the minimum classification units based on the classification tree, wherein the root node of each hierarchical relation is a species; and counting the abundance of the sequences at the species level, and outputting the result as the abundance of the sequences of the species.
The application also provides a system for reassigning the metagenome multiple alignment sequences, which comprises the following components:
component 1) quick sort component: the method is used for rapidly classifying metagenome sequencing data by using Krake 2 based on a complete metagenome reference sequence database to obtain a classification tree report;
component 2) extract repository component: the method comprises the steps of extracting a target reference sequence from a complete metagenome reference sequence database based on a classification tree report to obtain a small metagenome reference sequence database;
assembly 3) fast alignment assembly: the method is used for rapidly comparing the metagenome sequencing data based on the small metagenome reference sequence database to obtain sequence comparison information;
component 4) divide subset component: based on the sequence comparison information, dividing the minimum classification unit and the independent subset with closed comparison sequences;
component 5) build model component: the method is used for simulating and calculating the unique comparison rate of each minimum classification unit in the independent subset based on the small metagenome reference sequence database and the independent subset, and constructing a multiple comparison sequence reassignment probability model;
component 6) sequence reassignment component: the method comprises the steps of calculating posterior distribution probability of each multiple comparison sequence based on a multiple comparison sequence reassignment probability model and sequence comparison information, and reassigning a minimum classification unit identifier for the sequence by taking the posterior distribution probability as a random function parameter;
preferably, the method further comprises the steps of:
component 7) abundance estimation component: for estimating species abundance based on the unique alignment and multiple alignment reassignment results.
Further, the detailed definitions in the specific components refer to the above-described method.
The application also provides a computer storage medium storing a computer program comprising program instructions which, when executed by a processor, implement any of the methods described above.
The application also provides an electronic device comprising a memory and a processor, the memory being connected to the processor, the memory being for storing a computer program, the processor being for invoking the computer program to implement any of the methods described above.
The beneficial technical effect of this application:
1) The method fully integrates the advantages of Krake 2 rapid pre-classification and minimap2 rapid comparison, and obtains detailed comparison information of each sequence in the range of a small metagenome reference sequence database;
2) According to the method, the independent subsets are divided based on each piece of sequence detailed comparison information, multiple comparison sequence reassignment is carried out in the independent subset range, and the influence of classification tree structure errors on sequence reassignment is effectively reduced;
3) According to the method, data simulation and probability modeling are carried out on the basis of a small metagenome reference sequence database, the unique comparison rate is calculated in a fitting mode, and the influence of pathogen comparison on sequence redistribution caused by unbalance of the database is effectively reduced;
4) According to the method, the classification tree structure is fully excavated, pathogen comparison databases and sequence comparison information are redistributed to each multiple comparison sequence based on the probability model, false negative and false positive are effectively reduced, and pathogen detection accuracy is further improved.
Drawings
FIG. 1 is a schematic diagram of a macro genome multiple alignment sequence reassignment method and application;
FIG. 2 comparison of results of different abundance estimation methods for E.coli and Shigella flexneri simulated data sets in example 1;
FIG. 3 results of different abundance estimation methods for A.baumannii and A.weii simulated data sets in example 2 are compared.
Detailed Description
Embodiments of the present application will be described in detail below with reference to examples, but it will be understood by those skilled in the art that the following examples are only for illustration of the present application and should not be construed as limiting the scope of the present application. The specific conditions are not noted in the examples and are carried out according to conventional conditions or conditions recommended by the manufacturer. The reagents or apparatus used were conventional products commercially available without the manufacturer's attention.
Some definitions of terms unless defined otherwise below, all technical and scientific terms used in the detailed description of the present application are intended to have the same meaning as commonly understood by one of ordinary skill in the art. While the following terms are believed to be well understood by those skilled in the art, the following definitions are set forth to better explain the present application.
The term "about" in this application means a range of accuracy that one skilled in the art can understand that still guarantees the technical effect of the features in question. The term generally means a deviation of + -10%, preferably + -5%, from the indicated value.
As used in this application, the terms "comprising," "including," "having," "containing," or "involving" are inclusive or open-ended and do not exclude additional unrecited elements or method steps. The term "consisting of …" is considered to be a preferred embodiment of the term "comprising". If a certain group is defined below to contain at least a certain number of embodiments, this should also be understood to disclose a group that preferably consists of only these embodiments.
Furthermore, the terms first, second, third, (a), (b), (c), and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments described herein are capable of operation in other sequences than described or illustrated herein.
The present application is described below in conjunction with specific embodiments.
Experimental example: method system establishment of the application
Through the trust analysis optimization, a set of multi-comparison sequence reassignment method suitable for the metagenome is initially established, and the method specifically comprises the following steps:
step 1) rapid classification: based on a complete metagenome reference sequence database, rapidly classifying metagenome sequencing data by using Krake 2 to obtain a classification tree report;
wherein the reference sequence database is a complete metagenome reference sequence database; the complete metagenome reference sequence database is preferably a redundancy-removing, low-quality nt database + professional virus library + professional parasite library. The metagenome sequencing data is preferably sequencing data from which human sources are removed and low quality is removed; the Kraken2 parameter is preferably set to confidence=0.5; the Krake 2 classification tree report information includes: each classification hierarchy level classification tree information (classification name, classification number), the total number of sequences classified to the hierarchy level and below, the number of sequences classified to the hierarchy level, and the number of sequences classified to the hierarchy level and below.
Step 2) extracting a small library: based on a Krake 2 classification tree report, extracting a target reference sequence from a complete metagenome reference sequence database according to a metagenome reference sequence inclusion standard to obtain a small metagenome reference sequence database;
wherein the reference sequence inclusion criteria include:
a. the reference sequences listed in the classification tree report for all species listed in the complete metagenome reference sequence database;
b. all species subtypes, serotypes and strains encompassed by all the species listed above are listed in the complete metagenome reference sequence database;
c. all species, species subtypes, serotypes and strains contained in the family listed in the classification tree report with unassigned sequence numbers >1000 are listed in the complete metagenomic reference sequence database;
d. all species, species subtypes, serotypes and strains contained in the genus listed in the classification tree report with unassigned sequence numbers >100 are listed in the complete metagenomic reference sequence database.
Wherein the minigenome reference sequence database is preferably a subset of the complete metagenome reference sequence database.
Step 2) may further comprise performing a secondary extraction of the pool based on the sequencing data pool type and the reference sequence source information.
Step 3) fast comparison: based on a minigenome reference sequence database, performing rapid comparison on metagenome sequencing data by adopting a minimap2 to obtain sequence comparison information;
the results of the sequence alignment included: comparing sequence names, reference sequence names, minimum classification unit names, comparison positions, comparison quality, and detailed comparison information (M: matching, S: soft comparison, I: insertion, D: deletion); the minimum classifying unit includes: species, species subtype, serotype and strain. In addition, step 3) may further perform contrast quality filtering on the sequence alignment result based on the sequence alignment information.
Step 4) partitioning the subsets: based on the sequence comparison information, dividing the minimum classification unit and an independent subset with closed comparison sequences; the input of the divided subsets is the sequence comparison information of the rapid comparison, and the output is an independent subset with the minimum classification unit and the comparison sequence closed.
Wherein, the independent subset characteristics that minimum classification unit and alignment sequence are all closed include:
1) The minimum classification unit between the independent subsets has no intersection;
2) The independent subset and the sequence between the independent subsets have no intersection;
3) The union of all independent subsets is the sequence alignment information.
Preferably, the rule for dividing the subset is that the minimum classification unit and the comparison sequence are closed, and the division of the subset depends on sequence comparison information and is irrelevant to the classification tree structure.
Step 5) building a model: based on the small macro genome reference sequence database and the independent subset, simulating and calculating the unique comparison rate of each minimum classification unit in the independent subset, and constructing a multiple comparison sequence reassignment probability model;
the simulation calculation of the unique comparison rate of each minimum classification unit in the independent subset comprises the following steps:
1) Extracting a reference sequence of each minimum classification unit in the independent subset, and generating an independent subset reference sequence database;
2) Simulating to generate a sequencing sequence based on the independent subset reference sequence database;
3) Comparing the simulation sequences to the reference sequences of all minimum classification units in the independent subsets through the minimap2 to obtain sequence comparison results of the simulation sequences in the independent subsets;
4) Based on the comparison of the simulated sequences, a unique comparison rate for each minimum taxon within the independent subset is calculated.
The multi-alignment sequence reassignment probability model construction comprises the following steps:
1) Modeling within each independent subset; estimating the sequence abundance of each minimum taxa in the independent subset based on the unique comparison sequence and the unique comparison ratio;
2) Estimating the abundance of the multiple aligned sequences for each minimum taxa based on the unique alignment and the abundance of each minimum taxa;
3) The abundance of the multiple comparison sequences of each minimum classification unit is used as a priori reassignment probability; generating an observation probability of each minimum classification unit based on the comparison result of each multiple comparison sequence;
4) And constructing a multiple comparison sequence reassignment probability model based on the priori reassignment probability and the observed value probability.
In addition, in step 5), simulation of generating sequencing sequences based on the independent subset reference sequence database can set simulated data throughput and read length with reference to species and sequence information within the subset.
Step 6) sequence reassignment: based on the multiple comparison sequence reassignment probability model and the sequence comparison information, calculating posterior assignment probability of each multiple comparison sequence, and reassigning the minimum classification unit identifier for the sequence by taking the posterior assignment probability as a random function parameter;
the multiple alignment sequence reassignment includes:
1) Reassigning each multiple alignment sequence;
2) Based on the multiple comparison sequence reassignment probability model, generating posterior reassignment probability of each multiple comparison sequence;
3) And the posterior reassignment probability is used as a parameter of a random assignment function to reassign the minimum classification unit identification to each multiple comparison sequence.
Step 7) abundance estimation: species abundance is estimated based on the unique alignment and multiple alignment reassignment results.
The step of estimating the abundance of the species based on the unique alignment and the multiple alignment reassignment results comprises:
1) Estimating the sequence abundance of each minimum sorting unit, namely the sum of the unique comparison sequence number and the multiple comparison and redistribution sequence number;
2) Constructing hierarchical relations of the minimum classification units based on the classification tree, wherein the root node of each hierarchical relation is a species;
3) And counting the abundance of the sequences at the species level, and outputting the result as the abundance of the sequences of the species.
Example 1: abundance estimation of E.coli and Shigella flexneri
1) Coli and log He Ganjun sequencing data simulation
The E.coli reference genome (GCF_000005845.2_ASM584v2_genomic. Fna) and the Shigella flexneri reference genome (GCF_000006925.2_AS692v2_genomic. Fna) were downloaded from NCBI functional networks. An art_illumina simulator is adopted to generate a simulation data set according to different sequence gradients, the sequence reading length is 75bp, the sequencing mode is single-ended, and the sequencing platform is HS2500.
2) Species detection and abundance estimation were performed on the above simulated data sets using the multiple alignment reassignment methods of Kraken2, kraken2+bracken, respectively.
3) And (5) calculating abundance estimation results.
As shown in the statistical results and fig. 2, the estimated abundance of Kraken2 under different sequence gradients was significantly underestimated (estimated abundance 0.04 times, 0.05 times, 0.039 times and 0.036 times, respectively) compared to the theoretical value of the e.coli simulated sequence.
Compared with the theoretical value of the escherichia coli simulation sequence, the Kraken < 2+ > Bracken has omission when the abundance is low (100 simulation sequences); at medium and low abundance (500 simulated sequences), significant overestimation (abundance estimate 1.874 times theoretical); at medium high abundance (2500 analog sequences) and high abundance (12500 analog sequences), some underestimation occurs (abundance estimates 0.797 and 0.626 times theoretical, respectively).
Compared with the theoretical value of the escherichia coli simulation sequence, the novel method has the advantages that when the method is in low abundance, medium-high abundance and high abundance, the estimated abundance value is close to the theoretical value (the estimated abundance value is 1.03 times, 0.975 times and 967 times of the theoretical value respectively).
Compared with the theoretical value of the Shigella flexneri analog sequence, the Krake 2 has serious underestimation on the abundance estimation under different sequence gradients (the abundance estimation values are respectively 0.03 times, 0.016 times, 0.012 times and 0.016 times of the theoretical value).
Compared with the theoretical value of Shigella flexneri analog sequences, kraken2+Bracken has omission at low abundance (100 analog sequences) and medium-low abundance (500 analog sequences); slight overestimation occurs at medium and high abundance (abundance estimates 1.14 and 1.29 times theoretical, respectively).
Compared with the theoretical value of the Shigella flexneri analog sequence, the method has underestimation at low abundance and medium low abundance (the estimated abundance value is 0.27 times and 0.268 times of the theoretical value respectively); at medium and high abundances, the abundance estimate is highly close to the theoretical value (abundance estimates 0.972 and 0.967 times theoretical, respectively).
Compared with Kraken2 and Kraken2+Bracken, the method has the advantages that under the abundance gradients of all sequences, the abundance estimated values and theoretical values of escherichia coli and shigella flexneri are more similar.
In example 1, the E.coli and Shigella genomes were as similar as 98% but were incorrectly assigned to different genera of the same family on the classification tree.
The Kraken2 pre-classification is based on Kmer and LCA mapping, and due to the above-described classification tree classification errors for e.coli and album bacteria, a large number of e.coli and shigella sequences can be identified as LCA at the family level, resulting in a severe underestimation of the abundance of the sequences at the species level.
In addition, when the abundance of E.coli and Shigella input sequences is low, kraken2 and or Bracken can suffer from false negatives or underestimation of abundance.
According to the embodiment 1, the multiple alignment sequence reassignment method provided by the application effectively avoids or reduces the omission caused by the classification tree error through extracting the small library alignment and multiple alignment sequence reassignment, and the underabundance estimation and the false negative.
Example 2: abundance estimation of Acinetobacter baumannii and Acinetobacter weii
The acinetobacter baumannii reference genome (gcf_008632635.1_asm 863263v1_genomic.fna) and the acinetobacter baumannii reference genome (gcf_000368585.1_achn_vene_cip_110063_v1_genomic.fna) were downloaded from NCBI functional nets. An art_illumina simulator is adopted to generate a simulation data set according to different sequence gradients, the sequence reading length is 75bp, the sequencing mode is single-ended, and the sequencing platform is HS2500.
2) Species detection and abundance estimation were performed on the above simulated data sets using the multiple alignment reassignment methods of Kraken2, kraken2+bracken, respectively.
3) And (5) calculating abundance estimation results.
As shown in the statistical results and fig. 3, compared with the theoretical value of the analog sequence of acinetobacter baumannii, the estimated abundance of Kraken2 is significantly underestimated under different sequence abundance gradients (the estimated abundance values are 0.09 times, 0.136 times, 0.124 times and 0.122 times of the theoretical value respectively).
Compared with the theoretical value of the analog sequence of Acinetobacter baumannii, kraken & lt2+ & gt Bragg appears in the low abundance (100 analog sequences), and the estimated abundance value is close to the theoretical value (the estimated abundance value is 1.032 times, 1.039 times and 1.019 times of the theoretical value respectively) in the middle-low abundance, the middle-high abundance and the high abundance.
Compared with the analog sequence theoretical value of Acinetobacter baumannii, the novel method has the advantages that under different sequence gradients, the abundance estimation is close to the theoretical value (the abundance estimation values are respectively 0.95 times, 1 time, 0.819 times and 0.975 times of the theoretical value).
Compared with the theoretical value of the analog sequence of the acinetobacter jejuni, the Krake 2 has obviously low abundance estimation under different sequence abundance gradients (the abundance estimation values are respectively 0.44 times, 0.42 times, 0.439 times and 0.423 times of the theoretical value).
Compared with the theoretical value of the analog sequence of the acinetobacter jejuni, the Kraken & lt2+ & gt Bracken has obvious overestimation at low abundance (the estimated abundance value is 1.44 times of the theoretical value), and has obvious underestimation at medium-low abundance, medium-high abundance and high abundance (the estimated abundance values are 0.494 times, 0.52 times and 0.503 times of the theoretical value respectively).
Compared with the theoretical value of the analog sequence of the acinetobacter jejuni, under the condition of different sequence abundance gradients, the novel method has underestimation on the abundance estimation value (the abundance estimation values are respectively 0.630 times, 0.732 times, 0.717 times and 0.695 times of the theory).
Compared with Kraken2 and Kraken2+Bracken, the novel method has obvious advantages in estimating the abundance of the low-abundance Acinetobacter baumannii sequences.
The novel method of the present application is closer to theory in the estimation of the abundance of a. Weise sequence than that of kraken2+braicken.
In example 2, acinetobacter baumannii and acinetobacter weibull were homologous and heterogeneous, and classification information was correct on classification trees, but in the metagenome reference sequence database, the number of reference sequences recorded in the species level and strain level of acinetobacter baumannii was significantly higher than Yu Weishi acinetobacter baumannii.
Since acinetobacter baumannii and acinetobacter weibull are highly unbalanced in the number of reference sequence recordings, the Kraken2 method maps a large number of sequences from acinetobacter baumannii to the genus level or strain level, resulting in a significantly lower number of sequences mapped to the species level than acinetobacter weibull.
minimal ap2 will also identify a number of sequences from Acinetobacter baumannii as multiple alignment sequences.
The Bracken abundance estimate consists of two parts: the subordinate to species estimation uses a bayesian probabilistic model (multiplication) and the plant to species estimation uses a simple upward recursion (addition).
Because the species and strain levels have corresponding reference sequences, unbalanced numbers and structures of the species and strain reference sequences can cause unbalanced proportions of the Bragg probability model multiplication and the upward recursion addition, and a certain estimation bias is caused.
According to the novel method, species and plants are regarded as minimum classification units at the same level, each minimum classification unit is provided with independent reference sequences, the redistribution of multiple comparison sequences is carried out in independent closed subsets, and the redistribution bias caused by the upper and lower hierarchical structures of classification trees, the reference sequence data and the proportion imbalance is effectively avoided or reduced.
On the other hand, the reference sequences recorded by acinetobacter weigii are relatively less or not complete than acinetobacter baumannii, and the unique comparison rate calculation is based on a complete metagenome reference sequence library in the reassignment process of the Bracken, so that the estimation of the unique comparison rate and the bias of the abundance estimation can be caused.
The calculation of the unique comparison rate in the novel method is based on the independent subset and the small metagenome reference sequence library, so that the influence of other unrelated species reference sequences is reduced, and the simulation calculation of the unique comparison rate and the sequence abundance estimation are closer to a theoretical value or a true value.
According to the embodiment 2, the multiple alignment sequence reassignment method provided by the application effectively avoids or reduces the omission or underabundance estimation possibly caused by the unbalance of the reference sequences recorded in the metagenome reference sequence database by extracting the small library alignment and the multiple alignment sequence reassignment.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

Claims (12)

1. A method for reassigning a metagenomic multiple alignment sequence, comprising the steps of:
1) Quick classification: based on a reference sequence database, rapidly classifying metagenome sequencing data to obtain a classification tree report;
2) Extracting a small library: extracting a target reference sequence from a complete metagenome reference sequence database based on the classification tree report to obtain a small metagenome reference sequence database;
3) And (3) fast comparison: based on a small metagenome reference sequence database, carrying out rapid comparison on metagenome sequencing data to obtain sequence comparison information;
4) Dividing the subsets: based on the sequence comparison information, dividing the minimum classification unit and an independent subset with closed comparison sequences;
5) And (3) constructing a model: based on the small macro genome reference sequence database and the independent subset, simulating and calculating the unique comparison rate of each minimum classification unit in the independent subset, and constructing a multiple comparison sequence reassignment probability model;
6) Sequence reassignment: based on the multiple comparison sequence reassignment probability model and the sequence comparison information, calculating posterior assignment probability of each multiple comparison sequence, and reassigning minimum classification unit identification for each sequence by taking the posterior assignment probability as a random function parameter;
7) Abundance estimation: species abundance is assessed based on the unique alignment and multiple alignment reassignment results.
2. The method of claim 1, wherein in 1) the rapid classification uses Kraken2 to rapidly classify metagenomic sequencing data.
3. The method of claim 1, wherein in 1) the reference sequence database is a complete metagenome reference sequence database; the complete metagenome reference sequence database is a redundancy-removing and low-quality nt database, a professional virus library and a professional parasite library.
4. A method according to any one of claims 1 to 3, wherein in said 2) said extracting is performed according to a metagenomic reference sequence inclusion criterion;
the reference sequence inclusion criteria include:
a. the reference sequences listed in the classification tree report for all species listed in the complete metagenome reference sequence database;
b. all species subtypes, serotypes and strains contained in all species listed in the classification tree report are listed in the complete metagenome reference sequence database;
c. all species, species subtypes, serotypes and strains contained in the family listed in the classification tree report with unassigned sequence numbers >1000 are referenced in the complete metagenomic reference sequence database;
d. all species, species subtypes, serotypes and strains contained in the genus listed in the classification tree report with unassigned sequence numbers >100 are referenced in the complete metagenomic reference sequence database.
5. The method of any one of claims 1-3, wherein in 3) the alignment is a rapid alignment of metagenomic sequencing data using miniap 2 to obtain sequence alignment information.
6. The method of claim 5, wherein in 3), the sequence alignment information comprises:
a. the sequence comparison result comprises comparison sequence names, reference sequence names, minimum classification unit names, comparison positions, comparison quality and detailed comparison information;
b. minimum taxa include species, species subtypes, serotypes, and strains.
7. A method according to any one of claims 1 to 3, wherein in said 4), said independent subset comprises any one or more of the following features:
a. the minimum classification unit between the independent subsets has no intersection;
b. the independent subset and the sequence between the independent subsets have no intersection;
c. the union of all independent subsets is the sequence alignment information.
8. A method according to any one of claims 1-3, characterized in that in said 5), said simulation calculation comprises the steps of:
a. extracting a reference sequence of each minimum classification unit in the independent subset, and generating an independent subset reference sequence database;
b. simulating to generate a sequencing sequence based on the independent subset reference sequence database;
c. comparing the simulation sequences to the reference sequences of the minimum classification units in the independent subsets through the minimap2 to obtain sequence comparison results of the simulation sequences in the independent subsets;
d. based on the comparison of the simulated sequences, a unique comparison rate for each minimum taxon within the independent subset is calculated.
9. A method according to any one of claims 1-3, wherein in said 5), said constructing comprises the steps of:
a. modeling within each independent subset;
b. estimating the sequence abundance of each minimum taxa in the independent subset based on the unique comparison sequence and the unique comparison ratio;
c. estimating the abundance of the multiple aligned sequences for each minimum taxa based on the unique alignment and the abundance of each minimum taxa;
d. the abundance of the multiple comparison sequences of each minimum classification unit is used as a priori reassignment probability;
e. generating an observation probability of each minimum classification unit based on the comparison result of each multiple comparison sequence;
f. and constructing a multiple comparison sequence reassignment probability model based on the priori reassignment probability and the observed value probability.
10. A method according to any one of claims 1-3, characterized in that the evaluation in 7) comprises the steps of:
a. firstly, estimating the sequence abundance of each minimum classification unit, namely the sum of the unique comparison sequence number and the multiple comparison and redistribution sequence number;
b. constructing hierarchical relations of the minimum classification units based on the classification tree, wherein the root node of each hierarchical relation is a species; and counting the abundance of the sequences at the species level, and outputting the result as the abundance of the sequences of the species.
11. A computer storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions which, when executed by a processor, implement the method of any of claims 1-10.
12. An electronic device comprising a memory and a processor, the memory being coupled to the processor, the memory being for storing a computer program, the processor being for invoking the computer program to implement the method of any of claims 1-10.
CN202211545623.8A 2022-12-02 2022-12-02 Method for reassigning metagenome multiple comparison sequences and application Active CN115985400B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211545623.8A CN115985400B (en) 2022-12-02 2022-12-02 Method for reassigning metagenome multiple comparison sequences and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211545623.8A CN115985400B (en) 2022-12-02 2022-12-02 Method for reassigning metagenome multiple comparison sequences and application

Publications (2)

Publication Number Publication Date
CN115985400A CN115985400A (en) 2023-04-18
CN115985400B true CN115985400B (en) 2024-03-15

Family

ID=85961495

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211545623.8A Active CN115985400B (en) 2022-12-02 2022-12-02 Method for reassigning metagenome multiple comparison sequences and application

Country Status (1)

Country Link
CN (1) CN115985400B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105849555A (en) * 2013-12-18 2016-08-10 加利福尼亚太平洋生物科学股份有限公司 Iterative clustering of sequence reads for error correction
CN107368705A (en) * 2011-04-14 2017-11-21 考利达基因组股份有限公司 The processing and analysis of complex nucleic acid sequence data
CN111951895A (en) * 2020-07-09 2020-11-17 苏州协云基因科技有限公司 Pathogen analysis method, analysis device, apparatus and storage medium based on metagenomics
CN112687344A (en) * 2021-01-21 2021-04-20 予果生物科技(北京)有限公司 Human adenovirus molecule typing and tracing method and system based on metagenome
CN113223618A (en) * 2021-05-26 2021-08-06 予果生物科技(北京)有限公司 Method and system for detecting virulence genes of clinically important pathogenic bacteria based on metagenome
CN113265452A (en) * 2021-05-14 2021-08-17 北京大学人民医院 Bioinformatics pathogen detection method based on Nanopore metagenome RNA-seq
WO2022159838A1 (en) * 2021-01-22 2022-07-28 Idbydna Inc. Methods and systems for metagenomics analysis

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102349921B1 (en) * 2018-09-05 2022-01-12 주식회사 천랩 taxonomy profiling method for microorganism in sample

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107368705A (en) * 2011-04-14 2017-11-21 考利达基因组股份有限公司 The processing and analysis of complex nucleic acid sequence data
CN105849555A (en) * 2013-12-18 2016-08-10 加利福尼亚太平洋生物科学股份有限公司 Iterative clustering of sequence reads for error correction
CN111951895A (en) * 2020-07-09 2020-11-17 苏州协云基因科技有限公司 Pathogen analysis method, analysis device, apparatus and storage medium based on metagenomics
CN112687344A (en) * 2021-01-21 2021-04-20 予果生物科技(北京)有限公司 Human adenovirus molecule typing and tracing method and system based on metagenome
WO2022159838A1 (en) * 2021-01-22 2022-07-28 Idbydna Inc. Methods and systems for metagenomics analysis
CN113265452A (en) * 2021-05-14 2021-08-17 北京大学人民医院 Bioinformatics pathogen detection method based on Nanopore metagenome RNA-seq
CN113223618A (en) * 2021-05-26 2021-08-06 予果生物科技(北京)有限公司 Method and system for detecting virulence genes of clinically important pathogenic bacteria based on metagenome

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宏基因组样本数据的分析比较与分类;程福东;丁啸;李晟;孙啸;;生物技术通报(第05期);全文 *

Also Published As

Publication number Publication date
CN115985400A (en) 2023-04-18

Similar Documents

Publication Publication Date Title
Zielezinski et al. Alignment-free sequence comparison: benefits, applications, and tools
Antonelli et al. Toward a self-updating platform for estimating rates of speciation and migration, ages, and relationships of taxa
CN103186716B (en) Metagenomics-based unknown pathogeny rapid identification system and analysis method
EP2963575B1 (en) Data analysis device and method therefor
CN107004141A (en) To the efficient mark of large sample group
CN105279397A (en) Method for identifying key proteins in protein-protein interaction network
CN111429980A (en) Automatic acquisition method for material crystal structure characteristics
CN114817575B (en) Large-scale electric power affair map processing method based on extended model
CN111710364A (en) Method, device, terminal and storage medium for acquiring flora marker
Avino et al. Tree shape‐based approaches for the comparative study of cophylogeny
Gruber et al. Introduction to dartR
CN114491081A (en) Electric power data tracing method and system based on data blood relationship graph
CN107748755A (en) Synonym method for digging, device, equipment and computer-readable recording medium
Dimopoulos et al. HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data
CN115985400B (en) Method for reassigning metagenome multiple comparison sequences and application
CN111091194B (en) Operation system identification method based on CAVWBB _ KL algorithm
Oyebanji et al. Re-evaluation of the phylogenetic relationships and species delimitation of two closely related families (Lamiaceae and Verbenaceae) using two DNA barcode markers
Van Etten et al. A k-mer-based approach for phylogenetic classification of taxa in environmental genomic data
Zhang et al. A program plagiarism detection model based on information distance and clustering
CN117171676B (en) Decision tree-based soil microorganism identification analysis method, system and storage medium
CN113761034B (en) Data processing method and device
CN116153411B (en) Design method and application of multi-pathogen probe library combination
Onodera et al. Data on the solution and processing time reached when constructing a phylogenetic tree using a quantum-inspired computer
CN116895328B (en) Evolution event detection method and system for modularized gene structure
CN117251532B (en) Large-scale literature mechanism disambiguation method based on dynamic multistage matching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant