WO2024077568A1 - 参考序列的构建方法、宏基因组数据压缩方法和电子设备 - Google Patents

参考序列的构建方法、宏基因组数据压缩方法和电子设备 Download PDF

Info

Publication number
WO2024077568A1
WO2024077568A1 PCT/CN2022/125204 CN2022125204W WO2024077568A1 WO 2024077568 A1 WO2024077568 A1 WO 2024077568A1 CN 2022125204 W CN2022125204 W CN 2022125204W WO 2024077568 A1 WO2024077568 A1 WO 2024077568A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
reference sequence
read
metagenomic data
database
Prior art date
Application number
PCT/CN2022/125204
Other languages
English (en)
French (fr)
Inventor
周雁
丁仁鹏
何时绪
王琳琪
史旭莲
侯勇
Original Assignee
深圳华大智造科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大智造科技股份有限公司 filed Critical 深圳华大智造科技股份有限公司
Priority to PCT/CN2022/125204 priority Critical patent/WO2024077568A1/zh
Publication of WO2024077568A1 publication Critical patent/WO2024077568A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • the present disclosure relates to the technical field of biological data compression, and in particular to a reference sequence construction method, a metagenomic data compression method and an electronic device.
  • Metagenome is the sum of all microbial genomes in the environment. Metagenomics is a new microbial research method that uses the genome of microbial populations in environmental samples as the research object, functional gene screening and/or sequencing analysis as the research method, and microbial diversity, population structure, evolutionary relationships, functional activity, mutual cooperation and the relationship with the environment as the research purpose. The study of metagenomic data allows researchers to break free from species boundaries, more effectively develop multi-species genetic resources and reveal the laws of life movement at a higher and more complex level.
  • embodiments of the present disclosure provide a method for constructing a reference sequence for metagenomic data compression, a metagenomic data compression method, a metagenomic data compression device, an electronic device, a non-transitory computer-readable storage medium, a computer program product, and a computer program.
  • the first aspect of the present disclosure proposes a method for constructing a reference sequence for metagenomic data compression, comprising: constructing a basic reference sequence database according to the sample source of the metagenomic data; constructing an index of the basic reference sequence database based on the basic reference sequence database; comparing a first read sequence with the basic reference sequence database according to the index of the basic reference sequence database to obtain a comparison result, wherein the first read sequence is a read sequence of a portion of samples randomly selected from the metagenomic data to be compressed; and determining the sequence abundance distribution of the first read sequence according to the comparison result to construct the reference sequence for metagenomic data compression.
  • a basic reference sequence database is constructed, including: based on the sample source of the metagenomic data, corresponding reference genomes are obtained from public databases and summarized to obtain the basic reference sequence database.
  • an index of the basic reference sequence database is constructed, including: a single reference genome in the basic reference sequence database includes a first subsequence and a second subsequence, the first subsequence and the second subsequence are merged, and the number of the reference genome is retained to obtain a subsequence merged reference genome; based on the subsequence merged reference genome, an index of the basic reference sequence database is constructed.
  • the first read sequence is compared with the basic reference sequence database, including: based on the index of the basic reference sequence database, the first read sequence is compared to each of the subsequence merged reference genomes; based on the first read sequence being compared to the subsequence merged reference genome, the number of the reference genome to which the read sequence is compared is recorded.
  • determining the sequence abundance distribution of the first read sequence and constructing the reference sequence for metagenomic data compression includes: counting the number of the first read sequence aligned to each of the reference genomes in the comparison results to obtain the sequence abundance distribution of the first read sequence; sorting the reference genomes according to the sequence abundance, and selecting the top X reference genomes to construct the reference sequence for metagenomic data compression.
  • X can be 1000.
  • constructing the reference sequence for metagenome data compression further comprises: selecting, according to the sorting, a reference genome whose sum of sequence abundance percentages is greater than Y% to construct the reference sequence for metagenome data compression.
  • Y can be 80.
  • the method for constructing a reference sequence for metagenomic data compression also includes: splitting the basic reference sequence database into sub-basic reference sequence databases; constructing indexes of the sub-reference sequence databases based on the split sub-basic reference sequence databases; based on the indexes of the sub-reference sequence databases, comparing the first read sequence with each of the sub-basic reference sequence databases to obtain a second comparison result, wherein the second comparison result includes sub-result files based on each of the sub-basic reference sequence databases.
  • the method for constructing a reference sequence for metagenomic data compression also includes: respectively counting the number of the first read sequence in each of the sub-result files that is aligned to each of the sub-basic reference sequence databases to obtain the sequence abundance distribution of the first read sequence in each of the sub-result files; performing a first sorting of the reference genome according to the sequence abundance in each of the sub-result files, and selecting the reference genomes in the top X positions in the sequence abundance in each of the sub-result files to construct a sub-reference sequence database; performing a second sorting of the reference genomes in the sub-reference sequence database according to the sequence abundance; and selecting the reference genomes in the top X positions in the sequence abundance distribution in the sub-reference sequence database to construct the reference sequence for metagenomic data compression.
  • constructing the reference sequence for metagenome data compression further comprises: selecting, according to the first sorting, a reference genome whose sum of sequence abundance proportions in each of the sub-result files is greater than Y% to construct the sub-reference sequence database, and
  • a reference genome whose sum of sequence abundance percentages in the sub-reference sequence database is greater than Y% is selected to construct the reference sequence for metagenome data compression.
  • Y can be 80.
  • the method for constructing a reference sequence for metagenomic data compression further comprises: performing a first and/or second screening on the alignment result, wherein the first screening comprises: selecting the read sequence without insertion and/or deletion in the alignment result; and the second screening comprises: selecting the read sequence below a mismatch threshold.
  • the mismatch threshold may be 3.
  • the second aspect of the present disclosure proposes a method for compressing metagenomic data, which includes: constructing a reference sequence for metagenomic data compression according to the method for constructing a reference sequence for metagenomic data compression proposed in any embodiment of the first aspect of the present disclosure above; aligning a second read sequence with the reference sequence and recording the alignment result to obtain compressed data of the metagenomic data, wherein the second read sequence is a read sequence of a sample to be compressed in the metagenomic data.
  • the second read sequence is compared with the reference sequence and the comparison result is recorded, including: when the number of mismatched bases between the second read sequence and the reference sequence is less than R1, the position of the second read sequence on the reference sequence is recorded; when the number of mismatched bases between the second read sequence and the reference sequence is greater than R1 and less than R2, the position of the paired base in the second read sequence on the reference sequence is recorded, and the base information of the mismatched base is recorded; when the number of mismatched bases between the second read sequence and the reference sequence is greater than R2, the second read sequence is recorded.
  • R1, R2, and R3 are all integers greater than or equal to 0.
  • R1 is 0 to 5, and R2 is 3 to 10.
  • R1 is 0 to 2
  • R2 is 3 to 8.
  • R1 is 0 and R2 is 3.
  • the metagenomic data compression method further includes degenerating the quality value of the metagenomic data.
  • degenerating the quality values of the metagenomic data comprises: performing statistics on the base quality values in the metagenomic data to obtain the distribution of the quality values within M quality value ranges; and mapping the quality values within the M ranges to M mapping values respectively to degenerate the quality values of the metagenomic data.
  • M is an integer greater than 0.
  • the metagenomic data compression method also includes: when the proportion of bases with quality values lower than Q accounts for less than a set proportion N of all bases in the metagenomic data, mapping the quality values of all bases in the metagenomic data to degenerate the quality values of the metagenomic data.
  • the metagenomic data compression method also includes: when the proportion of bases with quality values lower than Q accounts for a proportion of all bases in the metagenomic data that is higher than or equal to a set proportion N, mapping the quality values of the bases with quality values higher than Q in the metagenomic data to degenerate the quality values of the metagenomic data.
  • the metagenomic data compression method further includes: retaining the original quality values of the bases with quality values lower than Q in the metagenomic data when the proportion of bases with quality values lower than Q accounts for a proportion of all bases in the metagenomic data that is higher than or equal to a set proportion N.
  • Q is a quality value corresponding to a base error probability of 0.01% to 1%.
  • N is greater than or equal to 10%. In some embodiments, N is greater than or equal to 20%.
  • the third aspect of the present disclosure provides a metagenomic data compression device, the device comprising: a reference sequence construction module, used to construct a reference sequence for metagenomic data compression according to the construction method of the reference sequence for metagenomic data compression described in any embodiment of the first aspect of the present disclosure; and
  • the data compression module is used to compare the read length sequence in the metagenome data with the reference sequence and record the comparison result to obtain compressed data of the metagenome data.
  • the device further comprises: a quality value degeneration module, configured to degenerate the quality value of the metagenomic data.
  • An embodiment of the fourth aspect of the present disclosure proposes an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, a method for constructing a reference sequence for metagenomic data compression as described in any embodiment of the first aspect of the present disclosure is implemented, the method comprising: constructing a basic reference sequence database according to a sample source of the metagenomic data; constructing an index of the basic reference sequence database based on the basic reference sequence database; comparing a first read sequence with the basic reference sequence database according to the index of the basic reference sequence database to obtain a comparison result, wherein the first read sequence is a read sequence of a randomly selected portion of samples in the metagenomic data to be compressed; and determining the sequence abundance distribution of the first read sequence according to the comparison result to construct the reference sequence for metagenomic data compression.
  • An embodiment of the fifth aspect of the present disclosure proposes a non-transitory computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the method for constructing a reference sequence for metagenomic data compression as described in any embodiment of the first aspect of the present disclosure is implemented.
  • the sixth aspect of the present disclosure provides a computer program product, which includes a computer program.
  • the computer program When executed by a processor, it implements the method for constructing a reference sequence for metagenomic data compression as described in any embodiment of the first aspect of the present disclosure.
  • the method for constructing an effective metagenomic reference sequence and compressing metagenomic data based on the sequence proposed in the present disclosure can construct an effective metagenomic data compression reference sequence.
  • index-dependent compression tools By using index-dependent compression tools, the compression efficiency of metagenomic data can be greatly improved (the average compression ratio achieved is nearly 4 times that of traditional compression ratios), thereby alleviating the storage and transmission pressure of metagenomic data with large sample sizes.
  • FIG1 is a diagram of a method for constructing a reference sequence for metagenome data compression according to an embodiment of the present disclosure
  • FIG2 is a technical solution diagram of constructing a reference sequence for metagenome data compression according to an embodiment of the present disclosure
  • FIG3 is a diagram of a method for constructing a reference sequence based on a reference genome with high sequence abundance according to an embodiment of the present disclosure
  • FIG4 is a flow chart of a method for compressing metagenomic data according to an embodiment of the present disclosure
  • FIG5 is a flow chart of metagenomic data compression based on reference sequences according to an embodiment of the present disclosure
  • FIG6 is an example diagram of a quality value mapping table according to an embodiment of the present disclosure.
  • FIG7 is a quality value degeneration flow chart according to an embodiment of the present disclosure.
  • FIG8 is a flow chart of conditional quality value degeneration according to one embodiment of the present disclosure.
  • FIG9 is a flow chart of conditional quality value degeneration according to another embodiment of the present disclosure.
  • FIG10 is a diagram of a method for compressing metagenomic data according to another embodiment of the present disclosure.
  • FIG11 is a structural diagram of a metagenomic data compression device according to an embodiment of the present disclosure.
  • FIG12 is a block diagram showing an exemplary computer device suitable for implementing embodiments of the present disclosure.
  • FIG13 is a specific data compression ratio distribution diagram according to an embodiment of the present disclosure.
  • FIG14 is a statistical graph of the Pearson correlation coefficient of the species composition of 233 samples before and after mass value degeneration.
  • index-dependent compression tools are used for the compression of metagenomic data.
  • index-dependent compression tools for metagenomic data the construction of reference sequences and the compression of data are usually achieved by the following two methods.
  • Method 1 Construct a universal reference sequence based on a public database. For example, for data with a clear source such as intestinal microorganisms, the reference sequence can be constructed by summarizing all possible species genomes in the database.
  • Method 2 Construct sample-specific reference sequences based on species composition and sequence assembly.
  • MetaCRAM Kim, M. et al., 2016
  • MCUIUC Ligo, J. G. et al., 2013
  • metagenomic species identification tools to quickly identify the species composition of the data. Based on the species identification results, users select species with abundance (Species Abundance) higher than a specific threshold as reference genome sources for constructing appropriate reference genomes, and assemble Reads that failed to align from scratch to construct new reference sequences. Finally, based on the reference sequences selected from the database and the reference sequences constructed from scratch, the metagenome data is compressed.
  • the method for constructing a reference sequence for metagenome data compression proposed in the embodiment of the present disclosure realizes index-dependent efficient data compression of metagenome data by constructing a project-specific reference sequence and combining it with conditional quality value lossy compression.
  • the method for constructing a reference sequence for metagenome data compression proposed in the embodiment of the present disclosure and the metagenome data compression method based on the constructed reference sequence greatly improve the compression efficiency of metagenome data and effectively alleviate the storage pressure and transmission pressure of metagenome data with large sample sizes.
  • the first embodiment of the present disclosure proposes a method for constructing a reference sequence for metagenomic data compression.
  • Fig. 1 is a schematic diagram of a method for constructing a reference sequence for metagenome data compression according to an embodiment of the present disclosure. As shown in Fig. 1 , the method may include: steps 101-104.
  • Step 101 Construct a basic reference sequence database based on the sample source of the metagenomic data.
  • the "sample source” is the extraction environment of the metagenome data sample to be compressed.
  • the sample can be intestinal microorganisms, water source microorganisms, soil microorganisms, etc.
  • the sample source can be the intestine, water source, soil, etc.
  • the corresponding public database can be selected and the commonly used sequences can be downloaded and aggregated as the basic reference sequence library for the construction of the comparison index.
  • the intestinal microorganism database can be GMrepo (Dai, D. et al., 2022), gutMEGA- (Zhang, Q. et al., 2021) and uhgg (Almeida, A. et al., 2021).
  • Step 102 Based on the basic reference sequence database, construct an index of the basic reference sequence database.
  • an index-dependent alignment software or script is used to construct an index for a basic reference sequence database.
  • the index-dependent alignment software can be bwa (Burrows-Wheeler Aligner, Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25: 1754-60. [PMID: 19451168]), Bowtie (Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25), and Bowtie2 (Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9: 357-359).
  • Step 103 According to the index of the basic reference sequence database, the first read sequence is compared with the basic reference sequence database to obtain a comparison result, wherein the first read sequence is a read sequence of a portion of samples randomly selected from the metagenomic data to be compressed.
  • data of some samples (i.e., read sequences, Reads) in the sample to be compressed can be randomly selected for comparison with the basic reference sequence database. It can be understood that compared with the large sample size of the whole sample, by randomly selecting a specific number of samples for preliminary comparison, the comparison efficiency can be effectively improved and computing resources can be saved.
  • Step 104 Determine the sequence abundance distribution of the first read sequence according to the alignment result, and construct a reference sequence for metagenomic data compression.
  • sequence abundance refers to the number of sample Reads aligned to each reference genome in the alignment.
  • sequence abundance distribution of the Reads of the selected part of the sample in the alignment result output by step 103 is obtained by counting the number of Reads (i.e., the first read length sequence) aligned to each reference genome, and the sequence abundance distribution of the Reads of the part of the sample is obtained; the reference genomes in the basic reference sequence database are sorted according to the sequence abundance, and the top-ranked reference genomes are selected according to the user's own computing configuration, compression ratio requirements or other personalized needs, and the reference sequence for compressing all sample metagenome data is constructed.
  • FIG. 2 is a technical scheme diagram for constructing a reference sequence for metagenome data compression according to an embodiment of the present disclosure.
  • the construction method of the reference sequence for metagenome data compression proposed in the embodiment of the present disclosure may include determining a microbial database of a specific major category from a public microbial database according to project information, and obtaining a basic reference sequence database from the microbial database of a specific major category; using a portion of the samples (i.e., test samples) in all the samples to be compressed to compare with the basic reference sequence database, obtaining the sequence abundance of each reference genome in the basic reference sequence database of the partial samples and sorting them to obtain the sequence abundance distribution of the partial samples; selecting the reference genome of a high-abundance species (i.e., the reference genome in the basic reference sequence database with the top sequence abundance ranking) for merging, thereby obtaining a project-specific (i.e., for the project) reference sequence for subsequent index-dependent data compression of the metagenome data.
  • the method proposed in the embodiment of the present disclosure can effectively improve the alignment efficiency and save computing resources by randomly selecting a specific number of samples for preliminary alignment and constructing a reference sequence according to the sequence abundance.
  • the data volume of the reference sequence is greatly reduced, which is conducive to efficient alignment and compression in the later stage.
  • step S102 may also include: a single reference genome in the basic reference sequence database includes a first subsequence and a second subsequence, the first subsequence and the second subsequence are merged, and the number of the reference genome is retained to obtain a subsequence merged reference genome; based on the subsequence merged reference genome, an index of the basic reference sequence database is constructed.
  • the first subsequence or the second subsequence can be a fragment sequence in a Fastq file of a single reference genome, such as the sequence of each chromosome in the reference genome.
  • a single reference genome such as the sequence of each chromosome in the reference genome.
  • the basic reference sequence database can also be split into several sub-basic reference sequence databases and the index of the sub-reference sequence database can be constructed based on the split sub-basic reference sequence databases respectively; based on the index of the sub-reference sequence database, the read sequence (i.e., the first read sequence) of the randomly selected part of the samples is respectively compared with each sub-basic reference sequence database to obtain a second comparison result, wherein the second comparison result includes a sub-result file based on each of the sub-basic reference sequence databases.
  • the computing configuration of some users is not sufficient to perform operations based on the larger basic reference sequence database.
  • FIG3 is a diagram of a reference sequence construction method based on a reference genome with high sequence abundance according to an embodiment of the present disclosure.
  • the number of randomly selected partial samples to be compressed i.e., the number of test samples
  • the number of basic reference sequence databases i.e., the number of basic sequence index files
  • B ⁇ 2 corresponds to splitting the basic reference sequence database into several sub-basic reference sequence databases.
  • X is the sequence abundance selection threshold selected by the user for the reference genome.
  • the single overall basic reference sequence database can be compared and sequence abundance screened to determine the reference sequence database for metagenome data compression.
  • the read sequence i.e., the first read sequence
  • the read sequence of a randomly selected portion of the sample can be aligned to each subsequence merged reference genome based on the index of the basic reference sequence database; in the case where the read sequence of the portion of the sample is aligned to the subsequence merged reference genome, the number of the reference genome to which the read sequence is aligned is recorded.
  • the alignment software can be Bwa, Bowtie, Bowtie2 or a locally written index-dependent script or software.
  • the number of the read sequences of the portion of the sample in the alignment result that are aligned to the numbers of each reference genome is counted to obtain the sequence abundance distribution of the read sequences of the portion of the sample; the reference genome is sorted according to the sequence abundance, and the reference genomes in the top X positions are selected to construct reference sequences for metagenome data compression.
  • each sample in the test sample i.e., part of the samples, the number is A
  • the sequence abundance of each reference genome of the A samples in the basic reference sequence database is merged and sorted to obtain the overall sequence abundance distribution of the A test samples; the first X reference genomes are selected to construct a reference sequence for metagenomic data compression.
  • the sub-basic reference sequence database after the split is compared and the sequence abundance is screened to determine the reference sequence database for the metagenome data compression.
  • each sample in the test sample i.e., a portion of the sample, the number is A
  • each sub-basic reference sequence database is compared with each sub-basic reference sequence database to obtain a sub-result file
  • the number of read sequences of the test samples in each sub-result file is respectively counted to be aligned to each sub-basic reference sequence database to obtain the sequence abundance distribution of the read sequences of the test samples in each sub-result file, wherein there are B sub-result files, including A*B sequence abundance distributions
  • the reference genome is first sorted according to the sequence abundance in the B sub-result files, and the reference genomes with the top X positions in the sequence abundance in the B sub-result files are respectively selected to construct the sub-reference sequence database
  • the subsequences within a single reference genome in the basic reference sequence database can be merged before the splitting, and the number of the reference genome can be retained for subsequent comparison; or after the splitting, the subsequences within a single reference genome in the sub-basic reference sequence databases obtained by the splitting can be merged, and the number of the reference genome can be retained for subsequent comparison.
  • the sequence abundance selection threshold value X may be selected by the user according to data conditions, personal computing resources, or requirements for compression ratio, etc. In some embodiments, X may be between 200 and 5000. In some embodiments, X may be between 500 and 3000. In some embodiments, X may be 1000.
  • a reference genome whose sum of sequence abundance ratios is greater than Y% may be selected based on the statistical and sorting results of sequence abundance to construct a reference sequence for metagenomic data compression, wherein the ratio is the ratio of the sequence abundance corresponding to a certain reference genome to the total sequence abundance. Selecting a reference genome whose sum of sequence abundance ratios is greater than Y% means selecting the top several reference genomes according to the statistics and sorting of sequence abundance, so that the sum of the sequence abundance ratios of the top several selected reference genomes is greater than Y%.
  • Y can be determined based on the sample size, the expected compression ratio, and the user's computing resources. In some embodiments, Y can be 20 to 80. In some embodiments, Y can be 40 to 80. In some embodiments, Y can be 80. In the disclosed embodiments, compared with the use of all reference genomes in the basic reference sequence database, the use of representative reference genomes does not affect the accuracy of subsequent data compression, that is, the compression performed based on the index constructed by all reference genomes in the basic reference sequence database with a huge amount of data, and the data composition after compression is highly correlated with the data composition after compression using the representative reference genome in the disclosed embodiments. Therefore, by selecting a representative reference genome with a high ranking in sequence abundance to construct a compressed index, the volume of the compressed index is effectively reduced, the amount of subsequent compression operations is greatly reduced, and the high fidelity of the compressed data is guaranteed.
  • the comparison results may be subjected to a first and/or second screening, wherein the first screening includes: selecting read sequences without insertions and/or deletions in the comparison results; and the second screening includes: selecting read sequences below a mismatch threshold.
  • the generated result file (e.g., Bam or Sam format) can be compared for the first and/or second screening to perform quality control on the alignment results.
  • Reads without insertions and/or deletions can be selected according to the Cigar value of the result file (Concise Idiosyncratic Gapped Alignment Report), wherein the absence of insertions and/or deletions is represented by 100M or 150M (100 and 150 represent Reads with a length of 100bp and 150bp, M represents Match, and 100M or 150M represents that the full-length sequence of 100bp or 150bp of the Reads is completely matched with the reference sequence).
  • the second screening Reads with a number of mismatches lower than the mismatch threshold can be selected according to the N:M value of the result file. In some embodiments, the mismatch threshold can be 1 to 10.
  • the mismatch threshold can be 1 to 5. In some embodiments, the mismatch threshold can be 3. It can be understood that the screening of reads in the comparison results removes reads with higher mismatches, thereby improving the overall credibility of the reads, making the selection of the reference genome based on the sequence abundance distribution of the screened high-confidence reads more accurate.
  • the method for constructing a reference sequence for metagenomic data compression proposed in the embodiment of the present disclosure effectively solves the problem that the basic reference sequence database has a large amount of data and users of small computing clusters or personal computers cannot construct the index required for alignment for a single Fastq file containing tens of thousands of reference genomes at one time by merging the subsequences of a single reference genome in a basic reference sequence database and retaining only its number, and/or splitting the basic reference sequence database into multiple sub-basic reference sequence databases; at the same time, the method randomly selects some samples for preliminary alignment and reference sequence construction, which ensures that the constructed reference sequence has the greatest possible coverage of the data to be compressed, while greatly reducing the input and output of the data volume in the alignment, improving the construction efficiency of the reference sequence, and saving computing and storage resources.
  • FIG4 is a flow chart of a method for compressing metagenomic data according to an embodiment of the present disclosure. As shown in FIG3 , the method includes:
  • Step 201 construct a reference sequence for metagenomic data compression according to the method for constructing a reference sequence for metagenomic data compression described in any one of the embodiments of the first aspect above;
  • Step 202 align the second read sequence with the reference sequence and record the alignment result to obtain compressed data of the metagenomic data, wherein the second read sequence is the read sequence of the sample to be compressed in the metagenomic data.
  • the read sequences of some or all of the samples in the metagenome data can be compressed based on the reference sequence, that is, the second read sequence is compressed. It is understandable that the second read sequence may be the same as or different from the first read sequence.
  • all or part of the samples in the metagenome data can be selected for compression according to user needs, thereby achieving efficient compression while improving the flexibility of compression.
  • Fig. 5 is a flow chart of the compression of metagenomic data based on a reference sequence according to an embodiment of the present disclosure. As shown in Fig. 5, after the reference sequence is constructed according to any embodiment of the first aspect of the present disclosure, the Reads (Fastq file) in the metagenomic data to be compressed are input and compared with the constructed reference sequence.
  • R1 when the number of mismatched bases between the read sequence (i.e., the second read sequence) in the metagenomic data and the reference sequence is less than R1, the position of the read sequence on the reference sequence is recorded; when the number of mismatched bases between the read sequence in the metagenomic data and the reference sequence is greater than R1 and less than R2, the position of the paired base in the read sequence on the reference sequence is recorded, and the base information of the mismatched base in the read is recorded; when the number of mismatched bases between the read sequence in the metagenomic data and the reference sequence is greater than R2, the read sequence is recorded.
  • R1, R2, and R3 are all integers greater than or equal to 0.
  • R1 is 0 to 5, and R2 is 3 to 10.
  • R1 is 0 to 2
  • R2 is 3 to 8. In some embodiments, R1 is 0 and R2 is 3.
  • step iii when there is a mismatch between the Reads and the reference sequence, and the number of mismatched bases is greater than 5 (ie, R2 ⁇ 5), the sequence information of the Reads is recorded, that is, the actual base information of the Reads is retained.
  • the number of mismatched bases in step iii can be a positive integer greater than 3 (ie, R2>3).
  • the metagenomic data compression method further includes: degenerating the quality value of the metagenomic data.
  • metagenomic data are mostly stored in the form of Fastq files.
  • the Fastq format is divided into 4 lines, and the characters in the 4th line correspond to the probability of each base in the sequence being misidentified, that is, the base quality value (Quality Score, Q-score).
  • the quality value of the base is divided into 0 to 40 according to the possibility of base error, where 0 represents an error probability of 100% and 40 represents an error probability of 0.01%.
  • the quality values of the metagenomic data are degenerated, including: counting the base quality values in the metagenomic data to obtain the distribution of the quality values within M quality value ranges; and mapping the quality values within the M ranges to M mapping values respectively to degenerate the quality values of the metagenomic data.
  • M quality value ranges are set according to different base error probabilities, and corresponding M specific mapping values are set to map the base quality values to complete degeneration, wherein M can be an integer greater than 0, such as any integer from 1 to 100.
  • the M specific mapping values can be adjusted by the user according to the actual data situation, and the present disclosure does not limit this.
  • Figure 6 is an example diagram of a quality value mapping table according to an embodiment of the present disclosure. As shown in Figure 6, the quality values 0 to 40 can be divided into M quality value ranges, and Q1, Q2...QM are used as corresponding specific mapping values.
  • FIG7 is a flow chart of quality value degeneration according to an embodiment of the present disclosure.
  • the quality values of the Reads to be compressed are counted and different threshold ranges are defined, such as [a, b], [c, d], [e, f], ..., etc., a total of M, where a-f represent different quality values.
  • the bases falling into the same threshold range are mapped to the same specific mapping value, so that the Reads to be compressed are degenerated, thereby reducing the volume of the data to be compressed and reducing the amount of redundant calculations.
  • the embodiments of the present disclosure also propose a technical solution for conditionally degenerating metagenomic data quality values, so as to reduce the impact of lossy compression of quality values on downstream analysis.
  • the quality values of the reads to be compressed are counted and before the reads are degenerated, it also includes: when the proportion of bases with quality values lower than Q accounts for less than a set proportion N of all bases in the metagenomic data, the quality values of all bases in the metagenomic data are mapped to degenerate the quality values of the metagenomic data.
  • the proportion of bases with quality values lower than Q accounts for a proportion of all bases in the metagenomic data that is higher than or equal to a set proportion N
  • the quality values of the bases with quality values higher than Q in the metagenomic data are mapped to degenerate the quality values of the metagenomic data.
  • the proportion of bases with quality values lower than Q accounts for a proportion of all bases in the metagenomic data that is higher than or equal to a set proportion N
  • the original quality values of the bases with quality values lower than Q in the metagenomic data are retained.
  • Q can be determined according to the actual quality value distribution of the metagenomic data and the desired degeneracy.
  • Q can be any integer from 0 to 40, that is, corresponding to the range of 100% to 0.01% of the base error probability.
  • Q can be the quality value corresponding to the base error probability of 0.01% to 1%.
  • Q can be the quality value corresponding to the base error probability of 0.1% to 1%.
  • the ratio N is set to be greater than or equal to 20%. In other embodiments, N is greater than or equal to 10%.
  • Fig. 8 is a flow chart of conditional quality value degeneration according to an embodiment of the present disclosure.
  • the quality value of the base is 0 to 40
  • the mapping values corresponding to the 4 quality value ranges are Q1, Q2, Q3 and Q4 respectively.
  • Fig. 9 is a flow chart of conditional quality value degeneration according to another embodiment of the present disclosure. As shown in Fig. 9, the difference between this flow and the flow shown in Fig. 8 is that if R1%+R2%+R3% is greater than or equal to N%, the original quality values of all bases in the data to be compressed are retained without degeneration.
  • Figure 10 is a diagram of a method for compressing metagenomic data according to an embodiment of the present disclosure. As shown in Figure 10, the method may include construction of an index for compression, conditional quality value degeneration of the data to be compressed (Fastq file), and data compression based on the constructed reference index.
  • the metagenome data compression method proposed in the second aspect of the present disclosure embodiment is a reference sequence constructed by the construction method of the reference sequence for metagenome data compression described in any embodiment of the first aspect of the present disclosure, and the Read to be compressed is quickly compared with the constructed reference sequence. If it can be accurately compared to the corresponding position, it is only necessary to record the position information of the corresponding Read on the reference sequence; if there is a small amount of mismatch, while recording the position information of the remaining paired bases, the information of the mismatched bases is retained; for Reads that cannot be accurately compared to the reference sequence, all sequence information is recorded, thereby greatly improving the compression efficiency of the metagenome data and alleviating the storage pressure of the metagenome data of large sample sizes.
  • the metagenome data compression method proposed in the present disclosure embodiment conditionally degenerates the base quality value before compression, that is, by setting a threshold, degenerates the bases with high quality values, and retains the original quality values of the bases with medium and low quality values, thereby simplifying and reducing the data to be compressed, while not affecting the subsequent comparison; at the same time, the data to be compressed based on the degenerate quality value further improves the compression efficiency.
  • the third aspect embodiment of the present disclosure proposes a metagenomic data compression device.
  • Figure 11 is a structural diagram of a metagenomic data compression device according to an embodiment of the present disclosure.
  • the metagenomic data compression device 90 may include: a reference sequence construction module 901, which is used to construct a reference sequence for metagenomic data compression according to the construction method of the reference sequence for metagenomic data compression described in any embodiment of the first aspect embodiment; and a data compression module 902, which is used to compare the read sequence in the metagenomic data with the reference sequence and record the comparison results to obtain compressed data of the metagenomic data.
  • the device 90 may further include: a quality value degeneration module 903, which is used to degenerate the quality values of the metagenomic data.
  • the metagenomic data compression device proposed in the third aspect of the present disclosure embodiment is a reference sequence constructed by the construction method of the reference sequence for metagenomic data compression described in any embodiment of the first aspect of the present disclosure, and the Read to be compressed is quickly compared with the constructed reference sequence. If it can be accurately compared to the corresponding position, it is only necessary to record the position information of the corresponding Read on the reference sequence; if there is a small amount of mismatch, while recording the position information of the remaining paired bases, the information of the mismatched bases is retained; for Reads that cannot be accurately compared to the reference sequence, all sequence information is recorded, thereby greatly improving the compression efficiency of metagenomic data and alleviating the storage pressure of metagenomic data with large sample sizes.
  • the metagenomic data compression device proposed in the present disclosure embodiment conditionally degenerates the base quality value before compression, that is, by setting a threshold, degenerates the bases with high quality values, and retains the original quality values of the bases with medium and low quality values, thereby simplifying and reducing the data to be compressed without affecting the subsequent comparison; at the same time, the data to be compressed based on the degenerate quality value further improves the compression efficiency.
  • the embodiments of the present disclosure also propose an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the program, it implements the method for constructing a reference sequence for metagenomic data compression as proposed in the first aspect of the embodiment of the present disclosure or the method for compressing metagenomic data as proposed in the second aspect of the embodiment of the present disclosure.
  • the embodiments of the present disclosure also propose a non-transitory computer-readable storage medium, on which a computer program is stored.
  • the program When the program is executed by a processor, it implements the method for constructing a reference sequence for metagenomic data compression as proposed in the first aspect of the embodiment of the present disclosure or the method for compressing metagenomic data as proposed in the second aspect of the embodiment of the present disclosure.
  • the embodiments of the present disclosure also propose a computer program product.
  • the instruction processor in the computer program product executes the method for constructing a reference sequence for metagenomic data compression proposed in the embodiment of the first aspect of the present disclosure or the method for compressing metagenomic data proposed in the embodiment of the second aspect of the present disclosure.
  • the embodiments of the present disclosure also propose a computer program, which includes computer program code.
  • the computer program code When the computer program code is run on a computer, it enables the computer to execute the method for constructing a reference sequence for metagenomic data compression as proposed in the first aspect of the embodiment of the present disclosure or the method for compressing metagenomic data as proposed in the second aspect of the embodiment of the present disclosure.
  • Fig. 12 shows a block diagram of an exemplary computer device suitable for implementing the embodiments of the present disclosure.
  • the electronic device 12 shown in Fig. 12 is only an example and should not bring any limitation to the functions and scope of use of the embodiments of the present disclosure.
  • the electronic device 12 is in the form of a general-purpose computing device.
  • the components of the electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 connecting different system components (including the system memory 28 and the processing unit 16).
  • Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor or a local bus using any of a variety of bus structures.
  • these architectures include but are not limited to Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, Enhanced ISA bus, Video Electronics Standards Association (VESA) local bus and Peripheral Component Interconnection (PCI) bus.
  • ISA Industry Standard Architecture
  • MAC Micro Channel Architecture
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnection
  • the electronic device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the electronic device 12, including volatile and non-volatile media, removable and non-removable media.
  • the memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32.
  • the electronic device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media.
  • the storage system 34 may be used to read and write non-removable, non-volatile magnetic media (not shown in FIG. 10 , commonly referred to as a “hard drive”).
  • a disk drive for reading and writing to a removable nonvolatile disk e.g., a “floppy disk”
  • an optical disk drive for reading and writing to a removable nonvolatile optical disk e.g., a Compact Disc Read Only Memory (hereinafter referred to as CD-ROM), a Digital Video Disc Read Only Memory (hereinafter referred to as DVD-ROM), or other optical media
  • each drive may be connected to the bus 18 via one or more data medium interfaces.
  • the memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to perform the functions of the various embodiments of the present disclosure.
  • a program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in the memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which or some combination may include an implementation of a network environment.
  • the program modules 42 generally perform the functions and/or methods of the embodiments described in the present disclosure.
  • the electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the electronic device 12, and/or any device that enables the electronic device 12 to communicate with one or more other computing devices (e.g., network card, modem, etc.). Such communication may be performed through an input/output (I/O) interface 22.
  • the electronic device 12 may also communicate with one or more networks (e.g., a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through a network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 through a bus 18.
  • LAN local area network
  • WAN wide area network
  • public network such as the Internet
  • the processing unit 16 executes various functional applications and data processing by running the programs stored in the system memory 28, such as implementing the training method of the prediction model mentioned in the above embodiment.
  • Any process or method description in a flowchart or otherwise described herein may be understood to represent a module, segment or portion of code that includes one or more executable instructions for implementing the steps of a specific logical function or process, and the scope of the preferred embodiments of the present disclosure includes alternative implementations in which functions may not be performed in the order shown or discussed, including performing functions in a substantially simultaneous manner or in the reverse order depending on the functions involved, which should be understood by those skilled in the art to which the embodiments of the present disclosure belong.
  • each functional unit in each embodiment of the present disclosure may be integrated into a processing module, or each unit may exist physically separately, or two or more units may be integrated into one module.
  • the above-mentioned integrated module may be implemented in the form of hardware or in the form of a software functional module. If the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it may also be stored in a computer-readable storage medium.
  • the storage medium mentioned above can be a read-only memory, a magnetic disk or an optical disk, etc.
  • This example uses a gut microbiome project data published on the China National Gene Bank Big Data Platform (db.cngb.org/search/project/CNP0000497/) as an example to describe the implementation of a specific solution.
  • the project contains 233 samples, 466 files, and a total raw data volume of 6.32 TB, which is 2.25 TB after gzip file compression.
  • This example uses the reference data set provided by Metaphlan3 as the source of the basic reference sequence database (github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-3.0).
  • the reference genome number of the marker gene source in NCBI was obtained by back-tracing the microbial marker genes in mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2.
  • the corresponding ftp link was obtained from the website ftp.ncbi.nih.gov/genomes/genbank/bacteria/assembly_summary.txt, so as to batch download the reference genome sequence.
  • a total of 25435 reference genomes were downloaded.
  • the subsequences within the reference genome were merged by Python script, and the merging rules were as follows:
  • this example uses the alignment software Bwa (Heng, L. et al 2009) to randomly select 50 Fastq files of test samples for alignment, count the number of reads aligned to each genome sequence, and sort the reference genome according to the number of reads on the alignment.
  • this embodiment selects the top 1000 reference genomes in terms of sequence abundance to construct project-specific compressed reference sequences.
  • the specific selection criteria refer to Figures 2 and 3.
  • the size of the final constructed Fastq file is 1.7GB, which is only 1.6% of the basic reference sequence.
  • the quality value mapping scheme is: 0-3 is merged into 0, 4-19 is simplified into 11, 20-29 is simplified into 23, and 30-40 is simplified into 37;
  • the judgment condition for low-quality reads is: when the proportion of bases in a read with quality values in the range of 4 to 29 is greater than or equal to 20%, the bases in the range of 4 to 29 in the read will not be degenerated in terms of quality values, and the remaining bases will be degenerated according to the original rules.
  • this embodiment uses the index-dependent open source compression tool genozip (genozip.Readthedocs.io/) to perform compression tests on all samples of the project (i.e., the above-mentioned total data volume of 6.32TB of raw data).
  • index-dependent open source compression tool genozip genozip.Readthedocs.io/
  • Other similar tools include GTZ (github.com/Genetalks/gtz), LW_FQZIP (github.com/Zhuzxlab/LW-FQZip2), etc.
  • Figure 13 shows a specific data compression ratio distribution diagram, where GZIP compression refers to directly compressing all sample data; Genozip index-free compression refers to using the Genozip tool to compress all sample data without using the project-specific compressed reference sequence constructed in step (3) above; Genozip indexed compression refers to using the Genozip tool to compress all sample data using the project-specific compressed reference sequence constructed in step (3) above.
  • GZIP compression refers to directly compressing all sample data
  • Genozip index-free compression refers to using the Genozip tool to compress all sample data without using the project-specific compressed reference sequence constructed in step (3) above
  • Genozip indexed compression refers to using the Genozip tool to compress all sample data using the project-specific compressed reference sequence constructed in step (3) above.
  • This example uses the Metaphlan-based species identification process (github.com/MGI-EU/MMHP_SOP_rmhost) to obtain the species composition of each sample using the Fastq files before and after the mass value degeneration as input, and then performs correlation statistics on the analysis results of the data before and after the mass value degeneration of each sample.
  • the statistical method is as follows:
  • FIG14 is a statistical graph of the Pearson correlation coefficient of the species composition of 233 samples before and after the mass value degeneration. As shown in FIG14, the correlation coefficients of the species composition of all samples before and after the mass value degeneration are all greater than 0.999, indicating that the lossy compression scheme used in this embodiment has almost no effect on the downstream species composition analysis. Therefore, the reference index in the embodiment of the present disclosure and the compression scheme based on the index do not affect the composition of the data on the basis of achieving efficient compression, that is, the high integrity, high accuracy and high fidelity of the information after data compression are achieved.
  • first and second are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features.
  • a feature defined as “first” or “second” may explicitly or implicitly include one or more of the features.
  • the meaning of “plurality” is two or more, unless otherwise clearly and specifically defined.

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

提出了一种用于宏基因组数据压缩的参考序列的构建方法,包括:根据所述宏基因组数据的样本来源,构建基础参考序列数据库;基于所述基础参考序列数据库,构建基础参考序列数据库的索引;根据所述基础参考序列数据库的所述索引,将第一读长序列与所述基础参考序列数据库进行比对,获得比对结果,其中所述第一读长序列为待压缩的宏基因组数据中随机选择的部分样本的读长序列;和根据所述比对结果,确定所述第一读长序列的序列丰度分布,构建所述用于宏基因组数据压缩的参考序列。

Description

参考序列的构建方法、宏基因组数据压缩方法和电子设备 技术领域
本公开涉及生物数据压缩技术领域,具体涉及一种参考序列构建方法、宏基因组数据压缩方法和电子设备。
背景技术
宏基因组(Metagenome)即环境中所有微生物基因组的总和。宏基因组学(Metagenomics)是一种以环境样品中的微生物群体基因组为研究对象,以功能基因筛选和/或测序分析为研究手段,以微生物多样性、种群结构、进化关系、功能活性、相互协作关系及与环境之间的关系为研究目的的新的微生物研究方法。宏基因组数据的研究使研究者摆脱物种界限,更有效地开发多物种基因资源并揭示更高更复杂层次上的生命运动规律。
高通量测序成本的快速下降促使基因组数据的产出大量增加,为数据的储存和传输带来巨大的挑战。基因数据主要以Fastq的格式进行储存,其序列信息和质量值的分布具有高度的随机性,无法使用通用的压缩软件如gzip等进行高效压缩。相关技术中基于索引的Fastq文件压缩工具通过将短读长序列(Reads)和参考基因组比对后,将序列信息转化为位置信息,从而提高压缩效率。这一策略高度依赖于参考基因序列的完整性,而宏基因组数据中物种组成较为复杂,无法通过稳定的参考序列来实现压缩效率的显著提升。
由此,亟待开发一种构建有效的宏基因组参考序列及基于该序列的宏基因组数据的压缩方法,以提升数据压缩效率的方法。
发明内容
为此,本公开的实施例提供了一种用于宏基因组数据压缩的参考序列的构建方法、宏基因组数据压缩方法、宏基因组数据压缩装置、电子设备、非瞬时性计算机可读存储介质、计算机程序产品及计算机程序。
本公开第一方面实施例提出了一种用于宏基因组数据压缩的参考序列的构建方法,包括:根据所述宏基因组数据的样本来源,构建基础参考序列数据库;基于所述基础参考序列数据库,构建基础参考序列数据库的索引;根据所述基础参考序列数据库的所述索引,将第一读长序列与所述基础参考序列数据库进行比对,获得比对结果,其中所述第一读长序列为待压缩的宏基因组数据中随机选择的部分样本的读长序列;和根据所述比对结果,确定所述第一读长序列的序列丰度分布,构建所述用于宏基因组数据压缩的参考序列。
在一些实施例中,根据所述宏基因组数据的样本来源,构建基础参考序列数据库,包括:根据所述宏基因组数据的所述样本来源,从公共数据库中获取对应的参考基因组并汇总,以获得所述基础参考序列数据库。
在一些实施例中,基于所述基础参考序列数据库,构建基础参考序列数据库的索引,包括:所述基础参考序列数据库中的单个参考基因组包括第一子序列和第二子序列,将所述第一子序列和第二子序列合并,并保留所述参考基因组的编号,以得到子序列合并参考基因组;基于所述子序列合并参考基因组,构建所述基础参考序列数据库的索引。
在一些实施例中,根据所述基础参考序列数据库的所述索引,将第一读长序列与所述基础参考序列数据库进行比对,包括:基于所述基础参考序列数据库的所述索引,将所述第一读长序列比对至每个所述子序列合并参考基因组上;基于所述第一读长序列比对到所述子序列合并参考基因组,记录所述读长序列比对到的所述参考基因组的所述编号。
在一些实施例中,其中根据比对结果,确定所述第一读长序列的序列丰度分布,构建所述用于宏基因组数据压缩的参考序列,包括:统计所述比对结果中,所述第一读长序列比对到各个所述参考基因组的所述编号的数目,以获得所述第一读长序列的所述序列丰度分布;根据所述序列丰度对所述参考基因组进行排序,选择前X位的参考基因组构建所述用于宏基因组数据压缩的参考序列。在一些实施例中,所述X可以为1000。
在一些实施例中,构建所述用于宏基因组数据压缩的参考序列,还包括:根据所述排序,选择所述序列丰度占比之和大于Y%的参考基因组构建所述用于宏基因组数据压缩的参考序列。在一些实施例中,所述Y可以为80。
在一些实施例中,所述用于宏基因组数据压缩的参考序列的构建方法还包括:将所述基础参考序列数据库拆分为子基础参考序列数据库;分别基于拆分出的所述子基础参考序列数据库构建子参考序列数据库的索引;基于所述子参考序列数据库的索引,将所述第一读长序列分别与每个所述子基础参考序列数据库进行比对,以获得第二比对结果,其中所述第二比对结果包括基于各个所述子基础参考序列数据库的子结果文件。
在一些实施例中,所述用于宏基因组数据压缩的参考序列的构建方法还包括:分别统计各个所述子结果文件中所述第一读长序列比对至每个所述子基础参考序列数据库的数目,以获得所述第一读长序列在各个所述子结果文件中的所述序列丰度分布;根据各个所述子结果文件中的所述序列丰度对所述参考基因组进行第一排序,选择各个所述子结果文件中所述序列丰度前X位的参考基因组构建子参考序列数据库;根据所述序列丰度,对所述子参考序列数据库中的参考基因组进行第二排序;选择子参考序列数据库中所述序列丰度分布前X位的参考基因组构建所述用于宏基因组数据压缩的参考序列。
在一些实施例中,构建所述用于宏基因组数据压缩的参考序列,还包括:根据所述第一排序,选择各个所述子结果文件中所述序列丰度占比之和大于Y%的参考基因组构建所述子参考序列数据库,并且
根据所述第二排序,选择所述子参考序列数据库中所述序列丰度占比之和大于Y%的参考基因组构建所述用于宏基因组数据压缩的参考序列。在一些实施例中,所述Y可以为80。
在一些实施例中,所述用于宏基因组数据压缩的参考序列的构建方法还包括:对所述比对结果进行第一和/或第二筛选,其中所述第一筛选包括:在所述比对结果中选择无插入和/或缺失的所述读长序列;所述第二筛选包括:选择低于错配阈值的所述读长序列。在一些实施例中,所述错配阈值可以为3。
本公开第二方面实施例提出了一种宏基因组数据压缩方法,所述方法包括:根据上方本公开第一方面的任一实施例所提出的用于宏基因组数据压缩的参考序列的构建方法,构建用于宏基因组数据压缩的参考序列;将第二读长序列与所述参考序列进行比对并记录比对结果,以获得所述宏基因组数据的压缩数据,其中所述第二读长序列为宏基因组数据中待压缩样本的读长序列。
在一些实施例中,将第二读长序列与所述参考序列进行比对并记录比对结果,包括:在所述第二读长序列与所述参考序列的错配碱基个数小于R1的情况下,记录所述第二读长序列在所述参考序列上的 位置;在所述第二读长序列与所述参考序列的错配碱基个数大于R1且小于R2的情况下,记录所述第二读长序列中配对碱基在所述参考序列上的位置,并记录错配碱基的碱基信息;在所述第二读长序列与所述参考序列的错配碱基个数大于R2的情况下,记录所述第二读长序列。在一些实施例中,R1、R2、R3均为大于或等于0的整数。在一些实施例中,R1为0至5,R2为3至10。在一些实施例中,R1为0至2,R2为3至8。在一些实施例中,R1为0,R2为3。
在一些实施例中,所述宏基因组数据压缩方法还包括对所述宏基因组数据的质量值进行简并。
在一些实施例中,对所述宏基因组数据的质量值进行简并,包括:对所述宏基因组数据中的碱基质量值进行统计,以获得所述质量值在M个质量值范围内的分布;分别将所述M个范围内的所述质量值对应映射到M个映射值上,以简并所述宏基因组数据的所述质量值。在一些实施例中,M为大于0的整数。
在一些实施例中,所述宏基因组数据压缩方法还包括:在所述质量值低于Q的碱基的比例占所述宏基因组数据中所有碱基的比例低于设定比例N的情况下,将所述宏基因组数据中的所有碱基的质量值进行映射以简并所述宏基因组数据的所述质量值。
在一些实施例中,所述宏基因组数据压缩方法还包括:在所述质量值低于Q的碱基的比例占所述宏基因组数据中所有碱基的比例高于或等于设定比例N的情况下,将所述宏基因组数据中的所述质量值高于Q的所述碱基的质量值进行映射,以简并所述宏基因组数据的所述质量值。
在一些实施例中,所述宏基因组数据压缩方法还包括:在所述质量值低于Q的碱基的比例占所述宏基因组数据中所有碱基的比例高于或等于设定比例N的情况下,保留所述宏基因组数据中的所述质量值低于Q的所述碱基的原始质量值。
在一些实施例中,所述Q为碱基错误概率为0.01%至1%对应的质量值。在一些实施例中,所述N大于或等于10%。在一些实施例中,所述N大于或等于20%。
本公开第三方面实施例提出了一种宏基因组数据压缩装置,所述装置包括:参考序列构建模块,用于根据本公开第一方面中任一实施例所述的用于宏基因组数据压缩的参考序列的构建方法,构建用于宏基因组数据压缩的参考序列;和
数据压缩模块,用于将所述宏基因组数据中的读长序列与所述参考序列进行比对并记录比对结果,以获得所述宏基因组数据的压缩数据。
在一些实施例中,所述装置还包括:质量值简并模块,用于对所述宏基因组数据的质量值进行简并。
本公开第四方面实施例提出了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中所述处理器执行所述计算机程序时,实现如本公开第一方面中任一实施例所述的用于宏基因组数据压缩的参考序列的构建方法,所述方法包括:根据所述宏基因组数据的样本来源,构建基础参考序列数据库;基于所述基础参考序列数据库,构建基础参考序列数据库的索引;根据所述基础参考序列数据库的所述索引,将第一读长序列与所述基础参考序列数据库进行比对,获得比对结果,其中所述第一读长序列为待压缩的宏基因组数据中随机选择的部分样本的读长序列;和根据所述比对结果,确定所述第一读长序列的序列丰度分布,构建所述用于宏基因组数据压缩的参考序列。
本公开第五方面实施例提出了一种非瞬时性计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如本公开第一方面中任一实施例所述的用于宏基因组数据压缩的参 考序列的构建方法。
本公开第六方面实施例提出了一种计算机程序产品,所述计算机程序产品中包括计算机程序,当所述计算机程序在在被处理器执行时,实现如本公开第一方面中任一实施例所述的用于宏基因组数据压缩的参考序列的构建方法。
本公开的实施例实现了如下有益效果:
本公开所提出的构建有效的宏基因组参考序列及基于该序列的宏基因组数据的压缩方法,能够构建有效的宏基因组数据压缩参考序列,通过借助索引依赖的压缩工具,能够大幅度提升宏基因组数据的压缩效率(其实现的平均压缩比为传统压缩比的近4倍),缓解大样本量的宏基因组数据的储存压力和传输压力。
附图说明
为了更清楚地说明本公开实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显然,下面描述中的附图是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为根据本公开实施例的用于宏基因组数据压缩的参考序列的构建方法图;
图2为根据本公开实施例的构建用于宏基因组数据压缩的参考序列的技术方案图;
图3为根据本公开实施例的基于高序列丰度的参考基因组的参考序列构建方法图;
图4为根据本公开实施例的宏基因组数据压缩方法的流程图;
图5为根据本公开实施例的基于参考序列的宏基因组数据压缩流程图;
图6为根据本公开实施例的质量值映射表的示例图;
图7为根据本公开实施例的质量值简并流程图;
图8为根据本公开一个实施例的条件性质量值简并的流程图;
图9为根据本公开另一实施例的条件性质量值简并的流程图;
图10为根据本公开另一实施例的宏基因组数据压缩方法图;
图11为根据本公开实施例的宏基因组数据压缩装置结构图;
图12示出了适于用来实现本公开实施方式的示例性计算机设备的框图;
图13为根据本公开实施例的具体数据压缩比分布图;
图14为质量值简并前和简并后233个样本物种组成的皮尔森相关系数统计图。
具体实施方式
下面结合具体实施方式对本公开进行进一步的详细描述,给出的实施例仅为了阐明本公开,并非限制本公开的范围。以下提供的实施例可作为本技术领域普通技术人员进行进一步改进的指南,并不以任何方式构成对本公开的限制。
本公开是基于发明人的以下认识做出的:
相关技术中基于索引(本公开实施例中也称索引依赖)的压缩工具被用于宏基因组数据的压缩。在针对宏基因组数据的索引依赖的压缩工具中,通常通过以下两种方法实现参考序列的构建和数据的压缩。
方法1:基于公共数据库构建通用参考序列。例如针对肠道微生物等来源明确的数据,可通过将数 据库中所有可能的物种基因组全部汇总以实现参考序列的构建。
方法2:基于物种组成及序列组装构建样本特异性参考序列。MetaCRAM(Kim,M.et al.,2016)及MCUIUC(Ligo,J.G.et al.,2013)首先通过宏基因组物种鉴定工具,对数据的物种组成进行快速鉴定,基于物种鉴定结果,用户选择丰度(Species Abundance)高于特定阈值的物种作为参考基因组来源,用于构建合适的参考基因组,并将比对失败的Reads进行从头组装,用于构建新的参考序列。最后分别基于从数据库中选择的参考序列及从头构建的参考序列,实现对宏基因组数据的压缩。
然而,方法1中的基于公共数据库构建通用参考序列的策略虽能够通过扩大参考基因组的数目而覆盖尽可能多的物种,但是由于微生物种类繁多,使得最后构建完成的参考序列文件极大,对计算机的配置(尤其是内存)有非常高的要求,不利于使用小规模计算集群或个人计算机的用户进行操作。
方法2中基于物种组成及序列组装构建样本特异性参考序列的策略虽能在获得理想压缩效率的同时,将内存需求控制在可接受范围内,但是在实际操作中,物种鉴定、序列从头组装均需要消耗大量的时间,最终导致数据压缩速度较慢。以MetaCRAM为例,压缩8,230MB的Fastq文件,需耗时73分钟。
本公开实施例提出的用于宏基因组数据压缩的参考序列的构建方法,通过构建项目特异性的参考序列,并结合条件性质量值有损压缩,对宏基因组数据实现了索引依赖的高效数据压缩。本公开实施例提出的用于宏基因组数据压缩的参考序列的构建方法及基于所构建的参考序列的宏基因组数据压缩方法,大幅度提升了宏基因组数据的压缩效率,有效缓解了大样本量的宏基因组数据的储存压力和传输压力。
本公开第一方面实施例提出了一种用于宏基因组数据压缩的参考序列的构建方法。
图1为根据本公开实施例的用于宏基因组数据压缩的参考序列的构建方法示意图。如图1所示,该方法可以包括:步骤101-104。
步骤101:根据宏基因组数据的样本来源,构建基础参考序列数据库。
本公开实施例中,“样本来源”为待压缩的宏基因组数据样本的提取环境。在本公开实施例中,样本可以为肠道微生物,水源微生物,土壤微生物等,样本来源对应可以为肠道、水源、土壤等。
本公开实施例中,基于项目背景信息(如肠道微生物,水源微生物,土壤微生物等)或样本来源,可以选择对应的公共数据库并下载常用的序列并进行汇总合并,作为基础参考序列库,用于比对索引的构建。在本公开的实施例中,肠道微生物数据库可以为GMrepo(Dai,D.et al,.2022)、gutMEGA-(Zhang,Q.et al.,2021)和uhgg(Almeida,A.et al.,2021)。
步骤102:基于基础参考序列数据库,构建基础参考序列数据库的索引。
在本公开实施例中,使用索引依赖的比对软件或脚本对基础参考序列数据库进行索引构建。在一些实施例中,索引依赖的比对软件可以为bwa(Burrows-Wheeler Aligner,Li H.and Durbin R.(2009)Fast and accurate short read alignment with Burrows-Wheeler Transform.Bioinformatics,25:1754-60.[PMID:19451168])、Bowtie(Langmead B,Trapnell C,Pop M,Salzberg SL.Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.Genome Biol 10:R25)、Bowtie2(Langmead B,Salzberg S.Fast gapped-read alignment with Bowtie 2.Nature Methods.2012,9:357-359)。
步骤103:根据基础参考序列数据库的索引,将第一读长序列与基础参考序列数据库进行比对,获得比对结果,其中第一读长序列为待压缩的宏基因组数据中随机选择的部分样本的读长序列。
本公开实施例中,可以随机选择待压缩样本中的部分样本的数据(即读长序列,Reads)与基础参考 序列数据库进行比对。可以理解的是,相对于使用整体样本的大样本量,通过随机选择特定数目的样本进行前期比对,能够有效提升比对效率,节约运算资源。
步骤104:根据比对结果确定第一读长序列的序列丰度分布,构建用于宏基因组数据压缩的参考序列。
本公开实施例中,序列丰度是指在比对中,样本Reads分别比对到各个参考基因组的数量。在本公开实施例中,通过统计步骤103产出的比对结果中选定的部分样本的Reads(即第一读长序列)比对到各个参考基因组的数目,获得部分样本的Reads的序列丰度分布;根据序列丰度对基础参考序列数据库中的参考基因组进行排序,并根据用户自身的计算配置、对压缩比的要求或其它个性化需求,选择排名靠前的参考基因组,构建用于所有样本宏基因组数据压缩的参考序列。
图2为根据本公开实施例的构建用于宏基因组数据压缩的参考序列的技术方案图。如图2所示,本公开实施例提出的用于宏基因组数据压缩的参考序列的构建方法可以包括根据项目信息从公共微生物数据库中确定特定大类的微生物数据库,并从特定大类的微生物数据库中获取基础参考序列数据库;使用待压缩的全部样本中的部分样本(即测试样本)与基础参考序列数据库进行比对,获得部分样本的Reads在基础参考序列数据库中的各个参考基因组的序列丰度并排序,以获得部分样本的序列丰度分布;选择高丰度物种的参考基因组(即序列丰度排名靠前的基础参考序列数据库中的参考基因组)进行合并,由此获得项目特异性(即针对该项目)的参考序列,以用于后续宏基因组数据的索引依赖的数据压缩。本公开实施例提出的方法,通过随机选择特定数目的样本进行前期比对并根据序列丰度构建参考序列,能够有效提升比对效率,节约运算资源。同时,通过基于部分样本的序列丰度分布选择排名靠前的代表性参考基因组构建用于压缩的参考序列,大大降低了参考序列的数据量,有利于后期的高效比对和压缩。
在本公开实施例中,步骤S102还可以包括:基础参考序列数据库中的单个参考基因组包括第一子序列和第二子序列,将该第一子序列和第二子序列合并,并保留参考基因组的编号,以得到子序列合并参考基因组;基于子序列合并参考基因组,构建基础参考序列数据库的索引。
本公开实施例中,第一子序列或第二子序列可以为单个参考基因组的Fastq文件中的片段序列,如该参考基因组中的每条染色体的序列。本公开实施例中,通过对单个参考基因组的Fastq文件中的子序列进行合并,仅保留参考基因组的编号作为唯一的序列描述行,有效减少了基础参考序列数据库的体积,同时有利于后续比对结果的统计。
在本公开实施例中,还可将基础参考序列数据库拆分为若干个子基础参考序列数据库并分别基于拆分出的子基础参考序列数据库构建子参考序列数据库的索引;基于子参考序列数据库的索引,将随机选择的部分样本的读长序列(即第一读长序列)分别与每个子基础参考序列数据库进行比对,以获得第二比对结果,其中第二比对结果包括基于各个所述子基础参考序列数据库的子结果文件。可以理解的是,在实际应用中,部分用户的运算配置不足以基于体积较大的基础参考序列数据库进行运算,因此通过将基础参考序列数据库拆分为子基础参考序列数据库并分别基于该子基础参考序列数据库进行运算,有效降低了对用户运算配置的要求,使本公开实施例提出的构建索引方法的应用门槛降低,使其应用范围更加广泛。
图3为根据本公开实施例的基于高序列丰度的参考基因组的参考序列构建方法图。如图3所示,随机选定的部分待压缩样本数(即测试样本数)为A,基础参考序列数据库的个数(即基础序列索引文件 数目)为B,其中B=1时对应于不对基础参考序列数据库进行拆分;B≥2时对应于将基础参考序列数据库拆分为若干个子基础参考序列数据库。X为用户选定的针对参考基因组的序列丰度选择阈值。
在本公开实施例中,可以在不拆分基础参考序列数据库的情况下,针对单个整体的基础参考序列数据库进行比对和序列丰度筛选以确定用于宏基因组数据压缩的参考序列数据库。具体地,在合并了单个参考基因组的子序列并仅保留了单个基因组的编号的情况下,可以基于基础参考序列数据库的索引,将随机选择的部分样本的读长序列(即第一读长序列)比对至每个子序列合并参考基因组上;在部分样本的读长序列比对到子序列合并参考基因组的情况下,记录读长序列比对到的参考基因组的编号。其中比对软件可以为Bwa、Bowtie、Bowtie2或本地编写的索引依赖的脚本或软件。在本公开实施例中,在得到比对结果后,统计比对结果中部分样本的读长序列比对到各个参考基因组的编号的数目,以获得部分样本的读长序列的序列丰度分布;根据序列丰度对参考基因组进行排序,选择前X位的参考基因组构建用于宏基因组数据压缩的参考序列。
如图3所示,当基础参考序列数据库的个数B为1时,将测试样本(即部分样本,数目为A)中的每个样本与该基础参考序列数据库进行比对,得到A个样本在该基础参考序列数据库中每个参考基因组的序列丰度;将A个样本的序列丰度合并并排序,以得到A个测试样本的整体序列丰度分布;选择前X个参考基因组构建用于宏基因组数据压缩的参考序列。
在本公开实施例中,可以在拆分基础参考序列数据库的情况下,针对拆分后的子基础参考序列数据库进行比对和序列丰度筛选以确定用于宏基因组数据压缩的参考序列数据库。具体地,如图3所示,在B≥2时,将测试样本(即部分样本,数目为A)中的每个样本与每个子基础参考序列数据库进行比对以得到子结果文件;分别统计各个子结果文件中测试样本的读长序列比对至每个子基础参考序列数据库的数目,以获得测试样本的读长序列在各个子结果文件中的序列丰度分布,其中子结果文件中共B个,包含A*B个序列丰度分布;根据B个子结果文件中的序列丰度对参考基因组进行第一排序,分别选择B个子结果文件中序列丰度前X位的参考基因组构建子参考序列数据库,即子参考序列数据库中包括B*X个参考基因组;根据序列丰度,对所述子参考序列数据库中的B*X个参考基因组进行第二排序,并选择子参考序列数据库中序列丰度分布前X位的参考基因组构建用于宏基因组数据压缩的参考序列。
可以理解的是,在本公开实施例中,在将基础参考序列数据库拆分为若干个子基础参考序列数据库的情况下,也可在拆分前将基础参考序列数据库中的单个参考基因组内的子序列合并,并保留参考基因组的编号以进行后续比对;或者在拆分后,将拆分所得的子基础参考序列数据库中的单个参考基因组内的子序列合并,并保留参考基因组的编号以进行后续比对。
在本公开实施例中,序列丰度选择阈值X可以是用户根据数据情况、个人计算资源或对压缩比的需求等选定的。在一些实施例中,X可以为200至5000。在一些实施例中,X可以为500至3000。在一些实施例中,X可以为1000。
在本公开实施例中,还可以根据序列丰度的统计和排序结果,选择序列丰度占比之和大于Y%的参考基因组构建用于宏基因组数据压缩的参考序列,其中占比为某个参考基因组对应的序列丰度占总序列丰度的比例,选择序列丰度占比之和大于Y%的参考基因组即为按照序列丰度的统计和排序,选择排名前若干个参考基因组,使选定的前若干个参考基因组的序列丰度的占比之和大于Y%。
可以理解的是,在本公开实施例中,Y可以根据样本量、期望压缩比和用户运算资源确定。在一些 实施例中,Y可以为20至80。在一些实施例中,Y可以为40至80。在一些实施例中,Y可以为80。在本公开实施例中,与使用基础参考序列数据库中的全部参考基因组相比,代表性参考基因组的使用并不影响后续数据压缩的准确性,即基于数据量庞大的基础参考序列数据库中的全部参考基因组所构建的索引进行的压缩,其压缩后的数据构成与本公开实施例中使用代表性参考基因组压缩后的数据构成相关性极高。因此,通过选择序列丰度排名靠前的具有代表性的参考基因组进行压缩索引的构建,有效降低了压缩索引的体积,大幅减少了后续压缩运算量,并保证了压缩数据的高保真性。
在本公开实施例中,在使用随机选择的部分样本与基础参考序列数据库或子基础参考序列数据库比对后,可以对比对结果进行第一和/或第二筛选,其中第一筛选包括:在比对结果中选择无插入和/或缺失的读长序列;第二筛选包括:选择低于错配阈值的读长序列。
在本公开实施例中,获得Bwa、Bowtie2或同样功能的脚本比对后的比对结果后,可以对比产生的结果文件(例如Bam或Sam格式)进行第一和/或第二筛选以对比对结果进行质量控制。在一些实施例中,在第一筛选中,可以根据结果文件(Concise Idiosyncratic Gapped Alignment Report)的Cigar值选择无插入和/或缺失的Reads,其中无插入和/或缺失以100M或150M表示(100和150代表Reads长度为100bp和150bp,M表示Match,100M或150M则表示Reads的100bp或150bp的全长序列均与参考序列完全匹配)。在一些实施例中,在第二筛选中,可以根据结果文件的N:M值选择错配数目低于错配阈值的Reads。在一些实施例中,错配阈值可以为1至10。在一些实施例中,错配阈值可以为1至5。在一些实施例中,错配阈值可以为3。可以理解的是,对比对结果中Reads的筛选去除了错配度较高的Reads,因而提升了Reads整体的可信度,使得基于筛选后的高可信度的Reads的序列丰度分布的参考基因组的选定也更为准确。
本公开实施例提出的用于宏基因组数据压缩的参考序列的构建方法,通过将基础参考序列数据库中单个参考基因组的子序列合并并仅保留其编号,和/或将基础参考序列数据库拆分为多个子基础参考序列数据库,有效解决了基础参考序列数据库的数据量大、小型计算集群或个人计算机的用户无法一次性对单个包含了上万个参考基因组的Fastq文件构建比对所需的索引的问题;同时,该方法通过随机选择部分样本进行前期比对和参考序列的构建,在确保构建的参考序列对于待压缩数据有尽可能大的覆盖度的同时,大大降低了比对中数据量的输入与产出,提升了参考序列的构建效率,节约了运算和存储资源。
本公开第二方面实施例提出了一种宏基因组数据压缩方法。图4为根据本公开实施例的宏基因组数据压缩方法的流程图。如图3所示,该方法包括:
步骤201:根据上述第一方面实施例中的任一实施例所述的用于宏基因组数据压缩的参考序列的构建方法,构建用于宏基因组数据压缩的参考序列;
步骤202:将第二读长序列与参考序列进行比对并记录比对结果,以获得宏基因组数据的压缩数据,其中第二读长序列为宏基因组数据中待压缩样本的读长序列。
在本公开实施例中,在基于第一读长序列,即待压缩的宏基因组数据中随机选择的部分样本的读长序列,构建出用于宏基因组数据压缩的参考序列后,可以基于该参考序列对宏基因组数据中的部分或全部样本的读长序列进行压缩,也即,对第二读长序列进行压缩。可以理解的是,第二读长序列可以与第一读长序列相同,也可与第一读长序列不同。在本公开实施例中,基于所构建用于压缩的参考序列,可以根据用户需求,选择对宏基因组数据中的全部或部分样本进行压缩,由此在实现高效压缩的同时,提 升了压缩的灵活性。
图5为根据本公开实施例的基于参考序列的宏基因组数据压缩流程图。如图5所示,在根据本公开第一方面实施例中的任一实施例构建好参考序列后,将待压缩的宏基因组数据中的Reads(Fastq文件)输入并与构建好的参考序列进行比对。
在一些实施例中,在宏基因组数据中的读长序列(即第二读长序列)与参考序列的错配碱基个数小于R1的情况下,记录该读长序列在参考序列上的位置;在宏基因组数据中的读长序列与参考序列的错配碱基个数大于R1且小于R2的情况下,记录该读长序列中配对碱基在参考序列上的位置,并记录该读长中错配碱基的碱基信息;在宏基因组数据中的读长序列与参考序列的错配碱基个数大于R2的情况下,记录该读长序列。在一些实施例中,R1、R2、R3均为大于或等于0的整数。在一些实施例中,R1为0至5,R2为3至10。在一些实施例中,R1为0至2,R2为3至8。在一些实施例中,R1为0,R2为3。
在一些实施例中,i.在宏基因组数据中的读长序列完全匹配到参考序列的情况下(即R1=0),记录该读长序列在参考序列上的位置;ii.在宏基因组数据中的读长序列与参考序列的错配碱基个数大于等于3(即R2=3)的情况下,记录该读长序列中配对碱基在参考序列上的位置,并记录错配碱基的碱基信息;iii.在宏基因组数据中的读长序列与所述参考序列无法匹配(即错配碱基个数大于3)的情况下,记录所述读长序列。
在一些实施例中,在步骤ii中,在Reads与参考序列存在错配,且错配碱基的个数小于5个(即R1=1-4)的情况下,记录该Reads配对碱基在参考序列上的位置,即将匹配碱基转化为位置信息储存,并记录错配碱基的实际碱基信息。在一些实施例中,步骤ii中的错配碱基个数可以为1至3个(即R1=1、2或3)。
在一些实施例中,在步骤iii中,在Reads与参考序列存在错配,且错配碱基的个数大于5个(即R2≥5)的情况下,记录该Reads的序列信息,即保留该Reads的实际碱基信息。在一些实施例中,步骤iii中的错配碱基个数可以为大于3个的正整数(即R2>3)。
在本公开实施例中,宏基因组数据压缩方法还包括:对宏基因组数据的质量值进行简并。
可以理解的是,宏基因组数据多以Fastq文件的形式存储。Fastq格式共分为4行,其中第4行中的字符对应代表该序列中每一位碱基的被识别错误的概率,即碱基质量值(Quality Score,Q-score)。也即,碱基质量值是碱基识别出错概率的整数映射,可以是Q=-10*lgP,其中P为碱基识别出错的概率。
碱基质量值根据不同测序平台,具有不同的表示体系,例如Phred33体系和Phred64体系等,这些体系中使用不同的字符表示碱基的质量值,但均可以通过Q=-10*lgP这一公式换算为碱基的错误概率。在本公开实施例中,依据碱基出错的可能性,碱基的质量值被划分为0至40,其中0代表错误概率为100%,40代表错误概率为0.01%。
在本公开实施例中,对宏基因组数据的质量值进行简并,包括:对宏基因组数据中的碱基质量值进行统计,以获得质量值在M个质量值范围内的分布;分别将M个范围内的质量值对应映射到M个映射值上,以简并宏基因组数据的质量值。
在本公开实施例中,根据不同的碱基出错概率设定M个质量值范围,并设定对应的M个具体映射值,以对碱基质量值进行映射以完成简并,其中M可以为大于0的整数,例如1至100中的任一整数。在本公开的一个实施例中,所有质量值按照各自代表的错误概率被分为4档(即M=4),分别为0至3(错误概率>50%)、4至19(错误概率为1%至40%)、20至30(错误概率0.1%至1%)和30至40(错 误概率0.01%至0.1%)。在另一个实施例中,所有质量值按照各自代表的错误概率被分为3档(即M=3)。可以理解的是,可以根据实际需要确定和调整M的具体值和M个具体范围。
在本公开实施例中,M个具体映射值可以按照实际的数据情况由用户进行调整,本公开对此不作限制。图6为根据本公开实施例的质量值映射表的示例图。如图6所示,可以将质量值0至40划分为M个质量值范围,并以Q1、Q2……QM作为对应的具体映射值。
图7为根据本公开实施例的质量值简并流程图。如图7所示,通过将待压缩Reads的质量值进行统计,并划定不同的阈值范围,如[a,b]、[c,d]、[e,f]、……等,共M个,其中a-f分别代表不同的质量值。例如当碱基的质量值被划分为0至40而M=3时,[a,b]可为0至10;[c,d]可为11至20;[e,f]可为21至40。在将待压缩Reads的碱基质量值分别划入M个阈值范围后,将落入同一阈值范围的碱基映射到同一具体映射值上,从而对待压缩Reads进行简并,由此缩小待压缩数据的体积、减少了冗余的运算量。
本公开发明人在具体运算时发现,在整体质量值较低的宏基因组数据中,中低水平的质量值的波动会影响到部分比对软件的比对质量值(如使用Bowtie2时,比对质量值以MAPQ表示),从而影响下游分析,因此本公开实施例在宏基因组数据质量值简并中,还提出了对宏基因组数据质量值进行条件性简并的技术方案,以减小质量值的有损压缩对于下游分析的影响。
具体地,在本公开实施例中,在对所述将待压缩Reads的质量值进行统计后、对Reads进行简并前,还包括:在质量值低于Q的碱基的比例占宏基因组数据中所有碱基的比例低于设定比例N的情况下,将宏基因组数据中的所有碱基的质量值进行映射以简并宏基因组数据的质量值。
在本公开实施例中,在质量值低于Q的碱基的比例占宏基因组数据中所有碱基的比例高于或等于设定比例N的情况下,将宏基因组数据中的质量值高于Q的碱基的质量值进行映射,以简并宏基因组数据的所述质量值。
在本公开实施例中,在质量值低于Q的碱基的比例占宏基因组数据中所有碱基的比例高于或等于设定比例N的情况下,保留宏基因组数据中的质量值低于Q的碱基的原始质量值。
可以理解的是,在本公开实施例中,可以根据宏基因组数据的实际质量值分布以及期望的简并情况确定Q,例如当碱基的质量值被划分为0至40时,Q可以为0至40中的任一整数,即对应为碱基错误概率为100%至0.01%的范围。在本公开实施例中,Q可以为碱基错误概率为0.01%至1%对应的质量值。在一些实施例中,Q可以为碱基错误概率为0.1%至1%对应的质量值。
在本公开实施例中,设定比例N大于或等于20%。在另一些实施例中,N大于或等于10%。
图8为根据本公开一个实施例的条件性质量值简并的流程图。如图8所示,碱基的质量值为0至40,且质量值0至40被分为4个质量值范围(即M=4),分别为0至3(错误概率>50%)、4至19(错误概率为1%至40%)、20至30(错误概率0.1%至1%)和30至40(错误概率0.01%至0.1%),且4个质量值范围所对应的映射值分别为Q1、Q2、Q3和Q4。根据图7,对待压缩的Reads进行质量值统计,得到Reads在4个质量值范围内的分布,即R1、R2、R3和R4;判断宏基因组数据中的质量值低于Q=29的碱基所占比例之和是否大于或等于设定比例N,即R1%+R2%+R3%是否大于或等于N%;若否,对待压缩数据中的所有碱基按照质量值映射表进行简并,即质量值为0至3的碱基的质量值将被映射和简并为Q1、质量值为4至19的碱基的质量值将被映射和简并为Q2、质量值为20至29的碱基的质量值将被映射和简并为Q3、质量值为30至40的碱基的质量值将被映射和简并为Q4;若R1%+R2%+R3%大于或等 于N%,则对质量值小于等于Q=29的碱基不予简并,即保留宏基因组数据中质量值小于等于Q=29的碱基的原始质量值,并对质量值大于Q=29(即大于或等于30)的碱基按照质量值映射表进行简并。
图9为根据本公开另一实施例的条件性质量值简并的流程图。如图9所示,该流程与图8所示流程不同点仅在于,若R1%+R2%+R3%大于或等于N%,则保留待压缩数据中所有碱基的原有质量值而不进行简并。
图10为根据本公开实施例的宏基因组数据压缩方法图。如图10所示,该方法可以包括用于压缩的索引的构建、待压缩数据(Fastq文件)的条件性质量值简并和基于构建的参考索引的数据压缩。
本公开第二方面实施例提出的宏基因组数据压缩方法,通过基于上述第一方面实施例中的任一实施例所述的用于宏基因组数据压缩的参考序列的构建方法构建的参考序列,并将待压缩Read与构建的参考序列进行快速比对,若能够准确比对至相应位置,则只需记录下对应Read在参考序列上的位置信息;若存在少量错配,则在记录其余配对碱基的位置信息的同时,保留错配的碱基的信息;对于无法准确比对至参考序列的Read,则记录所有的序列信息,以此大幅度提升了宏基因组数据的压缩效率,缓解大样本量的宏基因组数据的储存压力。此外,本公开实施例提出的宏基因组数据压缩方法在压缩前对碱基质量值进行条件性简并,即通过设定阈值,对高质量值的碱基进行简并,并保留中低质量值碱基的原始质量值,由此在实现了简化和缩小待压缩数据的同时,不会影响后续的比对;同时基于简并质量值的待压缩数据,进一步了提高压缩效率。
本公开第三方面实施例提出了一种宏基因组数据压缩装置。图11为根据本公开实施例的宏基因组数据压缩装置结构图。如图11所示,宏基因组数据压缩装置90可以包括:参考序列构建模块901,用于根据上述第一方面实施例中的任一实施例所述的用于宏基因组数据压缩的参考序列的构建方法,构建用于宏基因组数据压缩的参考序列;和数据压缩模块902,用于将宏基因组数据中的读长序列与参考序列进行比对并记录比对结果,以获得宏基因组数据的压缩数据。
在本公开实施例中,该装置90还可以包括:质量值简并模块903,用于对宏基因组数据的质量值进行简并。
本公开第三方面实施例提出的宏基因组数据压缩装置,通过基于上述第一方面实施例中的任一实施例所述的用于宏基因组数据压缩的参考序列的构建方法构建的参考序列,并将待压缩Read与构建的参考序列进行快速比对,若能够准确比对至相应位置,则只需记录下对应Read在参考序列上的位置信息;若存在少量错配,则在记录其余配对碱基的位置信息的同时,保留错配的碱基的信息;对于无法准确比对至参考序列的Read,则记录所有的序列信息,以此大幅度提升了宏基因组数据的压缩效率,缓解大样本量的宏基因组数据的储存压力。此外,本公开实施例提出的宏基因组数据压缩装置在压缩前对碱基质量值进行条件性简并,即通过设定阈值,对高质量值的碱基进行简并,并保留中低质量值碱基的原始质量值,由此在实现了简化和缩小待压缩数据的同时,不会影响后续的比对;同时基于简并质量值的待压缩数据,进一步了提高压缩效率。
为了实现上述实施例,本公开实施例还提出一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时,实现如本公开第一方面实施例提出的用于宏基因组数据压缩的参考序列的构建方法或如本公开第二方面实施例提出的宏基因组数据压缩方法。
为了实现上述实施例,本公开实施例还提出一种非瞬时性计算机可读存储介质,其上存储有计算机 程序,该程序被处理器执行时实现如如本公开第一方面实施例提出的用于宏基因组数据压缩的参考序列的构建方法或如本公开第二方面实施例提出的宏基因组数据压缩方法。
为了实现上述实施例,本公开实施例还提出一种计算机程序产品,当计算机程序产品中的指令处理器执行时,执行如本公开第一方面实施例提出的用于宏基因组数据压缩的参考序列的构建方法或如本公开第二方面实施例提出的宏基因组数据压缩方法。
为了实现上述实施例,本公开实施例还提出一种计算机程序,该计算机程序包括计算机程序代码,当该计算机程序代码在计算机上运行时,使得计算机执行如本公开第一方面实施例提出的用于宏基因组数据压缩的参考序列的构建方法或如本公开第二方面实施例提出的宏基因组数据压缩方法。
图12示出了适于用来实现本公开实施方式的示例性计算机设备的框图。图12显示的电子设备12仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图12所示,电子设备12以通用计算设备的形式表现。电子设备12的组件可以包括但不限于:一个或者多个处理器或者处理单元16,系统存储器28,连接不同系统组件(包括系统存储器28和处理单元16)的总线18。
总线18表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(Industry Standard Architecture;以下简称:ISA)总线,微通道体系结构(Micro Channel Architecture;以下简称:MAC)总线,增强型ISA总线、视频电子标准协会(Video Electronics Standards Association;以下简称:VESA)局域总线以及外围组件互连(Peripheral Component Interconnection;以下简称:PCI)总线。
电子设备12典型地包括多种计算机系统可读介质。这些介质可以是任何能够被电子设备12访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。
存储器28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(Random Access Memory;以下简称:RAM)30和/或高速缓存存储器32。电子设备12可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统34可以用于读写不可移动的、非易失性磁介质(图10未显示,通常称为“硬盘驱动器”)。
尽管图12中未示出,可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如:光盘只读存储器(Compact Disc Read Only Memory;以下简称:CD-ROM)、数字多功能只读光盘(Digital Video Disc Read Only Memory;以下简称:DVD-ROM)或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线18相连。存储器28可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本公开各实施例的功能。
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如存储器28中,这样的程序模块42包括但不限于操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本公开所描述的实施例中的功能和/或方法。
电子设备12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信,还可与 一个或者多个使得用户能与该电子设备12交互的设备通信,和/或与使得该电子设备12能与一个或多个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口22进行。并且,电子设备12还可以通过网络适配器20与一个或者多个网络(例如局域网(Local Area Network;以下简称:LAN),广域网(Wide Area Network;以下简称:WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器20通过总线18与电子设备12的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备12使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
处理单元16通过运行存储在系统存储器28中的程序,从而执行各种功能应用以及数据处理,例如实现前述实施例中提及的预测模型的训练方法。
需要说明的是,前述对用于宏基因组数据压缩的参考序列的构建方法和宏基因组数据压缩方法实施例的解释说明也适用于上述实施例中的装置、电子设备、非瞬时计算机可读存储介质、计算机程序产品和计算机程序,此处不再赘述。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。
需要说明的是,在本公开的描述中,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。此外,在本公开的描述中,除非另有说明,“多个”的含义是两个或两个以上。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本公开的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本公开的实施例所属技术领域的技术人员所理解。
应当理解,本公开的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。
此外,在本公开各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。
上述提到的存储介质可以是只读存储器,磁盘或光盘等。
下述实施例中的实验方法,如无特殊说明,均为常规方法,按照本领域内的文献所描述的技术或条件或者按照产品说明书进行。
如无特殊说明,以下实施例中的定量试验,均设置三次重复实验,结果取平均值。
实施例
本实施例以发布于中国国家基因库生命大数据平台的一个肠道微生物项目数据(db.cngb.org/search/project/CNP0000497/)为例,进行具体方案实施的描述。该项目共包含233个样本,文件数目共466个,原始数据总数据量为6.32TB,gzip文件压缩后为2.25TB。
(1)基础参考序列数据库及其索引的构建
本实施例使用Metaphlan3提供的参考数据集作为基础参考序列数据库的来源(github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-3.0)。
通过对mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2中的微生物标志基因进行反向回溯获得标志基因来源的参考基因组在NCBI中的编号,根据的NCBI基因组编号,从网站ftp.ncbi.nih.gov/genomes/genbank/bacteria/assembly_summary.txt中获得相应的ftp链接,从而对参考基因组序列进行批量下载。本实施例共下载25435个参考基因组。通过Python脚本对参考基因组内部的子序列进行合并,合并规则如下:
a.首先根据“>”的数目判断基因组文件内的子序列(通常为contig或scaffold)数目,如仅有1个,就将“>”后的内容改为基因组编号(通常为GCA开头);
b.若“>”数目大于1个,则首先在每个子序列末尾添加10个“N”字符作为分割符,然后删除第一个以外的所有“>”所在的行,将子序列进行合并,并将保留的第一个“>”后的内容改为基因组编号。
完成单个基因组的内部子序列合并后,使用shell中的cat指令,将所有的参考基因组合并成为一个总的Fastq文件,获得最终用于比对的基础参考序列文件。
(2)序列比对
基于(1)中所构建的基础参考序列,本实施例使用比对软件Bwa(Heng,L.et al 2009),随机选择50个测试样本的Fastq文件进行比对,统计比对至每条基因组序列的Reads数目。,并按照比对上的Reads数对参考基因组进行排序。
(3)项目特异性压缩参考序列构建
基于(2)中的统计结果,本实施例选择序列丰度前1000的参考基因组用于构建项目特异性的压缩参考序列。具体的选择标准参照图2和图3。最终构建完成的Fastq文件大小为1.7GB,仅为基础参考序列的1.6%。
(4)数据压缩测试
本实施例设定的质量值简并参数如下:
a.质量值映射方案为:将0~3兼并为0,4~19简并为11,20~29简并为23,30~40简并37;
b.低质量Read判断条件为:当一条Read的所有碱基中,质量值在4~29范围内的比例大于等于20%,则不对该条Read中4~29范围内的碱基进行质量值简并,剩余碱基按照原定规则进行简并。
完成压缩参考序列构建后,本实施例使用索引依赖的开源压缩工具genozip(genozip.Readthedocs.io/)对项目的所有样本(即上述总数据量为6.32TB的原始数据)进行压缩测试,其余类似的工具还有GTZ(github.com/Genetalks/gtz),LW_FQZIP(github.com/Zhuzxlab/LW-FQZip2)等。图13示出了具体数据压缩比分布图,其中GZIP压缩是指直接对所有样本数据进行压缩;Genozip无索引压缩是指使用Genozip工具,在不使用上方步骤(3)中构建的项目特异性压缩参考序列的情况下,对对所有样本数据进行压缩;Genozip有索引压缩是指使用Genozip工具,在使用上方步骤(3)中构建的项目特异性压缩参考序列的情况下,对对所有样本数据进行压缩。如图13所示,使用本公开设计的压缩方案,233个样本(共466个文件)的平均压缩比10.46,为gzip(2.81)的3.72倍。且相比于不使用参考序列的情况(6.73),平均压缩比提升约35%。可见,本公开实施例提出的参考索引以及基于该索引的压缩方案能够实现数据的高效压缩。
(5))质量值简并后对物种组成分析的影响评估
本实施例分别对以质量值简并前后的Fastq文件作为输入,使用基于Metaphlan的物种鉴定流程(github.com/MGI-EU/MMHP_SOP_rmhost)获得每个样本中的物种组成,随后对每个样本的质量值简并前后数据的分析结果进行相关性统计。统计方法如下:
a.首先对每个样本的物种丰度进行log转化,以使数据满足正态分布。
b.使用Python模块scipy中的pearsonr功能,计算皮尔森相关系数。
图14为质量值简并前和简并后233个样本物种组成的皮尔森相关系数统计图。如图14所示,所有样本的质量值简并前后,物种组成的相关系数均>0.999,表明本实施例中的所采用的有损压缩方案,几乎不影响下游的物种组成分析。由此,本公开实施例中的参考索引以及基于该索引的压缩方案在实现高效压缩的基础上,并不会影响数据的构成,即实现了数据压缩后信息的高完整性、高准确性和高保真性。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本公开的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本公开的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。
尽管上面已经示出和描述了本公开的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本公开的限制,本领域的普通技术人员在本公开的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (20)

  1. 一种用于宏基因组数据压缩的参考序列的构建方法,包括:
    根据所述宏基因组数据的样本来源,构建基础参考序列数据库;
    基于所述基础参考序列数据库,构建基础参考序列数据库的索引;
    根据所述基础参考序列数据库的所述索引,将第一读长序列与所述基础参考序列数据库进行比对,获得比对结果,其中所述第一读长序列为待压缩的宏基因组数据中随机选择的部分样本的读长序列;和
    根据所述比对结果,确定所述第一读长序列的序列丰度分布,构建所述用于宏基因组数据压缩的参考序列。
  2. 根据权利要求1所述的方法,其中根据所述宏基因组数据的样本来源,构建基础参考序列数据库,包括:
    根据所述宏基因组数据的所述样本来源,从公共数据库中获取对应的参考基因组并汇总,以获得所述基础参考序列数据库。
  3. 根据权利要求2所述的方法,其中基于所述基础参考序列数据库,构建基础参考序列数据库的索引,包括:
    所述基础参考序列数据库中的单个参考基因组包括第一子序列和第二子序列,将所述第一子序列和第二子序列合并,并保留所述参考基因组的编号,以得到子序列合并参考基因组;
    基于所述子序列合并参考基因组,构建所述基础参考序列数据库的索引。
  4. 根据权利要求3所述的方法,其中根据所述基础参考序列数据库的所述索引,将第一读长序列与所述基础参考序列数据库进行比对,包括:
    基于所述基础参考序列数据库的所述索引,将所述第一读长序列比对至每个所述子序列合并参考基因组上;
    基于所述第一读长序列比对到所述子序列合并参考基因组,记录所述读长序列比对到的所述参考基因组的所述编号。
  5. 根据权利要求4所述的方法,其中根据比对结果,确定所述第一读长序列的序列丰度分布,构建所述用于宏基因组数据压缩的参考序列,包括:
    统计所述比对结果中,所述第一读长序列比对到各个所述参考基因组的所述编号的数目,以获得所述第一读长序列的所述序列丰度分布;
    根据所述序列丰度对所述参考基因组进行排序,选择前X位的参考基因组构建所述用于宏基因组数据压缩的参考序列。
  6. 根据权利要求5所述的方法,其中构建所述用于宏基因组数据压缩的参考序列,还包括:
    根据所述排序,选择所述序列丰度占比之和大于Y%的参考基因组构建所述用于宏基因组数据压缩的参考序列。
  7. 根据权利要求1至6中任一项所述的方法,所述方法还包括:
    将所述基础参考序列数据库拆分为子基础参考序列数据库;
    分别基于拆分出的所述子基础参考序列数据库构建子参考序列数据库的索引;
    基于所述子参考序列数据库的索引,将所述第一读长序列分别与每个所述子基础参考序列数据库进行比对,以获得第二比对结果,其中所述第二比对结果包括基于各个所述子基础参考序列数据库的子结果文件。
  8. 根据权利要求7所述的方法,所述方法还包括:
    分别统计各个所述子结果文件中所述第一读长序列比对至每个所述子基础参考序列数据库的数目,以获得所述第一读长序列在各个所述子结果文件中的所述序列丰度分布;
    根据各个所述子结果文件中的所述序列丰度对所述参考基因组进行第一排序,选择各个所述子结果文件中所述序列丰度前X位的参考基因组构建子参考序列数据库;
    根据所述序列丰度,对所述子参考序列数据库中的参考基因组进行第二排序;
    选择子参考序列数据库中所述序列丰度分布前X位的参考基因组构建所述用于宏基因组数据压缩的参考序列。
  9. 根据权利要求8所述的方法,其中构建所述用于宏基因组数据压缩的参考序列,还包括:根据所述第一排序,选择各个所述子结果文件中所述序列丰度占比之和大于Y%的参考基因组构建所述子参考序列数据库,并且
    根据所述第二排序,选择所述子参考序列数据库中所述序列丰度占比之和大于Y%的参考基因组构建所述用于宏基因组数据压缩的参考序列。
  10. 根据权利要求1至9中任一项所述的方法,所述方法还包括:
    对所述比对结果进行第一和/或第二筛选,其中
    所述第一筛选包括:在所述比对结果中选择无插入和/或缺失的所述读长序列;
    所述第二筛选包括:选择低于错配阈值的所述读长序列。
  11. 一种宏基因组数据压缩方法,所述方法包括:
    根据权利要求1所述的用于宏基因组数据压缩的参考序列的构建方法,构建用于宏基因组数据压缩的参考序列;
    将第二读长序列与所述参考序列进行比对并记录比对结果,以获得所述宏基因组数据的压缩数据,其中所述第二读长序列为宏基因组数据中待压缩样本的读长序列。
  12. 根据权利要求11所述的方法,其中将第二读长序列与所述参考序列进行比对并记录比对结果,包括:
    在所述第二读长序列与所述参考序列的错配碱基个数小于R1的情况下,记录所述第二读长序列在所述参考序列上的位置;
    在所述第二读长序列与所述参考序列的错配碱基个数大于R1且小于R2的情况下,记录所述第二读长序列中配对碱基在所述参考序列上的位置,并记录错配碱基的碱基信息;
    在所述第二读长序列与所述参考序列的错配碱基个数大于R2的情况下,记录所述第二读长序列。
  13. 根据权利要求11所述的方法,还包括对所述宏基因组数据的质量值进行简并,所述简并包括:
    对所述宏基因组数据中的碱基质量值进行统计,以获得所述质量值在M个质量值范围内的分布;
    分别将所述M个范围内的所述质量值对应映射到M个映射值上,以简并所述宏基因组数据的所述质量值。
  14. 根据权利要求13所述的方法,所述方法还包括:在所述质量值低于Q的碱基的比例占所述宏基因组数据中所有碱基的比例低于设定比例N的情况下,将所述宏基因组数据中的所有碱基的质量值进行映射以简并所述宏基因组数据的所述质量值。
  15. 根据权利要求14所述的方法,所述方法还包括:
    在所述质量值低于Q的碱基的比例占所述宏基因组数据中所有碱基的比例高于或等于设定比例N的情况下,将所述宏基因组数据中的所述质量值高于Q的所述碱基的质量值进行映射,以简并所述宏基因组数据的所述质量值。
  16. 根据权利要求15所述的方法,所述方法还包括:
    在所述质量值低于Q的碱基的比例占所述宏基因组数据中所有碱基的比例高于或等于设定比例N的情况下,保留所述宏基因组数据中的所述质量值低于Q的所述碱基的原始质量值。
  17. 根据权利要求16中任一项所述的方法,所述Q为碱基错误概率为0.01%至1%对应的质量值。
  18. 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中所述处理器执行所述计算机程序时,实现如权利要求1所述的用于宏基因组数据压缩的参考序列的构建方法,所述方法包括:
    根据所述宏基因组数据的样本来源,构建基础参考序列数据库;
    基于所述基础参考序列数据库,构建基础参考序列数据库的索引;
    根据所述基础参考序列数据库的所述索引,将第一读长序列与所述基础参考序列数据库进行比对, 获得比对结果,其中所述第一读长序列为待压缩的宏基因组数据中随机选择的部分样本的读长序列;和
    根据所述比对结果,确定所述第一读长序列的序列丰度分布,构建所述用于宏基因组数据压缩的参考序列。
  19. 一种非瞬时性计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1所述的用于宏基因组数据压缩的参考序列的构建方法。
  20. 一种计算机程序产品,所述计算机程序产品中包括计算机程序,当所述计算机程序在在被处理器执行时,实现如权利要求1所述的用于宏基因组数据压缩的参考序列的构建方法。
PCT/CN2022/125204 2022-10-13 2022-10-13 参考序列的构建方法、宏基因组数据压缩方法和电子设备 WO2024077568A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/125204 WO2024077568A1 (zh) 2022-10-13 2022-10-13 参考序列的构建方法、宏基因组数据压缩方法和电子设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/125204 WO2024077568A1 (zh) 2022-10-13 2022-10-13 参考序列的构建方法、宏基因组数据压缩方法和电子设备

Publications (1)

Publication Number Publication Date
WO2024077568A1 true WO2024077568A1 (zh) 2024-04-18

Family

ID=90668445

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/125204 WO2024077568A1 (zh) 2022-10-13 2022-10-13 参考序列的构建方法、宏基因组数据压缩方法和电子设备

Country Status (1)

Country Link
WO (1) WO2024077568A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699998A (zh) * 2013-12-06 2015-06-10 国际商业机器公司 用于对基因组进行压缩和解压缩的方法和装置
WO2022028624A1 (zh) * 2020-08-07 2022-02-10 西安中科茵康莱医学检验有限公司 通过测序获取微生物物种及相关信息的方法、装置、计算机可读存储介质和电子设备
CN114930724A (zh) * 2019-12-31 2022-08-19 深圳华大智造科技股份有限公司 创建基因突变词典及利用基因突变词典压缩基因组数据的方法和装置
CN114974411A (zh) * 2022-06-28 2022-08-30 杭州杰毅医学检验实验室有限公司 宏基因组病原微生物基因组数据库及其构建方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699998A (zh) * 2013-12-06 2015-06-10 国际商业机器公司 用于对基因组进行压缩和解压缩的方法和装置
CN114930724A (zh) * 2019-12-31 2022-08-19 深圳华大智造科技股份有限公司 创建基因突变词典及利用基因突变词典压缩基因组数据的方法和装置
WO2022028624A1 (zh) * 2020-08-07 2022-02-10 西安中科茵康莱医学检验有限公司 通过测序获取微生物物种及相关信息的方法、装置、计算机可读存储介质和电子设备
CN114974411A (zh) * 2022-06-28 2022-08-30 杭州杰毅医学检验实验室有限公司 宏基因组病原微生物基因组数据库及其构建方法

Similar Documents

Publication Publication Date Title
Didion et al. Atropos: specific, sensitive, and speedy trimming of sequencing reads
Cabau et al. Compacting and correcting Trinity and Oases RNA-Seq de novo assemblies
CN113342750B (zh) 一种文件的数据比对方法、装置、设备及存储介质
CN111339103B (zh) 一种基于全量分片和增量日志解析的数据交换方法及系统
CN109376142B (zh) 数据迁移方法及终端设备
CN106021985B (zh) 一种基因组数据压缩方法
WO2021223449A1 (zh) 一种菌群标记物的获取方法、装置、终端及存储介质
WO2019213811A1 (zh) 检测染色体非整倍性的方法、装置及系统
CN113066532B (zh) 基于高通量测序技术的宿主中病毒来源sRNA数据分析方法
WO2024077568A1 (zh) 参考序列的构建方法、宏基因组数据压缩方法和电子设备
CN110782946A (zh) 识别重复序列的方法及装置、存储介质、电子设备
CN112863603A (zh) 细菌全基因组测序数据的自动化分析方法及系统
WO2020211399A1 (zh) 数据发送方法、装置、设备及存储介质
WO2019132010A1 (ja) 塩基配列における塩基種を推定する方法、装置及びプログラム
US20210130888A1 (en) Method, apparatus, and system for detecting chromosome aneuploidy
CN115391284B (zh) 基因数据文件快速识别方法、系统和计算机可读存储介质
CN112750501A (zh) 一种宏病毒组流程的优化分析方法
CN117238368B (zh) 分子遗传标记分型方法和装置、生物个体识别方法和装置
US20230420074A1 (en) Variant calling of high coverage samples with a restricted memory
Haubold et al. Interrogating and Storing Data
US20030113767A1 (en) Confirmation sequencing
CN111291040B (zh) 一种数据处理方法、装置、设备及介质
CN116386713A (zh) 基因编辑酶脱靶位点的检测方法、装置和电子设备
CN115775592A (zh) circRNA检测方法、计算机程序产品及系统
CN117275584A (zh) 重测序数据分析方法、电子设备及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22961757

Country of ref document: EP

Kind code of ref document: A1