CN107526942B

CN107526942B - Reverse retrieval method of life omics sequence data

Info

Publication number: CN107526942B
Application number: CN201710586828.3A
Authority: CN
Inventors: 李伟忠
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2017-07-18
Filing date: 2017-07-18
Publication date: 2021-04-20
Anticipated expiration: 2037-07-18
Also published as: CN107526942A

Abstract

The invention relates to a reverse retrieval method of lifetime sequence data, which comprises the following steps: s1, performing comprehensive indexing on unknown sequence data generated by sequencing to construct an integrated index database group; and S2, determining the known or annotated sequence data as a query sequence set required by retrieval, and then utilizing the query sequence set to perform retrieval in the index database group.

Description

Reverse retrieval method of life omics sequence data

Technical Field

The invention relates to the technical field of biomedicine, in particular to a reverse retrieval method of life omics sequence data.

Background

Existing methods for searching sequence data in the bionomics, such as NCBI BLAST (Camacho et al 2009), FASTA (Pearson et al 1991) and the like, index known or annotated sequence data to create an index database group, and then submit unknown or unannotated sequence data for alignment search, as shown in fig. 1. The search results provide information for a plurality of matching sequences. The user may annotate the unknown sequence being queried with information of the best match.

This forward search mode focuses on the unknown sequence being queried, which can be used to characterize or predict individual sequences or genes, and is applicable to cases where the data submitted is much smaller than the database being searched. However, with the rapid development of current sequencing technologies and the ever-decreasing cost of sequencing, the number of unknown sequences generated daily is now many times greater than known or annotated sequence data, reaching the TB and even PB scales. Under this trend, the search efficiency of the forward search method becomes lower and lower.

The current forward search mode based on unknown genomes assembled by reference genomes and splicing has various limited or even insurmountable problems when facing big data group data, and has two main aspects:

(1) submitting a large number of unknown sequences retrieves known or annotated sequence data, with efficiency decreasing as the number of unknown sequences submitted increases. The reason is that such methods scan the database of sequences being retrieved from beginning to end as each unknown sequence is retrieved. Inquiring an unknown sequence once, and searching and scanning once; and querying for n times, and searching and scanning for n times. It can be seen that the number of times of repeated scanning of the sequence database to be searched is large, and therefore, the searching efficiency is low.

(3) Since the unknown sequence to be searched is usually the data generated by genome sequencing, the search of the next stage can be performed after the splicing and assembling of the sequencing data. The process of assembly by splicing is to integrate overlapping short sequences together and generate a representative long sequence, and short sequences with no or insufficient overlap are discarded. The splicing and assembling process consumes a large amount of computing resources, and meanwhile, part of data is inevitably lost, so that comprehensive data cannot be obtained, and therefore, gene information cannot be comprehensively and accurately analyzed and utilized.

Disclosure of Invention

The invention provides a reverse retrieval method of life omics sequence data, aiming at solving the technical defects of low retrieval efficiency and incomplete data caused by splicing and assembling of unknown sequences in the forward retrieval method provided by the prior art.

In order to realize the purpose, the technical scheme is as follows:

s1, performing comprehensive indexing on unknown sequence data generated by sequencing to construct an integrated index database group;

and S2, determining the known or annotated sequence data as a query sequence set required by retrieval, and then utilizing the query sequence set to perform retrieval in the index database group.

Compared with the prior art, the invention has the beneficial effects that:

(1) the retrieval efficiency of the lifetime data is improved in a number level: the reverse search method uses known sequence data as a query sequence and uses a large amount of unknown sequence data as a searched database, so that the number of times of searched scanning is reduced by orders of magnitude, thereby improving the overall search efficiency.

(2) Fast speed, saving calculation and storage resources: the reverse search method is based on a reference-free genome, while the existing forward search method needs to assemble unknown sequences and align reference genomes, and the reference genomes need to consume a large amount of computing resources, storage resources and running time.

(3) All valuable data are retained: under the condition of no need of genome splicing assembly and no reference genome, the reverse retrieval method scans all unknown sequence data, so that all unknown sequence data can be utilized to carry out rapid hypothesis verification and analysis mining, and the aim of comprehensively mastering and utilizing data is fulfilled.

Drawings

Fig. 1 is a schematic diagram of a forward search method.

Fig. 2 is a schematic diagram of a reverse search method provided in the present invention.

Fig. 3 is a diagram illustrating a specific implementation process of the reverse search method provided in the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the invention is further illustrated below with reference to the figures and examples.

Example 1

The overall technical route and the specific method of the reverse retrieval method of the lifetime sequence data provided by the invention are shown in fig. 2 and 3, and specifically comprise the following steps:

(1) comprehensively indexing the unknown sequence data generated by sequencing to construct an integrated index database (FIG. 3 r)

In the embodiment, a Sequence Bloom Trees algorithm, an FM-index algorithm or a Population BWT algorithm is used for carrying out high-speed compressed indexing on PB-magnitude omics data; meanwhile, in the aspect of big data storage, distributed storage of PB-level omics data is realized by using a Hadoop data integration technology or a Spark data integration technology; in the aspect of big data operation, LSF, PBS, SGE, SLURM, MAKE and other parallel processing and automatic compiling technologies are used to realize the distributed automatic, seamless and efficient operation of the background PB-level data.

(2) Determining the query sequence set required for retrieval (FIG. 3-

Reverse search will accept any biomolecule sequence as query input. Known or annotated sequence data, such as genomes of exogenous diseases such as viruses and bacteria, rare disease-related genes, and sequences of interest obtained by gene variation data of interest versus genes, and even sequences of any known gene, are integrated into a sequence data set to be submitted through an internal automated process as an initial query sequence for search studies. An in-house developed program converts different types of genomic data into query sequences that meet the input criteria for reverse search.

(3) Index database group for searching unknown sequence data from known or annotated sequence data (FIG. 3 c)

Constructing an index database group by using high-flux and unannotated unknown sequence data generated by sequencing through indexing tools such as makeblastdb, FM-index and the like; the established index group is oriented to provide retrieval approaches of various methods by utilizing high-speed sequence retrieval tools, such as PSISearch, SBTblast, smartSearch and the like, and various domestic and foreign open-source tools, such as SBT Search, PSISearch2, megaBLAST, BLAT, Compressive BLAST and the like.

The index database to be retrieved can be single, or can be multiple, even more than hundreds. The bottom algorithm of the retrieval is provided with strict Schmidt Wattmann local comparison algorithm, local-global comparison algorithm, global-global comparison algorithm and the like. The retrieval mode can adopt single retrieval or iterative retrieval, and the number of iterations of the latter can be up to tens of times. The default values for the parameters of the tool are the search parameter default values, with the reverse search providing the primary parameters that are adjustable. The results of the search are presented in the form of pairwise alignments of matched permutations and matched sequences, with the results default to the first 100 matches of the alignment. In the aspect of search parallelism processing, a job scheduling system such as LSF, PBS, SLURM and the like is used to realize high-speed search of PB-level omics big data.

(4) Multiple comparison of reverse search results (FIG. 3 d)

The result of the sequence retrieval is only preliminary qualitative for the unknown sequence group, and the alignment of multiple sequences can carry out overall analysis on similar sequences listed by the retrieval result, and provide clues for downstream analysis such as genotype identification and the like. The sequence retrieval result outputs a large number of pairwise comparisons of similar sequences, the sequences are selected and extracted through the pairwise comparisons of the similarities and are subjected to multiple comparison treatment, and then the genotype and qualitative genetic variation are analyzed through the multiple comparison result. The combination of reverse retrieval technology and sequence multiple alignment algorithm Clustal Omega, TCOFFEE, MUSCLE and Kalign can rapidly identify genotype information analysis such as gene variation of the splicing-free assembled genome.

(5) Genotype identification and in-depth analysis of exogenous disease genome (FIG. 3 fifth)

Identifying gene variation in the sequence multiple comparison result, or adopting the latest result of artificial intelligence deep learning to carry out genotype analysis; in addition, reverse search will be applied to analysis of exogenous disease genome data, for example, to study the relationship between cancer genome and exogenous species genes such as viruses and bacteria. The developed method can provide high-speed search of PB-level genome data, and the result can be used for genotype analysis (including insertion, deletion, modification and the like of single nucleotide), structural variation independent of a reference genome, combination analysis of genetic variation and new molecular subtype, exogenous pathogenic genome and human gene, identification of the relation between cancer and different viruses and the like.

In the aspect of method technology, the genotype identification can be applied to a machine learning deep learning tool, such as a tool of a scroll neural network (CNN) model of Eigen, DeepSea, Deepbind and the like; in the aspect of hardware utilization, graphics API interface technologies such as OpenCL, CUDA, Brook and the like are used for realizing bioinformatics application of cooperative operation of the CPU and the GPU; in the aspect of software programming, a program library package such as Google TensorFlow, Karas, Caffe and the like is used for implementing and training deep learning of a recurrent neural network (CNN) model.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. The reverse retrieval method of the life omics sequence data is characterized by comprising the following steps: the method comprises the following steps:

s2, determining known or annotated sequence data as a query sequence set required by retrieval, and then retrieving in an index database group by using the query sequence set;

the known or annotated sequence data is a sequence of interest derived from any known gene sequence, exogenous disease genome, rare disease gene, or gene-control gene variation data of interest;

the step S2 is performed using PSISearch sequence Search tool, SBTblast sequence Search tool, smartSearch sequence Search tool, SBT Search sequence Search tool, PSISearch2 sequence Search tool, megaBLAST sequence Search tool, BLAT sequence Search tool, or Compressive BLAST sequence Search tool.

2. The reverse search method for data on genomic sequence according to claim 1, characterized in that: the step S1 is to perform high-speed compressed indexing on the unknown Sequence data using the Sequence Bloom Trees algorithm, FM-index algorithm, position BWT algorithm, or makeblastdb tool, and then construct an index database group.

3. The reverse search method for data on genomic sequence according to claim 2, characterized in that: and the index database group performs distributed storage by utilizing a Hadoop data integration technology or a Spark data integration technology.

4. The reverse search method for data on genomic sequence according to claim 1, characterized in that: the retrieval algorithm used in step S2 is a smith-warmann local alignment algorithm, a local-global alignment algorithm, or a global-global alignment algorithm.

5. The reverse search method for data on genomic sequence according to claim 4, wherein: the retrieval manner of step S2 may be a single retrieval or an iterative retrieval.

6. The reverse search method for data on genomic sequence according to any one of claims 1 to 5, wherein: after the search result is output in step S2, the output search result is subjected to multiple sequence alignment, and genotype analysis, genetic variation recognition, or exogenous pathogenic genome analysis is performed based on the result of the multiple sequence alignment.