CN107526942B - Reverse retrieval method of life omics sequence data - Google Patents

Reverse retrieval method of life omics sequence data Download PDF

Info

Publication number
CN107526942B
CN107526942B CN201710586828.3A CN201710586828A CN107526942B CN 107526942 B CN107526942 B CN 107526942B CN 201710586828 A CN201710586828 A CN 201710586828A CN 107526942 B CN107526942 B CN 107526942B
Authority
CN
China
Prior art keywords
sequence
data
search
retrieval
reverse
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710586828.3A
Other languages
Chinese (zh)
Other versions
CN107526942A (en
Inventor
李伟忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201710586828.3A priority Critical patent/CN107526942B/en
Publication of CN107526942A publication Critical patent/CN107526942A/en
Application granted granted Critical
Publication of CN107526942B publication Critical patent/CN107526942B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Abstract

The invention relates to a reverse retrieval method of lifetime sequence data, which comprises the following steps: s1, performing comprehensive indexing on unknown sequence data generated by sequencing to construct an integrated index database group; and S2, determining the known or annotated sequence data as a query sequence set required by retrieval, and then utilizing the query sequence set to perform retrieval in the index database group.

Description

Reverse retrieval method of life omics sequence data
Technical Field
The invention relates to the technical field of biomedicine, in particular to a reverse retrieval method of life omics sequence data.
Background
Existing methods for searching sequence data in the bionomics, such as NCBI BLAST (Camacho et al 2009), FASTA (Pearson et al 1991) and the like, index known or annotated sequence data to create an index database group, and then submit unknown or unannotated sequence data for alignment search, as shown in fig. 1. The search results provide information for a plurality of matching sequences. The user may annotate the unknown sequence being queried with information of the best match.
This forward search mode focuses on the unknown sequence being queried, which can be used to characterize or predict individual sequences or genes, and is applicable to cases where the data submitted is much smaller than the database being searched. However, with the rapid development of current sequencing technologies and the ever-decreasing cost of sequencing, the number of unknown sequences generated daily is now many times greater than known or annotated sequence data, reaching the TB and even PB scales. Under this trend, the search efficiency of the forward search method becomes lower and lower.
The current forward search mode based on unknown genomes assembled by reference genomes and splicing has various limited or even insurmountable problems when facing big data group data, and has two main aspects:
(1) submitting a large number of unknown sequences retrieves known or annotated sequence data, with efficiency decreasing as the number of unknown sequences submitted increases. The reason is that such methods scan the database of sequences being retrieved from beginning to end as each unknown sequence is retrieved. Inquiring an unknown sequence once, and searching and scanning once; and querying for n times, and searching and scanning for n times. It can be seen that the number of times of repeated scanning of the sequence database to be searched is large, and therefore, the searching efficiency is low.
(3) Since the unknown sequence to be searched is usually the data generated by genome sequencing, the search of the next stage can be performed after the splicing and assembling of the sequencing data. The process of assembly by splicing is to integrate overlapping short sequences together and generate a representative long sequence, and short sequences with no or insufficient overlap are discarded. The splicing and assembling process consumes a large amount of computing resources, and meanwhile, part of data is inevitably lost, so that comprehensive data cannot be obtained, and therefore, gene information cannot be comprehensively and accurately analyzed and utilized.
Disclosure of Invention
The invention provides a reverse retrieval method of life omics sequence data, aiming at solving the technical defects of low retrieval efficiency and incomplete data caused by splicing and assembling of unknown sequences in the forward retrieval method provided by the prior art.
In order to realize the purpose, the technical scheme is as follows:
s1, performing comprehensive indexing on unknown sequence data generated by sequencing to construct an integrated index database group;
and S2, determining the known or annotated sequence data as a query sequence set required by retrieval, and then utilizing the query sequence set to perform retrieval in the index database group.
Compared with the prior art, the invention has the beneficial effects that:
(1) the retrieval efficiency of the lifetime data is improved in a number level: the reverse search method uses known sequence data as a query sequence and uses a large amount of unknown sequence data as a searched database, so that the number of times of searched scanning is reduced by orders of magnitude, thereby improving the overall search efficiency.
(2) Fast speed, saving calculation and storage resources: the reverse search method is based on a reference-free genome, while the existing forward search method needs to assemble unknown sequences and align reference genomes, and the reference genomes need to consume a large amount of computing resources, storage resources and running time.
(3) All valuable data are retained: under the condition of no need of genome splicing assembly and no reference genome, the reverse retrieval method scans all unknown sequence data, so that all unknown sequence data can be utilized to carry out rapid hypothesis verification and analysis mining, and the aim of comprehensively mastering and utilizing data is fulfilled.
Drawings
Fig. 1 is a schematic diagram of a forward search method.
Fig. 2 is a schematic diagram of a reverse search method provided in the present invention.
Fig. 3 is a diagram illustrating a specific implementation process of the reverse search method provided in the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
the invention is further illustrated below with reference to the figures and examples.
Example 1
The overall technical route and the specific method of the reverse retrieval method of the lifetime sequence data provided by the invention are shown in fig. 2 and 3, and specifically comprise the following steps:
(1) comprehensively indexing the unknown sequence data generated by sequencing to construct an integrated index database (FIG. 3 r)
In the embodiment, a Sequence Bloom Trees algorithm, an FM-index algorithm or a Population BWT algorithm is used for carrying out high-speed compressed indexing on PB-magnitude omics data; meanwhile, in the aspect of big data storage, distributed storage of PB-level omics data is realized by using a Hadoop data integration technology or a Spark data integration technology; in the aspect of big data operation, LSF, PBS, SGE, SLURM, MAKE and other parallel processing and automatic compiling technologies are used to realize the distributed automatic, seamless and efficient operation of the background PB-level data.
(2) Determining the query sequence set required for retrieval (FIG. 3-
Reverse search will accept any biomolecule sequence as query input. Known or annotated sequence data, such as genomes of exogenous diseases such as viruses and bacteria, rare disease-related genes, and sequences of interest obtained by gene variation data of interest versus genes, and even sequences of any known gene, are integrated into a sequence data set to be submitted through an internal automated process as an initial query sequence for search studies. An in-house developed program converts different types of genomic data into query sequences that meet the input criteria for reverse search.
(3) Index database group for searching unknown sequence data from known or annotated sequence data (FIG. 3 c)
Constructing an index database group by using high-flux and unannotated unknown sequence data generated by sequencing through indexing tools such as makeblastdb, FM-index and the like; the established index group is oriented to provide retrieval approaches of various methods by utilizing high-speed sequence retrieval tools, such as PSISearch, SBTblast, smartSearch and the like, and various domestic and foreign open-source tools, such as SBT Search, PSISearch2, megaBLAST, BLAT, Compressive BLAST and the like.
The index database to be retrieved can be single, or can be multiple, even more than hundreds. The bottom algorithm of the retrieval is provided with strict Schmidt Wattmann local comparison algorithm, local-global comparison algorithm, global-global comparison algorithm and the like. The retrieval mode can adopt single retrieval or iterative retrieval, and the number of iterations of the latter can be up to tens of times. The default values for the parameters of the tool are the search parameter default values, with the reverse search providing the primary parameters that are adjustable. The results of the search are presented in the form of pairwise alignments of matched permutations and matched sequences, with the results default to the first 100 matches of the alignment. In the aspect of search parallelism processing, a job scheduling system such as LSF, PBS, SLURM and the like is used to realize high-speed search of PB-level omics big data.
(4) Multiple comparison of reverse search results (FIG. 3 d)
The result of the sequence retrieval is only preliminary qualitative for the unknown sequence group, and the alignment of multiple sequences can carry out overall analysis on similar sequences listed by the retrieval result, and provide clues for downstream analysis such as genotype identification and the like. The sequence retrieval result outputs a large number of pairwise comparisons of similar sequences, the sequences are selected and extracted through the pairwise comparisons of the similarities and are subjected to multiple comparison treatment, and then the genotype and qualitative genetic variation are analyzed through the multiple comparison result. The combination of reverse retrieval technology and sequence multiple alignment algorithm Clustal Omega, TCOFFEE, MUSCLE and Kalign can rapidly identify genotype information analysis such as gene variation of the splicing-free assembled genome.
(5) Genotype identification and in-depth analysis of exogenous disease genome (FIG. 3 fifth)
Identifying gene variation in the sequence multiple comparison result, or adopting the latest result of artificial intelligence deep learning to carry out genotype analysis; in addition, reverse search will be applied to analysis of exogenous disease genome data, for example, to study the relationship between cancer genome and exogenous species genes such as viruses and bacteria. The developed method can provide high-speed search of PB-level genome data, and the result can be used for genotype analysis (including insertion, deletion, modification and the like of single nucleotide), structural variation independent of a reference genome, combination analysis of genetic variation and new molecular subtype, exogenous pathogenic genome and human gene, identification of the relation between cancer and different viruses and the like.
In the aspect of method technology, the genotype identification can be applied to a machine learning deep learning tool, such as a tool of a scroll neural network (CNN) model of Eigen, DeepSea, Deepbind and the like; in the aspect of hardware utilization, graphics API interface technologies such as OpenCL, CUDA, Brook and the like are used for realizing bioinformatics application of cooperative operation of the CPU and the GPU; in the aspect of software programming, a program library package such as Google TensorFlow, Karas, Caffe and the like is used for implementing and training deep learning of a recurrent neural network (CNN) model.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (6)

1. The reverse retrieval method of the life omics sequence data is characterized by comprising the following steps: the method comprises the following steps:
s1, performing comprehensive indexing on unknown sequence data generated by sequencing to construct an integrated index database group;
s2, determining known or annotated sequence data as a query sequence set required by retrieval, and then retrieving in an index database group by using the query sequence set;
the known or annotated sequence data is a sequence of interest derived from any known gene sequence, exogenous disease genome, rare disease gene, or gene-control gene variation data of interest;
the step S2 is performed using PSISearch sequence Search tool, SBTblast sequence Search tool, smartSearch sequence Search tool, SBT Search sequence Search tool, PSISearch2 sequence Search tool, megaBLAST sequence Search tool, BLAT sequence Search tool, or Compressive BLAST sequence Search tool.
2. The reverse search method for data on genomic sequence according to claim 1, characterized in that: the step S1 is to perform high-speed compressed indexing on the unknown Sequence data using the Sequence Bloom Trees algorithm, FM-index algorithm, position BWT algorithm, or makeblastdb tool, and then construct an index database group.
3. The reverse search method for data on genomic sequence according to claim 2, characterized in that: and the index database group performs distributed storage by utilizing a Hadoop data integration technology or a Spark data integration technology.
4. The reverse search method for data on genomic sequence according to claim 1, characterized in that: the retrieval algorithm used in step S2 is a smith-warmann local alignment algorithm, a local-global alignment algorithm, or a global-global alignment algorithm.
5. The reverse search method for data on genomic sequence according to claim 4, wherein: the retrieval manner of step S2 may be a single retrieval or an iterative retrieval.
6. The reverse search method for data on genomic sequence according to any one of claims 1 to 5, wherein: after the search result is output in step S2, the output search result is subjected to multiple sequence alignment, and genotype analysis, genetic variation recognition, or exogenous pathogenic genome analysis is performed based on the result of the multiple sequence alignment.
CN201710586828.3A 2017-07-18 2017-07-18 Reverse retrieval method of life omics sequence data Active CN107526942B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710586828.3A CN107526942B (en) 2017-07-18 2017-07-18 Reverse retrieval method of life omics sequence data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710586828.3A CN107526942B (en) 2017-07-18 2017-07-18 Reverse retrieval method of life omics sequence data

Publications (2)

Publication Number Publication Date
CN107526942A CN107526942A (en) 2017-12-29
CN107526942B true CN107526942B (en) 2021-04-20

Family

ID=60749088

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710586828.3A Active CN107526942B (en) 2017-07-18 2017-07-18 Reverse retrieval method of life omics sequence data

Country Status (1)

Country Link
CN (1) CN107526942B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1268178A (en) * 1997-05-06 2000-09-27 人体基因组科学有限公司 i (Enterococcus faecalis) polynucleotides and polypeptides
CN1665840A (en) * 2002-04-30 2005-09-07 阿雷斯贸易股份有限公司 Immunoglobulin-domain containing cell surface recognition molecules
CN101113474A (en) * 2007-07-27 2008-01-30 中国人民解放军军事医学科学院卫生学环境医学研究所 Whole-genom sifting method for BPDE carcinogen related gene
CN101600340A (en) * 2006-12-15 2009-12-09 农业经济有限责任公司 The production of plant with oil, albumen or fiber content of change
CN105229651A (en) * 2013-05-23 2016-01-06 皇家飞利浦有限公司 DNA sequence dna fast and the retrieval of safety
CN105912649A (en) * 2016-04-08 2016-08-31 南京邮电大学 Database fuzzy retrieval method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1268178A (en) * 1997-05-06 2000-09-27 人体基因组科学有限公司 i (Enterococcus faecalis) polynucleotides and polypeptides
CN1665840A (en) * 2002-04-30 2005-09-07 阿雷斯贸易股份有限公司 Immunoglobulin-domain containing cell surface recognition molecules
CN101600340A (en) * 2006-12-15 2009-12-09 农业经济有限责任公司 The production of plant with oil, albumen or fiber content of change
CN101113474A (en) * 2007-07-27 2008-01-30 中国人民解放军军事医学科学院卫生学环境医学研究所 Whole-genom sifting method for BPDE carcinogen related gene
CN105229651A (en) * 2013-05-23 2016-01-06 皇家飞利浦有限公司 DNA sequence dna fast and the retrieval of safety
CN105912649A (en) * 2016-04-08 2016-08-31 南京邮电大学 Database fuzzy retrieval method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Query-seeded iterative sequence similarity searching improves selectivity 5-20-fold;Pearson, WR 等;《NUCLEIC ACIDS RESEARCH》;20170420;全文 *
基于概率的反向K最近邻高效查询算法研究;任长安 等;《计算机应用研究》;20140827;第391页第2段-第396页倒数第2段 *
通过构建蛋白质结构域功能模版库做基于序列的蛋白质功能位点预测;安雄博;《中国优秀硕士学位论文全文数据库基础科学辑》;20150315;全文 *

Also Published As

Publication number Publication date
CN107526942A (en) 2017-12-29

Similar Documents

Publication Publication Date Title
Steinegger et al. Clustering huge protein sequence sets in linear time
Luo et al. SOAP3-dp: fast, accurate and sensitive GPU-based short read aligner
Uzilov et al. Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change
Jones et al. jMOTU and taxonerator: turning DNA barcode sequences into annotated operational taxonomic units
Zekic et al. Pan-genome storage and analysis techniques
WO2022082879A1 (en) Gene sequencing data processing method and gene sequencing data processing device
Mahadik et al. Orion: Scaling genomic sequence matching with fine-grained parallelization
CN107798216A (en) The comparison method of high similitude sequence is carried out using divide and conquer
Cai et al. ESPRIT-Forest: parallel clustering of massive amplicon sequence data in subquadratic time
Sun et al. Protein function prediction using function associations in protein–protein interaction network
Comin et al. Beyond fixed-resolution alignment-free measures for mammalian enhancers sequence comparison
CN107526942B (en) Reverse retrieval method of life omics sequence data
Ju et al. Fleximer: accurate quantification of RNA-Seq via variable-length k-mers
Liang et al. Feature-scML: an open-source Python package for the feature importance visualization of single-cell omics with machine learning
CN112699018A (en) Software defect positioning method based on software defect correlation analysis
Memeti et al. Analyzing large-scale DNA Sequences on Multi-core Architectures
CN102841988A (en) System and method for matching nucleotide sequence information
EP3663890B1 (en) Alignment method, device and system
Hoeppner et al. An introduction to RNA databases
WO2019150399A1 (en) Implementation of dynamic programming in multiple sequence alignment
CN110322927B (en) CRISPR (clustered regularly interspaced short palindromic repeats) induced RNA (ribonucleic acid) library design method
Bukowski et al. De novo transcriptome assembly using Trinity
Choyon et al. PRESa2i: incremental decision trees for prediction o f Adenosine to Inosine RNA editing sites [version 1; peer
Zhang et al. Zero-shot-capable identification of phage–host relationships with whole-genome sequence representation by contrastive learning
Zhang et al. A multi-view ensemble classification model for clinically actionable genetic mutations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant