CN106650315A - SIFT parallel algorithm based on CPU+MIC heterogeneous platform - Google Patents

SIFT parallel algorithm based on CPU+MIC heterogeneous platform Download PDF

Info

Publication number
CN106650315A
CN106650315A CN201611081510.1A CN201611081510A CN106650315A CN 106650315 A CN106650315 A CN 106650315A CN 201611081510 A CN201611081510 A CN 201611081510A CN 106650315 A CN106650315 A CN 106650315A
Authority
CN
China
Prior art keywords
sequence
mic
sift
cpu
parallel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611081510.1A
Other languages
Chinese (zh)
Other versions
CN106650315B (en
Inventor
董昊
龚湛
张清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Wave Intelligent Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201611081510.1A priority Critical patent/CN106650315B/en
Publication of CN106650315A publication Critical patent/CN106650315A/en
Application granted granted Critical
Publication of CN106650315B publication Critical patent/CN106650315B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a SIFT parallel algorithm based on a CPU+MIC heterogeneous platform. MIC acceleration is conducted on a core module in the SIFT algorithm, a current multi-sequence is segmented with a message passing mechanism system to form multiple single protein sequences, parallel acceleration processing is conducted on each protein sequence, sequence alignment is conducted in a database, and the degree of parallelism is mined. The efficiency of the whole algorithm is greatly improved, and the problems of low application performance, low production efficiency and the like of a traditional CPU calculation method and system are solved.

Description

A kind of SIFT parallel algorithms based on CPU+MIC heterogeneous platforms
Technical field
The SIFT parallel algorithms of heterogeneous platform of the present invention realize technology, more particularly to a kind of different based on CPU+MIC platforms The processing method of the parallel acceleration that structure is calculated.
Background technology
SIFT algorithms are for predicting when whether amino acid change can affect the instrument of the function of protein, being applied to certainly So mutation or the artificial induction in laboratory make a variation in boundary.
SIFT searches first homologous sequence by blast, then selects close correlated series using PSI_BLAST, Whether affect the function of protein, idiographic flow if finally calculating the conversion of amino acid, as shown in Figure 1.It can be seen that SIFT algorithms It is made up of PSIBLAST algorithms and other subsequent treatment algorithms.PSIBLAST is the core algorithm of SIFT algorithms, is to be based on The database similarity search instrument of local sequence alignment, a kind of heuristic search algorithm, its core is:Seeding and extending.Flow process is as follows:
1. the list (make lookup table) of inquiry word string is set up
A) it is that W (albumen is generally 3) divides search sequence according to word length, builds the word string list of W word lengths,
B) find all matching with word string and compare neighbours' word string (according to scoring matrix) of the score value more than threshold value T, they are also added In entering inquiry word tandem table
2. search strengthens point (Seeding stages) in database:Search in database, with the word in inquiry word tandem table One hit of formation of string accurately mate strengthens point, used as the seed of next step;
3. seed (Extending stages) is extended:For seed, extended up to point along left and right both direction according to scoring matrix Value is less than threshold value S, and the result for obtaining is referred to as HSP;
4. recalled according to the carrying out of score matrix, draw comparison result sequence.
Basic blast algorithms are not consider gap insertion, but the insertion of base or disappearance are prominent during biological evolution Change is some often to occur in generally existing, therefore comparison result without room but discontinuous region, if by these high scores point Value fragment can just form some longer or more have an actual meaning to relatively low by some similitudes and fragment that have vacant position is coupled together The comparison of justice, therefore improved BLAST algorithm allows the appearance in room, in multiple HSP, looks for best highest scoring Fragment is extended a fragment to two sections of sequence with this basic operation state planning to (MSP), finally produces an integration higher Optimal comparison result, and be possible to room generation.
Innovatory algorithm (containing room) flow process:
(1)Take the two-hits stages, i.e., less than A, two adjacent hitss of the score more than T are together in series to form seed distance Into next step;
(2)Two steps extend:Seed is carried out first without gap extension, form HSP (BLAST of initial release), carried out afterwards Extension containing room.
PSI-BLAST (Position Specific Iterative BLAST), site-specific iteration blast search, Mainly for protein sequence, the major search albumen related to the remote source of proteins of interest.First time blast search after, as a result in Most like sequence rebuilds PSSM (site-specific scoring matrix), then carries out second blast according to this matrix and searches element, Matrix is adjusted again, is searched for, adjust matrix, such iteration.Compared with blastp programs, the sensitivity of search is improve.(Traditional BLAST relies on big to scoring matrix, and the scoring value of HSP is dependent on fixed score matrix, and the PSSM of foundation makes can not be searched To remote edge albumen be compared), PSIBLAST flow charts are as shown in Figure 2.
Calculating speed is particularly important for high-performance calculation, and high-performance calculation will develop towards multinuclear, many-core, using isomery simultaneously Row lifts computation speed, and current CPU+MIC is highly developed isomery cooperated computing pattern, is adapted to what highly-parallel was calculated Using or algorithm, such as bioinformatics, Fluid Mechanics Computation application, FFT calculate, but due to MIC coprocessors programming effect Rate, fine granularity parallel algorithm design, all there is huge challenge in large-scale parallel performance.With Intel KNL (Knight Lights Landing) formal issue, CPU+MIC will be one good selection of high-performance calculation, using the many-core of KNL, Compatible tradition CPU platform binary programs etc. other technologies feature, be particularly suitable for the high algorithm of degree of parallelism, using this framework energy Programming efficiency is greatly improved while application performance is lifted, MIC can solve more application performance bottles with CPU perfect adaptations Neck, but, memory-intensive application not high for some vectorization degree its performance also Challenge, will greatly meet it is different should Calculating performance requirement.
The content of the invention
The technical problem to be solved in the present invention is the thought of the characteristic for SIFT algorithms and parallel processing, using CPU+ The advantage of MIC heterogeneous platforms, the especially characteristic of Intel a new generations KNL coprocessors, to realize that whole efficiency of algorithm is significantly carried The problems such as rising, and solve degraded performance, the low production efficiency of traditional CPU computational methods and system application.
To solve above-mentioned technical problem, for this purpose, the present invention provides one kind for CPU+MIC heterogeneous platforms is based on CPU+MIC The SIFT parallel algorithms of heterogeneous platform, there is efficiency of algorithm to be substantially improved for it, and solve traditional CPU computational methods and system should The advantage of the problems such as degraded performance, low production efficiency.
The invention provides a kind of parallel processing thought realized for CPU+MIC heterogeneous platforms and SIFT algorithms;It is in parallel Closing MPI carries out parallel computation and carries out the group system of sequence alignment of protein parallel processing, the system be divided into hardware system and PSIBLAST algorithm software systems, wherein hardware system includes:
The system of one main processor platform and multiple coprocessor platforms accelerate platform, wherein primary processor to adopt pure CPU cores Piece is calculated, and cooperation accelerates platform just to be processed with the KNL platforms acceleration of the MIC technologies based on Intel;Special OPA express networks, use Each node in connection MIC clusters, each node can mutually realize high-speed communication, and hardware architecture figure is as shown in Figure 3.
To achieve these goals, the present invention is adopted the following technical scheme that.
A kind of SIFT parallel algorithms based on CPU+MIC heterogeneous platforms, are carried out on MIC to nucleus module in SIFT algorithms Accelerate, using message passing mechanism system, current multisequencing is split, a plurality of simple protein sequence is formed, to every egg White matter sequence carries out parallel acceleration process, carries out sequence alignment in database, excavates its degree of parallelism.
Based on the SIFT parallel algorithms of CPU+MIC heterogeneous platforms, specifically include:The first step, initializes each parameter, including looking into Length, scoring matrix, threshold value T, aligned sequences database, iterations, the desired value of significance analysis, comparison length etc. are ask, To sequence alignment module PSIBLAST parallel computation,
Protein multisequencing data are carried out dividing processing by second step according to the standard of bioinformatics.
3rd step, using MPI message passing mechanisms the distribution of a plurality of protein data sequence is carried out, and is distributed to MIC clusters Carry out acceleration computing.
4th step, comparison result output alignment is counted, selects protein sequence, sequence permutation, generation according to seed middle position Information of forecasting, data merging treatment.
Several subgroups are resolved into by PSIBLAST parallel algorithms first in whole sequence colony, then for each subgroup, The computing of the sequence alignment algorithms inside subgroup is carried out parallel simultaneously, because protein sequence data is mutually solely in each subgroup It is vertical and incoherent, it is possible to carrying out parallel computation well.In addition, the data of sequence are uncorrelated in each subgroup, carry out During sequence alignment, most suitable parallel computation.It thus is seen that can by each subgroup Distribution utilization multi-processor node or The mode of multithreading carries out genetic operator parallel work-flow, improves the operational efficiency of whole PSIBLAST algorithms, realizes calculation scale Extension, meet the requirement of performance application.
The present invention is using on the basis of the model of protein sequence database parallel partition algorithm, it is proposed that based on inquiry sequence Column split algorithm, searches sequences segmentation algorithm and adopts biological information protein cutting techniques, to many search sequence data of protein Carry out effective and reasonable segmentation.The input of search sequence partitioning algorithm is that the protein of a plurality of pending sequence alignment in single text is looked into Sequence is ask, output result is the data set of sequence alignment, the PSIBLAST algorithm flows based on MPI.
In order to preferably accelerate SIFT algorithms, inventive algorithm to be divided into three part-structures:(1) to core in SIFT algorithms Sequence alignment algorithms module PSIBLAST carries out the acceleration of MIC platform, and the mode using vectorization, multithreading reaches a reason The acceleration effect thought;(2) other flow processs in SIFT algorithms carry out the calculating of CPU using the high-frequency feature of CPU platforms, and very Carry out isomery with PSIBLAST algorithms well and cooperate with acceleration.(3) polyprotein sequence data is effectively split, parallel processing Operation SIFT algorithms are calculated, the degree of parallelism of SIFT algorithms is further excavated, in order to improve the efficiency of P mining, according to data flow The characteristics of, based on the parallel computation pattern of MPI, improve efficiency of algorithm.
Beneficial effects of the present invention:The present invention realizes that whole efficiency of algorithm is substantially improved, and solves traditional CPU computational methods And system application degraded performance, low production efficiency the problems such as.
Description of the drawings
Fig. 1 .SIFT algorithm flow schematic diagrames.
Fig. 2 .PSIBLAST algorithm flow schematic diagrames.
Fig. 3 .CPU+MIC heterogeneous platform systems.
Fig. 4 .CPU+MIC heterogeneous platform SIFT parallel algorithm schematic diagrames.
Specific embodiment
Below in conjunction with the accompanying drawings the invention will be further described with embodiment.
As shown in Figure 3, Figure 4, a kind of SIFT parallel algorithms based on CPU+MIC heterogeneous platforms, to carry out isomery accelerometer Calculate, by p53 data is activations to MIC platform Accelerating running, to carry out polyprotein sequence data sequence than parallel comparison, to albumen Matter sequence data text is split, and sent using message passing mechanism load balancing to MIC clusters enter line algorithm accelerate, And the result by Accelerating running after complete is analyzed synthesis, the concrete steps of enforcement are the following is.
The first step:Initialize each parameter, including query length, scoring matrix, threshold value T, aligned sequences database, iteration time Number, the desired value of significance analysis, comparison length etc., due to being calculated using parallel sequence alignment module PSIBLAST therein Method, therefore also need to arrange Thread Count.
Second step:Due to needing to carry out the flow process of SIFT algorithms to multisequencing protein data, therefore first have to albumen Matter multisequencing data carry out dividing processing according to the standard of bioinformatics.
3rd step:The distribution of a plurality of protein data sequence is carried out using MPI message passing mechanisms, MIC clusters are distributed to Carry out acceleration computing.
MPI pseudo-codes are as follows:
MPI_init()
{
MPI_COMM_SIZE(n);
MPI_COMM_RANK(CurRank);
If (CurRank=0) // host node, initial interrogation sequence and distribution calculating task
{
MPI_ Scatterv (sequence[]);
……
MPI_Gatherv(result[]);
}
Else //CurRank is not 0, is, from node, to receive host node task, carries out sequence data comparison
{
PSIBLAST()
……
}
MPl_finalize();
}
4th step:Comparison result output alignment counts, protein sequence, sequence permutation is selected according to seed middle position, prediction is generated Information, data merging treatment.Because other steps of SIFT algorithms are as shorter in taken, flow process is simple, and in CPU sections the height of CPU is utilized Frequency can be processed quickly in operation.
Although the above-mentioned accompanying drawing that combines is described to the specific embodiment of the present invention, not to present invention protection model The restriction enclosed, one of ordinary skill in the art should be understood that on the basis of technical scheme those skilled in the art are not Need the various modifications made by paying creative work or deformation still within protection scope of the present invention.

Claims (2)

1. a kind of SIFT parallel algorithms based on CPU+MIC heterogeneous platforms, it is characterised in that nucleus module in SIFT algorithms is entered Acceleration on row MIC, using message passing mechanism system, splits to current multisequencing, forms a plurality of simple protein sequence, Parallel acceleration process is carried out to every protein sequence, sequence alignment is carried out in database, its degree of parallelism is excavated.
2. the SIFT parallel algorithms of CPU+MIC heterogeneous platforms are based on as claimed in claim 1, it is characterised in that specifically included: The first step, initializes each parameter, including query length, scoring matrix, threshold value T, aligned sequences database, iterations, significantly Property analysis desired value, length etc. is compared, to sequence alignment module PSIBLAST parallel computation;
Protein multisequencing data are carried out dividing processing by second step according to the standard of bioinformatics;
3rd step, using MPI message passing mechanisms the distribution of a plurality of protein data sequence is carried out, and being distributed to MIC clusters is carried out Accelerate computing;
4th step, comparison result output alignment counts, protein sequence, sequence permutation is selected according to seed middle position, prediction is generated Information, data merging treatment.
CN201611081510.1A 2016-11-30 2016-11-30 SIFT parallel processing method based on CPU + MIC heterogeneous platform Active CN106650315B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611081510.1A CN106650315B (en) 2016-11-30 2016-11-30 SIFT parallel processing method based on CPU + MIC heterogeneous platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611081510.1A CN106650315B (en) 2016-11-30 2016-11-30 SIFT parallel processing method based on CPU + MIC heterogeneous platform

Publications (2)

Publication Number Publication Date
CN106650315A true CN106650315A (en) 2017-05-10
CN106650315B CN106650315B (en) 2020-01-03

Family

ID=58813598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611081510.1A Active CN106650315B (en) 2016-11-30 2016-11-30 SIFT parallel processing method based on CPU + MIC heterogeneous platform

Country Status (1)

Country Link
CN (1) CN106650315B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111599403A (en) * 2020-05-22 2020-08-28 电子科技大学 Parallel drug-target correlation prediction method based on sequencing learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279391A (en) * 2013-06-09 2013-09-04 浪潮电子信息产业股份有限公司 Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing
CN103294639A (en) * 2013-06-09 2013-09-11 浪潮电子信息产业股份有限公司 CPU+MIC mixed heterogeneous cluster system for achieving large-scale computing
CN104375807A (en) * 2014-12-09 2015-02-25 中国人民解放军国防科学技术大学 Three-level flow sequence comparison method based on many-core co-processor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279391A (en) * 2013-06-09 2013-09-04 浪潮电子信息产业股份有限公司 Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing
CN103294639A (en) * 2013-06-09 2013-09-11 浪潮电子信息产业股份有限公司 CPU+MIC mixed heterogeneous cluster system for achieving large-scale computing
CN104375807A (en) * 2014-12-09 2015-02-25 中国人民解放军国防科学技术大学 Three-level flow sequence comparison method based on many-core co-processor

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MARCO ALDINUCCI ET AL.: "Parallel stochastic systems biology in the cloud", 《BRIEFINGS IN BIOINFORMATICS》 *
YINGBO CUI ET AL.: "B-MIC: an Ultrafast Three-level Parallel Sequence Aligner Using MIC", 《INTERDISCIP SCI COMPUT LIFE SCI》 *
叶笑春 等: "蛋白质序列比对算法在众核结构上的并行优化", 《软件学报》 *
戴洛 等: "应用点突变预测程序(SIFT)检查MLH1蛋白质中的结肠癌相关点突变", 《基础研究》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111599403A (en) * 2020-05-22 2020-08-28 电子科技大学 Parallel drug-target correlation prediction method based on sequencing learning
CN111599403B (en) * 2020-05-22 2023-03-14 电子科技大学 Parallel drug-target correlation prediction method based on sequencing learning

Also Published As

Publication number Publication date
CN106650315B (en) 2020-01-03

Similar Documents

Publication Publication Date Title
Bader et al. Snap, small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks
Carr et al. Parallel peak pruning for scalable SMP contour tree computation
CN110909111B (en) Distributed storage and indexing method based on RDF data characteristics of knowledge graph
CN103745258A (en) Minimal spanning tree-based clustering genetic algorithm complex web community mining method
Chouhan et al. An approach for document clustering using PSO and K-means algorithm
CN112735528A (en) Gene sequence comparison method and system
Zhang et al. A competitive and cooperative Migrating Birds Optimization algorithm for vary-sized batch splitting scheduling problem of flexible Job-Shop with setup time
Kavitha et al. A correlation based SVM-recursive multiple feature elimination classifier for breast cancer disease using microarray
Cai et al. ESPRIT-Forest: parallel clustering of massive amplicon sequence data in subquadratic time
CN113222181A (en) Federated learning method facing k-means clustering algorithm
Rahman et al. BatchLayout: A batch-parallel force-directed graph layout algorithm in shared memory
CN106650315A (en) SIFT parallel algorithm based on CPU+MIC heterogeneous platform
Gupta et al. A classification method to classify high dimensional data
Babu et al. A simplex method-based bacterial colony optimization algorithm for data clustering analysis
Sadiq et al. Distributed Algorithm for Parallel Edit Distance Computation.
Wang et al. A spark-based artificial bee colony algorithm for large-scale data clustering
Yanto et al. A performance of modified fuzzy C-means (FCM) and chicken swarm optimization (CSO)
Vignesh et al. Clustering on structured proteins with filtering instances on Bioweka
Acharya et al. Cancer tissue sample classification using point symmetry-based clustering algorithm
Liu et al. Discovery of deep order-preserving submatrix in DNA microarray data based on sequential pattern mining
Daoudi et al. Revisiting bfr clustering algorithm for large scale gene regulatory network reconstruction using mapreduce
CN114999566B (en) Drug repositioning method and system based on word vector characterization and attention mechanism
Gururaj et al. Optimised parallel implementation with dynamic programming technique for the multiple sequence alignment
Jayapriya et al. Aligning molecular sequences using hybrid bioinspired algorithm in GPU
Lin et al. Referential hierarchical clustering algorithm based upon principal component analysis and genetic algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20191210

Address after: 215100 No. 1 Guanpu Road, Guoxiang Street, Wuzhong Economic Development Zone, Suzhou City, Jiangsu Province

Applicant after: Suzhou Wave Intelligent Technology Co., Ltd.

Address before: 450000 Henan province Zheng Dong New District of Zhengzhou City Xinyi Road No. 278 16 floor room 1601

Applicant before: Zhengzhou Yunhai Information Technology Co. Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant