CN106650315A - SIFT parallel algorithm based on CPU+MIC heterogeneous platform - Google Patents
SIFT parallel algorithm based on CPU+MIC heterogeneous platform Download PDFInfo
- Publication number
- CN106650315A CN106650315A CN201611081510.1A CN201611081510A CN106650315A CN 106650315 A CN106650315 A CN 106650315A CN 201611081510 A CN201611081510 A CN 201611081510A CN 106650315 A CN106650315 A CN 106650315A
- Authority
- CN
- China
- Prior art keywords
- sequence
- mic
- sift
- cpu
- parallel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a SIFT parallel algorithm based on a CPU+MIC heterogeneous platform. MIC acceleration is conducted on a core module in the SIFT algorithm, a current multi-sequence is segmented with a message passing mechanism system to form multiple single protein sequences, parallel acceleration processing is conducted on each protein sequence, sequence alignment is conducted in a database, and the degree of parallelism is mined. The efficiency of the whole algorithm is greatly improved, and the problems of low application performance, low production efficiency and the like of a traditional CPU calculation method and system are solved.
Description
Technical field
The SIFT parallel algorithms of heterogeneous platform of the present invention realize technology, more particularly to a kind of different based on CPU+MIC platforms
The processing method of the parallel acceleration that structure is calculated.
Background technology
SIFT algorithms are for predicting when whether amino acid change can affect the instrument of the function of protein, being applied to certainly
So mutation or the artificial induction in laboratory make a variation in boundary.
SIFT searches first homologous sequence by blast, then selects close correlated series using PSI_BLAST,
Whether affect the function of protein, idiographic flow if finally calculating the conversion of amino acid, as shown in Figure 1.It can be seen that SIFT algorithms
It is made up of PSIBLAST algorithms and other subsequent treatment algorithms.PSIBLAST is the core algorithm of SIFT algorithms, is to be based on
The database similarity search instrument of local sequence alignment, a kind of heuristic search algorithm, its core is:Seeding and
extending.Flow process is as follows:
1. the list (make lookup table) of inquiry word string is set up
A) it is that W (albumen is generally 3) divides search sequence according to word length, builds the word string list of W word lengths,
B) find all matching with word string and compare neighbours' word string (according to scoring matrix) of the score value more than threshold value T, they are also added
In entering inquiry word tandem table
2. search strengthens point (Seeding stages) in database:Search in database, with the word in inquiry word tandem table
One hit of formation of string accurately mate strengthens point, used as the seed of next step;
3. seed (Extending stages) is extended:For seed, extended up to point along left and right both direction according to scoring matrix
Value is less than threshold value S, and the result for obtaining is referred to as HSP;
4. recalled according to the carrying out of score matrix, draw comparison result sequence.
Basic blast algorithms are not consider gap insertion, but the insertion of base or disappearance are prominent during biological evolution
Change is some often to occur in generally existing, therefore comparison result without room but discontinuous region, if by these high scores point
Value fragment can just form some longer or more have an actual meaning to relatively low by some similitudes and fragment that have vacant position is coupled together
The comparison of justice, therefore improved BLAST algorithm allows the appearance in room, in multiple HSP, looks for best highest scoring
Fragment is extended a fragment to two sections of sequence with this basic operation state planning to (MSP), finally produces an integration higher
Optimal comparison result, and be possible to room generation.
Innovatory algorithm (containing room) flow process:
(1)Take the two-hits stages, i.e., less than A, two adjacent hitss of the score more than T are together in series to form seed distance
Into next step;
(2)Two steps extend:Seed is carried out first without gap extension, form HSP (BLAST of initial release), carried out afterwards
Extension containing room.
PSI-BLAST (Position Specific Iterative BLAST), site-specific iteration blast search,
Mainly for protein sequence, the major search albumen related to the remote source of proteins of interest.First time blast search after, as a result in
Most like sequence rebuilds PSSM (site-specific scoring matrix), then carries out second blast according to this matrix and searches element,
Matrix is adjusted again, is searched for, adjust matrix, such iteration.Compared with blastp programs, the sensitivity of search is improve.(Traditional
BLAST relies on big to scoring matrix, and the scoring value of HSP is dependent on fixed score matrix, and the PSSM of foundation makes can not be searched
To remote edge albumen be compared), PSIBLAST flow charts are as shown in Figure 2.
Calculating speed is particularly important for high-performance calculation, and high-performance calculation will develop towards multinuclear, many-core, using isomery simultaneously
Row lifts computation speed, and current CPU+MIC is highly developed isomery cooperated computing pattern, is adapted to what highly-parallel was calculated
Using or algorithm, such as bioinformatics, Fluid Mechanics Computation application, FFT calculate, but due to MIC coprocessors programming effect
Rate, fine granularity parallel algorithm design, all there is huge challenge in large-scale parallel performance.With Intel KNL (Knight
Lights Landing) formal issue, CPU+MIC will be one good selection of high-performance calculation, using the many-core of KNL,
Compatible tradition CPU platform binary programs etc. other technologies feature, be particularly suitable for the high algorithm of degree of parallelism, using this framework energy
Programming efficiency is greatly improved while application performance is lifted, MIC can solve more application performance bottles with CPU perfect adaptations
Neck, but, memory-intensive application not high for some vectorization degree its performance also Challenge, will greatly meet it is different should
Calculating performance requirement.
The content of the invention
The technical problem to be solved in the present invention is the thought of the characteristic for SIFT algorithms and parallel processing, using CPU+
The advantage of MIC heterogeneous platforms, the especially characteristic of Intel a new generations KNL coprocessors, to realize that whole efficiency of algorithm is significantly carried
The problems such as rising, and solve degraded performance, the low production efficiency of traditional CPU computational methods and system application.
To solve above-mentioned technical problem, for this purpose, the present invention provides one kind for CPU+MIC heterogeneous platforms is based on CPU+MIC
The SIFT parallel algorithms of heterogeneous platform, there is efficiency of algorithm to be substantially improved for it, and solve traditional CPU computational methods and system should
The advantage of the problems such as degraded performance, low production efficiency.
The invention provides a kind of parallel processing thought realized for CPU+MIC heterogeneous platforms and SIFT algorithms;It is in parallel
Closing MPI carries out parallel computation and carries out the group system of sequence alignment of protein parallel processing, the system be divided into hardware system and
PSIBLAST algorithm software systems, wherein hardware system includes:
The system of one main processor platform and multiple coprocessor platforms accelerate platform, wherein primary processor to adopt pure CPU cores
Piece is calculated, and cooperation accelerates platform just to be processed with the KNL platforms acceleration of the MIC technologies based on Intel;Special OPA express networks, use
Each node in connection MIC clusters, each node can mutually realize high-speed communication, and hardware architecture figure is as shown in Figure 3.
To achieve these goals, the present invention is adopted the following technical scheme that.
A kind of SIFT parallel algorithms based on CPU+MIC heterogeneous platforms, are carried out on MIC to nucleus module in SIFT algorithms
Accelerate, using message passing mechanism system, current multisequencing is split, a plurality of simple protein sequence is formed, to every egg
White matter sequence carries out parallel acceleration process, carries out sequence alignment in database, excavates its degree of parallelism.
Based on the SIFT parallel algorithms of CPU+MIC heterogeneous platforms, specifically include:The first step, initializes each parameter, including looking into
Length, scoring matrix, threshold value T, aligned sequences database, iterations, the desired value of significance analysis, comparison length etc. are ask,
To sequence alignment module PSIBLAST parallel computation,
Protein multisequencing data are carried out dividing processing by second step according to the standard of bioinformatics.
3rd step, using MPI message passing mechanisms the distribution of a plurality of protein data sequence is carried out, and is distributed to MIC clusters
Carry out acceleration computing.
4th step, comparison result output alignment is counted, selects protein sequence, sequence permutation, generation according to seed middle position
Information of forecasting, data merging treatment.
Several subgroups are resolved into by PSIBLAST parallel algorithms first in whole sequence colony, then for each subgroup,
The computing of the sequence alignment algorithms inside subgroup is carried out parallel simultaneously, because protein sequence data is mutually solely in each subgroup
It is vertical and incoherent, it is possible to carrying out parallel computation well.In addition, the data of sequence are uncorrelated in each subgroup, carry out
During sequence alignment, most suitable parallel computation.It thus is seen that can by each subgroup Distribution utilization multi-processor node or
The mode of multithreading carries out genetic operator parallel work-flow, improves the operational efficiency of whole PSIBLAST algorithms, realizes calculation scale
Extension, meet the requirement of performance application.
The present invention is using on the basis of the model of protein sequence database parallel partition algorithm, it is proposed that based on inquiry sequence
Column split algorithm, searches sequences segmentation algorithm and adopts biological information protein cutting techniques, to many search sequence data of protein
Carry out effective and reasonable segmentation.The input of search sequence partitioning algorithm is that the protein of a plurality of pending sequence alignment in single text is looked into
Sequence is ask, output result is the data set of sequence alignment, the PSIBLAST algorithm flows based on MPI.
In order to preferably accelerate SIFT algorithms, inventive algorithm to be divided into three part-structures:(1) to core in SIFT algorithms
Sequence alignment algorithms module PSIBLAST carries out the acceleration of MIC platform, and the mode using vectorization, multithreading reaches a reason
The acceleration effect thought;(2) other flow processs in SIFT algorithms carry out the calculating of CPU using the high-frequency feature of CPU platforms, and very
Carry out isomery with PSIBLAST algorithms well and cooperate with acceleration.(3) polyprotein sequence data is effectively split, parallel processing
Operation SIFT algorithms are calculated, the degree of parallelism of SIFT algorithms is further excavated, in order to improve the efficiency of P mining, according to data flow
The characteristics of, based on the parallel computation pattern of MPI, improve efficiency of algorithm.
Beneficial effects of the present invention:The present invention realizes that whole efficiency of algorithm is substantially improved, and solves traditional CPU computational methods
And system application degraded performance, low production efficiency the problems such as.
Description of the drawings
Fig. 1 .SIFT algorithm flow schematic diagrames.
Fig. 2 .PSIBLAST algorithm flow schematic diagrames.
Fig. 3 .CPU+MIC heterogeneous platform systems.
Fig. 4 .CPU+MIC heterogeneous platform SIFT parallel algorithm schematic diagrames.
Specific embodiment
Below in conjunction with the accompanying drawings the invention will be further described with embodiment.
As shown in Figure 3, Figure 4, a kind of SIFT parallel algorithms based on CPU+MIC heterogeneous platforms, to carry out isomery accelerometer
Calculate, by p53 data is activations to MIC platform Accelerating running, to carry out polyprotein sequence data sequence than parallel comparison, to albumen
Matter sequence data text is split, and sent using message passing mechanism load balancing to MIC clusters enter line algorithm accelerate,
And the result by Accelerating running after complete is analyzed synthesis, the concrete steps of enforcement are the following is.
The first step:Initialize each parameter, including query length, scoring matrix, threshold value T, aligned sequences database, iteration time
Number, the desired value of significance analysis, comparison length etc., due to being calculated using parallel sequence alignment module PSIBLAST therein
Method, therefore also need to arrange Thread Count.
Second step:Due to needing to carry out the flow process of SIFT algorithms to multisequencing protein data, therefore first have to albumen
Matter multisequencing data carry out dividing processing according to the standard of bioinformatics.
3rd step:The distribution of a plurality of protein data sequence is carried out using MPI message passing mechanisms, MIC clusters are distributed to
Carry out acceleration computing.
MPI pseudo-codes are as follows:
MPI_init()
{
MPI_COMM_SIZE(n);
MPI_COMM_RANK(CurRank);
If (CurRank=0) // host node, initial interrogation sequence and distribution calculating task
{
MPI_ Scatterv (sequence[]);
……
MPI_Gatherv(result[]);
}
Else //CurRank is not 0, is, from node, to receive host node task, carries out sequence data comparison
{
PSIBLAST()
……
}
MPl_finalize();
}
4th step:Comparison result output alignment counts, protein sequence, sequence permutation is selected according to seed middle position, prediction is generated
Information, data merging treatment.Because other steps of SIFT algorithms are as shorter in taken, flow process is simple, and in CPU sections the height of CPU is utilized
Frequency can be processed quickly in operation.
Although the above-mentioned accompanying drawing that combines is described to the specific embodiment of the present invention, not to present invention protection model
The restriction enclosed, one of ordinary skill in the art should be understood that on the basis of technical scheme those skilled in the art are not
Need the various modifications made by paying creative work or deformation still within protection scope of the present invention.
Claims (2)
1. a kind of SIFT parallel algorithms based on CPU+MIC heterogeneous platforms, it is characterised in that nucleus module in SIFT algorithms is entered
Acceleration on row MIC, using message passing mechanism system, splits to current multisequencing, forms a plurality of simple protein sequence,
Parallel acceleration process is carried out to every protein sequence, sequence alignment is carried out in database, its degree of parallelism is excavated.
2. the SIFT parallel algorithms of CPU+MIC heterogeneous platforms are based on as claimed in claim 1, it is characterised in that specifically included:
The first step, initializes each parameter, including query length, scoring matrix, threshold value T, aligned sequences database, iterations, significantly
Property analysis desired value, length etc. is compared, to sequence alignment module PSIBLAST parallel computation;
Protein multisequencing data are carried out dividing processing by second step according to the standard of bioinformatics;
3rd step, using MPI message passing mechanisms the distribution of a plurality of protein data sequence is carried out, and being distributed to MIC clusters is carried out
Accelerate computing;
4th step, comparison result output alignment counts, protein sequence, sequence permutation is selected according to seed middle position, prediction is generated
Information, data merging treatment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611081510.1A CN106650315B (en) | 2016-11-30 | 2016-11-30 | SIFT parallel processing method based on CPU + MIC heterogeneous platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611081510.1A CN106650315B (en) | 2016-11-30 | 2016-11-30 | SIFT parallel processing method based on CPU + MIC heterogeneous platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106650315A true CN106650315A (en) | 2017-05-10 |
CN106650315B CN106650315B (en) | 2020-01-03 |
Family
ID=58813598
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611081510.1A Active CN106650315B (en) | 2016-11-30 | 2016-11-30 | SIFT parallel processing method based on CPU + MIC heterogeneous platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106650315B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111599403A (en) * | 2020-05-22 | 2020-08-28 | 电子科技大学 | Parallel drug-target correlation prediction method based on sequencing learning |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279391A (en) * | 2013-06-09 | 2013-09-04 | 浪潮电子信息产业股份有限公司 | Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing |
CN103294639A (en) * | 2013-06-09 | 2013-09-11 | 浪潮电子信息产业股份有限公司 | CPU+MIC mixed heterogeneous cluster system for achieving large-scale computing |
CN104375807A (en) * | 2014-12-09 | 2015-02-25 | 中国人民解放军国防科学技术大学 | Three-level flow sequence comparison method based on many-core co-processor |
-
2016
- 2016-11-30 CN CN201611081510.1A patent/CN106650315B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279391A (en) * | 2013-06-09 | 2013-09-04 | 浪潮电子信息产业股份有限公司 | Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing |
CN103294639A (en) * | 2013-06-09 | 2013-09-11 | 浪潮电子信息产业股份有限公司 | CPU+MIC mixed heterogeneous cluster system for achieving large-scale computing |
CN104375807A (en) * | 2014-12-09 | 2015-02-25 | 中国人民解放军国防科学技术大学 | Three-level flow sequence comparison method based on many-core co-processor |
Non-Patent Citations (4)
Title |
---|
MARCO ALDINUCCI ET AL.: "Parallel stochastic systems biology in the cloud", 《BRIEFINGS IN BIOINFORMATICS》 * |
YINGBO CUI ET AL.: "B-MIC: an Ultrafast Three-level Parallel Sequence Aligner Using MIC", 《INTERDISCIP SCI COMPUT LIFE SCI》 * |
叶笑春 等: "蛋白质序列比对算法在众核结构上的并行优化", 《软件学报》 * |
戴洛 等: "应用点突变预测程序(SIFT)检查MLH1蛋白质中的结肠癌相关点突变", 《基础研究》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111599403A (en) * | 2020-05-22 | 2020-08-28 | 电子科技大学 | Parallel drug-target correlation prediction method based on sequencing learning |
CN111599403B (en) * | 2020-05-22 | 2023-03-14 | 电子科技大学 | Parallel drug-target correlation prediction method based on sequencing learning |
Also Published As
Publication number | Publication date |
---|---|
CN106650315B (en) | 2020-01-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bader et al. | Snap, small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks | |
Carr et al. | Parallel peak pruning for scalable SMP contour tree computation | |
CN110909111B (en) | Distributed storage and indexing method based on RDF data characteristics of knowledge graph | |
CN103745258A (en) | Minimal spanning tree-based clustering genetic algorithm complex web community mining method | |
Chouhan et al. | An approach for document clustering using PSO and K-means algorithm | |
CN112735528A (en) | Gene sequence comparison method and system | |
Zhang et al. | A competitive and cooperative Migrating Birds Optimization algorithm for vary-sized batch splitting scheduling problem of flexible Job-Shop with setup time | |
Kavitha et al. | A correlation based SVM-recursive multiple feature elimination classifier for breast cancer disease using microarray | |
Cai et al. | ESPRIT-Forest: parallel clustering of massive amplicon sequence data in subquadratic time | |
CN113222181A (en) | Federated learning method facing k-means clustering algorithm | |
Rahman et al. | BatchLayout: A batch-parallel force-directed graph layout algorithm in shared memory | |
CN106650315A (en) | SIFT parallel algorithm based on CPU+MIC heterogeneous platform | |
Gupta et al. | A classification method to classify high dimensional data | |
Babu et al. | A simplex method-based bacterial colony optimization algorithm for data clustering analysis | |
Sadiq et al. | Distributed Algorithm for Parallel Edit Distance Computation. | |
Wang et al. | A spark-based artificial bee colony algorithm for large-scale data clustering | |
Yanto et al. | A performance of modified fuzzy C-means (FCM) and chicken swarm optimization (CSO) | |
Vignesh et al. | Clustering on structured proteins with filtering instances on Bioweka | |
Acharya et al. | Cancer tissue sample classification using point symmetry-based clustering algorithm | |
Liu et al. | Discovery of deep order-preserving submatrix in DNA microarray data based on sequential pattern mining | |
Daoudi et al. | Revisiting bfr clustering algorithm for large scale gene regulatory network reconstruction using mapreduce | |
CN114999566B (en) | Drug repositioning method and system based on word vector characterization and attention mechanism | |
Gururaj et al. | Optimised parallel implementation with dynamic programming technique for the multiple sequence alignment | |
Jayapriya et al. | Aligning molecular sequences using hybrid bioinspired algorithm in GPU | |
Lin et al. | Referential hierarchical clustering algorithm based upon principal component analysis and genetic algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20191210 Address after: 215100 No. 1 Guanpu Road, Guoxiang Street, Wuzhong Economic Development Zone, Suzhou City, Jiangsu Province Applicant after: Suzhou Wave Intelligent Technology Co., Ltd. Address before: 450000 Henan province Zheng Dong New District of Zhengzhou City Xinyi Road No. 278 16 floor room 1601 Applicant before: Zhengzhou Yunhai Information Technology Co. Ltd. |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |