CN102411680B - Large-scale distributed parallel acceleration method and system for protein identification - Google Patents

Large-scale distributed parallel acceleration method and system for protein identification Download PDF

Info

Publication number
CN102411680B
CN102411680B CN201010292032.5A CN201010292032A CN102411680B CN 102411680 B CN102411680 B CN 102411680B CN 201010292032 A CN201010292032 A CN 201010292032A CN 102411680 B CN102411680 B CN 102411680B
Authority
CN
China
Prior art keywords
spectrogram
peptide
sequence
mass
piece
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201010292032.5A
Other languages
Chinese (zh)
Other versions
CN102411680A (en
Inventor
王乐珩
王文平
迟浩
吴妍洁
周郴
付岩
孙瑞祥
贺思敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201010292032.5A priority Critical patent/CN102411680B/en
Publication of CN102411680A publication Critical patent/CN102411680A/en
Application granted granted Critical
Publication of CN102411680B publication Critical patent/CN102411680B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The invention relates to a large-scale distributed parallel acceleration method and a large-scale distributed parallel acceleration system for protein identification. The method comprises the following steps of: 1, performing theoretical enzyme digestion on a protein sequence to obtain a peptide sequence, and sequencing the peptide sequence and removing redundancy of the peptide sequence to establish a peptide index file block; 2, sequencing a mass spectrogram by using a parallel processing method, and equally dividing the sequenced mass spectrogram to obtain a plurality of spectrogram data blocks; 3, uniformly distributing the spectrogram data blocks to a plurality of master processes, sequencing the distributed spectrogram data blocks by each master process, and designating the distributed spectrogram data blocks to idle slave processes in turn to perform peptide spectrogram matching identification; and 4, gathering identification results by using the parallel processing method, deducing a corresponding protein sequence by using the peptide sequence obtained through identification, and generating an output file. By the method and the system, when the scale of processor cores reaches several hundreds or even more than one thousand, satisfied acceleration efficiency can be achieved by performing the protein identification.

Description

A kind of large-scale distributed parallel acceleration method and system thereof of identification of proteins
Technical field
The present invention relates to the distributed parallel accelerated method that a kind of scale protein is identified, particularly relate to a kind of employing distributed parallel technology with on a plurality of computing nodes, thereby effectively share method and system thereof that search mission improves identification of proteins speed.
Background technology
" protein group " (Proteome) described all of the protein of expressing under given time and specified criteria in particular organisms sample.As its name suggests, proteomics is exactly the research to protein group, its the most basic task determines which protein has obtained expression, expression are how many in vivo exactly, posttranslational modification and protein-protein interaction etc., obtains thus on protein level about the integral body of the processes such as disease generation, cellular metabolism and comprehensive understanding.In current proteome research, identification of proteins based on tandem mass spectrum is one of the most widely used technology, list of references 1 < < Aebersold, R.and Mann, M.Mass spectrometry-based proteomics, Nature, is described in further detail having in relevant in 2003,422:198-207 > >.
Basic step based on tandem mass spectrum identification of protein is: first mixed protein sample enzyme is cut to peptide, after liquid chromatography separation, enter mass spectrometer, obtain the experiment tandem mass spectrum figure of peptide, then mass spectrogram is analyzed, obtained corresponding peptide sequence, finally by peptide, to protein merger, analyze, obtain the protein list in mixed protein sample, thereby reach the object that protein is identified.In identifying the process of the peptide sequence that produces experiment tandem mass spectrum, the method for database search is widely adopted.As at list of references 2 < < Eng, J.K., McCormack, A.L.and Yates, J.R.An approach to correlate tandemmass spectral data of peptides with amino acid sequences in a protein database.JAm Soc Mass Spectrom, 1994, 5:976-989 > >, list of references 3 < < Perkins, D.N., Pappin, D.J., Creasy, D.M.and Cottrell, J.S.Probability-based protein identification by searchingsequence databases using mass spectrometry data.Electrophoresis, 1999, 20:3551-3567 > > and list of references 4 < < Field, H.I.,
Figure BDA0000027084350000011
d.and Beavis, R.C.RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimisesprotein identification, and archives data in a relational database.Proteomics, describes in detail the evaluation that adopts the method for database search to realize peptide sequence in 2002,2:36-47 > >.
Adopt the method for database search to identify that by peptide sequence the method that realizes identification of proteins mainly comprises the following steps: first, the enzyme in simulation biology is cut rule the protein sequence in Protein Data Bank is cut into peptide sequence; Then calculate the quality of each peptide sequence that cutting obtains; Finally utilize the parent ion quality error window searching in mass spectrometric data to meet the peptide sequence within the scope of certain mass, satisfactory peptide sequence is inputed to scoring functions to realize the evaluation to peptide sequence.
Because the scale along with Protein Data Bank in recent years constantly increases, the qualification requirement of non-specific enzyme being cut to peptide constantly increases, and causes the scale of peptide sequence constantly to increase, simultaneously, the formation speed of mass spectrometric data, also in continuous growth, is therefore had higher requirement to the evaluation speed of protein.But aforesaid identification of proteins method has deficiency in efficiency, therefore need to above-mentioned database search method be accelerated.
In recent years, along with the cheapness of business cluster and universal, large-scale parallel calculates the mainstream solution that has become the acceleration problem that science and industry calculates.So-called cluster, is about to a group computing machine and gets up with certain interconnected with network, and United Dispatching, Coordination Treatment are calculated to realize efficient parallel.Compare with the supercomputer of early stage unified address space, each node in cluster has independently central processing unit, internal memory and necessary peripheral hardware.Process in cluster can large-scale parallel, but communication price is each other higher, this also means original serial or the multithread programs operating on common computer, and not natural have an expansibility, that is to say, stand-alone program is transplanted on cluster and can not be directly obtained acceleration.Must redesign existing algorithm, could farthest utilize the ability of hardware facility.Even due to the obvious algorithm of acceleration effect on the cluster of middle and small scale, along with the expansion of cluster scale, its acceleration effect still can constantly decline.Existing industrial software for calculation, cannot be issued to linear speed-up ratio surpassing hundred core processor scales mostly, in the above scale of thousand core processors, can reach the more rare of linear speed-up ratio.Except speed factor, the use of cluster also relates to space factor, use the high-performance calculation scene of cluster for example usually to relate to very googol, according to collection (biological protein sequence, and magnanimity mass spectrogram to be identified), this massive data sets even cannot move some routine operations (be for example written into internal memory and move common internal memory sort algorithm) in the single node of common computer or cluster, has to use the software algorithm of cluster hardware system and particular design to be processed.
Existing identification of proteins search engine has mostly been realized parallel version.As at list of references 5 < < Sadygov, R.G., Eng, J., Durr, E., Saraf, A., McDonald, H., MacCoss, M.J., Yates, J.R.3rd, Code developments to improve the efficiency of automated MS/MS spectrainterpretation.J Proteome Res, 2002, 1:211-215 > >, list of references 6 < < Duncan, D.T., Craig, R., Link, A.J.Parallel tandem:a program for parallel processing of tandemmass spectra using PVM or MPI and X! Tandem.J Proteome Res 2005,4:1842-1847 > >, list of references 7 < < Bjornson, R.D., Carriero, N.J., Colangelo, C., Shifman, M., Cheung, K.H., Miller, P.L., Williams, K.X! ! Tandem, an improvedmethod for running X! tandem in parallel on collections of commodity computers.JProteome Res 2008, 7:293-299 > >, list of references 8 < < Halligan, B.D., Geiger, J.F., Vallejos, A.K., Greene, A.S.Twigger, S.N.Low Cost, Scalable Proteomics DataAnalysis Using Amazon ' s Cloud Computing Services and Open Source SearchAlgorithms.J Proteome Res 2009, 8:3148-3153. > > and list of references 9 < < Leheng Wang, Wenping Wang, Hao Chi, Yanjie Wu, You Li, Yan Fu, Chen Zhou, Ruixiang Sun, Haipeng Wang, Chao Liu, Zuofei Yuan, Liyun Xiu, He, Si-Min.An efficientparallelization of phosphorylated peptide.Rapid Commun Mass Spectrom.2010, in 24:1791-1798 > >, there is explanation.Yet above method is all only applicable to the situation that cluster scale is less.Once processor scale reaches hundreds of, even surpass more than thousand cores, acceleration efficiency just starts remarkable decline, and more hardware investments can not obtain higher speed-up ratio.In view of the deficiency of existing method on large-scale cluster, provide a kind of effective distributed parallel accelerated method significant in actual applications.
Summary of the invention
The object of the present invention is to provide a kind of large-scale distributed parallel acceleration method and system thereof of identification of proteins, for solving prior art, reaching hundred cores even under the parallel condition of thousand core processor scales, the problem that acceleration efficiency is not good.
To achieve these goals, the invention provides a kind of large-scale distributed parallel acceleration method of identification of proteins, it is characterized in that, comprising:
Step 1, input protein sequence, carries out theoretical enzyme to described protein sequence and cuts and obtain peptide sequence, to described peptide sequence according to theoretical parent ion quality sort, de-redundancy processes, to create peptide index file piece, and generate peptide index metadata file according to described peptide index file piece;
Step 2, input mass spectrogram, adopts method for parallel processing to sort according to experiment parent ion quality to described mass spectrogram, and the mass spectrogram after sequence is averaged to division, obtain a plurality of spectral data pieces, and generate mass spectrum meta data file according to described spectral data piece;
Step 3, described spectral data piece is averagely allocated to a plurality of host processes, each host process management is a plurality of from process, each host process sorts to distributed spectral data piece, what be assigned to successively the free time carries out the evaluation of peptide spectrum coupling from process, and when more than one of described peptide index file piece, same described spectral data piece is distributed to a plurality of from process, by the plurality of, from peptide index file piece described in process traversal monolithic, carry out peptide spectrum coupling and identify;
Step 4, adopts method for parallel processing, gathers qualification result, utilizes the peptide sequence identifying to infer corresponding protein sequence, generates output file.
The large-scale distributed parallel acceleration method of described identification of proteins, wherein,
In described step 2, also comprise:
21, resolve described mass spectrogram, described mass spectrogram is on average divided into a plurality of original data blocks, described in each, the capacity of original data block is less than the local storage space of clustered node;
22, original data block described in each is processed by a spectrogram mapping processor process, described spectrogram mapping processor process is read in each mass spectrogram in handled original data block successively, according to mass range, described mass spectrogram is divided in corresponding queue, then by each queue stores in different spectrogram intermediate files;
23, to different mass ranges, each mass range is processed by a spectrogram reduction processor process, separate parallel processing between described spectrogram reduction processor process, described spectrogram reduction processor process is by all spectrogram intermediate files that read in this mass range, the mass spectrogram of input is first pressed to the large minispread of experiment parent ion quality, when experiment parent ion is identical in quality, again according to the conventional English words canonical ordering sequence of spectrogram title name, after sequence, deposit successively polylith spectral data piece in, the mass spectrogram number comprising in every equates;
24, the information of collecting all described mass spectrometric data pieces, and according to mass spectrum meta data file described in described Information generation.
The large-scale distributed parallel acceleration method of described identification of proteins, wherein,
In described step 22, also comprise:
When the number of described original data block is greater than the number of processor core in cluster, or when being greater than described spectrogram mapping processor process and counting, described original data block is carried out to many wheels to be processed, the spectrogram mapping processor process of finishing the work continues to get new task, First come first served, until all described original data blocks are all handled.
The large-scale distributed parallel acceleration method of described identification of proteins, wherein,
In described step 23, also comprise:
When the number of described mass range is greater than the number of processor core in cluster, or when being greater than described spectrogram reduction processor process and counting, described mass range is carried out to many wheels to be processed, the spectrogram reduction processor process of finishing the work continues to get new task, First come first served, until all described spectrogram intermediate files are all handled.
The large-scale distributed parallel acceleration method of described identification of proteins, wherein,
In described step 3, described host process is assigned and idle from process, carried out the step that peptide spectrum coupling identifies and comprise:
Described host process is read in described mass spectrum meta data file and described peptide index metadata file, according to the statistical information obtaining, by distributing to the described spectral data piece of oneself being responsible for identifying, according to mass range, sort from high to low, be assigned to successively described from process, if described peptide index file piece is polylith, same described spectral data piece is assigned repeatedly, each corresponding peptide index file piece; The described mode from process employing First come first served is got task, whenever expert assignment completes, deposit qualification result sub-block in, with described host process communication, beam back the filename of described qualification result sub-block, and ask for the information of spectral data piece corresponding to next step task and peptide index file piece, until complete the evaluation of all spectral data pieces.
The large-scale distributed parallel acceleration method of described identification of proteins, wherein,
In described step 3, describedly from process, carry out the step that peptide spectrum coupling identifies and comprise:
Describedly from process, read in described peptide index file piece, on the basis of original peptide sequence, calculate the possible situation of occurred changes in modification, utilize the parent ion quality error window searching in spectral data piece to be identified to meet the modified peptides sequence of setting mass range, satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm to realize the evaluation to peptide sequence.
The large-scale distributed parallel acceleration method of described identification of proteins, wherein,
In described step 4, also comprise:
41, gather qualification result, the all qualification result sub-blocks corresponding to spectral data piece described in each piece, by a spectrogram qualification result aggregation process device process, be responsible for processing, separate parallel processing between described spectrogram qualification result aggregation process device process, described spectrogram qualification result aggregation process device process is read in all qualification result sub-blocks of a described spectral data piece that is assigned to oneself, the peptide sequence of all qualification results of every mass spectrogram is pressed to the mark sequence of peptide spectrum coupling marking algorithm, retain forward peptide sequence information and the mark of rank, deposit piecemeal summary file in.
42, read in all piecemeal summary files, the filtration of the peptide sequence of each mass spectrogram qualification result, de-redundancy, the nonredundancy peptide sequence obtaining is equally divided into many groups, each group nonredundancy peptide sequence is responsible for searching corresponding protein numbering and sequence by a protein query processor process, the separate parallel processing of described protein query processor process, to lookup result, use peptide to infer algorithm to protein, generate output file.
The large-scale distributed parallel acceleration method of described identification of proteins, wherein,
In described step 41, also comprise:
When the number of described spectral data piece is greater than the number of processor core in cluster, or when being greater than described spectrogram qualification result aggregation process device process and counting, described qualification result sub-block is carried out to many wheels to be processed, the spectrogram qualification result aggregation process device process of finishing the work continues to get new task, First come first served, until all described qualification result sub-blocks are all processed complete.
The large-scale distributed parallel acceleration method of described identification of proteins, wherein,
In described step 42, also comprise:
When the number of described nonredundancy peptide sequence grouping is greater than the number of processor core in cluster, or when being greater than described protein query processor process and counting, described nonredundancy peptide sequence grouping is carried out to many wheels to be processed, the protein query processor process of finishing the work continues to get new task, First come first served, until all nonredundancy peptide sequences are all handled.
To achieve these goals, the present invention also provides a kind of large-scale distributed parallel accelerating system of identification of proteins, it is characterized in that, comprising:
Peptide sequence index module, for the protein sequence of input is carried out to theoretical enzyme, cut and obtain peptide sequence, to described peptide sequence according to theoretical parent ion quality sort, de-redundancy processes, and to create peptide index file piece, and generates peptide index metadata file according to described peptide index file piece;
Spectral data processing module, for the mass spectrogram to input, adopt method for parallel processing and sort according to experiment parent ion quality, and the mass spectrogram after sequence is averaged to division, obtain a plurality of spectral data pieces, and generate mass spectrum meta data file according to described spectral data piece;
Peptide spectrum coupling is identified module, connect described peptide sequence index module, described spectral data processing module, for described spectral data piece is averagely allocated to each host process, each host process management is a plurality of from process, each host process sorts to distributed spectral data piece, what be assigned to successively the free time carries out the evaluation of peptide spectrum coupling from process, and when more than one of described peptide index file piece, same described spectral data piece is distributed to a plurality of from process, by a plurality of, from peptide index file piece described in process traversal monolithic, carried out peptide spectrum coupling and identify;
Result gathers output module, connects described peptide spectrum coupling and identifies module, for adopting method for parallel processing to gather qualification result, utilizes the peptide sequence identifying to infer corresponding protein sequence, generates output file.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Described spectral data processing module, comprises again:
Spectrogram is divided module, for resolving described mass spectrogram, described mass spectrogram is on average divided into a plurality of original data blocks, and described in each, the capacity of original data block is less than the local storage space of clustered node;
Spectrogram mapping block, connect described spectrogram and divide module, for original data block described in each is processed by a spectrogram mapping processor process, described spectrogram mapping processor process is read in each mass spectrogram in handled original data block successively, according to mass range, described mass spectrogram is divided in corresponding queue, then by each queue stores in different spectrogram intermediate files;
Spectrogram reduction module, connect described spectrogram mapping block, be used for different mass ranges, each mass range is processed by a spectrogram reduction processor process, separate parallel processing between described spectrogram reduction processor process, described spectrogram reduction processor process is by all spectrogram intermediate files that read in this mass range, the mass spectrogram of input is first pressed to the large minispread of experiment parent ion quality, when experiment parent ion is identical in quality, again according to the conventional English words canonical ordering sequence of spectrogram title name, after sequence, deposit successively polylith spectral data piece in, the mass spectrogram number comprising in every equates,
Mass spectrum meta data file generation module, connects described spectrogram reduction module, for collecting the information of all described mass spectrometric data pieces, and according to mass spectrum meta data file described in described Information generation.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Described spectrogram mapping block, while being also greater than the number of cluster processor core for the number when described original data block, or when being greater than described spectrogram mapping processor process and counting, described original data block is carried out to many wheels to be processed, the spectrogram mapping processor process of finishing the work continues to get new task, First come first served, until all original data blocks are all handled.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Described spectrogram reduction module, while being also greater than the number of cluster processor core for the number when described mass range, or when being greater than described spectrogram reduction processor process and counting, described mass range is carried out to many wheels to be processed, the spectrogram reduction processor process of finishing the work continues to get new task, First come first served, until all spectrogram intermediate files are all handled.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Described peptide spectrum coupling is identified module, also for read in described mass spectrum meta data file and described peptide index metadata file by described host process, according to the statistical information obtaining, by distribute to described spectral data piece that oneself be responsible for to identify according to mass range sort be from high to low assigned to successively described from process, if described peptide index file piece is polylith, same described spectral data piece is assigned repeatedly, each corresponding peptide index file piece; The described mode from process employing First come first served is got task, whenever expert assignment completes, deposit qualification result sub-block in, with described host process communication, beam back the filename of described qualification result sub-block, and ask for the information of spectral data piece corresponding to next step task and peptide index file piece, until complete the evaluation of all spectral data pieces.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Described peptide spectrum coupling is identified module, also for reading in peptide index file piece by described from process, on the basis of original peptide sequence, calculate the possible situation of occurred changes in modification, utilize the parent ion quality error window searching in spectral data piece to be identified to meet the modified peptides sequence of setting mass range, satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm to realize the evaluation to peptide sequence.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Described result gathers output module, comprises again:
Summarizing module, for all qualification result sub-blocks corresponding to spectral data piece described in each piece, by a spectrogram qualification result aggregation process device process, be responsible for processing, separate parallel processing between described spectrogram qualification result aggregation process device process, described spectrogram qualification result aggregation process device process is read in all qualification result sub-blocks of a described spectral data piece that is assigned to oneself, the peptide sequence of all qualification results of every mass spectrogram is pressed to the mark sequence of peptide spectrum coupling marking algorithm, retain forward peptide sequence information and the mark of rank, deposit piecemeal summary file in,
Filter and infer output module, connect described summarizing module, be used for reading in described piecemeal summary file, the filtration of the peptide sequence of each mass spectrogram qualification result, de-redundancy, the nonredundancy peptide sequence obtaining is divided into many groups, each group nonredundancy peptide sequence is responsible for searching corresponding protein numbering and sequence, the separate parallel processing of described protein query processor process by a protein query processor process, to lookup result, use peptide to infer algorithm to protein, generate output file.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Described summarizing module, while being also greater than the number of cluster processor core for the number when described spectral data piece, or when being greater than described spectrogram qualification result aggregation process device process and counting, described qualification result sub-block is carried out to many wheels to be processed, the spectrogram qualification result aggregation process device process of finishing the work continues to get new task, First come first served, until all qualification result sub-blocks are all processed complete.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Output module is inferred in described filtration, while being also greater than the number of cluster processor core for the number when described nonredundancy peptide sequence grouping, or when being greater than described protein query processor process and counting, described nonredundancy peptide sequence grouping is carried out to many wheels to be processed, the protein query processor process of finishing the work continues to get new task, First come first served, until all nonredundancy peptide sequences are all handled.
Compared with prior art, useful technique effect of the present invention is:
1, the present invention processes protein sequence storehouse by distributed parallel, the magnanimity protein sequence that makes to exceed single-machine capacity is efficiently carried out theoretical enzyme to be cut, removes redundancy, sequence piecemeal and create peptide index file piece, and monolithic peptide index file piece can be written into internal memory and carry out high-level efficiency traversal.
2, the present invention is by orderly, de-redundancy, distributed peptide sequence index stores tissue protein and peptide sequence, relatively directly search protein sequence, not only greatly reduced the calculated amount of redundancy, but also it is identical in quality or approach the peptide spectrum matching operation of the lap of spectrogram to have merged parent ion, thereby greatly improved the efficiency of identity process.
3, the present invention processes mass spectrogram by distributed parallel, makes the magnanimity mass spectrogram that exceeds single-machine capacity be able to efficient sequence piecemeal establishment spectral data piece.The spectral data piece producing is convenient to dynamic dispatching parallel processing.
4, in the present invention, by a plurality of host processes, share and in a large number from the pressure communications of process, thereby reduced to block, wait for, greatly having improved clustered processors scale and reached hundreds of and even surpass the acceleration efficiency in the above situation of thousand core.
5, the present invention adopts the mode of parallel processing to gather qualification result, the protein under searching by peptide sequence, and carry out peptide to the deduction of protein, greatly improved this process speed.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the large-scale distributed parallel acceleration method of identification of proteins of the present invention;
Fig. 2 is the structural drawing of the large-scale distributed parallel accelerating system of identification of proteins of the present invention.
Embodiment
Below in conjunction with the drawings and specific embodiments, describe the present invention, but not as a limitation of the invention.
As shown in Figure 1, be the process flow diagram of the large-scale distributed parallel acceleration method of identification of proteins of the present invention, this flow process is to adopt following operation to carry out large-scale distributed parallel acceleration to identification of proteins, concrete steps are as follows:
Step 101, first sets necessary search parameter;
Step 102, then input protein sequence, utilizing a plurality of processor processes in cluster to carry out theoretical enzyme to protein sequence cuts, by the peptide sequence obtaining by theoretical parent ion quality sort, de-redundancy, the final peptide index file piece that creates, and generate peptide index metadata file according to peptide index file piece;
Step 103, next resolve the mass spectrogram of input, utilize a plurality of processor processes in cluster to sort according to experiment parent ion quality to mass spectrogram, mass spectrogram after sequence is stored in the middle of a plurality of spectral data pieces in order, the mass spectrogram quantity of storing in each spectral data piece is identical, then generates mass spectrum meta data file according to spectral data piece;
Step 104, then starts several host processes, and host process is in charge of again more from process separately, by spectral data piece average mark, gives each host process.Each host process sorts the spectral data piece of distributing to oneself from high to low according to mass range, what be dynamically assigned to the free time carries out the evaluation of peptide spectrum coupling from process, if more than one of peptide index file piece, same spectral data piece also can be assigned to a plurality ofly from process, by the plurality of, carried out peptide spectrum coupling identified from process traversal monolithic peptide index file piece;
Step 105, with method for parallel processing, gathers qualification result, utilizes the peptide sequence identifying to search corresponding protein sequence, carries out peptide to the deduction of protein, generates output file.
In above-mentioned steps 102, the mode of operation of at present common comparison poor efficiency is, read in successively protein sequence, it is carried out to theoretical enzyme one by one cuts and obtains peptide sequence, again peptide sequence piecemeal is deposited in to the interim peptide sequence piece of single order, then read in the interim peptide sequence piece of single order, every K piece is merged, remove redundancy, according to theoretical parent ion quality-ordered, output to the interim peptide sequence piece of second order, read in again the interim peptide sequence piece of second order, every K piece is merged, remove redundancy, according to theoretical parent ion quality-ordered, output to the three interim peptide sequence pieces in rank ... iterative cycles, until all data are merged together, the last final one interim peptide sequence piece of taking turns that reads successively, create peptide index file piece, collect the information of all peptide index file pieces, and according to this Information generation peptide index metadata file.
In above-mentioned steps 103, the mode of operation of at present common comparison poor efficiency is, resolve mass spectrogram, it is read in successively, piecemeal deposits the interim spectral data piece of single order in, then read in successively the interim spectral data piece of single order, every K piece is merged, according to experiment parent ion quality-ordered, output to the interim spectral data piece of second order, read in successively the more interim spectral data piece of second order, further merge, according to experiment parent ion quality-ordered, output to the three interim spectral data pieces in rank ... iterative cycles, until all data are merged together, the last final one interim spectral data piece of taking turns that reads successively, deposit some spectral data pieces in, the mass spectrogram number comprising in every equates, this number is specified by input parameter, finally collect the information of all mass spectrometric data pieces, and according to this Information generation mass spectrum meta data file.
In above-mentioned steps 104, the mode of operation of at present common comparison poor efficiency is that single host process is assigned to spectral data piece successively from process, from process, adopt the mode of First come first served to get task, get after the numbering of peptide index file piece of appointment, read in successively all peptide index file pieces, on the basis of original peptide sequence, calculate the possible situation of occurred changes in modification, utilize the parent ion quality error window searching in spectral data piece to be identified to meet the modified peptides sequence of setting mass range, satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm to realize the evaluation to peptide sequence, whenever expert assignment completes, deposit qualification result sub-block in, with host process communication, beam back the filename of qualification result sub-block, and ask for the information of spectral data piece corresponding to next step task and peptide index file piece, until complete the evaluation of all spectral data pieces.
Further, in above-mentioned steps 102, comprise again:
Step 1021, read in protein sequence, it is on average divided into a plurality of protein sequence son files, and the number of protein sequence son file can be greater than the number of processor core in cluster, and the capacity of each protein sequence son file must be less than the local storage space of clustered node;
Step 1022, each protein sequence son file is started to a peptide index-mapping processor process (referred to as Peptide Map process) to be processed, separate parallel processing between Peptide Map process, PeptideMap process is carried out successively theoretical enzyme by each protein sequence in handled protein sequence son file and is cut and obtain peptide sequence, again peptide sequence is divided in corresponding queue according to mass range, remove after redundancy peptide sequence, each queue stores is arrived to different peptide sequence intermediate files;
Step 1023, to different mass ranges, each mass range is processed by a peptide index reduction processor process (referred to as Peptide Reduce process), separate parallel processing between Peptide Reduce process, by Peptide Reduce process, read in the peptide sequence in all peptide sequence intermediate files in this mass range, peptide sequence is sorted, in sorting operation, first according to the large minispread of theoretical parent ion quality, when the theoretical parent ion of peptide sequence is identical in quality, again according to the conventional English words canonical ordering sequence of the character string of peptide sequence, after sequence, remove redundancy, create peptide index file piece,
Step 1024, this step is optional step, an option is to generate peptide to the inverted index of albumen, the specific implementation that described inverted index the creates algorithm document 10 < < You Li that see reference, Hao Chi, Le-HengWang, Hai-Peng Wang, Yan Fu, Zuo-Fei Yuan, Su-Jun Li, Yan-Sheng Liu, Rui-Xiang Sun, Rong Zeng, Si-Min He. " Speeding up tandem mass spectrometrybased database searching by peptide and spectrum indexing. " RapidCommunications in Mass Spectrometry, 2010, 24:807-814. > > and application number are index acceleration method and the corresponding system > > during 200810223683.1 patented claim < < scale protein is identified,
Step 1025, the information of collecting all peptide index file pieces, and according to this Information generation peptide index metadata file.
In embodiment preferably, in step 1022, when the number of protein sequence son file is greater than the number of processor core in cluster, or when being greater than Peptide Map process and counting, protein sequence son file is carried out to many wheels to be processed, the Peptide Map process of finishing the work continues to get new task, and First come first served, until all proteins sequence son file is all handled.
In embodiment preferably, in step 1023, when the number of mass range is greater than the number of processor core in cluster, or when being greater than Peptide Reduce process and counting, mass range is carried out to many wheels to be processed, the Peptide Reduce process of finishing the work continues to get new task, and First come first served, until all peptide sequence intermediate files are all handled.
Further, in above-mentioned steps 103, comprise again:
Step 1031, resolves mass spectrogram, and it is on average divided into a plurality of original data blocks, and the number of original data block can be greater than the number of processor core in cluster, and the capacity of each original data block must be less than the local storage space of clustered node;
Step 1032, each original data block is processed by a Spectra Map process, Spectra Map process is read in each mass spectrogram in handled original data block successively, according to mass range, mass spectrogram is divided in corresponding queue, then by each queue stores in different spectrogram intermediate files;
Step 1033, to different mass ranges, each mass range is processed by a Spectra Reduce process, separate parallel processing between Spectra Map process, Spectra Reduce process is by all spectrogram intermediate files that read in this mass range, mass spectrogram sequence to input, in sorting operation, first by the large minispread of experiment parent ion quality, when experiment parent ion is identical in quality, again according to the conventional English words canonical ordering sequence of spectrogram title name, after sequence, deposit successively more some spectral data pieces in, the mass spectrogram number comprising in every equates, this number is specified by input parameter,
Step 1034, the information of collecting all mass spectrometric data pieces, and according to this Information generation mass spectrum meta data file.
In embodiment preferably, in step 1032, when the number of original data block is greater than the number of processor core in cluster, or when being greater than Spectra Map process and counting, original data block is carried out to many wheels to be processed, the Spectra Map process of finishing the work continues to get new task, and First come first served, until all original data blocks are all handled.
In embodiment preferably, in step 1032, when the number of mass range is greater than the number of processor core in cluster, or when being greater than Spectra Reduce process and counting, mass range is carried out to many wheels to be processed, the Spectra Reduce process of finishing the work continues to get new task, and First come first served, until all spectrogram intermediate files are all handled.
In embodiment preferably, in step 104, dynamically assign operation to comprise: host process is read in mass spectrum meta data file and peptide index metadata file, according to the statistical information obtaining, by distributing to the spectral data piece of oneself being responsible for identifying, according to mass range, sort from high to low and be assigned to successively from process, if peptide index file piece is polylith, same spectral data piece is assigned repeatedly, each corresponding peptide index file piece; From process, adopt the mode of First come first served to get task, whenever expert assignment completes, deposit qualification result sub-block in, with host process communication, beam back the filename of qualification result sub-block, and ask for the information of spectral data piece corresponding to next step task and peptide index file piece, until complete the evaluation of all spectral data pieces.
In embodiment preferably, in step 104, peptide spectrum coupling identifies that operation comprises: from process, read in peptide index file piece, on the basis of original peptide sequence, calculate the possible situation of occurred changes in modification, utilize the parent ion quality error window searching in spectral data piece to be identified to meet the modified peptides sequence of setting mass range, satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm to realize the evaluation to peptide sequence.The specific implementation of the peptide spectrum coupling marking algorithm document 11 < < Y.Fu that see reference, Q.Yang, R.Sun, D.Li, R.Zeng, C.X.Ling, and W.Gao, " Exploiting the kernel trick to correlatefragment ions for peptide identification via tandem mass spectrometry, " Bioinformatics, 2004, 20:1948-1954. > > and patent < < method ZL200410088779.3 > > who uses tandem mass spectrum data to identify peptide.
Further, in above-mentioned steps 105, comprise again:
Step 1051, gather qualification result, the all qualification result sub-block corresponding to each piece spectral data piece, by a spectrogram qualification result aggregation process device process (referred to as Results Gather process), be responsible for processing, separate parallel processing between Results Gather process, Results Gather process is read in all qualification result sub-blocks of a spectral data piece that is assigned to oneself, the peptide sequence of all qualification results of every mass spectrogram is pressed to the mark sequence of peptide spectrum coupling marking algorithm, retain forward peptide sequence information and the mark of rank, deposit piecemeal summary file in.
Step 1052, read in all piecemeal summary files, the filtration of the peptide sequence of each mass spectrogram qualification result, de-redundancy, the nonredundancy peptide sequence obtaining is equally divided into some groups, each group nonredundancy peptide sequence is responsible for searching corresponding protein numbering and sequence by a protein query processor process (referred to as Protein Select process), the separate parallel processing of Protein Select process, the peptide that reruns after searching is inferred algorithm to protein, finally generates output file.Peptide is inferred the specific implementation of algorithm document 12 < < AI Nesvizhskii and R Aebersold. " Interpretation of shotgun proteomic data:theprotein inference problem. " the Mol Cell Proteomics that sees reference to albumen, 2005,4:1419-1440. > >.
In embodiment preferably, in step 1051, when the number of spectral data piece is greater than the number of processor core in cluster, or when being greater than Results Gather process and counting, qualification result sub-block is carried out to many wheels to be processed, the Results Gather process of finishing the work continues to get new task, and First come first served, until all qualification result sub-blocks are all handled.
In embodiment preferably, in step 1052, when the number of nonredundancy peptide sequence grouping is greater than the number of processor core in cluster, or when being greater than Protein Select process and counting, to nonredundancy peptide sequence, many wheel processing are carried out in grouping, the Protein Select process of finishing the work continues to get new task, and First come first served, until all nonredundancy peptide sequences are all handled.
For convenient, understand, in conjunction with a concrete example, be illustrated:
First, process protein sequence, create peptide index file piece.Suppose to have 3,000,000 protein sequence in protein sequence storehouse, cluster has 1000 processor cores.The first step, is divided into 1000 protein sequence son files by all proteins sequence, and each protein sequence son file comprises 3,000 protein sequences.Second step, 1000 Peptide Map processes of parallel starting, each Peptide Map reads in a protein sequence son file separately, by read in 3, article 000, protein sequence carries out successively theoretical enzyme and cuts and obtain peptide sequence, peptide sequence is divided in corresponding queue according to mass range again, for example, supposes to take that every 100Da is the queue that width is divided different quality scope, the peptide sequence EVDG that quality is 400.15 will be stored into the queue of 400-500Da.Remove after redundancy peptide sequence, each queue stores is arrived to different peptide sequence intermediate files.The 3rd step, to different mass range parallel starting Peptide Reduce processes, each mass range is processed by a Peptide Reduce process, separate parallel processing between Peptide Reduce process, starting how many Peptide Reduce processes is to be determined by lower limit qualitatively and the mass range width of predefined peptide sequence, in this example, the bound of the quality of peptide sequence is made as 400-10000Da, mass range width is 100Da, so just needs 96 Peptide Reduce processes (10000-400/100).By Peptide Reduce process, read in the peptide sequence in for example, all peptide sequence intermediate files in (400-500Da) in setting mass range, in the present embodiment, each process needs to read in 1000 peptide sequence intermediate files, peptide sequence is sorted according to theoretical parent ion quality size, remove redundancy, create peptide index file piece.In the present embodiment, 96 peptide index file pieces have finally been generated.The content of peptide index file piece comprises quality, peptide sequence, omission restriction enzyme site.Also having one can selection operation be to generate peptide to the inverted index of albumen simultaneously, a line of inverted index adopts following form: the numbering (size_t) that is first peptide sequence, next be the numbering (size_t) of the protein sequence under this peptide sequence, if same peptide sequence belongs to a plurality of protein sequences, the latter's numbering is arranged in order.The 4th step, after above-mentioned end-of-job, the information of collecting all mass spectrometric data pieces by Peptide Meta process, and according to this Information generation mass spectrum meta data file, this information spinner will comprise the number of index file piece, the size of each blocks of files, corresponding mass range, the peptide sequence entry of storage, the amino acid masses table that calculates peptide sequence quality, creation-time etc.
Then, process mass spectrogram, create spectral data piece.Suppose to have 5,000,000 mass spectrum, cluster has 1000 processor cores.The first step, resolves mass spectrogram, and it is on average divided into 1,000 original data block, and each original data block comprises 5000 mass spectrograms.Second step, start 1000 SpectraMap processes, each piece original data block is processed by a Spectra Map process, Spectra Map process is read in each mass spectrogram in handled original data block successively, according to mass range, mass spectrogram is divided in corresponding queue, for example suppose take that 100Da is as window, the spectrogram that quality is 400.15 will be stored into the queue of 400-500Da, then by each queue stores in different spectrogram intermediate files.The 3rd step, to different mass range parallel starting Spectra Reduce processes, each mass range is processed by a Spectra Reduce process, separate parallel processing between Spectra Reduce process, starting how many Spectra Reduce processes is to be determined by lower limit qualitatively and the mass range width of predefined peptide sequence, in this example, the bound of the quality of peptide sequence is made as 400-10000Da, mass range width is 100Da, so just needs 96 Spectra Reduce processes (10000-400/100).Spectra Reduce process will read all spectrogram intermediate files of (for example 400-500Da) in this mass range, to the mass spectrogram of input according to experiment parent ion quality-ordered, deposit successively more some spectral data pieces in, mass spectrogram number in every is equal, and the number of mass spectrogram is specified by input parameter.In the present embodiment, totally 7,000 of the spectrogram files within the scope of 400-500Da, after experiment parent ion quality-ordered, every 200 deposit one in, and symbiosis has become 35,200 being determined by input parameter here.The 4th step, after above-mentioned end-of-job, the information of collecting all mass spectrometric data pieces by Sepctra Meta process, and according to this Information generation mass spectrum meta data file, this information spinner will comprise spectrogram number corresponding to spectral data piece number, each data block, creation-time etc.
Then, start to identify.Start several host processes, host process is in charge of again more from process separately, in this example, whole 1000 processes, specify No. 0, No. 100, No. 200 ... No. 900 totally ten processes be host process, all the other are all from process, each host process is in charge of numbering and is come own below 99 from process, for example No. 123 from process, just returns No. 100 managements of process.The spectral data piece average mark that previous step is produced is given each host process, each host process is read in mass spectrum meta data file and peptide index metadata file, the information obtaining according to statistics, the spectral data piece of distributing to oneself is sorted from high to low according to mass range, what be dynamically assigned to successively the free time carries out the evaluation of peptide spectrum coupling from process, if more than one of the peptide index file piece of aforementioned generation, same spectral data piece also can be assigned to a plurality of from process, each is responsible for traveling through monolithic peptide index file from process and carries out the evaluation of peptide spectrum coupling, in the present embodiment, suppose that a step has produced 50 above, 000 spectral data piece, 96 peptide index file pieces, each host process has distributed 5, 000 data block (divide with load balancing by interval, for example No. 0 node has been divided No. 0, No. 10, No. 20 ... 4, 990 number pieces), each host node distributes to carried out 5000*96 subtask from node.The final qualification result sub-block that has altogether produced 10*5000*96 piece.
Finally, gather qualification result.Suppose to have before this 50,000 spectral data pieces, the qualification result sub-block of 10*5000*96 piece, cluster has 1,000 processor core.The first step, start 1,000 ResultsGather process, by many wheels, process respectively to 50, the qualification result sub-block of 000 spectral data piece is processed, and each Results Gather process will be read in 96 qualification result sub-blocks corresponding to certain piece spectral data piece of appointment, sequencing by merging at every turn, retain forward candidate's peptide result, deposit piecemeal summary file in.Second step, read in all 50, 000 piecemeal summary file, the peptide sequence of the qualification result of each mass spectrogram is filtered, de-redundancy, suppose to have obtained 70, 000 nonredundancy peptide sequence, be equally divided into 700 groups, start 700 Protein Select processes and process each group nonredundancy peptide sequence, by nonredundant peptide sequence, search corresponding protein sequence (if generated aforementioned inverted index in optional step before this, directly by the acquisition of tabling look-up, if not then direct search urporotein sequence library), after inquiry, then move peptide and infer algorithm to albumen, obtain the information of the protein of 180 evaluations, the final output file that generates, output file content comprises the peptide sequence of the qualification result of every spectrogram, decoration information, parent ion quality, marking mark, and the title of the protein identifying, numbering and protein sequence etc.
As shown in Figure 2, be the structural drawing of the large-scale distributed parallel accelerating system of identification of proteins of the present invention.This system 200 comprises:
Peptide sequence index module 21, for the protein sequence to input, adopting method for parallel processing to carry out theoretical enzyme cuts and obtains peptide sequence, to peptide sequence according to theoretical parent ion quality sort, de-redundancy processes, to create peptide index file piece, and generate peptide index metadata file according to peptide index file piece;
Spectral data processing module 22, for the mass spectrogram to input, adopt method for parallel processing and sort according to experiment parent ion quality, and the mass spectrogram after sequence is averaged to division, obtain a plurality of spectral data pieces, and generate mass spectrum meta data file according to spectral data piece;
Peptide spectrum coupling is identified module 23, connect peptide sequence index module 21, peptide sequence index module 22, for spectral data piece is averagely allocated to each host process, each host process management is a plurality of from process, each host process sorts to distributed spectral data piece, what be assigned to successively the free time carries out the evaluation of peptide spectrum coupling from process, and when more than one of peptide index file piece, same spectral data piece is distributed to a plurality of from process, by a plurality of, from process traversal monolithic peptide index file piece, carried out peptide spectrum coupling and identify;
Result gathers output module 24, connects peptide spectrum coupling and identifies module 23, for adopting method for parallel processing, gathers qualification result, and the peptide sequence being tested and appraised is searched corresponding protein sequence, carries out peptide to the deduction of protein, generates output file.
Peptide sequence index module 21, in conventional embodiment, read in successively protein sequence, it is carried out to theoretical enzyme one by one cuts and obtains peptide sequence, again peptide sequence piecemeal is deposited in to the interim peptide sequence piece of single order, then read in the interim peptide sequence piece of single order, every K piece is merged, remove redundancy, according to theoretical parent ion quality-ordered, output to the interim peptide sequence piece of second order, read in again the interim peptide sequence piece of second order, every K piece is merged, remove redundancy, according to theoretical parent ion quality-ordered, output to the three interim peptide sequence pieces in rank ... iterative cycles, until all data are merged together, the last final one interim peptide sequence piece of taking turns that reads successively, create peptide index file piece, collect the information of all peptide index file pieces, and according to this Information generation peptide index metadata file.
Spectral data processing module 22, in conventional embodiment, resolve mass spectrogram, it is read in successively, piecemeal deposits the interim spectral data piece of single order in, then read in successively the interim spectral data piece of single order, every K piece is merged, according to experiment parent ion quality-ordered, output to the interim spectral data piece of second order, read in successively the more interim spectral data piece of second order, further merge, according to experiment parent ion quality-ordered, output to the three interim spectral data pieces in rank ... iterative cycles, until all data are merged together, the last final one interim spectral data piece of taking turns that reads successively, deposit some spectral data pieces in, the mass spectrogram number comprising in every equates, this number is specified by input parameter, finally collect the information of all mass spectrometric data pieces, and according to this Information generation mass spectrum meta data file.
Peptide spectrum coupling is identified module 23, and in conventional embodiment, single host process is assigned to spectral data piece successively from process, from process, adopt the mode of First come first served to get task, get after the numbering of peptide index file piece of appointment, read in successively all peptide index file pieces, on the basis of original peptide sequence, calculate the possible situation of occurred changes in modification, utilize the parent ion quality error window searching in spectral data piece to be identified to meet the modified peptides sequence of setting mass range, satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm to realize the evaluation to peptide sequence, whenever expert assignment completes, deposit qualification result sub-block in, with host process communication, beam back the filename of qualification result sub-block, and ask for the information of spectral data piece corresponding to next step task and peptide index file piece, until complete the evaluation of all spectral data pieces.
Further, peptide sequence index module 21 comprises again:
Protein sequence is divided module 211, for protein sequence is on average divided into a plurality of protein sequence son files, the number of protein sequence son file can be greater than the number of processor core in cluster, and the capacity of each protein sequence son file must be less than the local storage space of clustered node;
Peptide mapping block 212, connect protein sequence and divide module 211, for being started to a peptide index-mapping processor process (referred to as Peptide Map process), each protein sequence son file processes, separate parallel processing between Peptide Map process, Peptide Map process is carried out successively theoretical enzyme by each protein sequence in handled protein sequence son file and is cut and obtain peptide sequence, again peptide sequence is divided in corresponding queue according to mass range, remove after redundancy peptide sequence, each queue stores is arrived to different peptide sequence intermediate files;
Peptide reduction module 213, connect peptide mapping block 212, be used for different mass ranges, each mass range is processed by a peptide index reduction processor process (referred to as Peptide Reduce process), separate parallel processing between Peptide Reduce process, by Peptide Reduce process, read in the peptide sequence in all peptide sequence intermediate files in setting mass range, according to theoretical parent ion quality-ordered, in sorting operation, first according to the large minispread of parent ion Theoretical Mass, when peptide sequence identical in quality, again according to the conventional English words canonical ordering sequence of the character string of peptide sequence, after sequence, remove redundancy, create peptide index file piece,
Peptide index meta file generation module 214, connects peptide reduction module 213, for collecting the information of all peptide index file pieces, and according to this Information generation peptide index metadata file.
In embodiment preferably, peptide mapping block 212, while being also greater than the number of cluster processor core for the number when protein sequence son file, or when being greater than Peptide Map process and counting, protein sequence son file is carried out to many wheels to be processed, the Peptide Map process of finishing the work continues to get new task, and First come first served, until all proteins sequence son file is all handled.
In embodiment preferably, peptide reduction module 213, while being also greater than the number of cluster processor core for the number when mass range, or when being greater than Peptide Reduce process and counting, mass range is carried out to many wheels to be processed, the Peptide Reduce process of finishing the work continues to get new task, and First come first served, until all peptide sequence intermediate files are all handled.
Further, spectral data processing module 22 comprises again:
Spectrogram is divided module 221, for resolving inputted mass spectrogram, it is on average divided into a plurality of original data blocks, and the number of original data block can be greater than the number of processor core in cluster, and the capacity of each original data block must be less than the local storage space of clustered node;
Spectrogram mapping block 222, connect spectrogram and divide module 221, for each original data block is processed by a spectrogram mapping processor process (referred to as Spectra Map process), Spectra Map process is read in each mass spectrogram in handled original data block successively, according to mass range, mass spectrogram is divided in corresponding queue, then by each queue stores in different spectrogram intermediate files;
Spectrogram reduction module 223, connect spectrogram mapping block 222, for each mass range of different mass ranges is processed by a spectrogram reduction processor process (referred to as Spectra Reduce process), separate parallel processing between Spectra Reduce process, Spectra Reduce process is by all spectrogram intermediate files that read in setting mass range, to input mass spectrogram according to parent ion quality-ordered, in sorting operation, first by the large minispread of experiment parent ion quality, when experiment parent ion is identical in quality, again according to the conventional English words canonical ordering sequence of spectrogram title name, after sequence, deposit successively some spectral data pieces in, the mass spectrogram number comprising in every equates, this number is specified by input parameter,
Spectral data meta file generation module 224, connects spectrogram reduction module 223, for collecting the information of all mass spectrometric data pieces, and according to this Information generation mass spectrum meta data file.
In embodiment preferably, spectrogram mapping block 222, while being also greater than the number of cluster processor core for the number when original data block, or when being greater than Spectra Map process and counting, each original data block is carried out to many wheels to be processed, the Spectra Map process of finishing the work continues to get new task, and First come first served, until all original data blocks are all handled.
In embodiment preferably, spectrogram reduction module 223, while being also greater than the number of cluster processor core for the number when mass range, or when being greater than Spectra Reduce process and counting, mass range is carried out to many wheels to be processed, the Spectra Reduce process of finishing the work continues to get new task, and First come first served, until all spectral data pieces are all handled.
In embodiment preferably, peptide spectrum coupling identifies that the appointment operation that module 23 is carried out comprises: host process is read in mass spectrum meta data file and peptide index metadata file, according to the statistical information obtaining, by distributing to the spectral data piece of oneself being responsible for identifying, according to mass range, sort from high to low and be assigned to successively from process, if peptide index file piece is polylith, same spectral data piece is assigned repeatedly, each corresponding peptide index file piece; From process, adopt the mode of First come first served to get task, whenever expert assignment completes, deposit qualification result sub-block in, with host process communication, beam back the filename of qualification result sub-block, and ask for the information of spectral data piece corresponding to next step task and peptide index file piece, until complete the evaluation of all spectral data pieces.
In embodiment preferably, the peptide spectrum coupling that peptide spectrum coupling evaluation module 23 is carried out identifies that operation comprises: by read in peptide index file piece from process, on the basis of original peptide sequence, calculate the possible situation of occurred changes in modification, utilize the parent ion quality error window searching in spectral data piece to be identified to meet the modified peptides sequence of setting mass range, satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm to realize the evaluation to peptide sequence.
Further, result gathers output module 24 and comprises again:
Summarizing module 241, for all qualification result sub-block corresponding to each piece spectral data piece, by a spectrogram qualification result aggregation process device process (referred to as Results Gather process), be responsible for processing, separate parallel processing between Results Gather process, Results Gather process is read in all qualification result sub-blocks of a spectral data piece that is assigned to oneself, the peptide sequence of all qualification results of every mass spectrogram is pressed to the mark sequence of peptide spectrum coupling marking algorithm, retain forward peptide sequence information and the mark of rank, deposit piecemeal summary file in.
Filter and infer output module 242, connect summarizing module 241, be used for reading in all piecemeal summary files, the filtration of the peptide sequence of each mass spectrogram qualification result, de-redundancy, the nonredundancy peptide sequence obtaining is equally divided into some groups, each group nonredundancy peptide sequence is responsible for searching corresponding protein numbering and sequence by a protein query processor process (referred to as Protein Select process), the separate parallel processing of Protein Select process, the peptide that reruns after searching is inferred algorithm to protein, finally generates output file.
In embodiment preferably, summarizing module 241, while being also greater than the number of cluster processor core for the number when spectral data piece, or when being greater than Results Gather process and counting, qualification result son file is carried out to many wheels to be processed, the Results Gather process of finishing the work continues to get new task, and First come first served, until all qualification result sub-blocks all complete.
In embodiment preferably, filter and infer output module 242, while being also greater than the number of cluster processor core for the number when the grouping of nonredundancy peptide sequence, or when being greater than Results Gather process and counting, qualification result sub-block is carried out to many wheels to be processed, the Results Gather process of finishing the work continues to get new task, and First come first served, until all qualification result sub-blocks are all handled.
The present invention proposes a kind of large-scale distributed parallel acceleration method and system of identification of proteins, solve prior art and reached hundred cores even under the parallel condition of thousand core processor scales, the problem that acceleration efficiency is not good, particularly in processor core scale, reach hundreds of and even surpass more than thousand, still can obtain satisfied acceleration efficiency.
It should be noted last that, above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although the present invention is had been described in detail with reference to embodiment, those of ordinary skill in the art is to be understood that, technical scheme of the present invention is modified or is equal to replacement, do not depart from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of claim scope of the present invention.

Claims (16)

1. a large-scale distributed parallel acceleration method for identification of proteins, is characterized in that, comprising:
Step 1, input protein sequence, carries out theoretical enzyme to described protein sequence and cuts and obtain peptide sequence, to described peptide sequence according to theoretical parent ion quality sort, de-redundancy processes, to create peptide index file piece, and generate peptide index metadata file according to described peptide index file piece;
Step 2, input mass spectrogram, adopts method for parallel processing to sort according to experiment parent ion quality to described mass spectrogram, and the mass spectrogram after sequence is averaged to division, obtain a plurality of spectral data pieces, and generate mass spectrum meta data file according to described spectral data piece;
Step 3, described spectral data piece is averagely allocated to a plurality of host processes, each host process management is a plurality of from process, each host process sorts to distributed spectral data piece, what be assigned to successively the free time carries out the evaluation of peptide spectrum coupling from process, and when more than one of described peptide index file piece, same described spectral data piece is distributed to a plurality of from process, by the plurality of, from peptide index file piece described in process traversal monolithic, carry out peptide spectrum coupling and identify;
Step 4, adopts method for parallel processing, gathers qualification result, utilizes the peptide sequence identifying to infer corresponding protein sequence, generates output file;
In described step 2, also comprise:
21, resolve described mass spectrogram, described mass spectrogram is on average divided into a plurality of original data blocks, described in each, the capacity of original data block is less than the local storage space of clustered node;
22, original data block described in each is processed by a spectrogram mapping processor process, described spectrogram mapping processor process is read in each mass spectrogram in handled original data block successively, according to mass range, described mass spectrogram is divided in corresponding queue, then by each queue stores in different spectrogram intermediate files;
23, to different mass ranges, each mass range is processed by a spectrogram reduction processor process, separate parallel processing between described spectrogram reduction processor process, described spectrogram reduction processor process is by all spectrogram intermediate files that read in this mass range, the mass spectrogram of input is first pressed to the large minispread of experiment parent ion quality, when experiment parent ion is identical in quality, again according to the conventional English words canonical ordering sequence of spectrogram title name, after sequence, deposit successively polylith spectral data piece in, the mass spectrogram number comprising in every equates;
24, the information of collecting all described mass spectrometric data pieces, and according to mass spectrum meta data file described in described Information generation.
2. the large-scale distributed parallel acceleration method of identification of proteins according to claim 1, is characterized in that,
In described step 22, also comprise:
When the number of described original data block is greater than the number of processor core in cluster, or when being greater than described spectrogram mapping processor process and counting, described original data block is carried out to many wheels to be processed, the spectrogram mapping processor process of finishing the work continues to get new task, First come first served, until all described original data blocks are all handled.
3. the large-scale distributed parallel acceleration method of identification of proteins according to claim 1, is characterized in that,
In described step 23, also comprise:
When the number of described mass range is greater than the number of processor core in cluster, or when being greater than described spectrogram reduction processor process and counting, described mass range is carried out to many wheels to be processed, the spectrogram reduction processor process of finishing the work continues to get new task, First come first served, until all described spectrogram intermediate files are all handled.
4. according to the large-scale distributed parallel acceleration method of the identification of proteins described in claim 1,2 or 3, it is characterized in that,
In described step 3, described host process is assigned and idle from process, carried out the step that peptide spectrum coupling identifies and comprise:
Described host process is read in described mass spectrum meta data file and described peptide index metadata file, according to the statistical information obtaining, by distributing to the described spectral data piece of oneself being responsible for identifying, according to mass range, sort from high to low, be assigned to successively described from process, if described peptide index file piece is polylith, same described spectral data piece is assigned repeatedly, each corresponding peptide index file piece; The described mode from process employing First come first served is got task, whenever expert assignment completes, deposit qualification result sub-block in, with described host process communication, beam back the filename of described qualification result sub-block, and ask for the information of spectral data piece corresponding to next step task and peptide index file piece, until complete the evaluation of all spectral data pieces.
5. the large-scale distributed parallel acceleration method of identification of proteins according to claim 4, is characterized in that,
In described step 3, describedly from process, carry out the step that peptide spectrum coupling identifies and comprise:
Describedly from process, read in described peptide index file piece, on the basis of original peptide sequence, calculate the possible situation of occurred changes in modification, utilize the parent ion quality error window searching in spectral data piece to be identified to meet the modified peptides sequence of setting mass range, satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm to realize the evaluation to peptide sequence.
6. according to the large-scale distributed parallel acceleration method of the identification of proteins described in claim 1,2,3 or 5, it is characterized in that,
In described step 4, also comprise:
41, gather qualification result, the all qualification result sub-blocks corresponding to spectral data piece described in each piece, by a spectrogram qualification result aggregation process device process, be responsible for processing, separate parallel processing between described spectrogram qualification result aggregation process device process, described spectrogram qualification result aggregation process device process is read in all qualification result sub-blocks of a described spectral data piece that is assigned to oneself, the peptide sequence of all qualification results of every mass spectrogram is pressed to the mark sequence of peptide spectrum coupling marking algorithm, retain forward peptide sequence information and the mark of rank, deposit piecemeal summary file in,
42, read in all piecemeal summary files, the filtration of the peptide sequence of each mass spectrogram qualification result, de-redundancy, the nonredundancy peptide sequence obtaining is equally divided into many groups, each group nonredundancy peptide sequence is responsible for searching corresponding protein numbering and sequence by a protein query processor process, the separate parallel processing of described protein query processor process, to lookup result, use peptide to infer algorithm to protein, generate output file.
7. the large-scale distributed parallel acceleration method of identification of proteins according to claim 6, is characterized in that,
In described step 41, also comprise:
When the number of described spectral data piece is greater than the number of processor core in cluster, or when being greater than described spectrogram qualification result aggregation process device process and counting, described qualification result sub-block is carried out to many wheels to be processed, the spectrogram qualification result aggregation process device process of finishing the work continues to get new task, First come first served, until all described qualification result sub-blocks are all processed complete.
8. the large-scale distributed parallel acceleration method of identification of proteins according to claim 6, is characterized in that,
In described step 42, also comprise:
When the number of described nonredundancy peptide sequence grouping is greater than the number of processor core in cluster, or when being greater than described protein query processor process and counting, described nonredundancy peptide sequence grouping is carried out to many wheels to be processed, the protein query processor process of finishing the work continues to get new task, First come first served, until all nonredundancy peptide sequences are all handled.
9. a large-scale distributed parallel accelerating system for identification of proteins, is characterized in that, comprising:
Peptide sequence index module, for the protein sequence of input is carried out to theoretical enzyme, cut and obtain peptide sequence, to described peptide sequence according to theoretical parent ion quality sort, de-redundancy processes, and to create peptide index file piece, and generates peptide index metadata file according to described peptide index file piece;
Spectral data processing module, for the mass spectrogram to input, adopt method for parallel processing and sort according to experiment parent ion quality, and the mass spectrogram after sequence is averaged to division, obtain a plurality of spectral data pieces, and generate mass spectrum meta data file according to described spectral data piece;
Peptide spectrum coupling is identified module, connect described peptide sequence index module, described spectral data processing module, for described spectral data piece is averagely allocated to each host process, each host process management is a plurality of from process, each host process sorts to distributed spectral data piece, what be assigned to successively the free time carries out the evaluation of peptide spectrum coupling from process, and when more than one of described peptide index file piece, same described spectral data piece is distributed to a plurality of from process, by a plurality of, from peptide index file piece described in process traversal monolithic, carried out peptide spectrum coupling and identify;
Result gathers output module, connects described peptide spectrum coupling and identifies module, for adopting method for parallel processing to gather qualification result, utilizes the peptide sequence identifying to infer corresponding protein sequence, generates output file;
Described spectral data processing module, comprises again:
Spectrogram is divided module, for resolving described mass spectrogram, described mass spectrogram is on average divided into a plurality of original data blocks, and described in each, the capacity of original data block is less than the local storage space of clustered node;
Spectrogram mapping block, connect described spectrogram and divide module, for original data block described in each is processed by a spectrogram mapping processor process, described spectrogram mapping processor process is read in each mass spectrogram in handled original data block successively, according to mass range, described mass spectrogram is divided in corresponding queue, then by each queue stores in different spectrogram intermediate files;
Spectrogram reduction module, connect described spectrogram mapping block, be used for different mass ranges, each mass range is processed by a spectrogram reduction processor process, separate parallel processing between described spectrogram reduction processor process, described spectrogram reduction processor process is by all spectrogram intermediate files that read in this mass range, the mass spectrogram of input is first pressed to the large minispread of experiment parent ion quality, when experiment parent ion is identical in quality, again according to the conventional English words canonical ordering sequence of spectrogram title name, after sequence, deposit successively polylith spectral data piece in, the mass spectrogram number comprising in every equates,
Mass spectrum meta data file generation module, connects described spectrogram reduction module, for collecting the information of all described mass spectrometric data pieces, and according to mass spectrum meta data file described in described Information generation.
10. the large-scale distributed parallel accelerating system of identification of proteins according to claim 9, is characterized in that,
Described spectrogram mapping block, while being also greater than the number of cluster processor core for the number when described original data block, or when being greater than described spectrogram mapping processor process and counting, described original data block is carried out to many wheels to be processed, the spectrogram mapping processor process of finishing the work continues to get new task, First come first served, until all original data blocks are all handled.
The large-scale distributed parallel accelerating system of 11. identification of proteins according to claim 9, is characterized in that,
Described spectrogram reduction module, while being also greater than the number of cluster processor core for the number when described mass range, or when being greater than described spectrogram reduction processor process and counting, described mass range is carried out to many wheels to be processed, the spectrogram reduction processor process of finishing the work continues to get new task, First come first served, until all spectrogram intermediate files are all handled.
12. according to the large-scale distributed parallel accelerating system of the identification of proteins described in claim 9,10 or 11, it is characterized in that,
Described peptide spectrum coupling is identified module, also for read in described mass spectrum meta data file and described peptide index metadata file by described host process, according to the statistical information obtaining, by distribute to described spectral data piece that oneself be responsible for to identify according to mass range sort be from high to low assigned to successively described from process, if described peptide index file piece is polylith, same described spectral data piece is assigned repeatedly, each corresponding peptide index file piece; The described mode from process employing First come first served is got task, whenever expert assignment completes, deposit qualification result sub-block in, with described host process communication, beam back the filename of described qualification result sub-block, and ask for the information of spectral data piece corresponding to next step task and peptide index file piece, until complete the evaluation of all spectral data pieces.
The large-scale distributed parallel accelerating system of 13. identification of proteins according to claim 12, is characterized in that,
Described peptide spectrum coupling is identified module, also for reading in peptide index file piece by described from process, on the basis of original peptide sequence, calculate the possible situation of occurred changes in modification, utilize the parent ion quality error window searching in spectral data piece to be identified to meet the modified peptides sequence of setting mass range, satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm to realize the evaluation to peptide sequence.
14. according to the large-scale distributed parallel accelerating system of the identification of proteins described in claim 9,10,11 or 13, it is characterized in that,
Described result gathers output module, comprises again:
Summarizing module, for all qualification result sub-blocks corresponding to spectral data piece described in each piece, by a spectrogram qualification result aggregation process device process, be responsible for processing, separate parallel processing between described spectrogram qualification result aggregation process device process, described spectrogram qualification result aggregation process device process is read in all qualification result sub-blocks of a described spectral data piece that is assigned to oneself, the peptide sequence of all qualification results of every mass spectrogram is pressed to the mark sequence of peptide spectrum coupling marking algorithm, retain forward peptide sequence information and the mark of rank, deposit piecemeal summary file in,
Filter and infer output module, connect described summarizing module, be used for reading in described piecemeal summary file, the filtration of the peptide sequence of each mass spectrogram qualification result, de-redundancy, the nonredundancy peptide sequence obtaining is divided into many groups, each group nonredundancy peptide sequence is responsible for searching corresponding protein numbering and sequence, the separate parallel processing of described protein query processor process by a protein query processor process, to lookup result, use peptide to infer algorithm to protein, generate output file.
The large-scale distributed parallel accelerating system of 15. identification of proteins according to claim 14, is characterized in that,
Described summarizing module, while being also greater than the number of cluster processor core for the number when described spectral data piece, or when being greater than described spectrogram qualification result aggregation process device process and counting, described qualification result sub-block is carried out to many wheels to be processed, the spectrogram qualification result aggregation process device process of finishing the work continues to get new task, First come first served, until all qualification result sub-blocks are all processed complete.
The large-scale distributed parallel accelerating system of 16. identification of proteins according to claim 14, is characterized in that,
Output module is inferred in described filtration, while being also greater than the number of cluster processor core for the number when described nonredundancy peptide sequence grouping, or when being greater than described protein query processor process and counting, described nonredundancy peptide sequence grouping is carried out to many wheels to be processed, the protein query processor process of finishing the work continues to get new task, First come first served, until all nonredundancy peptide sequences are all handled.
CN201010292032.5A 2010-09-26 2010-09-26 Large-scale distributed parallel acceleration method and system for protein identification Active CN102411680B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010292032.5A CN102411680B (en) 2010-09-26 2010-09-26 Large-scale distributed parallel acceleration method and system for protein identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010292032.5A CN102411680B (en) 2010-09-26 2010-09-26 Large-scale distributed parallel acceleration method and system for protein identification

Publications (2)

Publication Number Publication Date
CN102411680A CN102411680A (en) 2012-04-11
CN102411680B true CN102411680B (en) 2014-03-26

Family

ID=45913751

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010292032.5A Active CN102411680B (en) 2010-09-26 2010-09-26 Large-scale distributed parallel acceleration method and system for protein identification

Country Status (1)

Country Link
CN (1) CN102411680B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810200B (en) * 2012-11-12 2016-03-30 中国科学院计算技术研究所 The database search method of opened protein matter qualification and system thereof
CN107346350B (en) * 2016-05-06 2020-08-28 中国科学院微电子研究所 Distribution method, device and cluster system for integrated circuit layout data processing tasks
CN114242163B (en) * 2020-09-09 2024-01-30 复旦大学 Processing system for mass spectrometry data of proteomics

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158952A (en) * 2007-11-22 2008-04-09 中国人民解放军国防科学技术大学 Biological sequence data-base searching multilayered accelerating method based on flow process
CN101714187A (en) * 2008-10-07 2010-05-26 中国科学院计算技术研究所 Index acceleration method and corresponding system in scale protein identification

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158952A (en) * 2007-11-22 2008-04-09 中国人民解放军国防科学技术大学 Biological sequence data-base searching multilayered accelerating method based on flow process
CN101714187A (en) * 2008-10-07 2010-05-26 中国科学院计算技术研究所 Index acceleration method and corresponding system in scale protein identification

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
An efficient parallelization of phosphorylated peptide and protein identification;Leheng Wang等;《RAPID COMMUNICATIONS IN MASS SPECTROMETRY》;20100630;第24卷(第12期);第1791-1798页 *
InsPecT的2种并行优化方案;涂强等;《计算机工程》;20100331;第36卷(第6期);第100-101页 *
一种基于信息论的蛋白质数据库搜索鉴定算法;于长永;《东北大学学报(自然科学版)》;20090131;第30卷(第1期);第50-53页 *
规模化蛋白质鉴定中的串联质谱数据评价方法;杨兵等;《生命的化学》;20051015;第25卷(第5期);第407-410页 *

Also Published As

Publication number Publication date
CN102411680A (en) 2012-04-11

Similar Documents

Publication Publication Date Title
CN102411679B (en) Large-scale distributed parallel acceleration method and system for protein identification
CN102411666B (en) Large-scale distributed parallel acceleration method and system for protein identification
CN107122443B (en) A kind of distributed full-text search system and method based on Spark SQL
CN102799486B (en) Data sampling and partitioning method for MapReduce system
CN103902702B (en) A kind of data-storage system and storage method
CN105550274B (en) The querying method and device of this parallel database of two-pack
CN106528717A (en) Data processing method and system
US10002019B2 (en) System and method for assigning a transaction to a serialized execution group based on an execution group limit for parallel processing with other execution groups
CN103577474B (en) The update method and system of a kind of database
CN102402617A (en) Easily compressed database index storage system using fragments and sparse bitmap, and corresponding construction, scheduling and query processing methods
Ngu et al. B+-tree construction on massive data with Hadoop
CN105224658A (en) A kind of Query method in real time of large data and system
CN103440246A (en) Intermediate result data sequencing method and system for MapReduce
CN111159180A (en) Data processing method and system based on data resource directory construction
CN104111936A (en) Method and system for querying data
CN101714187B (en) Index acceleration method and corresponding system in scale protein identification
CN102411680B (en) Large-scale distributed parallel acceleration method and system for protein identification
CN110609924A (en) Method, device and equipment for calculating total quantity relation based on graph data and storage medium
CN104991741A (en) Key value model based contextual adaptive power grid big data storage method
Xu et al. A near-storage framework for boosted data preprocessing of mass spectrum clustering
CN109669987A (en) A kind of big data storage optimization method
CN108197275A (en) A kind of distributed document row storage indexing means
CN106951442A (en) Data interactive method and device between a kind of heterogeneous database
Haseeb et al. Lbe: A computational load balancing algorithm for speeding up parallel peptide search in mass-spectrometry based proteomics
CN103412942A (en) Voltage dip data analysis method based on cloud computing technology

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant