CN102411680A - Large-scale distributed parallel acceleration method and system for protein identification - Google Patents

Large-scale distributed parallel acceleration method and system for protein identification Download PDF

Info

Publication number
CN102411680A
CN102411680A CN2010102920325A CN201010292032A CN102411680A CN 102411680 A CN102411680 A CN 102411680A CN 2010102920325 A CN2010102920325 A CN 2010102920325A CN 201010292032 A CN201010292032 A CN 201010292032A CN 102411680 A CN102411680 A CN 102411680A
Authority
CN
China
Prior art keywords
spectrogram
peptide
piece
mass
spectral data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010102920325A
Other languages
Chinese (zh)
Other versions
CN102411680B (en
Inventor
王乐珩
王文平
迟浩
吴妍洁
周郴
付岩
孙瑞祥
贺思敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201010292032.5A priority Critical patent/CN102411680B/en
Publication of CN102411680A publication Critical patent/CN102411680A/en
Application granted granted Critical
Publication of CN102411680B publication Critical patent/CN102411680B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The invention relates to a large-scale distributed parallel acceleration method and a large-scale distributed parallel acceleration system for protein identification. The method comprises the following steps of: 1, performing theoretical enzyme digestion on a protein sequence to obtain a peptide sequence, and sequencing the peptide sequence and removing redundancy of the peptide sequence to establish a peptide index file block; 2, sequencing a mass spectrogram by using a parallel processing method, and equally dividing the sequenced mass spectrogram to obtain a plurality of spectrogram data blocks; 3, uniformly distributing the spectrogram data blocks to a plurality of master processes, sequencing the distributed spectrogram data blocks by each master process, and designating the distributed spectrogram data blocks to idle slave processes in turn to perform peptide spectrogram matching identification; and 4, gathering identification results by using the parallel processing method, deducing a corresponding protein sequence by using the peptide sequence obtained through identification, and generating an output file. By the method and the system, when the scale of processor cores reaches several hundreds or even more than one thousand, satisfied acceleration efficiency can be achieved by performing the protein identification.

Description

A kind of large-scale distributed parallel accelerated method and system thereof of identification of proteins
Technical field
The present invention relates to the distributed parallel accelerated method that a kind of scale protein is identified, particularly relate to a kind of employing distributed parallel technology, improve identification of proteins method of velocity and system thereof thereby effectively share search mission with on a plurality of computing nodes.
Background technology
" protein group " (Proteome) described in the particular organisms sample all of under given time and specified criteria expressed protein.As its name suggests; Proteomics is exactly the research to protein group; Its most basic task determines exactly which protein has obtained in vivo how many expression, expression be, posttranslational modification and albumen and protein-interacting etc., obtains thus on the protein level about the integral body of processes such as disease generation, cellular metabolism and comprehensive understanding.In current proteome research; Based on the identification of proteins of tandem mass spectrum is one of the most widely used technology, list of references 1 " Aebersold, R.and Mann; M.Mass spectrometry-based proteomics; Nature, 2003,422:198-207 " in relevant, have comparatively detailed explanation.
Basic step based on the tandem mass spectrum identification of protein is: at first mixed protein sample enzyme is cut to peptide, after separating through liquid chromatography, gets into mass spectrometer; Obtain the experiment tandem mass spectrum figure of peptide; Then mass spectrogram is analyzed, obtained corresponding peptide sequence, analyze to the protein merger through peptide at last; Obtain the protein tabulation in the mixed protein sample, thereby reach the purpose that protein is identified.In the process of identifying the peptide sequence that produces the experiment tandem mass spectrum, the method for database search is extensively adopted.As at list of references 2 " Eng, J.K., McCormack; A.L.and Yates, J.R.An approach to correlate tandemmass spectral data of peptides with amino acid sequences in a protein database.JAm Soc Mass Spectrom, 1994; 5:976-989 ", list of references 3 " Perkins, D.N., Pappin; D.J., Creasy, D.M.and Cottrell; J.S.Probability-based protein identification by searchingsequence databases using mass spectrometry data.Electrophoresis; 1999,20:3551-3567 " and list of references 4 " Field, H.I.;
Figure BDA0000027084350000011
; D.and Beavis, R.C.RADARS, a bioinformatics solution that automates proteome mass spectral analysis; Optimisesprotein identification; And archives data in a relational database.Proteomics, 2002,2:36-47 " in all the method that adopts database search is realized that the evaluation of peptide sequence specifies.
Adopt the method for database search to identify that through peptide sequence the method that realizes identification of proteins mainly may further comprise the steps: at first, the enzyme in the simulation biology is cut rule the protein sequence in the Protein Data Bank is cut into peptide sequence; Calculate the quality of each peptide sequence that cutting obtains then; Utilize the parent ion quality error window in the mass spectrometric data to seek the peptide sequence that meets in the certain mass scope at last, satisfactory peptide sequence is inputed to scoring functions to realize the evaluation to peptide sequence.
Because the scale along with Protein Data Bank constantly increases in recent years; The qualification requirement of non-specific enzyme being cut peptide constantly increases, and causes the scale of peptide sequence constantly to increase, simultaneously; Therefore the formation speed of mass spectrometric data has higher requirement to the evaluation speed of protein also in continuous growth.But aforesaid identification of proteins method has deficiency on efficient, therefore need above-mentioned database search method be quickened.
In recent years, along with the cheapness of commercial cluster with popularize, large-scale parallel calculates the mainstream solution that has become the acceleration problem that science and industry calculates.So-called cluster is about to a group computing machine and gets up with certain interconnected with network, and uniform dispatching, Coordination Treatment are calculated to realize efficient parallel.Compare with the supercomputer of early stage unified address space, each node in the cluster all has independently central processing unit, internal memory and necessary peripheral hardware.Process in the cluster can large-scale parallel; But communication cost each other is higher, and this also means original serial or the multithread programs that operates on the common computer, and not natural have an expansibility; That is to say, stand-alone program is transplanted to is directly obtained acceleration on the cluster.Must design again existing algorithm, could farthest utilize the ability of hardware facility.Even owing to the tangible algorithm of acceleration effect on the cluster of middle and small scale, along with the expansion of cluster scale, its acceleration effect still can constantly descend.Existing industrial software for calculation can't be issued to linear speed-up ratio surpassing hundred core processor scales mostly, can reach the more rare of linear speed-up ratio in the above scale of thousand core processors.Except speed factor; The use of cluster also relates to space factor; Use the high-performance calculation scene of cluster usually to relate to very googol according to the collection (protein sequence of biological example; And magnanimity mass spectrogram to be identified), this mass data collection is on the single node of common computer or cluster even can't move some routine operations (for example being written into the common internal memory sort algorithm of internal memory operation), has to use the software algorithm of cluster hardware system and particular design to be handled.
Existing identification of proteins search engine has mostly been realized parallel version.As at list of references 5 " Sadygov, R.G., Eng, J., Durr; E., Saraf, A., McDonald, H.; MacCoss, M.J., Yates, J.R.3rd; Code developments to improve the efficiency of automated MS/MS spectrainterpretation.J Proteome Res, 2002,1:211-215 ", list of references 6 " Duncan, D.T.; Craig, R., Link, A.J.Parallel tandem:a program for parallel processing of tandemmass spectra using PVM or MPI and X! Tandem.J Proteome Res 2005,4:1842-1847 ", list of references 7 " Bjornson, R.D., Carriero, N.J., Colangelo, C., Shifman, M., Cheung, K.H., Miller, P.L., Williams, K.X! Tandem, an improvedmethod for running X! Tandem in parallel on collections of commodity computers.JProteome Res 2008,7:293-299 ", list of references 8 " Halligan, B.D., Geiger; J.F., Vallejos, A.K., Greene; A.S.Twigger, S.N.Low Cost, Scalable Proteomics DataAnalysis Using Amazon ' s Cloud Computing Services and Open Source SearchAlgorithms.J Proteome Res 2009,8:3148-3153. " and list of references 9 " Leheng Wang; Wenping Wang, Hao Chi, Yanjie Wu, You Li; Yan Fu, Chen Zhou, Ruixiang Sun, Haipeng Wang; Chao Liu, Zuofei Yuan, Liyun Xiu; He, Si-Min.An efficientparallelization of phosphorylated peptide.Rapid Commun Mass Spectrom.2010,24:1791-1798 " in explanation is all arranged.Yet above method all only is applicable to the situation that cluster scale is less.In case the processor scale reaches hundreds of even surpasses more than thousand nuclears, acceleration efficiency just begins remarkable decline, and more hardware investments can not obtain higher speed-up ratio.In view of the existing deficiency of method on large-scale cluster, provide a kind of effective distributed parallel accelerated method significant in practical application.
Summary of the invention
The object of the present invention is to provide a kind of large-scale distributed parallel accelerated method and system thereof of identification of proteins, be used to solve prior art and reaching hundred nuclears even surpassing under the parallel condition of thousand core processor scales the problem that acceleration efficiency is not good.
To achieve these goals, the present invention provides a kind of large-scale distributed parallel accelerated method of identification of proteins, it is characterized in that, comprising:
Step 1; The input protein sequence carries out theoretical enzyme to said protein sequence and cuts and obtain peptide sequence, and said peptide sequence is sorted, goes disposal of Redundancy according to theoretical parent ion quality; With establishment peptide index file piece, and according to said peptide index file piece generation peptide index meta data file;
Step 2; The input mass spectrogram adopt method for parallel processing that said mass spectrogram is sorted according to experiment parent ion quality, and the mass spectrogram after will sorting averages division; Obtain a plurality of spectral data pieces, and generate the mass spectrum meta data file according to said spectral data piece;
Step 3; Give a plurality of host processes with said spectral data piece mean allocation, each host process management is a plurality of from process, and each host process sorts to the spectral data piece that is distributed; Appointment is carried out the evaluation of peptide spectrum coupling to idle from process successively; And when more than one of said peptide index file piece, same said spectral data piece is distributed to a plurality of from process, a plurality ofly carry out peptide spectrum coupling from the said peptide index file of process traversal monolithic piece and identify by this;
Step 4 adopts method for parallel processing, gathers qualification result, utilizes the peptide sequence that identifies to infer corresponding protein sequence, generates output file.
The large-scale distributed parallel accelerated method of described identification of proteins, wherein,
In the said step 2, also comprise:
21, resolve said mass spectrogram, said mass spectrogram on average is divided into a plurality of original data blocks, the capacity of each said original data block is less than the local storage space of clustered node;
22; Each said original data block is handled by a spectrogram mapping processor process; Said spectrogram mapping processor process is read in each mass spectrogram in the handled original data block successively; According to mass range said mass spectrogram is divided in the corresponding formation, again with each queue stores in different spectrogram intermediate files;
23; To different mass ranges, each mass range is handled by a spectrogram reduction processor process, separate parallel processing between the said spectrogram reduction processor process; Said spectrogram reduction processor process will read all the spectrogram intermediate files in this mass range; Mass spectrogram to importing is pressed the big minispread of experiment parent ion quality earlier, and when the experiment parent ion was identical in quality, the conventional english words canonical ordering according to the spectrogram title name sorted again; Deposit polylith spectral data piece after the ordering successively in, the mass spectrogram number that comprises in every equates;
24, the information of collecting all said mass spectrometric data pieces, and generate said mass spectrum meta data file according to said information.
The large-scale distributed parallel accelerated method of described identification of proteins, wherein,
In the said step 22, also comprise:
When the number of said original data block greater than cluster in during the number of processor core; Or when counting greater than said spectrogram mapping processor process; Said original data block is carried out many wheels to be handled; The spectrogram mapping processor process of finishing the work continues to get new task, and First come first served is all handled up to all said original data blocks.
The large-scale distributed parallel accelerated method of described identification of proteins, wherein,
In the said step 23, also comprise:
When the number of said mass range greater than cluster in during the number of processor core; Or when counting greater than said spectrogram reduction processor process; Said mass range is carried out many wheels to be handled; The spectrogram reduction processor process of finishing the work continues to get new task, and First come first served is all handled up to all said spectrogram intermediate files.
The large-scale distributed parallel accelerated method of described identification of proteins, wherein,
In the said step 3, said host process is assigned and idle carried out the step that peptide spectrum coupling identifies from process and comprise:
Said host process is read in said mass spectrum meta data file and said peptide index meta data file; According to the statistical information that obtains; Sort from high to low according to mass range distributing to the said spectral data piece of oneself being responsible for identifying, assign successively to said, if said peptide index file piece is a polylith from process; Then same said spectral data piece is assigned repeatedly, each corresponding peptide index file piece; Said mode from process employing First come first served is got task; Whenever expert assignment is accomplished; Deposit the sub-piece of qualification result in,, beam back the filename of the sub-piece of said qualification result with said host process communication; And ask for the corresponding spectral data piece of next step task and the information of peptide index file piece, up to the evaluation of accomplishing all spectral data pieces.
The large-scale distributed parallel accelerated method of described identification of proteins, wherein,
In the said step 3, saidly carry out the step that peptide spectrum coupling identifies from process and comprise:
Saidly read in said peptide index file piece from process; On the basis of original peptide sequence, calculate the possible situation of changes in modification takes place; Utilize the parent ion quality error window searching in the spectral data piece to be identified to meet the modified peptides sequence of setting mass range, satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm to realize the evaluation to peptide sequence.
The large-scale distributed parallel accelerated method of described identification of proteins, wherein,
In the said step 4, also comprise:
41, gather qualification result; To the corresponding sub-piece of all qualification results of the said spectral data piece of each piece; Be responsible for processing by a spectrogram qualification result aggregation process device process; The sub-piece of all qualification results of specifying a said spectral data piece giving oneself is read in separate parallel processing between the said spectrogram qualification result aggregation process device process, said spectrogram qualification result aggregation process device process, presses the peptide sequence of all qualification results of every mass spectrogram the mark ordering of peptide spectrum coupling marking algorithm; Keep forward peptide sequence information and the mark of rank, deposit the piecemeal summary file in.
42, read in all piecemeal summary files; Filter, go redundancy to the peptide sequence of each mass spectrogram qualification result; The nonredundancy peptide sequence that obtains is equally divided into many groups, each group nonredundancy peptide sequence is responsible for searching corresponding protein numbering and sequence, the separate parallel processing of said protein query processor process by a protein query processor process; Lookup result utilization peptide is inferred algorithm to protein, generate output file.
The large-scale distributed parallel accelerated method of described identification of proteins, wherein,
In the said step 41, also comprise:
When the number of said spectral data piece greater than cluster in during the number of processor core; Or when counting greater than said spectrogram qualification result aggregation process device process; The sub-piece of said qualification result is carried out many wheels to be handled; The spectrogram qualification result aggregation process device process of finishing the work continues to get new task, and First come first served all has been processed up to the sub-piece of all said qualification results.
The large-scale distributed parallel accelerated method of described identification of proteins, wherein,
In the said step 42, also comprise:
When number that said nonredundancy peptide sequence divides into groups greater than cluster in during the number of processor core; Or when counting greater than said protein query processor process; Said nonredundancy peptide sequence is divided into groups to carry out many wheels to be handled; The protein query processor process of finishing the work continues to get new task, and First come first served is all handled up to all nonredundancy peptide sequences.
To achieve these goals, the present invention also provides a kind of large-scale distributed parallel accelerating system of identification of proteins, it is characterized in that, comprising:
The peptide sequence index module; Being used for protein sequence to input carries out theoretical enzyme and cuts and obtain peptide sequence; Said peptide sequence is sorted, goes disposal of Redundancy according to theoretical parent ion quality, with establishment peptide index file piece, and according to said peptide index file piece generation peptide index meta data file;
The spectral data processing module; Be used for mass spectrogram, adopt method for parallel processing and sort, and the mass spectrogram after will sorting averages division according to experiment parent ion quality to input; Obtain a plurality of spectral data pieces, and generate the mass spectrum meta data file according to said spectral data piece;
Peptide spectrum coupling is identified module; Connect said peptide sequence index module, said spectral data processing module; Be used for giving each host process with said spectral data piece mean allocation, each host process management is a plurality of from process, and each host process sorts to the spectral data piece that is distributed; Appointment is carried out the evaluation of peptide spectrum coupling to idle from process successively; And when more than one of said peptide index file piece, same said spectral data piece is distributed to a plurality of from process, carry out peptide spectrum coupling from the said peptide index file of process traversal monolithic piece and identify by a plurality of;
The result gathers output module, connects said peptide spectrum coupling and identifies module, is used to adopt method for parallel processing to gather qualification result, utilizes the peptide sequence that identifies to infer corresponding protein sequence, generates output file.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Said spectral data processing module comprises again:
Spectrogram is divided module, is used to resolve said mass spectrogram, and said mass spectrogram on average is divided into a plurality of original data blocks, and the capacity of each said original data block is less than the local storage space of clustered node;
The spectrogram mapping block; Connect said spectrogram and divide module; Be used for each said original data block is handled by a spectrogram mapping processor process; Said spectrogram mapping processor process read in successively in the handled original data block each open mass spectrogram, according to mass range said mass spectrogram is divided in the corresponding formation, again with each queue stores in different spectrogram intermediate files;
Spectrogram reduction module connects said spectrogram mapping block, is used for different mass ranges; Each mass range is handled by a spectrogram reduction processor process; Separate parallel processing between the said spectrogram reduction processor process, said spectrogram reduction processor process will be read all the spectrogram intermediate files in this mass range, and the mass spectrogram of input is pressed the big minispread of experiment parent ion quality earlier; When the experiment parent ion is identical in quality; According to the conventional english words canonical ordering ordering of spectrogram title name, deposit polylith spectral data piece after the ordering successively in again, the mass spectrogram number that comprises in every equates;
Mass spectrum meta data file generation module connects said spectrogram reduction module, the information that is used to collect all said mass spectrometric data pieces, and generate said mass spectrum meta data file according to said information.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Said spectrogram mapping block; Also be used for when the number of said original data block during greater than the number of cluster processor core; Or when counting greater than said spectrogram mapping processor process, said original data block is carried out many wheels handle, the spectrogram mapping processor process of finishing the work continues to get new task; First come first served is all handled up to all original data blocks.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Said spectrogram reduction module; Also be used for when the number of said mass range during greater than the number of cluster processor core; Or when counting greater than said spectrogram reduction processor process, said mass range is carried out many wheels handle, the spectrogram reduction processor process of finishing the work continues to get new task; First come first served is all handled up to all spectrogram intermediate files.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Said peptide spectrum coupling is identified module; Also be used for reading in said mass spectrum meta data file and said peptide index meta data file by said host process; According to the statistical information that obtains, sort appointment successively from high to low to said, if said peptide index file piece is a polylith according to mass range from process with distributing to the said spectral data piece of oneself being responsible for identifying; Then same said spectral data piece is assigned repeatedly, each corresponding peptide index file piece; Said mode from process employing First come first served is got task; Whenever expert assignment is accomplished; Deposit the sub-piece of qualification result in,, beam back the filename of the sub-piece of said qualification result with said host process communication; And ask for the corresponding spectral data piece of next step task and the information of peptide index file piece, up to the evaluation of accomplishing all spectral data pieces.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Said peptide spectrum coupling is identified module; Also be used for reading in peptide index file piece from process by said; On the basis of original peptide sequence, calculate the possible situation of changes in modification takes place; Utilize the parent ion quality error window searching in the spectral data piece to be identified to meet the modified peptides sequence of setting mass range, satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm to realize the evaluation to peptide sequence.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Said result gathers output module, comprises again:
Summarizing module; Be used for the corresponding sub-piece of all qualification results of the said spectral data piece of each piece; Be responsible for processing by a spectrogram qualification result aggregation process device process; The sub-piece of all qualification results of specifying a said spectral data piece giving oneself is read in separate parallel processing between the said spectrogram qualification result aggregation process device process, said spectrogram qualification result aggregation process device process, presses the peptide sequence of all qualification results of every mass spectrogram the mark ordering of peptide spectrum coupling marking algorithm; Keep forward peptide sequence information and the mark of rank, deposit the piecemeal summary file in;
Filter and infer output module, connect said summarizing module, be used to read in said piecemeal summary file; Filter, go redundancy to the peptide sequence of each mass spectrogram qualification result; The nonredundancy peptide sequence that obtains is divided into many groups, each group nonredundancy peptide sequence is responsible for searching corresponding protein numbering and sequence, the separate parallel processing of said protein query processor process by a protein query processor process; Lookup result utilization peptide is inferred algorithm to protein, generate output file.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Said summarizing module; Also be used for when the number of said spectral data piece during greater than the number of cluster processor core; Or when counting greater than said spectrogram qualification result aggregation process device process, the sub-piece of said qualification result is carried out many wheels handle, the spectrogram qualification result aggregation process device process of finishing the work continues to get new task; First come first served all has been processed up to the sub-piece of all qualification results.
The large-scale distributed parallel accelerating system of described identification of proteins, wherein,
Output module is inferred in said filtration; The number that also is used for dividing into groups when said nonredundancy peptide sequence is during greater than the number of cluster processor core; Or when counting greater than said protein query processor process, said nonredundancy peptide sequence is divided into groups to carry out many wheels handle, the protein query processor process of finishing the work continues to get new task; First come first served is all handled up to all nonredundancy peptide sequences.
Compared with prior art, useful technique effect of the present invention is:
1, the present invention handles the protein sequence storehouse through distributed parallel; Make the magnanimity protein sequence that exceeds single-machine capacity be able to efficiently carry out theoretical enzyme and cut, remove redundancy, ordering and piecemeal establishment peptide index file piece, monolithic peptide index file piece can be written into internal memory and carry out the high-level efficiency traversal.
2, the present invention through orderly, remove redundant, distributed peptide sequence index stores tissue protein and peptide sequence; Directly search protein sequence relatively; Not only significantly reduced redundant calculated amount; But also it is identical in quality or near the peptide spectrum matching operation of the lap of spectrogram to have merged parent ion, thereby improved the efficient of identifying flow process greatly.
3, the present invention passes through distributed parallel processing mass spectrogram, makes the magnanimity mass spectrogram that exceeds single-machine capacity be able to efficient ordering and piecemeal establishment spectral data piece.The spectral data piece that produces is convenient to the dynamic dispatching parallel processing.
4, share and a large amount of pressure communications with a plurality of host processes among the present invention, wait for, improved the clustered processors scale greatly and reached hundreds of even surpass the acceleration efficiency under the above situation of thousand nuclears thereby reduced to block from process.
5, the present invention adopts the mode of parallel processing to gather qualification result, the protein under searching through peptide sequence, and carry out the deduction of peptide to protein, improved this process speed greatly.
Description of drawings
Fig. 1 is the process flow diagram of the large-scale distributed parallel accelerated method of identification of proteins of the present invention;
Fig. 2 is the structural drawing of the large-scale distributed parallel accelerating system of identification of proteins of the present invention.
Embodiment
Describe the present invention below in conjunction with accompanying drawing and specific embodiment, but not as to qualification of the present invention.
As shown in Figure 1, be the process flow diagram of the large-scale distributed parallel accelerated method of identification of proteins of the present invention, this flow process is to adopt following operation that identification of proteins is carried out large-scale distributed parallel acceleration, concrete steps are following:
Step 101 is at first set necessary search parameter;
Step 102; Import protein sequence then; Utilizing a plurality of processor processes in the cluster that protein sequence is carried out theoretical enzyme cuts; The peptide sequence that obtains is sorted, goes redundancy by theoretical parent ion quality, finally create peptide index file piece, and generate peptide index meta data file according to peptide index file piece;
Step 103; Next resolve the mass spectrogram of input; Utilize a plurality of processor processes in the cluster that mass spectrogram is sorted according to experiment parent ion quality; Mass spectrogram after the ordering is stored in the middle of a plurality of spectral data pieces in order, and the mass spectrogram quantity of storing in each spectral data piece is identical, generates the mass spectrum meta data file according to the spectral data piece again;
Step 104 starts several host processes then, and host process is in charge of more a plurality of from process again separately, gives each host process with spectral data piece average mark.The spectral data piece that each host process will be distributed to oneself sorts according to mass range from high to low; Dynamically appointment is carried out the evaluation of peptide spectrum coupling to idle from process; If more than one of peptide index file piece; Then same spectral data piece also can be assigned to a plurality of from process, is a plurality ofly carried out peptide spectrum coupling from process traversal monolithic peptide index file piece and identifies by this;
Step 105 is used method for parallel processing, gathers qualification result, utilizes the peptide sequence that identifies to search corresponding protein sequence, carries out the deduction of peptide to protein, generates output file.
The mode of operation of at present common comparison poor efficiency does in the above-mentioned steps 102, reads in protein sequence successively, it is carried out theoretical enzyme one by one cut and obtain peptide sequence; Deposit the peptide sequence piecemeal in single order interim peptide sequence piece again, read in the interim peptide sequence piece of single order then, to every K piece merge, remove redundant, according to theoretical parent ion quality-ordered; Output to the interim peptide sequence piece of second order; Read in the interim peptide sequence piece of second order again, every K piece is merged, removes redundancy, outputs to the interim peptide sequence piece in three rank according to theoretical parent ion quality-ordered ... Circulation repeatedly is till all data are integrated into together; The last final one interim peptide sequence piece of taking turns that reads successively; Create peptide index file piece, the information of collecting all peptide index file pieces, and according to this information generation peptide index meta data file.
The mode of operation of at present common comparison poor efficiency does in the above-mentioned steps 103, resolves mass spectrogram, and it is read in successively; Piecemeal deposits the interim spectral data piece of single order in, reads in the interim spectral data piece of single order then successively, to every K piece merging, according to experiment parent ion quality-ordered; Output to the interim spectral data piece of second order, read in the interim spectral data piece of second order more successively, further merge, according to experiment parent ion quality-ordered; Output to the interim spectral data piece in three rank ... Circulation repeatedly till all data merge to together, is read at last the final one interim spectral data piece of taking turns successively; Deposit some spectral data pieces in, the mass spectrogram number that comprises in every equates that this number is specified by input parameter; Collect the information of all mass spectrometric data pieces at last, and generate the mass spectrum meta data file according to this information.
The mode of operation of at present common comparison poor efficiency does in the above-mentioned steps 104, and single host process assigns the spectral data piece to from process successively; Adopt the mode of First come first served to get task from process; After getting the numbering of peptide index file piece of appointment, read in all peptide index file pieces successively, on the basis of original peptide sequence, calculate the possible situation of changes in modification takes place; Utilize the parent ion quality error window searching in the spectral data piece to be identified to meet the modified peptides sequence of setting mass range; Satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm to realize the evaluation to peptide sequence,, deposit the sub-piece of qualification result in whenever expert assignment is accomplished; With the host process communication; Beam back the filename of the sub-piece of qualification result, and ask for the corresponding spectral data piece of next step task and the information of peptide index file piece, up to the evaluation of accomplishing all spectral data pieces.
Further, in the above-mentioned steps 102, comprise again:
Step 1021; Read in protein sequence; It on average is divided into a plurality of protein sequence son files, and the number of protein sequence son file can be greater than the number of processor core in the cluster, and the capacity of each protein sequence son file must be less than the local storage space of clustered node;
Step 1022; Each protein sequence son file is started a peptide index mapping processor process (abbreviating Peptide Map process as) to be handled; Separate parallel processing between the Peptide Map process; The PeptideMap process is carried out theoretical enzyme successively with each the bar protein sequence in the handled protein sequence son file and is cut and obtain peptide sequence; Again peptide sequence is divided in the corresponding formation according to mass range, remove redundant peptide sequence after, with each queue stores to different peptide sequence intermediate files;
Step 1023, to different mass ranges, each mass range is handled by a peptide index reduction processor process (abbreviating Peptide Reduce process as); Separate parallel processing between the Peptide Reduce process is read in the peptide sequence in all the peptide sequence intermediate files in this mass range by Peptide Reduce process, and peptide sequence is sorted; In the sorting operation; Earlier according to the big minispread of theoretical parent ion quality, when the theoretical parent ion of peptide sequence is identical in quality, again according to the conventional english words canonical ordering ordering of the character string of peptide sequence; The ordering back is removed redundant, creates peptide index file piece;
Step 1024, this step is an optional step, an option is to generate the inverted index of peptide to albumen; The concrete realization that said inverted index is created algorithm see reference document 10 " You Li, Hao Chi, Le-HengWang; Hai-Peng Wang, Yan Fu, Zuo-Fei Yuan; Su-Jun Li, Yan-Sheng Liu, Rui-Xiang Sun; Rong Zeng; Si-Min He. " Speeding up tandem mass spectrometrybased database searching by peptide and spectrum indexing. " RapidCommunications in Mass Spectrometry, 2010,24:807-814. " and application number be 200810223683.1 patented claim " a kind of scale protein identify in index acceleration method and corresponding system ";
Step 1025, the information of collecting all peptide index file pieces, and according to this information generation peptide index meta data file.
In preferable embodiment; In the step 1022, when the number of protein sequence son file greater than cluster in during the number of processor core, or when counting greater than Peptide Map process; The protein sequence son file is carried out many wheels to be handled; The Peptide Map process of finishing the work continues to get new task, and First come first served is all handled up to all proteins sequence son file.
In preferable embodiment; In the step 1023, when the number of mass range greater than cluster in during the number of processor core, or when counting greater than Peptide Reduce process; Mass range is carried out many wheels to be handled; The Peptide Reduce process of finishing the work continues to get new task, and First come first served is all handled up to all peptide sequence intermediate files.
Further, in the above-mentioned steps 103, comprise again:
Step 1031 is resolved mass spectrogram, and it on average is divided into a plurality of original data blocks, and the number of original data block can be greater than the number of processor core in the cluster, and the capacity of each original data block must be less than the local storage space of clustered node;
Step 1032; Each original data block is handled by a Spectra Map process; Spectra Map process is read in each mass spectrogram in the handled original data block successively; According to mass range mass spectrogram is divided in the corresponding formation, again with each queue stores in different spectrogram intermediate files;
Step 1033, to different mass ranges, each mass range is handled by a Spectra Reduce process; Separate parallel processing between the Spectra Map process, Spectra Reduce process will be read all the spectrogram intermediate files in this mass range, to the mass spectrogram ordering of input; In the sorting operation, earlier by the big minispread of experiment parent ion quality, when the experiment parent ion is identical in quality; Conventional english words canonical ordering according to the spectrogram title name sorts again; Deposit some spectral data pieces after the ordering more successively in, the mass spectrogram number that comprises in every equates that this number is specified by input parameter;
Step 1034, the information of collecting all mass spectrometric data pieces, and according to this information generation mass spectrum meta data file.
In preferable embodiment; In step 1032, when the number of original data block greater than cluster in during the number of processor core, or when counting greater than Spectra Map process; Original data block is carried out many wheels to be handled; The Spectra Map process of finishing the work continues to get new task, and First come first served is all handled up to all original data blocks.
In preferable embodiment; In step 1032, when the number of mass range greater than cluster in during the number of processor core, or when counting greater than Spectra Reduce process; Mass range is carried out many wheels to be handled; The Spectra Reduce process of finishing the work continues to get new task, and First come first served is all handled up to all spectrogram intermediate files.
In preferable embodiment; In the step 104, dynamically assign operation to comprise: host process is read in mass spectrum meta data file and peptide index meta data file, according to the statistical information that obtains; Sort from high to low to assign successively according to mass range and give distributing to the spectral data piece of oneself be responsible for identifying from process; If peptide index file piece is a polylith, then same spectral data piece is assigned repeatedly, each corresponding peptide index file piece; Adopt the mode of First come first served to get task from process; Whenever expert assignment is accomplished; Deposit the sub-piece of qualification result in,, beam back the filename of the sub-piece of qualification result with the host process communication; And ask for the corresponding spectral data piece of next step task and the information of peptide index file piece, up to the evaluation of accomplishing all spectral data pieces.
In preferable embodiment; In the step 104; Peptide spectrum coupling identifies that operation comprises: read in peptide index file piece from process; On the basis of original peptide sequence, calculate the possible situation of changes in modification takes place, utilize parent ion quality error window in the spectral data piece to be identified to seek and meet the modified peptides sequence of setting mass range, satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm with the evaluation of realization to peptide sequence.The concrete realization of peptide spectrum coupling marking algorithm see reference document 11 " Y.Fu, Q.Yang, R.Sun; D.Li, R.Zeng, C.X.Ling; And W.Gao, " Exploiting the kernel trick to correlatefragment ions for peptide identification via tandem mass spectrometry, " Bioinformatics; 2004,20:1948-1954. " and patent " a kind of method ZL200410088779.3 that uses tandem mass spectrum data to identify peptide ".
Further, in the above-mentioned steps 105, comprise again:
Step 1051; Gather qualification result; The all qualification result piece corresponding to each piece spectral data piece is responsible for processing, separate parallel processing between the Results Gather process by a spectrogram qualification result aggregation process device process (abbreviating Results Gather process as); Results Gather process is read in the sub-piece of all qualification results of specifying a spectral data piece giving oneself; Mark ordering the peptide sequence of all qualification results of every mass spectrogram is pressed peptide spectrum coupling marking algorithm keeps forward peptide sequence information and the mark of rank, deposits the piecemeal summary file in.
Step 1052; Read in all piecemeal summary files, filter the peptide sequence of each mass spectrogram qualification result, go redundancy, the nonredundancy peptide sequence that obtains is equally divided into some groups; Each group nonredundancy peptide sequence is responsible for searching corresponding protein numbering and sequence by a protein query processor process (abbreviating Protein Select process as); The separate parallel processing of Protein Select process, the peptide that reruns after searching is inferred algorithm to protein, generates output file at last.Peptide is inferred the concrete realization of algorithm document 12 " AI Nesvizhskii and R Aebersold. " Interpretation of shotgun proteomic data:theprotein inference problem. " the Mol Cell Proteomics that sees reference to albumen; 2005,4:1419-1440. ".
In preferable embodiment; In step 1051, when the number of spectral data piece greater than cluster in during the number of processor core, or when counting greater than Results Gather process; The sub-piece of qualification result is carried out many wheels to be handled; The Results Gather process of finishing the work continues to get new task, and First come first served is all handled up to the sub-piece of all qualification results.
In preferable embodiment; In step 1052, when number that the nonredundancy peptide sequence divides into groups greater than cluster in during the number of processor core, or when counting greater than Protein Select process; The nonredundancy peptide sequence is divided into groups to carry out many wheels to be handled; The Protein Select process of finishing the work continues to get new task, and First come first served is all handled up to all nonredundancy peptide sequences.
Understand for ease, explain in conjunction with a concrete instance:
At first, handle protein sequence, create peptide index file piece.Suppose in the protein sequence storehouse, to have 3,000,000 protein sequence, cluster has 1000 processor cores.The first step is divided into 1000 protein sequence son files with all proteins sequence, and each protein sequence son file comprises 3,000 protein sequences.Second step; 1000 Peptide Map of parallel starting process, each Peptide Map reads in a protein sequence son file separately, with read in 3; Article 000, protein sequence carries out theoretical enzyme successively and cuts and obtain peptide sequence; Peptide sequence is divided in the corresponding formation according to mass range again, for example supposes that with every 100Da be the formation that width is divided the different quality scope, quality is the formation that 400.15 peptide sequence EVDG will be deposited in 400-500Da.After removing redundant peptide sequence, each queue stores is arrived different peptide sequence intermediate files.The 3rd step; To different mass range parallel starting Peptide Reduce processes; Each mass range is handled by a Peptide Reduce process; Separate parallel processing between the Peptide Reduce process, starting how many Peptide Reduce processes is to confirm that by the lower limit qualitatively and the mass range width of predefined peptide sequence the bound of the quality of peptide sequence is made as 400-10000Da in this example; The mass range width is 100Da, so just needs 96 Peptide Reduce processes (10000-400/100).Read in the peptide sequence that is in all peptide sequence intermediate files of setting (for example 400-500Da) in the mass range by Peptide Reduce process; Each process all need be read in 1000 peptide sequence intermediate files in the present embodiment; Peptide sequence is sorted according to theoretical parent ion quality size; Remove redundancy, create peptide index file piece.In the present embodiment, 96 peptide index file pieces have finally been generated.The content of peptide index file piece comprises quality, peptide sequence, omission restriction enzyme site.But also having a selection operation is to generate the inverted index of peptide to albumen simultaneously; The delegation of inverted index adopts following form: the numbering (size_t) that at first is peptide sequence; Next be the numbering (size_t) of the protein sequence under this peptide sequence; If same peptide sequence belongs to a plurality of protein sequences, latter's numbering is arranged in order.The 4th step; Behind the above-mentioned end-of-job; Collect the information of all mass spectrometric data pieces by Peptide Meta process; And according to this information generation mass spectrum meta data file, this information spinner will comprise the number of index file piece, the size of each blocks of files, corresponding mass range, the peptide sequence clauses and subclauses of storage, the amino acid masses table that calculates the peptide sequence quality, creation-time etc.
Then, handle mass spectrogram, create the spectral data piece.Suppose to have 5,000,000 mass spectrum, cluster has 1000 processor cores.The first step is resolved mass spectrogram, and it on average is divided into 1,000 original data block, and each original data block comprises 5000 mass spectrograms.Second step; Start 1000 SpectraMap processes, each piece original data block is handled by a Spectra Map process, Spectra Map process is read in each mass spectrogram in the handled original data block successively; According to mass range mass spectrogram is divided in the corresponding formation; For example suppose with 100Da to be window, quality is the formation that 400.15 spectrogram will be deposited in 400-500Da, again with each queue stores in different spectrogram intermediate files.The 3rd step; To different mass range parallel starting Spectra Reduce processes; Each mass range is handled by a Spectra Reduce process; Separate parallel processing between the Spectra Reduce process, starting how many Spectra Reduce processes is to confirm that by the lower limit qualitatively and the mass range width of predefined peptide sequence the bound of the quality of peptide sequence is made as 400-10000Da in this example; The mass range width is 100Da, so just needs 96 Spectra Reduce processes (10000-400/100).Spectra Reduce process will read all spectrogram intermediate files of (for example 400-500Da) in this mass range; To the mass spectrogram of input according to experiment parent ion quality-ordered; Deposit some spectral data pieces more successively in; Mass spectrogram number in every equates that the number of mass spectrogram is specified by input parameter.In the present embodiment, be in totally 7,000 of spectrogram files in the 400-500Da scope, after experiment parent ion quality-ordered, per 200 deposit one in, and symbiosis has become 35,200 being determined by input parameter here.The 4th step; Behind the above-mentioned end-of-job; The information of collecting all mass spectrometric data pieces by Sepctra Meta process, and generate the mass spectrum meta data file according to this information, this information spinner will comprise the corresponding spectrogram number of spectral data piece number, each data block, creation-time etc.
Then, begin to identify.Start several host processes, host process is in charge of more from process, in this example again separately; Whole 1000 processes; Specify No. 0, No. 100, No. 200 ... No. 900 totally ten processes be host process, all the other all are from process, each host process is in charge of numbering and is come 99 of own back from process; For example No. 123 from process, just returns No. 100 managements of process.Gave each host process with the spectral data piece average mark that a last step produces, each host process is read in mass spectrum meta data file and peptide index meta data file, the information that obtains according to statistics; The spectral data piece of distributing to oneself is sorted according to mass range from high to low, dynamically assign successively to give and idle carry out peptide spectrum coupling from process and identify, if more than one of the peptide index file piece of aforementioned generation; Then same spectral data piece also can be assigned to a plurality of from process, and each is responsible for traveling through monolithic peptide index file from process and carries out the evaluation of peptide spectrum coupling, in the present embodiment; Suppose that one step of front has produced 50; 000 spectral data piece, 96 peptide index file pieces, then each host process has distributed 5; 000 data block (is divided with load balancing at interval; For example No. 0 node has been divided No. 0, No. 10, No. 20 ... 4, No. 990 data blocks), each host node distributes to carried out the 5000*96 subtask from node.The final sub-piece of qualification result that has produced the 10*5000*96 piece altogether.
At last, gather qualification result.Suppose to have before this 50,000 spectral data pieces, the sub-piece of the qualification result of 10*5000*96 piece, cluster have 1,000 processor core.The first step then starts 1,000 ResultsGather process; Handle respectively to 50 through many wheels; The sub-piece of the qualification result of 000 spectral data piece is handled, and each Results Gather process will be read in 96 corresponding sub-pieces of qualification result of certain piece spectral data piece of appointment, sequencing by merging at every turn; Keep forward candidate's peptide result, deposit the piecemeal summary file in.In second step, read in 50,000 all piecemeal summary files; The peptide sequence of the qualification result of each mass spectrogram is filtered, goes redundancy, suppose to have obtained 70,000 nonredundancy peptide sequences; Be equally divided into 700 groups; Start 700 Protein Select processes and handle each group nonredundancy peptide sequence, search corresponding protein sequence (if generated aforementioned inverted index in the optional step before this, then directly through the acquisition of tabling look-up through nonredundant peptide sequence; If just do not directly search the urporotein sequence library); Then move peptide after the inquiry and infer algorithm, obtain the information of the protein of 180 evaluations, finally generate output file to albumen; The output file content comprises peptide sequence, decoration information, parent ion quality, the marking mark of the qualification result of every spectrogram, and the title of the protein that identifies, numbering and protein sequence etc.
As shown in Figure 2, be the structural drawing of the large-scale distributed parallel accelerating system of identification of proteins of the present invention.This system 200 comprises:
Peptide sequence index module 21; Be used for protein sequence to input; Adopting method for parallel processing to carry out theoretical enzyme cuts and obtains peptide sequence; Peptide sequence is sorted, goes disposal of Redundancy according to theoretical parent ion quality, with establishment peptide index file piece, and according to peptide index file piece generation peptide index meta data file;
Spectral data processing module 22; Be used for mass spectrogram, adopt method for parallel processing and sort, and the mass spectrogram after will sorting averages division according to experiment parent ion quality to input; Obtain a plurality of spectral data pieces, and generate the mass spectrum meta data file according to the spectral data piece;
Peptide spectrum coupling is identified module 23; Connection peptides sequence index module 21, peptide sequence index module 22; Be used for giving each host process with spectral data piece mean allocation, each host process management is a plurality of from process, and each host process sorts to the spectral data piece that is distributed; Appointment is carried out the evaluation of peptide spectrum coupling to idle from process successively; And when more than one of peptide index file piece, same spectral data piece is distributed to a plurality of from process, carry out peptide spectrum coupling from process traversal monolithic peptide index file piece and identify by a plurality of;
The result gathers output module 24, and connection peptides spectrum coupling is identified module 23, is used to adopt method for parallel processing, gathers qualification result, and the peptide sequence that is tested and appraised is searched corresponding protein sequence, carries out the deduction of peptide to protein, generates output file.
Peptide sequence index module 21 in the conventional embodiment, is read in protein sequence successively; It is carried out theoretical enzyme one by one cut and obtain peptide sequence, deposit the peptide sequence piecemeal in single order interim peptide sequence piece again, read in the interim peptide sequence piece of single order then; To every K piece merge, remove redundant, according to theoretical parent ion quality-ordered, output to the interim peptide sequence piece of second order, read in the interim peptide sequence piece of second order again; Every K piece is merged, removes redundancy, outputs to the interim peptide sequence piece in three rank according to theoretical parent ion quality-ordered ... Circulation repeatedly; Till all data are integrated into together, read at last the final one interim peptide sequence piece of taking turns successively, create peptide index file piece; Collect the information of all peptide index file pieces, and generate peptide index meta data file according to this information.
Spectral data processing module 22 in the conventional embodiment, is resolved mass spectrogram; It is read in successively, and piecemeal deposits the interim spectral data piece of single order in, reads in the interim spectral data piece of single order then successively; To every K piece merging, according to experiment parent ion quality-ordered, output to the interim spectral data piece of second order, read in the interim spectral data piece of second order more successively; Further merge, according to experiment parent ion quality-ordered, output to the interim spectral data piece in three rank ... Circulation repeatedly is till all data merge to together; The last final one interim spectral data piece of taking turns that reads successively deposits some spectral data pieces in, and the mass spectrogram number that comprises in every equates; This number is specified by input parameter, collects the information of all mass spectrometric data pieces at last, and generates the mass spectrum meta data file according to this information.
Peptide spectrum coupling is identified module 23, and in the conventional embodiment, single host process assigns the spectral data piece to from process successively; Adopt the mode of First come first served to get task from process; After getting the numbering of peptide index file piece of appointment, read in all peptide index file pieces successively, on the basis of original peptide sequence, calculate the possible situation of changes in modification takes place; Utilize the parent ion quality error window searching in the spectral data piece to be identified to meet the modified peptides sequence of setting mass range; Satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm to realize the evaluation to peptide sequence,, deposit the sub-piece of qualification result in whenever expert assignment is accomplished; With the host process communication; Beam back the filename of the sub-piece of qualification result, and ask for the corresponding spectral data piece of next step task and the information of peptide index file piece, up to the evaluation of accomplishing all spectral data pieces.
Further, peptide sequence index module 21 comprises again:
Protein sequence is divided module 211; Be used for protein sequence on average is divided into a plurality of protein sequence son files; The number of protein sequence son file can be greater than the number of processor core in the cluster, and the capacity of each protein sequence son file must be less than the local storage space of clustered node;
Peptide mapping block 212; Connect protein sequence and divide module 211; Being used for that each protein sequence son file is started a peptide index mapping processor process (abbreviating Peptide Map process as) handles; Separate parallel processing between the Peptide Map process, Peptide Map process are carried out theoretical enzyme successively with each the bar protein sequence in the handled protein sequence son file and are cut and obtain peptide sequence, peptide sequence are divided in the corresponding formation according to mass range again; After removing redundant peptide sequence, each queue stores is arrived different peptide sequence intermediate files;
Peptide reduction module 213, connection peptides mapping block 212 is used for different mass ranges; Each mass range is handled by a peptide index reduction processor process (abbreviating Peptide Reduce process as), and the peptide sequence that is in all peptide sequence intermediate files of setting in the mass range is read in separate parallel processing between the Peptide Reduce process by Peptide Reduce process; According to theoretical parent ion quality-ordered, in the sorting operation, earlier according to the big minispread of parent ion Theoretical Mass; When peptide sequence identical in quality; According to the conventional english words canonical ordering ordering of the character string of peptide sequence, the ordering back is removed redundant, creates peptide index file piece again;
Peptide index meta file generation module 214, connection peptides reduction module 213, the information that is used to collect all peptide index file pieces, and according to this information generation peptide index meta data file.
In preferable embodiment; Peptide mapping block 212 also is used for when the number of protein sequence son file during greater than the number of cluster processor core, or when counting greater than Peptide Map process; The protein sequence son file is carried out many wheels to be handled; The Peptide Map process of finishing the work continues to get new task, and First come first served is all handled up to all proteins sequence son file.
In preferable embodiment; Peptide reduction module 213 also is used for when the number of mass range during greater than the number of cluster processor core, or when counting greater than Peptide Reduce process; Mass range is carried out many wheels to be handled; The Peptide Reduce process of finishing the work continues to get new task, and First come first served is all handled up to all peptide sequence intermediate files.
Further, spectral data processing module 22 comprises again:
Spectrogram is divided module 221; Be used to resolve the mass spectrogram of being imported; It on average is divided into a plurality of original data blocks, and the number of original data block can be greater than the number of processor core in the cluster, and the capacity of each original data block must be less than the local storage space of clustered node;
Spectrogram mapping block 222; Connect spectrogram and divide module 221; Be used for each original data block is handled by a spectrogram mapping processor process (abbreviating Spectra Map process as); Spectra Map process read in successively in the handled original data block each open mass spectrogram, according to mass range mass spectrogram is divided in the corresponding formation, again with each queue stores in different spectrogram intermediate files;
Spectrogram reduction module 223 connects spectrogram mapping block 222, is used for each mass range of different mass ranges is handled by a spectrogram reduction processor process (abbreviating Spectra Reduce process as); Separate parallel processing between the Spectra Reduce process; Spectra Reduce process will read and be in all spectrogram intermediate files of setting in the mass range, to the mass spectrogram of input according to the parent ion quality-ordered, in the sorting operation; Earlier by the big minispread of experiment parent ion quality; When the experiment parent ion is identical in quality, according to the conventional english words canonical ordering ordering of spectrogram title name, deposit some spectral data pieces after the ordering successively in again; The mass spectrogram number that comprises in every equates that this number is specified by input parameter;
Spectral data meta file generation module 224 connects spectrogram reduction module 223, the information that is used to collect all mass spectrometric data pieces, and according to this information generation mass spectrum meta data file.
In preferable embodiment; Spectrogram mapping block 222 also is used for when the number of original data block during greater than the number of cluster processor core, or when counting greater than Spectra Map process; Each original data block is carried out many wheels to be handled; The Spectra Map process of finishing the work continues to get new task, and First come first served is all handled up to all original data blocks.
In preferable embodiment; Spectrogram reduction module 223 also is used for when the number of mass range during greater than the number of cluster processor core, or when counting greater than Spectra Reduce process; Mass range is carried out many wheels to be handled; The Spectra Reduce process of finishing the work continues to get new task, and First come first served is all handled up to all spectral data pieces.
In preferable embodiment; Peptide spectrum coupling identifies that the appointment operation that module 23 is carried out comprises: host process is read in mass spectrum meta data file and peptide index meta data file; According to the statistical information that obtains, sort from high to low to assign successively according to mass range and give from process, if peptide index file piece is a polylith with distributing to the spectral data piece of oneself be responsible for identifying; Then same spectral data piece is assigned repeatedly, each corresponding peptide index file piece; Adopt the mode of First come first served to get task from process; Whenever expert assignment is accomplished; Deposit the sub-piece of qualification result in,, beam back the filename of the sub-piece of qualification result with the host process communication; And ask for the corresponding spectral data piece of next step task and the information of peptide index file piece, up to the evaluation of accomplishing all spectral data pieces.
In preferable embodiment; The peptide spectrum coupling that peptide spectrum coupling evaluation module 23 is carried out identifies that operation comprises: by read in peptide index file piece from process; On the basis of original peptide sequence, calculate the possible situation of changes in modification takes place; Utilize the parent ion quality error window searching in the spectral data piece to be identified to meet the modified peptides sequence of setting mass range, satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm to realize the evaluation to peptide sequence.
Further, the result gathers output module 24 and comprises again:
Summarizing module 241; Be used for all qualification result piece corresponding to each piece spectral data piece; Be responsible for processing by a spectrogram qualification result aggregation process device process (abbreviating Results Gather process as); The sub-piece of all qualification results of specifying a spectral data piece giving oneself is read in separate parallel processing between the Results Gather process, Results Gather process, presses the peptide sequence of all qualification results of every mass spectrogram the mark ordering of peptide spectrum coupling marking algorithm; Keep forward peptide sequence information and the mark of rank, deposit the piecemeal summary file in.
Filter and infer output module 242; Connect summarizing module 241; Be used to read in all piecemeal summary files, filter the peptide sequence of each mass spectrogram qualification result, go redundancy, the nonredundancy peptide sequence that obtains is equally divided into some groups; Each group nonredundancy peptide sequence is responsible for searching corresponding protein numbering and sequence by a protein query processor process (abbreviating Protein Select process as); The separate parallel processing of Protein Select process, the peptide that reruns after searching is inferred algorithm to protein, generates output file at last.
In preferable embodiment; Summarizing module 241 also is used for when the number of spectral data piece during greater than the number of cluster processor core, or when counting greater than Results Gather process; The qualification result son file is carried out many wheels to be handled; The Results Gather process of finishing the work continues to get new task, and First come first served is all accomplished up to the sub-piece of all qualification results.
In preferable embodiment; Filter to infer output module 242, also be used for when the number of nonredundancy peptide sequence grouping during greater than the number of cluster processor core, or when counting greater than Results Gather process; The sub-piece of qualification result is carried out many wheels to be handled; The Results Gather process of finishing the work continues to get new task, and First come first served is all handled up to the sub-piece of all qualification results.
The present invention proposes a kind of large-scale distributed parallel accelerated method and system of identification of proteins; Having solved prior art is reaching hundred nuclears even is surpassing under the parallel condition of thousand core processor scales; The problem that acceleration efficiency is not good; Particularly reach hundreds of even, still can obtain satisfied acceleration efficiency above more than thousand in the processor core scale.
It should be noted last that above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although the present invention is specified with reference to embodiment; Those of ordinary skill in the art is to be understood that; Technical scheme of the present invention is made amendment or is equal to replacement, do not break away from the spirit and the scope of technical scheme of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.

Claims (18)

1. the large-scale distributed parallel accelerated method of an identification of proteins is characterized in that, comprising:
Step 1; The input protein sequence carries out theoretical enzyme to said protein sequence and cuts and obtain peptide sequence, and said peptide sequence is sorted, goes disposal of Redundancy according to theoretical parent ion quality; With establishment peptide index file piece, and according to said peptide index file piece generation peptide index meta data file;
Step 2; The input mass spectrogram adopt method for parallel processing that said mass spectrogram is sorted according to experiment parent ion quality, and the mass spectrogram after will sorting averages division; Obtain a plurality of spectral data pieces, and generate the mass spectrum meta data file according to said spectral data piece;
Step 3; Give a plurality of host processes with said spectral data piece mean allocation, each host process management is a plurality of from process, and each host process sorts to the spectral data piece that is distributed; Appointment is carried out the evaluation of peptide spectrum coupling to idle from process successively; And when more than one of said peptide index file piece, same said spectral data piece is distributed to a plurality of from process, a plurality ofly carry out peptide spectrum coupling from the said peptide index file of process traversal monolithic piece and identify by this;
Step 4 adopts method for parallel processing, gathers qualification result, utilizes the peptide sequence that identifies to infer corresponding protein sequence, generates output file.
2. the large-scale distributed parallel accelerated method of identification of proteins according to claim 1 is characterized in that,
In the said step 2, also comprise:
21, resolve said mass spectrogram, said mass spectrogram on average is divided into a plurality of original data blocks, the capacity of each said original data block is less than the local storage space of clustered node;
22; Each said original data block is handled by a spectrogram mapping processor process; Said spectrogram mapping processor process is read in each mass spectrogram in the handled original data block successively; According to mass range said mass spectrogram is divided in the corresponding formation, again with each queue stores in different spectrogram intermediate files;
23; To different mass ranges, each mass range is handled by a spectrogram reduction processor process, separate parallel processing between the said spectrogram reduction processor process; Said spectrogram reduction processor process will read all the spectrogram intermediate files in this mass range; Mass spectrogram to importing is pressed the big minispread of experiment parent ion quality earlier, and when the experiment parent ion was identical in quality, the conventional english words canonical ordering according to the spectrogram title name sorted again; Deposit polylith spectral data piece after the ordering successively in, the mass spectrogram number that comprises in every equates;
24, the information of collecting all said mass spectrometric data pieces, and generate said mass spectrum meta data file according to said information.
3. the large-scale distributed parallel accelerated method of identification of proteins according to claim 2 is characterized in that,
In the said step 22, also comprise:
When the number of said original data block greater than cluster in during the number of processor core; Or when counting greater than said spectrogram mapping processor process; Said original data block is carried out many wheels to be handled; The spectrogram mapping processor process of finishing the work continues to get new task, and First come first served is all handled up to all said original data blocks.
4. the large-scale distributed parallel accelerated method of identification of proteins according to claim 2 is characterized in that,
In the said step 23, also comprise:
When the number of said mass range greater than cluster in during the number of processor core; Or when counting greater than said spectrogram reduction processor process; Said mass range is carried out many wheels to be handled; The spectrogram reduction processor process of finishing the work continues to get new task, and First come first served is all handled up to all said spectrogram intermediate files.
5. according to the large-scale distributed parallel accelerated method of claim 1,2,3 or 4 described identification of proteins, it is characterized in that,
In the said step 3, said host process is assigned and idle carried out the step that peptide spectrum coupling identifies from process and comprise:
Said host process is read in said mass spectrum meta data file and said peptide index meta data file; According to the statistical information that obtains; Sort from high to low according to mass range distributing to the said spectral data piece of oneself being responsible for identifying, assign successively to said, if said peptide index file piece is a polylith from process; Then same said spectral data piece is assigned repeatedly, each corresponding peptide index file piece; Said mode from process employing First come first served is got task; Whenever expert assignment is accomplished; Deposit the sub-piece of qualification result in,, beam back the filename of the sub-piece of said qualification result with said host process communication; And ask for the corresponding spectral data piece of next step task and the information of peptide index file piece, up to the evaluation of accomplishing all spectral data pieces.
6. the large-scale distributed parallel accelerated method of identification of proteins according to claim 5 is characterized in that,
In the said step 3, saidly carry out the step that peptide spectrum coupling identifies from process and comprise:
Saidly read in said peptide index file piece from process; On the basis of original peptide sequence, calculate the possible situation of changes in modification takes place; Utilize the parent ion quality error window searching in the spectral data piece to be identified to meet the modified peptides sequence of setting mass range, satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm to realize the evaluation to peptide sequence.
7. according to the large-scale distributed parallel accelerated method of claim 1,2,3,4 or 6 described identification of proteins, it is characterized in that,
In the said step 4, also comprise:
41, gather qualification result; To the corresponding sub-piece of all qualification results of the said spectral data piece of each piece; Be responsible for processing by a spectrogram qualification result aggregation process device process; The sub-piece of all qualification results of specifying a said spectral data piece giving oneself is read in separate parallel processing between the said spectrogram qualification result aggregation process device process, said spectrogram qualification result aggregation process device process, presses the peptide sequence of all qualification results of every mass spectrogram the mark ordering of peptide spectrum coupling marking algorithm; Keep forward peptide sequence information and the mark of rank, deposit the piecemeal summary file in.
42, read in all piecemeal summary files; Filter, go redundancy to the peptide sequence of each mass spectrogram qualification result; The nonredundancy peptide sequence that obtains is equally divided into many groups, each group nonredundancy peptide sequence is responsible for searching corresponding protein numbering and sequence, the separate parallel processing of said protein query processor process by a protein query processor process; Lookup result utilization peptide is inferred algorithm to protein, generate output file.
8. the large-scale distributed parallel accelerated method of identification of proteins according to claim 7 is characterized in that,
In the said step 41, also comprise:
When the number of said spectral data piece greater than cluster in during the number of processor core; Or when counting greater than said spectrogram qualification result aggregation process device process; The sub-piece of said qualification result is carried out many wheels to be handled; The spectrogram qualification result aggregation process device process of finishing the work continues to get new task, and First come first served all has been processed up to the sub-piece of all said qualification results.
9. the large-scale distributed parallel accelerated method of identification of proteins according to claim 7 is characterized in that,
In the said step 42, also comprise:
When number that said nonredundancy peptide sequence divides into groups greater than cluster in during the number of processor core; Or when counting greater than said protein query processor process; Said nonredundancy peptide sequence is divided into groups to carry out many wheels to be handled; The protein query processor process of finishing the work continues to get new task, and First come first served is all handled up to all nonredundancy peptide sequences.
10. the large-scale distributed parallel accelerating system of an identification of proteins is characterized in that, comprising:
The peptide sequence index module; Being used for protein sequence to input carries out theoretical enzyme and cuts and obtain peptide sequence; Said peptide sequence is sorted, goes disposal of Redundancy according to theoretical parent ion quality, with establishment peptide index file piece, and according to said peptide index file piece generation peptide index meta data file;
The spectral data processing module; Be used for mass spectrogram, adopt method for parallel processing and sort, and the mass spectrogram after will sorting averages division according to experiment parent ion quality to input; Obtain a plurality of spectral data pieces, and generate the mass spectrum meta data file according to said spectral data piece;
Peptide spectrum coupling is identified module; Connect said peptide sequence index module, said spectral data processing module; Be used for giving each host process with said spectral data piece mean allocation, each host process management is a plurality of from process, and each host process sorts to the spectral data piece that is distributed; Appointment is carried out the evaluation of peptide spectrum coupling to idle from process successively; And when more than one of said peptide index file piece, same said spectral data piece is distributed to a plurality of from process, carry out peptide spectrum coupling from the said peptide index file of process traversal monolithic piece and identify by a plurality of;
The result gathers output module, connects said peptide spectrum coupling and identifies module, is used to adopt method for parallel processing to gather qualification result, utilizes the peptide sequence that identifies to infer corresponding protein sequence, generates output file.
11. the large-scale distributed parallel accelerating system of identification of proteins according to claim 10 is characterized in that,
Said spectral data processing module comprises again:
Spectrogram is divided module, is used to resolve said mass spectrogram, and said mass spectrogram on average is divided into a plurality of original data blocks, and the capacity of each said original data block is less than the local storage space of clustered node;
The spectrogram mapping block; Connect said spectrogram and divide module; Be used for each said original data block is handled by a spectrogram mapping processor process; Said spectrogram mapping processor process read in successively in the handled original data block each open mass spectrogram, according to mass range said mass spectrogram is divided in the corresponding formation, again with each queue stores in different spectrogram intermediate files;
Spectrogram reduction module connects said spectrogram mapping block, is used for different mass ranges; Each mass range is handled by a spectrogram reduction processor process; Separate parallel processing between the said spectrogram reduction processor process, said spectrogram reduction processor process will be read all the spectrogram intermediate files in this mass range, and the mass spectrogram of input is pressed the big minispread of experiment parent ion quality earlier; When the experiment parent ion is identical in quality; According to the conventional english words canonical ordering ordering of spectrogram title name, deposit polylith spectral data piece after the ordering successively in again, the mass spectrogram number that comprises in every equates;
Mass spectrum meta data file generation module connects said spectrogram reduction module, the information that is used to collect all said mass spectrometric data pieces, and generate said mass spectrum meta data file according to said information.
12. the large-scale distributed parallel accelerating system of identification of proteins according to claim 11 is characterized in that,
Said spectrogram mapping block; Also be used for when the number of said original data block during greater than the number of cluster processor core; Or when counting greater than said spectrogram mapping processor process, said original data block is carried out many wheels handle, the spectrogram mapping processor process of finishing the work continues to get new task; First come first served is all handled up to all original data blocks.
13. the large-scale distributed parallel accelerating system of identification of proteins according to claim 11 is characterized in that,
Said spectrogram reduction module; Also be used for when the number of said mass range during greater than the number of cluster processor core; Or when counting greater than said spectrogram reduction processor process, said mass range is carried out many wheels handle, the spectrogram reduction processor process of finishing the work continues to get new task; First come first served is all handled up to all spectrogram intermediate files.
14. the large-scale distributed parallel accelerating system according to claim 10,11,12 or 13 described identification of proteins is characterized in that,
Said peptide spectrum coupling is identified module; Also be used for reading in said mass spectrum meta data file and said peptide index meta data file by said host process; According to the statistical information that obtains, sort appointment successively from high to low to said, if said peptide index file piece is a polylith according to mass range from process with distributing to the said spectral data piece of oneself being responsible for identifying; Then same said spectral data piece is assigned repeatedly, each corresponding peptide index file piece; Said mode from process employing First come first served is got task; Whenever expert assignment is accomplished; Deposit the sub-piece of qualification result in,, beam back the filename of the sub-piece of said qualification result with said host process communication; And ask for the corresponding spectral data piece of next step task and the information of peptide index file piece, up to the evaluation of accomplishing all spectral data pieces.
15. the large-scale distributed parallel accelerating system of identification of proteins according to claim 14 is characterized in that,
Said peptide spectrum coupling is identified module; Also be used for reading in peptide index file piece from process by said; On the basis of original peptide sequence, calculate the possible situation of changes in modification takes place; Utilize the parent ion quality error window searching in the spectral data piece to be identified to meet the modified peptides sequence of setting mass range, satisfactory modified peptides sequence is inputed to peptide spectrum coupling marking algorithm to realize the evaluation to peptide sequence.
16. the large-scale distributed parallel accelerating system according to claim 10,11,12,13 or 15 described identification of proteins is characterized in that,
Said result gathers output module, comprises again:
Summarizing module; Be used for the corresponding sub-piece of all qualification results of the said spectral data piece of each piece; Be responsible for processing by a spectrogram qualification result aggregation process device process; The sub-piece of all qualification results of specifying a said spectral data piece giving oneself is read in separate parallel processing between the said spectrogram qualification result aggregation process device process, said spectrogram qualification result aggregation process device process, presses the peptide sequence of all qualification results of every mass spectrogram the mark ordering of peptide spectrum coupling marking algorithm; Keep forward peptide sequence information and the mark of rank, deposit the piecemeal summary file in;
Filter and infer output module, connect said summarizing module, be used to read in said piecemeal summary file; Filter, go redundancy to the peptide sequence of each mass spectrogram qualification result; The nonredundancy peptide sequence that obtains is divided into many groups, each group nonredundancy peptide sequence is responsible for searching corresponding protein numbering and sequence, the separate parallel processing of said protein query processor process by a protein query processor process; Lookup result utilization peptide is inferred algorithm to protein, generate output file.
17. the large-scale distributed parallel accelerating system of identification of proteins according to claim 16 is characterized in that,
Said summarizing module; Also be used for when the number of said spectral data piece during greater than the number of cluster processor core; Or when counting greater than said spectrogram qualification result aggregation process device process, the sub-piece of said qualification result is carried out many wheels handle, the spectrogram qualification result aggregation process device process of finishing the work continues to get new task; First come first served all has been processed up to the sub-piece of all qualification results.
18. the large-scale distributed parallel accelerating system of identification of proteins according to claim 16 is characterized in that,
Output module is inferred in said filtration; The number that also is used for dividing into groups when said nonredundancy peptide sequence is during greater than the number of cluster processor core; Or when counting greater than said protein query processor process, said nonredundancy peptide sequence is divided into groups to carry out many wheels handle, the protein query processor process of finishing the work continues to get new task; First come first served is all handled up to all nonredundancy peptide sequences.
CN201010292032.5A 2010-09-26 2010-09-26 Large-scale distributed parallel acceleration method and system for protein identification Active CN102411680B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010292032.5A CN102411680B (en) 2010-09-26 2010-09-26 Large-scale distributed parallel acceleration method and system for protein identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010292032.5A CN102411680B (en) 2010-09-26 2010-09-26 Large-scale distributed parallel acceleration method and system for protein identification

Publications (2)

Publication Number Publication Date
CN102411680A true CN102411680A (en) 2012-04-11
CN102411680B CN102411680B (en) 2014-03-26

Family

ID=45913751

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010292032.5A Active CN102411680B (en) 2010-09-26 2010-09-26 Large-scale distributed parallel acceleration method and system for protein identification

Country Status (1)

Country Link
CN (1) CN102411680B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810200A (en) * 2012-11-12 2014-05-21 中国科学院计算技术研究所 Database searching method and database searching system for open type protein identification
CN107346350A (en) * 2016-05-06 2017-11-14 中国科学院微电子研究所 Integrated circuit layout data handles distribution method, device and the group system of task
CN114242163A (en) * 2020-09-09 2022-03-25 复旦大学 Processing system of mass spectrum data of proteomics

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158952A (en) * 2007-11-22 2008-04-09 中国人民解放军国防科学技术大学 Biological sequence data-base searching multilayered accelerating method based on flow process
CN101714187A (en) * 2008-10-07 2010-05-26 中国科学院计算技术研究所 Index acceleration method and corresponding system in scale protein identification

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158952A (en) * 2007-11-22 2008-04-09 中国人民解放军国防科学技术大学 Biological sequence data-base searching multilayered accelerating method based on flow process
CN101714187A (en) * 2008-10-07 2010-05-26 中国科学院计算技术研究所 Index acceleration method and corresponding system in scale protein identification

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LEHENG WANG等: "An efficient parallelization of phosphorylated peptide and protein identification", 《RAPID COMMUNICATIONS IN MASS SPECTROMETRY》, vol. 24, no. 12, 30 June 2010 (2010-06-30), pages 1791 - 1798 *
于长永: "一种基于信息论的蛋白质数据库搜索鉴定算法", 《东北大学学报(自然科学版)》, vol. 30, no. 1, 31 January 2009 (2009-01-31), pages 50 - 53 *
杨兵等: "规模化蛋白质鉴定中的串联质谱数据评价方法", 《生命的化学》, vol. 25, no. 5, 15 October 2005 (2005-10-15), pages 407 - 410 *
涂强等: "InsPecT的2种并行优化方案", 《计算机工程》, vol. 36, no. 6, 31 March 2010 (2010-03-31), pages 100 - 101 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810200A (en) * 2012-11-12 2014-05-21 中国科学院计算技术研究所 Database searching method and database searching system for open type protein identification
CN103810200B (en) * 2012-11-12 2016-03-30 中国科学院计算技术研究所 The database search method of opened protein matter qualification and system thereof
CN107346350A (en) * 2016-05-06 2017-11-14 中国科学院微电子研究所 Integrated circuit layout data handles distribution method, device and the group system of task
CN107346350B (en) * 2016-05-06 2020-08-28 中国科学院微电子研究所 Distribution method, device and cluster system for integrated circuit layout data processing tasks
CN114242163A (en) * 2020-09-09 2022-03-25 复旦大学 Processing system of mass spectrum data of proteomics
CN114242163B (en) * 2020-09-09 2024-01-30 复旦大学 Processing system for mass spectrometry data of proteomics

Also Published As

Publication number Publication date
CN102411680B (en) 2014-03-26

Similar Documents

Publication Publication Date Title
CN102411679B (en) Large-scale distributed parallel acceleration method and system for protein identification
CN102411666B (en) Large-scale distributed parallel acceleration method and system for protein identification
CN107122443B (en) A kind of distributed full-text search system and method based on Spark SQL
Murata et al. Simultaneous comparison of three protein sequences.
CN102722553B (en) Distributed type reverse index organization method based on user log analysis
CN102799486B (en) Data sampling and partitioning method for MapReduce system
CN105550225B (en) Index structuring method, querying method and device
CN105550274B (en) The querying method and device of this parallel database of two-pack
CN105975617A (en) Multi-partition-table inquiring and processing method and device
CN103577474B (en) The update method and system of a kind of database
CN105224658A (en) A kind of Query method in real time of large data and system
Ngu et al. B+-tree construction on massive data with Hadoop
CN108241627A (en) A kind of isomeric data storage querying method and system
CN111159180A (en) Data processing method and system based on data resource directory construction
CN101714187B (en) Index acceleration method and corresponding system in scale protein identification
CN102411680B (en) Large-scale distributed parallel acceleration method and system for protein identification
CN109061020A (en) A kind of data analysis system based on gas phase and liquid phase chromatographic mass spectrometry platform
CN108733781A (en) The cluster temporal data indexing means calculated based on memory
CN109145225A (en) A kind of data processing method and device
Xu et al. A near-storage framework for boosted data preprocessing of mass spectrum clustering
EP1524599B1 (en) A method of reassigning objects to processing units
CN111428140B (en) High concurrency data retrieval method, device, equipment and storage medium
CN108776698A (en) A kind of data fragmentation method of the skew-resistant based on Spark
CN108197275A (en) A kind of distributed document row storage indexing means
CN103412942A (en) Voltage dip data analysis method based on cloud computing technology

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant