CN101495652A

CN101495652A - Computer-implemented biological sequence identifier system and method

Info

Publication number: CN101495652A
Application number: CNA2006800293307A
Authority: CN
Inventors: 安东尼·P.·马拉诺斯基; 林宝川; 乔尔·M.·施努尔; 大卫·A.·斯滕格
Original assignee: US Government
Current assignee: US Government; US Department of Navy
Priority date: 2005-06-16
Filing date: 2006-06-09
Publication date: 2009-07-29
Also published as: CN101273143A

Abstract

The invention provides a method, comprising: submitting reference sequences to a taxonomic database to produce taxonomic results; and reporting a taxonomic identification based on the taxonomic results. The reference sequences are the output of genetic database queries that return a score for each reference sequence. A method for processing a biological sequence obtained from an assay by: converting base calls located in a predetermined list of positions within the biological sequence to N; and determining the ratio of single nucleotide polymorphisms in the biological sequence relative to a reference sequence. Each entry in the predetermined list of positions represents the capability of a substance hybridizing to a microarray used to generate the biological sequence. The substance is not the nucleic acid of a target pathogen.

Description

Biological sequence identification systems and method that computer is carried out

The application requires the right of priority of following patent application: the U.S. Provisional Patent Application No.60/691 that on June 16th, 2005 submitted to, 768; The U.S. Patent application No.11/422 of the U.S. Provisional Patent Application submission of submitting on November 14th, 2005 on June 6th, 60/735,8762006,425; And the U.S. Patent application No.11/422 of submission on June 6th, 2006,431.This application is the partial continuous application of following patent application: the U.S. Patent application No.11/177 that on July 2nd, 2005 submitted to, 647; And the U.S. Patent application No.11/177 of submission on November 7th, 2005,646.These non-temporary patent applications require the right of priority of following U.S. Provisional Patent Application: the U.S. Provisional Patent Application No.60/590 that on July 2nd, 2004 submitted to, 931; The U.S. Provisional Patent Application No.60/609 that on September 15th, 2004 submitted to, 918; The U.S. Provisional Patent Application No.60/626 that on November 5th, 2004 submitted to, 500; The U.S. Provisional Patent Application No.60/631 that on September 29th, 2004 submitted to, 437; And the U.S. Provisional Patent Application No.60/631 of submission on September 29th, 2004,460.

Technical field

The present invention relates to the processing of biological sequence on the whole.

Background technology

For monitoring and diagnostic use, it is important that the Causal Agent Identification of precision standard and neighbour (near-neighbor) distinguish; Therefore in clinical setting, be desirably in the check of monitoring on this very special level (1-3).In order successfully to use any method that detects based on DNA or RNA, the big database that these checks must the binding nucleic acid sequence information design of testing, guaranteeing to provide the information of expectation, and the explanation raw data.The technology of many abundant foundation for example PCR in real time uses the short sequence of one section uniqueness of genome of passing through order-checking so that good specificity (4) to be provided.These technology can provide the evaluation of precision standard by selecting the section of quantity sufficient to biology approaching in the heredity.Yet these selecteed sections are although be specific in the selection of beginning is handled, afterwards because more biological the order-checking becomes lower so find selected section specificity.Especially, this is for the pathogenic agent that belongs to high mutation rate family, and is lessly relatively identified that the pathogenic agent of contiguous (neighboring) pathogenic agent is a difficult problem for having.In addition, PCR in real time can not detect the existence of new remarkable sudden change, can not resolve the details of base sequence.Similarly, the progress in other detection techniques provides and has obtained the method for Causal Agent Identification, but is subjected to using some of PCR or the influence of all problems (5-8).

The highdensity microarray that checks order again can produce the section of different lengths, 10 of direct sequence information ²～10 ⁵Individual base pair (bp).They have been successfully used to detect disease single nucleotide polymorphism (SNP) and hereditary variant (9-16) from virus, bacterium and gene of eucaryote cell group.They are used for application that SNP detects and have set up them clearly the ability of the sequence information of reliable quality is provided.In most cases, microarray is designed to study finite quantity heredity similar target pathogenic agent, for many situations, and the crossing pattern (12,14,15,17,18) that this detection method only relies on identification to be used to identify.Utilize SNP to detect the ability that the required microarray that checks order is again resolved continuous base, order-checking is successfully used to various bacteria and viral pathogen are carried out Causal Agent Identification by the use different methods more recently, allow simultaneously the pathogenic agent that is closely related is distinguished in detail, and the pathogenic agent sudden change (19-21) in vivo that follows the trail of the objective.This new method is different with work before, it uses the inquiry of the DNA database being carried out similarity searching through the base conduct of resolving to identify most probable kind and variant from viewed hybridization, and itself and described base respond (base calls) pairing.This system can detect 26 pathogenic agent simultaneously, and can detect the existence of multiple pathogenic agent.A kind of software program-the Causal Agent Identification that checks order again ( RESequencing PAthogen IDentifier) (REPI) be used to simplify to use basic local comparison research tool ( BAsic LOcal ALignment SEarch TOol) (BLAST) carry out the data analysis (22) of genetic database similarity retrieval.This REPI program is used the default setting of BLAST, and if expected value be lower than 10 ^-9, then only return the sequence that representative may be hybridized, wherein said expected value represents to occur at random in database by blast program calculate the amount of the sequence paired possibility of finding.This has filtered out all and has had the top and bottom of insufficient signal; Yet for detecting which kind of pathogenic agent (or multiple pathogenic agent) and may being that final which kind of degree is distinguished determined and need be carried out hand inspection to the result who is returned.This method successfully makes to the precision differentiation of various adenovirus and to the strain of Flu A and Flu B and identifies consistent with traditional sampling result (19,20).Two significant advantage of this method are that described information is recovered with the detailed level of most probable always, with and can still discern biology with up-to-date sudden change.This method has also kept specificity well, because it does not rely on the uniqueness of short sequence, along with more biology is checked order, the uniqueness of described short sequence is constantly destroyed.

Although this analytical procedure has practicality, still have many definite: it is time-consuming, is not optimized to the susceptibility maximization, has complicated result, and it only is fit to expert, and it contains unnecessary or multiple information.This processing is time-consuming, be to handle automatically because the primary screening is only arranged, and other steps need be carried out manual interpretation before finishing detection.Because (cut off value of expectation is 10 to use simple standard ^-9) and unoptimizable BLAST parameter to consider detected pathogenic agent, so the REPI algorithm provides the tabulation of candidate's biology, and do not make final simple conclusion, it does not relate to the result of a kind of prototype sequence with respect to other sequences yet.But, use manual handling to make last decision, but because the common core acid sequence that REPI provides all similar results and use to contain unnecessary inquiry, therefore a large amount of useless data are to the user.In addition, utilize manual handling, can not determine that the algorithm of being developed is common to all biologies that are provided nucleic acid base parsing sequence information.

Summary of the invention

A kind of method of the present invention comprises: in will inquiring about for taxonomy database provides a plurality of canonical sequences, to obtain a plurality of classification results; And based on classification results report classification evaluation.Described canonical sequence is the output to the genetic data library inquiry, and described genetic data library inquiry returns every kind of scoring with reference to sequence.

The method that another kind of the present invention is used for handling from checking the biological sequence that obtains comprises: the base response that will be arranged in the predetermined list of locations of biological sequence converts N to; And in definite biological sequence with respect to the single nucleotide polymorphism ratio of canonical sequence.In the predetermined list of locations each inquired about the ability of representative species all and the microarray hybridization that is used to produce biological sequence.Described material is not the nucleic acid of target pathogenic agent.

Description of drawings

By with reference to following description, will obtain easily the more complete understanding of this name to embodiment embodiment and accompanying drawing.

Fig. 1 is the algorithm synoptic diagram of the logic of the relation of three main tasks of expression and the subtask relevant with task.Task I filters the selection with subsequence, determines that then the prototype sequence is the most similar to which kind of data-base recording.Task II describes whether the prototype Sequence Identification is supported the common biological assay.The biology that task III carries out final inspection and determines from microarray data to be detected.ProSeq: prototype sequence; SubSeq: subsequence; HybSeq: the sequence of hybridization.

Fig. 2 is the detailed maps to the filtration subtask of task 1.For each ProSeq, guiding region is marked as N (uncertain) response, then calculates UniRate from HybSeq.For the ProSeq that requires by UniRate, a kind of gauged sliding window algorithm is used to produce a kind of SubSeq, and described SubSeq can use the inquiry of doing BLAST.The identity (zero position among the ProSeq and length) of the successful SubSeq that produces is placed document, be used for carrying out batch query by BLAST.

Fig. 3 is in the task 1 every SubSeq being carried out the detailed maps of the subtask of biological assay.Each SubSeq that sends to BLAST has returned and has been included in the possible paired tabulation of returning in the array, by find best bit scoring (bit score)/expected value to (maximum scores (MaxScore)) so that it is classified.If MaxScore is higher than MIN (10 ^-6), then all contain returning of this best scoring and are classified as new array level 1.Then the biology of identifying described SubSeq is handled in the detailed decision described in the method part.

Fig. 4 is the detailed maps of subtask of determining the biology of ProSeq in the task 1 based on the result that SubSeq found.All SubSeq to specific ProSeq compare mutually, to determine two best scoring SubSeq.If only exist a SubSeq or one by scoring much higher for than other, the feature of then described this SubSeq of ProSeq heredity.In other situations,, determine general classification kind according to the description in this patent text.

Fig. 5. to the comparison of influenza A NA1 ProSeq and A/Weiss/43, A/PuertoRico/8/34 strain.The original of A/puertoRico/8/34 and the filtering hybridization hybrid chip result of process have also been shown. ^*Expression is the sequence of coupling fully.

Implement pattern of the present invention

In the following description, in order to explain rather than purpose, provide concrete details to provide to complete understanding of the present invention in order to limit.Yet, be apparent that for those skilled in the art, can implement the present invention with other embodiments that are different from these details.In other cases, omitted detailed description, so that the description of this invention can be owing to there being unnecessary details to become hard to understand to known method and equipment.

Here employed term " sequence " refers to nucleotide sequence for example DNA or RNA or protein sequence.Here employed " base " and " base response (base call) " refers to nucleotide base or amino acid.Here employed term " classification " can refer to the evaluation to any level of pathogenic agent or classification, and it includes but not limited to genus, kind, strain and inferior strain.Here employed term " report " comprises signal from a system transmissions to another system, and produces the readable report of any type of people.All disclosed methods can be to be carried out by computer on the equipment with the device that carries out described method.

Disclosed is a kind of new software specialist system, the biological sequence evaluation that computer is carried out ( COmputer- IMplemented BIological SEquence IDentifier) system 2.0 (CIBSI2.0), it can successfully use from the user and specify high (Affymetrix) check order the again information of resolved base sequence of microarray that flies, so that the simple list to detected biology to be provided.By new feature being incorporated in the fully automatic Causal Agent Identification shortcoming of method before this algorithm has solved.This simple program can carry out correctly determining (19,20,23) to the whole 26 kinds of pathogenic agent that are included on the RPM v1 microarray with the susceptibility that improves, and can detect separately or combine detection.Although this program is used to check order microarray again at present, the method for being developed still can be general, because only the first part of this algorithm handles the problem special to microarray, remaining then solves the sequence that is suitable as the BLAST algorithm queries.In the general evaluation algorithm of exploitation, identified and resolved their the complicated problem special of application of making to the microarray that checks order again.Because whole definite which kind of detected definite process is automatization, the rule whether therefore easy test is used to identify is tight and can be used for all pathogenic agent.Use this effective program, can provide competitive method testing many possible pathogenic agent simultaneously based on the check of order-checking again, this provides the output result that can be explained by non-expert.

Amplification, hybridization and order-checking are determined

The details (19,20,23) of design of RPM v1 microarray and experimental technique has been discussed in work before.Employed experiment microarray data is to obtain with the composite amplification scheme at random by template and the clinical sample utilization of using various purifying in this analysis.(high fly company (Affymetrix Inc.), Santa Clara (Santa Clara) CA) is used to the microarray of hybridization is compared and scan intensity with each probe in definite each probe groups GCOS software v1.3.(Affymetrix Inc., Santa Clara CA), make the base response based on the intensity data of each probe groups, and wherein said GDAS v3.0.2.8 software utilizes the operation (11) of ABACUS algorithm to use GDASv3.0.2.8 software.Provide these data to carry out later analytical procedure with the FASTA form.

The microarray (RPM v.1) that checks order again is designed before, be used for threatening the detection and the sequence typings of pathogenic agent to notifying 20 kinds of common respiratory disease substances and 6 kinds of CDC category-A biologies of causing the heat generation respiratory illness, it is based on ProSeq, and does not rely on predetermined crossing pattern (19,20,23).Use the nucleic acid of different amplification schemes, single pathogenic agent target and many pathogenic agent target, purifying and about 4000 RPM that clinical sample carries out v.1 to test and be examined with exploitation Causal Agent Identification algorithm.Use this algorithm to utilize the result of the nucleic acid of clinical sample, certified pathogenic agent and purifying in other work, to be gone through (19,20,23).In all situations, according to RPM v.1 on the ProSeq length of expression, this algorithm plant or the level of strain on, correctly identified biology.Some specific embodiment will be discussed, how carry out under various conditions to describe this algorithm.

CIBSI three task grades of 2.0 routine processes (Fig. 1): (I) determine that the biology that is detected is the most similar to which data-base recording, (II) determine whether to support the common biological assay from the evaluation of each target, and (III) determine whether that the biology that is detected belongs to check and is designed to the target group that will detect, perhaps relate to the neighbour in the heredity.The target pathogenic agent is the biology that check is designed to want special detection.Here employed, expression is known as prototype sequence or abbreviation " ProSeq " from the probe groups of the canonical sequence that target pathogen gene group is selected.Be known as sequence or " HybSeq " of hybridization through the base group of resolving from the result of genome material and ProSeq hybridization.HybSeq is divided into possible subsequence or " SubSeq ".The part of this algorithm solves the biological assay based on ProSeq, it is operated with 3 steps: each HybSeq is carried out initial filtration be fit to similarity SubSeq relatively to become, each SubSeq is carried out data base querying, and each SubSeq returned the BLAST comparison of classifying.In next level, ProSeq is compared to determine whether that their support identical by identification of organism.In last step, the target pathogenic agent tabulation that detected biology and check are designed to detect compares, and is arrived by positive detection having determined whether.The differentiation level that specific sample is supported is determined automatically.

Filter

Developed initial filter algorithm-Causal Agent Identification that checks order again (REPI) (20) before, and will have gauged generic concept and be incorporated in CIBSI 2.0 in the employed algorithm that (detects automatically) at present.Filter and subsequence selects to be used to remove because the selection of canonical sequence and the possible deviation that causes owing to other source (primer), and HybSeq is divided into significant fragment to search for faster.This is first subtask of task I among Fig. 1, and is at length schematically illustrated in Fig. 2.If the use pcr amplification, then microarray is hybridized to determine that they cause the position of hybridizing when only having primer.All parts of ProSeq and primer hybridization are labeled as the N response, and HybSeq does not contain the information of deviation like this.For each ProSeq, calculate the ratio-UniRate of the total quantity of SNP and the response of unique base from HybSeq.In order to eliminate the HybSeq with insufficient hybridization, if UniRate 〉=20% (SNP threshold values) then, it is negative that ProSeq is considered to detect for target organism.20% UniRate represents that there are 5 SNP in every 25bp.Be desirably in and the biology of target pathogenic agent class class between have the 25bp probe of this difference frequency with canonical sequence as the ProSeq basis hybridization to take place be unpractical.This will finish to filter the subtask, and turn back in task 1 circulation, will check next ProSeq.ProSeq for ratio＜20% checks in more detail.In each position of HybSeq, calibrated sliding window algorithm (20) is used to produce the SubSeq that can be used as the BLAST inquiry.At first, check 20 bases (initial length) behind this position.If be lower than in these bases 60% be indefinite, N, then this SubSeq enters the extension stage.SubSeq is once extended a base, be lower than 40% (unique base threshold value), have and be less than 4 unique bases responses if perhaps contain the moving window of last 21 bases up to the total content of unique base response.These are different with the REPI algorithm, only use the moving window of 20 bases in the REPI algorithm, and if unique base response be lower than 40% of this window component, then stop SubSeq and produce.At this point, check SubSeq, and remove the N response of hangover.Need at least one position satisfying the size text parameter of BLAST, and preserve SubSeq and further analyze with 7 continuous unique bases responses.Accept the SubSeq longer than 100 bases.In order to accept, be less than or equal to unique base response (not being " N ") of the SubSeq needs at least 95% of 30 bases.For the SubSeq with 30～100 bases, subsequence need to accept unique base of VARI ((" SubSeq length "-30) * 0.2857+70) % at least.For the SubSeq more than or equal to 80 bases, if it contains at least 11 successive bases, then to be changed be 11 to BLAST size text parameter.The identity (zero position among the ProSeq and length) of the SubSeq that success produces is placed in the inquiry in the SubSeq array, and described SubSeq array keeps the information relevant with each SubSeq.This identity and SubSeq are placed in the document and carry out batch query by BLAST.Repeat this program, it continues by the end from successful before SubSeq, perhaps when failure, continues by the point that begins to produce from window, up to the terminal point of HybSeq.After finishing, this algorithm turns back to task I circulation, and carries out the BLAST subtask.

Data base querying

Use SubSeq as inquiry, the batch similarity searching to database is carried out in the BLAST subtask.Employed blast program is NCBI Blastall-p blastn 2.12 editions, and it has one group of parameter of determining.At seed stage (seeding phase) low complex area being covered is in order to quicken inquiry; Yet, in the reality scoring, comprised the low-complexity repetition.Use on February 7th, 2006 from the complete nucleotide database that obtains as the comparable data storehouse.(note using the early stage mirror image of this database on stream, but utilize described algorithm to return all experiments) according to the mirror image of the database that obtains in this day.Use the gap penalty (gap penalty) and the Nucleotide matching score of acquiescence.Nucleotide mispairing point penalty-q, this parameter is set to-1 rather than default value.Expected value is returned from the blastall program by the form with form less than the result of any BLAST inquiry of 0.0001.The information that to return (bit scoring, expected value, mispairing, matching length) about each places to be returned { hashed key (hashkey) } { the info} hash, it uses SubSeq further to analyze as hashed key.

Based on classification ProSeq is identified pathogenic agent from SubSeq

The next subtask that task I carries out is definite SubSeq () state shown in Fig. 3.For simple data and easily decision processing are provided, by the information of two parameters summaries about each SubSeq." biology of evaluation " represents biological class categories, the quality of " biological unique " expression biological assay.Each independent SubSeq () to ProSeq carries out the element that returns in the hash is checked and classification by the scoring array.The scoring array contain the scoring of a pair of parameter-bit and expected value-, it has fixed to given database and gets in touch.Sometimes, be fit to use rank scores, the size of its representation database (expected value) or do not represent (bit scoring).The element that returns in the hash can have identical scoring, so all elements with higher bit scoring/minimum expected value (MaxScore) are retained in the independent array level 1.All classification classification to every kind of element in the grade 1 also is (referring to the explanation of front) that obtains from the NCBI taxonomy database of acquisition on February 7th, 2006.If it (is 10 at present that the MaxScore expected value is higher than MAX ^-6), then SubSeq () is updated to zero with its certified biology and biological idiotype information.If MaxScore is fully little, then check to place returning of grade 1.If grade 1 contains unique element, then this SubSeq is designated as the biological unique of SeqUniqu.If grade 1 contains a plurality of elements, then all returning when all belonging to the same category classification, this SubSeq is designated as the biological idiotype of TaxUnique; Under other situations, the biological aspect of SubSeq is set to TaxAmbig (it is uncertain to classify).Each SubSeq () that task listed among Fig. 3 is used for ProSeq.In all situations, certified biology is designated as each SubSeq (), the class categories of the common parental generation of whole elements in its expression grade 1.

After checking each SubSeq, this algorithm moves to next target, and it is used for determining from SubSeq the biology (Fig. 4) of the evaluation of ProSeq.If the certified biological value of whole elements of SubSeq all is zero, then this ProSeq is negative, and checks next ProSeq.If for this ProSeq, in SubSeq, only contain a kind of element, perhaps the whole elements among the SubSeq have identical certified biology, if the state of a plurality of SubSeq inquiries or its hereditary unique SubSeq inquiry is then arranged, to generate result's 1 inquiry of certified biology, its biological uniqueness is TaxUnique (classification is unique).If different have many each inquiries in by the SubSeq of identification of organism having, then further analyze.Then by MaxScore (bit scoring) SubSeq is reclassified, the element that has the highest two best scorings like this is SubSeq (1) and SubSeq (2).If the scoring of SubSeq (1) is compared with the scoring of SubSeq (2), more than or equal to 30% (scoring proportion threshold value), then many characteristics of the biology of this ProSeq heredity SubSeq (1) and also identification of organism.Under other situations, the biological aspect of ProSeq is TaxAmbig, and certified biology is the common parental generation class categories of whole subsequences.If all subsequence only is comprised in direct filial generation and two class categories of parental generation, then the biology of being identified is the biology of the subsequence of filial generation classification.Finish the subtask among Fig. 4, and continued task I circulation.In result's 1 array, set up ProSeq tabulation with detected biology.

Whole Causal Agent Identification and positive response (positive call)

Finish the work behind the I, task II (referring to Fig. 1) is used to check in result 1 listed by the value of identification of organism, if they have identified identical class categories, then they is combined.If the biology of being identified does not appear in this tabulation, then each inquiry among the result 1 is checked, and in result 2, generated new inquiry.In most of situation, detected individual biology is represented in 2 inquiry as a result, but may still contain unnecessary information.Have by in these inquiries among the result 2 of identification of organism, if one is another classification parental generation, then in fact these inquiries have represented identical pathogenic agent.Described identical evaluation may also not occur, because because multiple possibility, genome target can all fully not hybridized with two kinds of ProSeq.Alternative, two kinds of differences but the biology that is closely related may be all can with microarray hybridization.

Although be difficult to make the result from each ProSeq to get in touch mutually, task III has handled present performed last inspection and decision.Task before specifically having carried out, therefore the information about which kind of ProSeq of expectation detection also is not determined.This allows these lower situations can not only discern the positive and feminine gender, can also discern fuzzy situation.In last task, this algorithm considers whether this ProSeq has identified that they are designed the biology that will detect.It is negative that clear and definite negative ProSeq and unclear ProSeq are considered to for the target pathogenic agent.For this situation,, ProSeq is divided into groups based on the grouping of in task II, having carried out.2 inquiry loop ends as a result.The ProSeq of an inquiry is used to inquire about pathogenic agent in by the table of target.If the certified biology of these result's 2 inquiries is the identical class categories of target pathogenic agent or its filial generation, then this pathogenic agent () array is updated to the positive inquiry of this target pathogenic agent.If pathogenic agent () array is zero for this pathogenic agent, then the biology of this pathogen levels evaluation is the biology of result 2 () inquiry.If a kind of inquiry has been placed in the pathogenic agent, then need further comparison.Result 2 () and pathogenic agent inquiry are compared.If they have direct filial generation parental generation relation, then the biology of the pathogenic agent of being identified is the class categories of filial generation.Under other situations, the conduct of report common parental generation class categories is by the biology of positive identification.In most of situation, fully hybridize if become the ProSeq of pathogenic agent, then report the differentiation of precision level.If but one or more ProSeq hybridization are relatively poor, the positive target pathogenic agent of then being reported is only identified on genus or kind level.The result of all three kinds of tasks is reported that manual review is possible like this.Attention certified biology that does not belong to the target pathogenic agent in task II is reported as the non-target positive and returns.Need check task II level result for which kind of certified details in these situations.

Causal Agent Identification

Selection has Chlamydia pneumoniae (Chlamydia pneumoniae) sample (by the method for reference 21) of 10～1000 genome copies, illustrates when multiple ProSeq is targeted to identical pathogenic agent, if carry out pathogen detection and evaluation (21).V.1, RPM has the ProSeq of 3 high conservatives, and it is selected from the gene of coding proteic VD2 of major outer membrane and VD4 and RNA polymerase (rpoB) gene that DNA instructs.HybSeq from different samples is only having any different aspect what the unique base responses shown in the table 1.The per-cent of ProSeq changes to 100% from 80%, except a kind of situation when the content 10, it only contains 11% rpoB ProSeq, its produce that unique response illustrates that this check reached limit of detection be higher than this content.Table 1 has been enumerated the decision that SubSeq is made, and in the decision of at last various samples being made of each task.Produce the SubSeq of similar number from the ProSeq of different situations.SubSeq from different samples has reported different bit scorings to returning from the identical highest ranking of BLAST.In fact, VD2 and VD4 have produced identical result.The NCBI taxonomy database will return and be categorized into 4 kinds of different groups, and it represents Chlamydia pneumoniae (C.pneumoniae) sorted group and 3 kinds of filial generation strains.AE001652, AE002167, AE017159 and BA000008 appear in the returning of each all ProSeq of sample, because the genomic data base querying that their representatives are checked order fully.A rpoB SubSeq has produced the unique SeqUniqu of its biology.Other all SubSeq are returned as the multiple TaxAmbig that returns from different class categories.Because VD2 and VD4 ProSeq have a SubSeq, so task I is the state that ProSeq specifies SubSeq.For rpoB ProSeq, the bit of SubSeq scoring enough greatly, so this algorithm is assigned to ProSeq with the evaluation of SubSeq.The task II of this algorithm because they all have identical certified biology, and is assigned TaxAmbig with whole 3 ProSeq groups together.The result of task III is a male for target pathogenic agent Chlamydia pneumoniae (C.pneumoniae), and this decision is flat-footed, because all ProSeq meet mutually, and belongs to identical target pathogenic agent class categories.Although rpoB ProSeq is SeqUniqu, this is not the final conclusion of task II, because be not the filial generation sorted group as the ProSeq of SeqUniqu, and other ProSeq is TaxAmbig.These the three kinds inferior strains that are identified are cited as identical scoring, and its sequence that is illustrated as the ProSeq selection is very conservative, can not distinguish between strain.

Table 1-determines algorithm, task I, II and the III of Chlamydia pneumoniae (C.pneumoniae) with some content SubSeq

(G1)J138(BA000008)，AR39(AE002167)，Tw-183(AE017159)，Cpne

(M69230，AF131889，AY555078，M64064，AF131229，AF131230)

(G2)Cpne(S83995)

(G3)J138(BA000008)，AR39(AE002167)，Tw-183(AE017159)

SU is the abbreviation of SeqUniqu

TA is the abbreviation of TaxAmbig

Influenza and adenovirus hominis (HAdV) are only pathogenic agent, and it has selected ProSeq, and the differentiation on the detailed strain level is carried out in (19,20,21) that described ProSeq will allow basis work before to be discussed.Work manual analysis before described finds that this microarray results and traditional sequencing result to clinical sample meet very much.To use the ncbi database that upgrades that the result of original microarray results operation CIBSI 2.0 programs and discovery are before compared (table 2).The biology of being identified and the discovery of beginning are inconsistent because use difference in the database.In fact, the traditional sequencing result that is provided for from this work is found, and each sample is all in having best returning of marking.For 3 kinds of situations among 8 kinds in 13 kinds of influenzas and the 12 kinds of influenza B, traditional order-checking that found that of task I and II is returning of unique the best, is the biology of being identified therefore.Owing in database, have a large amount of independent sequences for the red blood cell condensation plain gene, therefore do not find that in some cases unique unique challenge is not amazing.In each of remaining 5 influenza A sample, other sequence of being returned and traditional sequence have and are lower than 0.2% difference.For influenza B, less sample has unique independent evaluation, is because to ProSeq, uses older canonical sequence, and this makes less hybridization (19) takes place.This also means works as for a sample returns many sequences, and the heredity variation that their representatives are bigger is up to 2%.Thisly more only represent the Algorithm Analysis in task I level for red blood cell condensation element (HA) ProSeq, wherein said red blood cell condensation element is unique by the zone of tradition order-checking.Can be without comparison for task III result, because work before can not obtain the unanimity from a plurality of ProSeq.As the result of the method for carrying out the evaluation of task III level at present, the biology of reporting in this level is more not specific (H3N2 or Flu B) (supplementary table 1A and 1B) for each sample.For the HAdV sample, the differentiation (not shown) of the meticulousr standard of having made by manual method before this algorithm has also produced and compared.

The pathogenic agent that table 2-is identified from the HA ProSeq of influenza A and influenza B clinical sample

^*Return with this and to divide a plurality of of (tied) equally and return

S HybSeq is divided into a plurality of SubSeq

The next embodiment that detects mycoplasma pneumoniae (M.pneumoniae) pathogenic agent has illustrated a kind of situation that a ProSeq is only arranged for the target pathogenic agent, this means that biology that task I is identified automatically becomes the result of task II, also will be to the unique ProSeq that is considered of target pathogenic agent in task III.This ProSeq is not best for distinguishing, because it is selected from the high conservative zone (345bp) of attachment proteins P1 gene.Use identical 40 microarraies of purification of nucleic acid deposit test, mycoplasma pneumoniae (M.pneumoniae) or its a kind of inferior strain grouped data library inquiry and MaxScore that is identified divide (tied) equally in each case.Return in order to understand these better, these database sequences are checked and be divided into three groups of sequence A, B and C how it mates based on them and the canonical sequence that is used to form ProSeq.By the sequence of this gene being carried out the CLUSTAL comparison, determine the layout of this data base querying in these 3 groups.This comparison confirms that these data base queryings are more significantly different between can't help in the represented zone of ProSeq mutually, and contains the abundant mutability that can allow more accurate differentiation.Member and the ProSeq of group A mate fully, are distinguished on microarray.Similarly, the member of group B is mated with ProSeq except the 199th, and wherein the response of the 199th bit base is C rather than T.Group C sequence contains a plurality of data base queryings, and it is more variable, and may have any different with other inquiries among the ProSeq.For the experiment test of 40 mycoplasma pneumoniaes (M.pneumoniae), the ProSeq up to 95% is hybridized, but only 65% result contains clearly base response at the 199th.If it is clearly, then its always with group B sequences match.In the response of N base is in the 199th situation about making, and the sequence of group A and B all is returned identical scoring.Irrelevant with this point, to each sample of testing, the target pathogenic agent is mycoplasma pneumoniae (M.pneumoniae) by positive identification.

Make decision if these embodiment show, whether single or multiple ProSeq are not used for the target pathogenic agent and do not rely on.They also represent to distinguish possible level is that quality by selected ProSeq is determined.For some pathogenic agent, the differentiation of precision level is unwanted, and the selection of at present RPM v1 being tested will provide gratifying information.It reports the ability of distinguishing highest level automatically CIBSI 2.0 algorithm proofs, and this is supported by HybSeq information.

The heredity neighbour

In order to prove that how this algorithm handles neighbour's heredity species, has considered the example of the pathogenic agent of non-target.For a kind of biological pathogenic agent variola major virus that threatens, v.1 go up at RPM, operation confirms, its proof variola major virus template when detecting usually by positive identification.This array have two from the red blood cell condensation element (VMVHA ,～500bp) and cytokine influence modify sub-B (VMVcrmB ,～300bp) two ProSeq of gene are used to detect variola major virus.Table 3 has shown each ProSeq has been carried out 18 results that take turns that wherein the neighbour variola virus is incorporated in the nasal douche thing (nasalwash) with various concentration.The per-cent of ProSeq of hybridization is fully, and therefore iff considering hybridization region, then people can suppose that this zone identifies the existence of this target.This will illustrate that selected canonical sequence is not best selection.Yet,, in fact do not have sample can be accredited as variola major virus or milk-pox virus if do not adopt this algorithm.Smallpox is that cited orthopoxvirus (Orthopoxvirus) has VMVcrmB ProSeq one of the species of high scoring always, but only in 7 kinds of situations, it is unique as the possible species that detected.In having 3 kinds of samples of minimum concentration and in the fragment of VMVcrmB hybridization, ProSeq identifies that variola major virus is cause hybridization a kind of in many vaccinia subgroup virus species.For employed amplification method, the lower limit of detection is in this concentration with than it between the high concentration.VMVHA ProSeq makes the evaluation to the orthopoxvirus species in two experiments only, and classifies variola major virus as divide equally the best scoring and return (tied best scoring return).In two kinds of situations, VMVcrmB ProSeq is accredited as optimum matching with variola virus specifically.The per-cent of the ProSeq of hybridization is relevant with the content of this sample.

Table 3-carries out from smallpox sample survey biology variola major virus ProSeq

^*-only be the neighbour in the orthopoxvirus, and not variola virus or milk-pox virus CFU-colony forming unit

Filter

This embodiment proves the importance of algorithm filtration fraction, and it is considered by the HybSeq to the ProSeq of the H1N1 neuraminidase (NA1) of human influenza A/PuertoRico/8/34 (H1N1) strain and matrix (matrix) gene.Filtration is necessary, scoring is had with respect to ProSeq insert or the strain of disappearance has deviation because the HybSeq of ProSeq sent to single query, particularly when using the maximized BLAST parameter of the application of base response.The moving window test is the part of this algorithm controlled filter.If close filtration, then whole HybSeq will be used for showing the single subsequence of two influenza ProSeq of remarkable hybridization.From the HybSeq of NA1 ProSeq, A/Weiss/43 (H1N1) strain is accredited as most probable strain, has then correctly identified A/Puerto Rico/8/34 from the HybSeq of matrix ProSeq.In order better to understand the source of deviation, the NA1 gene of two strains and the CLUSTAL of canonical sequence comparison are used to obtain ProSeq shown in Figure 5.This two strain has shown 95% identity (67 mispairing are arranged) in the base that 1362 quilts are compared; Yet compare with A/PuertoRico/8/34 (SEQ ID NO.3), in A/Weiss/43 (SEQ ID NO.2) and NA1ProSeq (SEQ ID NO.1), all be inserted into the fragment of 45 bases.By the filtration of giving tacit consent to, NA1 ProSeq is divided into 5 SubSeq, because this algorithm has run into the not big fragment of response.In task I, this algorithm determines that 3 short SubSeq have certified H1N1 biology, because the isolate of the many A/Puerto of comprising Rico/8/34 has identical the best scoring, and other two SubSeq only have the biology that is accredited as A/Puerto Rico/8/34 strain, because mate most.The biology of identifying by NA1 ProSeq is A/Puerto Rico/8/34, because one of these SubSeq have much higher scoring.This ProSeq is supported in the identical strain of being done among the matrix ProSeq and identifies.This biology is accredited as A/Puerto Rico/8/34, because only this biology is detected two ProSeq.Detected correct target pathogenic agent by filtration, and do not used filtering words, the identification level of target pathogenic agent will be influenza A (a H1N1 hypotype), because detected two kinds of biological A/Puerto Rico/8/34 and A/Weiss/43.HySeq is divided into SubSeq removing the level that deviation can reduce evaluation, as in this case 3 kinds among described 5 kinds of SubSeq being occurred.The example of smallpox is another kind of situation before, if do not use filtering words can identify wrong species (camel acne (Camel Pox) or common marmoset (Callithrix jacchus)).Clinical sample in the table 2 shows that the HybSeq that is divided into a plurality of SubSeq can identify very specifically.

As than low spot,,, then need to carry out extra filtration to remove because the caused potential deviation of Auele Specific Primer if use compound strategy rather than common strategy to increase as described in the method.Fig. 5 has the baseline results (SEQ ID NO.4) of A/Puerto Rico/8/34 hybridization and covers up filtering result (SEQ ID NO.5) to show the example of this interference.Except problem owing to the deviation of described reason before, the sequence of 18 bases has appearred in baseline results because they with the interactional position of primer on, so they become N after filtration.If these base responses are comprised in the constructed subsequence, then the inquiry to ProSeq will be incorrect strain.

This algorithm successfully provides the detailed highest level of Causal Agent Identification most probable (planting or strain), and this depends on the quality of each ProSeq.The ability of this evaluation need be minimum to the identity of pathogenic agent input, this makes non-expert also can use.The key feature of the whole automatizations of permission that added is to use taxonomy database, it is divided into orderly group with biology, and the relation between the biological inquiry is provided, and it is unnecessary that this makes it possible to eliminate, different relevant prototype sequences is compared, and the expression of reduced data.This makes that database is that NCBI uses with the success of maximum, and described database is superfluous and carries out minimum reparation (minimal curation) but it stably accepts that upgrade and new sequence information.Although by only using ncbi database that it is proved, other databases or the database that the user did also can be used easily, this can improve performance.This algorithm can provide accurate evaluation at all analysis levels of pathogenic agent, and this is low that change or expressed by the ProSeq of high conservative.For the pathogenic agent of more variable or quick sudden change influenza A virus for example, task I and II still provide detailed evaluation, but task II can not report the differentiation of precision standard.This algorithm of comparative descriptions of the influenza virus gene sequence of tradition order-checking can be adjusted automatically with new database more.Its ability that correctly will on ProSeq, from the caused hybridization of strain, distinguish of this algorithm proof by heredity nearly (neighbour) owing to the concrete caused hybridization of pathogenic agent, and can not cause incorrect evaluation, it has reduced false-positive a kind of possible reason.Original results of hybridization is filtered to reduce computing time, and this can be used for the potential primer and interfere, and the more important thing is and reduces the potential deviation.This simple integrated algorithm provides fully and accurate the evaluation, can directly use RPM v.1 or similarly check order array and check so again.

Except the success of proof CIBSI 2.0, this work comprises also that exploitation can be accurate in one's observation and selects the algorithm of ProSeq importance.V.1, RPM is first array that checks order again, and it is designed for pathogen detection specifically, and it uses database similarity retrieval and as the prototype of this application.Its verified little single ProSeq that arrives 100bp can be enough to clearly identification of organism by correct design the time.Yet, represent that clearly some normal ProSeq provides better affirmation and more detailed information of pathogenic agent.This design has become generation to any pathogenic agent ability of widespread usage all to the attention of this point.The performance of raising task III may need more information about each pathogenic agent, and may need each specific pathogenic agent or every class pathogenic agent are developed.This information also can be that this algorithm is needed to identify which kind of difference between sample and the data base querying represents significant sudden change.Can easily introduce this grade design of data analysis and to be based upon the analysis of carrying out analytically.Use correct design check order again microarray with and automatically detection algorithm the check that can monitor multiple biology simultaneously with exploitation can be provided, provide accurate strain level to distinguish simultaneously, this provides about detailed strain identification, antibiotic resistance markers and pathogenic information.This can be used for such as following application from multiple bioanalysis partial sequence information: difference is diagnosed the disease (being the heat generation respiratory illness) with multiple possible cause, the appearance of following the tracks of pathogenic agent, is distinguished the biological influence that threatens and be used to follow the tracks of co-infected or super infection in surveillance application from nontoxic hereditary neighbour.The classification concept and the report evaluation in various degree that rely on sample and target sequence group quality are not limited at the order-checking microarray, but more generally be used for any can return the sequence horizontal respone can be used in the platform of inquiry with reference to the DNA database.Along with the raising of trend of the check of the multiple pathogenic agent of test, analysis tool such as device of the present invention become more important for carry out Rapid identification with simple form automatically, and this is useful for non-expert on daily basis.

Source code

Be the PERL source code listing of an embodiment of disclosed method below." overclinical " program is the top program of other program of operation." fstorepi " filters, prepares subsequence and the inquiry document is prepared.This program is used input document " primerhyb.dat ", and it contains the predetermined list of locations that is compiled to N." runblast " carries out the BLAST inquiry." dbparse " carries out this classification analysis.This program is used input document " chip 1 pathogengroups ", and it contains the tabulation of the target pathogenic agent of each ProSeq.

overclinical

#！/usr/bin/perl

use?File::Copy；

my?$win＝20；

my?$debug＝0；

my?$nq＝″-usequ?y″；

$unam＝″″；

$i＝0；

while($i?＜＝$#ARGV)?{

if(index($ARGV[$i?]，″-″)+1){

if(substr($ARGV[$i]，1)eq?″file″){

$unam＝$ARGV[$i+1]；$i++；

}?elsif(substr($ARGV[$i]，1)eq?″debug″){

$debug＝$ARGV[$i+1]；$i++；

}

$i++；

}?else?{

$i++；

}

if(-s?$unam){

if(($l＝index($unam，″.″))+1){

$inp＝substr($unam，0，$l?)；

}?else?{

$inp＝$unam；

}

$rmfst＝0；

if(！($unam?eq?join(″″，$inp，″.fst″))){copy($unam，join(″″，$inp，″.fst″))；

$rmfst＝1；}

$time＝″/usr/bin/ltime″；

$ftime＝join(″″，$inp，″.tim″)；

unlink$ftime；

system″./fstorepi-inp?$inp-debug$debug″；

$inp2＝join(″″，$inp，″b1″)；

system″echo?Blastinq?$inp2＞＞$ftime″；

system″$time-a-o$ftime./runblast$nq-inp$inp2-W11″；

$inp2＝join(″″，$inp，″b2″)；

system″echo?Blastinq$inp2＞＞$ftime″；

system″$time-a-o$ftime./runblast$nq-inp$inp2-W7″；

system″echo?Identifying$inp＞＞$ftime″；

system″$time-a-o$ftime./dbparse-inp$inp-debug$debug-targ?0＞

$inp.err?2＞&1″；

unlink(join(″″，$inp，″.fst″))if($rmfst)；

}?else?{

print?″Filename?not?found:$unam\n″；

}

exit；

fstorepi

#！/usr/bin/perl

my?$win＝20；

my?$probe＝21；

$debug＝0；

$unam＝″″；

$mci＝0；

$i＝0；

while($i＜＝$#ARGV){

if(index($ARGV[$i?]，″-″)+1){

if(substr($ARGV[$i]，1)eq″inp″){

$unam＝$ARGV[$i+1]；$i++；

}?elsif(substr($ARGV[$i]，1)eq?″debug″){

$debug＝$ARGV[$i+1]；$i++；

}?elsif(substr($ARGV[$i]，1)eq″snp″){

$mci＝$ARGV[$i+1]；$i++；

}

$i++；

}?else{

$i++；

}

$ch＝join(″″，$unam，″.fst″)；

if(！(-s?$ch)){print″Filename?not?Found:$ch\n″；exit；}

％hmi＝()；

if(open(INP，″primerhyb.dat″)){

while($line＝<INP>){

@a＝split(″″，$line)；

$hmi{$a[0]}{$a[3]}{str}＝$a[1]；

$hmi{$a[0]}{$a[3]}{beg}＝$a[2]；

}

close(INP)；

$ch＝join(″″，$unam，″.fst″)；

system(″/bin/ls?$ch＞flist″)；

open(INP，″flist″)；

@mast＝<INP>；

close(INP)；

unlink?″flist″；

chomp?@mast；

$search＝″＞″；

$Input＝″chip1tile.fasta″；

open(INP，$Input)||die；

@inplist＝()；

while($line＝<INP>){$line＝～tr/\cM//；push(@inplist，$line)；}

close(INP)；

chomp?@inplist；

@inchip＝filterparse(\@inplist)；

for?$Input?(@mast){

print″parsing?$Input\n″if($debug)；

$Output＝join(″″，″＞″，substr($Input，0，-4)，″b1.blai″)；

$Outpu2＝join(″″，″＞″，substr($Input，0，-4)，″b2.blai″)；

open(INP，$Input)；

@inplist＝()；

while($line＝<INP>){$line＝～tr/\cM//；push(@inplist，$line)；

close(INP)；

chomp?@inplist；

@in＝filterparse(\@inplist)；

frcdet($mci，substr($Input，0，-4)，\％hmi，@inchip，@in)；

($olist，$olis2)＝filter($win，$probe，\％hmi，@in)；

open(OUT，$Output)；

print?OUT?$olist；

close(OUT)；

open(OUT，$Outpu2)；

print?OUT?$olis2；

close?(OUT)；

}

exit；

sub?filterparse{

my($inplist)＝@_；

$j＝0；

my@name＝()；

my@val＝()；

while($j＜$#inplist+1){

# if(index($inplist[$j]，″AFFX-TagIQ-EX″)+1){$j++；}

# if(index($inplist[$j]，″ATTIM″)+1){$j++；}

# if(index($inplist[$j]，″ATNAC″)+1){$j++；}

if(index($inplist[$j]，$search)+1){

$i＝0；

push(@name，join(″`″，split(//，$inplist[$j])))；

$j++；

$temp″″＝；

while(！(index($inplist[$j]，$search)+1)&&$j＜$#inplist+1){

$inplist[$j]＝～tr/\cM//；

for($inplist[$j]){

s/^\s+//；s/\s+$//；s/\s+//g；

}

$temp2＝join(″″，$temp，$inplist[$j])；

$temp＝$temp2；

$j++；

}

$temp＝～tr/catgN/CATGn/；

push(@val，$temp)；

}else{

$j＝$j+1；

}

return(\@name，\@val)；

}

sub?frcdet{

my($mci，$inp，$hmi，$nam3，$val3，$name，$val)＝@_；

my?$Output＝join(″″，″＞″，$inp，″.frc″)；

if($mci){open(OUT，join(″″，″＞″，$inp，″.Bfl″))；}

for($j＝0；$j＜＝$#{$name}；$j＝$j+1){

$l＝index($$name[$j]，″:″)；

$current＝substr($$name[$j]，1，$l-1)；

if(！(index($current，″AFFX″)+1)){

print?OUT?″$current\n″if($mci)；

for($i＝0；$i＜＝$#{$nam3}；$i＝$i+1){

if((index($$nam3[$i]，$current)+1)){$ku＝$i；$i＝$#{$nam3}+1；}

}

@ma＝split(//，$$val3[$ku])；

@mc＝split(//，$$val[$j])；

$mkey＝substr($$name[$j]，1，index($$name[$j]，″:″)-1)；

if(exists?${$hmi}{$mkey}){

for?$skey(keys％{${$hmi}{$mkey}}){

$len＝length(${$hmi}{$mkey}{$skey}{str})；

if($len＞2){

$beg＝${$hmi}{$mkey}{$skey}{beg}-1；

$end＝$beg+$len；

if($end＞$#mc){$end＝$#use；}

for($jm＝$beg；$jm＜$end；$jm++){$mc[$jm]＝″N″；}

}

for($i＝0；$i＜＝$#ma；$i++){

if(！(index($mc[$i]，$ma[$i])+1)){

$mc[$i]＝～tr/CATG/catg/；

if($mc[$i]ne″n″&&$mc[$i]ne″N″){

print?OUT?$i+1，″$mc[$i]\n″if($mci)；

}

$line＝join(″″，@mc)；

$$val[$j]＝$line；

}

close(OUT)if($mci)；

if($mci){

open(OUT，join(″″，″＞″，$inp，″.Bfst″))；

$j＝0；

while($j＜$#{$name}+1){″

# print?″$j?$$name[$j]\n″；

print?OUT?″$$name[$j]\n″；

print?OUT?″$$val[$j]\n″；

$j++；

}

close(OUT)；

}

open(OUT，$Output)；

open(OUT2，join(″″，″＞″，$inp，″.frc2″))；

$tocj＝0；$toci＝0；

$dir＝0；

$outline＝″″；

$outlin2＝″$Input\n″；

for($j＝0；$j＜＝$#{$name}；$j＝$j+1){

$l＝index($$name[$j]，″:″)；

$current＝substr($$name[$j]，1，$l-1)；

$cou＝0；$cou2＝0；$bcou＝0；$l＝length($$val[$j])；

$cou＝$$val[$j]＝～tr/n/n/；

$cou2＝$$val[$j]＝～tr/N/N/；

$bcou＝$$val[$j]＝～tr/c/c/；

$bcou＝$bcou+($$val[$j]＝～tr/t/t/)；

$bcou＝$bcou+($$val[$j]＝～tr/a/a/)；

$bcou＝$bcou+($$val[$j]＝～tr/g/g/)；

$l＝$l-cou2；

$cou＝$l-$cou；

$fr＝$cou/$l＊100；

$temp＝join(″″，$outline，sprintf(″％s`％s`％f`％g`％g`％g\n″，$inp，$current，$fr，$cou，$b

cou，$l))；

$outline＝$temp；

if($fr＞20&&100＊$bcou/($fr＊$l)＜.25){

$temp＝join(″″，$outlin2，sprintf(″％14s?％5.1f?％6g

％6g\n″，$current，$fr，$bcou，$l))；

$outlin2＝$temp；

}

print?OUT?$outline；

print?OUT2?$outlin2；

close(OUT)；

close(OUT2)；

unlink?join(″″，$inp，″.frc2″)；

return；

}

sub?filter{

my($win，$probe，$hmi，$name，$val)＝@_；

my?$olist＝″″；my?$olis2＝″″；

for($jm＝0；$jm＜＝$#{$name}；$jm++){

my?$tmolist＝″″；my?$tmolis2＝″″；

if(index($$name[$jm]，″AFFX-TagIQ-EX″)+1){$$val[$jm]＝″″；}

if(index($$name[$jm]，″ATTIM″)+1){$$val[$jm]＝″″；}

if(index($$name[$jm]，″ATNAC″)+1){$$val[$jm]＝″″；}

my@nuse＝()；

$$val[$jm]＝～tr/CTAGn/ctagN/；

my@use＝split(//，$$val?[$jm])；

my$nummat＝0；

$mkey＝substr($$name[$jm]，1，index($$name[$jm]，″:″)-1)；

if(exists?${$hmi}{$mkey}){

for?$skey(keys?％{$$hmi{$mkey}}){

$len＝length(${$hmi}{$mkey}{$skey}{str})；

if($len＞2){

$beg＝${$hmi}{$mkey}{$skey}{beg}-1；

$end＝$beg+$len；

if($end＞$#use){$end＝$#use；}

for($j＝$beg；$j＜$end；$j++){$use[$j]＝″N″；}

}

for($j＝0；$j＜＝$#use；$j++){

if(！($use[$j]eq?″N″)){push(@nuse，″1″)；$nummat++；

}else{push(@nuse，″0″)；}

}

my?$pri＝0；

if($nummat＞20){

for($j＝0；$j＜＝$#use；$j++){

# if($nuse[$j]＝＝1?&&?$nuse[$j+1]＝＝1){

if($nuse[$j]＝＝1){

my?$current＝substr($$val[$jm]，$j?，$win)；

my?$score＝1；

my?$zscore；

for($k＝$j+1；$k＜$j+$win；$k++){$score＝$score+$nuse[$k]；}

# if($score＞4?&&！(index($current，″NNNN″)+1)){

if($score＞4){

my?$len＝$win；

$k＝$j+($win-1)；

if($probe＞$in){

while($len＜$probe&&$score/$len＞.4){

$k++；$len++；

$score＝$score+$nuse?[$k]；

}

$zscore＝$score；

} else?{

$zscore＝$score；

for($jl＝$k-$win；$jl＜$k-$probe；$jl++){$zscore＝$zscore-$nuse[$jl]；}

}

while($score/$len＞.4&&$zscore＞3){

$k++；$len++；

$score＝$score+$nuse[$k]；

$zscore＝$z$core-$nuse[$k-$probe]+$nuse[$k]；

}

$len--；$jl＝$j+$len-1；

if($nuse[$jl]＝＝1&&($nuse[$jl-1]+$nuse[$jl-2]+$nuse[$jl-3]+$nuse[$jl-

4])＜1)?{

$jl--；$len--；

}

while($nuse[$jl]＜1){

$jl--；$len--；

if($nuse[$jl]＝＝1&&($nuse[$jl-1]+$nuse[$jl-2]+$nuse[$jl-3]+$nuse[$jl-

4])＜1)?{

$jl--；$len--；

}

$current＝substr($$val[$jm]，$j，$len)；

if($current＝～m/[ctgaCTGA]{7}/){

if($len＞100){

$pri++；

if($score＞80&&$current＝～m/[ctgaCTGA]{11}/){

$tmolist＝join(″″，$tmolist，$$name[$jm]，″:″，$j，″\n″，$current，″\n″)；

}?else?{

$tmolis2＝join(″″，$tmolis2，$$name[$jm]，″:″，$j，″\n″，$current，″\n″)；

}

$j＝$k-1；

}?elsif($len＞30){

if($score/$len＞＝(.95-.25＊($len-30)/(70.))){

$pri++；

if($score＞80&&$current＝～m/[ctgaCTGA]{11}/){

}?else?{

}

$j＝$k-1；

}

}?elsif($len＞20){

if($score/$len＞＝.95){

$pri++；

$j＝$k-1；

}

}?else?{

}$j＝$k-($win-1)；

}

if($pri＞1){

$olist＝join(″″，$olist，$tmolist)；

$olis2＝join(″″，$olis2，$tmolis2)；

}?elsif($pri＞0){

$olist＝join(″″，$olist，$tmolist)；

my@a＝split(″\n″，$tmolis2)；

if($a[1]＝～m/[ctgaCTGA]{11}/){

$olist＝join(″″，$olist，$tmolis2)；

}

return($olist，$olis2)；

}

runblast

#！/usr/bin/perl

#-＊-perl-＊-

#

#perl?script?to?handle?filtering?originally?implemented?in?JAVA?REPI?and

carry?out?blastall

#Created?7/12/2005-For?Xserve?cluster?so?execution?of

# blast?search?done?through?SGE?job?queue

#

use?File::Copy；

{

$user＝getlogin()；

my?$debug＝0；

my?$noque＝1；

my?$srcdir＝″./″；

my?$tmpdir＝″/common/scratch/repi/″；

my

@in，$eltime，$Input，$infile，$outfile，$insge，$outsge，$errsge，$idsge，$cmde，$cmdf，

$cmdw，$cmda，$cmdo，$cmd，$cmdb，$cmdfi，$err；

$ENV{$GE_ROOT}＝″/common/sge″；

#Extract?the?command?line?options?for?this?script?the?ones?to?be?passed?to

blastall.

my?％param＝()；

$cmdblast＝″/common/bin/blastall″；

$param{e}＝″0.0001″；

$param{F}＝join(″″，″\\\″″，″m?D″，″\\\″″)；；

$param{W}＝″7″；

$param{m}＝″8″；

$param{r}＝″1″；

$param{q}＝″-1″；

$param{b}＝″750″；

$param{d}＝″/common/data/nt″；

my?$php＝0；

my?$repiparse＝0；

my?$win＝0；

$i＝0；

while($i＜＝$#ARGV){

if(index($ARGV[$i?]，″-inp″)+1){

$Input＝$ARGV[$i+1]；$i++；

}?elsif?(index($ARGV[$i]，″-debug″)+1){

$debug＝$ARGV[$i?+1]；$i++″；

}?elsif?(index($ARGV[$i]，″-usequ″)+1){

$noque＝0?if(index($ARGV[$i+1]，″y″)+1||index($ARGV[$i+1]，″Y″)+1)；$i++；

}?elsif?(index($ARGV[$i]，″-″)+1){

if(substr($ARGV[$i]，1)eq?″F″){

if($ARGV[$i+2]eq?″-U″){

$param{substr($ARGV[$i]，1)}＝join(″″，″\\\″″，$ARGV[$i+1]，″\\\″

″，$ARGV[$i+2])；$i＝$i+2；

}?else?{

$param{substr($ARGV[$i]，1)}＝join(″″，″\\\″″，$ARGV[$i+1]，″\\\″″)；$i++；

}

}?else?{

$param{substr($ARGV[$i]，1)}＝$ARGV[$i+1]；$i++；

}

$i++；

}

print?″beginning?repiv2?with?input?filename?$Input\n″if($debug)；

if($noque){

$infile＝join(″″，$Input，″.blai″)；

$outfile＝join?(″″，$Input，″.blao″)；

$cmd＝″$cmdblast-p?blastn-i?$infile?-o?$outfile″；

for?$mkey?(keys?％param){

$cmd＝join(″″，$cmd，″-″，$mkey，″″，$param{$mkey})；

}

#?Run?command

system″$cmd″；

}?else?{

#?construct?names?of?required?files?and?remove?preexisting?versions?of?files

#?files?for?final?report

my?$errfile＝join(″″，$srcdir，$Input，″.err″)；

unlink?$errfile；

unlink?join(″″，$srcdir，$Input，″.blao″)；

#?files?for?blast?command

$infile＝join(″″，$tmpdir，$Input，″.itmp″)；

$outfile＝join(″″，$tmpdir，$Input，″.otmp″)；

unlink($infile，$outfile)；

#?files?for?batch?queue

$insge＝join(″″，″cgisge-″，$Input)；

opendir(DIR，$tmpdir)；

$a＝join(″″，″cgisge-″，$Input)；

@files＝grep{/$a/}readdir(DIR)；

unlink($tmpdir.$_)for(@files)；

closedir(DIR)；

$insge＝join?(″″，$tmpdir，$insge)；

$idsge＝join(″.，$insge，″id?″)；

#?check?for?error?in?parameters?list

# errordie($errfile，$err，$php)if($err)；

if(！(-s?join(″″，$srcdir，$Input，″.blai″))){

copy(join(″″，$srcdir，$Input，″.blai″)，join(″″，$srcdir，$Input，″.blao″))；

exit；

}

#?create?input?blast?file

copy(join(″″，$srcdir，$Input，″.blai″)$infile)；

#?join?list?of?variable?parameters?to?Basic?command?with?input-output?file

names

$cmd＝″$cmdblast?-p?blastn?-i?$infile?-o?$outfile″；

for?$mkey(keys％param){

$cmd＝join(″″，$cmd，″-″，$mkey，″″，$param{$mkey})；

}

#?Run?command

#?additional?code?req″u?ired?to?generate?results?working?through?sge

open(OUT，″＞$insge″)or?errordie($errfile，″Failed?to?write?batch?queue

script\n?$insge\n″，$php)；

print?OUT?″#！/usr/bin/perl\n″；

print?OUT?″system(\″$cmd\″)；\n″；

print?OUT?″open(OUT，\″＞$idsge\″)；\n″；

print?OUT?″print?OUT?\″\$ENV{JOB_ID}\″；\n″；

print?OUT?″close?OUT；\n″；

close?OUT；

($sec，$min，$hour，$day，$month，$year，$dweek，$dyear，$dsave)＝localtime(time)；

$eltime＝join(″″，″Summary?of?results\n?refer?to?.blao?file?for?raw?blast

results\n?Start?″，$hour，″:″，$min，″:″，$sec)；

$i＝`/common/sge/bin/darwin/qsub?-sync?y?-o?$tmpdir?-e?$tmpdir?-s

/usr/bin/perl?$insge`；

@a＝split(″″，$i)；$id＝$a[2]；

$outsge＝join(″″，$insge，″.o″，$id)；

$errsge＝join(″″，$insge，″.e″，$id)；

unlink($insge，$idsge)；

$eltime＝join(″″，$eltme，″END″，$hour，″:″，$min，″:″，$sec，″\n″)；

#?print?the?output?to?the?file

if(-e?$outfile){

move($outfile，join(″″，$srcdir，$Input，″.blao″))；

# remove?input?and?output?files

unlink($outsge，$errsge，$infile，$outfile)；

}?else?{

open(OUTFILE，″$errsge″)；my?@errput?＝<OUTFILE>；close?OUTFILE；

$err＝join″″，@errput；

#?remove?input?and?output?files

errordie($errfile，$err，$php)；

}

exit；

}

sub?errordie?{

my?($file，$err，$php)＝@_；

open(INP，″＞$file″)or?die?″can\′t?open?file?error?file″；

print?INP?″ERROR:\n?$err″；

exit；

}

dbparse

#！/usr/bin/perl

#-＊-perl-＊-

#

use?strict；

use?warnings；

use?DBI；

my?$usetarsuc＝1；

my?$bound＝1e-6；

my?$debug＝0；

my?$unam＝″″；

my?$i＝0；

while($i＜＝$#ARGV)?{

if(index($ARGV[$i?]，″-″)+1){

if(substr($ARGV[$i]，1)eq?″inp″){

$unam＝$ARGV[$i+1]；$i++；

}?elsif(substr($ARGV[$i]，1)eq?″debug″)

$debug＝$ARGV[$i+1]；$i++；

}?elsif(substr($ARGV[$i]，1)eq?″targ″){

$usetarsuc＝$ARGV[$i+1]；$i++；

}

$i++；

}?else?{

$i++；

}

my?$chippatho＝″chip1pathogengroups″；

open(INP，$chippatho)or?die″Failed?to?open?input?file?$chippatho\n″；

my?％?patho＝()；my?$line；

while($line＝<INP>){

$line＝～tr/\cM//；

chomp?$line；

my?@a＝split(″`″，$line)；

for(my?$i＝1；$i＜＝$#a；$i++){

my?@b＝split(″，″，$a[$i])；

my?$c＝join(″″，″，″，$b[1]，″，″)；

$patho{$a[0]}{$c}{name}＝$b[0]；

$patho{$a[0]}{$c}{stat}＝$b[2]；

}

close(INP)；

$chippatho＝″targets.runs″；

open(INP，$chippatho)or?die?″Failed?to?open?input?file?$chippatho\n″；

my?$temptar＝0；

my?％target＝()；

while($line＝<INP>){

$line＝～tr/\cM//；

chomp?$line；

my?@a＝split(″`″，$line)；

for(my?$i＝1；$i＜＝$#a；$i++){

my?@b＝split(″，″，$a[$i?])；

my?$c＝join(″″，″，″，$b[1]，″，″)；

$target{$a?[0]}{$c}{name}＝$b[0]；

$temptar＝1?if(index($unam，$a[0])+1)；

}

if($usetarsuc?&&！$temptar?){

$usetarsuc＝0；print?″have?not?indicated?target?for?$unam\n″；

}?elsif(！$usetarsuc?&&$temptar){

print?″have?indicated?target?for?$unam?but?not?using\n″；

}

close(INP)；

$chippatho＝″chip1taxexempt″；

my?％taxexempt＝()；

if(open(INP，$chippatho)){

while($line＝<INP>){

$line＝～tr/\cM//；

chomp$line；

my@a＝spli?t(″`″，$line)；

$taxexempt{$a[0]}＝$a[1]；

}

close(INP)；

}?else?{

print?″No?$chippatho?file\n″；

}

my?@mast＝(join(″″，$unam，″.blao″))；

my?$dbh＝openconnection?()；

for?my?$Input(@mast){

my?@a＝split(″/″，substr($Input，0，-5))；

my?$SInput＝$a[$#a]；

if(！exists?$target{$SInput}){

$target{$SInput}{1}{taxid}＝-999；

$target{$SInput}{1}{name}＝″unknown″；

}

print?″parsing?$Input?\n″if($debug)；

outfilter(substr($Input，0，-5)，$SInput，my

$eltime，\％patho，\％target，\％taxexempt，$bound)；

}

closeconnection?($dbh)；

exit；

sub?outfilter{

use?strict；

my($Input，$SInput，$eltime，$patho，$target，$taxexempt，$bound)＝@_；

my?％ha＝()；

my?％iha＝()；

my?％liac＝()；

my?$Output＝join(″″，″＞″，$Input，″.rdb″)；

my?％frclist＝()；my?$line；

if(open(INP，join(″″，$Input，″.frc″))){

while($line＝<INP>){

$line＝～tr/\cM//；

chomp?$line；

my?@a＝split(″`″，$line)；

$frclist{$a[1]}{per}＝$a[2]；

$frclist{$a[1]}{mat}＝$a[3]；

$frclist{$a[1]}{mis}＝$a[4]；

$frcli?st{$a?[1]}{len}＝$a[5]；

}

close(INP)；

}

unlink?″flist″；

unlink?″flisterr″；

my?$ch＝join(″″，$Input，″b？.blai?″，$Input，″.blai″)；

system(″(/bin/ls?$ch＞flist)＞&flisterr″)；

open(INP，″flist″)；

my?@mast＝<INP>；

close(INP)；

unlink?″flist″；

unlink?″flisterr″；

chomp?@mast；

my?@inplist＝()；

if(！@mast){print?″Files?for?$Input?not?found\n″；exit；}

for?my?$Inp(@mast){

open(INP，$Inp)；

while($line＝<INP>){

$line＝～tr/\cM//；

push(@inplist，$line)；

}

close(INP)；

}

chomp?@inplist；

my?$j＝0；

my?$sear＝″＞″；

while($j＜＝$#inplist){

if(index($inplist[$j]，″＞″)+1){

my?$mkey＝substr($inplist[$j]，1，index($inplist[$j]，″:″)-1)；

my?@a＝split(″″，$inplist[$j])；

my?@b＝split(″:″，$a[0])；

my?$mnkey＝0；

$mnkey＝$b[2]if($#b＞1)；

$j++；

my?$temp＝″″；

my?$temp2＝$temp；

while($j＜$#inplist+1?&&！(index($inplist[$j]，$sear)+1)){

$inplist[$j]＝～tr/\cM//；

for($inplist[$j]){

s/^\s+//；s/\s+$//；s/\s+//g；

}

$temp2＝join(″″，$temp，$inplist[$j])；

$j++；

$temp＝$temp2；

}

$temp＝～tr/catgN/CATGn/；

$iha{$mkey}{$mnkey}＝$temp；

}?else?{

$j++；

}

$ch＝join(″″，$Input，″b？.blao?″，$Input，″.blao″)；

system(″(/bin/ls?$ch＞flist)＞&flisterr″)；

open(INP，″flist″)；

@mast＝<INP>；

close(INP)；

unlink?″flist″；

unlink?″flisterr″；

chomp?@mast；

@inplist＝()；

for?my?$Inp(@mast){

open(INP，$Inp)；

while($line＝<INP>){

$line＝～tr/\cM//；

push(@inplist，$line)；

}

close(INP)；

}

chomp?@inplist；

$j＝0；

while($j＜＝$#inplist){

my?$mkey＝substr($inplist[$j]，0，index($inplist[$j]，″:″))；

my?@a＝split(″″，$inplist[$j])；

my?@b＝split(″:″，$a[0])；

my?$mnkey＝0；

$mnkey＝$b[2]if($#b＞1)；

my?@temp＝split(/\|/，$a[1])；

my?$skey＝$temp?[$#temp]；

$line＝$skey；

my?$gi＝-1；

if(index($a[1]，″gi?\|″)+1){

$qi＝$temp[1]；

$line＝$temp[3]；

$skey＝$gi；

if((my?$i＝index($skey，″\.″))+1){$skey＝substr($skey，0，$i)；}

}

chomp?$skey；

$ha{$mkey}{$mnkey}{″$skey″}{line}＝$line；

$ha{$mkey}{$mnkey}{″$skey″}{scor}＝$a[11]；

$ha{$mkey}{$mnkey}{″$skey″}{eval}＝$a[10]；

$ha{$mkey}{$mnkey}{″$skey″}{gi}＝$gi；

my?$temp＝substr($iha{$mkey}{$mnkey}，($a[6]-1)，($a[7]-$a[6]+1))；

$a[6]＝$a[6]+$mnkey；

$a[7]＝$a[7]+$mnkey；

shift?@a；

$a[0]＝$a[1]；$a[1]＝$a[2]；$a[2]＝$a[3]；

$a[3]＝$temp＝～tr/nN/nN/；

$a[3]＝$a[2]-$a[3]；

$ha{$mkey}{$mnkey}{″$skey″}{out}＝join(″″，″″，@a)；

$j++；

}

foreach?my?$mkey(keys?％ha){

my($sec，$min，$hour，$day，$month，$year，$dweek，$dyear，$dsave)；

my@low＝()；

foreach?my?$mnkey(sort?keys?％{$ha{$mkey}})?{

foreach?$line(sort?keys?％{$ha{$mkey}{$mnkey}}){

if($ha{$mkey}{$mnkey}{$line}{gi}＜0){

$liac{$line}＝″-1`No?database?entry″；

}?else?{

push(@low，$ha{$mkey}{$mnkey}{$line}{gi})if(！exists?$liac{$line})；

}

if(1){

for?my?$giid(@low){

my?$taxid＝taxfromgi($giid)；

if($taxid＞-1){

my?$ch＝taxparents($taxid)；

if(！(index($ch，″，28384，″)+1)){

my?@a＝split(/，/，substr($ch，1，-1))；

my?$comname＝`/common/bin/fastacmd?-d?/common/data/nt?-s?$giid?-L?0，3`；

@a＝split(″″，$comname)；shift?@a；$comname＝join(″″，@a)；

@a＝split(″；″，$comname)；

my?@b＝split?(″，″，$a?[0])；

$comname＝$b[0]；

$liac{$giid}＝join?(″`″，$taxid，taxname($taxid)，$comname，$ch)；

}?else?{

$liac{$giid}＝-2；

}

}?else?{

# print?″no?tax?for?gi?$giid\n″；

$liac{$giid}＝″-1`No?database?entry″；

}

}?else?{

use?Bio::Seq；

use?Bio::DB::GenBank；

my?$gb?＝new?Bio::DB::GenBank()；

my?$seqio＝()；

$seqio＝$gb-＞get_stream_by_id([@low])；

while((my?$seq1＝$seqio-＞next_seq())){

if($seq1-＞alphabet()ne?′protein′){

$line＝$seq1-＞primary_id()；

my?$spe＝$seq1-＞species()；

my?$ch＝join(″；″，$spe-＞classification())；

if(！(index($ch，″artificial″)+1)){

my?@a＝split(″；″，$seq1-＞desc())；

my?@b＝split(″，″，$a[0])；

my?$comname＝$b[0]；

my?$taxid＝$spe-＞ncbi_taxid()；

$liac{$line}＝join(″`″，$taxid，$spe-＞common_name()，$comname)；

}?else?{

$liac{$line}＝-2；

}

for?$line(@low){

if(！(exists?$liac{$line})){

$liac{$line}＝″-1`No?database?entry″；

}

open(OUT，$Output)；

my($sec，$min，$hour，$day，$month，$year，$dweek，$dyear，$dsave)＝localtime(time)；

$eltime＝join(″″，″Time?″，$hour，″:″，$min，″:″，$sec，″Date

″，$month+1，″/″，$day，″/″，$year+1900，″\n″)；

my?$ltime＝

sprintf(″％02g:％02g:％02g；％02g/％02g/％4g″，($hour，$min，$sec，$month+1，$day，$year+19

00))；

print?OUT?$eltime；

print?OUT″(Probe?name`Experiment?name`run?version`taxid`Common

name`Description`pathogen?match`target?match`Accension?number`gi

number?`％identity`alignment?length`mismatches`nonN

nismatches`gaps`st.start`st.end`fs.start`fs.end`e-value`，bit

score`rank`time；Date)\n″；

my?％subtile＝()；my?％tilerslt＝()；

foreach?my?$mkey(keys％ha){

foreach?my?$mnkey(sort?keys％{$ha{$mkey}}){

my％alkey＝()；

my?$currank＝2000000000000；my?$rank＝0；

my?$maxrank＝0；my?$scale＝1.0；

foreach?$line(sort{$ha{$mkey}{$mnkey}{$b}{scor}<＝>

$ha{$mkey}{$mnkey}{$a}{scor}}keys％{$ha{$mkey}{$mnkey}}){

$maxrank＝$ha{$mkey}{$mnkey}{$line}{scor}；last；

}

foreach?$line(sort{$ha{$mkey}{$mnkey}{$b}{scor}<＝>

$ha{$mkey}{$mnkey}{$a}{scor}}keys?％{$ha{$mkey}{$mnkey}}){

my?@a＝split(″″，$ha{$mkey}{$mnkey}{$line}{out})；

my?$out1＝″$a[1]`$a[2]`$a[3]`$a[5]`$a[6]`$a[7]`$a[8]`$a[$#a-1]`$a[$#a]″；

my?$out2＝join(″`″，@a)；

my?@taxinfo＝split(″″，$liac{$line})；

my?$tilsuc＝-1；

my?$tarsuc?＝0；

if(exists?$liac{$line}&&$taxinfo?[0]＞-1){

my?$taxin?＝$taxinfo[3]；

for?my?$tarkey(sort

{${$patho}{$mkey}{$a}{stat}<＝>${$patho}{$mkey}{$b}{stat}}keys

％{${$patho}{$mkey}}){

if(index($taxinfo[3]，$tarkey)+1){

$tilsuc＝${$patho}{$mkey}{$tarkey}{stat}；

# $taxin＝${$patho}{$mkey}{$tarkey}{name}；

last；

}

if($usetarsuc){

for?my?$tarkey(keys％{${$target}{$SInput}}){

if(index($taxinfo[3]，$tarkey)+1){

$tarsuc＝1if(substr($tarkey，1，-1)＞0)；

last；

}

if($ha{$mkey}{$mnkey}{$line}{scor}＜$currank){

$rank++；

if($currank＊$scale＜$ha{$mkey}{$mnkey}{$line}{scor}){

$currank＝$currank＊$scale；

}?else?{

$currank＝$ha{$mkey}{$mnkey}{$line}{scor}；

}

if($rank＝＝1){$rank++if($ha{$mkey}{$mnkey}{$line}{eval}＞$bound)；}

my?$uni＝0；

if(exists?$alkey{$rank}){

if(exists?$alkey{$rank}{$taxin}){

$alkey{$rank}{$taxin}{count}++；

$alkey{$rank}{$taxin}{out}＝join(″，″$alkey{$rank}{$taxin}{out}，$ha{$mkey}{$mnk

ey}{$line}{line})；

}?else?{

$uni＝1；

$alkey{$rank}{$taxin}{count}++；

$alkey{$rank}{$taxin}{out}＝″$taxinfo[1]`$taxinfo[2]`$taxinfo[3]`$tilsuc`$out1`

$ha{$mkey}{$mnkey}{$line}{line}″；

}

}?else?{

$uni＝1；

$alkey{$rank}{$taxin}{count}++；

$ha{$mkey}{$mnkey}{$line}{line}″；

}

print?OUT

″$mkey`$SInput?`pri?mary`$taxi?nfo?[0]`$taxi?nfo?[1]`$taxi?nfo?[2]`$tilsuc`$tarsuc?`$ha

{$mkey}{$mnkey}{$line}{line}`$ha{$mkey}{$mnkey}{$line}{gi}$out2`$rank`$uni?`$l

time\n″；

}

#?subsequence?determinations

$rank＝1；

if(exists?$alkey{$rank}){

my?$detect＝″-1″；

my?@clip＝()；my?$tcomp；

for?my?$key2(keys?％{$alkey{$rank}}){

my?@a＝split(/`/，$alkey{$rank}{$key2}{out})；

my?@b＝split(/，/，$a[2])；

if(！exists?${$taxexempt}{$b[$#b]}){

if(@clip){

my?$i＝0；

if($#b?＜＝$#clip){

if(($detect?ne?″TAXAMBIG″)&&($b[$#b]＝＝$clip[$#b])){

$deteCt＝″TAXUNIQU″；

}?else?{

while(($i＜＝$#b?&&?$i＜＝$#clip)?&&?$b[$i]eq?$clip[$i]){$i++；}

splice(@clip，$i)；

$det?eCt＝″TAXAMBIG″；

}

}?else{

if($detect?ne?″TAXAMBIG″&&($b[$#clip]＝＝$clip[$#clip])){

$detect＝″TAXUNIQU″；

@clip＝split?(/，/，$a[2])；

}?else?{

while(($i＜＝$#b?&&?$i＜＝$#clip)&&?$b[$i]eq?$clip[$i]){$i++；}

splice(@clip，$i)；

$deteCt＝″TAXAMBIG″；

}

}?else?{

@clip＝split?(/，/，$a[2])；

$detect＝″SEQUNIQU″；

$detect＝″TAXUNIQU″if($alkey{$rank}{$key2}{count}＞1)；

}

$subtile{$mkey}{$mnkey}{detect}＝$detect；

$subtile{$mkey}{$mnkey}{tax}＝join(″″，join(″，″，@clip)，″，″)；

$subtile{$mkey}{$mnkey}{out}＝join(″″，″Subseq.of?$mkey?$detect

″，taxname($clip[$#clip])，″St:$mnkey\n″)；

$subtile{$mkey}{$mnkey}{out}＝join(″″，$subtile{$mkey}{$mnkey}{out}，″

Rank?$rank″)；

my?$tempout＝″″；

my?$tempou2＝″″；

for?my?$key2(keys?％{$alkey{$rank}}){

my?@a＝split(/`/，$alkey{$rank}{$key2}{out})；

$subtile{$mkey}{$mnkey}{name}＝$a[1]if(！exists

$subtile{$mkey}{$mnkey}{name})；

$subtile{$mkey}{$mnkey}{bit}＝$a[12]if(！exists

$subtile{$mkey}{$mnkey}{bit})；

$tempou2＝″Bases:$a[4]MIS:$a[5]SNP:$a[6]Range:$a[7]-$a[8]E:$a[11]

Bit:$a[12]\n″；

my?$name＝$a[0]；my?$out＝″Match:$a[3]Range:$a[9]-$a[10]\n $a[13]″；

$tempout＝join(″″，$tempout，″?$name?Num:$alkey{$rank}{$key2}{count}

$out\n″)；

}

$subtile{$mkey}{$mnkey}{out}＝join(″″，$subtile{$mkey}{$mnkey}{out}，$tempou2，$te

mpout)；

$rank＝2；

$subtile{$mkey}{$mnkey}{out}＝join(″″，$subtile{$mkey}{$mnkey}{out}，″

Rank?$rank?″)；

$tempout＝″″；

for?my?$key2(keys?％{$alkey{$rank}}){

my?@a＝split(/`/，$alkey{$rank}{$key2}{out})；

$tempou2＝″Bases:$a[4]MIS:$a[5]$NP:$a[6]Range:$a[7]-$a[8]E:$a[11]

Bit:$a[12]\n″；

my?$name＝$a[0]；my?$out＝″Match:$a[3]Range:$a[9]-$a[10]\n $a[13]″；

$tempout＝join(″″，$tempout，″$name?Num:$alkey{$rank}{$key2}{count}

$out\n″)；

}

mpout)；

}

for?my?$mkey(sort?keys?％subtile){

my?$bit＝0；my?$uskey；

$tilerslt{$mkey}{num}＝scalar(keys(％{$subtile{$mkey}}))；

for?my?$mnkey(keys?％{$subtile{$mkey}}){

if($subtile{$mkey}{$mnkey}{bit}＞$bit){

$bit＝$subtile{$mkey}{$mnkey}{bit}；

$uskey＝$mnkey；

}

if($tilerslt{$mkey}{num}＝＝1){

$tilerslt{$mkey}{sub}＝$uskey；

$tilerslt{$mkey}{detect}＝$subtile{$mkey}{$uskey}{detect}；

}?else?{

# my?$i＝0；

my?$i＝1；

for?my?$mnkey(keys?％{$subtile{$mkey}}){

if($mnkey?ne?$uskey){

my?$dif?＝($subtile{$mkey}{$mnkey}{bit})/$bit；

if($dif＞0.60){$i＝0；last；}

}

$tilerslt{$mkey}{sub}＝$uskey；

if($i＞0){

$tilerslt{$mkey}{detect}＝$subtile{$mkey}{$uskey}{detect}；

}?else?{

my?$j＝1；my?$us2＝$uskey；

my?@clip＝split(/，/，$subtile{$mkey}{$uskey}{tax})；

for?my?$mnkey(keys?％{$subtile{$mkey}}){

if($subtile{$mkey}{$us2}{tax}ne?$subtile{$mkey}{$mnkey}{tax}){

my?@b＝split(/，/，$subtile{$mkey}{$mnkey}{tax})；

my?$i＝0；

while(($i＜＝$#b?&&$i＜＝$#clip)){

if($b[$i]ne?$clip[$i]){

$i＝$#b；$j＝0；

}

$i++；

}

last?if(！$j)；

if($#b＞$#clip){

$us2＝$mnkey；

@clip＝split(/，/，$subtile{$mkey}{$us2}{tax})；

}

if($)j{

$tilerslt{$mkey}{detect}＝$subtile{$mkey}{$us?2}{detect}；

$tilerslt{$mkey}{sub}＝$us2；

}?else?{

$tilerslt{$mkey}{detect}＝″TAXAMBIG″；

}

my?@clip＝split(/，/，$subtile{$mkey}{$tilerslt{$mkey}{sub}}{tax})；

$tilerslt{$mkey}{taxcom}＝taxname($clip[$#clip])；

}

open(OUT2，join(″″，″＞″，$Input，″.n4rpt″))；

open(OUT3，join(″″，″＞″，$Input，″.n4pats″))；

open(OUT4，join(″″，″＞″，$Input，″.n4path″))；

open(OUT5，join(″″，″＞″，$Input，″.n4tile″))；

my?$tottile＝scalar(keys(％tilerslt))；

my?％subpatho＝()；my?％maipatho＝()；

for?my?$mkey(sort?keys?％tilerslt){

my?$pathogen＝$tilerslt{$mkey}{taxcom}；

for?my?$tarkey(keys?％{${$patho}{$mkey}}){

if(substr($tarkey，1，-1)＜0){

if(${$patho}{$mkey}{$tarkey}{stat}＝＝0){

$pathogen＝${$patho}{$mkey}{$tarkey}{name}；

last；

}

if(exists?$subpatho{$pathogen}{num}){

$subpatho{$pathogen}{num}＝$subpatho{$pathogen}{num}+1；

$subpatho{$pathogen}{list}＝join(″，″，$subpatho{$pathogen}{list}，$mkey)；

if(index($tilerslt{$mkey}{detect}，″SEQUNIQU″)+1&&

index($subpatho{$pathogen}{detect}，″TAXUNI″)+1){

$subpatho{$pathogen}{detect}＝″SEQUNIQU?″；

}elsif(index($tilerslt{$mkey}{detect}，″TAXUNIQU″)+1&&

index($subpatho{$pathogen}{detect}，″SEQUNIQU″)+1){

$subpat?ho{$pathogen}{detect}＝″SEQUNIQU″；

}

}?else?{

$subpatho{$pathogen}{list}＝$mkey；

$subpatho{$pathogen}{num}＝1；

$subpatho{$pathogen}{detect}＝$tilerslt{$mkey}{detect}；

}

print?OUT5

″$SInput?``$tottile`$mkey`$tilerslt{$mkey}{num}`$tilerslt{$mkey}{taxcom}`$tiler

slt{$mkey}{detect}\n″；

}

close(OUT5)；

$tottile＝scalar(keys(％subpatho))；my?％maizzz＝()；

for?my?$pathogen(sort?keys?％subpatho){

print?OUT4

″$SInput``$tottile`$pathogen`$subpatho{$pathogen}{num}`$subpatho{$pathogen}{de

tect}`$subpatho{$pathogen}{list}\n″；

my?$mkey；

if((my?$l＝index($subpatho{$pathogen}{list}，″，″))+1){

$mkey＝substr($subpatho{$pathogen}{list}，0，$l)；

}?else?{

$mkey＝$subpatho{$pathogen}{list}；

}

my?$pathotax；my?$pathogem；

for?my?$tarkey(sort?{${$patho}{$mkey}{$b}{stat}<＝>

${$patho}{$mkey}{$a}{stat}}keys?％{${$patho}{$mkey}}){

$pathotax＝$tarkey；

$pathogem＝${$patho}{$mkey}{$tarkey}{name}；

last；

}

if(substr($pathotax，1，-1)＞0){

if(index($subtile{$mkey}{$tilerslt{$mkey}{sub}}{tax}，$pathotax)+1){

my?@a＝split(″，″，$subtile{$mkey}{$tilerslt{$mkey}{sub}}{tax})；

my?$j；for($j＝1；$j＜＝$#a；$j++){last?if($a[$j]＝＝substr($pathotax，1，-

1))；}

my?$list＝$a[$j]；

for(my?$i＝$j+1；$i＜＝$#a；$i++){$list＝join(″，″，$list，$a[$i])；}

$j＝$#a-$j；

if(exists?$maipatho{$pathogem}{num}){

$maipatho{$pathogem}{num}＝$maipatho{$pathogem}{num}+1；

if(exists?$maipatho{$pathogem}{l}{$j}{$list}){

$maip″at″ho{$pathogem}{l}{$j}{$list}＝join?(″，″，$maipatho{$pathogem}{l}{$j}{list}，

join(″:″，$pathogen，$subpatho{$pathogen}{num}))；

}?else?{

$maipatho{$pathogem}{l}{$j}{$list}＝join(″:″，$pathogen，$subpatho{$pathogen}{num

})；

}

}?else?{

$maipatho{$pathogem}{num}＝1；

$maipatho{$pathogem}{l}{$j}{$list}＝join?(″:″，$pathogen，$subpatho{$pathogen}{num

})；

}

}?else?{

if(exists?$maizzz{num}){

$maizzz{num}＝$maizzz{num}+1；

$maizzz{list}＝join?(″，″，$maizzz{list}，join?(″:″，$pathogen，$subpatho{$pathogen}{n

um}))；

}?else?{

$maizzz{num}＝1；

$maizzz{list}＝join(″:″，$pathogen，$subpatho{$pathogen}{num})；

}

close(OUT4)；

$tottile＝scalar(keys(％maipatho))；

for?my?$pathogen(sort?keys?％maipatho){

my?$low；my?@clip＝()；

for?my?$mkey(sort{$b?<＝>$a}keys?％{$maipatho{$pathogen}{l}}){

for?my?$mnkey?(keys?％{$maipatho{$pathogen}{l}{$mkey}}){

if(@clip){

my?@b＝split(″，″，$mnkey)；

my?$i＝0；

while(($i＜＝$#b?&&?$i＜＝$#clip)&&?$b[$i]eq?$clip[$i]){$i++；}

splice(@clip，$i)；

}?else?{

@clip＝split(″，″，$mnkey)；

}

if($#clip＝＝0){

print?OUT3″$SInput``$tottile`$pathogen`$maipatho{$pathogen}{num}″；

}?else?{

$low＝taxname($clip[$#clip])；

print?OUT3″$SInput``$tottile`$low`$maipatho{$pathogen}{num}″；

}

for?my?$mkey(sort?keys?％{$maipatho{$pathogen}{l}}){

$low＝″″；

print?OUT3″`$mkey；″；

for?my?$mnkey(keys?％{$maipatho{$pathogen}{l}{$mkey}}){

$low＝join(″″，$low，″，″，$maipatho{$pathogen}{l}{$mkey}{$mnkey})；

}

print?OUT3?substr($low，1)；

}

print?OUT3″\n″；

}

if(exists?$maizzz{num}){

print?OUT3″$SInput``$tottile`Non-target`$mai?zzz{num}`-1；$maizzz{list}\n″；

}

close(OUT3)；

print?OUT2″$SInput\n″；

print?OUT2″\n?Controls\n″；

my?$detect＝″PASS″；my?$val＝″AFFX-TAGIQ-EX″；my?$pval＝″AFFY″；

if(exists?$frclist{″AFFX-TAGIQ-EX″}){

$val＝″AFFX-TAGIQ-EX″；

}?elsif(exists?$frclist{″AFFX-TagIQ-EX″}){

$val＝″AFFX-TagIQ-EX″；

}

if(exists?$frclist{$val}){

$detect＝″FAIL″if($frclist{$val}{per}＜20)；

print?OUT2″$detect?$frclist{$val}{per}\％AFFY?$frclist{$val}{mat}bases

called?of?$frclist{$val}{len}with?$frclist{$val}{mis}misidentified\n″；

}?else?{

print?″didn′t?find?$val\n″；

}

$detect＝″PASS″；my?$tilsuc＝1；$val＝″ATTIM″；$pval＝$val；

if(exists?$frclist{$val}){

if($frclist{$val}{per}＜20){$detect＝″FAIL″；$tilsuc＝-2；}

print?OUT2″$detect?$frclist{$val}{per}\％$pval?$frclist{$val}{mat}bases

called?of?$frclist{$val}{len}with?$frclist{$val}{mis}misidentified\n″；

print?OUT?″$val?`$SInput`pri?mary`3702`Cont?rol1`Arabidopsis

thaliana`$tilsuc`0`AY087893`21406667`$frclist{$val}{per}`499`0`0`1`499`426`924

```1`1`$ltime\n″；

}?else?{

print?″didn′t?find?$val\n″；

}

$detect＝″PASS″；$tilsuc＝1；$val＝″ATNAC1″；$pval＝$val；

if(exists?$frclist{$val}){

if($frclist{$val}{per}＜20){$detect＝″FAIL″；$tilsuc＝-2；}；

print?OUT2″$detect?$frclist{$val}{per}\％$pval?$frclist{$val}{mat}bases

called?of?$frclist{$val}{len}with?$frclist{$val}{mis}misidentified\n″；

print?OUT?″$val?`$SInput`primary`3702`Control?2`Arabidopsis

thaliana`$tilsuc`0`AC009894`5902358`$frclist{$val}{per}`519`0`0`1`519`82488`83

006```1`1`$ltime\n″；

}?else?{

print?″didn′t?find?$pval\n″；

}

for?my?$pathogen(sort?keys?％subpatho){

print?OUT2″$subpatho{$pathogen}{detect}$pathogen?in

$subpatho{$pathogen}{num}\n″；

my?@a＝split(″，″，$subpatho{$pathogen}{list})；

for?my?$mkey(sort?@a){

print?OUT2″ProSeq?$mkey?$tilerslt{$mkey}{detect}from

$tilerslt{$mkey}{num}subseq.\n″；

for?my?$mnkey(sort?keys?％{$subtile{$mkey}}){

my?@b＝split(″\n″，$subtile{$mkey}{$mnkey}{out})；

my.@c＝split(″″，$b[1])；

print?OUT2″$b[0]$c[2]$c[3]$c[4]$c[7]\n″；

}

print?OUT2″DETAILS\n″；

for?my?$pathogen(sort?keys?％subpatho){

print?OUT2″\n$subpatho{$pathogen}{detect}$pathogen?in

$subpatho{$pathogen}{num}\n″；

my?@a＝split(″，″，$subpatho{$pathogen}{list})；

for?my?$mkey(sort?@a){

print?OUT2″ProSeq?$mkey?$tilerslt{$mkey}{detect}from

$tilerslt{$mkey}{num}subseq.\n″；

for?my?$mnkey(sort?keys?％{$subtile{$mkey}}){

print?OUT2$subtile{$mkey}{$mnkey}{out}；

}

close(OUT)；

close(OUT2)；

}

sub?openconnection

{

my?($gi)＝@_；

my?$username＝′ncbi′；

my?$password＝′ncbi′；

my?$database＝′bio′；

my?$connect＝″dbi:Oracle:$database″；

my?$sqlldrconnection＝″$username/$password\@$database″；

my?$dbh＝DBI-＞connect(?$connect，$username，$password，

{

RaiseError＝＞1，

AutoCommit＝＞0

}

)

||die?″Database?connection?not?made:$DBI::errstr″；

return?$dbh；

}

sub?closeconnection

{

$dbh-＞disconnect；

return；

}

sub?taxfromgi

{

my?($gi)＝@_；

my?$sth；

my?$taxid＝-1；

my?$sql＝″select?tax_id?from?gi_taxid_nucl?where?gi＝$gi″；

$sth?＝$dbh-＞prepare($sql)；

$sth-＞execute()；

$sth-＞bind_columns(undef，\$taxid)；

$sth-＞fetch()；

my?$out＝$taxid；

$sth-＞finish()；

return($out)；

}

sub?taxname{

my($taxid)＝@_；

my($sth，$com)；

my?$sql＝″select?name_txt?from?Taxonomy_names?where?tax_id＝$taxid?and

name_class＝′scientific?name′″；

$sth＝$dbh-＞prepare($sql)；

$sth-＞execute()；

$sth-＞bind_columns(undef，\$com)；

$sth-＞fetch()；

$sth-＞finish()；

return($com)；

}

sub?taxparents

{

my($parentid)＝@_；

my?$sth；

my?$sql＝″″；my?$tax；

my?$ch2＝join(″″，″，″，$parentid，″，″)；

$sql＝″select?tax_id?AS?tax_id?from?nodes?start?with?tax_id＝$parentid

connect?by?NOCYCLE?tax_id＝PRIOR?parent_tax_id″；

$sth＝$dbh-＞prepare($sql)；

$sth-＞execute()；

$sth-＞bind_columns(undef，\$tax)；

$sth-＞fetch?()；

while($sth-＞fetch()){

$ch2＝join(″″，″，″，$tax，$ch2)；

}

$sth-＞finish?()；

return($ch2)；

}

Obviously, can carry out many modifications and change to the present invention in the above teachings.Therefore, be appreciated that claimed invention can implement to be different from specifically described mode.For example use article " a, " " an, " " the, " or " described " not to be considered to this composition is restricted to odd number with odd number to quoting of carrying out of claim composition.

Reference:

1.Whelen, A.C.and Persing, D.H. (1996) nucleic acid amplification and detect effect (The role of nucleic acid amplification and detection in theclinical microbiology laboratory) .Annu Rev Microbiol in the Clinical microorganism laboratory, 50,349-373.

2.McDonough, E.A., Barrozo, C.P., Russell, K.L.and Metzgar, D. (2005) be used for clinical sample detect mycoplasma pneumoniae, Chlamydia pneumoniae, legionella pneumophila and and multiplex PCR (the A multiplex PCR for detection of Mycoplasmapneumoniae of bordetella pertussis, Chlamydophila pneumoniae, Legionella pneumophila, and Bordetellapertussis in clinical specimens) .Mol Cell Probes, 19,314-322.

3.Roth, S.B., Jalava, J., Ruuskanen, O., Ruohola, A.and Nikkari, the diagnosis of S. (2004) oligonucleotide is used for causing application (Use of an oligonucleotide array for laboratory diagnosis of bacteriaresponsible for acute upper respiratory infections) the .J Clin Microbiol of the acute upper respiratory tract infection bacterium chamber of experimentizing diagnosis, 42,4268-4274.

4.Gardner, S.N., Kuczmarski, T.A., Vitalis, E.A.and Slezak, T.R. (2003) TaqMan PCR detects restriction (the Limitations of TaqMan PCR for detecting divergent viral pathogensillustrated by hepatitis A by the represented branch's viral pathogen of hepatitis A, B, C and E virus and HIV, B, C, and E viruses and human immunodeficiencyvirus) .J Clin Microbiol, 41,2417-2427.

5.Ecker, D.J., Sampath, R., Blyn, L.B., Eshoo, M.W., Ivy, C., Ecker, J.A., Libby, B., Samant, V., Sannes-Lowery, K.A., Melton, the respiratory disease substance is carried out Rapid identification to R.E.et al. (2005) and the strain somatotype is monitored (Rapid identification andstrain-typing of respiratory pathogens for epidemic surveillance) .Proc Natl AcadSci US A, 102,8012-8017. to carry out prevailing disease

6.Zammatteo, N., Hamels, S., De Longueville, F., Alexandre, I., Gala, J.L., Brasseur, F.and Remacle, the novel chip of J. (2002) molecular biology and diagnosis (Newchips for molecular biology and diagnostics) .Biotechnol Annu Rev, 8,85-101.

7.Campbell, C.J.and Ghazal, P. the molecular signatures of (2004) Infect And Diagnose: application (Molecular signatures for diagnosis of infection:application ofmicroarray technology) the .J Appl Microbiol that microarray finishes, 96,18-23.

8.Briese, T., Palacios, G., Kokoris, M., Jabado, O., Liu, Z., Renwick, N., Kapoor, V., Casas, I., Pozo, F., Limberger, R.et al. (2005) pathogenic agent fast and diagnositc system (Diagnostic system for rapid and sensitive differential detection ofpathogens) the .Emerg Infect Dis of difference test, 11,310-313.

9.Hacia J.G. (1999) uses the order-checking again and mutation analysis (the Resequencing and mutational analysis using oligonucleotide microarrays) .NatGenet of oligonucleotide microarray, 21,42-47.

10.Kozal, M.J., Shah, N., Shen, N., Yang, R., Fucini, R., Merigan, T.C., Richman, D.D., Morris, D., Hubbell, E., Chee, M.et al. (1996) use high-density oligonucleotide array, viewed polymorphism on a large scale in HIV-1 clade B protein enzyme gene (Extensive polymorphisms observed in HIV-1 clade B protease gene using high-density oligonucleotide arrays) .Nat Med, 2,753-759.

11.Cutler, D.J., Zwick, M.E., Carrasquillo, M.M., Yohn, C.T., Tobin, K.P., Kashuk, C., Mathews, D.J., Shah, N.A., Eichler, E.E., Warrington, J.A.et al. (2001) uses the high-throughput of microarray to change and detects and gene type (High-throughput variationdetection and genotyping using microarrays) .Genome Res, and 11,1913-1925.

12.Gingeras, T.R., Ghandour, G., Wang, E., Berno, A., Small, P.M., Drobniewski, F., Alland, D., Desmond, E., Holodniy, M.and Drenkow, J. (1998) use the hybridization region discriminance analysis of Mycobacterium DNA array to carry out gene type and species simultaneously and identify

(Simultaneous?genotyping?and?species?identification?using?hybridization?patternrecognition?analysis?of?generic?Mycobacterium?DNA?arrays).Genome?Res，8，435-448.

13.Lin, B., Vahey, M.T., Thach, D., Stenger, D.A.and Pancrazio, J.J. (2003) carries out biological threat detection (Biological threat detection viahost gene expression profiling) .Clin Chem by gene expression atlas, 49,1045-1049.

14.Wilson, W.J., Strout, C.L., DeSantis, T.Z., Stilwell, J.L., Carrano, A.V.and Andersen, G.L. (2002) use microarray technology that pathogenic agent microorganism in 18 is carried out sequence-specific evaluation (Sequence-specific identification of 18 pathogenic microorganismsusing microarray technology) .Mol Cell Probes, 16,119-127.

15.Wilson, K.H., Wilson, W.J., Radosevich, J.L., DeSantis, T.Z., Viswanathan, V.S., Kuczmarski, T.A.and Andersen, the high-density micro-array of G.L. (2002) small subunit rDNA probe (High-density microarray of small-subunit ribosomalDNA probes) .Appl Environ Microbiol, 68,2535-2541.

16.Zwick, M.E., McAfee, F., Cutler, D.J., Read, T.D., Ravel, J., Bowman, G.R., Galloway, D.R.and Mateczun, A. (2005) is based on order-checking again (Microarray-based resequencing of multiple Bacillus anthracisisolates) the .Genome Biol of the multiple Bacillus anthracis isolate of microarray, 6, R10.

17.Wong, C.W., Albert, T.J., Vega, V.B., Norton, J.E., Cutler, D.J., Richmond, T.A., Stanton, L.W., check order again evolution (the Tracking the evolution of theSARS coronavirus using high-throughput of array tracking sars coronavirus of the high-throughput high-density of using Liu, E.T.and Miller, L.D. (2004), high-density resequencing arrays) .Genome Res, 14,398-405.

18.Sulaiman, I.M., Liu, X., Frace, M., Sulaiman, N., Olsen-Rasmussen, M., Neuhaus, E., Rota, P.A.and Wohlhueter, R.M. (2006) Affymetrix severe acute respiratory syndrome again the sequenced genes chip at assessment (Evaluation ofsevere acute respiratory syndrome resequencing GeneChips in characterization ofthe genomes of two strains of coronavirus infecting humans) the .Appl EnvironMicrobiol of the coronavirus of identifying two strain infected person, 72,207-211.

19.Wang, Z., Daum, L.T., Vora, G.J., Metzgar, D., Walter, E.A., Canas, L.C., Malanoski, A.P., Lin, B.and Stenger, D.A. (2006) use the microarray that checks order again to identify influenza virus (Identifying Influenza Viruses with Resequencing Microarrays) .Emerg Infect Dis, 12,638-646.

20.Lin, B., Wang, Z., Vora, G.J., Thornton, J.A., Schnur, J.M., Thach, D.C., Blaney, K.M., Ligler, A.G., Malanoski, A.P., Santiago, J.et al. (2006) use the sequenced dna microarray to carry out the wide range respiratory pathogen again and identify (Broad-spectrum respiratorytract pathogen identification using resequencing DNA microarrays) .Genome Res.16:527-535

21.Lin, B., Blaney, K.M., Malanoski, A.P., Ligler, A.G., Schnur, J.M., Metzgar, D., Russell, K.L.and Stenger, D.A. (2006). and Naval Research Labratory (NavalResearch Laboratory).

22.Altschul, S.F., Gish, W., Miller, W., Myers, E.W.and Lipman, basic local comparison research tool (Basic local alignment search tool) the .J Mol Biol of D.J. (1990), 215,403-410.

23.Davignon, L., Walter, E.A., Mueller, K.M., Barrozo, C.P., Stenger, D.A.and Lin, B. (2005) nucleotide sequence that checks order again is used to identify application (Use of resequencing oligonucleotide microarrays foridentification of Streptococcus pyogenes and associated antibiotic resistancedeterminants) the .J Clin Microbiol of micrococcus scarlatinae and relevant antibiotics resistance determiner, 43,5690-5695.

＜110〉Secretary of the Navy's representative

United States Government

＜120〉the biological sequence identification systems and the method for computer execution

<130>PIUS0712388

<150>60/691,768

<151>2005-06-16

<150>60/735,876

<151>2005-11-14

<150>60/735,824

<151>2005-11-14

<150>60/743,977

<151>2006-03-30

<150>11/177,647

<151>2005-07-02

<150>11/177,646

<151>2005-07-02

<150>11/268,373

<151>2005-11-07

<150>11/422,425

<151>2006-06-06

<150>11/422,431

<151>2006-06-06

<160>5

<170>PatentIn?version 3.3

<210>1

<211>61

<212>DNA

＜213〉human influenza virus A

<220>

＜221〉gene

<222>(1)..(61)

<223>NA1

<400>1

ctgggtaaat caaacatatg?tcaatattaa?caacactaac?gttgttgctg?gaaaggacac 60

a 61

<210>2

<211>61

<212>DNA

＜213〉human influenza virus A

<220>

＜221〉gene

<222>(1)..(61)

<223>NA1

<400>2

ctgggtaaat?caaacatatg?ttaatattag?caacactaac?gttgttgctg?gaaaaggcac 60

a 61

<210>3

<211>16

<212>DNA

＜213〉human influenza virus A

<220>

＜221〉gene

<222>(1)..(16)

<223>NA1

<400>3

ctgggtaaag?gacaca 16

<210>4

<211>61

<212>DNA

＜213〉the unknown

<220>

＜223〉raw data

<220>

<221>misc_featu?re

<222>(1)..(61)

＜223〉n is a, c, g, or t

<400>4

ctgggnnnnn?nnnnnnnnnn?nnnnnnnnnn?nnnnnnngnc?gttgttgctg?gaaaggncac 60

a 61

<210>5

<211>61

<212>DNA

＜213〉the unknown

<220>

＜223〉filtering data

<220>

<221>misc_feature

<222>(1)..(61)

＜223〉n is a, c, g, or t

<400>5

ctgggnnnnn?nnnnnnnnnn?nnnnnnnnnn?nnnnnnnnnn?nnnnnnnnnn?nnnnnnnnac 60

a 61

Claims

1. method, this method comprises:

In inquiry, a plurality of canonical sequences are submitted in the taxonomy database to produce a plurality of classification results;

Wherein, described canonical sequence is the output of genetic data library inquiry, and described genetic data library inquiry returns the scoring of each canonical sequence; And

Identify based on described classification results report classification.

2. method according to claim 1 wherein, is reported one of following measured step of described classification evaluation use, and the selection of step is a first step, and wherein said standard is satisfied:

When classification results only contains single classification, then report this single classification;

When the highest scoring with have second the scoring of the canonical sequence of high scoring compare when having surpassed predetermined scoring ratio threshold values, then be reported in the classification results of the canonical sequence that has the highest scoring in a plurality of canonical sequences;

When all classification results only contain the direct parental generation classification of filial generation classification and described filial generation classification, then report described filial generation classification; And

In other cases, report common parental generation classification.

3. method according to claim 2, wherein, described scoring ratio threshold values is about 30%.

4. method, this method comprises:

Use a plurality of twice different of canonical sequences or more times method of carrying out claim 1 to identify to produce a plurality of classification;

Wherein, be subsequence to the input of genetic data library inquiry, this subsequence is relevant with predetermined prototype sequence from the target pathogenic agent; And

With identical classification identify with corresponding prototype combined sequence together.

5. method according to claim 4, this method also comprises:

Identical for one of identifying corresponding to the classification of one of described target pathogenic agent prototype sequence with described target pathogenic agent, or the filial generation of target pathogenic agent, report that each described target pathogenic agent is as positive findings.

6. method according to claim 5, this method also comprises:

Report final classification evaluation, wherein:

When positive findings only contains single classification, it is this single classification that then described final classification is identified;

When all positive findingses only contained the direct parental generation classification of filial generation classification and described filial generation classification, it was this filial generation classification that then described final classification is identified; Or

In other all situations, it is common parental generation classification that described final classification is identified.

7. method according to claim 1, wherein, described method is that computer is carried out.

8. equipment, this equipment comprises:

Enforcement of rights requires the device of 7 described methods.

9. a processing is from checking the method for the biological sequence that obtains, and this method comprises:

The base response that will be arranged in the tabulation of biological sequence predetermined position changes into N;

Wherein, each clauses and subclauses in preposition tabulation represent that material hybridizes to the ability on the microarray that is used to produce biological sequence;

Wherein, described material is not the nucleic acid of target pathogenic agent; And

Determine in the described biological sequence ratio with respect to the single nucleotide polymorphism of canonical sequence.

10. method according to claim 9, wherein, described material is the PCR primer.

11. method according to claim 9, this method also comprises:

When the ratio of described single nucleotide polymorphism is lower than the SNP threshold value, from described biological sequence, select the possible subsequence of initial length; And

The ratio of calculating unique base in described possible subsequence.

12. method according to claim 11, wherein, described SNP threshold value is about 20%.

13. method according to claim 11, this method also comprises:

When the ratio of described unique base is higher than unique base threshold value, described possible subsequence is prolonged a base;

Calculate the ratio of unique base in the possible subsequence that is prolonged once more;

Repeat to prolong and calculate once more, be lower than described unique base threshold value up to the ratio of unique base;

From the possible subsequence that is prolonged, remove the N base response of hangover, to form the inquiry subsequence; And

The ratio of calculating unique base in described inquiry subsequence.

14. method according to claim 13, wherein, described unique base threshold value is about 40%.

15. method according to claim 13 wherein, when last 21 positions of the possible subsequence that is prolonged have when being lower than 4 unique bases responses, then stops to repeat described prolongation and calculating once more.

16. method according to claim 13, this method also comprises:

When the ratio of unique base in the length of described inquiry subsequence and the described inquiry subsequence satisfies predetermined the requirement, then in inquiry, described inquiry subsequence is submitted to genetic database to identify described biological sequence.

17. method according to claim 16, wherein, described requirement comprises:

This inquiry subsequence contains the unique base response of at least 7 successive; And

The length of described inquiry subsequence surpasses 100 bases, the length of described inquiry subsequence is that the ratio of unique base in 30 to 100 bases and the described inquiry subsequence is that at least approximately (((length-30 of inquiry subsequence) * 0.2857+70) %, the ratio that the length of perhaps described inquiry subsequence is lower than unique base in 30 bases and the described inquiry subsequence is about at least 95%.

18. method according to claim 9, wherein, described method is that computer is carried out.

19. an equipment, this equipment comprises:

Enforcement of rights requires the device of 18 described methods.