KR20120083521A

KR20120083521A - Computer-implemented biological sequence identifier system and method

Info

Publication number: KR20120083521A
Application number: KR1020127014945A
Authority: KR
Inventors: 안토니 피. 말라노스키; 바오쿠안 린; 조엘 엠. 슈너; 데이비드 에이. 스텐저
Original assignee: 미합중국 (관리부서 : 미합중국 해군성)
Priority date: 2005-06-16
Filing date: 2006-06-09
Publication date: 2012-07-25
Also published as: JP2008547090A; JP4910104B2

Abstract

본 발명은 기준 서열을 분류학적 데이타에 전송하여 분류학적 결과를 산출하는 단계; 및 분류학적 결과를 기초로 분류학적 동정을 기록하는 단계를 포함하는 방법에 관한 것이다. 기준 서열은 각각의 기준 서열에 대한 스코어를 반송하는 유전적 데이타베이스 질의의 출력이다. 본 발명은 생물학적 서열 내 소정의 위치 목록에 자리잡고 있는 염기 콜을 N으로 전환하는 단계; 및 기준 서열에 상대적인 생물학적 서열 중 단일 염기 다형의 비율을 결정하는 단계에 의한 분석으로부터 얻어진 생물학적 서열을 프로세싱하는 방법에 관한 것이다. 소정의 위치 목록의 각각의 엔트리는 생물학적 서열을 생성하기 위해서 사용된 마이크로어레이에 하이브리드화하는 물질의 성능을 나타낸다. 상기 물질은 표적 병원체의 핵산이 아니다.The present invention comprises the steps of transferring the reference sequence to the taxonomic data to produce a taxonomic result; And recording the taxonomy identification based on the taxonomic results. The reference sequence is the output of a genetic database query that returns a score for each reference sequence. The present invention comprises the steps of converting a base call in a predetermined position list in the biological sequence to N; And a method for processing a biological sequence obtained from the analysis by determining a ratio of single base polymorphisms in the biological sequence relative to the reference sequence. Each entry in a given location list represents the ability of the material to hybridize to the microarray used to generate the biological sequence. The substance is not the nucleic acid of the target pathogen.

Description

Computer-implemented biological sequence identifier system and method {COMPUTER-IMPLEMENTED BIOLOGICAL SEQUENCE IDENTIFIER SYSTEM AND METHOD}

관련 출원Related application

본 출원은 2005년 6월 16일에 출원된 미국 가특허 출원 제60/691,768호; 2005년 11월 14일에 출원된 미국 가특허 출원 제60/735,876호; 2005년 11월 14일에 출원된 미국 가특허 출원 제60/735,824호; 2006년 5월 22일에 출원된 미국 가특허 출원 제60/743,639호; 2006년 6월 6일에 출원된 미국 가특허 출원 제11/422,425호; 및 2006년 6월 6일에 출원된 미국 가특허 출원 제11/422,431호를 기초로 우선권을 주장한다. 본 출원은 2005년 7월 2일에 출원된 미국 특허 출원 제11/177,647호; 2005년 7월 2일에 출원된 미국 특허 출원 제11/177,646호; 및 2005년 11월 7일에 출원된 미국 특허 출원 제11/268,373호의 일부 계속 출원이다. 이들 정규 출원은 2004년 7월 2일에 출원된 미국 가특허 출원 제60/590,931호; 2004년 9월 15일에 출원된 미국 가특허 출원 제60/609,918호; 2004년 11월 5일에 출원된 미국 가특허 출원 제60/626,500호; 2004년 11월 29일에 출원된 미국 가특허 출원 제60/631,437호; 및 2004년 11월 29일에 출원된 미국 가특허 출원 제60/631,460호를 기초로 우선권을 주장한다.This application is filed on June 16, 2005 in US Provisional Patent Application No. 60/691,768; US Provisional Patent Application No. 60/735,876, filed November 14, 2005; US Provisional Patent Application No. 60/735,824, filed Nov. 14, 2005; US Provisional Patent Application No. 60/743,639, filed May 22, 2006; US Provisional Patent Application No. 11/422,425, filed June 6, 2006; And US Provisional Patent Application No. 11/422,431 filed on June 6, 2006. This application is filed on July 2, 2005, filed in US patent application Ser. No. 11/177,647; US patent application Ser. No. 11/177,646, filed Jul. 2, 2005; And US patent application Ser. No. 11/268,373, filed Nov. 7, 2005. These regular applications are described in US Provisional Patent Application No. 60/590,931, filed July 2, 2004; US Provisional Patent Application No. 60/609,918, filed Sep. 15, 2004; US Provisional Patent Application No. 60/626,500, filed November 5, 2004; US Provisional Patent Application No. 60/631,437, filed Nov. 29, 2004; And US Provisional Patent Application No. 60/631,460 filed on November 29, 2004.

본 발명은 일반적으로 생물학적 서열의 처리 방법에 관한 것이다.The present invention relates generally to methods of processing biological sequences.

감시 분야 및 진단 분야 둘 다를 위해서는, 미세 규모 병원체 동정 및 인접자(near neighbor) 식별이 중요하므로, 이러한 매우 특이적 수준에서 모니터링하는 분석은 임상 환경에서 바람직하다(1-3). DNA 또는 RNA 검출을 기초로 한 임의의 방법을 성공적으로 사용하기 위해서, 상기 분석들은, 원하는 정보 제공을 확보하기 위한 분석 설계 및 원자료의 해석을 위한 핵산 서열 정보로 이루어진 다수의 데이터베이스와 연결해야 한다. 실시간 PCR과 같은 잘 확립된 몇몇 기법들은 우수한 특이성을 제공하기 위해 서열화된 게놈의 짧고, 특유한 스트렛치를 사용한다(4). 이들 기법은 충분한 수의 단편을 선정하여 유전적으로 밀접한 몇몇 유기체의 미세 규모 동정을 제공할 수 있다. 그러나, 최초 선정 공정에서 특이적인 이러한 선정 단편들은 더 많은 유기체가 서열화됨에 따라 추후 종종 특이성이 떨어지는 것으로 확인된다. 이는 돌연변이율이 높은 과에 속하는 병원체 및 동정된 비교적 인접하지 않는 병원체를 갖는 병원체에 있어서 특히 문제가 된다. 또한, 실시간 PCR은 신규 중요한 돌연변이의 존재를 검출할 수 없거나 또는 염기 서열 세목을 분석할 수 없다. 유사하게도, 다른 검출 기법에서 병원체 동정을 얻는 방법이 개선되었으나 PCR을 사용하는 몇몇 또는 모든 문제들을 겪고 있다(5-6).For both the surveillance and diagnostic fields, micro-scale pathogen identification and near neighbor identification are important, so an assay monitoring at this very specific level is desirable in a clinical setting (1-3). In order to successfully use any method based on DNA or RNA detection, the assays must be linked to a number of databases of nucleic acid sequence information for analysis design and interpretation of the original data to ensure the desired information provision. . Some well-established techniques, such as real-time PCR, use short, distinct stretches of sequenced genomes to provide good specificity (4). These techniques can provide micro-scale identification of several genetically close organisms by selecting a sufficient number of fragments. However, these selected fragments, which are specific in the initial selection process, are later identified as often less specific as more organisms are sequenced. This is particularly problematic for pathogens belonging to families with high mutation rates and for pathogens with relatively non-contiguous pathogens identified. In addition, real-time PCR cannot detect the presence of new important mutations or analyze nucleotide sequence details. Similarly, the method of obtaining pathogen identification in other detection techniques has been improved, but suffers from some or all of the problems using PCR (5-6).

고밀도 재배열 마이크로어레이는 직접 서열 정보로 된 10²-10⁵ 염기쌍(bp)의 가변 길이 단편을 제조할 수 있다. 이들은 바이러스, 박테리아, 및 진핵 게놈으로부터 단일 뉴클레오티드 다형(SNP) 및 유전적 변이를 검출하기 위해 성공적으로 사용하였다(9-16). SNP 검출에의 이들의 사용은 신뢰할 수 있는 품질 서열 정보를 제공하는 그 능력을 명확히 확립하였다. 대부분의 경우, 마이크로어레이는 제한된 수의 유전적 유사 표적 병원체를 연구하기 위해 디자인하며, 다수의 경우에 있어서, 검출 방법은 단지 동정을 위해 하이브리드화 패턴을 인지하는 것에 의존하였다(12, 14, 15, 17, 18). SNP 검출에 요구되는 재배열 마이크로어레이의 순차적 염기 분석력을 이용하여, 재배열은 다른 접근법을 사용함으로써 다수의 박테리아성 및 바이러스성 병원체의 병원체 동정을 위해 최근 성공적으로 개조된 한편, 밀접하게 관련된 병원체의 미세한 식별 및 표적화된 병원체 내 돌연변이의 추적을 가능케 하였다(19-21). 신규 방법론은 관찰된 하이브리드화으로부터의 염기 콜(base call)과 가장 일치할 종 및 변이체를 동정하기 위해 DNA 데이타베이스 유사성 조사의 질의로서 분석된 염기를 사용함으로써 이전 연구와 달랐다. 시스템은 26종의 병원체를 동시에 테스트할 수 있었고 다수 병원체의 존재를 검출할 수 있었다. 소프트웨어 프로그램, 재배열 병원체 동정자(REPI; REsequencing Pathogen Identifier)는 유전자 위치 정보 검색(BLAST; Basic Local Alignment Search Tool)을 이용하여 유전적 데이타베이스의 유사성 조사를 수행함으로써 데이타 분석을 단순화하기 위해 사용하였다(22). REPI 프로그램은 BLAST 기본 설정을 사용하였고, 기대치, 즉 확인된 서열 일치가 데이타베이스에서 무작위적 우연에 의해 발생할 가능성을 나타내는 BLAST 프로그램에 의해 계산된 양이 10^-9 미만일 때에만, 하이브리드화를 의미하는 서열을 반송하게 된다. 이는 불충분한 신호 전달을 가진 모든 경우를 스크리닝하였지만, 어떤 병원체가 검출되는지 및 어느 정도의 식별이 가능한지에 대한 최종 결정은 반송된 결과의 수동 조사를 필요로 하였다. 이러한 방법은 종래의 샘플링 결과와 일치하게 Flu A 및 B 샘플의 균주 동정과 다양한 아데노바이러스의 미세 식별을 성공적으로 가능케 하였다(19, 20). 이러한 연구 방법의 2가지 중요한 이점은 정보가 항상 가능한 가장 상세한 수준에서 회수된다는 점과 정보가 최신 돌연변이를 갖는 유기체를 여전히 인지할 수 있다는 점이었다. 또한 상기 연구 방법은, 이것이 더 많은 유기체가 서열화되는 경우 지속적으로 잠식되는 짧은 서열의 특유성에 의존하지 않으므로, 특이성을 잘 유지하였다.The high-density rearrangement microarray can produce ^{variable length fragments of 10 2} -10 ⁵ base pairs (bp) with direct sequence information. They have been used successfully to detect single nucleotide polymorphisms (SNPs) and genetic variations from viral, bacterial, and eukaryotic genomes (9-16). Their use in SNP detection has clearly established its ability to provide reliable quality sequence information. In most cases, microarrays are designed to study a limited number of genetically-like target pathogens, and in many cases, detection methods relied solely on recognizing hybridization patterns for identification (12, 14, 15). , 17, 18). Using the sequential sequencing power of rearrangement microarrays required for SNP detection, rearrangement has been recently successfully adapted for pathogen identification of many bacterial and viral pathogens by using different approaches, while closely related pathogens can be identified. It allowed microscopic identification and tracking of mutations in targeted pathogens (19-21). The new methodology differed from previous studies by using the analyzed base as a query for DNA database similarity investigation to identify the species and variants that would most likely match the base call from the observed hybridization. The system was able to test 26 pathogens simultaneously and detect the presence of multiple pathogens. A software program, REsequencing Pathogen Identifier (REPI), was used to simplify data analysis by performing similarity studies of genetic databases using BLAST (Basic Local Alignment Search Tool). (22). The REPI program used the BLAST default setting, meaning hybridization only when the expected value, i.e. the amount calculated by the BLAST program indicating the likelihood that the identified sequence matches will occur by random chance in the database is ^{less than 10 -9.} It will carry the sequence. This screened all cases with insufficient signal transduction, but the final decision as to which pathogens were detected and to what extent could be identified required manual examination of the returned results. This method successfully enabled strain identification of Flu A and B samples and micro-identification of various adenoviruses, consistent with the conventional sampling results (19, 20). Two important advantages of this research method were that information is always retrieved at the most detailed level possible, and that the information can still recognize organisms with the latest mutations. In addition, the study method maintained the specificity well, as it did not depend on the peculiarity of the short sequence, which was constantly encroached when more organisms were sequenced.

상기 분석 방법은 유용성이 있지만, 몇몇 결점도 있는데, 즉, 시간 소모적이고, 민감도를 극대화하기에 적절하지 않으며, 복잡한 결과를 나타내고, 전문가용으로서만 적합하며, 잉여 또는 중복 정보를 포함한다. 상기 공정은, 최초 스크리닝만이 자동적으로 취급되며, 반면에 나머지 단계들이 검출 분석을 종결하기 전 수동 해석을 필요로 하기 때문에 시간 소모적이었다. 단순한 기준(기대치 확정 범위 10^-9) 및 최적화되지 않은 BLAST 파라미터가 검출된 병원체를 고려하기 위해 사용되므로, REPI 알고리즘은 후보 유기체 목록을 제공하지만, 최종적인 단순 결론을 맺지 못하거나 하나의 원형 서열의 결과를 또 다른 것과 연계시킬 수 없었다. 대신, 최종 결정을 위하여 수동적 방법이 사용되었으나, REPI 프로그램은 모두 유사한 결과 및 잉여 엔트리를 포함하는 공공 핵산 데이타베이스의 이용을 제공하였으므로, 사용자에게 쓸모 없는 다량의 데이타가 제공되었다. 또한, 수동적 방법으로는, 개발된 알고리즘을 일반적으로 핵산 염기가 결정된 서열 정보가 제공된 임의의 유기체에 적용 가능하도록 확립하는 것은 불가능하였다.While useful, the analysis method has some drawbacks, namely, it is time consuming, not suitable for maximizing sensitivity, produces complex results, is suitable only for professionals, and contains redundant or redundant information. This process was time consuming because only the initial screening was handled automatically, while the remaining steps required manual interpretation before terminating the detection analysis. Since a simple criterion (expected definite range 10 ^-9 ) and non-optimized BLAST parameters are used to consider the detected pathogen, the REPI algorithm provides a list of candidate organisms, but fails to draw a final simple conclusion or I couldn't relate the results to another. Instead, a manual method was used for the final decision, but the REPI programs all provided the use of public nucleic acid databases containing similar results and redundant entries, thus providing users with a large amount of useless data. In addition, with a passive method, it was impossible to establish that the developed algorithm is generally applicable to any organism provided with sequence information from which the nucleic acid base was determined.

본 발명의 일 방법은, 다수의 기준 서열들을 질의(query)로 분류학적 데이타베이스에 전송하여 다수의 분류학적 결과를 산출하는 단계, 및 상기 분류학적 결과를 기초로 한 분류학적 동정을 기록하는 단계를 포함한다. 상기 기준 서열들은 각각의 기준 서열에 대한 스코어(Score)를 반송하는 유전적 데이타베이스 질의의 출력이다.One method of the present invention includes the steps of generating a plurality of taxonomic results by transmitting a plurality of reference sequences to a taxonomic database as a query, and recording the taxonomic identification based on the taxonomic results. Includes. The reference sequences are the output of a genetic database query that carries the Score for each reference sequence.

분석으로부터 얻어진 생물학적 서열을 처리하는 본 발명의 또 다른 방법은, 생물학적 서열 내 소정의 위치 목록에 자리하고 있는 염기 콜을 N으로 전환하는 단계, 및 기준 서열에 상대적인 생물학적 서열 내 단일 뉴클레오티드 다형의 비율을 결정하는 단계를 포함한다. 소정의 위치 목록 내 각각의 엔트리는 생물학적 서열을 생성시키는 데 사용되는 마이크로어레이에 하이브리드화하는 물질의 성능을 나타낸다. 상기 물질은 표적 병원체의 핵산이 아니다.Another method of the present invention for processing a biological sequence obtained from the analysis comprises the steps of converting a base call located on a list of positions in the biological sequence to N, and determining the ratio of a single nucleotide polymorphism in the biological sequence relative to the reference sequence And determining. Each entry in a given list of locations represents the ability of a material to hybridize to the microarray used to generate the biological sequence. The substance is not the nucleic acid of the target pathogen.

도 1은 3가지 메인 태스크의 관계 및 태스크와 관련된 서브 태스크의 로직을 나타내는 알고리즘의 개략도이다. 태스크 I은 필터링 및 부분 서열 선정을 수행한 후, 원형 서열이 어느 데이타베이스 기록과 가장 유사한지를 결정한다. 태스크 II는 원형 서열 동정이 공통적인 유기체 동정을 지지하는지를 판단한다. 태스크 III은 마이크로어레이 데이타로부터 검출된 유기체의 최종 조사 및 결정을 행한다. ProSeq: 원형 서열; SubSeq: 부분 서열; HybSeq: 하이브리드화 서열.
도 2는 태스크 I의 필터링 서브 태스크의 상세한 개략도이다. 각각의 ProSeq에 대해, 프라이머 영역을 N(모호한) 콜로서 차폐한 후, UniRate를 HybSeq로부터 계산하였다. UniRate 요건을 통과한 ProSeq의 경우에는, 개정된 슬라이딩 윈도우 알고리즘(revised sliding window algorithm)을 시도하여 BLAST에 대한 질의로서 사용될 수 있는 SubSeq를 증가시켰다. 성공적으로 증가된 SubSeq의 실체(ProSeq 내 개시 위치 및 길이)는 BLAST를 통한 배치 질의(batch query)를 위해 파일에 배치하였다.
도 3은 개개의 SubSeq에 대한 유기체 동정에 관여하는 태스크 I의 서브 태스크의 상세한 개략도이다. BLAST에 전송되는 각각의 SubSeq는 최상의 비트 스코어/기대치 쌍(MaxScore)을 찾기 위해 검색된 Return 어레이에 포함된 가능한 일치 목록을 반송하였다. MaxScore가 MIN(10^-6)보다 큰 경우, 상기 최상의 스코어를 갖는 모든 반송들(returns)을 신규 어레이 Rank1 내로 분류하였다. 상세한 결정 과정을 본원의 방법 섹션에 기술하였고 그 후 SubSeq의 유기체를 동정하였다.
도 4는 SubSeq에 대하여 확인된 결과를 기초로 하여 ProSeq에 대하여 결정된 유기체를 결정하는 태스크 I의 서브 태스크의 개략도이다. 특정 ProSeq의 모든 SubSeq를 서로 비교하여 2개의 최상의 스코어를 갖는 SubSeq를 결정한다. 단일 SubSeq가 있거나 다른 것보다 훨씬 더 우수한 스코어를 갖는 것이 존재하는 경우, 그 ProSeq는 상기 SubSeq의 특성을 물려받았다. 그렇지 않은 경우 공통적인 분류학적 부류는 특허 명세서에서 기술된 바와 같이 결정하였다.
도 5는 인플루엔자 A NA1 ProSeq 및 A/Weiss/43, A/푸에르토리코/8/34 균주의 정렬도이다. 또한 A/푸에르토리코/8/34의 미정제 및 필터링된 하이브리드화 칩 결과를 도시하였다. ^*는 완벽하게 일치된 서열을 나타낸다.1 is a schematic diagram of an algorithm showing the relationship between three main tasks and the logic of subtasks related to the task. Task I, after performing filtering and subsequencing, determines which database record the original sequence is most similar to. Task II determines whether circular sequence identification supports identification of common organisms. Task III makes the final investigation and determination of organisms detected from microarray data. ProSeq: circular sequence; SubSeq: partial sequence; HybSeq: hybridization sequence.
2 is a detailed schematic diagram of the filtering sub-task of task I. For each ProSeq, after masking the primer regions as N (ambiguous) calls, UniRate was calculated from HybSeq. For ProSeq that passed the UniRate requirement, we tried a revised sliding window algorithm to increase SubSeq, which can be used as a query to BLAST. The substance of SubSeq that was successfully increased (starting position and length in ProSeq) was placed in a file for batch query through BLAST.
3 is a detailed schematic diagram of the subtasks of Task I involved in the identification of organisms for individual SubSeq. Each SubSeq sent to BLAST returned a list of possible matches contained in the retrieved Return array to find the best bit score/expected pair (MaxScore). If MaxScore was ^{greater than MIN(10 -6} ), all returns with the best score were sorted into a new array Rank1. The detailed determination process was described in the Methods section herein and then the organisms of SubSeq were identified.
Fig. 4 is a schematic diagram of a subtask of task I for determining an organism determined for ProSeq based on the results confirmed for SubSeq. All SubSeqs of a particular ProSeq are compared to each other to determine the SubSeq with the two best scores. When there is a single SubSeq or there is one with a score that is much better than the others, the ProSeq inherits the properties of the SubSeq. Otherwise, the common taxonomic class was determined as described in the patent specification.
5 is an alignment diagram of influenza A NA1 ProSeq and A/Weiss/43, A/Puerto Rico/8/34 strains. Also shown are the crude and filtered hybridization chip results of A/Puerto Rico/8/34. ^{* Indicates} a perfectly matched sequence.

본 발명의 더욱 완전한 이해는 하기 실시예 실시양태의 설명 및 첨부 도면을 참조함으로써 용이하게 얻어진다.A more complete understanding of the present invention is readily obtained by referring to the description of the following example embodiments and the accompanying drawings.

하기 설명에서, 한정함이 없이 설명할 목적으로, 본 발명의 완전한 이해를 제공하기 위해 구체적인 상세한 설명을 서술한다. 그러나, 본 발명이 이들 구체적인 상세한 설명으로부터 벗어난 다른 실시형태로 실시될 수 있음은 당업자에게 자명하다. 다른 예에서, 잘 알려진 방법 및 장치의 상세한 설명은 생략하여 불필요한 상세한 설명으로 본 발명의 기술 내용을 모호하게 하지 않도록 하였다.In the following description, for purposes of illustration without limitation, specific detailed descriptions are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to those skilled in the art that the present invention may be practiced in other embodiments departing from these specific detailed descriptions. In other instances, detailed descriptions of well-known methods and devices have been omitted so as not to obscure the technical content of the present invention with unnecessary detailed descriptions.

본원에서 사용된 바와 같이, 용어 "서열"은 핵산 서열, 예컨대 DNA 또는 RNA, 또는 단백질 서열을 의미한다. 본원에서 사용된 바와 같이, "염기" 및 "염기 콜"은 뉴클레오티드 염기 또는 아미노산을 의미할 수 있다. 본원에서 사용된 바와 같이, 용어 "분류학상"은 속, 종, 균주, 및 아균주를 포함하나, 이들에 한정되지 않는, 병원체의 동정에 대한 임의의 수준 또는 부류를 의미할 수 있다. 본원에서 사용된 바와 같이, 용어 "기록"은 한 시스템에서 다른 시스템으로의 신호를 전송하는 것 및 인간이 판독가능한 임의 형태의 보고서를 생성하는 것을 포함할 수 있다. 개시된 모든 방법은 본 방법을 수행하기 위한 수단을 갖는 장치에서 컴퓨터로 구현될 수 있다.As used herein, the term “sequence” refers to a nucleic acid sequence, such as DNA or RNA, or a protein sequence. As used herein, “base” and “base call” can mean a nucleotide base or an amino acid. As used herein, the term “taxonomic” can mean any level or class of identification of a pathogen, including, but not limited to, genus, species, strains, and subspecies. As used herein, the term “record” may include transmitting signals from one system to another and generating any form of human-readable report. All of the disclosed methods can be computer-implemented in an apparatus having means for performing the method.

검출된 유기체의 단순 목록을 제공하기 위해 디자인된 애피메트릭스 재배열 마이크로어레이로부터 결정된 염기 서열 정보를 성공적으로 사용할 수 있는 신규 소프트웨어 전문 시스템, 컴퓨터로 구현된 생물학적 서열 동정자 시스템 (Computer-Implemented Biological Sequence Identifier system) 2.0 (CIBSI 2.0)이 개시된다. 이 알고리즘은 병원체 동정을 완전히 자동화하기 위해 새로운 특성을 통합시킴으로써 선행 방법의 결점을 처리한다. 이 단일 프로그램은, 개선된 민감도와 함께, 단독으로 검출되는지 또는 조합하여 검출되는지와 관계없이, RPM v1 마이크로어레이에 포함된 모든 26종의 병원체에 대하여 정확한 결정을 할 수 있다(19, 20, 23). 이 프로그램은 현재 재배열 마이크로어레이에 적용하고 있지만, 알고리즘의 제1 부분만이 마이크로어레이에 특이한 문제를 처리하는 한편 나머지 부분은 BLAST 알고리즘에 의해 질의로서 사용하기에 적절한 서열을 처리하기 때문에 개발된 방법론은 일반적으로 적용 가능하게 된다. 일반적인 동정 알고리즘을 개발하는 데 있어서, 그 이용을 복잡하게 하는 재배열 마이크로어레이에 특이한 문제를 확인하고 분석하였다. 검출되는 것에 대한 전체적인 결정 방법을 자동화하였으므로, 동정하기 위해 사용된 규칙이 임의의 병원체에 정확하고 적용 가능한지를 테스트하기가 수월하다. 이 효율적인 프로그램을 사용하여, 재배열에 기초한 분석은, 비전문가에 의해 해석될 수 있는 출력을 제공하면서, 다수의 가능한 병원체를 동시에 테스트하기 위한 경쟁력 있는 방법을 제공할 수 있다.A new software specialized system that can successfully use base sequence information determined from Affymetrix rearrangement microarrays designed to provide a simple list of detected organisms, Computer-Implemented Biological Sequence Identifier system) 2.0 (CIBSI 2.0) is disclosed. This algorithm addresses the shortcomings of prior methods by incorporating new features to fully automate pathogen identification. This single program, with improved sensitivity, can make accurate decisions for all 26 pathogens contained in the RPM v1 microarray, whether detected alone or in combination (19, 20, 23). ). This program is currently applied to rearranged microarrays, but a methodology developed because only the first part of the algorithm handles microarray-specific problems while the rest of the algorithm processes sequences suitable for use as queries by the BLAST algorithm. Becomes generally applicable. In developing a general identification algorithm, problems peculiar to rearrangement microarrays that complicate their use were identified and analyzed. Since we have automated the overall method of determining what is detected, it is easy to test that the rules used to identify are correct and applicable to any pathogen. Using this efficient program, an analysis based on rearrangement can provide a competitive way to test multiple possible pathogens simultaneously, while providing an output that can be interpreted by non-experts.

증폭, 하이브리드화, 및 서열 결정Amplification, hybridization, and sequencing

RPM v1 마이크로어레이 디자인의 상세한 설명 및 실험적 방법은 선행 문헌(19, 20, 23)에서 논의하였다. 본 발명의 분석에 사용되는 실험적 마이크로어레이 데이타는 다양한 정제 주형 및 임의의 다양한 증폭 기구를 사용한 임상 샘플을 사용하여 획득되었다. GCOS 소프트웨어 v1.3(미국 캘리포니아주 산타 클라라 소재의 애피메트릭스사 제조)을, 하이브리드화된 마이크로어레이를 정렬하고 스캔하기 위해 사용하여 모든 프로브 세트 내 각각의 프로브의 강도를 결정하였다. 염기 콜은, ABACUS 알고리즘을 구현하는 GDAS v3.0.2.8 소프트웨어(미국 캘리포니아주 산타 클라라 소재의 애니메트릭스사 제조)를 사용하여, 각각의 프로브 세트의 강도 데이타를 기초로 구성하였다(11). 상기 서열은 추후 분석 단계를 위해 FASTA 포맷으로 나타냈다.Detailed descriptions and experimental methods of RPM v1 microarray design were discussed in prior literature (19, 20, 23). Experimental microarray data used in the assays of the present invention were obtained using various purification templates and clinical samples using any of a variety of amplification instruments. GCOS software v1.3 (manufactured by Affymetrix, Santa Clara, CA) was used to align and scan the hybridized microarrays to determine the strength of each probe in all probe sets. The base call was constructed based on the intensity data of each probe set using GDAS v3.0.2.8 software (manufactured by Animatrix, Santa Clara, CA) that implements the ABACUS algorithm (11). The sequences were presented in FASTA format for later analysis steps.

재배열 마이크로어레이(RPM v1)는 소정의 하이브리드화 패턴에 의지하지 않고 ProSeq를 기초로 하여, 열성 호흡기 질환을 야기하는 것으로 알려진 20가지 일반적인 호흡기 병원체 및 6 CDC 카테고리 A 생체 위협(biothreat) 병원체의 서열 부류 및 검출을 위해 미리 설계되었다. 다양한 증폭 기법, 단일 및 다수 병원체 표적, 정제 핵산 및 임상 샘플을 사용하여 수행된 대략 4000개의 RPM v1 실험을, 병원체 동정 알고리즘을 개발하기 위하여 조사하였다. 이 알고리즘을 임상 샘플, 동정 병원체 및 정제 핵산에 대해 사용한 결과는 다른 연구에서 상세하게 논의하였다(19, 20, 23). 모든 경우에서, 알고리즘은 RPM v1에 나타낸 ProSeq의 길이에 따라, 종 또는 균주 수준에서 유기체를 정확하게 동정하였다. 다양한 조건 하에 알고리즘이 어떻게 수행되는지를 설명하기 위해 몇몇 구체적인 예가 논의될 것이다.The rearrangement microarray (RPM v1) is based on ProSeq without relying on a predetermined hybridization pattern, with a sequence of 20 common respiratory pathogens known to cause febrile respiratory disease and 6 CDC Category A biothreat pathogens. Designed in advance for classification and detection. Approximately 4000 RPM v1 experiments performed using various amplification techniques, single and multiple pathogen targets, purified nucleic acids and clinical samples were investigated to develop pathogen identification algorithms. The results of using this algorithm for clinical samples, identified pathogens and purified nucleic acids are discussed in detail in other studies (19, 20, 23). In all cases, the algorithm accurately identified organisms at the species or strain level, depending on the length of ProSeq shown in RPM v1. Several specific examples will be discussed to illustrate how the algorithm performs under various conditions.

CIBSI 2.0 프로그램은 3가지 태스크의 계층을 처리하였다(도 1): (I) 검출된 유기체가 어느 데이타베이스 기록과 가장 유사한지를 결정하고, (II) 별개의 표적으로부터의 동정이 공통적인 유기체 동정을 지지하는지를 결정하며, (III) 검출된 유기체가 분석이 검출하도록 구체적으로 설계된 표적 세트에 속하는지 또는 밀접한 유전적 인접자와 관련되는지를 결정한다. 표적 병원체는 분석이 검출하도록 구체적으로 설계된 유기체이다. 본원에서 사용되는 바와 같이, 표적 병원체 게놈으로부터 선정된 기준 서열을 나타내는 프로브 세트를 원형 서열 또는 간략히 "ProSeq"으로 지칭한다. 게놈 물질의 ProSeq로의 혼성화에 기인하는 결정된 염기 세트를 혼성화 서열 또는 "HybSeq"로 지칭한다. HybSeq는 가능한 부분 서열 또는 "SubSeq"로 나뉘어진다. 알고리즘의 일 부분은 ProSeq에 기초한 유기체 동정을 처리하며 3 단계, 즉, 서열 유사성 비교에 적합한 SubSeq로 개개의 HybSeq를 최초 필터링하는 단계, 개개의 SubSeq를 데이터베이스 질의하는 단계, 및 각각의 SubSeq에 대한 BLAST 반송들(return)의 분류학적 비교 단계로 처리된다. 다음 단계에서, ProSeq들을 비교하여 이들이 같은 동정 유기체를 지지하는지를 결정하였다. 최종 단계에서, 임의의 유기체가 양성으로 검출되었는지를 확인하기 위해, 검출된 유기체를 분석이 설계된 표적 병원체의 목록과 비교하였다. 특정 샘플이 지지하는 식별 수준은 자동적으로 결정되었다.The CIBSI 2.0 program handled three tiers of tasks (Figure 1): (I) determine which database record the detected organism is most similar to, and (II) identification from separate targets identifies common organisms. Support, and (III) whether the detected organism belongs to a target set specifically designed for the assay to detect or is associated with a close genetic contiguity. Target pathogens are organisms specifically designed for the assay to detect. As used herein, a set of probes representing a reference sequence selected from a target pathogen genome is referred to as a circular sequence or simply “ProSeq”. The set of determined bases resulting from hybridization of genomic material to ProSeq is referred to as the hybridization sequence or “HybSeq”. HybSeq is divided into possible subsequences or “SubSeq”. Part of the algorithm handles the identification of organisms based on ProSeq and includes three steps: first filtering individual HybSeqs with SubSeq suitable for sequence similarity comparison, querying the database for individual SubSeqs, and BLAST for each SubSeq. It is processed as a taxonomic comparison step of returns. In the next step, ProSeqs were compared to determine if they supported the same identified organism. In the final step, the detected organisms were compared to the list of target pathogens for which the assay was designed to determine if any organisms were detected as positive. The level of identification supported by a particular sample was automatically determined.

필터링Filtering

최초 필터링 알고리즘, 즉, 재배열 병원체 동정자(REsequencing Pathogen Identifier; REPI)가 먼저 개발되었고(20), 개정을 포함하는 일반적인 개념을 CIBSI 2.0 프로그램에 사용된 현재의 (자동화 검출) 알고리즘 내로 통합하였다. 필터링 및 부분 서열 선정법을 사용하여, 기준 서열 선택에 의해 그리고 다른 소스(프라이머)에 의해 발생하는 잠재적 편향(biasing)을 제거할 뿐만 아니라 HybSeq를 더 신속한 탐색에 유의한 단편으로 분리하였다. 이는 도 1에서 태스크 I의 제1 서브 태스크였고, 도 2에서 개략적으로 상세하게 도시하였다. PCR 증폭을 사용한 경우, 마이크로어레이를 단지 프라이머의 존재 하에 하이브리드화하여 프라이머가 하이브리드화를 초래한 위치를 결정하였다. 프라이머를 사용하여 하이브리드화된 ProSeq의 모든 부분을 N 콜로서 차폐하여 HybSeq가 편향된 정보를 포함하지 않게 하였다. 각각의 ProSeq에 대해, 특유 염기 콜의 총 수에 대한 SNP의 비, UniRate를 HybSeq로부터 산출하였다. UniRate가 ≥ 20%(SNP 분계점)인 경우 불충분한 하이브리드화를 갖는 HybSeq를 제거하기 위하여, 그 ProSeq는 표적 유기체 검출에 대해 음성으로 고려된다. UniRate 20%는 평균적으로 25 bp당 5 SNP임을 나타내었다. 표적 병원체와 유사한 유기체와 ProSeq가 기초로 하는 기준 서열 간의 이러한 빈도의 차이의 경우 25 bp 프로브의 유의한 특이적 하이브리드화를 예상하는 것은 비현실적이다. 이는 필터링 서브 태스크를 종결하고 태스크 I 루프로 돌아가며, 다음 ProSeq를 조사할 것이다. < 20%의 비율을 갖는 ProSeq의 경우에는, 더 상세한 조사를 수행하였다. HybSeq의 각각의 위치에서, 개정된 슬라이딩 윈도우 알고리즘(20)을 시도하여 BLAST에 대한 질의로서 사용될 수 있는 SubSeq를 증가시켰다. 처음에, 일정 위치 다음의 처음 20 염기(최초 길이)를 잠재적 부분 서열로서 조사하였다. 이 염기의 60% 미만이 모호한 경우(N), SubSeq는 연장 단계에 진입하였다. SubSeq는, 특유 염기 콜의 총 함량이 40% 이하(특유 염기 콜 분계점)로 감소될 때까지 또는 최종 21 염기를 포함하는 슬라이딩 윈도우가 4 미만의 특유 염기 콜을 갖는 경우, 한 번에 1 염기가 연장되었다. 이는 단지 20 염기의 슬라이딩 윈도우가 사용되며, 윈도우 함량의 40% 미만이 특유 염기 콜인 경우 SubSeq의 증가가 중단되는 REPI 알고리즘과는 다르다. 이 시점에서, SubSeq를 조사하고 추적하는(trailing) N 콜을 제거하였다. BLAST의 단어 크기 파라미터를 충족시키고 추가 분석을 위한 SubSeq를 유지하기 위하여 7개의 연속적인 특유 염기 콜을 갖는 적어도 1개의 위치가 필요하였다. 100 염기 초과의 SubSeq를 수용하였다. 수용에 있어서, ≤ 30 염기의 SubSeq는 적어도 95% 특유 염기 콜("N"이 아님)을 필요로 하였다. 30 ~ 100 염기를 갖는 SubSeq의 경우에는, 부분 서열 수용은 적어도 VARI(("SubSeq 길이"-30)*0.2857+70)% 특유 염기를 필요로 한다. ≥ 80 염기의 SubSeq의 경우에는, BLAST 단어 크기 파라미터는 이것이 적어도 11개의 연속적인 염기를 포함하는 경우 11로 수정하였다. 성공적으로 증가된 SubSeq의 실체(ProSeq 내 개시 위치 및 길이)를, 각각의 SubSeq와 관련된 정보를 보유하는 SubSeq 어레이 내 엔트리에 배치하였다. 이러한 실체 및 SubSeq를 BLAST를 통한 배치 질의(batch query)를 위해 파일에 배치하였다. 이 절차를 이전의 성공적인 SubSeq의 말단으로부터 지속시킴으로써, 또는 실패한 경우, 윈도우가 최초 증가된 지점으로부터 HybSeq의 말단까지 지속시킴으로써 반복하였다. 종료시, 알고리즘은 태스크 I 루프로 돌아갔고 BLAST 서브 태스크를 수행하였다.The first filtering algorithm, i.e., REsequencing Pathogen Identifier (REPI), was first developed (20), and the general concept, including revision, was incorporated into the current (automated detection) algorithm used in the CIBSI 2.0 program. Filtering and partial sequencing were used to remove potential bias caused by reference sequence selection and by other sources (primers), as well as to separate HybSeq into fragments that were significant for faster exploration. This was the first sub-task of task I in FIG. 1 and is schematically illustrated in detail in FIG. 2. When PCR amplification was used, the microarray was hybridized only in the presence of the primers to determine where the primers caused hybridization. All portions of the hybridized ProSeq using primers were masked as N calls so that HybSeq did not contain biased information. For each ProSeq, the ratio of SNP to the total number of unique base calls, UniRate, was calculated from HybSeq. In order to remove HybSeq with insufficient hybridization when UniRate is ≥ 20% (SNP threshold), its ProSeq is considered negative for target organism detection. UniRate 20% indicated an average of 5 SNPs per 25 bp. For this frequency difference between the target pathogen-like organism and the reference sequence on which ProSeq is based, it is impractical to expect a significant specific hybridization of the 25 bp probe. This will terminate the filtering subtask and return to the task I loop, which will examine the next ProSeq. In the case of ProSeq with a ratio of <20%, a more detailed investigation was performed. At each position of HybSeq, a revised sliding window algorithm 20 was tried to increase SubSeq, which can be used as a query for BLAST. Initially, the first 20 bases (initial length) after a certain position were investigated as potential partial sequences. If less than 60% of this base was ambiguous (N), SubSeq entered the extension phase. SubSeq is 1 base at a time until the total content of unique base calls is reduced to 40% or less (specific base call threshold), or if the sliding window containing the final 21 bases has less than 4 unique base calls. Was extended. This differs from the REPI algorithm, in which only 20 bases sliding windows are used, and if less than 40% of the window content is a unique base call, the increase in SubSeq stops. At this point, the N calls trailing and traversing the SubSeq were removed. At least 1 position with 7 consecutive unique base calls was required to meet the word size parameter of BLAST and maintain SubSeq for further analysis. SubSeq greater than 100 bases was received. For acceptance, SubSeq of ≤ 30 bases required at least 95% specific base call (not "N"). For SubSeq with 30-100 bases, partial sequence acceptance requires at least VARI(("SubSeq length"-30)*0.2857+70)% unique bases. For SubSeq> 80 bases, the BLAST word size parameter was modified to 11 if it contained at least 11 consecutive bases. The entity of the successfully increased SubSeq (starting position and length in ProSeq) was placed in an entry in the SubSeq array that holds information related to each SubSeq. These entities and SubSeq were placed in a file for batch query through BLAST. This procedure was repeated by continuing from the end of the previous successful SubSeq, or, if unsuccessful, by continuing from the point where the window was initially increased to the end of the HybSeq. At the end, the algorithm returned to the task I loop and performed the BLAST subtask.

데이타베이스 질의Database query

BLAST 서브 태스크는 질의로서 SubSeq를 사용한 데이타베이스의 배치 유사성 조사를 수행하였다. 사용된 BLAST 프로그램은 한정된 세트의 파라미터를 갖는 NCBI Blastall -p blastn version 2.12였다. 시딩(seeding) 단계에 대해서는 질의 속도를 높이도록 저복합 영역의 차폐를 수행하였지만, 실제 스코어링에는 저복합성 반복을 포함하였다. 2006년 2월 7일자로 획득한 NCBI로부터의 전체 뉴클레오티드 데이타베이스를 참조 데이타베이스로서 사용하였다. (개발 중에는 데이타베이스 초기 이미지를 사용하였지만 모든 실험을 이 일자에 얻은 데이타베이스의 이미지로 기술된 알고리즘으로 재실행하였음을 주목할 것). 디폴트 갭 패널티(gap penalty) 및 뉴클레오티드 일치 스코어를 사용하였다. 뉴클레오티드 불일치 페널티, -q, 파라미터는 디폴트와 다르게 -1로 설정하였다. < 0.0001의 기대치를 갖는 임의의 BLAST 질의의 결과를 blastall 프로그램으로부터 태뷸러 포맷(tabular format)으로 반송하였다. 각각의 반송(return)에 대한 정보(비트 스코어, 기대치, 불일치, 일치 길이)를 추가 분석을 위한 해시 키(hash key)로서 SubSeq 실체를 사용한 Return{hash key}{info}에 배치하였다.The BLAST subtask conducted a batch similarity investigation of the database using SubSeq as a query. The BLAST program used was NCBI Blastall -p blastn version 2.12 with a limited set of parameters. In the seeding step, the low-complexity region was shielded to speed up the query, but the actual scoring included the low-complexity repetition. The entire nucleotide database from NCBI obtained on February 7, 2006 was used as the reference database. (Note that during development, an initial image of the database was used, but all experiments were re-run with the algorithm described with the image of the database obtained on this date). The default gap penalty and nucleotide match score were used. The nucleotide mismatch penalty, -q, and parameters were set to -1 different from the default. The results of an arbitrary BLAST query with an expected value of <0.0001 were returned from the blastall program in tabular format. Information on each return (bit score, expected value, mismatch, match length) was placed in Return{hash key}{info} using the SubSeq entity as a hash key for further analysis.

SubSeq로부터 ProSeq에 대한 분류학 기반 병원체 동정Taxonomic-based pathogen identification for ProSeq from SubSeq

수행되는 태스크 I의 다음 서브 태스크는 SubSeq() 상태의 결정이며 도 3에 도시하였다. 단순 데이타를 제시하고 결정 과정을 용이하게 하기 위하여, 모든 SubSeq에 대한 정보를 2가지 파라미터에 의해 요약하였다. "동정 유기체"는 유기체의 분류학적 부류를 나타내고 "유기체 특유성"은 유기체 동정의 질을 나타내었다. Return 해시 내 구성 요소를 조사하고 ProSeq의 각각의 별개 SubSeq()에 대한 스코어 어레이에 의해 랭크하였다. 스코어 어레이는, 소정의 데이타베이스에 대한 고정 관계를 갖는 한 쌍의 파라미터, 즉 비트 스코어 및 기대치를 포함하였다. 때로는 데이타베이스의 크기(기대치)를 설명하거나 또는 (비트 스코어)를 설명하는 랭킹 스코어를 사용하는 것이 적절하였다. Return 해시 내 구성 요소는 같은 스코어를 가질 수 있었으며, 최고 비트 스코어/최저 기대치(MaxScore)를 갖는 모든 구성 요소를 별개의 어레이 Rank1에 보유하였다. Rank1 내 모든 구성 요소의 전체 분류학적 분류를 2006년 2월 7일자로 역시 획득한 NCBI 분류학 데이타베이스로부터 회수하였다(전술한 주목 참조). MaxScore 기대치가 MAX(현재 10^-6)보다 더 큰 경우, SubSeq()는 무효값(null)으로 갱신된 이의 동정 유기체와 유기체 특유성 정보 모두를 보유하였다. MaxScore가 충분히 작은 경우, Rank1에 배치된 반송들을 조사하였다. Rank1이 단일 구성 요소를 포함하는 경우, SubSeq에 SeqUniqu의 유기체 특유성을 할당하였다. Rank1이 다수 구성 요소를 포함하는 경우, SubSeq에, 모든 반송들이 같은 분류학적 부류에 속하는 경우, TaxUnique의 유기체 특유성을 할당하였다. 그렇지 않은 경우, SubSeq의 유기체 상태를 TaxAmbig로 설정하였다. 도 3에 개괄한 태스크를 ProSeq의 각각의 SubSeq()에 적용하였다. 모든 경우에서, 동정 유기체를 Rank1 내 모든 구성 요소의 공통적인 모형(parent)인 분류학적 부류를 나타내는 각각의 SubSeq()에 할당하였다.The next subtask of task I to be performed is the determination of the SubSeq() state, and is shown in FIG. 3. In order to present simple data and to facilitate the decision process, information for all SubSeqs was summarized by two parameters. "Identified organism" refers to the taxonomic class of organisms and "organism specificity" refers to the quality of the organism identification. The components in the return hash were examined and ranked by the score array for each distinct SubSeq() of ProSeq. The score array contained a pair of parameters with a fixed relationship to a given database, namely the bit score and the expected value. Sometimes it was appropriate to use a ranking score to describe the size of the database (expected value) or (bit score). The components in the return hash could have the same score, and all components with the highest bit score/lowest expected value (MaxScore) were held in a separate array Rank1. All taxonomic classifications of all components in Rank1 were retrieved from the NCBI taxonomic database, which was also obtained on February 7, 2006 (see note above). If the MaxScore expectation was ^{greater than MAX (currently 10 -6} ), SubSeq() retained both its identified organism and organism-specific information updated to null. If MaxScore was small enough, the bounces placed in Rank1 were examined. When Rank1 contains a single component, the organism specificity of SeqUniqu was assigned to SubSeq. If Rank1 contains multiple components, SubSeq is assigned, and TaxUnique's organism specificity is assigned if all transports belong to the same taxonomic class. Otherwise, the organism state of SubSeq was set to TaxAmbig. The task outlined in FIG. 3 was applied to each SubSeq() of ProSeq. In all cases, the identified organism was assigned to each SubSeq() representing the taxonomic class, which is the common parent of all constituents in Rank1.

각각의 SubSeq를 조사한 후, 알고리즘은, SubSeq로부터 ProSeq의 동정 유기체를 결정할, 다음 태스크로 이동하였다(도 4). SubSeq의 모든 구성 요소가 Null인 동정 유기체 값을 갖는 경우, 그 ProSeq는 음성이며 다음 ProSeq를 조사하였다. ProSeq에 대해 SubSeq 내 단일 구성 요소만이 있거나 또는 SubSeq 내 모든 구성 요소가 같은 동정 유기체를 갖는 경우, Result1 엔트리가 동정 유기체에 대하여 형성되며, 그 유기체 특유성은 다수의 SubSeq 엔트리인 경우 TaxUnique이거나 또는 단일 SubSeq 엔트리의 상태를 물려받았다. 다른 동정 유기체를 갖는 SubSeq 내 다수 엔트리가 있는 경우, 추가 분석을 수행하였다. 그러면 가장 높은 2개의 최상의 스코어를 갖는 구성 요소가 SubSeq(1) 및 SubSeq(2)가 되도록 SubSeq를 MaxScore(비트 스코어)로 재정렬하였다. SubSeq(1)이 SubSeq(2)의 스코어보다 ≥ 30%(스코어 비 분계점)인 스코어를 갖는 경우, 그 ProSeq는 SubSeq(1)의 유기체 특유성 및 동정 유기체를 물려받았다. 그렇지 않은 경우, ProSeq의 유기체 상태는 TaxAmbig이고 동정 유기체는 모든 부분 서열의 공통적인 모형인 분류학적 부류였다. 모든 부분 서열이 직접적인 자형 부류 및 모형 부류인 단지 2개의 분류학적 부류 내에 포함되는 경우, 동정 유기체는 자형 부류 내 부분 서열의 유기체였다. 도 4에 포함된 서브 태스크를 종결하고 태스크 I 루프를 지속하였다. 검출된 유기체를 갖는 ProSeq의 목록을 Result1 어레이에 형성시켰다.After examining each SubSeq, the algorithm determined the organism to identify ProSeq from SubSeq, and moved to the next task (FIG. 4). When all the components of SubSeq have an identified organism value of Null, the ProSeq is negative and the next ProSeq is investigated. For ProSeq, if there is only a single component in SubSeq or all components in SubSeq have the same identified organism, a Result1 entry is formed for the identified organism, and the organism specificity is TaxUnique for multiple SubSeq entries, or a single SubSeq It inherits the state of the entry. If there were multiple entries in SubSeq with different identified organisms, further analysis was performed. SubSeq was then rearranged by MaxScore (bit score) so that the components with the highest two best scores were SubSeq(1) and SubSeq(2). When SubSeq(1) has a score of ≥ 30% (score ratio threshold) than that of SubSeq(2), the ProSeq inherits the organism-specific and identified organisms of SubSeq(1). Otherwise, the organism status of ProSeq was TaxAmbig and the identified organism was a taxonomic class, a common model of all subsequences. If all subsequences were contained within only two taxonomic classes, the direct subclass and the model class, the identified organism was an organism of subsequence within the subclass. The subtask included in FIG. 4 was terminated and the task I loop was continued. A list of ProSeq with detected organisms was formed in the Result1 array.

전체 병원체 동정 및 양성의 콜All pathogen identification and positive call

태스크 I을 종결한 후, 태스크 II(도 1 참조)를 사용하여 Result1에 등재된 동정 유기체 값을 조사하고, 상기 값이 동일한 분류학적 부류를 동정하는 경우 함께 그룹화하였다. Result1 내 각각의 엔트리를 조사하고, 동정 유기체가 상기 목록에 나타나지 않는 경우 Result2에 신규 엔트리를 형성시켰다. 대부분의 경우에, Result2의 엔트리는 검출된 개개의 유기체를 나타내지만, 여전히 잉여 정보를 포함할 수 있었다. 하나가 다른 하나의 분류학적 모형인 동정 유기체를 갖는 Result2의 엔트리는 사실상 동일한 병원체를 나타낼 수 있었다. 동일한 동정은 일어날 수 없을 수 있었는데, 그 이유는 게놈 표적이 가능성 있는 다양한 이유로 ProSeq 모두에 하이브리드화가 잘 되지 않았기 때문이다. 대안으로, 2개의 상이한, 그러나 밀접하게 관련된 유기체는 둘 다 마이크로어레이에 하이브리드화할 수 있었다.After task I was terminated, the values of the identified organisms listed in Result1 were examined using Task II (see Fig. 1), and when the values identified the same taxonomic class, they were grouped together. Each entry in Result1 was examined, and a new entry was formed in Result2 if the identified organism did not appear in the list above. In most cases, the entry in Result2 represents the individual organism detected, but could still contain redundant information. An entry in Result2, one with an identified organism, a taxonomic model in the other, could represent virtually the same pathogen. The same identification could not have occurred because the genomic targets did not hybridize well to all of the ProSeq for a variety of possible reasons. Alternatively, two different, but closely related organisms were both able to hybridize to the microarray.

별개의 ProSeq로부터의 결과를 서로 연관시키는 것은 어려웠지만, 태스크 III은 현재 구현된 바와 같은 최종 조사 및 결정을 처리하였다. 이전 태스크들을 구체적으로 구현하여 그 결과 ProSeq가 검출하고자 했던 것에 관한 정보는 고려하지 않았다. 이는 이 아래쪽의 경우가 양성 및 음성뿐만 아니라 불확정된 경우도 인지할 수 있도록 하였다. 최종 태스크에서, 알고리즘은 ProSeq가 검출하도록 설계된 유기체를 동정하는지를 고려하였다. 명백한 음성의 ProSeq와 불확정된 ProSeq는 표적 병원체에 대해 음성으로 고려하였다. 이에 대한 ProSeq의 그룹화는 태스크 II에서 이미 수행된 그룹화를 기초로 하였다. Result2의 엔트리를 루핑(looping)하였다. 엔트리의 ProSeq를 사용하여 표적화된 표 내 병원체를 조사하였다. Result2 엔트리의 동정 유기체가 표적 병원체의 분류학적 부류와 동일하거나 또는 자형인 경우, Pathogen() 어레이를 그 표적화된 병원체에 대한 양성의 엔트리로 갱신하였다. Pathogen() 어레이가 상기 병원체에 대해 무효인 경우, 동정 유기체의 병원체 수준은 Result2() 엔트리의 수준이었다. 엔트리가 이미 병원체에 배치된 경우, 그 후 추가 비교를 필요로 하였다. Result2() 및 Pathogen 엔트리를 비교하였다. 이들이 직접적인 자형 모형 관계인 경우, 상기 Pathogen의 동정 유기체는 자형의 분류학적 부류였다. 그렇지 않은 경우, 공통적인 모형 분류학적 부류를 양성의 동정 유기체로서 기록하였다. 병원체에 대한 모든 ProSeq가 잘 하이브리드화된 대부분의 경우, 세련된 수준의 식별이 기록되었다. 그러나, 하나 이상의 ProSeq가 잘 하이브리드화되지 않은 경우, 기록된 양성의 표적 병원체는 단지 속 또는 종 수준에서만 동정되었다. 수동적인 재조사가 가능하도록 3가지 모든 태스크의 결과를 기록하였다. 표적 병원체에 속하지 않은 태스크 II에서 동정한 유기체는 비표적 양성의 반송으로서 기록되었음을 주목해야 한다. 이러한 경우에 동정된 것에 대한 세부 사항은 태스크 II 수준 결과의 조사를 필요로 하였다.Although it was difficult to correlate the results from separate ProSeqs, Task III handled the final investigations and decisions as currently implemented. As a result of implementing the previous tasks in detail, we did not take into account information about what ProSeq was trying to detect. This made it possible to recognize cases in which the lower cases are positive and negative as well as indeterminate cases. In the final task, the algorithm considered whether ProSeq identifies organisms designed to be detected. The apparent negative ProSeq and the uncertain ProSeq were considered negative for the target pathogen. ProSeq's grouping for this was based on the grouping already performed in Task II. The entry of Result2 was looped. Pathogens in the targeted table were investigated using the entry's ProSeq. Identification of the Result2 entry If the organism was identical to the taxonomic class of the target pathogen or was of the same type, the Pathogen() array was updated with an entry positive for the targeted pathogen. When the Pathogen() array was invalid for the pathogen, the pathogen level of the identified organism was the level of the Result2() entry. If the entry had already been placed on the pathogen, then further comparison was required. Result2() and Pathogen entries were compared. In the case of a direct phenotypic relationship, the pathogen's identified organism was a phenotypic taxonomic class. Otherwise, a common model taxonomic class was recorded as benign identifying organisms. In most cases where all ProSeqs for pathogens were well hybridized, a sophisticated level of identification was recorded. However, if more than one ProSeq did not hybridize well, the recorded positive target pathogen was identified only at the genus or species level. The results of all three tasks were recorded to enable manual re-examination. It should be noted that organisms identified in Task II that do not belong to the target pathogen were recorded as non-target positive carriers. Details of what was identified in this case required investigation of Task II level results.

병원체 동정Pathogen identification

10 ~ 1000 게놈 카피(참조 문헌 21 방법)를 갖는 클라미디아 뉴모니아에(Chlamydia pneumoniae) 샘플을 선정하여 다수의 ProSeq를 동일한 병원체에 대해 표적화한 경우 어떻게 병원체 검출 및 동정을 수행할 것인지를 설명하였다(21). RPM v1은 주요 외막 단백질 VD2 및 VD4, 및 DNA 의존성 RNA 폴리머라제 (rpoB) 유전자를 암호화하는 유전자로부터 선정되는 고도로 보존된 3가지 ProSeq를 보유한다. 상이한 샘플 유래의 HybSeq는 단지 하기 표 1에 나타낸 바와 같은 단지 특유 염기 콜이 얼마나 많은 지에 따라 달랐다. 콜링(calling)된 ProSeq의 백분율은, 분석 검출 한계치에 도달하였음을 나타내는 특유 콜이 이러한 농도 이상을 생성시키는 단지 11%의 rpoB ProSeq를 갖는 농도 10인 한 가지 경우를 제외하고는, 80 ~ 100%로 다양하였다. 다양한 샘플에 대한 각각의 태스크의 말단에서 SubSeq에 대해 수행된 결정값을 하기 표 1에 기재하였다. 상이한 경우로부터의 ProSeq는 같은 수의 SubSeq를 생성하였다. 상이한 샘플 유래의 이러한 SubSeq는 BLAST로부터 동일한 탑 랭킹된 리턴에 대해 다른 비트 스코어를 기록하였다. 사실상 VD2 및 VD4는 정확하게 동일한 결과를 생성하였다. NCBI 분류학 데이타베이스는 상기 리턴을 C. 뉴모니아에 분류학적 그룹 및 3가지 자형 균주 그룹을 나타내는 4가지의 명확한 그룹으로 분류하였다. AE001652, AE002167, AE017159, 및 BA000008은 각각의 샘플에 대해 모든 ProSeq의 리턴을 나타내었는데, 그 이유는 이들이 완전히 서열화된 게놈의 데이타베이스 엔트리를 나타내었기 때문이다. 하나의 rpoB SubSeq는 그 유기체 특유성, SeqUniqu를 생성하였다. 다른 모든 SubSeq는, 상이한 분류학적 부류로부터 다수의 리턴이 반송되었으므로, TaxAmbig였다. VD2 및 VD4 ProSeq 각각이 단일 SubSeq를 보유하므로, 태스크 I은 ProSeq에 SubSeq의 상태를 할당하였다. rpoB ProSeq의 경우, 하나의 SubSeq의 비트 스코어는 충분히 커서 알고리즘이 ProSeq에 그 SubSeq의 동정을 할당하였다. 알고리즘의 태스크 II는 모든 3가지 ProSeq를 함께 그룹화하였는데, 그 이유는 이들 모두가 동일한 동정 유기체를 보유하고 TaxAmbig를 할당하기 때문이었다. 태스크 III의 결과는 표적 병원체 C. 뉴모니아에에 대해 양의 값이었으며, 이러한 결정은 모든 ProSeq가 서로 일치하고 동일한 표적 병원체의 분류학적 부류에 속하였기 때문에 수월했다. rpoB ProSeq가 SeqUniqu였지만, 이는 SeqUniqu인 ProSeq가 자형 분류학적 그룹이 아니고 다른 ProSeq가 TaxAmbig였기 때문에 태스크 II에 대한 최종 결정이 아니었다. 인지된 3가지 아균주를 동일하게 스코어링하였으며, ProSeq에 대해 선정된 서열은 매우 보존적이었고 이러한 균주들 간의 차이를 허용하지 않음을 나타내었다. A sample of Chlamydia pneumoniae having 10 to 1000 genomic copies (reference 21 method) was selected to describe how to perform pathogen detection and identification when multiple ProSeqs were targeted against the same pathogen ( 21). RPM v1 possesses three highly conserved ProSeqs selected from genes encoding the major outer membrane proteins VD2 and VD4, and the DNA dependent RNA polymerase (rpoB) gene. HybSeq from different samples only differed depending on how many unique base calls were as shown in Table 1 below. The percentage of ProSeq called is 80-100%, except in one case where the distinctive call indicating that the assay detection limit has been reached is a concentration of 10 with only 11% rpoB ProSeq producing above this concentration. Was varied. The determinations performed for SubSeq at the end of each task for the various samples are shown in Table 1 below. ProSeq from different cases produced the same number of SubSeq. These SubSeqs from different samples scored different bit scores for the same top ranked return from BLAST. In fact, VD2 and VD4 produced exactly the same results. The NCBI taxonomic database classified the returns into four distinct groups representing the C. pneumoniae taxonomic group and the three-shaped strain group. AE001652, AE002167, AE017159, and BA000008 showed the return of all ProSeq for each sample because they represented the database entries of the fully sequenced genome. One rpoB SubSeq produced the organism-specific, SeqUniqu. All other SubSeqs were TaxAmbig, as multiple returns were returned from different taxonomic classes. Since each of the VD2 and VD4 ProSeq has a single SubSeq, Task I assigned the state of SubSeq to ProSeq. In the case of rpoB ProSeq, the bit score of one SubSeq was sufficiently large, and the algorithm assigned the identification of that SubSeq to ProSeq. Task II of the algorithm grouped all three ProSeqs together because they all possess the same identifying organism and assign TaxAmbig. The results of Task III were positive for the target pathogen C. pneumoniae, and this determination was straightforward because all ProSeqs were consistent with each other and belonged to the same taxonomic class of target pathogens. Although the rpoB ProSeq was SeqUniqu, this was not the final decision for Task II because the SeqUniqu, ProSeq, was not a phenotypic group and the other ProSeq was TaxAmbig. The three recognized sub-strains were scored identically, and the sequence selected for ProSeq was very conservative, indicating that differences between these strains were not allowed.

SubSeq, 태스크 I, II, 및 III에 있어서 몇몇 농도에서 C. 뉴모니아에에 대한 알고리즘 결정 Algorithm Determination for C. pneumoniae at several concentrations for SubSeq, Tasks I, II, and III 게놈
카피Genome
copy ProSeqProSeq 특유 콜Distinctive call #SubSeq#SubSeq SubSeq 유기체 동정 및 특유성,
비트 스코어SubSeq organism identification and peculiarity,
Beat score 태스크 ITask I 태스크 IITask II 태스크
IIItask
III 10001000 VD2VD2 89%89% 1One (G1)C.pne, TA, 145 (G1) C.pne, TA, 145 C.pne TAC.pne TA C.pne TA

C.pne TA

양의 값의 C.pne

Positive value C.pne

VD4VD4 91%91% 1One (G1)C.pne, TA, 145 (G1) C.pne, TA, 145 C.pne TAC.pne TA rpoBrpoB 80%80% 22 (G2)C.pne, SU, 307
(G3)C.pne, TA, 73(G2) C.pne, SU, 307
(G3) C.pne, TA, 73 C.pne TAC.pne TA 100100 VD2VD2 100%100% 1One (G1)C.pne, TA, 164 (G1) C.pne, TA, 164 C.pne TAC.pne TA C.pne TA

C.pne TA

VD4VD4 97%97% 1One (G1)C.pne, TA, 156 (G1) C.pne, TA, 156 C.pne TAC.pne TA rpoBrpoB 80%80% 22 (G2)C.pne, SU, 343
(G3)C.pne, TA, 87(G2) C.pne, SU, 343
(G3) C.pne, TA, 87 C.pne TAC.pne TA 100100 VD2VD2 83%83% 1One (G1)C.pne, TA, 136 (G1) C.pne, TA, 136 C.pne TAC.pne TA C.pne TA

C.pne TA

VD4VD4 91%91% 1One (G1)C.pne, TA, 145 (G1) C.pne, TA, 145 C.pne TAC.pne TA rpoBrpoB 84%84% 22 (G2)C.pne, SU, 318
(G3)C.pne, TA, 82(G2)C.pne, SU, 318
(G3) C.pne, TA, 82 C.pne TAC.pne TA 1010 VD2VD2 100%100% 1One (G1)C.pne, TA, 164 (G1) C.pne, TA, 164 C.pne TAC.pne TA C.pne TA

C.pne TA

VD4VD4 97%97% 1One (G1)C.pne, TA, 156 (G1) C.pne, TA, 156 C.pne TAC.pne TA rpoBrpoB 90%90% 22 (G2)C.pne, SU, 340
(G3)C.pne, TA, 89(G2)C.pne, SU, 340
(G3) C.pne, TA, 89 C.pne TAC.pne TA 1010 VD2VD2 100%100% 1One (G1)C.pne, TA, 164 (G1) C.pne, TA, 164 C.pne TAC.pne TA C.pne TA

C.pne TA

VD4VD4 93%93% 1One (G1)C.pne, TA, 148 (G1) C.pne, TA, 148 C.pne TAC.pne TA rpoBrpoB 11%11% 00 Null, Null Null, Null Null, Null Null, Null (G1) J138 (BA000008), AR39 (AE002167), Tw-183 (AE017159), Cpne
(M69230,AF131889,AY555078,M64064,AF131229,AF131230)
(G2) Cpne (S83995)
(G3) J138 (BA00008), AR39 (AE002167), Tw-183 (AE017159)
SU : SeqUniqu의 약어
TA : TaxAmbig의 약어(G1) J138 (BA000008), AR39 (AE002167), Tw-183 (AE017159), Cpne
(M69230,AF131889,AY555078,M64064,AF131229,AF131230)
(G2) Cpne (S83995)
(G3) J138 (BA00008), AR39 (AE002167), Tw-183 (AE017159)
SU: Abbreviation for SeqUniqu
TA: Abbreviation for TaxAmbig

인플루엔자 및 인간 아데노바이러스(HAdV)는 이전 연구에서 논의된 바와 같은 상세한 균주 수준 차이를 용인하는 선정된 ProSeq를 보유하는 유일한 병원체였다(19, 20, 21). 수동적 분석법을 사용한 이러한 이전 연구에서 마이크로어레이 결과가 임상 샘플에 대한 종래의 서열화 결과와 아주 우수하게 일치함을 발견하였다. 원(raw) 마이크로어레이 결과에 대한 갱신된 NCBI 데이타베이스를 사용한 CIBSI 2.0 프로그램을 실행한 결과를 이전 결과와 비교하였다(표 2). 동정된 유기체는 사용된 데이타베이스 내 차이로 인해 원래의 결과와 동일하지 않았다. 사실, 상기 문헌으로부터 NCBI로 전송된 종래의 서열화 결과는 최상 스코어를 갖는 리턴 중에 존재하는 모든 샘플에 대해 발견하였다. 13종의 인플루엔자 A 중 8종 및 12종의 인플루엔자 B 중 3종의 경우, 태스크 I 및 II의 결과는 종래의 서열화가 단일 최상 리턴이고, 따라서 동정 유기체였음을 확인하였다. 헤마글루티닌(hemagglutinin) 유전자에 대한 데이타베이스 내 다수의 분리 서열로 인하여, 몇몇 경우에서 단일 특유 엔트리가 발견되지 않음은 놀라운 사실이 아니었다. 나머지 5종의 인플루엔자 A 샘플 각각의 경우, 반송된 다른 서열은 종래의 서열과 0.2% 미만까지 달랐다. 인플루엔자 B에 대해 특유 분리 동정을 이용하는 샘플이 더 적은 것은, 하이브리드화가 덜 발생하도록 하는, ProSeq에 대해 더 오래된 기준 서열을 사용하기 때문이었다(19). 또한 이는, 다수 서열이 샘플에 대해 반송되는 경우, 이들이 더 큰 유전적 변이를 최대 2%까지 나타냄을 의미하였다. 이러한 비교는, 단지 종래의 서열화된 영역이었던 헤마글루티닌(HA) ProSeq에 대한 태스크 I 수준에서의 알고리즘 분석만을 나타내었다. 이전 연구는 다수 ProSeq로부터의 일치(consensus)를 얻고자 하지 않았으므로, 태스크 III 결과에 대한 어떠한 비교도 있을 수 없었다. 태스크 III 수준 동정을 차폐하는 현 방법의 결과로서, 이러한 수준에서 기록된 유기체는 모든 샘플에 대해 덜 특이적(H3N2 또는 Flu B)이었다(부록 표 1A 및 1B). HAdV 샘플의 경우, 알고리즘은 또한 이전에 수동적 방법에 의해 수행되었던 더 미세한 규모의 차이를 재생하였다(나타내지 않음).
Influenza and human adenovirus (HAdV) were the only pathogens carrying the selected ProSeq to tolerate detailed strain level differences as discussed in previous studies (19, 20, 21). In this previous study using passive assays, it was found that the microarray results are in very good agreement with conventional sequencing results for clinical samples. The results of running the CIBSI 2.0 program using the updated NCBI database for the raw microarray results were compared with the previous results (Table 2). The organisms identified were not identical to the original results due to differences in the database used. In fact, the conventional sequencing results sent from this document to the NCBI were found for all samples present during the return with the best score. For 8 out of 13 influenza A and 3 out of 12 influenza B, the results of tasks I and II confirmed that the conventional sequencing was the single best return and thus was the identified organism. It was not surprising that, due to the large number of separate sequences in the database for the hemagglutinin gene, no single distinct entry was found in some cases. For each of the remaining five influenza A samples, the other sequence returned differed from the conventional sequence by less than 0.2%. Fewer samples using specific isolation identification for influenza B were due to the use of an older reference sequence for ProSeq, allowing less hybridization to occur (19). In addition, this meant that when multiple sequences were carried for the sample, they exhibited a greater genetic variation by up to 2%. This comparison only showed algorithmic analysis at the task I level for hemagglutinin (HA) ProSeq, which was a conventional sequenced region. As previous studies did not seek to obtain consensus from multiple ProSeqs, there could be no comparisons for Task III results. As a result of current methods of masking Task III level identification, organisms recorded at this level were less specific (H3N2 or Flu B) for all samples (Appendix Tables 1A and 1B). For the HAdV sample, the algorithm also reproduced (not shown) finer scale differences that were previously performed by the passive method.

인플루엔자 A 및 B 임상 샘플의 HA ProSeqs로부터 동정된 병원체 Pathogens Identified from HA ProSeqs of Influenza A and B Clinical Samples 샘플명Sample name 선행 문헌에 의한
균주 동정By prior literature
Strain identification GenBank
등록번호GenBank
Registration Number CIBSI 2.0에 의한 HA3 ProSeq 동정Identification of HA3 ProSeq by CIBSI 2.0 GenBank
등록번호GenBank
Registration Number A/콜로라도/360/05A/Colorado/360/05 A/네팔/1679/2004 (H3N2)A/Nepal/1679/2004 (H3N2) AY945284AY945284 A/콜로라도/3
60/05^* A/Colorado/3
60/05 ^* DQ265717DQ265717 A/카타르/2039/05A/ Qatar/2039/05 A/네팔/1727/2004 (H3N2)A/Nepal/1727/2004 (H3N2) AY945272AY945272 A/카타르/203
9/05 A/Qatar/203
9/05 DQ265707DQ265707 A/괌/362/05A/Guam/362/05 A/네팔/1679/2004 (H3N2)A/Nepal/1679/2004 (H3N2) AY945264AY945264 A/괌/362/05 A/Guam/362/05 DQ265715DQ265715 A/이탈리아/384/05A/Italy/384/05 A/네팔/1727/2004 (H3N2)A/Nepal/1727/2004 (H3N2) AY945272AY945272 A/이탈리아/3
84/05 A/Italy/3
84/05 DQ265713DQ265713 A/터키/2108/05A/Turkey/2108/05 A/네팔/1664/2004 (H3N2)A/Nepal/1664/2004 (H3N2) AY945265AY945265 A/터키/2108/
05 A/Turkey/2108/
05 DQ265718DQ265718 A/한국/298/05A/Korea/298/05 A/네팔/1727/2004 (H3N2)A/Nepal/1727/2004 (H3N2) AY945273AY945273 A/한국/298/0
5 A/Korea/298/0
5 DQ265710DQ265710 A/일본/1337/05A/Japan/1337/05 A/말레이시아/2256/2004 (H3N2)A/Malaysia/2256/2004 (H3N2) ISDN110616ISDN110616 A/일본/1337/
05^* A/ Japan /1337/
05 ^* DQ265712DQ265712 A/일본/1383/05A/Japan/1383/05 A/말레이시아/2256/2004 (H3N2)A/Malaysia/2256/2004 (H3N2) ISDN110616ISDN110616 A/일본/1383/
05 A/Japan/1383/
05 DQ265711DQ265711 A/에콰도르/1968/04A/Ecuador/1968/04 A/뉴욕/17/2003 (H3N2)A/New York/17/2003 (H3N2) CY001053CY001053 A/에콰도르/1
968/04^* A/Ecuador/1
968/04 ^* DQ265716DQ265716 A/이라크/34/05A/Iraq/34/05 A/크라이스트처치/178/2004 (H3N2)A/Christchurch/178/2004 (H3N2) ISDN110530ISDN110530 A/이라크/34/
05 A/Iraq/34/
05 DQ265714DQ265714 A/페루/166/05A/Peru/166/05 A/마카우/103/2004 (H3N2)A/Macau/103/2004 (H3N2) ISDN64772ISDN64772 A/페루/166/0
5 A/Peru/166/0
5 DQ265708DQ265708 A/뉴욕/2782/04A/New York/2782/04 A/뉴욕/391/2005 (H3N2)A/New York/391/2005 (H3N2) CY002056CY002056 A/뉴욕/2782/
04^S* A/New York/2782/
04 ^S* DQ265709DQ265709 A/영국/400/05A/UK/400/05 A/뉴욕/227/2003 (H1N1)A/New York/227/2003 (H1N1) CY002536CY002536 A/영국/2005^* A/UK/2005 ^* DQ265706DQ265706 (계속)(continue) (계속)(continue) 샘플명Sample name 선행 문헌에 의한
균주 동정By prior literature
Strain identification GenBank
등록번호GenBank
Registration Number CIBSI 2.0에
의한 HA3
ProSeq 동정To CIBSI 2.0
By HA3
ProSeq identification GenBank
등록번호GenBank
Registration Number B/페루/1324/
04B/Peru/1324/
04 B/밀라노/66/04B/Milan/66/04 AJ842082AJ842082 B/페루/1324/
04S* B/Peru/1324/
04S* DQ265728DQ265728 B/페루/1364/
04B/Peru/1364/
04 B/밀라노/66/04B/Milan/66/04 AJ842082AJ842082 B/페루/1364/
04S* B/Peru/1364/
04S* DQ265726DQ265726 B/콜로라도/2
597/04B/Colorado/2
597/04 B/텍사스/3/2002B/Texas/3/2002 AY139049AY139049 B/콜로라도/2
597/04S* B/Colorado/2
597/04S* DQ265724DQ265724 B/일본/1905/
05B/Japan/1905/
05 B/텍사스/3/2002B/Texas/3/2002 AY139049AY139049 B/일본/1905/
05S* B/Japan/1905/
05S* DQ265727DQ265727 B/일본/1224/
05B/Japan/1224/
05 B/텍사스/3/2002B/Texas/3/2002 AY139049AY139049 B/일본/1224/
05S* B/Japan/1224/
05S* DQ265719DQ265719 B/알래스카/1
777/05B/Alaska/1
777/05 B/텍사스/3/2002B/Texas/3/2002 AY139049AY139049 B/알래스카/1
777/05S B/Alaska/1
777/05S DQ265730DQ265730 B/영국/1716/
05B/UK/1716/
05 B/텍사스/3/2002B/Texas/3/2002 AY139049AY139049 B/영국/1716/
05S B/UK/1716/
05S DQ265723DQ265723 B/영국/2054/
05B/UK/2054/
05 B/텍사스/3/2002B/Texas/3/2002 AY139049AY139049 B/영국/2054/
05* B/UK/2054/
05* DQ265722DQ265722 B/하와이/199
0/04B/Hawaii/199
0/04 B/테헤란/80/02B/Tehran/80/02 AJ784042AJ784042 B/하와이/199
0/04 B/Hawaii/199
0/04 DQ265721DQ265721 B/하와이/199
3/04B/Hawaii/199
3/04 B/테헤란/80/02B/Tehran/80/02 AJ784042AJ784042 B/하와이/199
3/04S* B/Hawaii/199
3/04S* DQ265720DQ265720 B/애리조나/1
48/04B/Arizona/1
48/04 B/테헤란/80/02B/Tehran/80/02 AJ784042AJ784042 B/애리조나/1
48/04* B/Arizona/1
48/04* DQ265725DQ265725 B/애리조나/1
46/04B/Arizona/1
46/04 B/테헤란/80/02B/Tehran/80/02 AJ784042AJ784042 B/애리조나/1
46/04* B/Arizona/1
46/04* DQ265729DQ265729 ^* 다수 리턴이 이러한 리턴으로 고정됨.
^S HybSeq가 다수 SubSeq로 분할됨. ^* Multiple returns are fixed with these returns.
^S HybSeq is split into multiple SubSeqs.

미코플라즈마 뉴모니아에(Mycoplasma pneumoniae) 병원체에 대한 다음 검출 실시예는 표적 병원체에 대해 단지 단일 ProSeq였던 경우를 설명하며, 이는 알고리즘의 태스크 I에 대한 동정 유기체가 자동적으로 태스크 II의 결과이고 이러한 표적화된 병원체에 대한 태스크 III에서 고려된 유일한 ProSeq였음을 의미하였다. 이러한 ProSeq는 또한 미세 식별에 적절하지 않은데, 그 이유는 이것이 시타드헤신(cytadhesin) P1 유전자의 고도로 보존된 영역(345 bp)으로부터 선정되었기 때문이다. 40 마이크로어레이를 동일한 정제 핵산 스톡으로 테스트하였고 모든 경우에서 M. 뉴모니아에 또는 이의 아균주 분류학적 데이타베이스 엔트리를 인지한 것은 MaxScore에 고정시켰다. 이러한 리턴을 더욱 이해하기 위해, 테이타베이스 서열을 조사하였고, 데이타베이스 서열이 ProSeq를 생성하기 위해 사용된 기준 서열과 얼마나 잘 일치하는 지를 기초로 하여 서열을 A, B, C의 3 그룹으로 부분 분할하였다. 상기 3 그룹 내 데이타베이스 엔트리의 배치는 이 유전자 서열의 CLUSTAL 정렬로부터 결정하였다. 이 정렬은, 데이타베이스 엔트리가 ProSeq로 나타내지 않고, 더 미세한 차이를 가능하게 하는 충분한 변이 가능성을 포함하는 영역에서 서로 더욱 유의하게 다름을 확인하였다. 그룹 A의 구성원은 ProSeq와 정확히 일치하였고 마이크로어레이 상에서 서로 구별할 수 없었다. 유사하게도, 그룹 B의 구성원은 콜링된 염기가 T가 아니라 C인 199번째 위치를 제외하고는 ProSeq와 일치하였다. 그룹 C 서열은 더욱 가변적이고 ProSeq 내 다른 엔트리와는 구별될 수 있는 약간의 데이타베이스 엔트리를 포함하였다. ProSeq 중 95%가 하이브리드화된 M. 뉴모니아에의 40 실험 테스트의 경우에는, 그 결과의 65%만이 199번째 위치에서 명확한 염기 콜을 보유하였다. 염기 콜이 명확한 경우, 이는 항상 그룹 B 서열과 일치하였다. N 염기 콜이 199번째 위치에서 생성된 경우, 그룹 A 및 B 서열은 둘 다 동일한 스코어로 반송하였다. 이와 무관하게, 양의 값으로 동정된 표적 병원체는 테스트된 모든 샘플에 대해 M. 뉴모니아에였다.The following detection example for the Mycoplasma pneumoniae pathogen describes the case where there was only a single ProSeq for the target pathogen, which is the result of task II automatically identifying the organism for task I of the algorithm and targeting this. It was meant that it was the only ProSeq considered in Task III for the pathogen to be treated. This ProSeq is also not suitable for microscopic identification because it was selected from the highly conserved region (345 bp) of the cytadhesin P1 gene. Forty microarrays were tested with the same purified nucleic acid stock and in all cases recognizing M. pneumoniae or its substrain taxonomic database entries were fixed in MaxScore. To further understand this return, the database sequence was examined, and the sequence was partially split into 3 groups A, B, and C based on how well the database sequence matched the reference sequence used to generate the ProSeq. I did. The placement of the database entries in the 3 groups was determined from the CLUSTAL alignment of this gene sequence. This alignment confirmed that the database entries differed more significantly from each other in the region where the database entries were not represented by ProSeq and contained sufficient variability to allow for finer differences. The members of group A matched exactly with ProSeq and were indistinguishable from each other on the microarray. Similarly, members of group B were consistent with ProSeq except for the 199th position where the called base was C rather than T. The Group C sequence was more variable and contained some database entries that could be distinguished from other entries in ProSeq. In the case of 40 experimental tests with M. pneumoniae in which 95% of ProSeq was hybridized, only 65% of the results had a clear base call at the 199th position. When the base call is clear, it always coincided with the group B sequence. When an N base call was generated at position 199, both group A and B sequences were returned with the same score. Regardless, the target pathogen identified as a positive value was M. pneumoniae for all samples tested.

이러한 예들은 단일 또는 다수 ProSeq가 표적 병원체에 적용되는 지의 여부와는 상관없이 결정이 어떻게 행해지는 지를 나타내었다. 이러한 예들은 또한 가능한 수준 차이가 선정된 ProSeq의 질에 의해 강력히 결정됨을 설명하였다. 몇몇 병원체의 경우 미세한 수준 차이가 요구되지 않을 수 있으며, RPM v1 상에서 현재 테스트된 선정은 만족스러운 정보를 제공할 것이다. CIBSI 2.0 알고리즘은 HybSeq 정보에 의해 지지될 수 있는 최대 수준 차이를 자동적으로 기록하는 그 성능을 증명하였다.These examples showed how the decision is made irrespective of whether a single or multiple ProSeq is applied to the target pathogen. These examples also demonstrated that the possible level difference is strongly determined by the quality of the chosen ProSeq. For some pathogens, subtle level differences may not be required, and selections currently tested on RPM v1 will provide satisfactory information. The CIBSI 2.0 algorithm has proven its ability to automatically record the maximum level difference that can be supported by HybSeq information.

유전적 인접자(Genetic Near Neighbors)Genetic Near Neighbors

알고리즘이 어떻게 밀접하게 관련된 유전적 종을 처리하는 지를 설명하기 위해, 비표적화된 병원체 샘플을 고려하였다. RPM v1 상에서 생체 위협 병원체 중 하나인 대두창 바이러스의 경우, 타당성 실시는 대두창 바이러스 DNA 주형이 검출되는 경우 항상 양의 값으로 동정함을 증명하였다. 어레이는 대두창 바이러스 검출을 위한 사이토카인 반응 개질자 B(VMVcrmB, ~ 300 bp) 및 헤마글루티닌(VMVHA, ~ 500 bp)으로부터 유래한 2가지 ProSeq를 보유한다. 하기 표 3은 밀접한 인접자인 우두 바이러스를 다양한 농도에서 비강 세척으로 스파이킹(spiking)한 각각의 ProSeq의 18회 실시에 대한 결과를 나타내었다. 하이브리드화한 ProSeq의 백분율은, 하이브리드화 패턴만이 단지 고려되는 경우 이러한 타일(tile)이 그 표적의 존재를 동정한다는 점을 추측할 수 있다는 점에서 충분하다. 이는 선정된 기준 서열이 최상의 선택이 아니었음을 나타내었다. 그러나, 알고리즘을 적용한 경우, 샘플 중 어떤 것도 사실상 대두창 또는 소두창 바이러스로서 동정되지 않았다. 우두 바이러스는 항상 VMVcrmB ProSeq에 대한 최고 스코어를 갖는 기재된 오르토폭스바이러스 종 중 하나였지만, 이는 단지 7가지 경우에서 가능한 종으로서 유일하게 검출하였다. VMVcrmB 하이브리드화의 최저 농도 및 단편을 갖는 3가지 샘플에서, 이 ProSeq는 하이브리드화의 원인일 수 있는 다수의 오르토폭스바이러스 종 중의 하나로서 대두창 바이러스를 동정하였다. 사용된 증폭 방법에 대한 검출 하한치는 이러한 농도 및 그 이상의 농도 사이에 존재하였다. VMVHA ProSeq는 오직 2개의 실험에서 오르토폭스바이러스 종을 동정하였고 대두창 바이러스를 고정된 최상의 스코어를 갖는 리턴 중 하나로서 기재하였다. 양쪽 모두의 경우, VMVcrmB ProSeq는 최상의 일치로서 우두 바이러스를 구체적으로 동정하였다. 하이브리드화된 ProSeq의 백분율은 샘플의 농도와 서로 관련되었다.To illustrate how the algorithm handles closely related genetic species, a sample of untargeted pathogens was considered. In the case of soybean virus, which is one of the bio-threatening pathogens, on RPM v1, the feasibility study proved that the detection of soybean virus DNA template was always identified as a positive value. The array carries two ProSeqs derived from cytokine response modifier B (VMVcrmB, ˜300 bp) and hemagglutinin (VMVHA, ˜500 bp) for soybean virus detection. Table 3 below shows the results of 18 runs of each ProSeq in which the closely adjacent vaccinia virus was spiked by nasal washing at various concentrations. The percentage of ProSeq hybridized is sufficient in that it can be inferred that these tiles identify the presence of the target if only the hybridization pattern is considered. This indicated that the selected reference sequence was not the best choice. However, when the algorithm was applied, none of the samples were in fact identified as soybean or smallpox virus. Vaccinia virus has always been one of the described orthopoxvirus species with the highest score for VMVcrmB ProSeq, but it was the only one that was detected as a possible species in only 7 cases. In three samples with the lowest concentration and fragment of VMVcrmB hybridization, this ProSeq identified soybean virus as one of a number of orthopoxvirus species that may be responsible for hybridization. The lower limit of detection for the amplification method used was between this concentration and above. The VMVHA ProSeq identified orthopoxvirus species in only two experiments and described soybean virus as one of the returns with the fixed best score. In both cases, VMVcrmB ProSeq specifically identified vaccinia virus as the best match. The percentage of hybridized ProSeq correlated with the concentration of the sample.

대두창 바이러스 ProSeq 상의 우두 샘플로부터 유기체 동정 Identification of organisms from vaccinia samples on soybean virus ProSeq CFUCFU ProSeqProSeq VMVCRMBVMVCRMB VMVHAVMVHA %% 동정체Identity %% 동정체Identity 5*10⁷ 5*10 ⁷ 77.977.9 우두 바이러스Vaccinia virus 29.429.4 오르토폭스바이러스Orthopox virus 5*10⁷ 5*10 ⁷ 79.879.8 우두 바이러스Vaccinia virus 25.725.7 오르토폭스바이러스Orthopox virus 1.6*10⁷ 1.6*10 ⁷ 79.479.4 우두 바이러스Vaccinia virus 14.814.8 -- 1.6*10⁷ 1.6*10 ⁷ 77.577.5 오르토폭스바이러스^* Orthopox virus ^* 24.524.5 -- 1.6*10⁷ 1.6*10 ⁷ 76.876.8 우두 바이러스Vaccinia virus 21.621.6 -- 1.6*10⁷ 1.6*10 ⁷ 74.574.5 오르토폭스바이러스^* Orthopox virus ^* 17.317.3 -- 5*10⁶ 5*10 ⁶ 77.977.9 우두 바이러스Vaccinia virus 25.725.7 -- 5*10⁶ 5*10 ⁶ 78.378.3 오르토폭스바이러스^* Orthopox virus ^* 22.022.0 -- 5*10⁶ 5*10 ⁶ 73.073.0 우두 바이러스Vaccinia virus 13.013.0 -- 5*10⁶ 5*10 ⁶ 73.473.4 오르토폭스바이러스^* Orthopox virus ^* 7.87.8 -- 1.6*10⁶ 1.6*10 ⁶ 75.375.3 오르토폭스바이러스^* Orthopox virus ^* 8.68.6 -- 1.6*10⁶ 1.6*10 ⁶ 49.849.8 우두 바이러스Vaccinia virus 6.66.6 -- 1.6*10⁶ 1.6*10 ⁶ 65.565.5 오르토폭스바이러스^* Orthopox virus ^* 10.010.0 -- 1.6*10⁶ 1.6*10 ⁶ 62.962.9 오르토폭스바이러스^* Orthopox virus ^* 8.28.2 -- 5*10⁵ 5*10 ⁵ 58.458.4 오르토폭스바이러스^* Orthopox virus ^* 9.09.0 -- 5*10⁵ 5*10 ⁵ 56.256.2 오르토폭스바이러스Orthopox virus 8.08.0 -- 5*10⁵ 5*10 ⁵ 49.049.0 오르토폭스바이러스Orthopox virus 9.39.3 -- 5*10⁵ 5*10 ⁵ 44.644.6 오르토폭스바이러스Orthopox virus 7.87.8 -- ^* - 대두창 또는 소두창 바이러스가 아닌 오르토폭스바이러스 내 인접자 종만의 CFU - 군체 형성 단위 ^* -CFU of only neighboring species within orthopox virus, not soybean or smallpox virus-Colony forming unit

필터링Filtering

상기 예는 인간 인플루엔자 A/푸에르토리코/8/34 (H1N1) 균주 유래의 H1N1 뉴라미니다제 (NA1) 및 매트릭스 유전자에 대한 ProSeq의 HybSeq를 고려하여 알고리즘의 필터링 부분의 중요성을 설명하였다. 필터링이 요구되는 이유는, ProSeq의 HybSeq를 단일 질의의 BLAST로 전송하는 것이, 특히 염기 콜의 이용을 최대화한 BLAST 파라미터를 사용하는 경우, ProSeq에 상대적인 삽입 또는 결손을 갖는 균주에 대하여 스코어를 바이어싱할 수 있기 때문이다. 슬라이딩 윈도우 테스트는 필터링을 제어한 알고리즘의 일부였다. 필터링을 중단한 경우, 전체 HybSeq는 유의한 하이브리드화를 나타내는 2개의 인플루엔자 ProSeq에 대한 단일 부분 서열에 사용하였다. A/Weiss/43 (H1N1) 균주는 NA1 ProSeq의 HybSeq로부터 유래한 가장 가능한 균주로서 동정되었고, 반면에 매트릭스 ProSeq의 HybSeq가 정확히 A/푸에르토리코/8/34를 동정되었다. 바이어싱의 소스를 더욱 이해하기 위해, ProSeq를 생성하기 위해 사용된 기준 서열 및 균주 2종의 NA1 유전자의 CLUSTAL 정렬을 도 5에 도시하였다. 상기 2종의 균주는 95% 동일성(정렬된 1362개의 염기 중 67개 불일치)을 나타내었으나, A/푸에르토리코/8/34(서열 번호 3)와 비교하여 A/Weiss/43(서열 번호 2) 및 NA1 ProSeq(서열 번호 1) 모두에 삽입된 45 염기의 스트렛치가 있었다. 디폴트를 필터링하면서, NA1 ProSeq는, 알고리즘이 어떤 콜도 없는 큰 스트렛치를 마주치므로, 5개의 SubSeq로 분리하였다. 태스크 I에서, 알고리즘은, 더 짧은 3개의 SubSeq가 최상의 스코어로 고정된 A/푸에르토리코/8/34 균주를 비롯한 몇몇 분리균(isolates)으로서 H1N1의 동정 유기체를 가지는 한편, 다른 2개의 SubSeq가 가장 근접하게 일치되는 A/푸에르토리코/8/34 균주만의 동정 유기체를 갖는 점을 결정하였다. NA1 ProSeq에 의해 동정된 유기체는 A/푸에르토리코/8/34였는데, 그 이유는 SubSeq 중 하나가 훨씬 높은 스코어를 가졌기 때문이었다. 상기 ProSeq는 매트릭스 ProSeq에서 행해진 동일한 균주 동정을 지지하였다. 동정된 유기체는 A/푸에르토리코/8/34였는데, 그 이유는 2개의 ProSeq가 상기 유기체만을 검출하였기 때문이었다. 정확한 표적 병원체를 필터링으로 검출되었고 반면에 필터링이 없는 경우 표적 병원체의 동정 수준은 인플루엔자 A(H1N1 아형)였는데, 그 이유는 2종의 유기체, A/푸에르토리코/8/34 및 A/Weiss/43이 검출되었기 때문이었다. 바이어싱을 제거하기 위해 HybSeq를 SubSeq로 분리하는 것은 이러한 경우에 5개의 SubSeq 중 3개의 SubSeq에 대해 일어나는 바와 같이 동정 수준을 감소시킬 수 있다. 우두에 대한 선행 예는, 필터링을 사용하지 않은 경우 잘못된 종(카멜 폭스(Camel Pox) 또는 칼리트릭스 야쿠스(Callithrix jacchus) 동정이 일어날 수 있는 또 다른 예였다. 표 2의 임상 샘플은 다수의 SubSeq로 분리된 HybSeq가 매우 특이적 동정을 할 수 있음을 나타내었다.The above example explained the importance of the filtering part of the algorithm, taking into account the H1N1 neuraminidase (NA1) from the human influenza A/Puerto Rico/8/34 (H1N1) strain and the HybSeq of ProSeq for the matrix gene. The reason for filtering is that ProSeq's HybSeq is transferred to BLAST of a single query, especially when using the BLAST parameter that maximizes the use of base calls, biasing scores for strains with insertions or deletions relative to ProSeq. Because you can. The sliding window test was part of the algorithm that controlled filtering. When filtering was stopped, the entire HybSeq was used for the single partial sequence for the two influenza ProSeq showing significant hybridization. The A/Weiss/43 (H1N1) strain was identified as the most possible strain derived from HybSeq of NA1 ProSeq, while HybSeq of matrix ProSeq accurately identified A/Puerto Rico/8/34. In order to further understand the source of biasing, the reference sequence used to generate ProSeq and the CLUSTAL alignment of the two strains of NA1 genes are shown in FIG. 5. The two strains showed 95% identity (67 mismatch among 1362 bases aligned), but compared to A/Puerto Rico/8/34 (SEQ ID NO: 3), A/Weiss/43 (SEQ ID NO: 2) And NA1 ProSeq (SEQ ID NO: 1) There was a 45 base stretch inserted into both. Filtering the defaults, the NA1 ProSeq was split into 5 SubSeqs, as the algorithm encountered a large stretch without any calls. In task I, the algorithm is based on the identification organism of H1N1 as several isolates, including the A/Puerto Rico/8/34 strain, with three shorter SubSeqs fixed to the best score, while the other two SubSeqs are the most. It was determined to have an identified organism of only closely matched A/Puerto Rico/8/34 strains. The organism identified by NA1 ProSeq was A/Puerto Rico/8/34 because one of the SubSeq had a much higher score. The ProSeq supported the same strain identification done in the matrix ProSeq. The organisms identified were A/Puerto Rico/8/34 because two ProSeqs detected only these organisms. The exact target pathogen was detected by filtering, whereas in the absence of filtering, the level of identification of the target pathogen was influenza A (H1N1 subtype), because of the two organisms, A/Puerto Rico/8/34 and A/Weiss/43. This was because it was detected. Separation of HybSeq into SubSeq to eliminate biasing can reduce the level of identification, as occurs for 3 out of 5 SubSeqs in this case. A preceding example for vaccinia was another example where the identification of the wrong species (Camel Pox or Callithrix jacchus) could occur when filtering was not used. The clinical samples in Table 2 are a number of SubSeqs. It was shown that the HybSeq separated by is capable of very specific identification.

부차적인 점으로서, 증폭을 위해 일반적인 것와 다르게 다양한 전략을 사용하는 경우, 본 방법에 기술된 바와 같이 추가적인 필터링을 수행하여 특이적 프라이머로부터 잠재적 바이어싱을 제거하는 것이 필요하였다. 도 5는 이러한 간섭의 예를 보여주기 위해 A/푸에르토리코/8/34의 하이브리드화에 대한 원(서열 번호 4) 결과 및 차폐 필터링된(서열 번호 5) 결과를 갖는다. 전술한 이유로 인한 바이어스를 갖는 문제점 이외에도 프라이머와 상호 작용하는 위치에 있기 때문에 필터링 후에 N이 형성되는 원 결과에 존재하는 18 염기의 서열이 있다. 이러한 염기 콜이 구성된 부분 서열에 포함되는 경우, ProSeq에 대한 질의는 부정확한 균주를 선호할 것이다.As a secondary point, when using various strategies for amplification different from the general one, it was necessary to perform additional filtering as described in this method to remove potential biasing from specific primers. Figure 5 has the original (SEQ ID NO: 4) results and shield filtered (SEQ ID NO: 5) results for hybridization of A/Puerto Rico/8/34 to show an example of such interference. In addition to the problem with the bias due to the above-described reason, there is a sequence of 18 bases present in the original result in which N is formed after filtering because it is at a position where it interacts with the primer. If these base calls are included in the constructed subsequence, queries to ProSeq will favor the incorrect strain.

알고리즘은 각각의 ProSeq의 질에 따라 최대 수준의 상세한 가능성 있는 (종 또는 균주)에 대한 병원체 동정을 성공적으로 제공하였다. 이 동정 성능은 비전문가 용도로 가능한 병원체의 동일성에 대한 최소 입력을 필요로 한다. 중요한 특성은, 잉여분의 제거, 상이한 관련 원형 서열의 비교, 및 데이타 제시의 단순화를 가능하게 하면서, 배열된(ordered) 그룹으로 유기체를 분류하고 유기체 엔트리 간의 관계를 제공하는 분류학적 데이타베이스의 이용을 완전히 자동화하게 한다는 것을 구체화하였다. 이는 데이타베이스, 즉 NCBI가 잉여적이고 최소 큐레이션(curation)에 적용하지만, 가장 성공적으로 사용되도록 갱신된 신규 서열을 지속적으로 수용하는 것을 가능하게 한다. 이는 단지 NCBI 데이타베이스만을 사용하는 것으로 설명하였지만, 다른 데이타베이스 또는 통상의 것도 용이하게 사용될 수 있고, 성능이 개선될 수 있었다. 알고리즘은 덜 가변적이거나 또는 고도로 보존된 ProSeq로 나타내는 병원체에 대한 모든 분석 수준에서 정확한 동정을 제공할 수 있었다. 더 가변적이거나 또는 신속하게 돌연변이화하는 병원체, 예를 들어 인플루엔자 A 바이러스, 태스크 I 및 II의 경우, 여전히 정확한 상세한 동정을 제공하지만, 태스크 III은 미세 규모 차이를 기록할 수 없었다. 종래의 서열화된 인플루엔자 바이러스 유전자 서열의 비교는 알고리즘이 데이타베이스에서 갱신을 위해 자동적으로 조절될 수 있음을 설명하였다. 알고리즘은 유전적으로 밀접한 (인접자) 균주에 의해 야기되는 것으로부터 특이적인 병원체에 의해 야기되는 ProSeq 상의 하이브리드화를 적절하게 구별하기 위한 그 성능을 증명하였고, 양성 오류의 일 잠재적 원인을 제거하면서 부정확한 동정을 하지 않았다. 원 하이브리드화 결과를 필터링하는 것은 잠재적 프라이머 간섭, 및 더 중요하게는 감소된 잠재적 바이어싱을 설명하는 연산 시간을 줄이기 위해 제공하였다. 이 간단한 통합 알고리즘은 충분하고 정확한 동정을 제공하여 RPM v1 또는 유사한 재배열 어레이 및 분석의 즉시 사용이 가능하다.The algorithm successfully provided pathogen identification for the maximum level of detailed possible (species or strains) depending on the quality of each ProSeq. This identification performance requires minimal input for the identity of the pathogen, which is possible for non-specialist use. An important feature is the use of a taxonomic database that categorizes organisms into ordered groups and provides relationships between organism entries, allowing for the elimination of redundancy, comparison of different related prototypical sequences, and simplification of data presentation. It was specified to be fully automated. This makes it possible for the database, ie NCBI, to be redundant and subject to minimal curation, but continue to accommodate new sequences that have been updated to be used most successfully. This has been described as using only the NCBI database, but other databases or ordinary ones can be easily used, and performance can be improved. The algorithm could provide accurate identification at all levels of analysis for pathogens represented by less variable or highly conserved ProSeq. For more variable or rapidly mutating pathogens, such as influenza A virus, tasks I and II, it still provides accurate detailed identification, but task III was unable to record fine scale differences. Comparison of conventional sequenced influenza virus gene sequences demonstrated that the algorithm can be automatically adjusted for updates in the database. The algorithm has demonstrated its ability to properly differentiate hybridization on ProSeq caused by a specific pathogen from what is caused by a genetically close (adjacent) strain, and is inaccurate while eliminating one potential cause of false positives. I didn't sympathize. Filtering the original hybridization results served to reduce the computational time that accounts for potential primer interference, and more importantly, reduced potential biasing. This simple integration algorithm provides sufficient and accurate identification for immediate use of RPM v1 or similar rearrangement arrays and analyzes.

CIBSI 2.0 프로그램의 성공을 설명하는 이외에, 알고리즘을 개발하는데 포함된 작업은 적절한 ProSeq 선정의 중요성을 통찰하도록 하였다. RPM v1은 데이타베이스 유사성 검색을 사용한 다수 병원체 검출을 위해 구체적으로 디자인한 제1 재배열 어레이고, 본 출원에 대한 원형(prototype)으로서 제공하였다. 정확하게 디자인한 경우, 100 bp만큼 작은 단일 ProSeq는 명확하게 유기체를 동정하기에 충분할 수 있음을 설명하였다. 그러나, 몇몇 더 긴 ProSeq는 더 우수한 확증 및 병원체의 더 상세한 정보를 제공함을 분명히 나타낸다. 이 점에 대한 디자인의 강조는 일반적으로 임의의 병원체에 적용 가능한 성능에 대한 것이었다. 태스크 III의 성능에 대한 개선은 개개의 병원체에 대한 더 많은 정보를 요구할 수 있고 각각의 특이적 병원체 또는 병원체의 부류에 대해 개발되도록 해야할 수 있다. 또한 이러한 정보는 샘플과 데이타베이스 엔트리 간의 차이점이 유의한 돌연변이를 나타낸다는 점을 동정하기 위해 알고리즘을 필요로 할 수 있다. 데이타 분석의 계층적 디자인은 이미 수행된 분석에 따라 형성되는 분석을 통합하기가 용이할 수 있다. 적절하게 디자인된 재배열 마이크로어레이 및 이 자동화된 검출 알고리즘의 이용은, 상세한 균주 인지, 항생 내성 마커 및 병원성에 대한 정보에 접근하면서 미세 균주 수준 차이를 제공하는 동안, 다수의 유기체에 대해 동시에 테스트할 수 있는 분석을 개발하기 위해 나아갈 방향을 제공할 수 있다. 이는, 다수의 잠재적 원인(즉, 열성 호흡기 질환)을 갖는 병에 대한 차별적 진단, 신생 병원체의 추적, 감시 분야에서 무해한 유전적 인근자로부터 생물학적 위협의 구별, 및 공동 감염 또는 중복 감염의 영향 추적과 같은 분야에 대하여 다수 유기체로부터 부분 서열 정보의 분석을 가능하게 할 것이다. 표적 서열 세트 및 샘플의 질에 따라 동정의 상이한 정도를 기록하고 범주화하는 개념은 재배열 마이크로어레이에 한정되지 않으며, 기준 DNA 데이타베이스를 질의하기 위해 사용될 수 있는 서열 수준 콜을 방송할 수 있는 임의의 플랫폼에 일반적으로 더 적용가능하다. 다수 병원체에 대한 테스트를 하는 분석에 대한 경향이 증가함에 따라, 자동화 분석 툴, 예컨대 이와 같은 것은 그날 그날 바로 처리하는 비전문가에게 유용한 단순한 포맷에서 신속한 동정을 위해 더욱 중요해졌다.In addition to describing the success of the CIBSI 2.0 program, the work involved in developing the algorithm provided insight into the importance of selecting an appropriate ProSeq. RPM v1 is a first rearrangement array specifically designed for detection of multiple pathogens using database similarity search, and was provided as a prototype for the present application. It was demonstrated that, if designed correctly, a single ProSeq as small as 100 bp may be sufficient to clearly identify organisms. However, it is clearly indicated that several longer ProSeqs provide better confirmation and more detailed information of the pathogen. The design emphasis in this regard was generally on the performance applicable to any pathogen. Improvements to the performance of Task III may require more information on individual pathogens and may need to be developed for each specific pathogen or class of pathogens. In addition, this information may require algorithms to identify that differences between sample and database entries indicate significant mutations. The hierarchical design of data analysis can facilitate incorporating analyzes that are formed according to analyzes that have already been performed. The use of properly designed rearrangement microarrays and this automated detection algorithm allows testing of multiple organisms simultaneously while providing micro-strain level differences while accessing detailed strain recognition, antibiotic resistance markers, and information on pathogenicity. It can provide directions for developing an analysis that can be done. This includes differential diagnosis of diseases with multiple potential causes (i.e. febrile respiratory diseases), tracking of emerging pathogens, discrimination of biological threats from innocuous genetic neighbors in the field of surveillance, and tracking the effects of co-infection or multiple infections. It will enable the analysis of partial sequence information from multiple organisms for the same field. The concept of recording and categorizing different degrees of identification depending on the target sequence set and the quality of the sample is not limited to rearrangement microarrays, and any sequence level call capable of broadcasting a sequence level call that can be used to query a reference DNA database. It is generally more applicable to the platform. As the trend toward analysis testing for multiple pathogens increases, automated analysis tools, such as such, have become more important for rapid identification in a simple format useful for non-experts who are dealing with it on the day-to-day basis.

소스 코드Source code

하기에는 개시된 방법의 실시형태의 PERL 소스 코드 목록이 있다. "overclinical" 프로그램은 다른 프로그램을 실행하는 최고 수준의 프로그램이다. "fstorepi"는 필터링, 서열 생성, 및 질의 파일 생성을 수행한다. 이 프로그램은 N으로 변화되는 소정의 목록의 위치를 포함하는 입력 파일 "primerhyb.dat"에 사용한다. "runblast"는 BLAST 질의를 수행한다. "dbparse"는 분류학적 분석을 수행한다. 이 프로그램은 각각의 ProSeq에 대한 표적 병원체의 목록을 포함하는 입력 파일 "chip1pathogengroups"를 사용한다.Below is a listing of the PERL source code of an embodiment of the disclosed method. The "overclinical" program is a top-notch program that runs other programs. "fstorepi" performs filtering, sequence generation, and query file generation. This program is used for the input file "primerhyb.dat" which contains the position of a predetermined list that is changed to N. "runblast" executes a BLAST query. "dbparse" performs taxonomic analysis. The program uses an input file "chip1pathogengroups" containing a list of target pathogens for each ProSeq.

명백하게도, 본 발명의 다수 변경예와 변형예는 상기 교시에 비추어 가능하다. 그러므로, 청구된 발명이 구체적으로 기재된 바와 같은 것 외에도 실시할 수 있음을 이해할 수 있을 것이다. 단수, 예를 들어 항목 "a," "an," "the," 또는 "said"를 사용하는 구성 요소를 청구하는 임의의 기준은 단수인 구성 요소로 제한하는 것으로 해석되지 않는다. Obviously, many modifications and variations of the present invention are possible in light of the above teaching. Therefore, it will be appreciated that the claimed invention may be practiced in addition to as specifically described. Any criteria claiming components that use the singular, for example, the items “a,” “an,” “the,” or “said” are not to be construed as limiting to the singular component.

SEQUENCE LISTING <110> Malanoski, Anthony P Lin, Baochuan Schnur, Joel M Stenger, David A <120> COMPUTER-IMPLEMENTED BIOLOGICAL SEQUENCE IDENTIFIER SYSTEM AND METHOD <130> 97748US2 <150> 60/691,768 <151> 2005-06-16 <150> 60/735,876 <151> 2005-11-14 <150> 60/735,824 <151> 2005-11-14 <150> 60/743,977 <151> 2006-03-30 <150> 11/177,647 <151> 2005-07-02 <150> 11/177,646 <151> 2005-07-02 <150> 11/268,373 <151> 2005-11-07 <150> 11/422,425 <151> 2006-06-06 <150> 11/422,431 <151> 2006-06-06 <160> 5 <170> PatentIn version 3.3 <210> 1 <211> 61 <212> DNA <213> Human Influenza A <220> <221> gene <222> (1)..(61) <223> NA1 <400> 1 ctgggtaaat caaacatatg tcaatattaa caacactaac gttgttgctg gaaaggacac 60 a 61 <210> 2 <211> 61 <212> DNA <213> Human Influenza A <220> <221> gene <222> (1)..(61) <223> NA1 <400> 2 ctgggtaaat caaacatatg ttaatattag caacactaac gttgttgctg gaaaaggcac 60 a 61 <210> 3 <211> 16 <212> DNA <213> Human Influenza A <220> <221> gene <222> (1)..(16) <223> NA1 <400> 3 ctgggtaaag gacaca 16 <210> 4 <211> 61 <212> DNA <213> Unknown <220> <223> Raw data <220> <221> misc_feature <222> (1)..(61) <223> n is a, c, g, or t <400> 4 ctgggnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnngnc gttgttgctg gaaaggncac 60 a 61 <210> 5 <211> 61 <212> DNA <213> Unknown <220> <223> Filtered data <220> <221> misc_feature <222> (1)..(61) <223> n is a, c, g, or t <400> 5 ctgggnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnac 60 a 61 SEQUENCE LISTING <110> Malanoski, Anthony P Lin, Baochuan Schnur, Joel M Stenger, David A <120> COMPUTER-IMPLEMENTED BIOLOGICAL SEQUENCE IDENTIFIER SYSTEM AND METHOD <130> 97748US2 <150> 60/691,768 <151> 2005-06-16 <150> 60/735,876 <151> 2005-11-14 <150> 60/735,824 <151> 2005-11-14 <150> 60/743,977 <151> 2006-03-30 <150> 11/177,647 <151> 2005-07-02 <150> 11/177,646 <151> 2005-07-02 <150> 11/268,373 <151> 2005-11-07 <150> 11/422,425 <151> 2006-06-06 <150> 11/422,431 <151> 2006-06-06 <160> 5 <170> PatentIn version 3.3 <210> 1 <211> 61 <212> DNA <213> Human Influenza A <220> <221> gene <222> (1)..(61) <223> NA1 <400> 1 ctgggtaaat caaacatatg tcaatattaa caacactaac gttgttgctg gaaaggacac 60 a 61 <210> 2 <211> 61 <212> DNA <213> Human Influenza A <220> <221> gene <222> (1)..(61) <223> NA1 <400> 2 ctgggtaaat caaacatatg ttaatattag caacactaac gttgttgctg gaaaaggcac 60 a 61 <210> 3 <211> 16 <212> DNA <213> Human Influenza A <220> <221> gene <222> (1)..(16) <223> NA1 <400> 3 ctgggtaaag gacaca 16 <210> 4 <211> 61 <212> DNA <213> Unknown <220> <223> Raw data <220> <221> misc_feature <222> (1)..(61) <223> n is a, c, g, or t <400> 4 ctgggnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnngnc gttgttgctg gaaaggncac 60 a 61 <210> 5 <211> 61 <212> DNA <213> Unknown <220> <223> Filtered data <220> <221> misc_feature <222> (1)..(61) <223> n is a, c, g, or t <400> 5 ctgggnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnac 60 a 61

Claims

A method of processing a biological sequence obtained from an analysis,
Masking base calls that are located in a predetermined list of positions in the biological sequence as ambiguous (N) base calls, wherein the unique base calls in the biological sequence are not ambiguous (N) base calls, The entry indicates the ability of the substance to hybridize to the microarray used to generate the biological sequence, wherein the substance is not a nucleic acid of the target pathogen; And
Determining the ratio of single nucleotide polymorphisms in the biological sequence relative to the reference sequence to the total number of unique base calls
&Lt; / RTI >

The method of claim 1, wherein the substance is a PCR primer.

The method of claim 1, further comprising: selecting a potential partial sequence of initial length from the biological sequence when the ratio of the single nucleotide polymorphism is below the SNP (single nucleotide polymorphism) threshold; And
Calculating the total content of the unique base call in the potential partial sequence
&Lt; / RTI >

The method of claim 3, wherein the SNP threshold is about 20%.

The method of claim 3,
Extending the potential partial sequence by one base if the total content of the unique base call is above the unique base call threshold;
Recalculating the total content of unique base calls in the extended potential partial sequence;
Repeating extension and recalculation until the total content of the unique base call is less than the unique base call threshold;
Removing an ambiguous (N) base call that is trailing from the extended potential partial sequence to form a query partial sequence; And
Calculating the total content of the unique base call in the query subsequence
&Lt; / RTI >

6. The method of claim 5, wherein the unique base call threshold is about 40%.

6. The method of claim 5, wherein repeating the extending and recalculating step stops if the last 21 positions of the extended potential partial sequence have less than 4 unique base calls.

The method according to claim 5, wherein if the length of the query subsequence and the total content of the unique base call in the query subsequence meets a predetermined requirement, the query subsequence is transferred to the genetic database as a query to obtain a biological sequence. Further comprising the step of identifying.

The method of claim 8, wherein the requirement is
The query subsequence comprises at least seven consecutive unique base calls; And
The length of the query subsequence is greater than 100 bases, or the length of the query subsequences is 30 to 100 bases, and the total content of unique base calls in the query subsequences is at least about ((length-30 of the query subsequences) * 0.2857 + 70)% or the query subsequence is less than 30 bases in length and the total content of unique base calls in the query subsequence is at least about 95%
How to include.