KR20220157773A

KR20220157773A - Bacteria identification apparatus and bacteria identification method

Info

Publication number: KR20220157773A
Application number: KR1020210065726A
Authority: KR
Inventors: 권창혁; 오귀영; 김우진; 김동인
Original assignee: 의료법인 이원의료재단
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2022-11-29

Abstract

An apparatus for identifying bacteria and a method for identifying bacteria are provided. The method for identifying bacteria comprises: (a) a step of converting raw sequence file fragments from sequences generated by next-generation sequencing equipment into one file; (b) a step of removing a specific primer sequence from the converted file using a bioinformatics analysis tool; (c) a step of removing sequence duplication and errors using an analysis tool in the file from which the specific primer sequence has been removed; (d) a step of identifying bacteria by comparing the sequences of the file from which duplicates and errors in the sequences have been removed with sequences stored in a database using a bioinformatics analysis tool, and (e) a step of analyzing the identified bacteria using a statistical tool.

Description

Bacterial identification device and method for identifying bacteria {BACTERIA IDENTIFICATION APPARATUS AND BACTERIA IDENTIFICATION METHOD}

본 발명은 박테리아 동정 장치 및 박테리아 동정 방법에 관한 것이다.The present invention relates to an apparatus for identifying bacteria and a method for identifying bacteria.

지난 10 년 동안 메타게놈 시료의 분류학적 조성을 예측하는 것은 어려운 일이었다. 주어진 샘플에 포함된 미생물 분류군을 결정할 수 있다면 환경에 미치는 미생물의 역할에 대한 많은 통찰력을 얻을 수 있다. 매년 공개되는 새로운 게놈을 데이터베이스에 추가하여 분석하면 더 정확하고 상세한 분류가 가능하다. 그러나 이러한 과정은 매우 많은 양의 복잡한 계산을 요구하며, 수천 개의 참조 게놈에 대한 샘플로부터 수백만 번 이상의 판독을 필요로 하기에 일반적으로 대규모의 CPU 클러스터를 필요로 한다.Predicting the taxonomic composition of metagenome samples over the past decade has been difficult. Being able to determine the microbial taxa contained in a given sample can provide many insights into the role of microbes in the environment. A more accurate and detailed classification is possible if new genomes released every year are added to the database and analyzed. However, this process requires a very large amount of complex calculations, and typically requires a large CPU cluster as it requires millions of reads from samples of thousands of reference genomes.

최근 몇 년 동안 공개적으로 이용 가능한 게놈의 수가 증가했기 때문에 "k-mer 완전일치" 접근법의 신뢰도가 충분히 높아졌고 이 방법을 구현하기 위한 컴퓨터 속도가 빨라지면서 유용한 방법이 되었다. 반면에, 상동성 검색 방법은 수행해야 할 비교 횟수가 많아져서 느려지고, 관련 게놈이 유사한 수준의 서열 구성을 가지기 때문에 부정확하다. 이러한 부정확성을 피하고 계산 시간을 줄이기 위해 일부 상동성 검색 방법은 유전자 마커(여러 종 또는 속에서 한 번만 존재하는 서열)를 사용하여 비교 횟수를 감소시킨다.Because the number of publicly available genomes has increased in recent years, the reliability of the "k-mer perfect match" approach has become sufficiently high, and computer speeds to implement it have increased, making it a useful method. On the other hand, homology search methods are slow because of the large number of comparisons to be performed, and are imprecise because related genomes have a similar level of sequence organization. To avoid these inaccuracies and reduce computational time, some homology search methods use genetic markers (sequences that occur only once in several species or genera) to reduce the number of comparisons.

이러한 유전자 마커를 이용한 방법의 단점은 박테리아 게놈의 크기와 유전자의 빈도가 매우 불규칙적이며 (일부 종 또는 속은 다른 종보다 더 많은 마커를 포함함) 다른 종 또는 속이 참조 데이터베이스에 추가되면 해당 마커를 다시 계산해야 한다는 것이다. 기존의 마커가 새롭게 발견된 완전히 다른 분류군에서 발견되면 해당 마커는 더 이상 기존 분류군에 대해 사용할 수 없다.A disadvantage of this method using genetic markers is that the bacterial genome is highly irregular in size and gene frequency (some species or genera contain more markers than others), and that markers are recalculated when other species or genera are added to the reference database. is that you have to If an existing marker is found in a newly discovered, completely different taxon, that marker can no longer be used for the existing taxon.

대한민국 공개특허공보 제10-2020-0027900호(2020.03.13.공개)Republic of Korea Patent Publication No. 10-2020-0027900 (published on March 13, 2020)

본 발명의 일 목적은, 최적화된 방법으로 박테리아를 동정하는 것이 가능한 박테리아 동정 장치 및 박테리아 동정 방법을 제공하는 것이다.One object of the present invention is to provide a bacteria identification device and a method for identifying bacteria capable of identifying bacteria in an optimized method.

본 발명의 다른 일 목적은, 장내 미생물에서 박테리아 문(Phylum) 이하를 자동으로 동정하는 것이 가능한 박테리아 동정 장치 및 박테리아 동정 방법을 제공하는 것이다.Another object of the present invention is to provide a bacterial identification device and a method for identifying bacteria capable of automatically identifying bacterial phylum or lower in intestinal microorganisms.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the description below.

상술한 과제를 해결하기 위한 본 발명의 일 실시예에 따른 박테리아 동정 방법은, (a) 차세대 염기서열 분석 장비에서 생성된 서열들에서 가공되지 않은 염기서열 파일 조각들을 하나의 파일로 변환하는 단계, (b) 상기 변환된 하나의 파일에서 생물정보학 분석도구를 이용하여 특정 프라이머 서열을 제거하는 단계, (c) 상기 특정 프라이머 서열이 제거된 파일에서 분석도구를 이용하여 서열의 중복 및 오류를 제거하는 단계, (d) 상기 서열의 중복 및 오류가 제거된 파일의 서열들을 생물정보학 분석도구를 이용하여 데이터베이스에 저장된 서열들과 비교하여 박테리아를 동정하는 단계 및 (e) 상기 동정된 박테리아를 통계툴을 이용하여 해석하는 단계를 포함한다.A method for identifying bacteria according to an embodiment of the present invention for solving the above problems includes the steps of (a) converting raw sequence file fragments from sequences generated by next-generation sequencing equipment into one file; (b) removing a specific primer sequence from the converted file using a bioinformatics analysis tool, (c) removing sequence duplication and errors using an analysis tool from the file from which the specific primer sequence was removed Step (d) identifying bacteria by comparing the sequences of the file from which duplicates and errors in the sequences have been removed with sequences stored in a database using a bioinformatics analysis tool; and (e) using a statistical tool to identify the bacteria. It includes the step of using and interpreting.

실시 예에 있어서, 상기 (a) 단계는, 각 샘플 별로 존재하는 염기서열 단편 조각들을 하나의 파일로 제작 또는 압축하여 하나의 파일로 변환하는 것을 특징으로 한다.In an embodiment, the step (a) is characterized in that the nucleotide sequence fragments existing for each sample are produced or compressed into a single file and converted into a single file.

실시 예에 있어서, 상기 (b) 단계는, 상기 (a) 단계에서 변환된 하나의 파일로부터 박테리아 특정 영역에 해당하는 프라이머 서열을 제거하여 특정 프라이머 서열 제거하는 것을 특징으로 한다.In an embodiment, the step (b) is characterized in that a specific primer sequence is removed by removing a primer sequence corresponding to a specific region of bacteria from one file converted in the step (a).

실시 예에 있어서, 상기 (c) 단계에서, 정방향(forward) 서열과 역방향(reverse) 서열에서 오류가 있는 서열의 위치를 찾아서 서열의 오류 및 중복을 제거하는 것을 특징으로 한다.In an embodiment, in the step (c), it is characterized in that the position of the erroneous sequence is removed from the sequence error and duplication in the forward sequence and the reverse sequence.

실시 예에 있어서, 상기 (c) 단계는, ASV (Amplicon Sequence Variant, 앰플리콘 서열 변이) 방법을 통해 서열의 오류를 제거하는 것을 특징으로 한다.In an embodiment, the step (c) is characterized in that sequence errors are removed through an Amplicon Sequence Variant (ASV) method.

실시 예에 있어서, 상기 차세대 염기서열 분석 장비에서 생성된 서열은 시컨싱 장비에서 만들어지는 서열파일인 것을 특징으로 한다.In an embodiment, the sequence generated by the next-generation sequencing equipment is characterized in that it is a sequence file created by sequencing equipment.

실시 예에 있어서, 상기 차세대 염기서열 분석 장비에서 생성된 서열은 인간의 장에 존재하는 박테리아인 것을 특징으로 한다.In an embodiment, the sequences generated by the next-generation sequencing equipment are bacteria present in the human intestine.

실시 예에 있어서, 상기 차세대 염기서열 분석 장비에서 생성된 서열은 인간의 장 속으로부터 배설된 분변에서 추출한 박테리아를 차세대 염기서열(NGS, next generation sequencing) 분석을 통해 생성된 서열인 것을 특징으로 한다.In an embodiment, the sequence generated by the next-generation sequencing equipment is characterized in that it is a sequence generated through next-generation sequencing (NGS) analysis of bacteria extracted from feces excreted from the human intestine.

본 발명의 다른 실시 예에 따른 박테리아 동정 장치는, 차세대 염기서열 분석을 통해 서열을 생성하는 차세대 염기서열 분석 장치 및 차세대 염기서열 분석 장비에서 생성된 서열들에서 가공되지 않은 염기서열 파일 조각들을 하나의 파일로 변환하고, 상기 변환된 하나의 파일에서 생물정보학 분석도구를 이용하여 특정 프라이머 서열을 제거하며, 상기 특정 프라이머 서열이 제거된 파일에서 분석도구를 이용하여 서열의 중복 및 오류를 제거하고, 상기 서열의 중복 및 오류가 제거된 파일의 서열들을 생물정보학 분석도구를 이용하여 데이터베이스에 저장된 서열들과 비교하여 박테리아를 동정하며, 상기 동정된 박테리아를 통계툴을 이용하여 해석하는 제어부를 포함한다.Bacterial identification device according to another embodiment of the present invention, the next-generation sequencing device for generating sequences through next-generation sequencing analysis and the sequence generated by the next-generation sequencing device are unprocessed base sequence file fragments into one. file, remove a specific primer sequence from the converted file using a bioinformatics analysis tool, remove duplicates and errors in the sequence using an analysis tool from the file from which the specific primer sequence has been removed, and a control unit for identifying bacteria by comparing the sequences of the file from which duplicates and errors have been removed with sequences stored in the database using a bioinformatics analysis tool, and analyzing the identified bacteria using a statistical tool.

상술한 과제를 해결하기 위한 본 발명의 다른 실시 예에 따른 박테리아 동정 프로그램은, 하드웨어인 컴퓨터와 결합되어 상술한 방법 중 어느 하나의 방법을 수행하기 위해 매체에 저장된다. The bacterial identification program according to another embodiment of the present invention for solving the above problems is combined with a computer as hardware and stored in a medium to perform any one of the above methods.

이 외에도, 본 발명을 구현하기 위한 다른 방법, 다른 시스템 및 상기 방법을 실행하기 위한 컴퓨터 프로그램을 기록하는 컴퓨터 판독 가능한 기록 매체가 더 제공될 수 있다.In addition to this, another method for implementing the present invention, another system, and a computer readable recording medium recording a computer program for executing the method may be further provided.

상기와 같은 본 발명에 따르면, 본 발명은 최적화된 방법으로 박테리아를 동정하는 것이 가능한 박테리아 동정 장치 및 방법을 제공할 수 있다.According to the present invention as described above, the present invention can provide an apparatus and method for identifying bacteria capable of identifying bacteria in an optimized way.

또한, 본 발명은 장내 미생물에서 박테리아 문(Phylum) 이하를 자동으로 동정하는 것이 가능한 새로운 박테리아 동정 장치 및 방법을 제공할 수 있다.In addition, the present invention can provide a novel bacterial identification device and method capable of automatically identifying bacterial phylum or lower in intestinal microorganisms.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 본 발명의 일 실시 예에 따른 박테리아 동정 장치를 설명하기 위한 개념도이다.
도 2는 본 발명의 일 실시 예에 따른 박테리아 동정 방법을 설명하기 위한 흐름도이다.
도 3, 도 4, 도 5a 및 도 5b, 도 6a 및 도 6b, 도 7, 도 8, 도 9 및 도 10은, 도 2에서 살펴본 박테리아 동정 방법을 설명하기 위한 개념도이다.1 is a conceptual diagram illustrating an apparatus for identifying bacteria according to an embodiment of the present invention.
2 is a flowchart illustrating a method for identifying bacteria according to an embodiment of the present invention.
3, 4, 5a and 5b, 6a and 6b, 7, 8, 9 and 10 are conceptual diagrams for explaining the bacterial identification method described in FIG. 2 .

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention, and methods of achieving them, will become clear with reference to the detailed description of the following embodiments taken in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, only these embodiments are intended to complete the disclosure of the present invention, and are common in the art to which the present invention belongs. It is provided to fully inform the person skilled in the art of the scope of the invention, and the invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.Terminology used herein is for describing the embodiments and is not intended to limit the present invention. In this specification, singular forms also include plural forms unless specifically stated otherwise in a phrase. As used herein, "comprises" and/or "comprising" does not exclude the presence or addition of one or more other elements other than the recited elements. Like reference numerals throughout the specification refer to like elements, and “and/or” includes each and every combination of one or more of the recited elements. Although "first", "second", etc. are used to describe various components, these components are not limited by these terms, of course. These terms are only used to distinguish one component from another. Accordingly, it goes without saying that the first element mentioned below may also be the second element within the technical spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used with meanings commonly understood by those skilled in the art to which the present invention belongs. In addition, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless explicitly specifically defined.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

설명에 앞서 본 명세서에서 사용하는 용어의 의미를 간략히 설명한다. 그렇지만 용어의 설명은 본 명세서의 이해를 돕기 위한 것이므로, 명시적으로 본 발명을 한정하는 사항으로 기재하지 않은 경우에 본 발명의 기술적 사상을 한정하는 의미로 사용하는 것이 아님을 주의해야 한다.Prior to the description, the meaning of the terms used in this specification will be briefly described. However, it should be noted that the description of terms is intended to help the understanding of the present specification, and is not used in the sense of limiting the technical spirit of the present invention unless explicitly described as limiting the present invention.

본 명세서에서 '박테리아 동정 장치'는 연산처리를 수행하여 사용자에게 결과를 제공할 수 있는 다양한 장치들이 모두 포함된다. In this specification, the 'bacteria identification device' includes all of various devices that can perform calculation processing and provide results to the user.

예를 들어, 박테리아 동정 장치는, 컴퓨터, 단말기, 데스크 탑 PC, 노트북(Note Book) 뿐만 아니라 스마트폰(Smart phone), 태블릿 PC, 셀룰러폰(Cellular phone), 피씨에스폰(PCS phone; Personal Communication Service phone), 동기식/비동기식 IMT-2000(International Mobile Telecommunication-2000)의 이동 단말기, 팜 PC(Palm Personal Computer), 개인용 디지털 보조기(PDA; Personal Digital Assistant) 등도 해당될 수 있다. For example, the bacteria identification device is not only a computer, a terminal, a desktop PC, and a notebook (Note Book), but also a smart phone, a tablet PC, a cellular phone, and a PCS phone (Personal Communication). Service phone), synchronous/asynchronous IMT-2000 (International Mobile Telecommunication-2000) mobile terminal, Palm PC (Palm Personal Computer), Personal Digital Assistant (PDA), etc. may also be applicable.

또한, 박테리아 동정 장치는 클라이언트로부터 요청을 수신하여 정보처리를 수행하는 서버와 통신을 수행할 수 있다.In addition, the bacteria identification device may receive a request from a client and communicate with a server that performs information processing.

본 발명의 일 실시 예에 따른 박테리아 동정 장치는, 도 1에서 설명하는 구성요소들 중 적어도 하나를 포함하도록 구현될 수 있다.An apparatus for identifying bacteria according to an embodiment of the present invention may be implemented to include at least one of the components described in FIG. 1 .

도 1은 본 발명의 일 실시 예에 따른 박테리아 동정 장치를 설명하기 위한 개념도이다.1 is a conceptual diagram illustrating an apparatus for identifying bacteria according to an embodiment of the present invention.

본 발명의 일 실시 예에 따른 박테리아 동정 장치는, 차세대 염기서열 분석 장치(110) 및 제어부(130)를 포함할 수 있다.The bacterial identification device according to an embodiment of the present invention may include a next-generation sequencing device 110 and a control unit 130.

차세대 염기서열 분석 장치(110)는, 차세대 염기서열(NGS, next generation sequencing) 분석을 통해 서열을 생성할 수 있다.The next generation sequencing device 110 may generate a sequence through next generation sequencing (NGS) analysis.

여기서, 차세대 염기서열 분석 장비(110)에서 생성된 서열은, 시컨싱(sequencing) 장비에서 만들어지는 서열파일일 수 있다.Here, the sequence generated by the next-generation sequencing equipment 110 may be a sequence file created by sequencing equipment.

또한, 차세대 염기서열 분석 장비(110)에서 생성된 서열은, 인간의 장에 존재하는 박테리아 또는 박테리아에 대한 서열일 수 있다.In addition, sequences generated by the next-generation sequencing equipment 110 may be sequences of bacteria or bacteria present in the human intestine.

또한, 차세대 염기서열 분석 장비(110)에서 생성된 서열은, 인간의 장 속으로부터 배설된 분변에서 추출한 박테리아를 차세대염기서열(NGS, next generation sequencing) 분석을 통해 생성된 서열일 수 있다.In addition, the sequence generated by the next generation sequencing equipment 110 may be a sequence generated through next generation sequencing (NGS) analysis of bacteria extracted from feces excreted from the human intestine.

도 2는 본 발명의 일 실시 예에 따른 박테리아 동정 방법을 설명하기 위한 흐름도이다.2 is a flowchart illustrating a method for identifying bacteria according to an embodiment of the present invention.

도 2를 참조하면, 제어부(130)는, 차세대 염기서열 분석 장비에서 생성된 서열들에서 가공되지 않은 염기서열 파일 조각들을 하나의 파일로 변환할 수 있다(S210).Referring to FIG. 2 , the control unit 130 may convert unprocessed base sequence file fragments from sequences generated by next-generation sequencing equipment into one file (S210).

제어부(130)는, 각 샘플별로 존재하는 염기서열 단편 조각들을 하나의 파일로 제작 또는 압축하여 하나의 파일로 변환할 수 있다.The control unit 130 may create or compress nucleotide sequence fragments existing for each sample into a single file and convert them into a single file.

제어부(130)는, 상기 변환된 하나의 파일에서 생물정보학 분석도구를 이용하여 특정 프라이머 서열을 제거할 수 있다(S220).The control unit 130 may remove a specific primer sequence from the converted file using a bioinformatics analysis tool (S220).

제어부(130)는, 상기 변환된 하나의 파일로부터 박테리아 특정 영역에 해당하는 프라이머 서열을 제거하여 특정 프라이머 서열을 제거할 수 있다.The controller 130 may remove a specific primer sequence by removing a primer sequence corresponding to a bacterial specific region from the converted file.

제어부(130)는, 상기 특정 프라이머 서열이 제거된 파일에서 분석도구를 이용하여 서열의 중복 및 오류를 제거할 수 있다(S230).The control unit 130 may remove sequence duplication and errors using an analysis tool in the file from which the specific primer sequence has been removed (S230).

제어부(130)는, 정방향(forward) 서열과 역방향(reverse) 서열에서 오류가 있는 서열의 위치를 찾아서 서열의 오류 및 중복을 제거할 수 있다.The control unit 130 may find the location of the erroneous sequence in the forward sequence and the reverse sequence and remove the sequence error and duplication.

여기서, 제어부(130)는, ASV (Amplicon Sequence Variant, 앰플리콘 서열 변이) 방법을 통해 서열의 오류를 제거할 수 있다. ASV 방법은, 생물정보학 오류 보정 방법 중 하나일 수 있다.Here, the control unit 130 may remove sequence errors through an Amplicon Sequence Variant (ASV) method. The ASV method may be one of the bioinformatics error correction methods.

ASV는 고처리량 마커 유전자 분석에서 회복된 단일 DNA 서열을 지칭할 수 있다.ASV may refer to a single DNA sequence recovered from high-throughput marker gene analysis.

ASV는 DNA 서열에 근거를 둔 종의 단을 분류하고, 생물학과 환경적 변이를 찾아내고, 생태학적 패턴을 결정하는데 이용될 수 있다.ASV can be used to group species based on DNA sequences, detect biological and environmental variation, and determine ecological patterns.

ASV 방법은 개별 시퀀싱 실행에 맞게 오류 모델을 생성하고, 모델을 사용하여 실제 생물학적 서열과 오류에 의해 생성된 것을 구별하는 알고리즘을 통해 광범위하게 적용될 수 있다.The ASV method is broadly applicable through algorithms that create error models tailored to individual sequencing runs, and use the models to discriminate between true biological sequences and those generated by errors.

제어부(130)는, 상기 서열의 중복 및 오류가 제거된 파일의 서열들을 생물정보학 분석도구를 이용하여 데이터베이스에 저장된 서열들과 비교하여 박테리아를 동정할 수 있다(S240).The control unit 130 may identify bacteria by comparing the sequences of the file from which duplicates and errors in the sequences have been removed with the sequences stored in the database using a bioinformatics analysis tool (S240).

여기서, 상기 생물정보학 분석도구는, 생물정보학 알고리즘을 포함할 수 있다. 상기 생물정보학 알고리즘은, 일 예로, 니들만 브니쉬(Needleman-Wunch) 알고리즘(algorithm)일 수 있다.Here, the bioinformatics analysis tool may include a bioinformatics algorithm. The bioinformatics algorithm may be, for example, a Needleman-Wunch algorithm.

상기 박테리아를 동정한 결과 데이터의 내용은, read count, read percentage 등을 포함할 수 있다.Contents of data as a result of identifying the bacteria may include read count, read percentage, and the like.

또한, 박테리아를 동정한 결과 데이터는 종(種), 속(屬), 과(科), 목(目), 강(綱), 문(門), 계(界)의 분류군에 따라 7개의 데이터로 구분될 수 있다.In addition, the bacteria identification result data is 7 data according to the taxa of species, genus, family, order, class, phylum, and kingdom. can be distinguished by

또한, 제어부(130)는, 박테리아를 동정한 결과 데이터를 .tsv 나 .csv 파일 형태로 출력할 수 있다.In addition, the control unit 130 may output bacteria identification result data in the form of a .tsv or .csv file.

제어부(130)는, 상기 동정된 박테리아를 통계툴을 이용하여 해석할 수 있다(S250).The controller 130 may analyze the identified bacteria using a statistical tool (S250).

상기 통계툴은, 계통트리, PCA, Heatmap 등을 포함할 수 있다.The statistical tool may include a phylogenetic tree, PCA, heatmap, and the like.

도 3, 도 4, 도 5, 도 6, 도 7, 도 8, 도 9 및 도 10은, 도 2에서 살펴본 박테리아 동정 방법을 설명하기 위한 개념도이다.3, FIG. 4, FIG. 5, FIG. 6, FIG. 7, FIG. 8, FIG. 9 and FIG. 10 are conceptual diagrams for explaining the bacterial identification method described in FIG. 2.

이하에서는, 도 2에 따른 박테리아 동정 방법을 적용하여 박테리아 동정을 실험한 내용을 살펴보기로 한다. Hereinafter, by applying the bacterial identification method according to FIG. 2, the contents of the bacterial identification experiment will be reviewed.

1) 실험 샘플 제작1) Production of experimental samples

본 명세서에서는, 사람의 장내에 일반적으로 분포하는 12종의 미생물과 2종의 곰팡이를 이용하여 35종의 미생물과곰팡이 세트를 제작하여 시컨싱을 수행하였고, 표 1과 표 2에 균주명과 제작한 균주의 타입이 도출된다.In the present specification, a set of 35 microorganisms and fungi was prepared using 12 kinds of microorganisms and 2 kinds of molds generally distributed in the human intestine, and sequenced was performed. The type of strain is derived.

2) ATP에서 Forward와 Reverse를 구분하는 기준과 구분력2) Criteria and Distinction between Forward and Reverse in ATP

도 3은 2개 앰플리콘 시컨싱을 수행한 샘플의 양방향 그래프로 구성된 fastqc의 그래프이다. 도 3은, 일반적인 Fastq의 예제이며, 왼쪽은 Forward, 오른쪽은 reverse를 나타낸다.Figure 3 is a graph of fastqc composed of bi-directional graphs of samples subjected to two amplicon sequencing. 3 is an example of a general Fastq, and the left side shows Forward and the right side shows reverse.

도 4를 참조하면, No_Filter(0-0), ATP(본 발명의 방법), F210-R170, Figaro(277-246) 및 VL-QIIME2(283-237)를 비교하면, 필터를 통과하는 리드의 수는 옵션과 포지션에 따라서 2배이상 차이가 난다. Referring to FIG. 4, comparing No_Filter (0-0), ATP (method of the present invention), F210-R170, Figaro (277-246) and VL-QIIME2 (283-237), the number of leads passing through the filter The number differs by more than two times depending on options and positions.

도 4는 필터를 통과한 리드의 수를 나타내며, 왼쪽은 도 3의 위쪽, 오른쪽은 도 3의 아래쪽 그래프를 기반으로 필터를 통과한 리드의 수를 나타낸다.4 shows the number of leads passing through the filter, and the left side shows the number of leads passing through the filter based on the top of FIG. 3 and the right side shows the bottom graph of FIG. 3 .

도 5a 및 도 5b는 매핑되는 영역과 분포를 나타낸다. 도 5a는 도 3의 위쪽, 도 5b는 도 3의 아래쪽 그래프를 기반으로 매핑되는 영역과 분포를 나타낸다.5A and 5B show mapped areas and distributions. FIG. 5A shows the mapped area and distribution based on the upper graph of FIG. 3 and FIG. 5B shows the lower graph of FIG. 3 .

많은 데이터가 필터를 통과해도 forward와 reverse의 리드가 하나의 공통(Consesne) 시컨스를 이루는 길이는 도 5와 같이, 특정한 길이 즉 400~430 bp에 한정된다. Even if a lot of data passes through the filter, the length of forward and reverse leads forming one common sequence is limited to a specific length, that is, 400 to 430 bp, as shown in FIG. 5 .

도 3의 그래프에서 F210-R170이 필터를 통과한 리드의 수는 월등히 많지만 도 6의 왼쪽 그래프에서와 같이 종에 매핑되는 리드는 거의 없고 종을 구분하기가 거의 어렵다. In the graph of FIG. 3 , the number of leads that passed the filter for F210-R170 is far greater, but as shown in the graph on the left of FIG.

도 6a 및 도 6b는 최종 속을 판단할 수 있는 리드의 수를 나타낸다. 도 6a는 도 3의 위쪽, 도 6b는 도 3의 아래쪽 그래프에 해당하는 최종 속을 판단할 수 있는 리드의 수를 나타낸다.6A and 6B show the number of leads that can determine the final genus. FIG. 6A shows the number of leads that can determine the final genus corresponding to the upper graph of FIG. 3 and FIG. 6B to the lower graph of FIG. 3 .

Forward의 뒤쪽 부분 서열과 Reverse의 앞 부분의 공통된 시퀀스들을 생성하며 이 공통된 부분이 서로 overlap을 이루어 하나의 새로운 긴 리드가 생성된다. forward와 reverse각각의 짧은 리드들의 수를 최대한 확보하더라도 이 공통된 부분이 서로 overlap이 안되어 하나의 새로운 긴 리드를 생성하지 못하면 종, 속을 구분할 수 있는 구분력이 약해진다. Common sequences of the rear part sequence of Forward and the front part of Reverse are created, and these common parts overlap each other to create a new long lead. Even if the maximum number of short leads for forward and reverse is secured, if these common parts do not overlap each other and create a new long lead, the ability to distinguish between species and genus is weakened.

이런 이유로 본 발명에서는 리드 간 적절한 공통시컨스를 생성할 수 있어 구분력이 뛰어난 ATP (Auto Truncation Position) 방법을 사용한다. For this reason, the present invention uses an ATP (Auto Truncation Position) method that can generate an appropriate common sequence between leads and has excellent discriminative power.

본 방법을 이용하였을 경우에 도 3의 왼쪽과 오른쪽의 그래프에서와 같이 필터를 통과하는 수가 최대는 아니지만 도 6의 결과에서와 같이 매핑되는 종의 수가 월등히 많은 것을 확인할 수 있다. When this method is used, as shown in the left and right graphs of FIG. 3, the number of passing through the filter is not the maximum, but as shown in the result of FIG. 6, it can be seen that the number of species mapped is significantly higher.

제어부(130)는, Overlapping size를 10~60으로 조절하면서 Quality가 높은 forward의 리드의 수를 최대한 확보하고 reverse의 길이를 선택하는 방법으로 동일한 리드를 이용하여 종/속의 구분력이나 확보할 수 있는 리드의 수가 많은 것을 확인할 수 있다. The control unit 130 adjusts the overlapping size to 10 to 60, secures the maximum number of high-quality forward leads, and selects the length of reverse. It can be seen that the number of leads is large.

또한, 본 발명은, quality가 낮은 도100의 아래 샘플을 ATP 방법을 이용하면 도 6과 같이 7.74%의 매핑 리드를 확인하고 정확도가 높아지는 것을 확인할 수 있다. In addition, in the present invention, when the ATP method is used for the lower sample of FIG. 100 with low quality, it can be confirmed that 7.74% of the mapping reads are confirmed and the accuracy is increased as shown in FIG. 6.

3) ATP의 알고리즘3) Algorithm of ATP

ATP 알고리즘은 k-mer, quality score, mismatch fraction을 조절하여 최적의 오버래핑 영역을 찾는 알고리즘이다. The ATP algorithm is an algorithm that finds an optimal overlapping region by adjusting k-mer, quality score, and mismatch fraction.

기본적으로는 17 bp의 K-mer를 이용하여 forward의 오른쪽에서 왼쪽으로 windows shifting을 수행하면서 최적의 영역을 찾는 알고리즘 원리이다. Basically, it is an algorithm principle that finds an optimal region while performing windows shifting from the right side of the forward to the left side using a 17 bp K-mer.

Quality score는 기본 19 이상의 영역을 선택하고 mismatch fraction은 0.3 이하의 양질의 영역만을 선택하여 assembly 즉, overlapping 영역을 찾는다. The quality score selects areas with a basic score of 19 or higher, and the mismatch fraction selects only high-quality areas with a score of 0.3 or less to find an assembly, that is, an overlapping area.

도 7은 window shifting에 의해서 최적의 영역을 찾는 ATP 알고리즘을 나타낸다.7 shows an ATP algorithm for finding an optimal region by window shifting.

도 8은 overlap region을 이용하여 공통 reads를 찾는 흐름도를 나타낸다.8 shows a flowchart for finding common reads using an overlap region.

도 8을 참조하면, 제어부(130)는, r2 시퀀스를 역상보 서열로 두고, K-mer 카운트 테이블을 생성한다. 제어부(130)는, 시퀀스 중첩 영역을 감지하고, 불일치 기본 영역을 처리할 수 있다. 또한, 제어부(130)는, r1, r2 리드에서 컨센서스 리드를 생성한다.Referring to FIG. 8 , the control unit 130 sets the r2 sequence as a reverse complementary sequence and generates a K-mer count table. The controller 130 may detect a sequence overlapping region and process an inconsistent basic region. In addition, the control unit 130 generates a consensus lead from the r1 and r2 leads.

4) ATP를 이용한 종과 속의 비교 실험4) Species and genus comparison experiments using ATP

본 명세서에서는, 리드의 품질에 따라서 자동으로 truncation되는 위치를 찾는 ATP(Auto Truncation Position)의 기능을 이용하여 종과 속의 구분력을 기존의 회사나 프로그램들보다 월등히 좋은 것을 확인할 수 있다. In this specification, it can be confirmed that the ability to distinguish species and genus is far better than existing companies or programs by using the function of ATP (Auto Truncation Position) that automatically finds a truncation position according to the quality of the lead.

속의 결과는 도 9와 같이 20%이상의 정확도로 구분이 가능하고, 도 10의 종의 비교에서도 최소 25%의 향상이 있고 특정 툴들은 전혀 종의 구분력이 없었다. The results of the genus can be distinguished with an accuracy of 20% or more as shown in FIG. 9, and in the comparison of species in FIG.

위에서 설명한 박테리아 동정 장치의 동작 및 기능은, 박테리아 동정 방법에 동일/유사하게 유추적용될 수 있다.The operation and function of the bacteria identification device described above may be analogously applied in the same/similar manner to the method for identifying bacteria.

이상에서 전술한 본 발명의 일 실시예에 따른 방법은, 하드웨어인 서버와 결합되어 실행되기 위해 프로그램(또는 어플리케이션)으로 구현되어 매체에 저장될 수 있다.The method according to an embodiment of the present invention described above may be implemented as a program (or application) to be executed in combination with a server, which is hardware, and stored in a medium.

상기 전술한 프로그램은, 상기 컴퓨터가 프로그램을 읽어 들여 프로그램으로 구현된 상기 방법들을 실행시키기 위하여, 상기 컴퓨터의 프로세서(CPU)가 상기 컴퓨터의 장치 인터페이스를 통해 읽힐 수 있는 C, C++, JAVA, 기계어 등의 컴퓨터 언어로 코드화된 코드(Code)를 포함할 수 있다. 이러한 코드는 상기 방법들을 실행하는 필요한 기능들을 정의한 함수 등과 관련된 기능적인 코드(Functional Code)를 포함할 수 있고, 상기 기능들을 상기 컴퓨터의 프로세서가 소정의 절차대로 실행시키는데 필요한 실행 절차 관련 제어 코드를 포함할 수 있다. 또한, 이러한 코드는 상기 기능들을 상기 컴퓨터의 프로세서가 실행시키는데 필요한 추가 정보나 미디어가 상기 컴퓨터의 내부 또는 외부 메모리의 어느 위치(주소 번지)에서 참조되어야 하는지에 대한 메모리 참조관련 코드를 더 포함할 수 있다. 또한, 상기 컴퓨터의 프로세서가 상기 기능들을 실행시키기 위하여 원격(Remote)에 있는 어떠한 다른 컴퓨터나 서버 등과 통신이 필요한 경우, 코드는 상기 컴퓨터의 통신 모듈을 이용하여 원격에 있는 어떠한 다른 컴퓨터나 서버 등과 어떻게 통신해야 하는지, 통신 시 어떠한 정보나 미디어를 송수신해야 하는지 등에 대한 통신 관련 코드를 더 포함할 수 있다.The aforementioned program is C, C++, JAVA, machine language, etc. It may include a code coded in a computer language of. Such codes may include functional codes related to functions defining necessary functions for executing the methods, and include control codes related to execution procedures necessary for the processor of the computer to execute the functions according to a predetermined procedure. can do. In addition, these codes may further include memory reference related code for determining where (address address) of the computer's internal or external memory the additional information or media required for the computer's processor to execute the functions should be referenced. have. In addition, when the processor of the computer needs to communicate with any other remote computer or server in order to execute the functions, the code uses the communication module of the computer to determine how to communicate with any other remote computer or server. It may further include communication-related codes for whether to communicate, what kind of information or media to transmit/receive during communication, and the like.

상기 저장되는 매체는, 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상기 저장되는 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있지만, 이에 제한되지 않는다. 즉, 상기 프로그램은 상기 컴퓨터가 접속할 수 있는 다양한 서버 상의 다양한 기록매체 또는 사용자의 상기 컴퓨터상의 다양한 기록매체에 저장될 수 있다. 또한, 상기 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장될 수 있다.The storage medium is not a medium that stores data for a short moment, such as a register, cache, or memory, but a medium that stores data semi-permanently and is readable by a device. Specifically, examples of the storage medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc., but are not limited thereto. That is, the program may be stored in various recording media on various servers accessible by the computer or various recording media on the user's computer. In addition, the medium may be distributed to computer systems connected through a network, and computer readable codes may be stored in a distributed manner.

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.Steps of a method or algorithm described in connection with an embodiment of the present invention may be implemented directly in hardware, implemented in a software module executed by hardware, or implemented by a combination thereof. A software module may include random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, hard disk, removable disk, CD-ROM, or It may reside in any form of computer readable recording medium well known in the art to which the present invention pertains.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다.Although the embodiments of the present invention have been described with reference to the accompanying drawings, those skilled in the art to which the present invention pertains can be implemented in other specific forms without changing the technical spirit or essential features of the present invention. you will be able to understand Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

Claims

In the method for identifying bacteria, performed by the device,
(a) converting raw sequence file fragments from sequences generated by next-generation sequencing equipment into a single file;
(b) removing a specific primer sequence from the converted file using a bioinformatics analysis tool;
(c) removing sequence duplication and errors using an analysis tool in the file from which the specific primer sequence has been removed;
(d) identifying bacteria by comparing the sequences of the file from which duplicates and errors have been removed with sequences stored in a database using a bioinformatics analysis tool; and
(e) a method for identifying bacteria comprising the step of analyzing the identified bacteria using a statistical tool.

According to claim 1,
In step (a),
A method for identifying bacteria, characterized in that the nucleotide sequence fragments present in each sample are produced or compressed into a single file and converted into a single file.

According to claim 2,
In step (b),
Bacterial identification method, characterized in that by removing a specific primer sequence by removing the primer sequence corresponding to the bacterial specific region from the one file converted in step (a).

According to claim 3,
In step (c),
A method for identifying bacteria, characterized by eliminating sequence errors and redundancies by locating erroneous sequences in forward and reverse sequences.

According to claim 4,
Wherein step (c) is characterized by removing sequence errors through an Amplicon Sequence Variant (ASV) method.

According to claim 1,
The bacterial identification method, characterized in that the sequence generated by the next-generation sequencing equipment is a sequence file created by the sequencing equipment.

According to claim 1,
The bacterial identification method, characterized in that the sequence generated by the next-generation sequencing equipment is a bacterium present in the human intestine.

According to claim 1,
Characterized in that the sequence generated by the next generation sequencing equipment is a sequence generated by next generation sequencing (NGS) analysis of bacteria extracted from feces excreted from the human intestine.

Next-generation sequencing devices that generate sequences through next-generation sequencing; and
From the sequences generated by next-generation sequencing equipment, unprocessed sequencing file pieces are converted into one file, and a specific primer sequence is removed from the converted one file using a bioinformatics analysis tool, and the specific primer sequence is removed. In the file from which the sequences have been removed, duplicates and errors in the sequence are removed using an analysis tool, and the sequences of the file from which the duplicates and errors have been removed are compared with the sequences stored in the database using a bioinformatics analysis tool to identify bacteria. Bacterial identification apparatus comprising a control unit for identifying and analyzing the identified bacteria using a statistical tool.

A program that is combined with a computer, which is hardware, and stored in a recording medium to execute the method of any one of claims 1 to 8.