KR20200126715A

KR20200126715A - Protein Toxicity Prediction System and Method Using Artificial Neural Network

Info

Publication number: KR20200126715A
Application number: KR1020190050703A
Authority: KR
Inventors: 김연수; 김건준; 성동렬
Original assignee: 주식회사 엘지화학
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2020-11-09

Abstract

An information providing system is provided. According to an embodiment of the present application, an information providing system may comprise: a database which stores data related to a plurality of proteins whose toxicity is determined according to preset criteria; an amino acid sequence calculation module which calculates sequence data of amino acids contained in the protein to be analyzed; an amino acid hydrophobicity calculation module for calculating hydrophobicity data according to an amino acid sequence according to a preset method using the amino acid sequence data calculated by the amino acid sequence calculation module; and a toxicity prediction module for comparing the hydrophobicity data calculated by the amino acid hydrophobicity calculation module with the hydrophobicity data of the plurality of proteins stored in the database, and determining whether the protein to be analyzed is toxic based on first comparison result data as a result of the comparison.

Description

Protein Toxicity Prediction System and Method Using Artificial Neural Network

본 발명은 인공 신경망을 이용한 단백질의 독성 예측 시스템, 상기 시스템을 구축하는 방법, 및 상기 시스템을 이용하여 단백질의 독성을 예측하는 방법에 관한 것이다. 또한, 컴퓨터를 이용하여 상기 방법을 실행시키기 위해 컴퓨터 판독 가능한 저장 매체에 저장된, 컴퓨터 프로그램에 관한 것이다.The present invention relates to a protein toxicity prediction system using an artificial neural network, a method of constructing the system, and a method of predicting the toxicity of a protein using the system. It also relates to a computer program stored in a computer-readable storage medium for executing the method using a computer.

일반적으로 유전자 조작 작물은 식품위생법에 따라 안전성 검사를 받은 후 시중에 유통 및 판매가 가능하다. 우리나라의 경우 유럽, 일본 등과 동일한 방법으로 안전성을 평가하고 있으며, 구체적으로, 삽입유전자 특성, 독성, 알레르기성, 영양성 등을 기존 농산물과 비교 평가하여 실질적 동등성을 입증하고 있다. In general, genetically modified crops can be distributed and sold on the market after undergoing safety testing according to the Food Sanitation Act. In Korea, safety is evaluated in the same way as in Europe and Japan, and concretely, the characteristics of the transgene, toxicity, allergy, and nutrition are compared with existing agricultural products to prove practical equivalence.

그러나, 외국의 선도 기업들은 다양한 검색어들을 이용한 문헌 검색 및 최신 데이터베이스에 등록된 인자와 유사성을 분석하고, 자체적으로 개발한 안전성 예측 프로그램을 이용하여 유전자 발굴 단계에서 독성 및 알레르기성에 대한 사전 스크리닝을 진행하고 있다. However, leading foreign companies search for literature using various search words and analyze the similarity with factors registered in the latest database, and conduct preliminary screening for toxicity and allergy at the gene discovery stage using a safety prediction program developed in-house. have.

특히, 단백질 독성 예측은 바이오 제품 개발 파이프라인에서 매우 중요한 스크리닝 단계이다. 단백질 독성 예측은 단백질 서열을 이용한 생물정보학적 분석 방법을 통하여 진행되며, 주로 사용되는 분류 분석 모델은 SVM(Support Vector Machine)이다.In particular, predicting protein toxicity is a very important screening step in the bioproduct development pipeline. Protein toxicity prediction is carried out through a bioinformatics analysis method using protein sequences, and the mainly used classification analysis model is SVM (Support Vector Machine).

Support Vector Machine은 주어진 데이터 집합을 바탕으로 하여 새로운 데이터가 어느 카테고리에 속할지 판단하는 대표적인 비선형 분류 모델이다. 하지만, 분류해야 할 데이터 크기가 늘어나면, 모델이 복잡해지며, 많은 분석 시간을 요하며, 모델의 정확성과 강건성이 떨어지는 문제가 발생한다. 이는 SVM이 데이터셋에서 주요한 support vector들을 선정하고, 이를 이용한 나머지 데이터셋의 분류 방법이기에, 최악의 상황에서는 모든 데이터셋이 주요한 vector로 취급되고, 모델의 사이즈가, 데이터 셋의 크기와 linear하게 증가된다.The Support Vector Machine is a representative nonlinear classification model that determines which category a new data belongs to based on a given data set. However, as the size of the data to be classified increases, the model becomes complex, requires a lot of analysis time, and the accuracy and robustness of the model decrease. This is how SVM selects the major support vectors from the dataset and classifies the rest of the datasets using it, so in the worst case, all datasets are treated as primary vectors, and the size of the model increases linearly with the size of the dataset. do.

이에, 본 발명자들은 인공 신경망 (Artificial Neural Network) 모델을 이용하여, 분석 속도 및 정확도 모두 향상된 단백질 독성 예측 방법을 개발하고자 노력하였으며, 그 결과, 아미노산 빈도, 소수성, 또는 서열 유사성과의 관계를 이용하는 인공 신경망(ANN)을 이용하여 기존의 SVM 방식과 비교 하였을 때 분석 속도 및 정확도 모두 향상된 단백질 독성 예측을 확인함으로써, 본 발명을 완성하였다.Accordingly, the present inventors have tried to develop a method for predicting protein toxicity with improved analysis speed and accuracy, using an artificial neural network model, and as a result, artificial neural network using the relationship between amino acid frequency, hydrophobicity, or sequence similarity By using a neural network (ANN), the present invention was completed by confirming the prediction of protein toxicity improved in both analysis speed and accuracy when compared with the conventional SVM method.

본 발명의 하나의 목적은 단백질 독성 예측 시스템을 제공하는 것이다. One object of the present invention is to provide a protein toxicity prediction system.

본 발명의 다른 하나의 목적은 단백질 독성 예측 방법을 제공하는 것이다. Another object of the present invention is to provide a method for predicting protein toxicity.

본 발명의 또 다른 하나의 목적은 컴퓨터를 이용하여 단백질 독성 예측 방법을 실행시키기 위해 컴퓨터 판독 가능한 저장 매체에 저장된, 컴퓨터 프로그램을 제공하는 것이다.Another object of the present invention is to provide a computer program stored in a computer-readable storage medium to execute a method for predicting protein toxicity using a computer.

이하에서는, 본 발명을 더욱 상세히 설명한다.Hereinafter, the present invention will be described in more detail.

한편, 본원에서 개시되는 각각의 설명 및 실시형태는 각각의 다른 설명 및 실시 형태에도 적용될 수 있다. 즉, 본원에서 개시된 다양한 요소들의 모든 조합이 본 발명의 범주에 속한다. 또한, 하기 기술되는 구체적인 서술에 의하여 본 발명의 범주가 제한된다고 할 수 없다.On the other hand, each description and embodiment disclosed herein can be applied to each other description and embodiment. That is, all combinations of various elements disclosed herein fall within the scope of the present invention. In addition, it cannot be said that the scope of the present invention is limited by the specific description described below.

또한, 당해 기술분야의 통상의 지식을 가진 자는 통상의 실험만을 사용하여 본 출원에 기재된 본 발명의 특정 양태에 대한 다수의 등가물을 인지하거나 확인할 수 있다. 또한, 이러한 등가물은 본 발명에 포함되는 것으로 의도된다.In addition, those of ordinary skill in the art can recognize or ascertain using only routine experimentation a number of equivalents to the specific aspects of the invention described in this application. Also, such equivalents are intended to be included in the present invention.

상기와 같은 과제를 해결하기 위한 본 출원의 일 실시예는, 기 설정된 기준에 따라 독성 여부가 결정된 다수의 단백질과 관련된 데이터가 저장된 데이터베이스, 분석 대상 단백질에 포함된 아미노산의 서열 데이터를 연산하는 아미노산 서열 연산 모듈, 상기 아미노산 서열 연산 모듈에 의해 연산된 아미노산 서열 데이터를 이용하여, 기 설정된 방법에 따라 아미노산 서열에 따른 소수성 데이터를 연산하는 아미노산 소수성 연산 모듈 및 상기 아미노산 소수성 연산 모듈에 의해 연산된 소수성 데이터와, 상기 데이터베이스에 저장된 다수의 단백질의 소수성 데이터를 비교하고, 비교 결과인 제1 비교 결과 데이터에 기초하여 상기 분석 대상 단백질의 독성 여부를 판단하는 독성 예측 모듈을 포함하는, 정보 제공 시스템을 제공한다.An exemplary embodiment of the present application for solving the above problems is a database storing data related to a plurality of proteins whose toxicity is determined according to a preset criterion, an amino acid sequence for calculating the sequence data of amino acids included in the protein to be analyzed. An operation module, an amino acid hydrophobicity operation module for calculating hydrophobicity data according to an amino acid sequence according to a preset method using the amino acid sequence data calculated by the amino acid sequence operation module, and hydrophobicity data calculated by the amino acid hydrophobicity operation module And a toxicity prediction module that compares hydrophobicity data of a plurality of proteins stored in the database, and determines whether the protein to be analyzed is toxic based on the first comparison result data as a result of the comparison, and provides an information providing system.

일 실시예에 있어서, 상기 아미노산 서열 데이터를 이용하여, 기 설정된 방법에 따라 상기 분석 대상 단백질에 포함된 각각의 아미노산에 대한 빈도 데이터를 연산하는 아미노산 빈도 연산 모듈 및 상기 아미노산 서열 데이터와, 상기 데이터베이스에 저장된 다수의 단백질의 아미노산 서열 데이터를 비교하여, 비교 결과인 제2 비교 결과 데이터에 기초하여 상기 분석 대상 단백질의 서열 유사성 데이터를 연산하는 아미노산 서열 유사성 연산 모듈을 더 포함하며, 상기 독성 예측 모듈은, 상기 아미노산 빈도 연산 모듈에 의해 연산된 상기 분석 대상 단백질에 포함된 각각의 아미노산에 대한 빈도 데이터와, 상기 데이터베이스에 저장된 다수의 단백질에 포함된 각각의 아미노산에 대한 빈도 데이터를 비교하고, 비교 결과인 제3 비교 결과 데이터와, 상기 아미노산 서열 유사성 연산 모듈에 의해 연산된 서열 유사성 데이터에 더 기초하여 상기 분석 대상 단백질의 독성 여부를 판단할 수 있다.In one embodiment, an amino acid frequency calculation module for calculating frequency data for each amino acid included in the protein to be analyzed according to a preset method using the amino acid sequence data, and the amino acid sequence data, and the database The amino acid sequence similarity calculation module further comprises an amino acid sequence similarity calculation module that compares the amino acid sequence data of a plurality of stored proteins and calculates sequence similarity data of the protein to be analyzed based on the second comparison result data, which is a comparison result, wherein the toxicity prediction module, The frequency data for each amino acid included in the protein to be analyzed calculated by the amino acid frequency calculation module and the frequency data for each amino acid included in the plurality of proteins stored in the database are compared, and the comparison result 3 It is possible to determine whether the protein to be analyzed is toxic based on the comparison result data and the sequence similarity data calculated by the amino acid sequence similarity calculation module.

일 실시예에 있어서, 상기 아미노산 소수성 연산 모듈에 의해 연산되는 소수성 데이터는 하기의 수학식 1을 이용하여 연산되며, [수학식 1] 소수성 데이터 = ∑_i R_ix I_i, 상기 수학식 1에서, R은 상기 분석 대상 단백질에 포함된 임의의 아미노산이며, I는 아미노산에 따라 미리 결정된 소수성 인덱스 값일 수 있다.In an embodiment, the hydrophobicity data calculated by the amino acid hydrophobicity calculation module is calculated using Equation 1 below, [Equation 1] hydrophobicity data = ∑ _i R _i x I _i , in Equation 1 , R is any amino acid contained in the protein to be analyzed, and I may be a hydrophobicity index value determined in advance according to the amino acid.

일 실시예에 있어서, 상기 아미노산 서열 유사성 연산 모듈에 의해 연산되는 서열 유사성 데이터는 하기의 수학식 2를 이용하여 연산되며, [수학식 2] 서열 유사성 데이터 = aligned query 단백질 서열 길이 / total query 단백질 서열 길이 상기 수학식 2에서, total query 단백질 서열 길이는 상기 분석 대상 단백질에 포함된 아미노산 서열의 개수이고, aligned query 단백질 서열 길이는 상기 분석 대상 단백질의 아미노산 서열 데이터와, 상기 데이터베이스에 저장된 다수의 단백질의 아미노산 서열 데이터의 동일한 아미노산 서열 데이터의 개수일 수 있다.In one embodiment, the sequence similarity data calculated by the amino acid sequence similarity calculation module is calculated using Equation 2 below, [Equation 2] Sequence similarity data = aligned query protein sequence length / total query protein sequence Length In Equation 2, the total query protein sequence length is the number of amino acid sequences included in the protein to be analyzed, and the aligned query protein sequence length is the amino acid sequence data of the protein to be analyzed and a plurality of proteins stored in the database. It may be the number of identical amino acid sequence data of amino acid sequence data.

일 실시예에 있어서, 상기 아미노산 빈도 연산 모듈에 의해 연산되는 아미노산 빈도 데이터는 하기의 수학식 3을 이용하여 연산되며, [수학식 3] 아미노산 빈도 데이터 = R_i/N x 100, 상기 수학식 3에서, N은 상기 분석 대상 단백질에 포함된 아미노산 서열의 개수일 수 있다.In one embodiment, the amino acid frequency data calculated by the amino acid frequency calculation module is calculated using Equation 3 below, [Equation 3] Amino acid frequency data = R _i /N x 100, Equation 3 In, N may be the number of amino acid sequences included in the protein to be analyzed.

일 실시예에 있어서, 상기 독성 예측 모듈은 상기 소수성 데이터, 상기 서열 유사성 데이터 및 상기 아미노산 빈도 데이터를 인공신경망(Artificial Neural Network, ANN) 알고리즘에 입력하고, 입력에 따른 출력 데이터를 이용하여 상기 분석 대상 단백질의 독성 여부를 판단할 수 있다.In one embodiment, the toxicity prediction module inputs the hydrophobicity data, the sequence similarity data, and the amino acid frequency data into an artificial neural network (ANN) algorithm, and the analysis target using output data according to the input. You can determine whether a protein is toxic.

일 실시예에 있어서, 상기 데이터베이스와 상이한 별도의 데이터베이스를 더 포함하며, 상기 독성 예측 모듈에 의해 판단된 단백질 독성 여부가 상기 별도의 데이터베이스에 저장될 수 있다.In an embodiment, a separate database different from the database may be further included, and whether protein toxicity determined by the toxicity prediction module may be stored in the separate database.

또한 본 출원은, 분석 대상 단백질의 독성 여부를 포함하는 정보를 제공하는 시스템을 이용하여 정보를 제공하는 방법으로서, 상기 시스템은, 기 설정된 기준에 따라 독성 여부가 결정된 다수의 단백질과 관련된 데이터가 저장된 데이터베이스를 포함하며, 상기 방법은 (a) 아미노산 서열 연산 모듈이 분석 대상 단백질에 포함된 아미노산 서열 데이터를 연산하는 단계 (b) 아미노산 소수성 연산 모듈이 상기 (a) 단계에서 연산된 아미노산 서열 데이터를 이용하여, 기 설정된 방법에 따라 아미노산 서열에 따른 소수성 데이터를 연산하는 단계 및 (c) 독성 예측 모듈이, 상기 (b) 단계에서 연산된 소수성 데이터와, 상기 데이터베이스에 저장된 다수의 단백질의 소수성 데이터를 비교하여, 비교 결과인 제1 비교 결과 데이터에 기초하여 상기 분석 대상 단백질의 독성 여부를 판단하는 단계를 포함하는, 정보 제공 방법을 제공한다.In addition, the present application is a method of providing information using a system that provides information including whether or not the protein to be analyzed is toxic, the system, wherein data related to a plurality of proteins whose toxicity is determined according to a preset standard are stored. The method includes a database, and the method comprises the steps of: (a) the amino acid sequence calculation module calculating amino acid sequence data contained in the protein to be analyzed (b) the amino acid hydrophobicity calculation module using the amino acid sequence data calculated in step (a). Thus, calculating hydrophobicity data according to the amino acid sequence according to a preset method, and (c) the toxicity prediction module compares the hydrophobicity data calculated in step (b) with the hydrophobicity data of a plurality of proteins stored in the database. Thus, it provides a method for providing information, including determining whether the protein to be analyzed is toxic based on the first comparison result data as a result of the comparison.

일 실시예에 있어서, 상기 (b) 단계는, (b1) 아미노산 서열 유사성 연산 모듈이, 상기 (a) 단계에서 연산된 아미노산 서열 데이터와, 상기 데이터베이스에 저장된 다수의 단백질의 아미노산 서열 데이터를 비교하여, 비교 결과인 제2 비교 결과 데이터에 기초하여 상기 분석 대상 단백질의 서열 유사성 데이터를 연산하는 단계 및 (b2) 아미노산 빈도 연산 모듈이, 상기 (a) 단계에서 연산된 아미노산 서열 데이터를 이용하여, 기 설정된 방법에 따라 상기 분석 대상 단백질에 포함된 각각의 아미노산에 대한 빈도 데이터를 연산하는 단계를 더 포함하며, 상기 (c) 단계는, 상기 독성 예측 모듈이, 상기 (b1) 단계에서 연산된 빈도 데이터와, 상기 데이터베이스에 저장된 다수의 단백질에 포함된 각각의 아미노산에 대한 빈도 데이터를 비교하여, 비교 결과인 제3 비교 결과 데이터와, 상기 (b2) 단계에서 연산된 서열 유사성 데이터에 더 기초하여 상기 분석 대상 단백질의 독성 여부를 판단하는 단계를 더 포함할 수 있다.In an embodiment, in the step (b), the (b1) amino acid sequence similarity calculation module compares the amino acid sequence data calculated in the step (a) with the amino acid sequence data of a plurality of proteins stored in the database. , Computing the sequence similarity data of the protein to be analyzed based on the second comparison result data, which is a result of the comparison, and (b2) the amino acid frequency calculation module, using the amino acid sequence data calculated in the step (a), It further comprises calculating frequency data for each amino acid included in the protein to be analyzed according to a set method, wherein step (c), wherein the toxicity prediction module, the frequency data calculated in step (b1) And, by comparing the frequency data for each amino acid contained in the plurality of proteins stored in the database, the analysis further based on the third comparison result data, which is a comparison result, and the sequence similarity data calculated in step (b2). It may further include determining whether the target protein is toxic.

일 실시예에 있어서, 상기 (b) 단계에서 연산되는 소수성 데이터는 하기의 수학식 1을 이용하여 연산되며, [수학식 1] 소수성 데이터 = ∑_i R_ix I_i, 상기 수학식 1에서, R은 상기 분석 대상 단백질에 포함된 임의의 아미노산이며, I는 아미노산에 따라 미리 결정된 소수성 인덱스 값일 수 있다.In an embodiment, the hydrophobicity data calculated in step (b) is calculated using Equation 1 below, and [Equation 1] hydrophobicity data = ∑ _i R _i x I _{i, in} Equation 1, R is any amino acid included in the protein to be analyzed, and I may be a predetermined hydrophobicity index value according to the amino acid.

일 실시예에 있어서, 상기 (b1) 단계에서 연산되는 서열 유사성 데이터는 하기의 수학식 2를 이용하여 연산되며, [수학식 2] 서열 유사성 데이터 = aligned query 단백질 서열 길이 / total query 단백질 서열 길이, 상기 수학식 2에서, total query 단백질 서열 길이는 상기 분석 대상 단백질에 포함된 아미노산 서열의 개수이고, aligned query 단백질 서열 길이는 상기 분석 대상 단백질의 아미노산 서열 데이터와, 상기 데이터베이스에 저장된 다수의 단백질의 아미노산 서열 데이터의 동일한 아미노산 서열 데이터의 개수일 수 있다.In an embodiment, the sequence similarity data calculated in step (b1) is calculated using Equation 2 below, [Equation 2] Sequence similarity data = aligned query protein sequence length / total query protein sequence length, In Equation 2, the total query protein sequence length is the number of amino acid sequences included in the protein to be analyzed, and the aligned query protein sequence length is amino acid sequence data of the protein to be analyzed and amino acids of a plurality of proteins stored in the database. It may be the number of identical amino acid sequence data of the sequence data.

일 실시예에 있어서, 상기 (b2) 단계에서 연산되는 아미노산 빈도 데이터는 하기의 수학식 3을 이용하여 연산되며, [수학식 3] 아미노산 빈도 데이터 = R_i/N x 100, 상기 수학식 3에서, N은 상기 분석 대상 단백질에 포함된 아미노산 서열의 개수일 수 있다.In one embodiment, the amino acid frequency data calculated in step (b2) is calculated using Equation 3 below, [Equation 3] Amino acid frequency data = R _i /N x 100, in Equation 3 , N may be the number of amino acid sequences included in the protein to be analyzed.

일 실시예에 있어서, 상기 (c) 단계는, 상기 독성 예측 모듈이 상기 소수성 데이터, 상기 서열 유사성 데이터 및 상기 아미노산 빈도 데이터를 인공신경망(Artificial Neural Network, ANN) 알고리즘에 입력하고, 입력에 따른 출력 데이터를 이용하여 상기 분석 대상 단백질의 독성 여부를 판단하는 단계를 더 포함할 수 있다.In one embodiment, in the step (c), the toxicity prediction module inputs the hydrophobicity data, the sequence similarity data, and the amino acid frequency data into an artificial neural network (ANN) algorithm, and outputs according to the input. The method may further include determining whether the protein to be analyzed is toxic by using the data.

또한 본 출원은, 컴퓨터를 이용하여 상기한 방법을 실행시키기 위해 컴퓨터 판독 가능한 저장 매체에 저장된, 컴퓨터 프로그램을 제공한다.Further, the present application provides a computer program, stored in a computer-readable storage medium, for executing the above method using a computer.

도 1을 참조하여, 본 출원의 실시예에 따른 정보 제공 시스템을 보다 구체적으로 설명한다.An information providing system according to an embodiment of the present application will be described in more detail with reference to FIG. 1.

도 1을 참조하면, 본 출원의 실시예에 따른 정보 제공 시스템은 데이터베이스(D₁), 입력 모듈(10), 아미노산 서열 연산 모듈(20), 아미노산 소수성 연산 모듈(30), 아미노산 빈도 연산 모듈(40), 아미노산 서열 유사성 연산 모듈(50) 및 독성 예측 모듈(60)을 포함할 수 있다. Referring to FIG. 1, the information providing system according to the embodiment of the present application includes a database (D ₁ ), an input module (10), an amino acid sequence calculation module (20), an amino acid hydrophobicity calculation module (30), and an amino acid frequency calculation module ( 40), an amino acid sequence similarity calculation module 50, and a toxicity prediction module 60.

데이터베이스(D)에는 기 설정된 기준에 따라 독성 여부가 미리 결정된 다수의 단백질과 관련된 데이터가 저장된다. 예를 들어, SwissProt 데이터베이스가 이에 해당할 수 있으며, SwissProt 데이터베이스는 실험을 통하여 단백질의 특성(예를 들어, 독성)을 입증한 데이터를 포함하는 데이터베이스이다. 하지만, 이에 제한되지 않고 TrEMBL 데이터베이스 등 실험을 통하여 입증이 이루어지지 않은, 서열의 유사성 등을 기초로 단백질의 특성을 예측한 데이터를 포함하는 데이터베이스 등 다양한 기준에 따라 단백질의 특성이 결정된 다수의 단백질과 관련된 데이터가 저장된 것이면 어느 것이든 적용될 수 있다고 할 것이다.In the database (D), data related to a plurality of proteins whose toxicity is determined in advance according to a preset criterion are stored. For example, the SwissProt database may be the case, and the SwissProt database is a database that contains data that has demonstrated the properties (eg, toxicity) of proteins through experiments. However, it is not limited thereto, and a number of proteins whose properties have been determined according to various criteria, such as a database containing data that predicted the properties of proteins based on sequence similarity, which have not been verified through experiments such as the TrEMBL database. As long as the related data is stored, it will be said that any one can be applied.

입력 모듈(10)은 독성을 예측하고자 하는 단백질(이하, 분석 대상 단백질)을 입력하기 위한 구성이며, 키보드, 마우스 등 다양한 입력 장치가 이에 해당할 수 있다. 입력 모듈(10)을 통한 분석 대상 단백질의 입력은, 단백질의 명칭의 입력을 통해 이루어질 수 있고, 단백질에 포함된 아미노산 서열을 입력함으로써 이루어질 수도 있으나, 이에 제한되지 않고 분석 대상 단백질을 특정할 수 있는 방법이면 어느 것이든 적용될 수 있다고 할 것이다.The input module 10 is a component for inputting a protein for predicting toxicity (hereinafter, a protein to be analyzed), and various input devices such as a keyboard and a mouse may correspond to this. The input of the protein to be analyzed through the input module 10 may be performed by inputting the name of the protein, or by inputting the amino acid sequence included in the protein, but is not limited thereto, and the protein to be analyzed can be specified. Any method can be applied.

아미노산 서열 연산 모듈(20)은 입력 모듈(10)을 통해 입력된 분석 대상 단백질의 아미노산 서열 데이터를 연산하는 부분이다. 예를 들어, 임의의 분석 대상 단백질이 입력되는 경우, 아미노산 서열 연산 모듈(20)은 Met - Gly - Arg - Arg - Ile - Ser - Gly - Gly로 분석 대상 단백질에 포함된 아미노산 서열 데이터를 연산할 수 있다.The amino acid sequence calculation module 20 is a part that calculates the amino acid sequence data of the protein to be analyzed input through the input module 10. For example, when an arbitrary protein to be analyzed is input, the amino acid sequence calculation module 20 calculates amino acid sequence data included in the protein to be analyzed with Met-Gly-Arg-Arg-Ile-Ser-Gly-Gly. I can.

일 예에서, 아미노산 서열 연산 모듈(20)은 분석 대상 단백질의 유전자 서열의 Open Reading Frame을 찾고, 이를 이용한 Six-frame translation을 이용하여 아미노산 서열 데이터를 연산할 수 있으나, 이에 제한되지 않고 아미노산 서열 연산을 위한 다양한 방법이 적용될 수 있다. In one example, the amino acid sequence calculation module 20 may search for an open reading frame of the gene sequence of the protein to be analyzed, and calculate amino acid sequence data using Six-frame translation using the same, but is not limited thereto. Various methods can be applied for this.

아미노산 소수성 연산 모듈(30)은 아미노산 서열 연산 모듈(20)에 의해 연산된 아미노산 서열 데이터를 이용하여, 기 설정된 방법에 따라 아미노산 서열에 따른 소수성 데이터를 연산하는 부분이다.The amino acid hydrophobicity calculation module 30 is a part that calculates hydrophobicity data according to the amino acid sequence according to a preset method using the amino acid sequence data calculated by the amino acid sequence calculation module 20.

소수성 데이터는, 하기의 수학식 1에 의해 연산될 수 있다.Hydrophobicity data can be calculated by Equation 1 below.

[수학식 1][Equation 1]

소수성 데이터 = ∑_i R_ix I_i Hydrophobicity data = ∑ _i R _i x I _i

상기 수학식 1에서, R은 상기 분석 대상 단백질에 포함된 임의의 아미노산이며, I는 아미노산에 따라 미리 결정된 소수성 인덱스 값이다.In Equation 1, R is any amino acid included in the protein to be analyzed, and I is a hydrophobicity index value determined in advance according to the amino acid.

아미노산은 20개의 종류가 있으며, 각 아미노산에 따라 미리 결정된 소수성 인덱스 값이 존재한다. 예를 들어, Arg의 경우 -4.50의 소수성 인덱스 값을 가질 수 있으며, Ile의 경우 +4.50의 소수성 인덱스 값을 가질 수 있다.There are 20 types of amino acids, and there is a predetermined hydrophobicity index value for each amino acid. For example, Arg may have a hydrophobicity index value of -4.50, and Ile may have a hydrophobicity index value of +4.50.

즉, 아미노산 서열 개수 증가에 따른 소수성 인덱스 값의 합이 소수성 데이터라고 할 수 있으며, 도 5는 소수성 데이터를 설명하기 위한 도면이다.That is, the sum of the hydrophobicity index values according to the increase in the number of amino acid sequences may be referred to as hydrophobicity data, and FIG. 5 is a diagram for explaining the hydrophobicity data.

구체적으로, 도 5는 독성 단백질과 독성 단백질과 비독성 단백질의 소수성 비교를 hydropathy plot으로써 나타낸 것이다. Molecular Function이 독성/Toxin Activity로 분류된 특징이 있는 독성단백질 Arabidopsis thaliana의 Thionin-2.1 (THI2.1)을 선정하였으며 (UniProt ID:Q42596), 비독성 단백질은 동일한 애기장대의 B3 domain-containing 단백질 At1g16640이다 (UniProt ID:Q9FX77). 도 5에서의 소수성 비교를 위하여, 단백질에 대한 소수성 계산은 다양한 인덱스 중 Kyte & Doolittle 소수성 인덱스를 이용하여 계산하였다. Kyte & Doolittle 소수성 인덱스는 다른 인덱스 scale들과는 다르게, 각각 아미노산 Residue의 소수성을 나타낸다. 가장 큰 차이를 나타내는 구간은 10~25 Residue 구간과, 100~125 Residue 구간이다.Specifically, Figure 5 is a toxic protein and The comparison of hydrophobicity between toxic and non-toxic proteins is shown as a hydropathy plot. Thionin-2.1 (THI2.1) of Arabidopsis thaliana, a toxic protein whose molecular function is classified as toxic/toxin activity, was selected (UniProt ID: Q42596), and the non-toxic protein was the B3 domain-containing protein of Arabidopsis thaliana, At1g16640 It is (UniProt ID: Q9FX77). For the hydrophobicity comparison in FIG. 5, the hydrophobicity calculation for the protein was calculated using the Kyte & Doolittle hydrophobicity index among various indexes. Unlike other index scales, the Kyte & Doolittle hydrophobicity index represents the hydrophobicity of each amino acid residue. The sections showing the greatest difference are the 10~25 residual section and the 100~125 residual section.

이와 같은 분석을 통해서, 독성으로 알려진 단백질과 비독성으로 알려진 단백질 소수성의 차이를 이용한다면, 쉽게 독성/비독성 단백질의 분류가 가능하다는 점을 확인하였다.Through this analysis, it was confirmed that if the difference in hydrophobicity between proteins known to be toxic and proteins known to be non-toxic is used, it is possible to easily classify toxic/non-toxic proteins.

아미노산 서열 유사성 연산 모듈(40)은 아미노산 서열 연산 모듈(20)에 의해 연산된 아미노산 서열 데이터와, 데이터베이스(D)에 저장된 다수의 단백질의 아미노산 서열 데이터를 비교하여, 비교 결과에 기초하여 분석 대상 단백질과의 서열 유사성 데이터를 연산하는 부분이다.The amino acid sequence similarity calculation module 40 compares the amino acid sequence data calculated by the amino acid sequence calculation module 20 with the amino acid sequence data of a plurality of proteins stored in the database D, and analyzes the target protein based on the comparison result. This is the part that calculates the sequence similarity data of the family.

서열 유사성 데이터는, 하기의 수학식 2에 의해 연산될 수 있다.Sequence similarity data can be calculated by Equation 2 below.

[수학식 2][Equation 2]

서열 유사성 데이터 = aligned query 단백질 서열 길이 / total query 단백질 서열 길이Sequence similarity data = aligned query protein sequence length / total query protein sequence length

상기 수학식 2에서, total query 단백질 서열 길이는 상기 분석 대상 단백질에 포함된 아미노산 서열의 개수이고, aligned query 단백질 서열 길이는 상기 분석 대상 단백질의 아미노산 서열 데이터와, 상기 데이터베이스에 저장된 다수의 단백질의 아미노산 서열 데이터의 동일한 아미노산 서열 데이터의 개수이다.In Equation 2, the total query protein sequence length is the number of amino acid sequences included in the protein to be analyzed, and the aligned query protein sequence length is amino acid sequence data of the protein to be analyzed and amino acids of a plurality of proteins stored in the database. It is the number of identical amino acid sequence data of sequence data.

예를 들어, 분석 대상 단백질의 아미노산 서열이 Met - Gly - Arg - Arg - Ile - Ser - Gly - Gly이고, 데이터베이스(D)에 저장된 다수의 단백질 데이터 중 임의의 단백질의 아미노산 서열이 Met - Arg - Arg - Arg - Ile - Ser - Gly - Gly라고 가정할 때, aligned query 단백질 서열 길이는 7이며, total query 단백질 서열 길이는 8이므로, 서열 유사성 데이터는 7/8 = 0.875로 연산될 수 있다. 이는, 아미노산 서열이 유사하다면, 단백질의 특성 또한 유사할 것이라는 점에 근거한 것이다.For example, the amino acid sequence of the protein to be analyzed is Met-Gly-Arg-Arg-Ile-Ser-Gly-Gly, and the amino acid sequence of any protein among a number of protein data stored in the database (D) is Met-Arg- Assuming Arg-Arg-Ile-Ser-Gly-Gly, the aligned query protein sequence length is 7 and the total query protein sequence length is 8, so the sequence similarity data can be calculated as 7/8 = 0.875. This is based on the fact that if the amino acid sequence is similar, the properties of the protein will also be similar.

아미노산 빈도 연산 모듈(50)은 아미노산 서열 연산 모듈(20)에 의해 연산된 아미노산 서열 데이터를 이용하여, 기 설정된 방법에 따라 분석 대상 단백질에 포함된 각각의 아미노산에 대한 빈도 데이터를 연산하는 부분이다.The amino acid frequency calculation module 50 is a part that calculates frequency data for each amino acid included in the protein to be analyzed according to a preset method, using the amino acid sequence data calculated by the amino acid sequence calculation module 20.

빈도 데이터는 하기의 수학식 3에 의해 연산될 수 있다.The frequency data can be calculated by Equation 3 below.

[수학식 3][Equation 3]

아미노산 빈도 데이터 = R_i/N x 100(%)Amino acid frequency data = R _i /N x 100 (%)

상기 수학식 3에서, N은 상기 분석 대상 단백질에 포함된 아미노산 서열의 개수이다.In Equation 3, N is the number of amino acid sequences included in the protein to be analyzed.

예를 들어, 분석 대상 단백질의 아미노산 서열이 Met - Gly - Arg - Arg - Ile - Ser - Gly - Gly일 때, Arg의 아미노산 빈도 데이터는 2/8 x 100(%) = 25%로 연산될 수 있고, Ile의 아미노산 빈도 데이터는 1/8 x 100(%) = 12.5%로 연산될 수 있다.For example, when the amino acid sequence of the protein to be analyzed is Met-Gly-Arg-Arg-Ile-Ser-Gly-Gly, the amino acid frequency data of Arg can be calculated as 2/8 x 100 (%) = 25%. And, the amino acid frequency data of Ile can be calculated as 1/8 x 100 (%) = 12.5%.

도 6은 독성 단백질과 비독성 단백질의 아미노산 빈도 비교를 막대그래프로 나타낸 것이다. Molecular Function이 독성/Toxin Activity로 분류된 특징이 있는 독성단백질 Arabidopsis thaliana의 Thionin-2.1을 선정하였으며 (UniProt ID:Q42596), 비독성 단백질은 동일한 애기장대의 B3 domain-containing 단백질 At1g16640이다 (UniProt ID:Q9FX77). 가장 큰 차이를 나타낸 아미노산은 Asp, Cys, Ser, Glu, Phe이다. 이를 통해서, 빈도수 또한 단백질의 독성/비독성을 결정하는 중요한 인자임을 확인하였다.6 is a bar graph showing the comparison of amino acid frequencies between toxic and non-toxic proteins. Thionin-2.1 of Arabidopsis thaliana, a toxic protein whose molecular function is classified as toxic/toxin activity, was selected (UniProt ID: Q42596), and the non-toxic protein was the same Arabidopsis B3 domain-containing protein At1g16640 (UniProt ID: Q9FX77). The amino acids showing the biggest difference are Asp, Cys, Ser, Glu, and Phe. Through this, it was confirmed that the frequency is also an important factor in determining the toxicity/nontoxicity of the protein.

독성 예측 모듈(60)은 아미노산 소수성 연산 모듈(30)에 의해 연산된 소수성 데이터, 아미노산 서열 유사성 연산 모듈(40)에 의해 연산된 서열 유사성 데이터, 그리고 아미노산 빈도 연산 모듈(50)에 의해 연산된 아미노산 빈도 데이터를 이용하여 분석 대상 단백질의 독성 예측을 수행하는 부분이다. 소수성 데이터, 서열 유사성 데이터, 아미노산 빈도 데이터 모두를 이용하여 분석 대상 단백질의 독성 예측도 가능하지만, 어느 하나 이상을 이용하여 분석 대상 단백질의 독성 예측을 하는 것도 얼마든지 가능하다.The toxicity prediction module 60 includes hydrophobicity data calculated by the amino acid hydrophobicity calculation module 30, sequence similarity data calculated by the amino acid sequence similarity calculation module 40, and the amino acid calculated by the amino acid frequency calculation module 50. It is a part that predicts the toxicity of the protein to be analyzed using frequency data. Although it is possible to predict the toxicity of the protein to be analyzed using all of the hydrophobicity data, sequence similarity data, and amino acid frequency data, it is also possible to predict the toxicity of the protein to be analyzed using any one or more.

보다 구체적으로, 독성 예측 모듈(60)은 소수성 데이터와, 데이터베이스(D)에 저장된 다수의 단백질의 소수성 데이터를 비교하고, 비교 결과인 제1 비교 결과 데이터에 기초하여 분석 대상 단백질의 독성 여부를 판단할 수 있다. 데이터베이스(D)에는 독성 여부가 결정된 단백질들의 데이터가 저장되어 있으며, 예를 들어 임의의 독성 단백질의 소수성 데이터와 유사한 패턴을 가질수록, 분석 대상 단백질이 독성 단백질일 확률이 높다고 할 수 있다. 예를 들어, 분석 대상 단백질의 소수성 데이터와, 데이터베이스(D)에 저장된 비교 단백질의 소수성 데이터를 비교하여, 유사도가 소정 기준 이상인 경우 비교 단백질의 독성 여부와 동일한 독성 여부를 갖는다고 예측할 수 있다.More specifically, the toxicity prediction module 60 compares the hydrophobicity data with the hydrophobicity data of a plurality of proteins stored in the database D, and determines whether the protein to be analyzed is toxic based on the first comparison result data as a result of the comparison. can do. In the database (D), data of proteins for which toxicity is determined are stored. For example, as the pattern has a similar pattern to the hydrophobicity data of any toxic protein, the probability that the protein to be analyzed is a toxic protein is high. For example, by comparing the hydrophobicity data of the protein to be analyzed with the hydrophobicity data of the comparative protein stored in the database D, if the similarity is greater than or equal to a predetermined criterion, it may be predicted to have the same toxicity as the toxicity of the comparison protein.

마찬가지로, 독성 예측 모듈(60)은 서열 유사성 데이터를 이용하여, 서열 유사성 데이터가 소정 기준 이상인 경우, 비교 단백질의 독성 여부와 동일한 독성 여부를 갖는다고 예측할 수 있다.Likewise, the toxicity prediction module 60 may use sequence similarity data to predict that if the sequence similarity data is greater than or equal to a predetermined criterion, it has the same toxicity as the toxicity of the comparison protein.

또한, 독성 예측 모듈(60)은 아미노산 빈도 데이터와, 데이터베이스(D)에 저장된 비교 단백질의 아미노산 빈도 데이터를 비교하여, 유사도가 소정 기준 이상인 경우 비교 단백질의 독성 여부와 동일한 독성 여부를 갖는다고 예측할 수 있다.In addition, the toxicity prediction module 60 may compare the amino acid frequency data with the amino acid frequency data of the comparative protein stored in the database D, and if the similarity is greater than or equal to a predetermined criterion, it can be predicted to have the same toxicity as that of the comparison protein. have.

상기한 독성 예측 모듈(60)은, 데이터베이스(D)에 저장된 비교 단백질과의 비교를 통해서도 분석 대상 단백질의 독성을 예측할 수도 있으나, 인공신경망 알고리즘을 통해서도 분석 대상 단백질의 독성을 예측할 수도 있다.The toxicity prediction module 60 may predict the toxicity of the protein to be analyzed through comparison with the comparison protein stored in the database D, but may also predict the toxicity of the protein to be analyzed through an artificial neural network algorithm.

구체적으로, 독성 예측 모듈(60)은, 분석 대상 단백질의 소수성 데이터, 서열 유사성 데이터 및 아미노산 빈도 데이터를 인공신경망(Artificial Neural Network, ANN) 알고리즘에 입력하고, 입력에 따른 출력 데이터를 이용하여 분석 대상 단백질의 독성 여부를 판단하게 된다.Specifically, the toxicity prediction module 60 inputs hydrophobicity data, sequence similarity data, and amino acid frequency data of the protein to be analyzed into an artificial neural network (ANN) algorithm, and the analysis target using output data according to the input. It determines whether the protein is toxic.

인공신경망 알고리즘에는 데이터베이스(D)에 저장된 단백질의 데이터가 학습되며, 이에 따라 분석 대상 단백질의 소수성 데이터, 서열 유사성 데이터 및 아미노산 빈도 데이터를 입력함에 따라 독성/비독성의 출력 데이터가 출력될 수 있는 것이다.In the artificial neural network algorithm, the data of the protein stored in the database (D) is learned, and the output data of toxicity/non-toxicity can be output by inputting the hydrophobicity data, sequence similarity data, and amino acid frequency data of the protein to be analyzed. .

독성 예측 모듈(60)이 예측한 분석 대상 단백질의 독성 여부는 다시 데이터베이스(D)에 저장될 수 있다. 분석 대상 단백질의 독성 예측이 반복될수록, 데이터베이스(D)는 단백질의 독성 여부에 관한 데이터가 축적되며, 이를 통하여 보다 정확한 독성 예측이 가능하게 된다.Whether or not the protein to be analyzed is toxicity predicted by the toxicity prediction module 60 may be stored in the database D again. As the toxicity prediction of the protein to be analyzed is repeated, the database (D) accumulates data on whether the protein is toxic, and through this, more accurate toxicity prediction is possible.

본 발명은 아미노산 빈도, 소수성, 또는 서열 유사성과의 관계를 이용하는 인공 신경망(ANN)을 이용하여 기존의 SVM 방식과 비교 하였을 때, 단백질 독성 예측과 관련하여 분석 속도 및 정확도 모두 향상된 효과를 갖는다. The present invention has an effect of improving both the analysis speed and accuracy in relation to the prediction of protein toxicity when compared with the existing SVM method using an artificial neural network (ANN) using a relationship between amino acid frequency, hydrophobicity, or sequence similarity.

도 1은 본 출원의 실시예에 따른 정보 제공 시스템을 설명하기 위한 블록도이다.
도 2는 본 출원의 실시예에 따른 정보 제공 방법을 설명하기 위한 순서도이다.
도 3은 본 출원의 실시예에 따른 정보 제공 시스템의 전체적인 분석 흐름도를 나타낸 도면이다.
도 4는 단백질 독성 예측에서 종래에 사용되었던 SVM 분석 흐름도를 나타낸 도면이다.
도 5는 독성 단백질과 비독성 단백질의 소수성 비교를 hydropathy plot으로 나타낸 도면이다.
도 6은 독성 단백질과 비독성 단백질의 아미노산 빈도 비교를 막대그래프로 나타낸 도면이다.1 is a block diagram illustrating an information providing system according to an embodiment of the present application.
2 is a flowchart illustrating a method of providing information according to an embodiment of the present application.
3 is a diagram illustrating an overall analysis flow chart of an information providing system according to an embodiment of the present application.
4 is a diagram showing a flow chart of SVM analysis that has been conventionally used in predicting protein toxicity.
5 is a diagram showing a hydropathy plot showing a comparison of hydrophobicity between a toxic protein and a non-toxic protein.
6 is a bar graph showing the comparison of amino acid frequencies between toxic and non-toxic proteins.

이하 본 출원을 실시예에 의해 보다 상세하게 설명한다. 그러나 하기 실시예는 본 출원을 예시하기 위한 바람직한 실시양태에 불과한 것이며 따라서, 본 출원의 권리범위를 이에 한정하는 것으로 의도되지는 않는다. 한편, 본 명세서에 기재되지 않은 기술적인 사항들은 본 출원의 기술 분야 또는 유사 기술 분야에서 숙련된 통상의 기술자이면 충분히 이해하고 용이하게 실시할 수 있다.Hereinafter, the present application will be described in more detail by examples. However, the following examples are merely preferred embodiments for illustrating the present application, and therefore, are not intended to limit the scope of the present application thereto. On the other hand, technical matters not described in the present specification can be sufficiently understood and easily implemented by a person skilled in the art or similar technical field of the present application.

실시예 1: 식물의 단백질 서열 데이터 확보Example 1: Obtaining plant protein sequence data

기계학습을 위하여 사용된 데이터는 UniProtKB/SwissProt과 TrEMBL 데이터베이스를 이용하여 구성하였다. Data used for machine learning was constructed using UniProtKB/SwissProt and TrEMBL databases.

구체적으로, SwissProt 데이터 베이스는 큐레이션이 된 데이터베이스로, 실험을 통하여 단백질의 특성을 입증한 데이터를 가진 데이터베이스이다. 또한, TrEMBL 단백질 서열 데이터베이스는 실험을 통하여 입증이 아직 되지 않은, 즉 서열의 유사성을 토대로 단백질의 특성을 예측한 데이터를 가진 데이터베이스로, SwissProt에 비하여 신뢰도가 떨어지는 단점이 있다. Specifically, the SwissProt database is a curated database and has data that prove the properties of proteins through experiments. In addition, the TrEMBL protein sequence database is a database that has not yet been verified through experiments, that is, has data that predicts the properties of proteins based on sequence similarity, and has a disadvantage in that it is less reliable than SwissProt.

다음은 SwissProt과 TrEMBL 데이터베이스를 이용하여 구성한 기계학습 모델링용 데이터 세트에 대한 설명이다.The following is a description of the data set for machine learning modeling constructed using SwissProt and TrEMBL databases.

1) Toxin이란 키워드를 이용하여 UniProtKB/SwissProt 검색 결과인 7,227 단백질 서열 (Positive). 상기 7,227 단백질 서열은 실험을 통하여 독성이 있다는 것이 입증된 단백질임. 검색에 사용된 키워드: keyword:toxin AND reviewed: yes1) 7,227 protein sequence (Positive) as a result of UniProtKB/SwissProt search using the keyword Toxin. The 7,227 protein sequence is a protein proven to be toxic through experiments. Keywords used in the search: keyword:toxin AND reviewed: yes

2) SwissProt에 등록된 557,992 단백질 서열 중 toxin이라는 키워드가 없는 550,765 단백질 서열에 대하여 무작위로 샘플링한 50,000 단백질 서열 (Easy). 무작위 샘플링은 Python에서 실행하였으며, numpy.random을 이용하였음.2) 50,000 protein sequences randomly sampled for 550,765 protein sequences without the keyword toxin among 557,992 protein sequences registered with SwissProt (Easy). Random sampling was performed in Python, and numpy.random was used.

3) SwissProt 데이터베이스에 toxin 키워드가 없는 550,765 단백질 서열을 blast database를 만들고 BLASTp를 이용하여 e-value cutoff 10을 기준으로 선별한 7,229 단백질 서열 (Medium).3) A blast database of 550,765 protein sequences without toxin keyword in the SwissProt database was created, and 7,229 protein sequences (Medium) were selected based on e-value cutoff 10 using BLASTp.

4) 상기 단백질 서열들과 동일한 방법으로 TrEMBL 데이터베이스에 toxin이라는 키워드가 없는 120,213,916 단백질 서열 중 BLASTp를 이용하여 e-value cutoff 1을 기준으로 선별한 6,652 단백질 서열 (Hard).4) 6,652 protein sequences (Hard) selected based on e-value cutoff 1 using BLASTp among 120,213,916 protein sequences without the keyword toxin in the TrEMBL database in the same manner as the protein sequences.

기계학습의 성능 비교를 위하여 사용된 데이터는, https://github.com/rgacesa/ToxClassifier/tree/master/datasets를 이용하였다.The data used to compare the performance of machine learning was https://github.com/rgacesa/ToxClassifier/tree/master/datasets .

상기 데이터 셋은 UniProtKB/SwissProt 데이터베이스에서 Animal toxin과 venom에 대한 검색값을 기반으로 다음과 같이 구성된 데이터 셋이다.The data set is a data set configured as follows based on search values for Animal toxin and venom in the UniProtKB/SwissProt database.

1) Animal toxin과 venom 키워드를 이용하여 UniProtKB/SwissProt 검색 결과 중 중복된 값은 제거된 8,093 단백질 서열 (Comp_Positive). 검색에 사용된 키워드: taxonomy:"Metazoa [33208]" (keyword:toxin OR annotation:(type:"tissue specificity" venom))1) 8,093 protein sequence (Comp_Positive) from which duplicate values were removed from UniProtKB/SwissProt search results using Animal toxin and venom keywords. Keywords used in the search: taxonomy:"Metazoa [33208]" (keyword:toxin OR annotation:(type:"tissue specificity" venom))

2) SwissProt에 등록된 557,992 단백질 서열 중 무작위로 샘플링하고, 1)번 단백질 서열들과 동일한 ID를 가진 서열들은 제외한 47,144 단백질 서열 (Comp_Easy).2) Among the 557,992 protein sequences registered with SwissProt, the 47,144 protein sequence (Comp_Easy) was randomly sampled, excluding sequences with the same ID as the 1) protein sequence.

3) UniProtKB/SwissProt 데이터베이스에 1번 데이터셋을 BLASTp를 이용하여 비교하고, 1.0e-10 기준으로 선별한 뒤 1과 2번 데이터셋에 중복된 단백질 서열은 제거된 8,034 단백질 서열 (Comp_Medium). 3) 8,034 protein sequences in UniProtKB/SwissProt database using BLASTp, and after selection based on 1.0e-10, the overlapping protein sequences in datasets 1 and 2 were removed (Comp_Medium).

4) 3번 단백질 서열들과 동일한 방법으로 TrEMBL 데이터베이스에 매치된 1, 2, 3번 데이터셋에 중복된 단백질 서열은 제거된 7,403 단백질 서열 (Comp_Hard).4) 7,403 protein sequence (Comp_Hard) from which the duplicated protein sequences in the 1, 2, and 3 datasets matched to the TrEMBL database in the same manner as the 3 protein sequences were removed.

실시예 2: Feature Extraction Example 2: Feature Extraction

기계학습을 위하여 사용된 Feature는 다음과 같으며, 본 발명에서 사용된 특정 인자는 단일 아미노산 빈도, 서열 유사성 및 소수성 이다. 위의 Feature들은 아미노산 서열로부터 독성과 비독성을 판단하기 위한 유의한 인자들로써, 도 5과 도 6와 같이 사전 분석을 통하여 발굴되었다. 본 발명에서는 기존에 알려지지 않았던 소수성을 단백질 독성과 비독성을 판단하는 인자로써 발굴하여, 기존 알려진 인자인 아미노산 빈도와의 성능 차이를 비교하였다. 본 발명에서는 상기에 명기된 특정 인자들을 실시예 1에서 구성된 데이터셋으로부터 계산하여, 테이블 형태의 데이터셋을 구성하였고, 이 데이터셋은 인공지능을 학습 시키고 성능을 판단하는 데이터셋으로써 사용되었다. Features used for machine learning are as follows, and specific factors used in the present invention are single amino acid frequency, sequence similarity, and hydrophobicity. The above features are significant factors for determining toxicity and non-toxicity from the amino acid sequence, and were discovered through prior analysis as shown in FIGS. 5 and 6. In the present invention, previously unknown hydrophobicity was discovered as a factor for determining protein toxicity and non-toxicity, and the difference in performance with amino acid frequency, which is a known factor, was compared. In the present invention, the specific factors specified above were calculated from the data set configured in Example 1 to form a table-type data set, and this data set was used as a data set for learning artificial intelligence and determining performance.

단일 아미노산 빈도는 다음과 같이 추출 되었다. The single amino acid frequency was extracted as follows.

[수학식 1][Equation 1]

아미노산 빈도 = R_i/N x 100 Amino acid frequency = R _i /N x 100

상기 수학식 1에서 Ri은 아미노산 서열의 각 잔기이고, In Equation 1, Ri is each residue of the amino acid sequence,

N은 분석 대상 단백질에 포함된 아미노산 서열의 길이임. N is the length of the amino acid sequence contained in the protein to be analyzed.

서열 유사성은 해당 단백질 서열을 Positive 데이터 셋(SwissProt 키워드 toxin)에 BLASTp로 비교한 결과를 토대로 추출한다. Sequence similarity is extracted based on the result of comparing the corresponding protein sequence to the positive data set (SwissProt keyword toxin) by BLASTp.

해당 단백질 서열 중 유사한 서열의 비율을 계산하여 서열 유사성을 추출한다. 서열 유사성을 계산하기 위하여, 최고 낮은 e-value를 가지고 있는 단백질 서열과의 alignment 결과를 이용하여 계산하였다. Sequence similarity is extracted by calculating the proportion of similar sequences among the corresponding protein sequences. In order to calculate sequence similarity, it was calculated using the result of alignment with the protein sequence having the lowest e-value.

[수학식 2][Equation 2]

서열 유사성 = aligned query 단백질 서열 길이 / total query 단백질 서열 길이.Sequence similarity = aligned query protein sequence length / total query protein sequence length.

소수성은 소수성 인덱스와 아미노산 빈도를 이용하여, 각 아미노산의 소수성을 계산한 뒤, 해당 단백질 서열의 소수성을 계산한다. Hydrophobicity is calculated using the hydrophobicity index and amino acid frequency, and after calculating the hydrophobicity of each amino acid, the hydrophobicity of the corresponding protein sequence is calculated.

[수학식 3][Equation 3]

소수성 = ∑i Ri x 소수성인덱스 i.Hydrophobicity = ∑i Ri x hydrophobicity index i .

Feature 추출은 Python 3과 NCBI BLAST+를 이용하여 추출하였으며, BLASTp를 위한 로컬 데이터베이스 빌드는 Positive 데이터셋, 즉, UniProKB/SwissProt의 toxin으로 알려진 단백질 서열들을 이용하였다.Feature extraction was performed using Python 3 and NCBI BLAST+, and the local database for BLASTp was built using a positive data set, that is, protein sequences known as toxins of UniProKB/SwissProt.

실시예 3: Model Training Example 3: Model Training

SVM과 ANN은 모두 Python 3를 이용하여 학습시켰다. 모두 Python의 scikit-learn package를 이용하였다. SVM은 통계적 학습 방법으로써, 구조적 리스크 최소화 원칙에 근거한 방법이다. SVM의 입력 데이터는 feature 벡터로써 아미노산의 속성을 나타내며, 단백질 family의 특성 결정에 중요한 실시예 2에서 추출된 feature들을 사용한다. 입력된 벡터들을 고차원의 feature space에서 분석하여, the optimal separating hyperplane을 구축하여, SVM의 parameter들을 학습한다. The optimal separating hyperplane은 독성 데이터와 비독성 데이터들간의 margin을 maximize하고, 데이터들을 독성과 비독성으로 분류한다. 본 발명에서 사용된 SVM은 커널 서포트 벡터 머신을 이용하였다. 비선형 분류를 위하여, 커널트릭을 사용하였으며, Radial basis function (RBF) 커널을 사용하였다.Both SVM and ANN were trained using Python 3. All of them used Python's scikit-learn package. SVM is a statistical learning method, which is based on the principle of structural risk minimization. The input data of SVM represents the properties of amino acids as a feature vector, and the features extracted in Example 2, which are important for determining the characteristics of a protein family, are used. The input vectors are analyzed in a high-dimensional feature space, the optimal separating hyperplane is constructed, and SVM parameters are learned. The optimal separating hyperplane maximizes the margin between toxic and non-toxic data and classifies the data into toxic and non-toxic. The SVM used in the present invention uses a kernel support vector machine. For nonlinear classification, a kernel trick was used, and a radial basis function (RBF) kernel was used.

RBF FunctionRBF Function

ANN은 생물의 신경망에서 영감을 얻은 통계학적 학습 알고리즘이다. 시냅스의 결합으로 네트워크를 형성하고, 인공 뉴런 (노드)들의 학습을 통한, 시냅스의 결합 세기 (parameter)를 변화시킨다. 이 과정을 반복시켜, 해당 문제에 대하여, 인지력을 가지는 인공지능 모델을 ANN이라고 한다. 자세하게는, 몇 개의 layer를 만들고, 그 레이어에 뉴런을 만들고, 각 레이어의 뉴런들을 연결한다. 이러한 인공신경망은, 뉴런들이 자신에게 들어온 신호를 가중치와 곱하고 더하여 (weight x input), 역치와 비교한 다음 (weight x input + b), 아웃풋을 연결된 다음 뉴런에게 전달하며, 이러한 일련의 과정을 통하여, 인지력(문제에 대한 해결 능력)을 가지게 된다.ANN is a statistical learning algorithm inspired by biological neural networks. A network is formed by synaptic bonding, and through the learning of artificial neurons (nodes), synaptic bonding strength (parameter) is changed. By repeating this process, an artificial intelligence model having cognitive power for the problem is called ANN. In detail, we create several layers, create neurons in that layer, and connect neurons in each layer. These artificial neural networks multiply and add the signal received by the neurons with the weight (weight x input), compare it with the threshold (weight x input + b), connect the output, and deliver it to the neuron, through a series of processes. , You will have cognition (the ability to solve problems).

사용된 ANN은 multi layer perceptron (MLP) classifier를 이용하였으며, 역전파 (Backpropagation)을 이용하여 가중치를 업데이트 하였으며,The ANN used was a multi-layer perceptron (MLP) classifier, and the weights were updated using backpropagation.

가중치를 계산할 때에는 Stochastic Gradient Descent를 이용하여 계산하였다.When calculating the weight, it was calculated using Stochastic Gradient Descent.

Backpropagation FunctionBackpropagation Function

학습 데이터셋 구성은 각각의 데이터셋의 75% 무작위 샘플링을 통하여 구성하였으며, 합쳐진 학습 데이터셋을 이용하여 각각의 모델을 학습시키고, 나머지 25%를 이용하여 모델의 성능을 평가하였다.The training dataset was constructed through 75% random sampling of each dataset, and each model was trained using the combined training dataset, and the performance of the model was evaluated using the remaining 25%.

도 3은 본 발명의 단백질 독성 예측 시스템의 전체적인 분석 흐름도를 나타낸 것이다. 인풋 데이터는 아미노산 서열이다. 이 서열로부터 유의미한 feature를 추출하기 위하여, 아미노산 빈도, 서열 유사성, 소수성을 계산하고, 계산된 유의미한 분석 데이터를 인공 신경망의 인풋으로써 사용한다. Backpropagation을 이용하여, 인공 신경망을 학습시켜, classifier를 만든다. 이 classifier를 만드는 과정을 단백질 서열 데이터 해당 단백질 서열이 독성인지, 비독성인지 예측한다.Figure 3 shows the overall analysis flow chart of the protein toxicity prediction system of the present invention. Input data is amino acid sequence. In order to extract a significant feature from this sequence, amino acid frequency, sequence similarity, and hydrophobicity are calculated, and the calculated significant analysis data is used as input to the artificial neural network. By using Backpropagation, artificial neural network is trained and classifier is created. The process of making this classifier predicts whether the protein sequence data is toxic or non-toxic.

실시예 4: Model EvaluationExample 4: Model Evaluation

SVM과 ANN의 성능 비교는 아래의 값을 토대로 비교하였다.The performance comparison of SVM and ANN was compared based on the following values.

▷ True Positive (TP) - 독소를 독소라고 제대로 예측한 수▷ True Positive (TP)-Number of correctly predicted toxins as toxins

▷ True Negative (TN) - Non독소를 Non독소라고 제대로 예측한 수▷ True Negative (TN)-Number of correctly predicted non-toxin as non-toxin

▷ False Positive (FP) - Non독소를 독소라고 잘못 예측한 수▷ False Positive (FP)-Number of incorrectly predicted non-toxins as toxins

▷ False Negative (FN) - 독소를 Non독소라고 잘못 예측한 수▷ False Negative (FN)-Number of incorrectly predicted toxins as non-toxins

▷ Accuracy - 제대로 예측한 예측률 Accuracy = (TP+TN) / (TP+TN+FP+FN)▷ Accuracy-Correctly predicted prediction rate Accuracy = (TP+TN) / (TP+TN+FP+FN)

▷ Specificity - Non독소를 Non독소라고 제대로 예측한 예측률 Specificity = TN / (TN+FP)▷ Specificity-Prediction rate of correctly predicting non-toxin as non-toxin Specificity = TN / (TN+FP)

▷ Sensitivity - 독소를 독소라고 제대로 예측한 예측률 Specificity = TP / (TP+FN)▷ Sensitivity-Prediction rate of correctly predicting a toxin as a toxin Specificity = TP / (TP+FN)

▷ NPV(Recall) - Non독소 예측 중 실제 Non독소 비율 NPV = TN / (TN+FN)▷ NPV(Recall)-Ratio of actual non-toxin during prediction of non-toxin NPV = TN / (TN+FN)

▷ PPV (Precision) - 독소 예측 중 실제 독소 비율 PPV = TP / (TP+FP)▷ PPV (Precision)-Actual toxin ratio during toxin prediction PPV = TP / (TP+FP)

▷ F-score (F1) - F-score는 Recall과 Precision의 average을 나타낸다. F1=2*TP/(2*TP+FP+FN).▷ F-score (F1)-F-score represents the average of Recall and Precision. F1=2*TP/(2*TP+FP+FN).

아미노산 빈도 (Single) Amino Acid Frequency (Single) Test Test Positive Positive Easy Easy Medium Medium Hard Hard ANN_AccuracyANN_Accuracy 94.08%94.08% 70.55%70.55% 99.19%99.19% 96.71%96.71% 89.40%89.40% ANN_Run TimeANN_Run Time 0.247s0.247 s 0.046s0.046s 0.745s0.745s 0.143s0.143s 0.089s0.089s SVM_AccuracySVM_Accuracy 89.60%89.60% 78.02%78.02% 97.51%97.51% 86.31%86.31% 58.27%58.27% SVM_Run TimeSVM_Run Time 23.6s23.6s 10.8s10.8s 61s61s 10.9s10.9s 9.66s9.66s

아미노산 빈도, 소수성,
서열 유사성 이용 Amino acid frequency, hydrophobicity,
Using sequence similarity Test Test Positive Positive Easy Easy Medium Medium Hard Hard ANN_AccuracyANN_Accuracy 99.38% 99.38% 96.91% 96.91% 99.95% 99.95% 99.86% 99.86% 98.33% 98.33% ANN_Run TimeANN_Run Time 1.07s 1.07s 0.312s 0.312s 1.43s 1.43s 0.319s 0.319s 0.184s 0.184s SVM_AccuracySVM_Accuracy 96.22% 96.22% 92.73% 92.73% 99.78% 99.78% 99.89% 99.89% 97.47% 97.47% SVM_Run TimeSVM_Run Time 786s786s 356s356s 1957s1957s 618s618s 299s299s

평가지표Evaluation index 아미노산빈도ANNAmino acid frequency ANN 아미노산빈도SVMAmino Acid Frequency SVM 소수성ANNHydrophobic ANN 소수성SVMHydrophobic SVM TPTP 5,7105,710 6,3146,314 7,8437,843 7,5057,505 TNTN 61,15061,150 57,21857,218 62,42362,423 62,28162,281 FPFP 1,4311,431 5,3635,363 158158 300300 FNFN 2,3832,383 1,7791,779 250250 588588 AccuracyAccuracy 0.9460.946 0.8990.899 0.9940.994 0.9870.987 SpecificitySpecificity 0.9770.977 0.9140.914 0.9970.997 0.9950.995 SensitivitySensitivity 0.7060.706 0.780.78 0.9690.969 0.9270.927 NPVNPV 0.9620.962 0.970.97 0.9960.996 0.9910.991 PrecisionPrecision 0.80.8 0.5410.541 0.980.98 0.9620.962 F1F1 0.750.75 0.6390.639 0.9750.975 0.9440.944

도 4는 단백질 독성 예측에서 주로 사용되는 SVM 분석 흐름도를 나타낸 것이다. 도 3에서와 동일한 인풋 데이터인, 아미노산 서열을 가지고, 아미노산의 빈도를 계산한다. 아미노산 빈도를 SVM 분석의 인풋으로써 사용하며, SVM의 Kernel은 radial basis function(RBF)를 주로 사용한다. RBF 커널을 이용한 SVM은 데이터셋에서 주요한 vector들은 선정하고, 이를 이용한 나머지 데이터들은 분류한다. 4 shows a flow chart of SVM analysis mainly used in predicting protein toxicity. With the amino acid sequence, which is the same input data as in FIG. 3, the frequency of amino acids is calculated. The amino acid frequency is used as an input for SVM analysis, and the SVM kernel mainly uses a radial basis function (RBF). SVM using the RBF kernel selects major vectors from the dataset and classifies the remaining data using them.

그 결과, 상기 표 1 및 표 2에 나타낸 바와 같이, 기존의 SVM(Support Vector Machine) 방식을 이용한 경우보다 분석 속도과 정확도가 높아짐을 확인할 수 있었다. As a result, as shown in Tables 1 and 2, it was confirmed that the analysis speed and accuracy were higher than the case of using the conventional SVM (Support Vector Machine) method.

구체적으로, Hydrophobicity (소수성)을 이용하여 단백질 독성 예측을 한 경우에, Test 데이터셋의 서열 분석시 정확도는 기존의 SVM 방식 보다 약 9.78% 더 높으며, 예측 분석 속도는 22.05배 빠름을 확인할 수 있었다. Specifically, in the case of predicting protein toxicity using hydrophobicity (hydrophobicity), it was confirmed that the accuracy of the sequence analysis of the Test dataset was about 9.78% higher than that of the existing SVM method, and the predictive analysis speed was 22.05 times faster.

또한, 독성 단백질과 서열적 유사도가 높은 비독성 단백질 서열 Hard 데이터셋의 분석 시 정확도가 매우 차이가 남을 확인하였으며, 기존의 SVM 방식 보다 정확도는 40.06% 더 높고, 속도는 52.5배 빠름을 확인할 수 있었다. In addition, it was confirmed that the accuracy was very different when analyzing the hard data set of the non-toxic protein sequence with high sequence similarity to the toxic protein, and the accuracy was 40.06% higher than the conventional SVM method and the speed was 52.5 times faster. .

또한, 상기 표 3에 나타낸 바와 같이, 본 발명의 기계학습과의 성능 비교를 위한 Comp 데이터셋들을 이용하여, 비교 시에도 존의 SVM 방식을 이용한 경우보다 ANN 전반적인 성능이 향상됨을 확인할 수 있었다.In addition, as shown in Table 3, it was confirmed that the overall performance of the ANN was improved compared to the case of using the Zone's SVM method even when comparing, using Comp datasets for performance comparison with machine learning of the present invention.

가장 큰 향상은 정밀도 (Precision)의 향상이다. 아미노산 빈도만을 이용한 분석에서도, ANN은 약 80.0%의 정밀도를 나타낸 반면, SVM은 54.1%의 정밀도를 나타냈다. 소수성을 이용한 ANN의 분석과 기존의 사용되는 아미노산 SVM분석은 약 43.9%의 Precision 차이를 나타낸다. 정밀도의 향상은 즉 예측된 독성 단백질 중 실제 독성 단백질 비율이 높아짐을 나타내어, 독성에 대한 예측률이 높아짐을 나타낸다.The biggest improvement is the improvement in precision. In the analysis using only amino acid frequency, ANN showed a precision of about 80.0%, while SVM showed a precision of 54.1%. Analysis of ANN using hydrophobicity and SVM analysis of amino acids used in the past show a precision difference of about 43.9%. The improvement of the precision indicates that the ratio of the actual toxic protein among the predicted toxic proteins increases, indicating that the prediction rate for toxicity is increased.

정확도 (Accuracy)의 경우, 소수성을 이용한 ANN의 분석과 기존의 사용되는 아미노산 SVM분석은 약 9.5%의 정확도 차이를 나타낸다. 정확도의 향상은 전반적인 예측률 향상을 나타난다.In the case of accuracy, the ANN analysis using hydrophobicity and the conventional amino acid SVM analysis show a difference in accuracy of about 9.5%. An improvement in accuracy indicates an improvement in the overall prediction rate.

특이도 (Specificity)의 경우, 소수성을 이용한 ANN의 분석과 기존의 사용되는 아미노산 SVM분석은 약 8.3%의 특이도 차이를 나타낸다. 특이도의 향상은 비독성 단백질을 비독성 단백질로 예측한 비율이 높아짐을 나타낸다.In the case of specificity, ANN analysis using hydrophobicity and SVM analysis of amino acids used in the past show a difference in specificity of about 8.3%. The increase in specificity indicates that the ratio of predicted non-toxic proteins as non-toxic proteins increases.

민감도 (Sensitivity)의 경우, 소수성을 이용한 ANN의 분석과 기존의 사용되는 아미노산 SVM분석은 약 18.9%의 민감도 차이를 나타낸다. 민감도의 향상은 독성 단백질을 독성 단백질로 예측한 비율이 높아짐을 나타낸다.In the case of sensitivity, the ANN analysis using hydrophobicity and the conventional amino acid SVM analysis showed a difference in sensitivity of about 18.9%. The increase in sensitivity indicates a higher proportion of predicted toxic proteins as toxic proteins.

재현율 (Recall)의 경우, 소수성을 이용한 ANN의 분석과 기존의 사용되는 아미노산 SVM분석은 약 2.6%의 재현율 차이를 나타낸다. 재현율의 향상은 예측된 비독성 단백질 중 실제 비독성 단백질 비율이 높아짐을 나타낸다.In the case of recall, the ANN analysis using hydrophobicity and the conventional amino acid SVM analysis showed a difference in recall of about 2.6%. The improvement of the reproducibility indicates that the ratio of the actual non-toxic proteins out of the predicted non-toxic proteins increases.

F1 Score 의 경우, 소수성을 이용한 ANN의 분석과 기존의 사용되는 아미노산 SVM분석은 약 33.6%의 차이를 나타낸다. F1-Score의 향상은 정밀도와 재현율의 조화 평균을 나타내기에, 기계학습 성능이 높아짐을 나타낸다. 이에 따라, 소수성을 이용한 ANN의 분석은 아미노산 빈도 이용 SVM 분석 대비 F1 Score가 33.6% 높으며, 아미노산빈도 이용 ANN 분석 대비 F1 Score가 22.5% 높으며, 소수성 이용 SVM 분석 대비 F1 Score가 3.1% 높음을 확인할 수 있었다. 이에 따라, 소수성 사용 여부와, ANN 사용 여부에 따른 F-1 Score의 향상, 즉 단백질 독성 예측 기계학습의 성능 향상을 확인할 수 있었다.In the case of the F1 Score, the ANN analysis using hydrophobicity and the conventional amino acid SVM analysis showed a difference of about 33.6%. The improvement of the F1-Score represents a harmonized average of precision and recall, indicating an increase in machine learning performance. Accordingly, in the analysis of ANN using hydrophobicity, the F1 Score is 33.6% higher than that of the SVM analysis using amino acid frequency, the F1 Score is 22.5% higher than that of the ANN analysis using amino acid frequency, and the F1 Score is 3.1% higher than that of SVM analysis using hydrophobicity. there was. Accordingly, it was confirmed that the F-1 Score was improved according to the use of hydrophobicity and the use of ANN, that is, the performance of machine learning for predicting protein toxicity.

이상의 설명으로부터, 본 발명이 속하는 기술분야의 당업자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 이와 관련하여, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다. 본 출원의 범위는 상기 상세한 설명보다는 후술하는 청구범위의 의미 및 범위 그리고 그 등가 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 출원의 범위에 포함되는 것으로 해석되어야 한다.From the above description, those skilled in the art to which the present invention pertains will be able to understand that the present invention can be implemented in other specific forms without changing the technical spirit or essential features thereof. In this regard, it should be understood that the embodiments described above are illustrative in all respects and not limiting. The scope of the present application should be construed as including the meaning and scope of the claims to be described later rather than the above detailed description, and all changes or modified forms derived from the equivalent concept within the scope of the present application.

D: 데이터베이스
10: 입력 모듈
20: 아미노산 서열 연산 모듈
30: 아미노산 소수성 연산 모듈
40: 아미노산 서열 유사성 연산 모듈
50: 아미노산 빈도 연산 모듈
60: 독성 예측 모듈D: database
10: input module
20: amino acid sequence calculation module
30: amino acid hydrophobicity calculation module
40: amino acid sequence similarity calculation module
50: amino acid frequency calculation module
60: Toxicity prediction module

Claims

A database storing data related to a plurality of proteins whose toxicity is determined according to a preset criterion;
An amino acid sequence calculation module for calculating sequence data of amino acids contained in the protein to be analyzed;
An amino acid hydrophobicity calculation module for calculating hydrophobicity data according to an amino acid sequence according to a preset method using the amino acid sequence data calculated by the amino acid sequence calculation module; And
Toxicity prediction for comparing the hydrophobicity data calculated by the amino acid hydrophobicity calculation module with the hydrophobicity data of a plurality of proteins stored in the database, and determining whether the protein to be analyzed is toxic based on the first comparison result data as a result of the comparison Module; containing,
Information provision system.

The method of claim 1,
An amino acid sequence similarity calculation module that compares the amino acid sequence data with amino acid sequence data of a plurality of proteins stored in the database, and calculates sequence similarity data of the protein to be analyzed based on second comparison result data, which is a comparison result; And
An amino acid frequency calculation module for calculating frequency data for each amino acid included in the protein to be analyzed according to a preset method using the amino acid sequence data; further comprising,
The toxicity prediction module,
The frequency data for each amino acid included in the protein to be analyzed calculated by the amino acid frequency calculation module and the frequency data for each amino acid included in the plurality of proteins stored in the database are compared, and the comparison result 3 further based on the comparison result data and the sequence similarity data calculated by the amino acid sequence similarity calculation module to determine whether the protein to be analyzed is toxic,
Information provision system.

The method of claim 2,
The hydrophobicity data calculated by the amino acid hydrophobicity calculation module is calculated using Equation 1 below,
[Equation 1]
Hydrophobicity data = ∑ _i R _i x I _i
In Equation 1, R is any amino acid contained in the protein to be analyzed, and I is a hydrophobicity index value determined in advance according to the amino acid,
Information provision system.

The method of claim 3,
The sequence similarity data calculated by the amino acid sequence similarity calculation module is calculated using Equation 2 below,
[Equation 2]
Sequence similarity data = aligned query protein sequence length / total query protein sequence length
In Equation 2, the total query protein sequence length is the number of amino acid sequences included in the protein to be analyzed, and the aligned query protein sequence length is amino acid sequence data of the protein to be analyzed and amino acids of a plurality of proteins stored in the database. The number of identical amino acid sequence data of sequence data,
Information provision system.

The method of claim 4,
The amino acid frequency data calculated by the amino acid frequency calculation module is calculated using Equation 3 below,
[Equation 3]
Amino acid frequency data = R _i /N x 100 (%)
In Equation 3, N is the number of amino acid sequences included in the protein to be analyzed,
Information provision system.

The method of claim 5,
The toxicity prediction module inputs the hydrophobicity data, the sequence similarity data, and the amino acid frequency data into an artificial neural network (ANN) algorithm, and determines whether the protein to be analyzed is toxic using output data according to the input. doing,
Information provision system.

The method of claim 6,
Further comprising a separate database different from the database,
The protein toxicity determined by the toxicity prediction module is stored in the separate database,
Information provision system.

As a method of providing information using a system that provides information including whether or not the protein to be analyzed is toxic,
The system,
It includes a database that stores data related to a number of proteins whose toxicity is determined according to a preset standard,
The above method,
(a) calculating, by the amino acid sequence calculation module, amino acid sequence data included in the protein to be analyzed;
(b) the amino acid hydrophobicity calculation module using the amino acid sequence data calculated in step (a), performing hydrophobicity data according to the amino acid sequence according to a preset method; And
(c) The toxicity prediction module compares the hydrophobicity data calculated in step (b) with the hydrophobicity data of a plurality of proteins stored in the database, and based on the first comparison result data, the analysis target protein Determining whether it is toxic; Containing,
How to provide information.

The method of claim 8,
The step (b),
(b1) The amino acid sequence similarity calculation module compares the amino acid sequence data calculated in step (a) with the amino acid sequence data of a plurality of proteins stored in the database, and based on the second comparison result data as a result of the comparison, the Calculating sequence similarity data of the protein to be analyzed; And
(b2) calculating, by the amino acid frequency calculation module, frequency data for each amino acid included in the protein to be analyzed according to a preset method, using the amino acid sequence data calculated in step (a). Includes,
The step (c),
The toxicity prediction module compares the frequency data calculated in step (b1) with the frequency data for each amino acid included in the plurality of proteins stored in the database, and the third comparison result data as a result of the comparison, the Further comprising the step of determining whether the protein to be analyzed is toxic based on the sequence similarity data calculated in step (b2),
How to provide information.

The method of claim 9,
The hydrophobicity data calculated in step (b) is calculated using Equation 1 below,
[Equation 1]
Hydrophobicity data = ∑ _i R _i x I _i
In Equation 1, R is any amino acid contained in the protein to be analyzed, and I is a hydrophobicity index value determined in advance according to the amino acid,
How to provide information.

The method of claim 10,
The sequence similarity data calculated in step (b1) is calculated using Equation 2 below,
[Equation 2]
Sequence similarity data = aligned query protein sequence length / total query protein sequence length
In Equation 2, the total query protein sequence length is the number of amino acid sequences included in the protein to be analyzed, and the aligned query protein sequence length is amino acid sequence data of the protein to be analyzed and amino acids of a plurality of proteins stored in the database. The number of identical amino acid sequence data of sequence data,
How to provide information.

The method of claim 11,
The amino acid frequency data calculated in step (b2) is calculated using Equation 3 below,
[Equation 3]
Amino acid frequency data = R _i /N x 100 (%)
In Equation 3, N is the number of amino acid sequences included in the protein to be analyzed,
How to provide information.

The method of claim 12,
The step (c),
The toxicity prediction module inputs the hydrophobicity data, the sequence similarity data, and the amino acid frequency data into an artificial neural network (ANN) algorithm, and determines whether the protein to be analyzed is toxic using output data according to the input. Further comprising the step of,
How to provide information.

A computer program stored in a computer readable storage medium for executing the method of any one of claims 8 to 13 using a computer.