KR20120085137A

KR20120085137A - Multiple linear regression-artificial neural network hybrid model predicting energesis of ideal gas of pure organic compound

Info

Publication number: KR20120085137A
Application number: KR1020110100796A
Authority: KR
Inventors: 성애리; 권오형; 권윤경; 김양수; 전정재; 정원천; 조준혁; 박태윤
Original assignee: 주식회사 켐에쎈
Priority date: 2011-10-04
Filing date: 2011-10-04
Publication date: 2012-07-31
Also published as: KR101325097B1

Abstract

PURPOSE: A MLR(Multiple Linear Regression)-ANN(Artificial Neuron Network) mixed model, predicting generating energy of ideal gas of a pure organic compound, is provided to receive a molecular descriptor included in an optimum MLRM(Multiple Linear Regression Model) in order to form an ANN outputting a standard state absolute entropy, thereby improving prediction performance. CONSTITUTION: Optimum molecule descriptors are extracted for an experimental data set. Experimental data is separated into a training set and a test set. An optimum MLRM for the training set is explored. If test performance about the test set satisfies standards, the optimum MLRM is decided. The optimum MLRM obtains a generating energy value of ideal gas through the explored MLRM for the experimental data set.

Description

Multiple Linear Regression-Artificial Neural Network Hybrid Model Predicting Energesis of Ideal Gas of Pure Organic Compound

본 발명은, 물성예측이라는 물리화학의 한 분야에 속하는 것으로 유기화합물의 여러 물성 중 하나인 이상기체의 생성 에너지를 높은 정확도로 예측하기 위한 방법에 관한 것이다. The present invention belongs to a field of physical chemistry called physical property prediction and relates to a method for predicting with high accuracy the generated energy of an ideal gas which is one of several physical properties of an organic compound.

유기화합물의 여러 물성의 정확한 값을 구체적으로 아는 것은 그 물질의 용도의 타당성을 검토하거나 합성 및 정제 과정을 설계하고 보관, 운반, 사용, 폐기의 방법과 조건을 설정하는 등, 생산과 소비의 전 과정에 걸친 제반 의사결정 사항들에 결정적이기 때문에 산업적으로나 학문적으로 매우 중요한 문제이다. 관심 있는 유기화합물의 관심 있는 물성의 값을 가장 정확하게 알 수 있는 방법은 역시 실험이겠으나 정제된 시료의 준비, 정확한 측정을 위한 환경의 구축 등 여러 가지 측면에서 상당한 비용과 시간이 드는 것이 사실이며 경우에 따라서는 불가능할 수도 있다. 따라서, 그 대안으로 오래 전부터 많은 연구자들이 유기화합물의 여러 물성의 정확한 값을 예측하고자 노력을 기울여 왔다. 이처럼 물성 예측은 오랜 역사를 가지며 끊임 없이 새로운 예측 방법들이 등장하여 현재는 물성 별로 정확도와 적용범위 등이 서로 다른 여러 예측모형들이 공존하고 있는 상황이다.Knowing the exact values of the various properties of an organic compound in detail can be used to determine the feasibility of the use of the substance, to design the synthesis and purification processes, and to establish methods and conditions for storage, transport, use, and disposal. It is an important issue both industrially and academically because it is critical to all decision-making processes throughout the process. The most accurate way of knowing the value of the property of interest of an organic compound of interest is also an experiment, but it is true that it is quite costly and time consuming in many aspects, including the preparation of purified samples and the construction of an environment for accurate measurement. Therefore, it may not be possible. Therefore, as an alternative, many researchers have long tried to predict the exact value of various physical properties of organic compounds. As such, the prediction of physical properties has a long history, and new prediction methods are constantly appearing. At present, several prediction models with different accuracy and application range coexist.

본 발명의 관심 물성인 이상기체의 생성 에너지에 대해서도 현재까지 여러 예측모형들이 제안되었다. 이상기체의 생성 에너지(Enthalpy of Formation for Ideal Gas at 298K)라 함은 순수한 물질이 298.15K(절대온도)에 있을 때의 이상기체 생성 에너지를 말한다. 이상기체의 생성 에너지 예측에 대한 그간의 연구결과들은 문헌[Poling B. E., Prausnitz J. M., O’Connell J. P. The Properties of Gases and Liquids,(5 ed.). New York: McGraw Hill. (2000).]에 간략히 소개되어 있다. 현재 이상기체의 생성 에너지를 예측하는 모형으로 잘 알려지고 널리 쓰이는 것들은 주로 그룹기여(group contribution) 방법과 양자역학 계산을 이용한 것들이다. 표 1은 그 동안 제안되었던 이상기체의 생성 에너지 예측 방법들에 대한 주요한 그룹기여 모형들을 연도순으로 보여 주고 있다.Several prediction models have been proposed to date for the energy of generation of ideal gas, which is the property of interest of the present invention. The energy of the ideal gas (Enthalpy of Formation for Ideal Gas at 298K) refers to the energy of the ideal gas generated when the pure material is at 298.15K (absolute temperature). The results of previous studies on the prediction of the formation energy of an ideal gas are described in Polling BE, Prausnitz JM, O'Connell JP The Properties of Gases and Liquids , (5 ed.). New York: McGraw Hill. (2000).]. Currently known and widely used models for predicting the energy generated by ideal gases are those that use group contribution methods and quantum mechanical calculations. Table 1 shows, in chronological order, the major group contribution models for predicting the energy generation of ideal gases that have been proposed.

연도year 제안자proponent 19681968 BensonBenson 19691969 Benson & BussBenson & Buss 19711971 Thinh, T.-P. & T. K. TrongThinh, T.-P. & T. K. Trong 19841984 JobackJoback 19921992 Ratkey & HarrisonRatkey & harrison 19941994 Constantinou & GaniConstantinou & Gani 19941994 Mavrovouniotis & ConstantinouMavrovouniotis & Constantinou 19971997 Forsythe & KarpForsythe & Karp 19971997 SteeleSteele

그룹기여 모형의 전형적인 형태는 아래와 같은 식으로 주어 진다. A typical form of group contribution model is given by

물성값

를 구하기 위해서는 먼저 값을 알고자 하는 화합물의 분자를 미리 정해진 다수의 조각형식들에 맞추어 쪼갠 다음 각 조각형식들의 개수

를 구한다. 이를 다시 그 형식에 할당된 계수

와 곱한 것을 합산한 것이 예측값

가 된다. 계수

들은 실험값이 존재하는 화합물들로부터 모형이 최선의 성능을 갖도록 통계적인 방법을 통해 결정된다.Property value

In order to find the value, first divide the molecule of the compound whose value you want to know according to a number of pieces.

. This is again the coefficient assigned to that type

Multiplying and multiplied by

Becomes Coefficient

These are determined by statistical methods to ensure that the model has the best performance from the compounds with experimental values.

이러한 그룹기여 방법은 그 동안 어느 정도 성공을 거둔 것이 사실이나 이론적 근거가 부족하고 때때로 조각형식에 맞추어 쪼개는 방식이 유일하지 않거나 심지어 존재하지 않는 경우가 발생하여 값의 계산이 불가능해 지기도 한다. 또한 예측성능을 높이기 위해 모형을 개선해 나갈수록 점점 더 복잡해지고 취급이 어려워지는 양상을 보인다.These group contribution methods have been somewhat successful in the past, but lack the theoretical basis, and sometimes the only way to split them into pieces is not, or even nonexistent, which makes the calculation of values impossible. Also, as the model is improved to improve the predictive performance, it becomes more complicated and difficult to handle.

예측모형을 구축하는데 있어서 그룹기여 방법의 대안이 될 수 있는 다른 방법들의 하나는 QSPR(quantitative structure-property relationship) 방법이다. 이 방법은 기본적으로 화합물의 물성은 그 분자의 구조적 특성들의 함수라는 가정에서부터 출발하고 있으며 서로 다른 여러 구조적 특성들을 반영하는 다양한 분자표현자(molecular descriptor)들을 이용한다. 현재까지 제안된 분자표현자들의 종류는 수천에 이르며 한 분자내의 탄소나 수소의 개수와 같은 단순한 것들로부터 분자의 모양이나 연결상태, 전기화학적 특성과 같은 복잡한 것들에 이르기까지 수많은 종류의 분자표현자들에 대한 계산법들이 개발되어 있다[Todeschini R., V. Consonni V., Molecular Descriptors for Chemoinformatics: Second, Revised and Enlarged Edition: Volume I/II, Wiley-VCH, 2009]. QSPR 예측모형은 이러한 분자표현자들 그리고 때로는 이에 더하여 화합물의 다른 물리화학적 물성들(이들 역시 구조적 특성들의 함수이다) 중 일부를 독립변수로 포함하는 함수의 형태로 제시된다. One of the alternatives to the group contribution method in constructing the prediction model is the QSPR (quantitative structure-property relationship) method. This method basically starts with the assumption that the properties of a compound are a function of the structural properties of the molecule and uses various molecular descriptors that reflect different structural properties. The number of molecular descriptors proposed to date has been thousands, and many kinds of molecular descriptors range from simple ones such as the number of carbons or hydrogens in a molecule to complex ones such as the shape, connection state, and electrochemical properties of molecules. Calculation methods have been developed [Todeschini R., V. Consonni V., Molecular Descriptors for Chemoinformatics: Second, Revised and Enlarged Edition: Volume I / II , Wiley-VCH, 2009]. The QSPR prediction model is presented in the form of a function that includes some of these molecular descriptors and sometimes, in addition, some of the other physicochemical properties of the compound, which are also functions of structural properties.

예측모형을 구축하는데 있어서 그룹기여 방법의 대안이 될 수 있는 다른 방법들의 하나는 양자역학 계산방법이다. 하지만 양자역학 계산방법은 높은 차원의 계산을 수행하는 경우, 정확한 예측값을 계산해주는 대신 계산 시간이 너무 많이 걸리는 큰 단점이 있다. 따라서 적절한 수준의 양자역학 계산방법을 시행한 후에 높지 않은 수준의 양자역학 계산으로 인해 발생하는 예측값의 오차를 보정하는 방법을 선택하였다. 양자역학 계산의 오차를 보정하는 방법은 QSPR(quantitative structure-property relationship) 방법을 사용했다. 이 방법은 기본적으로 화합물의 물성은 그 분자의 구조적 특성들의 함수라는 가정에서부터 출발하고 있으며 서로 다른 여러 구조적 특성들을 반영하는 다양한 분자표현자(molecular descriptor)들을 이용한다. 현재까지 제안된 분자표현자들의 종류는 수천에 이르며 한 분자내의 탄소나 수소의 개수와 같은 단순한 것들로부터 분자의 모양이나 연결상태, 전기화학적 특성과 같은 복잡한 것들에 이르기까지 수많은 종류의 분자표현자들에 대한 계산법들이 개발되어 있다[Todeschini R., V. Consonni V., Molecular Descriptors for Chemoinformatics: Second, Revised and Enlarged Edition: Volume I/II, Wiley-VCH, 2009]. QSPR 예측모형은 이러한 분자표현자들 그리고 때로는 이에 더하여 화합물의 AOMG(Atomic Orbital Molecular Graph) 분자표현자 중 원자 내에 혼성 오비탈 종류(이들 역시 구조적 특성들의 함수이다) 중 일부를 독립변수로 포함하는 함수의 형태로 제시된다. One of the alternatives to the group contribution method in constructing the prediction model is the quantum mechanical calculation method. However, the quantum mechanical calculation method has a big disadvantage in that it takes too much calculation time instead of calculating an accurate prediction value when a high dimensional calculation is performed. Therefore, after the appropriate level of quantum mechanical calculation, we selected a method to correct the error of the predicted value caused by the low level of quantum mechanical calculation. As a method of correcting the error of quantum mechanical calculation, a quantitative structure-property relationship (QSPR) method was used. This method basically starts with the assumption that the properties of a compound are a function of the structural properties of the molecule and uses various molecular descriptors that reflect different structural properties. The number of molecular descriptors proposed to date has been thousands, and many kinds of molecular descriptors range from simple ones such as the number of carbons or hydrogens in a molecule to complex ones such as the shape, connection state, and electrochemical properties of molecules. Calculation methods have been developed [Todeschini R., V. Consonni V., Molecular Descriptors for Chemoinformatics: Second, Revised and Enlarged Edition: Volume I / II , Wiley-VCH, 2009]. The QSPR prediction model is a function of functions that include some of these molecular descriptors, and sometimes, in addition, some of the compound's AOMG (Atomic Orbital Molecular Graph) molecular markers in their atoms, which are also a function of structural properties. Presented in form.

이때 이러한 함수의 꼴로 가장 빈번이 채택되는 것은 아래와 같은 표현자

들의 선형 결합 함수이며 각 계수

들은 주로 다중선형회귀분석을 통해 실험데이터로부터 결정된다.The most frequently adopted form of such a function is

Linear coupling function of each coefficient

These are mainly determined from experimental data through multiple linear regression analysis.

원자 내에 혼성 오비탈 종류를 분자표현자로 사용하게 된 것은 AOMG(Atomic Orbital Molecular Graph) 분자표현자에서 얻은 아이디어를 변형한 방식으로, 원자 별 혼성 오비탈의 종류에 따라 이상기체의 생성 에너지에 기여하는 에너지가 달라질 것이라는 발상을 적용했다. AOMG패턴에 대한 자세한 설명은 본 문서의 뒷부분 발명의 상세한 설명에서 하겠다.The use of hybrid orbital species as atoms in the atom is a modification of the idea of AOMG (Atomic Orbital Molecular Graph), and the energy contributing to the energy generated by the ideal gas depends on the type of hybrid orbital by atom. I applied the idea that it would be different. A detailed description of the AOMG pattern is given later in this document in the detailed description of the invention.

QSPR 모형을 만드는 또 다른 방법은 인공신경망을 이용하는 것이다. 인공신경망 기법은 지능을 가진 인간의 신경세포를 모델링하여 인공적으로 지능을 가진 기계를 만들어 보고자 하는 인류의 오랜 연구결과의 하나로서, 20세기 중반에 처음 등장하여 현재 다방면으로 응용되고 있는 정보처리기술이다. 도 2는 인공신경망의 전형적인 한 예를 보여주고 있다. 여기서 볼 수 있듯이, 인공신경망에는 입력 데이터를 수용하는 입력층(input layer)과 출력데이터를 만드는 출력층(output layer), 이들 사이에 위치한 은닉층(hidden layer)이 존재하며 각 층은 하나 이상의 노드(node)들로 구성되어 있다. 은닉층의 각 노드들은 입력층과 출력층의 노드들과 연결되어 있으며 각 연결들에는 가중치(weight)라 불리는 양

이 부여되어 있다. 은닉층과 출력층의 각 노드들은 전 단계의 노드들로부터 이러한 연결들을 통해 입력을 받은 뒤 이를 가공하여 출력값을 만드는데 이때 활성화 함수(activation function)라 불리는 함수

를 적용한다. 이러한 인공신경망을 실제로 활용하려면 먼저, 다양한 입력값과 그 입력값에 대응하는 출력값을 함께 묶어 놓은 샘플집합을 이용하여 인공신경망을 훈련시키는 과정이 필요한데 이는 주어진 입력에 대한 인공신경망의 출력과 원하는 출력의 차이가 최소가 되도록 역전파(back propagation) 알고리즘을 사용하여 각 연결의 가중치를 최적화 하는 것을 말한다. 이러한 훈련을 거친 인공신경망은 문제해결에 필요한 규칙이나 지식을 따로 제공하지 않아도 학습을 통해서 스스로 일반적인 규칙을 수립하여 미지의 입력에 대해서도 타당성 있는 출력을 내주므로 화합물의 물성예측과 같이 기반 이론이 결여되어 있는 분야에 매우 유용한 수단으로 널리 이용되고 있다.Another way to create a QSPR model is to use an artificial neural network. Artificial neural network technology is one of the long-standing research results of human beings who want to make intelligent machines by modeling human nerve cells with intelligence, and is an information processing technology that has been applied in various fields since the mid 20th century. . 2 shows a typical example of an artificial neural network. As can be seen, the artificial neural network has an input layer for receiving input data, an output layer for producing output data, and a hidden layer located between them, each layer having one or more nodes. ) Each node in the hidden layer is connected to nodes in the input and output layers, and each connection has a quantity called a weight.

Is granted. Each node in the hidden and output layers receives input through these connections from the nodes in the previous stage and then processes it to produce an output, which is called an activation function.

Apply. In order to actually use such an artificial neural network, first, the neural network is trained using a sample set in which various input values and output values corresponding to the input values are bundled together. This means optimizing the weight of each connection using a back propagation algorithm to minimize the difference. The neural network that has undergone such training does not provide the necessary rules or knowledge to solve the problem, but it establishes general rules through learning and gives valid output for unknown inputs. It is widely used as a very useful means in the field.

이상기체의 생성 에너지에 대한 AOMG 분자표현자를 이용한 QSPR 예측모형은 많이 제안되어 있지 않으며 제안된 것들도 샘플 화합물의 수가 많지 않거나 적용 대상이 특정 계열의 화합물에 제한되어 있는 단점이 있다. 문헌[Andres Mercader, Eduardo A. Castro, Andrey A. Toropov QSPR modeling of the enthalpy of formation from elements by means of correlation weighting of local invariants of atomic orbital molecular graphs. Chemical Physics Letters 330 (2000) 612-623]에는 65개의 화합물에 대한 데이터를 이용하여 0.99102의 결정계수(coefficient of determination)값을 갖는 모형이 보고되어 있다.
The QSPR prediction model using AOMG molecular descriptors for the generation energy of an ideal gas is not widely proposed, and the proposed ones also have disadvantages in that the number of sample compounds is not large or the application target is limited to a specific series of compounds. Andres Mercader, Eduardo A. Castro, Andrey A. Toropov QSPR modeling of the enthalpy of formation from elements by means of correlation weighting of local invariants of atomic orbital molecular graphs. Chemical Physics Letters 330 (2000) 612-623 reported a model with a coefficient of determination of 0.99102 using data for 65 compounds.

본 발명이 이루고자 하는 기술적 과제는 위에서 언급된 기존 모형들의 단점들을 극복하고 보다 폭넓고 보다 정확한 예측성능을 보이는, 수소(H), 탄소(C), 질소(N), 산소(O), 황(S) 등 5가지 이내의 원소로 구성되고, 수소를 제외한 원자의 개수가 25개 이하인 분자로 이루어진 순수한 유기화합물의 이상기체의 생성 에너지에 대한 다중선형회귀-인공신경망 혼성모형을 구축하는 것이다.
The technical problem to be solved by the present invention is to overcome the shortcomings of the existing models mentioned above and to show more broad and more accurate prediction performance, hydrogen (H), carbon (C), nitrogen (N), oxygen (O), sulfur ( It is to construct a multilinear regression-artificial neural network hybrid model for the energy of generation of ideal gas of a pure organic compound composed of less than 5 elements including S) and molecules having 25 or less atoms except hydrogen.

우리는 이상기체의 생성 에너지에 대한 양자역학적 계산의 토대 위에 보다 많은 실험데이터를 바탕으로 보다 다양한 분자표현자들을 고려한 다중선형회귀모형을 구축함으로써 이 목표를 달성하였다. 예측모형을 적용할 수 있는 유기화합물의 범위에 위에 언급한 바와 같은 제한을 두는 이유는 주로, 사용된 분자표현자들 중 그 값을 구하기 위해서는 양자역학적 계산이 필요한 것들이 존재하는 경우, 현재의 기술수준으로는 언급된 범위를 넘어서는 화합물에 대해서는 정확도와 계산시간의 측면에서 곤란한 문제가 발행한다는 사실에 기인하고 있다. 그러나 상기의 제한범위 내라 할지라도 대단히 많은 화합물들이 존재하며 산업적으로 중요한 화합물들이 상당부분 포함되므로 본 발명이 인류사회에 큰 유익을 끼칠 수 있을 것으로 판단된다.
We have achieved this goal by constructing a multilinear regression model that considers a wider variety of molecular presenters based on more experimental data on the basis of the quantum mechanical calculations of the energy of formation of ideal gases. The above-mentioned limitations on the range of organic compounds to which the predictive model can be applied are mainly due to the current state of the art when some of the molecular descriptors used require quantum mechanical calculations. This is due to the fact that for the compounds beyond the stated range, problems arise in terms of accuracy and calculation time. However, even within the above limitations, since there are a great many compounds and industrially important compounds are included in a large amount, it is determined that the present invention can greatly benefit human society.

오늘날 인류는 플라스틱, 섬유, 고무, 도료, 비료, 의약품, 연료 등, 방대한 종류의 유기화합물에 의존하여 살아가고 있으며 이러한 경향은 더욱 심화될 것으로 예상된다. 미국 화학회(ACS)에 따르면 2010년 7월 기준으로 등록된 전체 화합물의 수는 54,000,000개를 넘는다고 한다. 이에 비해 물성값이 한가지라도 실험적으로 알려져 있는 화합물의 수는 고작해야 수만에 지나지 않는다. 화합물의 물성값은 신물질과 신약의 개발, 화학플랜트의 최적 설계, 기존 설비의 생산성 향상, 자원의 개발과 절약, 안전성 확보, 환경보호 등 인류의 보다 나은 물질생활에 필수적인 요소이다. 특히 이상기체의 생성 에너지는 화학플랜트의 최적설계 프로그램으로 잘 알려진 AspenPlus 나 Pro/II 와 같은 상용 프로그램이 그 정확한 값을 절실히 요청하는 중요한 물성이다. 그러나 현재 그 실험값이 알려진 화합물의 개수는 기껏해야 수천에 불과하며 화합물에 따라서는 독성, 불안정성, 정제의 어려움 등으로 실험을 통하여 데이터를 얻는 작업이 지난한 경우도 있다. 이런 관점에서 실험을 거치지 않고도 분자에 대한 정보만으로 수많은 화합물의 이상기체의 생성 에너지값을 높은 정확도로 얻게 해주는 본 발명은 실험에 드는 비용과 시간을 절감해줄 뿐만이 아니라 실험이 불가능한 경우에도 그 값을 짐작하게 해주어 관련 산업의 연구개발활동을 용이하게 함은 물론 더 나아가 학계와 관(官)계 등 그 값을 필요로 하는 모든 곳에 합당한 정보를 제공하여 그 활동을 보다 원활히 수행할 수 있게 해주는 효과를 낳는다 하겠다.
Humans today depend on a wide variety of organic compounds, including plastics, fibers, rubber, paints, fertilizers, medicines and fuels, and this trend is expected to intensify. According to the American Chemical Society (ACS), as of July 2010, the total number of compounds registered was over 54,000,000. In comparison, even if the physical property value is only one, the number of experimentally known compounds is only tens of thousands. The physical property value of compounds is essential for the better life of mankind, such as the development of new materials and new drugs, the optimal design of chemical plants, the improvement of the productivity of existing facilities, the development and saving of resources, the securing of safety, and the protection of the environment. In particular, the energy generated by the ideal gas is an important property that commercial programs such as AspenPlus and Pro / II, which are well known as the optimum design programs for chemical plants, are urgently requesting the exact values. However, at present, the number of compounds whose experimental values are known is only a few thousand, and depending on the compounds, the work of obtaining data through experiments may be past due to toxicity, instability, and difficulty of purification. From this point of view, the present invention, which provides a high accuracy of generating energy of an ideal gas of a large number of compounds with only information on a molecule without undergoing experiments, not only saves cost and time, but also estimates the value even when the experiment is impossible. In addition to facilitating the R & D activities of related industries, it also has the effect of providing the appropriate information to academics and academia, wherever they need it, to facilitate its activities. would.

도 1은 본 발명이 제공하는 이상기체의 생성 에너지에 대한 다중선형회귀-인공신경망 혼성모형을 구축하는 과정을 흐름도로 나타낸 도면이다.
도 2는 본 발명이 제공하는 이상기체의 생성 에너지에 대한 다중선형회귀-인공신경망 혼성모형에 쓰이는 순수 화합물이 가질 수 있는 AOMG 유형들이다.
도 3은 본 발명이 제공하는 이상기체의 생성 에너지에 대한 다중선형회귀-인공신경망 혼성모형에 쓰이는 순수 화합물이 가질 수 있는 공유결합에 의한 AOMG 유형들이다.
도 4는 본 발명에 사용된 인공신경망의 구조를 나타낸 도면이다.
도 5는 이상기체의 생성 에너지에 대한 그룹기여 모형인 Joback 모형의 1407개의 실험데이터에 대한 패리티 도면이다.
도 6은 이상기체의 생성 에너지에 대한 그룹기여 모형인 Gani 모형의 1318개의 실험데이터에 대한 패리티 도면이다.
도 7은 본 발명이 제공하는 이상기체의 생성 에너지에 대한 다중선형회귀-인공신경망 혼성모형의 1536개의 실험데이터에 대한 패리티 도면이다.
도 8은 Joback 모형의 1407개의 실험데이터에 대한 히스토그램 도면이다.
도 9는 Gani 모형의 1318개의 실험데이터에 대한 히스토그램 도면이다.
도 10은 다중선형회귀-인공신경망 혼성모형의 1536개의 실험데이터에 대한 히스토그램 도면이다.1 is a flowchart illustrating a process of constructing a multiple linear regression-artificial neural network hybrid model for energy generated by an ideal gas provided by the present invention.
FIG. 2 shows AOMG types that a pure compound used in the multiple linear regression-artificial neural network hybrid model for generating energy of an ideal gas provided by the present invention.
Figure 3 shows the covalent bonds of AOMG types that a pure compound used in the multiple linear regression-artificial neural network hybrid model for the energy generated by the ideal gas provided by the present invention.
Figure 4 is a view showing the structure of the artificial neural network used in the present invention.
FIG. 5 is a parity diagram of 1407 experimental data of a Joback model which is a group contribution model of generated energy of an ideal gas.
FIG. 6 is a parity diagram of 1318 experimental data of a Gani model, which is a group contribution model for generating energy of an ideal gas.
FIG. 7 is a parity diagram of 1536 experimental data of a multiple linear regression-artificial neural network hybrid model for energy generated by an ideal gas provided by the present invention.
8 is a histogram diagram of 1407 experimental data of the Joback model.
9 is a histogram diagram of 1318 experimental data of the Gani model.
FIG. 10 is a histogram plot of 1536 experimental data of a multiple linear regression-artificial neural network hybrid model.

도 1은 이상기체의 생성 에너지에 대한 다중선형회귀-인공신경망 혼성모형을 구축하는 과정을 흐름도로 간략히 표현한 것이다.Figure 1 is a simplified representation of the process of building a multiple linear regression-artificial neural network hybrid model for the energy generated by the ideal gas.

모형을 구축하는데 있어서 가장 먼저 해야 할 일은 단계 1에 지정된 바와 같이 실험데이터를 수집하고 검토 분류하는 일이다. 본 발명을 위해 각종 논문과 단행본, 인터넷 사이트 등을 망라하여 참고할 수 있는 모든 문헌과 자료에 대한 광범위한 조사를 벌인 결과, 총 4887개의 본 발명의 조건에 맞는 화합물들에 대한 이상기체의 생성 에너지의 데이터가 수집되었다. 이렇게 수집된 데이터가 모형을 구축하는데 쓰일 수 있는 진정 타당한 값인지 다방면으로 검토하였는데 실험값이 아니거나 데이터 표기에 오류가 있었거나 동일 화합물에 대한 값들임에도 불구하고 차이가 많이 나거나 가까운 다른 화합물들의 값에 비해 신뢰하기 어려울 정도로 동떨어진 값에 대한 데이터인 경우 등에 대해 면밀한 분석을 거쳐 데이터를 수정 또는 삭제하여 최종적으로 총 2041개의 화합물들에 대한 데이터를 선정하였다. 또한 물성예측모형을 구축할 때, 샘플 화합물들을 순수한 분자와 라디칼을 포함한 분자로 분류하여 따로따로 모형을 세우는 것이 예측성능의 측면에서 더 나았던 그간의 경험에 비추어 진행했다. 그리고 탄소와 수소만으로 이루어진 탄화수소(hydrocarbon)들과 그렇지 않은 비탄화수소(nonhydrocarbon)들로 분류하여 따로따로 모형을 세우는 것이 예측성능의 측면에서 더 나았던 그간의 경험에 비추어 진행했다. 그래서 이들을 524개의 라디칼을 포함한 분자들과 663개의 탄화수소들, 854개의 비탄화수소들로 분류하여 모형을 확립하기로 하였다. 또한 라디칼 센터가 포함된 분자들 중에서 공명구조와 라디칼 센터간의 상호작용이 있는 분자와 단순히 라디칼 센터가 포함된 분자로 분류해서 따로 모형을 세우는 것이 예측의 정확도를 높일 수 있어서 두 그룹으로 분류하여 진행했다. 본 발명에서 ‘유기화합물’ 또는 ‘화합물’은 수소(H), 탄소(C), 질소(N), 산소(O), 황(S) 등 5가지 이내의 원소로 구성되고, 수소를 제외한 원자의 개수가 25개 이하인 분자로 이루어진 물질을 지칭한다.The first thing to do when building the model is to collect and review the experimental data as specified in step 1. As a result of extensive research on all literatures and data that can be referred to through various papers, books, Internet sites, etc. for the present invention, the data of the energy of generation of ideal gas for the compounds meeting the conditions of the present invention in total 4887. Was collected. The collected data were examined in a variety of ways to determine whether they were truly valid values that could be used to build the model.They were not experimental values, errors in data notation, or values for the same compound. In the case of unreliably distant values, data were analyzed in detail, and data were modified or deleted to finally select data for a total of 2041 compounds. In addition, when constructing the property prediction model, it was possible to classify the sample compounds into pure molecules and molecules including radicals, and to model them separately, in light of the previous experience in terms of predictive performance. In the light of the previous experience, it was better to model separately by classifying carbon and hydrogen-only hydrocarbons and non-hydrocarbons separately. So they decided to model them by classifying them into molecules containing 524 radicals, 663 hydrocarbons, and 854 non-hydrocarbons. In addition, two types of molecules containing radical centers and those with interactions between resonance structures and radical centers and simple modeling of molecules containing radical centers could be used to improve the accuracy of prediction. . In the present invention, the 'organic compound' or 'compound' is composed of five elements such as hydrogen (H), carbon (C), nitrogen (N), oxygen (O), and sulfur (S), and atoms except hydrogen Refers to a substance consisting of molecules of 25 or less.

그 다음은 이들 화합물들에 대한 양자역학 계산으로 예측값들을 준비하는 단계이다. 분자의 전자구조 계산을 하기 위해서는 보통 순이론인 방법으로 슈뢰딩거(Schrodinger) 방정식을 풀어 전자에너지에 대한 해를 구하게 되지만 전자가 많은 계의 경우 전자상관관계(electron correlation)를 무시한 근사법을 적용한 하트리-포크(Hartree-Fock, HF) 방법[C.C. J. Roothan, Rev. Mod. Phys. 23, 69 (1951)]을 사용하여 해를 풀게 된다. 이런 근사법으로 인해 계산된 결과에서 근본적인 오차가 유발되어 다차원의 이론적인 섭동항을 추가한 포스트 하트리-포크(Post Hartree-Fock) 방법[C. Moller and M. S. Plesset, Phys. Rev. 46, 618 (1934)]들을 사용하여 더 정확한 해를 구하긴 하지만 상대적으로 엄청나게 많은 계산량이 요구된다. 이런 방식으로는 큰 분자를 계산하기에는 비용이나 시간의 측면에서 무리가 있는 상황이다.The next step is to prepare predictions with quantum mechanical calculations for these compounds. In order to calculate the electronic structure of molecules, the Schrodinger equation is usually solved by a pure theory to solve the electron energy. However, in the case of systems with many electrons, Hartley uses an approximation method that ignores electron correlation. Hartree-Fock (HF) method [CC J. Roothan, Rev. Mod. Phys. 23, 69 (1951)]. This approximation introduces a fundamental error in the calculated results and adds a multidimensional theoretical perturbation term to the Post Hartree-Fock method [C. Moller and M. S. Plesset, Phys. Rev. 46, 618 (1934)], to obtain a more accurate solution, but require a relatively large amount of computation. In this way, it is too costly or time-consuming to calculate large molecules.

또한 하트리-포크와 포스트 하트리-포크를 조합한 가우시안 방법[L. A. Curtiss, K. Raghavachari, G. W. Trucks, and J. A. Pople, J. Chem. Phys. 94, 7221 (1991); L. A. Curtiss, K. Raghavachari, P. C. Redfern, V. Rassolov, and J. A. Pople, J. Chem. Phys. 109, 7764 (1998)]은 에너지 예측에 있어 아주 적은 오차를 보이지만 여러 포스트 하트리-포크 방법에 대한 에너지 계산을 수행하기 때문에 더 많은 계산량이 요구된다.In addition, the Gaussian method that combines Hartley-Fork and Post Hartley-Fork [L. A. Curtiss, K. Raghavachari, G. W. Trucks, and J. A. Pople, J. Chem. Phys. 94, 7221 (1991); L. A. Curtiss, K. Raghavachari, P. C. Redfern, V. Rassolov, and J. A. Pople, J. Chem. Phys. 109, 7764 (1998) show very little error in energy prediction, but more computation is required because it performs energy calculations for several post-Hartley-Fork methods.

많은 전자로 이루어진 분자에 대한 전자들간의 상관관계를 고려하기 위해 다차원의 섭동항이 추가된 파동함수 대신 전자 밀도함수를 써서 총에너지의 범함수를 이용해서 바닥상태를 구하는 밀도 범함수 이론(density functional theory)[ R. Seeger and J. A. Pople, J. Chem. Phys. 66, 3045 (1977)]을 적용하여 계산했다. 밀도 범함수 이론의 장점은 전자밀도만 고려하면 되므로 하트리-포크(Hartree-Fock) 방법과 비슷한 수준의 계산량으로 더 정확한 결과를 얻어낼 수 있다. 전자들의 교환-상관에너지를 계산을 위해 교환 범함수들과 상관 범함수들의 조합을 사용하여 계산량을 늘이지 않고도 더 향상된 결과를 얻고 있다.Density functional theory is used to find the ground state using the function of the total energy using the electron density function instead of the wave function with the multidimensional perturbation term to consider the correlation between the electrons of the molecules of many electrons. R. Seeger and JA Pople, J. Chem. Phys. 66, 3045 (1977). The advantage of the density functional theory is that the electron density only needs to be taken into account so that more accurate results can be obtained with comparable calculations to the Hartree-Fock method. The combination of exchange functions and correlation functions for calculating the exchange-correlation energy of the electrons is used to obtain more improved results without increasing the calculation amount.

최적의 양자역학 계산방법을 선발하기 위해 선행하여 시도하였던 계산이론은 상기에 언급된 하트리-포크 방법, 다양한 포스트 하트리-포크 방법, 가우시안(G2, G3) 방법, 다양한 범함수 조합의 밀도 범함수 이론 등이다. 이 중에서 계산시간 대비 가장 성능이 우수한 밀도 범함수 이론의 한가지 방법을 선발하였다.The computational theories previously attempted to select an optimal quantum mechanical calculation method are the density ranges of the aforementioned Hartley-Fork method, various Post-Hartley-Fork methods, Gaussian (G2, G3) methods, and various combinations of functional functions. Function theory. Among them, one method of density functional theory, which is the best performance calculation time, was selected.

따라서 본 발명에서는 상용 양자역학 계산 프로그램을 이용하여 지정된 밀도 범함수 이론의 계산방법을 적용하여 분자구조에 대한 최적화 및 진동수 계산을 수행하게 된다.Therefore, in the present invention, the optimization of the molecular structure and the frequency calculation are performed by applying the calculation method of the specified density functional theory using a commercial quantum mechanical calculation program.

다음으로는 양자역학 계산의 오차를 보정할 AOMG값을 준비하는 단계이다. 우선 순수한 분자로 구성된 화합물들의 혼성 오비탈의 종류를 나누는 작업부터 설명하겠다. 우리가 다루고 있는 대상 분자들은 C, H, N, O, S 로만 구성되어 있으므로 일반적인 유기화합물의 구성 원자와 결합 형태의 조합은 도2, 3에 나와있는 것과 같다. 도 3의 패턴은 분자 내 탄소, 질소, 산소 원자의 공유 결합 내에 포함된 패턴들이다. 공유결합의 유무는 분자의 전체 에너지에 특별한 안정화를 가져옴으로 이 또한 분자 내 원자 각각이 에너지에 기여하는 분리 패턴이다. 따라서 C, H, N, O, S 만으로 구성된 분자들 모두에 대해서 17종의 AOMG 패턴을 정의할 수 있다. 이 17개의 AOMG 패턴은 순수한 분자 화합물 모두에 적용되며, 라디칼 센터를 가지는 분자 화합물은 도 3에 나와있는 공유 결합 내에서 나타날 수 있는 AOMG패턴 중 C5 패턴만 포함한 15개의 AOMG 패턴을 적용한다. 여기까지의 예측 성능을 본 결과, 순수한 분자 중 탄화수소들은 상당히 좋은 성능을 내는 것으로 나타나서 이후의 과정은 진행하지 않았다. 그리고 비탄화수소들과 라디칼 센터와 공명구조의 상호작용이 없는 분자들에 대한 예측 성능은 만족할 수준은 아니지만 좋은 예측 성능을 보였다. 따라서 분자표현자들을 추가하는 것보다 인공신경망(neural network)을 이용하여 분자표현자들과 예측 물성 간의 비선형성을 고려하여 예측 성능을 높이는 방향으로 진행했다.The next step is to prepare AOMG values to correct for errors in quantum mechanical calculations. First, I will explain the kind of hybrid orbitals of compounds composed of pure molecules. Since the target molecules we are dealing with consist only of C, H, N, O, and S, the combination of constituent atoms and bond forms of a general organic compound is as shown in Figs. 3 are patterns included in covalent bonds of carbon, nitrogen, and oxygen atoms in a molecule. The presence of covalent bonds results in a special stabilization of the overall energy of the molecule, which is also a separation pattern in which each atom in the molecule contributes to energy. Thus, 17 AOMG patterns can be defined for all molecules consisting of C, H, N, O, and S only. These 17 AOMG patterns apply to all pure molecular compounds, and molecular compounds with radical centers apply 15 AOMG patterns, including only the C5 pattern, among the AOMG patterns that can appear in the covalent bonds shown in FIG. As a result of the predictive performance thus far, pure molecular hydrocarbons have been shown to perform fairly well, and no further process has been performed. And the predictive performance of non-hydrocarbons, molecules with no interaction between radical center and resonance structure, was not satisfactory, but showed good predictive performance. Therefore, rather than adding molecular markers, the neural network was used to increase the predictive performance by considering the nonlinearity between molecular markers and predictive properties.

그 다음 단계는 공명구조와 라디칼 센터와의 상호작용이 있는 화합물들에 대한 분자표현자들의 값들을 준비하는 단계이다. 이 문서는 앞 단계인 양자역학 계산하는 부분과 AOMG 패턴에 대한 값들을 계산하는 부분을 통칭해서 분자표현자들의 값을 준비하는 단계로 한다. 총 1978개에 달하는 다양한 분자표현자들에 대한 값들을 각 화합물들의 분자에 대한 정보를 담은 파일들로부터 컴퓨터를 이용하여 일괄적으로 계산하였으며 이어서 이들 중 적합하지 않은 것, 즉 모든 샘플 화합물에 대해 값이 동일하게 나와 모형의 독립변수가 될 수 없는 것들을 추려 내었다. 이는 이렇게 분자표현자의 개수를 줄임으로써 최적 모형을 찾는 데 드는 계산 시간을 줄일 수 있기 때문이다.The next step is to prepare the values of molecular descriptors for the compounds that have resonance structures and interactions with the radical centers. This document prepares the values of the molecular presenters, which are collectively referred to as the quantum mechanical calculation and the calculation of the values for the AOMG pattern. The values for various molecular descriptors totaling up to 1978 were collectively calculated by computer from files containing information about the molecules of each compound, which was then unsuitable, i.e. for all sample compounds. These same ones are selected for those that cannot be independent variables of the model. This is because reducing the number of molecular descriptors can reduce the computation time required to find the optimal model.

단계 4에서는 샘플 화합물들을 예측모형을 탐색하는데 사용할 훈련집합(training set)과 결정된 모형의 예측성능을 시험하는데 사용할 시험집합(test set), 이렇게 두 부분으로 나누는 작업을 진행한다. 이 모형에서 사용할 샘플 화합물의 집합은 크게 나누면 라디칼 센터를 가진 것과 라디칼 센터가 없는 순수한 비탄화수소 분자로 나눴다. 그리고 라디칼 센터를 가지는 분자들은 라디칼 센터와 공명구조 상호작용이 있는 구조와 없는 구조로 나누었다. 위 3가지 샘플 화합물 집합을 각각 유사한 분자들이 한쪽 부분에만 치우쳐 분포하지 않도록 주의하면서 훈련집합과 시험집합을 5:5 ~ 8:2, 바람직하게는 6 대 4의 비율로 나누었다.In step 4, the sample compounds are divided into two parts: the training set used to explore the predictive model, and the test set used to test the predicted performance of the determined model. The set of sample compounds to be used in this model is divided roughly into pure hydrogenated molecules with radical centers and without radical centers. And molecules with radical centers are divided into structures with and without resonance centers. Each of the three sample compound sets were divided into 5: 5 to 8: 2, preferably 6 to 4, training groups and test sets, taking care not to distribute similar molecules on one side.

이후 훈련집합 중 라디칼 센터와 공명구조 간의 상호작용을 가지는 분자들에 대한 훈련집합을 토대로 유전적 알고리즘(genetic algorithm)[Judson, "Genetic Algorithms and Their Uses in Chemistry", Reviews in Computational Chemistry, Lipkowitz & Boyd, Eds., Vol.10, pp.1-73 (VCH Publishers, NY, 1997)]을 통하여 최선의 다중선형회귀모형(multiple linear regression model)을 찾는다. 여기서 ‘최선’이라 함은 상대적인 의미로서 비교적 짧은 시간 내에 구할 수 있으면서 절대적인 의미에서의 최적 해에 매우 근접한 성능을 갖는다는 의미로 쓰여진 것이다. 최적 해를 직접 구하지 않는 이유는 긴 계산시간 때문인데 예를 들어 1978개의 분자표현자들 중 적합한 분자표현자들의 개수가 1700개일 때, 이 중에서 5개를 뽑아 만들 수 있는 서로 다른 다중선형회귀모형들의총 개수는

이며 이들을 다 조사하는 것은 현실적으로 불가능하다.The genetic algorithm is then based on a training set for molecules with interactions between radical centers and resonance structures in the training set [Judson, "Genetic Algorithms and Their Uses in Chemistry", Reviews in Computational Chemistry, Lipkowitz & Boyd. , Eds., Vol. 10, pp.1-73 (VCH Publishers, NY, 1997)] find the best multiple linear regression model. The term 'best' is used in the sense of relative meaning that it can be obtained in a relatively short time and has a performance very close to the optimal solution in the absolute sense. The reason for not finding the optimal solution directly is because of the long computation time. For example, when there are 1700 suitable molecular representations out of 1978 molecular representations, you can choose from five different linear regression models. The total number is

It is practically impossible to investigate them all.

한정된 시간 내에 유용한 결과를 얻기 위해 본 발명에서는 유전적 알고리즘(genetic algorithm)을 채택하였으며 그 상세한 방법은 다음과 같다. 먼저 분자표현자들의 풀(pool)에서 일정한 개수의 분자표현자들을 무작위로 뽑아 만든 다수의 다중선형회귀모형들로구성된 개체군(population)을 생성한다. 예를 들어 1700개의 적합한 분자표현자들 중 5개를 무작위로 뽑아 만든 1000개의 서로 다른 다중선형회귀모형들로개체군을 만들었다고 하자.In order to obtain useful results within a limited time, the present invention employs a genetic algorithm, and the detailed method is as follows. First, a population of multiple linear regression models is created by randomly drawing a certain number of molecular descriptors from a pool of molecular descriptors. For example, suppose you created a population of 1000 different polylinear regression models that were randomly drawn from five of the 1700 suitable molecular descriptors.

이때 염색체(chromosome)라 불리는 각 개체(individual)들은 뽑힌 분자표현자들의 번호들을 조합하여 부호화한다. 예를 들어 1700개의 분자표현자중 45, 167, 684, 1033, 1502번째의 분자표현자들로 형성한 다중선형회귀모형의 염색체는 (45, 167, 684, 1033, 1502)와 같이 표현할 수 있다. 이렇게 생성된 개체군으로부터 두 개의 부모 염색체를 선택한 뒤 교배(crossover)하여 자식들을 만들어 내는데 본 발명에서는 부모 염색체를 선택하는 선택기법으로 Roulette Wheel 방법을 채택하였다.Individuals, called chromosomes, are coded by combining the numbers of extracted molecular descriptors. For example, the chromosome of the multiple linear regression model formed by the 45th, 167, 684, 1033, and 1502th molecular descriptors among 1700 molecular descriptors can be expressed as (45, 167, 684, 1033, 1502). Two parent chromosomes are selected from the populations thus generated and crossed over to generate children. In the present invention, the Roulette Wheel method is adopted as a selection method for selecting the parent chromosomes.

Roulette Wheel 방법은 일반적으로 가장 많이 사용하는 선택 알고리즘으로 각 염색체의 적합도(fitness)에 비례하는 만큼 룰렛의 영역을 그 염색체에 할당한 다음, 룰렛을 돌려 해당된 영역의 염색체를 선택하는 방법이다. 따라서 이 방법에서는 적합도가 높은 개체일수록 선택될 확률이 높다. 선택확률을 결정짓는 각 염색체의 적합도 계산에는 회귀모형의 결정계수(coefficient of determination:

) 또는 평균절대오차(average absolute error: AAE)를 활용하였다. 즉 결정계수값이 크거나 평균절대오차값이 작은 것이 선택확률이 높도록 하였다.The Roulette Wheel method is the most commonly used selection algorithm, which allocates a roulette region to the chromosome in proportion to the fitness of each chromosome, and then rotates the roulette to select the chromosome of the corresponding region. Therefore, in this method, the higher the fit, the more likely it is to be selected. The coefficient of determination of the regression model is used to calculate the goodness of fit for each chromosome that determines the probability of selection.

) Or average absolute error (AAE). In other words, the larger the coefficient of determination or the smaller the mean absolute error, the higher the probability of selection.

교배방법으로는 단순교배(single point crossover)법을 채택하였는데 이는 가장 일반적인 교배 방법으로서 부모 염색체에서 임의로 1개의 교배점을 선택하여 그 지점 전후의 염색체부분을 서로 교환함으로써 자식을 생성하는 것을 말한다. 예를 들어 부모 염색체가 각각 (24, 262, 343, 789, 1290), (38, 454, 554, 1322, 1449)와 같이 주어지고 3번째와 4번째 요소 사이에 교배점이 놓이게 되면 자식 염색체는 각각 (24, 262, 343, 1322, 1449), (38, 454, 554, 789, 1290)와 같이 된다. The single point crossover method is adopted as the breeding method. The most common breeding method is to generate a child by selecting one crossing point on the parent chromosome and exchanging chromosomal parts before and after the point. For example, if the parent chromosome is given as (24, 262, 343, 789, 1290), (38, 454, 554, 1322, 1449), and there is a crossing point between the third and fourth elements, then the child chromosomes are (24, 262, 343, 1322, 1449), (38, 454, 554, 789, 1290).

이렇게 자식들이 생성되면 이들의 염색체 일부를 일정 확률로 돌연변이(mutation) 시키는 과정을 거치는데 이는 임의로 몇 개의 요소를 전혀 새로운 값으로 바꾸는 것으로 현재 집단에 존재하지 않는 새로운 정보로 초기 유전자 조합 이외의 공간을 탐색할 수 있게 해주어 초기 집합의 조합 내에 적절한 해가 없을 경우를 보완해주는 과정이다.When the offspring are created, they have a chance of mutating a portion of their chromosomes, which randomly replaces several elements with completely new values. This new information does not exist in the current population. It is a process that makes it possible to search to compensate for the case where there is no proper solution in the initial set combination.

이 같은 방법으로 새로이 구해진 개체들로 기존 개체군의 일부 또는 전부를 교체하여 새 세대의 개체군을 생성한다. 이 과정을 반복하여 그 세대수가 미리 정한 값(보통 10~1000사이에서 선택)에 이르면 가장 적합도가 큰 개체, 즉 예측성능이 가장 좋은 회귀모형을 선택하고 끝낸다.In this way, a new generation of populations are created by replacing some or all of the existing populations with newly obtained entities. Repeat this process until the number of generations reaches a pre-determined value (usually between 10 and 1000), and then select and end up the regression model with the best predictive performance.

일단 이렇게 최선의 다중선형회귀모형이 선정되면 다음 단계로 이 모형의 타당성을 검토한다. 만일 모형에 포함된 분자표현자의 t검정값이 좋지 않다든지 하는 문제점이 발견되면 이전 단계로 돌아가 다른 모형을 찾는다. 예를 들어 샘플 화합물의 수가 1005이고 선정된 모형이 5개의 분자표현자로 구성되어 있을 경우 그 중 한 분자표현자에 대한 t검정값이 3.3이상이면 이는 이 분자표현자가 해당 물성과 무관할 확률이 0.1%이하임을 뜻한다. 본 발명에서는 대략 3미만의 t검정값을 갖는 분자표현자가 존재할 경우 선정된 모형을 버리고 다른 모형을 찾았다. 또한 샘플 화합물들에 대한 한 분자표현자의 값들이 소수의 몇몇 화합물들을 제외하고는 모두 동일한 경우도 신뢰성 있는 모형이라고 볼 수 없어 마찬가지로 조처하였다. 일반적으로 모형에 포함되는 분자표현자의 개수를 늘리면 예측성능은 높아지지만 이와 같은 문제들이 발생하게 되므로 보통 최종 모형은 이 단계들을 분자표현자의 개수를 바꿔가며 여러 번의 시행착오를 거쳐 반복 수행함으로써 얻어진다. 선정된 모형에 더 이상 문제가 나타나지 않으면 다음 단계로 넘어간다.Once this best multiple linear regression model has been selected, the next step is to examine its validity. If a problem is found that the t-test value of the molecular descriptors included in the model is not good, go back and look for another model. For example, if the number of sample compounds is 1005 and the selected model consists of five molecular descriptors, if the t-test for one of the molecular descriptors is 3.3 or higher, then the probability that the molecular descriptor is irrelevant to that property is 0.1 It means less than%. In the present invention, when there is a molecular presenter having a t-test value of less than about 3, the selected model is discarded and another model is found. In addition, even if the values of the molecular descriptors for the sample compounds are the same except for a few few compounds, they are not considered to be reliable models. In general, increasing the number of molecular expressions included in the model increases the predictive performance, but such problems occur. Therefore, the final model is usually obtained by repeating these steps through several trials and errors while changing the number of molecular expressions. If the problem no longer appears in the selected model, proceed to the next step.

그 다음인 단계 7에서는 모형을 형성하는데 참여하지 않았던 시험집합을 이용하여 찾아낸 모형의 예측성능을 평가한다. 만일 훈련집합에서 보다 예측성능이 많이 떨어지거나 예측이 크게 벗어나는 샘플들이 보이는 등의 문제점이 발견되면 단계 4로 가서 훈련집합과 시험집합을 재조정한 뒤 이후 단계를 진행한다. 여기서 훈련집합과 시험집합의 차이가 훈련집합에 대해서 얻은 절대평균오차(AAE)의 20%를 넘지 않으면 예측성능이 만족되는 것으로 판단한다.Next, in Step 7, assess the predictive performance of the found model using test sets that did not participate in model formation. If a problem is found in the training set that results in much lower predictive performance or samples that are significantly off predicted, go to step 4 and readjust the training and test sets before proceeding. If the difference between the training set and the test set does not exceed 20% of the absolute mean error (AAE) obtained for the training set, it is judged that the predictive performance is satisfied.

이렇게 하여 표준생성 에너지에 대한 3가지 샘플 화합물 집합(공명구조와 라디칼 센터 상호작용이 있는 그룹, 공명구조와 라디칼 센터 상호작용이 없는 그룹, 비탄화수소 그룹)의 다중선형회귀모형이 일단 확립되면 인공신경망모형을 확립하기 위해 먼저 분자표현자들의 데이터와 표준생성 에너지의 실험값 데이터를 표준화하는 작업, 즉 각 값에서 해당 데이터의 평균을 뺀 뒤 표준편차로 나누는 작업을 진행한다. 이렇게 준비된 전체 샘플을 대략 6:2:2의 비율로 훈련집합(training set), 검증집합(validation set), 시험집합(test set)으로 나눈다.In this way, a neural network can be established once a multiple linear regression model of three sample compound sets for standard production energy (groups with resonance structures and radical center interactions, groups without resonance structures and radical center interactions, and non-hydrocarbon groups) is established. In order to establish the model, we first standardize the data of the molecular presenters and the experimental data of the standard generation energy, that is, subtract the mean of the data from each value and divide by the standard deviation. The entire sample thus prepared is divided into a training set, a validation set, and a test set in an approximately 6: 2: 2 ratio.

이후 이들을 사용하여 최선의 인공신경망모형을 탐색한다. 이때 탐색 범위는 도 2에서처럼 입력층과 출력층 사이에 한 개의 은닉층을 가지면서 이 3개 층이 전방향으로(feed forward), 즉 입력에서 출력으로 향하는 방향으로만 연결되어 있는 구조를 갖는 신경망으로 제한하였다. 입력층은 이미 확립되어 있는 다중선형회귀모형에 포함된 각 분자표현자들의 값을 입력 받는, 같은 개수만큼의 노드들로 구성하였으며출력층은 임계부피를 출력하는 한 개의 노드로 구성하였다. 또한 은닉층의 활성화 함수로는 Sigmoid 함수 즉

을, 출력층의 활성화 함수로는 선형함수 즉

를 채택하였다. 따라서 입력층의 각 노드들이 받는 입력값들을

라 할 때 은닉층의 j번째 노드의 출력값은

와 같이 주어지며 은닉층이

개의 노드로 이루어져 있을 때 출력층 출력노드의 최종 출력값은

와 같이 주어진다. 여기서

는 문턱 가중치(threshold weight)를 의미한다.We then use them to find the best artificial neural network model. In this case, the search range is limited to a neural network having a hidden layer between the input layer and the output layer as shown in FIG. 2 and having three structures connected only in a feed forward direction, that is, in a direction from the input to the output. It was. The input layer is composed of the same number of nodes that receive the values of the molecular markers included in the already established multiple linear regression model, and the output layer is composed of one node that outputs the critical volume. In addition, as the activation function of the hidden layer, the Sigmoid function,

, The activation function of the output layer is a linear function

Was adopted. Therefore, the input values that each node in the input layer receives

In this case, the output value of the j th node of the hidden layer is

Is given by

The final output value of the output layer output node when composed of four nodes

Is given by here

Denotes a threshold weight.

탐색은 은닉노드의 수가 1개인 것부터 차례로 개수를 늘려가며 진행하는데 보통 입력노드 개수의 2배가 될 때까지 진행하지만 만족스러운 모형이 나오지 않을 경우 더 진행하여 탐색한다. 자세한 절차는 다음과 같다. 먼저 은닉노드의 각 개수 별로, 난수 발생 함수를 써서 생성한 가중치

들의 다양한 초기값세트(보통 1000세트이내)를 마련하고, 훈련집합을 사용하여 각 세트로 초기화된 신경망을 역전파 알고리즘을 통해 반복 훈련함으로써 가중치

들의 최적화된 값을 찾는다. 최적화에 대한 판단은 매 훈련 후 경신된 가중치들의 값으로 정해지는 모형을 검증집합에 적용하였을 때 그 평균제곱오차(mean square error)의 값이 최소가 되는 것으로 한다. 보통은 3000~5000번의 반복훈련 내에 이러한 시점이 나오게 된다. 이렇게 얻어진 각 초기값세트에 대응하는 최적화된 신경망모형을 훈련집합, 검증집합, 시험집합에 각각 적용하여 그 평균제곱오차들이 모두 다중선형회귀모형의 그것들보다 작은 것만을 모은다. 이러한 것이 여러 개 있을 경우, 결정계수나 평균절대오차 등을 기준으로 가장 우수한 모형을 선택한다.The search proceeds from increasing the number of hidden nodes to one in order. Usually, the search proceeds to twice the number of input nodes. However, if a satisfactory model is not found, the search proceeds further. The detailed procedure is as follows. First, the weight generated by using random number generation function for each number of hidden nodes

Prepare different sets of initial values (usually within 1000 sets) and weight them by repeatedly training the neural networks initialized with each set using the back-up algorithm.

Find the optimal value for these. The judgment of the optimization is that the mean square error is minimized when the model, which is determined by the updated weights after each training, is applied to the test set. Normally this will occur within 3000 to 5000 repetitions. The optimized neural network model corresponding to each set of initial values thus obtained is applied to the training set, the test set, and the test set, respectively, to collect only those whose mean square errors are smaller than those of the multiple linear regression model. If there are several of these, choose the best model based on the coefficient of determination or the absolute absolute error.

이렇게 인공신경망모형이 선정되면 마지막으로 과적합(overfitting) 방지기준을 설정한다. 이는 과도한 훈련의 결과로 인공신경망이 미지의 입력에 대해 엉뚱한 답을 내놓는 불안정성을 개선하기 위한 조처로, 한 기준값을 정하여 인공신경망모형과 다중선형회귀모형의 예측값들 차이의 절대값이 기준값을 넘을 경우 다중선형회귀모형의 예측값을 채택하고 이보다 작을 경우 인공신경망모형의 값을 채택하게 하는 것을 말한다.When the artificial neural network model is selected, an overfitting prevention standard is finally set. This is a measure to improve the instability that the neural network gives wrong answers to unknown inputs as a result of excessive training.If the absolute value of the difference between the predicted values of the neural network model and the multiple linear regression model exceeds the reference value by setting a reference value, It means to adopt the predictive value of the multiple linear regression model and to adopt the value of the artificial neural network model if it is smaller than this.

이 같은 과정을 거쳐 이상기체의 생성 에너지에 대한 다중선형회귀-인공신경망 혼성모형을 확립한 결과는 표 2(탄화수소의 이상기체의 생성 에너지에 대한 QSPR 예측모델의 주요 내용), 표 3(비탄화수소의 이상기체의 생성 에너지에 대한 QSPR 예측모델의 주요 내용), 표 4(공명구조와 라디칼 센터간의 상호작용을 가지지 않는 분자의 이상기체의 생성 에너지에 대한 QSPR 예측모델의 주요 내용) 및 표 5(공명구조와 라디칼 센터간의 상호작용을 가지는 분자의 이상기체의 생성 에너지에 대한 QSPR 예측모델의 주요 내용)에 간략히 나와 있다.
Through this process, the results of establishing the multiple linear regression-artificial neural network hybrid model for the generation energy of the ideal gas are shown in Table 2 (the main contents of the QSPR prediction model for the generation energy of the hydrocarbon ideal gas) and Table 3 (non-hydrocarbon). The main contents of the QSPR prediction model for the generation energy of the ideal gas in Table 4 (the main contents of the QSPR prediction model for the generation energy of the ideal gas of the molecule having no interaction between the resonance structure and the radical center) and Table 5 The main content of the QSPR predictive model for the generation energy of an ideal gas of a molecule having an interaction between a resonance structure and a radical center is briefly described.

샘플 화합물들의 개수Number of sample compounds 663663 분자표현자들의 개수Number of molecular descriptors 66 분자표현자들의 이름Names of Molecular Presenters P₁: 양자역학 이상기체의 생성 에너지(Enthalpy of Formation for Ideal Gas at 298K from quantum mechanics)
P₂: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph)패턴 -H1
P₃: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph)패턴 -C2
P₄: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph)패턴 -C3
P₅: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph)패턴 -C4
P₆: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph)패턴 -C5P ₁ : Energy of Formation for Ideal Gas at 298K from quantum mechanics
P ₂ : Atomic Orbital Molecular Graph Pattern -H1
P ₃ : Atomic Orbital Molecular Graph Pattern -C2
P ₄ : Atomic Orbital Molecular Graph Pattern -C3
P ₅ : Atomic Orbital Molecular Graph Pattern -C4
P ₆ : Atomic Orbital Molecular Graph Pattern -C5 결정계수Coefficient of determination 0.9988780.998878 표준오차Standard error 1.250625 kcal/mol1.250625 kcal / mol 평균절대오차Mean Absolute Error 0.758519 kcal/mol0.758519 kcal / mol 모형 이상기체의 생성 에너지Generated energy of model ideal gas 이상기체의 생성 에너지

Generated energy of ideal gas

샘플 화합물들의 개수Number of sample compounds 854854 분자표현자들의 개수Number of molecular descriptors 1212 분자표현자들의 이름Names of Molecular Presenters

: 양자역학 이상기체의 생성 에너지(Enthalpy of Formation for Ideal Gas at 298K from quantum mechanics)

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴- N1

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴- N2

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴- N3

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴- N4

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴- O1

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph)패턴- O2

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴- O3

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴- S1

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴- S2

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴- S3

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴- S4

: Enthalpy of Formation for Ideal Gas at 298K from quantum mechanics

: Atomic Orbital Molecular Graph Pattern-N1

: Atomic Orbital Molecular Graph Pattern-N2

: Atomic Orbital Molecular Graph Pattern-N3

: Atomic Orbital Molecular Graph Pattern-N4

: Atomic Orbital Molecular Graph Pattern-O1

: Atomic Orbital Molecular Graph Pattern-O2

: Atomic Orbital Molecular Graph Pattern-O3

: Atomic Orbital Molecular Graph Pattern-S1

: Atomic Orbital Molecular Graph Pattern-S2

: Atomic Orbital Molecular Graph Pattern-S3

: Atomic Orbital Molecular Graph Pattern-S4
Regression Model Decision Coefficients 0.997871 Regression Model Mean Absolute Error 1.85253

Regression model Energy generated by an ideal gas [

] =

Artificial Neural Network Determination Coefficient 0.997871 Artificial neural network mean absolute error 0.827643

Artificial Neural Network Model Generated energy of ideal gas
=

Overconformity Prevention Criteria 20

샘플 화합물들의 개수Number of sample compounds 358358 분자표현자들의 개수Number of molecular descriptors 1515 분자표현자들의 이름Names of Molecular Presenters

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -H1

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -C2

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -C3

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -C4

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -C5

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -N1

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -N2

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -N3

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -O1

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -O2

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -S1

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -S2

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -S3

: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -S4

: Enthalpy of Formation for Ideal Gas at 298K from quantum mechanics

: Atomic Orbital Molecular Graph Pattern -H1

: Atomic Orbital Molecular Graph Pattern -C2

: Atomic Orbital Molecular Graph Pattern -C3

: Atomic Orbital Molecular Graph Pattern -C4

: Atomic Orbital Molecular Graph Pattern -C5

: Atomic Orbital Molecular Graph Pattern -N1

: Atomic Orbital Molecular Graph Pattern -N2

: Atomic Orbital Molecular Graph Pattern -N3

: Atomic Orbital Molecular Graph Pattern -O1

: Atomic Orbital Molecular Graph Pattern -O2

: Atomic Orbital Molecular Graph Pattern -S1

: Atomic Orbital Molecular Graph Pattern -S2

: Atomic Orbital Molecular Graph Pattern -S3

: Atomic Orbital Molecular Graph Pattern -S4 Regression Model Decision Coefficients 0.991293 Regression Model Mean Absolute Error 1.730095

Regression model Energy generated by an ideal gas [

] =

Artificial Neural Network Determination Coefficient 0.995293 Artificial neural network mean absolute error 1.016031

Artificial Neural Network Model Generated energy of ideal gas
=

Overconformity Prevention Criteria 20

샘플 화합물들의 개수Number of sample compounds 524524 분자표현자들의 개수Number of molecular descriptors 1616 분자표현자들의 이름Names of Molecular Presenters P₁: 양자역학 이상기체의 생성 에너지(Enthalpy of Formation for Ideal Gas at 298K from quantum mechanics)
P₂: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -C2
P₃: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -C3
P₄: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -C4
P₅: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -C5
P₆: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -N2
P₇: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -N3
P₈: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -O1
P₉: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -O2
P₁₀: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -S1
P₁₁: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -S2
P₁₂: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -S3
P₁₃: 원자 궤도함수 분자 그래프(Atomic Orbital Molecular Graph) 패턴 -S4
P₁₄: Balaban 지수(Balaban Index)
P₁₅: 수소결합의존적 수소결합 기부체 원자의 면적가중 표면전하 분율(HA dependent HDCA-2/TMSA)
P16: Kier의 유연성 지수(Kier flexibility index)
P ₁ : Energy of Formation for Ideal Gas at 298K from quantum mechanics
P ₂ : Atomic Orbital Molecular Graph Pattern-C2
P ₃ : Atomic Orbital Molecular Graph Pattern-C3
P ₄ : Atomic Orbital Molecular Graph Pattern-C4
P ₅ : Atomic Orbital Molecular Graph Pattern-C5
P ₆ : Atomic Orbital Molecular Graph Pattern -N2
P ₇ : Atomic Orbital Molecular Graph Pattern -N3
P ₈ : Atomic Orbital Molecular Graph Pattern -O1
P ₉ : Atomic Orbital Molecular Graph Pattern -O2
P ₁₀ : Atomic Orbital Molecular Graph Pattern-S1
P ₁₁ : Atomic Orbital Molecular Graph Pattern -S2
P ₁₂ : Atomic Orbital Molecular Graph Pattern-S3
P ₁₃ : Atomic Orbital Molecular Graph Pattern-S4
P ₁₄ : Balaban Index
P ₁₅ : Area weighted surface charge fraction of hydrogen bond dependent hydrogen bond base atoms (HA dependent HDCA-2 / TMSA)
P16: Kier flexibility index
회귀모형 결정계수Regression Model Decision Coefficients 0.9937330.993733 평균절대오차Mean Absolute Error 1.350155

1.350155

Regression model Energy generated by an ideal gas [

] =

본 발명이 기존 기술보다 우월함을 보이기 위해 이렇게 확립된 다중선형회귀-인공신경망 혼성모형과 널리 사용되는 기존의 그룹기여 모형 즉 Joback[Joback K. G., Reid R.C., Estimation of pure-component properties from group-contributions, Chem. Eng. Comm., 57: 233 (1987).]과 Gani 모형[Constantinou, L., Gani R., New Group Contribution Method for Estimating Properties of Pure Compounds, AIChE.,40: 1697 (1994).]의 예측성능을 실험값이 알려진 1535개의 화합물들의 데이터를 사용하여 비교하였다. 그 결과 Joback 모형은 1407개에 대해서만 예측값을 계산해주며 0.985271의 결정계수값과 8.022368

의 평균절대오차값을 가짐을 알게 되었다. 또한 Gani 모형은 1318개에 대해서만 예측값을 계산해주며 0.991044의 결정계수값과 2.625765

의 평균절대오차값을 가짐을 알게 되었다. 반면 다중선형회귀-인공신경망 혼성모형은 1536개 전부에 대해 예측값을 계산해주며 0.996184의 결정계수값과 1.892156

의 평균절대오차값을 가져 다른 두 모형보다 우수함을 알게 되었다. 도 5, 6, 7는 각 모형의 예측성능을 보여주는 패리티(parity) 도면들이며 이 도면들로부터 다중선형회귀-인공신경망 혼성모형이 다른 두 모형보다 우수한 성능을 가짐을 눈으로 확인할 수 있다. 한편 1536개 화합물들에 대한 실험데이터 중 실험오차가 알려진 것들의 평균 오차는 약 3

이며 이 값을 중심으로 실험값과 예측값 사이의 오차를 히스토그램으로 그린 것이 도 8, 9, 10이다. 이 도면들은 Joback 모형은 66.09%, Gani 모형은 76.66%, 다중선형회귀-인공신경망 혼성모형은 86.25%의 확률로 평균 실험오차의 범위 이내로 이상기체의 생성 에너지값을 예측하고 있음을 보여주어 다중선형회귀-인공신경망 혼성모형이 다른 두 모형보다 정확함을 증명해준다.In order to show that the present invention is superior to the existing technology, the multilinear regression-artificial neural network hybrid model thus established and the existing group contribution model widely used, namely Joback [Joback KG, Reid RC, Estimation of pure-component properties from group-contributions, Chem. Eng. Comm. , 57: 233 (1987).] And the Gani model [Constantinou, L., Gani R., New Group Contribution Method for Estimating Properties of Pure Compounds, AIChE ., 40: 1697 (1994).]. Comparisons were made using data of 1535 known compounds. As a result, the Joback model calculates the predictions for only 1407, and the coefficient of determination of 0.985271 and 8.022368

It is found that the mean absolute error of. In addition, the Gani model calculates predictions for only 1318, with a coefficient of determination of 0.991044 and 2.625765.

It is found that the mean absolute error of. On the other hand, the multiple linear regression-artificial neural network hybrid model calculates the predicted values for all 1536, and has a coefficient of determination of 0.996184 and 1.892156.

The average absolute error of is better than the other two models. 5, 6, and 7 are parity diagrams showing the predictive performance of each model, and it can be seen from these figures that the multiple linear regression-artificial neural network hybrid model has better performance than the other two models. Meanwhile, the average error of the experimental data of 1536 compounds with known experimental errors was about 3

8, 9, and 10 show a histogram of the error between the experimental value and the predicted value. These figures show that the Joback model is 66.09%, the Gani model is 76.66%, and the multiple linear regression-artificial neural network hybrid model has a probability of 86.25%. The regression-artificial neural network hybrid model proves to be more accurate than the other two models.

본 발명은 상기한 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변형실시가 가능한 것은 물론, 그와 같은 변경은 청구범위 기재의 범위 내에 있게 된다.The present invention is not limited to the above-described embodiments, and any person having ordinary skill in the art to which the present invention pertains may make various modifications without departing from the gist of the present invention as claimed in the claims. Such changes are intended to fall within the scope of the claims.

Claims

A first step of inputting hydrocarbon-based experimental data into experimental data of collected sample organic compounds;
A second step of preparing a molecular presenter value for the energy of generation of the ideal gas of the hydrocarbon compound through the experimental data input in the first step;
A third step of extracting optimal molecular presenters for each experimental data set input in the first step;
A fourth step of separating the experimental data input in the first step into a training set and a test set;
A fifth step of searching for an optimal multiple linear regression model for the training set;
A sixth step of examining the validity of the selected model;
A seventh step of repeating the fifth and sixth steps if there is no validity in the sixth step, and testing the predictive performance of the model with respect to the test set if it is valid;
An eighth step of repeating steps 4 to 7 if the performance does not satisfy the criterion in the seventh step test for the test set, and determining an optimal multilinear regression model if the criterion is satisfied; And
A ninth step of obtaining a generated energy value of an ideal gas through the multilinear regression model searched for the experimental data set of the collected sample organic compounds by the optimal multilinear regression model satisfying the performance test in the eighth step A method for obtaining the energy of generation of an ideal gas of a hydrocarbon organic compound through a multiple linear regression model comprising a step.

The energy generation of an ideal gas of a hydrocarbon organic compound according to claim 1, wherein the optimal molecular descriptor in the third step is an independent molecular descriptor whose values are not the same for all sample compounds. How to obtain.

The method of claim 1, wherein the training set and the test set are divided by a ratio of 5: 5 to 8: 2 in the fourth step.

The method according to claim 1, wherein in the fifth step, the multiple linear regression model searches for a multiple linear regression model by applying a genetic algorithm to the training set. A method for obtaining the energy of generation of ideal gas of a compound.

The method of claim 4, wherein the genetic algorithm generates a population composed of a plurality of multiple linear regression models randomly drawn from a predetermined number of molecular representations in a pool of molecular representations. Making; Encoding each individual by combining the numbers of the extracted molecular presenters; Selecting two parent chromosomes from the created population by the Roulette Wheel method and generating offspring by a single point crossover method; Ideal gas of the hydrocarbon organic compound through a multi-linear regression model comprising the step of mutating a portion of the chromosome of the generated offspring with a certain probability and then replacing a part of the existing population with them to create a new population How to find the generated energy of.

The method of claim 1, wherein the fifth step includes determining the predictive performance by determining the predictive performance based on the crystal coefficient or the mean absolute error of the regression model.

The method of claim 1, wherein the validity in the sixth step is to determine the generated energy of the ideal gas of the hydrocarbon organic compound through a multiple linear regression model in which the validity is determined by the t-test value.

The method of claim 1, wherein in the eighth step, if the predictive performance of the test set is similar to the predicted performance of the training set, a multiple linear regression model is determined, and the predictive performance of the test set is different from the predicted performance of the training set. After that, a multilinear regression model that reclassifies the training set and the test set is used to obtain the generated energy of the standard gas ideal gas of the hydrocarbon organic compound.

A first step of inputting non-hydrocarbon-based experimental data among collected sample organic compounds;
A second step of preparing a molecular presenter value for energy generated by an ideal gas of the non-hydrocarbon-based organic compound of sample organic compounds;
Extracting optimal molecular descriptors;
A fourth step of separating the experimental data into a training set and a test set;
A fifth step of searching for an optimal multiple linear regression model for the training set;
A sixth step of examining the validity of the selected model;
A seventh step of repeating the fifth and sixth steps if there is no validity in the sixth step, and testing the predictive performance of the model with respect to the test set if it is valid;
An eighth step of repeating steps 4 to 7 if the performance does not satisfy the criterion in the seventh step test for the test set, and separating the three sets after sample normalization if the criterion is satisfied;
A ninth step of searching for an optimal neural network model after dividing the entire sample into three sets;
The absolute value of the difference between the predicted energy produced by the ideal linear regression model satisfying the performance test in the eighth step and the predicted energy produced by the ideal gas calculated by the optimal neural network model found in the ninth step. A tenth step of comparing with a preset overfit prevention reference value; And
If the difference is larger than the reference value for preventing overfitting, the generated energy predicted value of the ideal gas obtained by the multiple linear regression model obtained in the eighth step is adopted as the generated energy value for the ideal gas. The generation energy of the ideal gas of the non-hydrocarbon-based organic compound is obtained through the multiple linear regression-artificial neural network model including the eleventh step of adopting the estimated energy generated by the artificial neural network model as the generated energy value of the ideal gas. How to obtain.

10. The method of claim 9, wherein the optimal molecular descriptors in the third step are independent molecular descriptors of which the values are not the same for all the sample compounds. A method of obtaining the energy of generation of an ideal gas.

10. The method of claim 9, wherein in the fourth step, the training set and the test set are divided by a ratio of 5: 5 to 8: 2. The generation of an ideal gas of the non-hydrocarbon-based organic compound through the multiple linear regression-artificial neural network model How to save energy.

10. The method of claim 9, wherein in the fifth step, the multiple linear regression model searches for the multiple linear regression model by applying a genetic algorithm to the training set. To obtain the generated energy of the ideal gas of the non-hydrocarbon organic compounds.

13. The method of claim 12, wherein the genetic algorithm comprises a population consisting of a plurality of multiple linear regression models randomly drawn from a pool of molecular presenters. Generating; Encoding each individual by combining the numbers of the extracted molecular presenters; Selecting two parent chromosomes from the created population by the Roulette Wheel method and generating offspring by a single point crossover method; A non-hydrocarbon series through a multiple linear regression-artificial neural network model comprising the step of mutating a portion of the chromosomes of the generated offspring with a certain probability and then replacing a portion of the existing population with them to create a new population. A method for obtaining the energy of generation of ideal gas of organic compounds.

The energy generation of the ideal gas of the non-hydrocarbon-based organic compound through the multiple linear regression-artificial neural network model according to claim 9, wherein the fifth step includes determining the predictive performance by the coefficient of determination or the mean absolute error of the regression model. How to obtain.

10. The method of claim 9, wherein the validity in the sixth step is to determine the generated energy of the ideal gas of the non-hydrocarbon-based organic compound through a multiple linear regression-artificial neural network model in which the validity is determined by the t-test value.

10. The method of claim 9, wherein in the eighth step, if the predictive performance of the test set is similar to the predicted performance of the training set, a multiple linear regression model is determined, and the predictive performance of the test set is different from the predicted performance of the training set. After that, the generation energy of the ideal gas of the non-hydrocarbon organic compounds is obtained through the multiple linear regression-artificial neural network model that classifies the training set and the test set again.

10. The method of claim 9, wherein in the ninth step, the search range by the artificial neural network has one hidden layer between the input layer and the output layer and is connected only in a feed forward. A method of obtaining the generated energy of ideal gas of non-hydrocarbon organic compounds through neural network model.

18. The method of claim 17, wherein the activation energy of the hidden layer is obtained by using a sigmoid function to obtain the energy of generation of an ideal gas of the non-hydrocarbon-based organic compound through a multiple linear regression-artificial neural network model. .

10. The method of claim 9, wherein in the tenth step, the overfit prevention reference value is 20. The method for obtaining the generated energy of the ideal gas of the non-hydrocarbon-based organic compound through the multiple linear regression-artificial neural network model.

The method of claim 1, wherein the optimal molecular descriptors extracted in the third step
P ₁ : Energy of Formation for Ideal Gas at 298K from quantum mechanics
P ₂ : Atomic Orbital Molecular Graph Pattern -H1
P ₃ : Atomic Orbital Molecular Graph Pattern-C2
P ₄ : Atomic Orbital Molecular Graph Pattern-C3
P ₅ : Atomic Orbital Molecular Graph Pattern-C4
P ₆ : Atomic Orbital Molecular Graph Pattern-C5
Method for obtaining the energy generated by the ideal gas of the hydrocarbon-based organic compound through a multiple linear regression model comprising a.

The method of claim 9, wherein the optimal molecular descriptors extracted in the third step is
P ₁ : Energy of Formation for Ideal Gas at 298K from quantum mechanics
P ₂ : Atomic Orbital Molecular Graph Pattern-N1
P ₃ : Atomic Orbital Molecular Graph Pattern -N2
P ₄ : Atomic Orbital Molecular Graph Pattern-N3
P ₅ : AOMG Pattern -N4
P ₆ : Atomic Orbital Molecular Graph Pattern -O1
P ₇ : Atomic Orbital Molecular Graph Pattern -O2
P ₈ : Atomic Orbital Molecular Graph Pattern -O3
P ₉ : Atomic Orbital Molecular Graph Pattern-S1
P ₁₀ : Atomic Orbital Molecular Graph Pattern -S2
P ₁₁ : Atomic Orbital Molecular Graph Pattern -S3
P ₁₂ : Atomic Orbital Molecular Graph Pattern-S4
Method for obtaining the generation energy of the ideal gas of the non-hydrocarbon-based organic compound through a multiple linear regression-artificial neural network model.

In the method of obtaining the generated energy of the ideal gas of a hydrocarbon-based organic compound through a multiple linear regression model,
P ₁ : Energy of Formation for Ideal Gas at 298K from quantum mechanics
P ₂ : Atomic Orbital Molecular Graph Pattern -H1
P ₃ : Atomic Orbital Molecular Graph Pattern -C2
P ₄ : Atomic Orbital Molecular Graph Pattern -C3
P ₅ : Atomic Orbital Molecular Graph Pattern -C4
P ₆ : Atomic Orbital Molecular Graph Pattern -C5
Method of obtaining the energy generated by the ideal gas of the hydrocarbon-based organic compound through a multiple linear regression model comprising a.

In the method of obtaining the generated energy of the ideal gas of the non-hydrocarbon organic compound through the multiple linear regression-artificial neural network model,
P ₁ : Energy of Formation for Ideal Gas at 298K from quantum mechanics
P ₂ : Atomic Orbital Molecular Graph Pattern-N1
P ₃ : Atomic Orbital Molecular Graph Pattern -N2
P ₄ : Atomic Orbital Molecular Graph Pattern-N3
P ₅ : Atomic Orbital Molecular Graph Pattern-N4
P ₆ : Atomic Orbital Molecular Graph Pattern -O1
P ₇ : Atomic Orbital Molecular Graph Pattern -O2
P ₈ : Atomic Orbital Molecular Graph Pattern -O3
P ₉ : Atomic Orbital Molecular Graph Pattern-S1
P ₁₀ : Atomic Orbital Molecular Graph Pattern -S2
P ₁₁ : Atomic Orbital Molecular Graph Pattern -S3
P ₁₂ : Atomic Orbital Molecular Graph Pattern-S4
Method for obtaining the generated energy of the ideal gas of the non-hydrocarbon-based organic compound through a multiple linear regression-artificial neural network model comprising a.

A computer-readable storage program for executing the method of obtaining the generated energy of the abnormal gas of the hydrocarbon-based organic compound according to any one of claims 1 to 8, 20, and 22 with a computer. media.

A method for obtaining the generated energy of the abnormal gas of the non-hydrocarbon-based organic compound according to any one of claims 9 to 19, 21 and 23, which can be recorded by a computer program and executed by a computer. Storage media.

A first step of inputting experimental data of radical molecules having an interaction between a resonance structure and a radical center among experimental data of sample organic compounds collected;
A second step of preparing a molecular presenter value for the energy of generation of an ideal gas of the radical molecule having an interaction between the resonance structure and the radical center through the experimental data input in the first step;
A third step of extracting optimal molecular presenters for each experimental data set input in the first step;
A fourth step of separating the experimental data input in the first step into a training set and a test set;
A fifth step of searching for an optimal multiple linear regression model for the training set;
A sixth step of examining the validity of the selected model;
A seventh step of repeating the fifth and sixth steps if there is no validity in the sixth step, and testing the predictive performance of the model with respect to the test set if it is valid;
An eighth step of repeating steps 4 to 7 if the performance does not satisfy the criterion in the seventh step test for the test set, and determining an optimal multilinear regression model if the criterion is satisfied; And
A ninth step of obtaining a generated energy value of an ideal gas through the multilinear regression model searched for the experimental data set of the collected sample organic compounds by the optimal multilinear regression model satisfying the performance test in the eighth step A method of obtaining the generated energy of an ideal gas of a radical molecule having an interaction between a resonance structure and a radical center through a multilinear regression model including a step.

The method of claim 1, wherein in the third step, the optimal molecular descriptor is an independent molecular descriptor whose values are not the same for all the sample compounds. A method for obtaining the energy of generation of an ideal gas of a radical molecule having.

27. The method according to claim 26, wherein in the fourth step, the training set and the test set are divided by a ratio of 5: 5 to 8: 2. A method of obtaining the energy of generation of an ideal gas.

27. The resonance structure of claim 26, wherein the multiple linear regression model searches for the multiple linear regression model by applying a genetic algorithm to the training set in the fifth step. A method for obtaining the energy of generation of an ideal gas of a radical molecule having an interaction between a radical and a radical center.

30. The method of claim 29, wherein the genetic algorithm generates a population consisting of a plurality of multiple linear regression models randomly drawn from the pool of molecular presenters. Making; Encoding each individual by combining the numbers of the extracted molecular presenters; Selecting two parent chromosomes from the created population by the Roulette Wheel method and generating offspring by a single point crossover method; Generating a new population by mutating a portion of the chromosomes of the generated offspring with a certain probability and then replacing a portion of the existing population with them to generate a new population between the resonance structure and the radical center. A method of obtaining the energy of generation of an ideal gas of a radical molecule having interaction.

27. The abnormality of radical molecules having an interaction between a resonance structure and a radical center through a multiple linear regression model according to claim 26, wherein the fifth step includes determining a predictive performance by a coefficient of determination or an average absolute error of the regression model. How to get the generated energy of gas.

27. The method of claim 26, wherein the validity in the sixth step is a multi-linear regression model that determines the validity by the t-test value to determine the generated energy of the ideal gas of the radical molecule having an interaction between the resonance structure and the radical center.

27. The method of claim 26, wherein in the eighth step, if the predictive performance of the test set is similar to the predicted performance of the training set, a multiple linear regression model is determined, and the predictive performance of the test set is different from the predicted performance of the training set. After that, a multilinear regression model that reclassifies a training set and a test set calculates the energy of generation of a standard gas ideal gas of a radical molecule having an interaction between a resonance structure and a radical center.

A first step of inputting experimental data of radical molecules having no interaction between a resonance structure and a radical center among the collected sample organic compounds;
A second step of preparing a molecular presenter value for an energy of generation of an ideal gas of a radical molecule having no interaction between the resonance structure of the sample organic compounds and the radical center;
Extracting optimal molecular descriptors;
A fourth step of separating the experimental data into a training set and a test set;
A fifth step of searching for an optimal multiple linear regression model for the training set;
A sixth step of examining the validity of the selected model;
A seventh step of repeating the fifth and sixth steps if there is no validity in the sixth step, and testing the predictive performance of the model with respect to the test set if it is valid;
An eighth step of repeating steps 4 to 7 if the performance does not satisfy the criterion in the seventh step test for the test set, and separating the three sets after sample normalization if the criterion is satisfied;
A ninth step of searching for an optimal neural network model after dividing the entire sample into three sets;
The absolute value of the difference between the predicted energy produced by the ideal linear regression model satisfying the performance test in the eighth step and the predicted energy produced by the ideal gas calculated by the optimal neural network model found in the ninth step. A tenth step of comparing with a preset overfit prevention reference value; And
If the difference is larger than the reference value for preventing overfitting, the generated energy predicted value of the ideal gas obtained by the multiple linear regression model obtained in the eighth step is adopted as the generated energy value for the ideal gas. Through the multiple linear regression-artificial neural network model including the eleventh step of adopting the generated energy prediction value of the ideal gas by the artificial neural network model as the generated energy value of the ideal gas, there is no interaction between the resonance structure and the radical center. A method of obtaining the energy of generation of an ideal gas of a radical molecule.

35. The method according to claim 34, wherein in the third step, the optimal molecular presenter is an independent molecular presenter whose values are not the same for all the sample compounds. A method of obtaining the energy of generation of an ideal gas of a radical molecule having no interaction.

35. The method of claim 34, wherein the training set and the test set in the fourth step have interaction between the resonance structure and the radical center through the multiple linear regression-artificial neural network model. A method for obtaining the energy of generation of an ideal gas of a radical molecule that does not.

35. The method of claim 34, wherein in the fifth step, the multiple linear regression model searches for the multiple linear regression model by applying a genetic algorithm to the training set. A method for obtaining the energy of generation of an ideal gas of a radical molecule having no interaction between a resonance structure and a radical center.

38. The population of claim 37, wherein the genetic algorithm generates a population of multiple polylinear regression models randomly drawn from a pool of molecular representations. Making; Encoding each individual by combining the numbers of the extracted molecular presenters; Selecting two parent chromosomes from the created population by the Roulette Wheel method and generating offspring by a single point crossover method; Resonance structure and regression through a multiple linear regression-artificial neural network model comprising the step of mutating a portion of the chromosome of the generated progeny (mutation) and then replacing a portion of the existing population with them to create a new population A method for obtaining the energy of generation of an ideal gas of a radical molecule having no interaction between radical centers.

35. The method of claim 34, wherein the fifth step has no interaction between the resonance structure and the radical center through the multiple linear regression-artificial neural network model, which includes determining the predictive performance by the coefficient of determination or the mean absolute error of the regression model. A method for obtaining the energy of generation of an ideal gas of a radical molecule.

35. The energy generated by the ideal gas of the radical molecule having no interaction between the resonance structure and the radical center through a multiple linear regression-artificial neural network model in which the validity in the sixth step is determined by the t-test value. How to obtain.

35. The method of claim 34, wherein in the eighth step, if the predictive performance of the test set is similar to the predicted performance of the training set, a multiple linear regression model is determined, and the predictive performance of the test set is different from the predicted performance of the training set. After that, a multilinear regression-artificial neural network model that reclassifies the training set and the test set is used to obtain the energy of generation of the ideal gas of radical molecules having no interaction between the resonance structure and the radical center.

35. The method of claim 34, wherein in the ninth step, the search range by the artificial neural network has one hidden layer between the input layer and the output layer and is connected only in a feed forward. A neural network model is used to determine the energy generated by an ideal gas of a radical molecule having no interaction between a resonance structure and a radical center.

43. The method of claim 42, wherein a sigmoid function is used as an activation function of the hidden layer, wherein the radical molecules have no interaction between the resonance structure and the radical center through a multiple linear regression-artificial neural network model. A method of obtaining the energy of generation of an ideal gas.

35. The generation of an ideal gas of radical molecules having no interaction between the resonance structure and the radical center through the multiple linear regression-artificial neural network model according to claim 34, wherein in step 10, the reference value for preventing overfitting is 20. How to save energy.

27. The method of claim 26, wherein the optimal molecular descriptors extracted in the third step
P ₁ : Energy of Formation for Ideal Gas at 298K from quantum mechanics
P ₂ : Atomic Orbital Molecular Graph Pattern-C2
P ₃ : Atomic Orbital Molecular Graph Pattern-C3
P ₄ : Atomic Orbital Molecular Graph Pattern-C4
P ₅ : Atomic Orbital Molecular Graph Pattern-C5
P ₆ : Atomic Orbital Molecular Graph Pattern -N2
P ₇ : Atomic Orbital Molecular Graph Pattern -N3
P ₈ : Atomic Orbital Molecular Graph Pattern -O1
P ₉ : Atomic Orbital Molecular Graph Pattern -O2
P ₁₀ : Atomic Orbital Molecular Graph Pattern-S1
P ₁₁ : Atomic Orbital Molecular Graph Pattern -S2
P ₁₂ : Atomic Orbital Molecular Graph Pattern-S3
P ₁₃ : Atomic Orbital Molecular Graph Pattern-S4
P ₁₄ : Balaban Index
P ₁₅ : Area weighted surface charge fraction of hydrogen bond dependent hydrogen bond base atoms (HA dependent HDCA-2 / TMSA)
P ₁₆ : Kier flexibility index
Method for obtaining the energy generated by the ideal gas of the radical molecule having an interaction between the resonance structure and the radical center through a multiple linear regression-artificial neural network model comprising a.

35. The method of claim 34, wherein the optimal molecular descriptors extracted in the third step
P ₁ : Energy of Formation for Ideal Gas at 298K from quantum mechanics
P ₁ : Atomic Orbital Molecular Graph Pattern -H1
P ₂ : Atomic Orbital Molecular Graph Pattern-C2
P ₃ : Atomic Orbital Molecular Graph Pattern-C3
P ₄ : Atomic Orbital Molecular Graph Pattern-C4
P ₅ : Atomic Orbital Molecular Graph Pattern-C5
P ₆ : Atomic Orbital Molecular Graph Pattern-N1
P ₇ : Atomic Orbital Molecular Graph Pattern -N2
P ₈ : Atomic Orbital Molecular Graph Pattern-N3
P ₉ : Atomic Orbital Molecular Graph Pattern -O1
P ₁₀ : Atomic Orbital Molecular Graph Pattern -O2
P ₁₁ : Atomic Orbital Molecular Graph Pattern-S1
P ₁₂ : Atomic Orbital Molecular Graph Pattern-S2
P ₁₃ : Atomic Orbital Molecular Graph Pattern-S3
P ₁₄ : Atomic Orbital Molecular Graph Pattern-S4
Method for obtaining the energy generated by the ideal gas of the radical molecules having no interaction between the resonance structure and the radical center through a multi-linear regression-artificial neural network model.

In the method of obtaining the generated energy of the ideal gas of the radical molecule having interaction between the resonance structure and the radical center through the multiple linear regression model,
P ₁ : Energy of Formation for Ideal Gas at 298K from quantum mechanics
P ₂ : Atomic Orbital Molecular Graph Pattern-C2
P ₃ : Atomic Orbital Molecular Graph Pattern-C3
P ₄ : Atomic Orbital Molecular Graph Pattern-C4
P ₅ : Atomic Orbital Molecular Graph Pattern-C5
P ₆ : Atomic Orbital Molecular Graph Pattern -N2
P ₇ : Atomic Orbital Molecular Graph Pattern -N3
P ₈ : Atomic Orbital Molecular Graph Pattern -O1
P ₉ : Atomic Orbital Molecular Graph Pattern -O2
P ₁₀ : Atomic Orbital Molecular Graph Pattern-S1
P ₁₁ : Atomic Orbital Molecular Graph Pattern -S2
P ₁₂ : Atomic Orbital Molecular Graph Pattern-S3
P ₁₃ : Atomic Orbital Molecular Graph Pattern-S4
P ₁₄ : Balaban Index
P ₁₅ : Area weighted surface charge fraction of hydrogen bond dependent hydrogen bond base atoms (HA dependent HDCA-2 / TMSA)
P ₁₆ : Kier flexibility index
Method of obtaining the generated energy of the ideal gas of the radical molecule having an interaction between the resonance structure and the radical center through a multiple linear regression model comprising a.

In the method of obtaining the generated energy of the ideal gas of the radical molecule having no interaction between the resonance structure and the radical center through the multiple linear regression-artificial neural network model,
P ₁ : Energy of Formation for Ideal Gas at 298K from quantum mechanics
P ₂ : Atomic Orbital Molecular Graph Pattern -H1
P ₃ : Atomic Orbital Molecular Graph Pattern-C2
P ₄ : Atomic Orbital Molecular Graph Pattern-C3
P ₅ : Atomic Orbital Molecular Graph Pattern-C4
P ₆ : Atomic Orbital Molecular Graph Pattern-C5
P ₇ : Atomic Orbital Molecular Graph Pattern-N1
P ₈ : Atomic Orbital Molecular Graph Pattern-N2
P ₉ : Atomic Orbital Molecular Graph Pattern-N3
P ₁₀ : Atomic Orbital Molecular Graph Pattern -O1
P ₁₁ : Atomic Orbital Molecular Graph Pattern -O2
P ₁₂ : Atomic Orbital Molecular Graph Pattern-S1
P ₁₃ : Atomic Orbital Molecular Graph Pattern-S2
P ₁₄ : Atomic Orbital Molecular Graph Pattern-S3
P ₁₅ : Atomic Orbital Molecular Graph Pattern-S4
Method for obtaining the energy generated by the ideal gas of the radical molecules having no interaction between the resonance structure and the radical center through a multiple linear regression-artificial neural network model comprising a.

A computer program for executing a method for obtaining the generated energy of an ideal gas of a radical molecule having an interaction between the resonance structure and the radical center according to any one of claims 26 to 33, 45 and 47. Recordable and computer-readable storage media.

A computer program for executing a method for obtaining the generated energy of an ideal gas of a radical molecule having no interaction between the resonance structure and the radical center according to any one of claims 34 to 44, 46 and 48. Computer-readable storage media.