KR20190067717A

KR20190067717A - Apparatus and system for machine learning using coding technique

Info

Publication number: KR20190067717A
Application number: KR1020180155770A
Authority: KR
Inventors: 서창호; 이강욱; 김훈; 황경조
Original assignee: 한국과학기술원
Priority date: 2017-12-07
Filing date: 2018-12-06
Publication date: 2019-06-17
Also published as: KR102180617B1

Abstract

The present invention discloses an apparatus for machine learning. The apparatus comprises: a coding data reception unit receiving a coded data set composed of a characteristic value and label from a client; a gradient correction unit calculating a gradient through the encoded data set and removing bias from the gradient; a model generating unit determining a model and parameter minimizing the loss by performing learning with a probabilistic gradient descent method based on the gradient from which the bias is removed; and a model transferring unit transferring the model and parameter in which the determines loss is minimized to the client. Through the present invention, the security and performance of a conventional machine learning algorithm optimization technique may be significantly improved.

Description

[0001] APPARATUS AND SYSTEM FOR MACHINE LEARNING USING CODING TECHNIQUE [0002]

본 발명은 부호화 기술을 이용한 기계학습 장치 및 시스템에 관한 것으로서, 보다 구체적으로 클라우드 컴퓨팅, 분산 컴퓨팅 등과 같은 환경에서 수행되는 기계학습 장치 및 시스템에 관한 것이다.The present invention relates to a machine learning apparatus and a system using an encoding technique, and more particularly, to a machine learning apparatus and a system that are performed in an environment such as cloud computing and distributed computing.

최근 기계 학습(machine learning)은 다양한 분야에서 정확한 예측이 가능한 수준으로 발전 되었다. 기계 학습에 필요한 데이터는 크게 개인 정보 보호가 필요한 데이터와 일반 데이터로 분류할 수 있는데, 의료 데이터 및 클라우드 기반의 개인 데이터 등이 개인 정보 보호가 필요한 데이터에 해당된다.Recently, machine learning has evolved to a level that allows accurate prediction in various fields. The data required for machine learning can be classified into data requiring general privacy protection and general data. Medical data and cloud-based personal data correspond to data requiring privacy protection.

기계 학습 알고리즘 최적화에 사용되는 기법들 중 하나인 확률적 경사 하강법은 최적화 하고자 하는 목적 함수의 기울기(경사)를 확률적으로 근사하여, 극점에 이를 때까지 기울기가 낮은 쪽으로의 이동을 반복하는 것이다. 이러한 확률적 경사 하강법은 경사를 근사하기 위하여 데이터를 필요로 하는데, 환자의 개인 정보와 같이 해당 데이터가 병원 외부로 반출이 되기 어려운 경우, 예를 들어, 병원 내부의 환자 데이터를 이용하여 재입원 위험을 예측하기 위해 제3자가 보유한 학습 모델이 존재하는 클라우드 서비스를 사용하고자 하는 경우에는, 병원 규정에 따라 환자의 개인 데이터가 외부로 유출이 될 수 없다는 문제점이 있었다.One of the techniques used to optimize machine learning algorithms is stochastic approximation of the slope (slope) of the objective function to be optimized and repeats the shift to the lower slope until it reaches the pole . This stochastic descent method requires data in order to approximate the slope. When the data is difficult to be taken out of the hospital, such as the patient's personal information, for example, There is a problem that the personal data of the patient can not be leaked out to the outside according to the hospital regulations when using the cloud service in which the learning model possessed by the third party exists to predict the risk.

이와 같이 환자의 정보를 담고 있는 의료 기관에서 기계 학습을 수행하는 경우, 많은 양의 정보를 보관하고 있음에도 불구하고, 연산 장비의 부재 및 기계학습을 위한 훈련 알고리즘의 대한 이해도 부족 등의 요인으로 인하여 데이터 분석에 어려움이 있었다.In this way, even though a large amount of information is kept in the case of performing a machine learning at a medical institution that contains patient information, due to factors such as lack of computing equipment and lack of understanding of training algorithms for machine learning There was a difficulty in data analysis.

최근 아마존, Microsoft, 그리고 구글과 같은 제3자의 클라우드 환경에서 기계학습 기반의 서비스를 제공하기 위한 노력이 이루어지고 있으나, 병원의 환자 데이터와 같은 개인 정보 보호에 의거한 법적 요구 사항 및 윤리적 문제로 인해 서비스가 제한되는 등, 정보의 보안이 유지되어야 하는 응용 분야에서는 데이터의 접근 자체가 어려워 해당 기법의 효용성이 매우 제한적이었다.Recently, efforts have been made to provide machine learning-based services in third-party cloud environments such as Amazon, Microsoft, and Google, but due to legal requirements and ethical issues, such as patient data in hospitals, In applications where information security is to be maintained, such as limited services, the accessibility of data is very difficult and the utility of the technique is very limited.

한편, 최근 정보의 보호를 위하여 또는 정보전달의 효율을 높이기 위하여 정보를 부호로 바꾸어 전달하는 기술인 부호화 기술을 선형 알고리즘뿐만 아니라 딥러닝과 같은 비선형 알고리즘에도 적용할 수 있음이 입증 되었는데, 이러한 부호화 기술을 기반으로 정보의 보안성을 유지하는 최적화 기법을 만들기 위해서는 정보의 보안이 유지되면서 동시에 정확한 함수의 경사를 근사 할 수 있는 체계적인 알고리즘이 필요한 실정이었다.Recently, it has been proved that encoding technology, which is a technique of transferring information into codes in order to protect information or to increase efficiency of information transmission, can be applied not only to linear algorithms but also to nonlinear algorithms such as deep learning. In order to create an optimization technique that maintains information security based on information security, it is necessary to have a systematic algorithm that can approximate the slope of an accurate function while maintaining security of information.

본 발명은 상기와 같은 종래 기술의 문제점을 해결하기 위한 것으로, 원본 자료를 공유하지 않고 무작위로 섞인 자료를 기반으로 기계학습을 수행하여 보안성과 성능을 향상시키는 것을 것을 목적으로 한다.SUMMARY OF THE INVENTION It is an object of the present invention to improve security and performance by performing machine learning based on randomly mixed data without sharing original data.

상기한 과제를 해결하기 위해 본 발명은 기계 학습 장치를 개시하여, 클라이언트로부터 특성값 및 레이블로 구성된 부호화된 데이터 세트를 수신하는 부호화 데이터 수신부, 상기 부호화된 데이터 세트를 통하여 경사를 계산하고, 상기 경사에서 편향을 제거하는 경사 보정부, 상기 편향이 제거된 경사를 기초로 확률적 경사 하강법으로 학습하여 손실이 최소화되는 모델 및 파라미터를 결정하는 모델 생성부 및 상기 결정된 손실이 최소화되는 모델 및 파라미터를 상기 클라이언트로 전달하는 모델 전달부를 포함한다.According to an aspect of the present invention, there is provided a machine learning apparatus comprising: a coded data receiving unit for receiving a coded data set composed of a characteristic value and a label from a client, calculating a tilt through the coded data set, A model generating unit for determining a model and a parameter for minimizing loss by learning by a stochastic descent method on the basis of the slope from which the deflection is removed, and a model and parameters for minimizing the determined loss, To the client.

또한 상기 부호화된 데이터 세트는, (a) n 개의 특성값 및 상기 n 개의 특성값에 대응되는 n 개의 레이블로 구성되는 데이터 세트를 추출하는 단계 및 (b) 상기 n 개의 특성값 및 n 개의 레이블로 구성되는 데이터 세트를 부호화하여 부호화된 데이터 세트를 생성하는 단계로 생성되는 것을 특징으로 한다. The encoded data set may further include: (a) extracting a data set including n number of characteristic values and n number of labels corresponding to the n number of characteristic values; and (b) And generating a coded data set by encoding the data set to be constructed.

또한 상기 n 개의 특성값 및 n 개의 레이블로 구성되는 데이터 세트를 부호화하는 과정은, 하기 수학식 1로 수행되는 것을 특징으로 한다. Also, a process of encoding a data set having n characteristic values and n labels is performed by Equation (1).

또한 상기 경사에서 편향을 제거하는 과정은, 하기 수학식 2로 수행되는 것을 특징으로 한다.Also, the process of removing the deflection from the tilt is performed by the following equation (2).

또한 상기 편향이 제거된 경사를 기초로 확률적 경사 하강법으로 학습하여 손실이 최소화되는 파라미터를 결정하는 과정은, 하기 수학식 3으로 수행되는 것을 특징으로 한다.Also, a process of determining a parameter by which a loss is minimized by learning by a stochastic gradient descent method based on a slope from which the deflection is removed is performed by Equation (3).

또한 본 발명은 기계 학습 시스템을 개시하여, n 개의 특성값 및 상기 n 개의 특성값에 대응되는 n 개의 레이블로 구성되는 데이터 세트를 추출하고, 상기 n 개의 특성값 및 n 개의 레이블로 구성되는 데이터 세트를 부호화하여 부호화된 데이터 세트를 생성하는 클라이언트, 상기 클라이언트로부터 특성값 및 레이블로 구성된 부호화된 데이터 세트를 수신하는 부호화 데이터 수신부, 상기 부호화된 데이터 세트를 통하여 경사를 계산하고, 상기 경사에서 편향을 제거하는 경사 보정부, 상기 편향이 제거된 경사를 기초로 확률적 경사 하강법으로 학습하여 손실이 최소화되는 모델 및 파라미터를 결정하는 모델 생성부 및 상기 결정된 손실이 최소화되는 모델 및 파라미터를 상기 클라이언트로 전달하는 모델 전달부를 포함하는 기계 학습 서버를 포함한다.The present invention also discloses a machine learning system for extracting a data set consisting of n characteristic values and n labels corresponding to the n characteristic values and generating a data set including n characteristic values and n labels, A coded data receiver for receiving a coded data set composed of a characteristic value and a label from the client, a tilt calculating unit for calculating a tilt through the coded data set, A model generating unit for determining a model and a parameter for minimizing a loss by learning by a stochastic descent method on the basis of the inclination removed from the deflection, and a model and a parameter for minimizing the determined loss to the client And a machine learning server that includes a model transfer unit for transferring the model to the system.

또한 상기 클라이언트는 상기 모델 전달부로부터 전달된 모델에 원본 데이터를 이용하여 상기 모델을 학습하는 과정을 수행하는 것을 특징으로 한다.In addition, the client performs a process of learning the model using original data to the model delivered from the model delivery unit.

본 발명의 실시 예들에 따른 부호화 기술을 이용한 기계학습 장치 및 시스템의 효과에 대해 설명하면 다음과 같다.Effects of the machine learning apparatus and system using the encoding technique according to the embodiments of the present invention will be described as follows.

본 발명을 통하여 기존의 기계 학습 알고리즘 최적화 기법의 보안성 및 성능을 현저히 향상시킬 수 있다.Through the present invention, the security and performance of the existing machine learning algorithm optimization technique can be significantly improved.

또한 클라이언트가 민감한 고객의 정보가 담긴 원본 자료를 공유하지 않고도 높은 기계 학습 성능을 달성할 수 있다.It also allows clients to achieve high machine learning performance without sharing the original data with sensitive customer information.

다만, 본 발명의 실시 예들에 따른 부호화 기술을 이용한 기계학습 장치 및 시스템이 달성할 수 있는 효과는 이상에서 언급한 것들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the effects that can be achieved by the machine learning apparatus and system using the encoding technique according to the embodiments of the present invention are not limited to those mentioned above, and other effects not mentioned can be obtained from the following description And will be apparent to one of ordinary skill in the art.

본 발명에 관한 이해를 돕기 위해 상세한 설명의 일부로 포함되는, 첨부도면은 본 발명에 대한 실시예를 제공하고, 상세한 설명과 함께 본 발명의 기술적 사상을 설명한다.
도 1은 서비스 제공자가 기계 학습에 기반하여 인공지능 서비스를 제공하는 방법을 개략적으로 도시한 것이다.
도 2는 본 발명에 따른 기계 학습 시스템(1000)을 도식화 한 것이다.
도 3은 본 발명에 따른 기계 학습 시스템(1000)을 통해 기계 학습 모델을 도출하는 과정을 나타낸 순서도이다.
도 4는 본 발명에 따른 일 실시예로서 원본 데이터와 부호화 된 데이터를 도시한 것이다.
도 5는 본 발명에 따른 기계 학습 서버(100)의 구성을 나타낸 블록도이다.
도 6은 본 발명에 따른 일 실시예로서 경사 보정부(120)에서 경사를 보정하는 과정을 개략적으로 도식화 한 것이다.BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of the specification, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention.
1 schematically illustrates how a service provider provides an artificial intelligence service based on machine learning.
2 is a schematic representation of a machine learning system 1000 according to the present invention.
3 is a flowchart illustrating a process of deriving a machine learning model through the machine learning system 1000 according to the present invention.
FIG. 4 illustrates original data and encoded data according to an embodiment of the present invention.
5 is a block diagram showing a configuration of the machine learning server 100 according to the present invention.
6 is a schematic diagram illustrating a process of correcting a tilt in the tilt correction unit 120 according to an embodiment of the present invention.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 이하에서는 특정 실시예들을 첨부된 도면을 기초로 상세히 설명하고자 한다.BRIEF DESCRIPTION OF THE DRAWINGS The present invention is capable of various modifications and various embodiments, and specific embodiments will be described in detail below with reference to the accompanying drawings.

이하의 실시예는 본 명세서에서 기술된 방법, 장치 및/또는 시스템에 대한 포괄적인 이해를 돕기 위해 제공된다. 그러나 이는 예시에 불과하며 본 발명은 이에 제한되지 않는다.The following examples are provided to aid in a comprehensive understanding of the methods, apparatus, and / or systems described herein. However, this is merely an example and the present invention is not limited thereto.

본 발명의 실시예들을 설명함에 있어서, 본 발명과 관련된 공지기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다. 그리고, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. 상세한 설명에서 사용되는 용어는 단지 본 발명의 실시 예들을 기술하기 위한 것이며, 결코 제한적이어서는 안 된다. 명확하게 달리 사용되지 않는 한, 단수 형태의 표현은 복수 형태의 의미를 포함한다. 본 설명에서, "포함" 또는 "구비"와 같은 표현은 어떤 특성들, 숫자들, 단계들, 동작들, 요소들, 이들의 일부 또는 조합을 가리키기 위한 것이며, 기술된 것 이외에 하나 또는 그 이상의 다른 특성, 숫자, 단계, 동작, 요소, 이들의 일부 또는 조합의 존재 또는 가능성을 배제하도록 해석되어서는 안 된다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description, well-known functions or constructions are not described in detail since they would obscure the invention in unnecessary detail. The following terms are defined in consideration of the functions of the present invention, and may be changed according to the intention or custom of the user, the operator, and the like. Therefore, the definition should be based on the contents throughout this specification. The terms used in the detailed description are intended only to describe embodiments of the invention and should in no way be limiting. Unless specifically stated otherwise, the singular form of a term includes plural forms of meaning. In this description, the expressions "comprising" or "comprising" are intended to indicate certain features, numbers, steps, operations, elements, parts or combinations thereof, Should not be construed to preclude the presence or possibility of other features, numbers, steps, operations, elements, portions or combinations thereof.

또한, 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되는 것은 아니며, 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 이하에서는, 첨부된 도면을 참조하여 본 발명의 실시예에 따른 부호화 기술을 이용한 기계학습 장치 및 시스템을 상세하게 설명하기로 한다.It is also to be understood that the terms first, second, etc. may be used to describe various components, but the components are not limited by the terms, and the terms may be used to distinguish one component from another . Hereinafter, a machine learning apparatus and a system using an encoding technique according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 서비스 제공자가 기계 학습에 기반하여 인공지능 서비스를 제공하는 방법을 개략적으로 도시한 것이다.1 schematically illustrates how a service provider provides an artificial intelligence service based on machine learning.

본 발명은 부호화 기술을 이용한 기계학습 장치 및 시스템에 관한 것으로, 원본 자료를 공유하지 않는 기계 학습 모델을 제공하여 사용자가 원하는 서비스를 이용할 수 있도록 한다.The present invention relates to a machine learning apparatus and system using an encoding technique, and provides a machine learning model that does not share original data so that a user can use a desired service.

여기서 기계학습(Machine learning)은 인공지능의 연구 분야 중 하나로, 인간의 학습 능력과 같은 기능을 컴퓨터에서 실현하는 기술 및 기법을 말한다. 기계학습은 클러스터링, 딥러닝(deep learning), 각종 인공신경망, 표현 학습법, 강화 학습법, 베이즈 네트워크 등 다양한 접근법을 포함하는 의미이다. 일반적으로 기계학습 모델은 훈련 데이터를 이용하여 학습되는 것으로, 이미 훈련된 모델을 의미한다.Machine learning is one of the research fields of artificial intelligence. It refers to technologies and techniques that realize functions such as human learning ability on a computer. Machine learning includes various approaches such as clustering, deep learning, artificial neural networks, expressive learning, reinforcement learning, and Bayesian network. In general, machine learning models are learned using training data, which means already trained models.

기계 학습 서버(100)는 복수의 클라이언트(200), 즉 데이터 제공자로부터 데이터를 받아 소정의 서비스를 제공하는 객체이다. 여기서 기계 학습 서버(100)는 서비스 제공자가 보유한 기계 학습 장치를 의미하는 것으로 이해될 수 있으며, 상기 서비스 제공자는 예를 들어 구글 클라우드, 아마존 웹 서비스 등일 수 있다. The machine learning server 100 is an object that receives data from a plurality of clients 200, that is, a data provider and provides a predetermined service. Here, the machine learning server 100 may be understood to mean a machine learning apparatus possessed by a service provider, and the service provider may be, for example, a Google cloud, an Amazon Web service, or the like.

서비스 제공자는 클라이언트(200)로부터 수집된 데이터를 서비스 제공자(또는 제3자)가 보유한 기계 학습 장치를 통해 분석하고, 클라이언트에게 최적의 모델을 제시하는 방식으로 서비스를 제공할 수 있다. The service provider can analyze the data collected from the client 200 through a machine learning apparatus possessed by the service provider (or a third party), and provide the service in a manner of presenting an optimal model to the client.

상기 서비스의 예로는 의료분석 서비스, 학습분석 서비스 등이 있다. 의료 분석 서비스의 일 예로, 다수의 환자들의 데이터를 분석하여 재입원 가능성 또는 특정한 질병이 발생할 위험을 예측하는 의료분석 서비스를 제공할 수 있다.Examples of such services include medical analysis services, learning analysis services, and the like. As an example of a medical analysis service, data from a plurality of patients may be analyzed to provide a medical analysis service that predicts the likelihood of re-admission or the risk of developing a particular disease.

이와 같은 의료분석 서비스를 받고자 하는 경우, 사용자가 서비스 제공자에게 개인정보가 포함된 데이터를 보내야 하는데, 이 때 사용자의 프라이버시(privacy)가 문제될 수 있다.In order to receive such a medical analysis service, a user must send data including personal information to the service provider, and the privacy of the user may be a problem.

도 2는 본 발명에 따른 기계 학습 시스템(1000)을 도식화 한 것이다.2 is a schematic representation of a machine learning system 1000 according to the present invention.

본 발명에 따른 기계 학습 시스템(1000)은 클라이언트(200), 기계 학습 서버(100)를 포함한다. 상기 클라이언트(200)는 예를 들어, 환자들의 데이터를 보유하고 있는 병원일 수 있다.The machine learning system 1000 according to the present invention includes a client 200 and a machine learning server 100. The client 200 may be, for example, a hospital that holds patient data.

클라이언트(200)는 n 개의 특성값

및 상기 n 개의 특성값에 대응되는 n 개의 레이블

로 구성되는 데이터 세트

를 추출하고, 상기 n 개의 특성값 및 n 개의 레이블로 구성되는 데이터 세트를 부호화하여 부호화된 데이터 세트

를 생성한다.The client 200 has n property values

And n labels corresponding to the n characteristic values

&Lt; / RTI >

And encodes a data set composed of the n characteristic values and n labels to generate a coded data set

.

기계 학습 서버(100)는 클라이언트(200)로부터 특성값 및 레이블로 구성된 부호화된 데이터 세트

를 수신하고, 이러한 부호화된 데이터 세트를 통하여 경사를 계산하여, 상기 경사에서 편향을 제거하는 보정을 수행한 후, 편향이 제거된 경사를 기초로 확률적 경사 하강법으로 학습하여 손실이 최소화되는 모델 및 파라미터를 결정한다. 이후 손실이 최소화되는 모델(

) 및 파라미터(

)를 클라이언트(200)로 전달한다.The machine learning server 100 receives from the client 200 an encoded data set

Calculates a slope through the coded data set, performs correction to remove the slope from the slope, and then learns by a stochastic slope descent method based on the slope from which the slope is removed to minimize the loss And parameters. Then the model with minimal loss (

) And parameters (

) To the client (200).

클라이언트(200)는 상기 기계 학습 서버(100)로부터 전달된 모델(

)에 원본 데이터

를 이용하여, 재차 학습함으로써 모델의 정확성을 높일 수 있다.The client 200 receives the model (s)

) To the original data

The accuracy of the model can be improved by learning again.

도 3은 본 발명에 따른 기계 학습 시스템(1000)을 통해 기계 학습 모델을 도출하는 과정을 나타낸 순서도이다.3 is a flowchart illustrating a process of deriving a machine learning model through the machine learning system 1000 according to the present invention.

도 3을 참조하면, 본 발명에 따른 기계 학습 시스템(1000)은 부호화된 데이터 세트를 생성하는 단계(S100), 부호화된 데이터 세트로 경사를 계산한 후 편향을 제거하는 단계(S200), 확률적 경사 하강법으로 학습하여 최적의 모델 및 파라미터를 결정하는 단계(S300) 및 원본 데이터 세트를 이용하여 모델을 학습하는 단계(S400)를 통하여 기계 학습 모델을 도출한다.Referring to FIG. 3, a machine learning system 1000 according to the present invention includes a step S100 of generating an encoded data set, a step S200 of calculating a gradient using a coded data set, (Step S300) of learning an optimal model and parameters by learning by a slope descent method, and learning a model using an original data set (step S400).

S100 단계는 클라이언트(200)에서 부호화된 데이터 세트를 생성하는 단계이다. 여기서 부호화된 데이터 세트는 앞서 언급한 바와 같이 n 개의 특성값 및 상기 n 개의 특성값에 대응되는 n 개의 레이블로 구성되는 데이터 세트를 추출하는 단계 및 상기 n 개의 특성값 및 n 개의 레이블로 구성되는 데이터 세트를 부호화하는 단계를 통하여 생성된다.In operation S100, the client 200 generates a coded data set. As described above, the coded data set includes a step of extracting a data set composed of n characteristic values and n labels corresponding to the n characteristic values, and extracting the data consisting of n characteristic values and n labels And encoding the set.

여기서 상기 n 개의 특성값 및 n 개의 레이블로 구성되는 데이터 세트

를 부호화하는 과정은, 하기 (수학식 1)과 같이 수행될 수 있다.Here, a data set composed of the n characteristic values and n labels

May be performed as Equation (1). &Quot; (1) "

(수학식 1)(1)

여기서

는 무작위로 추출되는 부호화 계수로

로 표현되며

는 베르누이 분포,

은 부호화 폭,

은 정규분포,

는 특성값,

는 레이블,

는

,

는

로

는 항등 행렬(identity matrix),

,

는 행렬의 크기(size)이다.here

Is a randomly extracted coding coefficient

And

The Bernoulli distribution,

The coding width,

Is a normal distribution,

Is a characteristic value,

The label,

The

,

The

in

Is an identity matrix,

,

Is the size of the matrix.

부호화 과정은 원본 데이터를 무작위로 섞는 것을 의미하며, 이를 통해 프라이버시가 포함되는 개인정보를 익명화한다. The encoding process means random mixing of the original data, thereby anonymizing the personal information containing the privacy.

본 발명은 원본 데이터 세트를 부호화는 과정에서 추가적으로 가우시안 노이즈(Gaussian Noise)를 사용한다. 상기 수학식 1에서

,

가 가우시안 노이즈에 해당한다. 가우시안 노이즈는 가우시안 분포를 이용한 렌덤 프로세스의 결과물로, 정규분포를 가지는 잡음을 의미한다.The present invention further uses Gaussian noise in the process of encoding the original data set. In Equation (1)

,

Corresponds to Gaussian noise. Gaussian noise is a result of a random process using Gaussian distribution, which means noise having a normal distribution.

도 4는 본 발명에 따른 일 실시예로서 원본 데이터와 부호화 된 데이터를 도시한 것이다. 좌측은 부호화 전의 원본 데이터를 나타낸 것이고, 우측은 원본 데이터로부터 부호화 과정이 수행된 데이터이다.FIG. 4 illustrates original data and encoded data according to an embodiment of the present invention. The left side shows the original data before encoding, and the right side is the data in which the encoding process is performed from the original data.

도 4를 참조하면, 우측의 부호화 과정이 수행된 데이터는 좌측의 원본 데이터의 내용이 완벽히 숨겨진 것을 확인할 수 있다. 이와 같이 본 발명은 부호화 과정에서 가우시안 노이즈를 부가함으로써 부호화된 데이터로부터 원본 데이터를 도출할 수 없도록 하여 개인정보를 효과적으로 보호할 수 있다.Referring to FIG. 4, it can be seen that the contents of the original data on the left side are completely hidden from the data on the right side. As described above, according to the present invention, original data can not be derived from coded data by adding Gaussian noise in the encoding process, thereby effectively protecting personal information.

클라이언트(200)에서는 원본 데이터를 부호화하여 기계 학습 서버(100)로 전달한다. 이후에 수행되는 S200 단계, S300 단계는 기계 학습 서버(100)를 통해 수행되는데 이는 도 5를 함께 참조하여 상세히 설명한다.In the client 200, the original data is encoded and transferred to the machine learning server 100. Steps S200 and S300 are performed through the machine learning server 100, which will be described in detail with reference to FIG.

도 5는 본 발명에 따른 기계 학습 서버(100)의 구성을 나타낸 블록도이다. 5 is a block diagram showing a configuration of the machine learning server 100 according to the present invention.

본 발명에 따른 기계 학습 서버(100)는 클라이언트로부터 특성값 및 레이블로 구성된 부호화된 데이터 세트를 수신하는 부호화 데이터 수신부(110), 상기 부호화된 데이터 세트를 통하여 경사를 계산하고, 상기 경사에서 편향을 제거하는 경사 보정부(120), 상기 편향이 제거된 경사를 기초로 확률적 경사 하강법으로 학습하여 손실이 최소화되는 모델 및 파라미터를 결정하는 모델 생성부(130) 및 상기 결정된 손실이 최소화되는 모델 및 파라미터를 상기 클라이언트로 전달하는 모델 전달부(140)를 포함한다.The machine learning server 100 according to the present invention includes a coded data receiving unit 110 for receiving a coded data set composed of a characteristic value and a label from a client and calculating a tilt through the coded data set, A model generating unit 130 for determining a model and parameters for which loss is minimized by learning by a stochastic descent method based on the slope from which the deflection is removed, and a model for minimizing the determined loss And a model transferring unit 140 for transferring the parameters to the client.

S200 단계는 기계 학습 서버(100)의 부호화 데이터 수신부(110)가 클라이언트(200)로부터 부호화된 데이터 세트

를 수신하여 경사를 계산하고, 이러한 경사에서 편향을 제거하는 단계이다.In step S200, the encoded data receiving unit 110 of the machine learning server 100 transmits the encoded data set

To calculate the slope, and to remove the slope at this slope.

상기 경사에서 편향을 제거하는 과정은, 하기 (수학식 2)로 수행될 수 있다.The process of removing the deflection from the tilt can be performed by the following equation (2).

(수학식 2)(2)

여기서

는 경사로

은 손실함수,

는 부호화된 특성,

는 부호화된 레이블,

는 심층 선형 신경망에서

번째 레이어의 파라미터,

는 편향이다. 그리고

이고,

이다.here

The ramp

Is the loss function,

Lt; RTI ID = 0.0 >

Lt; RTI ID = 0.0 > label,

In the deep linear neural network

Th layer,

Is a bias. And

ego,

to be.

본 발명은 원본 데이터 세트로 모델을 생성하는 것이 아니라 무작위로 섞인 데이터, 즉 부호화된 데이터 세트로 모델을 생성하기 때문에, 필연적으로 원본을 이용한 모델보다 정확도가 낮아지게 된다. 따라서 이러한 차이를 적절히 보간하고자 경사 보정부(120)에서 부호화된 데이터 세트를 통해 계산된 경사값에서 편향을 제거하는 작업을 수행한다.Since the present invention generates a model with randomly mixed data, that is, a coded data set, rather than generating a model with a set of original data, the accuracy is inevitably lower than that of a model using an original. Accordingly, in order to appropriately interpolate the difference, the slope correction unit 120 performs an operation of removing the slope from the slope value calculated through the encoded data set.

도 6은 본 발명에 따른 일 실시예로서 경사 보정부(120)에서 경사를 보정하는 과정을 개략적으로 도식화한 것이다. 해당 도식에서

이 최종적으로 수렴하고자 하는 모델의 최적 파라미터이고,

가 현재 파라미터이다. 6 is a schematic diagram illustrating a process of correcting a tilt in the tilt correction unit 120 according to an embodiment of the present invention. In the diagram

Is an optimal parameter of the model to be finally converged,

Is the current parameter.

초록색 화살표가 부호화하지 않은 원본 데이터를 이용하였을 때의 경사벡터이고, 빨간색 화살표가 부호화된 데이터를 이용하였을 때의 경사벡터라고 하면, 이 두 벡터의 차이인 파란색 화살표가 편향에 해당한다.. 가우시안 노이즈가 추가된 부호화 데이터를 이용하는 경우, 이렇게 편향이 발생한다. 따라서 편향을 제거하여 원본 데이터를 이용한 경사값과 근접한 수치가 도출되도록 함으로써 모델의 학습을 더 정확하게 수행할 수 있다.If the green arrow is an inclination vector when original data that is not encoded is used and the red arrow is an inclination vector when the encoded data is used, a blue arrow which is a difference between the two vectors corresponds to the deflection. Gaussian noise In the case of using the coded data added with the coded data. Therefore, by removing the deflection, a numerical value close to the slope value using the original data can be derived, so that the learning of the model can be performed more accurately.

다시 도 4 및 도 5를 참조하여 S300 단계를 설명한다. S300 단계는 S200 단계에서 산출된 편향이 제거된 경사값을 기초로 확률적 경사 하강법으로 학습하여 손실이 최소화되는 모델 및 파라미터를 결정하는 단계이다. 이는 기계 학습 서버(100)의 모델 생성부(130)를 통해 수행된다. 여기서 사용되는 확률적 경사 하강법은 공지의 알고리즘으로 이에 대한 자세한 설명은 생략한다.Referring to FIGS. 4 and 5 again, step S300 will be described. Step S300 is a step of determining a model and parameters in which the loss is minimized by learning by the stochastic gradient descent method based on the slope value obtained by removing the deflection calculated in step S200. This is performed through the model generation unit 130 of the machine learning server 100. The stochastic gradient descent method used here is a known algorithm, and a detailed description thereof will be omitted.

모델 생성부(130)가 편향이 제거된 경사를 기초로 확률적 경사 하강법으로 학습하여 파라미터를 결정하는 과정은, 하기 (수학식 3)과 같이 수행될 수 있다.The process in which the model generating unit 130 learns the parameters by the stochastic gradient descent method based on the slope from which the deflection is removed can be performed as Equation (3).

(수학식 3)(3)

여기서

이고,

는 학습률이다.here

ego,

Is the learning rate.

이후, 모델 생성부(130)는 다시 S200 단계 및 확률적 경사 하강법을 m 개의 모델

에 반복하여 최적의 모델

및 이 경우의 파라미터를 탐색한다.Thereafter, the model generation unit 130 repeats step S200 and the stochastic descent method using m models

The optimal model

And the parameters in this case are searched.

이러한 과정을 수학식으로 표현하면 하기 (수학식 4)과 같다.This process can be expressed by the following equation (4).

(수학식 4)(4)

여기서

는 업데이트의 의미이다. 즉, 가장 적합한 모델이 도출하기 위해 계산된 파라미터를 계속하여 업데이트하고 궁극적으로 최적의 모델

및 파라미터

를 결정하여 모델 전달부(140)를 통해 이를 클라이언트(200)에게 제공한다.here

Is the meaning of the update. That is, the calculated parameters are continuously updated to derive the most suitable model, and ultimately the optimal model

And parameters

And provides it to the client 200 through the model transfer unit 140. [

S400 단계는 클라이언트(200)가 모델 전달부(140)로부터 전달받은 모델에 원본 데이터 세트를 이용하여 상기 모델을 학습하는 과정을 수행하는 단계이다. 클라이언트(200)는 기계 학습 서버(100)로부터 부호화된 데이터 세트를 통해 학습된 모델을 전달받기 때문에, 원본 데이터를 통해 학습된 모델보다는 정확도가 낮다. 따라서 해당 모델을 다시 원본 데이터로 학습하여 모델의 성능을 높일 수 있다.In operation S400, the client 200 learns the model using the original data set to the model received from the model delivery unit 140. [ Since the client 200 receives the learned model from the machine learning server 100 through the encoded data set, the client 200 has lower accuracy than the model learned through the original data. Therefore, the performance of the model can be improved by learning the model again as the original data.

이 경우, 클라이언트(200)는 기계 학습 서버(100)로부터 모델(

) 및 파라미터(

)를 모두 전달받아 원본 데이터 세트를 이용한 학습을 수행할 수 도 있고, 모델(

)만을 전달받아 원본 데이터 세트를 이용한 학습을 수행하여 파라미터를 다시 도출하는 방식으로 수행할 수 도 있다. 이는 서비스 제공의 목적, 환경, 비용에 따라 적절히 정해질 수 있다.In this case, the client 200 receives from the machine learning server 100 a model

) And parameters (

), And can perform learning using the original data set, and the model (

) Is received and learning is performed using the original data set to derive the parameters again. This can be properly determined according to the purpose, environment and cost of service provision.

한편, 본 발명은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행할 수 있다. 그리고, 본 발명을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.The present invention can also be embodied as computer-readable codes on a computer-readable recording medium. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer system is stored. Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like, and may be implemented in the form of a carrier wave (for example, transmission via the Internet) . The computer readable recording medium may also be distributed over a networked computer system so that computer readable code can be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the present invention can be easily inferred by programmers of the technical field to which the present invention belongs.

또한, 상기와 같이 설명된 방법 및 이를 이용한 장치는 상기 설명된 실시예들의 구성과 방법이 한정되게 적용될수 있는 것이 아니라, 상기 실시예들은 다양한 변형이 이루어질 수 있도록 각 실시예들의 전부 또는 일부가 선택적으로 조합되어 구성될 수도 있다.In addition, the above-described method and apparatus using the same may not be limited in configuration and method of the embodiments described above, but the embodiments may be modified so that all or some of the embodiments are selectively As shown in FIG.

이상에서 본 발명의 대표적인 실시예들을 상세하게 설명하였으나, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 상술한 실시예에 대하여 본 발명의 범주에서 벗어나지 않는 한도 내에서 다양한 변형이 가능함을 이해할 것이다. 그러므로 본 발명의 권리범위는 설명된 실시예에 국한되어 정해져서는 안되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등한 것들에 의해 정해져야 한다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, . Therefore, the scope of the present invention should not be limited to the described embodiments, but should be determined by the scope of the appended claims, as well as the appended claims.

1000 : 기계 학습 시스템
100 : 기계 학습 서버
110 : 부호화 데이터 수신부 120 : 경사 보정부
130 : 모델 생성부 140 : 모델 전달부
200 : 클라이언트1000: Machine learning system
100: machine learning server
110: encoded data receiving unit 120: gradient correction unit
130: model generation unit 140: model delivery unit
200: Client

Claims

A coded data receiver for receiving a coded data set composed of a characteristic value and a label from a client;
A gradient correction unit for calculating gradient through the encoded data set and removing bias from the gradient;
A model generating unit that learns a model and a parameter whose loss is minimized by learning by a stochastic descent method based on the slope from which the deflection is removed; And
A model transferring unit for transferring the model and the parameter with which the determined loss is minimized to the client;
The machine learning apparatus comprising:

The method according to claim 1,
Wherein the encoded data set comprises:
(a) extracting a data set composed of n characteristic values and n labels corresponding to the n characteristic values; And
(b) generating a coded data set by coding a data set including the n characteristic values and n labels;
Is generated.

3. The method of claim 2,
Wherein the encoding of the data set including the n characteristic values and the n labels is performed by the following equation (1).
(1)

(here

Is a randomly extracted coding coefficient

And

The Bernoulli distribution,

The coding width,

Is a normal distribution,

Is a characteristic value,

The label,

The

,

The

in

Is an identity matrix,

,

Is the size of the matrix.

The method of claim 3,
Wherein the step of removing the deflection from the tilt is performed by the following equation (2).
(2)

(here

The ramp

Is the loss function,

Lt; RTI ID = 0.0 >

Lt; RTI ID = 0.0 > label,

In the deep linear neural network

Th layer,

Lt; / RTI >

ego,

being.)

5. The method of claim 4,
Wherein learning a parameter by minimizing a loss by learning a stochastic descent method based on a slope from which the deflection is removed is performed by the following equation (3).
(3)

(here

ego,

Is the learning rate.)

a data set having n characteristic values and n labels corresponding to the n characteristic values, and generating a coded data set by encoding a data set composed of the n characteristic values and n labels, ; And
A coded data receiver for receiving a coded data set composed of a characteristic value and a label from the client, a tilt correction unit for calculating a tilt through the coded data set and removing the tilt from the tilt, A model generating unit that learns a model and a parameter that minimizes a loss by learning with a stochastic gradient descent method based on the model; and a model transferring unit that transmits a model and parameters that minimize the determined loss to the client.
The machine learning system comprising:

The method according to claim 6,
Wherein the encoded data set comprises:
(a) extracting a data set composed of n characteristic values and n labels corresponding to the n characteristic values; And
(b) generating a coded data set by coding a data set including the n characteristic values and n labels;
Of the machine learning system.

8. The method of claim 7,
Wherein the encoding of the data set including the n characteristic values and the n labels is performed by the following equation (1).
(1)

(here

Is a randomly extracted coding coefficient

And

The Bernoulli distribution,

The coding width, Is a normal distribution,

Is a characteristic value,

The label,

The

,

The

in

Is an identity matrix,

,

Is the size of the matrix.

9. The method of claim 8,
Wherein the step of removing deflection from the tilt is performed by the following equation (2).
(2)

(here

The ramp

Is the loss function,

Lt; RTI ID = 0.0 >

Lt; RTI ID = 0.0 > label,

In the deep linear neural network

Th layer,

Lt; / RTI >

ego,

being.)

10. The method of claim 9,
Wherein learning a parameter by minimizing a loss by learning by a stochastic descent method based on a slope on which the deflection is removed is performed by the following equation (3).
(3)

(here

ego,

Is the learning rate.)

The method according to claim 6,
Wherein the client performs a process of learning the model using original data to the model delivered from the model delivery unit.