KR20230152962A

KR20230152962A - Apparatus and method for measuring items related personal information

Info

Publication number: KR20230152962A
Application number: KR1020220052566A
Authority: KR
Inventors: 임종호; 정동훈
Original assignee: 연세대학교 산학협력단
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2023-11-06

Abstract

본 발명은 개인정보 관련 항목 측정 장치 및 방법에 관한 것이다. 본 발명의 일 실시예에 따른 방법은 컴퓨팅(computing)이 가능한 장치에서 수행되며, 개인정보에 대한 노출 정도 및 정보량 유지 정도를 측정하기 위한 방법으로서, 원본 데이터 집합(X)에서 각 원소 벡터 간의 거리인 제1 거리를 계산하고, 원본 데이터 집합(X)의 각 원소 벡터와 재현 데이터 집합(Y)의 각 원소 벡터 간의 거리인 제2 거리를 계산하는 단계; 상기 제1 거리 및 상기 제2 거리를 이용하여 원본 데이터 집합(X)과 재현 데이터 집합(Y) 간의 유사도 평균을 도출하는 단계; 및 도출된 유사도 평균을 이용하여 개인정보에 대한 노출 정도 및 정보량 유지 정도를 나타내는 단계;를 포함한다.The present invention relates to an apparatus and method for measuring personal information-related items. The method according to an embodiment of the present invention is performed on a device capable of computing and is a method for measuring the degree of exposure to personal information and the degree of maintenance of information amount, and is a method for measuring the distance between each element vector in the original data set (X). calculating a first distance, and calculating a second distance, which is the distance between each element vector of the original data set (X) and each element vector of the reproduced data set (Y); deriving an average similarity between an original data set (X) and a reproduced data set (Y) using the first distance and the second distance; and a step of indicating the degree of exposure to personal information and the degree of maintenance of the amount of information using the derived similarity average.

Description

Personal information-related items Measuring devices and methods {APPARATUS AND METHOD FOR MEASURING ITEMS RELATED PERSONAL INFORMATION}

본 발명은 개인정보에 관련된 항목을 측정하기 위한 기술에 관한 것으로서, 더욱 상세하게는 개인정보에 대한 노출 정도 및 정보량 유지 정도를 동시에 측정하기 위한 기술에 관한 것이다.The present invention relates to technology for measuring items related to personal information, and more specifically, to technology for simultaneously measuring the degree of exposure to personal information and the degree of maintenance of the amount of information.

주소지, 주민등록번호, 결혼여부, 출신학교, 키, 몸무게 등과 같은 개인정보 중에는 불특정 다수에게 공개되어도 상관이 없는 정보가 있을 수 있지만, 대부분 프라이버시로서 비공개로 보호되어야 한다.Among personal information such as address, resident registration number, marital status, school attended, height, weight, etc., there may be information that does not matter if disclosed to an unspecified number of people, but most of it should be kept private as privacy.

다만, 특정 집단을 대상으로 설문조사를 수행하거나, 특정 수요자층을 대상으로 다이렉트 메일 발송 등을 수행하는 사업자에게는 대상자들에 대한 개인정보의 획득이 필수적이므로, 이러한 사업자들 사이에서는 필요한 개인정보에 대한 거래가 이루어지기도 한다. 즉, 개인으로부터 제공된 다수의 개인정보를 보유한 업체에서 이를 필요로 하는 다른 업체로 다양한 개인정보가 제공되기도 한다.However, since obtaining personal information about subjects is essential for businesses that conduct surveys targeting specific groups or send direct mail to specific groups of consumers, these businesses do not have access to the necessary personal information. Sometimes transactions take place. In other words, a variety of personal information may be provided from a company that holds a large amount of personal information provided by an individual to another company that needs it.

한편, 이와 같이 개인정보의 거래에 따른 현황을 파악하기 위해, 개인 정보와 관련하여 다양한 항목(이하, "측정 항목"이라 지칭함)이 측정된다. 특히, 대표적인 2가지 측정 항목은 개인 정보에 대한 노출 정도 및 정보량 유지 정도이다. Meanwhile, in order to determine the status of personal information transactions, various items (hereinafter referred to as “measurement items”) are measured in relation to personal information. In particular, two representative measurement items are the degree of exposure to personal information and the degree of maintenance of information volume.

하지만, 종래 기술의 경우, 어느 하나의 측정 항목에 대해서만 측정할 뿐이다. 즉, 개인 정보에 대한 노출 정도 및 정보량 유지 정도의 2가지 측정 항목을 동시에 측정하는 기술은 아직까지 전무한 실정이다.However, in the case of the prior art, only one measurement item is measured. In other words, there is still no technology to simultaneously measure two measurement items: the degree of exposure to personal information and the degree of information retention.

다만, 상술한 내용은 단순히 본 발명에 대한 배경 정보를 제공할 뿐 기 공개된 기술에 해당하는 것은 아니다.However, the above-described content merely provides background information on the present invention and does not correspond to previously disclosed technology.

상기한 바와 같은 종래 기술의 문제점을 해결하기 위하여, 본 발명은 개인 정보와 관련한 2가지 측정 항목인 개인정보에 대한 노출 정도 및 정보량 유지 정도를 동시에 측정하는 기술을 제공하는데 그 목적이 있다.In order to solve the problems of the prior art as described above, the purpose of the present invention is to provide a technology that simultaneously measures two measurement items related to personal information: the degree of exposure to personal information and the degree of maintenance of the amount of information.

즉, 본 발명은 개인정보에 대한 노출 정도 및 정보량 유지 정도를 동시에 확인할 수 있는 값을 도출하는 기술을 제공하는데 그 목적이 있다.In other words, the purpose of the present invention is to provide a technology for deriving a value that can simultaneously confirm the degree of exposure to personal information and the degree of maintenance of the amount of information.

또한, 본 발명은 개인정보에 대한 노출 정도 및 정보량 유지 정도의 측정에 필요한 최적의 기준 값을 설정하기 위한 기술을 제공하는데 그 다른 목적이 있다. Another purpose of the present invention is to provide a technology for setting the optimal standard value necessary for measuring the degree of exposure to personal information and the degree of maintenance of information volume.

또한, 본 발명은 개인정보에 대한 노출 정도 및 정보량 유지 정도를 보다 직관적으로 파악할 수 있게 하는 기술을 제공하는데 그 다른 목적이 있다.Another purpose of the present invention is to provide a technology that allows more intuitive understanding of the degree of exposure to personal information and the degree of maintenance of the amount of information.

다만, 본 발명이 해결하고자 하는 과제는 이상에서 언급한 과제에 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the problem to be solved by the present invention is not limited to the problems mentioned above, and other problems not mentioned can be clearly understood by those skilled in the art from the description below. There will be.

상기와 같은 과제를 해결하기 위한 본 발명의 일 실시예에 따른 방법은 컴퓨팅(computing)이 가능한 장치에서 수행되며, 개인정보에 대한 노출 정도 및 정보량 유지 정도를 측정하기 위한 방법으로서, 원본 데이터 집합(X)에서 각 원소 벡터 간의 거리인 제1 거리를 계산하고, 원본 데이터 집합(X)의 각 원소 벡터와 재현 데이터 집합(Y)의 각 원소 벡터 간의 거리인 제2 거리를 계산하는 단계; 상기 제1 거리 및 상기 제2 거리를 이용하여 원본 데이터 집합(X)과 재현 데이터 집합(Y) 간의 유사도 평균을 도출하는 단계; 및 도출된 유사도 평균을 이용하여 개인정보에 대한 노출 정도 및 정보량 유지 정도를 나타내는 단계;를 포함한다.The method according to an embodiment of the present invention for solving the above problems is performed on a device capable of computing, and is a method for measuring the degree of exposure to personal information and the degree of maintenance of the amount of information, and is a method of measuring the degree of exposure to personal information and the degree of maintenance of the amount of information, and includes an original data set ( Calculating a first distance, which is the distance between each element vector in X), and calculating a second distance, which is the distance between each element vector of the original data set (X) and each element vector of the reproduced data set (Y); deriving an average similarity between an original data set (X) and a reproduced data set (Y) using the first distance and the second distance; and a step of indicating the degree of exposure to personal information and the degree of maintenance of the amount of information using the derived similarity average.

상기 도출하는 단계는 원본 데이터 집합(X)의 제i 원소 벡터(x_i)에 대한 제1 거리들 중의 하나인 제1 값과, 원본 데이터 집합(X)의 제i 원소 벡터(x_i)에 대한 제2 거리들 중의 하나인 제2 값을 서로 비교하여, 비교 결과에 따라 원본 데이터 집합(X)의 제i 원소 벡터(x_i)에 대한 원본 데이터 집합(X) 및 재현 데이터 집합(Y)의 유사도인 제i 유사도의 값을 도출할 수 있다.The deriving step is performed by combining a first value, which is one of the first distances with respect to the ith element vector (x _i ) of the original data set (X), and the ith element vector (x _i ) of the original data set (X). The second value, which is one of the second distances, is compared with _each other, and according to the comparison result, the original data set ( The value of the i-th similarity, which is the similarity of , can be derived.

상기 도출하는 단계는 제1 값≥제2 값이면 상기 제i 유사도가 제1 특정 값인 것으로 도출하고, 제1 값<제2 값이면 상기 제i 유사도가 제2 특정 값인 것으로 도출할 수 있다.In the deriving step, if a first value ≥ a second value, the ith similarity can be derived as a first specific value, and if the first value < a second value, the ith similarity can be derived as a second specific value.

상기 도출하는 단계는 원본 데이터 집합(X)의 제i 원소 벡터(x_i)에 대한 제1 거리들 중의 최소값(min_i1)과, 원본 데이터 집합(X)의 제i 원소 벡터(x_i)에 대한 제2 거리들 중의 최소값(min_i2)을 서로 비교하여, 비교 결과에 따라 원본 데이터 집합(X)의 제i 원소 벡터(x_i)에 대한 원본 데이터 집합(X) 및 재현 데이터 집합(Y)의 유사도인 제i 유사도의 값을 도출할 수 있다.The deriving step includes the minimum value (min _i1 ) of the first distances to the ith element vector (x _i ) of the original data set (X) and the ith element vector (x _i ) of the original data set (X). The minimum value (min _i2 ) of the second distances for the original data set (X) and the reproduction data set (Y) for the ith element vector (x _i ) of the original data set (X) are compared with each other according to the comparison result. The value of the i-th similarity, which is the similarity of , can be derived.

상기 도출하는 단계는 min_i1≥min_i2이면 상기 제i 유사도가 제1 특정 값인 것으로 도출하고, min_i1<min_i2이면 상기 제i 유사도가 제2 특정 값인 것으로 도출할 수 있다.In the deriving step, if min _i1 ≥min _i2 , the ith similarity can be derived as a first specific value, and if min _i1 <min _i2 , the ith similarity can be derived as a second specific value.

상기 제1 특정 값은 상기 제2 특정 값보다 큰 값일 수 있다.The first specific value may be greater than the second specific value.

상기 제1 특정 값은 1이고 상기 제2 특정 값은 0일 수 있다.The first specific value may be 1 and the second specific value may be 0.

상기 도출하는 단계는 하기 식을 이용하여 상기 제i 유사도를 도출할 수 있다.In the deriving step, the i-th similarity can be derived using the following equation.

s(x_i)=I({d(x_j, x_i)}^<k> ≥ d(y_j', x_i)}^<k>)s(x _i )=I({d(x _j , x _i )} ^<k> ≥ d(y _j' , x _i )} ^<k> )

(단, s(x_i)는 제i 유사도를 나타내고, x_j와 x_i는 원본 데이터 집합(X)의 원소 벡터(x₁, …x_n) 중에서 선택된 것으로서 서로 다른 원소 벡터이며, y_j'는 재현 데이터 집합(Y)의 원소 벡터(y₁, …y_m) 중에서 선택된 것이고, d(x_j, x_i)는 x_j와 x_i의 제1 거리이며, d(y_j', x_i)는 y_j'와 x_i의 제2 거리이고, I는 지시함수로서 I(a≥b)는 조건인 a≥b가 만족하면 제1 특정 값을 도출하되 조건인 a≥b가 불만족하면 제2 특정 값을 도출하며, A^<k>는 집합 A에서 대해서 k번째로 작은 원소를 나타냄)(However, s(x _i ) represents the i-th similarity, x _j and x _i are different element vectors selected from the element vectors (x ₁ , ...x _n ) of the original data set (X), and y _j' is selected from the element vectors (y ₁ , ...y _m ) of the reproduction data set (Y), d(x _j , x _i ) is the first distance between x _j and x _i , and d(y _j' , x _i ) is the second distance between y _j' _and 2 Derives a specific value, and A ^<k> represents the kth smallest element in set A)

상기 나타내는 단계는 상기 유사도 평균이 최소값에 가까울수록 개인정보에 대한 정보량 유지 정도가 점차 낮아지도록 나타내고, 상기 유사도 평균이 최대 값에 가까울수록 개인정보에 대한 노출 정도가 점차 높아지도록 나타낼 수 있다.The above-mentioned steps may indicate that as the similarity average approaches the minimum value, the degree of maintaining the amount of information about personal information gradually decreases, and as the similarity average approaches the maximum value, the degree of exposure to personal information gradually increases.

상기 나타내는 단계는 상기 유사도 평균이 상기 최소값과 상기 최대 값의 사이에 있는 상기 기준 값에 도달하면 개인정보에 대한 정보량 유지 정도 및 노출 정도가 최적 상태인 것으로 나타낼 수 있다.The indicating step may indicate that the level of information retention and exposure level for personal information is optimal when the similarity average reaches the reference value between the minimum value and the maximum value.

상기 나타내는 단계는 상기 유사도 평균이 상기 기준 값을 초과하면 개인정보에 대한 정보량 유지 정도가 적절 상태인 것을 나타내고, 상기 유사도 평균이 상기 기준 값 미만이면 개인정보에 대한 노출 정도가 적절 상태인 것을 나타낼 수 있다.The indicating step may indicate that the degree of information retention for personal information is appropriate when the similarity average exceeds the standard value, and if the similarity average is less than the standard value, it may indicate that the degree of exposure to personal information is appropriate. there is.

상기 기준 값은 원본 데이터 집합(X)의 데이터 분포와 재현 데이터 집합(Y)의 데이터 분포가 일치하고 원본 데이터 집합(X)와 재현 데이터 집합(Y)가 통계적으로 독립(independent) 상태인 경우에 대한 값일 수 있다.The above reference value is used when the data distribution of the original data set (X) and the data distribution of the reproduction data set (Y) match and the original data set (X) and the reproduction data set (Y) are statistically independent. It may be a value for

상기 나타내는 단계는 상기 유사도 평균을 하기 식의 그래프로 나타낼 수 있다.The steps indicated above can be represented by a graph of the similarity average as shown below.

f(μ)=(u, v)f(μ)=(u, v)

(단, μ는 유사도 평균, u(μ)는 μ에 따라 결정되는 개인정보에 대한 정보량 유지 정도, v(μ)는 μ에 따라 결정되는 개인정보에 대한 노출 정도를 각각 나타냄)(However, μ represents the similarity average, u(μ) represents the degree of maintaining the amount of information about personal information determined by μ, and v(μ) represents the degree of exposure to personal information determined by μ)

상기 나타내는 단계는 상기 유사도 평균에 따라 결정되는 개인정보에 대한 정보량 유지 정도가 기준 값의 미만 영역에서 변화가 크도록 나타내고 상기 기준 값 초과 영역에서는 변화가 적도록 나타낼 수 있다.The indicating step can be expressed so that the degree of maintaining the amount of information about personal information determined according to the similarity average shows a large change in the area below the standard value and small change in the area exceeding the standard value.

상기 나타내는 단계는 상기 유사도 평균에 따라 결정되는 개인정보에 대한 노출 정도가 기준 값의 미만 영역에서 변화가 적도록 나타내고 상기 기준 값 초과 영역에서는 변화가 크도록 나타낼 수 있다.The indicating step may be indicated so that the degree of exposure to personal information determined according to the similarity average shows a small change in an area below the standard value and a large change in an area exceeding the standard value.

상기 나타내는 단계는 유사도 평균(μ)이 상기 기준 값을 가지는 경우에 u(μ)×v(μ)의 값은 최대 값을 가지도록 나타낼 수 있다.The above-mentioned steps can be expressed so that when the similarity average (μ) has the reference value, the value of u(μ)×v(μ) has the maximum value.

(단, u(μ)는 μ에 따라 결정되는 개인정보에 대한 정보량 유지 정도를 나타내고, v(μ)는 μ에 따라 결정되는 개인정보에 대한 노출 정도를 나타냄)(However, u(μ) represents the degree of maintenance of information about personal information determined by μ, and v(μ) represents the degree of exposure to personal information determined by μ)

본 발명의 일 실시예에 따른 장치는 원본 데이터 집합(X)과 재현 데이터 집합(Y)을 저장한 메모리; 및 상기 메모리에 저장된 정보를 기반으로 노출 정도 및 정보량 유지 정도에 대한 측정을 제어하는 제어부;를 포함한다.A device according to an embodiment of the present invention includes a memory storing an original data set (X) and a reproduced data set (Y); and a control unit that controls measurement of the degree of exposure and the degree of maintenance of the amount of information based on the information stored in the memory.

상기 제어부는 원본 데이터 집합(X)에서 각 원소 벡터 간의 거리인 제1 거리를 계산하도록 제어하고, 원본 데이터 집합(X)의 각 원소 벡터와 재현 데이터 집합(Y)의 각 원소 벡터 간의 거리인 제2 거리를 계산하도록 제어하고, 상기 제1 거리 및 상기 제2 거리를 이용하여 원본 데이터 집합(X)과 재현 데이터 집합(Y) 간의 유사도 평균을 도출하도록 제어하며, 도출된 유사도 평균을 이용하여 개인정보에 대한 노출 정도 및 정보량 유지 정도를 나타내도록 제어할 수 있다.The control unit controls to calculate a first distance, which is the distance between each element vector in the original data set (X), and a second distance, which is the distance between each element vector of the original data set (X) and each element vector of the reproduced data set (Y). 2 Controls to calculate the distance, controls to derive a similarity average between the original data set (X) and the reproduced data set (Y) using the first distance and the second distance, and uses the derived similarity average to determine the individual It can be controlled to indicate the degree of exposure to information and the degree of maintenance of the amount of information.

상기 제어부는 상기 유사도 평균의 도출 시, 원본 데이터 집합(X)의 제i 원소 벡터(x_i)에 대한 제1 거리들 중의 하나인 제1 값과, 원본 데이터 집합(X)의 제i 원소 벡터(x_i)에 대한 제2 거리들 중의 하나인 제2 값을 서로 비교하여, 비교 결과에 따라 원본 데이터 집합(X)의 제i 원소 벡터(x_i)에 대한 원본 데이터 집합(X) 및 재현 데이터 집합(Y)의 유사도인 제i 유사도의 값을 도출하도록 제어할 수 있다.When deriving the similarity average, the control unit selects a first value that is one of the first distances with respect to the ith element vector (x _i ) of the original data set (X), and the ith element vector of the original data set (X). Compare the second values, which are one of the second distances for (x _i ), with each other, and reproduce the original data set (X) for the ith element vector (x _i ) of the original data set (X) according to the comparison result. It can be controlled to derive the i-th similarity value, which is the similarity of the data set (Y).

상기 제어부는 상기 유사도 평균의 도출 시, 제1 값≥제2 값이면 상기 제i 유사도가 제1 특정 값인 것으로 도출하고, 제1 값<제2 값이면 상기 제i 유사도가 제2 특정 값인 것으로 도출하도록 제어할 수 있다.When deriving the similarity average, the control unit determines that the ith similarity is a first specific value if the first value ≥ the second value, and if the first value < the second value, derives the ith similarity as the second specific value. You can control it to do so.

상기 제어부는 상기 유사도 평균이 최소값에 가까울수록 개인정보에 대한 정보량 유지 정도가 점차 낮아지도록 나타내고, 상기 유사도 평균이 최대 값에 가까울수록 개인정보에 대한 노출 정도가 점차 높아지도록 나타내며, 상기 유사도 평균이 상기 최소값과 상기 최대 값의 사이에 있는 기준 값을 초과하면 개인정보에 대한 정보량 유지 정도가 적절 상태로 나타내고, 상기 유사도 평균이 상기 기준 값 미만이면 개인정보에 대한 노출 정도가 적절 상태인 것을 나타낼 수 있다.The control unit indicates that the degree of maintaining the amount of information about personal information is gradually lowered as the similarity average approaches the minimum value, and that the degree of exposure to personal information is gradually increased as the similarity average approaches the maximum value. If the standard value between the minimum value and the maximum value is exceeded, it indicates that the level of information retention for personal information is appropriate, and if the similarity average is less than the standard value, it can indicate that the degree of exposure to personal information is appropriate. .

상기와 같이 구성되는 본 발명은 개인 정보와 관련한 2가지 측정 항목인 개인정보에 대한 노출 정도 및 정보량 유지 정도를 동시에 측정할 수 있는 이점이 있다.The present invention, constructed as described above, has the advantage of being able to simultaneously measure two measurement items related to personal information: the degree of exposure to personal information and the degree of maintenance of the amount of information.

즉, 본 발명은 개인정보에 대한 노출 정도 및 정보량 유지 정도를 동시에 확인할 수 있는 값을 도출할 수 있는 이점이 있다.In other words, the present invention has the advantage of being able to derive a value that can simultaneously confirm the degree of exposure to personal information and the degree of maintenance of the amount of information.

또한, 본 발명은 개인정보에 대한 노출 정도 및 정보량 유지 정도의 측정에 필요한 최적의 기준 값의 설정 기술을 제공할 수 있는 이점이 있다.In addition, the present invention has the advantage of providing a technology for setting the optimal standard value necessary for measuring the degree of exposure to personal information and the degree of maintenance of the amount of information.

또한, 본 발명은 그래프 등을 통해 개인정보에 대한 노출 정도 및 정보량 유지 정도를 보다 직관적으로 파악할 수 있게 하는 이점이 있다.Additionally, the present invention has the advantage of enabling a more intuitive understanding of the degree of exposure to personal information and the degree of maintenance of the amount of information through graphs, etc.

본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects that can be obtained from the present invention are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the description below. will be.

도 1은 본 발명의 일 실시예에 따른 장치(100)의 블록 구성도를 나타낸다.
도 2는 본 발명의 일 실시예에 따른 측정 방법의 순서도를 나타낸다.
도 3은 개인정보의 원본 데이터 집합(X) 및 재현 데이터 집합(Y)에 대한 일 예를 나타낸다.
도 4 및 도 5는 2개의 속성(Attribute1, Attribute2)을 가진 원소 벡터로 구성된 원본 데이터 집합(X)과 재현 데이터 집합(Y)에 대해 제1 및 제2 거리 계산을 수행하는 일 예를 나타낸다.
도 6은 유사도 평균(μ)의 값에 따른 개인정보에 대한 노출 정도 및 정보량 유지 정도를 나타낸다.
도 7은 도 6에 따른 유사도 평균(μ)을 그래프 형태로 표시한 일 예를 나타낸다.Figure 1 shows a block diagram of a device 100 according to an embodiment of the present invention.
Figure 2 shows a flowchart of a measurement method according to an embodiment of the present invention.
Figure 3 shows an example of an original data set (X) and a reproduced data set (Y) of personal information.
Figures 4 and 5 show an example of performing first and second distance calculations on an original data set (X) and a reproduced data set (Y) composed of element vectors with two attributes (Attribute1 and Attribute2).
Figure 6 shows the degree of exposure to personal information and the degree of maintenance of information amount according to the value of the similarity average (μ).
Figure 7 shows an example of displaying the similarity average (μ) according to Figure 6 in a graph form.

본 발명의 상기 목적과 수단 및 그에 따른 효과는 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다.The above purpose and means of the present invention and the resulting effects will become clearer through the following detailed description in conjunction with the accompanying drawings, and thus the technical idea of the present invention will be easily understood by those skilled in the art. It will be possible to implement it. Additionally, in describing the present invention, if it is determined that a detailed description of known technologies related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description will be omitted.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며, 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 경우에 따라 복수형도 포함한다. 본 명세서에서, "포함하다", "구비하다", "마련하다" 또는 "가지다" 등의 용어는 언급된 구성요소 외의 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다.The terminology used herein is for describing embodiments and is not intended to limit the invention. In this specification, singular forms also include plural forms, as appropriate, unless specifically stated otherwise in the context. In this specification, terms such as “comprise,” “provide,” “provide,” or “have” do not exclude the presence or addition of one or more other components other than the mentioned components.

본 명세서에서, "또는", "적어도 하나" 등의 용어는 함께 나열된 단어들 중 하나를 나타내거나, 또는 둘 이상의 조합을 나타낼 수 있다. 예를 들어, "A 또는 B", "A 및 B 중 적어도 하나"는 A 또는 B 중 하나만을 포함할 수 있고, A와 B를 모두 포함할 수도 있다.In this specification, terms such as “or” and “at least one” may represent one of words listed together, or a combination of two or more. For example, “A or B”, “at least one of A and B” may include only A or B, or both A and B.

본 명세서에서, "예를 들어" 등에 따르는 설명은 인용된 특성, 변수, 또는 값과 같이 제시한 정보들이 정확하게 일치하지 않을 수 있고, 허용 오차, 측정 오차, 측정 정확도의 한계와 통상적으로 알려진 기타 요인을 비롯한 변형과 같은 효과로 본 발명의 다양한 실시예에 따른 발명의 실시 형태를 한정하지 않아야 할 것이다.In this specification, descriptions under "for example" and the like may not exactly match the information presented, such as the cited characteristics, variables, or values, and may be subject to tolerances, measurement errors, limits of measurement accuracy and other commonly known factors. Effects such as modifications, including, should not limit the embodiments of the invention according to various embodiments of the present invention.

본 명세서에서, 어떤 구성요소가 다른 구성요소에 '연결되어’ 있다거나 '접속되어' 있다고 기재된 경우, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성 요소에 '직접 연결되어' 있다거나 '직접 접속되어' 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해될 수 있어야 할 것이다.In this specification, when a component is described as being 'connected' or 'connected' to another component, it may be directly connected or connected to the other component, but other components may exist in between. It must be understood that it may be possible. On the other hand, when a component is mentioned as being 'directly connected' or 'directly connected' to another component, it should be understood that there are no other components in between.

본 명세서에서, 어떤 구성요소가 다른 구성요소의 '상에' 있다거나 '접하여' 있다고 기재된 경우, 다른 구성요소에 상에 직접 맞닿아 있거나 또는 연결되어 있을 수 있지만, 중간에 또 다른 구성요소가 존재할 수 있다고 이해되어야 할 것이다. 반면, 어떤 구성요소가 다른 구성요소의 '바로 위에' 있다거나 '직접 접하여' 있다고 기재된 경우에는, 중간에 또 다른 구성요소가 존재하지 않은 것으로 이해될 수 있다. 구성요소 간의 관계를 설명하는 다른 표현들, 예를 들면, '～사이에'와 '직접 ～사이에' 등도 마찬가지로 해석될 수 있다.In this specification, when a component is described as being ‘on’ or ‘in contact with’ another component, it may be in direct contact with or connected to the other component, but there may be another component in between. It must be understood that it can be done. On the other hand, if a component is described as being ‘right above’ or ‘in direct contact with’ another component, it can be understood that there is no other component in the middle. Other expressions that describe the relationship between components, such as 'between' and 'directly between', can be interpreted similarly.

본 명세서에서, '제1', '제2' 등의 용어는 다양한 구성요소를 설명하는데 사용될 수 있지만, 해당 구성요소는 위 용어에 의해 한정되어서는 안 된다. 또한, 위 용어는 각 구성요소의 순서를 한정하기 위한 것으로 해석되어서는 안되며, 하나의 구성요소와 다른 구성요소를 구별하는 목적으로 사용될 수 있다. 예를 들어, '제1구성요소'는 '제2구성요소'로 명명될 수 있고, 유사하게 '제2구성요소'도 '제1구성요소'로 명명될 수 있다.In this specification, terms such as 'first' and 'second' may be used to describe various components, but the components should not be limited by the above terms. Additionally, the above term should not be interpreted as limiting the order of each component, but may be used for the purpose of distinguishing one component from another component. For example, a 'first component' may be named a 'second component', and similarly, a 'second component' may also be named a 'first component'.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. Unless otherwise defined, all terms used in this specification may be used with meanings that can be commonly understood by those skilled in the art to which the present invention pertains. Additionally, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless clearly specifically defined.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일 실시예를 상세히 설명하도록 한다.Hereinafter, a preferred embodiment according to the present invention will be described in detail with reference to the attached drawings.

도 1은 본 발명의 일 실시예에 따른 장치(100)의 블록 구성도를 나타낸다.Figure 1 shows a block diagram of a device 100 according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 장치(100)(이하, "본 장치"라 지칭함)은 개인정보에 대한 노출 정도 및 정보량 유지 정도를 동시에 측정하기 위한 장치로서, 컴퓨팅(computing)이 가능한 전자 장치일 수 있다. The device 100 (hereinafter referred to as “this device”) according to an embodiment of the present invention is a device for simultaneously measuring the degree of exposure to personal information and the degree of maintenance of the amount of information, and is an electronic device capable of computing. You can.

예를 들어, 전자 장치는 데스크탑 PC(desktop personal computer), 랩탑 PC(laptop personal computer), 태블릿 PC(tablet personal computer), 넷북 컴퓨터(netbook computer), 워크스테이션(workstation), PDA(personal digital assistant), 스마트폰(smart phone), 스마트패드(smart pad), 또는 휴대폰(mobile phone) 등의 범용 컴퓨팅 시스템이거나, 임베디드 리눅스(Embeded Linux) 등을 기반으로 구현된 전용의 임베디드 시스템일 수 있으나, 이에 한정되는 것은 아니다.For example, electronic devices include desktop personal computers, laptop personal computers, tablet personal computers, netbook computers, workstations, and personal digital assistants (PDAs). , it may be a general-purpose computing system such as a smart phone, smart pad, or mobile phone, or a dedicated embedded system implemented based on Embedded Linux, etc., but is limited to this. It doesn't work.

이러한 본 장치(100)는, 도 1에 도시된 바와 같이, 입력부(110), 통신부(120), 디스플레이(130), 메모리(140) 및 제어부(150)를 포함할 수 있다.As shown in FIG. 1, this device 100 may include an input unit 110, a communication unit 120, a display 130, a memory 140, and a control unit 150.

입력부(110)는 다양한 사용자의 입력에 대응하여, 입력데이터를 발생시키며, 다양한 입력수단을 포함할 수 있다.The input unit 110 generates input data in response to various user inputs and may include various input means.

예를 들어, 입력부(110)는 키보드(key board), 키패드(key pad), 돔 스위치(dome switch), 터치 패널(touch panel), 터치 키(touch key), 터치 패드(touch pad), 마우스(mouse), 메뉴 버튼(menu button) 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.For example, the input unit 110 includes a keyboard, key pad, dome switch, touch panel, touch key, touch pad, and mouse. It may include (mouse), menu button, etc., but is not limited thereto.

통신부(120)는 다른 장치와의 통신을 수행하는 구성이다. 가령, 통신부(120)는 원본 데이터 집합 및 재현 데이터 집합 등에 대한 정보를 다른 장치로부터 수신할 수 있으며, 후술할 측정 방법의 수행 결과 등을 다른 장치에 전송할 수 있다.The communication unit 120 is a component that performs communication with other devices. For example, the communication unit 120 may receive information about an original data set and a reproduced data set from another device, and may transmit the results of a measurement method to be described later to another device.

예를 들어, 통신부(120)는 5G(5th generation communication), LTE-A(long term evolution-advanced), LTE(long term evolution), 블루투스, BLE(Bluetooth low energy), NFC(near field communication), 와이파이(WiFi) 통신 등의 무선 통신을 수행하거나, 케이블 통신 등의 유선 통신을 수행할 수 있으나, 이에 한정되는 것은 아니다.For example, the communication unit 120 supports 5th generation communication (5G), long term evolution-advanced (LTE-A), long term evolution (LTE), Bluetooth, Bluetooth low energy (BLE), near field communication (NFC), Wireless communication such as WiFi communication may be performed, or wired communication such as cable communication may be performed, but is not limited thereto.

디스플레이(130)는 다양한 영상 데이터를 화면으로 표시하는 것으로서, 비발광형 패널이나 발광형 패널로 구성될 수 있다. 즉, 디스플레이(130)는 서버(100)의 카드매출대금의 처리에 따른 다양한 영상 데이터를 표시할 수 있다.The display 130 displays various image data on a screen and may be composed of a non-emissive panel or an emissive panel. That is, the display 130 can display various image data according to the processing of card sales proceeds by the server 100.

예를 들어, 디스플레이(130)는 액정 디스플레이(LCD; liquid crystal display), 발광 다이오드(LED; light emitting diode) 디스플레이, 유기 발광 다이오드(OLED; organic LED) 디스플레이, 마이크로 전자기계 시스템(MEMS; micro electro mechanical systems) 디스플레이, 또는 전자 종이(electronic paper) 디스플레이 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. 또한, 디스플레이(130)는 입력부(120, 220)와 결합되어 터치 스크린(touch screen) 등으로 구현될 수도 있다.For example, the display 130 may be a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, or a micro electromechanical system (MEMS). It may include, but is not limited to, a mechanical systems display, an electronic paper display, etc. Additionally, the display 130 may be combined with the input units 120 and 220 and implemented as a touch screen or the like.

메모리(140)는 서버(100)의 동작에 필요한 각종 정보를 저장한다. 메모리(140)의 저장 정보로는 원본 데이터 집합, 재현 데이터 집합, 통신부(120)를 통해 다른 장치와 통신하는 정보, 제어부(150)의 제어 동작을 위한 정보, 후술할 측정 방법에 관련된 프로그램 정보 등이 포함될 수 있으나, 이에 한정되는 것은 아니다.The memory 140 stores various information necessary for the operation of the server 100. Information stored in the memory 140 includes an original data set, a reproduced data set, information for communicating with another device through the communication unit 120, information for the control operation of the control unit 150, program information related to a measurement method to be described later, etc. It may be included, but is not limited to this.

예를 들어, 메모리(140)는 그 유형에 따라 하드디스크 타입(hard disk type), 마그네틱 매체 타입(magnetic media type), CD-ROM(compact disc read only memory), 광 기록 매체 타입(optical Media type), 자기-광 매체 타입(magneto-optical media type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 플래시 메모리 타입(flash memory type), 롬 타입(read only memory type), 또는 램 타입(random access memory type) 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. 또한, 메모리(140)는 그 용도/위치에 따라 캐시(cache), 버퍼, 주기억장치, 또는 보조기억장치이거나 별도로 마련된 저장 시스템일 수 있으나, 이에 한정되는 것은 아니다.For example, the memory 140 is classified into hard disk type, magnetic media type, CD-ROM (compact disc read only memory), and optical media type depending on its type. ), magneto-optical media type, multimedia card micro type, flash memory type, ROM type (read only memory type), or RAM type (random access) memory type), etc., but is not limited thereto. Additionally, the memory 140 may be a cache, a buffer, a main memory, an auxiliary memory, or a separately provided storage system depending on its purpose/location, but is not limited thereto.

제어부(150)는 서버(100)의 다양한 제어 동작을 수행할 수 있다. 특히, 제어부(150)는 후술할 측정 방법의 수행을 제어할 수 있다. 또한, 제어부(150)는 서버(100)의 나머지 구성, 즉 입력부(110), 통신부(120), 디스플레이(130), 메모리(140) 등의 동작을 제어할 수 있다. 예를 들어, 제어부(150)는 하드웨어인 프로세서(processor) 또는 해당 프로세서에서 수행되는 소프트웨어인 프로세스(process) 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.The control unit 150 can perform various control operations of the server 100. In particular, the control unit 150 can control the performance of a measurement method to be described later. Additionally, the control unit 150 can control the operations of the remaining components of the server 100, that is, the input unit 110, communication unit 120, display 130, and memory 140. For example, the control unit 150 may include a hardware processor or a software process executed on the processor, but is not limited thereto.

이하, 본 발명의 일 실시예에 따른 측정 방법에 대해 설명하도록 한다.Hereinafter, a measurement method according to an embodiment of the present invention will be described.

도 2는 본 발명의 일 실시예에 따른 측정 방법의 순서도를 나타낸다.Figure 2 shows a flowchart of a measurement method according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 측정 방법(이하, "본 측정 방법"이라 지칭함)은 개인정보에 대한 노출 정도 및 정보량 유지 정도를 동시에 측정하는 방법으로서, 본 장치(100)에서 수행되며, 특히 제어부(150)의 제어에 따라 수행될 수 있다. 이러한 본 측정 방법은, 도 2에 도시된 바와 같이, S210 내지 S230을 포함한다.The measurement method according to an embodiment of the present invention (hereinafter referred to as “this measurement method”) is a method of simultaneously measuring the degree of exposure to personal information and the degree of maintenance of the amount of information, and is performed in the device 100, especially in the control unit. It can be performed under the control of (150). This measurement method includes steps S210 to S230, as shown in FIG. 2.

도 3은 개인정보의 원본 데이터 집합(X) 및 재현 데이터 집합(Y)에 대한 일 예를 나타낸다.Figure 3 shows an example of an original data set (X) and a reproduced data set (Y) of personal information.

먼저, S210은 개인정보의 원본 데이터 집합(X) 및 재현 데이터 집합(Y)에 대해 각 원소 벡터 간의 거리를 계산하는 단계이다. 즉, 원본 데이터 집합(X)에서 각 원소 벡터 간의 거리(이하, "제1 거리"라 지치함)를 계산한다. 또한, 원본 데이터 집합(X)의 각 원소 벡터와 재현 데이터 집합(Y)의 각 원소 벡터 간의 거리(이하, "제2 거리"라 지칭함)를 계산한다.First, S210 is the step of calculating the distance between each element vector for the original data set (X) and the reproduced data set (Y) of personal information. That is, the distance (hereinafter referred to as “first distance”) between each element vector in the original data set (X) is calculated. Additionally, the distance (hereinafter referred to as “second distance”) between each element vector of the original data set (X) and each element vector of the reproduced data set (Y) is calculated.

원본 데이터 집합(X)은 개인정보에 대한 데이터 집합으로서, 행(row)에 따라 구분될 수 있는 n개(단, n은 2이상의 자연수)의 원소 벡터(x₁, …x_n)를 포함한다. 이때, 원본 데이터 집합(X)의 각 원소 벡터(x₁, …x_n)는 열(column)에 따라 구분될 수 있는 p개(단, p는 1이상의 자연수)의 속성(attribute)(A₁, …A_p)을 포함한다. 즉, 원본 데이터 집합(X)에서, 제1 원소 벡터(x₁)는 {x₁₁, … x_1p}의 속성을 포함하고, 제n 원소 벡터(x_n)는 {x_n1, … x_np}의 속성을 포함한다.The original data set (X) is a data set about personal information and includes n element vectors (x ₁ , …x _n ) that can be classified according to rows (where n is a natural number of 2 or more). . At this time, each element vector (x ₁ _, …x _n ) of the original data set ( , …A _p ). That is, in the original data set (X), the first element vector (x ₁ ) is {x ₁₁ , … x _1p }, and the nth element vector (x _n ) is {x _n1 , … Includes the properties of x _np }.

마찬가지로, 재현 데이터 집합(Y)은 개인정보에 대한 데이터 집합으로서, 행(row)에 따라 구분될 수 있는 m개(단, m은 2이상의 자연수)의 원소 벡터(y₁, …y_m)를 포함한다. 이때, 재현 데이터 집합(Y)의 각 원소 벡터(y₁, …y_m)는 열(column)에 따라 구분될 수 있는 p개의 속성(attribute)(A₁, …A_p)을 포함한다. 즉, 재현 데이터 집합(Y)에서, 제1 원소 벡터(y₁)는 {y₁₁, … y_1p}의 속성을 포함하고, 제n 원소 벡터(y_n)는 {y_n1, … y_np}의 속성을 포함한다.Likewise, the reproduction data set (Y) is a data set about personal information, which consists of m element vectors (y ₁ , …y _m ) that can be classified according to rows (where m is a natural number of 2 or more). Includes. At this time, each element vector (y ₁ , …y _m ) of the reproduction data set (Y) includes p attributes (A ₁ , …A _p ) that can be distinguished according to columns. That is, in the reproduction data set (Y), the first element vector (y ₁ ) is {y ₁₁ , … y _1p }, and the nth element vector (y _n ) is {y _n1 , … Includes the properties of y _np }.

구체적으로, 원본 데이터 집합(X)은 원본 데이터들로 구성된 집합이며, 이때 원본 데이터는 정보 수집자가 정보 제공자로부터 수집한 개인정보에 관련된 데이터를 지칭한다. 개인정보는 개인에 관한 정보로서, 성명, 주민등록번호, 영상 등을 통하여 개인을 알아볼 수 있는 정보일 수 있으며, 그 외에도 특정 개인을 알아볼 수 없더라도 다른 정보와 쉽게 결합하여 알아볼 수 있는 정보일 수 있다.Specifically, the original data set (X) is a set composed of original data, and in this case, the original data refers to data related to personal information collected by the information collector from the information provider. Personal information is information about an individual and can be information that can identify an individual through name, resident registration number, video, etc. In addition, even if a specific individual cannot be identified, it can be information that can be easily combined with other information to identify the individual.

또한, 재현 데이터 집합(Y)은 재현 데이터들로 구성된 집합이며, 이때 재현 데이터는 개인정보에 대해 익명화가 수행된 정보(즉, 익명정보)에 대한 데이터이다. 본 발명은 재현 데이터를 측정하여, 재현 데이터에서 개인정보에 대한 노출 정도 및 정보량 유지 정도가 어느 정도인지를 동시에 측정하기 위한 기술이다. 익명정보는 시간, 비용, 기술 등을 합리적으로 고려할 때 다른 정보를 사용하여도 더 이상 개인을 알아볼 수 없는 정보에 해당한다. 이에 따라, 익명정보는 제한 없이 자유롭게 활용이 가능하며, 개인정보보호법의 적용 대상이 아니다.In addition, the reproduction data set (Y) is a set composed of reproduction data, where the reproduction data is data about information that has been anonymized for personal information (i.e., anonymous information). The present invention is a technology for measuring reproduced data and simultaneously measuring the degree of exposure to personal information and the degree of maintenance of information in the reproduced data. Anonymous information refers to information that can no longer identify an individual even if other information is used when reasonable consideration is given to time, cost, technology, etc. Accordingly, anonymous information can be used freely without restrictions and is not subject to the Personal Information Protection Act.

제1 거리 계산의 경우, 원본 데이터 집합(X)에서 각 원소 벡터(x₁, …x_n) 간의 거리를 계산하는 것이다. 즉, 하나의 제1 원소 벡터(x₁)와 나머지 각 원소 백터(x₂, …x_n) 간의 거리를 계산(n-1개의 거리 결과가 도출됨)하고, 다른 하나의 제2 원소 벡터(x₂)와 나머지 각 원소 백터(x₁, x₃, …x_n) 간의 거리를 계산(n-1개의 거리 결과가 도출됨)하며, 이러한 방식으로 마지막의 제n 원소 벡터까지 각 거리 계산이 수행된다. 그 결과, 총 n×(n-1) 개의 거리 결과가 도출될 수 있다.In the case of the first distance calculation, the distance between each element vector (x ₁ , ...x _n ) in the original data set (X) is calculated. In other words, the distance between one first element vector (x ₁ ) and each of the remaining element vectors (x ₂ , …x _n ) is calculated (n-1 distance results are derived), and the other second element vector ( Calculate _the distance between x ₂ ) and each of _the remaining element vectors (x ₁ , It is carried out. As a result, a total of n×(n-1) distance results can be derived.

이러한 제1 거리 계산의 과정은 하기의 식 1 및 표 1를 이용하여 나타낼 수 있다.This first distance calculation process can be expressed using Equation 1 and Table 1 below.

d(x_j, x_i) (식 1)d(x _j , x _i ) (Equation 1)

즉, 식 1에서, d는 거리함수로서 d(a, b)는 원소 벡터 a와 원소 벡터 b 간의 거리를 나타내고, j 및 i는 n이하의 자연수를 나타낸다. 또한, x_j와 x_i는 원본 데이터 집합(X)에서 서로 다른 원소 벡터이다. 즉, x_j는 원본 데이터 집합(X)의 원소 벡터(x₁, …x_n) 중에서 x_i가 제외된 나머지 원소 벡터에서 선택된 것이다. That is, in Equation 1, d is a distance function, d(a, b) represents the distance between element vector a and element vector b, and j and i represent natural numbers less than n. Additionally, x _j and x _i are different element vectors in the original data set (X). In other words, x _j is selected from the remaining element vectors excluding x _i among the element vectors (x ₁ , …x _n ) of the original data set (X).

이때, d(a, b)는 벡터 간의 거리(가령, 직선 거리 등)를 구하는 다양한 알고리즘을 이용할 수 있다.At this time, d(a, b) can use various algorithms to calculate the distance between vectors (eg, straight line distance, etc.).

일례로, p가 3인 경우에 원소 벡터 x₁(x₁₁, x₁₂, x₁₃)와 원소 벡터 x₂(x₂₁, x₂₂, x₂₃) 간의 거리는 d(x₁, x₂)=

일 수 있으나, 이에 한정되는 것은 아니다.For example, if p is 3, the distance between element vector x ₁ (x ₁₁ , x ₁₂ , x ₁₃ ) and element vector x ₂ (x ₂₁ , x ₂₂ , x ₂₃ ) is d(x ₁ , x ₂ )=

It may be, but is not limited to this.

거리 계산 개수(누적)Distance calculation count (cumulative) xx _ii xx _jj D(xD(x _jj , x, x _ii )) 1One x₁ x ₁ x₂ x ₂ D(x₂, x₁)D(x ₂ , x ₁ ) 22 x₁ x ₁ x₃ x ₃ D(x₃, x₁)D(x ₃ , x ₁ ) …… …… …… …… n-1n-1 x₁ x ₁ x_n x _n D(x_n, x₁)D(x _n , x ₁ ) nn x₂ x ₂ x₁ x ₁ D(x₁, x₂)D(x ₁ , x ₂ ) n+1n+1 x₂ x ₂ x₃ x ₃ D(x₃, x₁)D(x ₃ , x ₁ ) …… …… …… …… 2×(n-1)2×(n-1) x₂ x ₂ x_n x _n D(x_n, x₂)D(x _n , x ₂ ) …… …… …… …… (n-1)×(n-1)+1(n-1)×(n-1)+1 x_n x _n x₁ x ₁ D(x₁, x_n)D(x ₁ , x _n ) (n-1)×(n-1)+2(n-1)×(n-1)+2 x_n x _n x₂ x ₂ D(x₂, x_n)D(x ₂ , x _n ) …… …… …… …… n×(n-1)n×(n-1) x_n x _n x_n-1 xn _-1 D(x_n-1, x_n)D(x _n-1 , x _n )

제2 거리 계산의 경우, 원본 데이터 집합(X)의 각 원소 벡터(x₁, …x_n)와 재현 데이터 집합(Y)의 각 원소 벡터(y₁, …y_m) 간의 거리를 계산하는 것이다. 즉, 하나의 원본 데이터 집합(X)의 제1 원소 벡터(x₁)와 재현 데이터 집합(Y)의 각 원소 백터(y₁, …y_m) 간의 거리를 계산(m의 거리 결과가 도출됨)하고, 다른 하나의 원본 데이터 집합(X)의 제2 원소 벡터(x₂)와 재현 데이터 집합(Y)의 각 원소 백터(y₁, …y_m) 간의 거리를 계산(m개의 거리 결과가 도출됨)하며, 이러한 방식으로 원본 데이터 집합(X)의 제n 원소 벡터까지 각 거리 계산이 수행된다. 그 결과, 총 n×m 개의 거리 결과가 도출될 수 있다.이러한 제2 거리 계산의 과정은 하기의 식 2 및 표 2를 이용하여 나타낼 수 있다.In the case of the second distance calculation, the distance between each element vector (x ₁ , …x _n ) of the original data set (X) and each element vector (y ₁ , …y _m ) of the reproduced data set (Y) is calculated. . In other words, calculate the distance between the first element vector (x ₁ ) of one original data set (X) and each element vector (y ₁ , ...y _m ) of the reproduction data set (Y) (a distance result of m is derived ), and calculate the distance between the second element vector (x ₂ ) of the other original data set (X) and each element vector (y ₁ , ...y _m ) of the reproduction data set (Y) (m distance results are derived), and in this way, each distance calculation is performed to the nth element vector of the original data set (X). As a result, a total of n × m distance results can be derived. The process of calculating the second distance can be expressed using Equation 2 and Table 2 below.

d(y_j', x_i) (식 2)d(y _j' , x _i ) (Equation 2)

즉, 식 2에서, d는 거리함수로서 d(a, b)는 원소 벡터 a와 원소 벡터 b 간의 거리를 나타내고, j'는 m이하의 자연수를 나타내며, i는 n이하의 자연수를 나타낸다. 또한, x_i는 원본 데이터 집합(X)의 원소 벡터(x₁, …x_n) 중에서 선택된 것이며, y_j'는 재현 데이터 집합(Y)의 원소 벡터(y₁, …y_m) 중에서 선택된 것이다.That is, in Equation 2, d is a distance function, d(a, b) represents the distance between element vector a and element vector b, j' represents a natural number less than m, and i represents a natural number less than n. In addition, x _i is selected from the element vectors (x ₁ , …x _n ) of the original data set (X), and y _j' is selected from the element vectors (y ₁ , …y _m ) of the reproduction data set (Y). .

일례로, p가 3인 경우에 원소 벡터 y₁(y₁₁, y₁₂, y₁₃)와 원소 벡터 x₁(x₁₁, x₁₂, x₁₃) 간의 거리는 d(y₁, x₁)=

일 수 있으나, 이에 한정되는 것은 아니다.For example, when p is 3, the distance between element vector y ₁ (y ₁₁ , y ₁₂ , y ₁₃ ) and element vector x ₁ (x ₁₁ , x ₁₂ , x ₁₃ ) is d(y ₁ , x ₁ )=

It may be, but is not limited to this.

거리 계산 개수(누적)Distance calculation count (cumulative) xx _ii yy _jj D(yD(y _j'j' , x,x _ii )) 1One x₁ x ₁ y₁ y ₁ D(y₁, x₁)D(y ₁ , x ₁ ) 22 x₁ x ₁ y₂ y ₂ D(y₂, x₁)D(y ₂ , x ₁ ) …… …… …… …… mm x₁ x ₁ y_m y _m D(y_m, x₂)D(y _m , x ₂ ) m+1m+1 x₂ x ₂ y₁ y ₁ D(y₁, x₂)D(y ₁ , x ₂ ) m+2m+2 x₂ x ₂ y₂ y ₂ D(y₂, x₂)D(y ₂ , x ₂ ) …… …… …… …… 2×m2×m x₂ x ₂ y_m y _m D(y_m, x₂)D(y _m , x ₂ ) …… …… …… …… (n-1)×m+1(n-1)×m+1 x_n x _n y₁ y ₁ D(y₁, x_n)D(y ₁ , x _n ) (n-1)×m+2(n-1)×m+2 x_n x _n y₂ y ₂ D(y₂, x_n)D(y ₂ , x _n ) …… …… …… …… n×mn×m x_n x _n y_m y _m D(y_m, x_n)D(y _m , x _n )

도 4 및 도 5는 2개의 속성(Attribute1, Attribute2)을 가진 원소 벡터로 구성된 원본 데이터 집합(X)과 재현 데이터 집합(Y)에 대해 제1 및 제2 거리 계산을 수행하는 일 예를 나타낸다.일례로, 도 4 및 도 5를 참조하면, 원본 데이터 집합(X)과 재현 데이터 집합(Y)의 원소 벡터는 각 속성(Attribute1, Attribute2)의 축 상에 나타낼 수 있다. 즉, 원본 데이터 집합(X)의 각 원소 벡터(x₁, …x_n)는 원형 기호로 표시할 수 있고, 재현 데이터 집합(Y)의 원소 벡터(y₁, …y_m)는 삼각형 기호로 표시할 수 있다.Figures 4 and 5 show an example of performing first and second distance calculations on an original data set (X) and a reproduced data set (Y) composed of element vectors with two attributes (Attribute1 and Attribute2). For example, referring to Figures 4 and 5, element vectors of the original data set (X) and the reproduced data set (Y) can be displayed on the axes of each attribute (Attribute1, Attribute2). In _other words, each element _vector (x ₁ , …x _n ) of the original data set ( It can be displayed.

이때, 제1 거리 계산에 따라 도출된 원본 데이터 집합(X)의 각 원소 벡터(x₁, …x_n) 간의 거리들(총 n×(n-1) 개) 중에 각 원소 벡터(x₁, …x_n)에 대해 가장 가까운 거리를 각각 파란색 원형으로 표시할 수 있다. 또한, 제2 거리 계산에 따라 도출된 원본 데이터 집합(X) 및 재현 데이터 집합(Y)의 각 원소 벡터 간에 거리들(총 n×m 개) 중에 원본 데이터 집합(X)의 각 원소 벡터(x₁, …x_n)에 대해 가장 가까운 거리를 각각 붉은색 원형으로 표시할 수 있다. 다만, 도 4 및 도 5에서는 일부의 원소 벡터(x₁, x₂, x₃)에 대해서만 파란색 및 붉은색 원형을 표시하였다.At this time, among the distances (total n × (n-1)) between each element vector (x ₁ , ...x _n ) of the original data set (X) derived according to the first distance calculation, each element vector (x ₁ , ...x _n ), the closest distances can each be displayed in a blue circle. In addition, among the distances (total n × m) between each element vector of the original data set (X) and the reproduction data set (Y) derived according to the second distance calculation, each element vector (x) of the original data set (X) ₁ , …x _n ), the closest distances can each be displayed as a red circle. However, in Figures 4 and 5, blue and red circles are displayed only for some element vectors (x ₁ , x ₂ , x ₃ ).

다음으로, S220은 제1 거리 계산과 제2 거리 계산을 이용하여 원본 데이터 집합(X)과 재현 데이터 집합(Y) 간의 유사도 평균을 도출하는 단계이다. 이때, 원본 데이터 집합(X)의 각 원소 벡터(x₁, …x_n)에 대한 제1 거리 계산 결과와 제2 거리 계산 결과를 비교함으로써, 원본 데이터 집합(X)과 재현 데이터 집합(Y) 간의 유사도 평균을 도출할 수 있다.Next, S220 is a step of deriving the average similarity between the original data set (X) and the reproduced data set (Y) using the first and second distance calculations. At this time, by comparing the first and second distance calculation results for each element vector (x ₁ , ...x _n ) of the original data set (X), the original data set (X) and the reproduction data set (Y) The average similarity between the two can be derived.

즉, 원본 데이터 집합(X)의 제1 원소 벡터(x₁)에 대한 제1 거리 계산 결과와 제2 거리 계산 결과를 비교하고, 원본 데이터 집합(X)의 제2 원소 벡터(x₂)에 대한 제1 거리 계산 결과와 제2 거리 계산 결과를 비교하며, 이러한 방식으로 원본 데이터 집합(X)의 마지막 제n 원소 벡터(x_n)까지 제1 및 제2 거리 계산 결과 간의 비교가 수행된다.That is, the first distance calculation result and the second distance calculation result for the first element vector (x ₁ ) of the original data set (X) are compared, and the second distance calculation result for the first element vector (x ₂ ) of the original data set (X) is compared. The first and second distance calculation results are compared, and in this way, a comparison between the first and second distance calculation results is performed up to the last n-th element vector (x _n ) of the original data set (X).

일례로, 원본 데이터 집합(X)의 제1 원소 벡터(x₁)에 대한 제1 거리 계산 결과들 중 k번째(단, k는 자연수)로 작은 거리(즉, k가 1인 경우, 최소값을 가지는 거리)에 해당하는 값(D₁₁)과, 원본 데이터 집합(X)의 제1 원소 벡터(x₁)에 대한 제2 거리 계산 결과들 중 k번째로 작은 거리에 해당하는 값(D₁₂)을 서로 비교하여, 원본 데이터 집합(X)의 제1 원소 벡터(x₁)에 대한 원본 데이터 집합(X) 및 재현 데이터 집합(Y)의 유사도(즉, 제1 유사도)를 도출한다. 이때, D₁₁≥D₁₂이면 제1 유사도가 제1 특정 값(가령, 1)을 가지는 것으로 도출하고, D₁₁<D₁₂이면 제1 유사도가 제2 특정 값(가령, 0)을 가지는 것으로 도출한다. For example, among the first distance calculation results for the first element vector (x ₁ ) of the original data set ( A value (D ₁₁ ) corresponding to the distance) and a value (D ₁₂ ) corresponding to the kth smallest distance among the second distance calculation results for the first element vector (x ₁ ) of the original data set (X) are compared with each other to derive the similarity (i.e., first similarity) of the original data set (X) and the reproduced data set (Y) with respect to the first element vector (x ₁ ) of the original data set (X). At this time, if D ₁₁ ≥D ₁₂ , the first similarity is derived as having a first specific value (e.g., 1), and if D ₁₁ <D ₁₂ , the first similarity is derived as having a second specific value (e.g., 0). do.

또한, 원본 데이터 집합(X)의 제2 원소 벡터(x₂)에 대한 제1 거리 계산 결과들 중 k번째로 작은 거리에 해당하는 값(D₂₁)과, 원본 데이터 집합(X)의 제2 원소 벡터(x₂)에 대한 제2 거리 계산 결과들 중 k번째로 작은 거리에 해당하는 값(D₂₂)을 비교하여, 원본 데이터 집합(X)의 제2 원소 벡터(x₂)에 대한 원본 데이터 집합(X) 및 재현 데이터 집합(Y)의 유사도(즉, 제2 유사도)를 도출한다. 이때, D₂₁≥D₂₂이면 제2 유사도가 제1 특정 값을 가지는 것으로 도출하고, D₂₁<D₂₂이면 제2 유사도가 제2 특정 값을 가지는 것으로 도출한다. In addition, the value (D ₂₁ ) corresponding to the kth smallest distance among the first distance calculation results for the second element vector (x ₂ ) of the original data set (X), and the second distance of the original data set (X) By comparing the value (D ₂₂ ) corresponding to the kth smallest distance among the results of calculating the second distance for the element vector (x ₂ ), the original value for the second element vector (x ₂ ) of the original data set (X) The similarity (i.e., second similarity) of the data set (X) and the reproduction data set (Y) is derived. At this time, if D ₂₁ ≥D ₂₂ , the second similarity is derived as having a first specific value, and if D ₂₁ <D ₂₂ , the second similarity is derived as having a second specific value.

이러한 방식으로, 원본 데이터 집합(X)의 마지막 제n 원소 벡터(x_n)까지 제1 및 제2 거리 계산 결과 간의 비교가 수행된다. 즉, 원본 데이터 집합(X)의 마지막 제n 원소 벡터(x_n)에 대한 제1 거리 계산 결과들 중 k번째로 작은 거리에 해당하는 값(D_n1)과, 원본 데이터 집합(X)의 마지막 제n 원소 벡터(x_n)에 대한 제2 거리 계산 결과들 중 k번째로 작은 거리에 해당하는 값(D_n2)을 비교하여, 원본 데이터 집합(X)의 제n 원소 벡터(x_n)에 대한 원본 데이터 집합(X) 및 재현 데이터 집합(Y)의 유사도(즉, 제n 유사도)를 도출한다. 이때, D_n1≥D_n2이면 제n 유사도가 제1 특정 값을 가지는 것으로 도출하고, D_n1<D_n2이면 제n 유사도가 제2 특정 값을 가지는 것으로 도출한다.In this way, a comparison is performed between the first and second distance calculation results up to the last n-th element vector (x _n ) of the original data set (X). That is, the value (D _n1 ) corresponding to the kth smallest distance among the first distance calculation results for the last nth element vector (x _n ) of the original data set (X), and the last nth element vector (x n) of the original data set (X) Compare the value (D _n2 ) corresponding to the kth smallest distance among the second distance calculation results for the nth element vector (x _n ) to the nth element vector (x _n ) of the original data set (X). The similarity (i.e., n-th similarity) of the original data set (X) and the reproduced data set (Y) is derived. At this time, if D _n1 ≥D _n2 , the nth similarity is derived as having a first specific value, and if D _n1 <D _n2 , the nth similarity is derived as having a second specific value.

즉, 원본 데이터 집합(X)의 제i 원소 벡터(x_i)에 대한 제1 거리 계산 결과들 중 k번째로 작은 거리에 해당하는 값(D_i1)과, 원본 데이터 집합(X)의 제i 원소 벡터(x_i)에 대한 제2 거리 계산 결과들 중 k번째로 작은 거리에 해당하는 값(D_i2)을 서로 비교하여, 원본 데이터 집합(X)의 제i 원소 벡터(x_i)에 대한 재현 데이터 집합(Y)의 간의 유사도(즉, 제i 유사도)를 도출한다. 이때, D_i1≥D_i2이면 제i 유사도가 제1 특정 값을 가지는 것으로 도출하고, min_i1<min_i2이면 제i 유사도가 제2 특정 값(즉, 제1 특정 값보다 작은 값)을 가지는 것으로 도출한다.That is, the value (D _i1 ) corresponding to the k-th smallest distance among the first distance calculation results for the i-th element vector (x _i ) of the original data set (X), and the i-th distance of the original data set (X) By comparing the value (D _i2 ) corresponding to the kth smallest distance _among the second distance calculation results for the element vector (x _i ), The similarity between the reproduction data sets (Y) (i.e., the i similarity) is derived. At this time, if D _i1 ≥D _i2 , it is derived that the ith similarity has a first specific value, and if min _i1 <min _i2 , it is derived that the ith similarity has a second specific value (i.e., a value smaller than the first specific value). Derive.

특히, k가 1인 경우, D_i1와 D_i2는 각각 최소값의 거리에 해당한다.In particular, when k is 1, D _i1 and D _i2 each correspond to the distance to the minimum value.

즉, 이 경우(즉, k=1)에, 원본 데이터 집합(X)의 제1 원소 벡터(x₁)에 대한 제1 거리 계산 결과들 중의 최소값(min₁₁)과, 원본 데이터 집합(X)의 제1 원소 벡터(x₁)에 대한 제2 거리 계산 결과들 중의 최소값(min₁₂)을 서로 비교하여, 원본 데이터 집합(X)의 제1 원소 벡터(x₁)에 대한 원본 데이터 집합(X) 및 재현 데이터 집합(Y)의 유사도(즉, 제1 유사도)를 도출한다. 이때, min₁₁≥min₁₂이면 제1 유사도가 제1 특정 값(가령, 1)을 가지는 것으로 도출하고, min₁₁<min₁₂이면 제1 유사도가 제2 특정 값(가령, 0)을 가지는 것으로 도출한다. That is, in this case (i.e., k=1), the minimum value (min ₁₁ ) of the first distance calculation results for the first element vector (x ₁ ) of the original data set (X), and the original data set (X) By comparing the minimum value (min ₁₂ ) _of the second distance calculation results for the first element vector (x ₁ ) of ) and derive the similarity (i.e., first similarity) of the reproduction data set (Y). At this time, if min ₁₁ ≥min ₁₂ , the first similarity is derived as having a first specific value (e.g., 1), and if min ₁₁ <min ₁₂ , the first similarity is derived as having a second specific value (e.g., 0). do.

또한, 원본 데이터 집합(X)의 제2 원소 벡터(x₂)에 대한 제1 거리 계산 결과들 중의 최소값(min₂₁)과, 원본 데이터 집합(X)의 제2 원소 벡터(x₂)에 대한 제2 거리 계산 결과들 중의 최소값(min₂₂)을 비교하여, 원본 데이터 집합(X)의 제2 원소 벡터(x₂)에 대한 원본 데이터 집합(X) 및 재현 데이터 집합(Y)의 유사도(즉, 제2 유사도)를 도출한다. 이때, min₂₁≥min₂₂이면 제2 유사도가 제1 특정 값을 가지는 것으로 도출하고, min₂₁<min₂₂이면 제2 유사도가 제2 특정 값을 가지는 것으로 도출한다. In addition, the minimum value (min ₂₁ ) among the first distance calculation results for the second element vector (x ₂ ) of the original data set (X) and the second element vector (x ₂ ) of the original data set (X) By comparing the minimum value (min ₂₂ ) of the second distance calculation results, the similarity ₍ i.e., , the second similarity) is derived. At this time, if min ₂₁ ≥min ₂₂ , the second similarity is derived as having the first specific value, and if min ₂₁ <min ₂₂ , the second similarity is derived as having the second specific value.

이러한 방식으로, 원본 데이터 집합(X)의 마지막 제n 원소 벡터(x_n)까지 제1 및 제2 거리 계산 결과 간의 비교가 수행된다. 즉, 원본 데이터 집합(X)의 마지막 제n 원소 벡터(x_n)에 대한 제1 거리 계산 결과들 중의 최소값(min_n1)과, 원본 데이터 집합(X)의 마지막 제n 원소 벡터(x_n)에 대한 제2 거리 계산 결과들 중의 최소값(min_n2)을 비교하여, 원본 데이터 집합(X)의 제n 원소 벡터(x_n)에 대한 원본 데이터 집합(X) 및 재현 데이터 집합(Y)의 유사도(즉, 제n 유사도)를 도출한다. 이때, min_n1≥min_n2이면 제n 유사도가 제1 특정 값을 가지는 것으로 도출하고, min_n1<min_n2이면 제n 유사도가 제2 특정 값을 가지는 것으로 도출한다.In this way, a comparison is performed between the first and second distance calculation results up to the last n-th element vector (x _n ) of the original data set (X). That is, the minimum value (min _n1 ) among the first distance calculation results for the last n-th element vector (x _n ) of the original data set (X), and the last n-th element vector (x _n ) of the original data set (X) Similarity of the original data set (X) and the reproduction data set (Y) to the nth element vector (x _n ) of the original data set (X) by comparing the minimum value (min _n2 ) among the second distance calculation results for (i.e., the nth similarity) is derived. At this time, if min _n1 ≥min _n2 , the nth similarity is derived as having a first specific value, and if min _n1 <min _n2 , the nth similarity is derived as having a second specific value.

즉, 원본 데이터 집합(X)의 제i 원소 벡터(x_i)에 대한 제1 거리 계산 결과들 중의 최소값(min_i1)과, 원본 데이터 집합(X)의 제i 원소 벡터(x_i)에 대한 제2 거리 계산 결과들 중의 최소값(min_i2)을 서로 비교하여, 원본 데이터 집합(X)의 제i 원소 벡터(x_i)에 대한 원본 데이터 집합(X) 및 재현 데이터 집합(Y)의 유사도(즉, 제i 유사도)를 도출한다. 이때, min_i1≥min_i2이면 제i 유사도가 제1 특정 값을 가지는 것으로 도출하고, min_i1<min_i2이면 제i 유사도가 제2 특정 값(즉, 제1 특정 값보다 작은 값)을 가지는 것으로 도출한다.That is, the minimum value (min _i1 ) among the first distance calculation results for the i-th element vector (x _i ) of the original data set (X) and the i-th element vector (x _i ) of the original data set (X) The minimum value (min _i2 ) of the second distance _calculation results is compared with each other, and the similarity ( That is, the i th similarity) is derived. At this time, if min _i1 ≥min _i2 , it is derived that the ith similarity has a first specific value, and if min _i1 <min _i2 , it is derived that the ith similarity has a second specific value (i.e., a value smaller than the first specific value). Derive.

이를 정리하면, 원본 데이터 집합(X)의 모든 원소 벡터(x₁, …x_n)에 대한 제1 거리 계산 결과와 제2 거리 계산 결과의 비교, 즉 제i 유사도는 다음의 식 3을 이용하여 수행될 수 있다.To summarize, the comparison of the first distance calculation result and the second distance calculation result for all element vectors (x ₁ , ...x _n ) of the original data set ( It can be done.

s(x_i)=I({d(x_j, x_i)}^<k> ≥ d(y_j', x_i)}^<k>) (식 3)s(x _i )=I({d(x _j , x _i )} ^<k> ≥ d(y _j' , x _i )} ^<k> ) (Equation 3)

즉, s(x_i)는 제i 유사도를 나타내고, d(x_j, x_i)는 식 1를 나타내며, d(y_j', x_i)는 식 2를 나타낸다. 또한, I는 지시함수로서, I(a≥b)는 조건인 a≥b가 만족하면(즉, true 이면) 제1 특정 값(가령, 1)을 도출하고, 조건인 a≥b가 불만족하면(즉, false 이면)(a<b인 경우) 제2 특정 값(가령, 0)을 도출한다. 또한, A^<k>는 집합 A에서 대해서 k번째로 작은 원소를 나타낸다. 즉, k가 1인 경우는 집합 A에서 최소값을 가지는 원소를 나타낸다.That is, s(x _i ) represents the ith similarity, d(x _j , x _i ) represents Equation 1, and d(y _j' , x _i ) represents Equation 2. In addition, I is an indicator function, and I(a≥b) derives the first specific value (e.g., 1) if the condition a≥b is satisfied (i.e., true), and if the condition a≥b is not satisfied, (That is, if false) (if a<b), a second specific value (e.g., 0) is derived. Additionally, A ^<k> represents the kth smallest element in set A. In other words, when k is 1, it represents the element with the minimum value in set A.

이때, 원본 데이터 집합(X)과 재현 데이터 집합(Y) 간의 유사도는 원본 데이터 집합(X)의 모든 원소 벡터(x₁, …x_n)에 대해 도출된 제1 유사도 내지 제n 유사도의 평균 값에 해당할 수 있으며, 이를 "유사도 평균"이라고 지칭할 수 있으며, μ로 표시할 수 있다.At this time, the similarity between the original data set (X) and the reproduction data set (Y) is the average value of the first to nth similarities derived for all element vectors (x ₁ , _... This can be referred to as “similarity average” and can be expressed as μ.

일례로, 원본 데이터 집합(X)의 모든 원소 벡터(x₁, …x_n)에 대해, min_i1≥min_i2가 만족하여 모두 제1 특정 값이 도출되는 경우는 원본 데이터 집합(X)와 재현 데이터 집합(Y)가 실제적으로 거의 동일한 데이터(즉, 원소 벡터)를 가지는 경우에 해당하며, 최대값을 가지는 유사도 평균이 도출될 수 있다. 즉, 제1 내지 제n 유사도가 모두 제1 특정 값인 것으로 도출되어 제1 특정 값이 이 n회 도출되므로, 이 경우에 원본 데이터 집합(X)과 재현 데이터 집합(Y) 간의 유사도인 유사도 평균(μ)은 제1 특정 값의 최대값을 가질 수 있다. 만일, 제1 특정 값이 1인 경우, 유사도 평균(μ)의 최대값은 1일 수 있다.For example, for all element vectors (x ₁ , ...x _n ) of the original data set (X), if min _i1 ≥min _i2 is satisfied and the first specific value is derived, the original data set (X) and reproduction This corresponds to a case where the data set (Y) actually has almost identical data (i.e., element vectors), and a similarity average with the maximum value can be derived. That is, the first to nth similarities are all derived as the first specific value, and the first specific value is derived n times, so in this case, the similarity average ( μ) may have the maximum value of the first specific value. If the first specific value is 1, the maximum value of the similarity average (μ) may be 1.

반면, 원본 데이터 집합(X)의 모든 원소 벡터(x₁, …x_n)에 대해, min_i1<min_i2가 성립하여 모두 제2 특정 값이 도출되는 경우는 원본 데이터 집합(X)와 재현 데이터 집합(Y)가 실제적으로 전혀 다른 데이터(즉, 원소 벡터)를 가지는 경우에 해당하며, 최소값을 가지는 유사도 평균이 도출될 수 있다. 즉, 제1 내지 제n 유사도가 모두 제2 특정 값인 것으로 도출되어 제2 특정 값이 n회 도출되므로, 이 경우에 원본 데이터 집합(X)과 재현 데이터 집합(Y) 간의 유사도인 유사도 평균(μ)은 제2 특정 값의 최소값을 가질 수 있다. 만일, 제2 특정 값이 0인 경우, 유사도 평균의 최소값은 0일 수 있다.On the other hand, if min _i1 <min _i2 is established for all element vectors (x ₁ , ...x _n ) of the original data set (X) and a second specific value is derived, the original data set (X) and the reproduced data This corresponds to a case where the set (Y) actually has completely different data (i.e., element vectors), and a similarity average with the minimum value can be derived. That is, since the first to nth similarities are all derived as the second specific value and the second specific value is derived n times, in this case, the similarity average (μ) is the similarity between the original data set (X) and the reproduced data set (Y). ) may have the minimum value of the second specific value. If the second specific value is 0, the minimum value of the similarity average may be 0.

따라서, 원본 데이터 집합(X)과 재현 데이터 집합(Y) 간의 유사도(즉, 유사도 평균)(μ)는 다음의 식 4를 이용하여 도출될 수 있다.Therefore, the similarity (i.e., similarity average) (μ) between the original data set (X) and the reproduced data set (Y) can be derived using Equation 4 below.

(식 4) (Equation 4)

이때, s(x_i)는 원본 데이터 집합(X)의 제i 원소 벡터(x_i)와 재현 데이터 집합(Y)의 간의 유사도, 즉 제i 유사도를 나타낸다. 즉, 유사도 평균(μ)은 제1 내지 제n 유사도의 값에 대한 평균이다At this time, s(x _i ) represents the similarity between the ith element vector (x _i ) of the original data set (X) and the reproduced data set (Y), that is, the ith similarity. That is, the similarity average (μ) is the average of the first to nth similarity values.

한편, 원본 데이터 집합(X)의 데이터 분포(확률 분포)와 재현 데이터 집합(Y)의 데이터 분포(확률 분포)가 동일하고(이하, "제1 조건 만족"이라 지칭함), 원본 데이터 집합(X)와 재현 데이터 집합(Y)가 통계적으로 독립(independent) 상태라면(이하, "제2 조건 만족"이라 지칭함), 식 3에서 I에 포함된 {d(x_j, x_i)}^<k> ≥ d(y_j', x_i)}^<k>이라는 부동식의 확률(이하, "P(≥)"로 나타냄)은 하기의 식 5와 같이 나타낼 수 있다.On the other hand, the data distribution (probability distribution) of the original data set ( ) and the reproduction data set (Y) are statistically independent (hereinafter referred to as "satisfying the second condition"), {d(x _j , x _i )} ^<k> included in I in Equation 3 ≥ d(y _j' , x _i )} The probability of the floating expression ^<k> (hereinafter referred to as “P(≥)”) can be expressed as Equation 5 below.

P(≥) = n/(2n-1) (식 5)P(≥) = n/(2n-1) (Equation 5)

특히, 제1 조건 만족에 따라, 원본 데이터 집합(X)의 데이터 분포와 재현 데이터 집합(Y)의 데이터 분포가 동일(일치)하다는 것은 통계적으로 원본 데이터 집합(X)의 정보량과 재현 데이터 집합(Y)의 정보량이 동일(일치)하다는 것을 의미한다. 또한, 제2 조건 만족에 따라, 원본 데이터 집합(X)와 재현 데이터 집합(Y)가 통계적으로 독립 상태인 것은 원본 데이터 집합(X)의 데이터를 이용하여 재현 데이터 집합(Y)의 데이터를 생성한 것이 아니라는 것을 의미한다. 즉, 재현 데이터 집합(Y)에는 원본 데이터 집합(X)의 개인정보가 포함되어 있지 않음을 의미한다. 즉, 제1 및 제2 조건 만족인 경우, 식 3의 P(≥)가 식 5를 만족하게 되면서, 유사도 평균(μ)는 하기의 식 6의 값을 가지게 된다.In particular, according to the satisfaction of the first condition, the data distribution of the original data set (X) and the data distribution of the reproduction data set (Y) are the same (coincident). This means that the information amount of Y) is the same (matches). In addition, according to the satisfaction of the second condition, the original data set ( It means that it was not done. This means that the reproduction data set (Y) does not contain personal information from the original data set (X). That is, when the first and second conditions are satisfied, P(≥) in Equation 3 satisfies Equation 5, and the similarity average (μ) has the value of Equation 6 below.

(식 6) (Equation 6)

즉, μ_true는 기준 값(최적 값)으로서, 제1 및 제2 조건 만족인 경우의 유사도 평균(μ)이다.That is, μ _true is a reference value (optimal value) and is the similarity average (μ) when the first and second conditions are satisfied.

다음으로, S230은 S220에서 도출된 유사도 평균(μ)을 이용하여 개인정보에 대한 노출 정도 및 정보량 유지 정도를 나타내는 단계이다.Next, S230 is a step that indicates the degree of exposure to personal information and the degree of maintenance of information volume using the similarity average (μ) derived in S220.

도 6은 유사도 평균(μ)의 값에 따른 개인정보에 대한 노출 정도 및 정보량 유지 정도를 나타낸다.Figure 6 shows the degree of exposure to personal information and the degree of maintenance of information amount according to the value of the similarity average (μ).

이때, S220에서 도출된 유사도 평균(μ)은 식 4에 따라 0(최소값) 내지 1(최대 값)의 값을 가지게 되는데, 도 6에 도시된 바와 같이, 그 값에 따라 개인정보에 대한 노출 정도 및 정보량 유지 정도가 달라진다. 즉, 유사도 평균(μ)이 0에 근접하는지, 1에 근접하는지, 또는 μ_true에 근접하는지 정도에 따라, 개인정보에 대한 노출 정도 및 정보량 유지 정도를 다르게 나타낼 수 있다.At this time, the similarity average (μ) derived in S220 has a value of 0 (minimum value) to 1 (maximum value) according to Equation 4. As shown in FIG. 6, the degree of exposure to personal information depends on the value. and the degree of information retention varies. In other words, the degree of exposure to personal information and the degree of maintenance of the amount of information can be expressed differently depending on whether the similarity average (μ) is close to 0, 1, or μ _true .

도 6을 참조하면, μ가 0(최소값)에 가까울수록 품질 보존 정도(즉, 개인정보에 대한 정보량 유지 정도)가 점차 떨어지는(낮아지는) 반면, μ가 기준 값(최적 값)인 μ_true(즉, 도 6에서 a)를 초과하면 데이터 재현에 성공하여 개인정보에 대한 정보량 유지 정도가 적절한 상태인 것을 나타낸다. 또한, μ가 1(최대 값)에 가까울수록 개인정보에 대한 노출 정도가 점차 높아지는 반면, μ가 μ_true를 미만이면 노출 제어에 성공하여 개인정보에 대한 노출 정도가 적절한 상태인 것을 나타낸다.Referring to Figure 6, as μ approaches 0 (minimum value), the degree of quality preservation (i.e., the degree of maintaining the amount of information about personal information) gradually decreases (lowers), while μ is the standard value (optimal value), μ _true ( In other words, if it exceeds a) in FIG. 6, data reproduction is successful, indicating that the amount of information about personal information is maintained at an appropriate level. In addition, as μ approaches 1 (maximum value), the degree of exposure to personal information gradually increases, whereas if μ is less than μ _true , exposure control is successful, indicating that the degree of exposure to personal information is appropriate.

일례로, 도 6에서, c는 μ_true인 a미만이되 a에 근접하므로 개인정보에 대한 정보량 유지 정도가 떨어지더라도 별로 안 떨어진 상태인 반면, b는 0에 상당히 근접하므로 개인정보에 대한 정보량 유지 정도가 상당히 떨어진 상태이다. 다만, c와 b는 모두 개인정보에 대한 노출 정도가 성공적인(적절한) 상태이다. 또한, d는 μ_true인 a를 초과하되 a에 근접하므로 개인정보에 대한 노출 정도가 조금 증가한 상태인 반면, e는 1에 상당히 근접하므로 개인정보에 대한 노출 정도가 상당히 증가한 상태이다. 다만, d와 e는 모두 개인정보에 대한 정보량 유지 정도가 성공적인(적절한) 상태이다.For example, in Figure 6, c is less than a, which is μ _true , but is close to a, so even if the degree of maintaining the amount of information about personal information decreases, it does not drop much, whereas b is quite close to 0, so the amount of information about personal information is maintained. The level has dropped considerably. However, in both c and b, the degree of exposure to personal information is successful (appropriate). In addition, d exceeds a, which is μ _true , but is close to a, so the degree of exposure to personal information has slightly increased, while e is quite close to 1, so the degree of exposure to personal information has significantly increased. However, both d and e are in a successful (appropriate) state in maintaining the amount of information about personal information.

도 7은 도 6에 따른 유사도 평균(μ)을 그래프 형태로 표시한 일 예를 나타낸다.Figure 7 shows an example of displaying the similarity average (μ) according to Figure 6 in a graph form.

한편, 0 내지 1의 값을 가지는 유사도 평균(μ)은 도 7의 그래프 형태로 나타낼 수 있다. 이때, 해당 그래프의 함수인 f(μ)=(u, v)로 나타낼 수 있다. 이때, u(μ)는 μ에 따라 결정되는 데이터 품질(즉, 개인정보에 대한 정보량 유지 정도)을 나타내며, v(μ)는 μ에 따라 결정되는 개인정보에 대한 노출 정도를 나타낸다.Meanwhile, the similarity average (μ), which has a value of 0 to 1, can be expressed in the form of a graph in FIG. 7. At this time, it can be expressed as f(μ)=(u, v), which is a function of the corresponding graph. At this time, u(μ) represents the data quality determined by μ (i.e., the degree of maintaining the amount of information about personal information), and v(μ) represents the degree of exposure to personal information determined by μ.

특히, u(μ)와 v(μ)의 변화를 보다 극적으로 나타냄으로써, 개인정보에 대한 정보량 유지 정도와 노출 정도의 변화를 보다 명확하게 표현할 수 있다. 이를 위해, u(μ)는 μ_true미만 영역에서 변화가 크게 나타나도록 표현하되, μ_true초과 영역에서는 변화가 적게 나타나도록 표현한다. 또한, 이때, v(μ)는 μ_true미만 영역에서 변화가 적게 나타나도록 표현하되, μ_true초과 영역에서는 변화가 크게 나타나도록 표현한다. In particular, by showing the changes in u(μ) and v(μ) more dramatically, the changes in the degree of information retention and exposure to personal information can be expressed more clearly. For this purpose, u(μ) is expressed so that the change appears large in the area below μ _true , but the change appears small in the area exceeding μ _true . Also, at this time, v(μ) is expressed so that the change appears small in the area below μ _true , but the change appears large in the area exceeding μ _true .

이에 따라, f(μ)는 다음의 식 7과 같이 나타낼 수 있으며, 이에 대한 그래프는 도 7과 같이 나타낼 수 있다. 이때, 함수 f의 t는 튜닝 파라메터로서, μ_true미만 영역에서 u(μ)의 변화를 얼마나 크게 할지 그리고 μ_true초과 영역에서는 v(μ)의 변화를 얼마나 크게 할지를 결정짓는다. 또한, 식 7에서의 g(μ)는 식 8과 같이 나타낼 수 있다. 이때, 함수 g는 μ_true가 0.5가 아닐 수 있기에 0.5로 조정해주는 함수이며, 유사도 평균(μ)에 대한 마찬가지의 조정을 통해 u(μ)와 v(μ)가 계산될 수 있다.Accordingly, f(μ) can be expressed as in Equation 7 below, and a graph for this can be shown as in FIG. 7. At this time, t of the function f is a tuning parameter that determines how large the change in u(μ) will be in the area below μ _true and how large the change in v(μ) will be in the area exceeding μ _true . Additionally, g(μ) in Equation 7 can be expressed as Equation 8. At this time, function g is a function that adjusts μ _true to 0.5 because it may not be 0.5, and u(μ) and v(μ) can be calculated through the same adjustment to the similarity average (μ).

(식 7) (Equation 7)

(식 8) (Equation 8)

특히, 유사도 평균(μ)이 최적 값인 μ_true(즉, a)인 경우, u(μ)와 v(μ)의 곱인 u(μ)×v(μ)의 값은 최대 값을 가지게 된다.In particular, when the similarity average (μ) is the optimal value μ _true (i.e., a), the value of u(μ)×v(μ), which is the product of u(μ) and v(μ), has the maximum value.

상술한 바와 같이 구성되는 본 발명은 개인 정보와 관련한 2가지 측정 항목인 개인정보에 대한 노출 정도 및 정보량 유지 정도를 동시에 측정할 수 있는 이점이 있다. 즉, 본 발명은 개인정보에 대한 노출 정도 및 정보량 유지 정도를 동시에 확인할 수 있는 값을 도출할 수 있는 이점이 있다. 또한, 본 발명은 개인정보에 대한 노출 정도 및 정보량 유지 정도의 측정에 필요한 최적의 기준 값의 설정 기술을 제공할 수 있는 이점이 있다. 또한, 본 발명은 그래프 등을 통해 개인정보에 대한 노출 정도 및 정보량 유지 정도를 보다 직관적으로 파악할 수 있게 하는 이점이 있다.The present invention, constructed as described above, has the advantage of being able to simultaneously measure two measurement items related to personal information: the degree of exposure to personal information and the degree of maintenance of the amount of information. In other words, the present invention has the advantage of being able to derive a value that can simultaneously confirm the degree of exposure to personal information and the degree of maintenance of the amount of information. In addition, the present invention has the advantage of providing a technology for setting the optimal standard value necessary for measuring the degree of exposure to personal information and the degree of maintenance of the amount of information. Additionally, the present invention has the advantage of enabling a more intuitive understanding of the degree of exposure to personal information and the degree of maintenance of the amount of information through graphs, etc.

본 발명의 상세한 설명에서는 구체적인 실시예에 관하여 설명하였으나 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시예에 국한되지 않으며, 후술되는 청구범위 및 이 청구범위와 균등한 것들에 의해 정해져야 한다.In the detailed description of the present invention, specific embodiments have been described, but of course, various modifications are possible without departing from the scope of the present invention. Therefore, the scope of the present invention is not limited to the described embodiments, but should be defined by the claims described below and their equivalents.

100: 장치 110: 입력부
120: 통신부 130: 디스플레이
140: 메모리 150: 제어부100: device 110: input unit
120: Communication unit 130: Display
140: memory 150: control unit

Claims

It is performed on a device capable of computing and is a method for measuring the degree of exposure to personal information and the degree of maintenance of the amount of information,
Calculate a first distance, which is the distance between each element vector in the original data set (X), and calculate a second distance, which is the distance between each element vector in the original data set (X) and each element vector in the reproduction data set (Y). step;
deriving an average similarity between an original data set (X) and a reproduced data set (Y) using the first distance and the second distance; and
A step of indicating the degree of exposure to personal information and the degree of maintenance of the amount of information using the derived similarity average;
How to include .

According to paragraph 1,
The deriving step is performed by combining a first value, which is one of the first distances with respect to the ith element vector (x _i ) of the original data set (X), and the ith element vector (x _i ) of the original data set (X). The second value, which is one of the second distances for, is compared with each other, and according to the comparison result, the i th similarity value, which is the similarity between the original data set (X) and the reproduced data set (Y) for A method characterized by deriving .

According to paragraph 2,
The deriving step is characterized in that if a first value ≥ a second value, the i-th similarity is derived as a first specific value, and if the first value < a second value, the i-th similarity is derived as a second specific value. method.

According to paragraph 1,
The deriving step includes the minimum value (min _i1 ) of the first distances to the ith element vector (x _i ) of the original data set (X) and the ith element vector (x _i ) of the original data set (X). The minimum value (min _i2 ) of the second distances for the original data set (X) and the reproduction data set (Y) for the ith element vector (x _i ) of the original data set (X) are compared with each other according to the comparison result. A method characterized by deriving the value of the i-th similarity, which is the similarity of .

According to paragraph 4,
In the deriving step, if min _i1 ≥min _i2 , the ith similarity is derived as a first specific value, and if min _i1 <min _i2 , the ith similarity is derived as a second specific value.

According to paragraph 3 or 5,
The first specific value is a value greater than the second specific value.

According to paragraph 3 or 5,
The method wherein the first specific value is 1 and the second specific value is 0.

According to paragraph 1,
The deriving step is characterized in that the ith similarity is derived using the following equation.
s(x _i )=I({d(x _j , x _i )} ^<k> ≥ d(y _j' , x _i )} ^<k> )
(However, s(x _i ) represents the i-th similarity, x _j and x _i are different element vectors selected from the element vectors (x ₁ , ...x _n ) of the original data set (X), and y _j' is selected from the element vectors (y ₁ , ...y _m ) of the reproduction data set (Y), d(x _j , x _i ) is the first distance between x _j and x _i , and d(y _j' , x _i ) is the second distance between y _j' _and 2 Derives a specific value, and A ^<k> represents the kth smallest element in set A)

According to paragraph 1,
The indicating step indicates that as the similarity average approaches the minimum value, the degree of maintaining the amount of information about personal information gradually decreases, and as the similarity average approaches the maximum value, the degree of exposure to personal information gradually increases. .

According to clause 9,
The indicating step is characterized in that when the similarity average reaches the reference value between the minimum value and the maximum value, the degree of maintaining and exposing the amount of information about personal information is indicated as being optimal.

According to clause 10,
The indicating step indicates that if the similarity average exceeds the standard value, the degree of information retention for personal information is appropriate, and if the similarity average is less than the standard value, it indicates that the degree of exposure to personal information is appropriate. How to feature.

According to clause 9,
The above reference value is used when the data distribution of the original data set (X) and the data distribution of the reproduction data set (Y) match and the original data set (X) and the reproduction data set (Y) are statistically independent. How to get a value for.

According to paragraph 1,
The displaying step is characterized in that the similarity average is expressed in a graph of the following equation.
f(μ)=(u, v)
(However, μ represents the similarity average, u(μ) represents the degree of maintaining the amount of information about personal information determined by μ, and v(μ) represents the degree of exposure to personal information determined by μ)

According to paragraph 1,
The displaying step is characterized in that the degree of maintaining the amount of information about personal information determined according to the similarity average shows a large change in the area below the standard value and shows a small change in the area exceeding the standard value.

According to paragraph 1,
The displaying step is characterized in that the degree of exposure to personal information determined according to the similarity average shows a small change in the area below the standard value and shows a large change in the area exceeding the standard value.

According to claim 14 or 15,
The indicating step is characterized in that when the similarity average (μ) has the reference value, the value of u(μ)×v(μ) is expressed to have the maximum value.
(However, u(μ) represents the degree of maintenance of information about personal information determined by μ, and v(μ) represents the degree of exposure to personal information determined by μ)

A memory storing the original data set (X) and the reproduced data set (Y); and
It includes a control unit that controls measurement of the degree of exposure and the degree of maintenance of the amount of information based on the information stored in the memory,
The control unit,
Control to calculate the first distance, which is the distance between each element vector in the original data set (X), and calculate the second distance, which is the distance between each element vector in the original data set (X) and each element vector in the reproduction data set (Y). Control to calculate,
Controlling to derive a similarity average between an original data set (X) and a reproduced data set (Y) using the first distance and the second distance,
A device that controls the degree of exposure to personal information and the degree of maintenance of the amount of information using the derived similarity average.

According to clause 17,
When deriving the similarity average, the control unit selects a first value that is one of the first distances with respect to the ith element vector (x _i ) of the original data set (X), and the ith element vector of the original data set (X). Compare the second values, which are one of the second distances for (x _i ), with each other, and reproduce the original data set (X) for the ith element vector (x _i ) of the original data set (X) according to the comparison result. A device that controls to derive the i-th similarity value, which is the similarity of the data set (Y).

According to clause 18,
When deriving the similarity average, the control unit determines that the ith similarity is a first specific value if the first value ≥ the second value, and if the first value < the second value, derives the ith similarity as the second specific value. A device that controls to do so.

According to clause 17,
The control unit indicates that the degree of maintaining the amount of information about personal information is gradually lowered as the similarity average approaches the minimum value, and that the degree of exposure to personal information is gradually increased as the similarity average approaches the maximum value. If the standard value between the minimum value and the maximum value is exceeded, the degree of information retention for personal information is indicated as appropriate, and if the similarity average is less than the standard value, the device indicates that the degree of exposure to personal information is appropriate.