KR102153161B1

KR102153161B1 - Method and system for learning structure of probabilistic graphical model for ordinal data

Info

Publication number: KR102153161B1
Application number: KR1020170177372A
Authority: KR
Inventors: 양은호; 심하진
Original assignee: 한국과학기술원
Priority date: 2017-12-21
Filing date: 2017-12-21
Publication date: 2020-09-08
Also published as: WO2019124724A1; KR20190075631A

Abstract

확률 그래프 기반의 서열 데이터 연관성 학습 방법 및 시스템이 개시된다. 일 실시예에 따른 연관성 학습 시스템에 의해 수행되는 서열 데이터의 연관성 학습 방법은, 서열 데이터에서 각각의 변수 사이의 상관 관계를 예측하는 단계; 및 상기 각각의 변수 사이에 예측된 상관 관계를 그래프로 제공하는 단계를 포함할 수 있다. Disclosed is a method and system for learning association of sequence data based on a probability graph. A method for learning association of sequence data performed by an association learning system according to an embodiment includes: predicting a correlation between respective variables in sequence data; And providing a predicted correlation between the respective variables as a graph.

Description

Sequence data association learning method and system based on probability graph {METHOD AND SYSTEM FOR LEARNING STRUCTURE OF PROBABILISTIC GRAPHICAL MODEL FOR ORDINAL DATA}

아래의 설명은 확률 그래프 기반의 서열 데이터 연관성 학습 방법 및 시스템에 관한 것이다.
The following description relates to a method and system for learning association of sequence data based on a probability graph.

마르코프 랜덤 필드(MRF)라고 하는 방향이 지정되지 않은 그래픽 모델은 다변수 무작위 변수를 모델링하는데, 무차별 그래프를 사용하여 변수들 사이의 조건부 독립 구조를 모델링한다. 이러한 조건부 독립 구조는 서로 다른 변수가 서로 상호 작용하는 방식에 대한 유용한 통찰력을 제공한다. 결과적으로 MRF는 자연 언어 처리, 생물학 및 의학 등 다양한 분야에서 광범위하게 사용된다.An undirected graphic model called Markov Random Field (MRF) models a multivariate random variable, which uses a promiscuous graph to model a conditional independent structure between the variables. These conditional independence structures provide useful insights into how different variables interact with each other. As a result, MRF is widely used in various fields such as natural language processing, biology and medicine.

한국공개특허 제10-2013-0052432호는 마르코프 연쇄 은닉 조건부 랜덤 필드 모델 기반의 패턴 인식 방법에 관한 것으로, 특정 패턴에 대하여 측정되는 트레이닝 입력 신호로부터의 특징 벡터를 추출하고, 전체 공분산 가우스 분포의 조합을 적용한 은닉 조건부 랜덤 필드 모델이, 특징 벡터와 상기 특정 패 턴을 지시하는 라벨의 조합을 다수 개 입력 받아서 은닉 조건부 랜덤 필드 모델의 매개 변수를 구하고, 매개 변수가 적용된 은닉 조건부 랜덤 필드 모델이, 실제 패턴에 대하여 측정되는 테스트 입력 신호로부터 추출된 특징 벡터를 입력 받아서 실제 패턴을 지시하는 라벨을 추론하는 구성을 개시하고 있다.
Korean Patent Laid-Open Publication No. 10-2013-0052432 relates to a pattern recognition method based on a Markov chain concealment conditional random field model, extracting a feature vector from a training input signal measured for a specific pattern, and combining the entire covariance Gaussian distribution. The hidden conditional random field model to which the parameter is applied receives a plurality of combinations of feature vectors and labels indicating the specific pattern to obtain the parameters of the hidden conditional random field model, and the hidden conditional random field model to which the parameter is applied is actually Disclosed is a configuration in which a label indicating an actual pattern is inferred by receiving a feature vector extracted from a test input signal measured for a pattern.

확률 그래프 기반의 서열 데이터 연관성 학습 방법 및 시스템을 제공할 수 있다.
Probability graph-based sequence data association learning method and system can be provided.

연관성 학습 시스템에 의해 수행되는 서열 데이터의 연관성 학습 방법은, 서열 데이터에서 각각의 변수 사이의 상관 관계를 예측하는 단계; 및 상기 각각의 변수 사이에 예측된 상관 관계를 그래프로 제공하는 단계를 포함할 수 있다. A method for learning association of sequence data performed by an association learning system includes: predicting a correlation between respective variables in sequence data; And providing a predicted correlation between the respective variables as a graph.

상기 서열 데이터에서 각각의 변수 사이의 상관 관계를 예측하는 단계는, 상기 변수에 대한 단변량 서열 분포를 통해 노드 조건부 분포를 지정하고, 상기 지정된 노드 조건부 분포에 대한 분석을 수행하여 결합 분포를 탐색하는 단계를 포함할 수 있다. The step of predicting the correlation between each variable in the sequence data includes designating a node conditional distribution through a univariate sequence distribution for the variable, and searching for a binding distribution by performing analysis on the designated node conditional distribution. It may include steps.

상기 서열 데이터에서 각각의 변수 사이의 상관 관계를 예측하는 단계는, 도메인

를 포함하는 p차원의 확률 벡터

이고, 각각의 확률 변수

에 대응하는 노드를 갖는 그래프를

라고 하면, 확률 벡터의 모든 노드 조건부 분포가 수학식 1(

)의 단변량 누적 비율 모델에 적용될 경우, 각 노드

에 대하여, 위치 파라미터

가 나머지 변수의 임의 함수일 수 있다. Predicting the correlation between each variable in the sequence data, the domain

P-dimensional probability vector containing

And each random variable

A graph with nodes corresponding to

If, then, the conditional distribution of all nodes of the probability vector is Equation 1 (

), when applied to the univariate cumulative ratio model, each node

Regarding, the location parameter

Can be any function of the remaining variables.

를 포함하는 p 차원의 확률 벡터

에서, 각각의 확률 변수

에 대응하는 노드를 갖는 그래프를

라고 하면, 확률 벡터의 모든 노드 조건부 분포가 수학식 2(

)의 단변수 연속 비율 모델에 적용될 경우, 각 노드

에 대하여, 위치 파라미터

가 나머지 변수의 임의 함수이고,

에 대하여, 특정 노드 조건부 분포가 확률 벡터 Y를 통한 임의의 결합 분포에 대한 마르코프와 일치하지 않는

실수값 파라미터가 존재할 수 있다. Predicting the correlation between each variable in the sequence data, the domain

P-dimensional probability vector containing

In, each random variable

A graph with nodes corresponding to

If, then, the conditional distribution of all nodes of the probability vector is Equation 2 (

), when applied to a univariate continuous ratio model, each node

Regarding, the location parameter

Is an arbitrary function of the remaining variables,

For, a certain node conditional distribution does not match Markov's for any joint distribution through the probability vector Y

There may be real value parameters.

상기 서열 데이터에서 각각의 변수 사이의 상관 관계를 예측하는 단계는, 서열 확률 변수

에 대한 연속 비율 모델은 충분한 통계

:

를 갖는 지수족에 포함되며,

이고, 서열 확률 벡터

에 대하여 노드 조건부 분포를 지정하기 위하여 단변량 서열 분포를 사용할 경우, 각 노드

에 대해 수학식 3(

)과 같이 표현하고,

에 대해

이고, 위치 파라미터

가 나머지 변수의 임의 함수일 수 있다. Predicting the correlation between each variable in the sequence data, the sequence random variable

The continuous ratio model for

:

It is included in the exponential family having

Is, the sequence random vector

When using a univariate sequence distribution to specify a node conditional distribution for each node

For Equation 3(

), and

About

And the positional parameter

Can be any function of the remaining variables.

상기 서열 데이터에서 각각의 변수 사이의 상관 관계를 예측하는 단계는, 상기 노드 조건부 분포가 결합 분포와 일치하는 단계를 포함하고, 상기 그래프

와 관련하여, 마르코프인 결합 분포와 일치하며 크기가 가장 큰 2개의 요소를 갖는 쌍으로 된 경우, 수학식 4(

)와 같이 표현되고, 상기 연속 비율 모델의 파라미터를 추정하기 위하여, 각 노드

에서 정규화된 노드 조건부 로그 우드 최대화 문제를 해결할 수 있다. Predicting the correlation between each variable in the sequence data includes the step of matching the node conditional distribution with the binding distribution, the graph

In relation to, if the paired with the two elements of the largest size coinciding with the Markovin bond distribution, Equation 4 (

), and to estimate the parameters of the continuous ratio model, each node

The node conditional logwood maximization problem normalized in can be solved.

상기 서열 데이터에서 각각의 변수 사이의 상관 관계를 예측하는 단계는, 다변량 양자화 서열 분포에서 다변량 잠재 확률 벡터가 다변량 가우시안인 경우, 다변량 프로빗 모델로 불리며, 종속성이 가우스 분포를 통하여 잠재적인 확률 벡터에 의해 표현되고, 상기 다변량 프로빗 모델에서, 서열 확률 벡터

는 잠재 다변량 가우시안 확률 벡터

에 의해 생성되고,

와

일 때, 각

가

의 이산화를 통해 획득될 수 있다. The step of predicting the correlation between each variable in the sequence data is, when the multivariate latent probability vector in the multivariate quantization sequence distribution is multivariate Gaussian, it is called a multivariate probit model, and the dependency is on the potential probability vector through a Gaussian distribution. And in the multivariate probit model, a sequence random vector

Is a latent multivariate Gaussian random vector

Created by,

Wow

When, each

end

Can be obtained through the discretization of

상기 서열 데이터에서 각각의 변수 사이의 상관 관계를 예측하는 단계는,

, iff

일 때,

는 임계값이

,

으로 설정되며, Y의 밀도 함수,

가 수학식 5(

)와 같이 제안되고,

와

가

에 의하여 정의된 하이퍼큐브일 수 있다. Predicting the correlation between each variable in the sequence data,

, iff

when,

Is the threshold

,

Is set to the density function of Y,

Equation 5(

) Is proposed as,

Wow

end

It may be a hypercube defined by.

가 파라미터

를 포함하는 프로빗 모델로부터 유도된 확률 벡터 Y로부터 실현될 경우,

로부터 파라미터

를 학습하는

-정규화된 최대 우드(ML) 추정기가 수학식 6(

)과 같은 형식으로 표시되고,

가 diagonal entries를 제외한 항목별

표준일 수 있다. Predicting the correlation between each variable in the sequence data,

Parameter

When realized from a probability vector Y derived from a probit model containing

Parameters from

To learn

-Normalized maximum Wood (ML) estimator is Equation 6 (

), and

Is by item excluding diagonal entries

It can be standard.

상기 서열 데이터에서 각각의 변수 사이의 상관 관계를 예측하는 단계는, 상기 프로빗 그래프 모델 분포에서 알려지지 않은 파라미터를 추정하기 위하여 단변량 주변에서 임계값

를 추정하고, 이변량 주변 분포로부터 polychoric 상관 관계

를 추정할 수 있다. The step of predicting the correlation between each variable in the sequence data includes a threshold value around a univariate in order to estimate an unknown parameter in the probit graph model distribution.

And the polychoric correlation from the distribution around the bivariate

Can be estimated.

상기 서열 데이터에서 각각의 변수 사이의 상관 관계를 예측하는 단계는, 상기 이변량 주변 분포로부터 polychoric 상관 관계

를 추정하기 위하여, 이변량 주변 우도로부터 원시 추정치

를 계산하고, sparse 잠재 그래프와 평활화된 추정치

를 추정하기 위하여 예측된 공분산 행렬

을 그래픽 lasso 추정기로 플러그인할 수 있다. Predicting the correlation between each variable in the sequence data, polychoric correlation from the distribution around the bivariate

To estimate, the raw estimate from the likelihood around the bivariate

And the sparse latent graph and smoothed estimate

Covariance matrix predicted to estimate

Can be plugged into a graphical lasso estimator.

에 대해

를 추정하면,

의 결합 분포는 확률

를 갖는 다항식이고, 확률 변수

,

의 확률 분포가 평균 [0, 0]과 공분산

를 갖는 이변량 정규 분포이며,

, 이변량 주변 로그 우도 함수를 최대화함으로써 파라미터

를 수학식 7 (

)을 통해 추정하고,

,

일 수 있다. Predicting the correlation between each variable in the sequence data,

About

If you estimate

The combined distribution of the probability

Is a polynomial with

,

The probability distribution of is covariates with the mean [0, 0]

Is a bivariate normal distribution with

, Parameters by maximizing the log likelihood function around the bivariate

Equation 7 (

) Through estimation,

,

Can be

상기 서열 데이터에서 각각의 변수 사이의 상관 관계를 예측하는 단계는, 상기

를 추정하기 위하여

를 추정기

로 대체하고, 수학식 8(

)과 같이 로그 우도를 최대화하고,

은

의 도메인이며, (-1, 1)일 수 있다. Predicting the correlation between each variable in the sequence data, the

To estimate

Estimator

And Equation 8 (

) To maximize the log likelihood,

silver

Domain of, and may be (-1, 1).

상기 서열 데이터에서 각각의 변수 사이의 상관 관계를 예측하는 단계는, 그래프의 구조와 최종 공분산을 획득하기 위하여 파라 메트릭 가우시안 그래프 모델 추정기에

를 플러그인할 수 있다. The step of predicting the correlation between each variable in the sequence data may be performed by using a parametric Gaussian graph model estimator to obtain the structure of the graph and the final covariance.

Can be plugged in.

연관성 학습 시스템은, 서열 데이터에서 각각의 변수 사이의 상관 관계를 예측하는 예측부; 및 상기 각각의 변수 사이에 예측된 상관 관계를 그래프로 제공하는 제공부를 포함할 수 있다.
The association learning system includes: a prediction unit that predicts a correlation between each variable in sequence data; And a providing unit that provides a predicted correlation between the respective variables in a graph.

일 실시예에 따른 연관성 학습 시스템은 서열 데이터의 분석을 통한 관계성의 파악이 가능해진다. The association learning system according to an embodiment enables identification of the relationship through analysis of sequence data.

일 실시예에 따른 연관성 학습 시스템은 서열 데이터 분석에 있어서 변수들 사이의 연관 관계를 파악하여 데이터의 생성 및 구조에 대한 이해를 높일 수 있다.The association learning system according to an embodiment may increase an understanding of the generation and structure of data by grasping the association between variables in sequence data analysis.

일 실시예에 따른 연관성 학습 시스템은 각각의 변수들 사이의 연관성에 기반하여 특정 정보를 추천할 수 있다.
The association learning system according to an embodiment may recommend specific information based on the association between respective variables.

도 1은 일 실시예에 있어서, 체인 그래프 구조가 있는 프로빗 모델로부터 데이터가 생성될 때의 다양한 추정치를 비교한 것을 나타낸 것이다
도 2 및 3은 일 실시예에 있어서, 2D 그리드 구조 (10 x 5 그리드)가있는 Consec 모델에서 샘플링한 데이터를 나타낸 것이다.
도 4는 일 실시예에 있어서, SmokeNow 및 사회 인구 학적 지표에 해당하는 잠재 잠정 그래프 구조를 나타낸 것이다.
도 5는 일 실시예에 따른 연관성 학습 시스템의 구성을 설명하기 위한 블록도이다.
도 6은 일 실시예에 따른 연관성 학습 시스템에서 연관성 학습을 수행하는 방법을 설명하기 위한 흐름도이다. 1 shows a comparison of various estimates when data is generated from a probit model having a chain graph structure in an embodiment.
2 and 3 show data sampled from a Consec model with a 2D grid structure (10 x 5 grid) in one embodiment.
FIG. 4 shows a potential tentative graph structure corresponding to SmokeNow and socio-demographic indicators in an embodiment.
5 is a block diagram illustrating a configuration of a relevance learning system according to an embodiment.
6 is a flowchart illustrating a method of performing association learning in the association learning system according to an embodiment.

이하, 실시예를 첨부한 도면을 참조하여 상세히 설명한다.
Hereinafter, embodiments will be described in detail with reference to the accompanying drawings.

도 5는 일 실시예에 따른 연관성 학습 시스템의 구성을 설명하기 위한 블록도이고, 도 6은 일 실시예에 따른 연관성 학습 시스템에서 연관성 학습을 수행하는 방법을 설명하기 위한 흐름도이다. 5 is a block diagram illustrating a configuration of a association learning system according to an embodiment, and FIG. 6 is a flowchart illustrating a method of performing association learning in a association learning system according to an embodiment.

연관성 학습 시스템(100)은 확률 그래프 기반의 서열 데이터 연관성을 학습하기 위한 것으로, 예측부(510) 및 제공부(520)를 포함할 수 있다. 연관성 학습 시스템의 구성 요소들은 도 6의 연관성 학습을 수행하는 방법이 포함하는 단계들(610 내지 620)을 수행하도록 연관성 학습 시스템(100)을 제어할 수 있다. The association learning system 100 is for learning the association of sequence data based on a probability graph, and may include a prediction unit 510 and a providing unit 520. Components of the association learning system may control the association learning system 100 to perform steps 610 to 620 included in the method for performing association learning of FIG. 6.

단계(610)에서 예측부(510)는 서열 데이터에서 각각의 변수들 사이의 상관 관계를 예측할 수 있고, 단계(620)에서 제공부(520)는 각각의 변수들 사이에 예측된 상관 관계를 그래프로 제공할 수 있다. 이하, 아래의 설명에서는 확률 그래프 기반의 서열 데이터 연관성을 학습하는 것에 대하여 구체적으로 설명하기로 한다.In step 610, the prediction unit 510 may predict the correlation between each variable in the sequence data, and in step 620, the provider 520 graphs the predicted correlation between the respective variables. Can be provided as Hereinafter, in the following description, learning of sequence data association based on a probability graph will be described in detail.

다변량 확률 비율 기반 모델에 대하여 설명하기로 한다. 첫번째로, 종래의 단변수 서열 분포를 통해 노드 조건부 분포를 지정하고 해머슬리-클리포드-에스크(Hammersley-Clifford-esque) 분석을 통해 해당 결합 분포를 탐색할 수 있다.A model based on a multivariate probability ratio will be described. First, it is possible to designate a node conditional distribution through a conventional univariate sequence distribution, and search for the corresponding binding distribution through Hammersley-Clifford-esque analysis.

- Univariate Latent Quantified Ordinal Models를 통한 MRF-MRF through Univariate Latent Quantified Ordinal Models

CDF가

로 표시되는 실수값 잠재 확률 변수

가 있다고 가정하자. 여기서,

는 분포의 위치 파라미터이다. 서열 확률 변수

는 어떤 일부 위치(또는 컷 포인트) 파라미터

,

에 대해

,

와 같은 실수값 변수 Z의 이산화된 버전으로 작성될 수 있다.CDF

Real-valued latent random variable represented by

Suppose there is. here,

Is the positional parameter of the distribution. Sequence random variable

Is some position (or cut point) parameter

,

About

,

Can be written as a discrete version of the real-valued variable Z, such as

서열 변수 Y의 확률 질량 함수는 다음과 같이 나타낼 수 있다.The probability mass function of the sequence variable Y can be expressed as follows.

수학식 1:Equation 1:

잠재 실수값 변수 Z에 대한 대중적인 분포는 단변량 로지스틱 분포이고, 여기서,

는 즉 위의 함수

가 로지스틱 함수

가 되도록 한다. 이 경우, 앞서 설명한 분포는 로그-오즈비의 측면에서도 보다 간결한 형식으로 표현될 수 있다. The popular distribution for the real latent variable Z is the univariate logistic distribution, where

Is the above function

Logistic function

Let it be. In this case, the distribution described above can be expressed in a more concise form in terms of log-ods ratio.

따라서 서열 분포의 계급을 누적 비율 모델이라고도 부른다. Hence, the rank of the sequence distribution is also called the cumulative proportion model.

수학식 1의 단변량 서열 분포를 사용하여 노드 조건부 분포를 지정하고 일관된 결합 분포를 도출할 수 있다. The univariate sequence distribution of Equation 1 can be used to specify a node conditional distribution and derive a consistent binding distribution.

를 p차원의 서열 확률 벡터라고 하자. 표기법을 단순화하기 위하여, 후속에서 확률 변수

의 도메인은 동일하고,

와 같다고 가정한다.

를 각각의 확률 변수에 대응하는 노드로 나타낸 그래프라고 하자.

Let be the p-dimensional sequence probability vector. To simplify the notation, random variables in the subsequent

The domain of is the same,

Is assumed to be equal to

Let be a graph represented by nodes corresponding to each random variable.

각

에 대하여, 수학식 2와 같이 표현할 수 있다. bracket

With respect to, it can be expressed as in Equation 2.

수학식 2:Equation 2:

여기서, 위치 파라미터

는 나머지 변수들의 임의의 함수이고,

는 로지스틱 함수이다. 노드 조건부 분포가 일관된 결합 분포로 이어지지 않는 것을 증명하는 다음의 정리를 제시한다. Where, the position parameter

Is an arbitrary function of the remaining variables,

Is a logistic function. We present the following theorem, which proves that the node conditional distribution does not lead to a consistent joint distribution.

정리 1: 도메인

를 포함하는 p차원의 확률 벡터

고려하자. 그리고, 각각의 확률 변수

에 대응하는 노드를 가진 그래프를

라 하자. 확률 벡터의 모든 노드 조건부 분포가 수학식 2의 단변량 누적 비율 모델을 따른다고 가정하면, 각 노드

에 대하여, 위치 파라미터

는 나머지 변수의 임의 함수이다. Theorem 1: domain

P-dimensional probability vector containing

Let's consider. And, each random variable

A graph with nodes corresponding to

Let's do it. Assuming that the conditional distribution of all nodes of the probability vector follows the univariate cumulative proportion model of Equation 2, each node

Regarding, the location parameter

Is an arbitrary function of the remaining variables.

그러면,

에 대하여, 특정 노드-조건부 분포가 크기가 최대 2인 그래프 G에 대한 Markov인 Y에 대한 임의의 결합 분포와 일치하지 않는 실수값 파라미터

가 존재한다. then,

For, a real-valued parameter in which a particular node-conditional distribution does not match any joint distribution for Y, which is Markov for a graph G with a size of up to 2

Exists.

-MRFs via Continuation Ratio Models 통한 MRF-MRFs via Continuation Ratio Models

누적 비율 모델에 밀접한 관계가 있는 로그-오즈 비율의 수정이 고려된다. A modification of the log-odd ratio, which is closely related to the cumulative ratio model, is considered.

단변량 확률 분포 클래스는 연속 비율 모델이라고도 한다. 위의 로그-오즈 비율 비율에서

를 나타낼 때, 확률 변수 Y의 확률 질량 함수(RMF)는 다음과 같이 유도될 수 있다.The univariate probability distribution class is also referred to as a continuous ratio model. From the log-odd ratio above

When represents, the probability mass function RMF of the random variable Y can be derived as follows.

수학식 3:Equation 3:

에 대해,

About,

그러면,

는 다음과 같이 수정될 수 있다. then,

Can be modified as follows.

, PMF의 합계는 1이 된다.

, The sum of PMF becomes 1.

특히, 각 노드

에 대해 수학식 4를 가지고 있다고 가정하기로 하자.Specifically, each node

Suppose we have Equation 4 for

수학식 4:Equation 4:

여기서,

과 위치 파라미터

는 나머지 변수의 임의 함수이다. 다음 정리는 이러한 노드 조건부 분포가 일관된 결합 분포로 나타나지 않는 것을 증명한다. here,

And position parameters

Is an arbitrary function of the remaining variables. The following theorem proves that this node conditional distribution does not appear as a coherent joint distribution.

정리 2: 도메인

를 포함하는 p 차원의 확률 벡터

를 고려한다.

는 각각의 확률 변수

에 대응하는 노드를 갖는 그래프라고 가정하자. 확률 벡터의 모든 노드 조건부 분포가 수학식 4의 단변수 연속 비율 모델을 따른다고 가정하면, 각 노드

에 대하여, 위치 파라미터

는 나머지 변수의 임의 함수이다.Theorem 2: domain

P-dimensional probability vector containing

Consider.

Is each random variable

Suppose it is a graph with nodes corresponding to. Assuming that the conditional distribution of all nodes of the probability vector follows the univariate continuous ratio model of Equation 4, each node

Regarding, the location parameter

Is an arbitrary function of the remaining variables.

그러면,

에 대하여, 특정 노드 조건부 분포가 Y를 통한 임의의 결합 분포, 즉, 무방향성 그래프 G에 대한 마르코프와 일치하지 않는

실수값 파라미터 가 존재한다.then,

For, a certain node conditional distribution does not coincide with an arbitrary joint distribution through Y, that is, Markov for an undirected graph G

There are real-valued parameters.

-MRFs via a Consecutive Ratio model 통한 MRF-MRF through MRFs via a Consecutive Ratio model

단변량 누적 비율 모델 및 연속 비율 모델은 지수족에 비포함되고, 특히 이러한 분포에 속하는 노드 조건부에 일관성 있는 결합이 존재할 수 있도록 하는 규칙성을 갖고 있지 않다는 것이다. 다시 말해서, 단변량 누적 비율 모델 및 연속 비율 모델에서 각 노드 조건부 분포가 비규칙성을 가진다. The univariate cumulative proportion model and the continuous proportion model are not included in the exponential family, and in particular, they do not have a regularity that allows a coherent coupling to exist in the node conditionals belonging to this distribution. In other words, each node conditional distribution has irregularities in the univariate cumulative ratio model and the continuous ratio model.

다음과 같이 정의되는 연속 비율 모델이라고 불리는 단변수 서열 분포의 세번째 클래스를 고려한다.Consider a third class of univariate sequence distribution called a continuous ratio model, defined as:

아래에서 볼 수 있듯이 서열 분포는 앞서 설명한 서열 분포와 달리 지수족에 포함된다. As can be seen below, the sequence distribution is included in the exponential family, unlike the sequence distribution described above.

명제 1: 서열 확률 변수

에 대한 연속 비율 모델은 충분한 통계

:

를 갖는 지수족에 속하며,

이다.Proposition 1: Sequence random variable

The continuous ratio model for

:

Belongs to the exponential family with

to be.

서열 확률 벡터

에 대해 노드 조건부 분포를 지정하기 위하여 단변량 서열 분포를 사용한다고 가정하자. 특히, 각 노드

에 대해, 수학식 5와 같이 표현할 수 있다.Sequence random vector

Suppose we use a univariate sequence distribution to specify a node conditional distribution for. Specifically, each node

For can be expressed as in Equation 5.

수학식 5:Equation 5:

여기서,

에 대해

이고, 위치 파라미터

는 나머지 변수의 임의 함수이다. 노드 조건부 분포는 단변량 지수족에 속하기 때문에 명제 1을 적용하면 다음 정리를 산출할 수 있다. here,

About

And the positional parameter

Is an arbitrary function of the remaining variables. Since the node conditional distribution belongs to the univariate exponential family, applying the proposition 1 can yield the following theorem.

정리 3: 수학식 5에서 노드 조건부 분포는 결합 분포와 일치한다.Theorem 3: In Equation 5, the node conditional distribution matches the joint distribution.

무방향 그래프

와 관련하여, 마르코프인 결합 분포와 일치하며, 크기가 가장 큰 2개의 요소를 갖는 쌍으로 된 경우에는 다음과 같은 형식을 취할 수 있다. Undirected graph

Regarding, in the case of a pair with two elements of the largest size, consistent with the Markovin bond distribution, the following form can be taken:

정리 3에서 분포는 수학식 6과 동일하게 다시 작성될 수 있다.In Theorem 3, the distribution can be rewritten in the same manner as in Equation 6.

수학식 6:Equation 6:

를 통하여 Y의 순서를 쌍으로 기재한다.

The order of Y is described in pairs.

수학식 6의 연속 비율 모델의 파라미터를 추정하기 위하여, 각 노드

에서 정규화된 노드 조건부 로그 우드 최대화 문제를 해결한다. In order to estimate the parameters of the continuous ratio model of Equation 6, each node

Solve the problem of maximizing the node conditional logwood normalized in.

,

여기서,

은 트레이닝 샘플이고,

이다.here,

Is the training sample,

to be.

지수족 그래프 모델의 추정량에 대한 통계적 보증에 대한 기존의 결과가 연속 비율 모델로 이어진다. Existing results of statistical guarantees for estimators of exponential graph models lead to continuous ratio models.

이산과 대비/명목 그래프 모델: 수학식 6의 연속 비율 모델을 각 노드에서 확률 변수를 명목 변수로 취급하는 고전적인 이산 명목 그래프 모델과 대조한다. 확률 벡터 Y에 대해 수학식 7과 같은 이산 그래프 모델을 고려한다.Discrete vs. Contrast/Nominal Graph Model: Contrast the continuous ratio model of Equation 6 with the classic discrete nominal graph model, which treats random variables as nominal variables at each node. For the probability vector Y, a discrete graph model such as Equation 7 is considered.

수학식 7:Equation 7:

연속 비율 모델과 달리, 이산 그래프 모델은

,

의 다른 값에 대해 공통 엣지 파라미터

를 가지지 않는다. 범주형 모델의 각각의 엣지는 M² 변수를 사용하여 파라미터화 된다. 결과적으로 이산 그래프 모델은 Y의 순서를 사용하지 않고, 연속 비율 모델과 비교했을 때 더 복잡하다. 이 파라미터화는 연속 비율 모델 파라미터화를 포함하는 반면, 주요 단점은 명목 그래프 모델이 더 많은 파라미터를 가지므로 샘플 복잡성이 더 크다는 것이다.Unlike the continuous ratio model, the discrete graph model

,

Common edge parameter for different values of

Does not have Each edge of the categorical model is parameterized using an M ² variable. Consequently, the discrete graph model does not use the order of Y and is more complex compared to the continuous ratio model. While this parameterization involves the continuous ratio model parameterization, the main drawback is that the nominal graph model has more parameters and therefore the sample complexity is greater.

-Multivariate Latent Quantized Models -Multivariate Latent Quantized Models

단변량 서열 분포로부터 다변량 서열 그래프 모델을 직접 구성하는 것을 고려한다. 실수값 잠재 변수의 양자화에 기반하여 단변량 서열 분포의 고전적이고 가장 대중적인 클래스를 다시 고찰한다. 다변량 분포의 자연적인 클래스는 다변량 잠재 확률 벡터를 취하고, 다변량 서열 확률 벡터를 획득하기 위하여 양자화함으로써 획득될 수 있다. Consider constructing a multivariate sequence graph model directly from a univariate sequence distribution. We reconsider the classical and most popular class of univariate sequence distributions based on quantization of real-valued latent variables. The natural class of a multivariate distribution can be obtained by taking a multivariate latent probability vector and quantizing to obtain a multivariate sequence probability vector.

-Probit Graphical Model -Probit Graphical Model

다변량 양자화 서열 분포의 가장보편적인 예는 다변량 잠재 확률 벡터가 다변량 가우시안인 경우이며, 이는 다변량 프로빗 모형으로도 알려져 있다. 따라서, 종속성은 가우스 분포를 통하여 잠재적인 확률 벡터에 의하여 표현될 수 있다. The most common example of a multivariate quantization sequence distribution is when the multivariate latent probability vector is multivariate Gaussian, which is also known as a multivariate probit model. Thus, the dependency can be expressed by a potential probability vector through a Gaussian distribution.

프로빗 모델에서, 서열 확률 벡터

는 잠재 다변량 가우시안 확률 벡터

에 의해 생성되는 것으로 가정하고,

와

이다. 각

는 다음과 같이

의 이산화를 통해 획득될 수 있다.In the probit model, the sequence random vector

Is a latent multivariate Gaussian random vector

Is assumed to be generated by

Wow

to be. bracket

Is as follows

Can be obtained through the discretization of

, iff

일 때,

는 임계값이

,

으로 설정된다. 그러면, Y의 밀도 함수,

는 수학식 8과 같이 주어진다.

, iff

when,

Is the threshold

,

Is set to Then, the density function of Y,

Is given as in Equation 8.

수학식 8:Equation 8:

여기서,

와

는

에 의하여 정의된 하이퍼큐브이다. here,

Wow

Is

It is a hypercube defined by.

는 파라미터

를 포함하는 프로빗 모델로부터 유도된 확률 벡터 Y로부터 실현된다고 하자. 그러면,

로부터 파라미터

를 학습하는

-정규화된 최대 우드(ML) 추정기가 수학식 9와 같은 형식을 취한다.

Is the parameter

Suppose that it is realized from a probability vector Y derived from a probit model including. then,

Parameters from

To learn

-The normalized maximum Wood (ML) estimator takes the same form as in Equation 9.

수학식 9:Equation 9:

는 diagonal entries를 제외한 항목별

표준이다. 목적이 비볼록하고 일반적으로 최적화하기가 어렵다는 것을 알 수 있다. 모델 파라미터를 학습하기 위하여 근사 EM 기반 접근법이 제안되었지만, 여전히 상대적으로 계산적으로 요구되고 있으며, 실제 정규화된 MLE 솔루션에 대하여 강력한 통계 보증을 제공하지 않는다.

Is for each item excluding diagonal entries

It is standard. It can be seen that the purpose is non-convex and is generally difficult to optimize. Although an approximate EM-based approach has been proposed to learn model parameters, it is still relatively computationally required and does not provide strong statistical guarantees for actual normalized MLE solutions.

-A Direct Estimation Method -A Direct Estimation Method

수학식 8에서 프로빗 그래프 모델 분포에서 알려지지 않은 파라미터를 추정하기 위한 대체 절차를 제안한다. 2단계의 절차로서, 첫 번째 단계에서는 단변량 주변에서 임계값

를 추정하고, 두 번째 단계에서는 이변량 주변 분포로부터 polychoric상관관계

를 추정한다. In Equation 8, we propose an alternative procedure for estimating unknown parameters in the probit graph model distribution. As a two-step procedure, in the first step, the threshold value around the univariate

Is estimated, and in the second step, polychoric correlation from the distribution around the bivariate

Estimate

-ESTIMATION OF THRESHOLDS -ESTIMATION OF THRESHOLDS

의 추정량,

를 다음과 같이 정의한다.

The estimator of,

Is defined as follows.

는 표준 정규 분포의 CDF이고,

은 지시 함수,

는 벡터

의

번째 좌표이다.

는 일관되게

를 추정한다는 것을 알 수 있다.

Is the CDF of the standard normal distribution,

Is an indication function,

The vector

of

Is the second coordinate.

Is consistently

It can be seen that it estimates

- 상관 관계 및 잠재 그래프 구조의 추정-Estimation of correlation and latent graph structure

의 추정을 위한 두 단계 접근법을 제시한다. 첫 번째 단계에서, 이변량 주변 우도로부터 원시 추정치

를 계산한다. 두 번째 단계에서, sparse 잠재 그래프와 평활화된 추정치

를 추정하기 위하여 추정된 공분산 행렬

을 그래픽 lasso 추정기로 플러그인 한다.

We present a two-step approach for the estimation of In the first step, the raw estimate from the likelihood around the bivariate

Calculate In the second step, the sparse latent graph and the smoothed estimate

The estimated covariance matrix to estimate

Plug in the graphic lasso estimator.

단계 1:

의 각 항목을 추정하기 위하여 독립적인 최적화 문제를 해결한다.

에 대해

를 추정한다고 가정하자.

의 결합 분포는 확률

를 갖는 다항식이다. 여기서, 확률 변수

,

의 확률 분포는 평균 [0, 0]과 공분산

를 갖는 이변량 정규 분포이다. Step 1:

Independent optimization problems are solved to estimate each item of.

About

Suppose you estimate

The combined distribution of the probability

Is a polynomial with Where, random variable

,

The probability distribution of is covariance with the mean [0, 0]

Is a bivariate normal distribution with

만약,

이 알려져 있고, 이변량 주변 로그 우도 함수를 최대화함으로써 미지의 파라미터

를 추정할 수 있고, 다음과 같이 나타낼 수 있다.if,

Is known, and the unknown parameter by maximizing the log likelihood function around the bivariate

Can be estimated, and can be expressed as

수학식 10:Equation 10:

이고,

이다. 그러나, 임계값

이 알려져 있지 않다.

를 추정하기 위하여

를 추정기

로 대체하고, 다음의 로그 우도를 최대화한다.

ego,

to be. However, the threshold

This is not known.

To estimate

Estimator

And maximize the following log likelihood.

은

의 도메인이며, 공분산에 대한 추가적인 제한이 설정되지 않는다면 (-1, 1)이다. 일차원 최적화 문제로, 목표의 매끄러움과 같이 특정 규칙 하에서는

에서 미세한 그리드를 통해 목표를 단순히 평가하고 최적의 그리드 포인트를 선택함으로써 시간

에서 오류

내에서 해결할 수 있다.

silver

Is the domain of, and is (-1, 1) if no additional restrictions on covariance are set. It is a one-dimensional optimization problem, under certain rules, such as smoothness of the target

Time by simply evaluating the goal through a fine grid and selecting the optimal grid point

Error in

Can be solved within

단계 2: 그래프 구조와 최종 공분산을 획득하기 위하여 파라 메트릭 가우시안 그래프 모델 추정기에

를 플러그인한다. 일관된 파라 메트릭 가우시안 추정기 (예컨대, graphical lasso estimator, CLIME, graphical Dantzig selector 등)을 사용하여 잠재 그래프 구조를 추정하는데 사용될 수 있지만, 본 발명에서는 graphical lasso estimator에 기반하여 설명하기로 한다. 다음은 최적화 문제를 해결할 수 있다. Step 2: Use the parametric Gaussian graph model estimator to obtain the graph structure and final covariance.

Plug in. Although it can be used to estimate the latent graph structure using a consistent parametric Gaussian estimator (eg, graphical lasso estimator, CLIME, graphical Dantzig selector, etc.), in the present invention, it will be described based on the graphical lasso estimator. The following can solve the optimization problem.

수학식 11:Equation 11:

여기서 <<A, B>>는 A와 P의 trace inner product를 나타낸다.Here, <<A, B>> represents the trace inner product of A and P.

-Theoretical Properties 이론적 특성-Theoretical Properties Theoretical Properties

앞서 설명한, 직접적인 추정 방법이 단순할 뿐만 아니라 강력한 통계적 보증을 한다. 구체적으로, 역공분산

에 대한

로 향하는 것을 제공하고, 그래픽 모델 구조 복구와 관련하여 희소성을 보여준다. 단순화하기 위하여,

가 주어진 것으로 가정하자. 그러나,

가 알려지지 않은 경우의 확장은 매우 간단해야 한다. The direct estimation method described above is not only simple, but also provides strong statistical guarantees. Specifically, inverse covariance

for

It provides a heading to and shows scarcity in relation to the recovery of the graphic model structure. To simplify,

Suppose is given. But,

If is unknown, the extension should be very simple.

먼저 표기법을 소개하기로 한다.

라고 하자. 이때,

는 크로네커 매트릭스 곱을 나타내고,

에서 평가된 log det(A)의 헤시안(Hessian )을 나타낸다. S를

에 모든 0이 아닌 항목에 해당하는 인덱스 집합이라고 하고, S^c를 S의 보수이다. 또한,

는 최대 절대 행 합계를 나타내는 표기 단순성을 위해

를 정의한다. d를 잠재 그래프에서 최대 노드 차수라고 하자.

는 수학식 10에서 정의된 샘플 손실의 모집단 버전이다. 아래에서는 가정을 밝힌다. First, let's introduce the notation.

Let's say. At this time,

Denotes the Kronecker matrix product,

It represents the Hessian of -log det(A) evaluated at. S

Is called the set of indices corresponding to all nonzero items, and S ^c is the complement of S. Also,

Represents the maximum absolute row sum, for notation simplicity

Defines Let d be the maximum node order in the latent graph.

Is the population version of the sample loss defined in Equation 10. The assumptions are shown below.

(c-1)

인

이 존재한다.(c-1)

sign

Exists.

(c-2)

인

인 상수가 있다. 더욱이, 우도 함수

는

와 같이 양의

를 갖는다. (c-2)

sign

There is a constant phosphorus. Moreover, likelihood function

Is

As positive

Has.

(c-3) 1차, 2차 및 3차 미분의 절대값

는 L1, L2, L3,

에 의하여 각각 상한값을 갖는다. 더욱이

가

에서 퇴행성 임계점을 갖지 않는 온화한 규칙 성질이 성립한다. (c-3) Absolute values of the first, second and third derivatives

L1, L2, L3,

Each has an upper limit. Furthermore

end

A mild regular property that does not have a degenerative threshold is established.

(c-1) 는 glasso estimator의 보증을 위해 만들어진 표준 비일관성 가정이다.(c-1) is the standard inconsistency assumption made for the guarantee of the glasso estimator.

(c-2) 두개의 잠재적 변수가 동일 선상에 있지 않고 서열 변수의 모든 범주가 0이 아닌 확률을 갖도록 보장하는 온화한 조건이다.(c-2) It is a mild condition that ensures that the two potential variables are not collinear and that all categories of sequence variables have a non-zero probability.

이론 4: 파라미터

를 갖는 잠재 가우시안 모델을 해결하기 위하여 수학식 11을 추정치를 고려한다. c-1, c-3 조건이 만족된다고 가정하자. 그러면, L1, L2, L3, M,

,

에 따라 알려진 C1, C2, C3이 존재하므로,

과 n이

만큼 경계가 낮으면,

, 역추적

는 다음의 경계를 만족시킨다.Theory 4: parameters

Equation 11 is considered an estimate to solve the latent Gaussian model with. Assume that the conditions c-1 and c-3 are satisfied. Then, L1, L2, L3, M,

,

Since there are known C1, C2, C3 according to,

And n

If the boundary is as low as,

, Backtracking

Satisfies the following boundary.

수학식 12:Equation 12:

적어도

확률

만큼

로 인코딩된 잠재적인 가우시안 그래프 구조는 지속적으로

에 의해 복원될 수 있다.At least

percentage

as much as

The potential Gaussian graph structure encoded with

Can be restored by

추정치

가 1단계를 만족한다. 높은 확률과 함께,

Estimate

Satisfies step 1. With high probability,

glasso의 일관성 속성을 이용하여 단계 2로부터 추정치

가 높은 확률로 수학식 12를 만족한다는 것을 보여준다.Estimates from step 2 using glasso's consistency property

It shows that Equation 12 is satisfied with a high probability.

하기 위하여, 비볼록 경험적 위험 최소화 문제의 정점의 속성을 연구한다.

To do this, we study the attributes of the peaks of the non-convex empirical risk minimization problem.

도 1은 체인 그래프 구조가 있는 프로빗 모델로부터 데이터가 생성될 때의 다양한 추정치의 비교를 나타낸 것이다. 상단의 행과 하단의 행은 각각

= -0.3,

= -0.9에 해당된다. 왼쪽의 두개의 열은 n=50, 100에 대한 ROC 곡선을 나타낸다. 오른쪽 세개의 열은 log likelihood, 프로베니우스, 엔트로피 손실에 대한 성능을 나타낸다. 1 shows a comparison of various estimates when data is generated from a probit model with a chain graph structure. The top row and bottom row respectively

= -0.3,

= -0.9. The two columns on the left show the ROC curve for n=50, 100. The three columns on the right show the performance for log likelihood, probenius, and entropy loss.

평가 척도: 정규화 매개 변수를 변경하여 계산 된 ROC 곡선을 사용하여 그래프 구조 복구에 대한 기준선과 견적 도구의 성능을 비교할 수 있다. Probit 모델에서 데이터를 생성 할 때 Frobenius Loss 및 Entropy Loss를 사용하여 Oracle, ProbitEM, ProbitEMApprox 및 ProbitDirect의 매개 변수 예측 성능을 비교할 수 있다.Rating Scale: The ROC curve calculated by changing the normalization parameters can be used to compare the performance of the estimating tool with the baseline for the recovery of the graph structure. When generating data from the Probit model, you can use Frobenius Loss and Entropy Loss to compare the parameter prediction performance of Oracle, ProbitEM, ProbitEMApprox and ProbitDirect.

프로베니우스 손실:

Provenius loss:

엔트로피 손실:

Entropy loss:

여기서,

공분산 행렬이고,

는 추정된 공분산 행렬이다.here,

Is the covariance matrix,

Is the estimated covariance matrix.

마지막으로 500 개의 테스트 샘플에서 계산된 로그 가능성에 대한 ProbitEM, ProbitEMApprox, ProbitDirect를 비교할 수 있다. 　 이 세 가지 메트릭을 비교하기 위해 교차 유효성 검사를 사용하여 각 메소드에 대해 최적의 조정 매개 변수를 선택할 수 있다. 예를 들면, 그래프의 노드 수를 50으로 고정하고 각 서수 변수의 카테고리 수를 5로 설정한다. 분산을 줄이기 위해 평균 10 회 이상의 결과를 획득할 수 있다. Finally, you can compare ProbitEM, ProbitEMApprox, and ProbitDirect for log probability calculated from 500 test samples. To compare these three metrics, you can use cross-validation to select the optimal tuning parameters for each method. For example, the number of nodes in the graph is fixed at 50, and the number of categories for each ordinal variable is set to 5. In order to reduce the variance, more than 10 results can be obtained on average.

첫 번째, 프로빗 모델로부터 서열 데이터를 생성할 수 있고, 체인 그래프로부터 데이터를 시뮬레이션할 수 있다. 잠재 변수의 역 공분산 행렬은 다음의 수학식 13과 같이 선택될 수 있다. First, sequence data can be generated from a probit model, and data can be simulated from a chain graph. The inverse covariance matrix of the latent variable may be selected as shown in Equation 13 below.

수학식 13:Equation 13:

이때,

를 선택하고, 노드 j에서 임계값

을 다음과 같이 설정할 수 있다.At this time,

And the threshold at node j

Can be set as follows:

이때, 모든 변수가 1이 되도록 공분산 행렬을 스케일할 수 있다. 도 1은

= -0.3,

= -0.9를 사용하여 획득된 결과를 나타낸다. ProbitDirect와 ProbitEM은 비슷한 성능을 보이나, ProbitDirect는 ProbitEM보다 1-2 배 더 빠르고, ProbitEMApprox는 특히 낮은 샘플 복잡성 설정에서 성능이 매우 낮음을 알 수 있다. At this time, the covariance matrix can be scaled so that all variables are 1. 1 is

= -0.3,

= -0.9 is used to represent the results obtained. ProbitDirect and ProbitEM perform similarly, but ProbitDirect is 1-2 times faster than ProbitEM, and ProbitEMApprox has very low performance, especially at low sample complexity settings.

도 2는 2D 그리드 구조 (10 x 5 그리드)가있는 Consec 모델에서 샘플링 한 데이터이다. 노드 특정 파라미터(

)는 [-1, 1]로부터 균등하게 샘플링될 수 있다. 쌍방향 상호작용 항(

)은 모든 수평 모서리에 대하여 0.1로 설정되고, 모든 수직 모서리에 대하여 -0.1로 설정될 수 있다. Figure 2 is data sampled from a Consec model with a 2D grid structure (10 x 5 grid). Node specific parameters (

) Can be evenly sampled from [-1, 1]. Interactive interaction term (

) Is set to 0.1 for all horizontal edges, and can be set to -0.1 for all vertical edges.

도 3은 2D 그리드 구조 (10 x 5 그리드)가있는 Consec 모델에서 샘플링 한 데이터이다. 노드 특정 파라미터(

)는 [-1, 1]로부터 균등하게 샘플링될 수 있다. 쌍방향 상호 작용 항 (

)은 모든 수평 모서리에 대하여 0.3으로 설정되고 모든 수직 모서리에 대하여 -0.3으로 설정될 수 있다.Figure 3 is data sampled from a Consec model with a 2D grid structure (10 x 5 grid). Node specific parameters (

) Can be evenly sampled from [-1, 1]. Two-way interaction term (

) Can be set to 0.3 for all horizontal edges and -0.3 for all vertical edges.

실시예에서는 Consec 모델의 데이터를 샘플링할 수 있다. 도 2와 3은 사용된 정확한 매개 변수의 세부 사항과 함께 그리드 그래프에 결과를 제시하였다. 도 2를 참고하면, 변수 간의 상호 작용이 낮기 때문에 Consec 모델은 다른 추정치와 비슷한 성능을 보임을 알 수 있고, 도 3을 참고하면, 상호 작용이 높으면 성능이 저하됨을 판단할 수 있다. 연속적 비율 모델에 대한 노드 조건적 우도 기반 추정기가 효율적이지 않거나 프로빗 모델과 같은 잠정적 그래픽 모델이 연속 모델보다 더 좋은 모델임을 제안할 수 있다.In an embodiment, data of the Consec model may be sampled. Figures 2 and 3 present the results in a grid graph with details of the exact parameters used. Referring to FIG. 2, since the interaction between variables is low, it can be seen that the Consec model exhibits similar performance to other estimates, and referring to FIG. 3, it can be determined that the performance decreases when the interaction is high. It can be suggested that the node conditional likelihood-based estimator for the continuous ratio model is not efficient or that a tentative graphic model such as the probit model is a better model than the continuous model.

도 4는 SmokeNow 및 사회 인구 학적 지표에 해당하는 잠재 잠정 그래프 구조를 나타낸 것이다. 그래프는 대응하는 변수의 주변 분포로부터 생성될 수 있다. 녹색 및 적색 엣지는 각각 양의 상관 부분과 음의 부분 상관을 나타낸 것이고, 가장자리 두께는 부분 상관 관계의 크기에 비례한다.Figure 4 shows a potential tentative graph structure corresponding to SmokeNow and socio-demographic indicators. The graph can be generated from the marginal distribution of the corresponding variable. The green and red edges represent positive and negative partial correlations, respectively, and the edge thickness is proportional to the magnitude of the partial correlation.

일례로, 건강 정보 국가 동향 조사(HINTS)는 국립 암 연구소(NCI)에서 전국적으로 실시한 설문 조사에서 설문 조사의 각 질문을 그래프의 노드로, 질문에 대한 개인의 반응을 그래프에서 추출한 샘플로 취급할 수 있다. 분석과 관련이 있는 데이터 세트에서 일부의 질문을 선택할 수 있고, 선택한 질문에 ProbitDirect를 사용하여 프로빗 모델을 적용하고, 최적의 튜닝 파라미터를 선택하기 위해 우리는 10 배 교차 검증을 사용할 수 있다. 이후, 잭 나이프 리샘플링 기법을 통해 잠복 그래프의 에지 강도에 대해 95 % 신뢰 구간을 획득할 수 있다. 이때, 신뢰 구간이 [-0.1, 0.1]과 교차하지 않는 경우에만 그래프에 모서리를 배치할 수 있다. For example, in a nationally conducted survey conducted by the National Cancer Institute (NCI), the Health Information National Trends Survey (HINTS) treats each question in the survey as a node of a graph and an individual's response to the question as a sample extracted from the graph. I can. We can select some questions from the data set relevant to the analysis, apply the probit model using ProbitDirect to the selected questions, and use 10-fold cross-validation to select the optimal tuning parameters. Thereafter, a 95% confidence interval may be obtained for the edge strength of the latent graph through the jack knife resampling technique. At this time, corners can be placed on the graph only when the confidence interval does not intersect with [-0.1, 0.1].

도 4는 사회 인구 학적 지표와 관련된 다양한 변수가 사람의 흡연 행동과 어떻게 관련되는지를 나타낸다. 특히, SmokeNow는 교육과 매우 중요한 연관성이 있음을 나타내고, 이것은 사람이 잘 교육 받았고 다른 모든 변수를 조건으로 한다면, 그 사람이 담배를 피울 가능성이 낮다는 것을 나타낸다. SmokeNow와 FewCigarettesHarmHealth가 긍정적인 부분 상관 관계를 가지고 있어 나머지 변수들, 흡연하는 사람들, 담배를 피우지 않는 사람들보다 덜 해롭다는 것을 흡연자가 인지한다는 것을 나타냅니다. 일 실시예에 따르면, 이러한 통찰력이 흡연 관련 건강 정보를 대중에게 알리는 효율적인 전략을 설계하는 데 도움이 될 수 있다.Figure 4 shows how various variables related to socio-demographic indicators relate to smoking behavior in humans. In particular, SmokeNow indicates that there is a very important connection with education, which indicates that if a person is well educated and conditional on all other variables, that person is less likely to smoke. SmokeNow and FewCigarettesHarmHealth have a positive partial correlation, indicating that smokers perceive that they are less harmful than the rest of the variables, those who smoke, and those who do not smoke. According to one embodiment, these insights can help to design efficient strategies to inform the public about smoking related health information.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the devices and components described in the embodiments are, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA). , A programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, such as one or more general purpose computers or special purpose computers. The processing device may execute an operating system (OS) and one or more software applications executed on the operating system. In addition, the processing device may access, store, manipulate, process, and generate data in response to the execution of software. For the convenience of understanding, although it is sometimes described that one processing device is used, one of ordinary skill in the art, the processing device is a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it may include. For example, the processing device may include a plurality of processors or one processor and one controller. In addition, other processing configurations are possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of these, configuring the processing unit to behave as desired or processed independently or collectively. You can command the device. Software and/or data may be interpreted by a processing device or to provide instructions or data to a processing device, of any type of machine, component, physical device, virtual equipment, computer storage medium or device. Can be embodyed in The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -A hardware device specially configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those produced by a compiler but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described by the limited embodiments and drawings as described above, various modifications and variations are possible from the above description to those of ordinary skill in the art. For example, the described techniques are performed in an order different from the described method, and/or components such as a system, structure, device, circuit, etc. described are combined or combined in a form different from the described method, or other components Alternatively, even if substituted or substituted by an equivalent, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.
Therefore, other implementations, other embodiments, and those equivalents to the claims also fall within the scope of the claims to be described later.

Claims

In the association learning method of sequence data performed by a computer-implemented association learning system,
The association learning system implemented by the computer,
A memory for storing instructions in a computer-readable recording medium; And
A plurality of processors or one processor configured to store and execute instructions stored in the memory
Including,
The method of learning the association of sequence data,
In the association learning system, predicting a correlation between each variable based on a probability graph model from sequence data; And
In the association learning system, providing a predicted correlation between the respective variables as a graph
Including,
Predicting a correlation between each variable based on a probability graph model from the sequence data,
domain

P-dimensional probability vector containing

And each random variable

A graph with nodes corresponding to

If you say,
If all node conditional distributions in the probability vector follow the univariate cumulative proportion model, then the specific node conditional distribution does not match the Markov joint distribution.
How to learn associations.

The method of claim 1,
Predicting a correlation between each variable based on a probability graph model from the sequence data,
Designating a node conditional distribution through a univariate sequence distribution for the variable, and searching for a binding distribution by performing analysis on the designated node conditional distribution
Association learning method comprising a.

delete

In the association learning method of sequence data performed by a computer-implemented association learning system,
The association learning system implemented by the computer,
A memory for storing instructions in a computer-readable recording medium; And
A plurality of processors or one processor configured to store and execute instructions stored in the memory
Including,
A method of learning the association of sequence data performed by a computer-implemented association learning system,
In the association learning system, predicting a correlation between each variable based on a probability graph model from sequence data; And
In the association learning system, providing a predicted correlation between the respective variables as a graph
Including,
Predicting a correlation between each variable based on a probability graph model from the sequence data,
domain

P-dimensional probability vector containing

In, each random variable

A graph with nodes corresponding to

If you say,
If all node conditional distributions in the probability vector follow the univariate continuous ratio model, then the specific node conditional distribution does not match the Markov joint distribution.
Association learning method, characterized in that.

In the association learning method of sequence data performed by a computer-implemented association learning system,
The association learning system implemented by the computer,
A memory for storing instructions in a computer-readable recording medium; And
A plurality of processors or one processor configured to store and execute instructions stored in the memory
Including,
A method of learning the association of sequence data performed by a computer-implemented association learning system,
In the association learning system, predicting a correlation between each variable based on a probability graph model from sequence data; And
In the association learning system, providing a predicted correlation between the respective variables as a graph
Including,
Predicting a correlation between each variable based on a probability graph model in the sequence data,
Sequence random variable

The continuous ratio model for is included in the exponential family, and the sequence random vector

When a univariate sequence distribution is used to specify a node conditional distribution for, since the node conditional distribution belongs to the univariate exponential family,
Association learning method, characterized in that.

delete

In the association learning method of sequence data performed by a computer-implemented association learning system,
The association learning system implemented by the computer,
A memory for storing instructions in a computer-readable recording medium; And
A plurality of processors or one processor configured to store and execute instructions stored in the memory
Including,
A method of learning the association of sequence data performed by a computer-implemented association learning system,
In the association learning system, predicting a correlation between each variable based on a probability graph model from sequence data; And
In the association learning system, providing a predicted correlation between the respective variables as a graph
Including,
Predicting a correlation between each variable based on a probability graph model from the sequence data,
In a multivariate quantization sequence distribution, when the multivariate latent probability vector is a multivariate Gaussian, it is called a multivariate probit model, and the dependency is expressed by the potential probability vector through a Gaussian distribution.
Association learning method, characterized in that.

delete

The method of claim 7,
Predicting a correlation between each variable based on a probability graph model from the sequence data,
In order to estimate an unknown parameter in the multivariate probit model, a critical value is estimated around a univariate and a polychoric correlation is estimated from a distribution around a bivariate.
Association learning method, characterized in that.

The method of claim 10,
Predicting a correlation between each variable based on a probability graph model from the sequence data,
In order to estimate the polychoric correlation from the bivariate marginal distribution, a raw estimate is calculated from the likelihood around the bivariate, and the predicted covariance matrix is plugged into a graphic lasso estimator to estimate a sparse latent graph and a smoothed estimate.
Association learning method, characterized in that.

delete

The method of claim 11,
Predicting a correlation between each variable based on a probability graph model from the sequence data,
Plugging the predicted covariance matrix into a graphical lasso estimator among parametric Gaussian graph model estimators to obtain the graph structure and final covariance
Association learning method, characterized in that.

In a computer-implemented association learning system,
A memory for storing instructions in a computer-readable recording medium; And
A plurality of processors or one processor configured to store and execute instructions stored in the memory
Including,
The plurality of processors or one processor,
A prediction unit predicting a correlation between each variable based on a probability graph model in sequence data; And
Providing unit that provides the predicted correlation between the respective variables in a graph
Including,
The prediction unit,
domain

P-dimensional probability vector containing

And each random variable

A graph with nodes corresponding to

If you say,
If all node conditional distributions in the probability vector follow the univariate cumulative proportion model, then the specific node conditional distribution does not match the Markov joint distribution.
Association learning system.