KR20210130072A

KR20210130072A - Method and apparatus of processing image

Info

Publication number: KR20210130072A
Application number: KR1020200086572A
Authority: KR
Inventors: 슈윙 알렉산더; 황 지아-빈; 그래버 콜린; 예 레이먼드; 김지혜; 한재준
Original assignee: 더 보드 오브 트러스티즈 오브 더 유니버시티 오브 일리노이; 삼성전자주식회사; 버지니아 테크 인터렉추얼 프라퍼티스, 인크.
Priority date: 2020-04-21
Filing date: 2020-07-14
Publication date: 2021-10-29

Abstract

An image processing method according to an embodiment includes defining relationships between objects based on the feature vectors of the objects whose motion are to be predicted in the image of a first viewpoint, estimating a dynamic interaction between the entities at a first time point based on the relationships between the objects based on the estimated dynamic interaction, predicting the movement of the objects that change at a second time point, and outputting a result reflecting the predicted motion at the second time point.

Description

Image processing method and apparatus {METHOD AND APPARATUS OF PROCESSING IMAGE}

아래의 실시예들은 영상 처리 방법 및 장치에 관한 것이다.The following embodiments relate to an image processing method and apparatus.

예측(forecasting)과 같은 작업에는 예를 들어, 인체의 관절들, 스포츠 팀의 선수들 등과 같은 개체들(entities) 간의 상호 작용을 이해하는 것이 중요하다. 하지만, 개체들 간의 상호 작용은 일반적으로 관찰되지 않을 뿐만 아니라, 정량화 하기가 용이하지 않다. 특히, 개체들 간의 관계는 시간이 지남에 따라 변경되는 것이 일반적이므로 개체들 간의 상호 작용에 이러한 시간에 따른 변경을 반영하는 것은 매우 어렵다. For tasks such as forecasting, it is important to understand the interactions between entities such as, for example, the joints of the human body, players on a sports team, and the like. However, interactions between individuals are not generally observed and are not easy to quantify. In particular, it is very difficult to reflect such changes over time in the interactions between entities because the relationship between entities generally changes over time.

전술한 배경기술은 발명자가 본원의 개시 내용을 도출하는 과정에서 보유하거나 습득한 것으로서, 반드시 본 출원 전에 일반 공중에 공개된 공지기술이라고 할 수는 없다.The above-mentioned background art is possessed or acquired by the inventor in the process of deriving the disclosure of the present application, and it cannot be said that it is necessarily known technology disclosed to the general public prior to the present application.

일 실시예에 따르면, 영상 처리 방법은 제1 시점의 영상에서 움직임을 예측하고자 하는 대상의 개체들(entities)의 특징 벡터에 기초하여, 상기 개체들 간의 관계들(relations)을 정의하는 단계; 상기 개체들 간의 관계들에 기초하여, 상기 제1 시점에서의 상기 개체들 간의 동적 상호 작용(dynamic interaction)을 추정하는 단계; 상기 추정된 동적 상호 작용에 기초하여, 제2 시점에서 변화되는 상기 개체들의 움직임을 예측하는 단계; 및 상기 제2 시점에서의 상기 예측된 움직임을 반영한 결과를 출력하는 단계를 포함한다. According to an embodiment, an image processing method includes: defining relationships between entities based on feature vectors of entities of a target whose motion is to be predicted in an image of a first view; estimating a dynamic interaction between the entities at the first time point based on the relationships between the entities; estimating the movement of the objects changed at a second time point based on the estimated dynamic interaction; and outputting a result reflecting the predicted motion at the second time point.

상기 개체들 간의 관계들은 상기 개체들 간의 연결 관계들, 상기 개체들의 위치, 상기 개체들의 자세, 상기 개체들의 이동 방향, 상기 개체들의 이동 속도, 상기 개체들의 이동 경로(trajectory), 상기 개체들에 적용되는 규칙(rule), 상기 규칙에 따른 상기 개체들의 이동 패턴, 상기 개체들에 적용되는 법규, 및 상기 법규에 따른 상기 개체들의 이동 패턴 중 적어도 하나에 기초하여 결정될 수 있다. The relationships between the entities apply to the connection relationships between the entities, the positions of the entities, the postures of the entities, the moving directions of the entities, the moving speed of the entities, the trajectory of the entities, and the entities. It may be determined based on at least one of a rule, a movement pattern of the entities according to the rule, a law applied to the entities, and a movement pattern of the entities according to the rule.

상기 개체들 간의 관계들을 정의하는 단계는 상기 개체들에 대응하는 노드들 및 상기 개체들 간의 관계들에 대응하는 에지들을 포함하는 그래프 신경망(Graph Neural Network; GNN)에 상기 특징 벡터를 인가함으로써, 상기 제1 시점에서의 상기 개체들 간의 관계들에 대응하는 히든 스테이트 정보를 생성하는 단계를 포함할 수 있다. The defining of the relationships between the entities may include applying the feature vector to a Graph Neural Network (GNN) including nodes corresponding to the entities and edges corresponding to the relations between the entities. The method may include generating hidden state information corresponding to relationships between the entities at a first time point.

상기 그래프 신경망은 상기 특징 벡터에 기초하여, 상기 개체들의 페어들(pairs) 사이의 관계들의 상태에 대응하는 히든 스테이트 정보를 생성하기 위한 완전 연결된(fully-connected) 그래프 신경망을 포함할 수 있다. The graph neural network may include a fully-connected graph neural network for generating hidden state information corresponding to states of relationships between pairs of the entities based on the feature vector.

상기 동적 상호 작용을 추정하는 단계는 상기 히든 스테이트 정보에 기초하여, 상기 개체들에 대응하는 사전 정보(prior information)를 생성하는 단계; 상기 사전 정보 및 상기 히든 스테이트 정보에 기초하여, 상기 개체들에 대응하여 예측되는 사후 정보(posterior information)를 생성하는 단계; 및 상기 사전 정보 및 상기 사후 정보에 기초하여, 상기 개체들 간의 동적 상호 작용에 대응하는 잠재 변수(latent variable)를 생성하는 단계를 포함할 수 있다. The estimating of the dynamic interaction may include: generating prior information corresponding to the entities based on the hidden state information; generating posterior information predicted to correspond to the entities based on the prior information and the hidden state information; and generating a latent variable corresponding to a dynamic interaction between the entities based on the prior information and the post-information.

상기 사전 정보는 상기 제1 시점의 이전 시점까지 상기 개체들 간의 관계들의 과거 이력, 및 상기 제1 시점까지 입력된 상기 개체들의 특징 벡터들에 기초하여 결정될 수 있다. The prior information may be determined based on a past history of relationships between the entities up to a time point prior to the first time point, and feature vectors of the entities input up to the first time point.

상기 사전 정보를 생성하는 단계는 상기 히든 스테이트 정보를 순방향 상태 정보로서 순방향(forward) LSTM(Long Short-Term Memory)으로 전달함으로써 상기 사전 정보를 생성하는 단계를 포함할 수 있다. The generating of the dictionary information may include generating the dictionary information by transferring the hidden state information as forward state information to a forward long short-term memory (LSTM).

상기 사후 정보를 생성하는 단계는 상기 사전 정보 및 상기 히든 스테이트 정보를 역방향 상태 정보로서 역방향(backward) LSTM으로 전달함으로써 상기 사후 정보를 생성하는 단계를 포함할 수 있다. The generating of the post-information may include generating the post-information by transmitting the prior information and the hidden state information as backward state information to a backward LSTM.

상기 잠재 변수를 생성하는 단계는 상기 사전 정보 및 상기 사후 정보를 결합한 결과를 샘플링하는 단계; 및 상기 샘플링 결과에 기초하여, 상기 제1 시점에서의 상기 개체들 간의 동적 상호 작용에 대응하는 잠재 변수를 생성하는 단계를 포함할 수 있다. The generating of the latent variable may include: sampling a result of combining the prior information and the post-information; and generating a latent variable corresponding to a dynamic interaction between the entities at the first time point based on the sampling result.

상기 잠재 변수를 생성하는 단계는 상기 사전 정보에 기초하여 상기 잠재 변수를 최적화하는 단계를 포함할 수 있다. The generating of the latent variable may include optimizing the latent variable based on the prior information.

상기 대상의 개체들은 단일 사용자의 신체 부위들, 단일 사용자의 관절들, 복수의 보행자들, 복수의 차량들, 및 스포츠 팀의 운동 선수들 중 적어도 하나를 포함할 수 있다. The objects of the subject may include at least one of body parts of a single user, joints of a single user, multiple pedestrians, multiple vehicles, and athletes of a sports team.

상기 개체들의 움직임을 예측하는 단계는 상기 추정된 동적 상호 작용을 디코딩(decoding) 함으로써 상기 제2 시점에서 변화되는 상기 대상의 움직임을 예측하는 단계를 포함할 수 있다. Predicting the movement of the objects may include predicting the movement of the object changed at the second time point by decoding the estimated dynamic interaction.

상기 예측된 움직임을 반영한 결과를 출력하는 단계는 상기 제1 시점의 영상에 포함된 상기 개체들에, 상기 예측된 움직임을 반영하여 제2 시점의 영상으로 가공하는 단계; 및 상기 제2 시점의 영상을 출력하는 단계를 포함할 수 있다. The outputting the result of reflecting the predicted motion may include: processing the objects included in the image of the first viewpoint into an image of a second viewpoint by reflecting the predicted motion; and outputting the image of the second viewpoint.

상기 예측된 움직임을 반영한 결과를 출력하는 단계는 상기 제1 시점의 영상에 포함된 상기 개체들에, 상기 예측된 움직임을 반영하여 제2 시점의 영상으로 가공하는 단계; 상기 제2 시점의 영상을 기초로, 위험 상황의 발생 여부를 인지하는 단계; 및 상기 위험 상황에 대응하는 알람을 출력하는 단계를 포함할 수 있다. The outputting the result of reflecting the predicted motion may include: processing the objects included in the image of the first viewpoint into an image of a second viewpoint by reflecting the predicted motion; recognizing whether a dangerous situation has occurred based on the image of the second viewpoint; and outputting an alarm corresponding to the dangerous situation.

상기 영상 처리 방법은 상기 움직임을 예측하고자 하는 대상의 개체들을 결정하는 단계를 더 포함할 수 있다. The image processing method may further include determining the objects of the object for which the motion is to be predicted.

일 실시예에 따르면, 영상 처리 장치는 움직임을 예측하고자 하는 대상의 개체들을 포함하는 제1 시점의 영상을 수신하는 통신 인터페이스; 상기 제1 시점의 영상에서 상기 개체들의 특징 벡터를 추출하고, 상기 특징 벡터에 기초하여 정의된 상기 개체들 간의 관계들에 의해 상기 제1 시점에서의 상기 개체들 간의 동적 상호 작용을 추정하고, 상기 추정된 동적 상호 작용에 의해 제2 시점에서 변화되는 상기 개체들의 움직임을 예측하는 프로세서; 및 상기 제2 시점에서의 예측된 움직임을 반영한 결과를 출력하는 출력 장치를 포함한다. According to an embodiment, an image processing apparatus includes: a communication interface for receiving an image of a first view including objects of a target for which a motion is to be predicted; extracting feature vectors of the entities from the image of the first viewpoint, and estimating dynamic interactions between the entities at the first viewpoint based on relationships between the entities defined based on the feature vectors, and a processor for predicting motions of the objects that are changed at a second time point by the estimated dynamic interaction; and an output device for outputting a result reflecting the motion predicted at the second time point.

상기 프로세서는 상기 특징 벡터에 기초하여, 상기 제1 시점의 이전 시점까지의 상기 개체들 간의 관계들의 과거 이력, 및 상기 제1 시점까지 입력된 상기 개체들에 대응하는 특징 벡터들에 기초하여 결정되는 사전 정보를 생성하는 프라이어(prior); 상기 특징 벡터 및 상기 사전 정보에 기초하여, 상기 개체들 간의 동적 상호 작용에 대응하는 잠재 변수를 생성하는 인코더(encoder); 및 상기 잠재 변수에 기초하여, 상기 제2 시점에서 변화되는 상기 개체들의 움직임을 예측하는 디코더(decoder)를 포함할 수 있다. The processor is configured to determine based on a past history of relationships between the entities up to a time before the first time point, and feature vectors corresponding to the entities input up to the first time point, based on the feature vector. a prior that generates prior information; an encoder that generates a latent variable corresponding to a dynamic interaction between the entities based on the feature vector and the prior information; And based on the latent variable, it may include a decoder (decoder) for predicting the movement of the objects that change at the second time point.

상기 인코더는 상기 특징 벡터에 기초하여, 상기 개체들의 페어들 사이의 관계들의 상태에 대응하는 히든 스테이트 정보를 생성하는 완전 연결된 그래프 신경망(GNN); 상기 히든 스테이트 정보에 기초하여, 상기 제1 시점의 영상에서의 상기 대상의 개체들에 대응하는 사전 정보를 생성하는 순방향 LSTM ; 상기 사전 정보 및 상기 히든 스테이트 정보에 기초하여, 상기 개체들 간의 동적 상호 작용에 따라 예측되는 사후 정보를 생성하는 역방향 LSTM ; 및 상기 순방향 LSTM을 통해 전달된 상기 사전 정보 및 상기 역방향 LSTM을 통해 전달된 상기 사후 정보에 기초하여, 상기 제1 시점에서의 상기 개체들 간의 동적 상호 작용에 대응하는 잠재 변수를 생성하는 MLP(Multi-Layer Perceptron)를 포함할 수 있다. The encoder includes: a fully connected graph neural network (GNN) that generates, based on the feature vector, hidden state information corresponding to the state of relationships between the pairs of entities; a forward LSTM for generating prior information corresponding to the objects of the target in the image of the first view based on the hidden state information; a reverse LSTM for generating posterior information predicted according to dynamic interactions between the entities based on the prior information and the hidden state information; and Multi-MLP (MLP) for generating a latent variable corresponding to a dynamic interaction between the entities at the first time point based on the prior information transmitted through the forward LSTM and the post-information transmitted through the backward LSTM -Layer Perceptron).

상기 영상 처리 장치는 HUD(Head Up Display) 장치, 3D 디지털 정보 디스플레이(Digital Information Display, DID), 3D 모바일 기기, 및 스마트 차량 중 적어도 하나를 포함할 수 있다.The image processing apparatus may include at least one of a head up display (HUD) device, a 3D digital information display (DID), a 3D mobile device, and a smart vehicle.

도 1은 일 실시예에 따른 영상 처리 방법을 나타낸 흐름도.
도 2 내지 도 3은 일 실시예에 따른 영상 처리 장치의 구성 요소들 및 구성 요소들의 동작을 설명하기 위한 도면.
도 4는 다른 실시예에 따른 영상 처리 방법을 나타낸 흐름도.
도 5는 일 실시예에 따른 영상 처리 방법에 의해 예측된 인체의 움직임을 반영한 결과를 도시한 도면.
도 6은 일 실시예에 따른 영상 처리 방법에 의해 예측된 농구 선수들의 움직임을 반영한 이동 궤적을 도시한 도면.
도 7은 일 실시예에 따른 영상 처리 방법에 의해 예측된 보행자들의 움직임을 반영하여 미래 시점의 영상을 생성하는 과정을 설명하기 위한 도면.
도 8은 일 실시예에 따른 영상 처리 장치의 블록도.1 is a flowchart illustrating an image processing method according to an exemplary embodiment;
2 to 3 are diagrams for explaining components of an image processing apparatus and operations of the components according to an exemplary embodiment;
4 is a flowchart illustrating an image processing method according to another exemplary embodiment;
5 is a diagram illustrating a result of reflecting a human body motion predicted by an image processing method according to an exemplary embodiment;
6 is a diagram illustrating a movement trajectory reflecting the movement of basketball players predicted by an image processing method according to an embodiment;
7 is a view for explaining a process of generating an image of a future viewpoint by reflecting the movement of pedestrians predicted by the image processing method according to an embodiment;
8 is a block diagram of an image processing apparatus according to an exemplary embodiment;

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, since various changes may be made to the embodiments, the scope of the patent application is not limited or limited by these embodiments. It should be understood that all modifications, equivalents and substitutes for the embodiments are included in the scope of the rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are used for description purposes only, and should not be construed as limiting. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present specification, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiment belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same components are given the same reference numerals regardless of the reference numerals, and the overlapping description thereof will be omitted. In the description of the embodiment, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the embodiment, the detailed description thereof will be omitted.

또한, 실시 예의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다. In addition, in describing the components of the embodiment, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the components from other components, and the essence, order, or order of the components are not limited by the terms. When it is described that a component is “connected”, “coupled” or “connected” to another component, the component may be directly connected or connected to the other component, but another component is between each component. It will be understood that may also be "connected", "coupled" or "connected".

어느 하나의 실시 예에 포함된 구성요소와, 공통적인 기능을 포함하는 구성요소는, 다른 실시 예에서 동일한 명칭을 사용하여 설명하기로 한다. 반대되는 기재가 없는 이상, 어느 하나의 실시 예에 기재한 설명은 다른 실시 예에도 적용될 수 있으며, 중복되는 범위에서 구체적인 설명은 생략하기로 한다.. Components included in one embodiment and components having a common function will be described using the same names in other embodiments. Unless otherwise stated, descriptions described in one embodiment may be applied to other embodiments as well, and detailed descriptions within the overlapping range will be omitted.

도 1은 일 실시예에 따른 영상 처리 방법을 나타낸 흐름도이다. 도 1을 참조하면, 일 실시예에 따른 영상 처리 장치는 제1 시점의 영상에서 움직임을 예측하고자 하는 대상의 개체들(entities)의 특징 벡터에 기초하여, 개체들 간의 관계들(relations)을 정의한다(110). 여기서, 움직임을 예측하고자 하는 '대상'은 사람, 동물, 식물 등과 같은 생물뿐만 아니라, 자동차, 오토바이, 자전거 등과 같은 이동, 움직임의 변형, 및/또는 형태의 변형이 가능한 다양한 형태의 무생물들을 모두 포함하는 의미로 이해될 수 있다. 대상은 예를 들어, 아래에서 설명하는 개체들의 군집화된 형태로 이해될 수 있다. 1 is a flowchart illustrating an image processing method according to an exemplary embodiment. Referring to FIG. 1 , an image processing apparatus according to an embodiment defines relationships between entities based on feature vectors of entities of a target whose motion is to be predicted in an image of a first view. do (110). Here, the 'object' for which movement is to be predicted includes not only living things such as people, animals, plants, etc., but also various types of inanimate objects capable of movement, movement transformation, and/or shape transformation, such as automobiles, motorcycles, bicycles, etc. can be understood as meaning A subject may be understood, for example, as a clustered form of entities described below.

일 실시예에 따른 '개체들'은 움직임을 예측하고자 하는 대상과 유기적인 연결 관계 또는 유기적인 연관 관계를 갖는 모든 객체들로 이해될 수 있다. '개체들'은 움직임을 예측하고자 하는 대상의 일부 구성 요소 및/또는 대상 중 일부에 해당할 수도 있다. According to an exemplary embodiment, 'objects' may be understood as all objects having an organic connection relationship or an organic relationship with a motion prediction target. The 'entities' may correspond to some components and/or some of the objects for which motion is to be predicted.

대상의 개체들은 예를 들어, 단일 사용자의 신체 부위들, 단일 사용자의 관절들, 복수의 보행자들, 복수의 차량들, 및 스포츠 팀의 운동 선수들 등에 해당할 수 있으며, 반드시 이에 한정되지는 않는다. 예를 들어, 움직임을 예측하고자 하는 대상이 '스포츠 팀'이라면, 대상의 개체들은 해당 스포츠 팀의 '경기 중인 운동 선수들'에 해당할 수 있다. 또는, 움직임을 예측하고자 하는 대상이 '복수의 보행자들'이라면, 대상의 개체들은 차량의 전면에서 이동 중인 5명의 서로 다른 보행자들에 해당할 수 있다. 또는, 움직임을 예측하고자 하는 대상이 '사용자 A' 또는 '사용자 B의 손'이라면, 대상의 개체는 사용자 A의 각각의 신체 부위들, 또는 사용자 B의 손가락 관절들에 해당할 수 있다. 이 밖에도, 움직임을 예측하고자 하는 대상이 물질 W라면, 대상의 개체는 물질 W를 구성하는 요소들에 해당할 수 있다. The objects of interest may correspond to, for example, but not necessarily limited to, body parts of a single user, joints of a single user, multiple pedestrians, multiple vehicles, and athletes of a sports team, etc. . For example, if a target for which a movement is to be predicted is a 'sports team', the objects of the target may correspond to 'athletes in competition' of the corresponding sports team. Alternatively, if the object for which the motion is to be predicted is 'a plurality of pedestrians', the objects of the object may correspond to five different pedestrians moving in front of the vehicle. Alternatively, if the object for which the motion is to be predicted is 'user A' or 'user B's hand', the object of the object may correspond to each body part of user A or finger joints of user B. In addition, if the object whose motion is to be predicted is the material W, the object of the object may correspond to elements constituting the material W.

개체들의 '특징 벡터'는 예를 들어, 개체들 각각의 위치 및 속도 등을 포함할 수 있으며, 반드시 이에 한정되지는 않는다. 개체들의 특징 벡터는 예를 들어, 개체들의 이동 궤적(trajectory)에 대응할 수 있다. The 'feature vector' of the entities may include, for example, a position and a velocity of each of the entities, but is not necessarily limited thereto. The feature vector of the entities may correspond to, for example, a movement trajectory of the entities.

예를 들어, 인체의 관절들(joints)은 뼈대(skeleton)에 의해 움직임이 제한되고, 스포츠 팀의 선수는 연습한 대형(practiced formations)에 따라 움직이며, 교통 수단들은 강제되는 교통 법규 또는 교통 규칙에 따라 이동할 수 있다. 이와 같이, '개체들 간의 관계들'은 예를 들어, 개체들 간의 연결 관계들, 개체들의 위치, 개체들의 자세, 개체들의 이동 방향, 개체들의 이동 속도, 개체들의 이동 궤적, 개체들에 적용되는 규칙(rule), 규칙에 따른 개체들의 이동 패턴, 개체들에 적용되는 법규, 및 법규에 따른 개체들의 이동 패턴 등에 기초하여 결정될 수 있다.For example, the joints of the human body are restricted in movement by a skeleton, the players of a sports team move according to practiced formations, and the means of transport are enforced by traffic laws or traffic rules. can be moved according to As such, 'relationships between entities' are, for example, connection relationships between entities, positions of entities, postures of entities, moving directions of entities, moving speeds of entities, movement trajectories of entities, and applied to entities. It may be determined based on a rule, a movement pattern of the entities according to the rule, a law applied to the entities, and a movement pattern of the entities according to the law.

단계(110)에서, 영상 처리 장치는 예를 들어, 도 3에 도시된 그래프 신경망(Graph Neural Network; GNN)에 특징 벡터를 인가함으로써, 제1 시점에서의 개체들 간의 관계들에 대응하는 히든 스테이트 정보를 생성할 수 있다. 이때, 그래프 신경망은 예를 들어, 개체들에 대응하는 노드들(nodes) 및 개체들 간의 관계들에 대응하는 에지들(edges)을 포함할 수 있다. 그래프 신경망은 예를 들어, 특징 벡터에 기초하여, 개체들의 페어들(pairs) 사이의 관계들의 상태에 대응하는 히든 스테이트 정보를 생성하기 위한 완전 연결된(fully-connected) 그래프 신경망일 수 있다. 영상 처리 장치가 그래프 신경망을 통해 생성하는 히든 스테이트 정보는 예를 들어, 후술하는 '관계 임베딩(relation embeddings)'

에 해당할 수 있다. In step 110 , the image processing apparatus applies a feature vector to, for example, a graph neural network (GNN) illustrated in FIG. 3 , thereby providing a hidden state corresponding to relationships between entities at a first time point. information can be generated. In this case, the graph neural network may include, for example, nodes corresponding to entities and edges corresponding to relationships between entities. The graph neural network may be, for example, a fully-connected graph neural network for generating hidden state information corresponding to the state of relationships between pairs of entities, based on a feature vector. Hidden state information generated by the image processing device through a graph neural network is, for example, 'relation embeddings' to be described later.

may correspond to

일 실시예에서 '히든 스테이트(hidden state)'는 신경망(예를 들어, 그래프 신경망)을 구성하는 노드들의 내부 상태에 해당하고, '히든 스테이트 정보(hidden state information)'는 신경망을 구성하는 노드들의 내부 상태를 나타내는 정보에 해당할 수 있다. 예를 들어, 히든 스테이트에는 신경망의 피드백 구조에 의해 이전 시점에서 처리된 정보들이 축적된 시간 정보(temporal information)가 내제될 수 있다. 히든 스테이트 정보는 예를 들어, 히든 스테이트 벡터와 같은 벡터 형태의 정보일 수 있다. 또한, 히든 스테이트 정보는 예를 들어, 제1 시점 및 제1 시점의 이전 시점의 영상 프레임에 포함된 적어도 하나의 대상의 개체들에 대응하는 특징 벡터를 포함할 수 있다. In an embodiment, a 'hidden state' corresponds to an internal state of nodes constituting a neural network (eg, a graph neural network), and 'hidden state information' corresponds to an internal state of nodes constituting a neural network. It may correspond to information representing an internal state. For example, temporal information in which information processed at a previous time point is accumulated by a feedback structure of a neural network may be embedded in the hidden state. The hidden state information may be, for example, information in a vector form such as a hidden state vector. Also, the hidden state information may include, for example, a feature vector corresponding to the first view and at least one target object included in an image frame of a view preceding the first view.

이하, 설명의 편의를 위하여, 제1 시점의 이전 시전 시점은 과거에 대응하는 't-1 시점'으로, 제1 시점은 현재에 대응하는 't 시점'으로, 제2 시점은 미래에 대응하는 't+1 시점'으로 표현할 수도 있다. Hereinafter, for convenience of explanation, the previous casting time of the first time point is a 'time t-1' corresponding to the past, the first time point is a 'time t' corresponding to the present, and the second time point corresponds to the future It can also be expressed as 't+1 time point'.

영상 처리 장치는 단계(110)에서 정의된 개체들 간의 관계들에 기초하여, 제1 시점에서의 개체들 간의 동적 상호 작용(dynamic interaction)을 추정한다(120). 동적 상호 작용은 예를 들어, 개체들 간의 관계를 나타내는 잠재 변수(latent variable)의 형태로 표현될 수 있다.The image processing apparatus estimates a dynamic interaction between the objects at the first time point based on the relationships between the objects defined in step 110 (step 120). Dynamic interaction may be expressed in the form of, for example, a latent variable representing a relationship between entities.

개체들 간의 상호 작용을 파악하기 위해서는 예를 들어, 시간에 따른 궤도를 예측하는 것과 같은 대체 과제(surrogate task)를 연구하는 것이 일반적이다. 구체적으로, 모델링할 N 개의 개체들이 주어지면,

는 타임 스텝(time step) t에서의 개체들의 특징 벡터(feature vector)를 나타낼 수 있다. 개체들의 특징 벡터는 예를 들어, 위치 및 속도를 나타낼 수 있다. It is common to study surrogate tasks, such as predicting trajectories over time, to understand interactions between entities. Specifically, given N objects to be modeled,

may represent a feature vector of entities at a time step t. The feature vectors of entities may represent, for example, position and velocity.

일반적으로, 시스템 역학을 예측하는 과정에서 개체 간의 관계를 해석할 수 있는 신경 관계 추론(Neural Relational Inference; NRI)의 프레임워크는 개체들 간의 일련의 상호 작용을 예측함으로써 궤도를 예측할 수 있다. 결과적으로, 정확하게 예측된 상호 작용은 정확한 궤도 예측을 가능하게 하므로, 상호 작용은 미래 궤도의 예측을 향상시키는 데 사용될 수 있다. 다만, 신경 관계 추론(NRI) 방법은 이러한 관계가 관측된 궤도에서 정적으로 유지된다는 가정에 기초하고, 많은 시스템에서 개체들 간의 관계는 시간이 지남에 따라 변경되는 것이 일반적이다. 신경 관계 추론(NRI)을 사용하면 시간이 지남에 따라 평균화된 상호 작용이 검색되며, 평균화된 상호 작용은 기본 시스템을 정확하게 표현하기 어렵다. In general, the framework of Neural Relational Inference (NRI), which can interpret relationships between entities in the process of predicting system dynamics, can predict trajectories by predicting a series of interactions between entities. As a result, accurately predicted interactions enable accurate trajectory predictions, so interactions can be used to improve prediction of future trajectories. However, neural relationship inference (NRI) methods are based on the assumption that these relationships remain static in the observed trajectories, and in many systems, relationships between entities generally change over time. Using neural relational inference (NRI), averaged interactions are retrieved over time, and averaged interactions are difficult to accurately represent the underlying system.

예를 들어, 개체들 간의 상호 작용은 모든 개체들의 페어 i 및 j에 대해 잠재 변수

의 형태로 표현될 수 있다. 여기서, e는 모델링되는 관계 유형들(relation types)의 개수에 해당할 수 있다. 잠재 변수는 '잠재 관계 변수'라고도 부를 수 있다. For example, interactions between entities are latent variables for pairs i and j of all entities.

can be expressed in the form of Here, e may correspond to the number of relation types to be modeled. A latent variable can also be referred to as a 'latent relational variable'.

이러한 관계들은 사전 정의된 의미를 가지지 않지만, 관계 유형들에 의한 모델은 각 유형에 의미를 할당하는 방법을 학습할 수 있다. 잠재 변수

와 개체들의 미래 궤적들을 모두 예측하기 위해 신경 관계 추론(NRI) 방법은 변이형 자동 인코더(Variational Auto-Encoder; VAE)를 학습할 수 있다. These relationships do not have predefined meanings, but a model by relationship types can learn how to assign meaning to each type. latent variable

In order to predict both and future trajectories of entities, the Neural Relational Inference (NRI) method can learn Variational Auto-Encoder (VAE).

관측된 변수는 개체들의 궤적 x를 나타내고, 잠재 변수는 개체들 간의 관계들 z을 나타낸다. 전통적인 변이형 자동 인코더(VAE)에 따라, 아래의 수학식 1과 같이 ELBO(evidence lower bound)가 최대화될 수 있다. The observed variable represents the trajectory x of the individuals, and the latent variable represents the relationships z between the individuals. According to the traditional variant type automatic encoder (VAE), the evidence lower bound (ELBO) may be maximized as in Equation 1 below.

여기서,

및 θ는 인코더 파라미터 및 디코더 파라미터를 나타낼 수 있다. 수학식 1의 공식은 세 가지 주요 확률 분포들로 구성되며, 이에 대하여는 후술하기로 한다.here,

and θ may represent an encoder parameter and a decoder parameter. The formula of Equation 1 consists of three main probability distributions, which will be described later.

변이형 자동 인코더(VAE)는 생성 모델(Generative Model) 중 하나로, 확률 분포 P(x)를 학습함으로써, 데이터를 생성하는 것을 목적으로 한다. 변이형 자동 인코더(VAE)의 인코더(Encoder)는 학습용 데이터(x)를 입력으로 받고, 잠재 변수(z)의 확률 분포에 대한 파라미터를 출력할 수 있다. 예를 들어, 가우시안(Gaussian) 정규 분포의 경우, μ 및 σ²를 출력할 수 있다. 인코더는 예를 들어, 데이터가 주어지면, 디코더가 원래의 데이터로 잘 복원할 수 있는 잠재 변수(z)를 샘플링 할 수 있는 이상적인 확률 분포 p(z|x)를 찾을 수 있다. 또한, 변이형 자동 인코더(VAE)의 디코더(Decoder)는 잠재 변수에 대한 확률 분포 p(z)에서 샘플링한 벡터를 입력받고, 샘플링된 벡터를 이용해 원본 이미지를 복원할 수 있다. 디코더는 인코더에서 추출한 샘플을 입력으로 받아, 다시 원본으로 재구축하는 역할을 수행할 수 있다. Variant automatic encoder (VAE) is one of generative models, and aims to generate data by learning a probability distribution P(x). The encoder of the variant automatic encoder (VAE) may receive the training data (x) as an input and output a parameter for the probability distribution of the latent variable (z). For example, in the case of a Gaussian normal distribution, μ and σ ² can be output. The encoder can find, for example, an ideal probability distribution p(z|x) that, given the data, can sample a latent variable (z) that the decoder can reconstruct well into the original data. In addition, the decoder of the variant automatic encoder (VAE) may receive a vector sampled from the probability distribution p(z) for the latent variable, and reconstruct the original image using the sampled vector. The decoder may receive the sample extracted from the encoder as an input and play a role of reconstructing it back to the original.

정리하자면, 변이형 자동 인코더(VAE)는 최적화를 통해 주어진 데이터를 잘 설명하는 잠재 변수의 분포를 찾고(인코더에 의해 수행됨), 잠재 변수로부터 원본 이미지와 같은 이미지를 잘 복원(디코더에 의해 수행됨)할 수 있다. To recap, a variant autoencoder (VAE) finds a distribution of latent variables that explains the given data well (performed by the encoder) through optimization (performed by the encoder), and reconstructs the same image as the original image from the latent variable (performed by the decoder) well. can do.

일 실시예에서는 시간이 지남에 따라 변화되는 개체들 간의 관계를 고려하여 모든 시점에서 개체들 간의 상호 작용을 복구하는 '동적 신경 관계 추론(dynamic Neural Relational Inference; dNRI)' 방법을 이용할 수 있다. 일 실시예에서는 연속적인 잠재 변수가 각 타임 스텝에 대한 분리된 관계 그래프들을 예측할 수 있게 함으로써, 전술한 (정적) 신경 관계 추론 방법의 문제점을 해결하고, 예측 정확도를 향상시킬 수 있다. In an embodiment, a 'dynamic Neural Relational Inference (dNRI)' method that restores interactions between entities at all time points in consideration of relationships between entities that change over time may be used. In an embodiment, by enabling a continuous latent variable to predict separate relationship graphs for each time step, it is possible to solve the problem of the above-described (static) neural relationship inference method and improve prediction accuracy.

보다 구체적으로, 일 실시예에 따른 동적 신경 관계 추론(dNRI) 방법은 잠재 변수 모델(latent variable model)에 의해 개체들 간의 상호 작용을 추정할 수 있다. 여기서, '잠재 변수(latent variable)'는 개체들 사이의 관계 강도를 나타낼 수 있다. 일 실시예에서는 추정된 관계 강도를 사용하여 관측된 개체들의 이동 궤도를 최대한 정확하게 복구할 수 있다. 동적 신경 관계 추론(dNRI) 방법은 신경 관계 추론(NRI) 방법과 달리, 모든 시점들에서의 잠재 변수들을 추정할 수 있다. More specifically, the dynamic neural relationship inference (dNRI) method according to an embodiment may estimate the interaction between entities by using a latent variable model. Here, a 'latent variable' may indicate the strength of a relationship between entities. In an embodiment, the movement trajectories of the observed entities may be restored as accurately as possible using the estimated relationship strength. The dynamic neural relationship inference (dNRI) method can estimate latent variables at all time points, unlike the neural relationship inference (NRI) method.

또한, 일 실시예에서는 입력 궤적(input trajectory)의 이력에 의존하는 순차적인 관계 프라이어(sequential relation prior)와, 과거와 미래의 가변 상태들(variable states)을 모두 고려한 근사 관계 포스테리어(approximate relation posterior)를 모두 학습하기 위해, 순차적인 잠재 변수 모델(sequential latent-variable model)을 신경 관계 추론(NRI)의 프레임 워크에 적용할 수 있다. In addition, in an embodiment, a sequential relation prior that depends on the history of an input trajectory and an approximate relation posterior that considers both past and future variable states ), a sequential latent-variable model can be applied to the framework of neural relational inference (NRI).

아래의 도 5 및 도 6을 통해 설명하겠지만, 일 실시예에서는 까다로운 모션 캡처 및 스포츠 궤적 데이터 집합에 대해 동적 신경 관계 추론(dNRI) 방법을 적용한 결과를 평가할 수 있다. 동적 신경 관계 추론(dNRI) 방법은 정적 신경 관계 추론(NRI) 방법과 비교할 때 관측된 궤적의 회복을 크게 개선시킬 수 있으며, 정적 신경 관계 추론(NRI) 방법이 달성할 수 없는, 궤적의 서로 다른 양상들(different phases)을 변경하는 관계들을 예측할 수 있다. As will be described with reference to FIGS. 5 and 6 below, in an embodiment, a result of applying a dynamic neural relationship inference (dNRI) method to a difficult motion capture and sports trajectory data set can be evaluated. The dynamic neural relationship inference (dNRI) method can significantly improve the recovery of the observed trajectory when compared with the static neural relationship inference (NRI) method, and the different types of trajectories that the static neural relationship inference (NRI) method cannot achieve Relationships that change different phases can be predicted.

단계(120)에서, 영상 처리 장치는 예를 들어, 히든 스테이트 정보에 기초하여, 개체들에 대응하는 사전 정보(prior information)를 생성할 수 있다. 사전 정보는 예를 들어, 후술하는 프라이어

에 해당할 수 있다. 사전 정보는 예를 들어, 개체들 간의 관계들의 강도를 나타낼 수 있다. 또한, 사전 정보는 예를 들어, 제1 시점의 이전 시점까지 개체들 간의 관계들의 과거 이력, 및 제1 시점까지 입력된 개체들의 특징 벡터들에 기초하여 결정될 수 있다. 영상 처리 장치는 예를 들어, 히든 스테이트 정보를 순방향 상태 정보로서 도 3의 310에 도시된 순방향(forward) LSTM(Long Short-Term Memory)인 LSTM_prior로 전달함으로써 사전 정보를 생성할 수 있다. In operation 120 , the image processing apparatus may generate prior information corresponding to the objects, for example, based on the hidden state information. The prior information is, for example, a fryer described later

may correspond to The prior information may indicate, for example, the strength of relationships between entities. Also, the prior information may be determined based on, for example, a past history of relationships between entities up to a time point before the first time point, and feature vectors of entities input up to the first time point. The image processing apparatus may generate the prior information by, for example, transferring the hidden state information as forward state information to an LSTM _{prior that is a forward long short-term memory (LSTM) shown in 310 of FIG. 3 .}

단계(120)에서, 영상 처리 장치는 사전 정보 및 히든 스테이트 정보에 기초하여, 개체들에 대응하여 예측되는 사후 정보(posterior information)를 생성할 수 있다. 영상 처리 장치는 예를 들어, 사전 정보 및 히든 스테이트 정보를 역방향 상태 정보로서 도 3의 310에 도시된 역방향(backward) LSTM인 LSTM_enc로 전달함으로써 사후 정보를 생성할 수 있다. 사후 정보는 예를 들어, 후술하는 근사 포스테리어(approximate posterior)

에 해당할 수 있다. In operation 120 , the image processing apparatus may generate posterior information predicted to correspond to the objects based on the prior information and the hidden state information. The image processing apparatus may generate post information by, for example, transferring the prior information and the hidden state information as backward state information to LSTM _{enc which is a backward LSTM shown in 310 of FIG. 3 .} The posterior information is, for example, an approximate posterior to be described later.

may correspond to

단계(120)에서, 영상 처리 장치는 사전 정보 및 사후 정보에 기초하여, 개체들 간의 동적 상호 작용에 대응하는 잠재 변수(latent variable) 를 생성할 수 있다. 영상 처리 장치는 사전 정보 및 사후 정보를 결합한 결과를 샘플링할 수 있다. 영상 처리 장치는 샘플링 결과에 기초하여, 제1 시점에서의 개체들 간의 동적 상호 작용에 대응하는 잠재 변수

를 생성할 수 있다. 이때, 영상 처리 장치는 사전 정보에 기초하여 잠재 변수를 최적화할 수 있다. In operation 120 , the image processing apparatus may generate a latent variable corresponding to a dynamic interaction between entities based on the prior information and the post information. The image processing apparatus may sample a result of combining the pre-information and post-information. Based on the sampling result, the image processing apparatus may be configured to provide a latent variable corresponding to a dynamic interaction between objects at a first time point.

can create In this case, the image processing apparatus may optimize the latent variable based on the prior information.

영상 처리 장치는 단계(120)에서 추정된 동적 상호 작용에 기초하여, 제2 시점에서 변화되는 개체들의 움직임을 예측한다(130). 영상 처리 장치는 추정된 동적 상호 작용을 예를 들어, 도 3에 도시된 디코더(330)에 의해 디코딩(decoding) 함으로써 제2 시점에서 변화되는 대상의 움직임을 예측할 수 있다. 디코더는 인코더에서 생성된 관계 변수들에 대한 샘플링 결과로부터 예를 들어, 궤적 분포

를 예측할 수 있다. The image processing apparatus predicts the movement of the objects that change at the second time point based on the dynamic interaction estimated in operation 120 (operation 130). The image processing apparatus may predict the motion of the object changed at the second time point by decoding the estimated dynamic interaction, for example, by the decoder 330 illustrated in FIG. 3 . The decoder, for example, from the sampling result for the relational variables generated by the encoder, the trajectory distribution

can be predicted

영상 처리 장치는 제2 시점에서의 예측된 움직임을 반영한 결과를 출력한다(140). 단계(140)에서 영상 처리 장치는 제2 시점에서의 예측된 움직임을 반영한 결과를 내제적으로(implicitly) 출력할 수도 있고, 또는 외재적으로(explicitly) 출력할 수도 있다. 단계(140)에서, 영상 처리 장치는 예를 들어, 제1 시점의 영상에 포함된 개체들에, 예측된 움직임을 반영하여 제2 시점의 영상으로 가공하고, 제2 시점의 영상을 출력할 수 있다. 또는, 영상 처리 장치는 제1 시점의 영상에 포함된 개체들에, 예측된 움직임을 반영하여 제2 시점의 영상으로 가공할 수 있다. 영상 처리 장치는 제2 시점의 영상을 기초로, 위험 상황의 발생 여부를 인지하고, 위험 상황에 대응하는 알람을 출력할 수 있다. The image processing apparatus outputs a result reflecting the motion predicted at the second time point ( 140 ). In operation 140 , the image processing apparatus may implicitly or explicitly output a result reflecting the motion predicted at the second time point. In operation 140, the image processing apparatus may, for example, reflect predicted motions on objects included in the image of the first viewpoint, process it into an image of the second viewpoint, and output the image of the second viewpoint. have. Alternatively, the image processing apparatus may process the objects included in the image of the first viewpoint into an image of the second viewpoint by reflecting the predicted motion. The image processing apparatus may recognize whether a dangerous situation has occurred based on the image of the second viewpoint, and output an alarm corresponding to the dangerous situation.

일 실시예에 따르면, 입력 영상으로부터 시간의 흐름에 따른 개체들 간의 동적 상호 작용을 추정하여 예를 들어, 대상의 세그먼테이션(segmentation), 추적(tracking), 및 딥러닝 프로세싱을 위한 데이터 주석(annotation) 자동화 등에 활용할 수 있다. 보다 구체적으로, 영상 처리 장치는 입력 영상으로부터 시간의 흐름에 따른 개체들 간의 동적 상호 작용을 추정하여 예를 들어, 운동 경기에서 선수들의 미래 움직임, 자율 주행 차량 전방에서 놀고 있는 아이들 또는 보행자, 또는 차량들의 미래의 움직임 등을 예측할 수 있다. 운동 경기에서 선수들의 미래 움직임을 추정하는 것은 주요 경기 장면의 감지를 가능하게 하여, 경기를 보는 객석에 주요 경기 장면에 대한 영상이 효과적으로 전송되도록 할 수 있다. 또한, 영상 처리 장치는 차량이나 보행자들의 움직임을 예측하여 미래에 발생할 사고를 운전자 또는 관계자들에게 경고함으로써 사고를 예방할 수 있다. 이 밖에도, 영상 처리 장치는 타이핑 시, 손가락 움직임의 패턴을 예측하여 가상 키보드(virtual keyboard)에 대한 타이핑 패턴을 인식하는 성능을 강화하거나, 또는, 물리학(Physics)에서 구성 요소들(element)들의 미래 움직임을 예측하여 물질 관계 분석 등과 같은 관련 연구 분야의 발전에 기여할 수 있다. According to an embodiment, by estimating dynamic interactions between objects over time from an input image, for example, object segmentation, tracking, and data annotation for deep learning processing It can be used for automation. More specifically, the image processing apparatus estimates the dynamic interaction between objects over time from the input image, so that, for example, future movements of athletes in an athletic event, children or pedestrians playing in front of an autonomous driving vehicle, or a vehicle It is possible to predict their future movements. Estimating the future movements of athletes in an athletic event enables the detection of key game scenes, so that images of key game scenes are effectively transmitted to the spectators watching the game. In addition, the image processing apparatus may prevent an accident by predicting the movement of vehicles or pedestrians to warn a driver or related parties of an accident that will occur in the future. In addition, the image processing apparatus predicts a pattern of finger movement when typing to enhance the performance of recognizing a typing pattern for a virtual keyboard, or the future of elements in physics. By predicting motion, it can contribute to the development of related research fields such as material relationship analysis.

도 2는 일 실시예에 따른 영상 처리 장치의 동작을 설명하기 위한 계산 그래프의 일 예시를 도시한 도면이다. 도 2를 참조하면, 일 실시예에 따른 영상 처리 장치(200)는 프라이어(Prior)(210), 인코더(Encoder)(230), 및 디코더(Decoder)(250)를 포함할 수 있다. 2 is a diagram illustrating an example of a calculation graph for explaining an operation of an image processing apparatus according to an exemplary embodiment. Referring to FIG. 2 , the image processing apparatus 200 according to an embodiment may include a Prior 210 , an Encoder 230 , and a Decoder 250 .

일 실시예에 따른 동적 신경 관계 추론(dNRI) 방법에서는 개체들 간의 관계들이 각 타임 스텝마다 다를 것으로 예상되므로 프라이어 분산에서 이러한 변경 사항들을 캡처하는 것이 중요하다. 이를 위해, 일 실시예에서는 개체들 간의 관계들의 사전 확률(prior probabilities)에 대한 자동 회귀 모델(auto-regressive model)에 의해 학습될 수 있다. In the dynamic neural relationship inference (dNRI) method according to an embodiment, it is important to capture these changes in the prior variance because relationships between entities are expected to be different for each time step. To this end, in an embodiment, it may be learned by an auto-regressive model for prior probabilities of relationships between entities.

프라이어(210)는 입력되는 특징 벡터(예를 들어, x^t ^-1, x, x^t ⁺ ¹)에 기초하여, 제1 시점의 이전 시점까지의 개체들 간의 관계들의 과거 이력, 및 제1 시점까지 입력된 개체들에 대응하는 특징 벡터들에 기초하여 결정되는 사전 정보를 생성할 수 있다. Based on the input feature vector (eg, x ^t ^-1 , x, x ^t ⁺ ¹ ), the prior 210 determines the past history of relationships between entities up to the previous time point of the first time point, and the first time point. It is possible to generate the prior information determined based on the feature vectors corresponding to the entities input up to .

프라이어(210)는 각 타임 스텝 t에서 이전 관계 및 0 ~ t 까지의 시간들에 대한 입력들뿐만 아니라, 이전 관계들에 따라 조정될 수 있다. Prior 210 may adjust according to previous relations at each time step t, as well as inputs for previous relations and times 0 to t.

프라이어(210)는 예를 들어, 아래의 수학식 2과 같이 표현될 수 있다. The fryer 210 may be expressed as, for example, Equation 2 below.

일 실시예에서 사용하는 프라이어(210)의 구조는 아래의 도 3을 참조하여 구체적으로 설명한다. The structure of the fryer 210 used in one embodiment will be described in detail with reference to FIG. 3 below.

예를 들어, 그래프 신경망에서 하나의 에지가 상호 작용이 없음을 나타내는 것으로 하드 코딩된 경우, 해당 에지에 대한 프라이어 값, 다시 말해, 해당 에지에 대응하는 사전 정보는 주어진 문제에 대한 관계의 예상되는 희소성(sparsity)에 따라 선택될 수 있다. 사전 정보는 인코더(230)가 희소성 수준(sparsity level)에 따라 편향(biased) 되도록 손실을 조정할 수 있다. 사전 정보는 예를 들어, 인코더(230)가 각 시간 스텝(time step)마다 개체들의 페어 별로 대응하는 히든 스테이트 정보를 생성하도록 가이드(guide)할 수 있다. For example, in a graph neural network, if an edge is hard-coded to indicate that there is no interaction, the prior value for that edge, i.e. the prior information corresponding to that edge, is the expected sparsity of the relationship for a given problem. (sparsity) may be selected. The prior information may adjust the loss so that the encoder 230 is biased according to a sparsity level. The dictionary information may guide, for example, the encoder 230 to generate hidden state information corresponding to each pair of entities at each time step.

인코더(230)는 특징 벡터 및 프라이어(210)에서 생성된 사전 정보에 기초하여, 개체들 간의 동적 상호 작용에 대응하는 잠재 변수를 생성할 수 있다. The encoder 230 may generate a latent variable corresponding to a dynamic interaction between entities based on the feature vector and the prior information generated by the prior 210 .

일 실시예에서 인코더(230)의 역할은 과거 입력 이력과 달리 전체 입력의 함수로서 각 타임 스텝에서 관계들의 분포를 근사화하는 것이다. 잠재 변수에 대한 실제 포스테리어(posterior) 분포

는 관측된 변수들 x의 미래 상태들의 함수에 해당할 수 있다. 따라서, 인코더(230)의 핵심 구성 요소는 변수들의 상태들을 반대로 처리하는 LSTM이다. The role of encoder 230 in one embodiment is to approximate the distribution of relationships at each time step as a function of the total input as opposed to past input history. Actual posterior distribution for latent variables

may correspond to a function of the future states of the observed variables x. Thus, a key component of the encoder 230 is the LSTM, which reverses the states of the variables.

인코더(230)는 예를 들어, 도 3의 310에 도시된 것과 같이 개체들 당 하나의 노드를 포함하는 완전히 연결된 그래프 신경망(GNN) 구조를 사용하여 구현될 수 있다. 이러한 인코더(230)는 각 개체들의 쌍에 대한 임베딩을 학습한 이후에 예측되는 모든 관계 유형에 대한 포스테리어 관계 확률(posterior relation probability)을 생성하는데 사용될 수 있다. 인코더(230)에 의해 제공된 분포가 주어지면, 샘플링된 관계들은 디코더에서 입력으로 사용될 수 있다. 인코더(230)에서는 역전파(back propagation)를 통해 모델 가중치들(model weights)을 업데이트 할 수 있도록 샘플링 절차를 차별화 할 수 있어야 한다. 하지만, 범주형 분포(categorical distribution)에서 표준 샘플링은 구분할 수 없으므로, 결과적으로 구체적인 분포(concrete distribution)에서 표본을 추출할 수 있다. 이러한 분포는 불연속 범주 분포에 대한 지속적인 근사치이며, 이 분포에서의 샘플링은 아래의 수학식 3과 같은 형식으로 수행될 수 있다.The encoder 230 may be implemented using, for example, a fully connected graph neural network (GNN) architecture with one node per entity as shown at 310 of FIG. 3 . The encoder 230 may be used to generate posterior relation probabilities for all relation types predicted after learning the embedding for each pair of entities. Given the distribution provided by the encoder 230, the sampled relationships can be used as input to the decoder. The encoder 230 should be able to differentiate the sampling procedure so that model weights can be updated through back propagation. However, standard sampling is indistinguishable from a categorical distribution, and consequently, a sample can be drawn from a concrete distribution. This distribution is a continuous approximation to the discontinuous category distribution, and sampling in this distribution can be performed in the form of Equation 3 below.

여기서,

는

에 대한 예측된 사후 로짓(posterior logits)이고,

는 Gumbel (0,1) 분포의 표본이며, τ는 분포의 평활도(smoothness)를 제어하는 온도 파라미터에 해당할 수 있다.here,

Is

is the predicted posterior logits for ,

is a sample of the Gumbel (0,1) distribution, and τ may correspond to a temperature parameter controlling the smoothness of the distribution.

이러한 과정은 다른 방식으로 이산 샘플링을 근사화하고 디코더 재구성에서 인코더(210) 파라미터

까지 그래디언트(gradients)를 역전파할 수 있다.This process approximates discrete sampling in a different way and encoder 210 parameters in decoder reconstruction.

You can backpropagate gradients up to .

인코더(230)의 구성 요소에 대하여는 아래의 도 3을 참조하여 구체적으로 설명한다. Components of the encoder 230 will be described in detail with reference to FIG. 3 below.

디코더(250)는 인코더에서 생성된 잠재 변수(들)(예를 들어,

,

)에 기초하여, 제2 시점에서 변화되는 개체들의 움직임을 예측할 수 있다. The decoder 250 is a latent variable(s) generated in the encoder (eg,

,

), it is possible to predict the movement of the objects that change at the second time point.

디코더(250)는

과 같이 표현될 수 있다. 디코더(250)는 변수들 x 의 미래 상태들을 예측하는 데에 도움을 주기 위해 인코더(230)에서 샘플링된 잠재 변수들을 사용할 수 있다. 디코더(250)에 입력되는 잠재 변수 z는 타임 스텝마다 변화될 수 있다. The decoder 250 is

can be expressed as Decoder 250 may use the latent variables sampled at encoder 230 to help predict future states of variables x. The latent variable z input to the decoder 250 may be changed for each time step.

일 실시예에 따른 디코더(250)는 아래의 수학식 4와 같이 표현될 수 있다. The decoder 250 according to an embodiment may be expressed as Equation 4 below.

실제로 이것은 시퀀스 전체에서 동일한 모델을 사용하는 대신에 각 타임 스텝에서 모든 에지들에 대한 그래프 신경망 모델을 선택하는 것과 관련이 있을 수 있다. 이를 통해 디코더(250)는 시스템 상태에 기반한 예측을 조정하여 동적 시스템을 모델링하는 기능을 향상시킬 수 있다. In practice, this may involve choosing a graph neural network model for all edges at each time step instead of using the same model throughout the sequence. Through this, the decoder 250 may improve the ability to model a dynamic system by adjusting the prediction based on the system state.

인코더(230)와 유사하게, 디코더(250) 또한, 그래프 신경망(GNN)을 기반으로 구성될 수 있다. 그러나, 인코더(230)와 달리, 디코더(250)는 모든 에지 유형들에 대해 별도의 그래프 신경망(GNN)이 학습될 수 있다. 주어진 에지(i, j)에 대해 메시지(또는 정보)를 전달할 때, 사용된 에지 모델은 디코더(250)에 입력된 잠재 변수에 의해 생성된 예측에 해당할 수 있다. 또한, 일 실시예에서는 '상호 작용 없음(no interaction)'을 나타내기 위해 에지 유형을 하드 코딩(hard-coding)할 수 있으며, 이 경우 계산 중에 해당 에지를 통해 전달되는 메시지는 존재하지 않을 수 있다. Similar to the encoder 230 , the decoder 250 may also be configured based on a graph neural network (GNN). However, unlike the encoder 230 , the decoder 250 may train a separate graph neural network (GNN) for all edge types. When conveying a message (or information) for a given edge (i, j), the edge model used may correspond to the prediction generated by the latent variable input to the decoder 250 . In addition, in one embodiment, the edge type may be hard-coded to indicate 'no interaction', and in this case, a message transmitted through the corresponding edge during calculation may not exist. .

디코더(250)는 예를 들어, 마르코프(Markovian) 디코더일 수 있으며, 이 경우, 디코더(250)의 그래프 신경망(GNN)은 단순히 이전 예측의 함수이며, 모든 이전 상태들에 의존하는 디코더들에서 반복적인 히든 스테이트는 그래프 신경망을 사용하여 갱신될 수 있다.The decoder 250 may be, for example, a Markovian decoder, in which case the graph neural network (GNN) of the decoder 250 is simply a function of previous predictions and iterative in decoders depending on all previous states. The hidden state can be updated using a graph neural network.

도 3은 일 실시예에 따른 영상 처리 장치의 구성 요소들을 도시한 도면이다. 도 3을 참조하면, 일 실시예에 따른 영상 처리 장치(300)는 프라이어 및 인코더(310)와 디코더(330)를 포함할 수 있다. 3 is a diagram illustrating components of an image processing apparatus according to an exemplary embodiment. Referring to FIG. 3 , the image processing apparatus 300 according to an embodiment may include a fryer, an encoder 310 , and a decoder 330 .

예를 들어, 인코더는 각 개체들의 이동 궤적을 입력으로 받아, 이를 개체들의 관계들을 나타내는 잠재 변수로 인코딩할 수 있다. 인코딩된 잠재 변수들은 프라이어의 정보를 통해 최적화되고, 디코더(330)를 통해 다음 프레임에서의 개체들의 이동 궤적으로 디코딩될 수 있다. For example, the encoder may receive the movement trajectory of each entity as an input and encode it as a latent variable representing the relationships between entities. The encoded latent variables may be optimized through the prior information, and decoded as movement trajectories of entities in the next frame through the decoder 330 .

프라이어 및 인코더(310)에 대한 입력은 모든 시간 단계에서 모든 개체들 쌍들에 대한 임베딩을 생성하기 위해 완전히 연결된 그래프 신경망(GNN)을 통해 공급될 수 있다. Inputs to the prior and encoder 310 may be fed through a fully connected graph neural network (GNN) to generate embeddings for all pairs of entities at all time steps.

각 타임 스텝에서의 프라이어에 대한 입력은 다음의 수학식 5 내지 수학식 8로 표현되는 그래픽 신경망(GNN) 구조를 통해 시간 당, 에지 당 임베딩을 생성할 수 있다. An input to the prior at each time step may generate embeddings per time and per edge through a graphic neural network (GNN) structure expressed by Equations 5 to 8 below.

수학식 5 내지 수학식 8로 표현되는 그래픽 신경망(GNN) 구조는 그래프에 전달되는 신경 메시지(neural message)의 형태를 구현할 수 있다. 여기서, 정점들(vertices) v는 개체들을 나타내고, 에지들(edges) e는 개체들 간의 관계들을 나타낼 수 있다. The graphic neural network (GNN) structure represented by Equations 5 to 8 may implement the form of a neural message transmitted to the graph. Here, vertices v may indicate entities, and edges e may indicate relationships between entities.

또한, 수학식 5 내지 수학식 8로 표현되는 그래픽 신경망(GNN)에서, 예를 들어,

는 256 개의 히든/출력 유닛들 및 ELU 활성화 기능을 갖춘 2-계층 MLP일 수 있다. 또한, 프라이어 및 인코더에 의해 사용되는 LSTM 모델들은 64개의 히든 유닛들을 사용할 수 있다. In addition, in the graphic neural network (GNN) represented by Equations 5 to 8, for example,

may be a two-layer MLP with 256 hidden/output units and ELU activation capability. Also, the LSTM models used by the prior and encoder can use 64 hidden units.

프라이어 및 인코더(310)에서

와

는 예를 들어, 128개의 히든 유닛들과 ReLU 활성화 기능을 가진 3-계층 MLP일 수 있다. 이 경우, 인코더의 로짓은 256 개의 히든 유닛들과 모델링되는 관계 유형들의 개수와 동일한 수의 출력 유닛들을 가진 3-계층 MLP를 통해 h _emb를 전달하여 생성될 수 있다. in fryer and encoder 310

Wow

may be, for example, a 3-layer MLP with 128 hidden units and a ReLU activation function. In this case, the logit of the encoder can be generated by passing h _emb through a 3-layer MLP with 256 hidden units and a number of output units equal to the number of relation types being modeled.

실시예에 따라서, 영상 처리 장치는 정적 신경 관계 추론 및 동적 신경 관계 추론 둘 다를 위해 반복 디코더(recurrent decoder)를 사용할 수도 있다.According to an embodiment, the image processing apparatus may use a recurrent decoder for both static neural relationship inference and dynamic neural relationship inference.

전술한 수학식들에서 각 h는 계산 중의 개체들 또는 관계들에 대한 중간 히든 스테이트들(intermediate hidden states)의 정보를 나타낼 수 있다. 이 계산의 결과는 시간 t에서 개체들 i와 개체들 j 사이의 관계들의 상태를 캡처하는 임베딩일 수 있다. 이러한 임베딩들 각각은 LSTM으로 공급될 수 있다. 직관적으로, 이러한 LSTM은 시간에 따른 개체들 간의 관계들의 발전(evolution)을 모델링할 수 있다. In the above equations, each h may represent information on intermediate hidden states for entities or relationships during calculation. The result of this calculation may be an embedding that captures the state of the relationships between entities i and j at time t. Each of these embeddings can be supplied as an LSTM. Intuitively, such an LSTM can model the evolution of relationships between entities over time.

프라이어 및 인코더(310)에 대한 입력은 개체들 간의 관계들의 과거 이력을 인코딩하는 순방향(forward) LSTM_prior 및 개체들 간의 관계들의 미래 이력을 인코딩하는 역방향(backwards) LSTM_enc을 이용하여 집계될 수 있다. The input to the prior and encoder 310 is a forward LSTM _{prior that encodes a past history of relationships between entities.} _{and backwards LSTM enc} that encodes a future history of relationships between entities.

도 3에 도시된 모든 모델 f는 MLP(Multilayer Perceptron)를 나타낼 수 있다. All models f shown in FIG. 3 may represent a Multilayer Perceptron (MLP).

MLP는 각 타임 스텝에서의 히든 스테이트를 프라이어 분산의 로짓들(logits)로 변환할 수 있다. 이러한 최종적인 두 단계는 예를 들어, 아래의 수학식 9 및 수학식 10와 같이 표현될 수 있다. MLP may transform the hidden state at each time step into logits of prior variance. These final two steps may be expressed as, for example, Equations 9 and 10 below.

일 실시예에서는 프라이어에 대한 이전 관계 예측들을 입력으로 전달하는 대신에, 타임 스텝 t에 대응하는 히든 스테이트

의 이전 타임 스텝들에 대한 관계들에 대한 프라이어의 의존성을 인코딩할 수 있다.In one embodiment, instead of passing as input the previous relation predictions for the prior, the hidden state corresponding to time step t

may encode the dependency of the prior on relations to previous time steps of .

인코더는 전술한 관계 임베딩

를 재사용하고, 관계 임베딩

의 대표값들을 역방향(backwards) LSTM_enc을 통해 전달할 수 있다. 실시예에 따라서, 인코더는 예를 들어, LSTM(Long-Short term Memory) 이외에도, GRU 및 순환 신경망(Recurrent Neural Network; RNN) 등과 같은 순환 구조의 신경망에 의해 구성될 수도 있다. The encoder embeds the relationship described above.

Reuse and embed relationships

Representative values of can be transmitted through the _{backwards LSTM enc.} According to an embodiment, the encoder may be configured by, for example, a neural network having a recurrent structure, such as a GRU and a Recurrent Neural Network (RNN), in addition to Long-Short Term Memory (LSTM).

인코더의 최종적인 근사 포스테리어(approximate posterior)은 이러한 역방향 상태(reverse state)와 프라이어에 의해 제공된 전방향 상태(forward state)를 결합(concatenating)하고, 그 결합 결과를 MLP로 전달함으로써 획득될 수 있다. 전술한 인코더의 동작은 아래의 수학식 11 및 수학식 12와 같이 표현될 수 있다. The final approximate posterior of the encoder can be obtained by concatenating this reverse state with the forward state provided by the prior, and passing the combined result to the MLP. . The above-described operation of the encoder may be expressed as Equations 11 and 12 below.

인코더와 프라이어는 파라미터들을 공유하므로, 일 실시예에서는 인코더와 프라이어를 위해 인코더 파라미터

를 사용할 수 있다. Since encoder and prior share parameters, in one embodiment, encoder parameters for encoder and prior

can be used

프라이어 및 인코더(310)에서 프라이어는 과거 이력의 함수에 의해서만 계산되는 반면, 인코더에 의한 근사 포스테리어(approximate posterior)은 과거 및 미래의 함수로 계산될 수 있다. 에지 변수들의 집합은 근사 포스테리어로부터 샘플링되고, 이들은 디코더 그래프 신경망(GNN)에 대한 에지 모델들을 선택하는 데 사용될 수 있다. In the prior and encoder 310, the prior is calculated only as a function of past history, whereas the approximate posterior by the encoder may be calculated as a function of the past and future. A set of edge variables is sampled from an approximate forsterer, which can be used to select edge models for a decoder graph neural network (GNN).

디코더(330)는 이러한 그래프 신경망(GNN)과 이전 예측들을 사용하여 히든 스테이트(hidden state)를 발전시키고, 다음 타임 스텝에서 개체들의 상태를 예측하는 히든 스테이트들을 사용할 수 있다The decoder 330 may develop a hidden state using such a graph neural network (GNN) and previous predictions, and use the hidden states to predict the state of entities in the next time step.

이하에서는 프라이어 및 인코더(310)와 디코더(330)의 파라미터

및

를 트레이닝하는 과정을 설명하기에 앞서, 신경 관계 추론(NRI) 방법에 따른 트레이닝 과정을 살펴보기로 한다. Hereinafter, the parameters of the prior and encoder 310 and decoder 330 .

and

Before explaining the process of training , let's look at the training process according to the neural relation inference (NRI) method.

먼저, 인코더는 모든 개체들 쌍들에 대한 포스테리어 관계 확률(posterior relation probability)

를 예측하기 위해 현재 입력 x를 처리할 수 있다. 다음으로, 인코더는 구체적인 근사치에서 이 분포까지 관계들의 집합을 샘플링할 수 있다. 이러한 샘플들

가 주어지면, 최종 단계는 원래 궤도(original trajectory) X를 예측하는 것이다. 이를 통해 디코딩 성능을 향상시키고 디코더가 예측된 에지에 의존하는지 확인할 수 있다. 예를 들어, 트레이닝 시간에 디코더에게 제한된 개수(예를 들어, 10개)의 단계들에 대한 정답 입력들(ground-truth inputs)을 제공한 다음, 이전 예측의 함수로 궤적의 나머지를 예측할 수 있다. First, the encoder calculates the posterior relation probability for all pairs of entities.

We can process the current input x to predict Next, the encoder can sample a set of relationships from a specific approximation to this distribution. these samples

Given , the final step is to predict the original trajectory X. This can improve decoding performance and ensure that the decoder relies on predicted edges. For example, at training time we can provide the decoder with ground-truth inputs for a limited number of steps (e.g. 10) and then predict the remainder of the trajectory as a function of the previous prediction. .

전술한 수학식 1에 기재된 ELBO는 다음과 같은 두 가지 용어들(terms)을 포함할 수 있다. 먼저, 재구성 오류(reconstruction error)는 예측된 출력이 고정된 분산 σ를 갖는 가우스 분포의 평균을 나타내며, 결과적으로 아래의 수학식 13와 같은 형태로 표현될 수 있다. The ELBO described in Equation 1 above may include the following two terms. First, the reconstruction error represents the average of the Gaussian distribution having a fixed variance σ of the predicted output, and as a result, it can be expressed in the form of Equation 13 below.

또한, KL-발산(divergence)은 균일한 프라이어(uniform prior)와 예측된 근사 포스테리어(approximate posterior) 간의 분산을 나타내며, 예를 들어, 아래의 수학식 14와 같은 형태로 표현될 수 있다. In addition, KL-divergence represents the variance between a uniform prior and a predicted approximate posterior, and may be expressed, for example, in the form of Equation 14 below.

여기서, H는 엔트로피 함수를 나타낼 수 있다. 상수항(constant term)은 프라이어가 균일하다는 사실의 결과이며, 이는 손실에서 엔코더 항들 중 하나의 소외(marginalization)를 초래할 수 있다. Here, H may represent an entropy function. The constant term is a result of the fact that the prior is uniform, which can lead to marginalization of one of the encoder terms in loss.

신경 관계 추론(NRI) 모델은 비감독 모델(unsupervised model)로서, 관측 데이터로부터 순수하게 상호 작용들을 추론하고 명시적으로 표현할 수 있다. 이를 위해, 잠재 코드(latent code)가 인접 행렬의 형태로 기본 상호 작용 그래프를 나타내는 변형 자동 인코더 모델이 공식화될 수 있다. 인코더 모델과 재구성 모델은 모두 그래프 신경망에 기반할 수 있다. 동적 신경 관계 추론(dNRI) 모델과 달리, 정적 신경 관계 추론(NRI) 모델은 시간이 지남에 따라 상호 작용이 동일하게 유지된다고 가정한다. 신경 관계 추론(NRI) 공식은 모든 개체들 간의 관계가 정적인 것으로 가정한다. 그러나 이러한 가정은 많은 응용 프로그램에서 너무 강력하다. 개체들이 상호 작용하는 방식은 시간이 지남에 따라 변경되기 쉽다. 예를 들어, 농구 선수들은 다른 시점에서의 다른 팀원들의 위치와 상대적으로 자신의 위치를 조정할 수 있다. The neural relationship inference (NRI) model is an unsupervised model, which can infer interactions purely from observation data and express them explicitly. To this end, a variant autoencoder model can be formulated in which the latent code represents the basic interaction graph in the form of an adjacency matrix. Both the encoder model and the reconstruction model can be based on a graph neural network. Unlike dynamic neural relationship inference (dNRI) models, static neural relationship inference (NRI) models assume that interactions remain the same over time. The Neural Relationship Inference (NRI) formula assumes that the relationships between all entities are static. However, this assumption is too strong for many applications. The way entities interact is subject to change over time. For example, basketball players may adjust their position relative to the position of other team members at different points in time.

따라서, 일 실시예에서는 동적인 상호 작용을 밝혀 내고, 시간이 지남에 따라 관계가 변하는 개체들을 더 잘 추적하기 위해 동적 신경 관계 추론(dynamic Neural Relational Inference; dNRI) 방법을 이용할 수 있다. Accordingly, in an embodiment, a dynamic Neural Relational Inference (dNRI) method may be used to reveal dynamic interactions and better track entities whose relationships change over time.

보다 구체적으로, 일 실시예에서는 모든 타임 스텝 t에 대한 개별 관계(separate relations)

를 예측할 수 있다. 개별 관계

는 모델이 궤도를 통해 그들의 관계들이 변화하는 개체들에 응답하도록 할 수 있으므로 미래 상태들을 예측하는 능력이 향상될 수 있다. 동적 신경 관계 추론(dNRI) 방법을 사용하려면 시간이 지남에 따른 개체들 간의 관계들의 진화를 추적해야 하는데, 이는 정적 신경 관계 추론 (NRI)에는 필요하지 않은 것이다.More specifically, in one embodiment separate relations for every time step t

can be predicted individual relationship

can allow the model to respond to entities whose relationships change through trajectories, thus improving the ability to predict future states. Dynamic neural relationship inference (dNRI) methods require tracking the evolution of relationships between entities over time, which is not required for static neural relationship inference (NRI).

일 실시예에서는 모든 타임 스텝에서 개별 관계를 예측하기 위해, 각 모델 구성 요소들의 목적을 재고할 수 있다. 앞에서 살펴본 것과 같이, 프라이어는 사실상 손실 함수(loss function)의 조정 가능한 구성 요소에 해당할 수 있다. 이와 반대로, 순차적인 맥락에서 프라이어를 더 유용하게 만들기 위해, 일 실시예에서는 영상 처리 장치에 모든 이전 상태들이 주어지면 모든 시점에서 개체들 간의 관계들을 예측할 수 있다. 예를 들어, 정적 신경 관계 추론(NRI) 방법에서 인코더가 입력 궤적들의 전체 집합을 포괄하는 단일 에지 예측 집합을 이용하는 것과는 대조적으로, 일 실시예에서는 과거와 미래를 기반으로 모든 시점에서 시스템의 상태 정보를 이해하면서 인코더를 이용할 수 있다. 이러한 상태 정보는 손실 함수의 KL-발산 (divergence) 텀으로 인해 트레이닝 동안 인코더에서 프라이어로 전달될 수 있다. 이러한 변화는 프라이어가 미래 관계를 더 잘 예측하도록 할 수 있다. 순차적인 관계 예측의 결과로서, 디코더는 더욱 유연해질 수 있다. 이를 통해 일 실시예에서는 시스템이 어떻게 변화하는지에 따라 시점 별로 서로 다른 모델을 사용할 수 있다. 이러한 모든 변화는 예측 성능을 향상시키는 보다 표현적인 모델로 이어질 수 있다. In one embodiment, the purpose of each model component may be reconsidered to predict individual relationships at every time step. As noted earlier, the fryer can in fact correspond to a tunable component of the loss function. Conversely, in order to make the fryer more useful in a sequential context, according to an embodiment, if all previous states are given to the image processing apparatus, relationships between objects may be predicted at all time points. For example, in static neural relation inference (NRI) methods, in contrast to the encoder using a single edge prediction set covering the entire set of input trajectories, in one embodiment the state information of the system at any point in time based on the past and the future. You can use the encoder while understanding This state information can be passed from the encoder to the prior during training due to the KL-divergence term of the loss function. These changes may allow the fryer to better predict future relationships. As a result of sequential relational prediction, the decoder can become more flexible. Through this, in an embodiment, different models may be used for each time point according to how the system changes. All these changes can lead to more expressive models that improve predictive performance.

예를 들어, 일 실시예에 따른 영상 처리 장치(300)에 입력된 입력 궤적 x는 그래프 신경망(GNN) 모델을 통과하여 매 시간 t 및 모든 개체들 쌍(i, j)에 대한 관계 임베딩

를 생성할 수 있다. 이러한 관계 임베딩

은 순방향 LSTM 및/또는 역방향 LSTM으로 전달되고, 이를 통해 프라이어

및 근사 포스테리어

가 계산될 수 있다. 그런 다음, 인코더는 근사 포스테리어

에서의 샘플링을 통해 관계 변수

를 생성할 수 있다. 디코더는 이러한 샘플들(예를 들어,

)이 주어지면, 궤도 분포

를 예측할 수 있다. For example, an input trajectory x input to the image processing apparatus 300 according to an embodiment passes through a graph neural network (GNN) model and embeds relationships for every time t and all pairs of entities (i, j)

can create embedding these relationships

is passed to the forward LSTM and/or reverse LSTM through which the fryer

and approximate posteriors

can be calculated. Then, the encoder makes an approximate posterior

Relational variables through sampling from

can create The decoder can use these samples (e.g.,

), given the orbital distribution

can be predicted

일 실시예에서는 정적 신경 관계 추론(NRI)의 경우와 달리, 트레이닝 동안에 입력으로 항상 정답 상태들(ground truth states)을 디코더에게 제공할 수 있다. 일 실시예에서는 테스트 시에 고정된 수의 스텝들에 대한 정답을 제공하는 것을 관찰하고, 이후, 예측을 나머지 궤도에 대한 입력으로 사용할 수 있다. In an embodiment, unlike in the case of static neural relationship inference (NRI), ground truth states may always be provided to the decoder as an input during training. In one embodiment, it is observed that the test provides the correct answer for a fixed number of steps, and then the prediction can be used as input to the remaining trajectories.

일 실시예에 따라 ELBO에서 재구성 에러는 수학식 13에서 언급된 것과 동일한 방식으로 계산되고, KL-발산은 아래의 수학식 15와 같이 계산될 수 있다.According to an embodiment, the reconstruction error in the ELBO may be calculated in the same manner as described in Equation 13, and the KL-divergence may be calculated as Equation 15 below.

일 실시예에 따르면, 테스트 시에 시스템의 미래 상태들을 예측할 수 있다. 이는 미래에 대한 적절한 정보를 가지지 못하기 때문에, 인코더를 사용하여 에지를 예측할 수 없음을 의미한다. 따라서, 일 실시예에서는 이전 예측

이 주어지면, 관계들

에 대한 프라이어 분포를 계산할 수 있다. 또한, 관계 예측

을 얻기 위해 프라이어로부터 샘플링하고, 변수들

의 다음 상태를 추정하기 위해 이전 예측뿐만 아니라 관계 예측

을 사용할 수 있다. 이러한 과정은 전체 이동 궤도가 예측될 때까지 계속될 수 있다.According to one embodiment, it is possible to predict future states of the system at the time of testing. This means that the encoder cannot be used to predict edges, since we do not have adequate information about the future. Thus, in one embodiment, the previous prediction

Given this, the relationships

We can calculate the prior distribution for . In addition, relationship prediction

sample from the fryer to obtain

Predict the relationship as well as the previous prediction to estimate the next state of

can be used This process may continue until the entire movement trajectory is predicted.

도 4는 다른 실시예에 따른 영상 처리 방법을 나타낸 흐름도이다. 도 4를 참조하면, 일 실시예에 따른 영상 처리 장치는 제1 시점의 영상에서 움직임을 예측하고자 하는 대상의 개체들을 결정할 수 있다(410). 영상 처리 장치는 예를 들어, 사용자로부터 입력 영상에 포함된 대상에 대한 선택을 직접 입력받을 수도 있고, 영상 처리 장치가 자동으로 입력 영상에 포함된 대상을 직접 설정할 수도 있다. 4 is a flowchart illustrating an image processing method according to another exemplary embodiment. Referring to FIG. 4 , the image processing apparatus according to an embodiment may determine objects of a target whose motion is to be predicted in an image of a first view ( 410 ). The image processing apparatus may directly receive a selection of an object included in the input image from a user, for example, or the image processing apparatus may automatically set the object included in the input image directly.

영상 처리 장치는 단계(410)에서 결정된 개체들의 특징 벡터를 추출할 수 있다(420). The image processing apparatus may extract the feature vectors of the objects determined in operation 410 (operation 420).

영상 처리 장치는 단계(420)에서 추출한 특징 벡터를 그래프 신경망에 인가함으로써 제1 시점에서의 개체들 간의 관계들에 대응하는 히든 스테이트 정보를 생성할 수 있다(430).The image processing apparatus may generate hidden state information corresponding to relationships between entities at a first time point by applying the feature vector extracted in operation 420 to the graph neural network (operation 430 ).

영상 처리 장치는 단계(430)에서 생성한 히든 스테이트 정보에 기초하여, 사전 정보를 생성할 수 있다(440).The image processing apparatus may generate dictionary information based on the hidden state information generated in operation 430 (operation 440).

영상 처리 장치는 단계(440)에서 생성한 사전 정보 및 단계(430)에서 생성한 히든 스테이트 정보에 기초하여, 개체들에 대응하여 예측되는 사후 정보를 생성할 수 있다(450).The image processing apparatus may generate post-information predicted to correspond to the objects based on the prior information generated in step 440 and the hidden state information generated in step 430 ( S450 ).

영상 처리 장치는 단계(440)에서 생성한 사전 정보 및 단계(450)에서 생성한 사후 정보에 기초하여, 개체들 간의 동적 상호 작용에 대응하는 잠재 변수를 생성할 수 있다(460).The image processing apparatus may generate a latent variable corresponding to a dynamic interaction between entities based on the prior information generated in operation 440 and the post-information generated in operation 450 ( 460 ).

영상 처리 장치는 단계(460)에서 생성한 잠재 변수에 기초하여, 제2 시점에서 변화되는 개체들의 움직임을 예측할 수 있다(470).The image processing apparatus may predict the movement of the objects that change at the second time point based on the latent variable generated in operation 460 (operation 470).

영상 처리 장치는 제2 시점에서의 예측된 움직임을 반영한 결과를 출력할 수 있다(480). The image processing apparatus may output a result reflecting the motion predicted at the second time point ( 480 ).

일 실시예에서는 정적 신경 관계 추론(NRI)과 비교되는 동적 신경 관계 추론(dNRI)의 강점을 입증하기 위해, 아래의 도 5 및 도 6과 같이 휴먼 모션 캡처 데이터 집합들 및 농구 선수들의 이동 궤도 데이터 집합들에 대한 실험을 수행할 수 있다. In one embodiment, in order to demonstrate the strength of dynamic neural relationship inference (dNRI) compared to static neural relationship inference (NRI), human motion capture data sets and movement trajectory data of basketball players as shown in FIGS. 5 and 6 below. Experiments can be performed on sets.

도 5 및 도 6에서는 모델들의 동작을 보여주기 위해 몇 가지 샘플 궤적들과 예측된 관계들을 추가로 시각화할 수 있다. 또한, 모든 모델은 첫 번째 에지 유형을 하드 코딩하여 상호 작용이 없음을 나타낼 수 있다. 평가 목적으로, 모델들에는 입력의 n 개의 초기 타임 스텝들이 제공되며, 몇 가지 미래 단계들을 예측해야 한다. 정적 모델을 평가할 때, 일 실시예에서는 두 가지 서로 다른 추론 절차들을 사용할 수 있다. In FIGS. 5 and 6 , several sample trajectories and predicted relationships can be further visualized to show the operation of the models. Also, all models can hard-code the first edge type to indicate no interaction. For evaluation purposes, the models are given n initial time steps of the input and have to predict some future steps. When evaluating a static model, one embodiment may use two different inference procedures.

도 5 및 도 6에서 'S NRI, Dyn inf '이라고 표시된 추론 절차는 가장 최근의 n 개의 궤도 예측들을 사용하여 관계 예측을 평가한 결과를 나타내고, 'Static NRI'로 표시된 추론 절차는 관계 유형들을 예측하기 위해 제공된 초기 n 개의 타임 스텝들의 입력을 사용하여 관계 예측을 평가한 결과를 나타낸다. 이러한 관계들은 이동 궤도 전체를 디코딩하는 데 사용될 수 있다. In FIGS. 5 and 6, the inference procedure marked 'S NRI, Dyn inf ' represents the result of evaluating the relationship prediction using the most recent n orbital predictions, and the inference procedure marked 'Static NRI' predicts the relationship types. The result of evaluating the relationship prediction using the input of the initial n time steps provided for These relationships can be used to decode the entire movement trajectory.

도 5는 일 실시예에 따른 영상 처리 방법에 의해 예측된 인체의 움직임을 반영한 결과를 도시한 도면이다. 도 5를 참조하면, 4가지 관계 유형들을 사용하여 모션 캡쳐 대상의 테스트 궤적에 대하여 동적 신경 관계 추론(dNRI) 모델(510), 정적 신경 관계 추론(Static NRI) 모델(530) 및 "동적" 추론을 갖는 정적 신경 관계 추론(S. NRI(Dyn. Inf)) 모델(550)을 통해 예측된 모션을 샘플링한 결과가 도시된다. 도 5에서 실선의 인체 골격은 정답 상태를 나타내고, 점선의 인체 골격은 각 추론 모델들에 따른 움직임 예측 결과에 해당할 수 있다. 이러한 프레임들 각각은 가장 최근의 정답이 제공된 후, 20개의 타임 스텝들마다 예측될 수 있다. 5 is a diagram illustrating a result of reflecting the movement of the human body predicted by the image processing method according to an exemplary embodiment. Referring to FIG. 5 , a dynamic neural relationship inference (dNRI) model 510, a static neural relationship inference model 530, and a “dynamic” inference for a test trajectory of a motion capture object using four relationship types. The results of sampling the motion predicted through the static neural relationship inference (S. NRI (Dyn. Inf)) model 550 with In FIG. 5 , a human skeleton in a solid line indicates a correct answer state, and a human skeleton in a dotted line may correspond to a motion prediction result according to each inference model. Each of these frames can be predicted every 20 time steps after the most recent correct answer is provided.

도 5는 정적 모델 및 동적 모델에 대한 네 가지 예측된 타임 스텝들을 도시한다. 모든 모델들은 일반적인 점프 동작을 캡처할 수 있지만, 일 실시예에 따른 동적 신경 관계 추론(dNRI) 모델(510)은 다리와 엉덩이 관절들의 위치들을 훨씬 더 정확하게 추적함을 볼 수 있다. 도 5에서 알 수 있듯이, 동적 신경 관계 추론(dNRI) 모델(510)은 정답 골격(ground-truth skeleton)으로부터 너무 멀리 벗어나지 않고 보행 주기의 미래에 대한 프레임들을 예측할 수 있다. 하지만, 정적 모델들(530,550)은 미래에 더 많은 프레임이 예측될수록 중요한 오류를 만들기 시작한다. 이러한 오류는 골격의 중요한 변형들(deformities)이 나타나기 시작하는 지점들에서 발생할 수 있다. 5 shows four predicted time steps for a static model and a dynamic model. It can be seen that all models can capture a general jumping motion, but the dynamic neural relationship inference (dNRI) model 510 according to one embodiment tracks the positions of the leg and hip joints much more accurately. As can be seen in FIG. 5 , the dynamic neural relationship inference (dNRI) model 510 is able to predict frames for the future of the gait cycle without deviating too far from the ground-truth skeleton. However, static models 530 and 550 start to make significant errors as more frames are predicted in the future. These errors can occur at the points where significant deformations of the skeleton begin to appear.

도 6은 일 실시예에 따른 영상 처리 방법에 의해 예측된 농구 선수들의 움직임을 반영한 이동 궤적을 도시한 도면이다. 도 6을 참조하면, 전술한 도 5의 각 모델들에 대해 농구 선수들의 궤적 데이터를 이용한 실험을 실행한 결과가 도시된다. 6 is a diagram illustrating movement trajectories reflecting the movement of basketball players predicted by the image processing method according to an exemplary embodiment. Referring to FIG. 6 , results of an experiment using trajectory data of basketball players for each of the above-described models of FIG. 5 are shown.

도 6의 좌측 그래프(610)는 농구 선수들의 궤적 데이터에 대한 각 모델(동적 신경 관계 추론(dNRI) 모델(510), 정적 신경 관계 추론(Static NRI) 모델(530) 및 "동적" 추론을 갖는 정적 신경 관계 추론(S. NRI(Dyn. Inf)) 모델(550))의 예측 오류를 나타낼 수 있다. 도 6의 3개의 도면들(630, 650, 670)은 각 모델들을 이용한 궤적 예측들의 샘플을 도시한다. The left graph 610 of FIG. 6 shows each model (Dynamic Neural Relationship Inference (dNRI) Model 510, Static NRI Model 530 and "Dynamic" inferences for the trajectory data of basketball players. It may represent the prediction error of the static neural relationship inference (S. NRI (Dyn. Inf)) model 550). The three figures 630 , 650 , 670 of FIG. 6 show samples of trajectory predictions using the respective models.

도면(630)은 정답 결과, 도면(650)은 정적 신경 관계 추론(NRI) 모델의 추론 결과를, 도면(670)은 일 실시예에 따른 동적 신경 관계 추론(dNRI) 모델의 추론 결과에 해당할 수 있다. Figure 630 corresponds to the correct answer result, the figure 650 corresponds to the inference result of the static neural relationship inference (NRI) model, and the figure 670 corresponds to the inference result of the dynamic neural relationship inference (dNRI) model according to an embodiment. can

도 6의 각 도면(630, 650, 670)에 도시된 이동 궤적에는 5 명의 플레이어들로 구성된 공격 팀의 2차원 위치들와 속도들이 포함될 수 있다. 공격 팀의 2차원 위치들와 속도들은 예를 들어, 약 8 초간 재생되는 49 개의 프레임들에 의해 사전 처리될 수 있다. 이때, 모든 모델들은 트레이닝 궤적의 첫 40 프레임들에서 트레이닝될 수 있다. 평가 시에, 이러한 모델들에는 첫 40 개의 입력 프레임들이 제공되고, 다음 9 개의 프레임들을 예측하는 작업이 수행될 수 있다. 일 실시예에서는 예를 들어, 두 가지 관계 유형들을 예측하는 모델들을 트레이닝할 수 있다.The two-dimensional positions and velocities of an attacking team composed of five players may be included in the movement trajectories shown in each of the drawings 630 , 650 , and 670 of FIG. 6 . The two-dimensional positions and velocities of the attacking team can be pre-processed by, for example, 49 frames played for about 8 seconds. At this time, all models can be trained in the first 40 frames of the training trajectory. In evaluation, these models are provided with the first 40 input frames, and the task of predicting the next 9 frames can be performed. In one embodiment, for example, we can train models that predict two types of relationships.

도 6의 결과로부터 농구 선수들의 이동 궤적을 예측할 때에 정적 신경 관계 추론(Static NRI) 모델에 비해 동적 신경 관계 추론(dNRI) 모델이 예측한 이동 궤적이 더 정확함을 파악할 수 있다. 이와 같이, 일 실시예에 따른 영상 처리 방법에 의하면, 운동 경기 중인 선수들의 관계를 기초로, 각 선수들의 움직임을 예측할 수 있다. From the results of FIG. 6 , when predicting the movement trajectories of basketball players, it can be understood that the movement trajectories predicted by the dynamic neural relationship inference (dNRI) model are more accurate than the static neural relationship inference (Static NRI) model. As described above, according to the image processing method according to an embodiment, the movement of each player may be predicted based on the relationship between the players during an athletic match.

도 7은 일 실시예에 따른 영상 처리 방법에 의해 예측된 보행자들의 움직임을 반영하여 미래 시점의 영상을 생성하는 과정을 설명하기 위한 도면이다. 도 7을 참조하면, 예를 들어, 운행 중인 차량의 전면에 주차 중인 다른 차량 주변에서 공을 차며 놀고 있는 아이들이 촬영된 영상(710)이 도시된다. 7 is a view for explaining a process of generating an image of a future viewpoint by reflecting the movement of pedestrians predicted by the image processing method according to an embodiment. Referring to FIG. 7 , for example, an image 710 in which children playing with a ball in front of another vehicle parked in front of a moving vehicle are photographed is shown.

일 실시예에서는 개체들 간의 관계들이 시간이 지남에 따라 변할 것으로 예상되는 시스템을 처리하기 위해 동적 신경 관계 추론(Dynamic Neural Relational Inference) 방법을 이용할 수 있다. 또한, 동적 개체들 간의 관계들을 모델링하는 것은 인간의 모션 캡처와 스포츠 궤적 예측 작업 모두에서 성능을 향상시킬 수 있다. 또한, 이 모델을 교통 시스템(trafc systems)과 생물학적 신경망(biological neural networks)을 포함하는 동적 관계가 예상되는 다른 영역에도 적용할 수 있다.In an embodiment, a Dynamic Neural Relational Inference method may be used to process a system in which relationships between entities are expected to change over time. In addition, modeling the relationships between dynamic entities can improve performance in both human motion capture and sports trajectory prediction tasks. The model can also be applied to other areas where dynamic relationships are expected, including traffic systems and biological neural networks.

또는 일 실시예에 따르면, 영상 처리 방법에 의해 비디오에서 예측되는 대상 객체의 움직임을 모델링하여 다양한 어플리케이션에 활용할 수 있다. Alternatively, according to an embodiment, a motion of a target object predicted in a video by an image processing method may be modeled and utilized for various applications.

일 실시예에 따른 영상 처리 장치는 영상(710)으로부터 추출된, 움직임을 예측하고자 하는 대상의 개체들(예를 들어, 공을 차며 놀고 있는 아이들)의 특징 벡터에 기초하여 개체들 간의 관계들을 정의할 수 있다. 이때, 개체들 간의 관계들은 개체들의 페어들(pairs) 사이의 관계들의 상태에 대응하여 생성된 히든 스테이트 정보에 의해 정의될 수 있다. The image processing apparatus according to an embodiment defines relationships between objects based on feature vectors of objects (eg, children playing with a ball) extracted from the image 710 to predict a motion. can do. In this case, relationships between entities may be defined by hidden state information generated in response to states of relationships between pairs of entities.

영상 처리 장치는 히든 스테이트 정보에 기초하여 영상(720)과 같이, 개체들에 대응하는 사전 정보를 생성할 수 있다. 이때, 사전 정보는 주차 중인 차량에 가려진 아이들의 움직임 형태를 반영한 것으로서, 해당 영상(710) 이전의 영상으로부터 파악된 개체들의 과거 이력 및 영상(710)에 대응하는 개체들의 특징 벡터들에 기초하여 결정될 수 있다. The image processing apparatus may generate dictionary information corresponding to the objects, such as the image 720 , based on the hidden state information. At this time, the prior information reflects the movement of the children covered by the parked vehicle, and is determined based on the past histories of the objects identified from the image before the image 710 and the feature vectors of the objects corresponding to the image 710 . can

영상 처리 장치는 영상(710)으로부터 생성된 히든 스테이트 정보 및 영상(720)과 같은 사전 정보에 기초하여, 개체들 간의 동적 상호 작용에 대응하는 잠재 변수를 생성할 수 있다. The image processing apparatus may generate a latent variable corresponding to a dynamic interaction between entities based on hidden state information generated from the image 710 and prior information such as the image 720 .

영상 처리 장치는 잠재 변수를 기초로, 영상(710)에서 주차 중인 차량에 가려진 아이들의 움직임 형태가 드러나도록 예측된 영상(730)을 생성할 수도 있고, 또는 영상(710)가 촬영된 시점 이후의 예측된 미래의 영상들(영상(740) 및 영상(750))을 생성하여 표시할 수도 있다. Based on the latent variable, the image processing apparatus may generate an image 730 predicted to reveal the motion patterns of children covered by a parked vehicle in the image 710, or after the image 710 is captured. Predicted future images (image 740 and image 750 ) may be generated and displayed.

일 실시예에 따르면, 현재 프레임에서는 보이지 않는 대상 또는 대상의 개체들에 대한 예측에 전술한 영상 처리 방법을 적용함으로써 운전 시 또는 자율 주행 시, 운전자에게 보행자에 대한 예측 정보를 제공하거나, 위험 발생 시에 운전자에게 위험을 경고함으로써 안전한 운전을 도울 수 있다.According to an embodiment, by applying the above-described image processing method to prediction of an object or objects of a target that are not visible in the current frame, prediction information about a pedestrian is provided to the driver during driving or autonomous driving, or when a danger occurs It can help drivers drive safely by warning drivers of danger.

일 실시예에 따른 영상 처리 방법은 다양한 전자 제품들에서 예를 들어, instance segmentation, 및 amodal segmentation 등과 같은 비디오 세그먼테이션(segmentation) 또는 비디오 트랙킹(tracking)에 활용될 수 있다. The image processing method according to an embodiment may be used for video segmentation or video tracking, such as instance segmentation and amodal segmentation, in various electronic products.

도 8은 일 실시예에 따른 영상 처리 장치의 블록도이다. 도 8을 참조하면, 일 실시예에 따른 영상 처리 장치(800)는 통신 인터페이스(810), 프로세서(830), 및 메모리(850)를 포함한다. 통신 인터페이스(810), 프로세서(830), 및 메모리(850)는 통신 버스(805)를 통해 서로 통신할 수 있다. 영상 처리 장치(800)는 예를 들어, HUD(Head Up Display) 장치, 3D 디지털 정보 디스플레이(Digital Information Display, DID), 3D 모바일 기기, 및 스마트 차량 등에 해당할 수 있다. 8 is a block diagram of an image processing apparatus according to an exemplary embodiment. Referring to FIG. 8 , an image processing apparatus 800 according to an embodiment includes a communication interface 810 , a processor 830 , and a memory 850 . Communication interface 810 , processor 830 , and memory 850 may communicate with each other via communication bus 805 . The image processing apparatus 800 may correspond to, for example, a head-up display (HUD) device, a 3D digital information display (DID), a 3D mobile device, and a smart vehicle.

통신 인터페이스(810)는 움직임을 예측하고자 하는 대상의 개체들을 포함하는 제1 시점의 영상을 수신한다. The communication interface 810 receives an image of a first view including objects of a target whose motion is to be predicted.

프로세서(830)는 제1 시점의 영상에서 개체들의 특징 벡터를 추출한다. 프로세서(830)는 특징 벡터에 기초하여 정의된 상기 개체들 간의 관계들에 의해 제1 시점에서의 개체들 간의 동적 상호 작용을 추정한다. 프로세서(830)는 추정된 동적 상호 작용에 의해 제2 시점에서 변화되는 개체들의 움직임을 예측한다. 프로세서(830)는 예를 들어, 도 2 및 도 3에 도시된 것과 같은 프라이어(prior), 인코더(encoder) 및 디코더(decoder) 등을 포함할 수 있다. The processor 830 extracts feature vectors of objects from the image of the first view. The processor 830 estimates the dynamic interaction between the entities at the first time point based on the relationships between the entities defined based on the feature vector. The processor 830 predicts the movement of the objects changed at the second time point by the estimated dynamic interaction. The processor 830 may include, for example, a prior, an encoder, and a decoder as shown in FIGS. 2 and 3 .

출력 장치(850)는 제2 시점에서의 예측된 움직임을 반영한 결과를 출력한다. 출력 장치(850)는 예를 들어, HUD의 디스플레이와 같은 디스플레이 장치에 해당할 수도 있고, 또는 스피커과 같은 음향 장치에 해당할 수도 있다. The output device 850 outputs a result reflecting the motion predicted at the second time point. The output device 850 may correspond to, for example, a display device such as a display of a HUD, or a sound device such as a speaker.

실시예에 따라서, 통신 인터페이스(810)는 제1 시점의 영상에서 추출된 개체들의 특징 벡터를 수신할 수도 있다. 이 경우, 프로세서(830)는 특징 벡터의 추출 과정을 수행하지 않고, 특징 벡터에 기초하여 정의된 개체들 간의 관계들에 의해 제1 시점에서의 개체들 간의 동적 상호 작용을 추정하고, 추정된 동적 상호 작용에 의해 제2 시점에서 변화되는 개체들의 움직임을 예측할 수 있다. According to an embodiment, the communication interface 810 may receive feature vectors of entities extracted from the image of the first view. In this case, the processor 830 does not perform the feature vector extraction process, but estimates the dynamic interaction between the entities at the first time point based on the relationships between the entities defined based on the feature vector, and the estimated dynamic It is possible to predict the movement of the objects changed at the second point in time due to the interaction.

메모리(850)는 예를 들어, 통신 인터페이스(810)를 통해 수신한 제1 시점의 영상, 또는 통신 인터페이스(810)를 통해 수신한 제1 시점의 영상에서의 개체들의 특징 벡터를 저장할 수 있다. 또한, 메모리(850)는 프로세서(830)에 의해 생성된 사전 정보, 프로세서(830)에 의해 추정된 개체들 간의 동작 상호 작용에 대응하는 잠재 변수, 및/또는 프로세서(830)에 의해 예측된 제2 시점에서 변화되는 개체들의 움직임을 저장할 수 있다. The memory 850 may store, for example, feature vectors of objects in the image of the first viewpoint received through the communication interface 810 or the image of the first viewpoint received through the communication interface 810 . In addition, the memory 850 may store the prior information generated by the processor 830 , a latent variable corresponding to the operational interaction between entities estimated by the processor 830 , and/or the second predicted by the processor 830 . It is possible to store the movement of objects that change at 2 viewpoints.

또한, 메모리(850)는 상술한 프로세서(830)의 처리 과정에서 생성되는 다양한 정보들을 저장할 수 있다. 이 밖에도, 메모리(850)는 각종 데이터와 프로그램 등을 저장할 수 있다. 메모리(850)는 휘발성 메모리 또는 비휘발성 메모리를 포함할 수 있다. 메모리(850)는 하드 디스크 등과 같은 대용량 저장 매체를 구비하여 각종 데이터를 저장할 수 있다.Also, the memory 850 may store various pieces of information generated in the process of the processor 830 described above. In addition, the memory 850 may store various data and programs. The memory 850 may include a volatile memory or a non-volatile memory. The memory 850 may include a mass storage medium such as a hard disk to store various data.

또한, 프로세서(830)는 도 1 내지 도 7을 통해 전술한 적어도 하나의 방법 또는 적어도 하나의 방법에 대응되는 알고리즘을 수행할 수 있다. 프로세서(830)는 목적하는 동작들(desired operations)을 실행시키기 위한 물리적인 구조를 갖는 회로를 가지는 하드웨어로 구현된 데이터 처리 장치일 수 있다. 예를 들어, 목적하는 동작들은 프로그램에 포함된 코드(code) 또는 인스트럭션들(instructions)을 포함할 수 있다. 예를 들어, 하드웨어로 구현된 데이터 처리 장치는 마이크로프로세서(microprocessor), 중앙 처리 장치(central processing unit), 프로세서 코어(processor core), 멀티-코어 프로세서(multi-core processor), 멀티프로세서(multiprocessor), ASIC(Application-Specific Integrated Circuit), FPGA(Field Programmable Gate Array)를 포함할 수 있다.Also, the processor 830 may perform at least one method described above with reference to FIGS. 1 to 7 or an algorithm corresponding to the at least one method. The processor 830 may be a hardware-implemented data processing device having a circuit having a physical structure for executing desired operations. For example, desired operations may include code or instructions included in a program. For example, a data processing device implemented as hardware includes a microprocessor, a central processing unit, a processor core, a multi-core processor, and a multiprocessor. , an Application-Specific Integrated Circuit (ASIC), and a Field Programmable Gate Array (FPGA).

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and carry out program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited drawings, those skilled in the art may apply various technical modifications and variations based on the above. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

800: 영상 처리 장치
805: 통신 버스
810: 통신 인터페이스
830: 프로세서
850: 메모리800: image processing unit
805: communication bus
810: communication interface
830: Processor
850: memory

Claims

defining relationships between entities based on feature vectors of entities of a target whose motion is to be predicted in an image of a first view;
estimating a dynamic interaction between the entities at the first time point based on the relationships between the entities;
estimating the movement of the objects changed at a second time point based on the estimated dynamic interaction; and
outputting a result reflecting the predicted motion at the second time point
containing,
Image processing method.

According to claim 1,
The relationships between the entities are
Connection relationships between the entities, the positions of the entities, the postures of the entities, the moving direction of the entities, the moving speed of the entities, the trajectory of the entities, a rule applied to the entities, determined based on at least one of a movement pattern of the entities according to the rule, a law applied to the entities, and a movement pattern of the entities according to the law,
Image processing method.

According to claim 1,
The step of defining the relationships between the entities
By applying the feature vector to a Graph Neural Network (GNN) comprising nodes corresponding to the entities and edges corresponding to relationships between the entities, the relation between the entities at the first time point is applied. generating hidden state information corresponding to the
containing,
Image processing method.

4. The method of claim 3,
The graph neural network is
A fully-connected graph neural network for generating, based on the feature vector, hidden state information corresponding to the state of relationships between pairs of entities.
comprising,
Image processing method.

4. The method of claim 3,
The step of estimating the dynamic interaction is
generating prior information corresponding to the entities based on the hidden state information;
generating posterior information predicted to correspond to the entities based on the prior information and the hidden state information; and
generating a latent variable corresponding to a dynamic interaction between the entities based on the prior information and the post-information;
Including, an image processing method.

6. The method of claim 5,
The prior information
Determined based on a past history of relationships between the entities up to a time point prior to the first time point, and feature vectors of the entities input up to the first time point,
Image processing method.

6. The method of claim 5,
The step of generating the dictionary information is
generating the dictionary information by transferring the hidden state information as forward state information to a forward long short-term memory (LSTM);
containing,
Image processing method.

6. The method of claim 5,
The step of generating the post-information is
generating the post-information by transferring the prior information and the hidden state information as backward state information to a backward LSTM;
containing,
Image processing method.

6. The method of claim 5,
The step of generating the latent variable is
sampling a result of combining the pre-information and the post-information; and
generating, based on the sampling result, a latent variable corresponding to the dynamic interaction between the entities at the first time point;
containing,
Image processing method.

10. The method of claim 9,
The step of generating the latent variable is
optimizing the latent variable based on the prior information
containing,
Image processing method.

According to claim 1,
The subject objects are
comprising at least one of body parts of a single user, joints of a single user, multiple pedestrians, multiple vehicles, and athletes of a sports team.
Image processing method.

According to claim 1,
Predicting the movement of the objects includes:
predicting the movement of the object that is changed at the second time point by decoding the estimated dynamic interaction
containing,
Image processing method.

According to claim 1,
The step of outputting a result reflecting the predicted motion
processing the objects included in the image of the first viewpoint into an image of a second viewpoint by reflecting the predicted motion; and
outputting the image of the second viewpoint
containing,
Image processing method.

According to claim 1,
The step of outputting a result reflecting the predicted motion
processing the objects included in the image of the first viewpoint into an image of a second viewpoint by reflecting the predicted motion;
recognizing whether a dangerous situation has occurred based on the image of the second viewpoint; and
Outputting an alarm corresponding to the dangerous situation
containing,
Image processing method.

According to claim 1,
Determining the objects of the object for which the motion is to be predicted
further comprising,
Image processing method.

A computer program stored in a computer-readable recording medium in combination with hardware to execute the method of any one of claims 1 to 15.

a communication interface for receiving an image of a first view including objects of a target for which motion is to be predicted;
extracting feature vectors of the entities from the image of the first viewpoint, and estimating dynamic interactions between the entities at the first viewpoint based on relationships between the entities defined based on the feature vectors, and a processor for predicting motions of the objects that are changed at a second time point by the estimated dynamic interaction; and
An output device for outputting a result reflecting the motion predicted at the second time point
containing,
image processing device.

18. The method of claim 17,
the processor
Based on the feature vector, a past history of relationships between the entities up to the time before the first time point, and prior information determined based on the feature vectors corresponding to the entities input up to the first time point creating a prior;
an encoder that generates a latent variable corresponding to a dynamic interaction between the entities based on the feature vector and the prior information; and
A decoder predicting the movement of the entities that change at the second time point based on the latent variable
containing,
image processing device.

19. The method of claim 18,
the encoder is
a fully connected graph neural network (GNN) that generates, based on the feature vector, hidden state information corresponding to states of relationships between the pairs of entities;
a forward LSTM for generating prior information corresponding to the objects of the target in the image of the first view based on the hidden state information;
a reverse LSTM for generating post-information predicted according to a dynamic interaction between the entities based on the prior information and the hidden state information; and
Based on the prior information transmitted through the forward LSTM and the post-information transmitted through the backward LSTM, MLP (Multi- Layer Perceptron)
containing,
image processing device.

18. The method of claim 17,
The image processing device
comprising at least one of a Head Up Display (HUD) device, a 3D Digital Information Display (DID), a 3D mobile device, and a smart vehicle,
image processing device.