KR102235745B1

KR102235745B1 - Method for training a convolutional recurrent neural network and for semantic segmentation of inputted video using the trained convolutional recurrent neural network

Info

Publication number: KR102235745B1
Application number: KR1020190094417A
Authority: KR
Inventors: 필립 웨인즈에펠
Original assignee: 네이버 주식회사
Priority date: 2018-08-10
Filing date: 2019-08-02
Publication date: 2021-04-02
Also published as: JP6861249B2; JP2020027659A; KR20200018283A

Abstract

비디오들의 의미적 세그먼트화를 위한 컨볼루션 순환 신경망을 훈련시키는 방법은, (a) 의미적으로 세그먼트화된 훈련 이미지들의 세트를 사용하여, 제1 컨볼루션 신경망을 훈련시키는 단계; 및 (b) 의미적으로 세그먼트화된 훈련 비디오들의 세트를 사용하여, 상기 제1 컨볼루션 신경망에 대응하는, 컨볼루션 순환 신경망을 훈련시키는 단계 - 컨볼루션 레이어는 은닉 상태(hidden state)를 갖는 순환 모듈에 의해 대체되었음 -를 포함한다. 상기 컨볼루션 순환 신경망을 훈련시키는 단계는, 상기 의미적으로 세그먼트화된 훈련 비디오들의 세트 중 하나의 비디오의 연속하는 프레임들(

)의 각 쌍에 대해, 연속하는 프레임들의 쌍의 프레임들 간의 추정된 광학 흐름(optical flow)에 따라 순환 레이어의 내부 상태를, 상기 내부 상태가 상기 쌍의 프레임들 간의 픽셀들의 모션에 적응하도록(adapt), 와핑하는 단계 및 적어도 상기 순환 모듈의 파라미터들을 학습하는 단계를 포함한다. A method of training a convolutional recurrent neural network for semantic segmentation of videos includes: (a) training a first convolutional neural network using a set of semantically segmented training images; And (b) training a convolutional recurrent neural network, corresponding to the first convolutional neural network, using a set of semantically segmented training videos.- The convolutional layer has a hidden state. Replaced by module-includes. Training the convolutional recurrent neural network includes consecutive frames of one video of the set of semantically segmented training videos (

For each pair of ), the internal state of the cyclic layer is adapted according to the estimated optical flow between the frames of the pair of consecutive frames, so that the internal state adapts to the motion of the pixels between the pair of frames ( adapt), warping and learning at least the parameters of the cyclic module.

Description

METHOD FOR TRAINING A CONVOLUTIONAL RECURRENT NEURAL NETWORK AND FOR SEMANTIC SEGMENTATION OF INPUTTED VIDEO USING THE TRAINED CONVOLUTIONAL RECURRENT NEURAL NETWORK }

컨볼루션 순환 신경망을 훈련시키는 방법 및 훈련된 컨볼루션 순환 신경망을 사용하는 입력된 비디오의 의미적 세그먼트화 방법과 관련된다.It relates to a method of training a convolutional recurrent neural network and a method of semantic segmentation of an input video using the trained convolutional recurrent neural network.

35 U.S.C.§ 119 (a)에 의해, 본원 청구항들은 2018년 8월 10일자로 출원된 유럽 특허출원 EP 18306104.3의 더 빠른 출원일의 우선권의 이익을 주장하며, 이로 인해 2018년 8월 10일자로 출원된 유럽 특허출원 EP 18306104.3의 전체의 내용은 참조로서 포함된다. By 35 USC§ 119 (a), the claims herein claim the benefit of the earlier filing date priority of European patent application EP 18306104.3 filed on August 10, 2018, thereby The entire contents of European patent application EP 18306104.3 are incorporated by reference.

컴퓨터 비전은 디지털 비디오들로부터 높은 수준(high-level)의 이해를 컴퓨터들에게 제공하는 것을 것을 목적으로 하는 최근의 노력 분야이다. 이는 인간의 시각 시스템이 할 수 있는 작업들을 자동화하고자 한다. Computer vision is a recent field of effort aimed at providing computers with a high-level understanding of digital videos. It seeks to automate the tasks that the human visual system can do.

이러한 작업들 중 하나는 소위 "의미적인" 세그먼트화(segmentation)이다. 일반적으로, 세그먼트화는 이미지를 픽셀들의 세트들로 분할하는(partitioning) 프로세스이다. 이러한 세트들의 각각이 타입이 식별될 수 있는 엔티티(자동차, 사람, 빌딩 등)에 대응할 때, 세그먼트화는 의미적인 것으로 간주된다. 실제로, 의미적 세그먼트화는 클래스들의 소정의 세트 중에서 클래스 라벨을 각 픽셀에 할당하는 것으로 구성된다.One of these tasks is so-called "semantic" segmentation. In general, segmentation is the process of partitioning an image into sets of pixels. When each of these sets corresponds to an entity (car, person, building, etc.) whose type can be identified, segmentation is considered semantic. Indeed, semantic segmentation consists of assigning a class label to each pixel among a given set of classes.

이는 (장면을 이해하고 로봇이 탐색(navigate)할 수 있는 곳을 식별하기 위한) 로봇 공학 및 자율 주행을 포함하는 많은 응용들에 있어서 큰 관심을 가지고 있다. 유사한 의미에서, 의미적 세그먼트화는 장면을 이해하고 객체들 및 가상의 객체들이 추가될 수 있는 영역들의 범위를 발견하기 위한 증강 현실의 맥락에서 유용하다.This is of great interest in many applications, including robotics and autonomous driving (to understand scenes and identify where the robot can navigate). In a similar sense, semantic segmentation is useful in the context of augmented reality for understanding the scene and discovering a range of areas to which objects and virtual objects can be added.

의미적 세그먼트화는 (예컨대, 빌딩과 같이) 크거나 (예컨대, 신호등과 같이) 얇을 수 있는 (하늘, 잔디 등과 같은) 것들 및 잘-정의된 객체들(자동차들, 병들 등)을 포함하는 많은 종류의 클래스들로 인해 컴퓨터 비전에 있어서 도전적인 과제이다. 의미적 세그먼트화 출력들은 공간(객체의 경계들을 제외한, 동일한 클래스에 속할 경향이 있는 이웃하는 픽셀들) 및 시간(실세계 지점(point)이 시간에 있어서 일정한 라벨을 가짐. 이는 그 프로젝션(projection)들 또한 그러함을 의미함)에 둘 다 있어서 매끄럽게(smooth) 된다.Semantic segmentation includes many (such as buildings) that can be large (such as a traffic light) or thin (such as sky, grass, etc.) and well-defined objects (such as cars, bottles, etc.). This is a challenging task for computer vision due to the kinds of classes. Semantic segmentation outputs are spatial (neighboring pixels that tend to belong to the same class, excluding object boundaries) and time (real-world points have a constant label in time. These projections) It also means that it is both to be smooth.

이를 위해 컨볼루션 신경망(convolutional neural network)을 사용하는 것이 제안되어 왔다. 컨볼루션 신경망은 뉴런들 간의 연결 패턴이 동물들의 시각 피질(visual cortex)에 의해 영감을 받는, 일종의 신경망이다. 컨볼루션 신경망들은 따라서 이미지들 내에서의 개체들의 효율적인 인식을 허용하므로 비디오 처리에 특히 적합하다. For this, it has been proposed to use a convolutional neural network. A convolutional neural network is a kind of neural network in which the pattern of connections between neurons is inspired by the visual cortex of animals. Convolutional neural networks are therefore particularly suitable for video processing as they allow efficient recognition of objects within images.

따라서, 컨볼루션 신경망이 미리 세그먼트화된; 즉, 비디오들의 프레임들 내의 각 픽셀의 라벨들이 제공된 비디오들의 훈련 세트를 거기에 제공함으로써 훈련되는 지도 학습 단계 후에; 컨볼루션 신경망은 (라벨이 지정되지 않은, 특히, 라이브 CCTV로부터의 "프레쉬(fresh)" 비디오들인) 자신의 어떠한 입력된 비디오들에 대해서 세그먼트화할 수 있게 된다. Thus, the convolutional neural network is pre-segmented; That is, after a supervised learning step where the labels of each pixel in the frames of the videos are trained by providing thereto a training set of videos provided; The convolutional neural network will be able to segment any of its own input videos (unlabeled, especially "fresh" videos from live CCTV).

의미적 세그먼트화에 대한 기존의 대부분의 방법들은 이미지 레벨에서 수행하는 것, 즉, 알고리즘이 비디오의 모든 프레임에 독립적으로 적용되는 것이다.Most of the existing methods for semantic segmentation are performed at the image level, that is, the algorithm is applied independently to every frame of the video.

일 종래의 접근법은 밀집한 예측(dense prediction)을 출력하기 위해 이미지 분류 컨볼루션 신경망 아키텍쳐를 완전 컨볼루션 버전(fully convolutional version)으로 변환하는 것이다. 표준적인 컴퓨터 비전 컨볼루션 신경망 아키텍쳐들의 스트라이드(stride)에 의해, 피쳐 맵들은 일반적으로 입력 이미지들보다 더 낮은 해상도를 갖는다는 것에 유의한다. 첫 번째의 깊은(deep) 세그먼트화 접근법들은 따라서, 업 샘플링 전략, 즉, 인코더-디코더 아키텍쳐를 사용하는 조건부 랜덤 필드들과 같은 그래픽 모델들을 사용하여 출력을 정제하는 것이었다.One conventional approach is to transform the image classification convolutional neural network architecture into a fully convolutional version to output a dense prediction. Note that by the stride of standard computer vision convolutional neural network architectures, feature maps generally have a lower resolution than input images. The first deep segmentation approaches were thus to refine the output using an up-sampling strategy, i.e., graphical models such as conditional random fields using an encoder-decoder architecture.

더 최근에는, 팽창된 컨볼루션(dilated convolution)들(또한 아트루스(atrous) 컨볼루션들로 명명됨)이 소개되었고, 이는 기존의 분류 컨볼루션 신경망 아키텍쳐들로부터 더 밀집한 피쳐 맵들의 추출을 허용한다. More recently, dilated convolutions (also termed atrous convolutions) have been introduced, which allow the extraction of denser feature maps from existing classification convolutional neural network architectures. .

보다 우수한 정확성을 위해, 최근의 경향은 마지막 레이어들의 다중 스케일(multiple scales)에서의 콘텍스트 정보를 모델링하는 것으로 구성된다. 예컨대, 이는 팽창된 컨볼루션들의 출력을 복수의 인자들과 전역적으로(globally) 풀링된 (pooled) 이미지 피쳐들로 연쇄시킴으로써(concatenating) 이루어지거나, PSP-Net의 몇몇의 그리드 스케일들로 공간적인 풀링을 수행함으로써 이루어진다.For better accuracy, a recent trend consists in modeling contextual information at multiple scales of the last layers. For example, this can be accomplished by concatenating the output of the expanded convolutions into a plurality of factors and globally pooled image features, or spatially with several grid scales of the PSP-Net. It is done by performing pooling.

통상적으로, 이는, 실제 세계의 일부 영역들이 연속적인(consecutive) 프레임들 간의 의미적 라벨들의 많은 변화를 겪게되는, 불만족스러운 플리커링 효과(flickering effect)로 이어진다. 이는 3개의 연속하는 프레임들(도 1) 상에서 프레임당(per-frame) 추정의 예시들을 나타내는 도 2에서 설명된다. Typically, this leads to an unsatisfactory flickering effect, where some areas of the real world undergo many changes in semantic labels between consecutive frames. This is illustrated in FIG. 2 showing examples of per-frame estimation on three consecutive frames (FIG. 1).

노이즈 아티팩트들은 시간에 따라 예측된 라벨들(박스들)에서, 2 개 초과의 라벨들(오른쪽 박스) 간의 플리커링을 포함하는 영역에서 조차도 관찰된다는 것에 유의한다.Note that noise artifacts are observed in labels (boxes) predicted over time, even in the region containing flickering between more than two labels (box on the right).

시간에 따른 일관성(consistency)을 개선하기 위해, 의미적 세그먼트화는 비디오 레벨에서, 즉, 프레임들에 대해 함께, 수행되어야 한다.In order to improve consistency over time, semantic segmentation has to be performed at the video level, ie for frames together.

의미적 비디오 세그먼트화를 다루기 위한 최근의 몇몇의 시도들이 존재하였다. 첫 번째 접근법들은 시간 및 공간에 있어서 마르코브 랜덤 필드(Markov Random Field)들, 퍼터브-앤드-맵(Perturb-and-MAP) 랜덤 필드들 또는 조건부 랜덤 필드들에 기반하였다. 보다 최근에는, 프레임들 간의 쌍별 포텐셜(pairwise potential)들에 있어서의 모션을 모델링하기 위해 광학 흐름을 활용하는(leverage) 것이 제안되었다. 비디오들의 의미적 세그먼트화를 정제하기 위한 다른 방법은 필터링 전략을 사용하는 것으로 구성된다. 모든 이러한 접근법들은 그러나 일관된 비디오 세그먼트화 출력들을 생성하지 않는다. There have been several recent attempts to deal with semantic video segmentation. The first approaches were based on Markov Random Fields, Perturb-and-MAP random fields or conditional random fields in time and space. More recently, it has been proposed to leverage optical flow to model motion in pairwise potentials between frames. Another way to refine the semantic segmentation of videos consists of using a filtering strategy. All these approaches, however, do not produce consistent video segmentation outputs.

더 최근에는, 넷왑(NetWarp) 모듈을 어떤 시계열적 일관성(temporal consistency)을 의미적 세그먼트화 컨볼루션 신경망 아키텍쳐들로 통합하기 위해 도입하는 것이 제안되었다. 해당 아이디어는 현재의 프레임의 피쳐들을 광학 흐름에 따라 와핑된(warped) 이전의 프레임으로부터의 피쳐들과 결합하는 것이다. 실제로, 연속하는 프레임들로부터의 피쳐들이, 와핑 이후에, 광학 흐름에 따라 집합되고(aggregated), 최종적인 추정을 생성하기 위해 사용되지만, 피쳐들은 사전 정의되고 고정된 수의 프레임들로 제한되어 남아 있다.More recently, it has been proposed to introduce a NetWarp module to integrate some temporal consistency into semantic segmented convolutional neural network architectures. The idea is to combine the features of the current frame with features from the previous frame that were warped according to the optical flow. Indeed, features from successive frames, after warping, are aggregated according to the optical flow and used to generate the final estimate, but the features remain limited to a predefined and fixed number of frames. have.

대안적으로, 시계 장치(clockwork) 컨볼루션 신경망 아키텍쳐가 의미적 비디오 세그먼트화를 위해 제안되었다. 시계 장치 컨볼루션 신경망 아키텍쳐들은, 정확성의 저하의 비용으로 비디오 세그먼트의 런타임을 감소시키기 위한 목적으로, 이전의 프레임들로부터의 중간 피쳐들을 재사용하는 것으로 구성된다.Alternatively, a clockwork convolutional neural network architecture has been proposed for semantic video segmentation. Clockwork convolutional neural network architectures consist of reusing intermediate features from previous frames with the aim of reducing the runtime of a video segment at the cost of degrading accuracy.

결과적으로, 시계열적 정보의 직접적인 레버리지(leverage)를 허용하는 신경망들을 사용하는 세그먼트화 방법에 있어서의 개선이 요구된다. As a result, there is a need for improvement in segmentation methods using neural networks that allow direct leverage of time-series information.

다음의 도면들은 다양한 실시예들을 설명하기 위한 목적들을 위한 것이며 제한적으로 해석되지 않는다:
도 1은 3개의 연속하는 프레임들의 예시를 나타낸다;
도 2는 종래의 프레임당(per-frame) 기법을 사용하는 도 1의 예시 프레임들의 의미적 세그먼트화를 나타낸다;
도 3은 도 7의 컨볼루션 순환 신경망 아키텍쳐를 사용하는 도 1의 예시 프레임들의 의미적 세그먼트화를 나타낸다;
도 4는 의미적 세그먼트화를 위한 시스템의 예시를 설명한다;
도 5는 2개의 연속하는 프레임들의 예시에 대해 어떻게 광학 흐름이 픽셀들을 변위시키는지를 설명한다;
도 6은 의미적 세그먼트화에 있어서의 사용을 위한 컨볼루션 신경망 아키텍쳐의 예시를 나타낸다;
도 7은 의미적 세그먼트화에 있어서의 사용을 위한 컨볼루션 순환 신경망 아키텍쳐의 예시를 나타낸다.The following drawings are for the purpose of describing various embodiments and are not to be construed as limiting:
1 shows an example of three consecutive frames;
Figure 2 shows the semantic segmentation of the example frames of Figure 1 using a conventional per-frame technique;
Fig. 3 shows the semantic segmentation of the example frames of Fig. 1 using the convolutional recurrent neural network architecture of Fig. 7;
4 describes an example of a system for semantic segmentation;
5 illustrates how the optical flow displaces pixels for the example of two consecutive frames;
6 shows an example of a convolutional neural network architecture for use in semantic segmentation;
7 shows an example of a convolutional recurrent neural network architecture for use in semantic segmentation.

아래에서 설명되는 것처럼, 의미적 비디오 세그먼트화의 2개의 보완적인 측면들이 설명된다: (a) 비디오들의 의미적 세그먼트화를 위한 컨볼루션 순환 신경망(convolutional recurrent neural network)을 훈련시키는 방법; 및 (b) 유리하게는, 첫 번째 방법에 따라 훈련된, 컨볼루션 순환 신경망을 사용하는, 입력된 비디오의 의미적 세그먼트화 방법.As described below, two complementary aspects of semantic video segmentation are described: (a) a method of training a convolutional recurrent neural network for semantic segmentation of videos; And (b) a method of semantic segmentation of the input video, advantageously using a convolutional recurrent neural network, trained according to the first method.

여하한 비디오는 (그 길이에 따라), 1에서 T까지로 번호가 매겨지는, T개의 프레임들의 시퀀스가 된다. 다시 말해, 비디오의 일반적인 프레임은,

인, 프레임 t로서 참조될 것이다. 각 프레임은 이미지, 즉, 주어진 크기의 픽셀들의 매트릭스, 예컨대, 321 x 321 픽셀들이 된다.Any video (depending on its length) is a sequence of T frames, numbered from 1 to T. In other words, a typical frame of video is,

In, it will be referred to as frame t. Each frame is an image, i.e. a matrix of pixels of a given size, e.g. 321 x 321 pixels.

비디오의 의미적 세그먼트화는 비디오의 각 프레임의 각 픽셀의 분류(classification)이고; 즉, 라벨들의 소정의 리스트의 라벨들 중에서, 각 픽셀에 의해 묘사되는 엔티티의 타입을 정의하는, 각 프레임에 대해 각 픽셀의 라벨을 예측하는 것에 해당한다. 프레임은 따라서 동일한 라벨을 갖는 픽셀들의 복수의 세트들로 분할되고, 동일한 라벨을 갖는 각 "연결된" 픽셀들의 세트는 개체, 즉, "실세계" 객체를 정의한다. 예컨대, 하나의 자동차를 묘사하는 모든 픽셀들은 "자동차" 타입의 픽셀들로서 라벨링되어야 한다.Semantic segmentation of video is the classification of each pixel of each frame of the video; That is, it corresponds to predicting the label of each pixel for each frame, which defines the type of entity depicted by each pixel among the labels of a given list of labels. The frame is thus divided into multiple sets of pixels with the same label, and each set of "connected" pixels with the same label defines an object, ie a "real world" object. For example, all pixels depicting one automobile should be labeled as "automobile" type pixels.

전술된 2개의 타입의 방법들은, 제1 및/또는 제2 서버(1a, 1b)를 사용하는, 도 4에서 설명된 것과 같은 시스템 내에서 구현된다. 제1 서버(1a)는 (첫 번째 방법을 구현하는) 학습 서버이고, 제2 서버(1b)는 (두 번째 방법을 구현하는) 세그먼트화 서버이다. 이러한 두 서버들은 단일한 서버로 통합되는 것이 완전하게 가능하다. The two types of methods described above are implemented in a system such as that described in Fig. 4, using a first and/or second server 1a, 1b. The first server 1a is a learning server (implementing the first method), and the second server 1b is a segmentation server (implementing the second method). It is completely possible for these two servers to be integrated into a single server.

이러한 서버들(1a, 1b)의 각각은 통상적으로 데이터 교환을 위해 인터넷과 같은 확장된 네트워크(2)에 접속된 원격의 컴퓨터 장비이다. 각 하나는 프로세서 타입의 데이터 처리 수단(11a, 11b)(특히, 학습은 훈련된 컨볼루션 순환 신경망의 일반적인 사용과 비교해 길고 복잡하므로, 제1 서버의 데이터 처리 수단(11a)은 강력한 컴퓨팅 파워를 가짐)과 선택적으로, 컴퓨터 메모리; 예컨대, 하드 디스크와 같은 스토리지 수단(12)을 포함한다. Each of these servers 1a, 1b is typically a remote computer equipment connected to an extended network 2 such as the Internet for data exchange. Each one of the processor-type data processing means (11a, 11b) (in particular, the learning is long and complex compared to the general use of the trained convolutional recurrent neural network, so the data processing means (11a) of the first server has a strong computing power. ) And optionally, computer memory; And storage means 12 such as, for example, a hard disk.

제1 서버(1a)는 하나 이상의 훈련 데이터베이스; 즉, (정확하게 분류될 것을 의도하는 소위 입력된 데이터와는 반대인) 이미 분류된 데이터의 세트를 갖거나 이에 대한 접근을 가진다. 설명될 바와 같이, 데이터는 여기에서 이미지들 및/또는 비디오들로 구성되며 적어도 이미 의미적으로 세그먼트화된 훈련 비디오들을 포함한다. 다시 말해, 훈련 비디오들의 픽셀들은 이미 라벨링되어 있다.The first server 1a includes one or more training databases; That is, you have or have access to a set of data that has already been classified (as opposed to the so-called input data that is intended to be classified correctly). As will be explained, the data here consists of images and/or videos and at least includes training videos that have already been semantically segmented. In other words, the pixels of the training videos are already labeled.

아키텍쳐는 유리하게는 클라이언트 장비(10)의 하나 이상의 아이템들을 포함하고, 이는 여하한 워크 스테이션에(또한 네트워크(2)에 접속되어) 있을 수 있고, 바람직하게는, 서버들(1a, 1b)로부터 분리되어 있을 수 있고, 그 하나 및/또는 나머지와 통합될 수 있다. 하나 이상의 데이터 아이템들을 가진다. 장비의 운용자(operator)들은 통상적으로, 제1 및/또는 제2 서버들(1a, 1b)을 운용하는 서비스 제공자의, 용어의 상업적 의미로의 "클라이언트들"이다. The architecture advantageously comprises one or more items of the client equipment 10, which may be at any workstation (also connected to the network 2), preferably from servers 1a, 1b. It can be separate and can be integrated with one and/or the other. It has one or more data items. Operators of equipment are typically "clients" in the commercial sense of the term, of a service provider operating the first and/or second servers 1a, 1b.

순환 신경망들은 순차적인(sequential) 정보, 현재의 케이스 내의 프레임들의 시퀀스들을 활용하는(leverage) 신경망들의 일 타입이다. 다시 말해, 주어진 프레임에서의 출력은 이러한 프레임으로부터의 피쳐들뿐만 아니라, 또한 은닉 상태 또는 메모리로 명명되는, "내부 상태(internal state)" 덕분에, 이전의 프레임들로부터의 추정들에 또한 의존하게 될 것이다. 순환 신경망들은, 필적 또는 스피치 인식과 같은 작업들에 대해 매우 적합하다. Recurrent neural networks are a type of neural networks that utilize sequential information, sequences of frames in the current case. In other words, the output in a given frame is not only dependent on features from this frame, but also on estimates from previous frames, thanks to the "internal state", also termed hidden state or memory. Will be. Recurrent neural networks are well suited for tasks such as handwriting or speech recognition.

순환 신경망들은 신경망들 내에서의 시퀀스 모델링에 대해 효과적임이 입증되었다: 그 아이디어는 시간에 따른 관련 정보를 누적하는 내부 상태를 학습하고 추정이 현재의 입력과 이러한 내부 상태에 기반하는 것이다. 순환 신경망들은 경사(그라디언트)가 사라지는 문제들(vanishing gradient issues)에 의해 종종 훈련시키기가 어렵다.Recurrent neural networks have proven to be effective for modeling sequences within neural networks: the idea is to learn internal states that accumulate relevant information over time, and estimates are based on current inputs and these internal states. Recurrent neural networks are often difficult to train due to vanishing gradient issues.

순환 신경망은, "순환" 동작(behavior)을 나타내는 하나 이상의 레이어들의 블록인, "순환 신경망 모듈"을 포함한다. 순환 신경망 모듈들은 당업자에게 주지이다. 순환 신경망은 게이트 순환 유닛(gated recurrent unit) 및/또는 장단기 메모리(long short-term memory)를 포함할 수 있다. 게이트 순환 유닛 및 장단기 메모리는 내부 상태를 선택적으로 업데이트하기 위한 학습 가능한 게이트들을 포함하고, 따라서, 훈련 동안 더 긴 시퀀스들에 걸쳐 그라디언트들을 전파하는 것(propagating)을 가능하게 한다. A recurrent neural network includes a “recurrent neural network module”, which is a block of one or more layers representing a “cyclic” behavior. Recurrent neural network modules are well known to those skilled in the art. The recurrent neural network may include a gated recurrent unit and/or a long short-term memory. The gate cycle unit and long-term memory contain learnable gates for selectively updating the internal state, thus making it possible to propagating gradients over longer sequences during training.

일반적으로, x _t , h _t 및 o _t 가 각각 프레임 t로부터의 순환 신경망(Recurrent Neural Network; RNN)의 입력, 은닉 상태 및 출력을 나타내는 일반적인 설정을 정의하면, 순환 신경망 모듈의 출력 o _t 는 현재 프레임(t)에서의 피쳐들 x _t 뿐만아니라 이전의 프레임 t-1에서의 은닉 상태 h _t-1 의 함수가 된다. 동시에 은닉 상태 h _t 는 동일한 입력들에 기반하여 업데이트되고, 따라서, 순환 신경망 모듈은 다음의 수학식에 의해 정의된다:Typically, x _t , h _t And o _t a circular neural network from each frame t; the features at the output o _t is the current frame (t) of the input, defining a general configuration showing the hidden state and output, cycling the neural network modules (Recurrent Neural Network RNN) It is a function of not only _{x t} but also the hidden state h _t-1 in the previous frame t -1. At the same time the hidden state h _t is updated based on the same inputs, so the recurrent neural network module is defined by the following equation:

h _o 는 모두 제로인 텐서(tensor)로 초기화된다(h _o ). 이러한 제로 초기화에서 훈련 단계는 바람직하게는 쌍(pair)들보다 더 긴 시퀀스들로 수행된다. h _o is initialized to a tensor that is all zero ( h _o ). In this zero initialization the training step is preferably performed in longer sequences than pairs.

순환 신경망 모듈은 비디오들의 고유한(inherent) 모션을 고려하도록 의미적 세그먼트화의 작업에 적응될 수 있다. The recurrent neural network module can be adapted to the task of semantic segmentation to take into account the inherent motion of the videos.

사실, x _t 및 o _t 는 의미적 세그먼트화의 작업을 위한 현재의(presently) 피쳐 맵들이 되고, 따라서, (입력, 내부 상태, 및 출력이 3D 텐더들인) 컨볼루션 연산자(operator)들이 순환 신경망 모듈, 예컨대, 컨볼루션 게이트 순환 유닛 내부에서 사용될 수 있다.In fact, x _t And o _t are presently feature maps for the task of semantic segmentation, and thus convolution operators (in which input, internal state, and output are 3D tenders) are recursive neural network modules, e.g., It can be used inside the convolutional gate circulation unit.

이러한 컨볼루션 연산자들은, 예컨대, 프레임 예측 및 비디오 캡셔닝(captioning)을 위해 사용되었다. 컨볼루션 순환 신경망들은 또한, 그 목적이 첫 번째 프레임에서 그 세그먼트화가 주어진 비디오에 대한 객체의 바이너리 세그먼트화를 출력하는 것인 비디오 객체 세그먼트화와 같은, 다른 밀집한 비디오 추정 작업들에 대해 적용되었다. These convolution operators have been used, for example, for frame prediction and video captioning. Convolutional recurrent neural networks have also been applied for other dense video estimation tasks, such as video object segmentation, whose purpose is to output a binary segmentation of an object for a given video in the first frame.

예컨대, 컨볼루션 게이트 순환 유닛을 의미적 비디오 세그먼트화에 적용하는 것이 제안되었다. 그러나, 이러한 모델들은 모션 모델링을 결여한다: 주어진 픽셀에서의 추정은 해당 픽셀에서의 추정의 이력에 기반하지만, 이러한 픽셀은 모션이 있는 경우 실세계에서 다른 지점을 나타낼 수 있다.For example, it has been proposed to apply a convolutional gate circular unit to semantic video segmentation. However, these models lack motion modeling: the estimate at a given pixel is based on the history of the estimate at that pixel, but these pixels can represent different points in the real world if there is motion.

사실 출력 o _t 는, 알려진 방식의 컨볼루젼 연산자들의 커널 크기에 따른 이웃(neighborhood)의 크기로, x _t 및 h _t-1 의 로컬 공간적 이웃(local spatial neighborhood)에 의존할 것이다.In fact, the output o _t is the size of the neighborhood according to the kernel size of the known convolution operators, and will depend on the local spatial neighborhood of x _t and h _t-1.

표준적인 컨볼루션 순환 신경망들은, 다만, 프레임 t에서 주어진 픽셀 p _t 와 이전의 프레임 t-1에서 동일한 픽셀 p _t-1 ("동일한 픽셀"이란, 이는 픽셀이 정확히 동일한 좌표인 것을 의미함)이 동일한 실세계 지점의 프로젝션들인 경우들에서만, 즉, 이러한 지점이 실세계에서 정적으로(static) 남아 있으면(또는 프로젝션 레이(ray)를 따라 이동하면), 고려될 수 있다. 다시 말해, 이는 비디오가 정적인 것, 즉, 모션이 없는 것을 가정하는 것이다. 그러나, 자율 주행과 같은, 많은 응용들에 있어서, 이러한 가정은 유효하지 않고 픽셀들은 프레임들 간에서 상당이 이동할 수 있다. Standard convolutional neural network cycle are, however, (meaning that the "same pixel", which is exactly the same pixel coordinates) in frame t _t a given pixel p and the same pixel in the previous frame t -1 p _t-1 is Only in cases where there are projections of the same real-world point, that is, if this point remains static in the real world (or if it moves along a projection ray), it can be considered. In other words, this assumes that the video is static, i.e. there is no motion. However, for many applications, such as autonomous driving, this assumption is not valid and pixels can move significantly between frames.

더 정확하게는, 프레임 t의 픽셀 p _t 에서의 추정은 이러한 픽셀 p _t 의 로컬 이웃 N(p _t )의 피쳐들 x _t 와 이전의 프레임 t-1에서의 은닉 상태 h _t-1 으로부터의 로컬 이웃 N(p _t-1 )에 의존할 것이다. 반대로, p _t 에서의 추정은 실제로는 N(p _t )의 피쳐들 x _t 와 은닉 상태 h _t-1 으로부터의 로컬 이웃 N(p' _t-1 )에 의존해야 할 것이다. p' _t-1 는 실세계의 p _t 와 동일한 지점에 대응하는 픽셀이다.More precisely, the estimation of a pixel p in frame _t of t p _t are those pixels in the local neighborhood N (p _t) in the local neighborhood from a feature in _t x and concealed state in the previous frame t h -1 _t-1 It will depend on N(p _t-1 ). Conversely, the estimation at p _t will actually have to depend on the features x _t of N(p _t ) and the local neighbor N(p' _t-1 ) from the hidden state h _t-1 . p _'t-1 is the pixel corresponding to the same point in the real world of _t p.

연속하는 프레임들 t-1(좌) 및 t(우)의 예시들을 나타내는, 도 5는 현재의 프레임 t 상의 두 개의 픽셀들 p _t 및 q _t 의 정확히 동일한 좌표들에서 이전의 프레임 t-1 상의 픽셀들 p _t-1 및 q _t-1 을 나타낸다. p _t-1 및 q _t-1 는, p' _t-1 및 q' _t-1 와는 대조적으로, 각각 p _t 및 q _t 와 동일한 영역에 속하지 않는다.5, showing examples of consecutive frames t- 1 (left) and t (right), on the previous frame t- 1 at exactly the same coordinates of two pixels p _t and q _{t on the current frame t} The pixels p _t-1 and q _t-1 are shown. _{t p-1} and q _t-1 is, as opposed to p _'t-1, and q' _t-1, does not belong to the same area as the p and _t q _t, respectively.

소위 "광학 흐름"은 일 이미지의 다른 하나(특히, 연속하는 프레임들

의 쌍, 즉, 이전의 프레임 t-1 및 현재의 프레임 t)로의 모든 픽셀들의 변위를 정의하는 벡터 필드이다. The so-called "optical flow" is another one of an image (especially successive frames).

A pair of, i.e. the previous frame t -1 And a vector field defining the displacement of all pixels to the current frame t ).

예컨대, 광학 흐름

은 이전의 프레임 t-1으로부터 현재의 프레임 t으로의 모든 픽셀들의 변위를 정의하고, 광학 흐름

("역방향 흐름")은 현재의 프레임 t으로부터 이전의 프레임 t-1으로의 모든 픽셀들의 변위를 정의한다.E.g. optical flow

Defines the displacement of all pixels from the previous frame t -1 to the current frame t, and the optical flow

("Reverse flow") defines the displacement of all pixels from the current frame t to the previous frame t -1.

픽셀 p' _t-1 은 수학식

에 의해 주어진다: 다시 말해, 광학 흐름은 벡터 필드에 따라 각 픽셀을 "시프트"한다.Pixels p _'t-1 is formula

Given by: In other words, the optical flow "shifts" each pixel according to the vector field.

피쳐 맵들을 흐름에 따라 모든 픽셀들의 모션에 대해 적응시키기 위해, 순환 모듈의 사용은 "와핑(warping)" 함수의 사용과 함께 결합(combine)된다. 사실,

가 이전 프레임 t-1에서의 피쳐 맵이고

가 현재의 프레임 t에서의 피쳐 맵이면(피쳐 맵은 벡터 필드들임; 즉, 각 픽셀에 벡터를 연관시킴),

및

의 여하한 픽셀 피쳐들은 결합된다. 이러한 연산은 효율적으로 구현하기가 어렵기 때문에(관련된 픽셀들이 상이한 좌표들을 가지기 때문에), 소위 피쳐들의 와핑이 먼저 계산된다; 즉,

로서 정의되는 벡터 필드

를 구축하도록 계산된다. 따라서,

및

의 피쳐들은 픽셀 단위(pixel by pixel)로 직접적으로 결합될 수 있다. In order to adapt the feature maps to the motion of all pixels along the flow, the use of the cycle module is combined with the use of a "warping" function. Actually,

Is the feature map from the previous frame t -1

If is the feature map in the current frame t (the feature map is vector fields; i.e., associating a vector to each pixel),

And

Any pixel features of are combined. Since this operation is difficult to implement efficiently (since the pixels involved have different coordinates), the so-called warping of features is calculated first; In other words,

Vector field defined as

Is calculated to build. therefore,

And

The features of can be directly combined in pixel by pixel.

말하자면, 광학 흐름(

)을 알면, 와핑은 여하한 주어진 피쳐 맵

으로부터의 주어진 피쳐 맵

에 대응하는 와핑된 피쳐 맵

의 계산하는 것이고, 각 픽셀은 광학 흐름에 따른 변위를 겪은 것이 된다.In other words, optical flow(

), warping is any given feature map

A given feature map from

Feature maps corresponding to

Is to be calculated, and each pixel is subjected to a displacement according to the optical flow.

와핑은 피쳐 맵(즉, 벡터들)의 값들을 변경시키는 것이 아니며 단지 이들을 "공간적으로 재배치"하는 것이다: 본 경우의 이전의 프레임 t-1으로부터 프레임 t로의 와핑에 있어서, 와핑 함수는, 현재의 이미지 t의 각 픽셀 p _t 에 대해, 역방향 광학 흐름

에 따른 다른 픽셀

에 대한 와핑될 피쳐 맵의 값(이전의 이미지 t-1에 연관된 내부 상태 h _t-1 )을 연관시킨다.Warping does not change the values of the feature map (i.e. vectors), it just "spatially rearranges" them: for warping from the previous frame t -1 to frame t in this case, the warping function is For each pixel p _t of image t , the reverse optical flow

Other pixels according to

Associate the value of the feature map to be warped for (the internal state h _t-1 associated with the previous image t -1 ).

실제로, 광학 흐름은 정수들뿐만 아니라 소수들(float values)을 포함한다. 와핑 함수는 (예컨대, 4개의 값들 간의) 쌍선형 보간법(bilinear interpolation)에 기반할 수 있고, 이는 흐름 값들이, 그라디언트가 제로로 설정되는 경우인, 정수들인 예외적인 경우를 제외하면, 미분 가능하다.Indeed, optical flow includes not only integers but also float values. The warping function can be based on bilinear interpolation (e.g., between four values), which can be differentiated, except for the exceptional case where the flow values are integers, where the gradient is set to zero. .

일반적인 와핑된 순환 신경망 모듈은, "플로잉RNN(FlowingRNN)"으로 명명되고, 따라서, 내부 상태가 광학 흐름에 따른 프레임들 간에서 와핑되는 순환 신경망 모듈로서 작성될 수 있고, 즉, 다음과 같다:A typical warped recurrent neural network module is named "FlowingRNN", and thus, can be written as a recurrent neural network module in which the internal state is warped between frames according to the optical flow, that is, as follows:

바람직한 일 실시예에서, "플로잉GRU(FlowingGRU)", 즉, 컨볼루젼 게이트 순환 유닛 모듈 기반의 FlowingRNN이 사용된다. 다음의 수학식이 따라서 주어질 수 있다:In a preferred embodiment, "FlowingGRU", that is, a Convolutional Gate Cyclic Unit Module based FlowingRNN is used. The following equation can thus be given:

는 요소별(element-wise) 곱(multiplication)을,

는 컨볼루션 연산자를,

는 시그모이드 함수를, ReLU는 정류된 선형 유닛(rectified linear unit) 비선형성을(아래 참조), W와 b는 학습 가능한 파라미터들(각각 가중치들 및 바이어스들)을 나타낸다.

Is the element-wise multiplication,

Is the convolution operator,

Is a sigmoid function, ReLU is a rectified linear unit nonlinearity (see below), and W and b are learnable parameters (weights and biases, respectively).

직관적으로, 리셋 게이트 r _t 는 입력 x _t 를 이전의 은닉 상태 h _t-1 와 어떻게 결합해야 할 지를 학습하고, 업데이트 게이트 z _t 는 얼마나 많은 이전의 메모리가 유지되어야 하는지를 학습한다. Intuitively, the reset gate r _t learns how to combine the input x _t with the previous hidden state h _t-1, and the update gate z _t learns how much previous memory should be kept.

본 FlowingGRU와 표준적인 컨볼루션 게이트 순환 유닛 간의 주요한 차이는, 은닉 상태가 광학 흐름에 따라 와핑되고, 따라서, 픽셀 모션의 경우에 있어서도, 주어진 픽셀 및 타임 스텝에서의 추정은 실세계의 이러한 특정한 지점의 이력(history)에 기반하게 된다는 것이다. 하나의 작은 차이점은 또한 ReLU가 컴퓨터 비전 컨볼루션 신경망 아키텍쳐들에 있어서 보다 일반적으로 사용될수록 ReLU 비-선형성이 바람직하게는 표준적인 tanh 대신에 사용된다는 것이다. The main difference between this FlowingGRU and the standard convolutional gate circulation unit is that the hidden state is warped according to the optical flow, and therefore, even in the case of pixel motion, the estimate at a given pixel and time step is the history of this particular point in the real world. It is to be based on (history). One small difference is also that the more commonly ReLU is used in computer vision convolutional neural network architectures, the ReLU non-linearity is preferably used instead of the standard tanh.

당업자는 설명된 아키텍쳐를 여하한 다른 타입의 순환 모듈로 변경할(transpose) 수 있을 것이다. 특히, "FlowingGRU" 대신에, "플로잉LSTM(FlowingLSTM)"; 즉, 컨볼루션 장단기 메모리 모듈 기반의 FlowingRNN이 사용될 수 있다. One of skill in the art would be able to transpose the described architecture into any other type of circular module. In particular, instead of "FlowingGRU", "FlowingLSTM"; That is, FlowingRNN based on convolutional short-term memory module may be used.

일반적으로, FlowingRNN는, 증가된 성능, 특히 시간에 따른 일관성의 관점에서, 비디오들의 향상된 의미적 세그먼트화를 수행하도록 여하한 완전한 컨볼루션 이미지 세그먼트화 접근법들에 연결될(plugged into) 수 있다.In general, FlowingRNN can be plugged into any fully convolutional image segmentation approaches to perform improved semantic segmentation of videos, in terms of increased performance, especially consistency over time.

도 3은 도 1의 3개의 연속하는 프레임들에 대해 FlowingRNN를 사용한 의미적 세그먼트화의 예시들을 나타낸다. 도 2를 참조하여, 시간에 따라 예측된 라벨들 내의 노이즈 아티팩트들의 부재(absence)(박스들)가 도 3에서는 관찰될 수 있다.3 shows examples of semantic segmentation using FlowingRNN for three consecutive frames of FIG. 1. Referring to FIG. 2, the absence (boxes) of noise artifacts in labels predicted over time can be observed in FIG. 3.

컨볼루션 순환 신경망, 흐름 추정 및 와핑 모듈들은 모두 구별 가능하고(differentiable), 완전한 네트워트 엔드-투-엔드 훈련을 가능하게 만든다.The convolutional recurrent neural network, flow estimation and warping modules are all differentiable and make full network end-to-end training possible.

제1 측면에 있어서, 훈련 방법은 제1 서버(1a)의 데이터 처리 수단(11a)에 의해 구현된다. 방법은 비디오들의 의미적 세그먼트화를 위해 컨볼루션 순환 신경망을 훈련시킨다. In the first aspect, the training method is implemented by the data processing means 11a of the first server 1a. The method trains a convolutional recurrent neural network for semantic segmentation of videos.

첫 번째 단계(a)에 있어서, 표준적인 컨볼루션 신경망(비순환)은, "제1 컨볼루션 신경망"으로 참조될 것이고, 미리 의미적으로 세그먼트화된 훈련 이미지들의 베이스(예컨대, MS-Coco 데이터 세트)로부터 훈련된다. In the first step (a), the standard convolutional neural network (acyclic) will be referred to as the “first convolutional neural network” and a base of pre-semantically segmented training images (e.g., MS-Coco data set). ) From

다시 말해, 제1 컨볼루션 신경망은 시간에 따른 로컬 의존성을 고려하지 않는 전통적인(classic) "프레임당" 베이스라인이다. 여하한 알려진 아키텍쳐가 제1 컨볼루션 신경망을 위해 사용될 수 있다.In other words, the first convolutional neural network is a classic "per frame" baseline that does not take into account local dependence over time. Any known architecture can be used for the first convolutional neural network.

컨볼루션 신경망은 일반적으로 정보를 처리하는 4개의 타입의 레이어들을 포함한다:Convolutional neural networks generally contain four types of layers that process information:

(a) 하나씩 이미지의 블록들을 처리하는 컨볼루션 레이어;(a) a convolution layer that processes blocks of an image one by one;

(b) "활성화 함수"를 적용함으로써 결과의 적절성(pertinence)이 개선되도록 하는 비-선형 레이어(교정(correction) 레이어로 또한 명명됨);(b) a non-linear layer (also called a correction layer) that allows the pertinence of the result to be improved by applying the "activation function";

(c) 하나의 단일한 뉴런 내에서 몇몇의 뉴런들을 함께 그룹핑하도록 하는 풀링 레이어;(c) a pulling layer that allows several neurons to be grouped together within a single neuron;

(d) 레이어의 모든 뉴런들을 이전의(preceding) 레이어의 모든 뉴런들과 연결하는 완전히 연결된 레이어. (d) A fully connected layer connecting all neurons in the layer with all neurons in the preceding layer.

완전히 연결된 컨볼루션 네트워크의 경우에 있어서, 완전히 연결된 레이어(들)은 더 이상 존재하지 않고 커널 사이즈 1의 컨볼루션 레이어로 변환된다는 점에 유의한다. 이는 전체의 이미지에 대한 단지 하나의 확률 분포 대신에 확률들의 출력 맵을 허용한다.Note that in the case of a fully connected convolutional network, the fully connected layer(s) no longer exist and are converted to a convolutional layer of kernel size 1. This allows an output map of probabilities instead of just one probability distribution over the whole image.

비-선형 레이어의 활성화 함수로는, 가장 빈번하게 ReLU 함수(정류된 선형 유닛)이 사용되고, 이는 f(x) = max(0, x)와 동등하고, 풀링 레이어(POOL)은, 스퀘어(square)의 4개의 값들 사이에서 최대값에 대응하는(4개의 값들이 하나로 풀링됨), MaxPool2Х2 함수가 가장 많이 사용되지만, AveragePool 또는 SumPool와 같은 다른 풀링 함수들이 사용될 수 있다.As the activation function of the non-linear layer, the ReLU function (rectified linear unit) is most often used, which is equivalent to f(x) = max(0, x), and the pooling layer (POOL) is square (square). The MaxPool2Х2 function, which corresponds to the maximum value among the four values of) (4 values are pooled into one), is the most used, but other pooling functions such as AveragePool or SumPool can be used.

컨볼루션 레이어(CONV) 및 완전히 연결된 레이어(FC)는 일반적으로 이전의 레이어의 뉴런들과 컨볼루션 신경망의 가중치들 간의 스칼라곱에 대응한다.The convolutional layer (CONV) and the fully connected layer (FC) generally correspond to a scalar product between the neurons of the previous layer and the weights of the convolutional neural network.

전형적인 컨볼루션 신경망 아키텍쳐들은 몇몇의 레이어들 CONV

RELU 의 쌍들을 적층하고, 그 다음으로 레이어 POOL을 추가하고, 이러한 스킴[(CONV

RELU)p

POOL]을 충분히 작은 출력 벡터가 획득될 때까지 반복하고, 하나 또는 두 개의 완전히 연결된 레이어들로 종료한다.Typical convolutional neural network architectures have several layers of CONV

Stacking pairs of RELUs, then adding a layer POOL, this scheme[(CONV

RELU)p

POOL] is repeated until a sufficiently small output vector is obtained, and ends with one or two fully connected layers.

도 6의 예시에 있어서, 제1 컨볼루션 신경망을 위한 아키텍쳐가 설명된다.In the example of FIG. 6, an architecture for a first convolutional neural network is described.

도 6에서 설명된 것처럼, 첫 번째 컨볼루션 레이어와 블록 1 내지 블록 4는 ResNet-101로부터의 것이고, 블록 3 및 블록 4는 각각 2 및 4의 인자(레이트(rate)로 또한 명명됨)로 팽창된 컨볼루션들을 사용하고 있고, 32에서 8로 네트워크의 스트라이드(stride)를 감소시킨다. As illustrated in Figure 6, the first convolution layer and blocks 1 to 4 are from ResNet-101, and blocks 3 and 4 are expanded with a factor of 2 and 4, respectively (also referred to as rate). We are using the convolutions and reduce the stride of the network from 32 to 8.

다양한 팽창 인자들을 갖는 상이한 컨볼루션 커널들로부터의 결과들이 적층되는 아트루스 공간적 피라미드 풀링(Atrous Spatial Pyramid Pooling; ASPP) 모듈이 다음에 존재한다. 아트루스 공간적 피라미드 풀링 모듈은, 클래스 스코어들을 출력하는 최종적인 1x1 컨볼루션 레이어 전에, 256개의 필터들을 갖는 (배치 정규화(batch normalization) 및 ReUL을 가지는) 1x1 컨볼루션 레이어로 공급된다. 이러한 스코어들은 최초의 이미지 해상도로 이선형으로(bilinearly) 업샘플링되어, 예컨대, 훈련 시간 동안 픽셀당 소프트맥스 손실(softmax loss)을 적용하거나, 시험 시간 동안 픽셀당 argmax를 취함으로써 예측된 클래스를 추정한다.Next is the Atrous Spatial Pyramid Pooling (ASPP) module in which results from different convolution kernels with various expansion factors are stacked. The Atrus spatial pyramid pooling module is fed into a 1x1 convolution layer (with batch normalization and ReUL) with 256 filters, before the final 1x1 convolution layer that outputs the class scores. These scores are bilinearly upsampled to the original image resolution, e.g., by applying a softmax loss per pixel during training time, or by taking argmax per pixel during testing time to estimate the predicted class. .

제1 컨볼루션 신경망을 훈련하기 위해, 16의 배치 크기를 갖는 확률적 경사(그라디언트) 하강법(Stochastic Gradient Descent; SGD)이 사용될 수 있다. 특히, 소위 '폴리' 학습 속도 스케쥴(learning rate schedule); 즉, 초기 학습 속도에 N회의 반복들의 총 수에 대해 반복 i에서

만큼 곱해지는 것; 이 적용될 수 있다.To train the first convolutional neural network, a stochastic gradient descent (SGD) method having a batch size of 16 may be used. In particular, the so-called'poly' learning rate schedule; That is, at the iteration i for the total number of N iterations at the initial learning rate

Multiplied by; Can be applied.

훈련 패치들은 예컨대, 321x321 픽셀들의 크기를 갖는 훈련 이미지들의 랜덤 크롭(crop)들로부터 생성되고, 컨볼루션 신경망은 전체 이미지들에 대해 시험된다. 훈련 시에, 랜덤 크롭을 취하기 전에, 데이터 증강(augmentation)이 바람직하게는 이미지들을 랜덤으로 좌-우 플립핑하고 범위 [0:5; 1:5]의 랜덤 스케일링 인자를 적용함으로써 수행된다.Training patches are generated from random crops of training images having a size of, for example, 321x321 pixels, and a convolutional neural network is tested over the entire images. In training, before taking the random crop, data augmentation preferably randomly flips the images left-right and ranges [0:5; 1:5] by applying a random scaling factor.

추가적으로(단계 (a)), 광학 흐름이 기지인 연속하는 프레임들의 쌍들의 베이스로부터의 제2 컨볼루션 신경망의 추가적인 훈련이 수행된다.Additionally (step (a)), additional training of the second convolutional neural network from the base of pairs of consecutive frames for which the optical flow is known is performed.

설명될 것처럼, 제2 컨볼루션 신경망은 프레임들의 쌍들의 광학 흐름, 특히, 현재의 프레임 t로부터 이전의 프레임 t-1으로의 역방향 광학 흐름

을 추정하는 것을 목표로 한다. As will be explained, the second convolutional neural network is the optical flow of pairs of frames, in particular, the reverse optical flow from the current frame t to the previous frame t -1.

It aims to estimate.

제2 컨볼루션 신경망은 플로우넷심플(FlowNetSimple) 또는 플로우넷코렉션(FlowNetCorrelation)과 같은 것일 수 있다. 아키텍쳐 FlowNetSimple(일반적으로는 단지 FlowNetS로 명명됨)이 바람직하며 도 6의 예시에서 사용된다. 이러한 아키텍쳐에 있어서, 입력 이미지들 둘 다는 함께 적층되어 컨볼루션 레이어들로만 만들어진 네트워크를 통해 공급된다. The second convolutional neural network may be such as FlowNetSimple or FlowNetCorrelation. The architecture FlowNetSimple (generally just named FlowNetS) is preferred and is used in the example of FIG. 6. In this architecture, both input images are stacked together and fed through a network made only of convolutional layers.

ReLU 비-선형성 및 '업컨볼루션'이 또한 컨볼루션들에 의해 구현될 수 있다. ReLU non-linearity and'upconvolution' can also be implemented by convolutions.

그 다음으로(단계(b)), 컨볼루션 순환 신경망이 훈련된다. 아이디어는 제1 컨볼루션 신경망(및 존재하는 경우 제2 컨볼루션 신경망)으로부터 컨볼루션 순환 신경망을 구축하는 것이다.Next (step (b)), the convolutional recurrent neural network is trained. The idea is to build a convolutional recurrent neural network from a first convolutional neural network (and a second convolutional neural network if present).

컨볼루션 순환 신경망은 제1 컨볼루션 신경망에 대응하고, 컨볼루션 레이어는 전술된 것과 같은 은닉 상태를 갖는 순환 모듈(FlowingGRU와 같은 FlowingRNN 모듈)에 의해 대체되었다. The convolutional recurrent neural network corresponds to the first convolutional neural network, and the convolutional layer is replaced by a recursive module (FlowingRNN module such as FlowingGRU) having a hidden state as described above.

바람직하게는, 도 6에 의해 표현된 것처럼, 순환 모듈에 의해 대체되는, 제1 컨볼루션 신경망의 컨볼루션 레이어는, 끝에서 두 번째(penultimate) 컨볼루션 레이어이다. 이러한 위치는 ResNet-101에 기반한 예시 내의 아트루스 공간적 피라미드 풀링 모듈 바로 이후에 있기 때문에 우수한 결과들을 제공하지만, 순환 모듈은 제1 컨볼루션 신경망 내의 어디와도 대체될 수 있고, 예컨대, 마지막의 컨볼루션 레이어를 대체할 수 있다. 또한, 아트루스 공간적 피라미드 풀링 모듈 다음에 3개의 컨볼루션 레이어들이 존재할 수 있을 것이고, 순환 모듈은 첫 번째 것(즉, 제1 컨볼루션 신경망의 뒤에서 세 번째의(antepenultimate) 컨볼루션 레이어)을 대체할 수 있다. Preferably, as represented by Fig. 6, the convolutional layer of the first convolutional neural network, which is replaced by the recursive module, is the penultimate convolutional layer. This location provides excellent results because it is immediately after the Atrus spatial pyramid pooling module in the example based on ResNet-101, but the recursive module can be replaced anywhere in the first convolutional neural network, e.g., the last convolution. Layers can be replaced. In addition, three convolutional layers may exist after the Atrus spatial pyramid pooling module, and the recursive module will replace the first one (that is, the third (antepenultimate) convolutional layer behind the first convolutional neural network). I can.

제1 컨볼루션 신경망의 훈련(단계(a))은 컨볼루션 순환 신경망의 파라미터들의 "초기화"로서 보여질 수 있고, (후술되는 것처럼) 이미 의미적으로 세그먼트화된 훈련 비디오들의 베이스로부터의 추가적인 훈련에 의해, 컨볼루션 순환 신경망의 훈련(단계(b))은 이러한 파라미터들의 "미세-조정(fine-tuning)"으로서 보여질 수 있다. Training of the first convolutional neural network (step (a)) can be viewed as "initialization" of the parameters of the convolutional recurrent neural network, and additional training from the base of already semantically segmented training videos (as described below). By, the training of the convolutional recurrent neural network (step (b)) can be viewed as a "fine-tuning" of these parameters.

제1 컨볼루션 신경망의 훈련 덕분에 획득된 파라미터들의 일부는 고정될 수 있고, 즉, 컨볼루션 순환 신경망의 훈련에 있어서 추가로 학습되지 않을 수 있고, 바람직하게는, 순환 모듈 전에(즉, 아트루스 공간적 피라미드 풀링이 포함될 때까지) 컨볼루션 순환 신경망의 각 레이어의 파라미터들이 그러할 수 있다. 추가로 학습된 파라미터들은 이후 바람직하게는 순환 모듈, 컨볼루션 신경망의 마지막 컨볼루션 레이어(및 일반적으로 고정되지 않은 각 컨볼루션 레이어) 및 (존재하는 경우) 어쩌면 제2 컨볼루션 신경망의 파라미터들이 된다.Some of the parameters obtained thanks to the training of the first convolutional neural network may be fixed, i.e., may not be further learned in the training of the convolutional recurrent neural network, preferably before the recursive module (i.e., Atrus Until the spatial pyramid pooling is included), the parameters of each layer of the convolutional recurrent neural network may be. The further learned parameters are then preferably the recursive module, the last convolutional layer of the convolutional neural network (and generally each non-fixed convolutional layer) and possibly the parameters of the second convolutional neural network (if any).

의미적 세그먼트화에 대한 가장 인기 있는 벤치마크들은 이미지들로 제한되고 컨볼루션 순환 신경망의 훈련에 있어서는 사용될 수 없다. 본 방법은 단지 몇몇의 픽셀들이 비디오들 내에서 주석 처리되어 있더라도, 손실로 주석 처리되어 있지 않은 픽셀들을 무시함으로써 훈련될 수 있고, 밀집한 지상 검증자료(ground-truth)를 사용하는 훈련은 보다 우수한 시계열적 일관성으로 이어지게 될 것이다.The most popular benchmarks for semantic segmentation are limited to images and cannot be used in the training of convolutional recurrent neural networks. This method can be trained by ignoring pixels that are not annotated lossy, even if only a few pixels are annotated in the videos, and training using dense ground-truth is better for viewing. This will lead to thermal consistency.

대부분의 실세계 데이터 세트들은 시퀀스당 주석 처리된 단지 하나의 프레임을 가지지만, 몇몇의 합성 벤치마크들은 데이터의 사실성(realism)에 의해 또는 시퀀스들의 낮은 수에 의해 제한되며, 따라서, 이미 의미적으로 세그먼트화된 훈련 비디오들의 베이스는 바람직하게는, 현실적인 GTA(Grand Theft Auto) V 비디오 게임으로부터 캡쳐된 거의 200개의 비디오 시퀀스들로부터의 약 250k 프레임들로 구성된 최신(recent) 바이퍼(Viper) 데이터 세트이다. Most real-world data sets have only one frame annotated per sequence, but some synthetic benchmarks are limited either by the realism of the data or by a low number of sequences, and thus already semantically segmented. The base of the customized training videos is preferably a recent Viper data set consisting of about 250k frames from nearly 200 video sequences captured from a realistic Grand Theft Auto (GTA) V video game.

합성 렌더링은, 의미적 세그먼트화를 포함하는, 복수의 작업들에 대한 지상 검증자료를 획득하도록 허용한다. 조명 및 기상 상태들은 낮에서 밤까지, 맑을 때부터 비가 올 때까지, 눈이 올 때 또는 안개가 있을 때 변화하고, 데이터 세트를 어렵게(challenging) 만든다. 의미적 세그먼트화 작업은 (예컨대, 하늘, 지형과 같은) 요소, 차량들(예컨대, 승용차, 트럭), 및 작은 객체들(예컨대, 교통 표지, 신호등)을 포함하는 23개의 클래스들을 가진다.Synthetic rendering allows for obtaining ground verification data for multiple tasks, including semantic segmentation. Lighting and weather conditions change from day to night, from sunny to rain, when snow or when there is fog, making the data set challenging. The semantic segmentation task has 23 classes, including elements (such as sky, terrain), vehicles (such as cars, trucks), and small objects (such as traffic signs, traffic lights).

컨볼루션 순환 신경망의 훈련(단계(b))은 유리하게는 이미 의미적으로 세그먼트화된 훈련 비디오들의 베이스 중 일 비디오의 연속적인 프레임들

의 각 쌍에 대해, 해당 쌍의 프레임들 간의 광학 흐름(특히, 현재의 프레임 t로부터 이전의 프레임 t-1으로의 역방향 광학 흐름

)을 추정하는 것(서브-단계(b0))으로 시작한다.Training of the convolutional recurrent neural network (step (b)) advantageously consists of consecutive frames of one video among the bases of training videos that have already been semantically segmented.

For each pair of, the optical flow between the frames of that pair (in particular, the reverse optical flow from the current frame t to the previous frame t -1)

We start with estimating (sub-step (b0)).

훈련에 있어서의 이러한 단계(단계(b))는 바람직하게는 입력으로서 한 쌍의 두 프레임들을 사용하는 컨볼루션 신경망(도 6의 FlowNetS)의 제2의 훈련을 사용하여 수행되지만, 본 방법은 광학 흐름을 추정하기 위한 여하한 기법으로 제한되지 않는다.This step in training (step (b)) is preferably carried out using a second training of a convolutional neural network (FlowNetS in Fig. 6) using two frames in a pair as input, but the method is optical. It is not limited to any technique for estimating the flow.

광학 흐름은 훈련 베이스의 프레임들 사이에서 이미 추정된 것일 수 있음에 유의하고, 따라서, 이러한 단계는 훈련에 있어서 선택적으로 남아 있다(서브-단계(b0)).Note that the optical flow may have already been estimated between the frames of the training base, and thus this step remains optional in training (sub-step (b0)).

추가적으로(서브-단계(b1)), 이미 설명된 것처럼, 순환 레이어의 내부 상태는, 내부 상태가 쌍을 이루는 프레임들 간의 픽셀들의 모션에 적응하도록, 추정된 광학 흐름에 따라 와핑된다.Additionally (sub-step (b1)), as already explained, the inner state of the cyclic layer is warped according to the estimated optical flow so that the inner state adapts to the motion of the pixels between the paired frames.

그 다음으로(서브-단계(b2)), 적어도 순환 모듈의 파라미터들은 학습된다.Then (sub-step (b2)), at least the parameters of the cyclic module are learned.

순환 모듈에 더하여, 이러한 레이어의 출력은 프레임당 베이스라인의 하나와는 상이하게 될 것이므로, 이어지는 레이어들은 또한 재학습된다. 이미 설명한 것처럼, 추가로 학습된 파라미터들은, 이후 바람직하게는 순환 모듈, 컨볼루션 신경망의 마지막 컨볼루션 레이어(및 일반적으로 고정되지 않은 각 컨볼루션 레이어) 및 (존재하는 경우) 어쩌면 제2 컨볼루션 신경망의 파라미터들이 된다.In addition to the recursive module, the output of this layer will be different from one of the baselines per frame, so subsequent layers are also retrained. As already explained, the additionally learned parameters are then preferably the recursive module, the last convolutional layer of the convolutional neural network (and generally each non-fixed convolutional layer) and (if any) maybe the second convolutional neural network. Becomes the parameters of.

이를 위해, 4의 배치 크기를 갖고 12개의 연속하는 프레임들의 시퀀스들을 훈련하며, 첫 번째 CNN의 초기 훈련에 대한 것과 유사한 폴리 학습 속도 스케쥴을 갖는 시간 알고리즘을 통한 백프로파게이션(backpropagation)을 사용하는, 확률적 경사 하강법이 사용될 수 있다. 유사한 데이터 증강 전략이 또한 사용될 수 있다. 시험 시간에서, 이전의 프레임으로부터의 은닉 상태는, 시퀀스 길이에 대한 제한 없이, 사용될 수 있다.To this end, we train sequences of 12 consecutive frames with a batch size of 4, and use backpropagation through a temporal algorithm with a poly learning rate schedule similar to that for the initial training of the first CNN, Stochastic gradient descent can be used. Similar data augmentation strategies can also be used. At test time, the hidden state from the previous frame can be used, without any restrictions on the length of the sequence.

제2 측면에 있어서, 제2 서버(1b)의 데이터 처리 수단(11b)에 의해 구현된 입력된 비디오의 의미적 세그먼트화가 제안된다. 다시 말해, 제2 측면에 따른 방법은 입력된 비디오의 의미적 세그먼트화를 수행하고, 즉, 입력된 비디오의 프레임들의 픽셀들을 라벨링한다.In the second aspect, a semantic segmentation of the input video implemented by the data processing means 11b of the second server 1b is proposed. In other words, the method according to the second aspect performs semantic segmentation of the input video, that is, labels the pixels of the frames of the input video.

세그먼트화되는 입력된 비디오는 클라이언트 장비(10)로부터 수신될 수 있다.The input video that is segmented may be received from the client equipment 10.

첫 번째 단계(a)에서, 도 7에서 설명된 것처럼, 컨볼루션 순환 신경망의 훈련이 (적어도 미리 의미적으로 세그먼트화된 훈련 비디오들의 베이스로 및 유리하게는 적어도 미리 의미적으로 세그먼트화된 훈련 이미지들의 베이스로부터) 제1 서버(1a)에 의해 수행된다. 바람직하게는, 훈련은 제1 측면에 따른 방법과 일치하고; 즉, 제1 컨볼루션 신경망이 훈련되고, 따라서, 제1 컨볼루션 신경망에 기반한 컨볼루션 순환 신경망이 구축되고, 끝에서 두 번째 컨볼루션 레이어가 FlowingRNN 타입의 순환 모듈에 의해 대체된다.In the first step (a), the training of the convolutional recurrent neural network (at least on the basis of pre-semantically segmented training videos and advantageously at least pre-semantically segmented training images), as illustrated in FIG. Is performed by the first server 1a). Preferably, the training is consistent with the method according to the first aspect; That is, the first convolutional neural network is trained, and thus, a convolutional recurrent neural network based on the first convolutional neural network is constructed, and the second convolutional layer at the end is replaced by a FlowingRNN-type recursive module.

대안적으로 또는 조합에 따라, 단계(a')는, 광학 흐름이 기지인 연속하는 프레임들의 훈련 쌍들의 베이스로부터, 컨볼루션 신경망(즉, 훈련 방법의 제2 컨볼루션 신경망)을 훈련시키는 단계를 더 포함할 수 있다.Alternatively or according to a combination, step (a') comprises training a convolutional neural network (i.e., the second convolutional neural network of the training method) from the base of training pairs of successive frames for which the optical flow is known. It may contain more.

제1 및 제2 서버(1a, 1b)는 동일한 장비이거나, 혹은, 단계(a')는, 메모리(12) 내의 저장을 위해, 훈련된 컨볼루션 순환 신경망 파라미터들 및 가중치들의 제1 서버(1a)로부터 제2 서버(1b)로의 전송을 포함하는 것으로 이해되어야 한다. The first and second servers 1a, 1b are the same equipment, or step (a') is the first server 1a of trained convolutional recurrent neural network parameters and weights for storage in the memory 12. ) To the second server 1b.

제2 단계(b')에서, 제2 서버(1b)의 처리 수단(11b)은 세그먼트화를 위해 컨볼루션 순환 신경망을 사용한다.In the second step (b'), the processing means 11b of the second server 1b uses a convolutional recurrent neural network for segmentation.

이를 위해, 단계(b')는, 입력된 비디오의 연속하는 프레임들(

)의 각 쌍에 대해, 다음을 포함한다: To this end, step (b') includes consecutive frames of the input video (

For each pair of ), include:

(b'0) 쌍을 이루는 프레임들 간의 광학 흐름을 추정하는 단계;(b'0) estimating an optical flow between frames forming a pair;

(b'1) 추정된 광학 흐름에 따라 순환 레이어의 내부 상태를, 상기 내부 상태가 쌍을 이루는 프레임들 간의 픽셀들의 모션에 적응하도록(adapt), 와핑하는 단계; 및(b'1) warping the internal state of the cyclic layer according to the estimated optical flow to adapt the internal state to the motion of pixels between the paired frames; And

(b'2) 컨볼루션 순환 신경망으로 각 프레임을 의미적으로 세그먼트화하는 단계.(b'2) Semantic segmentation of each frame with a convolutional recurrent neural network.

서브-단계들 (b'0) 및 (b'1)은 유리하게는, 전술된 것과 같은, 서브-단계들 (b0) 및 (b1)과 동등하고, 그 여하한 정제를 포함할 수 있다.Sub-steps (b'0) and (b'1) are advantageously equivalent to sub-steps (b0) and (b1), as described above, and may include any purification thereof.

최종적으로, 획득되는 분류 결과는 클라이언트 장비(10)로 다시 전송될 수 있다. Finally, the obtained classification result may be transmitted back to the client equipment 10.

본 발명의 효과는 표준적인 픽셀당 민 인터섹션-오버-유니언 지표(mean Intersection-over-Union metric; mIoU^P): 각 클래스에 대해서, 해당 클래스 내에서 지상 검증자료 또는 추정이 계산되는 해당 클래스의 전체 픽셀들의 수에 대한 해당 클래스에 대해 올바르게 추정된 픽셀들의 수의 비율을 사용하여 평가될 수 있고, 모든 클래스들에 대한 평균이 보고된다.The effect of the present invention is a standard mean Intersection-over-Union metric (mIoU ^P ) per pixel: for each class, the ground verification data or estimation of the corresponding class is calculated within the corresponding class. It can be evaluated using the ratio of the number of pixels correctly estimated for that class to the number of total pixels, and the average for all classes is reported.

그러나, 이는 의미적 세그먼트화 추정의 시계열적인 일관성을 측정하지는 않는다. 이를 위해, 민 인터섹션-오버-유니언 지표는 궤적(trajectory) 레벨에서 더 계산될 수 있고, 거기에서 전체 비디오들에 대해 올바른 밀집한 궤적들이 하프(half) 해상도 및 8개의 픽셀들의 밀도에서 추출된 광학 흐름과 함께 추출될 수 있는 것으로서 간주되도록 하기 위해 궤적은 지상 검증자료 클래스를 갖는 모든 프레임들에 걸쳐 일관되게 라벨링되어야 한다. However, this does not measure the time-series consistency of semantic segmentation estimation. To this end, the min-intersection-over-union indicator can be further computed at the trajectory level, where the correct dense trajectories for all videos are extracted at half resolution and density of 8 pixels. The trajectory must be consistently labeled across all frames with the ground verification data class in order to be considered as being able to be extracted with the flow.

지상 검증자료 라벨링이 일관되는 궤적들만 흐름 추정에 있어서의 에러들에 의한 노이즈 궤적들을 필터링하기 위해 유지된다. 실제로, 이러한 필터링 단계는 약 8%의 추출된 궤적들을 제거한다. Only trajectories with consistent ground verification data labeling are maintained to filter out noise trajectories due to errors in flow estimation. Indeed, this filtering step removes about 8% of the extracted trajectories.

이는 "궤적당" 민 인터섹션-오버-유니언 지표(mean Intersection-over-Union metric; mIoU^P)를 보고할 수 있도록 하고, 픽셀들이 상이한 추정된 라벨들을 가지는 궤적들은 잘못된 것(wrong)으로 카운트된다. 각 클래스의 일관성은 또한 다음에 따라 계산될 수 있다: 예측된 클래스마다, 지상 검증자료가 일관되게 주어진 클래스에 해당하는 궤적들 중에서 예측이 일관되는 궤적들의 비율이 계산된다. "민(mean) 일관성"(mCons.)은 전체 클래스들에 대해 평균화하여 보고될 수 있다.This makes it possible to report a "mean Intersection-over-Union metric (mIoU ^P ) per trajectory, and trajectories where pixels have different estimated labels are counted as wrong. . The consistency of each class can also be calculated according to the following: For each predicted class, the proportion of trajectories for which the prediction is consistent among trajectories corresponding to a given class with consistent ground verification data is calculated. "Mean consistency" (mCons.) can be reported by averaging over all classes.

먼저, 바람직한 FlowingGRU 모듈의 성능에 대한 훈련 시퀀스들의 길이 T의 영향이 연구된다. 시험 시간에서, 훈련 시에 사용되는 시퀀스들의 길이와는 독립적으로, 이전의 프레임의 은닉 상태는 모든 경우들에 있어서 사용된다. 분별(sanity) 체크로서, 모델은, 성능이 프레임당 베이스라인과 유사한지를 검증하도록, 먼저 올-제로(all-zeroes) 내부 상태로 훈련에서 모든 경우들 및 하나의 프레임의 시퀀스들에 있어서 사용된다. T는 그 다음으로 2에서 12로 증가된다. 성능에 있어서 일관성 이득이 T가 증가할 때 관찰된다. 프레임당 베이스라인에 비해 12개의 프레임들의 훈련 시퀀스들을 사용할 때, mIoU^P는 1.2%만큼 더 높았고, mIoU^T가 2.1%만큼 mCons가 3.5%만큼 더 높았다. First, the effect of the length T of training sequences on the performance of the preferred FlowingGRU module is studied. At test time, independent of the length of the sequences used in training, the hidden state of the previous frame is used in all cases. As a sanity check, the model is first used for all cases and sequences of one frame in training with all-zeros internal state to verify that the performance is similar to the baseline per frame. . T is then increased from 2 to 12. The consistency gain in performance is observed as T increases. When using training sequences of 12 frames compared to baseline per frame, mIoU ^P was higher by 1.2%, mIoU ^T was 2.1% and mCons was 3.5% higher.

그 다음으로, 바람직한 FlowingGRU 모듈의 컨볼루션들의(및 따라서 내부 상태의 또한) 출력 채널들(C)의 수와 (FlowingGRU 내의 모든 컨볼루션들의) 커널 크기(K)의 영향이 연구된다. 최적의 성능이 C = 256 및 K = 3에서 관찰됨에 유의하고, 이는 다음의 사실들에 의해 설명될 수 있다: Next, the influence of the number of output channels (C) of the convolutions of the preferred FlowingGRU module (and therefore also of the internal state) and the kernel size (K) (of all convolutions in the FlowingGRU) is studied. Note that optimal performance is observed at C = 256 and K = 3, which can be explained by the following facts:

더 낮은 수의 채널들은, 정확한 의미적 세그먼트화를 위해 이전의 레이어로부터 공간적으로 및 이전의 프레임들로부터의 시간에 따라, 적절한 정보를 전달하기에 충분하지 않다;The lower number of channels is not sufficient to convey adequate information spatially from the previous layer and temporally from previous frames for accurate semantic segmentation;

1보다 더 큰 크기를 갖는 커널은 로컬 이웃에 대한 정보를 통합하고, 따라서, 광학 흐름 계산에 있어서 작은 에러들로부터의 복구를 허용한다;A kernel with a size greater than 1 incorporates information about the local neighbor, thus allowing recovery from small errors in optical flow calculations;

3보다 더 큰 크기를 갖는 커널은 너무 많은 파라미터들을 가지므로 로컬 시계열적 일관성을 효과적으로 학습할 수 없다. Kernels with a size larger than 3 have too many parameters and cannot effectively learn local time-series consistency.

(T = 12, K = 3 및 C = 256의 바람직한 값들을 갖는) FlowingGRU 모듈의 영향이 표 1(좌측)에서 몇몇의 변형들과 관련된 작업들과 비교된다. 작업들 중 어떠한 것도 바이퍼(Viper) 데이터 세트로 시험되지 않았고 상기 작업들은 상이한 백본(backbone) 아키텍쳐들을 사용함에 유의한다. 공정한 비교를 위해, 동일한 베이스라인이 사용되며, 제안된 모듈들의 각자는 FlowingRNN과 동일한 위치에서 연결되고 동일한 훈련 스킴을 따른다. The impact of the FlowingGRU module (with preferred values of T = 12, K = 3 and C = 256) is compared with the tasks associated with several variants in Table 1 (left). Note that none of the tasks have been tested with the Viper data set and these tasks use different backbone architectures. For fair comparison, the same baseline is used, and each of the proposed modules is connected at the same location as FlowingRNN and follows the same training scheme.

[표 1][Table 1]

프레임당 베이스라인과 비교하여 (ReLU 비-선형성을 갖는) FlowingGRU 모듈은 mIoU^P에 있어서 1.22% 의, mIoU^T에 있어서 2.14%의, mCons.에 있어서 3.44%의 명확한 개선을 나타낸다. 이는 FlowingGRU가 픽셀 모션을 고려하면서 효과적으로 밀집한 비디오 예측 작업들을 위한 시계열적 정보를 활용한다는 것을 강조한다.Compared to baseline per frame (with ReLU non-linearity), the FlowingGRU module shows a clear improvement of 1.22% ^{for mIoU P} ^{, 2.14% for mIoU T} and 3.44% for mCons. This emphasizes that FlowingGRU effectively utilizes time-series information for dense video prediction tasks while considering pixel motion.

ReLU 비-선형성을 갖는 FlowingGRU 모듈은 2개의 변형들과 비교될 수 있었다. 첫 번째 하나는 (ReLU 대신에) 표준적인 tanh 비선형성을 사용하는 것으로 구성되고, 이는 표 1의 'FlowingGRU (tanh)' 행을 참조한다. mIoU^P에 있어서 0.8%의 하락이 관찰되었고, 이는 ReLU가 의미적 세그먼트화를 위해 더 적절함을 나타낸다.The FlowingGRU module with ReLU non-linearity could be compared with two variants. The first one consists of using the standard tanh nonlinearity (instead of ReLU), which refers to the'FlowingGRU (tanh)' row in Table 1. A drop of 0.8% was observed for mIoU ^P , indicating that ReLU is more appropriate for semantic segmentation.

두 번째 변형에 있어서, FlowNetS의 파라미터들은 고정되지만, 훈련 동안 미세 조정되지 않는다(즉, 제2 컨볼루션 신경망이 단계(b)에서 추가로 학습되지 않음). 이는 표 1의 FlowingGRU (fixed flow)' 행을 참조한다. 이러한 경우에, mIoU^P에 있어서 0.9%의 하락이 있고, 이는 FlowingGRU 모듈과 함께 흐름 추정을 훈련시키는 것의 중요성을 보여준다. In the second variant, the parameters of FlowNetS are fixed, but not fine-tuned during training (i.e., the second convolutional neural network is not further trained in step (b)). This refers to the'FlowingGRU (fixed flow)' row of Table 1. In this case, there ^{is a 0.9% drop in mIoU P} , which shows the importance of training the flow estimation with the FlowingGRU module.

FlowingGRU 모듈은 다른 순환 모델들과 더 비교된다. 내부 상태의 와핑 없는, 예컨대, 의미적 비디오 세그먼트화의 맥락에서의 사용으로서, 표준적인 컨볼루션 게이트 순환 유닛(ConvGRU)이 먼저 시도된다. 결과들은 표 1의 'ConvGRU' 행에서 보고된다. 3개의 지표들은 프레임당 베이스라인에 비해 약간 증가되고, 이는 일부의 시계열적 일관성을 사실상 학습한다는 것을 보여준다. 그러나, ConvGRU는, 예컨대, 0.8% 더 낮은 mIoU^P와 같이, 바람직한 FlowingGRU보다는 상당히 열등한 성능을 나타낸다. 사실, 연속하는 프레임들의 사이에서 내부 상태를 유지함으로써, ConvGRU는 연속하는 프레임들 내의 동일한 위치의 픽셀들이 동일한 실세계 지점의 프로젝션들이 됨을 가정하고, 이는 대부분의 경우들에 있어서 옳지 못하다. 유사하게, mIoU^T 및 mCons.는 또한 각각 1.63% 및 2.55%로 상당히 감소한다. The FlowingGRU module is further compared to other circulation models. As use in the context of semantic video segmentation, without warping of the internal state, for example, the standard convolutional gate circular unit (ConvGRU) is first tried. Results are reported in the'ConvGRU' row of Table 1. The three indicators increase slightly compared to the baseline per frame, showing that some of the time-series consistency is actually learned. However, ConvGRU exhibits significantly inferior performance to the preferred FlowingGRU, e.g. 0.8% lower mIoU ^P. In fact, by maintaining the internal state between successive frames, ConvGRU assumes that pixels at the same location in successive frames are projections of the same real-world point, which is not correct in most cases. Similarly, mIoU ^T and mCons. are also significantly reduced to 1.63% and 2.55%, respectively.

다음으로, 게이트 순환 유닛이 ReLU 비-선형성이 tanh 대신에 또한 사용되는 장단기 메모리에 의해 대체되고, 이는 표 1의 'ConvLSTM' 및 'FlowingLSTM' 행들을 참조한다. 게이트 순환 유닛을 갖는 것보다 성능은 상당히 낮아 졌다. 하나의 설명은 장단기 메모리들은 비전 작업들에 대해 게이트 순환 유닛들보다 더 훈련시키기 어렵다는 것이다. FlowingLSTM가 컨볼루션 LSTM 보다 상당히 더 좋은 성능을 나타낸다(mIoU^P에 있어서 +4.06%)는 하나의 흥미로운 결과가 나타났고, 이는 내부 상태를 와핑하는 것의 중요성을 다시 강조한다. Next, the gate circulation unit is replaced by a long-term memory where ReLU non-linearity is also used instead of tanh, which refers to the'ConvLSTM'and'FlowingLSTM' rows of Table 1. The performance is considerably lower than with the gate circulation unit. One explanation is that short and long-term memories are more difficult to train than gate cycle units for vision tasks. FlowingLSTM performed significantly better than convolutional LSTM ( ^{+4.06% for mIoU P} ) with one interesting result, which again emphasizes the importance of warping the internal state.

마지막으로, 끝에서 두 번째의 레이어 대신에 마지막 컨볼루션 레이어를 대체함으로써, FlowingGRU 모듈의 위치가 연구될 수 있었다. 이러한 경우에, 출력은 소프트맥스(softmax)로 들어감에 유의한다(비 선형성은 사용되지 않았음). 이러한 변형은 바이퍼 데이터 세트에 대해 0.6%의 mIoU^P 감소로 이어짐에 유의한다. ConvGRU를 사용할 때, 성능은 또한 모듈을 마지막 레이어에서 대체할 때 더 낮아진다. 끝에서 두 번째 레이어를 대체함으로써, 이력이 최종 추정이 만들어지는 잠재 공간(latent space)에 내장되고, 이는 은닉 상태와 현재의 피쳐들로부터 최종적인 세그먼트화를 직접적으로 추정하는 것보다 더 강력하다.Finally, by replacing the last convolutional layer instead of the second to last layer, the location of the FlowingGRU module could be studied. Note that in this case, the output goes into softmax (nonlinearity was not used). Note that this modification leads to a reduction in ^{mIoU P} of 0.6% for the Viper data set. When using ConvGRU, the performance is also lower when replacing the module in the last layer. By replacing the second layer at the end, the history is embedded in the latent space where the final estimate is made, which is more powerful than estimating the final segmentation directly from the hidden state and current features.

몇몇의 실험들은 또한 미세 주석 처리(fine annotation)만을 사용하는, 즉, 훈련을 위해 2975개의 시퀀스들 및 검증을 위해 500개의 시퀀스들을 실세계 시티스케이프 데이터 세트(Cityscapes dataset) 상에서 동작될 수 있었다. 각 시퀀스는 하나의 프레임(12번째 프레임) 상에서만 주석을 갖는 30개의 프레임들을 갖는다.Some experiments could also be run on a real-world Cityscapes dataset using only fine annotation, ie 2975 sequences for training and 500 sequences for verification. Each sequence has 30 frames with annotations only on one frame (12th frame).

훈련 및 시험 절차들은 바이퍼 데이터 세트의 경우와 유사하게 유지될 수 있다. 유일한 차이점은 지상 검증자료가 희소(sparse) 프레임들에 대해 주석 처리되는 것이다. 따라서, 훈련 동안, 주석 처리되지 않은 프레임들에 대한 손실은 무시되고, 이는 비디오 정보가 덜 적합하게 활용되도록 만든다. 노이즈 궤적들은 시퀀스당 단지 하나의 주석 처리된 프레임으로는 해소(clean)될 수 없고, 따라서, mIoU^P만이 보고될 수 있다.Training and testing procedures can be kept similar to that of the Viper data set. The only difference is that the ground verification data is annotated for sparse frames. Thus, during training, the loss for unannotated frames is neglected, which makes the video information less suitably utilized. Noise trajectories cannot be cleaned with only one annotated frame per sequence, so ^{only mIoU P} can be reported.

상이한 변형들에 대한 결과는 표 1의 우측 열에서 보고된다. 먼저, FlowingGRU는 mIoU^P에 있어서 1.14%의 이득으로 프레임당 베이스라인을 능가한다. 모든 경우에 있어서, 'Flowing' 대응 관계는 정적인 컨볼루션 모듈보다 상당히 더 우수한 성능을 나타낸다. The results for the different variants are reported in the right column of Table 1. First, FlowingGRU exceeds the baseline per frame with a gain of 1.14% ^{in mIoU P.} In all cases, the'Flowing' correspondence represents significantly better performance than a static convolution module.

요컨대, 비디오들의 의미적 세그먼트화를 위한 컨볼루션 순환 신경망을 훈련시키는 방법은 이미 의미적으로 세그먼트화된 훈련 이미지들의 베이스로부터, 제1 컨볼루션 신경망을 훈련시키는 단계; 및 이미 의미적으로 세그먼트화된 훈련 비디오들의 베이스로부터, 상기 제1 컨볼루션 신경망에 대응하는, 컨볼루션 순환 신경망을 훈련시키는 단계를 포함하고, 컨볼루션 레이어는 은닉 상태(hidden state)를 갖는 순환 모듈에 의해 대체되었다. 상기 훈련시키는 단계는, 이미 의미적으로 세그먼트화된 훈련 비디오들의 베이스 중 하나의 비디오의 연속하는 프레임들의 각 쌍에 대해, 해당 쌍의 프레임들 간의 추정된 광학 흐름(optical flow)에 따라 순환 레이어의 내부 상태를, 상기 내부 상태가 상기 쌍의 프레임들 간의 픽셀들의 모션에 적응하도록(adapt), 와핑하는 단계를 포함한다.In short, a method of training a convolutional recurrent neural network for semantic segmentation of videos includes: training a first convolutional neural network from a base of already semantically segmented training images; And training a convolutional recurrent neural network corresponding to the first convolutional neural network from the base of the training videos that have already been semantically segmented, wherein the convolutional layer has a hidden state. Was replaced by In the training, for each pair of consecutive frames of one video among the bases of the training videos that are already semantically segmented, the cyclic layer is formed according to an estimated optical flow between the frames of the pair. Warping the internal state so that the internal state adapts to the motion of the pixels between the frames of the pair.

유리하게는, 내부 상태가 광학 흐름에 따라 프레임들 사이에서 와핑된 표준적인 컨볼루션 순환 레이어(이는 다른 컨볼루션 신경망을 사용하여 추정될 수 있음)를 제공하는 것은 비디오들의 의미적 세그먼트화를, 특히, 시간에 따른 일관성의 관점에서, 개선시킨다.Advantageously, providing a standard convolutional cyclic layer (which can be estimated using other convolutional neural networks) where the internal state is warped between frames according to the optical flow allows for semantic segmentation of videos, in particular. , Improve, in terms of consistency over time.

연속하는 프레임들의 쌍은, 바람직하게는, 이전의 프레임 및 현재의 프레임을 포함하고, 추정된 광학 흐름은 상기 현재의 프레임으로부터 상기 이전의 프레임으로의 역방향(backward) 광학 흐름이고, 상기 이전의 프레임과 연관된 내부 상태에, 상기 내부 상태에 대응하는 상기 이전의 프레임과 연관된 와핑된 내부 상태를 획득하기 위해, 와핑 함수가 적용되고, 각 픽셀은 상기 역방향 광학 흐름에 따른 변위를 겪은 것이다;The pair of successive frames preferably includes a previous frame and a current frame, the estimated optical flow is a backward optical flow from the current frame to the previous frame, and the previous frame A warping function is applied to the internal state associated with, to obtain a warped internal state associated with the previous frame corresponding to the internal state, and each pixel has undergone a displacement according to the reverse optical flow;

상기 와핑 함수는

이고, p _t 는 상기 현재의 프레임(t)의 픽셀이고,

는 상기 역방향 광학 흐름이고, h _t-1 는 상기 이전의 프레임(t-1)과 연관된 내부 상태이고,

는 상기 이전의 프레임(t-1)과 연관된 와핑된 내부 상태이다;The warping function is

And p _t is a pixel of the current frame ( t ),

Is the reverse optical flow, h _t-1 is the internal state associated with the previous frame ( t -1),

Is the warped internal state associated with the previous frame ( t -1);

쌍을 이루는 프레임들의 광학 흐름의 추정은, 광학 흐름이 기지인(known) 연속하는 프레임들의 훈련 쌍들의 베이스로부터 제2 컨볼루션 신경망을 훈련시킴으로써, 상기 제2 컨볼루션 신경망을 사용하여 수행될 수 있다. 상기 제2 컨볼루션 신경망은 플로우넷심플(FlowNetSimple) 네트워크 또는 플로우넷코렉션(FlowNetCorrelation) 네트워크일 수 있다;Estimation of the optical flow of the paired frames may be performed using the second convolutional neural network by training a second convolutional neural network from the base of training pairs of consecutive frames whose optical flow is known. . The second convolutional neural network may be a FlowNetSimple network or a FlowNetCorrelation network;

상기 순환 모듈 전의 상기 컨볼루션 순환 신경망의 각 레이어의 파라미터들은 고정될 수 있다. Parameters of each layer of the convolutional recurrent neural network before the recursive module may be fixed.

순환 모듈에 의해 대체되는 상기 제1 컨볼루션 신경망의 컨볼루션 레이어는 끝에서 두 번째(penultimate) 컨볼루션 레이어일 수 있다.The convolutional layer of the first convolutional neural network that is replaced by a recursive module may be a penultimate convolutional layer.

상기 컨볼루션 순환 신경망의 학습된 파라미터들은 상기 순환 모듈 및 상기 제1 컨볼루션 신경망의 마지막 컨볼루션 레이어의 파라미터들일 수 있다.The learned parameters of the convolutional recurrent neural network may be parameters of the recurrent module and the last convolutional layer of the first convolutional neural network.

상기 제1 컨볼루션 신경망은, 상기 끝에서 두 번째 컨볼루션 레이어의 전에, 아트루스 공간적 피라미드 풀링(atrous spatial pyramid pooling) 모듈을 포함할 수 있다. The first convolutional neural network may include an atrous spatial pyramid pooling module before the second convolutional layer from the end.

상기 순환 모듈은 컨볼루션 게이트 순환 유닛(gated recurrent unit) 또는 컨볼루션 장단기 메모리(long short-term memory)를 포함할 수 있다. The circulation module may include a convolutional gated recurrent unit or a convolutional long short-term memory.

의미적으로 세그먼트화된 훈련 이미지들의 세트를 사용하여, 제1 컨볼루션 신경망을 훈련시키는 단계; 및 의미적으로 세그먼트화된 훈련 비디오들의 세트를 사용하여, 상기 제1 컨볼루션 신경망에 대응하는, 컨볼루션 순환 신경망을 훈련시키는 단계 - 컨볼루션 레이어는 은닉 상태(hidden state)를 갖는 순환 모듈에 의해 대체되었고; 상기 컨볼루션 순환 신경망을 훈련시키는 단계는, 상기 의미적으로 세그먼트화된 훈련 비디오들의 세트 중 하나의 비디오의 연속하는 프레임들(

)의 각 쌍에 대해, 연속하는 프레임들의 쌍의 프레임들 간의 추정된 광학 흐름(optical flow)에 따라 순환 레이어의 내부 상태를, 상기 내부 상태가 상기 쌍의 프레임들 간의 픽셀들의 모션에 적응하도록(adapt), 와핑하는 단계 및 적어도 상기 순환 모듈의 파라미터들을 학습하는 단계를 포함함 -에 의해 훈련된 컨볼루션 순환 신경망을 사용하는, 입력된 비디오의 의미적 세그먼트화 방법은, (a) 상기 입력된 비디오의 연속하는 프레임들(

)의 각 쌍에 대해, 상기 쌍의 프레임들 간의 광학 흐름을 추정하는 단계; (b) 상기 입력된 비디오의 연속하는 프레임들(

)의 각 쌍에 대해, 상기 추정된 광학 흐름에 따라 상기 순환 레이어의 내부 상태를, 상기 내부 상태가 상기 쌍의 프레임들 간의 픽셀들의 모션에 적응하도록(adapt), 와핑하는 단계; 및 (c) 상기 입력된 비디오의 연속하는 프레임들(

)의 각 쌍에 대해, 상기 컨볼루션 순환 신경망으로 각 프레임을 의미적으로 세그먼트화하는 단계를 포함한다. Training a first convolutional neural network using a set of semantically segmented training images; And training a convolutional recurrent neural network, corresponding to the first convolutional neural network, using a set of semantically segmented training videos, the convolutional layer by a recursive module having a hidden state. Replaced; Training the convolutional recurrent neural network includes consecutive frames of one video of the set of semantically segmented training videos (

For each pair of ), the internal state of the cyclic layer is adapted according to the estimated optical flow between the frames of the pair of consecutive frames, so that the internal state adapts to the motion of the pixels between the pair of frames ( adapt), warping, and learning at least the parameters of the recursive module. A method for semantic segmentation of an input video using a convolutional recurrent neural network trained by: (a) the input Successive frames of video (

For each pair of ), estimating the optical flow between the frames of the pair; (b) successive frames of the input video (

For each pair of ), warping the internal state of the cyclic layer according to the estimated optical flow, such that the internal state adapts to the motion of pixels between the frames of the pair; And (c) consecutive frames of the input video (

For each pair of ), semantically segmenting each frame with the convolutional recurrent neural network.

상기 컨볼루션 순환 신경망을 훈련시키는 단계는, 광학 흐름이 기지인(known) 연속하는 프레임들의 훈련 쌍들의 세트를 사용하여, 제2 컨볼루션 신경망을 훈련시키는 단계를 포함할 수 있고, 상기 (a)는 상기 제2 컨볼루션 신경망을 사용하여 수행된다.Training the convolutional recurrent neural network may include training a second convolutional neural network using a set of training pairs of consecutive frames whose optical flow is known, wherein (a) Is performed using the second convolutional neural network.

상기 제2 컨볼루션 신경망은 플로우넷심플(FlowNetSimple) 네트워크 또는 플로우넷코렉션(FlowNetCorrelation) 네트워크일 수 있다. The second convolutional neural network may be a FlowNetSimple network or a FlowNetCorrelation network.

컨볼루션 순환 신경망을 사용하는, 입력된 비디오의 의미적 세그먼트화 방법은, (a) 상기 입력된 비디오의 연속하는 프레임들(

)의 각 쌍에 대해, 상기 추정된 광학 흐름에 따라 순환 레이어의 내부 상태를, 상기 내부 상태가 상기 쌍의 프레임들 간의 픽셀들의 모션에 적응하도록(adapt), 와핑하는 단계; (c) 상기 입력된 비디오의 연속하는 프레임들(

)의 각 쌍에 대해, 상기 컨볼루션 순환 신경망으로 각 프레임을 의미적으로 세그먼트화하는 단계를 포함하고, 상기 컨볼루션 신경망은, 의미적으로 세그먼트화된 훈련 이미지들의 세트를 사용하여, 제1 컨볼루션 신경망을 훈련시키는 단계; 및 의미적으로 세그먼트화된 훈련 비디오들의 세트를 사용하여, 상기 제1 컨볼루션 신경망에 대응하는, 컨볼루션 순환 신경망을 훈련시키는 단계에 의해 훈련되고, 컨볼루션 레이어는 은닉 상태(hidden state)를 갖는 순환 모듈에 의해 대체되었고; 상기 컨볼루션 순환 신경망을 훈련시키는 단계는, 상기 의미적으로 세그먼트화된 훈련 비디오들의 세트 중 하나의 비디오의 연속하는 프레임들(

)의 각 쌍에 대해, 연속하는 프레임들의 쌍의 프레임들 간의 추정된 광학 흐름(optical flow)에 따라 순환 레이어의 내부 상태를, 상기 내부 상태가 상기 쌍의 프레임들 간의 픽셀들의 모션에 적응하도록(adapt), 와핑하는 단계 및 적어도 상기 순환 모듈의 파라미터들을 학습하는 단계를 포함한다. A method for semantic segmentation of an input video using a convolutional recurrent neural network includes: (a) successive frames of the input video (

For each pair of ), warping the internal state of the cyclic layer according to the estimated optical flow, such that the internal state adapts to the motion of pixels between the frames of the pair; (c) successive frames of the input video (

), for each pair of ), semantically segmenting each frame with the convolutional recurrent neural network, wherein the convolutional neural network uses a set of semantically segmented training images, Training a lusion neural network; And training a convolutional recurrent neural network, corresponding to the first convolutional neural network, using a set of semantically segmented training videos, wherein the convolutional layer has a hidden state. Replaced by a circulation module; Training the convolutional recurrent neural network includes consecutive frames of one video of the set of semantically segmented training videos (

상기 컨볼루션 순환 신경망을 훈련시키는 단계는, 광학 흐름이 기지인(known) 연속하는 프레임들의 훈련 쌍들의 세트를 사용하여, 제2 컨볼루션 신경망을 훈련시키는 단계를 포함하고, 상기 (a)는 상기 제2 컨볼루션 신경망을 사용하여 수행된다. The training of the convolutional recurrent neural network includes training a second convolutional neural network using a set of training pairs of consecutive frames whose optical flow is known, wherein (a) is the It is performed using a second convolutional neural network.

제2 컨볼루션 신경망은 FlowNetSimple 네트워크 또는 FlowNetCorrelation 네트워크일 수 있다.The second convolutional neural network may be a FlowNetSimple network or a FlowNetCorrelation network.

다른 일 측면에 있어서, 비디오들의 의미적 세그먼트화를 위한 컨볼루션 순환 신경망을 훈련시키는 시스템이 제공된다. 상기 시스템은, (a) 의미적으로 세그먼트화된 훈련 이미지들의 세트를 사용하여, 제1 컨볼루션 신경망을 훈련시키고, (b) 의미적으로 세그먼트화된 훈련 비디오들의 세트를 사용하여, 상기 제1 컨볼루션 신경망에 대응하는, 컨볼루션 순환 신경망을 훈련시킬 수 있다. 컨볼루션 레이어는 은닉 상태(hidden state)를 갖는 순환 모듈에 의해 대체되었다. 상기 컨볼루션 순환 신경망을 훈련시키는 것은, 상기 의미적으로 세그먼트화된 훈련 비디오들의 세트 중 하나의 비디오의 연속하는 프레임들(

)의 각 쌍에 대해, (b1) 연속하는 프레임들의 쌍의 프레임들 간의 추정된 광학 흐름(optical flow)에 따라 순환 레이어의 내부 상태를, 상기 내부 상태가 상기 쌍의 프레임들 간의 픽셀들의 모션에 적응하도록(adapt), 와핑하고, (b1) 적어도 상기 순환 모듈의 파라미터들을 학습하는 것을 포함할 수 있다. In another aspect, a system for training a convolutional recurrent neural network for semantic segmentation of videos is provided. The system is configured to: (a) train a first convolutional neural network using a set of semantically segmented training images, and (b) use a set of semantically segmented training videos, wherein the first It is possible to train a convolutional recurrent neural network, corresponding to a convolutional neural network. The convolutional layer has been replaced by a circular module with a hidden state. Training the convolutional recurrent neural network includes successive frames of one video of the set of semantically segmented training videos (

) For each pair of, (b1) the internal state of the cyclic layer according to the estimated optical flow between the frames of the pair of consecutive frames, and the internal state is the motion of the pixels between the pair of frames. Adapting, warping, and (b1) learning at least the parameters of the recursive module.

또 다른 일 측면에 있어서, 비디오들의 의미적 세그먼트화를 위한 컨볼루션 순환 신경망을 훈련시키는 방법을 컴퓨터에서 실행시키기 위해 컴퓨터 판독 가능 기록매체에 저장된 프로그램이 제공된다. 상기 방법은, (a) 의미적으로 세그먼트화된 훈련 이미지들의 세트를 사용하여, 제1 컨볼루션 신경망을 훈련시키는 단계 및 (b) 의미적으로 세그먼트화된 훈련 비디오들의 세트를 사용하여, 상기 제1 컨볼루션 신경망에 대응하는, 컨볼루션 순환 신경망을 훈련시키는 단계 - 컨볼루션 레이어는 은닉 상태(hidden state)를 갖는 순환 모듈에 의해 대체되었음 -를 포함할 수 있다. 상기 컨볼루션 순환 신경망을 훈련시키는 단계는, 상기 의미적으로 세그먼트화된 훈련 비디오들의 세트 중 하나의 비디오의 연속하는 프레임들(

)의 각 쌍에 대해, (b1) 연속하는 프레임들의 쌍의 프레임들 간의 추정된 광학 흐름(optical flow)에 따라 순환 레이어의 내부 상태를, 상기 내부 상태가 상기 쌍의 프레임들 간의 픽셀들의 모션에 적응하도록(adapt), 와핑하는 단계 및 (b1) 적어도 상기 순환 모듈의 파라미터들을 학습하는 단계를 포함할 수 있다. In another aspect, a program stored in a computer-readable recording medium is provided to execute a method of training a convolutional recurrent neural network for semantic segmentation of videos on a computer. The method comprises the steps of: (a) training a first convolutional neural network using a set of semantically segmented training images, and (b) using a set of semantically segmented training videos, 1 Training a convolutional recurrent neural network, corresponding to a convolutional neural network, may include-the convolutional layer has been replaced by a recursive module having a hidden state. Training the convolutional recurrent neural network includes consecutive frames of one video of the set of semantically segmented training videos (

) For each pair of, (b1) the internal state of the cyclic layer according to the estimated optical flow between the frames of the pair of consecutive frames, and the internal state is the motion of the pixels between the pair of frames. It may include adapting, warping, and (b1) learning at least the parameters of the cyclic module.

앞서 개시된 실시예들의 변형들과 다른 특징을 및 기능들이 또는 그 대용물들은 많은 다른 상이한 시스템들 또는 응용들과 바람직하게 결합될 수 있다는 점이 이해될 수 있을 것이다. 또한, 여기에서 현재 예측되지 않거나 예견되지 않는 다양한 대용물들, 변경들, 변형들 또는 개선사항들은 당해 기술 분야에서 통상의 지식을 가진 자에 의해 이후에 이루어질 수 있고, 이는 상기의 설명과 아래의 청구범위에 의해 포괄되는 것으로 또한 의도된다.It will be appreciated that variations of the previously disclosed embodiments and other features and functions or their substitutes may be advantageously combined with many other different systems or applications. In addition, various substitutes, changes, modifications or improvements that are not currently predicted or predicted herein may be made later by those of ordinary skill in the art, which are described above and claimed below. It is also intended to be encompassed by range.

Claims

In a method of training a convolutional recurrent neural network for semantic segmentation of videos, performed by a computer,
(a) training a first convolutional neural network using a set of semantically segmented training images; And
(b) training a convolutional recurrent neural network constructed based on the first convolutional neural network using a set of semantically segmented training videos-a second (penultimate) from the end of the first convolutional neural network ) The convolution layer is replaced by a circular module with a hidden state-
Including,
Training the convolutional recurrent neural network includes consecutive frames of one video of the set of semantically segmented training videos (

For each pair of ),
(b1) warping the internal state of the cyclic layer according to the estimated optical flow between frames of a pair of consecutive frames, so that the internal state adapts to the motion of pixels between the pair of frames. The step of doing; And
(b2) learning at least the parameters of the recursive module
Including,
The consecutive frames (

The pair of) includes a previous frame ( t -1) and a current frame ( t ), and the estimated optical flow is in the reverse direction from the current frame (t ) to the previous frame ( t -1) ( backward) optical flow (

)ego;
Wherein (b1) is the previous frame (t -1) corresponding to the warping function, the internal state (h _t-1), the internal state (h _t-1) in association with the previous frame (t -1) The warped internal state associated with (

In order to obtain ), it is a step of applying, and each pixel has the reverse optical flow (

A method of training a convolutional recurrent neural network that has undergone displacement according to ).

delete

The method of claim 1,
The warping function is

And p _t is a pixel of the current frame ( t ),

Is the reverse optical flow, h _t-1 is the internal state associated with the previous frame ( t -1),

Is a warped internal state associated with the previous frame ( t -1).

The method of claim 1,
The (b1) includes the step of estimating an optical flow between frames of a pair of consecutive frames, which is performed using a second convolutional neural network, and the second convolutional neural network has a known optical flow ( known) A method of training a convolutional recurrent neural network, which is trained using a set of training pairs of consecutive frames.

The method of claim 4,
The second convolutional neural network is a FlowNetSimple network, a method for training a convolutional recurrent neural network.

The method of claim 4,
The second convolutional neural network is a FlowNetCorrelation network, a method for training a convolutional recurrent neural network.

The method of claim 1,
The method of training a convolutional recurrent neural network, wherein parameters of each layer of the convolutional recurrent neural network before the recursive module are fixed during (b2).

delete

The method of claim 7,
The parameters of the convolutional recurrent neural network learned in (b2) are parameters of the recurrent module and the last convolutional layer of the first convolutional neural network.

The method of claim 4,
(B2) is the step of learning parameters of the second convolutional neural network
A method of training a convolutional recurrent neural network further comprising.

The method of claim 7,
The first convolutional recurrent neural network, before the second convolutional layer from the end, includes an atrous spatial pyramid pooling module.

The method of claim 1,
The method of training a convolutional recurrent neural network, wherein the recursive module comprises a convolutional gated recurrent unit.

The method of claim 1,
The method of training a convolutional recurrent neural network, wherein the recursive module comprises a convolutional long short-term memory.

Training a first convolutional neural network using a set of semantically segmented training images, performed by a computer; And training a convolutional recurrent neural network constructed based on the first convolutional neural network using a set of semantically segmented training videos-a penultimate convolution at the end of the first convolutional neural network. The root layer is replaced by a circular module with a hidden state; Training the convolutional recurrent neural network includes consecutive frames of one video of the set of semantically segmented training videos (

For each pair of ), the internal state of the cyclic layer is adapted according to the estimated optical flow between the frames of the pair of consecutive frames, so that the internal state adapts to the motion of the pixels between the pair of frames ( adapt), warping, and learning at least the parameters of the recursive module. A method for semantic segmentation of an input video using a convolutional recurrent neural network trained by-
The consecutive frames (

)ego;
The step of warping includes a warping function in an internal state ( h _t-1 ) associated with the previous frame (t -1), and the previous frame ( t -1) corresponding to the internal state ( h _t-1). The warped internal state associated with (

) That has undergone displacement,
(a) successive frames of the input video (

For each pair of ), warping the internal state of the cyclic layer according to the optical flow estimated in (a) so that the internal state adapts to the motion of pixels between the frames of the pair; And
(c) successive frames of the input video (

For each pair of ), semantically segmenting each frame with the convolutional recurrent neural network
Including a method of semantic segmentation of the input video.

The method of claim 14,
The training of the convolutional recurrent neural network includes training a second convolutional neural network using a set of training pairs of consecutive frames whose optical flow is known, wherein (a) is the Semantic segmentation method of input video, performed using a second convolutional neural network.

The method of claim 15,
The second convolutional neural network is a FlowNetSimple network, a method for semantic segmentation of input video.

The method of claim 15,
The second convolutional neural network is a FlowNetCorrelation network, a method for semantic segmentation of input video.

In a method for semantic segmentation of input video using a convolutional recurrent neural network, performed by a computer,
(a) successive frames of the input video (

For each pair of ), warping the internal state of the cyclic layer according to the estimated optical flow, such that the internal state adapts to the motion of pixels between the frames of the pair; And
(c) successive frames of the input video (

For each pair of ), semantically segmenting each frame with the convolutional recurrent neural network
Including,
The convolutional recurrent neural network is trained according to the following training method,
The training method,
(a') training a first convolutional neural network using a set of semantically segmented training images; And
(b') training a convolutional recurrent neural network built based on the first convolutional neural network using a set of semantically segmented training videos-a second from the end of the first convolutional neural network ( penultimate) convolutional layer is replaced by a circular module with hidden state-
Including,
Training the convolutional recurrent neural network includes consecutive frames of one video of the set of semantically segmented training videos (

For each pair of ),
(b1') adapt the internal state of the cyclic layer according to the estimated optical flow between frames of a pair of consecutive frames, and the internal state adapts to the motion of pixels between the pair of frames, Warping; And
(b2') learning at least the parameters of the cyclic module
Including,
In the training method, the consecutive frames (

)ego;
The (b1 ') is the previous frame corresponding to the warping function in the previous frame (t -1) internal state (h _t-1), the internal state (h _t-1) and the associated (t -1 ) And associated warped internal state (

A method of semantic segmentation of the input video, which is to have undergone displacement according to ).

The method of claim 18,
The training of the convolutional recurrent neural network includes training a second convolutional neural network using a set of training pairs of consecutive frames whose optical flow is known, wherein (a) is the Semantic segmentation method of input video, performed using a second convolutional neural network.

The method of claim 19,
The second convolutional neural network is a FlowNetSimple network, a method for semantic segmentation of input video.

The method of claim 19,
The second convolutional neural network is a FlowNetCorrelation network, a method for semantic segmentation of input video.

In a system for training a convolutional recurrent neural network for semantic segmentation of videos,
The system,
(a) training a first convolutional neural network using a set of semantically segmented training images,
(b) Using a set of semantically segmented training videos, train a convolutional recurrent neural network built on the basis of the first convolutional neural network-and a penultimate at the end of the first convolutional neural network. The convolutional layer is replaced by a circular module with a hidden state -,
Training the convolutional recurrent neural network includes successive frames of one video of the set of semantically segmented training videos (

For each pair of ),
(b1) warping the internal state of the cyclic layer according to the estimated optical flow between frames of a pair of consecutive frames, so that the internal state adapts to the motion of pixels between the pair of frames. and,
(b2) learning at least the parameters of the recursive module
Including,
The consecutive frames (

In order to obtain ), it is to apply, and each pixel has the reverse optical flow (

In a program stored in a computer-readable recording medium to execute a method of training a convolutional recurrent neural network for semantic segmentation of videos in a computer,
The above method,
(a) training a first convolutional neural network using a set of semantically segmented training images; And
(b) training a convolutional recurrent neural network constructed based on the first convolutional neural network using a set of semantically segmented training videos-a second (penultimate) from the end of the first convolutional neural network ) The convolution layer is replaced by a circular module with a hidden state-
Including,
Training the convolutional recurrent neural network includes consecutive frames of one video of the set of semantically segmented training videos (

A program stored on a computer-readable recording medium that has undergone displacement according to ).