KR101563569B1

KR101563569B1 - Learnable Dynamic Visual Image Pattern Recognition System and Method

Info

Publication number: KR101563569B1
Application number: KR1020140064272A
Authority: KR
Inventors: 타니 준; 정민주; 황중식
Original assignee: 한국과학기술원
Priority date: 2014-05-28
Filing date: 2014-05-28
Publication date: 2015-10-28

Abstract

The present invention relates to a system and a method for recognizing a pattern of a dynamic visual image for learning. The system according to an embodiment of the present invention uses a multiple spatio-temporal scales neural network (MSTNN) to recognize a pattern of a dynamic visual image through preliminary training on a set of exemplary patterns. In the system, a pixel pattern or a pattern of a dynamic visual image for sequence of visual characteristics is inputted into the MSTNN and a recognition result of the inputted pattern of a dynamic visual image is obtained in an output unit by a delayed response method.

Description

[0001] The present invention relates to a dynamic visual image pattern recognition system,

본 발명은 시각 이미지 패턴 인식을 위한 정보 처리 방법에 관한 것으로, 더욱 상세하게는 뉴럴 네트워크를 이용하여 비디오 카메라 스트림에서 시각 이미지 패턴 인식 방법에 관한 것이다.
The present invention relates to an information processing method for visual image pattern recognition, and more particularly, to a visual image pattern recognition method in a video camera stream using a neural network.

최근 들어, 학습 모델은 인간의 행동 인식을 포함한 많은 비전 애플리케이션 에 적용되기 시작하였다. 기존의 비전 방식은 HOG, SIFT, SURF 등과 같은 핸드크래프티드 특징들을 사용하는 반면, 학습 모델은 데이터로부터 자동으로 특징들을 학습할 수 있기 때문이다.In recent years, learning models have begun to be applied to many vision applications, including human behavior awareness. The existing vision method uses handcrafted features such as HOG, SIFT, SURF, etc., while the learning model can automatically learn features from the data.

가장 많이 활용되고 있는 학습 모델 중 하나는 CNN(Convolutional Neural Network)이다. 하지만, CNN은 정적인 비전만을 취급할 수 있으며, CNN 자체는 다이내믹 비전을 취급할 수 없다. 3D CNN과 같이 CNN을 확장한 몇몇 모델들은 짧은 시간 동안의 다이내믹 비전을 처리할 수 있지만 여전히 긴 시간 동안의 다이내믹 비전 처리에는 어려움이 있다.One of the most popular learning models is CNN (Convolutional Neural Network). However, CNN can handle only static vision, and CNN itself can not handle dynamic vision. Some models that extend CNN like 3D CNN can handle dynamic vision for a short time but still have difficulty handling dynamic vision for a long time.

이에, CNN의 한계를 보완할 수 있는 기법에 대한 모색이 요청된다.
Therefore, it is required to search for a technique that can overcome the limitation of CNN.

본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, 본 발명의 목적은, CNN의 한계를 보완한 우수한 성능을 가진 다이내믹 시각 인식 방법 및 시스템을 제공함에 있다.
SUMMARY OF THE INVENTION It is an object of the present invention to provide a dynamic visual recognition method and system with superior performance that complements the limitations of CNN.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른, 다이내믹 이미지 패턴 인식 방법은, 제1 서브 네트워크가, 이미지를 입력받는 단계; 적어도 하나의 제2 서브 네트워크가, 입력된 이미지를 순차적으로 컨볼루션하는 단계; 및 제3 서브 네트워크가, 컨볼루션을 통한 이미지 인식 결과를 출력하는 단계;를 포함하고, 상기 적어도 하나의 제2 서브 네트워크는, 각기 다른 타임 스케일의 다이내믹을 갖는다. 여기서, 각 레벨의 서브 네트워크의 각 뉴럴 유닛은, 한 레벨이 다른 서브 네트워크에서 레티노픽적으로 가까운 로컬 뉴럴 유닛들과 한 레벨 위(top-down) 또는 한 레벨 아래(Bottom-up)로 연결되어 있고, 동일한 서브 네트워크 내에서 로컬 뉴럴 유닛들 간의 측면 연결이 있다. 또한, 상기 적어도 하나의 제3 서브 네트워크는, 단일 레이어는 물론 여러 레이어로 구성될 수 있다.According to another aspect of the present invention, there is provided a method for recognizing a dynamic image pattern, the method comprising: receiving an image; At least one second subnetwork sequentially convoluting the input image; And the third subnetwork outputting an image recognition result via convolution, the at least one second subnetwork having different time-scale dynamics. Here, each neural unit of a sub-network of each level is connected top-down or bottom-up one level with local neural units that are retino-close to one another in another sub-network , There is a side connection between local neural units within the same subnetwork. Also, the at least one third sub-network may be composed of a single layer as well as a plurality of layers.

그리고, 상기 제1 서브 네트워크에 인접한 제2 서브 네트워크의 타임 스케일은 빠르고, 상기 제3 서브 네트워크에 인접한 제2 서브 네트워크의 타임 스케일은 느릴 수 있다.The time scale of the second subnetwork adjacent to the first subnetwork may be fast, and the time scale of the second subnetwork adjacent to the third subnetwork may be slow.

또한, 제2 서브 네트워크는 다수의 특징 맵들을 포함할 수 있다.Also, the second subnetwork may comprise a number of feature maps.

그리고, 상기 서브 네트워크에서 뉴럴 유닛들은 2 차원 레티노토픽 기법으로 배치될 수 있다.The neural units in the subnetwork may be arranged in a two-dimensional retinotopic scheme.

또한, 상기 서브 네트워크들은, 레벨이 증가할수록 뉴럴 유닛의 개수가 감소하고 해상도가 감소하지만, 뉴럴 유닛의 수용 필드(receptive field)는 더 넓어진다(즉, 보는 범위가 넓어진다).In addition, the subnetworks have a wider neighbors' receptive field (i.e., a wider viewing range), while the number of neural units decreases and resolution decreases as the level increases.

그리고, 특정 레벨의 서브 네트워크에서 각 뉴럴 유닛은, 한 레벨이 다른 서브 네트워크에서 레티노픽적으로 가까운 로컬 뉴럴 유닛들과 한 레벨 위(top-down) 또는 한 레벨 아래(Bottom-up)로 연결될 수 있다.And, in a sublevel of a certain level, each neural unit may be connected top-down or bottom-up with local neural units that are retino-close to one another in a subnetwork .

또한, 동일한 서브 네트워크 내에서는, 로컬 뉴럴 유닛들이 측면으로 연결(Bilateral connection)될 수 있다.Also, within the same sub-network, the local neural units can be bilaterally connected.

그리고, 상기 서브 네트워크들에 의한 훈련과 인식은, 지연 응답 방식으로 수행될 수 있다.The training and recognition by the subnetworks may be performed in a delayed response manner.

또한, 상기 서브 네트워크들에 의한 훈련은, 에러 역 전파(error back-propagation scheme) 방식을 사용할 수 있다.In addition, the training by the sub-networks may use an error back-propagation scheme scheme.

한편, 본 발명의 다른 실시예에 따른, 다이내믹 이미지 패턴 인식 시스템은, 이미지를 입력받는 입력부; 상기 입력부를 통해 입력된 이미지를 순차적으로 컨볼루션하는 인식부; 및 컨볼루션을 통한 이미지 인식 결과를 출력하는 출력부;를 포함하고, 상기 인식부는, 각기 다른 타임 스케일의 다이내믹을 갖는 상기 적어도 하나의 서브 네트워크로 입력된 이미지를 순차적으로 컨볼루션한다.
According to another aspect of the present invention, there is provided a dynamic image pattern recognition system including: an input unit for receiving an image; A recognition unit for sequentially convoluting images input through the input unit; And an output unit for outputting image recognition results through convolution. The recognition unit sequentially convolutes images input to the at least one sub-network having different time-scale dynamics.

이상 설명한 바와 같이, 본 발명의 실시예들에 따르면, MSTNN의 자기-조직화된 시공간 계층 구조를 이용한 정보 처리를 통해, 복잡한 다이내믹 화상 이미지 패턴에 대해 상황에 맞고 강건하며 효율적인 인식이 가능하다.
As described above, according to embodiments of the present invention, information processing using a self-organizing spatiotemporal hierarchy of MSTNN enables a robust, robust and efficient recognition of complex dynamic image image patterns.

도 1은 MSTNN 아키텍처를 도시한 도면,
도 2는 MSTNN의 포워드 다이내믹의 설명에 제공되는 도면,
도 3은 MSTNN을 이용한 학습형 다이내믹 시각 이미지 패턴 인식 과정의 개념 설명에 제공되는 도면,
도 4는 MSTNN을 이용한 학습 과정의 설명에 제공되는 도면,
도 5는 다수의 비디오 시퀀스 패턴 훈련 과정을 나타낸 흐름도,
도 6은 비디오 시퀀스 인식 과정을 나타낸 흐름도, 그리고,
도 7은 학습형 다이내믹 시각 이미지 패턴 인식 시스템의 블럭도이다.1 is a diagram illustrating an MSTNN architecture,
Figure 2 is a diagram provided for a description of the forward dynamics of the MSTNN,
FIG. 3 is a diagram illustrating a concept of a learning dynamic visual image pattern recognition process using an MSTNN,
FIG. 4 is a diagram illustrating a learning process using an MSTNN,
FIG. 5 is a flowchart showing a plurality of video sequence pattern training processes,
6 is a flowchart showing a video sequence recognition process,
7 is a block diagram of a learning dynamic visual image pattern recognition system.

이하에서는 도면을 참조하여 본 발명을 보다 상세하게 설명한다.
Hereinafter, the present invention will be described in detail with reference to the drawings.

1. One. MSTNNMSTNN 아키텍처 architecture

MSTNN(Multiple Spatio-Temporal Scales Neural Network) 아키텍처를 도 1에 도시하였다. 도 1에 도시된 바와 같이, MSTNN 아키텍처는 1개의 입력 레이어(Layer 1), 3개의 컨볼루션 레이어(Layer 2-4) 및 1개의 전체-연결 레이어(Layer 5)로 구성되어, 총 5개의 레이어들로 구성된다.The Multiple Spatio-Temporal Scales Neural Network (MSTNN) architecture is shown in FIG. As shown in FIG. 1, the MSTNN architecture is composed of one input layer (Layer 1), three convolution layers (Layer 2-4), and one all-connection layer (Layer 5) .

한편, 전체-연결 레이어(Layer 5)는 1개가 아닌 여러 개로 구현 가능하며, 개수에 대한 제한은 없다.On the other hand, the total-connection layer (Layer 5) can be implemented in a plurality of, but not limited to, one.

표준 CNN(Convolutional Neural Network)은 작은 왜곡과 변화에 불변하는 특징을 추출하는 컨볼루션 레이어들 사이에 서브 샘플링 레이어가 있어, 계산의 복잡성을 상당히 줄였다. 하지만, 서브 샘플링 동작은 시간 경과에 따른 서브 샘플링 과정에서 뉴런 값의 불연속성을 유발한다.The standard Convolutional Neural Network (CNN) has a subsampling layer between the convolution layers that extracts small distortion and invariant features, significantly reducing computational complexity. However, the subsampling operation causes the discontinuity of the neuron values in the sub-sampling process over time.

MSTNN에서는, 컨볼루션과 서브 샘플링 동작을 컨볼루션 레이어에서 수행되는 하나의 동작으로 결합하여, 서브 샘플링 동작을 제거하였다.In MSTNN, the convolution and subsampling operations are combined into one operation performed in the convolution layer to remove the subsampling operation.

레이어 1은 입력 이미지를 포함하는 48x54 특징 맵을 1개 갖고 있는 입력 레이어이다.Layer 1 is an input layer that has one 48x54 feature map that contains the input image.

레이어 2는 단계 크기(또는, stride)가 2(이는 다음 컨볼루션을 위한 커널이 얼마나 시프트 되어야 하는지를 나타낸다.)이고, 시정수가 2인 22x22 특징 맵을 6개 갖고 있는 컨볼루션 레이어이다. 시정수는 타임 스케일 특성을 나타내는 파라미터이다. 작은 시정수는 짧은 타임 스케일을 나타내고, 큰 시정수는 긴 타임 스케일을 나타낸다. 본 발명의 실시예에 따른 모델에서, 컨볼루션 레이어와 전체-연결 레이어는 특정 시정수 값을 갖는다. 즉, 본 발명의 실시예에서 컨볼루션 레이어만이 타임 스케일을 가지며, 일반화시킬 경우 컨볼루션 레이어 뿐만 아니라 전체-연결 레이어도 타임 스케일을 가질 수도 있다. 레이어 2의 각 특징 맵은 6x12 커널로 레이어 1의 특징 맵에 연결된다.Layer 2 is a convolution layer that has a step size (or stride) of 2 (which indicates how much the kernel for the next convolution is to be shifted) and six 22x22 feature maps with a time constant of two. The time constant is a parameter indicating a time scale characteristic. Small time constants represent short time scales, and large time constants represent long time scales. In the model according to an embodiment of the present invention, the convolution layer and the all-connection layer have a certain time constant value. That is, in the embodiment of the present invention, only the convolution layer has a time scale, and when it is generalized, the convolution layer as well as the all-connection layer may have a time scale. Each feature map in layer 2 is linked to a feature map in layer 1 with a 6x12 kernel.

레이어 3은 단계 크기가 2이고, 시정수가 5인 8x8 특징 맵을 50개 갖고 있는 컨볼루션 레이어이다. 레이어 3의 각 특징 맵은 8x8 커널로 레이어 2의 특징 맵에 연결된다.Layer 3 is a convolution layer that has 50 8x8 feature maps with a step size of 2 and a time constant of 5. Each feature map of layer 3 is linked to a feature map of layer 2 with an 8x8 kernel.

레이어 4는 단계 크기가 1이고, 시정수가 100인 1x1 특징 맵을 100개 갖고 있는 컨볼루션 레이어이다. 레이어 4의 각 특징 맵은 8x8 커널로 레이어 3의 특징 맵에 연결된다.Layer 4 is a convolution layer that has 100 1x1 feature maps with a step size of 1 and a time constant of 100. Each feature map of layer 4 is linked to a feature map of layer 3 with an 8x8 kernel.

레이어 5는 분류(classification)를 위해 소프트 맥스(softmax)를 활성화 함수로 사용하는 전체-연결(fully-connected) 레이어이다. 한편, 레이어 5는 소프트 맥스 함수 이외의 다른 함수(예를 들면, tanh)를 사용하는 것도 가능하다. 레이어 5의 뉴런 개수는 데이터 세트에서 클래스의 개수와 동일하다. 레이어 5의 각 뉴런은 레이어 4의 100개 특징 맵의 모든 뉴런들에 모두 연결된다. 레이어 5에 있는 활성화된 뉴런은 분류 결과이다.
Layer 5 is a fully-connected layer that uses softmax as an activation function for classification. On the other hand, it is also possible to use another function (for example, tanh) other than the soft max function. The number of neurons in layer 5 is equal to the number of classes in the data set. Each neuron in layer 5 is connected to all the neurons in layer 4's 100 feature maps. Activated neurons in layer 5 are the result of classification.

2. 포워드 다이내믹(2. Forward dynamic ( ForwardForward DynamicsDynamics ))

시간 단계 t에서 l 번째 레이어의 m 번째 특징 맵의 위치 (x,y)에서 뉴런의 내부 상태는

로 표기하며, 다음과 같이 계산된다.At time t, the internal state of the neuron at the location (x, y) of the mth feature map of the lth layer is

And is calculated as follows.

여기서, I_C는 컨볼루션 레이어 인덱스 세트이고, I_F는 전체-연결 레이어 인덱스 세트이며, τ_l은 l 번째 레이어의 시정수이고, N_(l-1)은 (l-1) 번째 레이어의 특징 맵의 개수이며, P_l과 Q_l은 l 번째 레이어에서 커널의 높이와 폭이고, S_l은 l 번째 레이어의 단계 크기이며,

은 (l-1) 번째 레이어의 n 번째 특징 맵으로부터 현재 특징 맵까지 연결된 커널의 (p, q)에서의 값이고,

는 시간 단계 t에서 (l-1) 번째 레이어의 n 번째 특징 맵의 (x,y)에서 뉴런의 활성화 값이며,

은 현재 특징 맵에 대한 바이어스이다.Where I _C is a set of convolutional layer indexes, I _F is a set of all-connected layer indices, τ _l is the time constant of the i th layer, N _(l-1) P _l and Q _l are the height and width of the kernel in the lth layer, S _l is the step size of the lth layer,

Is a value at (p, q) of the kernel connected from the n-th feature map of the (l-1) -th layer to the current feature map,

Is the activation value of the neuron at (x, y) of the nth feature map of the (l-1) th layer at time step t,

Is the bias for the current feature map.

커널은 컨볼루션에 앞서 수평 및 수직 방향 모두에서 S_l 픽셀 만큼 시프트 된다. 전체-연결 레이어의 뉴런 및 웨이트를 각각 1x1 특징 맵과 1x1 커널로 정의하고, S_l을 1로 설정하여, 컨볼루션 및 전체-연결 레이어 모두에 대한 수학식을 같은 방법으로 표현할 수 있다.The kernel is shifted by S _l pixels in both horizontal and vertical directions prior to convolution. The mathematical expressions for both the convolution and the all-connection layer can be expressed in the same way by defining neurons and weights of the all-connection layer as 1x1 feature maps and 1x1 kernel, respectively, and setting S _l to 1.

각 뉴런의 활성화가 현재 입력에 의해서만 결정되는 컨볼루션 파이어링 레이트 모델을 이용하지 않고, 한 레벨 위(top-down), 한 레벨 아래(Bottom-up) 또는 동일 레벨(bilateral) 레이어들의 특징 맵들과 컨볼루션하고 감소된 이전 내부 상태를 전파하여 각 뉴런의 활성화가 계산되는 통합 및 발사 모델(integrate-and-fire model)이 사용된다.The feature maps of the top-down, bottom-up, or bilateral layers, and the neural network of the neurons, rather than the convolution firing rate model, where activation of each neuron is determined only by the current input, An integrate-and-fire model is used in which the activation of each neuron is calculated by propagating the convoluted and reduced previous internal state.

시정수 τ는 지난 입력의 히스토리가 현재 내부 상태에 영향을 미치는 정도를 나타낸다. τ가 크면 뉴런의 활성화는 천천히 변화하는데, 내부 상태가 현재 입력 대비 과거 입력의 히스토리에 의해 크게 영향받기 때문이다.The time constant τ represents the degree to which the history of the past input affects the current internal state. If τ is large, the activation of the neurons changes slowly, because the internal state is strongly influenced by the history of the past input versus the current input.

반면, τ가 작으면 뉴런의 활성화는 빠르게 변화하는데, 도 2에 도시된 바와 같이, 현재 입력이 내부 상태에 보다 강하게 영향을 주기 때문이다.On the other hand, if τ is small, the activation of the neurons changes rapidly, as shown in FIG. 2, because the current input more strongly affects the internal state.

뉴런의 활성화 값

은 다음과 같이 계산된다.Activation value of neurons

Is calculated as follows.

여기서, L은 모델의 레이어 개수이다.
Here, L is the number of layers in the model.

3. 훈련 방법(3. Training method ( TrainingTraining MethodMethod ))

오차 함수 E는 Kullback-Leibler 다이버전스를 이용하여 다음과 같이 결정된다:The error function E is determined using Kullback-Leibler divergence as follows:

여기서, T는 시퀀스의 길이이고, d는 라벨 시퀀스의 길이이며,

은 표기법을 단순화하기 위해 시간 단계 t에서 시퀀스 주어진 경우 클래스 m 의 신뢰도를 나타내는

로부터 재정의한 것이며,

는 라벨 값이다.Where T is the length of the sequence, d is the length of the label sequence,

Denotes the reliability of a class m given a sequence at time t to simplify the notation.

And,

Is the label value.

입력 시퀀스가 클래스 c에 속하면,

는 1로 설정되고,

인 나머지는 0으로 설정된다. 모델이 지연된 교육을 따르기 때문에, 시퀀스의 마지막 d 시간 단계에서만 에러가 발생한다. 시간 단계 t에서 생성된 에러 때문에 E_t가 이전 시간 단계로 전파되지 않으므로, CNN과 같은 피드 포워드 뉴럴 네트워크 모델은 지연된 교육을 사용할 수 없다.If the input sequence belongs to class c,

Is set to 1,

Is set to zero. Because the model follows a delayed training, errors occur only in the last d time step of the sequence. Feed-forward neural network models such as CNN can not use delayed training because E _t is not propagated to previous time steps due to errors generated at time step t.

반복적 뉴럴 네트워크 모델은 이전 시간 단계로 에러 전파할 수 있지만, 이 시간을 통해 빠르게 에러가 감쇠된다. 그러나, 제안된 모델은 전파된 에러의 감쇠를 줄이는 상위 레이어에서 큰 시정수를 갖아, 시퀀스의 마지막에서 발생된 에러의 적정량을 처음 시간 단계로 역 전파할 수 있도록 한다. 예를 들어, 시퀀스가 100개의 프레임을 갖아, 상위 레이어의 시정수가 100으로 설정되었다고 가정한다. 이 경우, 시간 단계 100에서 발생한 에러 E₁₀₀은 시간 단계 1에서

로 전파된다.The iterative neural network model can propagate error to the previous time step, but the error is quickly attenuated through this time. However, the proposed model has a large time constant at the upper layer that reduces the attenuation of the propagated error, allowing the correct amount of errors generated at the end of the sequence to be propagated back to the initial time step. For example, assume that the sequence has 100 frames and the time constant of the upper layer is set to 100. [ In this case, the error E ₁₀₀ occurring in the time step ₁₀₀ is

.

훈련 단계에서, 단계 n에서의 모델에 대한 모든 학습 파라미터들은 다음의 수학식으로 갱신된다.In the training phase, all learning parameters for the model at step n are updated with the following equation:

여기서, α는 학습 속도이다. 기존의 역-전파(back-propagation)에 의해 풀이되는 학습 파라미터에 대한 편미분

은 다음과 같이 주어진다.Here,? Is the learning rate. Partial differentials of learning parameters solved by existing back-propagation

Is given as follows.

여기서,

이고,

이며, X_l 및 Y_l은 l 번째 레이어에서 특징 맵의 높이와 폭이다.here,

ego,

And X _l and Y _l are the height and width of the feature map in the lth layer.

훈련 과정에서, 학습 속도 α가

를 만족하는 파라미터

에 의해 조정된다. 델타 에러 스케일은 출력 뉴런의 개수 N_L 및 라벨 시퀀스의 길이 d에 의존 하기 때문이다.In the course of training,

&Lt; / RTI >

. Since the delta error scale depends on the number N _L of output neurons and the length d of the label sequence.

파라미터

은 0.1로 설정된다. 훈련을 가속화하기 위해, 단계 n에서 평균 제곱 오차가 단계 n-1에서 평균 제곱 오차 보다 작은 경우,

에 1.05를 곱하고, 그렇지 않으면

를 2로 나눈다.parameter

Is set to 0.1. To speed up the training, if the mean squared error at step n is less than the mean squared error at step n-1,

Multiplied by 1.05, otherwise

Is divided by 2.

모든 커널 웨이트와 바이어스들은 표준 편차가 0.05인 가우시안 분포에서 무작위로 선택된 값으로 초기화된다. 컨볼루션 레이어에서 뉴런의 초기 상태

는 0으로 설정된다.All kernel weights and biases are initialized to randomly selected values from the Gaussian distribution with a standard deviation of 0.05. The initial state of neurons in the convolution layer

Is set to zero.

레이어의 개수, 특징 맵의 개수, 특징 맵 크기, 커널 크기, 단계 크기 및 시정수를 포함한 나머지 파라미터에 대해서는 모델 아키텍쳐 부분에서 이미 정의한 바 있다.
The remaining parameters, including the number of layers, the number of feature maps, the feature map size, the kernel size, the step size, and the time constant have already been defined in the Model Architecture section.

4. 학습형 다이내믹 시각 이미지 패턴 인식(4. Learning dynamic visual image pattern recognition LearnableLearnable DynamicDynamic VisualVisual ImageImage Pattern Pattern RecognitionRecognition ))

MSTNN은, 계층 구조를 이용하여 하위 레벨에서는 시간적으로 빠르게 변화하는 정보를 공간적으로는 높은 해상도를 가지지만 좁은 수용 필드(receptive field)에서 오는 정보를 처리하고, 상위 레벨에서는 시간적으로 느리게 변화하는 정보를 공간적으로는 낮은 해상도를 가지지만 넓은 수용 필드(receptive field)를 가지는 정보를 처리한다. 때문에 하위에서 상위 레벨로 정보가 처리되면서 데이터에 내재되어 있는 여러 스케일의 시공간 정보들이 성공적으로 추출 혹은 처리될 수 있다.The MSTNN processes information that changes rapidly in time at a lower level and spatially in a higher level but uses information in a narrower receptive field at a lower level by using a hierarchical structure, Spatially processes information with a low resolution but a wide receptive field. Therefore, the spatial information of various scales inherent in the data can be successfully extracted or processed while the information is processed from the lower level to the higher level.

도 3에 도시된 바와 같이, MSTNN은 뉴럴 활동이 각기 다른 타임 스케일의 다이내믹에 의해 제어되는 여러 레벨의 서브 네트워크들로 구성된다. l 번째 레벨의 타임 스케일은, 해당 레벨에서의 뉴럴 유닛들에 대해 설정된 시정수 파라미터 τ_l에 의해 결정된다. 각 뉴럴 유닛의 활성화 다이내믹은 시정수 파라미터 τ_l를 이용한 다음의 다음 미분 방정식을 따른다.As shown in FIG. 3, the MSTNN consists of several levels of subnetworks in which neural activity is controlled by the dynamics of different time scales. The time scale of the lth level is determined by the time constant parameter? _l set for the neural units at the level. The activation dynamics of each neural unit follows the next differential equation using the time constant parameter τ _l .

여기서, u_i는 i 번째 유닛의 포텐셜이고, w_ij는 j 번째 유닛부터 i 번째 유닛까지의 연결 웨이트이며, a_i는 i 번째 유닛의 활성화 값이고, I_k는 k 번째 외부 입력 값이고, f()는 시그모이드(sigmoid) 함수와 같은 비선형 함수이다.Here, u _i is the potential of the ith unit, w _ij is the connection weight from the j th unit to the ith unit, a _i is the activation value of the i th unit, I _k is the k th external input value, f () Is a nonlinear function such as a sigmoid function.

최하위 레벨은 비디오 프레임 입력 시퀀스를 수신한다.The lowest level receives the video frame input sequence.

최하위 레벨 이후, 가장 작은 시정수를 갖는 가장 빠른 다이내믹 서브 네트워크 부터 가장 큰 시정수를 갖는 가장 느린 다이내믹 서브 네트워크 까지, 여러 레벨의 서브 네트워크들이 배치된다.After the lowest level, several levels of subnetworks are deployed, from the fastest dynamic subnetwork with the smallest time constant to the slowest dynamic subnetwork with the largest time constant.

각 서브 네트워크는 여러 특징 맵들로 구성되어 있다. 가장 느린 다이내믹 서브 네트워크의 상단에는, 출력 레이어가 위치하여, 비디오 프레임 입력 시퀀스에 대한 인식 결과를 출력한다. 출력 레이어(출력부)는, 단일 레이어 뿐만 아니라 여러 레이어를 포함하는 멀티 레이어일 수 있다.Each subnetwork consists of several feature maps. At the top of the slowest dynamic subnetwork, the output layer is located and outputs the recognition result for the video frame input sequence. The output layer (output portion) may be a multilayer including a plurality of layers as well as a single layer.

전술한 여러 타임 스케일의 제약 조건 외에도, 각 레벨에 대한 특정 연결과 레티노토픽(retinotopic) 해상도의 할당에 의해, 모든 네트워크에 적용되는 여러 공간 스케일의 제약 조건이 있다.In addition to the constraints of the various time scales described above, there are constraints on various spatial scales that apply to all networks, with specific connections for each level and allocation of retinotopic resolution.

각 레벨의 서브 네트워크에서 뉴럴 유닛들은 2차원 레티노토픽 기법으로 배치된다. 특정 레벨의 서브 네트워크에서 각 뉴럴 유닛은, 한 레벨이 다른 서브 네트워크에서 레티노픽적으로 가까운 로컬 뉴럴 유닛들과 한 레벨 위(top-down) 또는 한 레벨 아래(Bottom-up)로 연결되어 있다. 또한, 동일한 서브 네트워크 내에서 로컬 뉴럴 유닛들 간의 측면 연결이 있다.In each level of subnetworks, neural units are arranged in a two-dimensional retinotopic scheme. Each neural unit in a particular level of subnetwork is connected top-down or bottom-up with local neural units that are retino-close to one another in the other subnetwork. There is also a side connection between local neural units within the same subnetwork.

즉, 한 레벨 위(top-down) 연결 외에도, 한 레벨 아래(Bottom-up) 연결 및 동일 레이어 안에서의 측면 연결(Bilateral)으로 확장가능하다.That is, in addition to the top-down connection, it can be expanded to a bottom-up connection and a bilateral connection within the same layer.

레벨이 증가함에 따라 뉴럴 유닛들의 개수는 감소하기 때문에, 뉴럴 유닛의 수용 필드(receptive field)는 더 넓어지고, 해상도는 그에 따라 낮아진다.As the number of neuronal units decreases as the level increases, the receptive field of the neural unit becomes wider and the resolution decreases accordingly.

모델 아키텍처는, 느린 다이내믹은 더 높은 레벨의 글로벌 연결을 유지하도록 하고, 빠른 다이내믹은 낮은 레벨의 수용 필드에서 국소적 주변 연결을 유지하도록 하여, 여러 시공간적 스케일들의 아이디어를 적용함으로써, 구성한다.The model architecture is constructed by applying the idea of several space-time scales, allowing slow dynamics to maintain a higher level of global connectivity and fast dynamics maintaining local peripheral connections at low level acceptance fields.

이 구성은 여러 단계의 추상화를 통한 처리에 의해 복잡한 다이내믹 시각 패턴 인식을 가능하게 한다.
This configuration enables complex dynamic visual pattern recognition by processing through multiple levels of abstraction.

5. 학습 및 인식 처리(5. Learning and recognition processing ( LearningLearning andand recognitionrecognition processesprocesses ))

도 4에 도시된 바와 같이, MSTNN은 교육 출력 시퀀스에 연관된 비디오 시퀀스 세트에 대해 훈련된다. 교육 행 이미지 시퀀스에는 종단 표시(예를 들면, 블랭크 이미지)가 추가된다.As shown in FIG. 4, the MSTNN is trained for a video sequence set associated with the training output sequence. An end-point display (e.g., a blank image) is added to the training row image sequence.

교육 출력 시퀀스는 지연 응답의 형태로 제공된다. 종단 표시가 오기 전에, 교육 출력이 모든 출력 유닛들에 대해 중간값을 가질 수 있음을 의미한다. 종단 표시 이후, 정확한 인식 결과를 나타내는 교육 출력 유닛이 활성화된다.The training output sequence is provided in the form of a delayed response. It means that the training output can have an intermediate value for all the output units before the end indication is received. After the end display, a training output unit indicating an accurate recognition result is activated.

이는, 성공적인 훈련 이후, 동일한 카테고리의 시각 이미지 시퀀스를 인식하여 종단 표시 이후에는 동일한 출력 유닛이 활성화될 것을 기대할 수 있다.This is because, after successful training, it is possible to recognize the same category of visual image sequences and expect the same output unit to be active after the end mark.

도 5는 다수의 비디오 시퀀스 패턴 훈련 과정을 나타낸 흐름도이다. 도 5에 도시된 바와 같이, 입력 시퀀스에 대한 훈련 시퀀스를 획득하고(S110), 훈련 시퀀스의 종단에 종단 표시를 부가하여(S120), BPTT(Back-Propagation Through Time) 훈련이 수행된다(S130). 이후, 다음 시퀀스 추가하여(S140), S110단계부터 재수행한다.5 is a flowchart illustrating a plurality of video sequence pattern training processes. 5, a training sequence for the input sequence is acquired (S110), an end mark is added to the end of the training sequence (S120), and Back-Propagation Through Time (BPTT) training is performed (S130) . Thereafter, the next sequence is added (S140), and the process is resumed from the step S110.

네트워크 훈련 이후에는, 도 6에 도시된 바에 따라 비디오 시퀀스 인식이 수행된다. 도 6은 비디오 시퀀스 인식 과정을 나타낸 흐름도이다. 도 6에 도시된 바와 같이, 타겟 시퀀스를 획득하여(S210), 타겟 시퀀스의 종단에 종단 표시를 부가하고(S220), 타겟 시퀀스 인식을 위해, 출력 시퀀스 획득을 위한 다이내믹 연산을 수행한다(S230).After network training, video sequence recognition is performed as shown in Fig. 6 is a flowchart illustrating a video sequence recognition process. As shown in FIG. 6, a target sequence is obtained (S210), an end mark is added to the end of the target sequence (S220), and a dynamic operation for acquiring an output sequence is performed for target sequence recognition (S230) .

도 7은 학습형 다이내믹 시각 이미지 패턴 인식 시스템의 블럭도이다. 도 7에 도시된 바와 같이, 본 발명의 실시예에 따른 시스템은, 입력부(310), 학습부(320), 인식부(330) 및 출력부(340)를 포함한다.7 is a block diagram of a learning dynamic visual image pattern recognition system. 7, the system according to the embodiment of the present invention includes an input unit 310, a learning unit 320, a recognition unit 330, and an output unit 340.

입력부(310)는 비디오 시퀀스를 입력 받고, 학습부(320)는 도 5에 도시된 알고리즘에 따라 학습을 수행한다. 인식부(330)는 도 6에 도시된 알고리즘에 따라 인식을 수행하며, 출력부(340)는 학습 결과 및 인식 결과를 출력한다.
The input unit 310 receives the video sequence, and the learning unit 320 performs learning according to the algorithm shown in FIG. The recognition unit 330 performs recognition according to the algorithm shown in FIG. 6, and the output unit 340 outputs a learning result and a recognition result.

또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention.

Layer 1 : 입력 레이어
Layer 2-4 : 컨볼루션 레이어
Layer 5 : 전체-연결 레이어Layer 1: Input Layer
Layer 2-4: Convolution Layer
Layer 5: Full-Link Layer

Claims

The first sub-network receiving the image;
At least one second subnetwork comprises convoluting the input image with characteristic maps of top-down, bottom-up and bilateral layers; And
The third sub-network outputting the image recognition result through the convolution,
Wherein the at least one second sub-
With different timescale dynamics,
In the subnetworks,
Dimensional retinotopic technique. &Lt; Desc / Clms Page number 19 >

The method according to claim 1,
The time scale of the second subnetwork adjacent to the first subnetwork is fast,
And the time scale of the second subnetwork adjacent to the third subnetwork is slow.

The method according to claim 1,
Wherein the second sub-network comprises a plurality of feature maps.

delete

The method according to claim 1,
The sub-
Wherein the number of neural units is reduced and the resolution is reduced as the level is increased, but the receptive field is widened.

6. The method of claim 5,
Each neural unit in a particular level of subnetwork is characterized in that one level is top-down or bottom-up connected to local neural units that are retino-close in other subnetworks A dynamic image pattern recognition method.

The method according to claim 6,
Wherein in the same sub-network, the local neural units are connected laterally.

8. The method of claim 7,
Wherein the training and recognition by the subnetworks is performed in a delayed response manner.

An input unit for receiving an image;
A recognition unit for convoluting the image input through the input unit with characteristic maps of top-down, bottom-up and bilateral layers; And
And an output unit for outputting an image recognition result through convolution,
Wherein,
Sequentially convolving images input into at least one sub-network having different time-scale dynamics,
In the subnetworks,
Dimensional dynamic pattern recognition system is arranged as a two-dimensional retinotopic technique.