KR101959436B1

KR101959436B1 - The object tracking system using recognition of background

Info

Publication number: KR101959436B1
Application number: KR1020180091203A
Authority: KR
Inventors: 김성찬; 김정준
Original assignee: 전북대학교 산학협력단
Priority date: 2018-08-06
Filing date: 2018-08-06
Publication date: 2019-07-02

Abstract

The present invention relates to an object tracking system using background recognition, which detects a background which is relatively more static than the shape of an object in an image frame, determines a portion, except for the background, as the shape of the object, and causes the learning of the shape of the object. To this end, the object tracking system, according to the present invention, comprises: a first artificial neural network for generating a predictive protrusion map for dividing an input image into an object to be tracked and a background by performing a two-dimensional convolution product calculation and a two-dimensional inverse convolution product calculation by receiving a current image frame; a second artificial neural network for performing two-dimensional convolution product calculation by receiving the current image frame of the input image, extracts the feature of the image by reflecting the predictive protrusion map of the first artificial neural network, and performing classification into a plurality of classes through nonlinear transformation of weighted sums of the features of the image obtained from the result of the two-dimensional convolution product calculation; and a boundary box output artificial neural network for receiving the output of the second artificial neural network by the input of a boundary box regression algorithm for predicting the location of the object to be tracked in the current image frame, and calculating and outputting the location information of the object to be tracked, which is predicted by the second artificial neural network, as information on rectangular locations surrounding the object to be tracked. Even when the shape or location (movement) of the object is severely changed, the shape of the object can be easily identified to create a robust and precise appearance learning model, thereby accurately tracking the object to be tracked.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an object tracking system using background recognition,

본 발명은 배경인식을 이용한 물체 추적시스템에 관한 것으로서, 보다 상세하게는 비디오 영상에서 주어진 영상 프레임은 배경과 추적 물체로 구성되고 영상 프레임에서 배경은 비교적 정적으로 유지되는 반면 추적 물체의 외형이나 위치는 상대적으로 변화가 심하게 나타나는 특징을 이용하여, 영상 프레임에서 정적인 배경을 감지하고 배경을 제외한 곳을 물체의 형태로 판단하여 이를 학습할 수 있도록 함으로써, 물체의 형태나 위치(움직임)의 변화가 심한 경우에도 물체의 형태를 쉽게 파악하여 강인하고 정밀한 외형 학습 모델을 만들어 추적 대상 물체의 형태나 위치 변화를 정확하게 예측하고 추적할 수 있게 하는 배경인식을 이용한 물체 추적시스템에 관한 것이다.More particularly, the present invention relates to an object tracking system using background recognition. More specifically, the present invention relates to an object tracking system using background recognition, By using a feature that shows relatively change, it is possible to detect a static background in an image frame and judge it as the shape of an object except for the background, so that the change of shape or position (movement) of the object is severe And more particularly, to an object tracking system using background recognition that allows a user to easily grasp the shape of an object and create a robust and accurate appearance learning model to precisely predict and track changes in shape and position of the object.

일반적으로 비디오 영상에서 물체를 인식하거나 추적하는 문제에서 미래 영상 프레임들을 현재 프레임 분석에서 알 수 있으면 오프라인 분석, 그렇지 않으면 온라인 분석이라고 한다.In general, if the future image frames are known in the current frame analysis in the problem of recognizing or tracking an object in a video image, it is called offline analysis, otherwise called online analysis.

물체 추적 연구들은 대부분 신경망을 이용해 초기 또는 주기적으로 물체의 형태를 학습하고 현재 영상 프레임에서 학습된 물체 형태와 가장 비슷한 영역을 찾아내는 방식에 기반을 두고 있다.Most of the object tracking studies are based on learning the shape of the object early or periodically using neural networks and finding the area most similar to the object shape learned in the current image frame.

반면 오프라인 방식에서는 분석해야 할 영상 프레임들이 모두 알려져 있기 때문에 특정 프레임에서 예측 작업을 수행할 때 나머지 프레임들을 모두 활용하며, 시간적으로 인접한 영상 프레임들의 상관관계를 이용해 물체의 형태나 움직임을 학습하고 미래의 변화를 예측할 수 있다.On the other hand, since all the image frames to be analyzed are known in the offline method, all of the remaining frames are used when performing a prediction operation in a specific frame, and the shape and motion of the object are learned using the correlation of temporally adjacent image frames, The change can be predicted.

그러나 과거 영상 프레임들을 참고하지 않고 현재 영상 프레임만을 이용하는 경우에는 과거 영상 프레임들을 참고하지 않기 때문에 현재 영상 프레임과 이전 영상 프레임들과의 시간적인 상관관계에 관한 정보를 활용할 수 없는 단점이 있고, 과거 영상 프레임들을 참고하는 경우에는 미리 정해진 개수의 과거 영상 프레임들을 이용하여 물체를 추적하게 하거나 또는 3차원 합성곱 연산시 기본적으로 모든 과거 영상 프레임들을 같은 중요도로 가정하여 영상들로부터 특징을 추출하게 되므로, 현재 영상 프레임과 이전 영상 프레임들과의 시간적인 상관관계를 활용하는 것은 가능하지만 이전 영상 프레임들 중에 현재 영상 프레임과 전혀 다른 특성이 존재할 경우 부정확한 상관관계 정보가 도출될 가능성이 있어 항상 모든 과거 영상 프레임들이 도움이 되는 것은 아니다. 예를 들어 추적 물체가 배경에 의해 가려지는 상황이 발생하는 경우 이전 영상 프레임에서 현재 영상 프레임을 참고하는 것은 물체의 형태나 크기 분석에 오류를 제공할 가능성이 있다.However, in the case of using only the current image frame without referring to past image frames, since there is no reference to past image frames, information about the temporal correlation between the current image frame and previous image frames can not be utilized, In the case of referring to frames, it is necessary to trace an object by using a predetermined number of past image frames or to extract features from images by basically assuming all past image frames to be of the same importance in a three-dimensional composite product operation. Although it is possible to utilize the temporal correlation between the video frame and the previous video frames, if there is a characteristic completely different from the current video frame among the previous video frames, there is a possibility that incorrect correlation information is derived, They help It is not. For example, if the tracking object is occluded by the background, referring to the current image frame in the previous image frame may provide an error in analyzing the shape or size of the object.

또한 종래의 추적 대상 물체 추적 시스템 및 방법은 추적 물체의 외형 자체를 학습하는 방법을 사용하고 있으며, 일반적인 물체 추적에서 사용되는 입력 영상들은 배경은 비교적 정적으로 유지되는 반면 추적 물체의 외형이나 위치는 상대적으로 변화가 심하다. 따라서 물체의 외형을 직접 학습하는 것은 여러 가지 기술적인 문제들을 포함하며, 물체 형태에 대한 학습 모델도 정확도가 떨어지는 문제점이 있었다.In addition, the conventional tracking object tracking system and method uses a method of learning the appearance of the tracking object itself, and the background of the input image used in general object tracking is kept relatively static, while the appearance and position of the tracking object are relatively . Therefore, learning the shape of the object directly involves various technical problems, and the learning model of the object shape is also inferior in accuracy.

KR 10-1040049 B1 2011.06.02. 등록KR 10-1040049 B1 2011.06.02. Enrollment KR 10-1731243 B1 2017.04.24. 등록KR 10-1731243 B1 2017.04.24. Enrollment KR 10-1735365 B1 2017.05.08. 등록KR 10-1735365 B1 2017.05.08. Enrollment

따라서 본 발명은 상기의 문제점을 해결하기 위해 안출한 것으로서, 본 발명이 해결하고자 하는 기술적 과제는, 비디오 영상에서 주어지는 영상 프레임들은 배경과 추적 물체로 구성되는 점을 이용하여 영상 프레임에서 물체의 형태보다 상대적으로 정적인 배경을 감지하고 이를 제외한 곳을 물체의 형태로 판단하여 이를 학습할 수 있도록 함으로써, 물체의 형태나 위치(움직임)의 변화가 심한 경우에도 물체의 형태를 쉽게 파악하여 강인하고 정밀한 외형 학습 모델을 만들 수 있도록 하여 추적 물체의 형태와 움직임에 대한 정밀한 정보를 얻어낼 수 있어 추적 대상 물체를 정확하게 추적할 수 있게 하는 배경인식을 이용한 물체 추적시스템을 제공하고자 하는 것이다.SUMMARY OF THE INVENTION The present invention has been made in order to solve the above problems, and it is an object of the present invention to provide an image processing method and a video processing method, By detecting a relatively static background and judging it as an object type and learning it, it is possible to easily recognize the shape of an object even if the shape or position (movement) of the object is severe, It is possible to obtain a precise information about the shape and motion of the tracking object by making it possible to create a learning model so that the tracking object can be accurately tracked.

상기 목적을 달성하기 위한 본 발명의 일 실시 형태는, 비디오 카메라 또는 동영상 파일에서 주어지는 영상 프레임들로 구성된 비디오 영상에서 추적 대상 물체의 위치를 예측하려는 현재 영상 프레임을 입력받아, 추적 대상 물체를 둘러싸는 직사각형의 위치 정보를 예측하여 물체 추적 결과를 출력하는 물체 추적시스템에 있어서, 입력 영상의 현재 영상 프레임을 입력받아 2차원 합성곱 연산들을 수행하여 영상의 특징을 추출하고 2차원 합성곱 연산 수행 결과들로부터 2차원 역합성곱 연산을 수행하여 입력영상을 추적 대상 물체와 배경으로 구분하기 위한 예측 돌출맵을 생성하는 제1인공신경망, 입력 영상의 현재 영상 프레임을 입력받아 2차원 합성곱 연산들을 수행하고 2차원 합성곱 연산 수행결과로부터 제1인공신경망의 예측 돌출맵을 반영하여 영상의 특징을 추출하며 2차원 합성곱 연산 수행된 결과들로부터 얻어진 영상의 특징들의 가중치합의 비선형 변환을 통해 여러 개의 계층으로 분류하는 제2인공신경망, 및 현재 영상 프레임에서 추적 대상 물체의 위치를 예측하기 위한 경계상자 회귀 알고리즘의 입력으로 제2인공신경망의 출력을 입력받아 제2인공신경망에서 예측된 추적 대상 물체의 위치정보를 해당 추적 대상 물체를 둘러싸는 직사각형의 위치정보로 계산하여 출력하는 경계상자 출력 인공신경망을 포함하는, 배경인식을 이용한 물체 추적시스템이다.According to an aspect of the present invention, there is provided an apparatus for tracking an object to be tracked, the method comprising: receiving a current image frame for predicting a position of an object to be tracked in a video image composed of video frames given from a video camera or a moving picture file; An object tracking system for predicting position information of a rectangle and outputting an object tracking result, the object tracking system comprising: an input unit for receiving a current image frame of an input image and performing two-dimensional composite multiplication operations to extract characteristics of the image, A first artificial neural network for generating a predicted protrusion map for separating an input image into an object to be tracked and a background by performing a two-dimensional inverse product product operation from the input image, The prediction projection map of the first artificial neural network is reflected from the result of performing the two-dimensional composite product operation A second artificial neural network for extracting the feature of the image and classifying it into a plurality of layers through nonlinear transformation of the weight sum of the features of the image obtained from the performed results and the position of the object to be traced in the current image frame A boundary box regression algorithm for receiving the output of the second artificial neural network and calculating the position information of the object to be tracked predicted in the second artificial neural network as the position information of the rectangle surrounding the object to be tracked, Output neural network based object recognition system using background recognition.

본 발명에 의하면, 비디오 영상에서 주어진 영상 프레임은 배경과 추적 물체로 구성되고 영상 프레임에서 배경은 비교적 정적으로 유지되는 반면 추적 물체의 외형이나 위치는 상대적으로 변화가 심하게 나타나는 특징으로 이용하여, 영상 프레임에서 정적인 배경을 감지하고 배경을 제외한 곳을 물체의 형태로 판단하여 이를 학습할 수 있도록 함으로써, 물체의 형태나 위치(움직임)의 변화가 심한 경우에도 물체의 형태를 쉽게 파악하여 강인하고 정밀한 외형 학습 모델을 만들 수 있도록 하여 추적 물체의 형태나 위치 변화를 정확하게 예측하고 추적할 수 있게 한다.According to the present invention, a given video frame in a video image is composed of a background and a tracked object, and the background of the video frame is kept relatively static, while the appearance and position of the tracked object are relatively changed, , It is possible to easily recognize the shape of an object even if the shape or position (movement) of the object is severe, and thus it is possible to recognize a strong and precise appearance Learning model to accurately predict and track changes in the shape or position of the tracking object.

도 1은 본 발명에 의한 배경인식을 이용한 물체 추적시스템을 예시한 개략도이다.
도 2는 본 발명에 의한 배경인식을 이용한 물체 추적시스템에서 제1인공신경망이 입력영상에서 배경을 예측하는 기능을 학습하기 위해 신경망이 예측한 배경과 미리 준비된 실제 배경과 비교해 오차를 최소화하는 학습과정을 예시한 참고도이다.1 is a schematic diagram illustrating an object tracking system using background recognition according to the present invention.
FIG. 2 is a flowchart illustrating an object tracking system using background recognition according to an exemplary embodiment of the present invention. In FIG. 2, the first artificial neural network learns a function of predicting a background in an input image. FIG.

이하, 본 발명의 바람직한 실시 형태에 따른 배경인식을 이용한 물체 추적시스템의 구성과 동작 및 그에 의한 작용 효과를 첨부 도면을 참조하여 상세히 설명한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, an object tracking system using background recognition according to a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정 해석되지 아니하며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서, 본 명세서에 기재된 실시 예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시 예에 불과할 뿐이므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.It is to be understood that the words or words used in the present specification and claims are not to be construed in a conventional or dictionary sense and that the inventor can properly define the concept of a term in order to describe its invention in the best possible way And should be construed in light of the meanings and concepts consistent with the technical idea of the present invention. Therefore, it should be understood that the embodiments described herein and the configurations shown in the drawings are only the most preferred embodiments of the present invention, and that various equivalents and modifications may be substituted for them at the time of the present application shall.

도 1은 본 발명에 의한 배경인식을 이용한 물체 추적시스템을 예시한 개략도이고, 도 2는 본 발명에 의한 배경인식을 이용한 물체 추적시스템에서 제1인공신경망이 입력영상에서 배경을 예측하는 기능을 학습하기 위해 신경망이 예측한 배경과 미리 준비된 실제 배경과 비교해 오차를 최소화하는 학습과정을 예시한 참고도로서, 도면에 예시된 바와 같이 본 발명의 배경인식을 이용한 물체 추적시스템은, 제1인공신경망(200), 제2인공신경망(300), 및 경계상자 출력 인공신경망(400)을 포함하여, 비디오 영상에서 주어진 영상 프레임에서 정적인 배경을 감지하여 이를 제외하여 추출된 추적물체의 형태를 학습할 수 있도록 함으로써, 물체의 형태나 위치(움직임)의 변화가 심한 경우에도 물체의 형태를 쉽게 파악하여 강인하고 정밀한 외형 학습 모델을 만들 수 있게 한다.FIG. 1 is a schematic diagram illustrating an object tracking system using background recognition according to an embodiment of the present invention. FIG. 2 is a block diagram illustrating an object tracking system using background recognition according to an embodiment of the present invention. The object tracking system using the background recognition of the present invention as illustrated in the figure is a system that includes a first artificial neural network 200, a second artificial neural network 300, and a boundary box output artificial neural network 400, a static background is detected in a given image frame in a video image, and a shape of the extracted tracking object is learned , It is possible to easily grasp the shape of an object even when the shape or position (movement) of the object is severe, and to make a strong and precise appearance learning model It allows.

이러한 본 발명의 물체 추적시스템은 비디오 카메라 또는 동영상 파일에서 주어지는 영상 프레임들로 구성된 비디오 영상에서 추적 대상 물체의 위치를 예측하려는 현재 영상 프레임(110)을 입력받아, 추적 대상 물체를 둘러싸는 직사각형의 위치 정보를 예측하여 물체 추적 결과를 출력하는 물체 추적시스템에 적용되어 구현될 수 있다.The object tracking system of the present invention receives a current image frame 110 for predicting the position of an object to be tracked in a video image composed of image frames given from a video camera or a moving picture file, And an object tracking system that predicts information and outputs an object tracking result.

제1인공신경망(200)은 비디오 카메라 또는 동영상 파일에서 주어지는 입력 영상의 현재 영상 프레임(110)을 입력받아 2차원 합성곱(2D convolution) 연산들을 수행하여 입력영상의 특징을 추출하고, 2차원 합성곱 연산 수행 결과들로부터 2차원 역합성곱(2D deconvolution) 연산을 수행하여 입력영상을 추적 대상 물체와 배경으로 구분하기 위한 예측 돌출맵(saliency map)(120)을 생성한다. 이러한 제1인공신경망(200)은 제1차원특징추출 인공신경망(210), 예측 돌출맵 생성 인공신경망(220)을 포함하여 구성될 수 있다.The first artificial neural network 200 receives a current image frame 110 of an input image given from a video camera or a moving image file and performs 2D convolution operations to extract features of the input image, Dimensional deconvolution operation is performed on the result of the multiplication operation to generate a predicted saliency map 120 for separating the input image into the background and the object to be tracked. The first artificial neural network 200 may include a first dimension feature extraction artificial neural network 210 and a prediction projection map generation artificial neural network 220.

제1차원특징추출 인공신경망(210)은 영상 프레임들에 대해 2차원 합성곱(2D convolution) 연산들을 수행하는 하나 이상의 계층들로 구성되며, 입력된 현재 영상 프레임(110)들로부터 2차원 합성곱(2D convolution) 연산들을 수행하여 입력영상의 특징을 추출한다.First Dimension Feature Extraction Artificial Neural Network 210 is composed of one or more layers for performing 2D convolution operations on image frames, and extracts a two-dimensional convolution product (2D convolution) operations to extract features of the input image.

예측 돌출맵 생성 인공신경망(220)은 영상 프레임들에 대해 2차원 역합성곱(2D deconvolution) 연산들을 수행하는 하나 이상의 계층들로 구성되며, 제1차원특징추출 인공신경망(210)의 결과로부터 2차원 역합성곱(2D deconvolution) 연산을 수행하여 입력영상을 추적 물체와 배경으로 구분하는 예측 돌출맵(saliency map)(120)을 생성한다. 이때 입력영상과 돌출맵은 같은 크기를 갖는다.Predictive Projection Map Generation Artificial Neural Network 220 is composed of one or more layers that perform 2D deconvolution operations on image frames. From the results of the first dimensional feature extraction artificial neural network 210, Dimensional deconvolution operation is performed to generate a saliency map 120 that separates the input image into a traced object and a background. At this time, the input image and the protrusion map have the same size.

제2인공신경망(300)은 비디오 카메라 또는 동영상 파일에서 주어지는 입력 영상의 현재 영상 프레임(130)을 입력받아 2차원 합성곱 연산들을 수행하고 2차원 합성곱 연산 수행결과로부터 제1인공신경망(200)의 예측 돌출맵(120)을 반영하여 영상의 특징을 추출하며 2차원 합성곱 연산 수행된 결과들로부터 얻어진 영상의 특징들의 가중치합의 비선형 변환을 통해 여러 개의 계층(class)으로 분류한다. 이러한 제2인공신경망(300)은 특징추출 인공신경망(310), 완전연결 인공신경망(fully-connected layer)(320)을 포함하여 구성된다.The second artificial neural network 300 receives the current image frame 130 of the input image given from the video camera or the moving image file, performs the two-dimensional synthetic product operations, and performs the two-dimensional synthetic product operations on the first artificial neural network 200, And extracts the feature of the image by reflecting the predicted protrusion map 120 and classifies it into a plurality of classes through nonlinear transformation of the weight sum of the features of the image obtained from the performed results. The second artificial neural network 300 includes a feature extraction artificial neural network 310 and a fully-connected layer 320.

특징추출 인공신경망(310)은 영상 프레임들에 대해 2차원 합성곱(2D convolution) 연산들을 수행하는 하나 이상의 계층들로 구성되며, 제1인공신경망(200)에서 생성된 예측 돌출맵(120)을 반영하여 입력 영상 프레임들로부터 2차원 합성곱(2D convolution) 연산들을 수행하여 추적 대상 물체에 대한 영상의 특징을 추출한다.The feature extraction artificial neural network 310 is composed of one or more layers for performing 2D convolution operations on image frames, and the predictive projection map 120 generated in the first artificial neural network 200 And 2D convolution operations are performed on the input image frames to extract characteristics of the image of the object to be tracked.

완전연결 인공신경망(fully-connected layer)(320)은 하나 이상의 완전연결 계층(fully-connected layer)들로 구성되며, 특징추출 인공신경망(310)의 결과로부터 얻어진 영상의 특징들의 가중치합의 비선형 변환을 통해 입력영상의 고차원 특징들을 추출한다.The fully-connected layer 320 is composed of one or more fully-connected layers and is a nonlinear transformation of the weight sum of features of the image obtained from the result of the feature extraction artificial neural network 310 Dimensional feature of the input image.

경계상자 출력 인공신경망(400)은 현재 영상 프레임에서 추적 대상 물체의 위치를 예측하기 위한 경계상자 회귀(bounding box regression) 알고리즘을 포함하며, 이러한 경계상자 회귀 알고리즘의 입력으로 제2인공신경망(300)의 출력을 입력받아 대상 물체를 가장 정확하게 둘러싸는 직사각형의 중심좌표, 가로/세로의 길이를 계산한다. 즉, 제2인공신경망(300)에서 예측된 추적 대상 물체의 위치정보를 해당 추적 대상 물체를 둘러싸는 직사각형의 위치정보로 계산하여 출력한다. 이러한 경계상자 출력 인공신경망(400)은 추적 대상 물체를 가장 정확하게 둘러싸는 직사각형의 위치정보를 4개 꼭지점들의 좌표 또는 길이, 너비 및 중심점의 좌표 등으로 계산하여 출력한다.The bounding box output artificial neural network 400 includes a bounding box regression algorithm for predicting the position of an object to be tracked in the current image frame. And calculates the center coordinates and the length / width of the rectangle that most accurately surrounds the object. That is, the position information of the object to be tracked predicted by the second artificial neural network 300 is calculated and outputted as the position information of the rectangle surrounding the object to be tracked. The bounding box output artificial neural network 400 calculates and outputs the position information of the rectangle which most accurately surrounds the object to be tracked, based on the coordinates of the four vertexes, the coordinates of the length, the width, and the center point.

상기와 같은 제1인공신경망(200)과 제2인공신경망들(300)에서의 동작은 순차적으로 수행될 수도 있고, 거의 동시에 수행될 수도 있음은 물론이다.The operations of the first and second artificial neural networks 200 and 300 may be performed sequentially or almost simultaneously.

이상과 같이 구성되는 본 발명에 따른 배경인식을 이용한 물체 추적시스템의 작용 효과를 설명하면 다음과 같다.Hereinafter, the operation and effect of the object tracking system using background recognition according to the present invention will be described.

먼저, 본 발명의 인공신경망을 이용한 물체 추적시스템은, 입력 영상이 비디오 카메라 또는 동영상 파일에서 주어지면, 제1인공신경망(200) 및 제2인공신경망(300)에 입력 영상의 현재 영상 프레임들(110)(130)이 동시에 입력된다.First, an object tracking system using an artificial neural network according to the present invention is configured such that when an input image is given from a video camera or a moving image file, the current artificial neural network 200 and the second artificial neural network 300 110) 130 are simultaneously input.

상기와 같이 제1인공신경망(200)에 입력 영상의 현재 영상 프레임(110)이 입력되면, 영상 프레임에 대해 2차원 합성곱(2D convolution) 연산들을 수행하는 하나 이상의 계층들로 구성된 제1차원특징추출 인공신경망(210)에서는 입력된 현재 영상 프레임(110)들로부터 2차원 합성곱(2D convolution) 연산들을 수행하여 입력영상의 배경부분에 해당하는 특징을 추출하여 예측 돌출맵 생성 인공신경망(220)으로 전달한다.When the current image frame 110 of the input image is input to the first artificial neural network 200 as described above, the first and second dimensional features, which are composed of one or more layers that perform 2D convolution operations on the image frame, The extracted artificial neural network 210 extracts features corresponding to the background portion of the input image by performing 2D convolution operations on the input current image frames 110, .

그리고, 영상 프레임들에 대해 2차원 역합성곱(2D deconvolution) 연산들을 수행하는 하나 이상의 계층들로 구성된 예측 돌출맵 생성 인공신경망(220)에서는 상기 제1차원특징추출 인공신경망(210)의 결과로부터 2차원 역합성곱(2D deconvolution) 연산을 수행하여 입력 영상을 추적 물체와 배경으로 구분하기 위한 예측 돌출맵(saliency map)(120)을 생성하고 제2인공신경망(300)에 전달한다.In the predicted protrusion map generation artificial neural network 220 including one or more layers for performing 2D deconvolution operations on the image frames, a result of the first dimension feature extraction artificial neural network 210 A 2D deconvolution operation is performed to generate a saliency map 120 for classifying the input image into a tracking object and a background and transmits the saliency map 120 to the second artificial neural network 300. [

이때, 배경감지를 이용하여 예측 돌출맵(120)을 생성하는 제1인공신경망(200)을 학습하려면 자연영상으로 구성된 학습데이터(110)에 대응하는 정답 돌출맵(150)이 있어야 제1인공신경망(200)의 예측 돌출맵(120)과 정답 돌출맵(150)의 오차 계산을 통해 그 차이(600: 예를 들면 binary cross entropy loss)가 가능한 적어지도록 제1인공신경망(200)을 학습해야 한다. 따라서 정확한 정답을 미리 만드는 것이 중요한데 이를 위해서는 추적 물체에 대한 돌출맵을 만드는 것보다 현재영상 데이터(110)의 배경(500)을 감지하여 전체 영상에서 배경을 뺀 부분을 추적 물체에 대한 정답 돌출맵(150)으로 간주하는 것이 정확하고 효율적이다. 이는 시간에 따라 추적물체의 외형이나 위치 변화가 배경보다 일반적으로 훨씬 크기 때문이다. 또한 주어진 영상에서 배경을 인식하는 기본 원리는 영상의 바깥쪽 테두리 부분들의 픽셀들은 일반적으로 배경일 확률이 높기 때문에 바깥쪽 테두리 부분들과 비슷한 내부 영역들은 배경으로 간주하는 것이다.At this time, in order to learn the first artificial neural network 200 that generates the predicted protrusion map 120 using the background sensing, there must be a correcting projection map 150 corresponding to the learning data 110 composed of natural images, The first artificial neural network 200 should be learned so that the difference (600: for example, binary cross entropy loss) is as small as possible by calculating the error between the predicted protrusion map 120 and the correcting projection map 150 . Therefore, it is important to prepare a correct answer in advance. To do this, it is necessary to detect the background 500 of the current image data 110 rather than creating a protrusion map for the tracked object, 150) is accurate and efficient. This is because the shape or position of the tracked object is generally much larger than the background. Also, the basic principle of recognizing the background in a given image is that the pixels in the outer edge portions of the image are generally considered to be the background, so that the inner regions similar to the outer edge portions are regarded as the background.

여기서, 예측 돌출맵(120)은 물체의 외형을 나타내는데, 물체를 둘러싸기 위한 경계상자를 결정하기 위해 물체 외형 근처에서 경계상자들(130번 내부의 노란색, 빨간색 박스들)을 임의의 개수(가령 1000개)만큼 무작위로 생성하며, 생성된 경계박스들도 제2인공신경망(300)의 입력으로 된다.Here, the predicted protrusion map 120 shows the outline of an object. In order to determine a bounding box for surrounding an object, bounding boxes (yellow inner, yellow boxes inside 130) are arranged in an arbitrary number 1000), and the generated boundary boxes are also input to the second artificial neural network 300.

다음으로 제2인공신경망(300)에 입력 영상의 현재 영상 프레임(130)이 입력되면, 영상 프레임들에 대해 2차원 합성곱(2D convolution) 연산들을 수행하는 하나 이상의 계층들로 구성된 특징추출 인공신경망(310)에서는 제1인공신경망(200)에서 생성 및 전달된 예측 돌출맵(120)을 반영하여 입력 영상 프레임들로부터 2차원 합성곱(2D convolution) 연산들을 수행하여 추적 대상 물체에 대한 영상의 특징을 추출하고 완전연결 인공신경망((320)에 전달한다.Next, when the current image frame 130 of the input image is inputted to the second artificial neural network 300, the feature extracting artificial neural network 300 composed of one or more layers performing 2D convolution operations on the image frames Dimensional convolution operations are performed on the input image frames by reflecting the predicted protrusion map 120 generated and transferred in the first artificial neural network 200 in step 310, And transmits it to the fully connected artificial neural network (320).

하나 이상의 완전연결 계층(fully-connected layer)들로 구성된 완전연결 인공신경망(fully-connected layer)(320)에서는 특징추출 인공신경망(310)의 결과로부터 얻어진 영상의 특징들의 가중치합의 비선형 변환을 통해 입력 영상의 고차원 특징들을 추출하고 그 결과를 경계상자 출력 인공신경망(400)으로 전달한다.In a fully-connected layer 320, which is composed of one or more fully-connected layers, a non-linear transformation of weighted sum of the features of the image obtained from the result of the feature extraction artificial neural network 310 Dimensional features of the image and delivers the result to the boundary-box output artificial neural network (400).

마지막으로 경계상자 출력 인공신경망(400)에서는 경계상자 회귀 알고리즘의 입력으로 제2인공신경망(300)의 출력을 입력받아 현재 영상 프레임(130)에서 추적 대상 물체를 가장 정확하게 둘러싸는 직사각형의 중심좌표, 가로/세로의 길이를 계산하여 출력(140)한다. 즉, 제2인공신경망(300)에서 예측된 추적 대상 물체의 위치정보를 해당 추적 대상 물체를 둘러싸는 직사각형의 위치정보로 계산하여 출력한다. 이러한 경계상자 출력 인공신경망(400)은 추적 대상 물체를 가장 정확하게 둘러싸는 직사각형의 위치정보를 4개 꼭지점들의 좌표 또는 길이, 너비 및 중심점의 좌표 등으로 계산하여 출력한다.Finally, the bounding box output artificial neural network 400 receives the output of the second artificial neural network 300 as an input of the bounding box regression algorithm, receives the output of the second artificial neural network 300 as a center coordinate of the rectangle most accurately surrounds the object, The length / length is calculated and output (140). That is, the position information of the object to be tracked predicted by the second artificial neural network 300 is calculated and outputted as the position information of the rectangle surrounding the object to be tracked. The bounding box output artificial neural network 400 calculates and outputs the position information of the rectangle which most accurately surrounds the object to be tracked, based on the coordinates of the four vertexes, the coordinates of the length, the width, and the center point.

상기와 같이 본 발명에서는 비디오 영상에서 주어지는 영상 프레임들은 배경과 추적 물체로 구성(도 2의 230 참조)된다는 점을 이용하여, 현재 영상 프레임에서 물체의 형태보다 상대적으로 정적인 배경을 먼저 추출하고 이를 현재 영상 프레임에서 제외하여 남은 영역이 추적 물체가 되는 방법을 이용하여 물체의 형태와 크기를 학습하고 위치를 예측하여 추적이 이루어지게 되므로 배경만 파악할 수 있으면 물체의 형태나 움직임 변화가 심한 경우에도 형태를 쉽게 파악할 수 있어 강인하고 정밀한 외형 학습 모델을 만들 수 있게 되며, 이로 인해 물체의 형태 변화를 정확하게 예측할 수 있게 된다.As described above, according to the present invention, by using the fact that image frames given in a video image are composed of a background and a tracked object (refer to 230 in FIG. 2), a background that is relatively static rather than the shape of an object in the current image frame is first extracted Since the area remaining after excluding from the current image frame is used as the tracking object, the shape and size of the object are learned and the position is predicted and tracked. Therefore, if only the background can be grasped, It is possible to create a robust and precise shape learning model, which can accurately predict the shape change of the object.

이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 이는 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. 따라서, 본 발명의 사상은 아래에 기재된 특허 청구 범위에 의해서만 파악되어야 하고, 이의 균등 또는 등가적 변형 모두는 본 발명 사상의 범주에 속한다고 할 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, Modification is possible. Accordingly, it is intended that the scope of the invention be defined by the claims appended hereto, and that all equivalent or equivalent variations thereof fall within the scope of the present invention.

110,130 : 현재 영상 프레임
120 : 예측 돌출맵
140 : 경계상자 표시 영상 프레임
200 : 제1인공신경망
210 : 제1차원특징추출 인공신경망
220 : 예측 돌출맵 생성 인공신경망
300 : 제2인공신경망
310 : 특징추출 인공신경망
320 : 완전연결 인공신경망
400 : 경계상자 출력 인공신경망110,130: current image frame
120: prediction projection map
140: Bounding box display image frame
200: 1st artificial neural network
210: First Dimensional Feature Extraction Artificial Neural Network
220: Predictive protrusion map generation Artificial neural network
300: 2nd artificial neural network
310: Feature Extraction Artificial Neural Network
320: Full Connection Artificial Neural Network
400: Boundary box output artificial neural network

Claims

A current image frame 110 for predicting the position of an object to be tracked in a video image composed of image frames given from a video camera or a moving image file is received and the position information of a rectangle surrounding the object to be tracked is predicted, 1. An object tracking system for outputting,
Dimensional convolution operations are performed on the current image frame 110 of the input image to extract the characteristics of the input image and 2D deconvolution is performed on the results of the two- A first artificial neural network (200) for generating a saliency map (120) for dividing an input image into an object to be tracked and a background by performing an operation;
Dimensional synthetic product operations on the input image, receiving the current image frame 130 of the input image, extracting features of the image by reflecting the predicted protrusion map of the first artificial neural network from the result of performing the two-dimensional synthetic product operation, A second artificial neural network 300 for classifying the weighted sum of features of the image obtained from the performed results into a plurality of classes through nonlinear transformation; And
An output of the second artificial neural network 300 as an input of a bounding box regression algorithm for predicting the position of the object to be tracked in the current image frame, And a boundary-box output artificial neural network (400) for calculating position information of a rectangular object surrounding the object to be tracked and outputting the calculated position information,
The first artificial neural network 200 includes:
Dimensional convolution operations on the input image frames 110 and two-dimensional convolution operations on the input image frames 110. The two-dimensional convolution operations are performed on the input image frames 110, A first dimension feature extraction artificial neural network 210 for extracting features; And
Dimensional deconvolution computation from the result of the first dimension feature extraction artificial neural network 210. The 2D deconvolution computation is performed on the basis of the result of the first dimension feature extraction artificial neural network 210. [ And generating a prediction saliency map 120 for separating the input image into a tracking object and a background by performing an operation on the input image,
The second artificial neural network (300)
The second artificial neural network 200 includes one or more layers that perform 2D convolution operations on the image frames. The first artificial neural network 200 reflects the prediction projection map 120, A feature extraction artificial neural network 310 for performing two-dimensional convolution operations to extract characteristics of an image of a tracking object;
A fully connected artificial neural network consisting of one or more fully connected layers and categorized into multiple layers by nonlinear transformation of the weight sum of features of the image obtained from the result of the feature extraction artificial neural network connected layer 320,
The first artificial neural network 200 for generating the predicted protrusion map 120 using the background sensing detects the predicted protrusion map 120 using the correcting projection map 150 corresponding to the learning data 110 composed of the natural image, The binary cross entropy loss of the correct answer projection map 150 is learned to be as small as possible,
The correcting projection map 150 is given as a projection map for the tracking object in a part of the entire image of the current image data 110 minus the background,
Wherein the background is perceived as a background and internal regions similar to pixels of the outer edge portions of the image are recognized as backgrounds.

delete