KR102138680B1

KR102138680B1 - Apparatus for Video Recognition and Method thereof

Info

Publication number: KR102138680B1
Application number: KR1020180043730A
Authority: KR
Inventors: 변혜란; 조보라; 홍기범; 홍종광; 김호성; 황선희; 기민송; 김태형
Original assignee: 연세대학교 산학협력단
Priority date: 2018-04-16
Filing date: 2018-04-16
Publication date: 2020-07-28
Also published as: KR20190120489A

Abstract

본 발명은 영상 인식 장치로서, 특히, 동영상 분석이 가능한 합성곱 신경망을 이용하여 영상 내 객체의 행동을 인식할 수 있는 영상 인식 장치를 개시한다. 본 발명의 영상 인식 장치는 적어도 하나의 객체들을 포함하는 입력 영상에서 상기 객체별 행동에 관한 모션 정보를 포함하는 액션 스트림을 생성하는 스트림 생성부; 및 상기 생성된 액션 스트림 또는 상기 입력 영상에서 액션 스트림의 위치 관계를 나타내는 위치 정보를 입력 받아, 상기 객체의 행동을 분류하기 위한 지표로서 적어도 하나의 클래스 벡터를 출력하는 제1 인식기를 이용하여 상기 객체들의 행동을 인식하는 인식부; 를 포함한다.The present invention, as an image recognition device, discloses an image recognition device capable of recognizing the behavior of an object in an image using a multi-product neural network capable of video analysis. The image recognition apparatus of the present invention includes a stream generating unit for generating an action stream including motion information related to the action for each object in an input image including at least one object; And the first recognizer receiving position information indicating the positional relationship of the action stream from the generated action stream or the input image, and outputting at least one class vector as an index for classifying the action of the object. Recognition unit for recognizing their behavior; It includes.

Description

Image recognition device and method {Apparatus for Video Recognition and Method thereof}

본 발명은 영상 인식 장치에 관한 것이다. 보다 상세하게는 신경망을 이용한 영상 내 객체의 행동을 인식하는 영상 인식 장치에 관한 것이다.The present invention relates to an image recognition device. More specifically, the present invention relates to an image recognition device that recognizes an object's behavior in an image using a neural network.

행동 인식은 시간 및 공간적 정보를 모두 포함하고 있는 비디오 영상에서 사람의 행동을 분류하는데 사용될 수 있다. 영상 내 주체와 그 주변의 사람, 주체와 주체의 행동 대상이 되는 객체, 객체와 또 다른 객체간의 관계는 영상 내 특정 객체의 행동을 분류하기 위한 요소로 사용될 수 있다.Behavior recognition can be used to classify human behavior in video images that contain both temporal and spatial information. The relationship between the subject in the image and the people around it, the subject and the subject, and the relationship between the object and another object can be used as an element to classify the behavior of a specific object in the image.

인공 신경망(Neural Network)은 기계 학습과 인지 과학에서 생물학의 신경망에서 비롯된 통계학적 학습 알고리즘인데, 최근 컴퓨터 비전 등의 분야에서 기계적으로 영상을 인식하기 위한 기술이 활발히 연구되고 있다. 컨벌루션 신경망(Convolutional Neural Network, CNN)은 하나 또는 여러 개의 컨벌루션 레이어와 그 위에 올려진 인공 신경망 레이어들로 이루어진 신경망으로 2차원 입력 데이터인 영상과 음성 분석에 주로 사용되는 인공 신경망이다. The neural network is a statistical learning algorithm originating from a neural network of biology in machine learning and cognitive science. Recently, techniques for mechanically recognizing images in computer vision and the like are actively being studied. Convolutional Neural Network (CNN) is a neural network consisting of one or several convolutional layers and artificial neural network layers placed on it. It is an artificial neural network mainly used for two-dimensional input data, image and voice analysis.

종래의, 인공 신경망을 이용한 영상 인식 기술은 주로 정지 영상에서만 이루어져왔고, 또한, 정지 영상 내 주체-객체와의 관계를 분석하는 한계가 있었다. 또한, 종래의 영상 인식 기술은 사람, 사물을 포함한 객체간의 관계를 분석함에 있어서 시간의 흐름을 고려하지 않는 문제점이 있었다.Conventional, image recognition technology using an artificial neural network has been mainly performed only on a still image, and there is a limit to analyze a relationship between a subject and an object in the still image. In addition, the conventional image recognition technology has a problem that does not consider the passage of time in analyzing the relationship between objects, including people.

따라서, 동영상 단위의 영상 분석을 수행할 수 있는 인공 신경망을 기반으로, 영상 내 객체와, 상기 객체들의 문맥적인 상호 관계를 고려하여 영상을 인식할 수 있는 기술의 개발이 요구되고 있다.Accordingly, there is a need to develop a technology capable of recognizing an image in consideration of a contextual interrelation between an object in an image and an object based on an artificial neural network capable of performing video analysis on a video basis.

한국 공개 특허 제 10-2017-0034226 (공고)Korean Open Patent No. 10-2017-0034226 (announcement)

본 발명은 상기한 문제점을 해결하기 위하여 안출된 것으로서, 영상 인식 장치를 개시한다. 특히, 신경망을 이용하여 동영상 내 객체의 행동을 인식할 수 있는 영상 인식 장치를 개시한다.The present invention has been devised to solve the above problems, and discloses an image recognition device. In particular, an image recognition device capable of recognizing the behavior of an object in a video using a neural network is disclosed.

본 발명은 상기한 목적을 달성하기 위해 안출된 것으로서, 본 발명의 영상 인식 장치는 적어도 하나의 객체들을 포함하는 입력 영상에서 상기 객체별 행동에 관한 모션 정보를 포함하는 액션 스트림을 생성하는 스트림 생성부; 및 상기 생성된 액션 스트림 또는 상기 입력 영상에서 액션 스트림의 위치 관계를 나타내는 위치 정보를 입력 받아, 상기 객체의 행동을 분류하기 위한 지표로서 적어도 하나의 클래스 벡터를 출력하는 제1 인식기를 이용하여 상기 객체들의 행동을 인식하는 인식부; 를 포함한다.The present invention has been devised to achieve the above object, and the image recognition device of the present invention is a stream generating unit that generates an action stream including motion information about the action for each object in an input image including at least one object. ; And the first recognizer receiving position information indicating the positional relationship of the action stream from the generated action stream or the input image, and outputting at least one class vector as an index for classifying the action of the object. Recognition unit for recognizing their behavior; It includes.

본 발명에서 상기 스트림 생성부는 상기 입력 영상을 시간의 흐름에 따라 분할하여 복수개의 프레임 영상들을 생성하며, 상기 생성된 프레임 영상들에서 상기 객체 별로 상기 객체를 적어도 일부 포함하는 관심 영역을 검출하는 관심 영역 검출부; 를 더 포함하고, 상기 검출된 관심 영역을 이용하여 상기 액션 스트림을 생성할 수 있다.In the present invention, the stream generator generates a plurality of frame images by dividing the input image over time, and a region of interest that detects a region of interest including at least a portion of the object for each object in the generated frame images Detection unit; Further comprising, it is possible to generate the action stream using the detected region of interest.

본 발명에서 상기 인식부는 상기 제1 인식기를 이용하여 상기 객체 별로 생성되는 액션 스트림의 클래스 벡터를 미리 설정된 방법에 따라 합산하는 합산부; 를 더 포함하고, 상기 합산된 클래스 벡터를 이용하여 상기 객체들의 행동을 인식할 수 있다.In the present invention, the recognition unit adds a class vector of action streams generated for each object according to a preset method using the first recognizer; Further comprising, it is possible to recognize the behavior of the objects using the summed class vector.

본 발명에서 상기 스트림 생성부는 인접하는 서로 다른 프레임 영상들에서 검출된 상기 관심 영역간 연결 점수를 산출하며, 상기 산출된 연결 점수를 고려하여 상기 서로 다른 프레임 영상들에서 검출된 상기 관심 영역을 연결하는 관심 영역 연결부; 를 더 포함하고, 상기 연결된 관심 영역들을 이용하여 상기 객체 별로 상기 액션 스트림을 생성할 수 있다.In the present invention, the stream generator calculates a connection score between the regions of interest detected in adjacent different frame images, and considers the calculated connection score to connect the regions of interest detected in the different frame images. Area connection; The method may further include generating the action stream for each object using the connected regions of interest.

본 발명에서 상기 스트림 생성부는 상기 프레임 영상들에서 검출된 상기 관심 영역을 프레임 영상 단위로 재배열하고, 상기 재배열된 관심 영역을 연결하여 생성된 조합 스트림으로부터 상기 액션 스트림의 상호간 위치에 관한 위치 정보를 산출하는 위치 정보 산출부; 를 더 포함할 수 있다.In the present invention, the stream generating unit rearranges the region of interest detected in the frame images in units of frame images, and positions information regarding mutual positions of the action streams from the combined stream generated by connecting the rearranged regions of interest. A location information calculating unit for calculating; It may further include.

본 발명에서 상기 관심 영역 연결부는 인접한 프레임 영상들 각각에서 검출된 상기 관심 영역들의 특징 정보를 입력으로 하는 유사도 함수, 상기 인접한 프레임 영상들 각각에서 검출된 상기 관심 영역들의 오버렙 비율을 출력으로 하는 교차비 함수 및 상기 관심 영역들의 클래스 정보의 유사도 중 적어도 하나를 고려하여 연결 점수를 산출하는 연결 점수 산출부; 를 더 포함하고, 상기 산출된 연결 점수를 이용하여 상기 관심 영역을 연결할 수 있다.In the present invention, the region of interest connection unit is a similarity function that inputs feature information of the regions of interest detected in each of the adjacent frame images as input, and a cross ratio that outputs an overlap ratio of the regions of interest detected in each of the adjacent frame images as output. A connection score calculation unit calculating a connection score by considering at least one of a similarity between a function and class information of the regions of interest; In addition, the region of interest may be connected by using the calculated connection score.

본 발명에서 상기 관심 영역 연결부는 상기 관심 영역이 검출되지 않은 프레임 영상이 존재하는 경우, 상기 관심 영역이 검출되지 않은 프레임 영상에 인접한 프레임 영상들에서 검출된 관심 영역들의 특징 정보를 기반으로 상기 관심 영역이 검출되지 않은 프레임 영상 내의 관심 영역의 특징 정보를 추정하는 특징 정보 추정부; 를 더 포함하고, 상기 추정된 특징 정보를 이용하여 상기 관심 영역을 연결할 수 있다.In the present invention, when a frame image in which the region of interest is not detected exists, the region of interest connection unit may include the region of interest based on feature information of regions of interest detected in frame images adjacent to the frame image in which the region of interest is not detected. A feature information estimator for estimating feature information of a region of interest in the undetected frame image; In addition, the region of interest may be connected using the estimated feature information.

본 발명에서 상기 관심 영역 검출부는 상기 생성된 프레임 영상 내 상기 관심 영역이 위치하는 좌표를 나타내는 특징 정보를 산출하는 특징 정보 산출부; 를 더 포함하고, 상기 산출된 특징 정보를 이용하여 상기 관심 영역을 검출할 수 있다.In the present invention, the region of interest detector comprises: a feature information calculator for calculating feature information indicating coordinates of the region of interest in the generated frame image; In addition, the region of interest may be detected using the calculated feature information.

본 발명에서 상기 위치 정보 산출부는 상기 프레임 영상들에서 상기 객체 별로 검출된 상기 관심 영역 및 상기 관심 영역이 검출되지 않은 부분을 구분하여 구분하는 영역 구분부; 및 상기 구분된 관심 영역을 소정의 조합 방법으로 재배열하는 재배열부; 를 더 포함하고, 상기 재배열된 관심 영역을 연결하여 생성된 조합 스트림을 기반으로 상기 위치 정보를 산출할 수 있다.In the present invention, the location information calculating unit is an area dividing unit for classifying and dividing the region of interest and the portion of the region of interest that are detected for each object in the frame images; And a rearrangement unit to rearrange the divided regions of interest by a predetermined combination method. Further comprising, it is possible to calculate the location information based on the combined stream generated by connecting the rearranged regions of interest.

본 발명에서 상기 특징 정보 산출부는 상기 생성된 프레임 영상들을 각각 분할하여 미리 결정된 크기의 격자셀을 생성하는 전처리부; 및 상기 생성된 프레임 영상들을 입력으로 하고, 상기 격자셀 내부에 중심을 가지고 상기 객체가 존재하는 확률을 나타내는 바운더리 셀의 중심 좌표 또는 상기 바운더리 셀 내 상기 객체가 존재하는 확률을 출력으로 하는 제2 인식기(Neural Network)을 이용하여 상기 바운더리 셀 각각의 중심 좌표 및 상기 바운더리 셀 내의 상기 객체가 존재하는 확률을 계산하는 계산부; 를 더 포함하고, 상기 확률 및 상기 중심 좌표가 계산된 바운더리 셀을 이용하여 상기 특징 정보를 산출할 수 있다.In the present invention, the feature information calculating unit is a pre-processing unit for generating grid cells of a predetermined size by dividing each of the generated frame images; And a second recognizer using the generated frame images as an input and outputting a center coordinate of a boundary cell indicating a probability that the object exists with a center in the grid cell or a probability that the object exists in the boundary cell as an output. A calculation unit that calculates a center coordinate of each boundary cell and a probability that the object exists in the boundary cell using (Neural Network); In addition, the feature information may be calculated using a boundary cell in which the probability and the center coordinates are calculated.

본 발명에서 상기 관심 영역 검출부는 상기 프레임 영상들 내에 상기 객체를 중복하여 포함하는 바운더리 셀 중 상기 바운더리 셀 내에 상기 객체가 존재하는 확률이 기 설정된 임계치 이상인지 여부를 고려하여 상기 객체를 중복하여 포함하는 바운더리 셀의 일부를 제거하는 바운더리 셀 제거부; 를 더 포함하고, 제거되고 남은 바운더리 셀을 이용하여 상기 관심 영역을 검출할 수 있다.In the present invention, the region-of-interest detection unit overlaps the object in consideration of whether a probability that the object exists in the boundary cell among the boundary cells including the object in the frame images is equal to or greater than a preset threshold. A boundary cell removal unit for removing a portion of the boundary cell; In addition, the region of interest may be detected using the removed and remaining boundary cells.

본 발명에서 상기 클래스 벡터는 상기 객체의 행동을 분류하는 행동 목록과 상기 입력 영상 내 객체들의 행동이 상기 행동 목록에 해당할 확률을 나타내는 확률 정보를 포함하고, 상기 인식부는 상기 합산된 클래스 벡터의 상기 행동 목록별 상기 확률 정보를 이용하여 상기 객체들의 행동을 인식할 수 있다.In the present invention, the class vector includes a list of actions to classify the behavior of the object and probability information indicating a probability that the actions of objects in the input image correspond to the list of actions, and the recognition unit includes the sum of the summed class vector. The behavior of the objects may be recognized using the probability information for each behavior list.

또한 상기한 목적을 달성하기 위하여 본 발명의 영상 인식 방법은 적어도 하나의 객체들을 포함하는 입력 영상에서 상기 객체별 행동에 관한 모션 정보를 포함하는 액션 스트림을 생성하는 단계; 및 상기 생성된 액션 스트림 또는 상기 입력 영상에서 상기 액션 스트림의 위치 관계를 나타내는 위치 정보를 입력 받아, 상기 객체의 행동을 분류하기 위한 지표로서 적어도 하나의 클래스 벡터를 출력으로 하는 제1 인식기를 이용하여 상기 객체들의 행동을 인식하는 단계; 를 포함한다.In addition, in order to achieve the above object, the image recognition method of the present invention includes generating an action stream including motion information on the action for each object in an input image including at least one object; And a first recognizer receiving at least one class vector as an index for classifying the behavior of the object by receiving position information indicating a positional relationship of the action stream from the generated action stream or the input image. Recognizing the behavior of the objects; It includes.

본 발명에서 상기 생성하는 단계는 상기 입력 영상을 시간의 흐름에 따라 분할하여 복수개의 프레임 영상들을 생성하며, 상기 생성된 프레임 영상들에서 상기 객체 별로 상기 객체를 적어도 일부 포함하는 관심 영역을 검출하는 단계; 를 더 포함하고, 상기 검출된 관심 영역을 이용하여 상기 액션 스트림을 생성할 수 있다.In the present invention, the generating step includes dividing the input image over time to generate a plurality of frame images, and detecting a region of interest including at least a portion of the object for each object in the generated frame images. ; Further comprising, it is possible to generate the action stream using the detected region of interest.

본 발명에서 상기 인식하는 단계는 상기 제1 인식기를 이용하여 상기 객체 별로 생성되는 액션 스트림의 클래스 벡터를 미리 설정된 방법에 따라 합산하는 단계; 를 더 포함하고, 상기 합산된 클래스 벡터를 이용하여 상기 객체들의 행동을 인식할 수 있다.In the present invention, the step of recognizing includes summing a class vector of action streams generated for each object according to a preset method using the first recognizer; Further comprising, it is possible to recognize the behavior of the objects using the summed class vector.

본 발명에서 상기 생성하는 단계는 인접하는 서로 다른 프레임 영상들에서 검출된 상기 관심 영역간 연결 점수를 산출하며, 상기 산출된 연결 점수를 고려하여 상기 서로 다른 프레임 영상들에서 검출된 상기 관심 영역을 연결하는 단계; 를 더 포함하고, 상기 연결된 관심 영역들을 이용하여 상기 객체 별로 상기 액션 스트림을 생성할 수 있다.In the present invention, the generating step calculates the connection points between the regions of interest detected in adjacent different frame images, and connects the regions of interest detected in the different frame images in consideration of the calculated connection points. step; The method may further include generating the action stream for each object using the connected regions of interest.

본 발명에서 상기 생성하는 단계는 상기 프레임 영상들에서 검출된 상기 관심 영역을 프레임 영상 단위로 재배열하고, 상기 재배열된 관심 영역을 연결하여 생성된 조합 스트림으로부터 상기 액션 스트림의 상호간 위치에 관한 위치 정보를 산출하는 단계; 를 더 포함할 수 있다.In the present invention, the generating step rearranges the region of interest detected in the frame images in units of frame images, and positions the mutually positioned locations of the action streams from the combined stream generated by connecting the rearranged regions of interest. Calculating information; It may further include.

본 발명에서 상기 연결하는 단계는 인접한 프레임 영상들 각각에서 검출된 상기 관심 영역들의 특징 정보를 입력으로 하는 유사도 함수, 상기 인접한 프레임 영상들 각각에서 검출된 상기 관심 영역들의 오버렙 비율을 출력으로 하는 교차비 함수 및 상기 관심 영역들의 클래스 정보의 유사도 중 적어도 하나를 고려하여 연결 점수를 산출하는 단계; 를 더 포함하고, 상기 산출된 연결 점수를 이용하여 상기 관심 영역을 연결할 수 있다.In the present invention, the step of connecting is a similarity function using the feature information of the regions of interest detected in each of the adjacent frame images as an input, and an intersection ratio of the overlap ratio of the regions of interest detected in each of the adjacent frame images as the output. Calculating a connection score considering at least one of a similarity between a function and class information of the regions of interest; In addition, the region of interest may be connected by using the calculated connection score.

본 발명에서 상기 연결하는 단계는 상기 관심 영역이 검출되지 않은 프레임 영상이 존재하는 경우, 상기 관심 영역이 검출되지 않은 프레임 영상에 인접한 프레임 영상들에서 검출된 관심 영역들의 특징 정보를 기반으로 상기 관심 영역이 검출되지 않은 프레임 영상 내의 관심 영역의 특징 정보를 추정하는 단계; 를 더 포함하고, 상기 추정된 특징 정보를 이용하여 상기 관심 영역을 연결할 수 있다.In the present invention, when the frame image in which the region of interest is not detected exists, the step of connecting may include the region of interest based on feature information of regions of interest detected in frame images adjacent to the frame image in which the region of interest is not detected. Estimating feature information of a region of interest in the undetected frame image; In addition, the region of interest may be connected using the estimated feature information.

본 발명에서 상기 검출하는 단계는 상기 생성된 프레임 영상 내 상기 관심 영역이 위치하는 좌표를 나타내는 특징 정보를 산출하는 단계; 를 더 포함하고, 상기 산출된 특징 정보를 이용하여 상기 관심 영역을 검출할 수 있다.In the present invention, the detecting may include calculating feature information indicating coordinates of the region of interest in the generated frame image; In addition, the region of interest may be detected using the calculated feature information.

본 발명에서 상기 위치 정보를 산출하는 단계는 상기 프레임 영상들에서 상기 객체 별로 검출된 상기 관심 영역 및 상기 관심 영역이 검출되지 않은 부분을 구분하는 단계; 및 상기 구분된 관심 영역을 소정의 조합 방법으로 재배열하는 단계; 를 더 포함하고, 상기 재배열된 관심 영역을 연결하여 생성된 조합 스트림을 기반으로 상기 위치 정보를 산출할 수 있다.In the present invention, calculating the location information may include distinguishing the region of interest detected by the object from the frame images and a portion of the region where the region of interest is not detected; And rearranging the divided regions of interest in a predetermined combination method. Further comprising, it is possible to calculate the location information based on the combined stream generated by connecting the rearranged regions of interest.

또한 본 발명은 컴퓨터에서 상기한 영상 인식 방법을 실행시키기 위한 컴퓨터에서 판독 가능한 기록매체에 저장된 컴퓨터 프로그램을 개시한다.In addition, the present invention discloses a computer program stored in a computer-readable recording medium for executing the above-described image recognition method in a computer.

본 발명에 따르면, 영상 내 객체의 행동을 인식할 수 있다. According to the present invention, the behavior of an object in an image can be recognized.

특히, 동영상 단위의 영상 분석이 가능한 합성곱 신경망을 이용하여 영상을 인식할 수 있다.In particular, an image may be recognized using a convolutional neural network capable of video-based image analysis.

도 1은 본 발명의 일 실시 예에 따른 영상 인식 장치의 블록도이다.
도 2는 도 1의 실시 예에서 스트림 생성부의 확대 블록도이다.
도 3은 도 2의 실시 예에서 관심 영역 검출부의 확대 블록도이다.
도 4는 도 3의 실시 예에서 특징 정보 산출부의 확대 블록도이다.
도 5는 관심 영역 검출부가 관심 영역을 검출하는 과정을 나타내는 참고도이다.
도 6은 관심 영역 검출부가 검출한 관심 영역들을 나타내는 예시도이다.
도 7은 도 2의 실시 예에서 관심 영역 연결부의 확대 블록도이다.
도 8은 관심 영역 연결부가 관심 영역들을 연결하는 과정을 나타내는 참고도이다.
도 9는 특징 정보 추정부가 관심 영역의 특징 정보를 추정하는 과정을 나타내는 예시도이다.
도 10은 도 2의 실시 예에서 위치 정보 산출부의 확대 블록도이다.
도 11은 본 발명의 영상 인식 장치가 수행하는 영상 인식 과정을 나타낸다.
도 12는 도 1의 실시 예에서 인식부의 확대 블록도이다.
도 13은 본 발명의 일 실시 예에 따른 영상 인식 방법의 흐름도이다.
도 14는 도 13의 실시 예에서 생성하는 단계의 확대 흐름도이다.
도 15는 도 14의 실시 예에서 검출하는 단계의 확대 흐름도이다.
도 16은 도 14의 실시 예에서 연결하는 단계의 확대 흐름도이다.
도 17은 도 14의 실시 예에서 위치 정보를 산출하는 단계의 확대 흐름도이다.
도 18은 도 13의 실시 예에서 인식하는 단계의 확대 흐름도이다.1 is a block diagram of an image recognition apparatus according to an embodiment of the present invention.
FIG. 2 is an enlarged block diagram of a stream generator in the embodiment of FIG. 1.
3 is an enlarged block diagram of a region of interest detector in the embodiment of FIG. 2.
4 is an enlarged block diagram of a feature information calculator in the embodiment of FIG. 3.
5 is a reference diagram illustrating a process of detecting a region of interest by the region of interest detector.
6 is an exemplary view showing regions of interest detected by the region of interest detector.
7 is an enlarged block diagram of a region of interest connection in the embodiment of FIG. 2.
8 is a reference diagram illustrating a process in which a region of interest connection unit connects regions of interest.
9 is an exemplary view illustrating a process in which the feature information estimator estimates feature information of a region of interest.
10 is an enlarged block diagram of a location information calculator in the embodiment of FIG. 2.
11 shows an image recognition process performed by the image recognition device of the present invention.
12 is an enlarged block diagram of a recognition unit in the embodiment of FIG. 1.
13 is a flowchart of an image recognition method according to an embodiment of the present invention.
14 is an enlarged flow chart of steps generated in the embodiment of FIG. 13.
15 is an enlarged flow chart of the detection step in the embodiment of FIG. 14.
FIG. 16 is an enlarged flow chart of steps of connecting in the embodiment of FIG. 14.
FIG. 17 is an enlarged flow chart of steps for calculating location information in the embodiment of FIG. 14.
FIG. 18 is an enlarged flow chart of steps recognized in the embodiment of FIG. 13.

이하, 본 발명의 일 실시예를 첨부된 도면들을 참조하여 상세히 설명한다.Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

첨부 도면을 참조하여 설명함에 있어, 동일하거나 대응하는 구성 요소는 동일한 도면번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.In the description with reference to the accompanying drawings, the same or corresponding components are assigned the same reference numbers and redundant descriptions thereof will be omitted.

또한 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략할 수 있다. In addition, in describing the present invention, when it is determined that detailed descriptions of related well-known configurations or functions may obscure the subject matter of the present invention, detailed descriptions thereof may be omitted.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 용어를 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 이하에서 설명하는 각 단계는 하나 또는 여러 개의 소프트웨어 모듈로도 구비가 되거나 또는 각 기능을 담당하는 하드웨어로도 구현이 가능하며, 소프트웨어와 하드웨어가 복합된 형태로도 가능하다.Terms used in the present application are only used to describe specific embodiments, and are not intended to limit the terms. Singular expressions include plural expressions unless the context clearly indicates otherwise. Each step described below may be provided with one or several software modules, or may be implemented with hardware in charge of each function, or a combination of software and hardware.

각 용어의 구체적인 의미와 예시는 각 도면의 순서에 따라 이하 설명 한다.The specific meaning and examples of each term will be described below according to the order of each drawing.

이하에서는 본 발명의 실시예에 따른 영상 인식 장치(10)의 구성을 관련된 도면을 상세히 설명한다.Hereinafter, a detailed description of the configuration of the image recognition device 10 according to an embodiment of the present invention.

도 1은 본 발명의 일 실시 예에 따른 영상 인식 장치(10)의 블록도이다.1 is a block diagram of an image recognition device 10 according to an embodiment of the present invention.

영상 인식 장치(10)는 스트림 생성부(100) 및 인식부(200)를 포함한다. 예를 들어, 본 발명의 영상 인식 장치(10)는 적어도 하나의 객체들을 포함하는 영상을 입력 받고, 입력 영상에서 액션 스트림을 생성하며, 생성된 액션 스트림을 입력으로 하는 제1 인식기를 이용하여 영상 내 객체의 행동을 인식할 수 있다.The image recognition device 10 includes a stream generation unit 100 and a recognition unit 200. For example, the image recognition apparatus 10 of the present invention receives an image including at least one object, generates an action stream from the input image, and uses an image generated by using the first recognizer as an input. I can recognize the behavior of my object.

본 발명의 영상 인식 장치(10)가 인식하는 영상은 정지 영상 외에도 시간의 흐름에 따라 영상 정보가 변화하는 비디오 영상(동영상)을 포함한다. 본 발명의 영상 인식 장치(10)는 종래의 영상 인식 기술과는 달리, 동영상 단위의 영상 분석을 수행하여 영상을 인식할 수 있다. 보다 상세하게는, 본 발명의 영상 인식 장치(10)는 영상 내 객체의 행동을 분류함으로서 영상을 인식할 수 있다.In addition to the still image, the image recognized by the image recognition apparatus 10 of the present invention includes a video image (video) in which image information changes over time. Unlike the conventional image recognition technology, the image recognition apparatus 10 of the present invention may recognize an image by performing image analysis in a video unit. More specifically, the image recognition device 10 of the present invention can recognize an image by classifying the behavior of an object in the image.

본 발명의 영상 인식 장치(10)는 자율 주행 로봇과 같은 자율 주행 장치에 사용되어 자율 주행을 제어하는데 사용될 수 있고, CCTV와 같은 감시 장치에 사용되어 입력 영상의 자동 분석에 사용될 수 있다. 또한, 본 발명의 영상 인식 장치(10)는 홈 IOT(Internet Of Things)의 제어장치로 마련되어 사람의 행동을 자동으로 인식함으로서 가전 기기의 동작을 제어하는데 사용될 수 있다.The image recognition device 10 of the present invention can be used to control autonomous driving by being used in an autonomous driving device such as an autonomous driving robot, or used in a surveillance device such as CCTV, and used for automatic analysis of an input image. In addition, the image recognition device 10 of the present invention is provided as a control device of a home Internet of Things (IOT) and can be used to control the operation of a household appliance by automatically recognizing a person's behavior.

본 발명에서 영상 인식 장치(10)가 사용하는 제1 인식기 및 제2 인식기는 여러 계층을 가진 깊은 신경망 구조를 가지는 딥러닝 알고리즘을 기반으로 지도 학습(Supervised Learning)되는 신경망(Neural Network)으로 마련될 수 있다. 바람직하게는, 본 발명의 영상 인식 장치(10)가 사용하는 제1 인식기 및 제2 인식기는 지도 학습을 기반으로 학습되는 신경망 구조로서, 적어도 하나의 컨벌루션 레이어 및 풀링 레이어를 포함하는 합성곱 신경망(Convolutional Neural Network)으로 마련될 수 있다. 역전파 알고리즘을 기반으로 고안된 인공 신경망은 신경망의 계층이 많아지게 될 경우 과도한 깊이로 인한 학습 시간 지연과 등의 문제점이 밝혀지게 됨으로써 한동안 연구가 정체되었으나, 오버 피팅 문제가 dropout등의 방법을 통해 해결됨으로써 알고리즘의 성능이 비약적으로 향상되었다.In the present invention, the first recognizer and the second recognizer used by the image recognition apparatus 10 may be provided as a neural network that is supervised learning based on a deep learning algorithm having a deep neural network structure having multiple layers. Can. Preferably, the first recognizer and the second recognizer used by the image recognition apparatus 10 of the present invention are a neural network structure learned based on supervised learning, and a convolutional neural network including at least one convolutional layer and a pooling layer ( Convolutional Neural Network). The artificial neural network designed based on the back propagation algorithm has been delayed for a while because the problems such as delay in learning time due to excessive depth are revealed when the number of neural network layers increases, but the overfitting problem is solved through a method such as dropout. As a result, the performance of the algorithm has improved dramatically.

영상 인식 장치(10)가 이용하는 대표적 딥러닝 구조인 합성곱 신경망 (Convolutional Neural Network)은 사람의 시각 처리 과정을 모방하여 고안되었기 때문에 영상 및 이미지 처리에 적합한 딥러닝 알고리즘으로 평가 받고 있으며, 이미지를 추상화하여 표현할 수 있는 특징(feature)을 추출함으로서 영상 인식 분야에서 높은 성능을 나타내고 있다. Since the convolutional neural network, which is a representative deep learning structure used by the image recognition device 10, is designed to mimic a human visual processing process, it is evaluated as a deep learning algorithm suitable for image and image processing, and abstracts images. By extracting features that can be expressed, it shows high performance in the field of image recognition.

구체적으로, 합성곱 신경망은 입력 이미지에서 특징(feature)들을 계층적으로 추출하고, 추출된 특징(feature)들을 이용하여 특징맵을 형성할 수 있다. 특징맵상 분포되는 특징들은 재배치된 이미지 정보들을 포함하고, 이를 통하여 합성곱 신경망은 효과적으로 이미지를 분류할 수 있다. 합성곱 신경망은 컨벌루션 레이어를 통하여 추출된 특징(feature)에 활성화 함수(Activation Function)을 적용하고, 다시 풀링 레이어를 반복 배치함으로서, 컨벌루션 레이어에서 추출된 특징 값들의 사이즈를 재조정 할 수 있다.Specifically, the convolutional neural network may hierarchically extract features from the input image and form a feature map using the extracted features. Features distributed on the feature map include rearranged image information, and through this, the composite product neural network can effectively classify the image. The convolutional neural network can resize the feature values extracted from the convolutional layer by applying an activation function to the features extracted through the convolutional layer and repeatedly arranging the pooling layer.

본 발명의 영상 인식 장치(10)가 이용하는 합성곱 신경망은 기본적으로 합성곱 신경망 구조로 마련되고, 입력 영상의 형태, 크기 및 화소값 정보에 따라 신경망 내의 구조가 다르게 마련될 수 있다. 즉, 영상 인식 장치(10)가 이용하는 합성곱 신경망의 입력 레이어, 히든 레이어, 및 출력 레이어의 수, 각 레이어에 포함된 노드의 수, 각 노드들의 엣지에 적용되는 가중치, 러닝 레이트(Learning Rate)등은 영상 인식 목적에 따라 서로 다르게 설정될 수 있다. The convolutional neural network used by the image recognition apparatus 10 of the present invention is basically provided as a convolutional neural network structure, and a structure in the neural network may be provided differently according to the shape, size, and pixel value information of the input image. That is, the number of input layers, hidden layers, and output layers of the convolutional neural network used by the image recognition apparatus 10, the number of nodes included in each layer, the weight applied to the edges of each node, and the learning rate The lights may be set differently depending on the purpose of image recognition.

또한, 영상 인식 장치(10)가 이용하는 합성곱 신경망은 신경망은 VGG-Net, GoogLeNet 및 ResNet 와 같은 구조가 적용된 신경망을 이용할 수 있다. In addition, as the convolutional neural network used by the image recognition apparatus 10, a neural network to which structures such as VGG-Net, GoogLeNet, and ResNet are applied may be used as the neural network.

또한, 본 발명의 영상 인식 장치(10)가 사용하는 신경망들은 컨벌루션 레이어와 풀링 레이어(pooling layer)를 포함하는 CNN 구조에 풀리 커넥티드 레이어(fully-connected)가 연결된 구조로서, 과적합 문제가 발생하지 않는다면 계층의 깊이가 깊을수록 이미지 인식 정확도가 높아질 수 있다. 도 2를 참조하여 설명한다.In addition, the neural networks used by the image recognition apparatus 10 of the present invention are structures in which a fully-connected layer is connected to a CNN structure including a convolutional layer and a pooling layer, resulting in overfitting problems. If not, the deeper the layer, the higher the image recognition accuracy. This will be described with reference to FIG. 2.

스트림 생성부(100)는 관심 영역 검출부(120), 관심 영역 연결부(140) 및 위치 정보 산출부(160)를 포함한다. 예를 들어, 스트림 생성부(100)는 적어도 하나의 객체들을 포함하는 입력 영상에서 상기 객체 별 행동에 관한 모션 정보를 포함하는 액션 스트림을 생성한다. 본 발명의 액션 스트림(Action Stream)은 영상 내 존재하는 객체 별로 생성되는 액션 튜브(Action Tube)를 신경망에 입력되기 위한 형태로 전처리하여 생성할 수 있다. The stream generator 100 includes a region of interest detection unit 120, a region of interest connection unit 140, and a location information calculating unit 160. For example, the stream generator 100 generates an action stream including motion information on the action for each object from an input image including at least one object. The action stream of the present invention may be generated by pre-processing an action tube generated for each object present in an image in a form for input to a neural network.

본 발명의 액션 튜브(Action Tube)는 입력 영상에서 객체의 적어도 일부를 포함하는 3차원(X, Y, Z) 영상 데이터의 집합으로 객체 별로 생성될 수 있다. 본 발명의 액션 튜브는 시간의 흐름에 따라 변하는 관심 영역들 내 화소값 변화량을 포함하여 객체의 행동 변화에 관한 모션 정보를 포함할 수 있다. 본 발명의 모션 정보는 시간의 흐름에 따라 변하는 관심 영역내의 화소값들의 변화량의 일종으로, 단위 시간에서 픽셀 별 화소값 변화량의 미분값으로 마련될 수도 있다. The action tube of the present invention may be generated for each object as a set of 3D (X, Y, Z) image data including at least a part of the object in the input image. The action tube of the present invention may include motion information regarding a change in an object's behavior, including a change in a pixel value in regions of interest that changes over time. The motion information of the present invention is a type of change in pixel values in a region of interest that changes with the passage of time, and may be provided as a differential value of a change in pixel value for each pixel in a unit time.

예를 들어, 스트림 생성부(100)는 입력 영상을 분할하여 소정의 시간 간격을 가지는 프레임 영상들을 생성할 수 있고, 생성된 프레임 영상들에서 객체 별로 관심 영역을 검출하고, 검출된 관심 영역을 연결함으로서 액션 튜브를 생성하며, 생성된 액션 튜브를 전처리하여 액션 스트림을 생성할 수 있다. For example, the stream generator 100 may divide the input image to generate frame images having a predetermined time interval, detect the region of interest for each object from the generated frame images, and connect the detected regions of interest By creating an action tube, an action stream can be generated by preprocessing the generated action tube.

소정의 시간 간격을 가지는 프레임 영상들에서 검출된 관심 영역을 연결하여 생성된 액션 튜브를 전처리하여 생성되는 액션 스트림은 시간의 흐름에 따라 변하는 화소값 변화량을 포함하는 액션 튜브를 전처리하여 생성되므로 시간의 흐름에 따른 객체의 행동 변화를 나타낼 수 있다. The action stream generated by pre-processing the action tube generated by connecting the region of interest detected in frame images having a predetermined time interval is generated by pre-processing the action tube including the pixel value change amount that changes over time. It can represent the behavioral change of the object according to the flow.

본 발명의 스트림 생성부(100)에서 객체 별로 생성된 액션 스트림은 56*56*3*16의 사이즈를 가질 수 있다. 스트림 생성부(100)에서 생성된 액션 스트림은 가로*세로 사이즈가 56*56인 관심 영역들을 연결하여 생성되고, RGB표색계의 화소값(픽셀값)을 포함하며, 하나의 액션 스트림은 16개의 영상 프레임들을 포함할 수 있다. 스트림 생성부(100)에서 생성된 액션 스트림의 사이즈는 제1 인식기의 설정 방법에 따라 달라질 수 있다. 도 3을 참조하여 설명한다.The action stream generated for each object in the stream generator 100 of the present invention may have a size of 56*56*3*16. The action stream generated by the stream generator 100 is generated by connecting regions of interest having a horizontal*vertical size of 56*56, and includes pixel values (pixel values) of an RGB color system, and one action stream includes 16 images. It may include frames. The size of the action stream generated by the stream generator 100 may vary according to a setting method of the first recognizer. This will be described with reference to FIG. 3.

관심영역 검출부(120)는 특징 정보 산출부(122) 및 바운더리 셀 제거부(126)를 포함한다. 예를 들어, 관심 영역 검출부(120)는 입력 영상을 시간의 흐름에 따라 분할하여 복수개의 프레임 영상들을 생성하며, 상기 생성된 프레임 영상들에서 상기 객체 별로 구분되어 상기 객체를 적어도 일부 포함하는 관심 영역을 검출할 수 있다. 도 4를 참조하여 설명한다.The region of interest detection unit 120 includes a feature information calculation unit 122 and a boundary cell removal unit 126. For example, the region-of-interest detection unit 120 divides the input image over time to generate a plurality of frame images, and is divided by the object in the generated frame images to include the region of interest including at least a part of the object Can be detected. This will be described with reference to FIG. 4.

특징 정보 산출부(122)는 전처리부(123) 및 계산부(124)를 포함한다. 예를 들어, 특징 정보 산출부(122)는 생성된 프레임 영상 내 상기 관심 영역이 위치하는 좌표를 나타내는 특징 정보를 산출할 수 있다. 본 발명에서 특징 정보는 프레임 영상 내 검출될 관심 영역의 위치에 관한 좌표로서, 프레임 영상에서 관심 영역의 중심 좌표 또는 관심 영역의 경계에 위치하는 좌표로 마련될 수 있다. 도 5를 참조하여 설명한다.The feature information calculating unit 122 includes a pre-processing unit 123 and a calculation unit 124. For example, the feature information calculating unit 122 may calculate feature information indicating coordinates in which the region of interest is located in the generated frame image. In the present invention, the feature information is coordinates related to the location of the region of interest to be detected in the frame image, and may be provided as a center coordinate of the region of interest in the frame image or a coordinate located at a boundary of the region of interest. This will be described with reference to FIG. 5.

전처리부(123)는 생성된 프레임 영상들을 각각 분할하여 도 5에 도시된 바와 같이, 미리 결정된 크기의 격자셀을 생성할 수 있다. 예를 들어, 전처리부(123)는 9*6 또는 7*7 크기의 격자셀들을 생성할 수 있다. 전처리부(123)에서 생성된 격자 셀들은 각각의 격자 셀들에 종속되는 바운더리 셀들을 포함할 수 있다.The pre-processing unit 123 may divide the generated frame images, respectively, to generate a grid cell of a predetermined size, as shown in FIG. 5. For example, the pre-processing unit 123 may generate grid cells of 9*6 or 7*7 size. The grid cells generated by the pre-processing unit 123 may include boundary cells that are dependent on the grid cells.

계산부(124)는 생성된 프레임 영상들을 입력으로 하고, 상기 격자셀 내부에 중심을 가지고 상기 객체가 존재하는 확률을 나타내는 바운더리 셀의 중심 좌표 또는 상기 바운더리 셀 내 상기 객체가 존재하는 확률을 출력으로 하는 제2 인식기(Neural Network)을 이용하여 상기 바운더리 셀 각각의 중심 좌표 및 상기 바운더리 셀 내의 상기 객체가 존재하는 확률을 계산한다. The calculation unit 124 uses the generated frame images as an input, and outputs a center coordinate of a boundary cell indicating a probability that the object exists with a center in the grid cell or a probability that the object exists in the boundary cell as an output. The second coordinate (Neural Network) is used to calculate the center coordinates of each boundary cell and the probability that the object exists in the boundary cell.

본 발명의 제2 인식기는 지도 학습을 기반으로 학습되는 신경망 구조로서, 적어도 하나의 컨벌루션 레이어 및 풀링 레이어를 포함하는 합성곱 신경망(Convolutional Neural Network)으로 마련될 수 있다. 바람직하게는, 계산부(124)가 이용하는 제2 인식기는 R-CNN(Regions with Convolutional Neural Network) 또는 Faster R-CNN으로 마련될 수 있다. The second recognizer of the present invention is a neural network structure learned based on supervised learning, and may be provided as a convolutional neural network including at least one convolutional layer and a pooling layer. Preferably, the second recognizer used by the calculation unit 124 may be provided as Regions with Convolutional Neural Network (R-CNN) or Faster R-CNN.

특징 정보 산출부(122)는 바운더리 셀 각각의 중심 좌표 및 상기 바운더리 셀 내의 상기 객체가 존재하는 확률이 계산된 바운더리 셀들을 이용하여 특징 정보를 산출할 수 있다. 예를 들어, 계산부(124)가 이용하는 제2 인식기는 특징(feature)을 추출하는 적어도 하나의 컨벌루션 레이어 및 상기 컨벌루션 레이어의 일단에 연결되어 상기 바운더리 셀의 중심 좌표와 상기 바운더리 셀 내의 상기 객체가 존재하는 확률을 계산하는 적어도 하나의 풀리 커넥티드 레이어를 포함할 수 있다. 또한, 본 발명의 계산부(140)가 이용하는 제2 인식기는 컨벌루션 레이어 및 풀리 커넥티드 레이어에 더하여 컨벌루션 레이어와 교대로 반복 배치되는 풀링 레이어를 더 포함할 수 있다. The feature information calculating unit 122 may calculate feature information using boundary coordinates in which the center coordinates of each boundary cell and the probability that the object exists in the boundary cell are calculated. For example, the second recognizer used by the calculator 124 is connected to at least one convolutional layer for extracting a feature and one end of the convolutional layer so that the center coordinates of the boundary cell and the object in the boundary cell are It may include at least one pulley connected layer to calculate the probability of existence. In addition, the second recognizer used by the calculator 140 of the present invention may further include a pooling layer alternately arranged with the convolutional layer in addition to the convolutional layer and the fully connected layer.

본 발명의 일 실시 예에 따른 특징 정보 산출부(122)가 특징 정보를 산출하는 과정을 설명하면 다음과 같다. 특징 정보 산출부(122)는 적어도 하나의 객체들이 포함된 입력 영상을 분할하여 프레임 영상을 생성하고, 상기 프레임 영상에서 기 설정된 간격의 격자 셀들을 생성한다. 계산부(124)는 제2 인식기를 이용하여 격자셀 내부에 중심 좌표를 가지고, 상기 격자셀에 종속되는 임의의 수의 바운더리 셀들(127, 128, 129, 132, 133)을 생성함과 동시에, 바운더리 셀 내부에 포함되는 객체들이 존재하는 확률을 계산할 수 있다.The process of calculating the feature information by the feature information calculating unit 122 according to an embodiment of the present invention is as follows. The feature information calculating unit 122 generates a frame image by dividing an input image including at least one object, and generates grid cells at predetermined intervals in the frame image. The calculation unit 124 generates a random number of boundary cells 127, 128, 129, 132, and 133 that have a center coordinate inside the grid cell and uses the second recognizer, and is dependent on the grid cell, It is possible to calculate the probability that objects included in the boundary cell exist.

바운더리 셀 제거부(126)는 프레임 영상들 내에 상기 객체를 중복하여 포함하는 바운더리 셀 중 상기 바운더리 셀 내에 상기 객체가 존재하는 확률이 기 설정된 임계치 이상인지 여부를 고려하여 상기 객체를 중복하여 포함하는 바운더리 셀의 일부를 제거한다. 예를 들어, 바운더리 셀 제거부(126)는 계산부(124)가 생성한 바운더리 셀 내부에 객체가 존재할 확률이 모두 다른 바운더리 셀들(127, 128, 129, 132, 133, 134)중에서 내부에 객체가 존재할 확률이 가장 높은 바운더리 셀(132, 133)을 관심 영역으로 검출할 수 있다.The boundary cell removing unit 126 overlaps the object in consideration of whether a probability that the object exists in the boundary cell is greater than or equal to a preset threshold among boundary cells that overlap the object in frame images. Remove part of the cell. For example, the boundary cell removing unit 126 is an object inside among the boundary cells 127, 128, 129, 132, 133, and 134, all of which have a different probability that an object exists inside the boundary cell generated by the calculation unit 124. The boundary cells 132 and 133 having the highest probability of being present may be detected as a region of interest.

이를 위하여 바운더리 셀 제거부(126)는 프레임 영상에서 존재하는 임의의 객체를 중복하여 포함하는 바운더리 셀 들 중에서 객체가 존재하는 확률이 기 설정된 임계치 이상인 바운더리 셀(132, 133)들만을 남기고 나머지 바운더리 셀들(127, 128, 129, 134)을 제거할 수 있다. 또한, 바운더리 셀 제거부(126)는 셀 내부에 병변이 존재할 확률이 기 설정된 임계치 이하인 바운더리 셀을 제거하기 위하여 NMS(Nom-maximal Suppression 비-최대값 억제) 알고리즘을 사용하여 오직 하나의 바운더리 셀만을 남길 수 있다. 도 6을 참조하여 설명한다.To this end, the boundary cell removal unit 126 leaves only the boundary cells 132 and 133 having a probability that an object exists among predetermined boundary cells overlapping an arbitrary object present in the frame image, which is greater than or equal to a preset threshold, and the remaining boundary cells. (127, 128, 129, 134) can be removed. In addition, the boundary cell removal unit 126 uses only a single boundary cell using a Nom-maximal Suppression (NMS) algorithm to remove boundary cells having a probability that a lesion exists within a cell below a preset threshold. Can be left. This will be described with reference to FIG. 6.

관심 영역 검출부(120)는 전술한 바와 같이, 특징 정보 산출부(122)에서 산출된 특징 정보들을 이용하여 입력 영상이 분할되어 생성된 프레임 영상들에서 사람 또는 사물을 포함하는 적어도 하나의 객체에 대한 이미지가 포함된 관심 영역을 검출할 수 있다. 도 7을 참조하여 설명한다.The region of interest detector 120, as described above, uses the feature information calculated by the feature information calculator 122 to generate at least one object including a person or an object from frame images generated by dividing the input image. The region of interest including the image may be detected. This will be described with reference to FIG. 7.

관심 영역 연결부(140)는 연결 점수 산출부(142) 및 특징 정부 추정부(144)를 포함한다. 예를 들어, 관심 영역 연결부(140)는 인접하는 서로 다른 프레임 영상들에서 검출된 상기 관심 영역간 연결 점수를 산출하며, 상기 산출된 연결 점수를 고려하여 상기 서로 다른 프레임 영상들에서 검출된 상기 관심 영역을 연결한다. 도 8을 참조하여 설명한다.The region of interest connection unit 140 includes a connection score calculation unit 142 and a feature government estimation unit 144. For example, the region of interest connection unit 140 calculates a connection score between the regions of interest detected in adjacent different frame images, and the region of interest detected in the different frame images in consideration of the calculated connection points. Connect. This will be described with reference to FIG. 8.

예를 들어, 관심 영역 연결부(140)는 입력 영상이 분할되어 생성된 복수의 프레임 영상들(302, 304, 306, 308)에서 각각 검출된 제1 객체에 대한 관심 영역들(331, 333, 335, 337, 339, 341, 343)과 제2 객체에 대한 관심 영역들(332, 334, 336, 338, 340, 342, 344) 각각을 연결하여 제1 객체에 대한 제1 액션 튜브(361)과 제2 객체에 대한 제2 액션 튜브(362)를 생성할 수 있다. For example, the region of interest connection unit 140 may include regions of interest 331, 333, and 335 for the first object detected from the plurality of frame images 302, 304, 306, and 308 generated by dividing the input image. , 337, 339, 341, 343) and the regions of interest for the second object (332, 334, 336, 338, 340, 342, 344) by connecting each of the first action tube 361 for the first object A second action tube 362 for the second object may be generated.

또한, 관심 영역 연결부(140)는 생성된 제1 액션 튜브(361)과 제2 객체에 대한 제2 액션 튜브(362)를 전처리하여 제1 객체 및 제2 객체에 대한 액션 스트림을 각각 생성할 수 있다. 즉, 본 발명의 액션 스트림들은 시간의 흐름에 따른 객체 별 행동에 대한 모션 정보들을 포함할 수 있음은 전술한 바와 같다. 본 발명에서 관심 영역 연결부(140)는 프레임 영상들 내에서 생성된 관심 영역들을 연결함에 있어, 객체 별로 연결하여 액션 튜브를 생성하고, 생성된 액션 튜브들을 전처리하여 객체 별 액션 스트림을 생성할 수 있다.Also, the region of interest connection unit 140 may pre-process the generated first action tube 361 and the second action tube 362 for the second object to generate action streams for the first object and the second object, respectively. have. That is, it is as described above that the action streams of the present invention may include motion information on actions for each object according to the passage of time. In the present invention, the region of interest connection unit 140 may connect the regions of interest generated in the frame images to generate action tubes by connecting objects, and pre-process the generated action tubes to generate action streams for each object. .

연결 점수 산출부(142)는 인접한 프레임 영상들 각각에서 검출된 상기 관심 영역들의 특징 정보를 입력으로 하는 유사도 함수, 상기 인접한 프레임 영상들 각각에서 검출된 상기 관심 영역들의 오버렙 비율을 출력으로 하는 교차비 함수 및 상기 관심 영역들의 클래스 정보의 유사도 중 적어도 하나를 고려하여 연결 점수를 산출한다. The connection score calculating unit 142 is a similarity function that inputs feature information of the regions of interest detected in each of the adjacent frame images as an input, and a cross ratio that outputs an overlap ratio of the regions of interest detected in each of the adjacent frame images as an output. The connection score is calculated by considering at least one of the similarity between the function and the class information of the regions of interest.

여기에서, sim()은 프레임 영상들 각각에서 검출된 관심 영역들의 특징 정보를 입력으로 하여 0~1사이의 출력값을 가지는 유사도 함수, ov()는 인접한 프레임 영상들에서 각각 검출된 관심 영역들의 오버렙 비율을 출력으로 하는 교차비 함수,

는 인접한 프레임 영상들에서 각각 검출된 관심 영역들의 행동 분류를 위한 클래스 정보의 유사도,

및

는 인접한 프레임 영상들의 인덱스 번호,

는 임의의 스칼라 값,

는

인덱스 번호의 프레임 영상에서 검출된 관심 영역의 특징 정보,

는

인덱스 번호의 프레임 영상에 검출된 관심 영역의 특징 정보를 나타낸다. 본 발명의 프레임 영상에서 검출된 관심 영역의 특징 정보는 관심 영역이 해당 프레임 영상에서 위치하는 좌표 정보를 포함함은 전술한 바와 같다. 상기 클래스 정보는 관심 영역 에서 검출된 객체의 행동을 분류하는 카테고리(목록)를 의미한다.Here, sim() is a similarity function having an output value between 0 and 1 by inputting feature information of the regions of interest detected in each of the frame images, and ov() is over the regions of interest detected in adjacent frame images. Cross-ratio function with rep ratio as output,

Is similarity of class information for classifying behaviors of regions of interest detected in adjacent frame images,

And

Is an index number of adjacent frame images,

Is an arbitrary scalar value,

The

Feature information of the region of interest detected in the frame image of the index number,

The

Represents feature information of a region of interest detected in a frame image of an index number. It is as described above that the feature information of the region of interest detected in the frame image of the present invention includes coordinate information in which the region of interest is located in the corresponding frame image. The class information means a category (list) that classifies the behavior of the object detected in the region of interest.

또한, 상기 수학식 1에서 ov() 함수는 IOU(Intersection Over Union) 값을 출력으로 하는 교차비 함수로서, 인접한 프레임 영상들 내에 각각 검출된 관심 영역들의 교집합 영역 및 합집합 영역을 입력으로 한다. 구체적으로는, 상기 교차비 함수는 상기 교집합 영역을 합집합 영역으로 나눈값을 출력으로 할 수 있다.In addition, in Equation 1, the ov() function is an intersection ratio function that outputs an Intersection Over Union (IOU) value, and inputs an intersection region and a union region of regions of interest detected in adjacent frame images. Specifically, the cross ratio function may output a value obtained by dividing the intersection region by a union region.

본 발명의 일 실시 예에 따른 연결 점수 산출부(142)가 연결 점수(Link Score)를 산출하는 과정을 영상 프레임(302, 304)에서 설명하면 다음과 같다. 연결 점수 산출부(142)는 제1 영상 프레임(302)에서 검출된 관심 영역들(331, 332)중 제1 객체에 관한 제1 관심 영역(331)과 제2 영상 프레임(304)내의 제1 관심 영역(333)의 연결 점수를 산출한다. The process of calculating the link score by the link score calculating unit 142 according to an embodiment of the present invention will be described in the video frames 302 and 304 as follows. The connection score calculating unit 142 includes first areas in the first region of interest 331 and the second image frame 304 of the first object among the regions of interest 331 and 332 detected in the first image frame 302. The connection score of the region of interest 333 is calculated.

또한, 연결 점수 산출부(142)는 제1 영상 프레임(302)내의 제1 관심 영역(331)과 제2 영상 프레임(304)내의 제2 관심 영역(334)의 연결 점수를 산출한다. 즉, 본 발명의 연결 점수 산출부(142)는 하나의 영상 프레임 내의 관심 영역이 인접한 다른 영상 프레임 내의 관심 영역과 연결될 수 있는 모든 경우의 연결(Link)에 대하여 연결 점수(Link Score)를 산출할 수 있다. 본 발명의 관심 영역 연결부(140)는 연결 점수가 가장 높게 나타나게 영상 프레임 내의 관심 영역들을 연결할 수 있다. 도 9를 참조하여 설명한다. In addition, the connection score calculator 142 calculates the connection scores of the first region of interest 331 in the first image frame 302 and the second region of interest 334 in the second image frame 304. That is, the connection score calculating unit 142 of the present invention calculates a link score for all cases in which a region of interest in one image frame can be connected to a region of interest in another adjacent image frame. Can. The region of interest connection unit 140 of the present invention may connect regions of interest in an image frame such that the connection score is highest. This will be described with reference to FIG. 9.

특징 정보 추정부(144)는 관심 영역이 검출되지 않은 프레임 영상이 존재하는 경우, 상기 관심 영역이 검출되지 않은 프레임 영상(372)에 인접한 프레임 영상들(375)에서 검출된 관심 영역들(335, 336, 337, 338, 341, 342,343, 344)의 특징 정보를 기반으로 상기 관심 영역이 검출되지 않은 프레임 영상(372) 내의 관심 영역의 특징 정보를 추정한다. The feature information estimator 144 detects regions of interest 335 detected in frame images 375 adjacent to the frame image 372 where the region of interest is not detected when a frame image in which the region of interest is not detected exists. Based on the feature information of 336, 337, 338, 341, 342, 343, 344, feature information of the region of interest in the frame image 372 in which the region of interest is not detected is estimated.

예를 들어, 본 발명의 관심 영역 검출부(120)가 입력 영상을 분할하여 프레임 영상에서 각각 관심 영역을 검출함에 있어서, 영상 내 화질 등의 문제로 인하여 관심 영역을 제대로 추출하지 못하는 경우가 발생할 수 있다. 즉, 도 9에 도시된 바와 같이, 프레임 영상들(375)중에서 관심 영역이 검출되지 않은 프레임 영상(372)이 존재할 수 있다. For example, when the region of interest detector 120 of the present invention divides the input image and detects each region of interest in the frame image, it may occur that the region of interest may not be properly extracted due to problems such as image quality in the image. . That is, as illustrated in FIG. 9, a frame image 372 in which the region of interest is not detected may exist among the frame images 375.

이러한 경우, 특징 정보 추정부(144)는 관심 영역이 검출되지 않은 프레임 영상들(374, 376, 378)에 인접한 프레임 영상들(371, 373, 379)내의 관심 영역들의 특징 정보를 함수 입력으로 가지는 선형 함수를 생성하고, 생성된 선형함수를 이용하여 관심 영역이 검출되지 않은 프레임 영상(374, 376, 378) 내 관심 영역의 특징 정보를 추정할 수 있다. In this case, the feature information estimator 144 has feature information of regions of interest in the frame images 371, 373, 379 adjacent to the frame images 374, 376, 378 where the region of interest is not detected as a function input A linear function may be generated, and feature information of a region of interest in the frame image 374, 376, and 378 in which the region of interest is not detected may be estimated using the generated linear function.

본 발명의 특징 정보는 프레임 영상 내에서 관심 영역의 좌표를 의미하고, 특징 정보 추정부(144)는 관심 영역이 검출되지 않은 프레임 영상들(374, 376, 378)에 인접한 프레임 영상들(371, 373, 379)내의 관심 영역들의 위치를 기반으로 관심 영역이 검출되지 않은 프레임 영상들(274, 276, 278)에서 검출된 관심 영역의 위치를 추정할 수 있다. The feature information of the present invention means coordinates of a region of interest in a frame image, and the feature information estimator 144 includes frame images 371 adjacent to frame images 374, 376, and 378 in which the region of interest is not detected. 373, 379), the location of the region of interest detected in the frame images 274, 276, and 278 in which the region of interest is not detected may be estimated based on the location of the region of interest.

즉, 특징 정보 추정부(144)는 관심 영역이 검출된 인접한 프레임 영상들(371, 373, 379)의 특징 정보들을 기반으로 보간 과정(Interpolation)을 수행함으로서 관심 영역이 검출되지 않은 프레임 영상들의 관심 영역의 좌표 정보인 특징 정보를 추정한다. 관심 영역 연결부(140)는 추정된 특징 정보를 이용하여 관심 영역이 검출되지 않은 프레임 영상 내 추정된 관심 영역을 생성하고, 추정된 관심 영역들을 인접한 프레임 영상 내의 관심 영역들과 연결할 수 있다. That is, the feature information estimator 144 performs interpolation based on feature information of adjacent frame images 371, 373, and 379 in which the region of interest is detected, thereby interested in frame images in which the region of interest is not detected. The feature information that is the coordinate information of the region is estimated. The ROI-connector 140 may generate an estimated ROI in the frame image in which the ROI is not detected using the estimated feature information, and connect the estimated ROIs to ROIs in an adjacent frame image.

위치 정보 산출부(160)는 영역 구분부(162) 및 재배열 부(164)를 포함한다. 예를 들어, 위치 정보 산출부(160)는 프레임 영상들에서 검출된 상기 관심 영역을 프레임 영상 단위로 재배열하고, 상기 재배열된 관심 영역을 연결하여 생성된 조합 스트림으로부터 상기 액션 스트림의 상호간 위치에 관한 위치 정보를 산출한다.The location information calculating unit 160 includes an area division unit 162 and a rearrangement unit 164. For example, the location information calculating unit 160 rearranges the region of interest detected from the frame images in units of frame images, and positions the action streams from each other from a combined stream generated by connecting the rearranged regions of interest. Calculate location information about.

예를 들어, 위치 정보 산출부(160)는 프레임 영상들에서 객체 별로 검출된 관심 영역들을 소정의 조합 방법으로 재배열하고, 재배열된 관심 영역들을 이용하여 조합 스트림을 생성하며, 생성된 조합 스트림을 이용하여 객체 별로 생성된 액션 스트림의 상호간 위치에 관한 위치 정보를 산출한다. 본 발명에서 위치 정보는 외형 정보가 아닌 프레임 영상 내에서 객체 별로 검출된 관심 영역들의 상호 위치에 대한 관계를 의미하며, 위치 정보를 산출하는 구체적인 방법은 후술한다.For example, the location information calculator 160 rearranges regions of interest detected for each object in the frame images in a predetermined combination method, generates a combination stream using the rearranged regions of interest, and generates the combined stream. Use to calculate the location information about the mutual location of the action stream generated for each object. In the present invention, the location information refers to a relationship between mutual locations of regions of interest detected for each object in a frame image, not external information, and a detailed method of calculating location information will be described later.

예를 들어, 위치 정보 산출부(160)는 제1 객체(사람), 제2 객체(사물)을 포함하는 입력 영상에서, 제1 객체에 관한 관심 영역들과 제2 객체에 관한 관심 영역들을 교차 배열 하거나 또는 밀집 배열 함으로서 조합 스트림을 생성하고, 생성된 조합 스트림을 이용하여 제1 객체에 관한 액션 스트림과 제2 객체에 관한 액션 스트림의 상호간 위치에 관한 위치 정보를 산출할 수 있다. 본 발명의 위치 정보 산출부(160)가 객체별로 생성된 액션 스트림의 위치 관계를 산출하기 위한 방법은 마스킹 또는 바이너리 처리에 제한되는 것은 아니며, 관심 영역들의 위치관계를 입력하기 위한 기타 공지의 기술을 포함한다. 도 11을 참조하여 설명한다.For example, the location information calculator 160 intersects regions of interest related to the first object and regions of interest related to the second object in an input image including the first object (person) and the second object (object). By arranging or densely arranging, a combination stream can be generated, and the generated combination stream can be used to calculate position information regarding the mutual position of the action stream for the first object and the action stream for the second object. The method for the location information calculating unit 160 of the present invention to calculate the location relation of the action stream generated for each object is not limited to masking or binary processing, and other known techniques for inputting the location relation of regions of interest Includes. This will be described with reference to FIG. 11.

영역 구분부(162)는 프레임 영상들에서 상기 객체 별로 검출된 상기 관심 영역 및 상기 관심 영역이 검출되지 않은 부분을 구분하여 마스킹 처리한다. 관심 영역 검출부(120)가 제1 객체(사람) 및 제2 객체(사물)을 포함하는 입력 영상에서, 제1 객체 및 제2 객체에 관한 관심 영역들을 검출한 경우를 예로 설명한다.The region dividing unit 162 performs masking by classifying the region of interest and the portion of the region of interest, which are detected for each object, in frame images. A case in which the region of interest detection unit 120 detects regions of interest related to the first object and the second object from an input image including the first object (person) and the second object (object) will be described as an example.

이 경우, 본 실시예에서 영역 구분부(162)는 관심 영역이 검출되지 않은 부분을 구분하여 마스킹 처리하는 마스킹부(미도시)를 포함할 수 있다.In this case, in this embodiment, the region dividing unit 162 may include a masking unit (not shown) that classifies and masks a portion in which the region of interest is not detected.

예를 들어, 마스킹부(미도시)는 프레임 영상들에서 제1 객체에 관한 관심 영역들이 검출된 부분과 제1 객체에 관한 관심 영역이 아닌 부분을 2가지 화소값 정보로 구분하여 마스킹할 수 있다. 또한, 제2 객체에 관한 관심 영역들이 검출된 부분과 제2 객체에 관한 관심 영역이 아닌 부분을 2가지 화소값 정보로 구분하여 마스킹할 수 있다. For example, the masking unit (not shown) may mask the part in which the regions of interest related to the first object are detected from the frame images and the portion not of the region of interest related to the first object into two pixel value information. . In addition, a portion in which regions of interest related to the second object are detected and a portion other than the region of interest related to the second object may be divided into two pixel value information and masked.

본 발명에서 영역 구분부(162)가 프레임 영상들에서 관심 영역들이 검출된 부분과 관심 영역들이 검출되지 않은 부분을 마스킹 하는 것은 프레임 영상을 바이너리(Binary) 처리하는 것으로 마련될 수 있다. 본 발명의 위치 정보 산출부(160)는 영역 구분부(162)에서 수행되는 마스킹 과정을 기반으로, 각 프레임 영상들에서 검출된 관심 영역들간의 위치 관계를 나타내는 위치 정보를 산출할 수 있다. In the present invention, the region dividing unit 162 masking a portion in which the regions of interest are detected in the frame images and a portion in which the regions of interest are not detected may be provided as binary processing of the frame image. The location information calculating unit 160 of the present invention may calculate location information indicating a location relationship between regions of interest detected in each frame image based on a masking process performed by the region division unit 162.

재배열 부(164)는 구분된 관심 영역을 소정의 조합 방법으로 재배열한다. 예를 들어, 재배열 부(164)는 제1 객체에 관한 마스킹된 관심 영역들(277)과 제2 객체에 관한 마스킹된 관심 영역들(279)을 소정의 조합 방법으로 조합할 수 있다. 재배열 부(164)가 구분된 관심 영역들을 조합하는 방법은 프레임별 정보를 고려하여 제1 객체에 관한 마스킹된 관심 영역-제2 객체에 관한 마스킹된 관심 영역-제1객체에 관한 마스킹된 관심 영역과 같이 객체 별 프레임 영상들을 교차 조합하는 방법을 포함한다.The rearrangement unit 164 rearranges the divided regions of interest in a predetermined combination method. For example, the rearrangement unit 164 may combine the masked regions of interest 277 for the first object and the masked regions of interest 279 for the second object in a predetermined combination method. The method of combining the regions of interest in which the rearrangement unit 164 is divided is a masked region of interest for the first object, a masked region of interest for the second object, and a masked region of interest for the first object in consideration of frame-specific information. It includes a method of cross-combining frame images for each object, such as a region.

또한, 재배열 부(164)가 구분된 관심 영역들을 조합하는 방법은 제1 객체 및 제2 객체 별 정보를 고려하여, 제1 객체에 관한 마스킹된 관심 영역들을 우선 배열 후, 제2 객체에 관한 마스킹된 관심 영역들을 후 배열하는 방법을 포함한다.In addition, in the method of combining the regions of interest in which the rearrangement unit 164 is divided, the masked regions of interest related to the first object are first arranged in consideration of the information of the first object and the second object, and then the And later arranging the masked regions of interest.

위치 정보 산출부(160)는 소정의 조합 방법으로 재배열된 관심 영역들을 이용하여 조합 스트림을 생성하며, 생성된 조합 스트림을 이용하여 객체 별로 생성된 액션 스트림의 상호간 위치에 관한 위치 정보를 산출할 수 있다. 본 발명에서 위치 정보 산출부(160)가 생성하는 조합 스트림은 제1 객체에 관한 마스킹된 관심 영역들이 연결된 스트림과 제2 객체에 관한 마스킹된 관심 영역들이 연결된 스트림 쌍을 포함하는 페어와이즈 스트림(Pairwise Stream)으로 마련될 수 있다.The location information calculating unit 160 generates a combination stream using regions of interest rearranged by a predetermined combination method, and calculates location information regarding mutual positions of action streams generated for each object using the generated combination stream. Can. In the present invention, the combined stream generated by the location information calculating unit 160 is a pairwise stream including a pair of streams in which masked regions of interest related to a first object are connected and a pair of streams in which masked regions of interest related to a second object are connected. Stream).

본 발명의 페어와이즈 스트림은 영상 내 객체별 액션 스트림에 포함된 외형 정보를 제외한 객체별 액션 스트림의 위치 관계를 나타내는 위치 정보를 포함한다. 즉, 객체위치 정보 산출부(160)는 전술한 조합 스트림을 이용하여 제1 객체에 관한 액션 스트림과 제2 객체에 관한 액션 스트림의 위치 관계를 나타내는 위치 정보를 산출할 수 있으며, 이를 통하여 본 발명의 영상 인식 장치(10)는 영상 내 객체 별 위치 관계를 기반으로 객체의 행동을 인식 할 수 있다.The pairwise stream of the present invention includes location information indicating the positional relationship of the action stream for each object, excluding appearance information included in the action stream for each object in the image. That is, the object location information calculating unit 160 may calculate location information indicating the positional relationship between the action stream related to the first object and the action stream related to the second object using the above-described combination stream, through which the present invention The image recognition device 10 may recognize the behavior of an object based on the positional relationship of each object in the image.

인식부(200)는 합산부(220) 및 비교부(240)를 포함한다. 예를 들어 인식부(200)는 생성된 액션 스트림 또는 상기 액션 스트림의 위치 정보를 입력으로 하고, 상기 객체의 행동을 분류하기 위한 지표로서 적어도 하나의 클래스 벡터를 출력으로 하는 제1 인식기를 이용하여 상기 객체들의 행동을 인식할 수 있다. 본 발명의 인식부(200)가 이용하는 클래스 벡터는 상기 객체의 행동을 분류하는 행동 목록과 상기 입력 영상 내 객체들의 행동이 상기 행동 목록에 해당할 확률을 나타내는 확률 정보를 포함한다. 인식부(200)는 상기 합산된 클래스 벡터의 상기 행동 목록별 상기 확률 정보를 이용하여 상기 객체들의 행동을 인식할 수 있다.The recognition unit 200 includes a summation unit 220 and a comparison unit 240. For example, the recognizer 200 uses the generated action stream or location information of the action stream as an input, and uses a first recognizer that outputs at least one class vector as an index for classifying the action of the object. The behavior of the objects can be recognized. The class vector used by the recognition unit 200 of the present invention includes an action list for classifying the action of the object and probability information indicating a probability that an action of objects in the input image corresponds to the action list. The recognizer 200 may recognize the behavior of the objects using the probability information for each behavior list of the summed class vector.

합산부(220)는 제1 인식기를 이용하여 상기 객체 별로 구분되는 액션 스트림 각각의 클래스 벡터를 미리 설정된 방법에 따라 합산하는 합산한다. 예를 들어, 합산부(220)는 스트림 생성부(100)에서 객체 별로 생성된 액션 스트림 및 객체 별 생성된 액션 스트림의 위치 관계를 포함하는 조합 스트림을 입력으로 하는 제1 인식기의 출력 값을 행동 목록 별로 미리 설정된 방법에 따라 합산할 수 있다.The summing unit 220 adds the class vectors of the action streams classified for each object according to a preset method using a first recognizer. For example, the summing unit 220 acts on the output value of the first recognizer using as input a combination stream including the positional relationship of the action stream generated for each object and the action stream generated for each object in the stream generation unit 100. Each list can be summed according to a preset method.

예를 들어, 합산부(220)는 제1 인식기에 제1 객체(사람)에 관한 액션 스트림을 입력 시 출력되는 제1 객체(사람)의 행동을 분류하기 위한 지표인 제1 클래스 벡터, 제1 인식기에 제2 객체(사물)에 관한 액션 스트림을 입력 시 출력되는 제2 객체(사물)의 행동을 분류하기 위한 지표인 제2 클래스 벡터 및 제1 인식기에 사람 또는 사물 별로 생성된 액션 스트림의 위치 관계를 포함하는 조합 스트림을 입력 시에 출력되는 제3 클래스 벡터에 서로 다른 가중치를 적용하여 합산하거나, 1/3과 같이 동일한 가중치를 적용하여 합산할 수 있다. 합산부(220)가 제1 클래스 벡터, 제2 클래스 벡터 및 제3 클래스 벡터에 동일한 가중치인 1/3을 적용하여 합산하는 것은 제1 클래스 벡터, 제2 클래스 벡터 및 제3 클래스 벡터의 평균을 구하는 과정으로 마련될 수 있다.For example, the summing unit 220 is a first class vector that is an index for classifying an action of a first object (person) output when an action stream related to the first object (person) is input to the first recognizer, the first class vector, the first The position of the action stream generated for each person or object in the second class vector and the first recognizer, which are indicators for classifying the behavior of the second object (object) that is output when the action stream related to the second object (object) is input to the recognizer. The combination stream including the relationship may be added by applying different weights to the third class vector output upon input, or may be added by applying the same weight as 1/3. When the summing unit 220 adds the first class vector, the second class vector, and the third class vector by applying the same weight 1/3, the sum of the first class vector, the second class vector, and the third class vector is averaged. It can be prepared as a process of seeking.

비교부(240)는 합산된 클래스 벡터 내 행동을 분류하는 행동 목록 별 확률 정보들을 미리 마련된 정답 데이터 셋(UCF-101 Detection Data Set)에 저장된 객체의 행동 목록 별 확률 정보와 비교한다. 예를 들어, 비교부(240)는 이미 알려진 객체의 행동 목록별 확률 정보들이 저장된 데이터 셋을 이용하여 합산부(220)에서 합산된 클래스 벡터가 나타내는 객체의 행동이 일치하는지 여부를 비교함으로서, 영상 내 객체의 행동을 정확하게 인식할 수 있다.The comparator 240 compares the probability information for each action list that classifies the actions in the summed class vector with the probability information for each action list of the object stored in the previously set correct answer data set (UCF-101 Detection Data Set). For example, the comparison unit 240 compares whether the behavior of the object represented by the class vector summed in the summation unit 220 is matched by using a data set in which probability information for each behavior list of a known object is stored, and the image is compared. I can accurately recognize the behavior of my object.

도 2는 도 1의 실시 예에서 스트림 생성부의 확대 블록도이다.FIG. 2 is an enlarged block diagram of a stream generator in the embodiment of FIG. 1.

스트림 생성부(100)는 관심 영역 검출부(120), 관심 영역 연결부(140) 및 위치 정보 산출부(160)를 포함한다. 예를 들어, 스트림 생성부(100)는 입력 영상에 포함된 객체의 행동에 관한 모션 정보를 객체 별로 구분하여 포함하는 액션 스트림을 생성한다. 액션 스트림은 시간의 흐름에 따라 각 영상 프레임들에서 검출된 관심 영역들을 연결하여 생성된 것이다. The stream generator 100 includes a region of interest detection unit 120, a region of interest connection unit 140, and a location information calculating unit 160. For example, the stream generating unit 100 generates an action stream including motion information related to the action of an object included in the input image, for each object. The action stream is created by connecting regions of interest detected in each image frame over time.

객체의 행동에 관한 모션 정보는 시간의 흐름에 따른 관심 영역들의 위치 변화량 및 관심 영역 내의 화소값 변화량을 모두 포함할 수 있지만, 프레임 영상 내 관심 영역들의 위치 변화량을 제외한 관심 영역 내의 화소값 변화량 만을 포함할 수 있다. Motion information about the behavior of an object may include both the amount of change in the position of the regions of interest and the amount of change in pixel values in the region of interest over time, but only the amount of change in the pixel values in the region of interest, excluding the amount of change in the region of interest in the frame image can do.

예를 들어, 영상 프레임 내에 위치하는 관심 영역들의 위치와 화소값 정보들은 시간의 흐름에 따라 달라지므로, 시간의 흐름에 따라 영상 프레임 내에서 변하는 관심 영역들의 위치 변화량, 시간의 흐름에 따라 변하는 관심 영역내의 화소값 변화량등은 그 자체로 객체의 행동 변화에 관한 모션 정보를 구성할 수 있다. 다만, 바람직하게는, 본 발명의 모션 정보는 시간의 흐름에 따라 변하는 관심 영역내의 화소값 변화량만을 포함할 수 있음은 전술한 바와 같다. For example, since the positions of the regions of interest located in the image frame and the pixel value information vary with the passage of time, the amount of change in the position of the regions of interest changing in the image frame according to the passage of time, the region of interest changing with the passage of time The amount of change in the pixel value within may constitute motion information about the behavior change of the object by itself. However, preferably, the motion information of the present invention can include only the amount of change in a pixel value in a region of interest that changes over time.

다만, 본 발명의 영상 인식 장치(10)는 페어 와이즈 스트림을 이용하기 때문에, 프레임 영상내 관심 영역들의 위치 변화량까지 모두 고려하여 영상 내 객체들의 행동을 인식할 수 있음은 후술한다.However, since the image recognition apparatus 10 of the present invention uses a pair-wise stream, it will be described later that the behavior of objects in the image can be recognized in consideration of all the amount of position change of regions of interest in the frame image.

관심 영역 검출부(120)는 입력 영상을 시간의 흐름에 따라 분할하여 복수개의 프레임 영상들을 생성하며, 상기 생성된 프레임 영상들에서 상기 객체 별로 구분되어 상기 객체를 적어도 일부 포함하는 관심 영역을 검출할 수 있다. 관심 영역 검출부(120)는 입력 영상들을 16개 프레임 단위로 분할할 수 있고, 각각의 분할된 프레임 영상에서 객체에 관한 이미지를 주로 포함하는 관심 영역들을 검출할 수 있다.The region-of-interest detection unit 120 divides the input image over time to generate a plurality of frame images, and is divided for each object in the generated frame images to detect a region of interest including at least a part of the object. have. The region of interest detector 120 may divide the input images into units of 16 frames, and may detect regions of interest mainly including an image of an object in each divided frame image.

관심 영역 연결부(140)는 시간의 흐름에 따라 분할된 프레임 영상들에서 검출된 관심 영역들을 서로 연결하여 객체 별 액션 튜브들을 생성한다. 관심 영역 연결부(140)는 생성된 액션 튜브들을 제1 인식기의 입력 형태에 맞게 전처리하여 액션 스트림을 생성할 수 있음은 전술한 바와 같다.The region of interest connection unit 140 connects regions of interest detected in divided frame images over time to generate action tubes for each object. As described above, the region of interest connection unit 140 may generate an action stream by pre-processing the generated action tubes according to an input form of the first recognizer.

위치 정보 산출부(160)는 위치 정보 산출부(160)는 프레임 영상들에서 검출된 상기 관심 영역을 프레임 영상 단위로 재배열하고, 상기 재배열된 관심 영역을 연결하여 생성된 조합 스트림으로부터 상기 액션 스트림의 상호간 위치에 관한 위치 정보를 산출한다. 본 발명의 영상 인식 장치(10)는 위치 정보 산출부(160)가 산출한 위치 정보를 이용하여 영상 내 객체 별로 생성된 액션 스트림의 위치 관계를 이용하여 영상 내 객체의 행동을 인식할 수 있다. The location information calculating unit 160 rearranges the region of interest detected from the frame images in frame image units, and the location information calculating unit 160 connects the rearranged regions of interest to the action from the combined stream. Calculate position information about the positions of the streams. The image recognition device 10 of the present invention may recognize the action of an object in the image using the location relationship of the action stream generated for each object in the image using the location information calculated by the location information calculator 160.

도 3은 도 2의 실시 예에서 관심 영역 검출부의 확대 블록도이다.3 is an enlarged block diagram of a region of interest detector in the embodiment of FIG. 2.

관심 영역 검출부(120)는 특징 정보 산출부(122) 및 바운더리 셀 제거부(126)를 포함한다. 특징 정보 산출부(122)는 프레임 영상 내 검출될 관심 영역의 위치에 관한 좌표로서, 특징 정보를 산출한다. 특징 정보 산출부(122)는 영상 내에서 관심 영역을 검출하기 위하여 합성곱 신경망을 기반으로 하는 YOLO(You Only Look Once), Fast R-CNN등을 포함하는 비전 인식 알고리즘을 사용할 수 있다.The region of interest detector 120 includes a feature information calculator 122 and a boundary cell remover 126. The feature information calculator 122 calculates feature information as coordinates of a location of a region of interest to be detected in a frame image. The feature information calculating unit 122 may use a vision recognition algorithm including YOLO (You Only Look Once), Fast R-CNN, etc. based on a convolutional neural network to detect a region of interest in an image.

바운더리 셀 제거부(126)는 프레임 영상들 내에 상기 객체를 중복하여 포함하는 바운더리 셀 중 상기 바운더리 셀 내에 상기 객체가 존재하는 확률이 기 설정된 임계치 이상인지 여부를 고려하여 상기 객체를 중복하여 포함하는 바운더리 셀의 일부를 제거한다. 바운더리 셀 제거부(126)가 바운더리 셀을 제거 하고 남은 바운더리 셀을 이용하여 관심 영역을 검출하는 방법은 전술한 바와 같다.The boundary cell removing unit 126 overlaps the object in consideration of whether a probability that the object exists in the boundary cell is greater than or equal to a preset threshold among boundary cells that overlap the object in frame images. Remove part of the cell. The method in which the boundary cell removal unit 126 removes the boundary cell and detects the region of interest using the remaining boundary cell is as described above.

도 4는 도 3의 실시 예에서 특징 정보 산출부의 확대 블록도이다.4 is an enlarged block diagram of a feature information calculator in the embodiment of FIG. 3.

특징 정보 산출부(122)는 전처리부(123) 및 계산부(124)를 포함한다. 전처리부(123)는 생성된 프레임 영상들을 각각 분할하여 미리 결정된 크기의 격자셀을 생성할 수 있고, 생성된 격자 셀들을 각각의 격자셀들에 종속되는 바운더리 셀들을 포함함은 전술한 바와 같다.The feature information calculating unit 122 includes a pre-processing unit 123 and a calculation unit 124. The pre-processing unit 123 may generate grid cells of a predetermined size by dividing each of the generated frame images, and the generated grid cells include boundary cells that are dependent on the grid cells.

본 발명의 제2 인식기는 컨벌루션 특징을 추출하는 적어도 하나의 컨벌루션 레이어 및 컨벌루션 레이어의 일단에 연결되어 바운더리 셀의 중심 좌표와 바운더리 셀 내의 객체가 존재하는 확률을 계산하는 풀리 커넥티드 레이어를 포함할 수 있다.The second recognizer of the present invention may include at least one convolutional layer that extracts convolutional features and a fully connected layer that is connected to one end of the convolutional layer and calculates the center coordinates of the boundary cell and the probability that an object in the boundary cell exists. have.

도 5는 관심 영역 검출부가 관심 영역을 검출하는 과정을 나타내는 참고도이다.5 is a reference diagram illustrating a process of detecting a region of interest by the region of interest detector.

전처리부(123)에서 생성된 격자 셀들에 종속되는 바운더리 셀들(127, 128, 129, 132, 133, 134)은 격자 셀들에 종속되어 내부에 객체에 관한 부분이 포함될 확률을 나타낸다. 관심 영역 검출부(120)는 생성된 바운더리 셀들 중 객체가 존재하는 확률이 기 설정된 임계치 이상이 아닌 바운더리 셀들을 모두 제거하고, 제거하고 남은 바운더리 셀들을 이용하여 관심 영역을 검출할 수 있음은 전술한 바와 같다. The boundary cells 127, 128, 129, 132, 133, and 134, which are dependent on the grid cells generated by the pre-processing unit 123, are dependent on the grid cells and represent the probability that the object-related part is included therein. As described above, the region of interest detector 120 may remove all boundary cells whose probability of existence of an object among the generated boundary cells is not equal to or greater than a preset threshold, and detect the region of interest using the remaining boundary cells. same.

도 5에 도시된 바와 같이, 관심 영역 검출부(120)는 바운더리 셀들이 나타내는 확률을 고려하여 99%의 확률을 가지는 제1 객체(사람)에 관한 관심 영역 및 67%의 확률을 가지는 제2 객체(사물)에 관한 관심 영역을 검출할 수 있다. 관심 영역 검출부(120)는 관심 영역을 검출함에 이용하는 관심 영역들의 확률 정보는 전술한 IOU(Intersection Over Union) 를 포함한다.As illustrated in FIG. 5, the region of interest detector 120 considers the probability indicated by the boundary cells, and the second object having a probability of interest and a 67% probability of the first object (person) having a probability of 99% Object). The region of interest detector 120 includes probability information of regions of interest used to detect the region of interest, including the above-described Intersection Over Union (IOU).

도 6은 관심 영역 검출부가 검출한 관심 영역들을 나타내는 예시도이다.6 is an exemplary view showing regions of interest detected by the region of interest detector.

관심 영역 검출부(120)는 도 6에 도시된 바와 같이, 영상 내 포함되는 객체 별 관심 영역들을 검출할 수 있다. 예를 들어, 입력 영상 내 사람들이 존재하는 경우, 제1 객체(사람) 및 제2 객체(사람)에 관한 관심 영역들을 검출할 수 있고, 입력 영상 내 사람 및 사물이 존재하는 경우, 제1 객체(사람) 및 제2 객체(사물)에 관한 관심 영역들을 검출할 수 있다.The region of interest detector 120 may detect regions of interest for each object included in the image, as illustrated in FIG. 6. For example, when there are people in the input image, regions of interest regarding the first object (person) and the second object (person) can be detected, and when there are people and objects in the input image, the first object Regions of interest with respect to the (person) and the second object (object) can be detected.

도 7은 도 2의 실시 예에서 관심 영역 연결부의 확대 블록도이다.7 is an enlarged block diagram of a region of interest connection in the embodiment of FIG. 2.

관심 영역 연결부(140)는 연결 점수 산출부(142) 및 특징 정보 추정부(144)를 포함한다. 예를 들어, 연결 점수 산출부(142)는 인접한 프레임 영상들 각각에서 검출된 상기 관심 영역들의 특징 정보를 입력으로 하는 유사도 함수, 상기 인접한 프레임 영상들 각각에서 검출된 상기 관심 영역들의 오버렙 비율을 출력으로 하는 교차비 함수 및 상기 관심 영역들의 클래스 정보의 유사도 중 적어도 하나를 고려하여 연결 점수를 산출할 수 있다. 본 발명의 연결 점수 산출부(142)가 수학식 1을 이용하여 연결 점수를 산출하는 구체적인 방법은 전술한 바와 같으므로 생략한다.The region of interest connection unit 140 includes a connection score calculation unit 142 and a feature information estimation unit 144. For example, the connection score calculating unit 142 may calculate a similarity function that inputs feature information of the regions of interest detected in each of the adjacent frame images and an overlap ratio of the regions of interest detected in each of the adjacent frame images. A connection score may be calculated by considering at least one of a cross ratio function as an output and similarity of class information of the regions of interest. A detailed method of calculating the connection score using the equation 1 by the connection score calculation unit 142 of the present invention is the same as described above, and thus is omitted.

특징 정보 추정부(144)는 관심 영역이 검출되지 않은 프레임 영상이 존재하는 경우, 상기 관심 영역이 검출되지 않은 프레임 영상에 인접한 프레임 영상들에서 검출된 관심 영역들의 특징 정보를 기반으로 상기 관심 영역이 검출되지 않은 프레임 영상 내의 관심 영역의 특징 정보를 추정한다. 본 발명의 특징 정보 추정부(144)가 보간 방법을 사용하여 관심 영역이 검출되지 않은 프레임 영상 내의 관심영역의 위치를 추정하고, 추정된 관심 영역의 위치를 이용하여 관심 영역들을 연결하는 구체적인 방법은 전술한 바와 같다.The feature information estimator 144 may determine that the region of interest is based on feature information of regions of interest detected from frame images adjacent to the frame image in which the region of interest is not detected when a frame image in which the region of interest is not detected exists. The feature information of the region of interest in the undetected frame image is estimated. A specific method of the feature information estimator 144 of the present invention using the interpolation method to estimate the location of the region of interest in the frame image in which the region of interest is not detected, and to connect the regions of interest using the estimated location of the region of interest As described above.

도 8은 관심 영역 연결부가 관심 영역들을 연결하는 과정을 나타내는 참고도이다.8 is a reference diagram illustrating a process in which a region of interest connection unit connects regions of interest.

또한, 관심 영역 연결부(140)는 생성된 제1 액션 튜브(361)과 제2 객체에 대한 제2 액션 튜브(362)를 전처리하여 제1 객체 및 제2 객체에 대한 액션 스트림을 각각 생성할 수 있음은 전술한 바와 같다.Also, the region of interest connection unit 140 may pre-process the generated first action tube 361 and the second action tube 362 for the second object to generate action streams for the first object and the second object, respectively. Yes, as described above.

도 9는 특징 정보 추정부가 관심 영역의 특징 정보를 추정하는 과정을 나타내는 예시도이다.9 is an exemplary view illustrating a process in which the feature information estimator estimates feature information of a region of interest.

특징 정보 추정부(144)는 관심 영역이 검출되지 않은 프레임 영상이 존재하는 경우, 상기 관심 영역이 검출되지 않은 프레임 영상에 인접한 프레임 영상들에서 검출된 관심 영역들의 특징 정보를 기반으로 상기 관심 영역이 검출되지 않은 프레임 영상 내의 관심 영역의 특징 정보를 추정한다. 특징 정보 추정부(144)가 보간 과정을 이용하여 관심 영역이 검출되지 않은 프레임 영상 내 에서 특징 정보를 추정하는 과정은 전술한 바와 같으므로 생략한다.The feature information estimator 144 may determine that the region of interest is based on feature information of regions of interest detected from frame images adjacent to the frame image in which the region of interest is not detected when a frame image in which the region of interest is not detected exists. The feature information of the region of interest in the undetected frame image is estimated. Since the feature information estimator 144 estimates the feature information in the frame image in which the region of interest is not detected using the interpolation process, it is omitted.

도 10은 도 2의 실시 예에서 위치 정보 산출부의 확대 블록도이다.10 is an enlarged block diagram of a location information calculator in the embodiment of FIG. 2.

위치 정보 산출부(160)는 영역 구분부(162) 및 재배열 부(164)를 포함한다. 예를 들어, 위치 정보 산출부(160)는 프레임 영상들에서 검출된 상기 관심 영역을 프레임 영상 단위로 재배열하고, 상기 재배열된 관심 영역을 연결하여 생성된 조합 스트림으로부터 상기 액션 스트림의 상호간 위치에 관한 위치 정보를 산출한다. 위치 정보 산출부(160)과 객체 별 생성된 액션 스트림의 위치 관계를 산출하는 구체적인 과정은 전술한 바와 같으므로 생략한다.The location information calculating unit 160 includes an area division unit 162 and a rearrangement unit 164. For example, the location information calculating unit 160 rearranges the region of interest detected from the frame images in units of frame images, and positions the action streams from each other from a combined stream generated by connecting the rearranged regions of interest. Calculate location information about. The detailed process of calculating the positional relationship between the location information calculating unit 160 and the generated action stream for each object is the same as described above, and thus is omitted.

도 11은 본 발명의 영상 인식 장치가 수행하는 영상 인식 과정을 나타낸다.11 shows an image recognition process performed by the image recognition device of the present invention.

본 발명의 영상 인식 장치(10)는 적어도 하나의 동적 객체들을 포함하는 입력 영상에서 객체 별 액션 스트림(272, 274)를 생성하고, 동시에 객체 별로 생성된 액션 스트림을 조합하여 조합 스트림(279)를 생성한다. 본 발명의 조합 스트림은 객체 별 생성된 액션 스트림(272, 274)의 위치 관계를 나타냄은 전술한 바와 같다.The image recognition apparatus 10 of the present invention generates action streams 272 and 274 for each object from an input image including at least one dynamic object, and simultaneously combines action streams generated for each object to generate a combination stream 279. To create. The combination stream of the present invention indicates the positional relationship of the generated action streams 272 and 274 for each object, as described above.

영상 인식 장치(10)는 객체 별 생성된 액션 스트림들(272, 274) 및 조합 스트림(페어와이즈 스트림, 279)를 제1 인식기에 입력하고, 출력되는 각각의 클래스 벡터들을 미리 설정된 방법으로 합산하며, 합산된 클래스 벡터들을 이용하여 영상 내 객체들의 행동을 인식할 수 있다.The image recognition device 10 inputs the generated action streams 272 and 274 for each object and the combination stream (pairwise stream 279) to the first recognizer, and sums the outputted class vectors in a preset manner. , Using the summed class vectors, the behavior of objects in the image can be recognized.

도 12는 도 1의 실시 예에서 인식부의 확대 블록도이다.12 is an enlarged block diagram of a recognition unit in the embodiment of FIG. 1.

인식부(200)는 합산부(220) 및 비교부(240)를 포함한다. 예를 들어 인식부(200)는 생성된 액션 스트림 또는 상기 액션 스트림의 위치 정보를 입력으로 하고, 상기 객체의 행동을 분류하기 위한 지표로서 적어도 하나의 클래스 벡터를 출력으로 하는 제1 인식기를 이용하여 상기 객체들의 행동을 인식할 수 있다.The recognition unit 200 includes a summation unit 220 and a comparison unit 240. For example, the recognizer 200 uses the generated action stream or location information of the action stream as an input, and uses a first recognizer that outputs at least one class vector as an index for classifying the action of the object. The behavior of the objects can be recognized.

도 13은 본 발명의 일 실시 예에 따른 영상 인식 방법의 흐름도이다.13 is a flowchart of an image recognition method according to an embodiment of the present invention.

영상 인식 장치(10)가 수행하는 영상 인식 방법(10)은 시계열적으로 수행되는 하기의 단계들을 포함한다.The image recognition method 10 performed by the image recognition apparatus 10 includes the following steps performed in time series.

S100에서, 스트림 생성부(100)는 적어도 하나의 객체들을 포함하는 입력 영상에서 상기 객체별 행동에 관한 모션 정보를 포함하는 액션 스트림을 생성한다. 스트림 생성부(100)는 입력 영상 내에 복수의 객체들이 포함되는 경우, 객체 별로 액션 스트림을 생성할 수 있다. 또한 스트림 생성부(100)는 객체 별로 생성된 액션 스트림의 위치 관계를 나타내는 위치 정보를 포함하는 조합 스트림을 생성함으로서, 객체 별로 생성된 액션 스트림의 위치 관계를 이용하여 영상 내 객체의 행동을 인식할 수 있음은 전술한 바와 같다.In S100, the stream generating unit 100 generates an action stream including motion information on the action for each object from an input image including at least one object. When a plurality of objects are included in the input image, the stream generator 100 may generate an action stream for each object. In addition, the stream generator 100 generates a combination stream including position information indicating the positional relationship of the action stream generated for each object, and recognizes the action of the object in the image using the positional relationship of the action stream generated for each object. Can be as described above.

S200에서, 인식부(200)는 생성된 액션 스트림 또는 상기 액션 스트림의 위치 정보를 입력으로 하고, 상기 객체의 행동을 분류하기 위한 지표로서 적어도 하나의 클래스 벡터를 출력으로 하는 제1 인식기를 이용하여 상기 객체들의 행동을 인식할 수 있다. 예를 들어, 인식부(200)가 객체 별로 생성된 액션 스트림 및 객체 별로 생성된 액션 스트림의 위치 관계를 나타내는 조합 스트림을 각각 입력으로 하는 제1 인식기의 출력을 합산하여 영상 내 객체들의 행동을 인식하는 구체적인 방법은 전술한 바와 같다.In S200, the recognizer 200 uses the generated action stream or the position information of the action stream as an input, and uses a first recognizer to output at least one class vector as an index for classifying the action of the object. The behavior of the objects can be recognized. For example, the recognition unit 200 recognizes the behavior of objects in the image by summing the outputs of the first recognizer, each of which has an input of an action stream generated for each object and a combination stream representing the positional relationship of the action stream generated for each object. The specific method is as described above.

도 14는 도 13의 실시 예에서 생성하는 단계의 확대 흐름도이다.14 is an enlarged flow chart of steps generated in the embodiment of FIG. 13.

S120에서, 관심 영역 검출부(120)는 입력 영상을 시간의 흐름에 따라 분할하여 복수개의 프레임 영상들을 생성하며, 상기 생성된 프레임 영상들에서 상기 객체 별로 구분되어 상기 객체를 적어도 일부 포함하는 관심 영역을 검출한다. 관심 영역 검출부(120)가 복수의 프레임 영상 들에서 관심 영역을 검출하는 구체적인 방법은 전술한 바와 같으므로 생략한다.In S120, the region-of-interest detection unit 120 divides the input image according to the passage of time to generate a plurality of frame images, and is divided by the object in the generated frame images to include a region of interest including at least a part of the object. To detect. The detailed method of detecting the region of interest from the plurality of frame images by the region of interest detector 120 is omitted as described above.

S140에서, 관심 영역 연결부(140)는 서로 다른 프레임 영상들에서 검출된 상기 관심 영역간 연결 점수를 산출하며, 상기 산출된 연결 점수를 고려하여 상기 서로 다른 프레임 영상들에서 검출된 상기 관심 영역을 연결한다. 관심 영역 연결부(140)는 임의 프레임 영상 내 검출된 하나의 관심 영역과 다른 프레임 영상 내에 검출된 모든 관심 영역들과의 연결 점수를 고려하여 관심 영역을 연결할 수 있다.In S140, the region of interest connection unit 140 calculates the connection points between the regions of interest detected in different frame images, and connects the regions of interest detected in the different frame images in consideration of the calculated connection points. . The region of interest connection unit 140 may connect the region of interest in consideration of connection points between one region of interest detected in an arbitrary frame image and all regions of interest detected in another frame image.

S160에서, 위치 정보 산출부(160)는 프레임 영상 들에서 검출된 객체 별 관심 영역을 프레임 영상 단위로 재배열하고, 재배열된 관심 영역을 연결하여 생성된 조합 스트림으로부터 액션 스트림의 상호간 위치 관계를 나타내는 위치 정보를 산출할 수 있다. 위치 정보 산출부(160)가 프레임 영상들에서 검출된 관심 영역들을 재배열 하는 방법은 전술한 바와 같으므로 생략한다.In S160, the location information calculating unit 160 rearranges the region of interest for each object detected in the frame images in units of frame images, and connects the rearranged regions of interest to determine the positional relationship between the action streams from the combined stream. The position information to be displayed can be calculated. The method of rearranging the regions of interest detected in the frame images by the location information calculating unit 160 is the same as described above, and thus is omitted.

도 15는 도 14의 실시 예에서 검출하는 단계의 확대 흐름도이다.15 is an enlarged flow chart of the detection step in the embodiment of FIG. 14.

S122에서, 특징 정보 산출부(122)는 생성된 프레임 영상 내 상기 관심 영역이 위치하는 좌표를 나타내는 특징 정보를 산출할 수 있다. 본 발명에서 특징 정보는 프레임 영상 내 검출될 관심 영역의 위치에 관한 좌표로서, 프레임 영상에서 관심 영역의 중심 좌표로 마련될 수 있음은 전술한 바와 같다.In S122, the feature information calculating unit 122 may calculate feature information indicating coordinates in which the region of interest is located in the generated frame image. In the present invention, the feature information is the coordinates of the location of the region of interest to be detected in the frame image, and it can be provided as the center coordinates of the region of interest in the frame image.

S126에서, 바운더리 셀 제거부(126)는 프레임 영상들 내에 상기 객체를 중복하여 포함하는 바운더리 셀 중 상기 바운더리 셀 내에 상기 객체가 존재하는 확률이 기 설정된 임계치 이상인지 여부를 고려하여 상기 객체를 중복하여 포함하는 바운더리 셀의 일부를 제거한다. 관심 영역 검출부(120)는 바운더리 셀 제거부(126)에서 제거 하고 남은 바운더리 셀들을 이용하여 관심 영역들을 검출한다.In S126, the boundary cell removing unit 126 duplicates the object in consideration of whether a probability that the object exists in the boundary cell among the boundary cells including the object in the frame images is greater than or equal to a preset threshold. A portion of the containing boundary cell is removed. The region-of-interest detection unit 120 removes the boundary-cell removal unit 126 and detects regions of interest using the remaining boundary cells.

도 16은 도 14의 실시 예에서 연결하는 단계의 확대 흐름도이다.16 is an enlarged flow chart of steps of connecting in the embodiment of FIG. 14.

S142에서, 연결 점수 산출부(142)는 인접한 프레임 영상들 각각에서 검출된 상기 관심 영역들의 특징 정보를 입력으로 하는 유사도 함수, 상기 인접한 프레임 영상들 각각에서 검출된 상기 관심 영역들의 오버렙 비율을 출력으로 하는 교차비 함수 및 상기 관심 영역들의 클래스 정보의 유사도 중 적어도 하나를 고려하여 연결 점수를 산출한다. 연결 점수 산출부(142)가 인접한 프레임 영상들 각각에서 검출된 관심 영역들의 연결 점수를 산출하는 구체적인 방법은 전술한 바와 같으므로 생략한다.In S142, the connection score calculating unit 142 outputs a similarity function using the feature information of the regions of interest detected in each of the adjacent frame images as input, and an overlap ratio of the regions of interest detected in each of the adjacent frame images. The connection score is calculated by considering at least one of the cross ratio function and the similarity of the class information of the regions of interest. A detailed method of calculating the connection scores of the regions of interest detected in each of the adjacent frame images is omitted because the connection score calculation unit 142 is described above.

S144에서, 특징 정부 추정부(144)는 입력 영상이 분할되어 생성된 프레임 영상들 중에서 관심 영역이 검출되지 않은 프레임 영상이 존재하는 경우, 관심 영역이 검출되지 않은 프레임 영상에 인접한 프레임 영상 내에서 검출된 관심 영역들의 좌표를 이용하여 관심 영역이 검출되지 않은 프레임 영상 내 관심 영역의 위치를 추정할 수 있다.In S144, the feature government estimator 144 detects a frame image adjacent to the frame image in which the region of interest is not detected, if a frame image in which the region of interest is not detected among frame images generated by dividing the input image. The location of the region of interest in the frame image in which the region of interest is not detected may be estimated using the coordinates of the regions of interest.

도 17은 도 14의 실시 예에서 위치 정보를 산출하는 단계의 확대 흐름도이다.17 is an enlarged flow chart of steps for calculating location information in the embodiment of FIG. 14.

S162에서, 영역 구분부(162)는 프레임 영상들에서 상기 객체 별로 검출된 상기 관심 영역 및 상기 관심 영역이 검출되지 않은 부분을 구분한다. 예를 들어, 영역 구분부(162)는 프레임 영상 내 검출된 관심 영역들의 위치 정보를 알기 위해서 2가지 화소값 정보를 이용하여 관심 영역과 관심 영역이 아닌 부분을 구분하여 마스킹 할 수 있다.In S162, the region dividing unit 162 discriminates the region of interest detected for each object in the frame images and a portion in which the region of interest is not detected. For example, the region dividing unit 162 may mask a region of interest and a region other than the region of interest using two pixel value information in order to obtain location information of the regions of interest detected in the frame image.

S164에서, 재배열 부(164)는 재배열 부(164)는 구분된 관심 영역을 소정의 조합 방법으로 재배열한다. 예를 들어, 재배열 부(164)는 제1 객체에 관한 마스킹된 관심 영역들(277)과 제2 객체에 관한 마스킹된 관심 영역들(279(을 소정의 조합 방법으로 조합할 수 있다. 재배열 부(164)가 마스킹된 관심 영역을 조합하는 방법은 제1 객체에 관한 관심 영역과 제2 객체에 관한 관심 영역을 교차로 조합하는 방법, 제1 객체에 관한 관심 영역들을 선배치하고, 제2 객체에 관한 관심 영역들을 후배치 하는 방법을 포함할 수 있다. In S164, the rearrangement unit 164 rearranges the divided regions of interest in a predetermined combination method. For example, the rearrangement unit 164 may combine the masked regions of interest 277 for the first object and the masked regions of interest 279 for the second object in a predetermined combination method. The method of combining the region of interest masked by the array unit 164 is a method of alternately combining the region of interest with respect to the first object and the region of interest with respect to the second object, pre-positioning regions of interest with respect to the first object, and the second It may include a method of rearranging regions of interest related to an object.

영상 인식 장치(10)는 객체 별로 검출된 관심 영역들을 구분하고, 구분된 관심 영역들을 재배열하여 조합 스트림을 생성함으로서, 영상 내 객체 들의 위치 관계를 이용하여 객체들의 행동을 인식할 수 있다.The image recognition apparatus 10 may recognize regions of interest detected by each object, and re-arrange the regions of interest to generate a combination stream, thereby recognizing the behavior of objects using a location relationship of objects in the image.

도 18은 도 13의 실시 예에서 인식하는 단계의 확대 흐름도이다.FIG. 18 is an enlarged flow chart of steps recognized in the embodiment of FIG. 13.

S220에서, 합산부(220)는 제1 인식기를 이용하여 상기 객체 별로 구분되는 액션 스트림 각각의 클래스 벡터를 미리 설정된 방법에 따라 행동 목록 별로 합산한다. 예를 들어, 합산부(220)는 객체 별 액션 스트림 및 객체 별로 생성된 액션 스트림의 위치 관계를 나타내는 조합 스트림을 입력으로 하는 제1 인식기의 출력인 클래스 벡터들에 가중치를 적용하여 합산할 수 있다.In S220, the summing unit 220 adds a class vector of each action stream classified for each object using a first recognizer for each action list according to a preset method. For example, the summing unit 220 may add and apply weights to class vectors, which are outputs of the first recognizer, using an input stream of an action stream for each object and a combination stream indicating a positional relationship between action streams generated for each object. .

S240에서, 비교부(240)는 비교부(240)는 합산된 클래스 벡터 내 행동을 분류하는 행동 목록 별 확률 정보들을 미리 마련된 정답 데이터 셋(UCF-101 Detection Data Set)에 저장된 객체의 행동 목록 별 확률 정보와 비교한다. In S240, the comparator 240, the comparator 240, by the action list of the object stored in the correct answer data set (UCF-101 Detection Data Set) provided in advance the probability information for each action list to classify the behavior in the summed class vector Compare with probability information.

상기 설명된 본 발명의 일 실시예의 방법의 전체 또는 일부는, 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 기록 매체의 형태(또는 컴퓨터 프로그램 제품)로 구현될 수 있다. 여기에서, 컴퓨터 판독 가능 매체는 컴퓨터 저장 매체(예를 들어, 메모리, 하드디스크, 자기/광학 매체 또는 SSD(Solid-State Drive) 등)를 포함할 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다.All or part of the method of an embodiment of the present invention described above may be embodied in the form of a computer-executable recording medium (or computer program product), such as a program module executed by a computer. Here, the computer-readable medium may include a computer storage medium (eg, a memory, a hard disk, a magnetic/optical medium, or a solid-state drive (SSD)). Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media.

또한, 본 발명의 일 실시예에 따르는 방법의 전체 또는 일부는 컴퓨터에 의해 실행 가능한 명령어를 포함하며, 컴퓨터 프로그램은 프로세서에 의해 처리되는 프로그래밍 가능한 기계 명령어를 포함하고, 고레벨 프로그래밍 언어(High-level Programming Language), 객체 지향 프로그래밍 언어(Object-oriented Programming Language), 어셈블리 언어 또는 기계 언어 등으로 구현될 수 있다.In addition, all or part of a method according to an embodiment of the present invention includes instructions executable by a computer, and the computer program includes programmable machine instructions processed by a processor, and is a high-level programming language. Language), Object-oriented Programming Language, assembly language or machine language.

본 명세서에서의 부(means) 또는 모듈(Module)은 본 명세서에서 설명되는 각 명칭에 따른 기능과 동작을 수행할 수 있는 하드웨어를 의미할 수도 있고, 특정 기능과 동작을 수행할 수 있는 컴퓨터 프로그램 코드를 의미할 수도 있고, 또는 특정 기능과 동작을 수행시킬 수 있는 컴퓨터 프로그램 코드가 탑재된 전자적 기록 매체, 예를 들어 프로세서 또는 마이크로 프로세서를 의미할 수 있다. 다시 말해, 부(means) 또는 모듈(Module)은 본 발명의 기술적 사상을 수행하기 위한 하드웨어 및/또는 상기 하드웨어를 구동하기 위한 소프트웨어의 기능적 및/또는 구조적 결합을 의미할 수 있다. Means or modules in this specification may mean hardware capable of performing functions and operations according to each name described in the specification, and computer program code capable of performing specific functions and operations It may mean, or an electronic recording medium on which computer program code capable of performing a specific function and operation is mounted, for example, a processor or a microprocessor. In other words, a means or module may mean a functional and/or structural combination of hardware for performing the technical idea of the present invention and/or software for driving the hardware.

따라서 본 발명의 일 실시예에 따르는 방법은 상술한 바와 같은 컴퓨터 프로그램이 컴퓨팅 장치에 의해 실행됨으로써 구현될 수 있다. 컴퓨팅 장치는 프로세서와, 메모리와, 저장 장치와, 메모리 및 고속 확장포트에 접속하고 있는 고속 인터페이스와, 저속 버스와 저장 장치에 접속하고 있는 저속 인터페이스 중 적어도 일부를 포함할 수 있다. 이러한 성분들 각각은 다양한 버스를 이용하여 서로 접속되어 있으며, 공통 머더보드에 탑재되거나 다른 적절한 방식으로 장착될 수 있다.Accordingly, a method according to an embodiment of the present invention may be implemented by executing a computer program as described above by a computing device. The computing device may include at least some of a processor, a memory, a storage device, a high-speed interface connected to the memory and a high-speed expansion port, and a low-speed interface connected to the low-speed bus and the storage device. Each of these components is connected to each other using various buses, and can be mounted on a common motherboard or mounted in other suitable ways.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위 내에서 다양한 수정, 변경 및 치환이 가능할 것이다. 따라서, 본 발명에 개시된 실시예 및 첨부된 도면들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예 및 첨부된 도면에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구 범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리 범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical spirit of the present invention, and those of ordinary skill in the art to which the present invention pertains can make various modifications, changes, and substitutions without departing from the essential characteristics of the present invention. will be. Therefore, the embodiments and the accompanying drawings disclosed in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments and the accompanying drawings. . The scope of protection of the present invention should be interpreted by the claims below, and all technical spirits within the scope equivalent thereto should be interpreted as being included in the scope of the present invention.

Claims

Regarding the behavior of the first object as the acting subject in adjacent different frame images including the first object as the acting subject and at least one second object located around the other acting subject A stream generating unit that generates a first action stream including motion information, a second action stream according to changes in the second object, and combines the first action stream and the second action stream to generate a combination stream. ; And
A first recognizer receiving the first action stream, a second action stream, and a combination stream, and outputting each class vector as an index for classifying the action of an object that is the subject of the action; and output from the first recognizer It includes; a summation unit for adding the class vectors according to a preset method; including, a recognition unit for analyzing and recognizing the behavior of the objects using the summed class vector;
The combination stream is based on location information on mutual positions of regions of interest detected for each object of the first action stream and the second action stream generated for each of the first object and the second object,
The class vector includes a behavior list to classify the behavior of the first object and probability information indicating a probability that the behavior of the first object and the second objects in the input image corresponds to the behavior list, and the recognition unit sums the sum And recognizing the behavior of the objects in consideration of the association between the first object and the second object according to the probability information for each action list of a class vector.

The method of claim 1, wherein the stream generating unit
The region of interest including a plurality of frame images by dividing the input image over time, and including at least a portion of the first object and the second object for each of the first and second objects in the generated frame images A region-of-interest detection unit that detects; Further comprising,
And generating the action stream using the detected region of interest.

delete

The method of claim 2, wherein the stream generator
A region of interest connection unit calculating a connection score between the regions of interest detected in the adjacent frame images, and connecting the regions of interest detected in the different frame images in consideration of the calculated connection points; Further comprising,
And generating the action stream for each of the first and second objects using the connected regions of interest.

The method of claim 4, wherein the stream generating unit
Position information calculation for rearranging the regions of interest detected in the frame images in units of frame images, and calculating location information regarding mutual positions of the action streams from the combined stream generated by connecting the rearranged regions of interest part; Video recognition device further comprising a.

The method of claim 4, wherein the region of interest connection
Similarity function that inputs feature information of the regions of interest detected in each of the adjacent frame images as input, cross ratio function that outputs the overlap ratio of the regions of interest detected in each of the adjacent frame images as output, and class information of the regions of interest A connection score calculation unit for calculating a connection score in consideration of at least one of the similarities of; Further comprising,
The apparatus for recognizing an image, wherein the region of interest is connected using the calculated connection score.

The method of claim 6, wherein the region of interest connection
When a frame image in which the region of interest is not detected exists,
A feature information estimator estimating feature information of a region of interest in the frame image in which the region of interest is not detected based on feature information of regions of interest detected in frame images adjacent to the frame image in which the region of interest is not detected; Further comprising,
The apparatus for recognizing an image, wherein the region of interest is connected using the estimated feature information.

The method of claim 2, wherein the region of interest detection unit
A feature information calculator that calculates feature information indicating coordinates of the region of interest in the generated frame image; Further comprising,
The apparatus for recognizing an image, wherein the region of interest is detected using the calculated feature information.

The method of claim 5, wherein the location information calculation unit
A region dividing unit for classifying the region of interest and the portion of the region of interest not detected in the frame images for each of the first and second objects; And
A rearrangement unit for rearranging the divided regions of interest in a predetermined combination method; Further comprising,
The apparatus for recognizing an image, wherein the location information is calculated based on a combination stream generated by connecting the rearranged regions of interest.

The method of claim 8, wherein the feature information calculation unit
A pre-processor for dividing each of the generated frame images into a grid cell of a predetermined size; And
The generated frame images are input, and a center coordinate of a boundary cell indicating a probability that the first or second object exists with a center in the grid cell or a probability that the object exists in the boundary cell is output. A calculation unit that calculates a center coordinate of each boundary cell and a probability that the object exists in the boundary cell using a second network (Neural Network); Further comprising,
The image recognition device, characterized in that for calculating the feature information using a boundary cell in which the probability and the center coordinates are calculated.

The method of claim 10, wherein the region of interest detector
A boundary cell including the first object overlapping in consideration of whether a probability that the first object exists in the boundary cell among the boundary cells overlapping the first object in the frame images is equal to or greater than a preset threshold. A boundary cell removal unit to remove a portion of the; Further comprising,
An image recognition apparatus characterized by detecting the region of interest using the removed and remaining boundary cells.

delete

The first object that acts as the acting object in adjacent frame images including the first object acting as the stream generating unit and at least one second object positioned around another acting subject. A first action stream including motion information related to an action is generated, and a second action stream according to the change of the second object is generated, and a combination stream is generated by combining the first action stream and the second action stream. step; And
Analyzing and recognizing the behavior of objects using the class vector added by the recognition unit; It includes,
The step of analyzing and recognizing the behavior of objects using the summed class vector is:
The first recognizer receives the first action stream, the second action stream, and the combination stream, and outputs each class vector as an index for classifying the action of the object acting as the subject. 1, summing the class vectors output from the recognizer according to a preset method.
The combination stream is based on location information on mutual positions of regions of interest detected for each object of the first action stream and the second action stream generated for each of the first object and the second object,
The class vector includes a behavior list to classify the behavior of the first object and probability information indicating a probability that the behavior of the first object and the second objects in the input image corresponds to the behavior list, and the recognition unit sums the sum And recognizing the behavior of the objects in consideration of the association between the first object and the second object according to the probability information for each action list of a class vector.

The method of claim 13, wherein the generating step
The region of interest including a plurality of frame images by dividing the input image over time, and including at least a portion of the first object and the second object for each of the first and second objects in the generated frame images Detecting; Further comprising,
And generating the action stream using the detected region of interest.

The method of claim 14, wherein the step of generating
Calculating a connection score between the regions of interest detected in adjacent different frame images, and connecting the regions of interest detected in the different frame images in consideration of the calculated connection points; Further comprising,
And generating the action stream for each of the first and second objects using the connected regions of interest.

The method of claim 15, wherein the step of generating
Rearranging the regions of interest detected in the frame images in units of frame images, and calculating location information regarding mutual positions of the action streams from a combination stream generated by connecting the rearranged regions of interest; The image recognition method further comprising a.

The method of claim 15, wherein the connecting step
Similarity function that inputs feature information of the regions of interest detected in each of the adjacent frame images as input, cross ratio function that outputs the overlap ratio of the regions of interest detected in each of the adjacent frame images as output, and class information of the regions of interest Calculating a connection score in consideration of at least one of the similarities; Further comprising,
A method of recognizing an image, wherein the region of interest is connected using the calculated connection score.

The method of claim 17, wherein the connecting step
When a frame image in which the region of interest is not detected exists,
Estimating feature information of a region of interest in a frame image in which the region of interest is not detected based on feature information of regions of interest detected in frame images adjacent to the frame image in which the region of interest is not detected; Further comprising,
A method of recognizing an image, wherein the region of interest is connected using the estimated feature information.

The method of claim 16, wherein the step of calculating the location information
Distinguishing the region of interest and the portion of the region of interest not detected from the frame images for each of the first object and the second object; And
Rearranging the divided regions of interest in a predetermined combination method; Further comprising,
The method of recognizing an image, wherein the location information is calculated based on a combination stream generated by connecting the rearranged regions of interest.

A program stored in a computer-readable recording medium that realizes the image recognition method according to any one of claims 13 to 19 through being executed by a processor.