KR102486083B1

KR102486083B1 - Crowded scenes image real-time analysis apparatus using dilated convolutional neural network and method thereof

Info

Publication number: KR102486083B1
Application number: KR1020200140950A
Authority: KR
Inventors: 백성욱; 이미영; 울라 아민; 칸 노만; 울 하크 이자즈
Original assignee: 세종대학교산학협력단
Priority date: 2020-10-28
Filing date: 2020-10-28
Publication date: 2023-01-09
Also published as: KR20220056399A

Abstract

본 발명은 군중 장면 이미지 실시간 분석 장치 및 방법에 관한 것으로, 더욱 상세하게는 군중 장면 이미지에 포함된 사람들에 대한 군중 밀도 분포맵 정보를 포함하는 학습 데이터세트를 엔드 투 엔드(End to End) 확장 합성곱 신경망(Convolutional Neural Network: CNN)이 적용된 군중 밀도 분포맵 모델을 학습하고, 분석 요청된 군중 장면 이미지를 상기 군중 밀도 분포맵 모델에 적용하여 고품질의 실시간 군중 밀도 분포맵을 생성한 후 생성된 실시간 군중 밀도 분포맵을 기반으로 군중 장면 이미지의 사람 수를 계수하는 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for analyzing a crowd scene image in real time, and more particularly, to end-to-end extended synthesis of a training dataset including crowd density distribution map information for people included in a crowd scene image. After learning a crowd density distribution map model to which a convolutional neural network (CNN) is applied, and applying the crowd scene image requested for analysis to the crowd density distribution map model, a high-quality real-time crowd density distribution map is generated. An apparatus and method for real-time analysis of a crowd scene image using an extended convolutional neural network that counts the number of people in a crowd scene image based on a crowd density distribution map.

Description

Crowded scenes image real-time analysis apparatus using dilated convolutional neural network and method thereof}

일반적으로 시민을 안전하게 보호하고, 위법자들을 추적 및 감시하기 위해 도로, 광장, 건물 내의 주요 공간 등에 폐쇄회로텔레비전(Closed Circuit Television: CCTV)이 설치되고 있다.In general, closed circuit television (CCTV) is installed in roads, plazas, and major spaces in buildings to safely protect citizens and track and monitor offenders.

또한, 이러한 CCTV를 통해 획득되는 이미지를 수집하여 저장하고, 분석하는 다양한 이미지 분석 장치들이 개발되어 적용되고 있다.In addition, various image analysis devices for collecting, storing, and analyzing images obtained through such CCTVs have been developed and applied.

일반적인 이미지 분석 장치는 CCTV를 통해 수집되는 이미지로부터 사람을 검출하고, 검출된 사람의 행동을 분석 및 추정하거나, 다수의 사람, 즉 군중의 수를 계수하는 데 사용되고 있다.A general image analysis device is used to detect a person from an image collected through CCTV, analyze and estimate the behavior of the detected person, or count a large number of people, that is, a crowd.

특히, 최근에는 보안을 위해 다수의 사람, 즉 군중이 모여 있는 정지 이미지 및 동영상(이하 "군중 장면 이미지"라 함)으로부터 군중의 수, 즉 사람의 수를 계산하거나 추정하는 이미지 분석 장치가 개발되고 적용되고 있다.In particular, recently, an image analysis device for calculating or estimating the number of people, that is, the number of people, has been developed from still images and videos (hereinafter referred to as "crowd scene images") in which a large number of people, that is, crowds, are gathered for security. is being applied

종래 이미지 분석 장치는 군중의 수를 계수하기 위해 획득되는 이미지 내에서 사람의 신체 전체 또는 신체의 특정 부분(통상: 얼굴)을 해당 신체 부분에 대한 특징 패턴에 의해 검출하여 사람 수를 계수하는 직접 검출 방식과 군중 분포에 따른 군중 밀도 분포맵을 생성하여 군중의 수를 계수하는 밀도 분포맵 방식이 적용되고 있다.A conventional image analysis device detects the entire body of a person or a specific part of the body (typically: a face) in an image obtained to count the number of crowds by a feature pattern for the corresponding body part, and counts the number of people. The density distribution map method, which counts the number of crowds by generating a crowd density distribution map according to the method and crowd distribution, is being applied.

직접 검출 방식 및 신체 전체 검출에 의해 군중 수를 계수하는 이미지 분석 장치는 이미지 내에 다수의 사람이 엉켜있어 사람의 전체 신체를 검출하는 것이 불가능한 문제점이 있었다.An image analysis device that counts the number of crowds by direct detection and whole-body detection has a problem in that it is impossible to detect the entire body of a person because many people are entangled in the image.

직접 검출 방식 및 신체 일부분에 의해 군중 수를 계수하는 이미지 분석 장치 또한 이미지 내에서 다수의 사람이 엉켜있어 해당 부분이 검출되지 않는 경우가 많아 정확하게 군중 수를 계수할 수 없는 문제점이 있었다.An image analysis device that counts the number of crowds using a direct detection method and a body part also has a problem in that the number of crowds cannot be accurately counted because many people are entangled in the image and the corresponding part is often not detected.

도 1은 일반적인 이미지 분석 장치에서의 비슷한 수의 군중을 포함하는 실제 이미지 및 기본 밀도 분포맵을 나타낸 도면이다.1 is a diagram showing a real image including a similar number of crowds and a basic density distribution map in a general image analysis device.

도 1을 참조하면, 밀도 분포맵 방식이 적용된 이미지 분석 장치는 도 1의 (a) 및 (b)에서 보이는 바와 같이 비슷한 사람 수를 가지는 (a) 및 (b) 각각의 제1 이미지(1) 및 제2 이미지(2)에 대응하는 제1 밀도 분포맵(3) 및 제2 밀도 분포맵(4)의 분포가 비슷하지 않고 전혀 다르다는 것을 알 수 있다. 즉, 일반적인 밀도 분포맵 방식을 적용한 이미지 분석 장치는 정확하게 군중의 사람 수를 계수할 수 없다.Referring to FIG. 1, the image analysis device to which the density distribution map method is applied is a first image 1 of each of (a) and (b) having a similar number of people as shown in (a) and (b) of FIG. And it can be seen that the distributions of the first density distribution map 3 and the second density distribution map 4 corresponding to the second image 2 are not similar but completely different. That is, an image analysis device to which a general density distribution map method is applied cannot accurately count the number of people in a crowd.

따라서 군중 장면 이미지로부터 사람 수를 실시간으로 정확하게 계수할 수 있는 이미지 분석 장치의 개발이 요구되고 있다.Therefore, the development of an image analysis device capable of accurately counting the number of people in real time from a crowd scene image is required.

대한민국 등록특허 제10-1467307호(2014.12.01. 공고)Republic of Korea Patent Registration No. 10-1467307 (2014.12.01. Notice)

따라서 본 발명의 목적은 군중 장면 이미지에 포함된 사람들에 대한 군중 밀도 분포맵 정보를 포함하는 학습 데이터세트를 엔드 투 엔드(End to End) 확장 합성곱 신경망(Convolutional Neural Network: CNN)이 적용된 군중 밀도 분포맵 모델을 학습하고, 분석 요청된 군중 장면 이미지를 상기 군중 밀도 분포맵 모델에 적용하여 고품질의 실시간 군중 밀도 분포맵을 생성한 후 생성된 실시간 군중 밀도 분포맵을 기반으로 군중 장면 이미지의 사람 수를 계수하는 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치 및 방법을 제공함에 있다.Therefore, an object of the present invention is to develop a training dataset containing crowd density distribution map information for people included in a crowd scene image, to which a convolutional neural network (CNN) is applied to crowd density. After learning the distribution map model and applying the crowd scene image requested for analysis to the crowd density distribution map model to generate a high-quality real-time crowd density distribution map, the number of people in the crowd scene image is based on the generated real-time crowd density distribution map. It is to provide a crowd scene image real-time analysis device and method using an extended convolutional neural network that counts.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치는: 군중의 수에 대한 주석이 달린 다수의 군중 장면 이미지를 포함하는 데이터세트를 생성하는 데이터세트 생성부; 상기 데이터세트의 상기 군중 장면 이미지들에 대한 기본 밀도 분포맵을 생성하고, 상기 군중 장면 이미지 및 기본 밀도 분포맵을 포함하는 데이터를 증가시켜 학습 데이터세트를 생성하는 데이터세트 전 처리부; 깊이별 분리 가능 합성곱 계층 및 확장 합섭곱 계층을 포함하는 엔드 투 엔드 합성곱 신경망을 적용하여 군중 장면 이미지의 머리 수에 대한 군중 밀도 분포맵을 학습하는 군중 밀도 분포맵 모델에 상기 데이터세트 전처리부로부터 입력되는 상기 학습 데이터세트를 적용하여 학습시키는 학습부; 상기 학습된 군중 밀도 분포맵 모델이 적용되어 실시간 입력되는 군중 장면 이미지에 대한 군중 밀도 분포 맵을 생성하여 출력하는 밀도 분포맵 생성부; 및 상기 군중 밀도 분포맵을 입력받아 실시간 입력되는 상기 군중 장면 이미지의 군중 수를 카운트하는 카운트부를 포함하는 것을 특징으로 한다.An apparatus for real-time analysis of crowd scene images using an extended convolutional neural network according to the present invention for achieving the above object is to generate a dataset including a plurality of crowd scene images annotated with respect to the number of crowds. wealth; a dataset pre-processing unit generating a basic density distribution map for the crowd scene images of the dataset, and generating a learning dataset by increasing data including the crowd scene image and the basic density distribution map; The dataset preprocessor is applied to a crowd density distribution map model that learns a crowd density distribution map for the number of heads of a crowd scene image by applying an end-to-end convolutional neural network including a separable convolutional layer by depth and an extended convolutional neural network. a learning unit for learning by applying the learning data set input from; a density distribution map generator for generating and outputting a crowd density distribution map for a crowd scene image input in real time by applying the learned crowd density distribution map model; and a counting unit receiving the crowd density distribution map and counting the number of crowds of the crowd scene image input in real time.

상기 데이터세트 전처리부는, 상기 데이터세트의 군중 장면 이미지들 각각에 대해 하기의 수학식으로 표현되는 적응형 지오메트리 커널을 적용하여 기본 밀도 분포맵을 생성하여 출력하는 기본 밀도 분포맵 생성부; 및 상기 주석이 달린 군중 장면 이미지에 대한 데이터량을 증가하는 데이터 증가부를 더 포함하는 것을 특징으로 한다.The dataset pre-processing unit includes: a basic density distribution map generating unit generating and outputting a basic density distribution map by applying an adaptive geometry kernel expressed by the following equation to each of the crowd scene images of the dataset; and a data increasing unit for increasing a data amount for the annotated crowd scene image.

[수학식][mathematical expression]

여기서, si는 모든 대상 물체를 나타내고, δ는 기본 밀도 분포맵이고, di는 k개의 가장 가까운 이웃의 거리 평균이며, s는 입력 이미지에서의 픽셀의 위치를 나타낸다.Here, si represents all target objects, δ is the basic density distribution map, di is the distance average of k nearest neighbors, and s represents the position of a pixel in the input image.

상기 데이터 증가부는, 상기 군중 장면 이미지를 상기 군중 장면의 크기에 대해 일정 비율로 줄어든 일정 수로 패치로 분할하여 데이터를 증가시키는 것을 특징으로 한다.The data increasing unit may increase data by dividing the crowd scene image into a predetermined number of patches reduced by a predetermined ratio with respect to the size of the crowd scene.

상기 군중 밀도 분포맵 모델은, 전처리된 학습 데이터세트의 서로 다른 해상도를 가지는 군중 장면 이미지에서 2D 특징을 추출하여 출력하는 프런트 엔드 처리부; 및 상기 프런트 엔드 처리부로부터 출력되는 2D 특징의 확장된 합성곱을 수행하여 군중 밀도 분포맵을 생성하여 출력하는 백 엔드 처리부를 포함하는 것을 특징으로 한다.The crowd density distribution map model may include a front-end processor extracting and outputting 2D features from crowd scene images having different resolutions of a preprocessed training dataset; and a back-end processing unit generating and outputting a crowd density distribution map by performing an extended convolution of the 2D features output from the front-end processing unit.

프런트 엔드 처리부는, 3개의 깊이별 분리 가능 합성곱 계층 및 크기가 3*3인 필터를 포함하되, 순차적으로 16개, 32개, 64개, 128개 및 256개의 서로 다른 개수의 필터를 가지며, 활성화 함수로 ReLU를 적용하는 5개의 깊이별 분리 가능 합성곱 계층 모듈; 상기 깊이별 분리 가능 합성곱 계층 모듈 중 제2 깊이별 분리 가능 합성곱 계층 모듈 뒤, 제4 및 제5 깊이별 분리 가능 합성곱 계층 모듈 뒤에 각각 구성되고 크기가 2*2인 맥스 풀링 계층; 및 2개의 표준 합성곱 계층 및 크기가 3*3인 512개의 필터를 포함하고, 활성화함수로 ReLU를 적용하는 표준 합성곱 계층 모듈을 포함하는 것을 특징으로 한다.The front-end processing unit includes three separable convolution layers by depth and filters of size 3 * 3, sequentially having different numbers of filters of 16, 32, 64, 128, and 256, 5 per-depth separable convolutional layer modules applying ReLU as activation function; a max pooling layer having a size of 2*2 and configured behind a second separable convolution layer module for each depth and behind a fourth and fifth separable convolution layer modules for each depth among the separable convolution layer modules for each depth; and a standard convolution layer module including two standard convolution layers and 512 filters having a size of 3*3 and applying ReLU as an activation function.

상기 백 엔드 처리부는, 2개의 확장 합성곱 계층 및 크기가 3*3인 필터를 포함하되, 순차적으로 512, 256 및 128 개의 필터 수를 가지는 3개의 제1 확장 합성곱 계층 모듈; 하나의 확장 합성곱 계층 및 크기가 3*3인 64개의 필터를 포함하는 제2확장 합성곱 계층 모듈; 및 하나의 표준 합성곱 계층 및 크기가 1*1인 한 개의 필터를 구비하고, 활성화함수로 ReLU를 적용하는 표준 합성곱 계층 모듈을 포함하는 것을 특징으로 한다.The back-end processing unit includes three first extended convolution layer modules including two extended convolution layers and filters having a size of 3*3, and sequentially having 512, 256, and 128 filters; A second extended convolution layer module including one extended convolution layer and 64 filters of size 3*3; and a standard convolution layer module having one standard convolution layer and one filter having a size of 1*1 and applying ReLU as an activation function.

상기 확장 합성곱 계층은, 여분의 커널들을 사용하여 수용야를 확장하는 것을 특징으로 한다.The extended convolution layer is characterized in that the receptive field is extended using redundant kernels.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 방법은: 데이터세트 생성부가 군중의 수에 대한 주석이 달린 다수의 군중 장면 이미지를 포함하는 데이터세트를 생성하는 데이터세트 생성 과정; 데이터세트 전 처리부가 상기 데이터세트의 상기 군중 장면 이미지들에 대한 기본 밀도 분포맵을 생성하고, 상기 군중 장면 이미지 및 기본 밀도 분포맵을 포함하는 데이터를 증가시켜 학습 데이터세트를 생성하는 데이터세트 전처리 과정; 학습부가 깊이별 분리 가능 합성곱 계층 및 확장 합섭곱 계층을 포함하는 엔드 투 엔드 합성곱 신경망을 적용하여 군중 장면 이미지의 머리 수에 대한 군중 밀도 분포맵을 학습하는 군중 밀도 분포맵 모델에 상기 데이터세트 전처리부로부터 입력되는 상기 학습 데이터세트를 적용하여 학습시키는 학습 과정; 밀도 분포맵 생성부가 상기 학습된 군중 밀도 분포맵 모델이 적용되어 실시간 입력되는 군중 장면 이미지에 대한 군중 밀도 분포 맵을 생성하여 출력하는 밀도 분포맵 생성 과정; 및 카운트부가 상기 군중 밀도 분포맵을 입력받아 실시간 입력되는 상기 군중 장면 이미지의 군중 수를 카운트하는 카운트 과정을 포함하는 것을 특징으로 한다.A real-time analysis method for crowd scene images using an extended convolutional neural network according to the present invention for achieving the above object is: A dataset generator generates a dataset including a plurality of crowd scene images annotated with respect to the number of crowds. Data set creation process; A dataset pre-processing process in which the dataset pre-processing unit generates a basic density distribution map for the crowd scene images of the dataset, and creates a training dataset by increasing data including the crowd scene image and the basic density distribution map. ; The crowd density distribution map model in which the training unit learns a crowd density distribution map for the number of heads of a crowd scene image by applying an end-to-end convolutional neural network including a depth-separable convolutional layer and an extended convolutional layer to the dataset. a learning process of applying and learning the learning data set input from the pre-processing unit; a density distribution map generation process in which a density distribution map generating unit generates and outputs a crowd density distribution map for a crowd scene image input in real time to which the learned crowd density distribution map model is applied; and a counting step in which a counting unit receives the crowd density distribution map and counts the number of crowds of the crowd scene image input in real time.

상기 데이터세트 전처리 과정은, 기본 밀도 분포맵 생성부가 상기 데이터세트의 군중 장면 이미지들 각각에 대해 하기의 수학식으로 표현되는 적응형 지오메트리 커널을 적용하여 기본 밀도 분포맵을 생성하여 출력하는 기본 밀도 분포맵 생성 단계; 및 데이터 증가부가 상기 주석이 달린 군중 장면 이미지에 대한 데이터량을 증가하는 데이터 증가 단계를 더 포함하는 것을 특징으로 한다.In the dataset pre-processing process, the basic density distribution map generation unit generates and outputs a basic density distribution map by applying an adaptive geometry kernel expressed by the following equation to each of the crowd scene images of the dataset. Default density distribution map creation step; and a data increasing step in which the data increasing unit increases the amount of data for the annotated crowd scene image.

[수학식][mathematical expression]

상기 데이터 증가부가 데이터 증가 단계에서 상기 군중 장면 이미지를 상기 군중 장면의 크기에 대해 일정 비율로 줄어든 일정 수로 패치로 분할하여 데이터를 증가시키는 것을 특징으로 한다.The data increasing unit may increase data by dividing the crowd scene image into a predetermined number of patches reduced by a predetermined ratio with respect to the size of the crowd scene in the data increasing step.

상기 학습 과정은, 프런트 엔드 처리부가 전처리된 학습 데이터세트의 서로 다른 해상도를 가지는 군중 장면 이미지에서 2D 특징을 추출하여 출력하는 프런트 엔드 처리 단계; 및 백 엔드 처리부가 상기 프런트 엔드 처리부로부터 출력되는 2D 특징의 확장된 합성곱을 수행하여 군중 밀도 분포맵을 생성하여 출력하는 백 엔드 처리 단계를 포함하는 것을 특징으로 한다.The learning process may include a front-end processing step of extracting and outputting 2D features from crowd scene images having different resolutions of a pre-processed training dataset by a front-end processing unit; and a back-end processing step of generating and outputting a crowd density distribution map by performing an extended convolution of the 2D features output from the front-end processing unit by the back-end processing unit.

본 발명은 획득된 데이터세트에 포함된 군중 장면 이미지에 포함된 주석이 달린 사람의 머리를 흐리게 하여 정규화한 후, 적응형 기하학 커널에 적용하여 기본 밀도 분포맵을 생성하고, 생성된 기본 밀도 분포맵을 데이터세트에 추가함으로써 데이터세트의 품질을 높여 상기 군중 장면 이미지에 포함된 사람 수의 카운트 정확도를 향상시킬 수 있는 효과가 있다.The present invention blurs and normalizes the annotated person's head included in the crowd scene image included in the acquired dataset, applies it to the adaptive geometry kernel to generate a basic density distribution map, and generates a basic density distribution map By adding to the dataset, there is an effect of improving the accuracy of counting the number of people included in the crowd scene image by increasing the quality of the dataset.

또한, 본 발명은 데이터세트의 군중 장면 이미지를 중첩 분할 및 의사잡음을 인가하여 데이터를 증가시키므로 상기 군중 장면 이미지 내의 사람 수의 카운트 정확도를 향상시킬 수 있는 효과가 있다.In addition, the present invention has an effect of improving the accuracy of counting the number of people in the crowd scene image by increasing the data by overlapping and segmenting the crowd scene images of the dataset and applying pseudo noise.

또한, 본 발명은 군중 밀도 분포맵 모델에 깊이별 분리 가능한 합성곱 계층을 포함하는 프런트 엔드를 적용함으로써 합성곱 프로세스에서 많은 양의 연산을 줄일 수 있고, 모델의 크기도 줄일 수 있는 효과가 있다.In addition, the present invention has an effect of reducing a large amount of computation in the convolution process and reducing the size of the model by applying a front end including a convolution layer separable by depth to the crowd density distribution map model.

또한, 본 발명은 군중 밀도 분포맵 모델에 최대 및 평균 풀링 계층을 최소화하여 맵의 일부 특수정보가 손실되는 것을 방지하고, 상기 풀링 계층의 최소화에 따른 역 합성곱 계층을 사용하지 않아도 되므로 복잡성 및 실행 시간을 줄일 수 있는 효과가 있다.In addition, the present invention prevents the loss of some special information of the map by minimizing the maximum and average pooling layers in the crowd density distribution map model, and does not require the use of an inverse convolution layer according to the minimization of the pooling layer, thereby reducing complexity and performance. It has the effect of reducing time.

도 1은 일반적인 이미지 분석 장치에서의 비슷한 수의 군중을 포함하는 실제 이미지 및 기본 밀도 분포맵을 나타낸 도면이다.
도 2는 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치의 구성을 나타낸 도면이다.
도 3은 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치의 각 구성요소를 개념적으로 나타낸 도면이다.
도 4는 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치의 데이터세트 전처리부의 데이터 증가 개념을 설명하기 위한 도면이다.
도 5는 본 발명에 따른 단일 및 다중 필터에 대한 표준 2D 합성곱 프로세스 계층의 구성을 나타낸 도면이다.
도 6은 본 발명에 따른 깊이별 분리 가능 합성곱 계층의 구성을 나타낸 도면이다.
도 7은 본 발명에 따른 확장률의 증가에 따른 수용야의 확장 효과를 나타낸 도면이다.1 is a diagram showing a real image including a similar number of crowds and a basic density distribution map in a general image analysis device.
2 is a diagram showing the configuration of a crowd scene image real-time analysis apparatus using an extended convolutional neural network according to the present invention.
3 is a diagram conceptually showing each component of an apparatus for analyzing a crowd scene image in real time using an extended convolutional neural network according to the present invention.
4 is a diagram for explaining the concept of data increase in the dataset pre-processing unit of the crowd scene image real-time analysis apparatus using the extended convolutional neural network according to the present invention.
5 is a diagram showing the configuration of a standard 2D convolution process layer for single and multiple filters according to the present invention.
6 is a diagram showing the configuration of a separable convolution layer for each depth according to the present invention.
7 is a diagram illustrating an effect of expanding an accommodation field according to an increase in an expansion rate according to the present invention.

이하 첨부된 도면을 참조하여 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치의 구성 및 동작을 설명하고, 상기 장치에서의 군중 장면 이미지 실시간 분석 방법을 설명한다.Hereinafter, with reference to the accompanying drawings, the configuration and operation of a crowd scene image real-time analysis device using an extended convolutional neural network according to the present invention will be described, and a crowd scene image real-time analysis method in the device will be described.

도 2는 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치의 구성을 나타낸 도면이고, 도 3은 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치의 각 구성 요소를 개념적으로 나타낸 도면이고, 도 4는 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치의 데이터세트 전처리부의 데이터 증가 개념을 설명하기 위한 도면이며, 도 5는 본 발명에 따른 단일 및 다중 필터에 대한 표준 2D 합성곱 프로세스 계층의 구성을 나타낸 도면이고, 도 6은 본 발명에 따른 깊이별 분리 가능 합성곱 계층의 구성을 나타낸 도면이며, 도 7은 본 발명에 따른 확장률의 증가에 따른 수용야의 확장 효과를 나타낸 도면이다. 이하 도 2 내지 도 7을 참조하여 설명한다. Figure 2 is a diagram showing the configuration of a crowd scene image real-time analysis device using an extended convolutional neural network according to the present invention, Figure 3 shows each component of the crowd scene image real-time analysis device using an extended convolutional neural network according to the present invention 4 is a diagram for explaining the concept of data increase in the dataset pre-processing unit of the apparatus for real-time analysis of crowd scene images using an extended convolutional neural network according to the present invention, and FIG. It is a diagram showing the configuration of a standard 2D convolution process layer for a filter, and FIG. 6 is a diagram showing the configuration of a separable convolution layer for each depth according to the present invention. It is a diagram showing the expansion effect of the receiving field. It will be described with reference to FIGS. 2 to 7 below.

우선 도 2를 참조하면, 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치는 데이터세트 생성부(100), 군중 이미지 획득부(200), 데이터세트 전처리부(300), 학습부(400), 밀도 분포맵 생성부(500) 및 군중 계수부(600)를 포함한다.First of all, referring to FIG. 2, the device for real-time analysis of a crowd scene image using an extended convolutional neural network according to the present invention includes a dataset generator 100, a crowd image acquisition unit 200, a dataset preprocessor 300, and a learning unit. 400, a density distribution map generating unit 500, and a crowd counting unit 600.

데이터세트 생성부(100)는 일정 기간동안 CCTV를 통해 수집하거나 온라인상의 서버에 업로드되어 있는 과밀한 다수의 군중들이 모여 있는 이미지인 군중 장면 이미지들을 수집하고, 수집된 군중 장면 이미지들에 포함된 사람 머리에 대한 주석을 관리자로부터 획득한 주석을 포함하는 군중 장면 이미지들을 포함하는 데이터세트를 생성한다. 상기 군중 장면 이미지는 정지영상일 수도 있고, 동영상의 프레임별 이미지일 수도 있을 것이다.The dataset generation unit 100 collects crowd scene images, which are images of a large number of dense crowds collected through CCTV or uploaded to an online server for a certain period of time, and the people included in the collected crowd scene images Create a dataset containing images of crowd scenes with annotations about heads obtained from administrators. The crowd scene image may be a still image or a frame-by-frame image of a moving picture.

또한, 데이터세트 생성부(100)는 하기 표 1과 같이 미리 획득되어 저장수단에 저장되어 있는 사람의 머리에 대한 주석을 포함하는 다수의 군중 장면 이미지를 포함하는 데이터세트를 로드하여 획득할 수도 있을 것이다.In addition, the dataset generation unit 100 may acquire a dataset by loading a dataset including a plurality of crowd scene images including annotations on a person's head that have been obtained in advance and stored in a storage device as shown in Table 1 below. will be.

표 1에서 상하이테크(ShanghaiTech) 파트(A) 및 UCF_CC_50의 군중 장면 이미지의 군중들이 많아 매우 혼잡하고, 상하이테크 파트(B), UCSD 및 WorldExpo'10은 상대적으로 사람이 드물고 감시보기 샘플(군중 장면 이미지)이 있음을 나타내고 있다.In Table 1, ShanghaiTech part (A) and crowd scene images of UCF_CC_50 are very crowded with many crowds, ShanghaiTech part (B), UCSD and WorldExpo'10 are relatively sparsely populated and lookout sample (crowd scene image) is present.

표 1에서 T_n은 전체 샘플을 나타내고, M_n 및 M_x는 각각 이미지의 최소 및 최대 개인 수를 나타낸다.In Table 1, T _n represents the entire sample, and M _n and M _x represent the minimum and maximum number of individuals in the image, respectively.

이미지 당 평균 사람 수(A_v)를 나타내기 위해 사용되고, R은 이미지 해상도를 T_h는 전체 주석이 달린 머리를, ROI는 관심영역이며, 관심영역의 존재 여부를 Yes 또는 No로 나타낸다.It is used to represent the average number of people per image (A _v ), R is the image resolution, T _h is the entire annotated head, ROI is the region of interest, and whether the region of interest is present is indicated as Yes or No.

상기 데이터세트에 포함된 데이터, 즉 군중 장면 이미지의 품질을 높이기 위해 데이터세트 생성부(100)는 이중 선형 보간 등을 적용하여 각 프레임에 대한 크기 조정 프로세스를 수행할 수도 있을 것이다. 예를 들어 데이터세트 생성부(100)는 프레임별 과밀 군중 영상 이미지의 크기를 900*600 해상도로 조정할 수 있을 것이다.In order to improve the quality of the data included in the dataset, that is, the crowd scene image, the dataset generator 100 may perform a resizing process for each frame by applying bilinear interpolation or the like. For example, the dataset generation unit 100 may adjust the size of an overcrowded crowd video image for each frame to a resolution of 900*600.

데이터세트가 생성 또는 획득되면 데이터세트 생성부(100)는 생성된 또는 획득된 데이터세트를 데이터세트 전처리부(300)로 제공한다.When a dataset is generated or acquired, the dataset generator 100 provides the generated or acquired dataset to the dataset preprocessor 300.

군중 이미지 획득부(200)는 CCTV, 서버 등으로 군중 장면 이미지를 실시간 획득되거나 분석해야 할 군중 장면 이미지를 획득하여 군중 밀도 분포맵 생성부(300)로 바로 제공하거나, 데이터세트 전처리부(300)에서 데이터 전처리를 수행한 후 밀도 분포맵 생성부(300)로 전송한다.The crowd image acquisition unit 200 obtains a crowd scene image to be analyzed in real time by CCTV, a server, etc., or directly provides it to the crowd density distribution map generator 300, or the dataset pre-processing unit 300 After performing data pre-processing, it is transmitted to the density distribution map generation unit 300.

데이터세트 전처리부(300)는 기본 밀도 분포맵 생성 모듈(310) 및 데이터 증가 모듈(320)을 포함하여, 기존 데이터로부터 파생되는 새로운 데이터를 생성하고, 기존 데이터를 증가시키는 전처리를 수행한다.The dataset preprocessor 300 includes a basic density distribution map generation module 310 and a data augmentation module 320 to generate new data derived from existing data and perform preprocessing to augment existing data.

구체적으로 기본 밀도 분포맵 생성 모듈(310)은 데이터세트 생성부(100)로부터 입력되는 데이터세트의 사람의 머리에 대한 주석을 포함하는 각 군중 장면 이미지에 기하학 커널을 적용하여 도 3의 311과 같이 실제 사진(군중 장면 이미지)에 대한 기본 밀도 분포맵을 생성한다. Specifically, the basic density distribution map generation module 310 applies a geometric kernel to each crowd scene image including an annotation on a person's head in the dataset input from the dataset generator 100, as shown in 311 of FIG. Generate a basic density distribution map for real photos (crowd scene images).

상기 기본 밀도 분포맵 생성 모듈(310)은 가우시안(Gaussian) 커널을 적용하여 주석이 달린 군중 장면 이미지의 모든 사람의 머리를 흐리게 처리하여 하나로 정규화하고, 모든 이미지의 특수 분포를 고려하여 기본 밀도 분포맵을 생성한다.The basic density distribution map generation module 310 applies a Gaussian kernel to blur all the heads of annotated crowd scene images, normalize them into one, and consider the special distribution of all images to obtain a basic density distribution map generate

상기 가우시안 커널을 적용하기 위한 지오매트리 적응 필터는 하기 수학식 1과 같이 나타낼 수 있다.A geometry adaptive filter for applying the Gaussian kernel can be expressed as in Equation 1 below.

여기서 s_i는 모든 대상 물체를 나타내고, δ는 군중 밀도 분포맵이며, d_i는 k개의 가장 가까운 이웃의 거리 평균이고, S는 입력 이미지에서 픽셀의 위치를 나타낸다.where s _i denotes all target objects, δ is the crowd density distribution map, d _i is the distance average of k nearest neighbors, and S denotes the position of a pixel in the input image.

합성곱 프로세스는 군중 밀도 분포맵을 생성하기 위해 표준편차(σ_i)를 갖는 매개변수(δ(s-s_i)) 및 가우스 필터에 적용된다.A convolution process is applied to the parameter (δ(ss _i )) and a Gaussian filter with standard deviation (σ _i ) to generate a crowd density distribution map.

상기 표 1의 상하이테크 파트(A) 및 UCF_CC_50과 같이 군중 장면 이미지들이 많은 경우 지오메트리 적응 커널을 사용할 수 있을 것이다.If there are many crowd scene images, such as Shanghai Tech part (A) and UCF_CC_50 in Table 1, the geometry adaptation kernel can be used.

그리고 상하이테크 파트(B), UCSD 및 WorldExpo'10은 군중이 비교적 적은 군중 장면 이미지들을 가지는 데이터세트로, 고정 커널이 적용될 수 있으며, 상하이테크 파트(B)는 δ=15가 적용될 수 있고, WorldExpo'10은 δ=3을 적용할 수 있을 것이다.And Shanghai Tech Part (B), UCSD and WorldExpo'10 are datasets with crowd scene images with relatively few crowds, to which a fixed kernel can be applied, Shanghai Tech Part (B) to which δ = 15 can be applied, and WorldExpo '10 would be able to apply δ=3.

기본 밀도 분포맵 생성 시 사용된 구성에 따라 β=0.3, k=3이 적용될 수 있을 것이다.Depending on the configuration used when generating the basic density distribution map, β = 0.3 and k = 3 may be applied.

고정된 표준편차(σ_i)를 가지는 가우스 커널은 희소 군중 장면 이미지에서 머리 주석을 흐리게 처리하고 머리 크기를 평균화하는 데 사용된다.A Gaussian kernel with a fixed standard deviation (σ _i ) is used to blur head annotations and average head size in sparse crowd scene images.

데이터 증가 모듈(320)은 모델의 학습 효율성을 높이기 위해 데이터 증가 방식을 적용하여 데이터세트의 군중 장면 이미지의 데이터양을 증가시킨다.The data augmentation module 320 increases the amount of data of the crowd scene images in the dataset by applying a data augmentation method to increase the learning efficiency of the model.

상기 데이터 증가 방식으로는 분할 방식, 분할 방식 및 의사 잡음 삽입 방식 등이 적용될 수 있을 것이다.As the data augmentation method, a division method, a division method, and a pseudo noise insertion method may be applied.

도 4를 예를 들어 설명하면 데이터 증가 모듈(320)은 (a)의 군중 장면 이미지를 4 등분하여 제1분할 군중 장면 이미지(11), 제2분할 군중 장면 이미지(12), 제3분할 군중 장면 이미지(13), 제4분할 군중 장면 이미지(14)를 생성하고, 상기 군중 장면 이미지의 중심부를 기준으로 분할 군중 장면 이미지와 동일한 크기의 제5분할 군중 장면 이미지(15)를 생성하며, 상기 군중 장면 이미지를 상기 분할 군중 장면 이미지와 동일한 크기로 랜덤한 위치에서 분할한 제6 내지 제9 분할 군중 장면 이미지(16, 17, 18, 19)를 생성하여 (b)와 같이 데이터를 증가시킨다. 상기 분할 군중 장면 이미지는 패치라고도 하며, 상기 패치는 원래 이미지인 군중 장면 이미지의 크기의 1/4 크기를 갖는다.Referring to FIG. 4 as an example, the data augmentation module 320 divides the crowd scene image of (a) into quarters to obtain a first split crowd scene image 11, a second split crowd scene image 12, and a third split crowd. A scene image 13 and a fourth divided crowd scene image 14 are generated, and a fifth divided crowd scene image 15 having the same size as the divided crowd scene image is generated based on the center of the crowd scene image, The sixth to ninth divided crowd scene images 16, 17, 18, and 19 are generated by dividing the crowd scene image into the same size as the divided crowd scene image at random locations, and the data is increased as shown in (b). The segmented crowd scene image is also referred to as a patch, and the patch has a size of 1/4 of the size of the original crowd scene image.

또한, 데이터 증가 모듈(320)은 생성된 분할 군중 장면 이미지(패치)를 수평으로 반전시키고, 약간의 의사 잡음을 추가할 수 있을 것이다.Also, the data augmentation module 320 may horizontally invert the generated segmented crowd scene image (patch) and add some pseudo noise.

학습부(400)는 프런트 엔드 처리부(410) 및 백 엔드 처리부(420)를 포함하는 군중 밀도 분포맵 모델을 가지고 있으며, 상기 데이터세트 전처리부(300)로부터 입력된 전처리된 데이터세트를 상기 군중 밀도 분포맵 모델에 적용하여 군중 밀도 분포맵 모델을 학습시킨 후, 상기 군중 밀도 분포맵 모델을 밀도 분포맵 생성부(500)로 제공한다.The learning unit 400 has a crowd density distribution map model including a front-end processing unit 410 and a back-end processing unit 420, and the pre-processed dataset input from the dataset pre-processing unit 300 is converted to the crowd density After learning the crowd density distribution map model by applying it to the distribution map model, the crowd density distribution map model is provided to the density distribution map generator 500.

구체적으로 설명하면, 프론트 엔드 처리부(410)는 전처리된 학습 데이터세트의 서로 다른 해상도를 가지는 군중 장면 이미지에서 2D 특징을 추출하여 출력한다.Specifically, the front-end processing unit 410 extracts 2D features from crowd scene images having different resolutions of the preprocessed training dataset and outputs them.

상기 프런트 엔드 처리부(410)는 3개의 깊이별 분리 가능 합성곱 계층을 포함하는 5개의 깊이별 분리 가능 합성곱 계층 모듈(411, 412, 414, 415, 417), 3개의 맥스 풀링 계층(413, 416, 418) 및 두 개의 표준 합성곱 계층을 포함하는 표준 합성곱 계층 모듈(419)을 포함한다.The front-end processing unit 410 includes five separable convolution layer modules for each depth including three separable convolution layers for each depth (411, 412, 414, 415, 417), three max pooling layers (413, 416, 418) and a standard convolution layer module 419 including two standard convolution layers.

상기 5개의 깊이별 분리 가능 합성곱 계층 모듈(411, 412, 414, 415, 417)은 크기가 3*3인 필터를 구비하며, 각각 16개, 32개, 64개, 128개, 256개의 필터를 포함하여 구성된다.The five separable convolutional layer modules 411, 412, 414, 415, and 417 for each depth have filters with a size of 3*3, and 16, 32, 64, 128, and 256 filters, respectively. It is composed of.

또한, 상기 깊이별 분리 가능 합성곱 계층 모듈(411, 412, 414, 415, 417)은 활성화 함수로 ReLU 함수를 적용한다.In addition, the depth-separable convolution layer modules 411, 412, 414, 415, and 417 apply a ReLU function as an activation function.

맥스 풀링 계층(413)은 순서적으로 두 번째인 제2 깊이별 분리 가능 합성곱 계층 모듈(412)의 출력단에 연결되고, 그 크기는 2*2이다.The max pooling layer 413 is connected to the output terminal of the second separable convolution layer module 412, which is second in order, and has a size of 2*2.

맥스 풀링 계층(416)은 순서적으로 4 번째인 제4 깊이별 분리 가능 합성곱 계층 모듈(455)의 출력단에 연결되고, 그 크기는 2*2이다.The max pooling layer 416 is connected to the output terminal of the fourth separable convolution layer module 455, which is sequentially fourth, and has a size of 2*2.

맥스 풀링 계층(418)는 순서적으로 5 번째인 제5 깊이별 분리 가능 합성곱 계층 모듈(417)의 출력단에 연결되고, 그 크기는 2*2이다.The max pooling layer 418 is connected to the output terminal of the fifth separable convolution layer module 417, which is 5th in order, and has a size of 2*2.

표준 합성곱 계층 모듈(419)은 최 후단에 구성되고, 3*3인 512개의 필터를 포함하며, 활성화함수로 ReLU가 적용된다.The standard convolution layer module 419 is configured at the last stage, includes 512 filters of 3*3, and ReLU is applied as an activation function.

2D 합성곱은 깊이별 합성곱 및 1*1 합성곱인 점별 합성곱으로 나뉜다.2D convolutions are divided into depth-by-depth convolutions and point-by-point convolutions, which are 1*1 convolutions.

본 발명의 프런트 엔드 처리부(410)는 깊이 합성곱을 수행하는 5개의 깊이별 합성곱 계층 모듈(411, 412, 414, 415, 417)과 점별 합성곱을 수행하는 하나의 표준 합성곱 계층 모듈(419)을 포함한다.The front-end processing unit 410 of the present invention includes five depth-specific convolution layer modules 411, 412, 414, 415, and 417 that perform depth convolution and one standard convolution layer module 419 that performs point-by-point convolution. includes

깊이별 합성곱 계층 모듈(411, 412, 414, 415, 417)은 단일 필터가 모든 입력 채널에 깊이별 합성곱으로 적용되고, 모든 채널에 대한 물리 특징 맵을 생성한다.The depth-specific convolution layer modules 411, 412, 414, 415, and 417 apply a single filter to all input channels as depth-specific convolution, and generate physical feature maps for all channels.

표준 합성곱 계층 모듈(419)은 깊이별 합성곱 계층 모듈의 출력을 결합하기 위해 1*1 합성곱인 점별 연산을 수행하여 새로운 출력 데이터세트를 생성하여 출력한다.The standard convolution layer module 419 generates and outputs a new output dataset by performing a point-by-point operation, which is a 1*1 convolution, to combine the outputs of the depth-by-depth convolution layer modules.

프런트 엔드 처리부(410)는 단일 깊이별 합성곱 프로세스를 수행하며, 필터링을 위해 추가 2개의 동작으로 분산될 수 있으며, 필터링을 위해 별도의 합성곱을 사용하는 한편, 다른 합성곱 결합을 위해 사용되며, 이 분배로 인해 합성곱 프로세스에서 많은 양의 연산을 감소시키고 모델 크기도 현저하게 감소시킨다.The front-end processing unit 410 performs a single depth-by-depth convolution process, which can be split into two additional operations for filtering, while using a separate convolution for filtering, and for another convolutional combination, This distribution reduces the amount of computation in the convolution process and significantly reduces the model size.

전형적인 2D 합성곱은 깊이별 분리 가능 합성곱과는 구별될 수 있다.Classical 2D convolutions can be distinguished from separable convolutions by depth.

도 5를 참조하면, 2D 표준 합성곱 계층은 도 5의 20에서 보이는 바와 같이 높이*폭*채널의 형태로 7*7*3의 크기를 가지는 입력 및 3*3*3의 크기를 갖는 필터를 갖는 경우 그 출력 계층의 출력 크기는 5*5*1이다.Referring to FIG. 5, the 2D standard convolution layer has an input having a size of 7*7*3 and a filter having a size of 3*3*3 in the form of a height*width*channel as shown in 20 of FIG. , the output size of the output layer is 5*5*1.

64개의 필터를 포함하는 경우 출력 계층은 5*5*1 크기의 64개의 출력맵을 가지는 출력 계층을 구성할 것이다.In the case of including 64 filters, the output layer will constitute an output layer with 64 output maps of 5*5*1 size.

그러나 도 5의 30에서 보이는 바와 같이 3*3*3 크기의 64개의 필터를 적용하여 5*5*64 크기의 단일 크기를 가지는 출력 계층으로 출력하도록 구성될 수도 있을 것이다. 이와 같이 함으로써 도 5의 30에서 보이는 바와 같이 채널이 증가하는 동안 높이와 너비가 감소함을 알 수 있다.However, as shown in 30 of FIG. 5, it may be configured to apply 64 filters of a size of 3*3*3 to output an output layer having a single size of 5*5*64. By doing this, it can be seen that the height and width decrease while the channel increases, as shown at 30 in FIG. 5 .

상술한 변환을 이루기 위해 깊이별 분리 가능한 합성곱 프로세스를 수행하는 프런트 엔드 처리부(410)의 동작을 설명한다.An operation of the front-end processor 410 that performs a separable convolution process for each depth to achieve the above transformation will be described.

프런트 엔드 처리부(410)는 첫 번째 단계에서 깊이별 합성곱 연산을 수행하고, 두 번째 단계에서 점별 합성곱 연산을 수행한다.The front-end processing unit 410 performs a depth-by-depth convolution operation in a first step, and performs a point-by-point convolution operation in a second step.

첫 번째 단계에서, 깊이별 합성곱 연산은 입력 계층에 적용되고 2D 합성곱에서 3*3*3형태를 가지는 하나의 필터를 적용하는 것보다는 도 6의 40에서 나타낸 바와 같이 각각 3*3*1의 크기를 갖는 3개의 필터를 적용하는 것이 바람직할 것이다.In the first step, the depth-by-depth convolution operation is applied to the input layer and rather than applying one filter having the form 3*3*3 in 2D convolution, each 3*3*1 as shown at 40 in FIG. It would be desirable to apply three filters with a size of

각 필터의 크기 형태가 5*5*1인 입력 계층의 한 채널로 합성곱되며, 이 개별 맵은 함께 싸여서 5*5*3 형태의 맵을 생성한다.Each filter is convolved with one channel of the input layer of size shape 5*5*1, and these individual maps are wrapped together to create a map of shape 5*5*3.

상기 도 6의 40은 입력층의 공간적 차원이 감소하는 것을 보여주지만 채널은 여전히 이전과 동일하다.40 in Fig. 6 shows that the spatial dimension of the input layer is reduced, but the channels are still the same as before.

제2단계에서, 도 6의 50과 같이 점별 합성곱 연산은 크기 1*1*3의 필터와 크기 5*5*1의 맵을 제공하는 크기 5*5*3의 입력 이미지를 합성곱하여 적용한다.In the second step, the point-by-point convolution operation, as shown in 50 in FIG. 6, is applied by convolutional multiplication of a filter of size 1*1*3 and an input image of size 5*5*3 that provides a map of size 5*5*1. .

따라서 점별 합성곱 계층은 도 6의 60에서와 같이 크기 1*1*3의 64개의 필터를 크기 5*5*3인 입력 이미지와 합성곱한 후, 출력 맵 크기가 5*5*64인 출력 맵을 생성한다.Therefore, the point-by-point convolution layer convolutions 64 filters of size 1*1*3 with an input image of size 5*5*3, as shown in 60 in FIG. generate

프런트 엔드 처리부(410)는 위의 두 단계(깊이별 합성곱 및 점별 합성곱)와 함께 크기가 7*7*3인 입력 계층을 크기가 5*5*64인 출력 계층으로 변환한다.The front-end processing unit 410 converts an input layer of size 7*7*3 into an output layer of size 5*5*64 along with the above two steps (convolution by depth and convolution by point).

이와 같이 깊이별 분리 가능 합성곱을 사용하므로 프런트 엔드 처리부(410)에서 수행되는 연산을 감소할 수 있고, 표준 2D 합성곱에 비해 매개변수를 적게 사용하므로 모델의 크기를 감소시킬 수 있다.Since separable convolution by depth is used in this way, the operation performed by the front-end processing unit 410 can be reduced, and the size of the model can be reduced because fewer parameters are used than standard 2D convolution.

백 엔드 처리부(420)는 2개의 확장 합성곱 계층 및 크기가 3*3인 필터를 포함하되, 순차적으로 2개씩 512, 256 및 128 개의 필터 수를 가지는 3개의 제1 확장 합성곱 계층 모듈(421, 422, 423), 하나의 확장 합성곱 계층 및 크기가 3*3인 64개의 필터를 포함하는 제2확장 합성곱 계층 모듈(424)과, 하나의 표준 합성곱 계층 및 크기가 1*1인 한 개의 필터를 구비하고, 활성화함수로 ReLU를 적용하는 표준 합성곱 계층 모듈(425)을 포함한다.The back-end processing unit 420 includes two extended convolution layers and filters having a size of 3*3, including three first extended convolution layer modules (421 , 422, 423), a second extended convolution layer module 424 including one extended convolution layer and 64 filters of size 3*3, and one standard convolution layer and size 1*1 It includes a standard convolution layer module 425 that has one filter and applies ReLU as an activation function.

상기 확장 합성곱 계층 모듈(421, 422, 423, 424)은 분할 문제를 위해 적용되며, 매우 높은 정확도를 달성하고 풀링 계층 대신 사용될 수 있다.The extended convolution layer modules 421, 422, 423, 424 are applied for segmentation problems, achieve very high accuracy and can be used instead of pooling layers.

확장 합성곱 계층(421, 422, 423, 424)은 풀링 계층을 대체하여 사용할 수 있고, 역합성곱 계층을 사용하지 않아도 되도록 함으로써 실행시간을 단축시키고 복잡성을 감소시킨다.The extended convolution layers 421, 422, 423, and 424 can be used instead of the pooling layer, and reduce the execution time and complexity by eliminating the need to use the deconvolution layer.

이러한 확장 합성곱 계층 모듈(421, 422, 423, 424)은 추가의 매개변수나 연산을 사용하지 않고, 수용야를 확장하는 여분 커널들을 사용한다.These extended convolution layer modules 421, 422, 423, and 424 do not use additional parameters or operations, but use redundant kernels that extend the receptive field.

예를 들어, 크기 m*m의 단순 커널을 m+(m-1)(d-1)(d-1)f_h로 늘릴 수 있다. 여기서 d는 확장률이다.For example, a simple kernel of size m*m can be increased to m+(m-1)(d-1)(d-1)f _h . where d is the expansion rate.

확장률을 조절함으로써 도 7에서와 같이 크기가 3*3인 커널의 수용야를 확장할 수 있다.By adjusting the expansion rate, as shown in FIG. 7, the receiving field of the 3*3 kernel can be expanded.

도 7에서 보이는 바와 같이 확장률 d=1일 때 확장된 합성곱 계층은 정상적인 합성곱 계층처럼 3*3의 수용야를 획득한다.As shown in FIG. 7, when the expansion rate d = 1, the extended convolution layer obtains a 3*3 accommodation field like a normal convolution layer.

??장률 d=2를 적용하면 확장된 합성곱 계층은 5*5의 수용야를 획득하며, d=3이면 7*7 크기의 수용야를 획득하며, d=4이면 9*9의 수용야를 획득할 수 있을 것이다.When the growth rate d=2 is applied, the extended convolutional layer obtains a 5*5 receptive field, d=3 obtains a 7*7 receptive field, and d=4 obtains a 9*9 receptive field. will be able to obtain

확장된 합성곱 계층은 기존의 합성곱 계층, 풀링 계층 및 역합성곱 계층을 사용하는 매커니즘에 비해 특징 맵 해상도를 보존하는 데 많은 이점이 있다.The extended convolutional layer has many advantages in preserving the feature map resolution compared to mechanisms using conventional convolutional layers, pooling layers, and deconvolutional layers.

이러한 이점을 예를 들어 설명한다.These advantages are explained with an example.

과밀한 군중 장면 이미지가 있고, 군중 장면 이미지는 두 가지 다른 기술로 독립적으로 처리되지만, 이 기술이 이미지에 적용한 후에도 원 입력 이미지의 모양과 출력의 모양은 같아야 할 것이다.I have a dense crowd scene image, and the crowd scene image is processed independently by two different techniques, but even after these techniques are applied to the image, the shape of the original input image and the shape of the output should be the same.

기존 매커니즘 기술로는 다운 샘플링, 합성곱 계층 및 업 샘플링의 세 가지 주요 기술이 있다.Existing mechanism technologies include three main technologies: downsampling, convolution layer, and upsampling.

기존의 매커니즘은 최대 풀링을 적용하여 초기 입력 이미지의 모양을 축소하고, 작업을 위해 창 크기를 2로 유지하여 이미지 크기가 초기 이미지 모양의 절반으로 줄어든다. Existing mechanisms apply max pooling to reduce the shape of the initial input image, and keep the window size at 2 for operation so that the image size is reduced to half of the initial image shape.

풀링 작업 후, 3*3의 소버(Sober) 필터는 처리된 이미지와 합성곱되고 최종적으로 업샘플링 작업이 처리된 이미지는 이중선 보간이 적용되어 그 크기가 원 이미지의 크기와 비슷해진다.After pooling, a 3*3 Sober filter is convolved with the processed image, and finally, the upsampling processed image is subjected to double line interpolation so that its size is similar to that of the original image.

2D 확장 합성곱 계층은 다음 수학식 2에 의해 표현될 수 있다.The 2D extended convolution layer can be expressed by Equation 2 below.

여기서, f(i, j)는 높이(H)와 넓이(B)의 필터이고, x(h,b)는 입력이고, g(h,b)는 확장된 합성곱 계층의 출력을 나타내며, 매개변수 d는 확장률을 나타낸다.where f(i, j) is a filter of height (H) and width (B), x(h,b) is an input, g(h,b) is the output of the extended convolution layer, and The variable d represents the expansion rate.

매개변수 값 d가 1이면 확장 합성곱 계층은 표준 합성곱 계층처럼 작동하지만 d값이 증가하면 추가 매개변수(가중치)를 사용하지 않고, 필터의 수용야를 증가한다.When the parameter value d is 1, the extended convolutional layer behaves like a standard convolutional layer, but as the value of d increases, it does not use additional parameters (weights) and increases the filter's acceptance field.

상술한 프런트 엔드 처리부(410) 및 백 엔드 처리부(420)를 포함하는 군중 밀도 분포맵 모델은 확률적 경사 하강법(SGD)을 적용한다.The crowd density distribution map model including the front-end processing unit 410 and the back-end processing unit 420 described above applies stochastic gradient descent (SGD).

확률적 경사 하강법은 제안된 아키텍처를 일정한 학습률, f_h 훈련하기 위한 최적화 도구로 사용된다.Stochastic gradient descent is used as an optimization tool to train the proposed architecture at a constant learning rate, f _h .

손실할수로는 예측된 밀도 분포와 실제 군중 밀도 분포맵 사이의 거리를 찾는 유클리드 거리가 사용된다. 손실함수는 하기 수학식 3을 사용하여 표현할 수 있다.As the loss number, Euclidean distance is used to find the distance between the predicted density distribution and the actual crowd density distribution map. The loss function can be expressed using Equation 3 below.

여기서, F(Xi;θ)는 우리가 제안한 모델의 출력이고, M은 훈련의 배치 크기이고, θ는 입력 이미지를 표시하는 데 Xi가 사용되는 매개변수를 나타내며, D_i ^GT는 기본 밀도 분포맵이다.where F(Xi;θ) is the output of our proposed model, M is the batch size for training, θ represents the parameter Xi is used to display the input image, and D _i ^GT is the base density distribution map. to be.

밀도 분포맵 생성부(500)는 상기 학습부(400)로부터 학습된 군중 밀도 분포맵 모델을 제공받아 구동하고, 군중 이미지 획득부(200)가 출력하는 군중 장면 이미지를 직접 입력받거나 데이터세트 전처리부(300)를 통해 입력받아 상기 군중 밀도 분포맵 모델에 적용하여, 본 발명에 따른 군중 밀도 분포맵을 군중 계수부(600)로 출력한다.The density distribution map generation unit 500 receives and drives the crowd density distribution map model learned from the learning unit 400, and directly receives a crowd scene image output from the crowd image acquisition unit 200 or is a dataset pre-processing unit. 300, the crowd density distribution map is applied to the crowd density distribution map model, and the crowd density distribution map according to the present invention is output to the crowd counting unit 600.

군중 밀도 분포맵을 입력받은 군중 계수부(600)는 군중 밀도 분포맵의 사람의 머리 부분에 대응하는 분포점들을 계수하여 군중의 사람 수를 계수한다.Upon receiving the crowd density distribution map, the crowd counting unit 600 counts the number of people in the crowd by counting distribution points corresponding to the head of a person in the crowd density distribution map.

반면, 본 발명에 따른 확장 합성곱 계층은 동일한 필터, 즉 확장률 d=2를 갖는 3*3 크기의 소벨(Sobel) 필터를 원래 입력 이미지로 합성곱하여 확장된 합성곱 개념을 적용한다.On the other hand, the extended convolution layer according to the present invention applies the extended convolution concept by convolutional-multiplying the same filter, that is, a 3*3 Sobel filter having an expansion rate d=2 with the original input image.

출력 이미지는 풀링 및 업 샘플링 작업 없이 초기 입력 이미지와 동일한 모양을 갖는다.The output image has the same shape as the initial input image without pooling and upsampling.

또한, 확장된 합성곱 출력에는 추가로 포괄적인 정보가 포함된다.In addition, the extended convolution output includes additional comprehensive information.

한편, 본 발명은 전술한 전형적인 바람직한 실시예에만 한정되는 것이 아니라 본 발명의 요지를 벗어나지 않는 범위 내에서 여러 가지로 개량, 변경, 대체 또는 부가하여 실시할 수 있는 것임은 당해 기술분야에서 통상의 지식을 가진 자라면 용이하게 이해할 수 있을 것이다. 이러한 개량, 변경, 대체 또는 부가에 의한 실시가 이하의 첨부된 특허청구범위의 범주에 속하는 것이라면 그 기술사상 역시 본 발명에 속하는 것으로 보아야 한다.On the other hand, it is common knowledge in the art that the present invention is not limited to the above-described typical preferred embodiments, but can be implemented by various improvements, changes, substitutions, or additions within the scope of the present invention. If you have the , you can easily understand. If the implementation by such improvement, change, substitution or addition falls within the scope of the appended claims below, the technical idea should also be regarded as belonging to the present invention.

100: 데이터세트 생성부 200: 군중 이미지 획득부
300: 데이터세트 전처리부 310: 기본 밀도 분포맵 생성모듈
320: 데이터 증가 모듈 400: 학습부
410: 프런트 엔드 처리부 420: 백 엔드 처리부
500: 밀도 분포맵 생성부 600: 군중 계수부100: dataset generation unit 200: crowd image acquisition unit
300: dataset pre-processing unit 310: basic density distribution map generation module
320: data increase module 400: learning unit
410: front end processing unit 420: back end processing unit
500: density distribution map generation unit 600: crowd counting unit

Claims

a dataset generating unit that creates a dataset including a plurality of crowd scene images annotated with respect to the number of crowds;
a dataset pre-processing unit generating a basic density distribution map for the crowd scene images of the dataset, and generating a learning dataset by increasing data including the crowd scene image and the basic density distribution map;
The dataset preprocessor is applied to a crowd density distribution map model that learns a crowd density distribution map for the number of heads of a crowd scene image by applying an end-to-end convolutional neural network including a separable convolutional layer by depth and an extended convolutional neural network. a learning unit for learning by applying the learning data set input from;
a density distribution map generator for generating and outputting a crowd density distribution map for a crowd scene image input in real time by applying the learned crowd density distribution map model; and
A counting unit for receiving the crowd density distribution map and counting the number of crowds of the crowd scene image input in real time;
The dataset pre-processing unit,
a basic density distribution map generating unit for generating and outputting a basic density distribution map by applying an adaptive geometry kernel expressed by the following equation to each of the crowd scene images of the dataset; and
A crowd scene image real-time analysis device using an extended convolutional neural network, characterized in that it further comprises a data increaser for increasing the amount of data for the annotated crowd scene image.
[mathematical expression]

Here, si represents all target objects, δ is the basic density distribution map, di is the distance average of k nearest neighbors, and s represents the position of a pixel in the input image.

delete

According to claim 1,
The data augmentation unit,
A crowd scene image real-time analysis device using an extended convolutional neural network, characterized in that the crowd scene image is divided into a predetermined number of patches reduced by a predetermined ratio with respect to the size of the crowd scene to increase data.

According to claim 1,
The crowd density distribution map model,
a front-end processor extracting and outputting 2D features from crowd scene images having different resolutions of the preprocessed training dataset; and
A crowd scene image real-time analysis device using an extended convolutional neural network, characterized in that it comprises a back-end processing unit for generating and outputting a crowd density distribution map by performing an extended convolution of the 2D features output from the front-end processing unit.

According to claim 4,
The front-end processing unit,
Including three separable convolution layers by depth and filters of size 3*3, sequentially having different numbers of filters of 16, 32, 64, 128, and 256, and using ReLU as an activation function. Five depth-separable convolutional layer modules to apply;
a max pooling layer having a size of 2*2 and configured behind a second separable convolution layer module for each depth and behind a fourth and fifth separable convolution layer modules for each depth among the separable convolution layer modules for each depth; and
Crowd scene image real-time analysis using an extended convolutional neural network, characterized in that it includes a standard convolutional layer module that includes two standard convolutional layers and 512 filters of size 3 * 3 and applies ReLU as an activation function. Device.

According to claim 4,
The back end processing unit,
Three first extended convolution layer modules including two extended convolution layers and filters having a size of 3*3, sequentially having 512, 256, and 128 filters;
A second extended convolution layer module including one extended convolution layer and 64 filters of size 3*3; and
Crowd scene image real-time analysis using an extended convolutional neural network, characterized in that it includes a standard convolutional layer module having one standard convolutional layer and one filter having a size of 1*1 and applying ReLU as an activation function. Device.

According to claim 6,
The extended convolution layer,
An apparatus for real-time analysis of crowd scene images using an extended convolutional neural network, characterized in that the receiving field is expanded using redundant kernels.

a dataset creation process in which the dataset creation unit creates a dataset including a plurality of crowd scene images annotated on the number of crowds;
A dataset pre-processing process in which the dataset pre-processing unit generates a basic density distribution map for the crowd scene images of the dataset, and creates a training dataset by increasing data including the crowd scene image and the basic density distribution map. ;
The crowd density distribution map model in which the training unit learns a crowd density distribution map for the number of heads of a crowd scene image by applying an end-to-end convolutional neural network including a depth-separable convolutional layer and an extended convolutional layer to the dataset. a learning process of applying and learning the learning data set input from the pre-processing unit;
a density distribution map generation process in which a density distribution map generating unit generates and outputs a crowd density distribution map for a crowd scene image input in real time to which the learned crowd density distribution map model is applied; and
A counting process in which a counting unit receives the crowd density distribution map and counts the number of crowds of the crowd scene image input in real time;
The dataset preprocessing process,
a basic density distribution map generating step in which a basic density distribution map generating unit generates and outputs a basic density distribution map by applying an adaptive geometry kernel expressed by the following equation to each of the crowd scene images of the dataset; and
The crowd scene image real-time analysis method using an extended convolutional neural network, characterized in that the data augmentation unit further comprises a data augmentation step of increasing the amount of data for the annotated crowd scene image.
[mathematical expression]

delete

According to claim 8,
The crowd scene image real-time analysis method using an extended convolutional neural network, characterized in that the data increaser divides the crowd scene image into a predetermined number of patches reduced by a predetermined ratio with respect to the size of the crowd scene in a data increase step to increase data. .

According to claim 8,
The learning process is
a front-end processing step of extracting and outputting 2D features from crowd scene images having different resolutions of the pre-processed training dataset by the front-end processing unit; and
A back-end processing step in which the back-end processing unit performs extended convolution of the 2D features output from the front-end processing unit to generate and output a crowd density distribution map Real-time crowd scene image using an extended convolutional neural network analysis method.