KR20220056399A

KR20220056399A - Crowded scenes image real-time analysis apparatus using dilated convolutional neural network and method thereof

Info

Publication number: KR20220056399A
Application number: KR1020200140950A
Authority: KR
Inventors: 백성욱; 이미영; 울라 아민; 칸 노만; 울 하크 이자즈
Original assignee: 세종대학교산학협력단
Priority date: 2020-10-28
Filing date: 2020-10-28
Publication date: 2022-05-06
Also published as: KR102486083B1

Abstract

The present invention relates to an apparatus and method for analyzing a crowded scene image in real time and, more specifically, to an apparatus and method for analyzing a crowded scene image in real time using a dilated convolutional neural network (CNN), which train a crowd density distribution map model to which an end-to-end dilated CNN is applied, with training datasets including crowd density distribution map information for people included in a crowd scene image, generate a high-quality real-time crowd density distribution map by applying a crowd scene image to be analyzed to the crowd density distribution map model, and count the number of people in the crowd scene image on the basis of the generated real-time crowd density distribution map. According to the present invention, the apparatus comprises a dataset generation unit, a dataset preprocessing unit, a training unit, a density distribution map generation unit, and a count unit.

Description

Crowded scenes image real-time analysis apparatus using dilated convolutional neural network and method thereof

본 발명은 군중 장면 이미지 실시간 분석 장치 및 방법에 관한 것으로, 더욱 상세하게는 군중 장면 이미지에 포함된 사람들에 대한 군중 밀도 분포맵 정보를 포함하는 학습 데이터세트를 엔드 투 엔드(End to End) 확장 합성곱 신경망(Convolutional Neural Network: CNN)이 적용된 군중 밀도 분포맵 모델을 학습하고, 분석 요청된 군중 장면 이미지를 상기 군중 밀도 분포맵 모델에 적용하여 고품질의 실시간 군중 밀도 분포맵을 생성한 후 생성된 실시간 군중 밀도 분포맵을 기반으로 군중 장면 이미지의 사람 수를 계수하는 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for real-time analysis of crowd scene images, and more particularly, end-to-end extended synthesis of a training dataset including crowd density distribution map information for people included in the crowd scene image. Real-time generated after learning a crowd density distribution map model to which a convolutional neural network (CNN) is applied, and applying a crowd scene image requested for analysis to the crowd density distribution map model to generate a high-quality real-time crowd density distribution map A crowd scene image real-time analysis apparatus and method using an extended convolutional neural network that counts the number of people in a crowd scene image based on a crowd density distribution map.

일반적으로 시민을 안전하게 보호하고, 위법자들을 추적 및 감시하기 위해 도로, 광장, 건물 내의 주요 공간 등에 폐쇄회로텔레비전(Closed Circuit Television: CCTV)이 설치되고 있다.In general, Closed Circuit Television (CCTV) is being installed on roads, squares, and major spaces in buildings to protect citizens safely and to track and monitor offenders.

또한, 이러한 CCTV를 통해 획득되는 이미지를 수집하여 저장하고, 분석하는 다양한 이미지 분석 장치들이 개발되어 적용되고 있다.In addition, various image analysis devices for collecting, storing, and analyzing images obtained through such CCTV have been developed and applied.

일반적인 이미지 분석 장치는 CCTV를 통해 수집되는 이미지로부터 사람을 검출하고, 검출된 사람의 행동을 분석 및 추정하거나, 다수의 사람, 즉 군중의 수를 계수하는 데 사용되고 있다.A general image analysis apparatus is used to detect a person from images collected through CCTV, analyze and estimate the detected person's behavior, or count the number of people, that is, a crowd.

특히, 최근에는 보안을 위해 다수의 사람, 즉 군중이 모여 있는 정지 이미지 및 동영상(이하 "군중 장면 이미지"라 함)으로부터 군중의 수, 즉 사람의 수를 계산하거나 추정하는 이미지 분석 장치가 개발되고 적용되고 있다.In particular, recently, for security, an image analysis device for calculating or estimating the number of people, that is, the number of people, from still images and moving images (hereinafter referred to as "crowd scene image") in which a large number of people, that is, the crowd is gathered, has been developed and is being applied

종래 이미지 분석 장치는 군중의 수를 계수하기 위해 획득되는 이미지 내에서 사람의 신체 전체 또는 신체의 특정 부분(통상: 얼굴)을 해당 신체 부분에 대한 특징 패턴에 의해 검출하여 사람 수를 계수하는 직접 검출 방식과 군중 분포에 따른 군중 밀도 분포맵을 생성하여 군중의 수를 계수하는 밀도 분포맵 방식이 적용되고 있다.A conventional image analysis apparatus detects the entire body of a person or a specific part of the body (usually: a face) in an image obtained for counting the number of crowds by a feature pattern for the corresponding body part, direct detection for counting the number of people A density distribution map method that counts the number of crowds by generating a crowd density distribution map according to the method and crowd distribution is being applied.

직접 검출 방식 및 신체 전체 검출에 의해 군중 수를 계수하는 이미지 분석 장치는 이미지 내에 다수의 사람이 엉켜있어 사람의 전체 신체를 검출하는 것이 불가능한 문제점이 있었다.The image analysis apparatus for counting the number of crowds by the direct detection method and the whole body detection has a problem in that it is impossible to detect the entire body of a person because a large number of people are entangled in the image.

직접 검출 방식 및 신체 일부분에 의해 군중 수를 계수하는 이미지 분석 장치 또한 이미지 내에서 다수의 사람이 엉켜있어 해당 부분이 검출되지 않는 경우가 많아 정확하게 군중 수를 계수할 수 없는 문제점이 있었다.The image analysis apparatus for counting the number of crowds by the direct detection method and body parts also has a problem in that the number of crowds cannot be accurately counted because a large number of people are entangled in the image and the corresponding parts are often not detected.

도 1은 일반적인 이미지 분석 장치에서의 비슷한 수의 군중을 포함하는 실제 이미지 및 기본 밀도 분포맵을 나타낸 도면이다.1 is a view showing an actual image and a basic density distribution map including a similar number of crowds in a general image analysis apparatus.

도 1을 참조하면, 밀도 분포맵 방식이 적용된 이미지 분석 장치는 도 1의 (a) 및 (b)에서 보이는 바와 같이 비슷한 사람 수를 가지는 (a) 및 (b) 각각의 제1 이미지(1) 및 제2 이미지(2)에 대응하는 제1 밀도 분포맵(3) 및 제2 밀도 분포맵(4)의 분포가 비슷하지 않고 전혀 다르다는 것을 알 수 있다. 즉, 일반적인 밀도 분포맵 방식을 적용한 이미지 분석 장치는 정확하게 군중의 사람 수를 계수할 수 없다.Referring to FIG. 1 , the image analysis apparatus to which the density distribution map method is applied is a first image (1) of each of (a) and (b) having a similar number of people as shown in FIGS. 1 (a) and (b). And it can be seen that the distributions of the first density distribution map 3 and the second density distribution map 4 corresponding to the second image 2 are not similar and completely different. That is, the image analysis apparatus to which the general density distribution map method is applied cannot accurately count the number of people in the crowd.

따라서 군중 장면 이미지로부터 사람 수를 실시간으로 정확하게 계수할 수 있는 이미지 분석 장치의 개발이 요구되고 있다.Therefore, the development of an image analysis apparatus capable of accurately counting the number of people in real time from a crowd scene image is required.

대한민국 등록특허 제10-1467307호(2014.12.01. 공고)Republic of Korea Patent Registration No. 10-1467307 (2014.12.01. Announcement)

따라서 본 발명의 목적은 군중 장면 이미지에 포함된 사람들에 대한 군중 밀도 분포맵 정보를 포함하는 학습 데이터세트를 엔드 투 엔드(End to End) 확장 합성곱 신경망(Convolutional Neural Network: CNN)이 적용된 군중 밀도 분포맵 모델을 학습하고, 분석 요청된 군중 장면 이미지를 상기 군중 밀도 분포맵 모델에 적용하여 고품질의 실시간 군중 밀도 분포맵을 생성한 후 생성된 실시간 군중 밀도 분포맵을 기반으로 군중 장면 이미지의 사람 수를 계수하는 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치 및 방법을 제공함에 있다.Therefore, it is an object of the present invention to provide a training dataset including crowd density distribution map information for people included in the crowd scene image to end-to-end extended convolutional neural network (CNN) applied crowd density. The number of people in the crowd scene image based on the generated real-time crowd density distribution map after learning the distribution map model, generating a high-quality real-time crowd density distribution map by applying the crowd scene image requested for analysis to the crowd density distribution map model An object of the present invention is to provide an apparatus and method for real-time analysis of crowd scene images using an extended convolutional neural network for counting.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치는: 군중의 수에 대한 주석이 달린 다수의 군중 장면 이미지를 포함하는 데이터세트를 생성하는 데이터세트 생성부; 상기 데이터세트의 상기 군중 장면 이미지들에 대한 기본 밀도 분포맵을 생성하고, 상기 군중 장면 이미지 및 기본 밀도 분포맵을 포함하는 데이터를 증가시켜 학습 데이터세트를 생성하는 데이터세트 전 처리부; 깊이별 분리 가능 합성곱 계층 및 확장 합섭곱 계층을 포함하는 엔드 투 엔드 합성곱 신경망을 적용하여 군중 장면 이미지의 머리 수에 대한 군중 밀도 분포맵을 학습하는 군중 밀도 분포맵 모델에 상기 데이터세트 전처리부로부터 입력되는 상기 학습 데이터세트를 적용하여 학습시키는 학습부; 상기 학습된 군중 밀도 분포맵 모델이 적용되어 실시간 입력되는 군중 장면 이미지에 대한 군중 밀도 분포 맵을 생성하여 출력하는 밀도 분포맵 생성부; 및 상기 군중 밀도 분포맵을 입력받아 실시간 입력되는 상기 군중 장면 이미지의 군중 수를 카운트하는 카운트부를 포함하는 것을 특징으로 한다.Crowd scene image real-time analysis apparatus using an extended convolutional neural network according to the present invention for achieving the above object: Generating a dataset that generates a dataset including a plurality of crowd scene images annotated on the number of crowds wealth; a dataset pre-processing unit that generates a basic density distribution map for the crowd scene images of the dataset, and increases data including the crowd scene image and the basic density distribution map to generate a training dataset; The dataset preprocessor applies an end-to-end convolutional neural network including a separable convolutional layer for each depth and an extended convolutional layer to learn a crowd density distribution map for the number of heads in a crowd scene image. a learning unit for learning by applying the learning data set input from a density distribution map generator for generating and outputting a crowd density distribution map for a crowd scene image input in real time to which the learned crowd density distribution map model is applied; and a counting unit for receiving the crowd density distribution map and counting the number of crowds of the crowd scene image input in real time.

상기 데이터세트 전처리부는, 상기 데이터세트의 군중 장면 이미지들 각각에 대해 하기의 수학식으로 표현되는 적응형 지오메트리 커널을 적용하여 기본 밀도 분포맵을 생성하여 출력하는 기본 밀도 분포맵 생성부; 및 상기 주석이 달린 군중 장면 이미지에 대한 데이터량을 증가하는 데이터 증가부를 더 포함하는 것을 특징으로 한다.The dataset preprocessor may include: a basic density distribution map generator for generating and outputting a basic density distribution map by applying an adaptive geometry kernel expressed by the following equation to each of the crowd scene images of the dataset; and a data increasing unit for increasing the amount of data for the annotated crowd scene image.

[수학식][Equation]

여기서, si는 모든 대상 물체를 나타내고, δ는 기본 밀도 분포맵이고, di는 k개의 가장 가까운 이웃의 거리 평균이며, s는 입력 이미지에서의 픽셀의 위치를 나타낸다.Here, si denotes all target objects, δ denotes the basic density distribution map, di denotes the distance average of k nearest neighbors, and s denotes the position of a pixel in the input image.

상기 데이터 증가부는, 상기 군중 장면 이미지를 상기 군중 장면의 크기에 대해 일정 비율로 줄어든 일정 수로 패치로 분할하여 데이터를 증가시키는 것을 특징으로 한다.The data increasing unit is characterized in that the data is increased by dividing the crowd scene image into a predetermined number of patches reduced by a predetermined ratio with respect to the size of the crowd scene.

상기 군중 밀도 분포맵 모델은, 전처리된 학습 데이터세트의 서로 다른 해상도를 가지는 군중 장면 이미지에서 2D 특징을 추출하여 출력하는 프런트 엔드 처리부; 및 상기 프런트 엔드 처리부로부터 출력되는 2D 특징의 확장된 합성곱을 수행하여 군중 밀도 분포맵을 생성하여 출력하는 백 엔드 처리부를 포함하는 것을 특징으로 한다.The crowd density distribution map model includes: a front-end processing unit for extracting and outputting 2D features from crowd scene images having different resolutions of a preprocessed training dataset; and a back-end processing unit that generates and outputs a crowd density distribution map by performing extended convolution of 2D features output from the front-end processing unit.

프런트 엔드 처리부는, 3개의 깊이별 분리 가능 합성곱 계층 및 크기가 3*3인 필터를 포함하되, 순차적으로 16개, 32개, 64개, 128개 및 256개의 서로 다른 개수의 필터를 가지며, 활성화 함수로 ReLU를 적용하는 5개의 깊이별 분리 가능 합성곱 계층 모듈; 상기 깊이별 분리 가능 합성곱 계층 모듈 중 제2 깊이별 분리 가능 합성곱 계층 모듈 뒤, 제4 및 제5 깊이별 분리 가능 합성곱 계층 모듈 뒤에 각각 구성되고 크기가 2*2인 맥스 풀링 계층; 및 2개의 표준 합성곱 계층 및 크기가 3*3인 512개의 필터를 포함하고, 활성화함수로 ReLU를 적용하는 표준 합성곱 계층 모듈을 포함하는 것을 특징으로 한다.The front-end processing unit includes three separable convolutional layers by depth and filters of size 3 * 3, but sequentially has 16, 32, 64, 128, and 256 different numbers of filters, 5 depth-wise separable convolutional layer modules that apply ReLU as an activation function; a max pooling layer having a size of 2*2, each configured after the separable convolutional layer module by depth and after the separable convolutional layer module by depth among the separable convolutional layer modules by depth and after the fourth and fifth separable convolutional layer modules by depth; and a standard convolutional layer module including two standard convolutional layers and 512 filters having a size of 3*3, and applying ReLU as an activation function.

상기 백 엔드 처리부는, 2개의 확장 합성곱 계층 및 크기가 3*3인 필터를 포함하되, 순차적으로 512, 256 및 128 개의 필터 수를 가지는 3개의 제1 확장 합성곱 계층 모듈; 하나의 확장 합성곱 계층 및 크기가 3*3인 64개의 필터를 포함하는 제2확장 합성곱 계층 모듈; 및 하나의 표준 합성곱 계층 및 크기가 1*1인 한 개의 필터를 구비하고, 활성화함수로 ReLU를 적용하는 표준 합성곱 계층 모듈을 포함하는 것을 특징으로 한다.The back-end processing unit includes: three first extended convolutional layer modules including two extended convolution layers and a filter having a size of 3*3, sequentially having 512, 256, and 128 filters; a second extended convolutional layer module including one extended convolutional layer and 64 filters having a size of 3*3; and a standard convolutional layer module having one standard convolutional layer and one filter having a size of 1*1, and applying ReLU as an activation function.

상기 확장 합성곱 계층은, 여분의 커널들을 사용하여 수용야를 확장하는 것을 특징으로 한다.The extension convolution layer is characterized in that it extends the accommodation field using extra kernels.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 방법은: 데이터세트 생성부가 군중의 수에 대한 주석이 달린 다수의 군중 장면 이미지를 포함하는 데이터세트를 생성하는 데이터세트 생성 과정; 데이터세트 전 처리부가 상기 데이터세트의 상기 군중 장면 이미지들에 대한 기본 밀도 분포맵을 생성하고, 상기 군중 장면 이미지 및 기본 밀도 분포맵을 포함하는 데이터를 증가시켜 학습 데이터세트를 생성하는 데이터세트 전처리 과정; 학습부가 깊이별 분리 가능 합성곱 계층 및 확장 합섭곱 계층을 포함하는 엔드 투 엔드 합성곱 신경망을 적용하여 군중 장면 이미지의 머리 수에 대한 군중 밀도 분포맵을 학습하는 군중 밀도 분포맵 모델에 상기 데이터세트 전처리부로부터 입력되는 상기 학습 데이터세트를 적용하여 학습시키는 학습 과정; 밀도 분포맵 생성부가 상기 학습된 군중 밀도 분포맵 모델이 적용되어 실시간 입력되는 군중 장면 이미지에 대한 군중 밀도 분포 맵을 생성하여 출력하는 밀도 분포맵 생성 과정; 및 카운트부가 상기 군중 밀도 분포맵을 입력받아 실시간 입력되는 상기 군중 장면 이미지의 군중 수를 카운트하는 카운트 과정을 포함하는 것을 특징으로 한다.A method for real-time analysis of crowd scene images using an extended convolutional neural network according to the present invention for achieving the above object is: A dataset generating unit generates a dataset including a plurality of crowd scene images annotated on the number of crowds data set creation process; A dataset preprocessing process in which a dataset preprocessing unit generates a basic density distribution map for the crowd scene images of the dataset, and increases data including the crowd scene image and the basic density distribution map to generate a training dataset ; The dataset in the crowd density distribution map model, where the learning unit learns the crowd density distribution map for the number of heads in the crowd scene image by applying an end-to-end convolutional neural network including a separable convolutional layer and extended convolutional layer by depth. a learning process of learning by applying the learning dataset input from the preprocessor; a density distribution map generation process in which a density distribution map generator generates and outputs a crowd density distribution map for a crowd scene image input in real time to which the learned crowd density distribution map model is applied; and a counting process of counting the number of crowds of the crowd scene image input in real time by a counting unit receiving the crowd density distribution map.

상기 데이터세트 전처리 과정은, 기본 밀도 분포맵 생성부가 상기 데이터세트의 군중 장면 이미지들 각각에 대해 하기의 수학식으로 표현되는 적응형 지오메트리 커널을 적용하여 기본 밀도 분포맵을 생성하여 출력하는 기본 밀도 분포맵 생성 단계; 및 데이터 증가부가 상기 주석이 달린 군중 장면 이미지에 대한 데이터량을 증가하는 데이터 증가 단계를 더 포함하는 것을 특징으로 한다.In the dataset preprocessing process, the basic density distribution map generator generates and outputs a basic density distribution map by applying an adaptive geometry kernel expressed by the following equation to each of the crowd scene images of the dataset. map creation step; and a data increasing step in which the data increasing unit increases the amount of data for the annotated crowd scene image.

[수학식][Equation]

상기 데이터 증가부가 데이터 증가 단계에서 상기 군중 장면 이미지를 상기 군중 장면의 크기에 대해 일정 비율로 줄어든 일정 수로 패치로 분할하여 데이터를 증가시키는 것을 특징으로 한다.In the data increasing step, the data increasing unit divides the crowd scene image into a certain number of patches reduced by a predetermined ratio with respect to the size of the crowd scene to increase the data.

상기 학습 과정은, 프런트 엔드 처리부가 전처리된 학습 데이터세트의 서로 다른 해상도를 가지는 군중 장면 이미지에서 2D 특징을 추출하여 출력하는 프런트 엔드 처리 단계; 및 백 엔드 처리부가 상기 프런트 엔드 처리부로부터 출력되는 2D 특징의 확장된 합성곱을 수행하여 군중 밀도 분포맵을 생성하여 출력하는 백 엔드 처리 단계를 포함하는 것을 특징으로 한다.The learning process includes: a front-end processing step of extracting and outputting 2D features from crowd scene images having different resolutions of the pre-processed training dataset by the front-end processing unit; and a back-end processing step in which a back-end processing unit generates and outputs a crowd density distribution map by performing extended convolution of 2D features output from the front-end processing unit.

본 발명은 획득된 데이터세트에 포함된 군중 장면 이미지에 포함된 주석이 달린 사람의 머리를 흐리게 하여 정규화한 후, 적응형 기하학 커널에 적용하여 기본 밀도 분포맵을 생성하고, 생성된 기본 밀도 분포맵을 데이터세트에 추가함으로써 데이터세트의 품질을 높여 상기 군중 장면 이미지에 포함된 사람 수의 카운트 정확도를 향상시킬 수 있는 효과가 있다.The present invention blurs and normalizes the head of an annotated person included in the crowd scene image included in the obtained dataset, and then applies it to the adaptive geometry kernel to generate a basic density distribution map, and the generated basic density distribution map By adding to the dataset, it is possible to improve the quality of the dataset to improve the counting accuracy of the number of people included in the crowd scene image.

또한, 본 발명은 데이터세트의 군중 장면 이미지를 중첩 분할 및 의사잡음을 인가하여 데이터를 증가시키므로 상기 군중 장면 이미지 내의 사람 수의 카운트 정확도를 향상시킬 수 있는 효과가 있다.In addition, since the present invention increases data by overlapping and dividing the crowd scene image of the dataset and applying pseudo noise, it is possible to improve the counting accuracy of the number of people in the crowd scene image.

또한, 본 발명은 군중 밀도 분포맵 모델에 깊이별 분리 가능한 합성곱 계층을 포함하는 프런트 엔드를 적용함으로써 합성곱 프로세스에서 많은 양의 연산을 줄일 수 있고, 모델의 크기도 줄일 수 있는 효과가 있다.In addition, the present invention has the effect of reducing a large amount of computation in the convolution process and reducing the size of the model by applying a front end including a convolution layer separable by depth to the crowd density distribution map model.

또한, 본 발명은 군중 밀도 분포맵 모델에 최대 및 평균 풀링 계층을 최소화하여 맵의 일부 특수정보가 손실되는 것을 방지하고, 상기 풀링 계층의 최소화에 따른 역 합성곱 계층을 사용하지 않아도 되므로 복잡성 및 실행 시간을 줄일 수 있는 효과가 있다.In addition, the present invention minimizes the maximum and average pooling layers in the crowd density distribution map model to prevent loss of some special information of the map, and eliminates the need to use an inverse convolutional layer according to the minimization of the pooling layer, thereby reducing complexity and execution. It has the effect of saving time.

도 1은 일반적인 이미지 분석 장치에서의 비슷한 수의 군중을 포함하는 실제 이미지 및 기본 밀도 분포맵을 나타낸 도면이다.
도 2는 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치의 구성을 나타낸 도면이다.
도 3은 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치의 각 구성요소를 개념적으로 나타낸 도면이다.
도 4는 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치의 데이터세트 전처리부의 데이터 증가 개념을 설명하기 위한 도면이다.
도 5는 본 발명에 따른 단일 및 다중 필터에 대한 표준 2D 합성곱 프로세스 계층의 구성을 나타낸 도면이다.
도 6은 본 발명에 따른 깊이별 분리 가능 합성곱 계층의 구성을 나타낸 도면이다.
도 7은 본 발명에 따른 확장률의 증가에 따른 수용야의 확장 효과를 나타낸 도면이다.1 is a view showing an actual image and a basic density distribution map including a similar number of crowds in a general image analysis apparatus.
2 is a diagram showing the configuration of a crowd scene image real-time analysis apparatus using an extended convolutional neural network according to the present invention.
3 is a diagram conceptually illustrating each component of a crowd scene image real-time analysis apparatus using an extended convolutional neural network according to the present invention.
4 is a diagram for explaining the data increase concept of the dataset preprocessor of the apparatus for real-time analysis of crowd scene images using an extended convolutional neural network according to the present invention.
5 is a diagram showing the configuration of a standard 2D convolutional process layer for single and multiple filters according to the present invention.
6 is a diagram illustrating the configuration of a separable convolutional layer for each depth according to the present invention.
7 is a view showing the effect of expanding the receiving field according to the increase in the expansion rate according to the present invention.

이하 첨부된 도면을 참조하여 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치의 구성 및 동작을 설명하고, 상기 장치에서의 군중 장면 이미지 실시간 분석 방법을 설명한다.Hereinafter, the configuration and operation of an apparatus for real-time analysis of crowd scene images using an extended convolutional neural network according to the present invention will be described with reference to the accompanying drawings, and a method for real-time analysis of crowd scene images in the apparatus will be described.

도 2는 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치의 구성을 나타낸 도면이고, 도 3은 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치의 각 구성 요소를 개념적으로 나타낸 도면이고, 도 4는 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치의 데이터세트 전처리부의 데이터 증가 개념을 설명하기 위한 도면이며, 도 5는 본 발명에 따른 단일 및 다중 필터에 대한 표준 2D 합성곱 프로세스 계층의 구성을 나타낸 도면이고, 도 6은 본 발명에 따른 깊이별 분리 가능 합성곱 계층의 구성을 나타낸 도면이며, 도 7은 본 발명에 따른 확장률의 증가에 따른 수용야의 확장 효과를 나타낸 도면이다. 이하 도 2 내지 도 7을 참조하여 설명한다. 2 is a diagram showing the configuration of an apparatus for real-time analysis of crowd scene images using an extended convolutional neural network according to the present invention, and FIG. It is a diagram conceptually shown, and FIG. 4 is a diagram for explaining the data increase concept of the dataset preprocessor of the crowd scene image real-time analysis apparatus using the extended convolutional neural network according to the present invention, and FIG. It is a diagram showing the configuration of a standard 2D convolutional process layer for a filter, FIG. 6 is a diagram showing the configuration of a separable convolution layer for each depth according to the present invention, and FIG. It is a diagram showing the effect of expanding the receiving field. Hereinafter, it will be described with reference to FIGS. 2 to 7 .

우선 도 2를 참조하면, 본 발명에 따른 확장 합성곱 신경망을 이용한 군중 장면 이미지 실시간 분석 장치는 데이터세트 생성부(100), 군중 이미지 획득부(200), 데이터세트 전처리부(300), 학습부(400), 밀도 분포맵 생성부(500) 및 군중 계수부(600)를 포함한다.First, referring to FIG. 2 , the apparatus for real-time analysis of crowd scene images using an extended convolutional neural network according to the present invention includes a dataset generator 100 , a crowd image acquisition unit 200 , a dataset preprocessor 300 , and a learning unit. 400 , a density distribution map generating unit 500 and a crowd counting unit 600 .

데이터세트 생성부(100)는 일정 기간동안 CCTV를 통해 수집하거나 온라인상의 서버에 업로드되어 있는 과밀한 다수의 군중들이 모여 있는 이미지인 군중 장면 이미지들을 수집하고, 수집된 군중 장면 이미지들에 포함된 사람 머리에 대한 주석을 관리자로부터 획득한 주석을 포함하는 군중 장면 이미지들을 포함하는 데이터세트를 생성한다. 상기 군중 장면 이미지는 정지영상일 수도 있고, 동영상의 프레임별 이미지일 수도 있을 것이다.The dataset generating unit 100 collects crowd scene images, which are images in which a large number of overcrowded crowds are gathered through CCTV or uploaded to an online server for a certain period of time, and a person included in the collected crowd scene images Create a dataset containing crowd scene images including annotations for heads obtained from managers. The crowd scene image may be a still image or a frame-by-frame image of a moving picture.

또한, 데이터세트 생성부(100)는 하기 표 1과 같이 미리 획득되어 저장수단에 저장되어 있는 사람의 머리에 대한 주석을 포함하는 다수의 군중 장면 이미지를 포함하는 데이터세트를 로드하여 획득할 수도 있을 것이다.In addition, the dataset generating unit 100 may be obtained by loading a dataset including a plurality of crowd scene images including annotations on human heads that are obtained in advance and stored in the storage means as shown in Table 1 below. will be.

표 1에서 상하이테크(ShanghaiTech) 파트(A) 및 UCF_CC_50의 군중 장면 이미지의 군중들이 많아 매우 혼잡하고, 상하이테크 파트(B), UCSD 및 WorldExpo'10은 상대적으로 사람이 드물고 감시보기 샘플(군중 장면 이미지)이 있음을 나타내고 있다.In Table 1, the crowd scene images of ShanghaiTech part (A) and UCF_CC_50 are very crowded with large crowds, and ShanghaiTech part (B), UCSD and WorldExpo'10 are relatively sparsely populated and the surveillance view sample (crowd scenes) image) is shown.

표 1에서 T_n은 전체 샘플을 나타내고, M_n 및 M_x는 각각 이미지의 최소 및 최대 개인 수를 나타낸다.In Table 1, T _n represents the entire sample, and M _n and M _x represent the minimum and maximum number of individuals in the image, respectively.

이미지 당 평균 사람 수(A_v)를 나타내기 위해 사용되고, R은 이미지 해상도를 T_h는 전체 주석이 달린 머리를, ROI는 관심영역이며, 관심영역의 존재 여부를 Yes 또는 No로 나타낸다.Used to represent the average number of people per image (A _v ), R is the image resolution, T _h is the full annotated head, ROI is the region of interest, and the presence or absence of the region of interest is indicated as Yes or No.

상기 데이터세트에 포함된 데이터, 즉 군중 장면 이미지의 품질을 높이기 위해 데이터세트 생성부(100)는 이중 선형 보간 등을 적용하여 각 프레임에 대한 크기 조정 프로세스를 수행할 수도 있을 것이다. 예를 들어 데이터세트 생성부(100)는 프레임별 과밀 군중 영상 이미지의 크기를 900*600 해상도로 조정할 수 있을 것이다.In order to improve the quality of the data included in the dataset, that is, the crowd scene image, the dataset generating unit 100 may perform a resizing process for each frame by applying bilinear interpolation or the like. For example, the dataset generator 100 may adjust the size of the overcrowded crowd video image for each frame to a resolution of 900*600.

데이터세트가 생성 또는 획득되면 데이터세트 생성부(100)는 생성된 또는 획득된 데이터세트를 데이터세트 전처리부(300)로 제공한다.When a dataset is generated or acquired, the dataset generator 100 provides the generated or acquired dataset to the dataset preprocessor 300 .

군중 이미지 획득부(200)는 CCTV, 서버 등으로 군중 장면 이미지를 실시간 획득되거나 분석해야 할 군중 장면 이미지를 획득하여 군중 밀도 분포맵 생성부(300)로 바로 제공하거나, 데이터세트 전처리부(300)에서 데이터 전처리를 수행한 후 밀도 분포맵 생성부(300)로 전송한다.The crowd image acquisition unit 200 acquires a crowd scene image in real time with CCTV, a server, etc. or obtains a crowd scene image to be analyzed and provides it directly to the crowd density distribution map generation unit 300, or a dataset preprocessor 300 After performing data pre-processing in , it is transmitted to the density distribution map generator 300 .

데이터세트 전처리부(300)는 기본 밀도 분포맵 생성 모듈(310) 및 데이터 증가 모듈(320)을 포함하여, 기존 데이터로부터 파생되는 새로운 데이터를 생성하고, 기존 데이터를 증가시키는 전처리를 수행한다.The dataset preprocessor 300 includes a basic density distribution map generation module 310 and a data increase module 320 to generate new data derived from existing data and perform preprocessing to increase the existing data.

구체적으로 기본 밀도 분포맵 생성 모듈(310)은 데이터세트 생성부(100)로부터 입력되는 데이터세트의 사람의 머리에 대한 주석을 포함하는 각 군중 장면 이미지에 기하학 커널을 적용하여 도 3의 311과 같이 실제 사진(군중 장면 이미지)에 대한 기본 밀도 분포맵을 생성한다. Specifically, the basic density distribution map generation module 310 applies a geometric kernel to each crowd scene image including the annotation on the human head of the dataset input from the dataset generation unit 100, as shown in 311 of FIG. Create a basic density distribution map for real photos (crowd scene images).

상기 기본 밀도 분포맵 생성 모듈(310)은 가우시안(Gaussian) 커널을 적용하여 주석이 달린 군중 장면 이미지의 모든 사람의 머리를 흐리게 처리하여 하나로 정규화하고, 모든 이미지의 특수 분포를 고려하여 기본 밀도 분포맵을 생성한다.The basic density distribution map generation module 310 applies a Gaussian kernel to blur everyone's heads in the annotated crowd scene image and normalizes them to one, and takes into account the special distribution of all images, the basic density distribution map create

상기 가우시안 커널을 적용하기 위한 지오매트리 적응 필터는 하기 수학식 1과 같이 나타낼 수 있다.A geometry adaptive filter for applying the Gaussian kernel may be expressed as in Equation 1 below.

여기서 s_i는 모든 대상 물체를 나타내고, δ는 군중 밀도 분포맵이며, d_i는 k개의 가장 가까운 이웃의 거리 평균이고, S는 입력 이미지에서 픽셀의 위치를 나타낸다.where s _i represents all target objects, δ is the crowd density distribution map, d _i is the distance average of k nearest neighbors, and S represents the position of a pixel in the input image.

합성곱 프로세스는 군중 밀도 분포맵을 생성하기 위해 표준편차(σ_i)를 갖는 매개변수(δ(s-s_i)) 및 가우스 필터에 적용된다.A convolution process is applied to the parameter (δ(ss _i )) and a Gaussian filter with standard deviation (σ _i ) to generate a crowd density distribution map.

상기 표 1의 상하이테크 파트(A) 및 UCF_CC_50과 같이 군중 장면 이미지들이 많은 경우 지오메트리 적응 커널을 사용할 수 있을 것이다.If there are many crowd scene images as in the Shanghai Tech part (A) and UCF_CC_50 of Table 1 above, a geometry adaptation kernel may be used.

그리고 상하이테크 파트(B), UCSD 및 WorldExpo'10은 군중이 비교적 적은 군중 장면 이미지들을 가지는 데이터세트로, 고정 커널이 적용될 수 있으며, 상하이테크 파트(B)는 δ=15가 적용될 수 있고, WorldExpo'10은 δ=3을 적용할 수 있을 것이다.And Shanghai Tech Part (B), UCSD, and WorldExpo'10 are datasets with crowd scene images with relatively small crowds, and a fixed kernel can be applied, and δ=15 can be applied to Shanghai Tech Part (B), WorldExpo '10 may apply δ=3.

기본 밀도 분포맵 생성 시 사용된 구성에 따라 β=0.3, k=3이 적용될 수 있을 것이다.Depending on the configuration used to generate the basic density distribution map, β=0.3, k=3 may be applied.

고정된 표준편차(σ_i)를 가지는 가우스 커널은 희소 군중 장면 이미지에서 머리 주석을 흐리게 처리하고 머리 크기를 평균화하는 데 사용된다.A Gaussian kernel with a fixed standard deviation (σ _i ) is used to blur head annotations and average head sizes in sparse crowd scene images.

데이터 증가 모듈(320)은 모델의 학습 효율성을 높이기 위해 데이터 증가 방식을 적용하여 데이터세트의 군중 장면 이미지의 데이터양을 증가시킨다.The data increase module 320 increases the data amount of the crowd scene image of the dataset by applying the data increase method to increase the learning efficiency of the model.

상기 데이터 증가 방식으로는 분할 방식, 분할 방식 및 의사 잡음 삽입 방식 등이 적용될 수 있을 것이다.As the data increase method, a partitioning method, a partitioning method, and a pseudo-noise insertion method may be applied.

도 4를 예를 들어 설명하면 데이터 증가 모듈(320)은 (a)의 군중 장면 이미지를 4 등분하여 제1분할 군중 장면 이미지(11), 제2분할 군중 장면 이미지(12), 제3분할 군중 장면 이미지(13), 제4분할 군중 장면 이미지(14)를 생성하고, 상기 군중 장면 이미지의 중심부를 기준으로 분할 군중 장면 이미지와 동일한 크기의 제5분할 군중 장면 이미지(15)를 생성하며, 상기 군중 장면 이미지를 상기 분할 군중 장면 이미지와 동일한 크기로 랜덤한 위치에서 분할한 제6 내지 제9 분할 군중 장면 이미지(16, 17, 18, 19)를 생성하여 (b)와 같이 데이터를 증가시킨다. 상기 분할 군중 장면 이미지는 패치라고도 하며, 상기 패치는 원래 이미지인 군중 장면 이미지의 크기의 1/4 크기를 갖는다.4 as an example, the data increasing module 320 divides the crowd scene image of (a) into 4 equal parts, the first divided crowd scene image 11, the second divided crowd scene image 12, and the third divided crowd A scene image 13 and a fourth divided crowd scene image 14 are generated, and a fifth divided crowd scene image 15 of the same size as the divided crowd scene image is generated based on the center of the crowd scene image, and the The 6th to ninth divided crowd scene images 16, 17, 18, and 19 are generated by dividing the crowd scene image at random positions with the same size as the divided crowd scene image, and the data is increased as shown in (b). The divided crowd scene image is also referred to as a patch, and the patch has a size of 1/4 of the original image of the crowd scene image.

또한, 데이터 증가 모듈(320)은 생성된 분할 군중 장면 이미지(패치)를 수평으로 반전시키고, 약간의 의사 잡음을 추가할 수 있을 것이다.In addition, the data augmentation module 320 may horizontally invert the generated segmented crowd scene image (patch), and add some pseudo-noise.

학습부(400)는 프런트 엔드 처리부(410) 및 백 엔드 처리부(420)를 포함하는 군중 밀도 분포맵 모델을 가지고 있으며, 상기 데이터세트 전처리부(300)로부터 입력된 전처리된 데이터세트를 상기 군중 밀도 분포맵 모델에 적용하여 군중 밀도 분포맵 모델을 학습시킨 후, 상기 군중 밀도 분포맵 모델을 밀도 분포맵 생성부(500)로 제공한다.The learning unit 400 has a crowd density distribution map model including a front-end processing unit 410 and a back-end processing unit 420 , and uses the pre-processed dataset input from the dataset pre-processing unit 300 as the crowd density. After learning the crowd density distribution map model by applying it to the distribution map model, the crowd density distribution map model is provided to the density distribution map generating unit 500 .

구체적으로 설명하면, 프론트 엔드 처리부(410)는 전처리된 학습 데이터세트의 서로 다른 해상도를 가지는 군중 장면 이미지에서 2D 특징을 추출하여 출력한다.Specifically, the front-end processing unit 410 extracts and outputs 2D features from the crowd scene images having different resolutions of the pre-processed training dataset.

상기 프런트 엔드 처리부(410)는 3개의 깊이별 분리 가능 합성곱 계층을 포함하는 5개의 깊이별 분리 가능 합성곱 계층 모듈(411, 412, 414, 415, 417), 3개의 맥스 풀링 계층(413, 416, 418) 및 두 개의 표준 합성곱 계층을 포함하는 표준 합성곱 계층 모듈(419)을 포함한다.The front-end processing unit 410 includes five depth-by-depth separable convolutional layer modules 411 , 412 , 414 , 415 , 417 , three max pooling layers 413 , including three depth-wise separable convolutional layers. 416, 418) and a standard convolutional layer module 419 comprising two standard convolutional layers.

상기 5개의 깊이별 분리 가능 합성곱 계층 모듈(411, 412, 414, 415, 417)은 크기가 3*3인 필터를 구비하며, 각각 16개, 32개, 64개, 128개, 256개의 필터를 포함하여 구성된다.The five depth separable convolutional layer modules 411 , 412 , 414 , 415 , and 417 have filters of size 3*3, respectively, with 16, 32, 64, 128, and 256 filters. is comprised of

또한, 상기 깊이별 분리 가능 합성곱 계층 모듈(411, 412, 414, 415, 417)은 활성화 함수로 ReLU 함수를 적용한다.In addition, the depth separable convolutional layer modules 411 , 412 , 414 , 415 , and 417 apply the ReLU function as an activation function.

맥스 풀링 계층(413)은 순서적으로 두 번째인 제2 깊이별 분리 가능 합성곱 계층 모듈(412)의 출력단에 연결되고, 그 크기는 2*2이다.The max pooling layer 413 is connected to the output terminal of the second sequentially second separable convolutional layer module 412 by depth, and its size is 2*2.

맥스 풀링 계층(416)은 순서적으로 4 번째인 제4 깊이별 분리 가능 합성곱 계층 모듈(455)의 출력단에 연결되고, 그 크기는 2*2이다.The max pooling layer 416 is connected to the output terminal of the fourth separable convolutional layer module 455 by depth, which is sequentially fourth, and has a size of 2*2.

맥스 풀링 계층(418)는 순서적으로 5 번째인 제5 깊이별 분리 가능 합성곱 계층 모듈(417)의 출력단에 연결되고, 그 크기는 2*2이다.The max pooling layer 418 is connected to the output terminal of the fifth separable convolutional layer module 417 by depth, which is sequentially fifth, and has a size of 2*2.

표준 합성곱 계층 모듈(419)은 최 후단에 구성되고, 3*3인 512개의 필터를 포함하며, 활성화함수로 ReLU가 적용된다.The standard convolutional layer module 419 is configured at the rearmost stage, includes 512 filters of 3*3, and ReLU is applied as an activation function.

2D 합성곱은 깊이별 합성곱 및 1*1 합성곱인 점별 합성곱으로 나뉜다.2D convolution is divided into depth-by-depth and point-by-point convolution, which is 1*1 convolution.

본 발명의 프런트 엔드 처리부(410)는 깊이 합성곱을 수행하는 5개의 깊이별 합성곱 계층 모듈(411, 412, 414, 415, 417)과 점별 합성곱을 수행하는 하나의 표준 합성곱 계층 모듈(419)을 포함한다.The front-end processing unit 410 of the present invention includes five depth-by-depth convolutional layer modules 411, 412, 414, 415, and 417 for performing depth convolution and one standard convolutional layer module 419 for performing point-by-point convolution. includes

깊이별 합성곱 계층 모듈(411, 412, 414, 415, 417)은 단일 필터가 모든 입력 채널에 깊이별 합성곱으로 적용되고, 모든 채널에 대한 물리 특징 맵을 생성한다.The depth-by-depth convolution layer modules 411, 412, 414, 415, and 417 apply a single filter to all input channels by depth-by-depth convolution, and generate physical feature maps for all channels.

표준 합성곱 계층 모듈(419)은 깊이별 합성곱 계층 모듈의 출력을 결합하기 위해 1*1 합성곱인 점별 연산을 수행하여 새로운 출력 데이터세트를 생성하여 출력한다.The standard convolutional layer module 419 generates and outputs a new output dataset by performing a point-by-point operation of 1*1 convolution in order to combine the outputs of the convolutional layer module by depth.

프런트 엔드 처리부(410)는 단일 깊이별 합성곱 프로세스를 수행하며, 필터링을 위해 추가 2개의 동작으로 분산될 수 있으며, 필터링을 위해 별도의 합성곱을 사용하는 한편, 다른 합성곱 결합을 위해 사용되며, 이 분배로 인해 합성곱 프로세스에서 많은 양의 연산을 감소시키고 모델 크기도 현저하게 감소시킨다.The front-end processing unit 410 performs a single depth-by-depth convolution process, and may be distributed into two additional operations for filtering, and uses a separate convolution for filtering, while used for other convolutional combinations, This distribution reduces the amount of computation in the convolution process and significantly reduces the model size.

전형적인 2D 합성곱은 깊이별 분리 가능 합성곱과는 구별될 수 있다.Typical 2D convolution can be distinguished from separable convolution by depth.

도 5를 참조하면, 2D 표준 합성곱 계층은 도 5의 20에서 보이는 바와 같이 높이*폭*채널의 형태로 7*7*3의 크기를 가지는 입력 및 3*3*3의 크기를 갖는 필터를 갖는 경우 그 출력 계층의 출력 크기는 5*5*1이다.Referring to FIG. 5, the 2D standard convolution layer includes an input having a size of 7*7*3 and a filter having a size of 3*3*3 in the form of height*width*channel as shown in 20 of FIG. If it has, the output size of that output layer is 5*5*1.

64개의 필터를 포함하는 경우 출력 계층은 5*5*1 크기의 64개의 출력맵을 가지는 출력 계층을 구성할 것이다.If 64 filters are included, the output layer will constitute an output layer having 64 output maps of size 5*5*1.

그러나 도 5의 30에서 보이는 바와 같이 3*3*3 크기의 64개의 필터를 적용하여 5*5*64 크기의 단일 크기를 가지는 출력 계층으로 출력하도록 구성될 수도 있을 것이다. 이와 같이 함으로써 도 5의 30에서 보이는 바와 같이 채널이 증가하는 동안 높이와 너비가 감소함을 알 수 있다.However, as shown in 30 of FIG. 5 , it may be configured to output as an output layer having a single size of 5*5*64 by applying 64 filters of 3*3*3 size. By doing this, it can be seen that the height and width decrease while the channel increases, as shown in 30 of FIG. 5 .

상술한 변환을 이루기 위해 깊이별 분리 가능한 합성곱 프로세스를 수행하는 프런트 엔드 처리부(410)의 동작을 설명한다.An operation of the front-end processing unit 410 that performs a separable convolution process for each depth in order to achieve the above-described transformation will be described.

프런트 엔드 처리부(410)는 첫 번째 단계에서 깊이별 합성곱 연산을 수행하고, 두 번째 단계에서 점별 합성곱 연산을 수행한다.The front-end processing unit 410 performs a depth-by-depth convolution operation in a first step, and performs a point-by-point convolution operation in a second step.

첫 번째 단계에서, 깊이별 합성곱 연산은 입력 계층에 적용되고 2D 합성곱에서 3*3*3형태를 가지는 하나의 필터를 적용하는 것보다는 도 6의 40에서 나타낸 바와 같이 각각 3*3*1의 크기를 갖는 3개의 필터를 적용하는 것이 바람직할 것이다.In the first step, the depth-by-depth convolution operation is applied to the input layer and rather than applying one filter having a 3*3*3 shape in 2D convolution, as shown in 40 of FIG. 6, each 3*3*1 It would be desirable to apply three filters having a size of .

각 필터의 크기 형태가 5*5*1인 입력 계층의 한 채널로 합성곱되며, 이 개별 맵은 함께 싸여서 5*5*3 형태의 맵을 생성한다.Each filter is convolved with one channel of the input layer of size form 5*5*1, and these individual maps are wrapped together to produce a map of the form 5*5*3.

상기 도 6의 40은 입력층의 공간적 차원이 감소하는 것을 보여주지만 채널은 여전히 이전과 동일하다.40 in FIG. 6 shows that the spatial dimension of the input layer is reduced, but the channel is still the same as before.

제2단계에서, 도 6의 50과 같이 점별 합성곱 연산은 크기 1*1*3의 필터와 크기 5*5*1의 맵을 제공하는 크기 5*5*3의 입력 이미지를 합성곱하여 적용한다.In the second step, as shown in 50 of FIG. 6 , the point-by-point convolution operation is applied by convolution of a filter of size 1*1*3 and an input image of size 5*5*3 providing a map of size 5*5*1. .

따라서 점별 합성곱 계층은 도 6의 60에서와 같이 크기 1*1*3의 64개의 필터를 크기 5*5*3인 입력 이미지와 합성곱한 후, 출력 맵 크기가 5*5*64인 출력 맵을 생성한다.Therefore, the point-by-point convolution layer convolutions 64 filters of size 1*1*3 with the input image of size 5*5*3 as in 60 in FIG. create

프런트 엔드 처리부(410)는 위의 두 단계(깊이별 합성곱 및 점별 합성곱)와 함께 크기가 7*7*3인 입력 계층을 크기가 5*5*64인 출력 계층으로 변환한다.The front-end processing unit 410 converts an input layer with a size of 7*7*3 into an output layer with a size of 5*5*64 together with the above two steps (convolution by depth and convolution by point).

이와 같이 깊이별 분리 가능 합성곱을 사용하므로 프런트 엔드 처리부(410)에서 수행되는 연산을 감소할 수 있고, 표준 2D 합성곱에 비해 매개변수를 적게 사용하므로 모델의 크기를 감소시킬 수 있다.As described above, since separable convolution for each depth is used, the number of operations performed by the front-end processing unit 410 can be reduced, and since fewer parameters are used compared to standard 2D convolution, the size of the model can be reduced.

백 엔드 처리부(420)는 2개의 확장 합성곱 계층 및 크기가 3*3인 필터를 포함하되, 순차적으로 2개씩 512, 256 및 128 개의 필터 수를 가지는 3개의 제1 확장 합성곱 계층 모듈(421, 422, 423), 하나의 확장 합성곱 계층 및 크기가 3*3인 64개의 필터를 포함하는 제2확장 합성곱 계층 모듈(424)과, 하나의 표준 합성곱 계층 및 크기가 1*1인 한 개의 필터를 구비하고, 활성화함수로 ReLU를 적용하는 표준 합성곱 계층 모듈(425)을 포함한다.The back-end processing unit 420 includes three first extended convolutional layer modules 421 including two extended convolution layers and a filter having a size of 3*3, and having a number of 512, 256, and 128 filters sequentially two each. , 422, 423), a second extended convolution layer module 424 including one extended convolution layer and 64 filters of size 3*3, and one standard convolution layer and size 1*1 It includes a standard convolutional layer module 425 that has one filter and applies ReLU as an activation function.

상기 확장 합성곱 계층 모듈(421, 422, 423, 424)은 분할 문제를 위해 적용되며, 매우 높은 정확도를 달성하고 풀링 계층 대신 사용될 수 있다.The extended convolutional layer modules 421 , 422 , 423 , 424 are applied for the partition problem, achieve very high accuracy and can be used instead of the pooling layer.

확장 합성곱 계층(421, 422, 423, 424)은 풀링 계층을 대체하여 사용할 수 있고, 역합성곱 계층을 사용하지 않아도 되도록 함으로써 실행시간을 단축시키고 복잡성을 감소시킨다.The extended convolution layers 421 , 422 , 423 , and 424 can be used as a replacement for the pooling layer, and the execution time is shortened and complexity is reduced by eliminating the need to use the deconvolution layer.

이러한 확장 합성곱 계층 모듈(421, 422, 423, 424)은 추가의 매개변수나 연산을 사용하지 않고, 수용야를 확장하는 여분 커널들을 사용한다.These extended convolutional layer modules 421 , 422 , 423 , and 424 do not use additional parameters or operations, and use extra kernels that extend the receiving field.

예를 들어, 크기 m*m의 단순 커널을 m+(m-1)(d-1)(d-1)f_h로 늘릴 수 있다. 여기서 d는 확장률이다.For example, a simple kernel of size m*m can be increased to m+(m-1)(d-1)(d-1)f _h . where d is the expansion rate.

확장률을 조절함으로써 도 7에서와 같이 크기가 3*3인 커널의 수용야를 확장할 수 있다.By adjusting the expansion rate, the receiving field of the kernel having a size of 3*3 can be expanded as shown in FIG. 7 .

도 7에서 보이는 바와 같이 확장률 d=1일 때 확장된 합성곱 계층은 정상적인 합성곱 계층처럼 3*3의 수용야를 획득한다.As shown in FIG. 7 , when the extension rate d=1, the extended convolutional layer acquires a 3*3 acceptance field like a normal convolutional layer.

??장률 d=2를 적용하면 확장된 합성곱 계층은 5*5의 수용야를 획득하며, d=3이면 7*7 크기의 수용야를 획득하며, d=4이면 9*9의 수용야를 획득할 수 있을 것이다.When the extension factor d=2 is applied, the extended convolutional layer acquires a receiving field of 5*5, if d=3, it obtains a receiving field of size 7*7, and if d=4, a receiving field of 9*9 is obtained. will be able to obtain

확장된 합성곱 계층은 기존의 합성곱 계층, 풀링 계층 및 역합성곱 계층을 사용하는 매커니즘에 비해 특징 맵 해상도를 보존하는 데 많은 이점이 있다.The extended convolutional layer has many advantages in preserving the feature map resolution compared to mechanisms using conventional convolutional layers, pooling layers and deconvolutional layers.

이러한 이점을 예를 들어 설명한다.These advantages will be described by way of example.

과밀한 군중 장면 이미지가 있고, 군중 장면 이미지는 두 가지 다른 기술로 독립적으로 처리되지만, 이 기술이 이미지에 적용한 후에도 원 입력 이미지의 모양과 출력의 모양은 같아야 할 것이다.I have an overcrowded crowd scene image, and the crowd scene image is processed independently by two different techniques, but even after these techniques are applied to the image, the shape of the original input image and the shape of the output should be the same.

기존 매커니즘 기술로는 다운 샘플링, 합성곱 계층 및 업 샘플링의 세 가지 주요 기술이 있다.There are three main techniques of existing mechanism techniques: downsampling, convolutional layer, and upsampling.

기존의 매커니즘은 최대 풀링을 적용하여 초기 입력 이미지의 모양을 축소하고, 작업을 위해 창 크기를 2로 유지하여 이미지 크기가 초기 이미지 모양의 절반으로 줄어든다. The existing mechanism applies maximum pooling to shrink the shape of the initial input image, and for work, keeps the window size at 2, reducing the image size to half the shape of the initial image.

풀링 작업 후, 3*3의 소버(Sober) 필터는 처리된 이미지와 합성곱되고 최종적으로 업샘플링 작업이 처리된 이미지는 이중선 보간이 적용되어 그 크기가 원 이미지의 크기와 비슷해진다.After the pooling operation, a 3*3 Sober filter is convolved with the processed image, and doublet interpolation is applied to the finally upsampling processed image so that the size becomes similar to the size of the original image.

2D 확장 합성곱 계층은 다음 수학식 2에 의해 표현될 수 있다.The 2D extended convolution layer may be expressed by the following Equation (2).

여기서, f(i, j)는 높이(H)와 넓이(B)의 필터이고, x(h,b)는 입력이고, g(h,b)는 확장된 합성곱 계층의 출력을 나타내며, 매개변수 d는 확장률을 나타낸다.where f(i, j) is the filter of height (H) and width (B), x(h,b) is the input, and g(h,b) is the output of the extended convolutional layer, The variable d represents the expansion rate.

매개변수 값 d가 1이면 확장 합성곱 계층은 표준 합성곱 계층처럼 작동하지만 d값이 증가하면 추가 매개변수(가중치)를 사용하지 않고, 필터의 수용야를 증가한다.If the parameter value d is 1, the extended convolutional layer behaves like a standard convolutional layer, but as the value of d increases, no additional parameters (weights) are used, and the receptive field of the filter is increased.

상술한 프런트 엔드 처리부(410) 및 백 엔드 처리부(420)를 포함하는 군중 밀도 분포맵 모델은 확률적 경사 하강법(SGD)을 적용한다.The above-described crowd density distribution map model including the front-end processing unit 410 and the back-end processing unit 420 applies a stochastic gradient descent method (SGD).

확률적 경사 하강법은 제안된 아키텍처를 일정한 학습률, f_h 훈련하기 위한 최적화 도구로 사용된다.The stochastic gradient descent method is used as an optimization tool to train the proposed architecture at a constant learning rate, f _h .

손실할수로는 예측된 밀도 분포와 실제 군중 밀도 분포맵 사이의 거리를 찾는 유클리드 거리가 사용된다. 손실함수는 하기 수학식 3을 사용하여 표현할 수 있다.As the loss channel, the Euclidean distance is used to find the distance between the predicted density distribution and the actual crowd density distribution map. The loss function can be expressed using Equation 3 below.

여기서, F(Xi;θ)는 우리가 제안한 모델의 출력이고, M은 훈련의 배치 크기이고, θ는 입력 이미지를 표시하는 데 Xi가 사용되는 매개변수를 나타내며, D_i ^GT는 기본 밀도 분포맵이다.where F(Xi;θ) is the output of our proposed model, M is the batch size of training, θ is the parameter Xi used to display the input image, and D _i ^GT is the default density distribution map. am.

밀도 분포맵 생성부(500)는 상기 학습부(400)로부터 학습된 군중 밀도 분포맵 모델을 제공받아 구동하고, 군중 이미지 획득부(200)가 출력하는 군중 장면 이미지를 직접 입력받거나 데이터세트 전처리부(300)를 통해 입력받아 상기 군중 밀도 분포맵 모델에 적용하여, 본 발명에 따른 군중 밀도 분포맵을 군중 계수부(600)로 출력한다.The density distribution map generation unit 500 receives and operates the crowd density distribution map model learned from the learning unit 400 , and directly receives the crowd scene image output by the crowd image acquisition unit 200 or a dataset preprocessing unit The input through 300 is applied to the crowd density distribution map model, and the crowd density distribution map according to the present invention is output to the crowd counting unit 600 .

군중 밀도 분포맵을 입력받은 군중 계수부(600)는 군중 밀도 분포맵의 사람의 머리 부분에 대응하는 분포점들을 계수하여 군중의 사람 수를 계수한다.The crowd counting unit 600 receiving the crowd density distribution map counts the number of people in the crowd by counting distribution points corresponding to the heads of people in the crowd density distribution map.

반면, 본 발명에 따른 확장 합성곱 계층은 동일한 필터, 즉 확장률 d=2를 갖는 3*3 크기의 소벨(Sobel) 필터를 원래 입력 이미지로 합성곱하여 확장된 합성곱 개념을 적용한다.On the other hand, the extended convolution layer according to the present invention applies the extended convolution concept by convolution of the same filter, that is, a 3*3 Sobel filter having an extension rate d=2, with the original input image.

출력 이미지는 풀링 및 업 샘플링 작업 없이 초기 입력 이미지와 동일한 모양을 갖는다.The output image has the same shape as the initial input image without any pooling and upsampling operations.

또한, 확장된 합성곱 출력에는 추가로 포괄적인 정보가 포함된다.In addition, the extended convolution output includes additional comprehensive information.

한편, 본 발명은 전술한 전형적인 바람직한 실시예에만 한정되는 것이 아니라 본 발명의 요지를 벗어나지 않는 범위 내에서 여러 가지로 개량, 변경, 대체 또는 부가하여 실시할 수 있는 것임은 당해 기술분야에서 통상의 지식을 가진 자라면 용이하게 이해할 수 있을 것이다. 이러한 개량, 변경, 대체 또는 부가에 의한 실시가 이하의 첨부된 특허청구범위의 범주에 속하는 것이라면 그 기술사상 역시 본 발명에 속하는 것으로 보아야 한다.On the other hand, it is common knowledge in the art that the present invention is not limited to the typical preferred embodiments described above, but can be improved, changed, replaced, or added in various ways within the scope of the present invention. Those who have will be able to understand it easily. If implementation by such improvement, change, substitution or addition falls within the scope of the appended claims below, the technical idea should also be considered to belong to the present invention.

100: 데이터세트 생성부 200: 군중 이미지 획득부
300: 데이터세트 전처리부 310: 기본 밀도 분포맵 생성모듈
320: 데이터 증가 모듈 400: 학습부
410: 프런트 엔드 처리부 420: 백 엔드 처리부
500: 밀도 분포맵 생성부 600: 군중 계수부100: data set generation unit 200: crowd image acquisition unit
300: data set preprocessor 310: basic density distribution map generation module
320: data increase module 400: learning unit
410: front-end processing unit 420: back-end processing unit
500: density distribution map generation unit 600: crowd counting unit

Claims

a dataset generating unit for generating a dataset including a plurality of crowd scene images annotated on the number of crowds;
a dataset pre-processing unit that generates a basic density distribution map for the crowd scene images of the dataset, and increases data including the crowd scene image and the basic density distribution map to generate a training dataset;
The dataset preprocessor applies an end-to-end convolutional neural network including a separable convolutional layer for each depth and an extended convolutional layer to learn a crowd density distribution map for the number of heads in a crowd scene image. a learning unit for learning by applying the learning dataset input from
a density distribution map generator to which the learned crowd density distribution map model is applied to generate and output a crowd density distribution map for a crowd scene image input in real time; and
and a counting unit for receiving the crowd density distribution map and counting the number of crowds of the crowd scene image input in real time.

According to claim 1,
The data set preprocessor,
a basic density distribution map generator for generating and outputting a basic density distribution map by applying an adaptive geometry kernel expressed by the following equation to each of the crowd scene images of the dataset; and
Crowd scene image real-time analysis apparatus using an extended convolutional neural network, characterized in that it further comprises a data increasing unit for increasing the amount of data for the annotated crowd scene image.
[Equation]

Here, si denotes all target objects, δ denotes the basic density distribution map, di denotes the distance average of k nearest neighbors, and s denotes the position of a pixel in the input image.

3. The method of claim 2,
The data increase unit,
Crowd scene image real-time analysis apparatus using an extended convolutional neural network, characterized in that the data is increased by dividing the crowd scene image into a predetermined number of patches reduced by a predetermined ratio with respect to the size of the crowd scene.

According to claim 1,
The crowd density distribution map model is,
a front-end processing unit for extracting and outputting 2D features from crowd scene images having different resolutions of the preprocessed training dataset; and
Crowd scene image real-time analysis apparatus using an extended convolutional neural network, characterized in that it comprises a back-end processing unit that generates and outputs a crowd density distribution map by performing extended convolution of the 2D features output from the front-end processing unit.

5. The method of claim 4,
The front-end processing unit,
Including 3 separable convolutional layers by depth and filters of size 3*3, with different numbers of filters sequentially 16, 32, 64, 128 and 256, using ReLU as the activation function 5 depth-by-depth separable convolutional layer modules to apply;
a max pooling layer having a size of 2*2 and configured after the separable convolutional layer module for each depth and after the separable convolution layer module for each depth and after the separable convolutional layer module for each depth among the separable convolutional layer modules for each depth; and
Real-time analysis of crowd scene images using an extended convolutional neural network, comprising two standard convolutional layers and 512 filters with a size of 3*3, and a standard convolutional layer module that applies ReLU as an activation function Device.

5. The method of claim 4,
The back-end processing unit,
three first extended convolution layer modules comprising two extended convolution layers and a filter of size 3*3, sequentially having 512, 256 and 128 filter numbers;
a second extended convolutional layer module including one extended convolutional layer and 64 filters having a size of 3*3; and
Crowd scene image real-time analysis using an extended convolutional neural network, comprising one standard convolutional layer and one filter with a size of 1*1, and a standard convolutional layer module that applies ReLU as an activation function Device.

7. The method of claim 6,
The extended convolution layer is
Crowd scene image real-time analysis device using an extended convolutional neural network, characterized in that it extends the receiving field using extra kernels.

a dataset creation process in which the dataset creation unit generates a dataset including a plurality of crowd scene images annotated on the number of crowds;
A dataset preprocessing process in which a dataset preprocessing unit generates a basic density distribution map for the crowd scene images of the dataset, and increases data including the crowd scene image and the basic density distribution map to generate a training dataset ;
The dataset in the crowd density distribution map model, where the learning unit learns the crowd density distribution map for the number of heads in the crowd scene image by applying an end-to-end convolutional neural network including a separable convolutional layer and extended convolutional layer by depth. a learning process of learning by applying the learning dataset input from the preprocessor;
a density distribution map generation process in which a density distribution map generator generates and outputs a crowd density distribution map for a crowd scene image input in real time to which the learned crowd density distribution map model is applied; and
Crowd scene image real-time analysis method using an extended convolutional neural network, characterized in that the counting unit receives the crowd density distribution map and includes a counting process for counting the number of crowds of the crowd scene image input in real time.

9. The method of claim 8,
The data set preprocessing process is
a basic density distribution map generating step in which a basic density distribution map generator generates and outputs a basic density distribution map by applying an adaptive geometry kernel expressed by the following equation to each of the crowd scene images of the dataset; and
Crowd scene image real-time analysis method using an extended convolutional neural network, characterized in that the data increasing unit further comprises a data increasing step of increasing the data amount for the annotated crowd scene image.
[Equation]

10. The method of claim 9,
Crowd scene image real-time analysis method using an extended convolutional neural network, characterized in that the data increasing unit increases the data by dividing the crowd scene image into a certain number of patches reduced by a predetermined ratio with respect to the size of the crowd scene in the data increasing step .

9. The method of claim 8,
The learning process is
a front-end processing step of extracting and outputting 2D features from the crowd scene images having different resolutions of the pre-processed training dataset by the front-end processing unit; and
Crowd scene image real-time using extended convolutional neural network, characterized in that the back-end processing unit performs extended convolution of the 2D features output from the front-end processing unit to generate and output a crowd density distribution map. analysis method.