KR101983684B1

KR101983684B1 - A People Counting Method on Embedded Platform by using Convolutional Neural Network

Info

Publication number: KR101983684B1
Application number: KR1020170108015A
Authority: KR
Inventors: 안하은; 유지상
Original assignee: 광운대학교 산학협력단
Priority date: 2017-08-25
Filing date: 2017-08-25
Publication date: 2019-05-30
Also published as: KR20190022126A

Abstract

연속된 프레임으로 구성된 영상에서 보행자를 카운팅하는, 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법에 관한 것으로서, (a) 적어도 2개 이상의 연속된 이전 프레임을 이용하여 배경모델을 초기화하여 생성하는 단계; (b) 영상의 현재 프레임을 입력받는 단계; (c) 배경모델의 밴드 상단선과 하단선을 업데이트 하되, 상기 배경모델의 화소의 편차를 이용하여 업데이트하는 단계; (d) 상기 배경모델, 및, 밴드의 상단선과 하단선을 이용하여 보행자 후보 영역 맵을 추출하되, 현재 프레임의 각 화소에 대하여 상기 업데이트된 배경모델의 밴드 상단선과 하단선에서 벗어나는 화소들을 상기 보행자 후보 영역 맵으로 추출하는 단계; (e) 상기 영상의 현재 프레임을 이용하여 배경모델을 업데이트 하되, 보행자 후보 영역에서는 이전 프레임의 배경모델을 그대로 사용하고 나머지 영역에 대해서만 업데이트 하는 단계; (f) 상기 보행자 후보 영역 맵을 컨벌루션 신경망(CNN) 보행자 분류기에 입력하여 분류시키는 단계; 및, (g) 분류결과에 따라 보행자를 카운팅하는 단계를 포함하는 구성을 마련하여, 시스템의 성능이 카메라 촬영 거리나 각도 등의 설치 환경에 강인하고, 현재 상용화된 임베디드 보드에서 실시간으로 동작할 수 있다.A method of counting a person on an embedded platform using a convolutional neural network, the method comprising: (a) initializing and generating a background model using at least two consecutive previous frames; (b) receiving a current frame of an image; (c) updating a band top line and a bottom line of the background model using the deviation of the pixels of the background model; (d) extracting a pedestrian candidate area map using the background model and upper and lower lines of the band, and for each pixel of the current frame, extracting pixels out of the upper band line and the lower line of the updated background model, Extracting a candidate region map; (e) updating the background model using the current frame of the image, using only the background model of the previous frame in the pedestrian candidate area and updating only the remaining area; (f) inputting the pedestrian candidate area map into a convolutional neural network (CNN) pedestrian classifier and classifying the pedestrian candidate area map; And (g) counting pedestrians according to the result of the classification. Thus, the performance of the system is robust to the installation environment such as the camera photographing distance and angle, and can be operated in real time on the currently commercialized embedded board have.

Description

[0001] A People Counting Method on Embedded Platform using Convolutional Neural Network [

본 발명은 임베디드 환경에서 실시간으로 동작하는 피플 카운팅(people counting) 방법으로서, 영상의 밝기 변동 특성을 반영하여 고 학습이나 파라미터(parameter) 조절 없이 배경 영상을 생성하는, 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법에 관한 것이다.The present invention relates to a people counting method that operates in real time in an embedded environment and is a method for counting people on an embedded platform using a convolutional neural network that generates a background image without high learning or parameter adjustment, Counting method.

피플 카운팅(people counting) 방법의 중요 요소는 빠른 동작시간, 정확도 그리고 제약 없는 카메라 설치 환경이다. 일반적으로 보행자를 빠르게 분류하기 위하여 배경 모델 생성 방법이 이용된다. 영상 화소의 밝기 값은 영상을 촬영한 장소의 광원이나 영상 촬영에 사용된 카메라 렌즈의 종류에 따라서 다양한 특성을 지닌다.A key element of the people counting method is fast operation time, accuracy and unrestricted camera installation environment. In general, a background model generation method is used to quickly classify pedestrians. The brightness value of the image pixel has various characteristics according to the light source at the place where the image is taken and the type of the camera lens used for the image photographing.

또한, 본 발명은 기존의 지역제안(region proposal) 방식보다 신뢰도가 높은 보행자 후보군을 입력으로 갖는 CNN(convolutional neural network) 기반 보행자 분류 모델을 제안하는, 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법에 관한 것이다.In addition, the present invention proposes a convolutional neural network (CNN) -based pedestrian classification model having a more reliable pedestrian candidate group than an existing region proposal method, and relates to a method of counting people on an embedded platform using a convolutional neural network will be.

피플 카운팅(people counting) 기술은 비디오 영상에서 매장의 출입구와 같은 특정 지점을 통과하는 사람이 몇 명인지 파악하는 비디오 영상분석 기술이다. 영상기반 감시 시스템이나 현장 관리에 사용되는 요소 기술 중 하나이며 특히 매장에서의 고객의 동선이나 특정 시간의 방문자 숫자 파악등 마케팅 분야에서 적극 활용되고 있다.People counting is a video image analysis technique that detects how many people in a video image pass through a certain point, such as a store's entrance. It is one of the element technologies used in video-based surveillance system and field management. Especially, it is utilized in marketing field such as customer's movement in the store or the number of visitors at a specific time.

피플 카운팅(people counting) 방법은 카메라의 설치 각도나 거리 등 다양한 설치 환경이 존재하며 주로 설치 각도에 따라서 그 명칭이 달라진다. 카메라를 천장에 설치할 경우에는 톱뷰(top view) 방식이라 표현하며 카메라를 사선방향으로 설치하여 보행자를 촬영할 경우에는 조감도(bird's eye view) 방식이라 칭한다.The people counting method has various installation environments such as the installation angle of the camera and the distance, and the name is different according to the installation angle. When the camera is installed on a ceiling, it is expressed as a top view system, and when a camera is installed in an oblique direction to photograph a pedestrian, the system is referred to as a bird's eye view system.

톱뷰(top view) 방식의 경우 카메라를 건물의 천장에 설치해야 하기 때문에 비교적 제한적인 설치 환경을 가진다. 카메라의 촬영 화각이 위에서 바라본 형태이기 때문에 주로 촬영되는 영상은 보행자의 머리나 어깨에 한정된다. 또한 카메라와 보행자와의 거리가 비교적 가깝기 때문에 영상에서 시맨틱(semantic) 정보를 추출하기 용이하다. 시맨틱(semantic) 정보를 활용하는 방식들은 시스템 구현이 간단하고 실시간 처리에 유리하기 때문에 톱뷰(top view) 방식의 피플 카운팅(people counting)에서는 시맨틱(semantic) 정보를 활용하는 기법들이 활발하게 연구되었다. 이러한 톱뷰(top view) 방식의 피플 카운팅(people counting) 시스템들은 제한적인 상황에서 영상을 획득하기 때문에 외부 조명이나 가려짐 영역 혹은 보행자가 겹치는 경우에 따른 인식률 저하현상이 적다. 이런 이유에 기계학습을 통한 분류 모델생성 없이도 사람의 시맨틱(semantic) 정보만을 활용하여 비교적 만족할 만한 인식률을 보여준다.In the case of the top view method, the camera is installed on the ceiling of the building, so that the camera has a relatively limited installation environment. Since the photographing angle of view of the camera is viewed from above, the image mainly captured is limited to the head or shoulder of the pedestrian. Also, since the distance between the camera and the pedestrian is relatively close, it is easy to extract the semantic information from the image. Techniques that utilize semantic information have been actively studied in the top view type of people counting since system implementation is simple and advantageous for real-time processing. Such top view type people counting systems acquire images in a limited situation, so there is little degradation in recognition rate due to overlapping of outside lights, obstructed areas or pedestrians. For this reason, it shows comparatively satisfactory recognition rate by using only semantic information of a person without generating a classification model through machine learning.

반면, 조감도(Bird's eye view) 방식은 카메라가 새가 하늘을 날면서 지상을 바라보는 형태로 영상을 촬영하는 방식이다. 촬영 각도나 촬영 거리의 변화에 따라서 다양한 변종(variant)을 가지는 보행자 영상이 촬영되기 때문에 피플 카운팅(people counting) 문제 해결을 위해서는 기계 학습과 같은 고수준의 알고리즘 레벨이 요구된다. 조감도(Bird's eye view) 방식의 피플 카운팅(people counting)은 전통적으로 분류작업(classification task)의 형태로 연구되어 왔다.Bird's eye view, on the other hand, is a method in which a camera shoots images in the form of birds flying in the sky and looking at the ground. Since the pedestrian images having various variants are captured according to the change of the shooting angle and the shooting distance, a high level of algorithm level such as machine learning is required to solve the people counting problem. People counting in Bird's eye view has traditionally been studied in the form of classification tasks.

HOG(histogram of oriented gradient)[비특허문헌 1], ICF(integral channel feature)[비특허문헌 2]와 같은 대표적인 영상 특징(feature)를 추출한 뒤 SVM(support vector machine)이나 캐스케이드(cascade) 형태의 학습기를 이용하여 객체 분류 모델(classification model)을 생성하는 방법이 이에 해당된다. 조감도(Bird's eye view) 방식의 피플 카운팅(people counting)은 현재도 활발히 연구되고 있는 분야이며 최근에는 객체 검출작업(detection task)에서 좋은 성능을 보여주고 있는 CNN(convolutional neural network) 기반 객체 인식 모델[비특허문헌 3, 4]을 이용하는 연구들도 이루어 지고 있다. [비특허문헌 5]에서는 R-CNN[비특허문헌 6]을 이용하여 피플 카운팅(people counting)을 수행하는 방법을 제안하였다.A representative vector feature such as HOG (histogram of oriented gradient) [Non-Patent Document 1], ICF (integral channel feature) [Non Patent Document 2] is extracted and then SVM (support vector machine) or cascade This is the method of creating an object classification model using a learning machine. People counting with Bird's eye view method is still actively researched. Recently, CNN (convolutional neural network) based object recognition model which shows good performance in object detection task [ Non-Patent Documents 3 and 4] are also being studied. [Non-Patent Document 5] proposed a method of performing people counting using R-CNN (Non-Patent Document 6).

인공 신경망을 구축하고 빅데이터를 이용하여 영상 인식 모델을 학습하는 딥러닝 방법은 다양한 영상 분석 분야에서 성공적인 결과를 보여주고 있다. [비특허문헌 6-8]에서는 CNN을 이용한 모델과 지역제안(region proposal) 방법을 이용하여 영상에서 객체의 종류와 위치를 예측하는 객체 검출 작업(object detection task)을 수행하는 방법을 제안하였다. 일반적으로 한 장의 영상에서 2000개의 제안지역(proposal region)들이 추출되며 에지박스(edge boxes)[비특허문헌 9]나 선택적 탐색(selective search)[비특허문헌 10]과 같은 방법들이 주로 이용된다.Deep learning methods for constructing artificial neural networks and learning image recognition models using big data show successful results in various image analysis fields. [Non-Patent Document 6-8] proposed a method of performing an object detection task for predicting the type and position of an object in an image using a CNN model and a region proposal method. In general, 2000 proposal regions are extracted from a single image and methods such as edge boxes (non-patent document 9) and selective search (non-patent document 10) are mainly used.

DALAL, Navneet and TRIGGS, Bill: "Histograms of oriented gradients for human detection" Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. IEEE, Vol.1, pp. 886-893 (2005) DALAL, Navneet and TRIGGS, Bill: " Histograms of oriented gradients for human detection ", Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. IEEE, Vol. 1, pp. 886-893 (2005) Dollar, P., Tu, Z., Perona, P., and Belongie, S.: "Integral channel features", In: BMVC, pp. 91-1 (2009) Dollar, P., Tu, Z., Perona, P., and Belongie, S .: " Integral channel features ", In: BMVC, pp. 91-1 (2009) Simonyan, K., and Zisserman, A.: "Very deep convolutional networks for large-scale image recognition" arXiv preprint arXiv:1409.1556. (2014) Simonyan, K., and Zisserman, A .: " Very deep convolutional networks for large-scale image recognition " arXiv preprint arXiv: 1409.1556. (2014) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... and Rabinovich, A.: "Going deeper with convolutions" Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1-9 (2015) Proceedings of the IEEE conference on "Going deeper with convolutions", Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., on computer vision and pattern recognition, pp. 1-9 (2015) Gao, C., Li, P., Zhang, Y., Liu, J., and Wang, L.: "People counting based on head detection combining Adaboost and CNN in crowded surveillance environment" Neurocomputing, 208, pp. 108-116. (2016) Gao, C., Li, P., Zhang, Y., Liu, J., and Wang, L .: Neurocomputing, 208, pp. 108-116. (2016) Girshick, R., Donahue, J., Darrell, T., &Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition(pp. 580-587). Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580-587). Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 1440-1448). Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 1440-1448). Ren, S., He, K., Girshick, R., &Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91-99). Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91-99). Zitnick, C. L., &Dollㅱr, P. (2014, September). Edge boxes: Locating object proposals from edges. In European Conference on Computer Vision (pp. 391-405). Springer, Cham. Zitnick, C. L., & Doll, P. (2014, September). Edge boxes: Locating object proposals from edges. In European Conference on Computer Vision (pp. 391-405). Springer, Cham. Uijlings, J. R., Van De Sande, K. E., Gevers, T., &Smeulders, A. W. (2013). Selective search for object recognition. International journal of computer vision, 104(2), 154-171. Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International journal of computer vision, 104 (2), 154-171. Li, G., Ren, P., Lyu, X., &Zhang, H. (2016, December). Real-Time Top-View People Counting Based on a Kinect and NVIDIA Jetson TK1 Integrated Platform. In Data Mining Workshops (ICDMW), 2016 IEEE 16th International Conference on (pp. 468-473). IEEE. Li, G., Ren, P., Lyu, X., & Zhang, H. (2016, December). Real-Time Top-View People Counting Based on a Kinect and NVIDIA Jetson TK1 Integrated Platform. In Data Mining Workshops (ICDMW), IEEE 16th International Conference on 2016 (pp. 468-473). IEEE. Garcㅽa, J., Gardel, A., Bravo, I., Lㅱzaro, J. L., Martㅽnez, M., &Rodrㅽguez, D. (2013). Directional people counter based on head tracking. IEEE Transactions on Industrial Electronics, 60(9), 3991-4000. García, J., Gardel, A., Bravo, I., L ㅱ zaro, J. L., Martínez, M., & Rodríguez, D. (2013). Directional people counter based on head tracking. IEEE Transactions on Industrial Electronics, 60 (9), 3991-4000. Hsieh, J. W., Peng, C. S., &Fan, K. C. (2007, October). Grid-based template matching for people counting. In Multimedia Signal Processing, 2007. MMSP 2007. IEEE 9th Workshop on (pp. 316-319). IEEE. Hsieh, J. W., Peng, C. S., and Fan, K. C. (2007, October). Grid-based template matching for people counting. In Multimedia Signal Processing, 2007. MMSP 2007. IEEE 9th Workshop on (pp. 316-319). IEEE. Chen, T. H., Chen, T. Y., &Chen, Z. X. (2006, June). An intelligent people-flow counting method for passing through a gate. In Robotics, Automation and Mechatronics, 2006 IEEE Conference on (pp. 1-6). IEEE. Chen, T. H., Chen, T. Y., & Chen, Z. X. (2006, June). An intelligent people-flow counting method for passing through a gate. In Robotics, Automation and Mechatronics, 2006 IEEE Conference on (pp. 1-6). IEEE. Ma, H., Lu, H., &Zhang, M. (2008, June). A real-time effective system for tracking passing people using a single camera. In Intelligent Control and Automation, 2008. WCICA 2008. 7th World Congress on (pp. 6173-6177). IEEE Ma, H., Lu, H., & Zhang, M. (2008, June). A real-time effective system for tracking people using a single camera. In Intelligent Control and Automation, 2008. WCICA 2008. 7th World Congress on (pp. 6173-6177). IEEE Hou, Y. L., &Pang, G. K. (2011). People counting and human detection in a challenging situation. IEEE transactions on systems, man, and cybernetics-part a: systems and humans, 41(1), 24-33. Hou, Y. L., & Pang, G. K. (2011). People counting and human detection in a challenging situation. IEEE transactions on systems, man, and cybernetics-part a: systems and humans, 41 (1), 24-33. Chan, A. B., Liang, Z. S. J., &Vasconcelos, N. (2008, June). Privacy preserving crowd monitoring: Counting people without people models or tracking. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1-7). IEEE. Chan, A. B., Liang, Z. S. J., & Vasconcelos, N. (2008, June). Privacy preserving crowd monitoring: Counting people without people models or tracking. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1-7). IEEE. Zeng, C., &Ma, H. (2010, August). Robust head-shoulder detection by pca-based multilevel hog-lbp detector for people counting. In Pattern Recognition (ICPR), 2010 20th International Conference on (pp. 2069-2072). IEEE. Zeng, C., & Ma, H. (2010, August). Robust head-shoulder detection by pca-based multilevel hog-lbp detector for people counting. In Pattern Recognition (ICPR), 2010 20th International Conference on (pp. 2069-2072). IEEE. Subburaman, V. B., Descamps, A., &Carincotte, C. (2012, September). Counting people in the crowd using a generic head detector. In Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on (pp. 470-475). IEEE. Subburaman, V. B., Descamps, A., & Carincotte, C. (2012, September). Counting people in the crowd using a generic head detector. In Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on (pp. 470-475). IEEE. Topkaya, I. S., Erdogan, H., &Porikli, F. (2014, August). Counting people by clustering person detector outputs. In Advanced Video and Signal Based Surveillance (AVSS), 2014 11th IEEE International Conference on (pp. 313-318). IEEE. Topkaya, I. S., Erdogan, H., & Porikli, F. (2014, August). Counting people by clustering person detector outputs. In Advanced Video and Signal Based Surveillance (AVSS), 2014 11th IEEE International Conference on (pp. 313-318). IEEE. Gao, L., Wang, Y., &Wang, J. (2016, October). People counting with block histogram features and network flow constraints. In Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), International Congress on (pp. 515-520). IEEE. Gao, L., Wang, Y., & Wang, J. (2016, October). People counting with block histogram features and network flow constraints. In Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), International Congress on (pp. 515-520). IEEE. Krizhevsky, A., Sutskever, I., &Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). Lin, M., Chen, Q., &Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400. Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv: 1312.4400. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., &Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9), 1627-1645. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32 (9), 1627-1645. Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., &Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958. Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15 (1), 1929-1958. C920. Available online: http://www.logitech.com/en-roeu /product/hd-pro-webcam-c920 (accessed on 19 July 2017). C920. Available online: http://www.logitech.com/en-roeu/product/hd-pro-webcam-c920 (accessed on 19 July 2017). NVIDIA Jetson TX2. Available online: http://www.nvidia .com/object/embedded-systems-dev-kits-modules.html (accessed on 19 July 2017). NVIDIA Jetson TX2. Available online: http: //www.nvidia .com / object / embedded-systems-dev-kits-modules.html (accessed on 19 July 2017).

본 발명의 목적은 상술한 바와 같은 문제점을 해결하기 위한 것으로, 임베디드 환경에서 실시간으로 동작하는 피플 카운팅(people counting) 방법, 구체적으로, 영상의 밝기 변동 특성을 반영하여 고 학습이나 파라미터(parameter) 조절 없이 배경 영상을 생성하는, 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법을 제공하는 것이다.An object of the present invention is to solve the above problems and to provide a method of counting people who operate in real time in an embedded environment, And to provide a method of counting people on an embedded platform using a convolutional neural network, which generates a background image without using a convolutional neural network.

특히, 본 발명은 배경 모델을 이용하여 보행자 후보군 영역을 생성하고 이를 입력으로 갖는 CNN 기반하여 보행자 검출(detection)을 수행하는, 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법을 제공하는 것이다.Particularly, the present invention provides a method of counting people on an embedded platform using a convolution neural network, which performs a pedestrian detection based on a CNN, which generates a pedestrian candidate region using a background model and inputs it as an input.

또한, 본 발명은 기존의 지역제안(region proposal) 방식 대신 배경 모델을 통한 신뢰도가 높은 보행자 후보군을 생성하며, 이를 입력으로 갖는 CNN(convolutional neural network) 기반 보행자 분류 모델을 이용하되, 최적의 CNN 구조를 이용하는, 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법을 제공하는 것이다.In addition, the present invention generates a highly reliable pedestrian candidate group using a background model instead of the existing region proposal method, uses a CNN (convolutional neural network) -based pedestrian classification model as an input, To provide a method of counting people on an embedded platform using a convolutional neural network.

상기 목적을 달성하기 위해 본 발명은 연속된 프레임으로 구성된 영상에서 보행자를 카운팅하는, 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법에 관한 것으로서, (a) 적어도 2개 이상의 연속된 이전 프레임을 이용하여 배경모델을 초기화하여 생성하는 단계; (b) 영상의 현재 프레임을 입력받는 단계; (c) 배경모델의 밴드 상단선과 하단선을 업데이트 하되, 상기 배경모델의 화소의 편차를 이용하여 업데이트하는 단계; (d) 상기 배경모델, 및, 밴드의 상단선과 하단선을 이용하여 보행자 후보 영역 맵을 추출하되, 현재 프레임의 각 화소에 대하여 상기 업데이트된 배경모델의 밴드 상단선과 하단선에서 벗어나는 화소들을 상기 보행자 후보 영역 맵으로 추출하는 단계; (e) 상기 영상의 현재 프레임을 이용하여 배경모델을 업데이트 하되, 보행자 후보 영역에서는 이전 프레임의 배경모델을 그대로 사용하고 나머지 영역에 대해서만 업데이트 하는 단계; (f) 상기 보행자 후보 영역 맵을 컨벌루션 신경망(CNN) 보행자 분류기에 입력하여 분류시키는 단계; 및, (g) 분류결과에 따라 보행자를 카운팅하는 단계를 포함하는 것을 특징으로 한다.According to an aspect of the present invention, there is provided a method for counting people on an embedded platform using a convolutional neural network that counts pedestrians in an image composed of consecutive frames, the method comprising: (a) Initializing and generating a model; (b) receiving a current frame of an image; (c) updating a band top line and a bottom line of the background model using the deviation of the pixels of the background model; (d) extracting a pedestrian candidate area map using the background model and upper and lower lines of the band, and for each pixel of the current frame, extracting pixels out of the upper band line and the lower line of the updated background model, Extracting a candidate region map; (e) updating the background model using the current frame of the image, using only the background model of the previous frame in the pedestrian candidate area and updating only the remaining area; (f) inputting the pedestrian candidate area map into a convolutional neural network (CNN) pedestrian classifier and classifying the pedestrian candidate area map; And (g) counting the pedestrian according to the classification result.

또, 본 발명은 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법에 있어서, 상기 (a)단계에서, N개(N은 2이상의 자연수)의 배경 프레임을 평균하여 상기 배경모델을 초기화하고, 상기 (c)단계에서, 현재 프레임을 포함하여 이전 프레임 까지 N개의 프레임을 평균하여 업데이트하는 것을 특징으로 한다.In the method of counting people on an embedded platform using a convolutional neural network, in the step (a), the background model is initialized by averaging N background frames (N is a natural number of 2 or more) ), The N frames are averaged up to the previous frame including the current frame.

또, 본 발명은 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법에 있어서, 상기 (c)단계에서, 상기 배경모델을 다음 수식 1에 의하여 업데이트하는 것을 특징으로 한다.In the method of counting people on an embedded platform using a convolutional neural network, the background model is updated by the following equation (1) in step (c).

[수식 1][Equation 1]

단, B_m(i,j)는 현재 프레임인 m번째 프레임의 배경모델의 화소 (i,j)를 나타내고, Ik(i,j)는 k번째 프레임의 화소(i,j)의 화소값이고, F_m(i,j)는 m번째 프레임의 화소(i,j)가 보행자 후보 영역인지 여부를 나타내는 보행자 후보 영역 맵이고, N은 2이상의 자연수임.However, B _m (i, j) represents a pixel (i, j) of the background model of the m-th frame is the current frame, Ik (i, j) is the pixel value of the pixel (i, j) of the k-th frame , F _m (i, j) is a pedestrian candidate area map indicates whether the pedestrian candidate region and the pixel (i, j) of the m-th frame, N is a natural number of at least 2.

또, 본 발명은 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법에 있어서, 상기 (d)단계에서, 다음 수식 2에 의하여 상기 배경모델의 상단선 U_band 및 하단선 L_band를 업데이트하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a method of counting people on an embedded platform using a convolutional neural network, wherein in step (d), the upper line U_band and the lower line L_band of the background model are updated by the following equation (2).

[수식 2][Equation 2]

단, α는 가중치로서 사전에 결정되는 상수임.However,? Is a predetermined constant as a weight.

또, 본 발명은 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법에 있어서, 상기 (e)단계에서, 상기 보행자 후보 영역 맵을 다음 수식 3에 의하여 구하는 것을 특징으로 한다.Further, the present invention is a method for counting people on an embedded platform using a convolutional neural network, wherein in the step (e), the pedestrian candidate area map is obtained by the following equation (3).

[수식 3][Equation 3]

단, F_m(i,j)은 보행자 후보 영역 맵의 화소 (i,j)에서의 화소값을 나타내고, I_m(i,j)는 현재 프레임 m의 화소 (i,j)의 화소값임.However, F _m (i, j) is the pixel ¹ a represents the pixel value of the pixel of the pedestrian candidate region map (i, j), I _m (i, j) is the pixel (i, j) of the current frame m.

또, 본 발명은 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법에 있어서, 상기 CNN 보행자 분류기는 인셉션 모듈(inception module)의 레이어(layer)가 추가되고, 상기 인셉션 모듈은 NIN(network in network)의 구조로 구성되는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a method of counting people on an embedded platform using a convolutional neural network, wherein the CNN pedestrian classifier is added with a layer of an inception module, And a structure shown in FIG.

또, 본 발명은 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법에 있어서, 상기 CNN 보행자 분류기를 학습시킬 때, 이동과 회전 연산을 적용하여 주어진 학습 영상을 다수의 새로운 학습 영상으로 생성하여 학습시키는 것을 특징으로 한다.The present invention also provides a method for counting people on an embedded platform using convolutional neural networks, comprising the steps of: generating and learning a given learning image by applying movement and rotation operations to the CNN pedestrian classifier; .

또, 본 발명은 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법에 있어서, 상기 주어진 학습 영상 (x,y)를 다음 수식 4에 의하여 다수의 새로운 학습 영상을 생성시키되, 회전 파라미터와 평행이동 파라미터들을 달리하여 학습영상을 생성하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a method of counting people on an embedded platform using a convolutional neural network, the method comprising: generating a new learning image by multiplying the given learning image (x, y) Thereby generating a learning image.

[수식 4][Equation 4]

단, (x,y)는 주어진 학습 영상을 나타내고, (x',y')은 새로 생성되는 학습 영상을 나타내고, a와 b는 회전 파라미터이고 c와 d는 각각 평행이동 파라미터임.(X ', y') represents a newly generated training image, a and b are rotation parameters, and c and d are parallel movement parameters, respectively.

또, 본 발명은 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법에 있어서, 상기 인셉션 모듈에 대하여 프루닝 작업을 수행하는 것을 특징으로 한다.Further, the present invention is a method for counting people on an embedded platform using a convolutional neural network, the method comprising performing a pruning operation on the incessant module.

또한, 본 발명은 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법을 수행하는 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.The present invention also relates to a computer-readable recording medium on which a program for performing a person counting method on an embedded platform using a convolutional neural network is recorded.

상술한 바와 같이, 본 발명에 따른 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법에 의하면, 시스템의 성능이 카메라 촬영 거리나 각도 등의 설치 환경에 강인하고, 현재 상용화된 임베디드 보드에서 실시간으로 동작할 수 있는 효과가 얻어진다.As described above, according to the method of counting people on an embedded platform using the convolution neural network according to the present invention, the performance of the system is robust to installation environments such as camera shooting distance and angle, and can be operated in real time on a commercially available embedded board Is obtained.

특히, 본 발명에 따른 피플 카운팅(people counting) 방법은 실험을 통하여, 엔비디아(nvidia) 사의 TX1과 TX2 임베디드 보드에서 실시간으로 동작하며, 카메라 촬영 거리나 각도 등 설치 환경에 강인한 특성을 지님을 증명하였다.Particularly, the people counting method according to the present invention proves that it operates in real time on nvidia's TX1 and TX2 embedded boards through experiments and has robust characteristics such as camera shooting distance and angle .

도 1은 본 발명을 실시하기 위한 전체 시스템의 구성을 도시한 도면.
도 2는 피플 카운팅(people counting) 방법들에 대한 비교를 나타낸 표.
도 3은 본 발명의 일실시예에 따른 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법을 설명하는 흐름도.
도 4는 본 발명의 일실시예에 따른 피플 카운팅(people counting) 방법의 구조를 나타낸 흐름도.
도 5는 본 발명의 일실시예에 따른 조명 환경에 의한 영상의 밝기 값 변동 특성을 나타내는 그래프.
도 6은 본 발명의 일실시예에 따른 영상의 밝기값 변동 밴드의 상단선(U_band) 및 하단선(L_band)을 나타내는 그래프.
도 7은 본 발명에서 이용하는 구글넷(GoogleNet)의 컨벌루션 신경망(convolutional neural network)의 구조를 나타낸 표.
도 8은 본 발명에서 이용하는 구글넷(GoogleNet)의 인셉션 모듈(inception module)을 나타낸 도면.
도 9는 본 발명의 일실시예에 따른 보행자 분류를 위한 CNN의 구조를 나타낸 표.
도 10은 본 발명의 실험에서 사용되는 C920 카메라의 이미지 및 스펙을 나타낸 도면.
도 11은 본 발명의 실험에 사용되는 엔비디아(Nvidia)사의 Jetson TX2의 스펙을 나타낸 표.
도 12는 본 발명의 실험에 따른 피플 카운팅(people counting) 방법의 설치환경(installation environment)을 나타낸 도면.
도 13은 본 발명의 실험에 따른 다양한 환경에서 획득된 보행자 영상.
도 14는 본 발명의 실험에 따른 보행자 움직임 시나리오 영상으로서, (a) 가방을 맨 상황, (b) 모자를 쓴 상황, (c) 박스를 파지한 상황, (d) 어깨동무를 한 상황, (e) 팔짱을 낀 상황, (f) 달리는 상황, (g) 교차로 이동하는 상황, (h) 군중을 지어 움직이는 상황을 나타낸 영상.
도 15는 본 발명의 실험에 따른 실험 데이터에 대한 피플 카운팅의 정확도(accuracy)를 나타낸 표.
도 16은 본 발명의 실험에 따른 실험 데이터에 대한 피플 카운팅(people counting)의 오차행렬(confusion matrix)을 나타낸 표.
도 17은 본 발명의 실험에 따른 보행자 움직임 시나리오별 피플 카운팅(people counting) 에러를 도식화한 그래프.BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a diagram showing a configuration of an overall system for carrying out the present invention; Fig.
Figure 2 is a table showing comparisons for people counting methods.
3 is a flow chart illustrating a method of counting people on an embedded platform using a convolutional neural network according to an embodiment of the present invention.
4 is a flow diagram illustrating a structure of a method of counting people according to an embodiment of the present invention.
5 is a graph illustrating a brightness value variation characteristic of an image according to an illumination environment according to an exemplary embodiment of the present invention.
6 is a graph showing a top band U_band and a bottom band L_band of a brightness value variation band of an image according to an exemplary embodiment of the present invention.
7 is a table showing a structure of a convolutional neural network of GoogleNet used in the present invention.
FIG. 8 illustrates an inception module of GoogleNet used in the present invention. FIG.
9 is a table showing the structure of CNNs for pedestrian classification according to an embodiment of the present invention.
10 shows an image and a specification of a C920 camera used in the experiment of the present invention.
11 is a table showing specifications of Jetson TX2 of Nvidia used in the experiment of the present invention.
12 illustrates an installation environment of a people counting method according to an experiment of the present invention.
Figure 13 is a pedestrian image obtained in various environments according to the experiment of the present invention.
Fig. 14 is an image of a pedestrian motion scenario according to the experiment of the present invention. Fig. 14 (b) shows a situation in which a bag is used, (b) (F) running, (g) moving at an intersection, (h) moving a crowd.
15 is a table showing the accuracy of the person counting for the experimental data according to the experiments of the present invention;
16 is a table showing an error matrix of people counting for experimental data according to the experiments of the present invention.
17 is a graphical representation of a people counting error for each pedestrian motion scenario according to the experiment of the present invention.

이하, 본 발명의 실시를 위한 구체적인 내용을 도면에 따라서 설명한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail with reference to the drawings.

또한, 본 발명을 설명하는데 있어서 동일 부분은 동일 부호를 붙이고, 그 반복 설명은 생략한다.In the description of the present invention, the same parts are denoted by the same reference numerals, and repetitive description thereof will be omitted.

먼저, 본 발명을 실시하기 위한 전체 시스템의 구성의 예들에 대하여 도 1을 참조하여 설명한다.First, examples of the configuration of the entire system for carrying out the present invention will be described with reference to Fig.

도 1에서 보는 바와 같이, 본 발명에 따른 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법은 보행자를 촬영한 영상(또는 이미지)(10)을 입력받아 상기 영상(또는 이미지)에 대하여 보행자를 인식하여 그 수를 카운팅하는 컴퓨터 단말(20) 상의 프로그램 시스템으로 실시될 수 있다. 즉, 상기 피플 카운팅 방법은 프로그램으로 구성되어 컴퓨터 단말(20)에 설치되어 실행될 수 있다. 컴퓨터 단말(20)에 설치된 프로그램은 하나의 프로그램 시스템(30)과 같이 동작할 수 있다.1, a method of counting people on an embedded platform using a convolution neural network according to the present invention includes a step of recognizing a pedestrian with respect to the image (or image) May be implemented in a program system on the computer terminal 20 that counts the number. That is, the method of counting people can be implemented by a program and installed in the computer terminal 20 and executed. A program installed in the computer terminal 20 can operate as a single program system 30. [

한편, 다른 실시예로서, 본 발명에 따른 피플 카운팅 방법은 프로그램으로 구성되어 범용 컴퓨터에서 동작하는 것 외에 ASIC(주문형 반도체) 등 하나의 전자회로로 구성되어 실시될 수 있다. 또는 영상을 대상으로 보행자를 인식하여 그 수를 카운팅하는 작업만을 전용으로 처리하는 전용 컴퓨터 단말(20)로 개발될 수도 있다. 이를 피플 카운팅 장치 또는 피플 카운팅 시스템(30)이라 부르기로 한다. 그 외 가능한 다른 형태도 실시될 수 있다.Meanwhile, as another embodiment, the method of counting people according to the present invention may be implemented by a single electronic circuit such as an ASIC (on-demand semiconductor) in addition to being operated by a general-purpose computer. Alternatively, the dedicated computer terminal 20 may be developed to recognize only pedestrians and count the number of pedestrians on an image basis. This will be referred to as a people counting device or a people counting system 30. Other possible forms may also be practiced.

한편, 영상(10)은 시간상으로 연속된 프레임으로 구성된다. 하나의 프레임은 하나의 이미지를 갖는다. 또한, 영상(10)은 하나의 프레임(또는 이미지)을 가질 수도 있다. 즉, 영상(10)은 하나의 이미지인 경우에도 해당될 수 있다.On the other hand, the image 10 is composed of consecutive frames in time. One frame has one image. Also, the image 10 may have one frame (or image). That is, the image 10 may correspond to one image.

다음으로, 본 발명에서 사용하는 피플 카운팅(people counting) 방법에 대하여 도 2를 참조하여 보다 구체적으로 설명한다.Next, a method of counting people used in the present invention will be described in more detail with reference to FIG.

피플 카운팅(people counting) 방법은 카메라의 설치 각도나 거리 등에 따라서 다양한 알고리즘의 접근 방법을 가지며 크게 시맨틱(semantic) 정보를 이용하는 방법과 기계학습을 이용하는 방법으로 구분된다.The people counting method has various algorithm approach according to the installation angle and distance of the camera, and is largely divided into a method using semantic information and a method using machine learning.

톱뷰(Top view) 방식의 경우 카메라를 건물의 천장에 설치해야 하기 때문에 비교적 제한적인 설치 환경을 가진다. 그렇기 때문에 시스템 구현의 이점이나 시스템의 실시간 동작을 목적으로 시맨틱(semantic) 정보를 이용하는 방식[비특허문헌 11-16]을 주로 이용한다.In the Top view method, the camera is installed on the ceiling of the building, so it has a relatively limited installation environment. Therefore, the system mainly uses semantic information (non-patent document 11-16) for the purpose of system implementation or real-time operation of the system.

조감도(Bird's eye view) 방식은 촬영 각도나 촬영 거리의 변화에 따라서 다양한 변종(variant)을 가진다. 보행자 영상이 촬영되기 때문에 피플 카운팅(people counting) 문제 해결을 위해서 수작업에 의한 특징(handcrafted-feature)들을 학습하여 보행자 인식 모델을 생성하는 방법들[비특허문헌 17-21]이 주류로 연구되어 왔다. 도 2는 피플 카운팅(people counting)을 수행하는 여러 가지 방법들 비교하여 표현한다.Bird's eye view method has various variants according to the change of shooting angle and shooting distance. Methods for generating a pedestrian recognition model by learning handcrafted-features in order to solve the problem of people counting (Non-Patent Document 17-21) have been studied as mainstream since pedestrian images are shot . Figure 2 compares various methods of performing people counting.

시맨틱(semantic) 정보를 이용하는 방법들은 깊이지도(depth map), 템플릿 매칭(template matching), 히스토그램(histogram) 분석 등을 반영하여 피플 카운팅(people counting)을 수행한다. [비특허문헌 15]에서는 보행자 후보 영역의 수평, 수직 히스토그램(horizontal, vertical histogram)을 분석하는 방법으로 문제를 해결한다. [비특허문헌 12, 13]에서는 거리변환(distance transform)이나 에지추출(edge extraction) 방법 등 전처리 방법을 통한 템플릿 매칭(template matching)을 이용한다. [비특허문헌 11]에서는 깊이(depth) 정보를 이용하여 사람의 머리 부분을 찾는다. 비교적 먼 거리에서는 깊이 정보를 획득하기 어렵기 때문에 피플 카운팅(people counting) 방법 설치 시 어려움이 있다.Methods that use semantic information perform the people counting reflecting the depth map, template matching, histogram analysis, and the like. [Non-Patent Document 15] solves the problem by analyzing the horizontal and vertical histograms of the candidate area of the pedestrian. [Non-Patent Documents 12 and 13] use template matching through a preprocessing method such as a distance transform or an edge extraction method. In [Non-Patent Document 11], a head portion of a person is searched using depth information. It is difficult to install the people counting method because it is difficult to obtain depth information at a relatively long distance.

기계학습 방법(Machine learning technique)은 수작업에 의한 특징(hand-crafted feature)을 이용하는 방법과 컨벌루션 신경망(convolutional neural network)를 이용하는 방법 등이 있다. HOG(histogram of oriented gradient), 컬러 특징(color feature), 캐니 에지(canny edge) 등 널리 알려진 수작업에 의한 특징(hand-crafted feature)을 이용하기도 하며 영역(area), 경계(perimeter), 경계 에지(perimeter edge) 등과 같은 선형 특징(linear feature)을 이용하는 방법[비특허문헌 17]도 연구되었다. 최근에는 객체 검출(object detection) 과정에서 좋은 성능을 보여주고 있는 CNN(convolutional neural network) 구조를 이용하는 연구들도 이루어지고 있다. [비특허문헌 5]에서는 지역 기반(region based) CNN[비특허문헌 6]을 이용하여 피플 카운팅(people counting)을 수행하는 방법을 제안하였다.Machine learning techniques include a method using hand-crafted features and a method using convolutional neural networks. It is also possible to use widely known hand-crafted features such as histogram of oriented gradient (HOG), color features, canny edges, and the like to define areas, perimeters, a method using a linear feature such as a perimeter edge (Non-Patent Document 17) was also studied. Recently, there have been studies using CNN (convolutional neural network) structure which shows good performance in object detection. [Non-Patent Document 5] proposed a method of performing people counting using region based CNN (Non-Patent Document 6).

한편, 배경모델은 입력 영상에서 보행자 후보 지역(region)을 생성할 때 주로 이용된다. 보행자 후보 지역(region)은 ROI(region of interest)로 표현된다. 보행자 분류기는 입력 영상의 모든 영역을 처리하는 대신 ROI 만을 선택적으로 처리하여 추론 파이프 라인(inference pipe-line)에서 발생하는 중복(redundancy)을 제거할 수 있다.On the other hand, the background model is mainly used to generate a pedestrian candidate region in the input image. A pedestrian candidate region is expressed as an ROI (region of interest). The pedestrian classifier can selectively process ROIs instead of processing all areas of the input image to eliminate redundancy in the inference pipe-line.

배경모델을 생성하는 방법으로는 가우시안 혼합 모델(Gaussian mixture model)을 이용하는 방법과 영상의 밝기 값을 문턱 값 처리(threshold processing)하는 방법 등이 있다. 가우시안 혼합 모델(Gaussian mixture model)을 이용하는 경우에는 입력 영상의 종류가 변경될 경우 배경 모델의 재학습이 요구된다. 영상의 밝기 값을 문턱 값 처리하는 방법은 사용자의 하이퍼 파라미터(hyper-parameter)에 크게 의존한다. 입력 영상의 밝기 값 변동 특성에 따라 문턱치 값이 변하기 때문에 적절한 문턱치 값을 찾기 위해서는 실험적인 방법에 의존하는 경우가 많다.A method of generating a background model includes a method using a Gaussian mixture model and a method of threshold processing brightness values of an image. When a Gaussian mixture model is used, re-learning of the background model is required when the type of input image is changed. The method of threshold processing the brightness of the image depends heavily on the user's hyper-parameter. Since the threshold value changes depending on the brightness value variation characteristic of the input image, it often depends on the experimental method to find an appropriate threshold value.

본 발명에서는 영상의 고유 밝기 값의 변동 특성을 관찰하여 배경 모델을 생성한다. 배경모델을 지역적으로 업데이트 하고 보행자 후보 영역을 생성한다. 또한 보행자 후보 영역을 입력으로 갖는 CNN(convolutional neural network) 기반 보행자 분류기(classification) 구조를 제시한다.In the present invention, a background model is generated by observing fluctuation characteristics of intrinsic brightness values of an image. Update the background model locally and create a pedestrian candidate area. In addition, a convolutional neural network (CNN) based pedestrian classifier structure with pedestrian candidate area as input is presented.

다음으로, 본 발명의 일실시예에 따른 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법을 도 3 및 도 4를 참조하여 설명한다. 도 3은 본 발명에 따른 피플 카운팅 방법에 대한 전체적인 흐름도를 보여주고, 도 4는 본 발명에 따른 피플 카운팅(people counting) 방법의 구조를 보여준다.Next, a method of counting people on an embedded platform using a convolutional neural network according to an embodiment of the present invention will be described with reference to FIGS. 3 and 4. FIG. FIG. 3 shows an overall flow chart of the method of counting people according to the present invention, and FIG. 4 shows the structure of a method of counting people according to the present invention.

본 발명에 따른 피플 카운팅 방법은 배경모델을 생성하여 보행자 후보 지역(region)을 생성하는 전처리 부분과 피플 카운팅(people counting) 작업을 수행하는 보행자 분류 부분으로 구성된다. 전처리 부분에서는 배경 모델 생성과 업데이트, 그리고 보행자 후보 영역을 탐지한다. 보행자 분류 부분에서는 탐지된 보행자 후보 영역을 입력으로 받아 보행자 분류를 수행하고, 이 결과를 가지고 최종적으로 피플 카운팅(people counting) 작업을 수행한다. The method for counting people according to the present invention comprises a preprocessing section for generating a background model and a pedestrian candidate region and a pedestrian classification section for performing a people counting operation. In the preprocessing section, background model generation and update, and pedestrian candidate areas are detected. In the pedestrian classification part, pedestrian classification is performed by taking the detected pedestrian candidate area as an input, and finally the people counting operation is performed with the result.

도 3에서 보는 바와 같이, 본 발명에 따른 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법은 초기 배경모델을 생성하는 단계(S10), 영상의 프레임을 입력받는 단계(S20), 밴드의 상하단선을 업데이터하는 단계(S30), 보행자 후보 영역 맵을 추출하는 단계(S40), 배경모델을 업데이트하는 단계(S50), CNN 분류기에 입력하는 단계(S60), 및 분류결과에 따라 보행자 수를 카운팅하는 단계(S70)로 구성된다. 또한, 상기 S20 내지 S70 단계는 마지막 프레임까지 반복한다.As shown in FIG. 3, the method of counting people on an embedded platform using convolution neural network according to the present invention includes generating an initial background model (S10), receiving a frame of an image (S20) A step S50 of extracting a pedestrian candidate area map, a step S50 of updating a background model, a step S60 of inputting to a CNN classifier, and a step of counting the number of pedestrians according to the result of classification S70). The steps S20 to S70 are repeated until the last frame.

각 단계를 설명하기에 앞서, 배경모델을 생성하는 전체적인 과정을 먼저 설명한다. 배경 모델 생성은 주로 영상에서 움직이는 객체를 찾기 위해 사용되는 고전적인 영상처리 방법이다. 입력 프레임과 배경 모델의 차 영상 생성을 가지고 움직이는 보행자의 후보 영역을 생성하는 방법[비특허문헌 12, 13, 15, 17, 22]이 많이 연구되었다. 배경 모델을 생성하는 방법은 GMM(gaussian mixture model)을 사용하는 방법과 영상 화소의 밝기 값을 문턱 값을 가지고 처리하는 방법 등으로 구분된다. GMM 방법은 입력 영상의 종류가 바뀔 경우 배경 모델의 재학습이 필수이기 때문에 실제 시스템에서는 잘 사용되지 않는다. 영상 화소의 밝기 값을 문턱 값으로 처리하는 방법에서 중요한 것은 배경과 객체를 구분할 수 있는 적당한 문턱 값을 구하는 것이다. 기존 방법에서는 좋은 성능을 얻기 위해서 영상 밝기 값의 변동 특성을 경험적(heuristic)으로 분석하였다. Before describing each step, the overall process of creating a background model is described first. Background model generation is a classical image processing method used mainly for finding moving objects in an image. Many non-patent documents 12, 13, 15, 17, and 22 have been studied to generate a candidate region of a moving pedestrian with a difference image between an input frame and a background model. A method of generating a background model is classified into a method using a gaussian mixture model (GMM) and a method of processing a brightness value of an image pixel with a threshold value. GMM method is not used well in real system because re-learning of background model is necessary when the type of input image is changed. What is important in the method of processing the brightness value of the image pixel as a threshold value is to obtain a proper threshold value for distinguishing the background and the object. In the conventional method, the variation characteristics of image brightness value are analyzed heuristically to obtain good performance.

영상의 밝기 값 변동 특성은 영상을 획득한 장소의 광원이나 사용된 카메라 렌즈의 종류에 따라서 다양한 특성을 가진다. 예를 들어, 인공조명만을 사용한 경우와 자연광과 인공조명을 동시에 사용한 경우 획득된 영상의 밝기 값의 변동 특성은 매우 상이하다.The variation characteristics of the brightness value of the image have various characteristics according to the light source at the place where the image is acquired and the type of the used camera lens. For example, when using only artificial light and using natural light and artificial light at the same time, the variation characteristics of brightness values of acquired images are very different.

도 5는 각기 다른 영상의 밝기 값의 변동을 보여준다. 도 5(a)는 인공조명과 자연광이 동시에 존재하는 장소에서 획득된 영상의 변동 특성이다. 도 5(b)는 어두운 조명 하에서 획득된 영상이고, 도 5(c)는 인공조명 환경에서 획득된 영상의 밝기 값 변동 특성이다. 본 발명에서는 영상의 밝기 값 변동 특성을 관찰하여 배경 모델을 생성하고자 한다.5 shows the variation of brightness values of different images. 5 (a) is a variation characteristic of an image obtained in a place where artificial light and natural light exist at the same time. 5 (b) is an image obtained under dark illumination, and FIG. 5 (c) is a brightness value variation characteristic of an image obtained in an artificial illumination environment. In the present invention, a background model is created by observing brightness characteristic of an image.

배경 영상의 밝기 값 변동 특성이 표준 정규 분포 함수를 갖는다고 가정하고 배경 영상의 밝기 값 변화를 따라 위 아래로 폭이 같게 움직이는 밝기 값 밴드를 형성한다. 도 6에서 파란색 실선은 도 5(a)의 밝기 값 변동특성이고, 빨간색 실선은 도 5(b)의 밝기 값 변동 특성이다. 도 6의 점선과 같이 각각의 밝기 값보다 높은 밴드를 상단선(U_band), 낮은 밴드를 하단선(L_band)으로 정의한다. 결정된 밴드를 기준으로 이 밴드를 벗어나는 밝기 값을 갖는 화소들은 보행자 후보 영역으로 판단한다. 밴드 안에 존재하는 화소들은 배경 영상 업데이트에 이용된다.It is assumed that the brightness value variation characteristic of the background image has a standard normal distribution function, and a brightness value band moving up and down by the width of the background image is formed according to the brightness value change of the background image. In Fig. 6, the blue solid line is the brightness value variation characteristic of Fig. 5 (a), and the red solid line is the brightness value variation characteristic of Fig. 5 (b). As shown by the dotted line in FIG. 6, upper band U_band and lower band L_band are defined as a band higher than each brightness value. Based on the determined band, pixels having a brightness value deviating from this band are judged to be a pedestrian candidate area. The pixels in the band are used for background image update.

먼저, 초기 배경모델을 생성한다(S10).First, an initial background model is generated (S10).

초기 배경 모델 B_init은 N개의 배경 프레임들의 평균값으로 수학식 1과 같이 결정된다. 이때 배경 프레임은 보행자가 포함되지 않은 임의의 흑백 영상으로 선택된다.The initial background model B _init is determined as an average value of N background frames as shown in Equation (1). At this time, the background frame is selected as an arbitrary monochrome image that does not include a pedestrian.

[수학식 1] [Equation 1]

여기서, I_k는 k번째 입력영상 프레임이고 X 와 Y는 각각 영상의 가로와 세로의 크기를 의미한다.Here, I _k is the k-th input image frame, and X and Y denote the horizontal and vertical sizes of the image, respectively.

다음으로, 피플 카운팅을 할 영상의 프레임을 입력받는다(S20). 영상은 연속된 프레임으로 구성되고, 연속된 프레임들에 대하여 순차적으로 피플 카운팅을 수행한다. 즉, 하나의 프레임에 대하여 피플카운팅을 하고, 다음 프레임에 대하여 피플카운팅을 수행한다. 이때, 각 프레임의 피플카운팅을 수행할 때마다 배경모델을 업데이트한다.Next, a frame of the image to be counted is input (S20). The image is composed of consecutive frames, and performs the person counting sequentially for successive frames. That is, the subject counting is performed for one frame, and the subject counting is performed for the next frame. At this time, the background model is updated every time the person counting of each frame is performed.

다음으로, 배경모델의 밴드 상하단선(U_band, L_band)을 업데이트한다(S30).Next, upper and lower band lines U_band and L_band of the background model are updated (S30).

L_band(i,j)와 U_band(i,j)은 이전 프레임의 배경 모델 영상의 화소에 가중 표준편차(weighted standard deviation) 값을 더하거나 뺀 값으로 각각 정의한다. L_band (i, j) and U_band (i, j) are defined by adding or subtracting the weighted standard deviation value to the pixels of the background model image of the previous frame.

[수학식 4]&Quot; (4) "

여기서 α는 가중치(weight)로서 밝기 값 밴드의 폭을 결정한다. 새로운 입력 영상에 대하여 위의 과정을 반복 수행하게 된다.Here, α is the weight, which determines the width of the brightness value band. The above procedure is repeated for a new input image.

다음으로, 배경모델 및, 해당 모델의 밴드 상하단선을 이용하여 보행자 후보 영역 맵을 추출한다(S40).Next, the pedestrian candidate area map is extracted using the background model and the upper and lower disconnection lines of the corresponding model (S40).

시스템의 m번째 입력 I_m와 배경 모델 영상 B_m을 이용하여 수학식 3과 같이 보행자 후보의 영역 맵 F_m을 생성한다. 보행자 후보 영역 맵 F_m는 보행자 분류기의 입력으로 주어진다. Using the m-th input I _m of the system and the background model image B _m , an area map F _m of a pedestrian candidate is generated as shown in Equation (3). The pedestrian candidate area map F _m is given as input to the pedestrian classifier.

[수학식 3]&Quot; (3) "

여기서, L_band(i,j)와 U_band(i,j)는 각각 도 6의 주어진 영상에서 각각의 화소에 대한 낮은 밝기 값 밴드와 높은 밝기 값 밴드이다.Here, L_band (i, j) and U_band (i, j) are a low brightness value band and a high brightness value band for each pixel in the given image of FIG.

다음으로, 입력된 영상 프레임에 의해 배경 모델을 업데이트 한다(S50).Next, the background model is updated by the input image frame (S50).

앞서와 같이, 수학식 1에서 초기 배경 모델 B_init이 결정되면, m번째 프레임의 배경 모델 B_m은 수학식 2와 같이 조건이 만족하는 영상의 일부 화소에 대해서 업데이트를 실시한다. 즉, 보행자 후보 영역에서는 이전 프레임의 배경 모델을 그대로 사용하고 배경 영역에서는 보행자 후보 영역을 제외한 영역에 대해서만 평균을 취하게 된다. As described above, when the initial background model B _init is determined in Equation (1), the background model B _m of the m-th frame is updated with respect to some pixels of the image satisfying the condition as shown in Equation (2). That is, the background model of the previous frame is used as it is in the pedestrian candidate area, and the average is taken only in the area excluding the pedestrian candidate area in the background area.

[수학식 2]&Quot; (2) "

여기서, B_m(i,j)는 m번째 프레임의 화소 위치 (i,j)에서의 배경 모델값이고, I_k는 k번째 입력영상 프레임이다.Here, B _m (i, j) is a background model value at the pixel position (i, j) of the m-th frame, and I _k is the k-th input image frame.

F_m(i,j)는 B_m _- ₁(i,j)와 상하단선(L_band, U_band)을 이용하여 구한다. 초기에는 B_m _-1(i,j)이 정의되어 있지 않기 때문에 수학식 1을 통하여 초기 배경모델 B_init(i,j)을 구한 뒤 이를 이용하여, F_m(i,j)을 구한다. F_m(i,j)가 구해진 뒤 이를 이용하여 B_init(i,j)을 업데이트 하여 B_m(i,j)를 구한다. 그리고 B_m(i,j)는 F_m+1(i,j)을 구하는데 사용된다.F _m (i, j) is obtained by using B _m _- ₁ (i, j) and upper and lower disconnection (L_band, U_band). Since initial B _m _-1 (i, j) is not defined in the beginning, F _m (i, j) is obtained by using initial background model B _init (i, j) F _m (i, j) is calculated back to update the _init B (i, j) is obtained by using this, the B _m (i, j). And B _m (i, j) is used to find F _{m + 1} (i, j).

다음으로, 보행자 후보 영역 맵을 CNN 보행자 분류기에 입력하여 분류시킨다(S60). CNN을 이용한 보행자 분류기에 대하여 이하에서 구체적으로 설명한다.Next, the pedestrian candidate area map is input to the CNN pedestrian classifier and classified (S60). The pedestrian classifier using CNN will be described in detail below.

다음으로, 분류결과에 따라 보행자 수를 카운팅한다(S70). 그리고 앞서 S20 내지 S70단계, 즉, 다음 프레임을 입력받아 다음 프레임에 대한 피플 카운팅 작업을 반복하여 수행한다(S70). 반복 수행은 마지막 프레임까지 진행한다(S80).Next, the number of pedestrians is counted according to the classification result (S70). Then, in steps S20 to S70, that is, the next frame is received, and the person counting operation for the next frame is repeated (S70). The iterative process proceeds to the last frame (S80).

분류작업을 마치게 되면 영상내에서 보행자의 위치 정보를 획득할 수 있다. 이 위치정보를 바탕으로 특정 기준선 혹은 기준 영역을 통과할 경우 보행자 카운팅을 수행한다.Once the classification operation is completed, the position information of the pedestrian can be acquired in the image. Based on this location information, pedestrian counting is performed when a specific reference line or reference area is passed.

다음으로, 본 발명의 일실시예에 따른 CNN을 이용하여 보행자 분류(Pedestrian classification)를 수행하는 방법에 대하여 보다 구체적으로 설명한다.Next, a method for performing pedestrian classification using CNN according to an embodiment of the present invention will be described in more detail.

보행자 분류는 다양하게 연구되어 왔다. [비특허문헌 1]에서는 사람에 대하여 반복성이 높은 HOG(histogram of oriented gradient)라는 수작업에 의한 특징(hand-crafted feature)를 제안함으로써 보행자 인식의 성능을 크게 개선하였다. 이후 HOG를 사람의 팔, 다리, 머리 등에 부분적으로 적용하는 DPM(deformable part models)[비특허문헌 25]이 제안되고 성능이 더욱 개선되었다. 이후로는 영상 인식 분야에서 좋은 성능을 보여주고 있는 딥러닝(deep learning) 기법이 보행자 분류에도 사용되기 시작하였다. Pedestrian classification has been studied variously. [Non-Patent Document 1] greatly improves the performance of pedestrian recognition by proposing a hand-crafted feature called a histogram of oriented gradient (HOG) with high repeatability for a human being. Then, DPM (deformable part models) (Non-Patent Document 25) partially applying HOG to human's arms, legs, and head were proposed and performance was further improved. Since then, deep learning techniques have been used for pedestrian classification, which shows good performance in image recognition.

딥러닝은 인공 신경망을 구축하고 빅데이터를 이용하여 영상 인식 모델을 학습하는 방법으로서 다양한 영상 분석 분야에서 활용되고 있다. [비특허문헌 6-8]에서는 CNN과 지역제안(region proposal) 방법을 이용하여 영상에서 객체의 종류와 위치를 예측하는 객체 추출 기법을 제안하였다. 일반적으로 한 개의 영상에서 2000개의 객체 후보 영역이 추출되며 에지박스(Edgeboxes)[비특허문헌 14]나 선택적 탐지(Selective Search)[비특허문헌 13]와 같은 방법들이 주로 이용된다. Deep learning is a method of building an artificial neural network and learning image recognition models using big data. [Non-Patent Document 6-8] proposed an object extraction technique for predicting the type and position of an object in an image using CNN and a region proposal method. Generally, 2000 object candidate regions are extracted from one image, and methods such as Edgeboxes (Non-Patent Document 14) and Selective Search (Non-Patent Document 13) are mainly used.

본 발명의 방법에서는 앞서 설명한 배경모델을 활용하여 보행자 후보 영역을 생성하고 CNN 방법을 이용하여 보행자를 분류한다. 즉, 본 발명에서는 특히 구글넷(GoogleNet)[비특허문헌 4]을 푸르닝(pruning)한 CNN 구조를 이용한다. 구글넷(GoogleNet)은 2014년 ILSVRC(Imagenet Large Scale Visual Recognition Competition)의 분류(classification) 분야에서 좋은 성능을 보여준 대표적인 CNN 기반의 객체 분류기이다. 구글넷(GoogleNet)은 총 22개의 레이어(layer)로 구성되어 있으며 VGG-Net[비특허문헌 3]이나 Alex-Net[비특허문헌 23]과 같이 단순히 컨벌루션 레이어(convolutional layer) 만을 깊이 쌓은 구조가 아니라 인셉션 모듈(inception module)이라는 새로운 레이어(layer) 구조가 추가되었다. 도 7의 표는 구글넷(GoogleNet)의 CNN 구조를 나타낸다. In the method of the present invention, a pedestrian candidate region is created using the background model described above, and the pedestrian is classified using the CNN method. That is, in the present invention, in particular, a CNN structure in which GoogleNet (Non-Patent Document 4) is pruned is used. GoogleNet is a representative CNN-based object classifier that has demonstrated good performance in the classification of the International Large Scale Visual Recognition Competition (ILSVRC) in 2014. GoogleNet is composed of 22 layers in total, and a structure in which only a convolutional layer is simply stacked as VGG-Net [Non-Patent Document 3] or Alex-Net [Non Patent Document 23] A new layer structure called the inception module has been added. The table in FIG. 7 represents the CNN structure of GoogleNet.

구글넷(GoogleNet)의 인셉션 모듈(inception module)은 NIN(network in network)[비특허문헌 24] 구조의 한 종류이며 같은 레이어(layer)에 서로 다른 크기를 갖는 컨벌루션 필터(convolutional filter)를 사용하여 스케일(scale)이 다른 특징(feature)도 얻을 수 있다. 도 8에 구글넷(GoogleNet)의 인셉션 모듈(inception module) 구조를 보였다. The inception module of GoogleNet is a type of network in network (NIN) [Non-Patent Document 24] and uses a convolutional filter having different sizes in the same layer So that features with different scales can be obtained. FIG. 8 shows an inception module structure of GoogleNet.

NIN 구조는 앞서 설명한 바와 같이 다양한 스케일(scale)의 특징(feature)을 생성할 수 있다는 장점이 있다. 하지만 일반적인 CNN 구조보다 연산량이 많기 때문에 네트워크의 레이어 수가 많아지면 객체 추론에 많은 시간이 소모된다. 구글넷(GoogleNet)에서는 1×1 컨벌루션(convolution) 연산을 통하여 특징맵(feature-map)의 숫자를 줄이는 방법을 제안함으로서 연산량이 고르게 분포되도록 하였고 따라서 깊은 네트워크의 구축이 가능하다.The NIN structure has the advantage that various scale features can be generated as described above. However, since the computational complexity is larger than that of the general CNN structure, the number of layers in the network increases the time required for object inference. GoogleNet proposes a method to reduce the number of feature-maps through 1 × 1 convolution operation, so that the computation volume is evenly distributed and thus a deep network can be constructed.

인셉션 모듈은 NIN구조의 한 종류이다. NIN 구조의 부분집합이 인셉션 모듈이라고 할 수 있다. 일반적인 CNN구조는 필터(filter)의 특징이 선형적(linear) 이기 때문에 비선형적(non-linear)인 특징(feature)을 추출하는데 어려움이 있다. NIN구조는 이러한 문제를 극복하기 위한 방법으로서 CNN 안에 또 다른 CNN을 삽입하여 비선형적 특징(non-linear feature) 추출에 좋은 성능을 보인다.Inception modules are a kind of NIN structure. A subset of the NIN structure is an INSESSION module. The general CNN structure has difficulties in extracting non-linear features because the characteristics of the filter are linear. The NIN structure has a good performance for non-linear feature extraction by inserting another CNN in CNN as a way to overcome this problem.

본 발명에서는 인셉션 모듈(inception module)을 이용하여 보행자를 분류하는 CNN 구조를 제안한다. 입력 영상은 3 채널(RGB) 컬러(color) 영상이며 영상의 크기는 96×180이다. 단일 클래스에 대한 분류 작업을 수행하기 때문에 기본 구글넷(GoogleNet) 구조에 대하여 망(network)의 깊이와 특징맵(feature-map)의 수를 줄일 수 있도록 프루닝(pruning) 작업을 수행한다. 도 9는 제안하는 CNN 구조이다.In the present invention, a CNN structure for classifying a pedestrian using an inception module is proposed. The input image is a 3-channel (RGB) color image and the image size is 96 × 180. Because it performs classification for a single class, pruning is performed to reduce the number of network depths and feature maps for the basic GoogleNet structure. 9 is a proposed CNN structure.

일반적으로 CNN 구조의 망 깊이와 특징맵 수가 늘어날수록 인식률이 좋아지지만 동시에 프로그램의 동작시간이 증가한다. 프로그램의 실시간 동작을 위해서 적정량의 망 깊이와 특징맵 수를 가지는 CNN구조가 필요하다. 프루닝 작업이란 기 존재하는 CNN 구조의 크기를 줄이기 위하여 각 레이어에 포함되는 노드의 개수를 줄이거나 레이어 자체를 삭제하는 것을 의미한다. 프루닝 작업은 CNN 구조에서 레이어 삭제나 특징맵 수를 줄이는 작업이기 때문에 프루닝 작업의 결과 망 깊이와 특징맵 수가 줄어든다.In general, as the network depth and feature map number of CNN structure increases, the recognition rate improves, but at the same time, the program operation time increases. For the real-time operation of the program, a CNN structure with an appropriate amount of network depth and number of feature maps is needed. Pruning means reducing the number of nodes included in each layer or deleting the layer itself to reduce the size of existing CNN structures. Because pruning is the task of deleting layers or reducing the number of feature maps in a CNN structure, the number of network depths and feature maps resulting from pruning operations is reduced.

본 발명에 따른 방법에서는 객체가 보행자인지 판단하는 작업 외에 보행자의 위치를 결정하는 리그레션 헤드(regression head) 작업을 포함한다. 리그레션 헤드(regression head) 작업은 바운딩 박스(bounding box)의 좌측 상단의 위치와 너비(width), 높이(height) 등을 예측한다. 분류(Classification)와 리그레션 헤드(regression head) 작업은 CNN 네트워크의 가중치(weights)를 공유하기 때문에 추가적인 연산 없이 한번의 피드포워드(feedforward)를 통하여 객체 분류와 바운드 박스 리그레션(bounding box regression)을 수행할 수 있다.The method according to the present invention includes a regression head operation for determining the position of a pedestrian in addition to the operation for determining whether the object is a pedestrian. The regression head operation predicts the position of the upper left corner of the bounding box, the width, the height, and the like. Classification and regression head work share the weights of the CNN network so that object classification and bounding box regression are performed through a single feedforward without additional computation. Can be performed.

일반적으로 딥 신경망(deep neural network) 모델은 많은 수의 파라미터로 구성된다. 모델을 구성하는 파라미터의 개수는 CNN구조에서 사용하는 특징맵(feature-map)의 수가 많아질수록, 네트워크의 깊이가 깊어질수록 증가하게 된다. 실질적으로 학습에 사용되는 데이터들의 수는 한정되어 있다. 그렇기 때문에 딥 신경망(deep neural network) 모델을 학습할 때 학습 데이터에 대하여 모델 파라미터들이 과하게 학습되는 오버피팅(overfitting)이 발생한다. 오버피팅(overfitting)은 학습 데이터에 대해서는 잘 동작하지만 실제 데이터에 대한 오차는 증가하는 현상을 의미한다.In general, a deep neural network model consists of a large number of parameters. The number of parameters constituting the model increases as the number of feature maps (feature-maps) used in the CNN structure increases, and as the network depth increases. The number of data actually used for learning is limited. Therefore, when learning a deep neural network model, overfitting occurs in which the model parameters are learned over the learning data. Overfitting is a phenomenon that works well for learning data but error for real data increases.

오버피팅(overfitting)을 해결하기 위해서 드롭아웃(dropout)[비특허문헌 27]과 데이터 증강(data augmentation) 방법 등이 주로 사용된다. 드롭아웃(dropout)은 학습단계에서 CNN 구조의 FCL(fully connected layer)을 부분적으로 사용하는 방법이다. 학습에 사용되는 레이어들과 사용되지 않는 레이어들이 무작위로 결정된다. 본 발명에 따른 방법의 딥러닝 모델은 FCL을 포함하지 않기 때문에 데이터 증강(data augmentation)을 통하여 오버피팅(overfitting)을 해결한다. To solve the overfitting, a dropout (Non-Patent Document 27) and a data augmentation method are mainly used. Dropout is a method of partially using the fully connected layer (FCL) of the CNN structure in the learning phase. The layers used and the unused layers are randomly determined. The deep running model of the method according to the present invention solves overfitting through data augmentation because it does not include FCL.

본 발명에서는 데이터 증강(data augmentation)을 위하여 학습 영상에 평행이동(translation)과 회전(rotation) 연산을 적용한다. 원본 영상을 상하좌우 방향으로 평행이동 한 후 그 결과 영상에 대해 회전 연산을 수행한다. 회전 연산은 영상의 중심을 축으로 0, 5, 10도씩 좌, 우 방향으로 적용한다. 데이터 증강(data augmentation) 결과로 한 장의 입력 영상에 대하여 20개의 새로운 영상이 만들어 진다. 수학식 5는 회전과 평행이동을 위한 식이다. In the present invention, a translation and a rotation operation are applied to a learning image for data augmentation. The original image is moved in parallel in the up, down, left, and right directions, and rotation calculation is performed on the resulting image. The rotation operation is applied to the center of the image as 0, 5, and 10 degrees in the left and right directions. As a result of data augmentation, 20 new images are created for one input image. Equation (5) is an equation for rotation and translation.

[수학식 5]&Quot; (5) "

여기서 a와 b는 회전 파라미터이고 c와 d는 각각 평행이동 파라미터이다.Where a and b are the rotation parameters and c and d are the translation parameters, respectively.

다음으로, 본 발명의 효과를 실험을 통해 보다 구체적으로 설명한다.Next, the effects of the present invention will be described in more detail through experiments.

먼저, 실험 환경에 대하여 설명한다.First, the experimental environment will be described.

로지텍(Logitec)사의 C920 카메라가 실험에 사용되었다. 도 10은 실험에 사용된 C920의 제품 스펙(specification)과 실물 사진을 보여준다[비특허문헌 27]. 입력 영상의 해상도는 1280 × 720 이며 프레임 레이트(frame rate)는 20 fps이다. 피플 카운팅(people counting) 방법이 구현될 임베디드 보드와는 USB 2.0으로 연결된다. A Logitec C920 camera was used in the experiment. FIG. 10 shows a product specification and a real photograph of the C920 used in the experiment (Non-Patent Document 27). The resolution of the input image is 1280 × 720 and the frame rate is 20 fps. The embedded board, on which the people counting method is implemented, is connected via USB 2.0.

실험에서는 엔비디아(Nvidia)사의 Jetson TX2 임베디드 보드를 사용한다. Jetson TX2 보드는 GPU를 포함하고 있어 NVIDIA CUDA 플랫폼 사용도 가능하다. 그림 7는 Jetson TX2의 제품 사양을 보여준다[비특허문헌 27].The experiment uses Nvidia's Jetson TX2 embedded board. The Jetson TX2 board includes a GPU, which also makes it possible to use the NVIDIA CUDA platform. Figure 7 shows the product specification of Jetson TX2 [Non-Patent Document 27].

도 12는 실험에 사용된 피플 카운팅(people counting) 방법의 설치 환경이다. 한 대의 카메라가 조감도(Bird's eye view) 방식으로 설치되었다. Figure 12 is an installation environment of the people counting method used in the experiment. One camera was installed in Bird's eye view mode.

다음으로, 실험 결과를 설명한다.Next, the experimental results will be described.

실험 영상들은 여러 조명 환경 속에서 다양한 시나리오로 촬영되었다. 보행자가 팔짱을 끼고 있는 상황, 보행자가 손을 잡고 있는 상황, 조깅을 하거나 빨리 달리는 상황, 가방을 가지고 있는 상황 등 일상생활에서 가능하다고 생각하는 시나리오를 포함하였다.Experimental images were taken in various scenarios in various lighting environments. These include scenarios that you think are possible in everyday life, such as pedestrians wearing arms, pedestrians holding hands, jogging or running fast, or having a bag.

카메라와 보행자 사이의 거리는 피플 카운팅(people counting) 방법 설치 시 매우 중요한 요소이다. 촬영 거리가 멀어 질수록 보행자의 정보의 손실 정도는 촬영 각도에 따라 달라지며 인식률에도 큰 영향을 끼친다. 본 발명에서는 다양한 촬영 거리에서 영상을 획득하였다. 도 13은 다양한 환경에서 획득된 보행자 영상이다.The distance between the camera and the pedestrian is a very important factor in the installation of the people counting method. As the shooting distance increases, the degree of loss of information of the pedestrian depends on the shooting angle and also has a great influence on the recognition rate. In the present invention, images were acquired at various photographing distances. Figure 13 is a pedestrian image obtained in various environments.

도 13(a)는 실내에서 촬영되었다. 이 경우 카메라는 건물의 입구를 바라보도록 설치되었고 보행자와의 거리는 3m이다. 도 13(b)는 실외에서 획득된 영상이며 보행자와 카메라 간의 거리는 15m로 비교적 장거리 경우이다. 도 13(c), (d)는 동일한 실내의 장소에서 촬영 각도와 촬영 거리를 바꾸어 획득한 영상이다. 이 경우 특별한 조명이 없고 건물 입구에서 역광이 발생하고 있는데, 이것은 보행자 분류 작업을 위해 매우 안좋은 환경이다. 13 (a) was photographed in the room. In this case, the camera is installed to look at the entrance of the building and the distance from the pedestrian is 3m. Fig. 13 (b) is an image obtained from outside, and the distance between the pedestrian and the camera is 15 m, which is a relatively long distance. 13 (c) and 13 (d) are images obtained by changing the photographing angle and the photographing distance in the same room. In this case there is no special lighting and backlighting is occurring at the entrance of the building, which is a very bad environment for pedestrian classification.

도 14는 다양한 보행자 움직임의 시나리오이다. 보행자가 박스(box)를 들고 있거나 가방을 맨 경우, 교차로 이동하거나 어깨를 나란히 하고 이동하는 경우, 무리로 움직이는 상황 등 일상생활에서 가능한 여러 가지 상황을 가정하였다. 도 14(a),(b),(c) 및 (d)에서 촬영된 동영상에는 도 14에 보인 보행자 움직임 시나리오를 모두 경우를 포함하고 있다. Figure 14 is a scenario of various pedestrian movements. We assumed various situations in everyday life, such as when a pedestrian is holding a box or carrying a bag, moving at an intersection, moving shoulders side by side, or moving in a crowd. 14A, 14B, 14C, and 14D include the case where all of the pedestrian motion scenarios shown in FIG. 14 are included.

도 15는 앞에서 설명한 각각의 실험 동영상에 대해 본 발명에 따른 방법의 피플 카운팅(people counting) 결과를 보여준다. 도 13(a)에 보인 실험 동영상은 총 304명의 보행자를, (b)는 317명, (c)와 (d)는 각각 520명의 보행자를 포함한다. 도 15의 표에서 총합(total number)은 각 시나리오에 해당되는 보행자의 수를 의미한다. "Detection"은 제안하는 피플 카운팅(people counting) 방법이 검출에 성공한 보행자 수를 의미하며 마지막으로 피플 카운팅(people counting) 정확도를 백분율로 표현하였다.FIG. 15 shows the results of the people counting of the method according to the present invention for each of the experimental videos described above. The experimental video shown in FIG. 13 (a) includes a total of 304 pedestrians, (b) 317 pedestrians, and (c) and (d) 520 pedestrians. In the table of FIG. 15, the total number means the number of pedestrians corresponding to each scenario. "Detection" means the number of pedestrians that the proposed people counting method has succeeded in detection, and finally, the percentage of people counting accuracy.

본 발명에 따른 방법은 평균 89.9%의 정확도를 가진다. 실험영상 (b)의 경우 다른 영상들에 비하여 상대적으로 낮은 인식률을 보이는데 그 이유는 보행자들을 측면에서 촬영했기 때문이다. 보행자를 측면에서 촬영하면 보행자의 형태에 대한 정보가 손실되고 따라서 인식률 저하를 야기하는 원인이 된다. 특히 측면에서 획득된 영상의 경우에는 보행자가 피플 카운팅(people counting) 라인을 통과할 때 가려짐 영역이 빈번하게 발생하게 된다. The method according to the present invention has an average of 89.9% accuracy. The experimental image (b) shows a relatively low recognition rate as compared with the other images because the pedestrians were photographed from the side. Taking the pedestrian from the side, information about the shape of the pedestrian is lost, which causes a decrease in the recognition rate. Particularly in the case of images acquired from the side, the area where the pedestrian crosses the people counting line frequently occurs.

도 16의 표는 시나리오별 피플 카운팅(people counting) 정확도를 보여준다. 도 14에서 정의한 보행자 움직임을 모두 포함하고 있다. 각 시나리오에 포함된 총 보행자수에 대하여 정확하게 검출된 보행자 수를 표현하고 있으며 마지막으로 보행자 검출의 정확도를 백분율도 보여준다. 도 17은 보행자 검출시 에러율을 도식화 하여 보여준다.The table of FIG. 16 shows the accuracy of people counting per scenario. All the pedestrian movements defined in FIG. 14 are included. The number of accurately detected pedestrians for the total number of pedestrians included in each scenario is expressed, and finally the percentage of pedestrian detection accuracy is also shown. Fig. 17 shows the error rate when the pedestrian is detected.

본 발명에서 제안하는 피플 카운팅(people counting) 방법은 보행자의 형태 변화에 강인한 검출 결과를 보여준다. 시나리오 (a), (b)와 같이 보행자가 물건을 가지고 있거나 시나리오 (c), (d)와 같이 보행자 형태의 변화가 비교적 큰 경우에도 좋은 결과를 보여준다. 보행자가 달리고 있는 경우에도 동영상 내의 연속된 프레임에서는 블러(blur) 현상이 관측되지만 보행자 검출율에는 큰 영향이 없다. The people counting method proposed in the present invention shows a detection result robust to the shape change of a pedestrian. As shown in the scenarios (a) and (b), when the pedestrian has the object or the pedestrian type change is relatively large as in the scenarios (c) and (d) Even if a pedestrian is running, a blur phenomenon is observed in consecutive frames in the moving image, but there is no significant influence on the pedestrian detection rate.

반면에 시나리오 (g)와 (h)에서 상대적으로 저조한 인식률을 보인다. 시나리오 (g)는 보행자가 출입구에서 교차로 이동하는 상황이고 시나리오 (h)는 무리지어 이동하는 상황이다. 보행자들이 교차하거나 밀집되어 이동하는 경우에는 보행자들 사이에서 가려짐 영역이 발생한다. 본 발명에 따른 CNN 기반 보행자 분류기는 가려짐 영역에 존재하는 보행자를 한명으로 인식하게 되고 결국 에러가 발생하게 된다. 보행자들이 겹쳐진 경우에 발생하는 가려짐 영역은 카메라 설치 각도나 거리와 상관없이 항상 발생 할 수 있으며 이 경우 카운팅 오차(counting error)는 피하기 어렵다. 하지만 가려짐 영역이 짧은 프레임 기간 동안 발생하는 경우에는 정상적으로 동작한다. On the other hand, the scenarios (g) and (h) show a relatively low recognition rate. Scenario (g) is a situation where a pedestrian moves at an intersection at an entrance and exit, and scenario (h) moves in a crowded state. When pedestrians cross or densely move, there is a blocked area between pedestrians. The CNN-based pedestrian classifier according to the present invention recognizes pedestrians existing in the obstructed area as one person, and an error occurs. The area of occlusion that occurs when pedestrians are overlapped can always occur irrespective of the angle of camera installation or distance, and in this case, counting error is difficult to avoid. However, when the masked area occurs during a short frame period, it operates normally.

본 발명에서는 조감도(Bird's eye view) 형태의 피플 카운팅(people counting) 방법을 설명하였다. 본 발명에 따른 방법에서는 조명 환경에 따라 영상의 화소 밝기 값 변동이 다르다는 특성을 이용하여 배경 모델을 생성하고 주어진 영상의 일부분에서만 배경 모델을 업데이트한다. 본 발명에 따른 방법으로 생성된 보행자 후보 영역은 CNN 기반 보행자 분류기의 입력으로 주어진다. 보행자 분류기에서는 입력된 보행자 후보 영역이 보행자 인지 여부를 판단하는 작업을 수행한다. In the present invention, a method of counting people in Bird's eye view form has been described. In the method according to the present invention, a background model is generated using a characteristic that a variation in pixel brightness value of an image is different according to an illumination environment, and a background model is updated only in a part of a given image. The pedestrian candidate area generated by the method according to the present invention is given as an input of a CNN-based pedestrian classifier. The pedestrian classifier performs an operation of determining whether the inputted pedestrian candidate area is a pedestrian.

본 발명에 따른 방법은 조명환경, 카메라 설치 거리 등에 강인한 성능을 가지고 있으며 다양한 보행자 시나리오 상황에서도 좋은 성능을 보여준다. 보행자가 겹치는 경우나 큰 물건을 운반하는 경우 가려짐 영역이 발생하고 검출 성능이 상대적으로 낮아지지만 대체적으로 시스템 설치 환경에 대하여 강인하게 동작하는 것을 확인할 수 있다.The method according to the present invention has a robust performance in a lighting environment, a camera installation distance, and shows good performance in various pedestrian scenario situations. In the case of overlapping pedestrians or transporting a large object, the obstruction area occurs and the detection performance is relatively low, but it can be confirmed that the operation is robust against the system installation environment in general.

이상, 본 발명자에 의해서 이루어진 발명을 상기 실시 예에 따라 구체적으로 설명하였지만, 본 발명은 상기 실시 예에 한정되는 것은 아니고, 그 요지를 이탈하지 않는 범위에서 여러 가지로 변경 가능한 것은 물론이다.Although the present invention has been described in detail with reference to the above embodiments, it is needless to say that the present invention is not limited to the above-described embodiments, and various modifications may be made without departing from the spirit of the present invention.

10 : 영상 20 : 컴퓨터 단말
30 : 프로그램 시스템10: video 20: computer terminal
30: Program system

Claims

A method of counting people on an embedded platform using a convolutional neural network that counts pedestrians in an image composed of consecutive frames,
(a) initializing and generating a background model using at least two consecutive previous frames;
(b) receiving a current frame of an image;
(c) updating a band top line and a bottom line of the background model using the deviation of the pixels of the background model;
(d) extracting a pedestrian candidate area map using the background model and the upper and lower lines of the band, and for each pixel of the current frame, extracting a pedestrian candidate area map from the area between the upper end line and the lower end line of the updated background model Extracting the pedestrian candidate area map as the pedestrian candidate area map;
(e) updating the background model using the current frame of the image, using only the background model of the previous frame in the pedestrian candidate area and updating only the remaining area;
(f) inputting the pedestrian candidate area map into a convolutional neural network (CNN) pedestrian classifier and classifying the pedestrian candidate area map; And
(g) counting the pedestrians according to the classification results. < Desc / Clms Page number 19 >

The method according to claim 1,
In the step (a), the background model is initialized by averaging N background frames (N is a natural number of 2 or more)
Wherein, in step (e), N frames including the current frame are averaged up to a previous frame, and the updated N frames are averaged to update the counted number of frames on the embedded platform using the convolution neural network.

The method according to claim 1,
Wherein, in step (e), the background model is updated by the following equation (1): " (1) "
[Equation 1]

However, B _m (i, j) represents a pixel (i, j) of the background model of the m-th frame is the current frame, Ik (i, j) is the pixel value of the pixel (i, j) of the k-th frame , F _m (i, j) is a pedestrian candidate area map indicates whether the pedestrian candidate region and the pixel (i, j) of the m-th frame, N is a natural number of at least 2.

The method of claim 3,
Wherein the top line U_band and the bottom line L_band of the background model are updated using Equation (2) in the step (c).
[Equation 2]

However,? Is a predetermined constant as a weight.

5. The method of claim 4,
Wherein, in the step (d), the pedestrian candidate area map is obtained by the following equation (3).
[Equation 3]

However, F _m (i, j) is the pixel ¹ a represents the pixel value of the pixel of the pedestrian candidate region map (i, j), I _m (i, j) is the pixel (i, j) of the current frame m.

The method according to claim 1,
Wherein the CNN pedestrian classifier comprises a layer of an inception module and the insception module comprises a network in network (NIN) structure. The CNN pedestrian classifier includes a person counting module on an embedded platform using a convolution neural network, Way.

The method according to claim 1,
Wherein the learning image is generated as a plurality of new learning images by applying movement and rotation operations when learning the CNN pedestrian classifier, and then learning is performed.

8. The method of claim 7,
And a learning image is generated by generating a plurality of new learning images according to the following equation (4) for the given learning image (x, y), but with different rotation parameters and parallel movement parameters: Counting method.
[Equation 4]

(X ', y') represents a newly generated training image, a and b are rotation parameters, and c and d are parallel movement parameters, respectively.

The method according to claim 6,
And performing a pruning operation on the inc Sense module.

A computer-readable recording medium having recorded thereon a program for performing a person counting method on an embedded platform using the convolutional neural network according to any one of claims 1 to 9.