KR102387357B1

KR102387357B1 - A method and apparatus for detecting an object in an image by matching a bounding box on a space-time basis

Info

Publication number: KR102387357B1
Application number: KR1020180157430A
Authority: KR
Inventors: 유석봉; 한미경
Original assignee: 한국전자통신연구원
Priority date: 2018-12-07
Filing date: 2018-12-07
Publication date: 2022-04-15
Also published as: KR20200075072A

Abstract

바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법 및 장치가 개시된다. 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법은, 시간적으로 연속하는 복수의 프레임 각각에 대하여 적어도 하나의 초기 바운딩 박스를 검출하는 단계, 검출된 적어도 하나의 초기 바운딩 박스를 시공간상으로 서로 매칭하는 단계, 매칭에 따라 결정된 대응관계를 기초로 상기 적어도 하나의 초기 바운딩 박스에서 바운딩 박스를 추가 또는 삭제함으로써, 적어도 하나의 최종 바운딩 박스를 검출하는 단계 및 검출된 최종 바운딩 박스를 이용하여 상기 복수의 프레임에 포함된 객체를 검출하는 단계를 포함한다.Disclosed are a method and an apparatus for detecting an object in an image by spatio-temporal matching of a bounding box. A method for detecting an object in an image by spatio-temporal matching of a bounding box includes: detecting at least one initial bounding box for each of a plurality of temporally successive frames; matching each other, detecting at least one final bounding box by adding or deleting a bounding box in the at least one initial bounding box based on a correspondence determined according to matching, and using the detected final bounding box and detecting an object included in a plurality of frames.

Description

A method and apparatus for detecting an object in an image by spatio-temporal matching of a bounding box

본 발명은 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법 및 장치에 관한 것으로, 더욱 상세하게는 시간적으로 연속되는 프레임마다 검출된 바운딩 박스를 정방향 또는 양방향으로 매칭함으로써, 초기 검출된 바운딩 박스의 오류를 보정하는 기술에 관한 것이다.The present invention relates to a method and apparatus for detecting an object in an image by spatio-temporal matching of a bounding box, and more particularly, by matching a bounding box detected for each temporally successive frame in a forward or bidirectional direction, an initially detected bounding It is about the technique of correcting the error of the box.

객체 검출(Object Detection) 기술은 로봇, 비디오 감시, 자동차 안전 등과 같은 여러 응용 분야에서 널리 사용되고 있는 핵심 기술이다. 최근에는, 객체 검출 기술에 인공 신경망(Artificial Neural Network) 또는 합성 곱 신경망(convolutional neural network, CNN)을 사용하는 방식이 알려짐에 따라, 단일 영상을 이용한 객체 검출 기술은 비약적으로 발전하였다. Object detection technology is a core technology that is widely used in various applications such as robotics, video surveillance, and vehicle safety. Recently, as a method of using an artificial neural network or a convolutional neural network (CNN) for object detection technology is known, object detection technology using a single image has developed rapidly.

객체 검출 기술은 영상 내에서 특정 위치에 따른 객체를 식별하는 기술로서, 객체 분류(Object Classification) 기술과는 달리 객체의 위치와 식별을 동시에 추정해야 하고 영상 내에서 검출하고자 하는 모든 관심 객체를 식별해야 한다.Object detection technology is a technology that identifies an object according to a specific location in an image. Unlike object classification technology, it is necessary to simultaneously estimate the location and identification of an object and identify all objects of interest to be detected in the image. do.

합성 곱 신경망을 이용한 객체 검출 방법은 영역 추출(RoI pooling, region of interest pooling)을 기반으로 하는 기법과 격자 공간(Grid Cell)을 기반으로 하는 기법으로 분류할 수 있다. 그 중 영역 추출을 기반으로 하는 기법은 하나의 특징 맵에서 서로 다른 크기를 갖는 객체를 검출할 수 있도록 바운딩 박스(bounding box)의 크기와 위치를 추정하고, 추정된 바운딩 박스를 이용해 객체를 검출한다. An object detection method using a convolutional neural network can be classified into a technique based on region extraction (RoI pooling, region of interest pooling) and a technique based on a grid cell. Among them, the technique based on region extraction estimates the size and location of a bounding box to detect objects having different sizes in one feature map, and detects the object using the estimated bounding box. .

그런데, 이러한 바운딩 박스는 시간적으로 연속성을 갖는 동영상의 특성을 고려하지 않고 각 프레임마다 독립적으로 검출하는 것이 일반적이므로, 일부 바운딩 박스가 검출되지 않거나, 관심 객체가 아닌 영역에 대하여 바운딩 박스가 잘못 검출되는 문제가 있다.However, since it is common to independently detect such a bounding box for each frame without considering the temporal continuity of the moving picture, some bounding boxes are not detected or the bounding box is incorrectly detected for an area other than the object of interest. there is a problem.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은, 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법을 제공하는 데 있다.An object of the present invention for solving the above problems is to provide a method of detecting an object in an image by matching a bounding box in space-time.

상기와 같은 문제점을 해결하기 위한 본 발명의 다른 목적은, 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 장치를 제공하는 데 있다.Another object of the present invention for solving the above problems is to provide an apparatus for detecting an object in an image by matching a bounding box in space-time.

상기와 같은 문제점을 해결하기 위한 본 발명의 또 다른 목적은, 바운딩 박스 기반의 객체 식별을 이용한 영상 분석 서버를 제공하는 데 있다.Another object of the present invention for solving the above problems is to provide an image analysis server using a bounding box-based object identification.

상기 목적을 달성하기 위한 본 발명의 일 측면은, 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법을 제공한다.One aspect of the present invention for achieving the above object provides a method of detecting an object in an image by matching a bounding box in space-time.

상기 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법은, 시간적으로 연속하는 복수의 프레임 각각에 대하여 적어도 하나의 초기 바운딩 박스를 검출하는 단계, 검출된 적어도 하나의 초기 바운딩 박스를 시공간상으로 서로 매칭하는 단계, 매칭에 따라 결정된 대응관계를 기초로 상기 적어도 하나의 초기 바운딩 박스에서 바운딩 박스를 추가 또는 삭제함으로써, 적어도 하나의 최종 바운딩 박스를 검출하는 단계 및 검출된 최종 바운딩 박스를 이용하여 상기 복수의 프레임에 포함된 객체를 검출하는 단계를 포함할 수 있다.The method of detecting an object in an image by matching the bounding box in space-time includes: detecting at least one initial bounding box for each of a plurality of temporally continuous frames; matching each other with , detecting at least one final bounding box by adding or deleting a bounding box in the at least one initial bounding box based on the correspondence determined according to matching, and using the detected final bounding box The method may include detecting an object included in the plurality of frames.

상기 적어도 하나의 초기 바운딩 박스를 검출하는 단계는, 상기 복수의 프레임 각각에 대하여 전처리를 수행하는 단계, 전처리가 수행된 상기 복수의 프레임을 각각 CNN(Convolutional Neural Network)에 입력하여 바운딩 박스 후보를 획득하는 단계, 획득된 바운딩 박스 후보 중에서 선정된 바운딩 박스를 상기 적어도 하나의 초기 바운딩 박스로 검출하는 단계를 포함할 수 있다.The detecting of the at least one initial bounding box includes performing preprocessing on each of the plurality of frames, and inputting the plurality of frames on which the preprocessing is performed into a Convolutional Neural Network (CNN), respectively, to obtain a bounding box candidate and detecting a bounding box selected from among the obtained bounding box candidates as the at least one initial bounding box.

상기 전처리를 수행하는 단계는, 상기 복수의 프레임에 대한 해상도를 상기 CNN에 대한 입력 해상도에 따라 변환시키는 단계를 포함할 수 있다.The performing of the pre-processing may include converting the resolution of the plurality of frames according to the input resolution to the CNN.

상기 적어도 하나의 초기 바운딩 박스를 시공간상으로 서로 매칭하는 단계는, 상기 복수의 프레임 중에서 시간적으로 가장 앞선 프레임에 대한 초기 바운딩 박스를 나머지 프레임에 대한 초기 바운딩 박스들과 정방향 매칭하는 단계를 포함할 수 있다.The step of matching the at least one initial bounding box with each other in spacetime may include forward matching an initial bounding box for a temporally most advanced frame among the plurality of frames with initial bounding boxes for the remaining frames. there is.

상기 적어도 하나의 최종 바운딩 박스를 검출하는 단계는, 상기 정방향 매칭 결과, 상기 복수의 프레임 중에서 상기 가장 앞선 프레임에 대한 초기 바운딩 박스가 제1 프레임에 대한 초기 바운딩 박스와 대응관계가 없고, 나머지 프레임에 대한 초기 바운딩 박스와 대응관계가 있는 경우, 상기 제1 프레임에 대하여 바운딩 박스를 추가하는 단계를 포함할 수 있다.In the step of detecting the at least one final bounding box, as a result of the forward matching, the initial bounding box for the most advanced frame among the plurality of frames has no correspondence with the initial bounding box for the first frame, and in the remaining frames When there is a corresponding relationship with the initial bounding box for , the method may include adding a bounding box to the first frame.

상기 적어도 하나의 초기 바운딩 박스를 시공간상으로 서로 매칭하는 단계는, 상기 복수의 프레임 중에서 시간적으로 중간에 위치한 프레임에 대한 초기 바운딩 박스를 나머지 프레임에 대한 초기 바운딩 박스들과 양방향 매칭하는 단계를 포함할 수 있다.The step of matching the at least one initial bounding box with each other in space-time may include bidirectionally matching an initial bounding box for a frame located in the middle of time among the plurality of frames with initial bounding boxes for the remaining frames. can

상기 적어도 하나의 최종 바운딩 박스를 검출하는 단계는, 상기 양방향 매칭 결과, 상기 중간에 위치한 프레임에 대한 제1 초기 바운딩 박스가 나머지 프레임에 대한 초기 바운딩 박스들과 대응관계가 없으면, 상기 제1 초기 바운딩 박스를 삭제하는 단계를 포함할 수 있다.In the detecting of the at least one final bounding box, if, as a result of the bidirectional matching, the first initial bounding box for the intermediate frame has no correspondence with the initial bounding boxes for the remaining frames, the first initial bounding box It may include deleting the box.

상기 초기 바운딩 박스는, 중심 픽셀 위치 좌표값 및 크기로 표현될 수 있다.The initial bounding box may be expressed as a center pixel position coordinate value and size.

상기 적어도 하나의 초기 바운딩 박스를 시공간상으로 서로 매칭하는 단계는, 매칭할 두개의 바운딩 박스가 중첩되는 면적을 기초로 산출되는 비용함수를 이용하여 대응관계를 결정하는 단계를 포함할 수 있다.The step of matching the at least one initial bounding box with each other in space-time may include determining a correspondence relationship using a cost function calculated based on an overlapping area of two bounding boxes to be matched.

상기 비용함수는, 상기 중첩되는 면적을 상기 두개의 바운딩 박스 각각이 차지하는 면적의 합으로 나눈 값으로 정의될 수 있다.The cost function may be defined as a value obtained by dividing the overlapping area by the sum of the areas occupied by each of the two bounding boxes.

상기 목적을 달성하기 위한 본 발명의 다른 측면은, 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 장치를 제공한다.Another aspect of the present invention for achieving the above object provides an apparatus for detecting an object in an image by matching a bounding box in space-time.

상기 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 장치는, 적어도 하나의 프로세서(processor) 및 상기 적어도 하나의 프로세서가 적어도 하나의 단계를 수행하도록 지시하는 명령어들(instructions)을 저장하는 메모리(memory)를 포함할 수 있다.The apparatus for detecting an object in an image by spatio-temporal matching of the bounding box includes at least one processor and a memory storing instructions instructing the at least one processor to perform at least one step (memory) may be included.

상기 적어도 하나의 단계는, 시간적으로 연속하는 복수의 프레임 각각에 대하여 적어도 하나의 초기 바운딩 박스를 검출하는 단계, 검출된 적어도 하나의 초기 바운딩 박스를 시공간상으로 서로 매칭하는 단계, 매칭에 따라 결정된 대응관계를 기초로 상기 적어도 하나의 초기 바운딩 박스에서 바운딩 박스를 추가 또는 삭제함으로써, 적어도 하나의 최종 바운딩 박스를 검출하는 단계 및 검출된 최종 바운딩 박스를 이용하여 상기 복수의 프레임에 포함된 객체를 검출하는 단계를 포함할 수 있다.The at least one step may include detecting at least one initial bounding box for each of a plurality of temporally successive frames, matching the detected at least one initial bounding box with each other in space and time, and a correspondence determined according to the matching Detecting at least one final bounding box by adding or deleting a bounding box in the at least one initial bounding box based on a relationship, and detecting an object included in the plurality of frames using the detected final bounding box may include steps.

상기 목적을 달성하기 위한 본 발명의 다른 측면은, 바운딩 박스 기반의 객체 식별을 이용한 영상 분석 서버를 제공한다.Another aspect of the present invention for achieving the above object provides an image analysis server using a bounding box-based object identification.

상기 바운딩 박스 기반의 객체 식별을 이용한 영상 분석 서버는, 적어도 하나의 프로세서 및 상기 적어도 하나의 프로세서가 적어도 하나의 단계를 수행하도록 지시하는 명령어들을 저장하는 메모리를 포함할 수 있다.The image analysis server using the bounding box-based object identification may include at least one processor and a memory for storing instructions instructing the at least one processor to perform at least one step.

상기 적어도 하나의 단계는, 도시 환경을 관리하는 통합 관제 서버 또는 상기 통합 관제 서버와 연동된 데이터베이스로부터 CCTV영상을 수신하는 단계, 획득된 CCTV 영상에서 시간적으로 연속한 복수의 프레임을 획득하는 단계, 획득한 복수의 프레임 각각에 대하여 적어도 하나의 초기 바운딩 박스를 검출하는 단계, 검출된 적어도 하나의 초기 바운딩 박스를 시공간상으로 서로 매칭하는 단계, 매칭에 따라 결정된 대응관계를 기초로 상기 적어도 하나의 초기 바운딩 박스에서 바운딩 박스를 추가 또는 삭제함으로써, 적어도 하나의 최종 바운딩 박스를 검출하는 단계, 검출된 최종 바운딩 박스를 이용하여 상기 복수의 프레임에 포함된 객체를 검출하는 단계 및 검출된 객체에 대한 정보를 상기 통합 관제 서버로 전송하는 단계를 포함할 수 있다.The at least one step includes: receiving a CCTV image from an integrated control server that manages an urban environment or a database linked with the integrated control server; acquiring a plurality of temporally consecutive frames from the acquired CCTV image; acquiring Detecting at least one initial bounding box for each of a plurality of frames, matching the detected at least one initial bounding box with each other in space-time, and the at least one initial bounding box based on a correspondence determined according to matching Detecting at least one final bounding box by adding or deleting a bounding box in the box, detecting an object included in the plurality of frames using the detected final bounding box, and providing information on the detected object It may include the step of transmitting to the integrated control server.

상기와 같은 본 발명에 따른 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법 및 장치를 이용할 경우에는 CNN으로 도출된 초기 바운딩 박스에 대한 오검출 또는 미검출을 보완하여 정확한 바운딩 박스를 검출할 수 있다.When using the method and apparatus for detecting an object in an image by spatio-temporal matching of the bounding box according to the present invention as described above, the correct bounding box is detected by supplementing the erroneous or non-detection of the initial bounding box derived by CNN. can do.

또한, 일부 사물에 의해 가려진 폐색 영역에 위치한 객체의 바운딩 박스까지 검출할 수 있는 장점이 있다.In addition, there is an advantage in that even a bounding box of an object located in an occluded area covered by some objects can be detected.

도 1은 시간적으로 연속성이 있는 프레임들에서 바운딩 박스를 검출할 때 발생하는 오류를 설명하기 위한 예시도이다.
도 2는 본 발명의 일 실시예에 따른 초기 바운딩 박스를 검출하는 방법을 설명하기 위한 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 시간적으로 연속성이 있는 프레임들 각각에서 검출된 초기 바운딩 박스의 구성을 설명하기 위한 개념도이다.
도 4는 본 발명의 일 실시예에 따른 바운딩 박스를 시공간상에서 정방향으로 매칭하는 방법을 설명하기 위한 개념도이다.
도 5는 본 발명의 일 실시예에 따른 바운딩 박스를 시공간상에서 양방향으로 매칭하는 방법을 설명하기 위한 개념도이다.
도 6은 본 발명의 일 실시예에 따른 시공간상에서 매칭되는 바운딩 박스 상호간 유사도를 결정하기 위한 비용함수에 관한 개념도이다.
도 7은 본 발명의 일 실시예에 따른 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법에 대한 흐름도이다.
도 8은 본 발명의 일 실시예에 따른 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 장치에 대한 구성도이다.
도 9는 본 발명의 일 실시예에 따른 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법 및 장치가 적용될 수 있는 통합 관제 시스템에 대한 구성도이다.
도 10은 본 발명의 일 실시예에 따른 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법 및 장치에 대한 효과를 설명하기 위한 개념도이다.1 is an exemplary diagram for explaining an error that occurs when detecting a bounding box in frames having temporal continuity.
2 is a flowchart illustrating a method of detecting an initial bounding box according to an embodiment of the present invention.
3 is a conceptual diagram for explaining a configuration of an initial bounding box detected in each of temporally continuous frames according to an embodiment of the present invention.
4 is a conceptual diagram for explaining a method of matching a bounding box in a forward direction in space-time according to an embodiment of the present invention.
5 is a conceptual diagram illustrating a method of bidirectionally matching a bounding box in space and time according to an embodiment of the present invention.
6 is a conceptual diagram of a cost function for determining the mutual similarity between bounding boxes matched in space and time according to an embodiment of the present invention.
7 is a flowchart illustrating a method of detecting an object in an image by spatio-temporal matching of a bounding box according to an embodiment of the present invention.
8 is a block diagram of an apparatus for detecting an object in an image by spatio-temporal matching of a bounding box according to an embodiment of the present invention.
9 is a block diagram of an integrated control system to which a method and an apparatus for detecting an object in an image by spatio-temporal matching of a bounding box according to an embodiment of the present invention can be applied.
10 is a conceptual diagram for explaining the effect of a method and apparatus for detecting an object in an image by spatio-temporal matching of a bounding box according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. Since the present invention can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention. In describing each figure, like reference numerals have been used for like elements.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다. Terms such as first, second, A, and B may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component. and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. When an element is referred to as being “connected” or “connected” to another element, it is understood that it may be directly connected or connected to the other element, but other elements may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical and scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 시간적으로 연속성이 있는 프레임들에서 바운딩 박스를 검출할 때 발생하는 오류를 설명하기 위한 예시도이다.1 is an exemplary diagram for explaining an error that occurs when detecting a bounding box in frames having temporal continuity.

일반적으로 객체 검출을 위한 영상들은 시간적으로 연속성을 갖는 복수의 프레임(frame) 또는 픽쳐로 구성된다. 이때, 개별 프레임마다 객체 검출을 위한 바운딩 박스를 검출할 경우, 바운딩 박스 검출 성능에 따라 특정 프레임에서 잘못된 바운딩 박스가 검출되거나, 검출되어야할 바운딩 박스가 검출되지 않는 문제가 발생할 수 있다.In general, images for object detection are composed of a plurality of frames or pictures having temporal continuity. In this case, when a bounding box for object detection is detected for each individual frame, an incorrect bounding box is detected in a specific frame or a bounding box to be detected is not detected depending on the bounding box detection performance.

도 1을 참조하면, 현재 프레임(frame(t))에 대한 이전 프레임(frame(t-1))에서 제1 바운딩 박스(10), 제2 바운딩 박스(11), 제3 바운딩 박스(12)가 관심 객체들에 대하여 검출된 것을 확인할 수 있다.Referring to FIG. 1 , a first bounding box 10, a second bounding box 11, and a third bounding box 12 in a previous frame (frame(t-1)) with respect to a current frame (frame(t)) It can be confirmed that is detected with respect to the objects of interest.

이때, 현재 프레임(frame(t))에서도 바운딩 박스를 검출하면, 도 1과 같이 제1 바운딩 박스(10)와 제3 바운딩 박스(12)는 그대로 검출되더라도, 제2 바운딩 박스(11)가 검출되지 않는 검출실패(false negative) 상황이 발생할 수 있다.At this time, when a bounding box is detected even in the current frame (frame(t)), as shown in FIG. 1 , even if the first bounding box 10 and the third bounding box 12 are detected as they are, the second bounding box 11 is detected. A false negative situation may occur.

또한, 도면에 도시하지는 않았으나, 이전 프레임(frame(t-1))에서 검출되지 않은 바운딩 박스가 현재 프레임(frame(t))에서 새로운 관심 객체가 나타나지 않았음에도 불구하고 검출되는 오검출(false positive)이 발생할 수 있다.In addition, although not shown in the drawing, a bounding box not detected in the previous frame (frame(t-1)) is detected even though a new object of interest does not appear in the current frame (frame(t)). ) may occur.

이처럼, 프레임 단위로 독립적으로 바운딩 박스를 검출하여 객체 검출을 수행할 경우, 동영상 자체가 갖는 시간 연속성을 고려하지 않게 되기 때문에 객체 검출 알고리즘에 따라 오검출이나 검출실패할 수 있다.As such, when the object detection is performed by independently detecting the bounding box in units of frames, the temporal continuity of the video itself is not taken into account, so erroneous detection or detection may fail depending on the object detection algorithm.

이러한 문제점들을 해결하기 위한 방법으로, 본 발명의 일 실시예에 따르면 프레임 단위로 검출된 바운딩 박스를 초기 바운딩 박스로 설정하고, 초기 바운딩 박스 상호간에 시공간상 상관관계를 고려하여 추가적인 바운딩 박스를 검출하거나, 잘못 검출된 바운딩 박스를 삭제하는 방식으로 초기 바운딩 박스를 보정할 수 있다.As a method for solving these problems, according to an embodiment of the present invention, a bounding box detected in units of frames is set as an initial bounding box, and an additional bounding box is detected or , it is possible to correct the initial bounding box by deleting the wrongly detected bounding box.

이하에서는 먼저 초기 바운딩 박스를 결정하는 과정을 도면을 참조하여 설명한다. Hereinafter, a process of first determining an initial bounding box will be described with reference to the drawings.

도 2는 본 발명의 일 실시예에 따른 초기 바운딩 박스를 검출하는 방법을 설명하기 위한 흐름도이다.2 is a flowchart illustrating a method of detecting an initial bounding box according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 일 실시예에 따른 초기 바운딩 박스를 검출하는 방법은, 입력 이미지에 대한 전처리를 수행하는 단계(S100), 전처리된 입력 이미지를 CNN(Convolutional Neural Network, 합성곱 신경망)에 입력하여 바운딩 박스 후보를 획득하는 단계(S110) 및 바운딩 박스 후보 중에서 선정된 바운딩 박스를 초기 바운딩 박스를 결정하는 단계(S120)를 포함할 수 있다.Referring to FIG. 2 , the method for detecting an initial bounding box according to an embodiment of the present invention includes performing pre-processing on an input image ( S100 ), and converting the pre-processed input image to a Convolutional Neural Network (CNN). ) to obtain a bounding box candidate (S110) and may include a step (S120) of determining an initial bounding box for a bounding box selected from among the bounding box candidates.

전처리를 수행하는 단계(S100)는, 특정 해상도로 이루어진 입력 이미지를 CNN의 입력 데이터에 맞는 해상도로 변환하는 단계를 포함할 수 있다. 예를 들어 입력 영상이 고해상도(예를 들어 FHD, QHD 등) 영상이고, CNN의 입력 데이터는 저해상도(예를 들어 SD 등) 영상인 경우, 입력 이미지의 해상도를 다운 스케일링하여 CNN의 입력 데이터에 따른 해상도를 갖도록 변환할 수 있다. 이때, 입력 이미지는 동영상을 구성하는 연속된 프레임들(예를 들면 현재 시점이 t일 때, t-1, t, t+1에 해당하는 시점을 갖는 3개의 프레임)일 수 있다.The pre-processing (S100) may include converting an input image having a specific resolution into a resolution suitable for input data of the CNN. For example, if the input image is a high-resolution (eg, FHD, QHD, etc.) image and the input data of the CNN is a low-resolution (eg, SD, etc.) image, the resolution of the input image is downscaled according to the input data of the CNN. It can be converted to have resolution. In this case, the input image may be consecutive frames constituting a moving picture (eg, three frames having viewpoints corresponding to t-1, t, and t+1 when the current viewpoint is t).

바운딩 박스 후보를 획득하는 단계(S110)는 사전에 학습된 CNN에 전처리된 입력 이미지를 통과시킴에 따라 CNN의 출력으로, 여러 픽셀 위치에 대한 바운딩 박스 후보와 바운딩 박스 후보에 관심 객체가 위치할 확률을 나타내는 신뢰도를 획득할 수 있다.The step (S110) of obtaining a bounding box candidate is the output of the CNN as it passes the preprocessed input image to the pre-trained CNN. It is possible to obtain a reliability representing

여기서 CNN은 다양한 크기를 갖는 특징 맵(feature map)을 사용하여 바운딩 박스가 될 수 있는 위치와 크기를 추정하고, 추정된 바운딩 박스들로 이루어지는 바운딩 박스 후보들을 출력할 수 있다. 이때, CNN은 바운딩 박스 후보들에 대한 신뢰도를 함께 출력할 수 있다. 또한, CNN을 구성하는 계수 값 등은 빅 데이터를 이용하여 미리 학습됨으로써 결정될 수 있다. 그 밖에 객체 검출을 위한 CNN의 구성과 동작에 대해서는 본 발명이 속하는 기술분야에서 통상의 기술자가 용이하게 이해할 수 있을 것이므로 구체적인 설명은 생략한다.Here, the CNN may estimate the position and size of a bounding box using a feature map having various sizes, and may output bounding box candidates composed of the estimated bounding boxes. In this case, the CNN may output the reliability of the bounding box candidates together. In addition, coefficient values constituting the CNN may be determined by being previously learned using big data. In addition, the configuration and operation of the CNN for object detection will be easily understood by those skilled in the art to which the present invention pertains, so a detailed description thereof will be omitted.

바운딩 박스 후보 중에서 바운딩 박스를 초기 바운딩 박스로 결정하는 단계(S120)는 비최대화 억제(Non-max suppression) 알고리즘을 수행하여 신뢰도가 미리 설정된 임계값보다 높은 바운딩 박스 후보를 선택함으로써, 초기 바운딩 박스를 결정할 수 있다. In the step of determining the bounding box as the initial bounding box among the bounding box candidates (S120), a non-max suppression algorithm is performed to select a bounding box candidate whose reliability is higher than a preset threshold, thereby selecting an initial bounding box. can decide

상기 단계 S100 내지 S120은 동영상을 구성하는 연속된 프레임들마다 개별적으로 수행되어 각각의 프레임마다 초기 바운딩 박스가 결정될 수 있다.Steps S100 to S120 may be individually performed for each successive frame constituting a moving picture, so that an initial bounding box may be determined for each frame.

도 3은 본 발명의 일 실시예에 따른 시간적으로 연속성이 있는 프레임들 각각에서 검출된 초기 바운딩 박스의 구성을 설명하기 위한 개념도이다.3 is a conceptual diagram for explaining the configuration of an initial bounding box detected in each of temporally continuous frames according to an embodiment of the present invention.

도 2에 따른 초기 바운딩 박스를 검출하는 과정을 현재 프레임(frame(t)) 및 상기 현재 프레임과 시간적으로 인접한 이전 프레임(frame(t-1))과 이후 프레임(frame(t+1))에 대하여 수행하면, 도 3과 같이 3개의 시간적으로 연속한 프레임에 대하여 초기 바운딩 박스들을 검출할 수 있다.The process of detecting the initial bounding box according to FIG. 2 is performed in the current frame (frame(t)), the previous frame temporally adjacent to the current frame (frame(t-1)), and subsequent frames (frame(t+1)) , initial bounding boxes can be detected for three temporally consecutive frames as shown in FIG. 3 .

구체적으로, 도 3의 도면 기호 31을 참조하면, 이전 프레임(frame(t-1))에 대하여 초기 바운딩 박스들이 검출되어 B₁ ^t ^- ¹ 부터 B_M ^t ^- ¹ 까지 M개의 바운딩 박스가 도출된 것을 확인할 수 있다.Specifically, referring to reference numeral 31 of FIG. 3 , initial bounding boxes are detected with respect to the previous frame (frame(t-1)), and M bounding boxes from B ₁ ^t ^- ¹ to B _M ^t ^- ¹ are derived. can check that

또한, 도 3의 도면 기호 32를 참조하면, 현재 프레임(frame(t))에 대하여 초기 바운딩 박스들이 검출되어 B₁ ^t 부터 B_N ^t 까지 N개의 바운딩 박스가 도출된 것을 확인할 수 있다.In addition, referring to reference numeral 32 of FIG. 3 , it can be confirmed that initial bounding boxes are detected with respect to the current frame (frame(t)) and N bounding boxes are derived from B ₁ ^t to B _N ^t .

또한, 도 3의 도면 기호 33을 참조하면, 이후 프레임(frame(t+1))에 대하여 초기 바운딩 박스들이 검출되어 B₁ ^t ⁺ ¹ 부터 B_K ^t ⁺ ¹ 까지 K개의 바운딩 박스가 도출된 것을 확인할 수 있다.In addition, referring to reference numeral 33 of FIG. 3 , initial bounding boxes are detected for a subsequent frame (frame(t+1)) and K bounding boxes from B ₁ ^t ⁺ ¹ to B _K ^t ⁺ ¹ are derived. can be checked

한편, 각각의 초기 바운딩 박스들(31~33)은 바운딩 박스의 위치 정보와 크기 정보를 갖도록 구성될 수 있다. 여기서 바운딩 박스의 위치 정보는 바운딩 박스의 중심 픽셀 위치 좌표값(x,y)으로 표현될 수 있으며, 크기 정보는 바운딩 박스의 너비(w)와 높이(h)으로 표현될 수 있다.Meanwhile, each of the initial bounding boxes 31 to 33 may be configured to have location information and size information of the bounding box. Here, the location information of the bounding box may be expressed as coordinate values (x,y) of the center pixel of the bounding box, and the size information may be expressed as a width (w) and height (h) of the bounding box.

도 4는 본 발명의 일 실시예에 따른 바운딩 박스를 시공간상에서 정방향으로 매칭하는 방법을 설명하기 위한 개념도이다. 도 5는 본 발명의 일 실시예에 따른 바운딩 박스를 시공간상에서 양방향으로 매칭하는 방법을 설명하기 위한 개념도이다.4 is a conceptual diagram for explaining a method of matching a bounding box in a forward direction in space-time according to an embodiment of the present invention. 5 is a conceptual diagram illustrating a method of bidirectionally matching a bounding box in space and time according to an embodiment of the present invention.

본 발명의 일 실시예에 따르면, 도 3과 같이 연속된 프레임 각각에 대하여 도출된 초기 바운딩 박스들을 서로 시공간상으로 비교하여 대응관계가 있는 바운딩 박스가 있는지 결정하고, 대응관계 유무에 따라 오검출된 바운딩 박스를 삭제하거나, 검출되지 않은 바운딩 박스를 추가할 수 있다.According to an embodiment of the present invention, as shown in FIG. 3 , the initial bounding boxes derived for each successive frame are compared in space-time with each other to determine whether there is a corresponding bounding box, and erroneously detected depending on whether or not there is a corresponding relationship. You can delete bounding boxes or add undetected bounding boxes.

먼저, 바운딩 박스를 시공간상에서 정방향으로 매칭하는 방법을 설명하면 도 4와 같이 이전 프레임(frame(t-1))에 대한 초기 바운딩 박스들을 현재 프레임(frame(t))과 이후 프레임(frame(t+1))에 대한 초기 바운딩 박스들과 비교하여 대응관계 유무를 판단할 수 있다. 여기서 정방향이라는 의미는 프레임들이 갖는 시간적 진행 방향에 따라 초기 바운딩 박스를 비교한다는 것을 의미할 수 있다.First, a method of matching the bounding box in the forward direction in spacetime is described. As shown in FIG. 4, the initial bounding boxes for the previous frame (frame(t-1)) are divided into the current frame (frame(t)) and the subsequent frame (frame(t)). +1)), it is possible to determine whether there is a correspondence by comparing it with the initial bounding boxes. Here, the forward direction may mean comparing the initial bounding box according to the temporal progress direction of the frames.

도 4와 같은 정방향 매칭을 통해 초기 바운딩 박스들 중에서 검출되지 않은 바운딩 박스를 추가로 검출할 수 있다. 예를 들어, 이전 프레임(frame(t-1))의 특정 초기 바운딩 박스가 현재 프레임(frame(t))에 대한 초기 바운딩 박스들과는 모두 대응관계가 없지만, 이후 프레임(frame(t+1))에 대한 초기 바운딩 박스와 대응관계가 있는 경우, 현재 프레임(frame(t))에 대한 초기 바운딩 박스가 누락된 것으로 해석할 수 있다. 따라서, 현재 프레임(frame(t))에 대한 초기 바운딩 박스를 추가로 생성할 수 있다. 이때 추가로 생성되는 현재 프레임(frame(t))에 대한 초기 바운딩 박스는, 이전 프레임(frame(t-1))과 이후 프레임(frame(t+1)) 사이에서만 서로 대응관계를 갖는 두 개의 초기 바운딩 박스의 평균에 해당하는 위치정보와 크기 정보를 가질 수 있다.An undetected bounding box among initial bounding boxes may be additionally detected through forward matching as shown in FIG. 4 . For example, a specific initial bounding box of the previous frame (frame(t-1)) has no correspondence with all of the initial bounding boxes of the current frame (frame(t)), but a subsequent frame (frame(t+1)) If there is a corresponding relationship with the initial bounding box for , it may be interpreted that the initial bounding box for the current frame (frame(t)) is missing. Accordingly, an initial bounding box for the current frame frame(t) may be additionally generated. At this time, the initial bounding box for the additionally generated current frame (frame(t)) is two frames having a corresponding relationship only between the previous frame (frame(t-1)) and the subsequent frame (frame(t+1)). It may have location information and size information corresponding to the average of the initial bounding box.

또한, 바운딩 박스를 시공간상에서 양방향으로 매칭하는 방법으로는, 도 5와 같이 현재 프레임(frame(t))에 대한 초기 바운딩 박스들을 이전 프레임(frame(t-1))과 이후 프레임(frame(t+1))에 대한 초기 바운딩 박스들과 비교하여 서로간에 유사도를 산출하고 산출된 유사도에 기초하여 대응관계 유무를 판단할 수 있다. 여기서 양방향이라는 의미는 현재 프레임을 기준으로 시간적으로 앞서는 이전 프레임과 비교를 수행하고, 현재 프레임을 기준으로 시간적으로 느린 이후 프레임과 비교를 수행한다는 것을 의미할 수 있다.In addition, as a method of matching the bounding box in both directions in space and time, as shown in FIG. 5 , the initial bounding boxes for the current frame (frame(t)) are combined with the previous frame (frame(t-1)) and the subsequent frame (frame(t)). +1))), a similarity may be calculated between each other by comparing with the initial bounding boxes, and whether or not a correspondence relationship exists may be determined based on the calculated similarity. Here, bidirectional may mean that a comparison is performed with a previous frame that is temporally earlier based on the current frame, and comparison is performed with a temporally later frame that is temporally slower with respect to the current frame.

도 5와 같은 양방향 매칭을 통해 초기 바운딩 박스들 중에서 잘못 검출된 바운딩 박스를 삭제할 수 있다. 예를 들어, 현재 프레임(frame(t))의 특정 초기 바운딩 박스가 이전 프레임(frame(t-1))과 이후 프레임(frame(t+1))에 대한 초기 바운딩 박스 모두에서 대응관계가 없으면, 그러한 현재 프레임(frame(t))의 특정 초기 바운딩 박스는 잘못 검출되었을 가능성이 높다. 따라서, 잘못된 검출된 것으로 확인된 현재 프레임(frame(t))의 특정 초기 바운딩 박스를 삭제할 수 있다.Through the two-way matching as shown in FIG. 5 , an erroneously detected bounding box from among the initial bounding boxes may be deleted. For example, if a specific initial bounding box of the current frame (frame(t)) has no correspondence in both the initial bounding boxes for the previous frame (frame(t-1)) and the subsequent frame (frame(t+1)), , a certain initial bounding box of such a current frame (frame(t)) is likely to have been erroneously detected. Accordingly, it is possible to delete a specific initial bounding box of the current frame (frame(t)) that is confirmed to be erroneously detected.

여기서는 연속된 3개의 프레임을 기준으로 정방향 또는 양방향으로 대응관계를 비교하였으나 이에 한정되는 것은 아니다. 예를 들어, 현재 프레임을 기준으로 시간적으로 앞선 두개의 프레임에 대한 바운딩 박스들을 현재 프레임에 대한 바운딩 박스와 비교하고, 시간적으로 느린 두개의 프레임에 대한 바운딩 박스들을 현재 프레임에 대한 바운딩 박스와 비교하여 잘못 검출된 바운딩 박스를 찾을 수도 있다. 또한, 정방향으로 5개의 연속된 프레임에 대한 바운딩 박스를 비교하여 중간에 특정 프레임에서만 대응관계가 있는 바운딩 박스가 발견되지 않은 경우, 해당하는 특정 프레임에서 바운딩 박스가 미검출(또는 누락)된 것으로 판단하고, 해당 특정 프레임과 시간상으로 인접한 두개의 프레임의 서로 대응관계가 있는 바운딩 박스를 이용하여 미검출된 바운딩 박스를 추가로 생성할 수 있다. Here, the correspondence is compared in the forward direction or in both directions based on three consecutive frames, but the present invention is not limited thereto. For example, by comparing the bounding boxes for two frames that are temporally ahead of the current frame with the bounding box for the current frame, and comparing the bounding boxes for two temporally slow frames with the bounding box for the current frame, You can also find falsely detected bounding boxes. In addition, by comparing the bounding boxes for 5 consecutive frames in the forward direction, if a corresponding bounding box is not found only in a specific frame in the middle, it is determined that the bounding box is not detected (or omitted) in the specific frame. In addition, an undetected bounding box may be additionally generated using a corresponding specific frame and a bounding box corresponding to two temporally adjacent frames.

이하에서는, 바운딩 박스 상호간에 대응관계 유무를 결정하기 위한 방법을 설명한다.Hereinafter, a method for determining whether there is a correspondence between the bounding boxes will be described.

도 6은 본 발명의 일 실시예에 따른 시공간상에서 매칭되는 바운딩 박스 상호간 유사도를 결정하기 위한 비용함수에 관한 개념도이다.6 is a conceptual diagram of a cost function for determining the mutual similarity between bounding boxes matched in space and time according to an embodiment of the present invention.

바운딩 박스 상호간에 대응관계가 있는지 결정하기 위한 방법으로, 본 발명의 일 실시예에 따르면 비용함수를 정의할 수 있다.As a method for determining whether there is a correspondence between bounding boxes, a cost function may be defined according to an embodiment of the present invention.

예를 들어, 도 6을 참조하면, 이전 프레임(frame(t-1))에 대한 바운딩 박스(71)와 현재 프레임(frame(t))에 대한 바운딩 박스(72) 상호간 대응관계를 판단할 경우, 두개의 바운딩 박스(71, 72)가 각각의 프레임 상에서 위치한 위치 정보와 면적 정보에 따라 도면기호 74와 같이 동일한 공간(또는 프레임) 상에서 두개의 바운딩 박스(71, 72)를 표시할 수 있다.For example, referring to FIG. 6 , when determining the correspondence between the bounding box 71 for the previous frame (frame(t-1)) and the bounding box 72 for the current frame (frame(t)) , the two bounding boxes 71 and 72 may be displayed on the same space (or frame) in the same space (or frame) as shown in reference numeral 74 according to location information and area information located on each frame.

이때, 두개의 바운딩 박스(71, 72)가 서로 중첩되는 면적(73)의 크기를 개별 바운딩 박스 각각(71, 72)의 면적 크기의 합으로 나눈 값을 비용함수로 정의할 수 있다.In this case, a value obtained by dividing the size of the area 73 in which the two bounding boxes 71 and 72 overlap each other by the sum of the area sizes of each of the individual bounding boxes 71 and 72 may be defined as a cost function.

여기서 정의한 비용함수는 0과 1 사이의 값을 가질 수 있다. 구체적으로 두개의 바운딩 박스가 공간적으로 인접한 위치에 있어 중첩되는 면적이 많다면 비용함수는 1에 가까운 값을 가지며, 공간적으로 멀리 위치해 있어 중첩되는 면적이 없거나 작다면 0에 가까운 값을 가질 수 있다. The cost function defined here may have a value between 0 and 1. Specifically, the cost function may have a value close to 1 if the two bounding boxes are spatially adjacent and have a large overlapping area.

따라서, 본 발명의 일 실시예에 따라 정의된 비용함수를 비교 대상인 두개의 바운딩 박스에 대하여 산출하고, 산출된 비용함수가 미리 설정된 제1 임계값보다 크다면, 두개의 바운딩 박스는 서로 대응관계가 있는 것으로 결정할 수 있다.Therefore, if the cost function defined according to an embodiment of the present invention is calculated for two bounding boxes to be compared, and the calculated cost function is greater than a preset first threshold value, the two bounding boxes have a corresponding relationship with each other It can be decided that there is

또한, 앞서 산출된 비용함수가 미리 설정된 제2 임계값보다 작다면, 두개의 바운딩 박스는 서로 대응관계가 없는 것으로 결정할 수 있다.In addition, if the previously calculated cost function is smaller than a preset second threshold, it may be determined that the two bounding boxes do not have a corresponding relationship with each other.

이때, 제1 임계값과 제2 임계값은 서로 다른 값을 갖고 제1 임계값이 제2 임계값보다 클 수 있으나, 반드시 이에 한정되는 것은 아니다. 예를 들어 제1 임계값과 제2 임계값을 동일한 값으로 설정할 수도 있다.In this case, the first threshold value and the second threshold value may have different values and the first threshold value may be greater than the second threshold value, but is not limited thereto. For example, the first threshold value and the second threshold value may be set to the same value.

도 7은 본 발명의 일 실시예에 따른 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법에 대한 흐름도이다.7 is a flowchart illustrating a method of detecting an object in an image by spatio-temporal matching of a bounding box according to an embodiment of the present invention.

도 7을 참조하면, 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법은, 시간적으로 연속하는 복수의 프레임 각각에 대하여 적어도 하나의 초기 바운딩 박스를 검출하는 단계(S200), 검출된 적어도 하나의 초기 바운딩 박스를 시공간상으로 서로 매칭하는 단계(S210), 매칭에 따라 결정된 대응관계를 기초로 상기 적어도 하나의 초기 바운딩 박스에서 바운딩 박스를 추가 또는 삭제함으로써, 적어도 하나의 최종 바운딩 박스를 검출하는 단계(S220) 및 검출된 최종 바운딩 박스를 이용하여 상기 복수의 프레임에 포함된 객체를 검출하는 단계(S230)를 포함할 수 있다.Referring to FIG. 7 , the method of detecting an object in an image by spatio-temporal matching of a bounding box includes detecting at least one initial bounding box for each of a plurality of temporally continuous frames (S200), the detected at least Matching one initial bounding box with each other in space and time (S210), by adding or deleting a bounding box in the at least one initial bounding box based on the correspondence determined according to the matching, at least one final bounding box is detected and detecting an object included in the plurality of frames by using the detected final bounding box (S230).

상기 적어도 하나의 초기 바운딩 박스를 검출하는 단계(S200)는, 상기 복수의 프레임 각각에 대하여 전처리를 수행하는 단계, 전처리가 수행된 상기 복수의 프레임을 각각 CNN(Convolutional Neural Network)에 입력하여 바운딩 박스 후보를 획득하는 단계, 획득된 바운딩 박스 후보 중에서 선정된 바운딩 박스를 상기 적어도 하나의 초기 바운딩 박스로 검출하는 단계를 포함할 수 있다.The step of detecting the at least one initial bounding box ( S200 ) includes performing pre-processing on each of the plurality of frames, and inputting the plurality of frames on which the pre-processing is performed to a Convolutional Neural Network (CNN), respectively, for a bounding box It may include obtaining a candidate, and detecting a bounding box selected from among the obtained bounding box candidates as the at least one initial bounding box.

상기 적어도 하나의 초기 바운딩 박스를 시공간상으로 서로 매칭하는 단계(S210)는, 상기 복수의 프레임 중에서 시간적으로 가장 앞선 프레임에 대한 초기 바운딩 박스를 나머지 프레임에 대한 초기 바운딩 박스들과 정방향 매칭하는 단계를 포함할 수 있다.The step of matching the at least one initial bounding box with each other in space-time (S210) includes the step of forward matching the initial bounding box for the temporally most advanced frame among the plurality of frames with the initial bounding boxes for the remaining frames. may include

상기 적어도 하나의 최종 바운딩 박스를 검출하는 단계(S220)는, 상기 정방향 매칭 결과, 상기 복수의 프레임 중에서 상기 가장 앞선 프레임에 대한 초기 바운딩 박스가 제1 프레임에 대한 초기 바운딩 박스와 대응관계가 없고, 나머지 프레임에 대한 초기 바운딩 박스와 대응관계가 있는 경우, 상기 제1 프레임에 대하여 바운딩 박스를 추가하는 단계를 포함할 수 있다.In the step of detecting the at least one final bounding box (S220), as a result of the forward matching, the initial bounding box for the most advanced frame among the plurality of frames has no correspondence with the initial bounding box for the first frame, The method may include adding a bounding box to the first frame when there is a correspondence with the initial bounding box for the remaining frames.

상기 적어도 하나의 초기 바운딩 박스를 시공간상으로 서로 매칭하는 단계(S210)는, 상기 복수의 프레임 중에서 시간적으로 중간에 위치한 프레임에 대한 초기 바운딩 박스를 나머지 프레임에 대한 초기 바운딩 박스들과 양방향 매칭하는 단계를 포함할 수 있다.The step of matching the at least one initial bounding box with each other in space-time (S210) is a step of bi-directionally matching an initial bounding box for a frame located in the middle of time among the plurality of frames with initial bounding boxes for the remaining frames may include

상기 적어도 하나의 최종 바운딩 박스를 검출하는 단계(S220)는, 상기 양방향 매칭 결과, 상기 중간에 위치한 프레임에 대한 제1 초기 바운딩 박스가 나머지 프레임에 대한 초기 바운딩 박스들과 대응관계가 없으면, 상기 제1 초기 바운딩 박스를 삭제하는 단계를 포함할 수 있다.The step of detecting the at least one final bounding box (S220) is, as a result of the bidirectional matching, if the first initial bounding box for the intermediate frame has no correspondence with the initial bounding boxes for the remaining frames, the second 1 may include deleting the initial bounding box.

상기 적어도 하나의 초기 바운딩 박스를 시공간상으로 서로 매칭하는 단계(S210)는, 매칭할 두개의 바운딩 박스가 중첩되는 면적을 기초로 산출되는 비용함수를 이용하여 대응관계를 결정하는 단계를 포함할 수 있다.The step of matching the at least one initial bounding box to each other in space and time (S210) may include determining a correspondence relationship using a cost function calculated based on an overlapping area of two bounding boxes to be matched. there is.

도 8은 본 발명의 일 실시예에 따른 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 장치에 대한 구성도이다.8 is a block diagram of an apparatus for detecting an object in an image by spatio-temporal matching of a bounding box according to an embodiment of the present invention.

도 8을 참조하면, 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 장치(100)는, 적어도 하나의 프로세서(processor, 110) 및 상기 적어도 하나의 프로세서(110)가 적어도 하나의 단계를 수행하도록 지시하는 명령어들(instructions)을 저장하는 메모리(memory, 120)를 포함할 수 있다.Referring to FIG. 8 , in the apparatus 100 for detecting an object in an image by spatio-temporal matching of a bounding box, at least one processor 110 and the at least one processor 110 perform at least one step. It may include a memory (memory, 120) for storing instructions (instructions) instructing to be performed.

상기 적어도 하나의 프로세서(110)는 중앙 처리 장치(central processing unit, CPU), 그래픽 처리 장치(graphics processing unit, GPU), 또는 본 발명의 실시예들에 따른 방법들이 수행되는 전용의 프로세서를 의미할 수 있다. 메모리(120) 및 저장 장치(160) 각각은 휘발성 저장 매체 및 비휘발성 저장 매체 중에서 적어도 하나로 구성될 수 있다. 예를 들어, 메모리(120)는 읽기 전용 메모리(read only memory, ROM) 및 랜덤 액세스 메모리(random access memory, RAM) 중에서 적어도 하나로 구성될 수 있다. The at least one processor 110 may mean a central processing unit (CPU), a graphics processing unit (GPU), or a dedicated processor on which methods according to embodiments of the present invention are performed. can Each of the memory 120 and the storage device 160 may be configured of at least one of a volatile storage medium and a non-volatile storage medium. For example, the memory 120 may be configured as at least one of a read only memory (ROM) and a random access memory (RAM).

또한, 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 장치(100)는, 무선 네트워크를 통해 통신을 수행하는 송수신 장치(transceiver, 130)를 포함할 수 있다. 또한, 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 장치(100)는 입력 인터페이스 장치(140), 출력 인터페이스 장치(150), 저장 장치(160) 등을 더 포함할 수 있다. 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 장치(100)에 포함된 각각의 구성 요소들은 버스(bus)(170)에 의해 연결되어 서로 통신을 수행할 수 있다.Also, the apparatus 100 for detecting an object in an image by matching a bounding box in space-time may include a transceiver 130 that performs communication through a wireless network. Also, the apparatus 100 for detecting an object in an image by spatio-temporal matching of a bounding box may further include an input interface device 140 , an output interface device 150 , a storage device 160 , and the like. Each component included in the apparatus 100 for detecting an object in an image by matching the bounding box in space-time may be connected by a bus 170 to communicate with each other.

상기 적어도 하나의 최종 바운딩 박스를 검출하는 단계는, 상기 정방향 매칭 결과, 상기 복수의 프레임 중에서 상기 가장 앞선 프레임에 대한 초기 바운딩 박스가 제1 프레임에 대한 초기 바운딩 박스와 대응관계가 없고, 나머지 프레임에 대한 초기 바운딩 박스와 대응관계가 있는 경우, 상기 제1 프레임에 대하여 바운딩 박스를 추가하는 단계를 포함할 수 있다.In the step of detecting the at least one final bounding box, as a result of the forward matching, the initial bounding box for the most advanced frame among the plurality of frames has no correspondence with the initial bounding box for the first frame, and in the remaining frames If there is a corresponding relationship with the initial bounding box for , adding a bounding box to the first frame may be included.

도 9는 본 발명의 일 실시예에 따른 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법 및 장치가 적용될 수 있는 통합 관제 시스템에 대한 구성도이다.9 is a block diagram of an integrated control system to which a method and apparatus for detecting an object in an image by spatio-temporal matching of a bounding box according to an embodiment of the present invention can be applied.

도 1 내지 도 8에서 설명한 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법 및 장치는 교통, 공공시설, 관광, 주거 환경 등에서 스마트 시티를 운용하기 위한 통합 관제 시스템에 적용될 수 있다.The method and apparatus for detecting an object in an image by spatio-temporal matching of the bounding box described in FIGS. 1 to 8 may be applied to an integrated control system for operating a smart city in traffic, public facilities, tourism, and residential environments.

여기서, 스마트 시티를 운용하기 위한 통합 관제 시스템은, 도시 내 각 지점에 설치되거나 사용자에 의해 제어되어, 설치장소 주변 환경에 대한 영상 데이터와 센싱 데이터를 생성하는 각종 단말 장치들(90~92), 각종 단말 장치(90~92)들로부터 수집된 영상, 센싱 데이터를 수집하여 저장하고, 저장된 데이터를 영상 분석 서버(94) 또는 통합 관제 서버(95)로 전송하는 데이터베이스(93), 데이터베이스(93)로부터 전송받은 영상 데이터를 분석하여 객체를 영상 내 객체를 식별하고, 식별한 결과를 통합 관제 서버(95)로 전송하는 영상 분석 서버(94) 및/또는 데이터베이스(93)로부터 전송받은 영상 또는 센싱 데이터와 영상 분석 서버(94)로부터 전송받은 객체 식별 결과를 기초로 도시 내 주차, 재난, 긴급구조, 교통 등을 비롯한 각종 주거, 교통 환경을 모니터링하고 다양한 대응 서비스를 제공하는 통합 관제 서버(95)를 포함할 수 있다.Here, the integrated control system for operating the smart city is installed at each point in the city or controlled by the user, and various terminal devices 90 to 92 that generate image data and sensing data for the environment around the installation site (90 to 92); A database 93, a database 93 for collecting and storing images and sensing data collected from various terminal devices 90 to 92, and transmitting the stored data to the image analysis server 94 or the integrated control server 95 The image or sensing data transmitted from the image analysis server 94 and/or the database 93 that analyzes the image data received from the image data to identify the object in the image, and transmits the identification result to the integrated control server 95 . and an integrated control server 95 that monitors various residential and traffic environments including parking, disaster, emergency rescue, traffic, etc. in the city based on the object identification result transmitted from the and image analysis server 94 and provides various response services. may include

여기서 각종 단말 장치들(90~92)는 도시 내 각 지점에 설치된 CCTV(90), 도시 내 일반 시민들이 사용하는 사용자 단말(91), 카메라(92) 등을 의미할 수 있다. 또한 그 밖에 단말 장치들(90~92)의 예를 들면, 통신 가능한 데스크탑 컴퓨터(desktop computer), 랩탑 컴퓨터(laptop computer), 노트북(notebook), 스마트폰(smart phone), 태블릿 PC(tablet PC), 모바일폰(mobile phone), 스마트 워치(smart watch), 스마트 글래스(smart glass), e-book 리더기, PMP(portable multimedia player), 휴대용 게임기, 네비게이션(navigation) 장치, 디지털 카메라(digital camera), DMB(digital multimedia broadcasting) 재생기, 디지털 음성 녹음기(digital audio recorder), 디지털 음성 재생기(digital audio player), 디지털 동영상 녹화기(digital video recorder), 디지털 동영상 재생기(digital video player), PDA(Personal Digital Assistant) 등일 수 있다.Here, the various terminal devices 90 to 92 may mean a CCTV 90 installed at each point in the city, a user terminal 91 used by ordinary citizens in the city, a camera 92 , and the like. In addition, for example, other terminal devices (90 to 92), a communicable desktop computer (desktop computer), a laptop computer (laptop computer), a notebook (notebook), a smart phone (smart phone), a tablet PC (tablet PC) , mobile phone, smart watch, smart glass, e-book reader, PMP (portable multimedia player), portable game console, navigation device, digital camera, DMB (digital multimedia broadcasting) player, digital audio recorder, digital audio player, digital video recorder, digital video player, PDA (Personal Digital Assistant) etc.

통합 관제 서버(95)는 영상 분석 서버(94)에서 전달받은 객체 식별 결과를 기초로 불법 주차 여부를 판단하거나, 특정 지점에서의 재난 상황, 교통 장애 여부를 결정하고, 결정된 상황에 따른 대응 서비스로서 단속 요원에 대한 통지 메시지 전송, 재난 안내 메시지 전송, 재난 제어 본부로의 상황 전송 등을 수행할 수 있다. The integrated control server 95 determines whether or not parking is illegal based on the object identification result received from the image analysis server 94, or determines whether a disaster situation or a traffic disorder at a specific point, and as a response service according to the determined situation. It is possible to transmit a notification message to the enforcement personnel, transmit a disaster information message, and transmit the situation to the disaster control headquarters.

영상 분석 서버(94)는 데이터베이스(93)로부터 영상 데이터를 수신하거나, 또는 통합 관제 서버(95)의 영상 분석 요청(request) 메시지와 영상 데이터를 통합 관제 서버(95)로부터 수신할 수 있다. 또한, 영상 분석 서버(94)는 앞선 도 8에 따른 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 장치를 이용하여 영상 내 객체를 검출하는 서버에 해당될 수 있다.The image analysis server 94 may receive image data from the database 93 , or may receive an image analysis request message and image data of the integrated control server 95 from the integrated control server 95 . Also, the image analysis server 94 may correspond to a server that detects an object in an image by using a device that detects an object in an image by temporally and spatially matching the bounding box according to FIG. 8 .

예를 들어, 도 9의 영상 분석 서버(95)는 바운딩 박스 기반의 객체 식별을 이용한 영상 분석 서버로서, 적어도 하나의 프로세서 및 상기 적어도 하나의 프로세서가 적어도 하나의 단계를 수행하도록 지시하는 명령어들을 저장하는 메모리를 포함할 수 있다.For example, the image analysis server 95 of FIG. 9 is an image analysis server using a bounding box-based object identification, and stores at least one processor and instructions instructing the at least one processor to perform at least one step. It may include memory that

상기 적어도 하나의 단계는, 도시 환경을 관리하는 통합 관제 서버(95) 또는 상기 통합 관제 서버(95)와 연동된 데이터베이스(93)로부터 CCTV영상을 수신하는 단계, 획득된 CCTV 영상에서 시간적으로 연속한 복수의 프레임을 획득하는 단계, 획득한 복수의 프레임 각각에 대하여 적어도 하나의 초기 바운딩 박스를 검출하는 단계, 검출된 적어도 하나의 초기 바운딩 박스를 시공간상으로 서로 매칭하는 단계, 매칭에 따라 결정된 대응관계를 기초로 상기 적어도 하나의 초기 바운딩 박스에서 바운딩 박스를 추가 또는 삭제함으로써, 적어도 하나의 최종 바운딩 박스를 검출하는 단계, 검출된 최종 바운딩 박스를 이용하여 상기 복수의 프레임에 포함된 객체를 검출하는 단계 및 검출된 객체에 대한 정보를 상기 통합 관제 서버로 전송하는 단계를 포함할 수 있다.The at least one step is a step of receiving a CCTV image from the integrated control server 95 that manages the urban environment or the database 93 linked with the integrated control server 95, temporally continuous from the acquired CCTV image obtaining a plurality of frames, detecting at least one initial bounding box for each of the plurality of obtained frames, matching the detected at least one initial bounding box with each other in space-time, a correspondence determined according to the matching Detecting at least one final bounding box by adding or deleting a bounding box from the at least one initial bounding box based on and transmitting information on the detected object to the integrated control server.

그 밖에도 영상 분석 서버(94)에 대한 구성이나 동작은 도 1 내지 도 8에 따른 설명이 적용될 수 있으므로 구체적 설명은 생략한다.In addition, a detailed description of the configuration or operation of the image analysis server 94 may be omitted because the descriptions according to FIGS. 1 to 8 may be applied.

도 10은 본 발명의 일 실시예에 따른 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법 및 장치에 대한 효과를 설명하기 위한 개념도이다.10 is a conceptual diagram for explaining the effect of a method and apparatus for detecting an object in an image by spatio-temporal matching of a bounding box according to an embodiment of the present invention.

도 10을 참조하면, 시간적으로 연속하는 이전 프레임(frame(t-1)), 현재 프레임(frame(t)), 이후 프레임(frame(t+1))에서 각각 검출된 바운딩 박스를 확인할 수 있다.Referring to FIG. 10 , it is possible to check the bounding boxes detected in the temporally continuous previous frame (frame(t-1)), the current frame (frame(t)), and the subsequent frame (frame(t+1)). .

먼저, 이전 프레임(frame(t-1))에서는 제1 바운딩 박스(96), 제2 바운딩 박스(97)가 검출된 것을 확인할 수 있다. 이때, 현재 프레임(frame(t))에서 제1 바운딩 박스(96)는 차량에 가려지면서 폐색 영역이 됨에 따라 일반적인 바운딩 박스 검출 방법으로는 검출되지 않고, 이후 프레임(frame(t+1))에서 제1 바운딩 박스(96)와 제2 바운딩 박스(97)가 다시 검출된다.First, it can be confirmed that the first bounding box 96 and the second bounding box 97 are detected in the previous frame (frame(t-1)). At this time, as the first bounding box 96 becomes an occluded area while being covered by a vehicle in the current frame (frame(t)), it is not detected by a general bounding box detection method, and in a subsequent frame (frame(t+1)) The first bounding box 96 and the second bounding box 97 are detected again.

그러나, 본 발명의 일 실시예에 따른 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법 및 장치를 사용할 경우, 이전 프레임(frame(t-1))과 이후 프레임(frame(t+1)) 각각에서 제1 바운딩 박스(96)와 제2 바운딩 박스(97)가 검출된 점을 이용하여 현재 프레임(frame(t))에서 폐색 영역에 검출되지 않은 제1 바운딩 박스(96)를 추가로 검출할 수가 있다.However, when using the method and apparatus for detecting an object in an image by spatio-temporal matching of a bounding box according to an embodiment of the present invention, the previous frame (frame(t-1)) and the subsequent frame (frame(t+1) )) using the point where the first bounding box 96 and the second bounding box 97 are detected in each, add the undetected first bounding box 96 to the occluded area in the current frame (frame(t)) can be detected with

따라서, 상기와 같은 본 발명에 따른 바운딩 박스를 시공간상으로 매칭하여 영상 내 객체를 검출하는 방법 및 장치를 이용할 경우에는 CNN으로 도출된 초기 바운딩 박스에 대한 오검출 또는 미검출을 보완하여 정확한 바운딩 박스를 검출할 수 있다. Therefore, when using the method and apparatus for detecting an object in an image by spatio-temporal matching of the bounding box according to the present invention as described above, an accurate bounding box is complemented by erroneous detection or non-detection of the initial bounding box derived by CNN. can be detected.

본 발명에 따른 방법들은 다양한 컴퓨터 수단을 통해 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 매체에 기록되는 프로그램 명령은 본 발명을 위해 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.The methods according to the present invention may be implemented in the form of program instructions that can be executed by various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the computer-readable medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software.

컴퓨터 판독 가능 매체의 예에는 롬(ROM), 램(RAM), 플래시 메모리(flash memory) 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함될 수 있다. 프로그램 명령의 예에는 컴파일러(compiler)에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터(interpreter) 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다. 상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 적어도 하나의 소프트웨어 모듈로 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of computer-readable media may include hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions may include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as at least one software module to perform the operations of the present invention, and vice versa.

또한, 상술한 방법 또는 장치는 그 구성이나 기능의 전부 또는 일부가 결합되어 구현되거나, 분리되어 구현될 수 있다.In addition, the above-described method or apparatus may be implemented by combining all or part of its configuration or function, or may be implemented separately.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. Although the above has been described with reference to preferred embodiments of the present invention, those skilled in the art can variously modify and change the present invention within the scope without departing from the spirit and scope of the present invention as set forth in the claims below. You will understand that it can be done.

Claims

detecting at least one initial bounding box for each of a plurality of temporally consecutive frames;
matching the at least one initial bounding box for each of the plurality of frames in space-time;
determining at least one final bounding box by deleting the at least one initial bounding box or adding a new bounding box based on a matching result; and
Detecting the object included in the plurality of frames by using the final bounding box,
In the step of matching the at least one initial bounding box with each other in spacetime, the forward matching is performed between the initial bounding box for the temporally most advanced frame among the plurality of frames and the initial bounding boxes for the remaining frames. determining whether there is an undetected bounding box by determining using a predetermined cost function; A method for detecting an object in an image by spatio-temporal matching of a bounding box, comprising at least one of determining whether there is an erroneously detected bounding box by determining whether bidirectional matching is made using the cost function.

In claim 1,
The step of detecting the at least one initial bounding box comprises:
performing pre-processing on each of the plurality of frames;
obtaining a bounding box candidate by inputting the plurality of frames on which preprocessing has been performed into a Convolutional Neural Network (CNN), respectively;
A method of detecting an object in an image by spatio-temporal matching of a bounding box, comprising the step of detecting a bounding box selected from among the obtained bounding box candidates as the at least one initial bounding box.

In claim 2,
The pre-processing step is
A method of detecting an object in an image by spatio-temporal matching of a bounding box, comprising converting the resolution of the plurality of frames according to the input resolution to the CNN.

delete

In claim 1,
The step of detecting the at least one final bounding box comprises:
As a result of the forward matching, when the initial bounding box for the most advanced frame among the plurality of frames has no correspondence with the initial bounding box for the first frame and has a correspondence with the initial bounding box for the remaining frames, the second A method of detecting an object in an image by spatio-temporal matching of a bounding box, comprising the step of adding a bounding box to one frame.

delete

In claim 1,
The step of detecting the at least one final bounding box comprises:
As a result of the bidirectional matching, if the first initial bounding box for the intermediate frame has no correspondence with the initial bounding boxes for the remaining frames, deleting the first initial bounding box A method of detecting an object in an image by matching with an image.

In claim 1,
The initial bounding box is,
A method of detecting an object in an image by spatio-temporal matching of a bounding box, which is expressed by a center pixel position coordinate value and a size.

delete

In claim 1,
The cost function is
A method of detecting an object in an image by spatio-temporal matching of bounding boxes, which is defined as a value obtained by dividing an overlapping area of matching bounding boxes by the sum of areas occupied by each of the matching bounding boxes.

A device for detecting an object in an image by matching a bounding box in space-time, comprising:
at least one processor; and
a memory for storing instructions instructing the at least one processor to perform at least one step;
The at least one step is
detecting at least one initial bounding box for each of a plurality of temporally consecutive frames;
matching the at least one initial bounding box for each of the plurality of frames in space-time;
determining at least one final bounding box by deleting the at least one initial bounding box or adding a new bounding box based on a matching result; and
Detecting the object included in the plurality of frames by using the final bounding box,
In the step of matching the at least one initial bounding box with each other in spacetime, the forward matching is performed between the initial bounding box for the temporally most advanced frame among the plurality of frames and the initial bounding boxes for the remaining frames. determining whether there is an undetected bounding box by determining using a predetermined cost function; An apparatus for detecting an object in an image by spatio-temporal matching of a bounding box, comprising at least one of determining whether there is an erroneously detected bounding box by determining whether there is an erroneously detected bounding box by using the cost function.

In claim 11,
The step of detecting the at least one initial bounding box comprises:
performing pre-processing on each of the plurality of frames;
obtaining a bounding box candidate by inputting the plurality of frames on which preprocessing has been performed into CNN;
An apparatus for detecting an object in an image by spatio-temporal matching of a bounding box, comprising the step of detecting a bounding box selected from among the obtained bounding box candidates as the at least one initial bounding box.

In claim 12,
The pre-processing step is
An apparatus for detecting an object in an image by spatio-temporal matching of a bounding box, comprising converting the resolution of the plurality of frames according to the input resolution to the CNN.

delete

In claim 11,
The step of detecting the at least one final bounding box comprises:
As a result of the forward matching, when the initial bounding box for the most advanced frame among the plurality of frames has no correspondence with the initial bounding box for the first frame and has a correspondence with the initial bounding box for the remaining frames, the second An apparatus for detecting an object in an image by spatio-temporal matching of a bounding box, comprising the step of adding a bounding box to one frame.

delete

In claim 11,
The step of detecting the at least one final bounding box comprises:
As a result of the bidirectional matching, if the first initial bounding box for the intermediate frame has no correspondence with the initial bounding boxes for the remaining frames, deleting the first initial bounding box A device for detecting an object in an image by matching the image.

delete

In claim 11,
The cost function is
An apparatus for detecting an object in an image by spatially matching bounding boxes, which is defined as a value obtained by dividing the overlapping area of the matching bounding boxes by the sum of the areas occupied by each of the matching bounding boxes.

As an image analysis server using bounding box-based object identification,
at least one processor; and
a memory storing instructions instructing the at least one processor to perform at least one step;
The at least one step is
Receiving a CCTV image from an integrated control server that manages an urban environment or a database linked with the integrated control server;
acquiring a plurality of temporally consecutive frames from the acquired CCTV image;
detecting at least one initial bounding box for each of the plurality of acquired frames;
matching the at least one initial bounding box for each of the plurality of frames in space-time;
determining at least one final bounding box by deleting the at least one initial bounding box or adding a new bounding box based on a matching result;
detecting an object included in the plurality of frames by using the final bounding box; and
Transmitting information on the detected object to the integrated control server,
In the step of matching the at least one initial bounding box with each other in spacetime, the forward matching is performed between the initial bounding box for the temporally most advanced frame among the plurality of frames and the initial bounding boxes for the remaining frames. determining whether there is an undetected bounding box by determining using a predetermined cost function; An image analysis server using object identification based on a bounding box, comprising at least one of determining whether there is an erroneously detected bounding box by determining whether a two-way matching is made using the cost function.