KR102375541B1

KR102375541B1 - Apparatus for Providing Artificial Intelligence Service with structured consistency loss and Driving Method Thereof

Info

Publication number: KR102375541B1
Application number: KR1020210148554A
Authority: KR
Inventors: 김종목; 나종근
Original assignee: 주식회사 스누아이랩
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2022-03-17

Abstract

The present invention relates to an artificial intelligence service device having a structural consistency loss and a driving method of the device thereof, wherein the artificial intelligence service device according to an embodiment of the present invention may comprise: a communication interface part that receives an image consisting of a series of video frames; and a control part that performs deep learning analysis after dropping each of the video frames at different positions in a first deep learning network and a second deep learning network of different paths for the received series of video frames, and outputs each of a first deep learning analysis result and a second deep learning analysis result by bringing the video data of the video frame corresponding to each dropped position from an adjacent path and performing a deep learning analysis. Therefore, the present invention is capable of increasing a generalization ability of a deep learning network.

Description

Apparatus for Providing Artificial Intelligence Service with structured consistency loss and Driving Method Thereof

본 발명은 구조적 일관성 손실을 갖는 인공지능서비스장치 및 그 장치의 구동방법에 관한 것으로서, 더 상세하게는 가령 영상의 비디오 프레임간 관계를 일관되게(consistent) 강제함으로써 비디오 분류에서 중요한 프레임 특징(feature)간의 관계에 일관성을 갖도록 하여 의미론적 세분화에서 준지도 학습에 추가된 구조적 일관성 손실의 효과를 통해 학습의 성능을 향상시키는 구조적 일관성 손실을 갖는 인공지능서비스장치 및 그 장치의 구동방법에 관한 것이다.The present invention relates to an artificial intelligence service apparatus having structural coherence loss and a method of driving the apparatus, and more particularly, for example, an important frame feature in video classification by forcing a relationship between video frames of an image to be consistent. It relates to an artificial intelligence service device having a structural coherence loss that improves the performance of learning through the effect of structural coherence loss added to semi-supervised learning in semantic segmentation by making the relationship between them consistent, and a method of driving the device.

기존 딥러닝 네트워크 학습시 손실함수(loss function)라는 목적함수를 최소화하는 방향으로 학습시킨다. 신경망은 데이터를 통해 학습을 한다는 것이 핵심인데, 그 데이터를 가지고 어떻게 학습을 하느냐에 대한 해답이 바로 손실함수이다 우리가 만든 신경망이 원하는 기능을 할 때와 현재의 차이를 함수로 정의하여, 그 차이값을 구하고, 그 값을 이용하여 신경망을 개선(gradient descent 등)시킨다.When learning the existing deep learning network, it learns in the direction of minimizing an objective function called a loss function. The core of a neural network is that it learns from data, and the answer to how to learn with that data is a loss function. , and use the value to improve the neural network (gradient descent, etc.).

일관성 손실(consistency loss)이란 하나의 이미지에서 데이터 확대/증가(data augmentation)를 통해 복수 개의 이미지를 생성해내더라도, 복수개의 이미지들은 모두 같은 원본 이미지로부터 생성되었으므로 같은 의미있는(semantic) 정보를 내포하고 있다고 보여지므로, 딥러닝 네트워크를 이 복수 개의 이미지들에 대해 일관되게 예측하게 함으로써 그 일반화 능력을 키우게 만드는 손실이다. 데이터 증강 즉 확대/증가는 기존 데이터의 약간 수정된 복사본이나 기존 데이터에서 새로 생성된 합성 데이터를 추가하여 데이터 양을 늘리는 데 사용되는 기법이다. 그것은 규칙적인 역할을 하며 기계학습 모델을 훈련할 때 과도한 피팅(fitting)을 줄이는 데 도움을 준다.Consistency loss means that even if a plurality of images are generated from one image through data augmentation, since the plurality of images are all generated from the same original image, they contain the same semantic information. It seems that there is, so it is a loss that makes the deep learning network predict consistently for these plurality of images, thereby increasing its generalization ability. Data augmentation, or augmentation, is a technique used to increase the amount of data by adding slightly modified copies of existing data or newly created synthetic data from existing data. It plays a regular role and helps reduce overfitting when training machine learning models.

기존의 일관성 손실은 이미지 분류(image classification) 분야에 사용되었고, 즉 1개의 이미지 당 1개의 예측 벡터를 생성하여 각 벡터간 거리(cosine similarity) 등을 측정하여 그 손실을 계산하였다.The existing coherence loss was used in the field of image classification, that is, one prediction vector was generated per one image, and the cosine similarity was measured and the loss was calculated.

하지만 정밀 예측(dense prediction)(예: object detection, semantic segmentation)으로 넘어오면 1개의 이미지 당 여러 개의 예측을 해야 하며 특히 세분화(segmentation)의 경우 이미지 내의 모든 픽셀에서 전부 예측 벡터가 생성된다. 이때 이미지 분류에서의 일관성 손실 개념은 픽셀 단위의 일관성 손실(pixel-wise consistency loss) 개념으로 확장될 수 있다.However, when it comes to dense prediction (eg object detection, semantic segmentation), it is necessary to make multiple predictions per image, and especially in the case of segmentation, prediction vectors are generated from every pixel in the image. In this case, the concept of consistency loss in image classification can be extended to the concept of pixel-wise consistency loss.

하지만 기존 개념의 단순 확장으로는 픽셀간 관계를 학습시킬 수 없는 한계가 있으며, 또한 현재의 개념으로 비디오(혹은 영상) 분류로 확장되어도 역시 마찬가지로 프레임간 관계를 학습시킬 수 없는 한계가 있다.However, there is a limit in that the relationship between pixels cannot be learned by a simple extension of the existing concept, and also there is a limit in that the relationship between frames cannot be learned even if the current concept is expanded to video (or image) classification.

한국등록특허공보 제10-2219561호(2021.02.18)Korean Patent Publication No. 10-2219561 (2021.02.18) 한국공개특허공보 제10-2021-0029089호(2021.03.15)Korean Patent Publication No. 10-2021-0029089 (2021.03.15)

본 발명의 실시예는 가령 영상의 비디오 프레임간 관계를 일관되게 강제함으로써 비디오 분류에서 중요한 프레임 특징간의 관계에 일관성을 갖도록 하여 의미론적 세분화에서 준지도 학습에 추가된 구조적 일관성 손실의 효과를 통해 학습의 성능을 향상시키는 구조적 일관성 손실을 갖는 인공지능서비스장치 및 그 장치의 구동방법을 제공함에 그 목적이 있다.An embodiment of the present invention provides consistency in the relationship between important frame features in video classification by consistently forcing the relationship between video frames of an image, for example, through the effect of structural coherence loss added to semi-supervised learning in semantic segmentation. An object of the present invention is to provide an artificial intelligence service device having a loss of structural consistency that improves performance and a method of driving the device.

본 발명의 실시예에 따른 구조적 일관성 손실을 갖는 인공지능 서비스장치는, 일련의 비디오 프레임으로 구성되는 영상을 수신하는 통신 인터페이스부, 및 상기 수신한 일련의 비디오 프레임에 대하여 서로 다른 경로의 제1 딥러닝 네트워크 및 제2 딥러닝 네트워크에서 서로 다른 위치의 비디오 프레임을 각각 드롭(drop)시킨 후 딥러닝 분석을 수행하되, 상기 각각 드롭시킨 위치에 대응하는 비디오 프레임의 비디오 데이터를 인접하는 경로에서 가져와 딥러닝 분석을 수행하여 제1 딥러닝 분석결과 및 제2 딥러닝 분석결과를 각각 출력하는 제어부를 포함한다.An artificial intelligence service apparatus having structural coherence loss according to an embodiment of the present invention includes a communication interface unit for receiving an image composed of a series of video frames, and a first dip of different paths with respect to the received series of video frames. Deep learning analysis is performed after dropping video frames at different locations in the learning network and the second deep learning network, respectively, by taking video data of video frames corresponding to the dropped locations from adjacent paths and a control unit that performs a learning analysis and outputs a first deep learning analysis result and a second deep learning analysis result, respectively.

상기 제어부는, 상기 제1 딥러닝 네트워크에서 상기 수신한 일련의 비디오 프레임 중 N번째(여기서, N은 양의 정수) 비디오 프레임을 드롭시키는 경우 상기 제2 딥러닝 네트워크에서 상기 N번째 비디오 프레임을 유지시킬 수 있다.The controller maintains the Nth video frame in the second deep learning network when dropping an Nth video frame (where N is a positive integer) among the received series of video frames from the first deep learning network can do it

상기 제어부는, 상기 제1 딥러닝 네트워크의 상기 딥러닝 분석 동작시 상기 제2 딥러닝 네트워크의 상기 N번째 비디오 프레임의 비디오 데이터를 가져와 분석할 수 있다.The controller may fetch and analyze the video data of the N-th video frame of the second deep learning network during the deep learning analysis operation of the first deep learning network.

상기 제어부는 동일 비디오 프레임에 대하여 상기 제1 딥러닝 네트워크 및 상기 제2 딥러닝 네트워크에서 병렬 처리하여 딥러닝 분석 결과를 각각 출력할 수 있다.The controller may perform parallel processing on the same video frame in the first deep learning network and the second deep learning network to output a deep learning analysis result, respectively.

상기 제어부는, 상기 비디오 프레임 내의 지정 객체에 대하여 특징 벡터(feature vector)를 처리하는 형태로 비디오 프레임의 비디오 데이터를 처리할 수 있다.The controller may process video data of a video frame in a form of processing a feature vector with respect to a specified object in the video frame.

상기 제어부는, 상기 비디오 프레임 내의 동일 내용이 지정 시간동안 지속될 때 하나 이상의 비디오 프레임을 드롭시킬 수 있다.The controller may drop one or more video frames when the same content in the video frame continues for a specified time.

또한, 본 발명의 실시예에 따른 구조적 일관성 손실을 갖는 인공지능 서비스장치의 구동방법은, 통신 인터페이스부가, 일련의 비디오 프레임으로 구성되는 영상을 수신하는 단계, 및 제어부가, 상기 수신한 일련의 비디오 프레임에 대하여 서로 다른 경로의 제1 딥러닝 네트워크 및 제2 딥러닝 네트워크에서 서로 다른 위치의 비디오 프레임을 각각 드롭시킨 후 딥러닝 분석을 수행하되, 상기 각각 드롭시킨 위치에 대응하는 비디오 프레임의 비디오 데이터를 인접하는 경로에서 가져와 딥러닝 분석을 수행하여 제1 딥러닝 분석결과 및 제2 딥러닝 분석결과를 각각 출력하는 단계를 포함한다.In addition, the method of driving an artificial intelligence service apparatus having structural coherence loss according to an embodiment of the present invention includes the steps of: receiving, by a communication interface unit, an image composed of a series of video frames; and, by the control unit, the received series of videos After dropping video frames at different positions in the first deep learning network and the second deep learning network of different paths with respect to the frame, deep learning analysis is performed, and video data of video frames corresponding to the dropped positions respectively and performing deep learning analysis by bringing them from an adjacent path and outputting a first deep learning analysis result and a second deep learning analysis result, respectively.

본 발명의 실시예에 따르면, 비디오 프레임 세분화(segmentation)의 경우 픽셀 단위의 예측(pixel-wise prediction)이 있었다면 비디오 분류(video classification)에서는 비디오 내 프레임마다 특징 벡터(feature vector)를 만들어낼 수 있고 이 특징 벡터간 관계를 일관성있게 강제함으로써 딥러닝 네트워크의 일반화 능력을 키울 수 있을 것이다.According to an embodiment of the present invention, if there is a pixel-wise prediction in the case of video frame segmentation, a feature vector can be created for each frame in the video in video classification, By coherently forcing the relationship between these feature vectors, it will be possible to increase the generalization ability of deep learning networks.

또한, 본 발명의 실시예는 이를 통해 의미론적 세분화에서 준지도 학습에 추가된 구조적 일관성 손실 효과는 학습의 성능을 향상시킬 수 있을 것이다.In addition, according to the embodiment of the present invention, the effect of structural coherence loss added to semi-supervised learning in semantic segmentation may improve learning performance.

도 1은 본 발명의 실시예에 따른 구조적 일관성 손실을 갖는 인공지능 서비스 시스템을 나타내는 도면,
도 2는 도 1의 구조적 일관성 손실을 갖는 인공지능 서비스 장치의 동작 방법을 설명하기 위한 도면,
도 3 및 도 4는 실제 영상에서 주취자를 딥러닝 분석하는 방법을 설명하기 위한 도면,
도 5는 도 1의 구조적 일관성 손실을 갖는 인공지능 서비스 장치의 세부구조를 예시한 블록다이어그램, 그리고
도 6은 도 1의 구조적 일관성 손실을 갖는 인공지능 서비스 장치의 구동과정을 나타내는 흐름도이다.1 is a diagram showing an artificial intelligence service system having structural consistency loss according to an embodiment of the present invention;
Figure 2 is a view for explaining the operation method of the artificial intelligence service device having the structural consistency loss of Figure 1;
3 and 4 are diagrams for explaining a method of deep learning analysis of drunkenness in real images;
5 is a block diagram illustrating the detailed structure of an artificial intelligence service device having structural coherence loss of FIG. 1, and
6 is a flowchart illustrating a driving process of the artificial intelligence service device having the structural consistency loss of FIG. 1 .

이하, 도면을 참조하여 본 발명의 실시예에 대하여 상세히 설명한다.Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 실시예에 따른 구조적 일관성 손실을 갖는 인공지능 서비스 시스템을 나타내는 도면이며, 도 2는 도 1의 구조적 일관성 손실을 갖는 인공지능 서비스 장치의 동작 방법을 설명하기 위한 도면이다.1 is a diagram illustrating an artificial intelligence service system having structural consistency loss according to an embodiment of the present invention, and FIG. 2 is a diagram for explaining an operation method of the artificial intelligence service apparatus having structural consistency loss of FIG. 1 .

도 1에 도시된 바와 같이, 본 발명의 실시예에 따른 인공지능 서비스 시스템(90)은 촬영장치(100), 영상제공장치(110), 통신망(120) 및 인공지능 서비스장치(130)의 일부 또는 전부를 포함한다.As shown in FIG. 1 , the artificial intelligence service system 90 according to the embodiment of the present invention is a part of a photographing device 100 , an image providing device 110 , a communication network 120 , and an artificial intelligence service device 130 . or all inclusive.

여기서, "일부 또는 전부를 포함한다"는 것은 촬영장치(100)나 영상제공장치(110)와 같은 일부 구성요소가 생략되어 인공지능 서비스 시스템(90)이 구성되거나, 인공지능 서비스장치(130)의 일부 또는 전부가 통신망(120)을 구성하는 네트워크장치(예: 무선교환장치 등)에 통합되어 구성될 수 있는 것 등을 의미하는 것으로서, 발명의 충분한 이해를 돕기 위하여 전부 포함하는 것으로 설명한다.Here, "including some or all" means that some components such as the photographing device 100 or the image providing device 110 are omitted to configure the artificial intelligence service system 90, or the artificial intelligence service device 130 It means that some or all of can be configured by being integrated into a network device (eg, a wireless switching device, etc.) constituting the communication network 120, and it will be described as including all in order to help a sufficient understanding of the invention.

촬영장치(100)는 도로나 지하철 공간 등 임의 지역이나 공간에 설치되어 사건, 사고의 발생 여부를 판단하기 위해 특정 지역을 촬영하는 카메라를 포함한다. 카메라는 CCTV나 IP 카메라 등을 포함하며, 고정식 카메라 및 PTZ(Pan, Tilt, Zoom) 카메라 등을 포함한다. 촬영장치(100)는 임의 지역을 촬영하여 실시간으로 또는 주기적으로 촬영영상을 전송할 수 있다.The photographing apparatus 100 includes a camera installed in an arbitrary area or space such as a road or subway space to photograph a specific area in order to determine whether an event or an accident has occurred. The camera includes a CCTV or IP camera, and includes a fixed camera and a PTZ (Pan, Tilt, Zoom) camera. The photographing apparatus 100 may photograph an arbitrary area and transmit the photographed image in real time or periodically.

영상제공장치(110)는 가령 촬영장치(100)의 경우가 아니라 하더라도, 블랙박스, 캠코더나 스마트폰 등을 통해 촬영된 촬영영상을 제공하거나, 사용자가 소지하는 데스크탑 컴퓨터, 랩탑 컴퓨터, 태블릿PC, 스마트폰, 손목 등에 착용하는 웨어러블장치 등을 통해 기촬영된 영상을 제공하는 장치를 포함한다. 예를 들어, 통신망(120) 내에는 특정 장치에서 촬영된 영상을 저장시키는 디지털 영상저장장치(DVR) 등이 이에 포함될 수 있다. 예를 들어, 인공지능 서비스장치(130)가 사용자들에게 인공지능 서비스를 플랫폼의 형태로 제공하는 경우, 가령 고객의 영상을 제공받아 분석한 후 딥러닝 분석 결과를 제공할 수 있다.The image providing device 110 provides, for example, a photographed image captured through a black box, a camcorder, or a smart phone, even if it is not the case of the photographing device 100, or a desktop computer, a laptop computer, a tablet PC, It includes a device for providing a pre-recorded image through a wearable device worn on a smart phone, a wrist, or the like. For example, a digital image storage device (DVR) for storing images captured by a specific device may be included in the communication network 120 . For example, when the artificial intelligence service device 130 provides an artificial intelligence service to users in the form of a platform, for example, it is possible to provide a deep learning analysis result after receiving and analyzing a customer's image.

통신망(120)은 유무선 통신망을 모두 포함한다. 가령 통신망(120)으로서 유무선 인터넷망이 이용되거나 연동될 수 있다. 여기서, 유선망은 케이블망이나 공중 전화망(PSTN)과 같은 인터넷망을 포함하는 것이고, 무선 통신망은 CDMA, WCDMA, GSM, EPC(Evolved Packet Core), LTE(Long Term Evolution), 와이브로(Wibro) 망 등을 포함하는 의미이다. 물론 본 발명의 실시예에 따른 통신망(120)은 이에 한정되는 것이 아니며, 가령 클라우드 컴퓨팅 환경하의 클라우드 컴퓨팅망, 5G망 등에 사용될 수 있다. 가령, 통신망(120)이 유선 통신망인 경우 통신망(120) 내의 액세스포인트는 전화국의 교환국 등에 접속할 수 있지만, 무선 통신망인 경우에는 통신사에서 운용하는 SGSN 또는 GGSN(Gateway GPRS Support Node)에 접속하여 데이터를 처리하거나, BTS(Base Transsive Station), NodeB, e-NodeB 등의 다양한 중계기에 접속하여 데이터를 처리할 수 있다.The communication network 120 includes both wired and wireless communication networks. For example, a wired/wireless Internet network may be used or interlocked as the communication network 120 . Here, the wired network includes an Internet network such as a cable network or a public telephone network (PSTN), and the wireless communication network includes CDMA, WCDMA, GSM, Evolved Packet Core (EPC), Long Term Evolution (LTE), Wibro network, etc. is meant to include Of course, the communication network 120 according to the embodiment of the present invention is not limited thereto, and may be used, for example, in a cloud computing network under a cloud computing environment, a 5G network, and the like. For example, when the communication network 120 is a wired communication network, the access point in the communication network 120 can connect to a switching center of a telephone company, etc., but in the case of a wireless communication network, it accesses the SGSN or GGSN (Gateway GPRS Support Node) operated by the communication company to transmit data. Data can be processed by accessing various repeaters such as a BTS (Base Transsive Station), NodeB, and e-NodeB.

통신망(120)은 액세스포인트를 포함할 수 있다. 여기서의 액세스포인트는 건물 내에 많이 설치되는 펨토(femto) 또는 피코(pico) 기지국과 같은 소형 기지국을 포함한다. 펨토 또는 피코 기지국은 소형 기지국의 분류상 촬영장치(100)나 영상제공장치(110) 등을 최대 몇 대까지 접속할 수 있느냐에 따라 구분된다. 물론 액세스 포인트는 촬영장치(100)나 영상제공장치(110) 등과 지그비 및 와이파이 등의 근거리 통신을 수행하기 위한 근거리 통신모듈을 포함할 수 있다. 액세스포인트는 무선통신을 위하여 TCP/IP 혹은 RTSP(Real-Time Streaming Protocol)를 이용할 수 있다. 여기서, 근거리 통신은 와이파이 이외에 블루투스, 지그비, 적외선, UHF(Ultra High Frequency) 및 VHF(Very High Frequency)와 같은 RF(Radio Frequency) 및 초광대역 통신(UWB) 등의 다양한 규격으로 수행될 수 있다. 이에 따라 액세스포인트는 데이터 패킷의 위치를 추출하고, 추출된 위치에 대한 최상의 통신 경로를 지정하며, 지정된 통신 경로를 따라 데이터 패킷을 다음 장치, 예컨대 인공지능 서비스장치(130) 등으로 전달할 수 있다. 액세스포인트는 일반적인 네트워크 환경에서 여러 회선을 공유할 수 있으며, 예컨대 라우터(router), 리피터(repeater) 및 중계기 등이 포함된다.The communication network 120 may include an access point. Here, the access point includes a small base station, such as a femto or pico base station, which is often installed in a building. Femto or pico base stations are classified according to the maximum number of access to the imaging device 100 or the image providing device 110, etc. in the classification of a small base station. Of course, the access point may include a short-distance communication module for performing short-distance communication such as Zigbee and Wi-Fi with the photographing device 100 or the image providing device 110 . The access point may use TCP/IP or Real-Time Streaming Protocol (RTSP) for wireless communication. Here, short-distance communication may be performed in various standards such as Bluetooth, Zigbee, infrared, radio frequency (RF) such as ultra high frequency (UHF) and very high frequency (VHF), and ultra-wideband communication (UWB) in addition to Wi-Fi. Accordingly, the access point can extract the location of the data packet, designate the best communication path for the extracted location, and deliver the data packet to the next device, such as the artificial intelligence service device 130 , along the designated communication path. The access point may share several lines in a general network environment, and includes, for example, a router, a repeater, and a repeater.

인공지능 서비스장치(130)는 가령 서버(server)로서 동작할 수 있으며, 인공지능 프로그램을 탑재하여 영상 분석 결과를 DB(130a)에 체계적으로 분류하여 저장시킨다. 본 발명의 실시예에 따른 인공지능 서비스장치(130)는 딥러닝 네트워크 학습시 손실함수라는 목적함수를 최소화하는 방향으로 학습한다. 즉 인공지능 서비스장치(130)는 비디오 분류를 위한 구조적 일관성 손실과 관련되는 동작을 수행한다. 이를 위하여 인공지능 서비스장치(130)는 프레임간 관계를 일관성있게 강제함으로써 비디오 분류에서 중요한 프레임 특징간의 관계에 일관성을 갖게 된다. 이와 같이 의미론적 세분화에서 준지도 학습에 추가된 구조적 일관성 손실의 효과는 학습의 성능을 향상시키게 되는 것이다.The artificial intelligence service device 130 may operate, for example, as a server, and is equipped with an artificial intelligence program to systematically classify and store the image analysis result in the DB 130a. The artificial intelligence service device 130 according to an embodiment of the present invention learns in a direction to minimize an objective function called a loss function when learning a deep learning network. That is, the artificial intelligence service device 130 performs an operation related to loss of structural consistency for video classification. To this end, the artificial intelligence service device 130 consistently enforces the relationship between frames so that the relationship between important frame features in video classification is consistent. As such, the effect of loss of structural consistency added to semi-supervised learning in semantic segmentation is to improve learning performance.

좀더 구체적으로 도 2에 도시된 바와 같이 인공지능 서비스장치(130)는 인공지능 프로그램에 의한 딥러닝 동작을 수행하기 위하여 일련의 동일 비디오 프레임에 대하여 제1 딥러닝 네트워크(혹은 제1 딥러닝부)와 제2 딥러닝 네트워크(혹은 제2 딥러닝부)에서 각각 딥러닝 동작을 수행한다. 여기서, 제1 딥러닝 네트워크와 제2 딥러닝 네트워크는 제1 딥러닝 모듈과 제2 딥러닝 모듈이 될 수 있으며, 동일한 딥러닝 프로그램을 갖는 것이 바람직하다. 물론 그에 특별히 한정하지는 않을 것이다. 다만, 복수의 딥러닝 네트워크 즉 모듈(module)은 동일 비디오 프레임에 대하여 딥러닝을 위한 병렬 처리 동작을 수행한다고 볼 수 있다. 가령, 지정 시간 동안 제1 번째에서 제7 번째의 비디오 프레임이 수신되었다면 해당 프레임들은 제1 딥러닝 네트워크와 제2 딥러닝 네트워크로 각각 제공될 수 있다. 본 발명의 실시예에서는 2개의 딥러닝 네트워크를 예시하였지만, 실제로는 복수의 딥러닝 네트워크를 구성하는 것이 바람직하다. 손실 함수와 관련한 딥러닝 동작에 대해서는 앞서 이미 충분히 설명하였으므로 그 내용들로 대신하고자 한다.More specifically, as shown in Figure 2, the artificial intelligence service device 130 is a first deep learning network (or a first deep learning unit) for a series of the same video frame in order to perform a deep learning operation by an artificial intelligence program. and the second deep learning network (or the second deep learning unit) perform deep learning operations, respectively. Here, the first deep learning network and the second deep learning network may be the first deep learning module and the second deep learning module, and it is preferable to have the same deep learning program. Of course, it will not be particularly limited thereto. However, it can be seen that a plurality of deep learning networks, ie, modules, perform parallel processing operations for deep learning on the same video frame. For example, if the first to the seventh video frames are received for a specified time, the frames may be provided to the first deep learning network and the second deep learning network, respectively. In the embodiment of the present invention, two deep learning networks are exemplified, but in practice, it is preferable to configure a plurality of deep learning networks. The deep learning operation related to the loss function has already been sufficiently described, so we will replace it with the contents.

인공지능 서비스장치(130)는 도 2에 도시된 바와 같이 비디오 분류에서 비디오 내 프레임마다 특징 벡터를 만들어 낼 수 있고, 이 특징 벡터간 관계를 일관성 있게 강제함으로써 딥러닝 네트워크의 일반화 능력을 키운다고 볼 수 있다. 예를 들어, 비디오 프레임 내 객체를 추적할 때 추적 객체에 대한 벡터값을 처리함으로써 이동위치를 판단하는 것이 대표적이다. 예를 들어, 임의 위치의 비디오 프레임 내 객체가 X방향으로 거리 L만큼 이동하였으므로, 임의 객체의 위치(예: 비디오 프레임 내 (x,y)값)를 근거로 벡터값으로 데이터를 처리함으로써 임의 객체에 대한 추적을 수행할 수 있는 것이다. 물론 딥러닝 분석을 위해 영상을 처리하고, 이의 과정에서 객체의 위치를 추적하고, 또 그 객체에 발생되는 사건이나 사고의 이벤트를 판단하는 동작은 다양하게 이루어질 수 있으므로, 본 발명의 실시예에서는 어느 하나의 형태에 특별히 한정하지는 않을 것이다. 도 2에서는 서로 다른 딥러닝 네트워크에서 프레임별 특징 벡터와 특징 벡터간 코사인 유사도를 이용하고, 또 이를 통해 동영상 분류를 위한 구조적 일관성 손실(예: L2 거리 손실)을 보여주고 있다. 코사인 유사도는 가령 텍스트 데이터의 유사도를 구하는 방법 중 하나로 두 벡터 사이의 코사인 각도를 구해 서로의 유사도를 구하는 방식이다.As shown in FIG. 2, the artificial intelligence service device 130 can create a feature vector for each frame in the video in video classification, and it can be seen that the generalization ability of the deep learning network is increased by consistently forcing the relationship between the feature vectors. there is. For example, when tracking an object in a video frame, it is representative to determine a movement position by processing a vector value for the tracking object. For example, since an object in a video frame at an arbitrary location has moved by a distance L in the X direction, an arbitrary object is that can be tracked. Of course, since the operation of processing an image for deep learning analysis, tracking the position of an object in the process, and determining an event or accident event occurring in the object can be performed in various ways, in an embodiment of the present invention, any It will not specifically limit to one form. 2 shows structural coherence loss (eg, L2 distance loss) for video classification using the cosine similarity between the feature vectors and the feature vectors for each frame in different deep learning networks. The cosine similarity is, for example, one of the methods of obtaining the similarity of text data, and is a method of obtaining the similarity between two vectors by obtaining the cosine angle between the two vectors.

좀더 구체적으로, 본 발명의 실시예에 따른 인공지능 서비스장치(130)는 서로 다른 네트워크로 입력되는 동일한 내용의 일련의 비디오 프레임에 대하여 서로 다른 시간 위치의 비디오 프레임을 드롭시킨다. 예를 들어, 제1 딥러닝 네트워크에서 입력된 5개의 비디오 프레임 즉 n = 5일 때 (n-1)번째 비디오 프레임을 드롭시켰다면 제2 딥러닝 네트워크에서는 5개의 동일 비디오 프레임에 대하여 (n-1)번째 비디오 프레임을 빼고 나머지 비디오 프레임에서 적어도 하나의 비디오 프레임을 드롭시킨다. 이러한 과정은 딥러닝 동작시 동일한 비디오 프레임의 데이터에 대한 연산 처리량을 줄여 신속한 연산이 이루어지도록 하기 위한 것이라 볼 수 있다. 물론 이러한 드롭 동작은 기설정될 수 있지만, 비디오 프레임 내의 컨텐츠 즉 내용에 따라 가변될 수 있다. 다시 말해, 동일한 장면이 5장의 비디오 프레임에서 지속되었다면 가운데의 2-3장의 비디오 프레임을 드롭시키는 것도 얼마든지 가능하다.More specifically, the artificial intelligence service device 130 according to an embodiment of the present invention drops video frames at different time positions with respect to a series of video frames of the same content input to different networks. For example, if the (n-1)-th video frame is dropped when 5 video frames input in the first deep learning network, that is, when n = 5, in the second deep learning network, (n-1) for the same 5 video frames )th video frame and drop at least one video frame from the remaining video frames. This process can be considered to be to reduce the computational throughput for data of the same video frame during deep learning operation so that rapid computation is performed. Of course, such a drop operation may be preset, but may vary depending on the content within the video frame, that is, the content. In other words, if the same scene lasted for 5 video frames, it is possible to drop 2-3 video frames in the middle.

무엇보다 본 발명의 실시예에 따른 인공지능 서비스장치(130)는 드롭된 비디오 프레임의 비디오 데이터, 이는 앞서 언급한 대로 특징 벡터 형태의 데이터가 될 수 있으며, 따라서 복수의 딥러닝 네트워크에서 서로 다른 위치의 비디오 프레임에 대하여 드롭 동작을 수행한 후 드롭된 비디오 프레임(예: 4번째인 경우)에 대해서는 그 드롭된 비디오 프레임에 대응하는 비디오 프레임의 비디오 데이터 즉 특징 벡터를 인접하는 딥러닝 네트워크에서 가져와 딥러닝 동작을 수행한 후 분석 결과를 출력하게 된다.Above all, in the artificial intelligence service device 130 according to the embodiment of the present invention, the video data of the dropped video frame, which may be data in the form of a feature vector, as mentioned above, may be located at different locations in a plurality of deep learning networks. After performing a drop operation on the video frame of After performing the running motion, the analysis result is output.

도 3 및 도 4는 실제 영상에서 주취자를 딥러닝 분석하는 방법을 설명하기 위한 도면이다.3 and 4 are diagrams for explaining a method of deep learning analysis of a drunken in an actual image.

설명의 편의상 도 3 및 도 4를 도 1과 함께 참조하면, 본 발명의 실시예에 따른 도 1의 인공지능 서비스장치(130)는 도 3의 (a)에서와 같은 비디오 프레임의 (촬영) 영상을 도 1의 촬영장치(100)나 영상제공장치(110)에서 수신할 수 있다. 여기서, 영상제공장치(110)는 블랙박스나 사용자의 스마트폰 등을 통해 촬영된 동영상을 제공하는 장치를 포함한다.Referring to FIGS. 3 and 4 together with FIG. 1 for convenience of explanation, the artificial intelligence service device 130 of FIG. 1 according to an embodiment of the present invention is a (photographed) image of a video frame as in FIG. may be received by the photographing apparatus 100 or the image providing apparatus 110 of FIG. 1 . Here, the image providing device 110 includes a device for providing a video shot through a black box or a user's smart phone.

인공지능 서비스장치(130)는 도 3의 (a)에서와 같은 일련의 비디오 프레임을 도 3의 (b) 및 (c)에서와 같이 제1 딥러닝 네트워크 및 제2 딥러닝 네트워크로 각각 입력받는다. 여기서, 제1 딥러닝 네트워크 및 제2 딥러닝 네트워크는 딥러닝 모듈 즉 딥러닝 프로그램이 탑재되어 있는 IC 칩(chip) 등을 의미할 수 있으며, 설명의 편의상 도 3에서는 2개의 딥러닝 네트워크를 예시하였지만, 복수의 딥러닝 네트워크를 더 구성할 수 있으므로, 본 발명의 실시예에서는 어느 하나의 형태에 특별히 한정하지는 않을 것이다.The artificial intelligence service device 130 receives a series of video frames as in (a) of FIG. 3 as input to the first deep learning network and the second deep learning network as in FIGS. 3 (b) and (c), respectively. . Here, the first deep learning network and the second deep learning network may mean an IC chip on which a deep learning module, that is, a deep learning program is mounted, and for convenience of explanation, two deep learning networks are illustrated in FIG. 3 . However, since a plurality of deep learning networks can be further configured, the embodiment of the present invention will not be particularly limited to any one form.

인공지능 서비스장치(130)는 제1 딥러닝 네트워크로 입력된 일련의 비디오 프레임 가령 제1 비디오 프레임과 제5 비디오 프레임 중에서 제2 및 제3 비디오 프레임을 드롭시켜 딥러닝 네트워크에서 딥러닝 분석을 수행한다. 또한, 인공지능 서비스장치(130)는 제2 딥러닝 네트워크로 입력된 도 3의 (a)에서와 같은 일련의 비디오 프레임에서 제5 비디오 프레임을 드롭시켜 딥러닝 네트워크에서 딥러닝 분석을 수행한다. 본 발명의 실시예에 따른 인공지능 서비스장치(130)는 복수의 딥러닝 네트워크를 구성하는 경우 서로 중복되는 위치의 비디오 프레임에 대하여 드롭시키지 않도록 하는 것이 바람직하다. 이는 하나의 딥러닝 네트워크에서 드롭한 비디오 프레임의 데이터를 인접하는 딥러닝 네트워크에서 가져오기 위하여 반드시 필요하며, 이는 딥러닝을 위해 반드시 필요하다고 볼 수 있다.The artificial intelligence service device 130 performs deep learning analysis in the deep learning network by dropping the second and third video frames among a series of video frames input to the first deep learning network, such as the first video frame and the fifth video frame. do. In addition, the artificial intelligence service device 130 performs deep learning analysis in the deep learning network by dropping the fifth video frame from a series of video frames as in (a) of FIG. 3 input to the second deep learning network. When the artificial intelligence service device 130 according to an embodiment of the present invention configures a plurality of deep learning networks, it is preferable not to drop video frames at positions overlapping each other. This is absolutely necessary to bring data of video frames dropped from one deep learning network to an adjacent deep learning network, which can be seen as essential for deep learning.

이와 같이, 도 1의 인공지능 서비스장치(130)는 도 4의 (a)에서 볼 수 있는 바와 같이 1번 이미지에 대한 특징 벡터, 4번 이미지에 대한 특징 벡터 각각이 존재하고, 또 도 4의 (b)에 나타낸 딥러닝 네트워크 1번의 특징 1에서 4간의 관계, 도 4의 (c)에서 나타낸 딥러닝 네트워크 2번의 1에서 4간의 관계가 있을 때, 1번 네트워크는 2, 3번 이미지를 못보기 때문에 즉 드롭시킨 관계로 어떠한 비디오 프레임인지 알 수 없기 때문에 1에서 4번간 관계를 파악하기 힘들고, 또 2번 네트워크는 2, 3번 이미지를 다 보기 때문에 즉 어떠한 비디오 프레임인지 인식할 수 있기 때문에 그 관계를 파악하기 쉬우므로, 2번 네트워크의 특징 벡터간 관계(1 ~ 4)를 1번 네트워크에서 배울 수 있도록 구조 일관성 손실(structured consistency loss)을 주는 것이다. 결국 네트워크 1번은 2, 3번 비디오 프레임 없는 상태에서 본 비디오의 라벨인 주취자를 맞추기 힘들지만, 2번 네트워크의 예측으로부터 프레임간 관계를 배움으로써 즉 학습할 수 있게 됨으로써 1 ~ 4번 프레임간 동일 인물이 겉옷을 벗고 자리에 누우려는 모습을 확인함으로써 주취자 클래스(class) 즉 등급이나 부류를 맞출 수 있게 되는 것이다.As described above, in the artificial intelligence service device 130 of FIG. 1 , as can be seen in FIG. When there is a relationship between features 1 to 4 of deep learning network No. 1 shown in (b) and a relationship between 1 to 4 of deep learning network No. 2 shown in FIG. It is difficult to understand the relationship between 1 and 4 because it is not possible to know what kind of video frame it is because of viewing, i.e., because it is dropped, and because network 2 sees both images 2 and 3, that is, because it can recognize what kind of video frame it is. Since the relationship is easy to understand, a structured consistency loss is given so that the relationship (1 to 4) between the feature vectors of the second network can be learned from the first network. In the end, it is difficult for Network 1 to match the main character, the label of the video viewed in the absence of video frames 2 and 3, but by learning the frame-to-frame relationship from the prediction of Network 2, it is possible to learn the same person between frames 1 to 4 By taking off your coat and confirming that you are about to lie down, you will be able to match the class or class of the drunkard.

상기한 바와 같이, 본 발명의 실시예에 따른 인공지능 서비스장치(130)는 하나의 딥러닝 네트워크에서 드롭시킨 비디오 프레임에 대한 데이터를 인접하는 딥러닝 네트워크에서 관련 데이터를 가져와 학습을 수행하게 됨으로써 각 딥러닝 네트워크에서의 연산 처리를 빠르게 수행하면서 동시에 학습 정확도도 증가시킬 수 있게 되는 것이다. 결국 딥러닝 네트워크 학습시 손실함수라는 목적함수를 최소화하는 방향으로 학습을 수행하게 되는 것이다.As described above, the artificial intelligence service device 130 according to an embodiment of the present invention performs learning by bringing data for a video frame dropped from one deep learning network and related data from an adjacent deep learning network to learn each. It will be possible to perform computational processing in a deep learning network quickly and at the same time increase the learning accuracy. In the end, when learning the deep learning network, learning is performed in the direction of minimizing the objective function called the loss function.

도 5는 도 1의 구조적 일관성 손실을 갖는 인공지능 서비스장치의 세부구조를 예시한 블록다이어그램이다.FIG. 5 is a block diagram illustrating a detailed structure of an artificial intelligence service device having structural consistency loss of FIG. 1 .

도 5에 도시된 바와 같이, 본 발명의 실시예에 따른 인공지능 서비스장치(130)는 통신 인터페이스부(500), 제어부(510), 특징벡터 강제화부(520) 및 저장부(530)의 일부 또는 전부를 포함한다.As shown in FIG. 5 , the artificial intelligence service device 130 according to an embodiment of the present invention includes a communication interface unit 500 , a control unit 510 , a feature vector forcing unit 520 , and a part of the storage unit 530 . or all inclusive.

여기서, "일부 또는 전부를 포함한다"는 것은 저장부(530)와 같은 일부 구성요소가 생략되어 인공지능 서비스장치(130)가 구성되거나, 특징벡터 강제화부(520)와 같은 일부 구성요소가 제어부(510)와 같은 다른 구성요소에 통합되어 구성될 수 있는 것 등을 의미하는 것으로서, 발명의 충분한 이해를 돕기 위하여 전부 포함하는 것으로 설명한다.Here, "including some or all" means that some components such as the storage unit 530 are omitted to configure the artificial intelligence service device 130, or some components such as the feature vector forcing unit 520 are controlled by the control unit. As meaning that it can be configured by being integrated with other components such as 510, it will be described as including all to help a sufficient understanding of the invention.

물론 본 발명의 실시예에 따른 인공지능 서비스장치(130)는 도 1에서는 서버의 형태로 운영되는 것을 설명하였지만, 컴퓨터와 같이 스탠드얼론(stand-alone) 형태로 구성되는 개별 장치일 수 있으며, 따라서 저장매체(예: USB 등)를 포트(port)에 연결하여 앞서 설명한 동작을 수행하는 것도 얼마든지 가능하므로, 본 발명의 실시예에서는 어느 하나의 형태에 특별히 한정하지는 않을 것이다.Of course, although the artificial intelligence service device 130 according to the embodiment of the present invention has been described as being operated in the form of a server in FIG. 1, it may be an individual device configured in a stand-alone form such as a computer, and thus Since it is possible to perform the operation described above by connecting a storage medium (eg, USB, etc.) to a port, the embodiment of the present invention will not be limited to any one form.

통신 인터페이스부(500)는 도 1의 통신망(120)을 경유하여 촬영장치(100)나 영상제공장치(110)로부터 동영상의 비디오 프레임을 수신한다. 수신한 비디오 프레임을 제어부(510)에 제공한다. 물론 이러한 과정에서 통신 인터페이스부(500)는 변/복조, 먹싱/디먹싱, 인코딩/디코딩 등의 동작을 수행할 수 있으며, 이는 해당 기술분야의 당업자에게 자명한 사항이므로 더 이상의 설명은 생략하도록 한다.The communication interface unit 500 receives a video frame of a moving picture from the photographing apparatus 100 or the image providing apparatus 110 via the communication network 120 of FIG. 1 . The received video frame is provided to the controller 510 . Of course, in this process, the communication interface unit 500 may perform operations such as modulation/demodulation, muxing/demuxing, encoding/decoding, etc., which are obvious to those skilled in the art, so further description will be omitted. .

제어부(510)는 도 1의 인공지능 서비스장치(130)를 구성하는 도 5의 통신 인터페이스부(500), 특징벡터 강제화부(520) 및 저장부(530)의 전반적인 제어동작을 담당한다. 다시 말해, 제어부(510)는 통신 인터페이스부(500)에서 수신된 일련의 비디오 프레임을 저장부(530)에 순서대로 저장한 후, 저장한 순서대로 다시 불러내어 특징벡터 강제화부(520)에 제공할 수 있다. 또한, 제어부(510)는 특징벡터 강제화부(520)를 제어하여 내부에 탑재된 특징 벡터의 강제화를 위한 인공지능 프로그램을 실행시킬 수 있으며, 실행 결과는 다시 도 1의 DB(130a)에 체계적으로 분류하여 저장되도록 통신 인터페이스부(500)를 제어할 수 있다.The control unit 510 is in charge of overall control operations of the communication interface unit 500 of FIG. 5 , the feature vector forcing unit 520 , and the storage unit 530 constituting the artificial intelligence service device 130 of FIG. 1 . In other words, the control unit 510 sequentially stores a series of video frames received from the communication interface unit 500 in the storage unit 530 , recalls them in the stored order, and provides them to the feature vector forcing unit 520 . can do. In addition, the control unit 510 may control the feature vector forcing unit 520 to execute an artificial intelligence program for forcing the feature vector mounted therein, and the execution result is again systematically stored in the DB 130a of FIG. 1 . The communication interface unit 500 may be controlled to be classified and stored.

특징벡터 강제화부(520)는 복수의 딥러닝 네트워크 모듈을 구성할 수 있다. 이에 따라 복수의 딥러닝 네트워크 모듈은 딥러닝 동작을 수행하기 위한 병렬 처리 동작을 수행할 수 있으며, 각각의 딥러닝 모듈은 서로 다른 시간 위치의 비디오 프레임을 드롭시키고, 또 드롭시킨 비디오 프레임에 대한 비디오 데이터, 가령 특징 벡터 형태의 데이터는 인접하는 딥러닝 모듈에서 가져와 딥러닝 동작을 수행할 수 있다. 물론 여기서 인접하다는 것은 제3 딥러닝 모듈의 경우 제2 딥러닝 모듈이나 제4 딥러닝 모듈을 의미하는 것뿐 아니라, 제1 딥러닝 모듈이나 제5 딥러닝 모듈과 같이 건너뛰는 형태로 드롭된 비디오 프레임의 데이터를 가져올 수도 있으며, 더 정확하게는 드롭된 비디오 프레임에 대한 비디오 데이터를 유지하고 있는 딥러닝 모듈로부터 비디오 데이터 즉 특징 벡터를 가져오는 것이 바람직하다.The feature vector forcing unit 520 may configure a plurality of deep learning network modules. Accordingly, a plurality of deep learning network modules can perform a parallel processing operation to perform a deep learning operation, and each deep learning module drops video frames at different time positions, and video for the dropped video frames Data, for example, data in the form of a feature vector may be imported from an adjacent deep learning module to perform a deep learning operation. Of course, adjacent here means not only the second deep learning module or the fourth deep learning module in the case of the third deep learning module, but also the video dropped in the form of skipping like the first deep learning module or the fifth deep learning module. You can also get the data of the frame, more precisely, it is preferable to get the video data, that is, the feature vector from the deep learning module that maintains the video data for the dropped video frame.

이와 같이, 특징벡터 강제화부(520)는 드롭된 비디오 프레임의 분석 결과 즉 특징 벡터를 강제화하는 동작을 수행한다. 즉 일반화된 능력을 갖도록 동작하는 것이다. 따라서, 특징벡터 강제화부(520)는 일반화된 능력을 갖기 위하여 기설정된 방식대로 비디오 프레임을 드롭시키고 또 손실함수를 줄이는 방향으로 학습을 수행한다고 볼 수 있다. 예를 들어, 드롭시키는 비디오 프레임의 개수는 수신된 비디오 프레임 내의 컨텐츠 내용에 따라 다를 수 있다. 예를 들어, 비디오 프레임 내의 객체의 이동이 빠른 경우 드롭시키는 비디오 프레임의 개수는 줄이는 것이 바람직하며, 객체의 이동이 느린 경우에는 드롭시키는 비디오 프레임의 개수를 늘리는 것이 연산 처리 부담을 줄여 처리 속도를 빠르게 증가시킬 수 있다. 이와 같이 수신된 비디오 프레임의 특성에 따라 드롭되는 비디오 프레임은 가변될 수 있다. 물론 본 발명의 실시예에서는 가변보다는 고정하는 형태로 동작하는 것이 바람직하지만, 어느 하나의 형태에 특별히 한정하지는 않을 것이다.In this way, the feature vector forcing unit 520 performs an operation of forcing the result of analyzing the dropped video frame, that is, the feature vector. That is, it operates with generalized capabilities. Accordingly, it can be seen that the feature vector coercion unit 520 drops video frames in a preset manner in order to have a generalized capability and performs learning in a direction to reduce the loss function. For example, the number of video frames to be dropped may vary according to content content within the received video frame. For example, it is desirable to reduce the number of dropped video frames when the object within a video frame moves quickly, and when the object moves slowly, increasing the number of dropped video frames reduces the computational burden and speeds up processing. can increase As described above, the dropped video frame may vary according to the characteristics of the received video frame. Of course, in the embodiment of the present invention, it is preferable to operate in a fixed form rather than a variable, but it will not be particularly limited to any one form.

저장부(530)는 제어부(510)의 제어하에 수신되는 비디오 프레임의 데이터를 임시 저장한 후 출력할 수 있다. 통상 데이터 패킷을 통해 수신되는 비디오 데이터는 헤더부에는 해당 비디오 데이터에 대한 부가 정보, 가령 어떠한 형태로 인코딩이 이루어졌는지에 대한 정보, 또 영상의 경우는 자막 정보 등이 제공될 수 있으며, 페이로드부에는 비디오 프레임을 구성하는 화소들에 대한 실질 데이터 즉 화소 데이터에 대한 화소값이 수신된다. 따라서, 앞서 언급한 대로 압축 즉 인코딩된 비디오 데이터는 디코딩되어 원래 비디오 프레임이 복원되며, 복원할 때 부가 정보로서 인코딩 정보 등을 참조하여 복원이 이루어지게 된다. 이와 같이 복원의 일련의 비디오 프레임들은 저장부(530)에 저장된다.The storage unit 530 may temporarily store data of a video frame received under the control of the control unit 510 and then output it. In general, video data received through a data packet may be provided with additional information on the corresponding video data, for example, information on the format of encoding, and subtitle information in the case of an image, in the header part, and the payload part Real data for pixels constituting the video frame, that is, pixel values for the pixel data is received. Accordingly, as mentioned above, the compressed, that is, encoded video data is decoded to restore the original video frame, and when the video data is restored, it is restored by referring to encoding information as additional information. As such, a series of restored video frames are stored in the storage 530 .

상기한 내용 이외에도 도 5의 통신 인터페이스부(500), 제어부(510), 특징벡터 강제화부(520) 및 저장부(530)는 다양한 동작을 수행할 수 있으며, 기타 자세한 내용은 앞서 충분히 설명하였으므로 그 내용들로 대신하고자 한다.In addition to the above, the communication interface unit 500, the control unit 510, the feature vector forcing unit 520, and the storage unit 530 of FIG. 5 may perform various operations, and other details have been sufficiently described above. instead of content.

본 발명의 실시예에 따른 도 5의 통신 인터페이스부(500), 제어부(510), 특징벡터 강제화부(520) 및 저장부(530)는 서로 물리적으로 분리된 하드웨어 모듈로 구성되지만, 각 모듈은 내부에 상기의 동작을 수행하기 위한 소프트웨어를 저장하고 이를 실행할 수 있을 것이다. 다만, 해당 소프트웨어는 소프트웨어 모듈의 집합이고, 각 모듈은 하드웨어로 형성되는 것이 얼마든지 가능하므로 소프트웨어니 하드웨어니 하는 구성에 특별히 한정하지 않을 것이다. 예를 들어 저장부(530)는 하드웨어인 스토리지(storage) 또는 메모리(memory)일 수 있다. 하지만, 소프트웨어적으로 정보를 저장(repository)하는 것도 얼마든지 가능하므로 위의 내용에 특별히 한정하지는 않을 것이다.The communication interface unit 500, the control unit 510, the feature vector forcing unit 520, and the storage unit 530 of FIG. 5 according to an embodiment of the present invention are composed of hardware modules physically separated from each other, but each module is It may be possible to store software for performing the above operation therein and execute it. However, since the software is a set of software modules, and each module can be formed of hardware, it will not be particularly limited to the configuration of software or hardware. For example, the storage unit 530 may be a hardware storage (storage) or a memory (memory). However, since it is possible to store information in software (repository), it will not be particularly limited to the above.

한편, 본 발명의 다른 실시예로서 제어부(510)는 CPU 및 메모리를 포함할 수 있으며, 원칩화하여 형성될 수 있다. CPU는 제어회로, 연산부(ALU), 명령어해석부 및 레지스트리 등을 포함하며, 메모리는 램을 포함할 수 있다. 제어회로는 제어동작을, 그리고 연산부는 2진비트 정보의 연산동작을, 그리고 명령어해석부는 인터프리터나 컴파일러 등을 포함하여 고급언어를 기계어로, 또 기계어를 고급언어로 변환하는 동작을 수행할 수 있으며, 레지스트리는 소프트웨어적인 데이터 저장에 관여할 수 있다. 상기의 구성에 따라, 가령 인공지능 서비스장치(130)의 동작 초기에 특징벡터 강제화부(520)에 저장되어 있는 프로그램을 복사하여 메모리 즉 램(RAM)에 로딩한 후 이를 실행시킴으로써 데이터 연산 처리 속도를 빠르게 증가시킬 수 있다.Meanwhile, as another embodiment of the present invention, the control unit 510 may include a CPU and a memory, and may be formed as a single chip. The CPU includes a control circuit, an arithmetic unit (ALU), a command interpreter and a registry, and the memory may include a RAM. The control circuit performs a control operation, the operation unit performs an operation operation of binary bit information, and the instruction interpretation unit converts a high-level language into a machine language and a machine language into a high-level language, including an interpreter or compiler. , the registry may be involved in software data storage. According to the above configuration, for example, at the beginning of the operation of the artificial intelligence service device 130, the program stored in the feature vector coercion unit 520 is copied, loaded into a memory, that is, RAM, and then executed, thereby speeding up data operation processing. can be increased quickly.

도 6은 도 1의 구조적 일관성 손실을 갖는 인공지능 서비스 장치의 구동과정을 나타내는 흐름도이다.6 is a flowchart illustrating a driving process of the artificial intelligence service device having the structural consistency loss of FIG. 1 .

설명의 편의상 도 6을 도 1과 함께 참조하면, 본 발명의 실시예에 따른 인공지능 서비스장치(130)는 일련의 비디오 프레임으로 구성되는 (동)영상을 수신한다(S600). 도 1의 인공지능 서비스장치(130)는 서버나 스탠드얼론 형태로 동작하는 컴퓨터 등을 포함할 수 있으므로, 수신되는 동영상은 저장매체 등을 통해 수신하는 것도 얼마든지 가능하다.Referring to FIG. 6 together with FIG. 1 for convenience of explanation, the artificial intelligence service device 130 according to an embodiment of the present invention receives a (moving) image composed of a series of video frames (S600). Since the artificial intelligence service device 130 of FIG. 1 may include a server or a computer operating in a standalone form, it is possible to receive the received video through a storage medium or the like.

또한, 인공지능 서비스장치(130)는 수신한 일련의 비디오 프레임에 대하여 서로 다른 경로의 제1 딥러닝 네트워크 및 제2 딥러닝 네트워크에서 서로 다른 위치의 비디오 프레임을 각각 드롭시킨 후 딥러닝 분석을 수행하되, 각각 드롭시킨 위치에 대응하는 비디오 프레임의 비디오 데이터를 인접하는 경로에서 가져와 딥러닝 분석을 수행하여 제1 딥러닝 분석 결과 및 제2 딥러닝 분석 결과를 각각 출력한다(S610).In addition, the artificial intelligence service device 130 performs deep learning analysis after dropping video frames at different locations in the first deep learning network and the second deep learning network of different paths with respect to the received series of video frames, respectively. However, the video data of the video frame corresponding to each dropped position is taken from an adjacent path, a deep learning analysis is performed, and a first deep learning analysis result and a second deep learning analysis result are respectively output (S610).

예를 들어 인공지능 서비스장치(130)는 제1 딥러닝 네트워크가 (기)수신된 제1 내지 제5 비디오 프레임에서 제2 및 제3 비디오 프레임을 드롭시킨 경우, 드롭시킨 제2 및 제3 비디오 프레임의 데이터 가령 특징 벡터 정보를 가지고 있는 인접하는 딥러닝 네트워크를 찾거나 인식한 후 가령 제2 딥러닝 네트워크에 해당 특징 벡터 정보가 있는 경우 해당 정보를 가져와 딥러닝 동작을 수행해 분석 결과를 출력할 수 있다. 물론 해당 드롭된 비디오 프레임에 대하여 제3 딥러닝 네트워크에 해당 특징 벡터 정보가 있는 경우 제3 딥러닝 네트워크에서 가져와 딥러닝 동작을 수행할 수도 있으므로, 본 발명의 실시예에서는 어느 하나의 형태에 특별히 한정하지는 않을 것이다. For example, when the artificial intelligence service device 130 drops the second and third video frames from the first to fifth video frames received by the first deep learning network (pre), the dropped second and third videos After finding or recognizing the data of a frame, such as an adjacent deep learning network that has feature vector information, for example, if the second deep learning network has the corresponding feature vector information, the data is retrieved and the deep learning operation is performed to output the analysis result. there is. Of course, if there is corresponding feature vector information in the third deep learning network with respect to the dropped video frame, it is also possible to perform the deep learning operation by bringing it from the third deep learning network. I won't.

예를 들어 각 딥러닝 네트워크에서 드롭시키는 비디오 프레임의 위치가 고정되어 있는 경우 다른 딥러닝 네트워크에서는 이를 공유하는 형태로 동작할 수 있지만, 앞서 언급한 대로 가변하는 경우에는 인접하는 딥러닝 네트워크로 문의하여 또는 검색하여 이에 대한 답변이 있을 때 특징 벡터 정보를 가져오는 형태로 동작할 수 있다. 여기서, 특징 벡터라는 것은 가령 임의 객체의 특징에 대하여 벡터 정보를 근거로 객체의 이동을 추적하는 것을 의미할 수 있다. 따라서 시간 변화에 대하여 객체의 이동이 발생할 때 특징 정보를 근거로 어떠한 객체인지는 알 수 있으므로 벡터 정보를 이용해 위치만 판단함으로써 정보가 적은 벡터 정보만으로도 데이터 분석이 얼마든지 가능할 수 있게 된다. 물론 본 발명의 실시예에 특징 벡터 정보를 예시하였지만, 그러한 것에 특별히 한정하지는 않을 것이다.For example, if the location of a video frame dropped by each deep learning network is fixed, other deep learning networks can share it, but as mentioned above, if it is variable, inquire with an adjacent deep learning network. Alternatively, it may operate in the form of retrieving feature vector information when there is an answer by searching. Here, the feature vector may mean tracking the movement of an object based on vector information with respect to a feature of an arbitrary object, for example. Therefore, when the movement of an object occurs with respect to time change, it is possible to know what kind of object it is based on the characteristic information. Therefore, by judging only the position using the vector information, data analysis can be done with only vector information with little information. Of course, although the feature vector information has been exemplified in the embodiment of the present invention, it will not be particularly limited thereto.

한편, 본 발명의 실시 예를 구성하는 모든 구성 요소들이 하나로 결합하거나 결합하여 동작하는 것으로 설명되었다고 해서, 본 발명이 반드시 이러한 실시 예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성 요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다. 또한, 그 모든 구성요소들이 각각 하나의 독립적인 하드웨어로 구현될 수 있지만, 각 구성 요소들의 그 일부 또는 전부가 선택적으로 조합되어 하나 또는 복수 개의 하드웨어에서 조합된 일부 또는 전부의 기능을 수행하는 프로그램 모듈을 갖는 컴퓨터 프로그램으로서 구현될 수도 있다. 그 컴퓨터 프로그램을 구성하는 코드들 및 코드 세그먼트들은 본 발명의 기술분야의 당업자에 의해 용이하게 추론될 수 있을 것이다. 이러한 컴퓨터 프로그램은 컴퓨터가 읽을 수 있는 비일시적 저장매체(non-transitory computer readable media)에 저장되어 컴퓨터에 의하여 읽혀지고 실행됨으로써, 본 발명의 실시 예를 구현할 수 있다.On the other hand, even though it has been described that all components constituting the embodiment of the present invention are combined or operated in combination, the present invention is not necessarily limited to this embodiment. That is, within the scope of the object of the present invention, all the components may operate by selectively combining one or more. In addition, all of the components may be implemented as one independent hardware, but a part or all of each component is selectively combined to perform some or all of the combined functions in one or a plurality of hardware program modules It may be implemented as a computer program having Codes and code segments constituting the computer program can be easily deduced by those skilled in the art of the present invention. Such a computer program is stored in a computer-readable non-transitory computer readable media, read and executed by the computer, thereby implementing an embodiment of the present invention.

여기서 비일시적 판독 가능 기록매체란, 레지스터, 캐시(cache), 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라, 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로, 상술한 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리 카드, ROM 등과 같은 비일시적 판독가능 기록매체에 저장되어 제공될 수 있다.Here, the non-transitory readable recording medium refers to a medium that stores data semi-permanently and can be read by a device, not a medium that stores data for a short moment, such as a register, cache, memory, etc. . Specifically, the above-described programs may be provided by being stored in a non-transitory readable recording medium such as a CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM, and the like.

이상에서는 본 발명의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안 될 것이다.In the above, preferred embodiments of the present invention have been illustrated and described, but the present invention is not limited to the specific embodiments described above, and it is common in the technical field to which the present invention pertains without departing from the gist of the present invention as claimed in the claims. Various modifications may be made by those having the knowledge of, of course, and these modifications should not be individually understood from the technical spirit or perspective of the present invention.

100: 촬영장치 110: 영상제공장치
120: 통신망 130: 인공지능 서비스장치
500: 통신 인터페이스부 510: 제어부
520: 특징벡터 강제화부 530: 저장부100: photographing device 110: image providing device
120: communication network 130: artificial intelligence service device
500: communication interface unit 510: control unit
520: feature vector forcing unit 530: storage unit

Claims

a communication interface unit for receiving an image composed of a series of video frames; and
Deep learning analysis is performed after dropping video frames at different positions in the first deep learning network and the second deep learning network of different paths with respect to the received series of video frames, respectively, A control unit for taking video data of a video frame corresponding to a location from an adjacent path and performing deep learning analysis to output a first deep learning analysis result and a second deep learning analysis result, respectively;
The video data of the video frame corresponding to each dropped position is,
It is data that has been subjected to predetermined processing in a deep learning network that is not dropped on the video data of the dropped video frame,
The adjacent path is
In the case of the first deep learning network, the second deep learning network is an adjacent path, and in the case of the second deep learning network, the first deep learning network is an artificial intelligence service device having a structural coherence loss. .

The method of claim 1,
The control unit maintains the N-th video frame in the second deep learning network when dropping an N-th (here, N is a positive integer) video frame among the received series of video frames in the first deep learning network An artificial intelligence service device with a loss of structural consistency.

3. The method of claim 2,
The control unit, an artificial intelligence service device having a structural coherence loss for fetching and analyzing the video data of the N-th video frame of the second deep learning network during the deep learning analysis operation of the first deep learning network.

The method of claim 1,
The control unit performs parallel processing in the first deep learning network and the second deep learning network for the same video frame to output a deep learning analysis result, respectively, an artificial intelligence service device having a structural consistency loss.

The method of claim 1,
The control unit, an artificial intelligence service device having a structural coherence loss for processing the video data of the video frame in the form of processing a feature vector (feature vector) with respect to a specified object in the video frame.

The method of claim 1,
The control unit, an artificial intelligence service device having a structural coherence loss to drop one or more video frames when the same content in the video frame continues for a specified time.

receiving, by the communication interface unit, an image composed of a series of video frames; and
The control unit performs deep learning analysis after dropping video frames at different positions in the first deep learning network and the second deep learning network of different paths with respect to the received series of video frames, respectively, Taking video data of a video frame corresponding to a location from an adjacent path, performing deep learning analysis, and outputting a first deep learning analysis result and a second deep learning analysis result, respectively; including;
The video data of the video frame corresponding to each dropped position is,
It is data that has been subjected to predetermined processing in a deep learning network that is not dropped on the video data of the dropped video frame,
The adjacent path is
In the case of the first deep learning network, the second deep learning network is an adjacent path, and in the case of the second deep learning network, the first deep learning network is an artificial intelligence service device having a structural coherence loss. driving method.