KR20210056899A

KR20210056899A - Methods and systems for real-time data reduction

Info

Publication number: KR20210056899A
Application number: KR1020200136038A
Authority: KR
Inventors: 알리레자 알리아미리; 이첸 션
Original assignee: 삼성전자주식회사
Priority date: 2019-11-11
Filing date: 2020-10-20
Publication date: 2021-05-20

Abstract

A computing system for decimating video data includes: a processor; a persistent storage system coupled to the processor; and a memory storing instructions, wherein the instructions, when executed by the processor, cause the processor to decimate a batch of frames of video data by: receiving the batch of frames of video data; mapping, by a feature extractor, the frames of the batch to corresponding feature vectors in a feature space, wherein each of the feature vectors has a lower dimension than a corresponding one of the frames of the batch; selecting a set of dissimilar frames from the plurality of frames of video data; and storing the selected set of dissimilar frames in the persistent storage system, wherein the size of the selected set of dissimilar frames is smaller than the number of frames in the batch of frames of video data.

Description

Systems and methods for real-time data reduction {METHODS AND SYSTEMS FOR REAL-TIME DATA REDUCTION}

본 개시는 실시간 데이터 감소를 위한 시스템들 및 실시간 데이터를 감소시키는 방법들에 관한 것으로, 컴퓨터 비전 시스템들 등과 같은 머신 러닝 시스템들을 학습시키기 위한 비디오 데이터를 수집하는 분야를 포함하는 것이다.The present disclosure relates to systems for real-time data reduction and methods for reducing real-time data, and includes the field of collecting video data for training machine learning systems such as computer vision systems and the like.

머신 러닝 시스템들은 일반적으로 예측들 또는 답들을 그러한 시스템들이 만나게 될 입력들의 범위를 나타내는 거대한 학습 데이터 컬렉션(모음) 상에서 학습된 근본적인(underlying) 통계 모델들에 기초하여 연산한다.Machine learning systems generally compute predictions or answers based on underlying statistical models learned on a huge collection of training data (collection) representing a range of inputs that such systems will encounter.

일 예시에 있어서, 몇몇 자율주행 차량들은 카메라들을 주변 물체들을 감지하기 위해 사용한다. 물체 감지에 대한 한 기술은 지도 기계 학습(supervised machine learning)으로, 이는 딥 러닝(예를 들어, 심층 신경망들(Deep Neural Networks; DNNs)을 활용한다. 그러한 접근들에서, 자율주행 차량들에 설치된 카메라들은 거리의 광경의 이미지들을 실시간으로 캡쳐하고, 그리고 이미지들을 그 광경 내의 물체들을 자동적으로 검지(검출)하고 그리고 분류하도록 학습된 하나 이상의 통계 모델(예를 들어, 심층 신경망들)로 공급한다. 이는 보행자들의 위치, 도로 위험들(road hazards), 커브들(curbs, 도로 경계석들), 노면 표시들, 및 자율주행 차량 근방의 다른 차량들 등과 같은 차량 주변 환경의, 실질적으로 실시간인, 의미론적인(semantic) 표현들을 갖는 자율주행 알고리즘들을 제공한다. 결과적으로, 자율주행 알고리즘들은 검지된 물체들에 대한 정보를, 차량의 경로를 횡단하는 보행자와의 충돌을 피하기 위해 차량을 감속하고 정지하는 등과 같이 차량을 제어하기 위해 사용할 수 있다.In one example, some autonomous vehicles use cameras to detect surrounding objects. One technique for object detection is supervised machine learning, which utilizes deep learning (eg, Deep Neural Networks (DNNs)). In such approaches, installed on autonomous vehicles. Cameras capture images of a street scene in real time, and feed the images into one or more statistical models (eg, deep neural networks) that are trained to automatically detect (detect) and classify objects within the scene. This is a real-time, semantic, substantially real-time environment around the vehicle, such as the location of pedestrians, road hazards, curves, road markings, and other vehicles in the vicinity of an autonomous vehicle. Autonomous driving algorithms with (semantic) expressions are provided, and as a result, autonomous driving algorithms provide information about detected objects, such as slowing down and stopping the vehicle to avoid collisions with pedestrians crossing the vehicle's path. Can be used to control the vehicle.

컴퓨터 비전 시스템들의 통계 모델들을 학습시키기 위한 학습 데이터를 생성하기 위해, 대규모(예를 들어, 수천 개의) 이미지 모음이 다양한 조건들(예를 들어, 서로 다른 유형의 교통량, 날씨, 도로 유형들, 주차장들, 주차 구조들 등등)에서 차량을 주행하는 도중에 비디오 카메라로부터 캡쳐된다. 이렇게 수집된 이미지들은 그런 다음 인간 주석자들에 의해 수동으로 주석이 달릴 수 있다(예를 들어, 마스킹된 영역 내부에 나타난 물체들로 레이블된 다른 마스크들 또는 바운딩(경계) 박스들을 수동으로 지정함으로써).In order to generate training data for training statistical models of computer vision systems, a large (e.g., thousands of) image collections are subject to various conditions (e.g., different types of traffic, weather, road types, parking lots). Fields, parking structures, etc.) while driving the vehicle. Images thus collected can then be annotated manually by human annotators (e.g. by manually specifying other masks or bounding (bounding) boxes labeled with objects appearing inside the masked area. ).

본 개시의 목적은 머신 러닝 시스템들을 학습시키려는 목적으로 캡쳐된 데이터를 실시간으로 감소시키기 위한 방법들 및 시스템들을 제공하는 데 있다.It is an object of the present disclosure to provide methods and systems for reducing captured data in real time for the purpose of training machine learning systems.

본 개시의 일 실시 예에 따르면, 비디오 데이터를 제거(decimate)하기 위한 컴퓨팅 시스템은: 프로세서; 프로세서에 결합된 영구적 스토리지 시스템; 및 프로세서에 의해 수행될 때, 프로세서로 하여금: 비디오 데이터의 프레임 배치를 수신하는 단계; 특징 추출기에 의해, 프레임 배치를 특징 공간 내 대응하는, 각각이 대응하는 프레임 배치보다 낮은 차원을 갖는 특징 벡터들로 맵핑하는 단계; 대응하는 특징 벡터들 사이의 비유사성들에 기초하여 복수의 비디오 데이터의 프레임들로부터 비유사한 프레임들의 집합을 선택하는 단계; 및 그 크기가 비디오 데이터의 프레임 배치 내 프레임들의 개수보다 작은 선택된 비유사한 프레임 집합을 영구적 스토리지 시스템에 저장하는 단계를 수행하도록 함으로써 비디오 데이터의 프레임들의 배치를 제거하도록 야기하는 명령어들을 저장하는 메모리를 포함한다. According to an embodiment of the present disclosure, a computing system for decimating video data includes: a processor; A permanent storage system coupled to the processor; And when performed by the processor, causing the processor to: receive a batch of frames of video data; Mapping, by a feature extractor, the frame arrangement into feature vectors corresponding in the feature space, each having a lower dimension than the corresponding frame arrangement; Selecting a set of dissimilar frames from the frames of the plurality of video data based on dissimilarities between the corresponding feature vectors; And a memory for storing instructions that cause the placement of frames of video data to be removed by performing the step of storing a selected dissimilar frame set whose size is less than the number of frames in the frame arrangement of video data in the persistent storage system. do.

프로세서로 하여금 프레임 집합을 선택하도록 야기하는 명령어들은, 프로세서에 의해 수행될 때, 프로세서로 하여금: 제 1 기준 프레임을 비디오 데이터의 프레임 배치로부터 임의로 선택하고; 제 1 프레임 집합을 프레임 배치로부터 버리되, 제 1 프레임 집합은 제 1 기준 프레임에 대응하는 제 1 특징 벡터의 유사성 역치 거리 내에 있는 대응하는 특징 벡터들을 갖고; 제 2 기준 프레임을 복수의 비디오 데이터의 프레임들로부터 선택하되, 제 2 기준 프레임은 제 2 특징 벡터를 갖고, 제 1 특징 벡터 및 제 2 특징 벡터 사이의 거리는 유사성 역치 거리보다 클 수 있고; 그리고 제 2 프레임 집합을 프레임 배치로부터 버리되, 제 2 프레임 집합은 제 2 특징 벡터의 유사성 역치 거리 내의 대응하는 특징 벡터들을 갖고, 선택된 비유사한 프레임들의 집합은 제 1 및 제 2 기준 프레임들을 포함하고 그리고 제 1 및 제 2 프레임 집합을 배제할 수 있다.The instructions that cause the processor to select a frame set, when executed by the processor, cause the processor to: randomly select a first reference frame from a frame arrangement of video data; Discarding the first frame set from the frame arrangement, the first frame set having corresponding feature vectors within a similarity threshold distance of the first feature vector corresponding to the first reference frame; Selecting a second reference frame from a plurality of frames of video data, wherein the second reference frame has a second feature vector, and a distance between the first feature vector and the second feature vector may be greater than the similarity threshold distance; And the second frame set is discarded from the frame arrangement, wherein the second frame set has corresponding feature vectors within the similarity threshold distance of the second feature vector, and the selected set of dissimilar frames includes first and second reference frames. In addition, the first and second frame sets may be excluded.

특징 추출기는 신경망을 포함할 수 있다. 신경망은 CNN(convolutional neural network)을 포함할 수 있다.Feature extractors may include neural networks. The neural network may include a convolutional neural network (CNN).

컴퓨팅 시스템 및 영구적 스토리지 시스템은 차량에 탑재될 수 있고, 그리고 비디오 데이터의 프레임 배치는 차량에 설치된 비디오 카메라에 의해 캡쳐된 비디오 데이터의 스트림의 부분일 수 있고, 비디오 카메라는 차량 주변 환경의 이미지들을 캡쳐하도록 구성된다.Computing systems and permanent storage systems can be mounted on the vehicle, and the frame arrangement of video data can be part of a stream of video data captured by a video camera installed on the vehicle, and the video camera captures images of the environment around the vehicle. Is configured to

비디오 데이터의 프레임 배치는 캡쳐 구간에 대응하는 길이를 갖는 제 1 시간 주기동안 캡쳐될 수 있고, 비디오 데이터의 스트림은 캡쳐 구간에 대응하는 길이를 갖는 제 2 시간 주기동안 캡쳐된 비디오 데이터에 대응하는 비디오 데이터의 제 2 프레임 배치를 포함할 수 있고, 제 2 시간 주기는 제 1 시간 주기를 즉시 따를 수 있고, 그리고 컴퓨팅 시스템은 캡쳐 구간에 대응하는 시간 내에서 비디오 데이터의 프레임 배치를 제거하기 위해 배치 프레임을 맵핑하도록 구성될 수 있다.The frame arrangement of video data may be captured during a first time period having a length corresponding to the capture period, and the stream of video data is a video corresponding to the video data captured during a second time period having a length corresponding to the capture period. A second frame arrangement of data, the second time period can immediately follow the first time period, and the computing system is configured to remove the arrangement of frames of video data within a time corresponding to the capture period. Can be configured to map.

메모리는 프레임들을 선택된 비유사한 프레임들의 집합으로부터 불확실성 메트릭에 기초하여 제거하기 위해, 프로세서에 의해 실행될 때 프로세서로 하여금: 선택된 비유사한 프레임들의 집합의 프레임 각각을 복수의 물체 클래스들의 물체 클래스 각각의 인스턴스들을 묘사하는 프레임의 부분들을 식별하는 복수의 바운딩 박스 집합들을 계산하기 위해 CNN(convolutional neural network)을 포함하는 객체 검출기로 지원하되, 바운딩 박스 집합 각각의 바운딩 박스는 관련된 신뢰 점수를 갖고; 불확실성 점수를 선택된 비유사한 프레임들의 집합의 프레임 각각에 대해 바운딩 박스 집합들 및 관련된 신뢰 점수들에 기초하여 연산하고; 그리고 불확실성 역치를 만족시키지 않는 불확실성 점수들을 갖는 프레임 집합을 선택된 비유사한 프레임들의 집합으로부터 제거하도록 하되, 선택된 비유사한 프레임들의 집합은 영구적 스토리지 시스템에 저장될 수 있고, 그리고 불확실성 역치를 만족시키지 않는 불확실성 점수들을 갖는 프레임 집합들을 배제하는 명령어들을 포함하는 명령어들을 더 저장할 수 있다.The memory causes the processor, when executed by the processor, to remove frames from the selected set of dissimilar frames based on the uncertainty metric, by: locating each frame of the selected set of dissimilar frames to instances of each object class of a plurality of object classes. Support with an object detector including a convolutional neural network (CNN) to calculate a plurality of bounding box sets that identify portions of a depicted frame, wherein each bounding box of the bounding box set has an associated confidence score; Calculate an uncertainty score based on the bounding box sets and associated confidence scores for each frame of the selected set of dissimilar frames; And the set of frames having uncertainty scores that do not satisfy the uncertainty threshold is removed from the set of selected dissimilar frames, but the set of selected dissimilar frames can be stored in a permanent storage system, and an uncertainty score that does not satisfy the uncertainty threshold. It is possible to further store instructions including instructions for excluding frame sets having s.

객체 검출기는 LSTM(long-short term memory) 신경망을 더 포함할 수 있다.The object detector may further include a long-short term memory (LSTM) neural network.

불확실성 점수를 계산하기 위한 명령어들은, 프로세서에 의해 수행될 때, 프로세서로 하여금, 프레임 각각에 대해: 프레임의 동일한 부분에 대응하고 그리고 상이한 물체 클래스들에 대응하는 제일 높은 두 개의 관련된 신뢰성 점수들을 식별하고; 그리고 제일 높은 두 개의 관련된 신뢰성 점수들을 비교하되, 불확실성 점수는 제일 높은 두 개의 관련된 신뢰성 점수들 사이의 차가 작으면 높을 수 있고, 그리고 불확실성 점수는 제일 높은 두 개의 관련된 신뢰성 점수들의 차가 크면 낮을 수 있도록 야기하는 명령어들을 포함할 수 있다.The instructions for calculating the uncertainty score, when executed by the processor, cause the processor to, for each frame: identify the highest two related reliability scores that correspond to the same part of the frame and correspond to different object classes, and ; Then, the two highest related reliability scores are compared, but the uncertainty score is caused to be high if the difference between the two highest related reliability scores is small, and the uncertainty score is to be low if the difference between the two highest related reliability scores is large. It may contain commands to do.

특징 추출기는 객체 검출기의 CNN(convolutional neural network)을 포함할 수 있다.The feature extractor may include a convolutional neural network (CNN) of the object detector.

본 개시의 일 실시 예에 따르면, 비디오 데이터를 제거하기 위한 컴퓨팅 시스템은: 프로세서; 프로세서로 결합된 영구적 스토리지 시스템; 및 명령어들을 저장하는 메모리를 포함하고, 명령어들은, 프로세서에 의해 수행될 때, 프로세서로 하여금: 비디오 데이터의 프레임들의 배치를 수신하고; 비디오 데이터의 프레임들의 배치의 프레임 각각을 복수의 물체 클래스들의 물체 클래스 각각의 인스턴스들을 묘사하는 프레임의 부분들을 식별하는 복수의 바운딩 박스들의 집합들을 계산하기 위해 CNN(convolutional neural network)을 포함하는 물체 탐지기로 제공하되, 바운딩 박스들의 집합 각각의 바운딩 박스은 관련된 신뢰 점수를 갖고; 불확실성 점수를 비디오 데이터의 프레임들의 배치의 프레임 각각에 대해 바운딩 박스들의 집합들 및 관련된 신뢰 점수들에 기초하여 계산하고; 불확실한 프레임들의 집합을 비디오 데이터의 프레임들의 배치로부터 선택하되, 불확실한 프레임들의 집합의 프레임 각각의 불확실성 점수는 불확실성 역치를 만족하고, 그리고 선택된 불확실한 프레임들의 집합을 영구적 스토리지 시스템에 저장하되, 선택된 불확실한 프레임들의 개수는 비디오 데이터의 프레임들의 배치 내 프레임들의 개수보다 작음으로써 비디오 데이터의 프레임들의 배치를 제거하도록 야기한다.According to an embodiment of the present disclosure, a computing system for removing video data includes: a processor; A permanent storage system coupled with a processor; And a memory storing instructions, wherein the instructions, when executed by the processor, cause the processor to: receive a batch of frames of video data; An object detector comprising a convolutional neural network (CNN) for calculating sets of a plurality of bounding boxes that identify portions of a frame that describe each frame of an arrangement of frames of video data and instances of each object class of a plurality of object classes. Provided by, but each bounding box of the set of bounding boxes has an associated confidence score; Calculate an uncertainty score based on the sets of bounding boxes and associated confidence scores for each frame of the batch of frames of video data; A set of uncertain frames is selected from the arrangement of frames of the video data, and the uncertainty score of each frame of the set of uncertain frames satisfies the uncertainty threshold, and the selected set of uncertain frames is stored in a permanent storage system. The number is less than the number of frames in the arrangement of frames of video data, thereby causing the arrangement of frames of video data to be eliminated.

불확실성 점수를 계산하기 위한 명령어들은, 프로세서에 의해 수행될 때, 프로세서로 하여금, 프레임 각각에 대해: 프레임의 동일한 부분에 대응하고 그리고 서로 다른 물체 클래스들에 대응하는 두 개의 제일 높은 관련된 신뢰 점수들을 식별하고; 그리고 두 개의 제일 높은 관련된 신뢰 점수들을 비교하되, 불확실성 점수는 두 개의 제일 높은 관련된 신뢰 점수들 사이의 차가 작으면 높을 수 있고, 그리고 불확실성 점수는 두 개의 제일 높은 관련된 신뢰 점수들 사이의 차가 크면 낮을 수 있도록 야기하는 명령어들을 포함할 수 있다.The instructions for calculating the uncertainty score, when executed by the processor, cause the processor to, for each frame: identify the two highest related confidence scores that correspond to the same part of the frame and correspond to different object classes. and; And compare the two highest related confidence scores, where the uncertainty score can be high if the difference between the two highest related confidence scores is small, and the uncertainty score can be low if the difference between the two highest related confidence scores is large. It may contain instructions that cause it to occur.

컴퓨팅 시스템 및 영구적 스토리지 시스템은 차량에 탑재될 수 있고, 그리고 비디오 데이터의 배치는 차량에 설치된 비디오 카메라에 의해 캡쳐되는 비디오 데이터의 스트림의 일부일 수 있되, 비디오 카메라는 차량의 주변 환경들의 이미지를 캡쳐하도록 구성된다.The computing system and the permanent storage system can be mounted on the vehicle, and the placement of the video data can be part of a stream of video data captured by a video camera installed on the vehicle, while the video camera is configured to capture images of the vehicle's surroundings. It is composed.

비디오 데이터의 프레임들의 배치는 캡쳐 구간에 대응하는 길이를 갖는 제 1 시간 주기동안 캡쳐될 수 있되, 비디오 데이터의 스트림은 캡쳐 구간에 대응하는 길이를 갖는 제 2 시간 주기동안 캡쳐된 비디오 데이터에 대응하는 비디오 데이터의 프래임들의 제 2 배치를 포함할 수 있고, 제 2 시간 주기는 제 1 시간 주기를 즉시 따를 수 있고, 그리고 컴퓨팅 시스템은 캡쳐 구간에 대응하는 시간 내에서 비디오 데이터의 프레임들의 배치를 제거하도록 구성될 수 있다.The arrangement of frames of video data may be captured during a first time period having a length corresponding to the capture period, but the stream of video data corresponds to the video data captured during a second time period having a length corresponding to the capture period. A second arrangement of frames of video data, the second time period immediately following the first time period, and the computing system to remove the arrangement of frames of video data within a time corresponding to the capture interval. Can be configured.

본 개시의 일 실시 예에 따르면, 비디오 데이터를 제거하기 위한 컴퓨팅 시스템은: 프로세서; 프로세서에 결합된 영구적 스토리지 시스템; 및 명령어들을 저장하는 메모리를 포함하되, 명령어들은, 프로세서에 의해 실행될 때, 프로세서로 하여금: 비디오 데이터의 프레임들의 배치를 수신하고; 특징 추출기에 의해, 비디오 데이터의 프레임들의 배치를 특징 공간 내 대응하는 특징 벡터들로 맵핑하되, 각각의 특징 벡터들은 배치의 프래임들의 대응하는 프레임보다 낮은 차원을 갖고; 복수의 비유사성 점수들을 특징 벡터들에 기초하여 계산하되, 각각의 비유사성 점수들은 배치의 프레임들의 중 어느 하나에 대응하고; 비디오 데이터의 프레임들의 배치의 프레임 각각을 복수의 물체 클래스들의 물체 클래스 각각의 인스턴스들을 묘사하는 프레임의 부분들을 식별하는 복수의 바운딩 박스들의 집합들을 계산하기 위해 CNN(convolutional neural network)을 포함하는 물체 탐지기로 제공하되, 바운딩 박스들의 집합 각각의 바운딩 박스은 관련된 신뢰 점수를 갖고; 복수의 불확실성 점수들을 바운딩 박스들의 집합들 및 관련된 신뢰 점수들에 기초하여 계산하되, 각각의 불확실성 점수들은 배치의 프레임들 중 어느 하나에 대응하고; 프레임 각각에 대한 총 점수를 비유사성 점수들의 관련된 비유사성 점수 및 불확실성 점수의 관련된 불확실성 점수를 집계함으로써 계산하고; 비디오 데이터의 프레임들의 배치로부터 각각의 선택된 프레임들의 총 점수가 총 프레임 역치를 만족시키는 프레임들의 집합을 선택하고; 그리고 선택된 프레임들의 집합을 영구적 스토리지 시스템에 저장하되, 선택된 프레임들의 집합의 개수는 비디오 데이터의 프레임들의 배치 내 프레임들의 개수보다 작음으로써 비디오 데이터의 프레임들의 배치를 제거하도록 야기한다.According to an embodiment of the present disclosure, a computing system for removing video data includes: a processor; A permanent storage system coupled to the processor; And a memory storing instructions, wherein the instructions, when executed by the processor, cause the processor to: receive a batch of frames of video data; By the feature extractor, mapping the arrangement of frames of the video data to corresponding feature vectors in the feature space, each feature vectors having a lower dimension than a corresponding frame of the frames of the arrangement; Calculating a plurality of dissimilarity scores based on the feature vectors, each dissimilarity score corresponding to any one of the frames of the batch; An object detector comprising a convolutional neural network (CNN) for calculating sets of a plurality of bounding boxes that identify portions of a frame that describe each frame of an arrangement of frames of video data and instances of each object class of a plurality of object classes. Provided by, but each bounding box of the set of bounding boxes has an associated confidence score; Calculating a plurality of uncertainty scores based on the sets of bounding boxes and associated confidence scores, each uncertainty score corresponding to any one of the frames of the batch; Calculating a total score for each frame by aggregating the associated dissimilarity score of the dissimilarity scores and the related uncertainty score of the uncertainty score; Select a set of frames in which the total score of each selected frame satisfies the total frame threshold from the arrangement of frames of video data; The selected set of frames is stored in the permanent storage system, and the number of the selected set of frames is smaller than the number of frames in the arrangement of frames of video data, thereby causing the arrangement of frames of video data to be removed.

본 개시의 일 실시 예에 따른 컴퓨팅 시스템은 비디오 데이터의 프레임들의 부분집합을 프레임들 간의 특징 공간(feature space)에서의 비유사성 또는 객체 분류의 불확실성 중 적어도 하나에 기반하여 선택할 수 있다. 이에 따라, 기계 학습 모델의 학습에 있어서 불필요한 데이터가 제거될 수 있고, 따라서 컴퓨팅 시스템의 성능이 개선될 수 있다.The computing system according to an embodiment of the present disclosure may select a subset of frames of video data based on at least one of dissimilarity in a feature space between frames or uncertainty in object classification. Accordingly, unnecessary data can be removed in the learning of the machine learning model, and thus the performance of the computing system can be improved.

명세서와 함께 첨부된 도면들은 본 개시의 예시적인 실시 예들을 도시하고, 상세한 설명과 함께 본 개시의 원리를 설명하는 역할을 한다.
도 1은 본 개시의 일 실시 예에 따라 비디오 캡쳐 및 데이터 감소 시스템을 묘사하는 블록도이다.
도 2는 본 개시의 일 실시 예에 따라 캡쳐된 데이터를 감소하는 방법을 묘사하는 순서도이다.
도 3은 본 개시의 일 실시 예에 따라 유사성에 기초하여 프레임들을 버리는 특징 공간 유사성-기반 프레임 선택 모듈을 묘사하는 개략적인 블록도이다.
도 4는 본 개시의 일 실시 예에 따라 프레임들을 유사성에 기초하여 버림으로써 프레임을 선택하는 방법을 묘사하는 순서도이다.
도 5는 본 개시의 일 실시 예에 따라 불확실성에 기초하여 프레임들을 버리는 불확실성-기반 프레임 선택 모듈을 묘사하는 개략적인 블록도이다.
도 6은 본 개시의 일 실시 예에 따라 프레임들을 불확실성 점수에 기초하여 버림으로써 프레임을 선택하는 방법을 묘사하는 순서도이다.
도 7은 본 개시의 일 실시 예에 따라 프레임들을 유사성 및 불확실성 둘 다에 기초하여, 직렬로, 선택하는 방법의 순서도이다.
도 8은 본 개시의 일 실시 예에 따라 프레임들을 유사성 및 불확실성 둘 다에 기초하여, 병렬로, 선택하는 방법의 순서도이다.
도 9는 본 개시의 일 실시 예에 따라 컴퓨팅 시스템을 예시하는 블록도이다.The accompanying drawings together with the specification illustrate exemplary embodiments of the present disclosure, and together with the detailed description, serve to explain the principles of the present disclosure.
1 is a block diagram depicting a video capture and data reduction system according to an embodiment of the present disclosure.
2 is a flowchart illustrating a method of reducing captured data according to an embodiment of the present disclosure.
3 is a schematic block diagram depicting a feature space similarity-based frame selection module that discards frames based on similarity according to an embodiment of the present disclosure.
4 is a flowchart illustrating a method of selecting a frame by discarding frames based on similarity according to an embodiment of the present disclosure.
5 is a schematic block diagram depicting an uncertainty-based frame selection module for discarding frames based on uncertainty according to an embodiment of the present disclosure.
6 is a flowchart illustrating a method of selecting a frame by discarding frames based on an uncertainty score according to an embodiment of the present disclosure.
7 is a flowchart of a method of selecting frames in series, based on both similarity and uncertainty, according to an embodiment of the present disclosure.
8 is a flowchart of a method of selecting frames in parallel, based on both similarity and uncertainty, according to an embodiment of the present disclosure.
9 is a block diagram illustrating a computing system according to an embodiment of the present disclosure.

이하의 상세한 설명에서, 오직 본 개시의 특정한 실시 예들만이 예시의 방식으로 도시되고 설명된다. 통상의 기술자들이 인식할 수 있는 바와 마찬가지로, 본 개시는 다양한 형태로 구현될 수 있으며, 이하에서 설명되는 실시 예들에 한정되는 것으로 해석되어서는 안 된다. 도면들 및 아래 논의에서, 유사한 참조 번호들은 유사한 구성 요소들을 지칭한다.In the detailed description that follows, only specific embodiments of the present disclosure are shown and described by way of example. As will be appreciated by those of ordinary skill in the art, the present disclosure may be implemented in various forms, and should not be construed as being limited to the embodiments described below. In the drawings and the discussion below, like reference numbers refer to like elements.

일반적으로, 머신 러닝 모델의 지도 학습은 큰 규모의 레이블링된(labeled) 학습 데이터 모음을 포함한다. 자율주행 차량들에 대한 컴퓨터 비전 시스템들을 위한 학습 데이터의 수집을 위한 비교방법론들(comparative methodologies)에서, 카메라들은 차량에 설치될 수 있고 그리고 비디오 데이터(예를 들어, 30 또는 60 fps(frames per second)의 레이트로, 프레임 각각은 일반적으로 하나의 정적인 이미지에 대응함)를 주행 도중에 다양한 지역들을 거쳐 그리고 다양한 주행 조건들 하에서 캡쳐할 수 있다. 캡쳐된 비디오는 그 다음, 인간 주석자들(human annotators)에 의해 검토될 수 있으며, 인간 주석자들은 비디오 데이터를, 물체들(객체들, objects)의 특정한 미리 선택된 클래스들(classes)(예를 들어, 차들, 트럭들, 보행자들, 벽들, 나무들, 표지판들, 교통 신호들 등)의 인스턴스들(instances) 주변에 박스들을 그리는 등과 같이, 레이블링(labeling)할 수 있다.In general, supervised learning of machine learning models involves a large collection of labeled training data. In comparative methodologies for the collection of learning data for computer vision systems for autonomous vehicles, cameras can be installed in vehicles and video data (e.g. 30 or 60 frames per second (fps)). ), each frame generally corresponds to one static image) can be captured during driving through various regions and under various driving conditions. The captured video can then be reviewed by human annotators, which can then read the video data, specific preselected classes of objects (e.g. For example, you can label, such as drawing boxes around instances of cars, trucks, pedestrians, walls, trees, signs, traffic signals, etc.).

몇몇 상황들에서, 주행 도중 캡쳐된 모든 비디오가 저장된다. 그러나, 주행 도중 캡쳐된 모든 비디오 데이터를 저장하는 것은 거대한 차량-내 데이터 스토리지를 요구하며, 이는 비쌀 수 있다. 또한, 거대한 양의 데이터를 수동으로 레이블링하는 것은 인간 주석자들이 수백만 개의 프레임들을 보고 레이블링해야 할 수 있기 때문에 굉장히 비싸다.In some situations, all video captured while driving is saved. However, storing all the video data captured while driving requires huge in-vehicle data storage, which can be expensive. Also, manually labeling huge amounts of data is extremely expensive because human annotators can have to see and label millions of frames.

몇몇 상황들에서, 운전자(또는 다른 인간 동작자(조작자))는 카메라를 선택적으로 주행의 일부들의 비디오 클립들(clips)을 녹화하기 위해 수동으로 활성화하거나 또는 비활성화할 수 있다. 그러나, 이는 비효율적일 수 있는데, 왜냐하면 운전자가 녹화할 가치가 있는 장면을 인식하는 것과 카메라를 활성화하는 것 사이의 지연은 중요한 데이터가 유실되는 것을 야기할 수 있기 때문이다. 또한, 운전자에 의해 수동으로 선택된 장면들은 운전자의 성향에 영향을 받을 수 있고, 그리고 학습된 머신 러닝 모델의 성능을 개선할 수 있는 추가적인 데이터를 제공하는데 가장 쓸모가 있을 장면들의 유형을 실질적으로 반영하지 않을 수 있다.In some situations, the driver (or other human operator (operator)) may manually activate or deactivate the camera to selectively record video clips of portions of the drive. However, this can be inefficient because the delay between the driver's recognizing a scene worth recording and activating the camera can cause important data to be lost. In addition, scenes manually selected by the driver may be affected by the driver's propensity and do not substantially reflect the types of scenes that would be most useful in providing additional data that could improve the performance of the learned machine learning model. May not.

일반적으로, 이미 수집된 데이터와 유사하지 않은 데이터를 수집하는 것이 이롭다(예를 들어, 이전에 캡쳐된 데이터와 유사한 데이터는 불필요(redundant)할 수 있다). 상술된 예시에 이어서, 물체(객체)들의 다양성을 묘사하고 그리고 이전에 수집된 비디오 클립들과 유사하지 않은 비디오 클립들 및 이미지들은 이전에 보인 실질적으로 동일한 물체들 및 구성들을 묘사하는 비디오 클립들보다 더 이로울 수 있다. 그러한 더 “현저한(핵심적인, salient)”클립들은 일반적으로 학습된 머신 러닝 모델들의 성능을 개선하는 데 더 큰 영향을 갖는다.In general, it is beneficial to collect data that is not similar to data that has already been collected (eg, data similar to previously captured data may be redundant). Continuing the above example, video clips and images depicting the diversity of objects (objects) and not similar to previously collected video clips are more than video clips depicting substantially the same objects and configurations shown previously. It could be more beneficial. Such more “salient” clips generally have a greater impact on improving the performance of trained machine learning models.

따라서, 본 개시의 실시 예들의 양상들은 머신 러닝 모델을 학습시키기 위해 유지하기 위한 더 현저한 데이터의 부분들을 선택함으로써 데이터 감소 또는 제거(축소, decimation)를 수행하는 것과 관련된다. 본 개시의 실시 예들의 몇몇 양상들은 데이터 스트림의 부분들을 선택적으로 유지하고 그리고 데이터 스트림의 다른 부분들을 버림으로써(예를 들어, 데이터 스트림을 제거함(축소함, decimating)으로써) 자원이 한정된 환경들(예를 들어, 제한된 데이터 스토리지 용량 또는 데이터 전송 대역폭을 갖는 환경들) 내에서 지능적인 자동 및 연속적인 실시간(real-time) 데이터 캡처를 가능하게 한다. 데이터 스트림의 부분들을 선택함으로써 데이터를 감소하거나 또는 제거하는 것은 저장 비용들 및 인간 주석 비용들을 감소하면서도, 학습된 알고리즘들의 성능을 개선하기 위한 추가적인 데이터를 획득하는 이익은 유지할 수 있다.Accordingly, aspects of embodiments of the present disclosure relate to performing data reduction or removal (decimation) by selecting more salient portions of data to keep in order to train a machine learning model. Some aspects of embodiments of the present disclosure selectively maintain portions of the data stream and discard other portions of the data stream (e.g., by removing (decimating) the data stream) in resource-limited environments ( For example, it enables intelligent, automatic and continuous real-time data capture within environments with limited data storage capacity or data transmission bandwidth). Reducing or removing data by selecting portions of the data stream can reduce storage costs and human annotation costs, while maintaining the benefit of obtaining additional data to improve the performance of the learned algorithms.

이하에서 본 개시의 실시 예들은 주로 차량들에 설치된 비디오 카메라들의 맥락에서 비디오 데이터의 감소 또는 축소 및 차량 인근 환경에 대한 데이터를 캡쳐하는 것과 관련되어 설명되지만, 본 개시의 실시 예들은 그에 한정되지 않으며 다른 데이터 수집 맥락들에서 저장된 데이터를 감소시키는 것에도 적용될 수 있다.Hereinafter, embodiments of the present disclosure will be mainly described in connection with reducing or reducing video data in the context of video cameras installed in vehicles and capturing data on an environment near the vehicle, but embodiments of the present disclosure are not limited thereto. It can also be applied to reduce stored data in other data collection contexts.

본 개시의 일 실시 예에 따르면, 자동화된 이미지 및/또는 비디오 캡쳐 방법론은 비디오 카메라에 의해 캡쳐된 광경(장면, scene)을 연속적으로 모니터링하고, 그리고 선택된 중요한 비디오 프레임들을 사용하여 기계 학습 모델을 재-학습시킬 때 기계 학습 모델의 성능을 개선할 수 있다고 시스템에 의해 예측되는 중요한 클립들 및 이미지들(예를 들어, 비디오의 프레임들)만을 선택하고 그리고 저장한다.According to an embodiment of the present disclosure, an automated image and/or video capture methodology continuously monitors a scene (scene) captured by a video camera, and reconstructs a machine learning model using selected critical video frames. Select and store only important clips and images (eg frames of video) that are predicted by the system to improve the performance of the machine learning model when training.

도 1은 본 개시의 일 실시 예에 따라, 차량(105)에 탑승한 것으로 묘사된 비디오 캡쳐 및 데이터 감소 시스템(100)을 묘사하는 블록도이다. 도 1의 비디오 캡쳐 및 데이터 감소 시스템(100)은 하나 이상의 비디오 카메라들(110)을 포함한다. 몇몇 실시 예에서, 비디오 카메라들(110)은 (2 차원 정적(정지, still) 프레임들의 스트림 또는 시퀀스들을 포함하는) 비디오를 30 fps(frames per second) 등과 같은 프레임 레이트로 캡쳐할 수 있다. 비디오 카메라들(110)은 이미지들을 동작하는 동안, 즉 차량(105)이 주행하는 동안 일반적으로 연속적으로 캡쳐한다. 컴퓨팅 시스템(130)은 비디오 카메라들(110)로부터 수신된 (프레임들의 스트림을 포함하는) 비디오 데이터를 저장하는 버퍼 메모리(150; 예를 들어, 동적 랜덤-액세스 메모리 또는 DRAM(Dynamic Random Access Memory))를 포함한다.1 is a block diagram depicting a video capture and data reduction system 100 depicted as aboard a vehicle 105, in accordance with an embodiment of the present disclosure. The video capture and data reduction system 100 of FIG. 1 includes one or more video cameras 110. In some embodiments, the video cameras 110 may capture video (including a stream or sequences of two-dimensional static (still) frames) at a frame rate such as 30 frames per second (fps) or the like. The video cameras 110 capture images generally continuously while operating, that is, while the vehicle 105 is driving. Computing system 130 is a buffer memory 150 for storing video data (including a stream of frames) received from the video cameras 110; for example, a dynamic random-access memory or a dynamic random access memory (DRAM) ).

몇몇 실시 예에서, 비디오 캡쳐 및 데이터 감소 시스템(100)은 객체 검출 시스템(예를 들어, 동일한 컴퓨팅 시스템(130) 또는 동일한 차량에 탑재한 다른 컴퓨팅 시스템에서 실행되는 객체 검출 및/또는 장면 분할 시스템)과 함께 동작한다. 객체 검출 시스템은 비디오 카메라들(110)이 보는 객체들을 검지(검출)하고 분류하는 데 사용될 수 있고, 그리고 객체 검출에 의한 검지(검출)들은 차량의 자율 주행 시스템의 입력으로서 사용될 수 있다.In some embodiments, the video capture and data reduction system 100 is an object detection system (e.g., an object detection and/or scene segmentation system running on the same computing system 130 or another computing system mounted on the same vehicle). Works with The object detection system can be used to detect (detect) and classify objects viewed by the video cameras 110, and detection (detections) by object detection can be used as an input to the autonomous driving system of the vehicle.

몇몇 실시 예에서, 컴퓨팅 시스템(130)은, 버퍼 메모리(150)에 저장된 비디오 데이터의 프레임들에 데이터 스트림의 현저한(salient) 부분들(예를 들어, 현저한 프레임들)을 선택하기 위해 능동 학습(active learning) 및 이상치 검출(novelty detection) 과정을 적용하는데, 데이터 스트림의 현저한(salient) 부분들은, 주석이 달리거나 또는 레이블링된 이후에, 이러한 선택되고 레이블링된 현저한 부분들을 포함하는 학습 세트로 객체 검출 시스템이 재-학습될 때 객체 검출 시스템의 성능을 개선할 것으로 예측된다. 컴퓨팅 시스템(130)은 선택된 프레임들의 부분집합(subset)을 영구 스토리지 시스템(170)으로 기입한다. 적절한 영구 스토리지 시스템(170)의 예시들은, 독립적인 디스크들의 중복 어레이 또는 RAID 컨트롤러, 네트워크-연결 스토리지 컨트롤러, 스토리지 영역 네트워크 컨트롤러 등과 같은 컨트롤러에 의해 작동될 수 있는, 하나 이상의 플래시 메모리 유닛들 및/또는 하나 이상의 하드 디스크 드라이브들을 포함할 수 있는 대용량 저장 시스템을 포함할 수 있다. 다른 예로서, 영구 스토리지 시스템(170)은 수신된 데이터를 차량(105)에 탑재되지 않은 영구 스토리지로 전송하기 위한 컴퓨터 네트워크 장치(예를 들어, 셀룰러, WiFi, 및/또는 이더넷)를 포함할 수 있다(예를 들어, 모든 캡쳐된 데이터를 네트워크를 통해 전송하는 것은 실용적이지 않을 수 있으나, 제거된(축소된) 데이터를 전송하는 것은 용이할(feasible) 수 있다). 선택되지 않은(예를 들어, 영구 스토리지 시스템(170)에 기입되지 않은) 프레임들은 불필요한 것으로 간주되고, 그리고 버려질 수 있다(예를 들어, 버퍼 메모리에서 덮어 쓰여질 수 있다).In some embodiments, computing system 130 may perform active learning to select salient portions of the data stream (e.g., salient frames) to frames of video data stored in buffer memory 150 active learning) and novelty detection processes, in which the salient portions of the data stream, after annotation or labeling, are detected as a learning set containing these selected and labeled salient portions. It is expected to improve the performance of the object detection system when the system is re-learned. The computing system 130 writes a subset of the selected frames to the persistent storage system 170. Examples of suitable persistent storage system 170 include one or more flash memory units and/or which may be operated by a redundant array of independent disks or a controller such as a RAID controller, a network-attached storage controller, a storage area network controller, etc. It may include a mass storage system that may include one or more hard disk drives. As another example, the persistent storage system 170 may include a computer network device (e.g., cellular, WiFi, and/or Ethernet) for transmitting the received data to persistent storage not mounted on the vehicle 105. Yes (e.g., it may not be practical to transmit all the captured data over the network, but it may be feasible to transmit the removed (reduced) data). Frames that are not selected (eg, not written to the persistent storage system 170) are considered unnecessary, and may be discarded (eg, overwritten in the buffer memory).

도 2는 본 개시의 일 실시 예에 따라 캡쳐된 비디오 데이터를 감소하기 위한 방법을 묘사하는 순서도이다. 동작(210)에서, 컴퓨텅 시스템(130)은 데이터의 배치(a batch of data)를 수신한다. 상술된 예시에 이어서, 몇몇 실시 예들에서, 데이터의 배치는 캡쳐 구간 I 동안 비디오 카메라에 의해 캡쳐된 비디오 프레임들의 배치(a batch of video frames)(예를 들어, 스틸 이미지들)에 대응한다. 비디오 카메라들(110)은 데이터를 일정한 레이트 f_s(예를 들어, 30 fps에서 60 fps)에서 캡쳐할 수 있고, 따라서 캡쳐 구간 I 각각은 f_s*I 프레임들을 포함한다. 캡쳐 구간 I의 길이는 버퍼 메모리(150)의 사이즈 T(예를 들어, f_s*I 프레임들의 비디오 데이터의 하나의 배치를 저장하기 위해 할당된 공간의 량) 및 배치의 처리가 다음 데이터의 배치를 동작(210)에서 비디오 카메라들(110)로부터 수신하기 전에 완료되는 것이 가능하도록 하는 저장을 위한 프레임들을 선택하는 기법들의 처리 속도에 의존할 수 있다. 따라서, 캡쳐 구간 I의 길이 및 버퍼 메모리(150)의 크기에 대한 특정 파라미터들은 프레임들을 선택하는 기법들(예를 들어, 프레임들을 선택하기 위한 프로세스들의 알고리즘적인 시간 복잡성들이 배치 크기에 제약을 둘 수 있다) 및 방법(200; 예를 들어, 스토리지(저장)를 위한 이미지 프레임들을 선택하는 것을 수행하는 것)을 수행하는 데 있어 컴퓨팅 시스템(130)의 성능을 포함하는 세부 사항들에 의존할 수 있다.2 is a flowchart illustrating a method for reducing captured video data according to an embodiment of the present disclosure. In operation 210, the computer system 130 receives a batch of data. Following the example described above, in some embodiments, the batch of data corresponds to a batch of video frames (eg, still images) captured by the video camera during the capture period I. The video cameras 110 can capture data at a constant rate f _s (eg, 30 fps to 60 fps), so each capture period I includes f _s *I frames. The length of the capture section I is the size T of the buffer memory 150 (e.g., _{the amount of space allocated to store one batch of video data of f s} *I frames) and the processing of the batch is the next batch of data. It may rely on the processing speed of techniques for selecting frames for storage that allow it to be completed prior to receiving it from video cameras 110 in operation 210. Therefore, certain parameters for the length of the capture period I and the size of the buffer memory 150 are techniques for selecting frames (e.g., algorithmic temporal complexity of processes for selecting frames may place restrictions on the batch size. And method 200 (e.g., performing selection of image frames for storage (storage)) may depend on details including the performance of computing system 130 .

동작(230)에서, 컴퓨팅 시스템(130)은 현재 프레임들의 배치를 버퍼 메모리(150)에 저장한다. 컴퓨팅 시스템(130)은 이 프레임들을 하나 이상의 비디오 카메라들(110)로부터 수신된 것으로서 저장할 수 있다(예를 들어, 개별적인 스틸(정지) 프레임들 또는 스틸 이미지들을 포함하는 비디오의 스트림을 저장함). 몇몇 실시 예들에서, 버퍼 메모리(150)는 시스템이 현재 프레임들을 실시간으로 처리하면서도 계속 데이터의 배치들을 수신하고 그리고 수신된 배치들을 버퍼 메모리(150)에 저장하는 것이 가능하도록 다수의 배치들(예를 들어, 적어도 두 개의 데이터 배치들)을 저장하기 위한 충분한 공간을 포함할 수 있다.In operation 230, computing system 130 stores the current batch of frames in buffer memory 150. Computing system 130 may store these frames as received from one or more video cameras 110 (eg, storing individual still (still) frames or a stream of video including still images). In some embodiments, the buffer memory 150 allows multiple batches (e.g., to enable the system to process the current frames in real time while still receiving batches of data and storing the received batches in the buffer memory 150). For example, it may contain enough space to store at least two data batches).

동작(250)에서, 컴퓨팅 시스템(130)은 비디오 데이터의 프레임들의 배치(a batch of frames of video data)의 부분집합을 선택된 부분집합이 더 현저하다는(예를 들어, 기계 학습 모델이 선택된 데이터에 기반하여 학습되는 경우 기계 학습 모델의 성능을 개선할 가능성이 더 높다는) 예측에 기반하여 선택할 수 있다. 이는 여기에서 데이터를 감소시키거나, 데이터를 필터링하거나, 또는 데이터를 축소(decimate)하는 것으로서도 지칭될 수 있다. 부분집합은 여기에서 선택된 프레임들의 모음으로서 지칭될 것이며, 이때 선택된 프레임들의 수는 입력된 프레임들의 배치의 프레임들의 수보다 작다(예를 들어, 선택된 프레임들의 집합의 크기는 입력된 프레임들의 배치의 집합의 크기보다 작다). 예측을 수행하기 위한 본 개시의 실시 예들(예를 들어, 데이터의 현저성(saliency)을 결정하는 것)은 이하에서 구체적으로 서술될 것이다.In operation 250, computing system 130 determines that a subset of a batch of frames of video data is more significant (e.g., a machine learning model is applied to the selected data). If it is learned based on the machine learning model, it is more likely to improve the performance of the machine learning model). This may also be referred to herein as reducing data, filtering data, or decimating data. The subset will be referred to herein as a collection of selected frames, where the number of selected frames is less than the number of frames in the arrangement of input frames (e.g., the size of the set of selected frames is the set of arrangements of input frames. Smaller than the size of). Embodiments of the present disclosure for performing prediction (eg, determining the saliency of data) will be described in detail below.

동작(270)에서, 컴퓨팅 시스템(130)은 선택된 프레임들을 영구 스토리지 시스템(170)에 저장하거나 또는 기입한다. 영구 스토리지 시스템(170)은 선택된 프레임들을 영구적으로 또는 비-일시적으로(예를 들어, 차량(105)의 비디오 카메라들(110)이 데이터를 수집하는 주행의 전체 동안 및 주행의 종료 후) 저장한다.In operation 270, computing system 130 stores or writes selected frames to persistent storage system 170. Persistent storage system 170 stores selected frames permanently or non-transitory (e.g., during and after the end of the trip, during and after the end of the trip for which the video cameras 110 of the vehicle 105 collect data) .

동작(290)에서, 컴퓨팅 시스템(130)은 데이터의 캡쳐를 계속할 것인지 여부를 판단한다. 이 판단은, 예를 들어, 컴퓨팅 시스템(130)이 데이터의 캡쳐를 정지하거나 또는 중단하라는 명령을 수신했는지 여부에 기초할 수 있다. 만약 컴퓨팅 시스템(130)이 데이터의 캡쳐를 계속하기로 한다면, 컴퓨팅 시스템(130)은 다음 비디오 데이터의 프레임들의 배치를 비디오 카메라들(110)로부터 수신하기 위해 동작(210)로 계속될 수 있으며, 이때 다음 비디오 데이터의 프레임들의 배치는 이전 비디오 데이터의 프레임들의 배치를 즉시 뒤따르는 (캡쳐 구간 I의 길이와 동일한 길이를 갖는) 시간 주기에 대응한다. 만약 컴퓨팅 시스템(130)이 데이터 캡쳐를 중단하기로 한다면, 방법은 종료된다.In operation 290, computing system 130 determines whether to continue capturing data. This determination may be based, for example, on whether computing system 130 has received a command to stop or stop capturing data. If computing system 130 chooses to continue capturing data, computing system 130 can continue with operation 210 to receive from video cameras 110 a batch of frames of the next video data, At this time, the arrangement of frames of the next video data corresponds to a time period (having the same length as the length of the capture section I) immediately following the arrangement of the frames of the previous video data. If computing system 130 decides to stop capturing data, the method ends.

본 개시의 몇몇 실시 예들에 따르면, 비디오 카메라들(110)은 이미지들을 연속적으로 캡쳐한다. 따라서, 비디오 데이터는 연속적으로 수신되며, 비디오 데이터의 모든 배치는 캡쳐 구간 I 동안 캡쳐된다. 이에 따라, 동작(250)에서 저장하기 위해 배치로부터 프레임들을 선택하는 것을 포함하는 방법(200)의 반복 각각은 캡쳐 구간 I보다 길지 않은 시간 내에 완료되며, 따라서 컴퓨팅 시스템(130)의 연산 자원들이 캡쳐 구간 I와 동일한 길이를 갖는 제 2 시간 주기 동안 캡쳐되는 프레임들에 대응하는 다음 비디오 데이터의 프레임들의 배치를 처리하는 것이 가능하게 된다.According to some embodiments of the present disclosure, video cameras 110 continuously capture images. Thus, video data is continuously received, and all batches of video data are captured during the capture period I. Accordingly, each iteration of the method 200 including selecting frames from the batch for storage in operation 250 is completed within a time not longer than the capture period I, and thus the computational resources of the computing system 130 are captured. It becomes possible to process the arrangement of frames of the next video data corresponding to the frames captured during the second time period having the same length as the period I.

특징 공간(feather space)의 프레임 유사성에 기반하는 프레임 선택Frame selection based on frame similarity in feature space

본 개시의 몇몇 실시 예들에 따르면, 컴퓨팅 시스템(130)은 특징 공간에서 서로 다른 비유사한 프레임들을 선택함으로써 프레임들을 선택한다(예를 들어, 도 2의 동작(250)에서). 몇몇 실시 예들에서, 이는 특징 공간 내 '기준' 프레임들과 유사한 프레임들을 버리는 것과 동등하다. 일반적으로, 서로가 실질적으로 동일한 많은 학습 예시들을 제시하는 것은, 유사한 예시들은 중복되기 때문에 학습된 기계 학습 시스템의 성능의 큰 개선을 야기하지 않는다. 따라서, 본 개시의 실시 예들의 양상들은 중복되는 프레임들을 제거하고 비유사한 프레임들만을 유지함으로써 학습 데이터의 수집의 효율을 증가시키는 것과 관련된다.According to some embodiments of the present disclosure, computing system 130 selects frames by selecting different dissimilar frames in a feature space (eg, in operation 250 of FIG. 2 ). In some embodiments, this is equivalent to discarding frames similar to the'reference' frames in the feature space. In general, presenting many learning examples that are substantially identical to each other does not cause a significant improvement in the performance of the learned machine learning system because similar examples are redundant. Accordingly, aspects of embodiments of the present disclosure are related to increasing the efficiency of collection of training data by removing overlapping frames and maintaining only dissimilar frames.

도 3은 본 개시의 일 실시 예에 따라 유사성에 기반하여 프레임들을 버리는 특징 공간 유사성-기반 프레임 선택 모듈(300)의 개략적인 블록도이다. 도 4는 본 개시의 일 실시 예에 따라 특징 공간 내 유사성에 기반하여 프레임들을 버리는 프레임 선택을 위한 방법을 묘사하는 순서도이다.3 is a schematic block diagram of a feature space similarity-based frame selection module 300 for discarding frames based on similarity according to an embodiment of the present disclosure. 4 is a flowchart illustrating a method for selecting a frame in which frames are discarded based on similarity in a feature space according to an embodiment of the present disclosure.

도 3 및 도 4를 참조하면, 동작(410)에서, 컴퓨팅 시스템(130)은 (예를 들어, 버퍼 메모리(150)에 저장된) 비디오 데이터의 N개 프레임들의 배치(302, a batch of N frames of video data)를 수신한다. 동작(430)에서, 컴퓨팅 시스템(130)은 데이터의 프레임들의 배치(302)를 특징 공간 유사성-기반 프레임 선택 모듈(300)의 특징 추출기(310)로 공급하며, 특징 추출기(310)는 프레임들로부터 특징 벡터들(330)을 추출하고, 특징 벡터 각각은 데이터의 프레임들 중 어느 하나로 대응한다. 이는 또한 프레임들을 특징 공간(또는 잠재(latent) 공간)으로 맵핑하는 것으로서 지칭될 수 있으며, 이때 특징 공간은 데이터의 프레임들의 원래 이미지 공간보다 낮은 랭크(rank)를 가질 수 있다(또는 특징 벡터들은 입력 이미지보다 낮은 차원을 가질 수 있다). 예를 들어, 프레임 각각은 2048*2048 픽셀들의 이미지 해상도를 가질 수 있으나, 특징 벡터 각각은 256*256 픽셀의 차원을 가질 수 있다. 예를 들어, 이러한 정렬(배열)에서, 특징 벡터는 대응하는 프레임보다 64배 작을 것이다.3 and 4, in operation 410, the computing system 130 (eg, stored in the buffer memory 150) a batch of N frames of video data (302, a batch of N frames). of video data). In operation 430, the computing system 130 supplies the arrangement 302 of frames of data to the feature extractor 310 of the feature space similarity-based frame selection module 300, and the feature extractor 310 Feature vectors 330 are extracted from and each feature vector corresponds to one of the frames of data. This may also be referred to as mapping frames to a feature space (or a latent space), where the feature space may have a lower rank than the original image space of the frames of data (or feature vectors are input It can have a lower dimension than the image). For example, each frame may have an image resolution of 2048*2048 pixels, but each feature vector may have a dimension of 256*256 pixels. For example, in this alignment (array), the feature vector will be 64 times smaller than the corresponding frame.

이처럼, 특징 벡터들은, 비록 프레임들 내 대응하는 픽셀들의 내용이 상당히 다르더라도, 내용 측면에서 실질적으로 유사한 두 프레임들이 유사한 특징 벡터들로 매핑되도록, 데이터의 프레임들 각각의 내용을 압축하거나 또는 요약하는 것으로서 여겨지는 것이 가능하다. 예를 들어, 한 순간에서 한 장면의 이미지는 실질적으로, 의미론적으로, 1초 후의 장면의 이미지와 유사할 수 있는데, 이는 동일한 다른 차량들, 보행자들, 정지된 물체들은 두 이미지들에서 모두 보일 수 있기 때문이다. 그러나, 두 이미지들을 픽셀 레벨로 비교하면, 두 이미지들은 매우 다를 수 있는데, 이는 이미지들 내 물체(객체)들 각각의 위치가 이동했을 수 있기 때문이다(예를 들어, 두 이미지들 사이의 공통된 물체들 모두는 길을 따르는 차량(105)의 이동과 함께, 물체들의 움직임들에 따라 변환될 수 있다). 또한, 입력 이미지 공간보다 낮은 랭크를 갖는 특징 공간 내 두 데이터의 프레임들(두 프레임(들)의 데이터, two frames of data)을 비교하는 것은 연산 강도가 적으며 (예를 들어, 비교할 값이 더 적기 때문에), 따라서 데이터 선택의 실시간 수행을 더 다루기 쉽게 한다(예를 들어, 특징 벡터들의 크기를 조정함으로써).As such, feature vectors compress or summarize the contents of each of the frames of data so that two frames that are substantially similar in content are mapped to similar feature vectors, even if the contents of the corresponding pixels in the frames are significantly different. It is possible to be considered as. For example, an image of a scene at an instant may be substantially, semantically, similar to the image of the scene one second later, where other identical vehicles, pedestrians, and stationary objects will be visible in both images. Because it can. However, if two images are compared at the pixel level, the two images can be very different, because the position of each of the objects (objects) in the images may have moved (e.g., a common object between the two images). All of them can be transformed according to the movements of objects, along with the movement of the vehicle 105 along the way). In addition, comparing two frames of data (two frames of data) in the feature space with a rank lower than the input image space is less computationally intensive (e.g., the value to be compared is more Less), thus making the real-time performance of data selection more manageable (eg, by adjusting the size of the feature vectors).

본 개시의 몇몇 실시 예들에서, 컴퓨팅 시스템(130) 상에서 실행되는 특징 추출기(310)는 프레임들을 신경망으로 공급함으로써 특징 벡터들(330)을 프레임들로부터 추출한다. 특징 추출기(310)는 이미지 내의 물체(객체)들의 인스턴스(instance)들을 식별하고 그리고 그러한 물체들의 위치들을 식별하는 바운딩 박스(bounding box)들을 연산하기 위해 학습되는 객체 탐지기(검출기) 신경망의 초기 스테이지들에 대응할 수 있다. 물체(객체) 탐지기 신경망은 LSTM(long-short term memory)과 결합된 CNN(convolutional neural network)을 포함할 수 있다. 이러한 경우, 추출된 특징들은, 최종 결과를 연산하는 것에 앞선, 네트워크의 중간 층의 활성화들(activations; 또는 출력)에 대응한다(예를 들어, 바운딩 박스들의 위치들을 연산하는 것 이전에). 예를 들어, 입력(된) 이미지 내 다양한 클래스들의 객체들의 인스턴스들을 포함하는 레이블링된 바운딩 박스들을 연산하도록 학습된 심층 CNN을 포함하는 컴퓨터 비전 시스템의 상술된 예시를 계속하면, 특징 추출기(310)는 적절한 '백본(backbone)' CNN에 대응할 수 있다. 백본 CNN은 ImageNet(참고, 예를 들어, Deng, Jia, et al. "ImageNet: A large-scale hierarchical image database." 2009 IEEE conference on computer vision and pattern recognition. IEEE, 2009.) 등과 같은 표준 학습 말뭉치(standard training corpus) 상에서 사전에 훈련될 수 있다. 백본 신경망들의 예시들은 MobileNetV2(참고, 예를 들어, Sandler, Mark, et al. "MobileNetV2: Inverted residuals and linear bottlenecks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.), MobileNetV3(참고, 예를 들어, Howard, Andrew, et al. "Searching for MobileNetV3." arXiv preprint arXiv:1905.02244 (2019).), MnasNet(참고, 예를 들어, Tan, Mingxing, et al. "MnasNet: Platform-aware neural architecture search for mobile." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.), 및 Xception(참고, 예를 들어, Chollet, Fran

ois. "Xception: Deep learning with depthwise separable convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.)을 포함할 수 있다. 몇몇 실시 예들에서, 특징 추출기(310)는 다수의 훈련된 잔여(residual) 모듈들(또는 잔여 계층들)을 포함하는 잔류 신경망(residual neural network)을 더 포함하고, 잔류 신경망은 백본 신경망의 출력을 제공받고, 그리고 최종 잔여 모듈의 출력은 입력 프레임의 추출된 특징들 또는 추출된 특징 벡터로 취해진다.In some embodiments of the present disclosure, the feature extractor 310 running on the computing system 130 extracts feature vectors 330 from the frames by supplying the frames to the neural network. The feature extractor 310 identifies instances of objects (objects) in the image, and the initial stages of the object detector (detector) neural network, which are learned to compute bounding boxes that identify the locations of those objects. Can respond to. The object (object) detector neural network may include a convolutional neural network (CNN) combined with a long-short term memory (LSTM). In this case, the extracted features correspond to the activations (or outputs) of the intermediate layer of the network prior to computing the final result (eg, prior to calculating the positions of the bounding boxes). For example, continuing the above-described example of a computer vision system comprising a deep CNN trained to compute labeled bounding boxes containing instances of objects of various classes in the input (input) image, feature extractor 310 It can respond to an appropriate'backbone' CNN. The backbone CNN is a standard learning corpus such as ImageNet (see, for example, Deng, Jia, et al. "ImageNet: A large-scale hierarchical image database." 2009 IEEE conference on computer vision and pattern recognition. IEEE, 2009.) It can be pre-trained on the (standard training corpus). Examples of backbone neural networks are MobileNetV2 (see, for example, Sandler, Mark, et al. "MobileNetV2: Inverted residuals and linear bottlenecks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.), MobileNetV3 (Ref. For example, Howard, Andrew, et al. "Searching for MobileNetV3." arXiv preprint arXiv:1905.02244 (2019).), MnasNet (see, for example, Tan, Mingxing, et al. "MnasNet: Platform-aware neural" architecture search for mobile." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.), and Xception (see, for example, Chollet, Fran

ois. "Xception: Deep learning with depthwise separable convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.). In some embodiments, the feature extractor 310 further includes a residual neural network including a plurality of trained residual modules (or residual layers), the residual neural network receiving the output of the backbone neural network. Is provided, and the output of the final residual module is taken as the extracted features of the input frame or the extracted feature vector.

몇몇 실시 예들에서, 특징 추출기(310)는 비디오 캡쳐 및 데이터 감소 시스템(100)과 연관된 객체 감지 시스템의 신경망의 일부에 대응한다. 예를 들어, 객체 감지 시스템이 CNN 및 MobileNetV3 등과 같은 백본 신경망을 포함하면, 특징 추출기(310)는 동일한 백본 신경망을 포함할 수 있다.In some embodiments, feature extractor 310 corresponds to a portion of a neural network of an object detection system associated with video capture and data reduction system 100. For example, if the object detection system includes a backbone neural network such as CNN and MobileNetV3, the feature extractor 310 may include the same backbone neural network.

음성 처리 또는 자연어 처리 등과 같은 다른 맥락들에 적용되는 본 개시의 다른 실시 예들에 있어서, 다른 특징 추출기들이 영역(domain)에 적합한 방식으로 특징 추출을 수행하는 데 적용될 수 있다. 예를 들어, 음성 처리에 있어서, 스펙트럴(spectral) 특징들(예를 들어, 주파수 영역의 음성 신호들) 상에서 동작하는 순환 신경망(a recurrent neural network)이 특징 추출기의 구성 요소로서 사용될 수 있다.In other embodiments of the present disclosure applied to other contexts such as speech processing or natural language processing, other feature extractors may be applied to perform feature extraction in a manner suitable for a domain. For example, in speech processing, a recurrent neural network operating on spectral features (eg, speech signals in a frequency domain) may be used as a component of the feature extractor.

동작(450)에서, 컴퓨팅 시스템(130)은 하나 이상의 기준 프레임들을 유사성 랭킹(ranking) 모듈(350)을 사용하여 특징 벡터들 간의 비유사성들에 기초해 기준 프레임들의 배치로부터 선택한다. 몇몇 실시 예들에서, 컴퓨팅 시스템(130)은 시드(seed) 프레임을 버퍼 메모리(150) 내 데이터의 프레임들(302)로부터 무작위로 선택하고, 그리고 특징 공간 상에서 서로 비유사하거나 또는 떨어진(예를 들어, 비유사 역치를 만족하는) 프레임들의 집합을 프레임들에 대응하는 특징 벡터들에 기초하여 식별한다. 몇몇 실시 예들에서, k-Center-Greedy 기법(참고, 예를 들어, Sener, Ozan, and Silvio Savarese. "Active learning for convolutional neural networks: A core-set approach." arXiv preprint arXiv:1708.00489 (2017).) 등과 같은 반복 알고리즘이 기준 프레임들의 집합을 식별하기 위해 사용될 수 있다. 예를 들어, k-Center-Greedy 접근을 사용하는 일 실시 예에 있어서, 앵커(anchor) 프레임 F가 임의로 선택되고, 그리고 선택된 프레임들의 집합 S={F}에 시드로서 추가된다. 그 다음, 추가적인 프레임들이 선택된 프레임들에 대해 가장 낮은 유사성 점수들을 갖는 것에 기반하여 선택된다. 예를 들어, 제 2 기준 프레임 F1은 시드 프레임 F와 제일 덜 유사한 프레임을 찾음으로써 선택될 수 있다. 제 2 프레임을 추가한 후, 선택된 프레임들 S는 F 및 F1을 둘 다 포함한다(S={F, F1}). 제 3 프레임 F2는 S 내 구성원들에 대해 가장 낮은 평균 유사성(예를 들어, F 및 F1 둘 다에 대해 가장 낮은 평균 유사성)을 갖는 프레임을 고름으로써 선택될 수 있다. 프레임은 K 개의 프레임들이 선택될 때까지 이러한 방식으로 계속 추가될 수 있다. 다른 실시 예에 따르면, 시드 프레임은 임의로 선택되고 그리고 시드 프레임에 대한 유사성 거리들이 그 프레임 주변의 나머지 프레임들 전부에 대해 계산되며, 나머지 프레임들에 대해 점수를 매기는 것은 계산된 유사성 거리들에 기반하여 완료된다. 기준 프레임은 최고 점수(예를 들어, 자기 자신에 대해 제일 유사함)를 가질 것이다.In operation 450, computing system 130 selects one or more frames of reference from a placement of reference frames based on dissimilarities between feature vectors using similarity ranking module 350. In some embodiments, the computing system 130 randomly selects a seed frame from the frames 302 of data in the buffer memory 150, and dissimilar or distant from each other (e.g., , A set of frames that satisfy a dissimilar threshold) are identified based on feature vectors corresponding to the frames. In some embodiments, the k-Center-Greedy technique (see, eg, Sener, Ozan, and Silvio Savarese. “Active learning for convolutional neural networks: A core-set approach.” arXiv preprint arXiv: 1708.00489 (2017). ), etc. may be used to identify a set of reference frames. For example, in an embodiment using the k-Center-Greedy approach, an anchor frame F is randomly selected, and added as a seed to the set of selected frames S={F}. Then, additional frames are selected based on having the lowest similarity scores for the selected frames. For example, the second reference frame F1 may be selected by finding a frame that is least similar to the seed frame F. After adding the second frame, the selected frames S include both F and F1 (S={F, F1}). The third frame F2 can be selected by picking the frame with the lowest average similarity for the members in S (eg, the lowest average similarity for both F and F1). Frames can continue to be added in this way until K frames have been selected. According to another embodiment, the seed frame is randomly selected and similarity distances for the seed frame are calculated for all remaining frames around the frame, and scoring for the remaining frames is based on the calculated similarity distances. And it is done. The frame of reference will have the highest score (eg, most similar to itself).

예를 들어, 본 개시의 몇몇 실시 예들에 있어서, 기준 프레임들을 선택하는 과정의 반복 각각 동안, 데이터의 배치의 프레임들 중 하나가 무작위로 선택된다. 그 프레임의 특징 벡터는 하나 이상의 유사성 점수들을 계산하기 위해 다른 기준 프레임들(기준 프레임들의 집합은 제 1 기준 프레임을 무작위로 선택함으로써 초기화될 수 있다)의 특징 벡터들과 비교된다. 몇몇 실시 예들에서, 두 개의 특징 벡터들의 비교로부터의 유사성 점수는, 예를 들어, 두 특징 벡터들 사이의 L1 노름(Norm)(또는 맨허튼(Manhattan) 거리) 또는 L2 노름(또는 유클리드 거리)를 사용하여 계산된다. 만약 현재 프레임의 특징 벡터의 유사성 점수가 비유사성 역치를 만족시키는데 실패한다면(예를 들어, 현재 기준 프레임들 중 어느 하나와 너무 유사한 특징 벡터를 갖는다면), 그렇다면 이는 버려진다. 그렇지 않으면, 현재 프레임은 선택된 기준 프레임들의 컬렉션에 추가된다. 결과적으로, 모든 기준 프레임들은 그들의 유사성 점수들이 그들이 특징 공간에서 서로 적어도 역치 거리(또는 점수)만큼 떨어져있다는 것을 나타낸다는 점에서 비유사하고, 그리고 컴퓨팅 시스템(130)은 실질적으로 기준 프레임들과 유사한(예를 들어, 특징 공간에서, 적어도 한 기준 프레임의 역치 거리 내에 그들이 위치함을 나타내는 유사성 점수들을 갖는) 프레임들의 집합들을, 선택된 기준 프레임들이 버려진 프레임들을 배제하도록, 버린다.For example, in some embodiments of the present disclosure, during each iteration of the process of selecting reference frames, one of the frames of the arrangement of data is randomly selected. The feature vector of that frame is compared to feature vectors of other reference frames (a set of reference frames can be initialized by randomly selecting a first reference frame) to calculate one or more similarity scores. In some embodiments, the similarity score from the comparison of two feature vectors uses, for example, an L1 norm (or Manhattan distance) or an L2 norm (or Euclidean distance) between two feature vectors. Is calculated. If the similarity score of the feature vector of the current frame fails to satisfy the dissimilarity threshold (eg, has a feature vector that is too similar to any one of the current reference frames), then it is discarded. Otherwise, the current frame is added to the selected collection of reference frames. Consequently, all reference frames are dissimilar in that their similarity scores indicate that they are at least a threshold distance (or score) apart from each other in the feature space, and the computing system 130 is substantially similar to the reference frames ( For example, in the feature space, sets of frames (with similarity scores indicating that they are located within a threshold distance of at least one reference frame) are discarded so that the selected reference frames exclude discarded frames.

기준 프레임들의 개수는 데이터 감소의 원하는 정도에 기반하여 구성될 수 있다. 예를 들어, 만약 한 주기가(구간이) 300개의 프레임들을 포함하고 그리고 90%의 데이터 감소를 원한다면(예를 들어, 30개의 프레임들을 유지), 그렇다면 기준 프레임들의 개수는 30개로 설정될 수 있다. 동작(470)에서, 컴퓨팅 시스템(130)은 남아있는 프레임들(예를 들어, 기준 프레임들로 선택되지 않은 프레임들 또는 '비-기준' 프레임들)을 버린다. 이에 따라, 기준 프레임들은 특징 공간 유사성-기반 프레임 선택 모듈(300)에 의해 선택된 프레임들(352)로서 출력된다.The number of reference frames can be configured based on the desired degree of data reduction. For example, if a period (interval) contains 300 frames and you want a data reduction of 90% (e.g., keep 30 frames), then the number of reference frames can be set to 30 . In operation 470, the computing system 130 discards the remaining frames (eg, frames not selected as reference frames or'non-reference' frames). Accordingly, the reference frames are output as frames 352 selected by the feature space similarity-based frame selection module 300.

이에 따라, 본 개시의 일 실시 예에 따르면, 특징 공간 유사성-기반 프레임 선택 모듈(300)은 배치 내 다른 데이터와 상이한(비유사한) 대표 또는 기준 데이터를 선택하고, 그리고 선택된 대표 데이터와 유사한 불필요한(중복되는) 데이터를 버림으로써 캡쳐된 데이터를 감소하는 방법에 대한 한 방법(400)과 관계가 있다.Accordingly, according to an embodiment of the present disclosure, the feature space similarity-based frame selection module 300 selects representative or reference data different (dissimilar) from other data in the arrangement, and unnecessary ( It concerns one method 400 of how to reduce the captured data by discarding the redundant) data.

프레임 불확실성에 기반하는 프레임 선택Frame selection based on frame uncertainty

본 개시의 일 실시 예에 따르면, 컴퓨팅 시스템(130)은 프레임들(예를 들어, 도 2의 동작(250))을, 학습된 물체 감지기(객체 검출기)가 높은 불확실성을 출력하기 위한 프레임들을 선택함으로써 선택하고, 그러므로 추가적인 학습 예시들은 경쟁적인 가능한 검지들을 구분하도록(예를 들어, '보행자' 대 '자전거 주행자' 등과 같은 두 클래스들을 구분하도록) 물체 감지기를 훈련하는 데 도움이 될 것이다.According to an embodiment of the present disclosure, the computing system 130 selects frames (for example, operation 250 of FIG. 2) and frames for outputting high uncertainty by a learned object detector (object detector). By making choices, and therefore additional learning examples will help train the object detector to differentiate between competing possible index fingers (eg, to distinguish between two classes such as'pedestrian' vs'cyclist').

도 5는 본 개시의 일 실시 예에 따라 불확실성에 기반하여 프레임들을 버리는 불확실성-기반 프레임 선택 모듈(500)의 개략적인 블록도이다. 도 6은 본 개시의 일 실시 예에 따라 프레임들을 불확실성 점수에 기반하여 버리는 프레임 선택 방법을 묘사하는 순서도이다.5 is a schematic block diagram of an uncertainty-based frame selection module 500 for discarding frames based on uncertainty according to an embodiment of the present disclosure. 6 is a flowchart illustrating a frame selection method for discarding frames based on an uncertainty score according to an embodiment of the present disclosure.

도 5 및 도 6을 참조하면, 동작(610)에서, 컴퓨팅 시스템(130)은 (예를 들어, 버퍼 메모리(150)에 저장된) 입력 프레임들의 M개 프레임들의 배치(502)를 수신한다. 동작(630)에서, 컴퓨팅 시스템(130)은 각각의 M개 프레임들을 K-클래스 객체 검출기(510)로 공급하며, K-클래스 객체 검출기(510)는 프레임 각각에 대해 바운딩 박스들(530)의 K개 집합들을 출력하도록 구성되고(편의상, 도 5는 프레임 i에 대한 바운딩 박스들(531)의 집합 및 프레임 i+M에 대한 바운딩 박스들(539)의 집합뿐만 아니라 프레임 i 및 프레임 i+M 사이의 모든 프레임들에 대한 바운딩 박스들의 집합들을 계산하기 위해 프레임 i부터 프레임 i+M까지의 프레임들이 순차적으로 K-클래스 객체 검출기(510)로 공급되는 상태를 도시한다), 이때 바운딩 박스들의 집합 각각은 K-클래스 객체 검출기(510)가 검지하도록 훈련되는 K개의 서로 다른 클래스들 중 어느 하나로 대응한다. 바운딩 박스들의 집합 각각 내에서, 예를 들어, K 클래스들 중 제 j 클래스에 대해, 바운딩 박스 각각은 제 j 클래스의 검출된 인스턴스의 위치에 대응한다. 바운딩 박스 각각은 또한 바운딩 박스가 제 j 클래스의 인스턴스를 묘사하는 확률 또는 확신 점수와 연관된다(참고, 예를 들어, Aghdam, Hamed H., et al. "Active Learning for Deep Detection Neural Networks." Proceedings of the IEEE International Conference on Computer Vision. 2019.).5 and 6, in operation 610, computing system 130 receives a batch 502 of M frames of input frames (eg, stored in buffer memory 150). In operation 630, the computing system 130 supplies each of the M frames to the K-class object detector 510, and the K-class object detector 510 provides the bounding boxes 530 for each frame. It is configured to output K sets (for convenience, Fig. 5 shows a set of bounding boxes 531 for frame i and a set of bounding boxes 539 for frame i+M, as well as frame i and frame i+M. It shows a state in which frames from frame i to frame i+M are sequentially supplied to the K-class object detector 510 in order to calculate sets of bounding boxes for all frames in between), at this time, a set of bounding boxes Each corresponds to one of K different classes that the K-class object detector 510 is trained to detect. Within each set of bounding boxes, for example, for the jth class of the K classes, each bounding box corresponds to the position of the detected instance of the jth class. Each of the bounding boxes is also associated with a probability or confidence score in which the bounding box describes an instance of class j (see, eg, Aghdam, Hamed H., et al. "Active Learning for Deep Detection Neural Networks." Proceedings. of the IEEE International Conference on Computer Vision. 2019.).

동작(650)에서, 컴퓨팅 시스템(130)은, 불확실성 측정 모듈(550)을 사용하여, 프레임 각각에 대한 불확실성 점수를, K개 클래스들의 각각에 대한 바운딩 박스들의 집합들 및 관련된 확률들에 기반하여 계산한다. 몇몇 실시 예들에서, 픽셀-단위(wise) 접근이 세분화 맵(segmentation map)의 픽셀 각각에 대한 두 개의 가장 높은 신뢰도 검출들을 비교하는 데 사용된다. 몇몇 실시 예들에서, 불확실성 점수들은 선택된 바운딩 박스들의 집합들에 대해서만 계산되고, 집합들은 함께 집계된다(하나의 박스 제안은 하나의 연관 클래스 확률을 가질 것이다). 불확실성 메트릭(metric)들을 계산하기 위한 기법들은, 예를 들어, Brust, Clemens-Alexander, Christoph K

ding, and Joachim Denzler. "Active learning for deep object detection." arXiv preprint arXiv:1809.09875 (2018)에 설명되어 있고, 그 전체 개시 내용은 여기에서 참조로서 포함된다. 예를 들어, '1-vs-2' 또는 '마진 샘플링(margin sampling)'은 두 개의 가장 높은 점수를 갖는 클래스들 c₁ 및 c₂에 대한 메트릭으로서 사용될 수 있다:In operation 650, the computing system 130 calculates an uncertainty score for each frame, using the uncertainty measurement module 550, based on the sets of bounding boxes for each of the K classes and the associated probabilities. Calculate. In some embodiments, a pixel-wise approach is used to compare the two highest reliability detections for each pixel of the segmentation map. In some embodiments, uncertainty scores are computed only for the selected sets of bounding boxes, and the sets are aggregated together (one box proposal will have one associated class probability). Techniques for calculating uncertainty metrics are, for example, Brust, Clemens-Alexander, Christoph K

ding, and Joachim Denzler. "Active learning for deep object detection." arXiv preprint arXiv:1809.09875 (2018), the entire disclosure of which is incorporated herein by reference. For example, '1-vs-2'or'marginsampling' can be used as a metric for the _{two highest scored classes c 1} and c _2:

수학식 1에서,

는 바운딩 박스(또는 이미지) x가 클래스 c_i의 인스턴스를 묘사하는 확률 또는 신뢰도 점수를 나타낸다. 특정 바운딩 박스(또는 이미지) x에 대한(또는 서로 다른 두 클래스들의 겹치는 바운딩 박스들, 또는 특정한 픽셀에 대한) 두 개의 가장 높은 점수를 갖는 클래스들 c₁ 및 c₂ 사이의 신뢰도 점수

의 차가 작으면, 그렇다면 예시(example)는 결정 경계에 가까울 것(예를 들어, 더 불확실함)이고, 그러므로 더 높은 불확실성 점수를 가질 수 있으며, 이는 학습을 위한 더 현저한 또는 쓸모 있는 예시에 대응한다(예를 들어, 예시는 수동적으로 바운딩 박스 x의 올바른 클래스로 주석이 달릴 수 있고, 레이블링된 예시는 기계 학습 시스템을 재학습하는 데 사용될 수 있도록 함). 이러한 검출 메트릭들은 다양한 클래스들의 집합들에서, 프레임 x 내 모든 바운딩 박스들(또는 검지들) D에 대해 모든 v_1vs2들의 합을 계산하는 것과 같은 다양한 기법들에 따라 집계될 수 있다:In Equation 1,

Represents the probability or confidence score that the bounding box (or image) x _{describes an instance of class c i.} Confidence score between _{classes c 1} and c ₂ with the two highest scores for a particular bounding box (or image) x (or overlapping bounding boxes of two different classes, or for a particular pixel)

If the difference of is small, then the example will be closer to the decision boundary (e.g., more uncertain), and therefore may have a higher uncertainty score, which corresponds to a more salient or useful example for learning. (For example, examples can be manually annotated with the correct class of bounding box x, so labeled examples can be used to retrain machine learning systems). These detection metrics can be aggregated according to various techniques, such as calculating the sum of _{all v 1vs2} for all bounding boxes (or detections) D in frame x, in sets of various classes:

다른 옵션은 프레임 x 내 검지들의 평균을 계산하는 것이며, 예를 들어:Another option is to calculate the average of the index fingers in frame x, for example:

다른 옵션은 프레임 x 내 검지들의 최대값을 계산하는 것이며, 예를 들어:Another option is to calculate the maximum of the index fingers in frame x, for example:

동작(670)에서, 만약 불확실성이 불확실성 역치를 만족하면(예를 들어, 프레임에 대해 충분히 높은 불확실성이 K-클래스 객체 검출기(510)의 출력에 존재한다면), 컴퓨팅 시스템(130)은 프레임을 추후의 학습에 유용한 예시로서 저장하기 위해 선택할 수 있다. 만약 불확실성 점수가 불확실성 역치를 만족하지 않으면, 프레임은 버려진다.In operation 670, if the uncertainty meets the uncertainty threshold (e.g., if a sufficiently high uncertainty for the frame is present at the output of the K-class object detector 510), then the computing system 130 You can choose to save it as a useful example for your learning. If the uncertainty score does not satisfy the uncertainty threshold, the frame is discarded.

이에 따라, 본 개시의 일 실시 예에 따르면, 불확실성-기반 프레임 선택 모듈(500)은 K-클래스 객체 검출기(510)가 높은 불확실성을 갖는(예를 들어, K-클래스 객체 검출기(510)가 잘 작동하지 않는) 데이터를 선택하고, 그리고 K-클래스 객체 검출기(510)가 낮은 불확실성을 갖는 데이터(예를 들어, 학습된 K-클래스 객체 검출기(510)가 이미 잘 작동하는 입력들)를 버림으로써 캡쳐된 데이터를 감소하기 위한 한 방법(600)과 관계가 있다.Accordingly, according to an embodiment of the present disclosure, the uncertainty-based frame selection module 500 has a high uncertainty in the K-class object detector 510 (for example, the K-class object detector 510 is well By selecting the data that is not working), and the K-class object detector 510 discarding data with low uncertainty (e.g., inputs for which the learned K-class object detector 510 already works well). It concerns one method 600 for reducing the captured data.

본 개시의 몇몇 실시 예들에 따르면, 컴퓨팅 시스템(130)은 특징 공간 유사성-기반 프레임 선택 모듈(300) 및 불확실성-기반 프레임 선택 모듈(500), 및 대응하는 방법들을, (상술된 바와 같이) 분리하여 또는 직렬(예를 들어, 순차적으로) 또는 병렬 등과 같은 다양한 조합들로서 사용하여 프레임들을 프레임들의 배치로부터 선택한다(예를 들어, 도 2의 동작(250)에서).According to some embodiments of the present disclosure, the computing system 130 separates the feature space similarity-based frame selection module 300 and the uncertainty-based frame selection module 500, and corresponding methods, (as described above). Or using various combinations such as serial (eg, sequentially) or parallel, etc. to select frames from an arrangement of frames (eg, in operation 250 of FIG. 2).

예를 들어, 몇몇 실시 예들에서, 특징 공간 유사성-기반 프레임 선택 모듈(300) 및 불확실성-기반 프레임 선택 모듈(500)은 직렬적으로 사용될 수 있다. 그러한 실시 예들에서, 특징 공간 유사성-기반 프레임 선택 모듈(300)은 배치 내 중복된 프레임들을 제거하고, 따라서 불확실성-기반 프레임 선택 모듈(500)로 제공되는 프레임들의 개수를 줄이며, 따라서 불확실성-기반 프레임 선택 모듈(500)의 연산 부하 또한 감소시킨다. 불확실성-기반 프레임 선택 모듈(500)에 의해 사용되는 프로세스들이 초-선형(super-linear) 알고리즘 시간 및/또는 공간 복잡성을 갖는 등과 같은 몇몇 환경들에서, 입력 프레임들의 집합(502)의 크기를 줄이는 것은 시스템이 실시간으로 동작하는 것이 가능한지 여부에 상당한 영향력을 갖는다. 몇몇 실시 예들에서, 유사성 역치는 선택된 프레임들의 개수가 불확실성-기반 프레임 선택 모듈(500)로 하여금 프레임 선택을 한정된 시간 내에(예를 들어, 캡쳐 구간 I의 주기 내) 완료하도록 구성될 수 있다.For example, in some embodiments, the feature space similarity-based frame selection module 300 and the uncertainty-based frame selection module 500 may be used serially. In such embodiments, the feature space similarity-based frame selection module 300 removes redundant frames in the arrangement, thus reducing the number of frames provided to the uncertainty-based frame selection module 500, and thus the uncertainty-based frame. The computational load of the selection module 500 is also reduced. In some circumstances, such as the processes used by the uncertainty-based frame selection module 500 have super-linear algorithm time and/or spatial complexity, reducing the size of the set of input frames 502 This has a significant impact on whether the system is capable of operating in real time. In some embodiments, the similarity threshold may be configured such that the number of selected frames causes the uncertainty-based frame selection module 500 to complete frame selection within a limited time (eg, within a period of capture interval I).

몇몇 실시 예들에서, 특징 공간 유사성-기반 프레임 선택 모듈(300)의 출력의 선택된 프레임들(352)은 M개 프레임들의 입력 배치(502)로서 불확실성-기반 프레임 선택 모듈(500)로 제공되며, 이는 영구 스토리지 시스템(170)에 저장되어야 하는 프레임들(예를 들어, 선택된 프레임들(552))의 개수를 더 줄인다. 좀 더 명시적으로, 도 7은 본 개시의 일 실시 예에 따라 프레임들을 유사성 및 불확실성 모두에 순서대로 기반하여 선택하는 방법(700)의 순서도이다. 도 7에 도시된 바와 같이, 동작(7300)에서 프레임들을 특징 공간에서의 유사성에 기반하여(예를 들어, 도 3에 도시된 특징 공간 유사성-기반 프레임 선택 모듈(300) 및 도 4에 도시된 방법(400)에 따라) 선택하기 위해 프레임들의 배치(예를 들어, 캡쳐 구간 I 동안 비디오 카메라들(110)에 의해 캡쳐된 로우(raw) 비디오 데이터의 배치)가 컴퓨팅 시스템(130)에 의해 처리된다. 컴퓨팅 시스템(130)은 그 다음, 유사성 메트릭(예를 들어, 비유사성에 기초한) 및 불확실성 메트릭 둘 다에 기반하여 프레임들을 선택하기 위해, 동작(7600)에서 불확실성에 기반하여 프레임들의 부분집합을 선택함으로써(예를 들어, 데이터를 더 감소시킴으로써) 결과 프레임들(특징 공간에서의 프레임들의 (비)유사성에 따라 선택된)을 처리한다.In some embodiments, the selected frames 352 of the output of the feature space similarity-based frame selection module 300 are provided to the uncertainty-based frame selection module 500 as an input arrangement 502 of M frames, which The number of frames (eg, selected frames 552) to be stored in the persistent storage system 170 is further reduced. More specifically, FIG. 7 is a flowchart of a method 700 for selecting frames based on both similarity and uncertainty in order according to an embodiment of the present disclosure. As shown in FIG. 7, frames in operation 7300 are based on similarity in a feature space (e.g., feature space similarity-based frame selection module 300 shown in FIG. 3 and shown in FIG. 4 ). The placement of frames (e.g., placement of raw video data captured by video cameras 110 during capture interval I) to select (according to method 400) is processed by computing system 130 do. Computing system 130 then selects a subset of frames based on uncertainty in operation 7600 to select frames based on both a similarity metric (e.g., based on dissimilarity) and an uncertainty metric. By doing (eg, further reducing the data) the resulting frames (selected according to the (non)similarity of the frames in the feature space) are processed.

몇몇 실시 예들에서, 프레임들을 선택하기 보다는, 특징 공간 유사성-기반 프레임 선택 모듈(300) 및 불확실성-기반 프레임 선택 모듈(500)은 각각 유사성 점수들 및 불확실성 점수들을 비디오 카메라들(110)로부터 수신되는 모든 데이터의 프레임들에 대해 계산한다. 프레임 각각의 유사성 점수는 프레임 각각에 대한 총점을 계산하기 위해 프레임의 대응하는 불확실성 점수와 결합하거나 또는 같이 집계되며, 이때 역치를 만족하는 총 점수를 갖는 프레임들은 선택되고 그리고 영구 스토리지 시스템(170)에 저장되고, 역치를 만족하지 않는 총 점수를 갖는 프레임들은 버려진다.In some embodiments, rather than selecting frames, the feature space similarity-based frame selection module 300 and the uncertainty-based frame selection module 500 each receive similarity scores and uncertainty scores from video cameras 110. Compute for all frames of data. The similarity score of each frame is combined with or aggregated together with the corresponding uncertainty score of the frame to calculate the total score for each of the frames, wherein frames having a total score satisfying the threshold are selected and stored in the persistent storage system 170. Frames that are stored and have a total score that do not satisfy the threshold are discarded.

도 8은 본 개시의 일 실시 예에 따라 프레임들을 비유사성 및 불확실성 둘 다에 병렬적으로 기반하여 선택하는 방법(800)의 순서도이다. 도 8에 도시된 바와 같이, 동작(8300)에서, 컴퓨팅 시스템(130)은 도 3 및 도 4를 참조하여 상술된 방식과 실질적으로 동일한 방식으로 그러한 비유사성 메트릭들(예를 들어, 프레임들이 얼마나 프레임들의 배치 집합 내 선택된 기준 프레임들에 대해 비유사한지)에 기반한 프레임-당 점수를 계산하기 위해, 특징 공간 내 유사성에 기초하여 데이터의 프레임들의 배치에 대해 점수를 매긴다. 다양한 실시 예들에서, 유클리디안(L2) 거리, 맨허튼(L1) 거리, 또는 코사인 거리(또는 코사인 유사성)이 특징 공간에서의 유사성 점수를 계산하기 위해 사용된다. 또한, 동작(8600)에서, 컴퓨팅 시스템(130)은 데이터의 프레임들의 배치(특징 공간 내 비유사성에 기반하여 점수 매겨진 배치와 동일한 배치)에 대해, 불확실성에 기반한 프레임-당 점수들을 계산하기 위해, 불확실성에 기반하여 도 5 및 도 6을 참조하여 상술된 방식과 실질적으로 동일한 방식으로 점수를 매긴다. 동작(8700)에서, 컴퓨팅 시스템(130)은 프레임에 대한 총 점수 o(x_i)를 계산하기 위해, 프레임 각각에 대한 비유사성 점수 d(x_i) 및 불확실성 점수 u(x_i)의 총계를 낸다. 몇몇 실시 예들에서, 이는 두 점수들의 선형 조합일 수 있다, 예를 들어:8 is a flowchart of a method 800 of selecting frames based on both dissimilarity and uncertainty in parallel according to an embodiment of the present disclosure. As shown in FIG. 8, in operation 8300, computing system 130 may perform such dissimilarity metrics (e.g., how many frames are) in substantially the same manner as described above with reference to FIGS. To calculate a per-frame score based on whether or not they are dissimilar to selected reference frames in the batch set of frames), the batch of frames of data is scored based on the similarity in the feature space. In various embodiments, the Euclidean (L2) distance, the Manhattan (L1) distance, or the cosine distance (or cosine similarity) is used to calculate the similarity score in the feature space. Further, in operation 8600, the computing system 130 calculates scores per frame based on uncertainty for a batch of frames of data (the same batch as the batch scored based on dissimilarity in the feature space), Based on the uncertainty, scores are scored in substantially the same manner as described above with reference to FIGS. 5 and 6. In operation 8700, the computing system 130 calculates the sum of the dissimilarity score d(x _i ) and the uncertainty score u(x _i ) _{for each frame to calculate the total score o(x i) for the frame.} Serve. In some embodiments, this may be a linear combination of the two scores, for example:

일 수 있고, 이때 α는 비유사성 점수들 및 불확실성 점수들의 상대적인 가중치들을 제어하는 파라미터이다. 그러나, 본 개시의 실시 예들은 이에 한정되지 아니하며, 두 점수들을 집계(aggregate)하기 위한 다른 기법들이 대신 사용될 수 있다.Wherein α is a parameter that controls the relative weights of dissimilarity scores and uncertainty scores. However, embodiments of the present disclosure are not limited thereto, and other techniques for aggregating two scores may be used instead.

동작(8800)에서 컴퓨팅 시스템(130)은 프레임에 대한 총 점수가 총 프레임 역치를 만족하는지 여부를 판단한다. 만약 총 프레임 역치가 만족되면, 프레임은 영구 스토리지 시스템(170)에 저장하기 위해 선택된다. 만약 총 프레임 역치가 만족되지 않으면, 프레임은 버려진다.In operation 8800, computing system 130 determines whether the total score for the frame satisfies the total frame threshold. If the total frame threshold is satisfied, the frame is selected for storage in the persistent storage system 170. If the total frame threshold is not satisfied, the frame is discarded.

몇몇 실시 예들에서, 특징 공간 유사성-기반 프레임 선택 모듈(300)의 특징 추출기(310)는 불확실성-기반 프레임 선택 모듈(500)의 객체 검출기의 신경망의 부분에 대응할 수 있다. 예를 들어, K-클래스 객체 검출기(510) 시스템이 CNN 및 MobileNetV3 등과 같은 백본 신경망을 포함한다면, 특징 공간 유사성-기반 프레임 선택 모듈(300)의 특징 추출기(310)는 동일한 백본 신경망을 포함할 수 있다.In some embodiments, the feature extractor 310 of the feature space similarity-based frame selection module 300 may correspond to a portion of the neural network of the object detector of the uncertainty-based frame selection module 500. For example, if the K-class object detector 510 system includes a backbone neural network such as CNN and MobileNetV3, the feature extractor 310 of the feature space similarity-based frame selection module 300 may include the same backbone neural network. have.

도 9는 본 개시의 일 실시 예에 따른 컴퓨팅 시스템(130)을 예시하는 블록도이다. 도 9에 도시된 바와 같이, 일 실시 예에 따른 컴퓨팅 시스템(130)은 프로세서(132) 및 메모리(134)를 포함하고, 이때 프로세서(132)는: (예를 들어, 하나 이상의 코어들을 갖는) 일반-목적 중앙 처리 장치(central processing unit; CPU); 그래픽 처리 장지(graphical processing unit; GPU), 필드 프로그래밍 가능 게이트 어레이(field programmable gate array; FPGA); 신경 처리 장치(neural processing unit; NPU) 또는 신경망 프로세서(neural network processor; NNP)(예를 들어, 신경망을 이용하여 추론을 수행하도록 맞추어진 구조를 갖는 프로세서); 또는 뉴모로픽(neuromorphic) 프로세서일 수 있다. 예를 들어, 신경망의 파라미터들(예를 들어, 가중치들 및 바이어스들) 및 신경망 구조는 프로세서로 연결된 비-일시적 메모리에 저장될 수 있고, 이때 프로세서는 메모리로부터 파라미터들 및 신경망 구조를 로딩함으로써 망을 사용하여 추론을 수행한다. 다른 예로서, FPGA의 경우, FPGA는 비트파일(bitfile)을 사용하여 망 구조 및 가중치들과 함께 비-일시적 방식으로 구성될 수 있다. 메모리(134)는 데이터의 배치(예를 들어, 하나 이상의 비디오 카메라들(110)로부터 수신된 비디오 데이터의 프레임들의 배치)를 저장하기 위한 버퍼 메모리(150)을 포함할 수 있다. 컴퓨팅 시스템(130)은 보조 프로세서(136)을 더 포함할 수 있고, 이는: (하나 이상의 코어들을 갖는) 일반-목적 CPU; GPU; FPGA; NPU, NNP; 또는 뉴모로픽 프로세서를 포함할 수 있다. 예를 들어, 보조 프로세서(136)는 특징 추출기(310) 및 K-클래스 객체 검출기(510) 등과 같은 요소들과 연관된 연산들을 독립하여 수행(및 가속 또는 오프로드)하도록 구성될 수 있다. 상술된 바와 같이, 컴퓨팅 시스템(130)은 컴퓨팅 시스템(130)에 의해 선택된 데이터를 영구 스토리지 시스템(170)에 저장하도록 구성될 수 있다. 또한, 메모리(134)는, 프로세서(132) 및/또는 보조 프로세서(136)에 의해 실행될 때, 본 개시의 실시 예들에 따라 상술된 모듈들 및 방법들을 구현하는 명령어들을 저장한다.9 is a block diagram illustrating a computing system 130 according to an embodiment of the present disclosure. As shown in FIG. 9, the computing system 130 according to an embodiment includes a processor 132 and a memory 134, wherein the processor 132 includes: (eg, having one or more cores) General-purpose central processing unit (CPU); A graphical processing unit (GPU), a field programmable gate array (FPGA); A neural processing unit (NPU) or a neural network processor (NNP) (eg, a processor having a structure tailored to perform inference using a neural network); Alternatively, it may be a neuromorphic processor. For example, the parameters of the neural network (e.g., weights and biases) and the neural network structure may be stored in a non-transitory memory connected to the processor, in which the processor loads the parameters and the neural network structure from the memory. To perform inference. As another example, in the case of an FPGA, the FPGA can be configured in a non-transitory manner with a network structure and weights using a bitfile. The memory 134 may include a buffer memory 150 for storing an arrangement of data (eg, arrangement of frames of video data received from one or more video cameras 110 ). Computing system 130 may further include a coprocessor 136, which includes: a general-purpose CPU (with one or more cores); GPU; FPGA; NPU, NNP; Alternatively, it may include a pneumotropic processor. For example, the coprocessor 136 may be configured to independently perform (and accelerate or offload) operations associated with elements such as the feature extractor 310 and the K-class object detector 510. As described above, computing system 130 may be configured to store data selected by computing system 130 in persistent storage system 170. Further, the memory 134, when executed by the processor 132 and/or the coprocessor 136, stores instructions for implementing the modules and methods described above according to embodiments of the present disclosure.

이에 따라, 본 개시의 실시 예들의 양상은 데이터 감소를 위한 시스템들 및 방법들과 관계 있다. 본 개시가 특정한 예시적인 실시 예들과 연관되어 설명되었으나, 본 개시는 개시된 실시 예들에 한정되지 않으며, 오히려 그 반대로, 첨부된 청구항들 및 그의 균등물들의 범위 내에 포함된 다양한 변경들 및 균등한 배열들을 다루기 위해 의도되었다고 이해될 것이다.Accordingly, aspects of embodiments of the present disclosure relate to systems and methods for data reduction. While the present disclosure has been described in connection with certain exemplary embodiments, the present disclosure is not limited to the disclosed embodiments, but rather, vice versa, various modifications and equivalent arrangements included within the scope of the appended claims and their equivalents. It will be understood that it is intended for handling.

100: 비디오 캡쳐 및 데이터 감소 시스템100: video capture and data reduction system

Claims

In a computing system for removing (decimation) video data:
Processor;
A persistent storage system coupled to the processor; And
A memory for storing instructions, wherein the instructions, when executed by the processor, cause the processor to:
Receive a batch of frames of video data;
Mapping the frames of the arrangement to corresponding feature vectors in a feature space, by a feature extractor, each of the corresponding feature vectors having a lower dimension than a corresponding one of the frames of the arrangement;
Select a set of dissimilar frames from the frames of the video data based on dissimilarities between corresponding ones of the feature vectors; And
The selected set of dissimilar frames is stored in the persistent storage system, wherein the size of the selected set of dissimilar frames is smaller than the number of frames in the arrangement of the frames of the video data. A computing system that allows to eliminate (reduce) the placement of frames.

The method of claim 1,
Instructions that cause the processor to select the set of frames, when executed by the processor, cause the processor to:
Randomly select a first reference frame from the arrangement of the frames of the video data;
Discarding a first set of frames from the frames of the arrangement, the first set of frames having corresponding feature vectors within a similarity threshold distance of a first feature vector corresponding to the first reference frame;
Selecting a second reference frame from the frames of the video data, the second reference frame having a second feature vector, and a distance between the first feature vector and the second feature vector is greater than the similarity threshold distance; And
Discarding a second set of frames from the frames of the arrangement, causing the second set of frames to have corresponding feature vectors within the dissimilarity threshold distance of the second feature vector,
The selected set of dissimilar frames includes the first reference frame and the second reference frame, and excludes the first set of frames and the second set of frames.

The method of claim 1,
The feature extractor is a computing system including a neural network.

The method of claim 3,
The neural network is a computing system including a convolutional neural network (CNN).

The method of claim 1,
The computing system and the permanent storage system are on-boadr on a vehicle, and
The arrangement of the frames of video data is part of a stream of video data captured by a video camera mounted on the vehicle, and the video camera is configured to capture images of surrounding environments of the vehicle.

The method of claim 5,
The arrangement of the frames of the video data is captured over a first time period having a length corresponding to the capture period,
The stream of video data includes a second arrangement of frames of video data corresponding to video data captured during a second time period having the length corresponding to the capture period, wherein the second time period comprises the first Right after the time period, and
The computing system is configured to map the frames of the batch to remove the batch of frames of the video data within a time corresponding to the capture interval.

The method of claim 1,
The memory further stores instructions for removing frames from the selected set of dissimilar frames based on an uncertainty metric, the instructions causing the processor to:
Each frame of the selected set of dissimilar frames is supplied to an object detector including a CNN in order to calculate sets of a plurality of bounding boxes that identify portions describing instances of each object class of a plurality of object classes of the frame. Provided that the bounding box of each of the set of bounding boxes has an associated confidence score;
Calculating an uncertainty score for each frame of the selected set of dissimilar frames based on the sets of bounding boxes and the associated confidence score; And
Removing from the selected set of dissimilar frames a set of frames having uncertainty scores that failed to satisfy an uncertainty threshold,
The computing system further comprising instructions for causing the selected set of dissimilar frames stored in the persistent storage system to exclude the set of frames having uncertainty scores that failed to meet the uncertainty threshold.

The method of claim 7,
The object detector further includes a long-short term memory (LSTM) neural network.

The method of claim 7, wherein the instructions for calculating the uncertainty score, when executed by the processor, cause the processor to:
Identify two highest associated confidence scores corresponding to the same portion of the frame and corresponding to different object classes; And
Comparing the two highest associated confidence scores,
The uncertainty score is high when the difference between the two highest associated confidence scores is small, and
A computing system comprising instructions that cause the uncertainty score to be low when the difference between the two highest associated confidence scores is large.

The method of claim 7,
The feature extractor comprises the CNN of the object detector.

In a computing system for removing (decimation) video data:
Processor;
A permanent storage system coupled to the processor; And
A memory for storing instructions, wherein the instructions, when executed by the processor, cause the processor to:
Receive a batch of frames of video data;
Each frame of the arrangement of the frames of the video data is converted into a convolutional neural network (CNN) to calculate sets of a plurality of bounding boxes that identify a portion describing an instance of each of the object classes of the plurality of object classes of the frame. Supply to an object detector including, wherein the bounding box of each set of bounding boxes has an associated confidence score;
Calculating an uncertainty score for each frame of the batch of frames of the video data based on the sets of bounding boxes and the associated confidence scores;
Selecting a set of uncertain frames from the arrangement of the frames of the video data, wherein an uncertainty score of each frame of the set of uncertain frames satisfies an uncertainty threshold; And
The set of the selected uncertain frames is stored in the persistent storage system, wherein the number of the selected uncertain frames is smaller than the number of frames in the arrangement of the frames of the video data, thereby removing the arrangement of the frames of the video data. The computing system that causes it to occur.

The method of claim 11,
The object detector further includes a long-short term memory (LSTM) neural network.

The method of claim 11, wherein the instructions for calculating the uncertainty score, when executed by the processor, cause the processor to:
Identifying two highest associated confidence scores corresponding to the same part of the frame and corresponding to different object classes; And
Identify the two highest associated confidence scores,
The uncertainty score is high when the difference between the two highest associated confidence scores is small, and
A computing system comprising instructions that cause the uncertainty score to be low when the difference between the two highest associated confidence scores is large.

The method of claim 11,
The computing system and the permanent storage system are on-board on a vehicle, and
The arrangement of the frames of video data is part of a stream of video data captured by a video camera mounted on the vehicle, and the video camera is configured to capture images of surrounding environments of the vehicle.

The method of claim 14,
The arrangement of the frames of the video data is captured over a first time period having a length corresponding to the capture period,
The stream of video data includes a second arrangement of frames of video data corresponding to video data captured during a second time period having the length corresponding to the capture period, wherein the second time period comprises the first Right after the time period, and
The computing system is configured to map the frames of the batch to remove the batch of frames of video data within a time corresponding to the capture interval.

In a computing system for removing (decimation) video data:
Processor;
A permanent storage system coupled to the processor; And
A memory for storing instructions, wherein the instructions when executed by the processor cause the processor to:
Receive an arrangement of frames of video data;
By a feature extractor, mapping the arrangement of the frames of the video data to corresponding feature vectors in a feature space, each of the feature vectors having a lower dimension than a corresponding one of the frames of the arrangement;
Calculating a plurality of dissimilarity scores based on the feature vectors, each of the dissimilarity scores corresponding to any one of the frames of the arrangement;
A convolutional neural network (CNN) is used for each frame of the arrangement of the frames of the video data to calculate sets of a plurality of bounding boxes that identify portions depicting instances of each object class of a plurality of object classes of the frame. Supply to the containing object detector, wherein each of the bounding boxes of each set of bounding boxes has an associated confidence score;
Calculating a plurality of uncertainty scores based on the sets of bounding boxes and the associated confidence scores, each of the uncertainty scores corresponding to any one of the frames of the batch;
Calculating a total score for each frame by aggregating an associated dissimilarity score among the dissimilarity scores and an associated uncertainty score among the uncertainty scores;
Selecting a set of frames from the arrangement of the frames of the video data, wherein the total score of each of the selected frames satisfies a total frame threshold; And
Storing the set of selected frames in the persistent storage system, wherein the number of the set of selected frames is smaller than the number of frames in the arrangement of frames of the video data, thereby causing the arrangement of frames of the video data to be removed. Computing system.

The method of claim 16,
The instructions for calculating the uncertainty scores, when executed by the processor, cause the processor to:
Identifying two highest associated confidence scores corresponding to the same part of the frame and corresponding to different object classes; And
Identify the two highest associated confidence scores,
The uncertainty score is high when the difference between the two highest associated confidence scores is small, and
A computing system comprising instructions that cause the uncertainty score to be low when the difference between the two highest associated confidence scores is large.

The method of claim 16,
The computing system and the permanent storage system are on-boadr on a vehicle, and
The arrangement of the frames of video data is part of a stream of video data captured by a video camera mounted on the vehicle, and the video camera is configured to capture images of surrounding environments of the vehicle.

The method of claim 18,
The arrangement of the frames of the video data is captured over a first time period having a length corresponding to the capture period,
The stream of video data includes a second arrangement of frames of video data corresponding to video data captured during a second time period having the length corresponding to the capture period, wherein the second time period comprises the first Right after the time period, and
The computing system is configured to map the frames of the batch to remove the batch of frames of the video data within a time corresponding to the capture interval.

The method of claim 16,
The feature extractor comprises the CNN of the object detector.