RU2783632C1

RU2783632C1 - Method and system for video scenes segmentation

Info

Publication number: RU2783632C1
Application number: RU2022116268A
Authority: RU
Inventors: Роман Валерьевич Лексутин; Евгений Юрьевич Жилин
Original assignee: Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк)
Filing date: 2022-06-16
Publication date: 2022-11-15

Abstract

FIELD: computer technology.

SUBSTANCE: present invention relates to the field of computer technology for the segmentation of video sequence scenes. The effect is achieved by defining features for each frame of the video sequence in each of the data streams; performing vectorization of said features in each of said data streams; normalizing the vector representations obtained in each stream and then concatenating the normalized vector representations to obtain a common set of features for each frame of the video sequence as a single vector; determining a distance metric for each data stream as a cosine distance between vectors for video face images and as a Euclidean distance between vectors for text data and content images; calculating, based on said metrics, an indicator of a common metric for said single vector characterizing each frame of the video sequence; performing segmentation of the video sequence into context-related scenes based on a comparison of the obtained single vector representations of frames in vector space, while the division is performed based on exceeding the threshold value of the common metric of vector representations of the common vectors of frames of the video sequence.

EFFECT: improving the accuracy of determining contextual scenes for video segmentation, due to the parallel analysis of the data streams that form the video.

7 cl, 4 dwg

Description

ОБЛАСТЬ ТЕХНИКИFIELD OF TECHNOLOGY

[0001] Настоящее решение относится к области компьютерных технологий, в частности к способу и системе для сегментации сцен видеоряда.[0001] The present solution relates to the field of computer technology, in particular to a method and system for segmenting video sequence scenes.

УРОВЕНЬ ТЕХНИКИBACKGROUND OF THE INVENTION

[0002] В различных процессах компаний (например - продажи, разработки продуктов, и т.п.) необходимо использование коммуникаций с контрагентами (клиентами, сотрудниками других подразделений и т.п.). В рамках этих коммуникаций создаются/передаются большие объемы неструктурированной/слабоструктурированной информации, которая используется для корректировки/изменений реализации соответствующих процессов.[0002] In various processes of companies (for example, sales, product development, etc.), it is necessary to use communications with counterparties (customers, employees of other departments, etc.). Within the framework of these communications, large amounts of unstructured/slightly structured information are created/transmitted, which is used to correct/change the implementation of the relevant processes.

[0003] Одним из наиболее распространенных способов коммуникации является онлайн встреча, в рамках которой используются различные каналы передачи данных - люди видят друг друга (видео связь), общаются с помощью аудио (может быть телефонная линия) и могут демонстрировать контент (презентации, демонстрации экрана и т.п.).[0003] One of the most common methods of communication is an online meeting, which uses various data transmission channels - people see each other (video communication), communicate using audio (may be a telephone line) and can demonstrate content (presentations, screen sharing etc.).

[0004] Например, в процессах продаж, при коммуникациях с клиентами необходимо выявить потребность клиента и затем на основании выявленной потребности предложить соответствующие позиции из ассортимента товаров/услуг компании. Соответственно, необходимо в коммуникациях с клиентом определить промежуток времени и артефакты/объекты, относящиеся к потребностям клиента (например, описание/запрос коммерческого предложения и т.п.). Такой промежуток времени называется сценой. Определение сцен - это задача сегментации видео и другой не структурированной информации которой обмениваются стороны в процессе коммуникаций. По результатам анализа соответствующих сцен, могут совершаться определенные действия (фактически решая задачу классификации сцены по определенному набору действий/классов) - отправить покупателю соответствующие предложения из имеющегося в наличии ассортимента, рассчитать и предложить скидку на дополнительные товары и т.п.[0004] For example, in sales processes, when communicating with customers, it is necessary to identify the customer's need and then, based on the identified need, offer appropriate items from the company's product/service range. Accordingly, it is necessary in communications with the client to determine the time period and artifacts/objects related to the needs of the client (for example, a description/request for a quotation, etc.). Such a period of time is called a scene. Scene detection is the task of segmenting video and other non-structured information exchanged between parties in the course of communications. Based on the results of the analysis of the relevant scenes, certain actions can be performed (in fact, solving the problem of classifying the scene according to a certain set of actions / classes) - send the appropriate offers to the buyer from the available assortment, calculate and offer a discount on additional goods, etc.

[0005] Для сегментации видео известным и доступным (включен во многие открытые библиотеки по работе с видео, например, opencv) подходом является разбиение видеоряда на сцены по переходу между кадрами (оценивая разницу между характеристиками последовательных кадров). Такой подход не учитывает контекстную составляющую соответствующих коммуникаций (смысловое содержание видео изображений, аудио, слайдов презентаций и т.п.) и не позволяет производить классификацию сцен для получения практических результатов/действий зависящих от контекста (т.е. смысла/содержания) сцен.[0005] For video segmentation, a well-known and accessible (included in many open source video libraries, for example, opencv) approach is to split the video sequence into scenes by transition between frames (estimating the difference between the characteristics of successive frames). This approach does not take into account the contextual component of the relevant communications (the semantic content of video images, audio, presentation slides, etc.) and does not allow the classification of scenes to obtain practical results/actions depending on the context (i.e. meaning/content) of the scenes.

[0006] В патенте RU 2628192 С2 (Акционерное общество "Творческо-производственное объединение "Центральная киностудия детских и юношеских фильмов им. М. Горького", 15.08.2017) описано средство сегментации и классификации видео, но в качестве входных признаков используется только один канал передачи данных - видеоизображение. Данный подход не подходит для формата онлайн коммуникаций, в которых видеоизображение может быть статичным долгий промежуток времени, но при этом в аудио коммуникациях может обсуждаться и затрагиваться несколько тем, относящихся к разным контекстным сценам.[0006] Patent RU 2628192 C2 (Joint-Stock Company "Creative and Production Association "Central Film Studio for Children's and Youth Films named after M. Gorky", 08/15/2017) describes a video segmentation and classification tool, but only one channel is used as input features data transmission - video image This approach is not suitable for the online communication format, in which the video image can be static for a long period of time, but at the same time, audio communications can discuss and touch on several topics related to different contextual scenes.

[0007] В статье A Local-to-Global Approach to Multi-modal Movie Scene Segmentation (https://arxiv.org/abs/2004.02678) описан фреймворк выделения контекстных сцен в фильмах, использующий мультимодальные характеристики каждого кадра (место, актерский состав, действие и аудио). Метод извлечения признаков в этом фреймворке является наиболее близким решением, но имеет отличия в той части, что для сегментации и классификации сцен используется подход на основе «обучения с учителем» (supervised) с помощью сети BNet, предназначенный именно под художественные фильмы. Использование supervised подхода невозможно в случае различной стилистики онлайн коммуникаций (в зависимости от назначения, стиля спикеров, используемого демонстрационного материала).[0007] The article A Local-to-Global Approach to Multi-modal Movie Scene Segmentation (https://arxiv.org/abs/2004.02678) describes a framework for extracting contextual scenes in films using the multimodal characteristics of each frame (place, cast , action and audio). The feature extraction method in this framework is the closest solution, but differs in that for segmentation and classification of scenes, a supervised approach is used using the BNet network, designed specifically for feature films. The use of a supervised approach is impossible in case of different styles of online communications (depending on the purpose, the style of the speakers, the demo material used).

[0008] В заявленном решении для преодоления недостатков, присущих решениям, известным из уровня техники, предлагается подход, позволяющий выполнять сегментацию сцен с помощью классификации по трем каналам данных, формирующих видеоряд, с помощью моделей машинного обучения.[0008] In order to overcome the shortcomings inherent in the prior art solutions, the claimed solution proposes an approach that allows scene segmentation by classifying three channels of video sequence data using machine learning models.

СУЩНОСТЬ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

[0009] Заявленное изобретение направлено на решение технической проблемы, заключающейся в создании эффективного способа сегментации видео, содержащего демонстрацию контента.[0009] The claimed invention is directed to solving the technical problem of creating an efficient method for segmenting a video containing a demonstration of content.

[0010] Техническим результатом является повышение точности определения контекстных сцен для сегментации видео, за счет параллельного анализа потоков данных, формирующих видео.[0010] The technical result is to increase the accuracy of determining contextual scenes for video segmentation, due to the parallel analysis of data streams that form the video.

[0011] Заявленный результат достигается за счет осуществления способа сегментации сцен видеоряда, выполняемого с помощью вычислительного устройства и содержащего этапы, на которых:[0011] The claimed result is achieved by implementing a method for segmenting scenes of a video sequence, performed using a computing device and comprising the steps of:

получают входной видеоряд, содержащий видео и речевые данные;receiving an input video containing video and speech data;

выполняют разделение входного видеоряда по трем потокам данных: изображения лиц, текстовые данные на основании транскрибированной речевой информации, и изображения контента, представленных в видеоряде;separating the input video sequence into three data streams: face images, text data based on the transcribed speech information, and content images presented in the video sequence;

определяют признаки для каждого кадра видеоряда в каждом из упомянутых потоков данных;determining features for each frame of the video sequence in each of said data streams;

выполняют векторизацию упомянутых признаков в каждом из упомянутых потоков данных;performing vectorization of said features in each of said data streams;

осуществляют нормализацию векторных представлений, полученных в каждом потоке, и последующую конкатенацию нормализованных векторных представлений для получения общего набора признаков для каждого кадра видеоряда в виде единого вектора;carry out the normalization of the vector representations obtained in each stream, and the subsequent concatenation of the normalized vector representations to obtain a common set of features for each frame of the video sequence in the form of a single vector;

определяют метрику расстояния для каждого потока данных, как косинусное расстояние между векторами для изображений лиц в видео, и как евклидово расстояние между векторами для текстовых данных и изображений контента;determining a distance metric for each data stream as a cosine distance between vectors for video face images, and as a Euclidean distance between vectors for text data and content images;

вычисляют на основании упомянутых метрик показатель общей метрики для упомянутого единого вектора, характеризующего каждый кадр видеоряда;calculating, based on said metrics, a common metric for said single vector characterizing each frame of the video sequence;

выполняют сегментацию видеоряда на контекстно связанные сцены на основании сравнения получаемых единых векторных представлений кадров в векторном пространстве, при этом разделение выполняется на основании превышения порогового значения общей метрики векторных представлений единых векторов кадров видеоряда.segmentation of the video sequence into contextually related scenes is performed based on comparison of the received common vector representations of frames in vector space, wherein the division is performed based on exceeding the threshold value of the common metric of vector representations of the common vectors of frames of the video sequence.

[0012] В одном из частных примеров осуществления способа на основании изображений лиц формируют векторные представления, характеризующие по меньшей мере одно из: лицевые характеристики, пол, возраст, направление взгляда, эмоции.[0012] In one of the particular examples of the implementation of the method, based on the images of faces, vector representations are formed that characterize at least one of: facial characteristics, gender, age, gaze direction, emotions.

[0013] В другом частном примере осуществления способа дополнительно распознают жесты, отображаемые в видеоряде.[0013] In another particular embodiment of the method, the gestures displayed in the video sequence are additionally recognized.

[0014] В другом частном примере осуществления способа дополнительно из речевых данных выделяют аудиохарактеристики голосов в видеоряде.[0014] In another particular embodiment of the method, audio characteristics of voices in the video sequence are additionally extracted from speech data.

[0015] В другом частном примере осуществления способа аудиохарактеристики голосов включают в себя по меньшей мере одно из: тональность, интенсивность, форманты.[0015] In another particular embodiment of the method, the audio characteristics of voices include at least one of: tonality, intensity, formants.

[0016] В другом частном примере осуществления способа демонстрируемый контент дополнительно подвергается OCR обработке для распознавания представленной информации.[0016] In another particular embodiment of the method, the displayed content is additionally subjected to OCR processing to recognize the information presented.

[0017] Заявленное изобретение также осуществляется за счет системы сегментации сцен видеоряда, содержащая по меньшей мере один процессор и память, хранящую машиночитаемые инструкции, которые при их исполнении процессором реализуют вышеуказанный способ сегментации сцен видеоряда.[0017] The claimed invention is also implemented by a video sequence segmentation system comprising at least one processor and a memory storing machine-readable instructions that, when executed by the processor, implement the above video sequence segmentation method.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

[0018] Фиг. 1 иллюстрирует блок-схему заявленного способа.[0018] FIG. 1 illustrates a block diagram of the claimed method.

[0019] Фиг. 2 иллюстрирует пример потоков данных видеоряда.[0019] FIG. 2 illustrates an example of footage data streams.

[0020] Фиг. 3 иллюстрирует пример формирования усредняющего нормализованного вектора для потока видео.[0020] FIG. 3 illustrates an example of generating an average normalized vector for a video stream.

[0021] Фиг. 4 иллюстрирует общую схему вычислительного устройства.[0021] FIG. 4 illustrates the general layout of a computing device.

ОСУЩЕСТВЛЕНИЕ ИЗОБРЕТЕНИЯIMPLEMENTATION OF THE INVENTION

[0022] На Фиг. 1 представлена блок-схема выполнения заявленного способа (100) сегментации видеоряда. На первом этапе (101) исполняющее способ (100) устройство получает входные данные, которые представляют собой видео данные, в частности видеоряд, содержащий как изображения видео контента, так и демонстрацию сопутствующего контента в видеоряде, например, видео презентация. В качестве исполняющего устройства может применяться любой пригодный тип вычислительной техники, например, компьютер, сервер и т.п. Передача информации может осуществляться любым пригодным способ связи, например, с помощью вычислительной сети «Интернет», с помощью непосредственной загрузки данных в вычислительное устройство или любым другим известным принципом передачи цифровой информации.[0022] In FIG. 1 shows a block diagram of the implementation of the claimed method (100) for segmenting a video sequence. In the first step (101), the executing method (100) receives input data, which is video data, in particular a video sequence containing both images of video content and a demonstration of related content in the video sequence, for example, a video presentation. The executing device can be any suitable type of computer technology, such as a computer, a server, or the like. The transfer of information can be carried out by any suitable method of communication, for example, using the computer network "Internet", by directly downloading data into a computing device, or by any other known principle of transferring digital information.

[0023] Как показано на Фиг. 2 на этапе (102) из полученного видеоряда выделяется три потока данных, в каждом из которых будет происходить вычисление соответствующих признаков:[0023] As shown in FIG. 2 at step (102), three data streams are selected from the received video sequence, in each of which the corresponding features will be calculated:

- видео данные (201);- video data (201);

- изображения контента, представленных в видеоряде (202);- images of the content presented in the video sequence (202);

- аудиопоток (203) и получаемые текстовые данные (2031) на основании транскрибированной речевой информации и их последующие векторные представления (2032).- an audio stream (203) and received text data (2031) based on the transcribed speech information and their subsequent vector representations (2032).

[0024] Выделение потоков может выполняться с помощью известных в уровне техники подходов по выделению из кадра видеопотока контента, отображаемого в видео. Например из кадров видеоряда может выделяться область интереса (задаваемую определенной и фиксированной областью в кадре, например с помощью OpenCV алгоритмов) содержащая демонстрируемый контент. Из потока видеоданных выделяются изображения лиц людей, например, с помощью технологии распознавания лиц (алгоритмы Face recognition). Аудиопоток транскрибируется в текстовую форму для последующего анализа.[0024] The selection of streams can be performed using approaches known in the art to extract from the frame of the video stream the content displayed in the video. For example, a region of interest (specified by a specific and fixed area in the frame, for example, using OpenCV algorithms) containing the displayed content can be selected from the video sequence frames. Images of people's faces are extracted from the video data stream, for example, using face recognition technology (Face recognition algorithms). The audio stream is transcribed into text form for further analysis.

[0025] Полученные данные в каждом из потоков на этапе (103) обрабатываются для определения признаков в каждый момент времени (кадр видеоряда). В частности, для каждого потока (201-203) может устанавливаться временное окно, в котором будет происходить обработка информации (T1, Т2, Т3). Также, может определяться частота кадров для анализа кадров видеоданных (F1) и изображений контента (F3), представленного в видеоряде. Частота кадров - настраиваемый параметр, определяющие баланс между точностью и производительностью системы (рекомендуемое значение - 2 кадра в секунду, но не реже 1 кадра в 2 секунды).[0025] The received data in each of the streams at step (103) is processed to determine the features at each time point (video frame). In particular, for each thread (201-203) a time window can be set in which information processing will take place (T1, T2, T3). Also, a frame rate may be determined to analyze frames of video data (F1) and images of content (F3) present in the footage. Frame rate is a configurable parameter that determines the balance between accuracy and system performance (recommended value is 2 frames per second, but at least 1 frame per 2 seconds).

[0026] Окно обработки информации в потоках данных выполняет две основные функции:[0026] The data processing window performs two main functions:

- исправление ошибок и артефактов для видеоканалов (за счет последующего отброса аномальных значений в окне и усреднения изображений) - выявляются аномалии по значению прогноз-факт на основе определения ошибки (методом Upper Control Limit) в сравнении с прогнозом модели VAR (Vector Auto-Regression) с установкой параметра max lag равным размеру окна обработки информации;- correction of errors and artifacts for video channels (due to the subsequent rejection of anomalous values in the window and averaging of images) - anomalies are detected by the forecast-fact value based on the error determination (Upper Control Limit method) in comparison with the forecast of the VAR model (Vector Auto-Regression) with the setting of the max lag parameter equal to the size of the information processing window;

- обеспечение возможности работы с речевыми признаками для аудио потока - использование алгоритмов речь-в-текст невозможно "в моменте" (речь всегда обрабатывается за определенный промежуток времени), поэтому для того, чтобы соотнести векторное представление смысла произнесенной речи с кадром видеоряда необходима обработка аудиопотока с движущимся окном Т3.- providing the ability to work with speech features for the audio stream - the use of speech-to-text algorithms is impossible "at the moment" (speech is always processed for a certain period of time), therefore, in order to correlate the vector representation of the meaning of the spoken speech with the frame of the video sequence, processing of the audio stream is necessary with moving window T3.

[0027] Окна обработки (T1, Т2, Т3) - настраиваемые параметры, определяющие устойчивость (robustness) алгоритма (рекомендуемое значение для потока видео Т1=100/F1 секунд, для контента Т2=100/F3 секунд, для потока аудио Т3-30 секунд).[0027] Processing windows (T1, T2, T3) - adjustable parameters that determine the stability (robustness) of the algorithm (recommended value for the video stream T1=100/F1 seconds, for content T2=100/F3 seconds, for the audio stream T3-30 seconds).

[0028] Для каждого потока данных определяются признаки, которые затем преобразуются в векторный вид (эмбеддинги на этапе (104) для их последующей обработки с помощью модели машинного обучения).[0028] For each data stream, features are determined, which are then converted into a vector form (embeddings at step (104) for their subsequent processing using a machine learning model).

[0029] На каждый момент времени t (каждая секунда видео, аудио и т.п.) производится определение вектора признаков

в метрическом пространстве (Р, d), где Р - множество векторов, характеризующий контекст (включающий смысловое содержание видео и демонстрируемого контента, аудио и пр.) на данный момент времени, a d - метрика, определяющая расстояние между векторами из множества Р.[0029] For each time t (every second of video, audio, etc.), a feature vector is determined

in the metric space (P, d), where P is a set of vectors that characterizes the context (including the semantic content of the video and content shown, audio, etc.) at a given point in time, ad is a metric that determines the distance between vectors from the set P.

[0030] Например, для кадра формируется вектор признаков:[0030] For example, a feature vector is generated for a frame:

А для транслируемого контента на этом кадре:And for the broadcast content on this frame:

[0031] Описанное решение позволяет работать на любых числовых признаках, но в качестве опорного перечня выбираются следующие признаки:[0031] The described solution allows you to work on any numerical features, but the following features are selected as a reference list:

- видео - лицевые эмбеддинги (в том числе закодированные в них лицевых характеристики такие как направление взгляда, пол, возраст, эмоции), определение жестов, значения HSV.- video - facial embeddings (including encoded facial characteristics such as gaze direction, gender, age, emotions), gesture detection, HSV values.

- аудио - перевод речи в текст и применение языковой модели для выделения эмбедингов предложений, аудиохарактеристики голосов всех людей, находящихся в анализируемом временном окне (тональность, интенсивность, форманты).- audio - translation of speech into text and the use of a language model to highlight the embeddings of sentences, the audio characteristics of the voices of all people in the analyzed time window (tonality, intensity, formants).

- видео демонстрируемого контента - значения HSV, эмбеддинг изображения, детектирование (с помощью OCR) и получения эмбеддингов текста, описывающего соответствующий контент.- video of the displayed content - HSV values, image embedding, detection (using OCR) and obtaining embeddings of text describing the corresponding content.

[0032] Далее для полученных векторов признаков в каждом из потоков (201-203) выполняется их нормализация и последующая конкатенация на этапе (105) для получения усредняющего вектора признаков для каждого потока данных. На Фиг. 3 приведен пример получения усредняющего вектора (2011) для потока видео (201).[0032] Next, for the obtained feature vectors in each of the streams (201-203), they are normalized and then concatenated at step (105) to obtain an average feature vector for each data stream. On FIG. 3 shows an example of obtaining an averaging vector (2011) for a video stream (201).

[0033] Для потока видеоизображений (201) производится отбрасывание аномальных векторов в рамках окна и усреднение оставшихся. Аналогично для потоков контента (202) и аудио (203) выполняется нормализация признаков векторов методом Max/Min Normalization в рамках групп признаков и их конкатенирование в единый вектор (так как конкатенирование отдельных эмбеддингов без нормализации повлияет на дальнейшее вычисление расстояния).[0033] For the video stream (201), anomalous vectors within the window are discarded and the remaining ones are averaged. Similarly, for content (202) and audio (203) streams, feature vectors are normalized using the Max/Min Normalization method within feature groups and concatenated into a single vector (since concatenating individual embeddings without normalization will affect further distance calculation).

[0034] Например, итоговый единый вектор признаков для кадра №17 будет иметь следующий вид:[0034] For example, the resulting single feature vector for frame #17 would look like this:

[0035] Аналогично для каждого кадра видеоряда для соответствующего потока (202-203) также формируется усредненный вектор.[0035] Similarly, for each frame of the video sequence for the corresponding stream (202-203), an average vector is also formed.

[0036] На основании полученного набора признаков для каждого потока данных (201-203) на этапе (106) определяется метрика расстояния, которая задает метрическое пространство, в котором каждый вектор описывает текущее состояние в момент времени. Под метрикой расстояния подразумевается числовая функция, удовлетворяющая требованиям/аксиомам определения расстояния в этом метрическом пространстве. Примерами такой метрики могут быть расстояние Хэмминга, евклидово расстояние, косинусное расстояние и т.д. Так как сравнение векторов в разных потоках данных может определяться разными метриками, то формально метрика d - это набор метрик (d1, d2, d3) и сравнение векторов выражается в применении отдельных метрик к разным потокам (201-203) и их последующее взвешенное усреднение (используя разные «веса» составляющих метрик d1, d2, d3).[0036] Based on the obtained set of features for each data stream (201-203), in step (106), a distance metric is determined that defines a metric space in which each vector describes the current state at a point in time. A distance metric is a numerical function that satisfies the requirements/axioms for determining the distance in this metric space. Examples of such a metric would be Hamming distance, Euclidean distance, cosine distance, etc. Since the comparison of vectors in different data streams can be determined by different metrics, then formally the metric d is a set of metrics (d1, d2, d3) and the comparison of vectors is expressed in the application of individual metrics to different streams (201-203) and their subsequent weighted averaging ( using different "weights" of the constituent metrics d1, d2, d3).

[0037] Например, состав метрики d из набора метрик (d1, d2, d3) может иметь следующий вид:[0037] For example, the composition of the metric d from the set of metrics (d1, d2, d3) may be as follows:

- Метрика d1 для канала видео с изображениями лиц (с использованием предобученной архитектуры ResNet50 для получения эмбеддингов) представляет из себя косинусный коэффициент (косинусная близость) двух векторов;- The metric d1 for a video channel with face images (using the pre-trained ResNet50 architecture for embeddings) is the cosine coefficient (cosine proximity) of two vectors;

- Метрика d2 для видео демонстрируемого контента (с использованием предобученной модели на базе ResNet50 для определения контекста сцены и модели GPT3 для получения эмбеддингов предложений) представляет из себя евклидовое расстояние между векторами;- The d2 metric for video content being shown (using a pre-trained model based on ResNet50 to determine the scene context and a GPT3 model to get sentence embeddings) is the Euclidean distance between vectors;

- Метрика d3 для аудио канала (с использованием предобученной модели эмбеддингов на базе архитектуры BERT) представляет из себя евклидовое расстояние между векторами.- The d3 metric for the audio channel (using a pre-trained embedding model based on the BERT architecture) is the Euclidean distance between the vectors.

[0038] В результате выполнения предыдущих шагов для каждого времени Т определяется вектор в метрическом пространстве (Р, (d1, d2, d3)). Вектор признаков Р меняется во времени по ходу видеоряда (так как вектор определяется для каждого кадра видеоряда демонстрируемого в момент времени t). То есть каждый момент видеоряда представляет из себя набор векторов, привязанных ко времени.[0038] As a result of the previous steps, for each time T, a vector in the metric space (P, (d1, d2, d3)) is determined. The feature vector Р changes in time along the video sequence (since the vector is determined for each frame of the video sequence shown at time t). That is, each moment of the video sequence is a set of vectors tied to time.

[0039] Далее на этапе (107) выполняется сегментация входного видеоряда на сцены за счет сравнения получаемых единых векторных представлений кадров. Для выявления данных и контекста сцен с последующей сегментацией используется вышеуказанное метрическое пространство (набор признаков и соответствующая метрика расстояния) и датасет с примерами сегментации информации в каналах коммуникаций (т.е. фактически, математическое описание отдельных областей в метрическом пространстве).[0039] Next, at step (107), segmentation of the input footage into scenes is performed by comparing the resulting single vector representations of frames. To identify data and scene context with subsequent segmentation, the above metric space is used (a set of features and the corresponding distance metric) and a dataset with examples of information segmentation in communication channels (i.e., in fact, a mathematical description of individual areas in the metric space).

[0040] Так как векторы признаков для каждого кадра видеоряда не являются независимыми, а должны быть рассмотрены как последовательность, то в отличие от применения просто методов кластеризации, применяются методы кластеризации последовательности Optimal Sequential Grouping.[0040] Since the feature vectors for each frame of the footage are not independent, but must be considered as a sequence, in contrast to the use of simple clustering methods, Optimal Sequential Grouping sequence clustering methods are used.

[0041] Для определения последовательностей, в которых разные кадры могут сильно отличаться, но при этом контекстно содержаться в одной сцене, алгоритм проходит в два этапа.[0041] To determine sequences in which different frames can be very different, but still be contextually contained in the same scene, the algorithm goes through two stages.

[0042] На первом этапе для каждых последовательных кадров производится сравнение по метрике d и в случае превышения порога чувствительности L производится предварительное разделение на s сегментов, которые определяются последовательностью временных «меток» segms=[t⁽¹⁾, t⁽²⁾, …, t^(s)].[0042] At the first stage, for each successive frames, a comparison is made according to the metric d, and if the sensitivity threshold L is exceeded, a preliminary division into s segments is performed, which are determined by the sequence of time "marks" segms=[t ⁽¹⁾ , t ⁽²⁾ , ... , t ^(s) ].

Пример результатов сравнения последовательных кадров по метрике d:An example of the results of comparing successive frames by metric d:

[…, 0.31, 0.25, 0.79, 0.12, 0.17, …, 0.47, 0.85, 0.14][…, 0.31, 0.25, 0.79, 0.12, 0.17, …, 0.47, 0.85, 0.14]

Соответственно для этого примера границы сцен соответствуют кадрам с расстояниями от предыдущих равными 0.79 и 0.85.Accordingly, for this example, scene boundaries correspond to frames with distances from the previous ones equal to 0.79 and 0.85.

[0043] Для полученных сегментов segms производится кластеризация Optimal Sequential Grouping с помощью решения оптимизационной задачи минимизации расстояния между центроидами векторов, входящих в сегменты segms по построенной матрице попарных расстояний между сегментами. В качестве алгоритма сегментации предлагается использование методов кластеризации и использование метода локтя для определения количества кластеров.[0043] For the obtained segms segments, Optimal Sequential Grouping is clustered by solving the optimization problem of minimizing the distance between the centroids of the vectors included in the segms segments according to the constructed matrix of pairwise distances between the segments. As a segmentation algorithm, it is proposed to use clustering methods and use the elbow method to determine the number of clusters.

[0044] Для калибровки и выбора параметров, используемых для сегментирования в данном подходе может использоваться калибровочная выборка. Под калибровочной выборкой понимается размеченный датасет коммуникаций в видео, с разметкой сцен (сегментов) в виде меток начала и окончания сцены. В отличие от supervised подхода, для калибровки/выбора параметров при сегментации фильмов нужно не 21000 сцен, как это представлено в A Local-to-Global Approach to Multi-modal Movie Scene Segmentation, а всего 200 сцен.[0044] A calibration sample can be used to calibrate and select parameters used for segmentation in this approach. A calibration sample is a labeled dataset of communications in a video, with scene (segment) marking in the form of scene start and end marks. Unlike the supervised approach, calibrating/selecting parameters for movie segmentation does not require 21,000 scenes, as presented in A Local-to-Global Approach to Multi-modal Movie Scene Segmentation, but only 200 scenes.

[0045] При этом с помощью калибровочной выборки возможна оптимизация и выбор параметров, используемых для сегментации. Калибруемые параметры:[0045] In this case, with the help of a calibration sample, optimization and selection of parameters used for segmentation is possible. Calibrated parameters:

- весовые коэффициенты w1, w2, w3 метрик d1,d2,d3 для расчета метрики d;- weight coefficients w1, w2, w3 of the metrics d1,d2,d3 for calculating the metric d;

- порог чувствительности L.- sensitivity threshold L.

[0046] Калибровка производится путем корректировки Cost функции, которая представляет сумму ошибок на калибровочной выборке (т.е. классическая задача минимизации ошибки).[0046] Calibration is performed by adjusting the Cost function, which represents the sum of the errors on the calibration sample (i.e., the classical error minimization problem).

[0047] Для отладки модели машинного обучения, применяемой для классификации сцен, может выполнять классификация по тэгам, которая производится на базе тех же самых единых векторов, которые получены на шаге (105), и сцен, полученных в ходе сегментации на этапе (107). Для классификации используется обучающая выборка извлекаемых объектов из сцен - тэгов. Для каждой сцены может быть несколько тэгов, то есть решается задача multilabel (множество меток) классификации.[0047] To debug the machine learning model used to classify scenes, a tag classification can be performed based on the same single vectors that are obtained in step (105) and the scenes obtained during segmentation in step (107) . For classification, a training sample of extracted objects from scenes - tags is used. For each scene, there can be several tags, that is, the task of multilabel (many labels) classification is being solved.

[0048] Представлением сцены для классификации является усредненный вектор по всей сцене (под усреднением понимается вектор из сцены, который имеет наименьшее евклидово расстояние до среднего вектора). Так как вектор признаков уже подготовлен на этапе (105), то для классификации можно использовать не сложные deep learning end2end подходы, а производить обучение на обучающей выборке тэгов классическими методами машинного обучения (ML), например, с помощью метода градиентного бустинга.[0048] The scene representation for classification is the averaged vector over the entire scene (averaging is the vector from the scene that has the smallest Euclidean distance to the average vector). Since the feature vector has already been prepared at step (105), it is possible to use not complex deep learning end2end approaches for classification, but to train on the training set of tags using classical machine learning (ML) methods, for example, using the gradient boosting method.

[0049] Предложенный подход может найти широкое применение в части эффективной автоматизированной сегментации видеоряда с помощью применяемых технологий и алгоритмов машинного обучения, которые за счет тренировки на соответствующих датасетах могут с высокой вероятностью классифицировать контекстно связанные сцены для их выделения из общего потока данных. Например, такое применение может быть полезно для эффективного разделения блоков презентаций или конференций, в части анализа демонстрируемого контента и сегментации на основании контекстно несвязанных блоков, что может потом передаваться в качестве сегментов во внешние системы демонстрирования контента, например, системы предоставления виде по запросу (video on demand) или т.п.[0049] The proposed approach can be widely used in terms of efficient automated video sequence segmentation using the applied technologies and machine learning algorithms, which, due to training on the appropriate datasets, can classify contextually related scenes with high probability to separate them from the general data stream. For example, such an application can be useful for effectively separating blocks of presentations or conferences, in terms of analysis of the displayed content and segmentation based on contextually unrelated blocks, which can then be transmitted as segments to external content display systems, for example, systems for providing video on demand (video on demand) or the like.

[0050] На Фиг. 4 представлен общий вид вычислительной системы на базе вычислительного устройства (300), пригодного для выполнения способа (100). Устройство (300) может представлять собой, например, сервер или иной тип вычислительного устройства, который может применяться для реализации заявленного способа.[0050] In FIG. 4 is a general view of a computing system based on a computing device (300) suitable for performing method (100). Device (300) may be, for example, a server or other type of computing device that can be used to implement the claimed method.

[0051] В общем случае вычислительное устройство (300) содержит объединенные общей шиной информационного обмена один или несколько процессоров (301), средства памяти, такие как ОЗУ (302) и ПЗУ (303), интерфейсы ввода/вывода (304), устройства ввода/вывода (305), и устройство для сетевого взаимодействия (306).[0051] In general, the computing device (300) contains one or more processors (301), memory facilities such as RAM (302) and ROM (303), input / output interfaces (304), input devices connected by a common information exchange bus / output (305), and a device for networking (306).

[0052] Процессор (301) (или несколько процессоров, многоядерный процессор) могут выбираться из ассортимента устройств, широко применяемых в текущее время, например, компаний Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™ и т.п. В качестве процессора (301) может также применяться графический процессор, например, Nvidia, AMD, Graphcore и пр.[0052] The processor (301) (or multiple processors, multi-core processor) may be selected from a variety of devices currently widely used, such as Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™, and etc. The processor (301) can also be a graphics processor such as Nvidia, AMD, Graphcore, etc.

[0053] ОЗУ (302) представляет собой оперативную память и предназначено для хранения исполняемых процессором (301) машиночитаемых инструкций для выполнение необходимых операций по логической обработке данных. ОЗУ (302), как правило, содержит исполняемые инструкции операционной системы и соответствующих программных компонент (приложения, программные модули и т.п.).[0053] RAM (302) is a random access memory and is designed to store machine-readable instructions executable by the processor (301) to perform the necessary data logical processing operations. The RAM (302) typically contains the executable instructions of the operating system and associated software components (applications, program modules, etc.).

[0054] ПЗУ (303) представляет собой одно или более устройств постоянного хранения данных, например, жесткий диск (HDD), твердотельный накопитель данных (SSD), флэш-память (EEPROM, NAND и т.п.), оптические носители информации (CD-R/RW, DVD-R/RW, BlueRay Disc, MD) и др.[0054] A ROM (303) is one or more persistent storage devices such as a hard disk drive (HDD), a solid state drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media ( CD-R/RW, DVD-R/RW, BlueRay Disc, MD), etc.

[0055] Для организации работы компонентов устройства (300) и организации работы внешних подключаемых устройств применяются различные виды интерфейсов В/В (304). Выбор соответствующих интерфейсов зависит от конкретного исполнения вычислительного устройства, которые могут представлять собой, не ограничиваясь: PCI, AGP, PS/2, IrDa, Fire Wire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232 и т.п.[0055] Various types of I/O interfaces (304) are used to organize the operation of device components (300) and organize the operation of external connected devices. The choice of the appropriate interfaces depends on the particular design of the computing device, which can be, but not limited to: PCI, AGP, PS/2, IrDa, Fire Wire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro , mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.

[0056] Для обеспечения взаимодействия пользователя с вычислительным устройством (300) применяются различные средства (305) В/В информации, например, клавиатура, дисплей (монитор), сенсорный дисплей, тач-пад, джойстик, манипулятор мышь, световое перо, стилус, сенсорная панель, трекбол, динамики, микрофон, средства дополненной реальности, оптические сенсоры, планшет, световые индикаторы, проектор, камера, средства биометрической идентификации (сканер сетчатки глаза, сканер отпечатков пальцев, модуль распознавания голоса) и т.п.[0056] To ensure user interaction with the computing device (300), various means (305) of I/O information are used, for example, a keyboard, a display (monitor), a touch screen, a touchpad, a joystick, a mouse, a light pen, a stylus, touch panel, trackball, speakers, microphone, augmented reality, optical sensors, tablet, indicator lights, projector, camera, biometric identification tools (retinal scanner, fingerprint scanner, voice recognition module), etc.

[0057] Средство сетевого взаимодействия (306) обеспечивает передачу данных устройством (300) посредством внутренней или внешней вычислительной сети, например, Интранет, Интернет, ЛВС и т.п. В качестве одного или более средств (306) может использоваться, но не ограничиваться: Ethernet карта, GSM модем, GPRS модем, LTE модем, 5G модем, модуль спутниковой связи, NFC модуль, Bluetooth и/или BLE модуль, Wi-Fi модуль и др.[0057] The network communication means (306) enables data communication by the device (300) via an internal or external computer network, such as an Intranet, Internet, LAN, and the like. As one or more means (306) can be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and / or BLE module, Wi-Fi module and others

[0058] Дополнительно могут применяться также средства спутниковой навигации в составе устройства (300), например, GPS, ГЛОНАСС, BeiDou, Galileo.[0058] Additionally, satellite navigation tools in the device (300) can also be used, for example, GPS, GLONASS, BeiDou, Galileo.

[0059] Представленные материалы заявки раскрывают предпочтительные примеры реализации технического решения и не должны трактоваться как ограничивающие иные, частные примеры его воплощения, не выходящие за пределы испрашиваемой правовой охраны, которые являются очевидными для специалистов соответствующей области техники.[0059] The submitted application materials disclose preferred examples of the implementation of the technical solution and should not be construed as limiting other, particular examples of its implementation that do not go beyond the scope of the requested legal protection, which are obvious to specialists in the relevant field of technology.

Claims

1. A method for segmenting scenes of a video sequence, performed using a computing device and comprising the steps of:

receiving an input video containing video and speech data;

separating the input video sequence into three data streams: face images, text data based on the transcribed speech information, and content images presented in the video sequence;

determining features for each frame of the video sequence in each of said data streams;

performing vectorization of said features in each of said data streams;

carry out the normalization of the vector representations obtained in each stream, and the subsequent concatenation of the normalized vector representations to obtain a common set of features for each frame of the video sequence in the form of a single vector;

determining a distance metric for each data stream as a cosine distance between vectors for video face images and as a Euclidean distance between vectors for text data and content images;

calculating, based on said metrics, a common metric for said single vector characterizing each frame of the video sequence;

segmentation of the video sequence into context-related scenes is performed based on a comparison of the obtained single vector representations of frames in vector space, while the division is performed based on the exceedance of the threshold value of the common metric of vector representations of the common vectors of frames of the video sequence.

2. The method according to claim 1, in which, based on the images of faces, vector representations are formed that characterize at least one of: facial characteristics, gender, age, gaze direction, emotions.

3. The method of claim 2, further comprising recognizing gestures displayed in the video.

4. The method according to claim 1, in which, additionally, the audio characteristics of voices in the video sequence are extracted from the speech data.

5. The method according to claim 4, wherein the audio characteristics of the voices include at least one of: tonality, intensity, formants.

6. The method according to claim. 1, in which the displayed content is additionally subjected to OCR processing to recognize the presented information.

7. A video sequence scene segmentation system, comprising at least one processor and a memory storing machine-readable instructions that, when executed by the processor, implement the video sequence scene segmentation method according to any one of paragraphs. 1-6.