RU2742602C1

RU2742602C1 - Recognition of events on photographs with automatic selection of albums

Info

Publication number: RU2742602C1
Application number: RU2020112953A
Authority: RU
Inventors: Андрей Владимирович Савченко
Original assignee: Самсунг Электроникс Ко., Лтд.
Priority date: 2020-04-06
Filing date: 2020-04-06
Publication date: 2021-02-09

Abstract

FIELD: physics.

SUBSTANCE: invention relates to image processing and analysis methods and can be used for organizing photo galleries in mobile systems. Disclosed is a method of recognizing events on photographs with automatic selection of albums. According to method automatically assign album boundaries, estimating similarity between degrees of confidence for each pair of consecutive photos in gallery, wherein similarities between photographs are calculated as distance between degrees of confidence for each type of event, estimated by classification of each photograph, or similarity between photographs is calculated as sum of similarities between degrees of confidence for each type of event, estimated on the basis of classification of each photo, and similarities between their locations, if the EXIF (Exchangeable Photo File Format) data of said photos contain location information.

EFFECT: technical result is to improve the accuracy of identifying events by combining successively taken photographs into albums with similar content with subsequent classification of each album based on the neuronal attention mechanism.

11 cl, 5 dwg, 7 tbl

Description

Область техники, к которой относится изобретениеThe technical field to which the invention relates

Изобретение относится к области компьютерных технологий, в частности, к способам обработки и анализа изображений, и может быть использовано для организации фотогалереи в мобильных системах.The invention relates to the field of computer technology, in particular, to methods for processing and analyzing images, and can be used to organize a photo gallery in mobile systems.

Описание известного уровня техникиDescription of the prior art

В последнее время благодаря быстрому росту социальных сетей, облачных сервисов и мобильных технологий люди стали делать гораздо больше фотографий, чем когда-либо прежде [13]. Чтобы организовать личную коллекцию, фотографии обычно относят к альбомам в соответствии с какими-то событиями. Системы организации фотографий (Apple iPhoto, Google Photos и т.д.) позволяют пользователю быстро находить нужные фотографии, а также повышают эффективность работы с галереей [27]. На сегодняшний день в этих системах обычно задействован контент-анализ фотографий и автоматическое ассоциирование каждой фотографии с различными тегами (описание сцены, люди, объекты, места и т.п.). Такой анализ можно использовать не только для выборочного извлечения фотографий по определенному тегу, чтобы сохранить приятные воспоминания о некоторых эпизодах жизни пользователя [32], но и для предоставления персональных рекомендаций, которые помогают пользователям находить соответствующие объекты в больших коллекциях. Разработка таких систем требует тщательного анализа пользовательского подхода к моделированию [26]. Большую галерею фотографий на мобильном устройстве можно использовать для понимания таких интересов пользователя, как спорт, технические приспособления, фитнес, одежда, автомобили, еда, путешествия, домашние животные и т.д. [12, 20].Recently, thanks to the rapid growth of social networks, cloud services and mobile technologies, people have taken far more photos than ever before [13]. To organize a personal collection, photographs are usually assigned to albums in accordance with some event. Photo organization systems (Apple iPhoto, Google Photos, etc.) allow the user to quickly find the desired photos, and also increase the efficiency of working with the gallery [27]. Today, these systems usually involve content analysis of photos and automatic association of each photo with different tags (description of the scene, people, objects, places, etc.). Such analysis can be used not only to selectively extract photographs by a specific tag in order to preserve pleasant memories of certain episodes of a user's life [32], but also to provide personalized recommendations that help users find relevant objects in large collections. The development of such systems requires a careful analysis of the user approach to modeling [26]. A large gallery of photos on a mobile device can be used to understand user interests such as sports, tech, fitness, clothing, cars, food, travel, pets, and more. [12, 20].

Настоящее изобретение сконцентрировано на одной из самых сложных частей механизма организации фотографий, а именно на задаче распознавания событий на фотографиях [1] для извлечения таких событий, как праздники, спортивные мероприятия, свадьбы, различные виды деятельности и т.д. Термин "событие" можно определить как категорию, которая фиксирует "сложное поведение группы людей, взаимодействующих с множеством объектов, происходящее в определенной обстановке" [32]. Существуют две различные задачи в области распознавания событий. Первая задача сфокусирована на обработке отдельных фотографий, т.е. событие рассматривается как сложная сцена с большими вариациями внешнего вида и структуры [32]. Вторая задача направлена на предсказание категорий событий группы фотографий (альбома) [4]. В последнем случае предполагается, что все фотографии в альбоме имеют слабую разметку [2], при этом каждая фотография может иметь разную важность [33]. Однако на практике имеется в наличии только галерея фотографий, поэтому для второго подхода требуется, чтобы пользователь выбрал альбомы вручную. В другом варианте применяется создание альбомов на основе местоположения, если включены теги GPS. В обоих случаях использование распознавания событий на основе альбома ограничено или даже невозможно.The present invention focuses on one of the most difficult parts of the photo organizing mechanism, namely the task of recognizing events in photographs [1] to extract events such as holidays, sporting events, weddings, various activities, etc. The term "event" can be defined as a category that captures "the complex behavior of a group of people interacting with a variety of objects, taking place in a certain setting" [32]. There are two different tasks in the field of event recognition. The first task is focused on processing individual photographs, i.e. the event is viewed as a complex scene with large variations in appearance and structure [32]. The second task is aimed at predicting the categories of events in a group of photographs (album) [4]. In the latter case, it is assumed that all photographs in the album have weak markup [2], while each photograph may have a different importance [33]. In practice, however, only the photo gallery is available, so the second approach requires the user to manually select albums. Another option uses location-based album creation if GPS tags are enabled. In both cases, the use of album-based event recognition is limited or even impossible.

Сущность изобретенияThe essence of the invention

В настоящем изобретении изучается новая постановка задачи распознавания событий: требуется предсказать категории событий в галерее фотографий, для которых неизвестны альбомы (группы фотографий, соответствующие одному событию). Предложен новый двухэтапный подход. Сначала из каждой фотографии извлекаются векторные представления (признаки) с помощью предобученной сверточной нейронной сети. Эти векторные представления классифицируются индивидуально. Оценки классификатора используются, чтобы группировать последовательные фотографии в несколько кластеров. В заключение, векторные представления фотографий в каждой группе агрегируются в единый дескриптор с помощью нейронного механизма внимания. Этот механизм представляет собой весовую схему для линейного объединения всех векторных представлений во входном наборе. Веса вычисляются адаптивно с помощью нейронной сети прямого распространения. Благодаря механизму внимания агрегация инвариантна по отношению к порядку изображений и не зависит от количества изображений во входном наборе. Этот алгоритм можно при необходимости расширить, чтобы повысить точность классификации каждой фотографии в группе. В отличие от обычной тонкой настройки (дообучения) сверточных нейронных сетей (CNN) предлагается использовать составление подписи к изображению, то есть генеративную модель, которая преобразует фотографии в текстовые описания. Они подвергаются унитарному кодированию и преобразуются в разреженное векторное представление, подходящее для обучения произвольного классификатора. Экспериментальные работы с коллекцией Photo Event Collection и набором данных Multi-Label Curation Flickr Events показали, что точность предлагаемого подхода на 9-20% выше, чем распознавание событий на отдельных фотографиях. Кроме того, предлагаемый метод имеет на 13-16% меньшую вероятность ошибки, чем классификация групп фотографий, полученных с помощью иерархической кластеризации. Было экспериментально показано, что подписи фотографий, обученные на наборе данных Conceptual Captions, можно классифицировать более точно, чем векторные представления из детектора объектов, хотя по всей видимости они оба не так содержательны, как векторные представления на основе CNN. Тем не менее, можно объединить предложенный подход с обычными CNN в ансамбль, чтобы получить самую высокую точность для нескольких наборов данных событий. In the present invention, a new formulation of the event recognition problem is studied: it is required to predict the categories of events in the gallery of photos for which albums (groups of photos corresponding to one event) are unknown. A new two-stage approach is proposed. First, vector representations (features) are extracted from each photo using a pretrained convolutional neural network. These vector representations are classified individually. Classifier scores are used to group consecutive photographs into multiple clusters. Finally, the vector representations of the photographs in each group are aggregated into a single descriptor using the neural mechanism of attention. This mechanism is a weighting scheme for the linear union of all vector representations in the input set. The weights are calculated adaptively using a feedforward neural network. Due to the attention mechanism, aggregation is invariant with respect to the order of images and does not depend on the number of images in the input set. This algorithm can be extended as needed to improve the classification accuracy of each photo in the group. In contrast to the usual fine-tuning (additional training) of convolutional neural networks (CNN), it is proposed to use the compilation of an image caption, that is, a generative model that converts photographs into textual descriptions. They are unitary encoded and converted into a sparse vector representation suitable for training an arbitrary classifier. Experimental work with the Photo Event Collection and the Multi-Label Curation Flickr Events dataset showed that the accuracy of the proposed approach is 9-20% higher than recognizing events in individual photographs. In addition, the proposed method has a 13-16% lower error probability than the classification of groups of photographs obtained using hierarchical clustering. It has been experimentally shown that photo captions trained on the Conceptual Captions dataset can be classified more accurately than vector representations from an object detector, although both appear not to be as meaningful as vector representations based on CNN. However, it is possible to combine the proposed approach with conventional CNNs into an ensemble to obtain the highest accuracy for multiple event datasets .

Таким образом, в основу настоящего изобретения положена новая задача распознавания событий, в которой дана галерея фотографий и известно, что эта галерея содержит упорядоченные альбомы с неизвестными границами. Предлагается автоматически присваивать эти границы на основе анализа визуального содержания последовательных фотографий в галерее. Затем последовательные фотографии группируют и вычисляют дескриптор каждой группы с помощью механизма внимания из нейронного агрегационного модуля [37]. В завершение, этот подход расширяется следующим образом. Несмотря на традиционное использование CNN в качестве дискриминационных моделей в программной структуре классификатора, предлагается заимствовать генеративные модели для представления входной фотографии в другой области. В частности, используются известные методы автоматического составления подписи к изображению [14], которые генерируют текстовые описания фотографий. Основной вклад изобретения состоит в демонстрации того, что сгенерированные описания можно подать на вход классификатора в ансамбле, чтобы повысить точность распознавания событий по сравнению с традиционными методами. Хотя предлагаемое визуальное представление не так содержательно, как векторные представления, извлеченные с помощью дообученных CNN, оно лучше, чем выходы детекторов объектов [20].Thus, the present invention is based on a novel event recognition problem in which a gallery of photos is given and it is known that this gallery contains ordered albums with unknown boundaries. It is proposed to automatically assign these boundaries based on the analysis of the visual content of successive photos in the gallery. Then sequential photographs are grouped and the descriptor of each group is calculated using the attention mechanism from the neural aggregation module [37]. Finally, this approach is expanded as follows. Despite the traditional use of CNN as discriminatory models in the program structure of the classifier, it is proposed to borrow generative models to represent the input photograph in another area. In particular, the well-known methods of automatic caption to an image [14] are used, which generate textual descriptions of photographs. The main contribution of the invention is to demonstrate that the generated descriptions can be fed to the input of a classifier in an ensemble in order to increase the accuracy of event recognition compared to traditional methods. Although the proposed visual representation is not as meaningful as the vector representations extracted using additional trained CNNs, it is better than the outputs of object detectors [20].

Задача распознавания событий в личных коллекциях фотографий заключается в распознавании события не на отдельной фотографии, а во всем альбоме [29]. События и подсобытия на фотографиях подпоследовательности идентифицируются в работе [8] путем интеграции оптимизированного линейного программирования с дескриптором цвета изображения подписи. В работе [5] применялись скрытые модели Маркова (Stopwatch Hidden Markov) путем обработки фотографий в альбоме как последовательных данных. Детекторы объектов, релевантных для событий, обучались на наборе данных праздников [29]. Затем праздники классифицировались на основе выходов детектора объектов. В статье [2] рассматривается наличие нерелевантных изображений в альбоме с помощью методов обучения на множестве экземпляров. В [33] представлена процедура итеративного обновления для предсказания типа события и оценки важности изображения в сиамской сети. Авторы этой статьи использовали CNN, которая распознает тип события, и классификатор событий в последовательности на основе долгой краткосрочной памяти (LSTM) во всем альбоме. Кроме того, они успешно применили этот метод для обучения репрезентативных глубоких признаков для анализа набора изображений [34]. Последний подход фокусируется на оценке совместных появлений и частот признаков, вследствие чего не требуется временная когерентность фотографий в альбоме. В [13] предложена модель распознавания событий от крупного до мелкого уровня иерархии с использованием мультигранулярных признаков [24] на основе сети внимания, которая изучает представления фотоальбомов. Эффективность повторного обнаружения ожидаемых фотографий в мобильных телефонах была улучшена с помощью метода классификации личных фотографий на основании взаимосвязи времени и места съемки с конкретными событиями [10].The task of recognizing events in personal collections of photographs is to recognize an event not in a single photograph, but in the entire album [29]. Events and sub-events in the photographs of the subsequence are identified in [8] by integrating the optimized linear programming with the caption image color descriptor. In [5], the hidden Markov models (Stopwatch Hidden Markov) were used. by treating the photos in the album as sequential data. Event-relevant object detectors were trained on the holiday dataset [29]. The holidays were then classified based on the object detector outputs. Article [2] examines the presence of irrelevant images in an album using multi-instance learning methods. In [33], an iterative update procedure is presented for predicting the type of event and assessing the importance of the image in the Siamese network. The authors of this article used CNN, which recognizes the type of event, and a long short term memory (LSTM) sequence event classifier throughout the album. In addition, they successfully applied this method to train representative deep features for analysis of a set of images [34]. The latter approach focuses on the assessment of co-occurrences and frequency of features, so that the temporal coherence of photographs in the album is not required. In [13], a model of event recognition from large to small hierarchical levels using multi-granular features [24] based on the attention network that studies the representations of photo albums is proposed. The efficiency of re-detecting expected photographs in mobile phones has been improved by using a method for classifying personal photographs based on the relationship of time and location of the photograph to specific events [10].

Информация о границах альбома не всегда имеется в наличии, так как галерея содержит неструктурированный список фотографий, упорядоченный по времени их создания. В этом случае можно использовать существующие методы распознавания событий на отдельных фотографиях [1]. Как и в других областях компьютерного зрения, в наиболее распространенном подходе преимущественно применяются архитектуры на основе CNN. Например, четыре различных слоя дообученной CNN использовались для извлечения признаков и выполнения линейного дискриминантного анализа, чтобы выиграть в конкурсе распознавания культурных событий ChaLearn LAP 20 [9]. Для повышения точности распознавания событий на многомасштабные пространственные карты проецируются ограничительные рамки обнаруженных объектов [35]. В [32] введен новый метод итеративного выбора для идентификации подмножества классов, наиболее релевантных для передачи глубоких представлений, обученных на наборах данных объектов (ImageNet) и сцен (Places2).Album boundary information is not always available, as the gallery contains an unstructured list of photos sorted by the time they were taken. In this case, you can use the existing methods of recognizing events in individual photographs [1]. As with other areas of computer vision, the most common approach predominantly uses CNN-based architectures . For example, four different layers of retrained CNN were used to extract features and perform linear discriminant analysis to win the ChaLearn LAP 20 cultural event recognition competition [9]. To improve the accuracy of event recognition, bounding boxes of detected objects are projected onto multiscale spatial maps [35]. In [32], a new iterative selection method was introduced to identify a subset of the classes most relevant for the transmission of deep representations trained on datasets of objects (ImageNet) and scenes (Places2).

Предложен способ распознавания событий на фотографиях с автоматическим выделением альбомов, заключающийся в том, что: автоматически присваивают границы альбомов на основе визуального содержания последовательных фотографий в галерее; группируют последовательные фотографии из галереи в альбомы; вычисляют дескриптор каждого альбома с помощью механизма внимания из нейронного агрегационного модуля; распознают тег типа события каждого альбома путем подачи его дескриптора на вход классификатора; распознают событие на фотографиях в галерее путем присвоения соответствующего тега типа события всем фотографиям в каждом альбоме. При этом при автоматическом присвоении границ альбомов: оценивают сходство между степенями уверенностями для каждой пары последовательных фотографий в галерее; если сходство не превышает заданный порог, то обе фотографии включают в один и тот же альбом. При этом сходство между фотографиями вычисляют как расстояние между степенями уверенностями для каждого типа события, оцененного с помощью классификации каждой фотографии. При этом сходство между фотографиями вычисляют как сумму сходств между степенями уверенностями для каждого типа события, оцененного на основе классификации каждой фотографии, и сходства между их местоположениями, если данные EXIF (Exchangeable Photo File Format) этих фотографий содержат информацию о местоположении. При этом заданный порог вычисляют автоматически во время процедуры предварительного обучения, используя предоставленный обучающий набор альбомов.A method is proposed for recognizing events in photographs with automatic selection of albums, which consists in the following: automatically assign boundaries of albums based on the visual content of consecutive photographs in the gallery; group consecutive photos from the gallery into albums; calculating the descriptor of each album using the attention mechanism from the neural aggregation module; recognize the event type tag of each album by supplying its descriptor to the input of the classifier; recognize the event in the photos in the gallery by assigning an appropriate event type tag to all the photos in each album. At the same time, when automatically assigning the boundaries of albums: assess the similarity between the degrees of confidence for each pair of consecutive photos in the gallery; if the similarity does not exceed the specified threshold, then both photos are included in the same album. In this case, the similarity between the photographs is calculated as the distance between the degrees of confidence for each type of event, assessed by the classification of each photograph. In this case, the similarity between photographs is calculated as the sum of the similarities between the degrees of certainty for each type of event, assessed based on the classification of each photograph, and the similarity between their locations, if the EXIF (Exchangeable Photo File Format) data of these photographs contains location information. In this case, the predetermined threshold is calculated automatically during the preliminary training procedure using the provided training set of albums.

Также предложен способ распознавания события на фотографии с ее представлением в текстовой области, заключающийся в том, что: вычисляют векторные представления фотографии путем подачи ее RGB (красный-зеленый-синий) представления на вход сверточной нейронной сети; используют метод автоматического составления подписи к изображению для генерации текстовых описаний фотографии на основе ее векторных представлений; кодируют сгенерированную подпись; подают сгенерированные текстовые описания на вход классификатора событий; распознают событие путем присвоения выхода классификатора тегу событий соответствующей фотографии. При этом сгенерированную подпись кодируют как разреженный вектор с помощью унитарного кодирования текстового описания фотографии: v-й компонент вектора равен 1, только в том случае, если сгенерированная подпись содержит по меньшей мере одно v-е слово из словаря. Согласно предложенному способу дополнительно объединяют выходы классификатора сгенерированных текстовых описаний и традиционного классификатора векторных представлений в ансамбль для повышения точности распознавания событий. При этом выходы классификаторов получают следующим образом: вычисляют оценки степени уверенности для всех типов событий для фотографии путем подачи ее векторных представлений в произвольный классификатор; вычисляют оценки степени уверенности для всех типов событий для фотографии путем подачи разреженного вектора унитарно кодированного текстового описания в произвольный классификатор; объединяют выходы классификаторов векторных представлений и текстов в ансамбле на основе простого голосования с мягким агрегированием. При этом текстовое описание фотографии генерируют следующим образом: подают извлеченные векторные представления и последовательность ранее сгенерированных слов в рекуррентную нейронную сеть (RNN) для прогнозирования следующего слова в текстовом описании фотографии; отображают номера, сгенерированные RNN, в реальные слова из словаря; выбирают подмножество словаря путем выбора наиболее часто встречающихся слов в обучающих данных с необязательным исключением стоп-слов. В предложенном способе дополнительно добавляют спрогнозированное слово в эту последовательность ранее сгенерированных слов и подают извлеченные визуальные векторные представления и эту последовательность в одну и ту же RNN. При этом обучающая выборка представляет собой набор фотографий с известным типом события. При этом следует предварительно обучить традиционный классификатор распознаванию событий в обучающей выборке, в которой каждая фотография представлена вместе с ее извлеченными собственными визуальными векторными представлениями. При этом, в частности, агрегированные степени уверенности вычисляют как взвешенную сумму оценок степеней уверенности, и решение принимают в пользу класса с максимальной агрегированной степенью уверенности.Also proposed is a method for recognizing an event in a photograph with its representation in a text area, which consists in the following: calculating vector representations of a photograph by feeding its RGB (red-green-blue) representation to the input of a convolutional neural network; use the method of automatically composing an image caption to generate textual descriptions of a photograph based on its vector representations; encode the generated signature; feed the generated text descriptions to the input of the event classifier; recognize the event by assigning the classifier output to the event tag of the corresponding photograph. In this case, the generated signature is encoded as a sparse vector using unitary coding of the textual description of the photograph: the v-th component of the vector is equal to 1, only if the generated signature contains at least one v-th word from the dictionary. According to the proposed method, the outputs of the generated text description classifier and the traditional vector representation classifier are additionally combined into an ensemble to improve the accuracy of event recognition. In this case, the outputs of the classifiers are obtained as follows: the estimates of the degree of confidence for all types of events for the photograph are calculated by feeding its vector representations into an arbitrary classifier; calculating confidence scores for all types of events for the photograph by supplying a sparse vector of the unitary encoded text description to an arbitrary classifier; combine outputs of classifiers of vector representations and texts in an ensemble based on simple voting with soft aggregation. The textual description of the photograph is generated as follows: the extracted vector representations and the sequence of previously generated words are fed to a recurrent neural network (RNN) to predict the next word in the textual description of the photograph; map numbers generated by RNN to real words from the dictionary; select a subset of the vocabulary by selecting the most common words in the training data with the optional exclusion of stop words. In the proposed method, a predicted word is additionally added to this sequence of previously generated words, and the extracted visual vector representations and this sequence are fed to the same RNN. In this case, the training sample is a set of photographs with a known type of event. In this case, the traditional classifier should be preliminarily trained to recognize events in the training sample, in which each photograph is presented along with its own extracted visual vector representations. In this case, in particular, the aggregated degrees of confidence are calculated as a weighted sum of the estimates of the degrees of confidence, and the decision is made in favor of the class with the maximum aggregated degree of confidence.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF DRAWINGS

Описанные выше и/или другие аспекты станут более очевидными из описания примерных вариантов осуществления изобретения со ссылками на прилагаемые чертежи.The above and / or other aspects will become more apparent from the description of exemplary embodiments of the invention with reference to the accompanying drawings.

Фиг. 1 изображает предлагаемый подход для распознавания событий на основе галереи.FIG. 1 depicts a proposed approach for gallery-based event recognition.

Фиг. 2 изображает основанную на механизме внимания нейронную сеть для векторных представлений из MobileNet v2.FIG. 2 depicts an attention-based neural network for vector representations from MobileNet v2.

Фиг. 3 изображает демонстрационный графический интерфейс для мобильных устройств.FIG. 3 depicts a demo graphical interface for mobile devices.

Фиг. 4 изображает предлагаемый конвейер для распознавания событий на основе составления подписи к изображению.FIG. 4 depicts a proposed pipeline for event recognition based on captioning an image.

Фиг. 5 изображает примерные результаты распознавания событий. FIG. 5 depicts exemplary event recognition results.

ПОДРОБНОЕ ОПИСАНИЕDETAILED DESCRIPTION

Изобретение можно использовать в программном обеспечении для организации фотографий, которое можно при желании реализовать, например, на машиночитаемом носителе, для автоматического выбора альбомов из неразмеченного набора фотографий и ассоциирования каждого альбома с конкретными тегами (типами событий или сцен). Настоящее изобретение можно реализовать программными и/или аппаратными средствами в любом устройстве, например, смартфоне, мобильном телефоне, планшетном ПК, ноутбуке, компьютере и т.п.The invention can be used in photo management software that can optionally be implemented on, for example, a computer readable medium to automatically select albums from an unlabeled set of photographs and associate each album with specific tags (types of events or scenes). The present invention can be implemented in software and / or hardware in any device such as a smartphone, mobile phone, tablet PC, laptop, computer, and the like.

Технический эффект заключается в значительном улучшении точности распознавания событий посредством объединения последовательно снятых фотографий в альбомы (то есть группы последовательных фотографий со сходным содержанием) с последующей классификацией каждого альбома на основе нейронного механизма внимания. Для нахождения похожих фотографий вместо обычного сопоставления визуальных векторных представлений используется оценка степеней уверенности классификаторов событий. Кроме традиционных визуальных векторных представлений используются их автоматически генерируемые текстовые описания (подписи) для повышения точности ансамбля классификаторов за счет повышения диверсификации визуальных векторных представлений.The technical effect is to significantly improve the accuracy of event recognition by combining sequentially taken photographs into albums (i.e., groups of sequential photographs with similar content), followed by the classification of each album based on the neural mechanism of attention. To find similar photographs, instead of the usual comparison of visual vector representations, an assessment of the degrees of confidence of the event classifiers is used. In addition to traditional visual vector representations, their automatically generated text descriptions (signatures) are used to improve the accuracy of the ensemble of classifiers by increasing the diversification of visual vector representations.

Предлагаемое решение заключается в следующем:The suggested solution is as follows:

Для каждой фотографии в галерее выполняется следующая обработка. Вычисляются векторные представления фотографии путем подачи ее RGB (красный-зеленый-синий) представления на вход сверточной нейронной сети. Предлагается использовать дополнительное представление фотографии с использованием методов автоматического составления подписи к изображению, чтобы повысить диверсификацию визуальных векторных представлений. В частности, создается подпись (текстовое описание) фотографии посредством подачи ее векторных представлений в специально обученную рекуррентную нейронную сеть. Далее осуществляется унитарное кодирование сгенерированных подписей. Оцениваются степени уверенности для каждого типа события путем классификации фотографии с помощью ансамбля, состоящего из традиционного классификатора векторных представлений и классификатора унитарно кодированных подписей к фотографиям. Ансамбль традиционного классификатора векторных представлений и классификатора унитарно кодированных подписей к фотографиям представляет собой следующее: при традиционном подходе классифицируют векторные представления фотографий, извлеченные с помощью сверточных нейронных сетей; в предлагаемом подходе сначала генерируют текстовое описание фотографии, затем этот текст преобразуют в числовую форму, используя метод унитарного кодирования, и полученный вектор классифицируют посредством известных методов.For each photo in the gallery, the following processing is performed. Vector representations of a photo are calculated by feeding its RGB (red-green-blue) representation to the input of a convolutional neural network. It is proposed to use an additional representation of a photograph using methods of automatic caption to an image to increase the diversification of visual vector representations. In particular, a signature (textual description) of a photo is created by feeding its vector representations into a specially trained recurrent neural network. Next, the generated signatures are unitary encoded. The degree of confidence for each type of event is assessed by classifying a photograph using an ensemble consisting of a traditional classifier of vector representations and a classifier of unitary coded captions to photographs. The ensemble of the traditional classifier of vector representations and the classifier of unitarily encoded captions to photographs is as follows: in the traditional approach, vector representations of photographs are classified, extracted using convolutional neural networks; In the proposed approach, a textual description of the photograph is first generated, then this text is converted into numerical form using the unitary coding method, and the resulting vector is classified by known methods.

Оценивается сходство между каждой парой последовательных фотографий в галерее. Вместо традиционного вычисления сходства между векторными представлениями, извлеченными сверточной нейронной сетью, предлагается вычислять сходство между описанными выше нормированными оценками степеней уверенности. Для нормирования оценок степеней уверенности вычисляется их L2-норма как квадратный корень из суммы квадратов всех оценок степеней уверенности, и каждая оценка делится на эту L2-норму. Если данные EXIF (Exchangeable Photo File Format) этих фотографий содержат информацию о местоположении, то к сходству между нормированными оценками степенями уверенности можно добавить сходство между их местоположениями. Последовательные фотографии в галерее автоматически группируются в альбомы по следующему правилу: если сходство не превышает заданный порог, то предполагается, что обе фотографии включаются в один и тот же альбом. Этот порог вычисляется автоматически во время процедуры предварительного обучения с использованием предоставленного обучающего набора альбомов. Обучающие альбомы берутся из доступной обучающей выборки, например, из Photo Event Collection (PEC) или из набора данных Multi-Label Curation of Flickr Events (ML-CUFED), они необходимы для обучения классификатора.The similarity between each pair of consecutive photos in the gallery is evaluated. Instead of the traditional computation of the similarity between vector representations extracted by the convolutional neural network, it is proposed to compute the similarity between the normalized confidence scores described above. To normalize the estimates of the degrees of confidence, their L2-norm is calculated as the square root of the sum of the squares of all the estimates of the degrees of confidence, and each estimate is divided by this L2-norm. If the EXIF (Exchangeable Photo File Format) data of these photos contains location information, then similarities between their locations can be added to the similarities between the normalized confidence levels. Successive photos in the gallery are automatically grouped into albums according to the following rule: if the similarity does not exceed a specified threshold, then it is assumed that both photos are included in the same album. This threshold is calculated automatically during the pre-training procedure using the provided training album set. The training albums are taken from the available training sample, for example, from the Photo Event Collection (PEC) or from the Multi-Label Curation of Flickr Events (ML-CUFED) dataset, they are required to train the classifier.

В отличие от существующих методов обнаружения событий в коллекциях фотографий, присваивается тег типа события всем фотографиям в каждом альбоме с использованием специальной нейронной сети с механизмом внимания, обученной классифицировать события в наборе фотографий из одного альбома. Изобретение позволяет повысить точность по сравнению с обычной техникой присвоения тегов фотографиям, потому что события распознаются одновременно для всех фотографий в выбранных альбомах и ошибки для отдельных фотографий могут исправляться автоматически.Unlike existing methods for detecting events in photo collections, an event type tag is assigned to all photos in each album using a special attention-engine neural network trained to classify events in a set of photos from one album. The invention improves accuracy over the conventional photo tagging technique, because events are recognized simultaneously for all photos in selected albums and errors for individual photos can be corrected automatically.

Новым трендом в сервисах организации фотографий является аннотирование личных фотоальбомов [8]. В [17] предложен метод иерархической организации фотографий на смартфоне по темам и тематическим категориям, основанный на интеграции сверточной нейронной сети (CNN) и тематического моделирования для классификации фотографий. В [16] предложены автоматическая иерархическая кластеризация и решение для выбора лучшей фотографии для моделирования пользовательских решений при организации или кластеризации похожих фотографий в альбомах. В [24] рассматривается организация фотоальбомов для предсказания пользовательских предпочтений на мобильном устройстве.A new trend in photo organization services is the annotation of personal photo albums [8]. In [17], a method for the hierarchical organization of photos on a smartphone according to topics and thematic categories is proposed, based on the integration of a convolutional neural network (CNN) and topic modeling for the classification of photos. In [16], automatic hierarchical clustering and a solution for choosing the best photo are proposed for simulating user decisions when organizing or clustering similar photos in albums. In [24], the organization of photo albums for predicting user preferences on a mobile device is considered.

К сожалению, точность классификации событий на статических фотоснимках [32] обычно намного ниже, чем точность распознавания по альбомам [33]. Поэтому в настоящем изобретении предлагается сконцентрироваться на других подходящих визуальных векторных представлениях, извлеченных с помощью генеративных моделей, в частности, на методах автоматического составления подписи к изображению. Существует широкий спектр приложений для составления подписи: от автоматической генерации описаний для фотографий, размещаемых в социальных сетях, до извлечения фотографий из баз данных с использованием сгенерированных текстовых описаний [30]. В основе методов составления подписи к изображению обычно лежит нейросеть с архитектурой кодер-декодер, которая сначала кодирует фотографию в векторное представление фиксированной длины, используя предварительно обученную CNN, а затем декодирует это представление в подпись (описание на естественном языке). Во время обучения декодера (генератора) входная фотография и ее истинное текстовое описание подаются в нейронную сеть в качестве ввода, а желаемый вывод сети представляет собой одно унитарно кодированное описание. Описание кодируется с использованием текстовых векторных представлений в слое "Векторное представление (таблица поиска)" [11]. Сгенерированные векторные представления фотографии и текста объединяются посредством конкатенации или суммирования, и они образуют ввод для декодирующей части сети. Обычно предусматривается слой рекуррентной нейронной сети (RNN), за которым следует полносвязный слой с выходным слоем Soft-max.Unfortunately, the accuracy of classification of events in static photographs [32] is usually much lower than the accuracy of recognition from albums [33]. Therefore, the present invention proposes to concentrate on other suitable visual vector representations extracted using generative models, in particular on methods for automatically composing an image caption. There is a wide range of signature compilation applications, from automatically generating descriptions for photos posted on social networks to extracting photos from databases using generated text descriptions [30]. Image captioning techniques are typically based on a codec neural network that first encodes a photo into a fixed-length vector representation using a pretrained CNN, and then decodes that representation into a caption (natural language description). During the training of the decoder (generator), the input photograph and its true textual description are fed into the neural network as input, and the desired network output is a single unitary coded description. The description is encoded using text vector representations in the "Vector representation (lookup table)" layer [11]. The generated vector representations of the photo and the text are combined by concatenation or addition and they form the input for the decoding part of the network. Typically a Recurrent Neural Network (RNN) layer is provided, followed by a fully connected layer with a Soft-max output layer.

Одна из первых успешных моделей, Show and Tell [31], выиграла первое соревнование MS COCO Image Captioning Challenge в 2015 году. В части декодера в ней используется RNN с блоками долгой краткосрочной памяти (LSTM). Ее усовершенствованная модель Show, Attend and Tell [36] включает в себя мягкий механизм внимания для улучшения качества генерации подписи. Модель составления подписи к изображению "Neural Baby Talk" [18] основана на создании шаблона с местоположениями слотов, явно привязанными к определенным областям изображения. Эти слоты затем заполняются визуальными концептами, идентифицированными в детекторах объектов. Области переднего плана получают с помощью сети Faster-RCNN [21], а декодером служит LSTM с механизмом внимания. Мультимодальная рекуррентная нейронная сеть (mRNN) [19] основана на сети Inception для извлечения векторных представлений фотографий и глубокой RNN для генерации предложений. Одной из лучших моделей в настоящее время является сеть Auto-Reconstructor Network (ARNet) [6], в кодере которой используется сеть Inception-V4 [28], а декодер основан на LSTM. Существуют две предварительно обученные модели с жадным поиском (ARNet-g) и лучевым поиском (ARNet-b) с размером 3, предназначенные для генерации окончательной подписи для каждой входной фотографии.One of the first successful models, Show and Tell [31], won the first MS COCO Image Captioning Challenge in 2015. In the decoder part, it uses RNN with Long Short Term Memory (LSTM) blocks. Its improved Show, Attend and Tell model [36] includes a soft attention mechanism to improve the quality of signature generation. The Neural Baby Talk image caption model [18] is based on the creation of a template with slot locations explicitly tied to specific areas of the image. These slots are then populated with visual concepts identified in the object detectors. The foreground areas are obtained using the Faster-RCNN [21], and the decoder is an attention mechanism LSTM. A multimodal recurrent neural network (mRNN) [19] is based on the Inception network to extract vector representations of photographs and deep RNN to generate sentences. One of the best models at present is the Auto-Reconstructor Network (ARNet) [6], which uses the Inception-V4 [28] encoder and the LSTM-based decoder. There are two pretrained models with greedy search (ARNet-g) and ray search (ARNet-b) with size 3 designed to generate the final caption for each input photo.

МАТЕРИАЛЫ И МЕТОДЫ MATERIALS AND METHODS

Постановка задачиFormulation of the problem

Как отмечалось выше, решается задача распознавания событий на каждой фотографии в галерее. При этом на первом этапе автоматически выделяются альбомы (группы снятых пользователем последовательных фотографий одного и того же события) на основе анализа содержания фотографии. Затем распознаются события, соответствующие каждому выделенному альбому. После этого событие, распознанное для данного альбома, отображается в каждой фотографии данного альбома. Это позволяет в результате распознавать события более точно, чем просто классифицировать события на каждой отдельной фотографии.As noted above, the problem of recognizing events in each photo in the gallery is solved. At the same time, at the first stage, albums (groups of consecutive photographs of the same event taken by the user) are automatically selected based on the analysis of the content of the photograph. Then the events corresponding to each selected album are recognized. After that, the event recognized for this album is displayed in each photo of this album. This makes it possible, as a result, to recognize events more accurately than to simply classify the events in each individual photograph.

В данном подразделе описывается технологический механизм, позволяющий решить эту задачу посредством последовательной обработки фотографий аналогично кластерному анализу с использованием базовой последовательной алгоритмической схемы (Basic Sequential Algorithmic Scheme, BSAS) [3]. Основную задачу можно сформулировать следующим образом. требуется приписать каждую фотографию

из галереи входного пользователя к одной из С>1 категорий событий (классов). В данном случае T≥1 - общее количество фотографий в галерее. Имеется обучающий набор N≥1 альбомов для обучения классификатора событий. n-й обучающий (эталонный) альбом образован коллекцией Ln фотографий

. Предполагается, что предоставлена метка класса

каждого n-го альбома, т.е. что альбом связан с одним конкретным типом события.This subsection describes a technological mechanism that allows to solve this problem by sequential processing of photographs similar to cluster analysis using the Basic Sequential Algorithmic Scheme (BSAS) [3]. The main task can be formulated as follows. it is required to assign each photo

from the gallery of the input user to one of the C> 1 categories of events (classes). In this case, T≥1 is the total number of photos in the gallery. There is a training set of N≥1 albums for training the event classifier. The n-th training (reference) album is formed by a collection of Ln photos

... Assumes class label provided

every nth album, i.e. that the album is associated with one particular type of event.

Обычное распознавание событий на отдельных фотографиях [32] представляет собой специальный случай сформулированной выше задачи, если T=l. Следовательно, эту задачу можно решить путем классифицирования событий отдельно на каждой t-й фотографии (t=1,2,…,T). Однако можно принять во внимание, что галерея {Xt} в данной задаче не является случайной коллекцией фотографий, а может быть представлена как последовательность несвязных альбомов. Каждая фотография в альбоме связана с одним и тем же событием. В отличие от распознавания событий в альбоме, нам не известны границы каждого альбома, т.е. номер первой t1 и последней t2 фотографии из галереи, для которых гарантируется, что все фотографии X_t, t=t₁+1,…,t₂ соответствуют данному альбому. Эта задача обладает рядом характеристик, придающих ей чрезвычайную сложность по сравнению с ранее изученными проблемами. Одной из таких характеристик является наличие нерелевантных или несущественных фотографий, которые в принципе могут быть ассоциированы с любым событием [1]. Эти фотографии легко обнаруживаются в основанных на внимании моделях [13, 37], но могут существенно повлиять на точность автоматического определения границ альбома.The usual recognition of events in individual photographs [32] is a special case of the problem formulated above, if T = l. Consequently, this problem can be solved by classifying events separately in each t-th photograph (t = 1,2, ..., T). However, you can take into account that the gallery {Xt} in this problem is not a random collection of photographs, but can be represented as a sequence of disconnected albums. Each photo in the album is associated with the same event. Unlike recognizing events in an album, we do not know the boundaries of each album, i.e. number of the first t1 and last t2 photos from the gallery, for which it is guaranteed that all photos X _t , t = t ₁ + 1,…, t ₂ correspond to the given album. This problem has a number of characteristics that make it extremely difficult compared to previously studied problems. One of these characteristics is the presence of irrelevant or irrelevant photographs, which, in principle, can be associated with any event [1]. These photographs are easily detected in attention-based models [13, 37], but can significantly affect the accuracy of automatic album delineation.

В данном случае базовый подход заключается в независимой друг от друга классификации всех фотографий T, чтобы решение для каждой фотографии не влияло на решение для любой другой фотографии. В таком случае обычно разворачивают обучающие альбомы в набор X = {X₁(1),…,X₁(L₁),X₂(1),…,X₂(L₂),…,X_N(1),…,X_N(L_N)} из L=L₁+…+L_N фотографий таким образом, чтобы метка c_n уровня коллекции n-го альбома присваивалась меткам каждой 1-й фотографии

. Метки c_n - это номера события, соответствующие всему альбому (коллекции фотографий). Затем можно обучить произвольный известный классификатор, который способен возвращать вектор оценок степени уверенности для каждого класса, вместо того, чтобы предсказывать только метку класса. Если L достаточно мало для обучения глубокой CNN (сверточной нейронной сети) с нуля, то можно применить перенос обучения или адаптацию домена [11]. В этих методах используется большой внешний набор данных, например, ImageNet-1000 или Places2 [38], для предварительного обучения глубокой CNN. Поскольку особое внимание уделяется автономному распознаванию на мобильных устройствах, целесообразно использовать такие CNN, как MobileNet vl/v2 [15, 22]. Заключительным этапом в переносе обучении является дообучение этой нейронной сети на наборе данных X. Данный этап включает в себя замену последнего слоя предварительно обученной CNN новым слоем с активациями Softmax и выводами С. Во время процесса классификации каждая входная фотография X_t подается в дообученную CNN для вычисления оценок (прогнозов на последнем слое:In this case, the basic approach is to classify all T photographs independently of each other so that the decision for each photograph does not affect the decision for any other photograph. In this case, training albums are usually expanded into a set X = {X ₁ (1),…, X ₁ (L ₁ ), X ₂ (1),…, X ₂ (L ₂ ),…, X _N (1), …, X _N (L _N )} from L = L ₁ +… + L _N photos in such a way that the label c _{n of} the collection level of the n-th album is assigned to the labels of each 1st photo

... Tags c _n are event numbers corresponding to the entire album (collection of photos). An arbitrary known classifier can then be trained that is capable of returning a vector of confidence scores for each class, rather than predicting only the class label. If L is small enough to train deep CNN (Convolutional Neural Network) from scratch, then transfer learning or domain adaptation can be applied [11]. These methods use a large external dataset such as ImageNet-1000 or Places2 [38] to pre-train deep CNN. Since special attention is paid to autonomous recognition on mobile devices, it is advisable to use such CNNs as MobileNet vl / v2 [15, 22]. The final stage in the transfer learning is the additional training of this neural network on dataset X. This stage includes replacing the last layer of the pretrained CNN with a new layer with Softmax activations and C outputs. During the classification process, each input photo X _t is fed to the retrained CNN for computation estimates (forecasts on the last layer:

Эту процедуру можно модифицировать, заменив С логистических регрессий в последнем слое более сложным классификатором, например, случайным лесом (RF), машиной опорных векторов (SVM) или градиентным бустингом. В этом случае векторные представления [25] извлекаются с использованием выходов одного из последних слоев предварительно обученной CNN. В частности, фотографии X_t и X_n(l) подаются в CNN, а выходы предпоследнего слоя используются в качестве D-мерных векторов векторных представлений x_t=[x_1;t,…,x_D;;t] и x_n(l)=[x_n;1(l),…,x_n;D(l)], соответственно. Такие блоки извлечения векторных представлений на основе глубокого обучения позволяют обучать произвольный классификатор. В этот классификатор подается t-тое фото для получения C-мерных оценок степени уверенности

.This procedure can be modified by replacing the C logistic regressions in the last layer with a more complex classifier, such as a random forest (RF), support vector machine (SVM), or gradient boosting. In this case, vector representations [25] are extracted using the outputs of one of the last layers of the pretrained CNN. In particular, photographs X _t and X _n (l) are fed to CNN, and the outputs of the penultimate layer are used as D-dimensional vectors of vector representations x _t = [x _{1; t} , ..., x _{D ;; t} ] and x _n ( l) = [x _{n; 1} (l),…, x _{n; D} (l)], respectively. Such deep learning-based vector representation extraction blocks allow training of an arbitrary classifier. The t-th photo is fed into this classifier to obtain C-dimensional estimates of the degree of confidence

...

В заключение, степени уверенности

, вычисленные любым из вышеперечисленных способов, используются для принятия решения в пользу наиболее вероятного класса:In conclusion, degrees of confidence

calculated by any of the above methods are used to decide in favor of the most likely class:

Распознавание событий в галерее фотографий.Recognition of events in the photo gallery.

На фиг. 1 показан предлагаемый подход для распознавания событий на основе галереи. Следует отметить, что все описанные блоки можно реализовать программными и/или аппаратными средствами.FIG. 1 shows a suggested approach for gallery-based event recognition. It should be noted that all described blocks can be implemented in software and / or hardware.

В блоке "Извлечение векторных представлений (CNN)" выполняется следующее: извлекаются визуальные векторные представления (векторные представления) путем подачи RGB-представления каждой фотографии из галереи на вход сверточной нейронной сети для вычисления выходов одного из ее последних слоев (обычно предпоследнего).The CNN Extraction block does the following: extracts visual vector representations (vector representations) by feeding an RGB representation of each photo from the gallery to the input of a convolutional neural network to compute the outputs of one of its last layers (usually the penultimate one).

В блоке "Классификатор" выполняется следующее: вычисляются оценки степени уверенности для всех типов событий для каждой фотографии путем подачи их векторных представлений в произвольный классификатор. Этот классификатор должен быть предварительно обучен распознавать события в обучающей выборке. Вектор оценок степеней уверенности нормализуется с использованием L2-нормы.In the "Classifier" block, the following is performed: estimates of the degree of confidence for all types of events for each photo are calculated by feeding their vector representations into an arbitrary classifier. This classifier must be pre-trained to recognize events in the training set. The confidence score vector is normalized using the L2-norm.

В блоке "Последовательный кластерный анализ" выполняется следующее: для каждой фотографии из галереи вычисляется сходство между ее нормированными оценками степеней уверенности и нормированными оценками уверенности следующей фотографии. Если данные EXIF (Exchangeable Photo File Format) этих фотографий содержат информацию о местоположении, то к сходству между нормированными оценками степеней уверенности можно добавить их местоположения. Если это сходство превышает заданный порог, то фотографии будут включены в различные альбомы, и будет установлена граница между альбомами этих двух последовательных фотографий.In the "Sequential Cluster Analysis" block, the following is performed: for each photo from the gallery, the similarity is calculated between its normalized confidence scores and the normalized confidence scores of the next photo. If the EXIF (Exchangeable Photo File Format) data of these photos contains location information, then their locations can be added to the similarity between the normalized confidence ratings. If this similarity exceeds the specified threshold, the photos will be included in different albums, and a border will be established between the albums of these two consecutive photos.

В блоке "Нейронная модель внимания" выполняется следующее: для каждой последовательной пары границ альбома, извлеченных на предыдущем этапе, получают визуальные векторные представления всех фотографий между этими границами. Это набор векторных представлений подают в нейронную модель внимания, которая предсказывает один класс событий для набора фотографий. Предсказанный класс событий присваивается всем фотографиям между этими границами.In the "Neural model of attention" block, the following is done: for each sequential pair of album boundaries extracted at the previous stage, visual vector representations of all photos between these boundaries are obtained. This set of vector representations is fed into a neural model of attention, which predicts one class of events for a set of photographs. The predicted event class is assigned to all photos between these boundaries.

В данном случае сначала "Блок извлечения векторных представлений" вычисляет векторные представления x_t каждой отдельной t-й фотографии, как было описано выше. В блоке "Классификатор" оцениваются степени уверенности

классификатора. Затем используется последовательный анализ по аналогии с кластеризацией BSAS [3] в блоке "Последовательный кластерный анализ" для определения последовательности степеней уверенностей {

} для получения границ альбомов. В частности, вычисляются сходства между степенями уверенностями всех последующих фотографий p(

,

). При этом можно использовать любое подходящее сходство, например, эвклидово, Минковского, хи-квадрат, расхождения Кульбака-Лейблера и Дженсена-Шеннона и т.п. Если сходство не превышает заданный порог

, то предполагается, что обе фотографии включены в один и тот же альбом. Если в данных EXIF (Exchangeable Photo File Format) этих фотографий содержится информация о местоположении, то к p(

,

) можно добавить сходство между их местоположениями, чтобы получить окончательное сходство, сопоставимое с порогом. В противном случае граница между двумя альбомами устанавливается в t-й позиции. В результате получаются границы

,

, так что k-й альбом содержит фотографии

, где t₀₌0. См. фиг. 2.In this case, the "Vector Representation Extractor" first calculates the vector representations x _{t of} each individual t-th photograph, as described above. The "Classifier" block evaluates the degree of confidence

classifier. Then sequential analysis is used by analogy with BSAS clustering [3] in the "Sequential cluster analysis" block to determine the sequence of degrees of confidence {

} to get the boundaries of albums. In particular, the similarities between the degrees of confidence of all subsequent photographs p (

,

). In this case, you can use any suitable similarity, for example, Euclidean, Minkowski, chi-square, Kullback-Leibler and Jensen-Shannon divergences, etc. If the similarity does not exceed the specified threshold

, it is assumed that both photographs are included in the same album. If the EXIF (Exchangeable Photo File Format) data of these photos contains location information, then p (

,

) you can add similarities between their locations to get a final similarity comparable to the threshold. Otherwise, the border between the two albums is set at the t-th position. The result is boundaries

,

so that the kth album contains photos

, where t _{0 =} 0. See Fig. 2.

На втором этапе создается итоговый дескриптор x(k) k-го альбома как взвешенная сумма отдельных векторных представлений x_t:At the second stage, the final descriptor x (k) of the kth album is created as a weighted sum of individual vector representations x _t :

где веса w могут зависеть от векторных представлений x_t. При этом обычно используется усредненный пулинг (группировка) (AvgPool) с равными весами, так что реализуется обычное вычисление среднего векторного представления.where the weights w can depend on the vector representations x _t . In this case, the average pooling (grouping) (AvgPool) with equal weights is usually used, so that the usual calculation of the average vector representation is implemented.

Алгоритм 1 Предложенное распознавание событий на основе галереиAlgorithm 1 Proposed gallery-based event recognition

Ввод: входная галерея

}, классификатор C, порог

o. Input: input gallery

}, classifier C, threshold

o.

Вывод: метки событий c*(t) всех входных изображений.Output: event labels c * (t) of all input images.

1: Присвоить К: = 0, инициализировать список границ В: = □1: Assign K: = 0, initialize boundary list B: = □

2: для каждого входного изображения

повторить:2: for each input image

repeat:

3: Подать t-е изображение в CNN и вычислить векторные представления x_t 3: Feed t-th image to CNN and compute vector representations x _t

4: Вычислить степени уверенности p_t с использованием классификатора С4: Calculate degrees of confidence p _t using classifier C

5: если t=1 или

, то5: if t = 1 or

then

6: Присвоить К:=К+1, добавить t-1 в список В6: Assign K: = K + 1, add t-1 to list B

7: закончить, если7: finish if

8: закончить для8: finish for

9: Добавить T к списку В 9: Add T to list B

10: для каждого извлеченного альбома

повторить:10: for each extracted album

repeat:

11: Подать входные изображения

в11: Submit input images

in

сеть внимания (2)-(3) и получить класс событий c* attention net (2) - (3) and get event class c *

12: Присвоить c*{t):=c* для всех

12: Assign c * {t): = c * for all

13: закончить для13: finish for

14: возвратить метки c*(t),

{1,…,T}14: return labels c * (t),

{1, ..., T}

Однако в настоящем изобретении предлагается обучать веса w(x_t), в частности, с помощью механизма внимания из нейронного агрегационного модуля, использовавшегося ранее только для распознавания видео [37]:However, the present invention proposes to train the weights w (x _t ), in particular, using the attention mechanism from the neural aggregation module, which was previously used only for video recognition [37]:

Здесь q - обучаемый D-мерный вектор весов. Плотный (полносвязный) слой присоединяют к результирующему дескриптору x(k), и всю нейронную сеть (фиг. 2) обучают сквозным образом с использованием предоставленного обучающего набора из N≥1 альбомов. Класс события, предсказанный данной сетью в блоке "Нейронная модель внимания" (фиг. 1), присваивается всем фотографиям

}.Here q is the trained D-dimensional vector of weights. A dense (fully connected) layer is attached to the resulting descriptor x (k), and the entire neural network (Fig. 2) is trained end-to-end using the provided training set of N≥1 albums. The class of the event predicted by this network in the "Neural model of attention" block (Fig. 1) is assigned to all photos

}.

Полные процедуры классификация и обучения представлены в алгоритме 1 и алгоритме 2, соответственно. Для простоты следует отметить, что во втором алгоритме на шаге 17 вызывается алгоритм 1 классификации событий. Однако для ускорения вычислений рекомендуется предварительно вычислить матрицу попарного сходства между оценками уверенности всех обучающих фотографий, чтобы во время обучения модели не потребовалось извлечение векторных представлений (шаги 3-4 в алгоритме 1) и вычисление подобия.The complete classification and training procedures are presented in Algorithm 1 and Algorithm 2, respectively. For simplicity, it should be noted that in the second algorithm, at step 17, the event classification algorithm 1 is called. However, to speed up the computations, it is recommended to pre-compute the pairwise similarity matrix between the confidence estimates of all training photographs so that during model training it is not required to extract vector representations (steps 3-4 in Algorithm 1) and calculate the similarity.

В настоящем изобретении реализован весь конвейер (фиг. 1) в общедоступном демонстрационном приложении для Android (https: drive.google.comopen?id=laYN0ZwU90T8ZruacvND01hbIaJS4EZLI) (фиг. 3), разработанный ранее для извлечения пользовательских предпочтений путем обработки всех фотографий из галереи в фоновом потоке [26]. Схожие события, найденные на фотографиях, сделанных в один день, объединяются в высокоуровневые записи для наиболее важных событий. Отображаются только те сцены/события, для которых существует как минимум две фотографии и средняя оценка прогнозов сцены/события для всех фотографий этого дня превышает определенный порог. Этот порог устанавливается автоматически во время процедуры обучения (шаги 16-22 в алгоритме 2). На фиг. 3а показан пример скриншота основного пользовательского интерфейса. Можно нажать на любой столбец на этой гистограмме, чтобы отобразить новую форму с подробными категориями (фиг. 3в). Если нажать на конкретную категорию, появится форма "display", содержащая список всех фотографий из галереи данной категории (фиг. 3с). В этом списке события сгруппированы по дате и имеется возможность выбрать конкретный день.The present invention implements the entire pipeline (Fig. 1) in the publicly available demo Android application (https: drive.google.comopen? Id = laYN0ZwU90T8ZruacvND01hbIaJS4EZLI) (Fig. 3), developed earlier to extract user preferences by processing all photos from the gallery in the background stream [26]. Similar events found in photographs taken on the same day are combined into high-level recordings for the most important events. Only those scenes / events are displayed for which at least two photos exist and the average score of the scene / event forecasts for all photos of that day exceeds a certain threshold. This threshold is set automatically during the learning procedure (steps 16-22 in Algorithm 2). FIG. 3a shows an example screenshot of the main user interface. You can click on any column in this histogram to display a new form with detailed categories (Fig. 3c). If you click on a specific category, a "display" form will appear containing a list of all photos from the gallery of that category (Fig. 3c). In this list, events are grouped by date and it is possible to select a specific day.

Алгоритм 2. Процедура обучения согласно предлагаемому подходуAlgorithm 2. Training procedure according to the proposed approach

1: для каждого альбома n∈{1,…, N} повторить1: for each album n∈ {1,…, N} repeat

2: для каждого изображения l∈{1,…, L_n} повторить2: for each image l∈ {1,…, L _n } repeat

3: Подать изображение X_n(l) в CNN и вычислить векторные представления x_n(l) 3: Feed image X _n (l) to CNN and compute vector representations x _n (l)

4: закончить для4: finish for

5: закончить для5: finish for

6: Обучить классификатор С, используя развернутую обучающую выборку X векторных представлений6: Train classifier C using expanded training set X vector representations

7: Обучить сеть внимания (2)-{3), используя подмножества с фиксированным размером S всех обучающих выборок признаков {x_n(l)} 7: Train the attention network (2) - {3) using the fixed-size subsets S of all training feature sets {x _n (l)}

8: для каждого альбома n∈{1,…,N} повторить8: for each album n∈ {1,…, N} repeat

9: для каждого изображения l∈{1,…,L_n} повторить9: for each image l∈ {1,…, L _n } repeat

10: Предсказать оценки степени уверенности р_n(l) для векторных представлений х_n(l), используя классификатор С 10: Predict estimates of the degree of confidence p _n (l) for vector representations x _n (l) using classifier C

11: закончить для 11: finish for

12: закончить для 12: finish for

13: Выполнить случайную перестановку всех индексов {1,…,N} для получения последовательности (n_i,…,n_N) 13: Perform random permutation of all indices {1,…, N} to obtain the sequence (n _i ,…, n _N )

14: Развернуть все обучающие векторные представления, используя эту перестановку:

={X_n1(1),…,X₁(L_n1),…,X_nN(1),…,X_nN(L_nN)}14: Expand all training vector representations using this permutation:

= {X _n1 (1),…, X ₁ (L _n1 ),…, X _nN (1),…, X _nN (L _nN )}

15: Присвоить ρ:=0, α*:=015: Assign ρ: = 0, α *: = 0

16: для каждого потенциального порога с повторить16: for each potential threshold with repeat

17: Вызвать алгоритм 1 с параметрами

, С и порогом с17: Call algorithm 1 with parameters

, С and threshold с

18: Вычислить точность б, используя прогнозы для всех обучающих изображений18: Calculate precision b using predictions for all training images

19:если α*<α, то 19: if α * <α, then

20: Присвоить α*:=α, ρ₀:=ρ20: Assign α *: = α, ρ ₀ : = ρ

21: закончить, если21: finish if

22: закончить для22: finish for

23: возвратить классификатор C, сеть внимания, порог с₀ 23: return classifier C, attention net, threshold from ₀

Распознавание событий на отдельных фотографияхRecognition of events in individual photos

Задачу распознавания события на отдельных фотографиях можно сформулировать как обычную задачу распознавания изображения. Необходимо поставить в соответствие каждой фотографии X из галереи одну из С>1 категорий событий (классов). Для обучения классификатора доступна обучающая выборка N≥1 фотографий X={X_n|n∈{1,…,N}} с известными метками событий с_n∈{l,…,C}. Иногда обучающие фотографии одного и того же события ассоциированы с альбомом [5, 33]. В этом случае обучающие альбомы разворачиваются в набор X, так что метка уровня коллекции альбома присваивается меткам каждой фотографии из данного альбома. У этой задачи есть несколько характеристик, чрезвычайно усложняющих ее по сравнению с распознаванием событий на основе альбома. Одной из этих характеристик является наличие нерелевантных или несущественных фотографий, которые можно ассоциировать с любым событием [1]. Эти фотографии можно выявить с помощью моделей на основе внимания, когда имеется весь альбом [13], однако качество распознавания событий на отдельных фотографиях может значительно пострадать.The problem of recognizing an event in individual photographs can be formulated as a common problem of image recognition. It is necessary to assign each photo X from the gallery to one of the C> 1 categories of events (classes). To train the classifier, a training sample of N≥1 photos X = {X _n | n∈ {1,…, N}} with known event labels with _n ∈ {l,…, C} is available. Sometimes educational photographs of the same event are associated with an album [5, 33]. In this case, the training albums are expanded into set X so that an album collection level tag is assigned to the tags of each photo in that album. This task has several characteristics that make it extremely difficult compared to album-based event recognition. One of these characteristics is the presence of irrelevant or irrelevant photographs that can be associated with any event [1]. These photographs can be identified using attention-based models when the entire album is available [13], however, the recognition of events in individual photographs can be significantly affected.

Так как N обычно довольно мало, может применить перенос обучения [11]. Сначала обучают глубокую CNN на большом наборе данных, например, ImageNet или Places [38]. Затем эту CNN подвергают тонкой настройке (дообучению) на обучающее множество X, то есть последний слой заменяют на новый с активациями Softmax и выходами С. Каждую входную фотографию X классифицируют путем ее подачи в дообученную CNN для вычисления С выходов выходного слоя, т.е. оценок апостериорных вероятностей для всех категорий событий. Эту процедуру можно модифицировать путем извлечения глубоких векторных представлений фотографий, используя выходы одного из последних слоев предварительно обученной CNN. Фотографии X и X_n подаются на вход CNN, а выходы предпоследнего слоя используются в качестве D-мерных векторов векторные представления X=[x₁,…,x_D] и X_n=[x_n;1,…,x_n;D], соответственно. Такие блоки извлечения векторных представлений на базе глубокого обучения позволяют обучать общий классификатор C_emb, например, k-ближайший сосед, случайный лес (RF), машину опорных векторов (SVM) или градиентный бустинг. C-мерный вектор p_emb=C_emb(x) оценок степеней уверенности предсказывается с учетом входной фотографии в обоих случаях дообученной сети с последним слоем Softmax в роли классификатора C_emb и извлечением векторных представлений с помощью общего классификатора. См. фиг. 4. Окончательное решение может быть принято в пользу класса с максимальной уверенностью.Since N is usually quite small, learning transfer can be applied [11]. First, deep CNN is trained on a large dataset, for example, ImageNet or Places [38]. Then this CNN is fine-tuned (retrained) to the training set X, that is, the last layer is replaced with a new one with Softmax activations and C outputs. Each input X photo is classified by feeding it to the retrained CNN to calculate the C outputs of the output layer, i.e. estimates of posterior probabilities for all categories of events. This procedure can be modified by extracting deep vector representations of photographs using the outputs of one of the last layers of the pretrained CNN. Photos X and X _n are fed to the CNN input, and the outputs of the penultimate layer are used as D-dimensional vectors vector representations X = [x ₁ , ..., x _D ] and X _n = [x _{n; 1} ,…, x _{n; D} ], respectively. Such deep learning-based vector representation extractors train a generic C _emb classifier such as k-nearest neighbor, random forest (RF), support vector machine (SVM), or gradient boosting. The C-dimensional vector p _emb = C _emb (x) of the confidence level estimates is predicted taking into account the input photo in both cases of the retrained network with the last Softmax layer as the C _emb classifier and extracting vector representations using a general classifier. See fig. 4. The final decision can be made in favor of the class with maximum confidence.

В настоящем изобретении используется другой подход к распознаванию событий, основанный на генеративных моделях и составлении подписи к изображению. Предложенный подход для распознавания событий на основе составления подписи к изображению представлен на фиг. 4. Следует отметить, что все описанные блоки могут быть реализованы программными и/или аппаратными средствами.The present invention uses a different approach to event recognition based on generative models and image captioning. The proposed approach for recognizing events based on composing an image caption is shown in FIG. 4. It should be noted that all the blocks described can be implemented in software and / or hardware.

В блоке "Извлечение векторных представлений (CNN)" выполняется следующее: извлекаются визуальные векторные представления (векторные представления) путем подачи RGB-представления фотографии из галереи на вход предварительно обученной сверточной нейронной сети для вычисления выходов одного из ее последних слоев (обычно предпоследнего слоя).The CNN Extraction block does the following: extracts visual vector representations (vector representations) by feeding an RGB representation of a photo from a gallery to the input of a pretrained convolutional neural network to compute the outputs of one of its last layers (usually the penultimate layer).

В блоке "Генерация подписи" выполняется следующее: извлеченные визуальные векторные представления и последовательность ранее сгенерированных слов подаются в RNN (рекуррентную нейронную сеть), чтобы спрогнозировать следующее слово в текстовом описании фотографии. Эта последовательность ранее сгенерированных слов изначально содержит только одно специальное слово <START>. Пока предсказанное слово не равно специальному слову <END>, оно добавляется в эту последовательность ранее сгенерированных слов и извлеченные визуальные векторные представления и эта последовательность подаются на вход той же самой RNN.In the Signature Generation block, the following is done: the extracted visual vector representations and the sequence of previously generated words are fed to an RNN (recurrent neural network) to predict the next word in the textual description of the photo. This sequence of previously generated words initially contains only one special word <START>. As long as the predicted word is not equal to the special word <END>, it is added to this sequence of previously generated words and the extracted visual vector representations, and this sequence is fed to the input of the same RNN.

В блоке "Словарь" выполняется следующее: так как вышеупомянутая RNN оперирует со словами, представленными числами, необходимо отобразить номера, сгенерированные RNN, в реальные слова из словаря.In the "Dictionary" block, the following is done: since the aforementioned RNN operates with words represented by numbers, it is necessary to map the numbers generated by the RNN into real words from the dictionary.

В блоке "Предварительная обработка подписи" выполняется следующее: выбирается подмножество словаря путем выбора наиболее часто встречающихся слов в обучающей выборке с необязательным исключением стоп-слов. Далее, каждая фотография представляется в виде разреженного вектора с использованием унитарного кодирования: v-й компонент вектора равен 1, только если хотя бы одно слово в сгенерированной подписи равно v-му слову из словаря. In the block "Signature preprocessing", the following is performed: a subset of the dictionary is selected by selecting the most frequent words in the training set with the optional exclusion of stop words. Further, each photo is represented as a sparse vector using unitary coding: the v-th component of the vector is 1 only if at least one word in the generated signature is equal to the v-th word from the dictionary.

Блок "Обучающая выборка" содержит набор обучающих фотографий с известным типом события. The "Training sample" block contains a set of training photos with a known type of event.

В блоке "Классификация векторных представлений" выполняется следующее: вычисляются оценки степени уверенности для всех типов событий для фотографии путем подачи ее векторных представлений в любой классификатор. Этот классификатор должен быть предварительно обучен распознавать события в обучающей выборке, где каждая фотография из обучающей выборки представлена собственными визуальными векторными представлениями, извлеченными с помощью процедуры из блока "Извлечение векторных представлений (CNN)".In the "Classification of vector representations" block, the following is done: estimates of the degree of confidence for all types of events for a photograph are calculated by feeding its vector representations to any classifier. This classifier must be pre-trained to recognize events in the training set, where each photo from the training set is represented by its own visual vector representations, extracted using the procedure from the Extraction of vector representations (CNN) block.

В блоке "Текстовая классификация" выполняется следующее: вычисляются оценки степени уверенности для всех типов событий для фотографии путем подачи разреженного вектора унитарно кодированного текстового описания в любой классификатор. Этот классификатор должен быть предварительно обучен распознавать события в обучающей выборке, где каждая фотография из обучающей выборки представлена ее собственными визуальными векторными представлениями, извлеченными с помощью процедуры из блока "Предварительная обработка подписи".In the "Text classification" block, the following is performed: estimates of the degree of confidence for all types of events for a photograph are calculated by supplying a sparse vector of a unitary encoded text description to any classifier. This classifier must be pretrained to recognize events in the training set, where each photograph from the training set is represented by its own visual vector representations, extracted using the procedure from the Signature Preprocessing block.

В блоке "Ансамбль": выходы классификаторов для векторных представлений и текстов объединяются в ансамбле на основе простого голосования с мягким агрегированием. В частности, агрегированные степени уверенности вычисляются как взвешенная сумма оценок степеней уверенности, вычисленных в блоках "Классификация векторных представлений" и "Текстовая классификация". Решение принимается в пользу класса с максимальной агрегированной степенью уверенностью.In the "Ensemble" block: the outputs of the classifiers for vector representations and texts are combined in an ensemble based on a simple voting with soft aggregation. In particular, the aggregated confidence levels are calculated as the weighted sum of the confidence scores calculated in the Vector Representation Classification and Text Classification boxes. The decision is made in favor of the class with the highest aggregate degree of confidence.

Сначала выполняется обычное извлечение векторных представлений x с помощью предварительно обученной CNN. Затем эти визуальные векторные представления и словарь V подаются в специальную нейронную сеть (генератор) на основе RNN, которая создает подпись, описывающую каждую входную фотографию. Подпись представляется в виде последовательности L>0 слов из словаря (t_l∈V, l∈{0,…,L}). Она генерируется последовательно, слово за словом, начиная со специального слова t₀=<START>, до тех пор, пока не будет создано специальное слово t_L=<END> [6]. См. фиг. 4.First, the usual extraction of vector representations of x is performed using a pretrained CNN. These visual vector representations and vocabulary V are then fed into a special neural network (generator) based on the RNN, which creates a signature describing each input photo. The signature is represented as a sequence L> 0 of words from the dictionary (t _l ∈V, l∈ {0,…, L}). It is generated sequentially, word by word, starting with the special word t ₀ = <START>, until the special word t _L = <END> [6] is created. See fig. 4.

Сгенерированная подпись t подается в классификатор событий. Для обучения его параметров каждую n-ю фотографию из обучающей выборки подают в одну и ту же сеть составления подписи к изображению, чтобы создать подпись t_n={t_n;0,t_n;;1…,t_n;L_n+1}. Так как количество слов Ln не одинаково для всех фотографий, необходимо либо обучить последовательный классификатор на базе RNN, либо преобразовывать все подписи в векторы векторного представления с одинаковой размерностью. Поскольку количество обучающих экземпляров N не очень велико, было определено экспериментальным путем, что последний подход является таким же точным, как и первый, при значительно меньшем времени обучения. Поэтому было решено использовать унитарное кодирование последовательностей t и {t_n} в векторы 0 и 1, как описано в [7]. В частности, формируется подмножество словаря

путем выбора наиболее часто встречающихся слов в обучающих данных {t_n} с необязательным исключением стоп-слов. Затем каждую входную фотографию представляют как |

|- мерный разреженный вектор

, где |

| - размер сокращенного словаря

, а v-й компонент вектора

равен 1, только если хотя бы одно из L слов в подписи t равно v-му слову из словаря

. Это будет означать, например, превращение последовательности {1,5,10,2} в

-мерный разреженный вектор, состоящий из всех нулей, за исключением индексов 1, 2, 5 и 10, значения которых будут единицами [7]. Такая же процедура используется для описания каждой n-й обучающей фотографии с

-мерным разреженным вектором

. После этого можно использовать произвольный классификатор C_txt таких текстовых представлений, подходящий для разреженных данных, чтобы предсказать С оценок степени уверенности P_txt=C_txt(

). В [7] было продемонстрировано, что такой подход даже более точен, чем традиционные классификаторы на основе RNN (включающие в себя один слой LSTM) для набора данных IMDB.The generated signature t is fed to the event classifier. To train its parameters, each n-th photograph from the training sample is fed to the same image captioning network to create a signature t _n = {t _{n; 0} , t _{n ;; 1} ..., t _n; L _n +1}. Since the number of words Ln is not the same for all photographs, it is necessary either to train a sequential classifier based on RNN, or to convert all signatures into vectors of vector representation with the same dimension. Since the number of training instances N is not very large, it was determined experimentally that the latter approach is as accurate as the first, with significantly less training time. Therefore, it was decided to use unitary encoding of sequences t and {t _n } into vectors 0 and 1, as described in [7]. In particular, a subset of the dictionary is formed

by selecting the most frequent words in the training data {t _n } with the optional exclusion of stop words. Then each input photo is presented as |

| - dimensional sparse vector

where |

| - the size of the abbreviated dictionary

, and the vth component of the vector

is equal to 1 only if at least one of the L words in the signature t is equal to the vth word from the dictionary

... This will mean, for example, the transformation of the sequence {1,5,10,2} into

-dimensional sparse vector, consisting of all zeros, except for indices 1, 2, 5 and 10, whose values will be ones [7]. The same procedure is used to describe each nth training photo with

-dimensional sparse vector

... Thereafter, an arbitrary classifier C _{txt of} such text representations, suitable for sparse data, can be used to predict the C confidence scores P _txt = C _txt (

). It was demonstrated in [7] that this approach is even more accurate than traditional RNN classifiers (including one LSTM layer) for the IMDB dataset.

В общем, не ожидается, что классификация коротких текстовых описаний будет более точной, чем традиционные методы распознавания изображений. Однако существует уверенность, что наличие подписей к фотографиям в ансамбле классификаторов может значительно улучшить его диверсификацию. Кроме того, поскольку подписи генерируются на базе извлеченного вектора x векторных представлений, требуется всего один вывод CNN, если объединенный обычный общий классификатор векторных представлений из отдельных классификаторов объединяется в простом голосовании с мягким агрегированием. In general, the classification of short text descriptions is not expected to be more accurate than traditional image recognition methods. However, there is confidence that the presence of captions to photographs in the ensemble of classifiers can significantly improve its diversification. In addition, since the signatures are generated based on the extracted vector x of vector representations, only one CNN output is required if the combined common common vector representation classifier from separate classifiers is combined in a simple soft aggregation vote.

В частности, агрегированные уверенности вычисляются как взвешенная сумма выходов отдельного классификатора:In particular, the aggregated confidences are calculated as the weighted sum of the outputs of the individual classifier:

Решение принимается в пользу класса с максимальной уверенностью: The decision is made in favor of the class with maximum confidence:

Вес w∈[0,1] в (4) можно выбрать, используя специальное контрольное подмножество для достижения максимальной точности критерия (5).The weight w∈ [0,1] in (4) can be chosen using a special control subset to achieve maximum accuracy of criterion (5).

Приведем качественные примеры для использования описанного подхода (фиг. 4). На фиг. 5 представлены результаты (корректного) распознавания событий с использованием ансамбля. Первая строка заголовка содержит сгенерированную подпись фотографии. Кроме того, в заголовке отображаются результаты распознавания события с использованием подписей t (вторая строка), векторных представлений x_emb (третья строка) и всего ансамбля (последняя строка). Как можно заметить, единая классификация подписей не всегда корректна. Однако этот ансамбль способен выдать достоверное решение, даже если отдельные классификаторы принимают неверные решения. Here are some qualitative examples for using the described approach (Fig. 4). FIG. 5 shows the results of (correct) event recognition using an ensemble. The first line of the title contains the generated photo caption. In addition, the header displays the results of event recognition using t captions (second line), vector representations of x _emb (third line), and the entire ensemble (last line). As you can see, a unified classification of signatures is not always correct. However, this ensemble is capable of delivering a reliable solution, even if individual classifiers make the wrong decisions.

ЭКСПЕРИМЕНТАЛЬНОЕ ИССЛЕДОВАНИЕEXPERIMENTAL STUDY

Распознавание событий в галерее фотографийRecognition of events in the photo gallery

Для распознавания событий в личных коллекциях фотографий имеется только ограниченное количество наборов данных [1]. Поэтому в этой области рассматриваются два основных набора данных, а именно:Only a limited number of data sets are available for recognizing events in personal photo collections [1]. Therefore, this area deals with two main datasets, namely:

1. PEC [5], содержащий 61 364 фотографий из 807 коллекций 14 классов общественных мероприятий (день рождения, свадьба, выпускной и т.д.). Использовалось выполненное его авторами разделение: обучающая выборка, содержащая 667 альбомов (50 279 фотографий), и тестовая выборка, содержащая 140 альбомов (11 085 фотографий). 1. PEC [5] containing 61,364 photographs from 807 collections of 14 classes of social events (birthday, wedding, graduation, etc.). The division made by its authors was used: a training sample containing 667 albums (50,279 photographs) and a test sample containing 140 albums (11,085 photographs).

2. ML-CUFED [33], содержащий 23 распространенных типа событий. Каждый альбом связан с несколькими событиями, т.е. решается задача мультиклассовой классификации. Использовалось обычное разделение на обучающую выборку (75 377 фотографий, 1507 альбомов) и тестовую выборку (376 альбомов с 19 420 фотографиями).2. ML-CUFED [33] containing 23 common types of events. Each album is associated with several events, i.e. the problem of multiclass classification is being solved. The usual division into a training set (75,377 photos, 1507 albums) and a test set (376 albums with 19,420 photographs) was used.

Векторные представления извлекались с помощью моделей распознавания сцен (Inception v3 и MobileNet v2 с a=1 и a=1.4), предварительно обученных на наборе данных Places2 [38]. Для получения окончательного дескриптора набора фотографий использовались два метода, а именно: (1) простое усреднение векторных представлений отдельных фотографий в наборе (AvgPool), и (2) реализация нейронного механизма внимания (2)-(3) для L₂-нормированных векторных представлений. В первом случае в качестве С использовался линейный классификатор SVM из библиотеки scikit-learn, так как он имеет более высокую точность, чем RF, k-NN и RBF SVM. Во втором случае обучались веса основанной на внимании сети (фиг. 2) с использованием наборов из S=10 произвольно выбранных фотографий из всех альбомов для создания идентичной формы входных тензоров. В результате было получено 667 обучающих подмножеств и 1507 подмножеств с S=10 фотографиями для PEC и ML-CUFED, соответственно. Так как ML-CUFED содержит множество меток на каждый альбом, используются сигмоидные активации и потеря двоичной кросс-энтропии. Для PEC применяются обычные активации Softmax и категорическая кросс-энтропия. Эта модель обучалась с использованием оптимизатора ADAM (скорость обучения 0,001) для 10 эпох с ранним остановом в среде Keras 2.3 с бэкэндом TensorFlow 1.15. Vector representations were retrieved using scene recognition models (Inception v3 and MobileNet v2 with a = 1 and a = 1.4) pretrained on the Places2 dataset [38]. To obtain the final descriptor of a set of photographs, two methods were used, namely: (1) simple averaging of vector representations of individual photographs in the set (AvgPool), and (2) implementation of the neural mechanism of attention (2) - (3) for L ₂ -normalized vector representations ... In the first case, the linear SVM classifier from the scikit-learn library was used as C, since it has a higher accuracy than RF, k-NN, and RBF SVM. In the second case, the weights of the attention-based network (Fig. 2) were trained using sets of S = 10 randomly selected photographs from all albums to create an identical shape of the input tensors. As a result, 667 training subsets and 1507 subsets with S = 10 photos were obtained for PEC and ML-CUFED, respectively. Since ML-CUFED contains multiple labels per album, sigmoid activations and binary cross-entropy loss are used. For PEC, the usual Softmax activations and categorical cross-entropy apply. This model was trained using the ADAM optimizer (learning rate 0.001) for 10 early-stop epochs in a Keras 2.3 environment with a TensorFlow 1.15 backend.

Таблица 1: Точность (%) распознавания событий в наборе изображений (альбоме)Table 1: Accuracy (%) of event recognition in a set of images (album)

CNNCNN АгрегацияAggregation PECPEC ML-CUFEDML-CUFED MobileNet2, б = 1.0MobileNet2, b = 1.0 AvgPool
ВниманиеAvgPool
Attention 86.42
89.2986.42
89.29 81.38
84.0481.38
84.04 MobileNet2, б = 1.4MobileNet2, b = 1.4 AvgPool
ВниманиеAvgPool
Attention 87.14
87.3687.14
87.36 81.91
84.3181.91
84.31 Inception v3Inception v3 AvgPool
ВниманиеAvgPool
Attention 86.43
87.8686.43
87.86 82.45
84.8482.45
84.84 Ales NetAles net CNN-LSTM-Iterative [33]
Агрегация репрезентативных признаков |34]CNN-LSTM-Iterative [33]
Aggregation of Representative Features | 34] 84.5
87.984.5
87.9 79.3
84.579.3
84.5 ResNet-101ResNet-101 CNN-LSTM-Iterative [33]
Агрегация репрезентативных признаков |34]CNN-LSTM-Iterative [33]
Aggregation of Representative Features | 34] 84.5
89.184.5
89.1 71.7
83.471.7
83.4

В таблице 1 представлены показатели точности распознавания предобученной CNN. В данном случае вычисляется мультиклассовая точность для ML-CUFED, поэтому предполагается, что предсказанное событие верно, если оно соответствует любой метке, ассоциированной с альбомом. В таблице представлены наиболее известные результаты для указанных наборов данных [33, 34].Table 1 shows the indicators of the recognition accuracy of the pretrained CNN. In this case, the multi-class precision is calculated for ML-CUFED, so it is assumed that the predicted event is correct if it matches any label associated with the album. The table shows the most famous results for the indicated data sets [33, 34].

При этом во всех случаях основанная на внимании агрегация точнее на 1-3%, чем классификация средних векторных представлений. Можно заметить, что предлагаемая реализация механизма внимания достигает самых высоких на сегодняшний день результатов, хотя используются гораздо более быстрые сверточные сети (MobileNet и Inception, а не AlexNet и ResNet-101) и не учитывается последовательный характер фотографий в альбом в сети на основе внимания (фиг. 2). Наиболее примечательным фактом в данном случае является то, что наилучшие результаты для PEC достигаются для самой простой модели (MobileNet v2, a=1.0), что можно объяснить отсутствием обучающих данных для этого конкретного набора данных.Moreover, in all cases, attention-based aggregation is 1-3% more accurate than the classification of mean vector representations. It can be seen that the proposed implementation of the attention mechanism achieves the highest results to date, although much faster convolutional networks are used (MobileNet and Inception, not AlexNet and ResNet-101) and the sequential nature of photos into an album on the attention-based network is not taken into account ( Fig. 2). The most notable fact in this case is that the best results for PEC are achieved for the simplest model (MobileNet v2, a = 1.0), which can be explained by the lack of training data for this particular dataset.

Как отмечалось выше, информация об альбомах в галерее вообще отсутствует. Поэтому событие должно присваиваться всем фотографиям индивидуально. В следующем эксперименте каждой фотографии, содержащейся в обоих наборах данных, непосредственно присваивается первая метка уровня коллекции и для распознавания событий просто используется сама фотография без какой-либо метаинформации. В дополнение к базовому подходу (подраздел 3.1) используется иерархическая агломерационная кластеризация всей тестовой галереи. Показаны только результаты, достигнутые средней связной кластеризацией векторных представлений x_t, извлеченных предобученной CNN, и оценки степени уверенности pt. В первом случае используются как евклидово расстояние (L2), так и расстояние хи-квадрат (X2). Поскольку оценки степени уверенности, возвращаемые функцией принятия решения для LinearSVC, не всегда неотрицательны, для них используется только евклидово расстояние. Результаты представлены в таблице 2.As noted above, there is no information about albums in the gallery at all. Therefore, the event must be assigned to all photos individually. In the following experiment, each photo contained in both datasets is directly assigned the first collection-level label and simply uses the photo itself to recognize events without any meta-information. In addition to the basic approach (subsection 3.1), hierarchical agglomeration clustering of the entire test gallery is used. Only the results achieved by the mean connected clustering of the vector representations x _t extracted by the pretrained CNN and the confidence score pt are shown. In the first case, both the Euclidean distance (L2) and the chi-square distance (X2) are used. Since the confidence scores returned by the decision function for LinearSVC are not always non-negative, only Euclidean distance is used for them. The results are shown in Table 2.

Таблица 2: Точность (%) распознавания событий в одном изображении. Table 2: Accuracy (%) of event recognition in one image.

Набор данныхData set CNNCNN Базовый подходBasic approach Векторные представленияVector views ОценкиEstimates LL ₂₂ XX ²² LL ₂₂ PECPEC MobiLeNet2, б = 1.0
MobileNet2, б = 1.4
Inception v3MobiLeNet2, b = 1.0
MobileNet2, b = 1.4
Inception v3 58.32
60.34
61.8258.32
60.34
61.82 60.42
61.25
64.1960.42
61.25
64.19 60.69
61.92
64.22 60.69
61.92
64.22 58.44
60.58
61.9758.44
60.58
61.97 ML-CUFEDML-CUFED MobileNet2, б = 1.0
MobiLeNet2, б = 1.4
Inception v3MobileNet2, b = 1.0
MobiLeNet2, b = 1.4
Inception v3 54.41
53.54
57.2654.41
53.54
57.26 57.03
54.97
59.1957.03
54.97
59.19 57.45
55.98
60.1257.45
55.98
60.12 54.56
54.03
57.8754.56
54.03
57.87

При этом, во-первых, точность распознавания событий на отдельных фотографиях на 25-30% ниже, чем точность классификации по альбомам (таблица 1). Во-вторых, кластеризация оценок степеней уверенности на выходе лучшего классификатора не оказывает существенного влияния на общую точность. В-третьих, иерархическая кластеризация с расстоянием хи-квадрат приводит к чуть более точным результатам, чем обычная евклидова метрика. И наконец, предварительная кластеризация векторных представлений снижает частоту ошибок базового подхода всего на 1,2-2%, даже если при кластеризации был тщательно подобран порог подобия.At the same time, firstly, the accuracy of event recognition in individual photographs is 25-30% lower than the accuracy of classification by albums (Table 1). Second, the clustering of confidence estimates at the output of the best classifier does not significantly affect the overall accuracy. Third, hierarchical clustering with chi-squared distance leads to slightly more accurate results than the usual Euclidean metric. And finally, preliminary clustering of vector representations reduces the error rate of the basic approach by only 1.2-2%, even if the similarity threshold was carefully selected during clustering.

Продемонстрируем, как предположение о последовательно упорядоченных фотографиях в альбоме может повысить точность распознавания событий. Чтобы усложнить задачу, была 10 раз проведена следующая трансформация порядка тестовых фотографий. Последовательность альбомов случайным образом перемешивалась, и также перемешивались фотографии в каждом альбоме. В дополнение к сопоставлению оценок уверенности от решающей функции linearSVC выполнялась их L2-нормализация. Кроме того, CNN дообучали с помощью развернутой обучающей выборки X следующим образом. Сначала замораживались веса в базовой части CNN и изучался новый головной элемент (полносвязный слой с С выходами и активацией Soft-max) в течение 10 эпох. Затем изучались веса во всей CNN в течение 3 эпох с 10-кратным снижением скорости обучения.Let's demonstrate how the assumption of sequentially ordered photos in an album can improve the accuracy of event recognition. To complicate the task, the following transformation of the order of test photos was carried out 10 times. The sequence of albums was randomly shuffled, and the photos in each album were also shuffled. In addition to comparing confidence scores from the linearSVC decisive function, L2-normalization was performed. In addition, CNN was retrained using the expanded training set X as follows. First, the weights in the base part of the CNN were frozen and a new head element (fully connected layer with C outputs and Soft-max activation) was studied for 10 epochs. The weights were then studied across the entire CNN over 3 epochs with a 10-fold decrease in learning rate.

В таблицах 3 и 4 представлены результаты (средняя точность ± ее стандартное отклонение) предложенных алгоритмов 1, 2 для PEC и ML-CUFED, соответственно. При этом механизм внимания в большинстве случаев снижает уровень ошибок на 8%. Примечательно, что сопоставление сходства между L2-нормированными уверенностями значительно улучшает общую точность модели внимания для PEC (таблица 3), хотя представленные эксперименты не показали каких-либо улучшений в традиционной кластеризации по сравнению с предыдущим экспериментом (таблица 2). Очевидно, что дообученные сети CNN приводят к наиболее точному решению, но разница (0,1-1,6%) с лучшими результатами предварительно обученных моделей довольно мала. Однако последние не требуют дополнительного вывода в существующих моделях распознавания сцены, поэтому реализация распознавания событий в альбоме будет очень быстрой, если сцены необходимо дополнительно классифицировать, например, для более детального моделирования пользователем [26]. Удивительно, что вычисление сходства между оценками степеней уверенности классификаторов (с(P_t, P_t-1)) снижает вероятность ошибки традиционного сопоставления векторных представлений (с(X_t, X_t-1)) на 2-7%. Следует напомнить, что обычная кластеризация векторных представлений была на 1-2% более точной при сравнении с оценками классификатора (таблица 2). По видимому, в этом конкретном случае порог с₀ можно оценить (алгоритм 2) более надежно, когда большинство фотографий одного и того же события сопоставляются в процедуре предсказания (алгоритм 1). И наконец, самый важный вывод состоит в том, что предлагаемый подход имеет на 9-20% более высокую точность по сравнению с базовым подходом. Более того, этот алгоритм на 13-16% точнее, чем классификация групп фотографий, полученных с помощью иерархической кластеризации (таблица 2).Tables 3 and 4 present the results (mean accuracy ± its standard deviation) of the proposed algorithms 1, 2 for PEC and ML-CUFED, respectively. At the same time, the attention mechanism in most cases reduces the error rate by 8%. Notably, matching the similarity between L2-normalized certainties significantly improves the overall accuracy of the attention model for PEC (Table 3), although the experiments presented did not show any improvement in traditional clustering compared to the previous experiment (Table 2). Obviously, overtrained CNNs lead to the most accurate solution, but the difference (0.1-1.6%) with the best results from pretrained models is quite small. However, the latter do not require additional output in existing models of scene recognition, so the implementation of event recognition in the album will be very fast if the scenes need to be further classified, for example, for more detailed modeling by the user [26]. Surprisingly, the calculation of the similarity between the estimates of the degrees of confidence of the classifiers (c (P _t , P _t-1 )) reduces the probability of error in traditional matching of vector representations (c (X _t , X _t-1 )) by 2-7%. It should be recalled that the usual clustering of vector representations was 1-2% more accurate when compared with the classifier's estimates (Table 2). Apparently, in this particular case, the threshold with ₀ can be estimated (Algorithm 2) more reliably when most of the photographs of the same event are compared in the prediction procedure (Algorithm 1). And finally, the most important conclusion is that the proposed approach has 9-20% higher accuracy compared to the basic approach. Moreover, this algorithm is 13-16% more accurate than the classification of groups of photographs obtained using hierarchical clustering (Table 2).

Таблица 3: Точность (%) предложенного подхода, PEC.Table 3: Precision (%) of the proposed approach, PEC.

CNNCNN АгрегацияAggregation Базовый уровеньA basic level of Векторные представленияVector views ОценкиEstimates Оценки (L ₂-нормированные)Estimates ( L ₂ -normalized) LL ₂₂ xx ²² LL ₂₂ xx ²² LL ₂₂ MobileNet2, б=1.0 (предобученная), векторные представленияMobileNet2, b = 1.0 (pretrained), vector representations AvgPool
ВниманиеAvgPool
Attention 58.32
54.4358.32
54.43 66.85±0.59
68.51±0.4166.85 ± 0.59
68.51 ± 0.41 68.52±0.89
70.65±1.2068.52 ± 0.89
70.65 ± 1.20 71.08±0.59
74.49±0.7071.08 ± 0.59
74.49 ± 0.70 -- 72.68±0.56
80.48±1.0172.68 ± 0.56
80.48 ± 1.01 MobileNeC. б=1.4 (предобученная), векторные представленияMobileNeC. b = 1.4 (pretrained), vector representations AvgPool
ВниманиеAvgPool
Attention 60.34
55.3660.34
55.36 68.85±0.50
70.53±0.7968.85 ± 0.50
70.53 ± 0.79 63,57±0.57
71.16±0.7263.57 ± 0.57
71.16 ± 0.72 72.50±1.49
78.20±1.4772.50 ± 1.49
78.20 ± 1.47 -- 73.49±0.86
81.27±0.8173.49 ± 0.86
81.27 ± 0.81 MobileNet2, б=1.4 (дообученная), оценкиMobileNet2, b = 1.4 (retrained), estimates AvgPool
ВниманиеAvgPool
Attention 61.89
61.5561.89
61.55 -
--
- -
--
- 75.66±0.55
78.77±0.4975.66 ± 0.55
78.77 ± 0.49 76.96±0,97
81.33±0.6976.96 ± 0.97
81.33 ± 0.69 -
--
- Inception v3 (предобученная), векторные представленияInception v3 (pretrained), vector views AvgPool
ВниманиеAvgPool
Attention 61.82
56.9461.82
56.94 72.29±1.28
72.38±1.1372.29 ± 1.28
72.38 ± 1.13 72,32±1.54
71.36±0.6772.32 ± 1.54
71.36 ± 0.67 74.54±1.04
76.76±0.7074.54 ± 1.04
76.76 ± 0.70 -
--
- 76.48±0.47
80.17±1.1476.48 ± 0.47
80.17 ± 1.14 Inception v3
(дообученная), оценкиInception v3
(retrained), grades AvgPool
ВниманиеAvgPool
Attention 63.56
62.9163.56
62.91 -
--
- -
--
- 78.87±0,67
81.03±0.7778.87 ± 0.67
81.03 ± 0.77 73.92±0,65
81.95±1.1173.92 ± 0.65
81.95 ± 1.11 -
--
-

Таблица 4: Точность (%) предложенного подхода, ML-CUFED.Table 4: Accuracy (%) of the proposed approach, ML-CUFED.

CNNCNN АгрегацияAggregation Базовый
подходBase
an approach Векторные представленияVector views ОценкиEstimates Оценки (L₂-нормированные)Estimates (L ₂ -normalized) LL ₂₂ xx ²² L₂ L ₂ xx ²² LL ₂₂ MobileNet2, б=1.0 (предобученная), векторные представленияMobileNet2, b = 1.0 (pretrained), vector representations AvgPool
ВниманиеAvgPool
Attention 54.41
51.0554.41
51.05 67.54±0.76
68.71±0.7167.54 ± 0.76
68.71 ± 0.71 67.42±0.93
68.55±0.6167.42 ± 0.93
68.55 ± 0.61 69.83±0.74
71.44±0.8269.83 ± 0.74
71.44 ± 0.82 -
--
- 70.42±0.41
71.61±0.6970.42 ± 0.41
71.61 ± 0.69 MobileNet2, б=1.4 (предобученная), векторные представленияMobileNet2, b = 1.4 (pretrained), vector representations AvgPool
ВниманиеAvgPool
Attention 53.54
51.1253.54
51.12 66.93±0.60
68.34±0.6866.93 ± 0.60
68.34 ± 0.68 67.21±0.55
68.62±0.5067.21 ± 0.55
68.62 ± 0.50 68.56±0,73
70.79±0.7568.56 ± 0.73
70.79 ± 0.75 -
--
- 69.47±0.36
71.78±0.7469.47 ± 0.36
71.78 ± 0.74 MobileNet2, б=1.4 (дообученная), оценкиMobileNet2, b = 1.4 (retrained), estimates AvgPool
ВниманиеAvgPool
Attention 56.01
56.0956.01
56.09 -
--
- -
--
- 70.57±0.48
72.90±0.5970.57 ± 0.48
72.90 ± 0.59 71.61±0.28
73.46±0.5871.61 ± 0.28
73.46 ± 0.58 -
--
- Inception v3
(предобученная), векторные представленияInception v3
(pretrained), vector representations AvgPool
ВниманиеAvgPool
Attention 57.26
50.8957.26
50.89 69.91±0.58
69.30±0.4769.91 ± 0.58
69.30 ± 0.47 70,01±0.62
68.52±0.8970.01 ± 0.62
68.52 ± 0.89 72.25±0.61
72.73±0.7272.25 ± 0.61
72.73 ± 0.72 -
--
- 72.78±0.71
73.00±0.6572.78 ± 0.71
73.00 ± 0.65 Inception v3
(дообученная), оценкиInception v3
(retrained), grades AvgPool
ВниманиеAvgPool
Attention 57.12
57.2957.12
57.29 -
--
- -
--
- 72.18±0,63
73.06±0.7472.18 ± 0.63
73.06 ± 0.74 73.20±0,74
73.92±0.8173.20 ± 0.74
73.92 ± 0.81 -
--
-

Кроме PEC и ML-CUFED исследовалась база веб-изображений для распознавания событий WIDER (Web Image Dataset for Event Recognition) [35], содержащая 50 574 фотографий и С=61 категорий событий (парад, танцы, встречи, пресс-конференции и т.д.). Использовалось стандартное разделение на обучение/тестирование для всех наборов данных, предложенных их разработчиками. В PEC и ML-CUFED метка уровня коллекции присваивалась непосредственно каждой фотографии, содержащейся в данной коллекции. Любые метаданные, например, временная информация, полностью игнорировались, за исключением самой фотографии, так же, как и в статье [32].In addition to PEC and ML-CUFED, the WIDER (Web Image Dataset for Event Recognition) [35] database of web images for event recognition was investigated, containing 50,574 photographs and С = 61 categories of events (parade, dances, meetings, press conferences, etc.) etc.). The standard training / testing division was used for all datasets suggested by their developers. In PEC and ML-CUFED, a collection level tag was assigned directly to each photograph contained in a given collection. Any metadata, such as time information, was completely ignored, with the exception of the photograph itself, just like in article [32].

Чтобы сосредоточиться на возможности реализовать автономное распознавание событий на мобильных устройствах [26] в целях сравнения предлагаемого подхода с традиционными классификаторами, использовались сети CNN MobileNet v2 с а=1 [23] и Inception v4 [28]. Сначала их обучали на наборе данных Places2 [38] для извлечения векторных представлений. Использовался линейный классификатор SVM из библиотеки scikit-learn, так как он имеет более высокую точность, чем другие классификаторы из этой библиотеки (RF, k-NN и RBF SVM). Кроме того, эти CNN дообучались с использованием предоставленной обучающей выборки следующим образом. Сначала фиксировались веса в базовой части CNN и обучался новый классификатор (полносвязный слой с С выходами и активацией Softmax) с использованием оптимизатора ADAM (скорость обучения 0,001) для эпох с ранним остановом в среде Keras 2.2 с бэкэндом TensorFlow 1.15. Затем обучались веса во всей CNN в течение 5 эпох, используя ADAM. И наконец, CNN обучалась с использованием SGD в течение 3 эпох с 10-кратным снижением параметра скорости обучения.To focus on the possibility of implementing autonomous event recognition on mobile devices [26] in order to compare the proposed approach with traditional classifiers, CNN MobileNet v2 with a = 1 [23] and Inception v4 [28] were used. They were first trained on the Places2 dataset [38] to extract vector representations. The linear SVM classifier from scikit-learn library was used, as it has higher accuracy than other classifiers from this library (RF, k-NN and RBF SVM). In addition, these CNNs were retrained using the provided training sample as follows. First, we fixed the weights in the base CNN part and trained a new classifier (fully connected layer with C outputs and Softmax activation) using the ADAM optimizer (learning rate 0.001) for epochs with early stopping in Keras 2.2 environment with TensorFlow 1.15 backend. Weights were then trained throughout CNN for 5 epochs using ADAM. Finally, CNN was trained using SGD for 3 epochs with a 10-fold decrease in the learning rate parameter.

Кроме того, использовались векторные представления из моделей обнаружения объектов, которые типичны для распознавания событий [35, 26]. Поскольку многие фотографии одного и того же события иногда содержат идентичные объекты (например, мяч в футболе), их можно обнаружить с помощью современных методов на основе CNN, например, SS-DLite [23] или Faster R-CNN [21]. Эти методы определяют положение нескольких объектов на входной фотографии и прогнозируют оценки каждого класса из заданного набора К>1 типов. Для каждого типа объекта извлекается разреженный K-мерный вектор оценок. Если имеется несколько объектов одного типа, максимальная оценка сохраняется в данном векторе векторные представления [20]. Этот вектор векторные представления либо классифицируется линейным SVM, либо используется для обучения нейронной сети прямого распространения с двумя скрытыми слоями, содержащими 32 нейрона каждый. Оба классификатора обучались с использованием обучающей выборки из каждого набора данных событий. В этом исследовании рассматривается SSD с архитектурой MobileNet и Faster R-CNN с архитектурами InceptionResNet. Модели, предварительно обученные на наборе данных Open Photos v4 (K=601 объектов), были взяты из TensorFlow Object Detection Model Zoo.In addition, vector representations were used from object detection models, which are typical for event recognition [35, 26]. Since many photographs of the same event sometimes contain identical objects (such as a soccer ball), they can be detected using modern CNN-based methods such as SS-DLite [23] or Faster R-CNN [21]. These methods determine the position of several objects in the input photograph and predict the estimates of each class from a given set of K> 1 types. For each type of object a sparse K-dimensional vector of estimates is extracted. If there are several objects of the same type, the maximum estimate is stored in this vector vector representations [20]. This vector representation vector is either classified by a linear SVM or used to train a feedforward neural network with two hidden layers containing 32 neurons each. Both classifiers were trained using a training sample from each event dataset. This study examines SSDs with MobileNet architecture and Faster R-CNN with InceptionResNet architectures. Models pretrained on the Open Photos v4 dataset (K = 601 objects) were taken from the TensorFlow Object Detection Model Zoo.

Предварительное экспериментальное исследование с предварительно обученными моделями составления подписи к изображению, которые обсуждались в разделе 2, продемонстрировало, что лучшее качество для набора данных составления подписей MS COCO достигается моделью AR-Net[6]. Поэтому в этом эксперименте использовалась модель кодера/декодера ARNet. Однако его можно заменить любым другим методом составления подписи к изображению без изменения алгоритма распознавания событий. ARNet обучали на наборе данных концептуальных подписей Conceptual Captions Dataset, который содержит более 3,3 млн пар фото-URL и подписей в обучающей выборке и около 15 тысяч пар в тестовой выборке. Извлечение векторных представлений в кодировщике реализуется не только теми же CNN (Inception и MobileNet v2). Было извлечено |

|=5000 наиболее часто встречающихся слова, за исключением специальных слов <START> и <END>. Их классифицировали либо линейным SVM, либо нейронной сетью с прямым распространением с той же архитектурой, что и в случае обнаружения объекта. Эти классификаторы также обучались с нуля на каждом предоставленной обучающей выборке событий. Этот же набор использовался для оценки веса w в ансамбле (уравнение 1).A preliminary pilot study with pretrained image caption composing models discussed in Section 2 demonstrated that the best quality for the MS COCO signature compilation dataset is achieved by the AR-Net model [6]. Therefore, the ARNet encoder / decoder model was used in this experiment. However, it can be replaced by any other method of captioning an image without changing the event recognition algorithm. ARNet was trained on the Conceptual Captions Dataset, which contains more than 3.3 million pairs of photo URLs and captions in the training set and about 15 thousand pairs in the test set. The extraction of vector representations in the encoder is implemented not only by the same CNN (Inception and MobileNet v2). Was checked out |

| = 5000 most common words, excluding the special words <START> and <END>. They were classified as either a linear SVM or a feedforward neural network with the same architecture as in the case of object detection. These classifiers were also trained from scratch on each training set of events provided. The same set was used to estimate the weight w in the ensemble (Equation 1).

В таблицах 5, 6, 7 представлены результаты облегченных мобильных (MobileNet и детектор объектов SSD) и глубоких моделей (Inception и Faster R-CNN) для PEC, WIDER и ML-CUFED, соответственно. В них добавлены самые известные результаты за 24 года для тех же экспериментальных протоколов. Tables 5, 6, 7 present the results of lightweight mobile (MobileNet and SSD object detector) and deep models (Inception and Faster R-CNN) for PEC, WIDER and ML-CUFED, respectively. They add the most famous results for 24 years for the same experimental protocols.

Таблица 5: Точность распознавания событий (%), PECTable 5: Accuracy of Event Recognition (%), PEC

КлассификаторClassifier ПризнакиSigns Облегченные модели Lightweight models Глубокие моделиDeep models SVMSVM Векторные представления
Объекты
Тексты
Предлагаемый ансамбль (4),(5)Vector views
Objects
Texts
Proposed ensemble (4), (5) 59.72
42.18
43.77
60.56 59.72
42.18
43.77
60.56 61.82
47.83
47.24
62.8761.82
47.83
47.24
62.87 Дообученная CNNRetrained CNN Векторные представления
Объекты
Тексты
Предлагаемый ансамбль (4),(5)Vector views
Objects
Texts
Proposed ensemble (4), (5) 62.33
40.17
43.52
63.3862.33
40.17
43.52
63.38 63.56
47.42
46.84
65.1263.56
47.42
46.84
65.12 Агрегированный SVM [5]
Мультимножество подсобытий [5]
SHMM[5]
Перенос обучения на базе инициализации [32]
Перенос обучения данных и знаний [321Aggregated SVM [5]
Multiple Sub-Events [5]
SHMM [5]
Initialization Based Learning Transfer [32]
Transfer of training data and knowledge [321 41.4
51.4
55.7
60.6
62.241.4
51.4
55.7
60.6
62.2

Таблица 6: Точность распознавания событий (%), WIDERTable 6: Accuracy of event recognition (%), WIDER

КлассификаторClassifier ПризнакиSigns Облегченные модели Lightweight models Глубокие моделиDeep models SVMSVM Векторные представления
Объекты
Тексты
Предлагаемый ансамбль (4),(5)Vector views
Objects
Texts
Proposed ensemble (4), (5) 48.31
19.91
26.38
48.9148.31
19.91
26.38
48.91 50.48
28.66
31.89
51.5950.48
28.66
31.89
51.59 Дообученная
CNNRetrained
CNN Векторные представления
Объекты
Тексты
Предлагаемый ансамбль (4),(5)Vector views
Objects
Texts
Proposed ensemble (4), (5) 49.11
12.91
23.93
49.80 49.11
12.91
23.93
49.80 50.97
21.27
30.91
51.8450.97
21.27
30.91
51.84 Базовая CNN [35]
Глубокий синтез каналов [35]
Перенос обучения на базе инициализации [32]
Передача обучения данных и знаний [32]Basic CNN [35]
Deep Channel Synthesis [35]
Initialization Based Learning Transfer [32]
Transfer of training data and knowledge [32] 39.7
42.4
50.8
53.039.7
42.4
50.8
53.0

Конечно, предлагаемое распознавание подписей к фотографиям не так точно, как с обычными векторными представлениями на основе CNN. Однако классификация текстовых описаний намного лучше, чем случайное предположение с точностью 100%/14≈7,14%, 100%/6l≈1,64% и 100%/23≈4,35% для PEC, WIDER и ML-CUFED, соответственно. Важно подчеркнуть, что в большинстве случаев этот подход дает более низкую частоту ошибок, чем классификация векторных представлений на базе обнаружения объектов. Это улучшение особенно заметно на облегченных моделях SSD, точность которых на 1,5-13% ниже, чем в предлагаемой классификации подписей к фотографиям, из-за ограничений моделей на основе SSD при обнаружении небольших объектов (продуктов питания, домашних животных, модных аксессуаров и т.п.). Векторные представления, обнаруженные Faster R-CNN, можно классифицировать более точно, но Faster R-CNN с архитектурой InceptionResNet дает в несколько раз более медленный вывод, чем декодирование в ARNet (6-10 секунд против 0,5-2 секунд в MacBook Pro 2015).Of course, the proposed recognition of photo captions is not as accurate as with conventional CNN-based vector representations. However, the classification of text descriptions is much better than a random guess with an accuracy of 100% / 14≈7.14%, 100% / 6l≈1.64%, and 100% / 23≈4.35% for PEC, WIDER and ML-CUFED. respectively. It is important to emphasize that in most cases this approach yields a lower error rate than the classification of vector representations based on object detection. This improvement is especially noticeable on lightweight SSD models, which are 1.5-13% less accurate than the proposed photo caption classification due to the limitations of SSD-based models when detecting small objects (food, pets, fashion accessories, etc.) etc.). Vector representations detected by Faster R-CNN can be classified more accurately, but Faster R-CNN with InceptionResNet architecture gives several times slower output than decoding in ARNet (6-10 seconds versus 0.5-2 seconds in MacBook Pro 2015 ).

Таблица 7. Точность распознавания событий (%), ML-CUFEDTable 7. Accuracy of event recognition (%), ML-CUFED

КлассификаторClassifier ПризнакиSigns Облегченные модели Lightweight models Глубокие моделиDeep models SVMSVM Векторные представления
Объекты
Тексты
Предлагаемый ансамбль (4),(5)Vector views
Objects
Texts
Proposed ensemble (4), (5) 53.54
34.21
37.24
55.26 53.54
34.21
37.24
55.26 57.27
40.94
41.52
58.8657.27
40.94
41.52
58.86 Дообученная CNNRetrained CNN Векторные представления
Объекты
Тексты
Предлагаемый ансамбль (4),(5)Vector views
Objects
Texts
Proposed ensemble (4), (5) 56.01
32.05
36.74
57.94 56.01
32.05
36.74
57.94 57.12
40.12
41.35
60.0157.12
40.12
41.35
60.01

И наконец, наиболее подходящим способом использования составления подписей изображений при классификации событий является их слияние с обычными CNN. В таком случае предыдущая известная наилучшая точность для PEC увеличивается с 62,2% [32] даже для облегченных моделей (63,38%), если в ансамбле используются дообученные CNN. Представленная модель на основе Inception еще лучше (точность 65,12%). Самый высокий на сегодня уровень точности 53% [32] для набора данных WIDER все еще не достигнут, хотя лучшая точность (51,84%) на 9% выше по сравнению с лучшими результатами (42,4%) из оригинальной статьи [35]. Представленный экспериментальный протокол для набора данных ML-CUFED изучался в этой работе впервые, так как этот набор данных разработан главным образом для распознавания событий на основе альбома.Finally, the most suitable way to use image captioning in classifying events is to merge them with regular CNNs. In this case, the previous known best accuracy for PEC increases from 62.2% [32] even for lightweight models (63.38%), if additional trained CNNs are used in the ensemble. The presented model based on Inception is even better (65.12% accuracy). The highest accuracy rate to date of 53% [32] for the WIDER dataset is still not achieved, although the best accuracy (51.84%) is 9% higher than the best results (42.4%) from the original paper [35] ... The presented experimental protocol for the ML-CUFED dataset was studied for the first time in this work, as this dataset is designed primarily for the recognition of album-based events.

На практике предпочтительно использовать предварительно обученную CNN в качестве блока извлечения векторных представлений, чтобы исключить дополнительный проход по дообученной CNN, когда она расходится с кодером в модели подписи изображений. К сожалению, точность SVM для векторных представлений предварительно обученной CNN ниже на 1,5-3% по сравнению с дообученными моделями для PEC и ML-CUFED. В этом случае можно допустить дополнительный вывод. Тем не менее, разница в частоте возникновения ошибок между предварительно обученными и дообученными моделями для набора данных WIDER несущественна, так что в данном случае определенно стоит использовать предварительно обученные CNN.In practice, it is preferable to use a pretrained CNN as a vector representation extractor in order to eliminate an additional pass through the retrained CNN when it diverges from the encoder in the image signature model. Unfortunately, SVM accuracy for vector representations of the pretrained CNN is 1.5-3% lower compared to the pretrained models for PEC and ML-CUFED. In this case, an additional conclusion can be assumed. However, the difference in error rates between pretrained and pretrained models for the WIDER dataset is not significant, so pretrained CNNs are definitely worth using in this case.

Описанные выше варианты осуществления изобретения являются примерами, и их не следует рассматривать как ограничивающие. Кроме того, описание примерных вариантов осуществления предназначено для иллюстрации, а не для ограничения объема формулы изобретения, и для специалистов в данной области техники будут очевидны многие альтернативы, модификации и варианты. The above described embodiments of the invention are examples and should not be construed as limiting. In addition, the description of the exemplary embodiments is intended to be illustrative, and not to limit the scope of the claims, and many alternatives, modifications, and variations will be apparent to those skilled in the art.

ЛИТЕРАТУРАLITERATURE

[1] Kashif Ahmad and Nicola Conci, 'How deep features have improved event recognition in multimedia: A survey', ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 15(2), 39(2019).[1] Kashif Ahmad and Nicola Conci, 'How deep features have improved event recognition in multimedia: A survey', ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 15 (2), 39 (2019).

[2] Kashif Ahmad, Nicola Conci, Giulia Boato, and Francesco GB De Natale, 'Event recognition in personal photo collections via multiple instance learning-based classification of multiple images', Journal of Electronic Imaging, 26(6), 060502, (2017).[2] Kashif Ahmad, Nicola Conci, Giulia Boato, and Francesco GB De Natale, 'Event recognition in personal photo collections via multiple instance learning-based classification of multiple images', Journal of Electronic Imaging, 26 (6), 060502, ( 2017).

[3] Wesam M Ashour, Riham Z Muqat, Alaaeddin В AlQazzaz, and Saeb R AbdElnabi, 'Improve basic sequential algorithm scheme using ant colony algorithm', in Proceedings of the 7th Palestinian International Conference on Electrical and Computer Engineering (PICECE), pp. 1-6. IEEE, (2019).[3] Wesam M Ashour, Riham Z Muqat, Alaaeddin B AlQazzaz, and Saeb R AbdElnabi, 'Improve basic sequential algorithm scheme using ant colony algorithm', in Proceedings of the 7th Palestinian International Conference on Electrical and Computer Engineering (PICECE), pp ... 1-6. IEEE, (2019).

[4] Siham Bacha, Mohand Said Allili, and Nadjia Benblidia, 'Event recognition in photo albums using probabilistic graphical models and feature relevance', Journal of Visual Communication and Image Representation, 40, 546-558, (2016).[4] Siham Bacha, Mohand Said Allili, and Nadjia Benblidia, 'Event recognition in photo albums using probabilistic graphical models and feature relevance', Journal of Visual Communication and Image Representation, 40, 546-558, (2016).

[5] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool, 'Event recognition in photo collections with a stopwatch HMM', in Proceedings of the International Conference on Computer Vision (ICCV), pp. 1193-1200. IEEE, (2013).[5] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool, 'Event recognition in photo collections with a stopwatch HMM', in Proceedings of the International Conference on Computer Vision (ICCV), pp. 1193-1200. IEEE, (2013).

[6] Xinpeng Chen, Lin Ma, Wenhao Jiang, Jian Yao, and Wei Liu, 'Regularizing rnns for caption generation by reconstructing the past with the present', in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018).[6] Xinpeng Chen, Lin Ma, Wenhao Jiang, Jian Yao, and Wei Liu, 'Regularizing rnns for caption generation by reconstructing the past with the present', in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2018).

[7] Francois Chollet, Deep learning with Python, Manning Publications Company, 2017.[7] Francois Chollet, Deep learning with Python, Manning Publications Company, 2017.

[8] Minh-Son Dao, Due-Tien Dang-Nguyen, and Francesco GB De Natale, 'Signature-image-based event analysis for personal photo albums', in Proceedings of the 19th International Conference on Multimedia (ACM MM), pp. 1481-1484. ACM, (2011).[8] Minh-Son Dao, Due-Tien Dang-Nguyen, and Francesco GB De Natale, 'Signature-image-based event analysis for personal photo albums', in Proceedings of the 19th International Conference on Multimedia (ACM MM), pp. ... 1481-1484. ACM, (2011).

[9] Sergio Escalera, Junior Fabian, Pablo Pardo, Xavier Baro, Jordi Gonzalez, Hugo J Escalante, Dusan Misevic, Ulrich Steiner, and Isabelle Guyon, 'Chalearn looking at people 2015: Apparent age and cultural event recognition datasets and results', in Proceedings of the International Conference on Computer Vision Workshops (ICCVW), pp. 1-9,(2015).[9] Sergio Escalera, Junior Fabian, Pablo Pardo, Xavier Baro, Jordi Gonzalez, Hugo J Escalante, Dusan Misevic, Ulrich Steiner, and Isabelle Guyon, 'Chalearn looking at people 2015: Apparent age and cultural event recognition datasets and results', in Proceedings of the International Conference on Computer Vision Workshops (ICCVW), pp. 1-9, (2015).

[10] Ming Geng, Yukun Li, and Fenglian Liu, 'Classifying personal photo collections: an event-based approach', in Proceedings of the Asia- Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, pp. 201-215. Springer, (2018).[10] Ming Geng, Yukun Li, and Fenglian Liu, 'Classifying personal photo collections: an event-based approach', in Proceedings of the Asia- Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, pp. 201-215. Springer, (2018).

[11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep learning, MIT Press, 2016.[11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep learning, MIT Press, 2016.

[12] Ivan Grechikhin and Andrey V Savchenko, 'User modeling on mobile device based on facial clustering and object detection in photos and videos', in Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis, pp. 429-440. Springer, (2019).[12] Ivan Grechikhin and Andrey V Savchenko, 'User modeling on mobile device based on facial clustering and object detection in photos and videos', in Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis, pp. 429-440. Springer, (2019).

[13] Cong Guo, Xinmei Tian, and Tao Mei, 'Multigranular event recognition of personal photo albums', IEEE Transactions on Multimedia, 20(7), 1837-1847, (2017). [13] Cong Guo, Xinmei Tian, and Tao Mei, 'Multigranular event recognition of personal photo albums', IEEE Transactions on Multimedia, 20 (7), 1837-1847, (2017).

[14] MD Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga, 'A comprehensive survey of deep learning for image captioning', ACM Computing Surveys (CSUR), 51(6), 118, (2019).[14] MD Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga, 'A comprehensive survey of deep learning for image captioning', ACM Computing Surveys (CSUR), 51 (6), 118, (2019).

[15] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, 'MobileNets: Efficient convolutional neural networks for mobile vision applications', arXivpreprint arXiv: 1704.04861,(2017).[15] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, 'MobileNets: Efficient convolutional neural networks for mobile vision applications', arXivpreprint arXiv: 1704.04861, (2017) ...

[16] Dmitry Kuzovkin, Tania Pouli, Olivier Le Meur, Remi Cozot, Jonathan Kervec, and Kadi Bouatouch, 'Context in photo albums: Understanding and modeling user behavior in clustering and selection', ACM Transactions on Applied Perception (TAP), 16(2), 11,(2019).[16] Dmitry Kuzovkin, Tania Pouli, Olivier Le Meur, Remi Cozot, Jonathan Kervec, and Kadi Bouatouch, 'Context in photo albums: Understanding and modeling user behavior in clustering and selection', ACM Transactions on Applied Perception (TAP), 16 (2), 11, (2019).

[17] Stefan Lonn, Petia Radeva, and Mariella Dimiccoli, 'Smartphone picture organization: A hierarchical approach', Computer Vision and Image Understanding, 187, 102789, (2019).[17] Stefan Lonn, Petia Radeva, and Mariella Dimiccoli, 'Smartphone picture organization: A hierarchical approach', Computer Vision and Image Understanding, 187, 102789, (2019).

[18] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh, 'Neural baby talk', in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018).[18] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh, 'Neural baby talk', in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018).

[19] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille, 'Deep captioning with multimodal recurrent neural networks (m-RNN)', in Proceedings of the International Conference on Learning Representations (ICLR), (2015).[19] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille, 'Deep captioning with multimodal recurrent neural networks (m-RNN)', in Proceedings of the International Conference on Learning Representations (ICLR), ( 2015).

[20] Alexandr Rassadin and Andrey Savchenko, 'Scene recognition in user preference prediction based on classification of deep embeddings and object detection', in Proceedings of the International Symposium on Neural Networks (ISNN), volume 11555, pp. 422-430. Springer, (2019).[20] Alexandr Rassadin and Andrey Savchenko, 'Scene recognition in user preference prediction based on classification of deep embeddings and object detection', in Proceedings of the International Symposium on Neural Networks (ISNN), volume 11555, pp. 422-430. Springer, (2019).

[21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, 'Faster R- CNN: Towards real-time object detection with region proposal networks', in Advances in Neural Information Processing Systems (NIPS), pp. 91-99, (2015).[21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, 'Faster R-CNN: Towards real-time object detection with region proposal networks', in Advances in Neural Information Processing Systems (NIPS), pp. 91-99, (2015).

[22] Mark Sandier, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, 'MobilenetV2: Inverted residuals and linear bottlenecks', in Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510-4520, (2018).[22] Mark Sandier, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, 'MobilenetV2: Inverted residuals and linear bottlenecks', in Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510-4520, (2018).

[23] Mark Sandier, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, 'Mobilenetv2: Inverted residuals and linear bottlenecks', in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 4510-4520, (2018).[23] Mark Sandier, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, 'Mobilenetv2: Inverted residuals and linear bottlenecks', in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 4510-4520, (2018).

[24] Andrey V. Savchenko, 'Efficient facial representations for age, gender and identity recognition in organizing photo albums using multi-output ConvNet', PeerJ Computer Science, 5(el97), (2019).[24] Andrey V. Savchenko, 'Efficient facial representations for age, gender and identity recognition in organizing photo albums using multi-output ConvNet', PeerJ Computer Science, 5 (el97), (2019).

[25] Andrey V. Savchenko, 'Sequential three-way decisions in multi-category image recognition with deep features based on distance factor', Information Sciences, 489, 18-36, (2019).[25] Andrey V. Savchenko, 'Sequential three-way decisions in multi-category image recognition with deep features based on distance factor', Information Sciences, 489, 18-36, (2019).

[26] Andrey V Savchenko, Kirill V Demochkin, and Ivan S Grechikhin, 'User preference prediction in visual data on mobile devices', arXiv preprint arXiv: 1907.04519, (2019).[26] Andrey V Savchenko, Kirill V Demochkin, and Ivan S Grechikhin, 'User preference prediction in visual data on mobile devices', arXiv preprint arXiv: 1907.04519, (2019).

[27] Anastasiia D Sokolova, Angelina S Kharchevnikova, and Andrey V Savchenko, 'Organizing multimedia data in video surveillance systems based on face verification with convolutional neural networks', in Proceedings of the International Conference on Analysis of Images, Social Networks and Texts (AIST), pp. 223-230. Springer, (2017). [27] Anastasiia D Sokolova, Angelina S Kharchevnikova, and Andrey V Savchenko, 'Organizing multimedia data in video surveillance systems based on face verification with convolutional neural networks', in Proceedings of the International Conference on Analysis of Images, Social Networks and Texts ( AIST), pp. 223-230. Springer, (2017).

[28] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex A. Alemi, 'Inception-v4, Inception-ResNet and the impact of residual connections on learning', in Proceedings of the International Conference on Learning Representations (ICLR) Workshop, (2016).[28] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex A. Alemi, 'Inception-v4, Inception-ResNet and the impact of residual connections on learning', in Proceedings of the International Conference on Learning Representations (ICLR) Workshop, (2016).

[29] Shen-Fu Tsai, Thomas S Huang, and Feng Tang, 'Album-based object-centric event recognition', in Proceedings of the International Conference on Multimedia and Expo, pp. 1-6. IEEE, (2011).[29] Shen-Fu Tsai, Thomas S Huang, and Feng Tang, 'Album-based object-centric event recognition', in Proceedings of the International Conference on Multimedia and Expo, pp. 1-6. IEEE, (2011).

[30] Nivetha Vijayaraju, 'Image retrieval using image captioning', Master's Projects, 687, (2019).[30] Nivetha Vijayaraju, 'Image retrieval using image captioning', Master's Projects, 687, (2019).

[31] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, 'Show and tell: Lessons learned from the 20 mscoco image captioning challenge', IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 652-663, (2017).[31] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, 'Show and tell: Lessons learned from the 20 mscoco image captioning challenge', IEEE Transactions on Pattern Analysis and Machine Intelligence, 39 (4), 652-663, (2017).

[32] Limin Wang, Zhe Wang, Yu Qiao, and Luc Van Gool, 'Transferring deep object and scene representations for event recognition in still images', International Journal of Computer Vision, 126(2-4), 390-409, (2018).[32] Limin Wang, Zhe Wang, Yu Qiao, and Luc Van Gool, 'Transferring deep object and scene representations for event recognition in still images', International Journal of Computer Vision, 126 (2-4), 390-409, ( 2018).

[33] Yufei Wang, Zhe Lin, Xiaohui Shen, Radomir Mech, Gavin Miller, and Garrison W Cottrell, 'Recognizing and curating photo albums via event-specific image importance', in Proceedings of British Conference on Machine Vision (BMVC), (2017).[33] Yufei Wang, Zhe Lin, Xiaohui Shen, Radomir Mech, Gavin Miller, and Garrison W Cottrell, 'Recognizing and curating photo albums via event-specific image importance', in Proceedings of British Conference on Machine Vision (BMVC), ( 2017).

[34] Zifeng Wu, Yongzhen Huang, and Liang Wang, 'Learning representative deep features for image set analysis', IEEE Transactions on Multimedia, 17(11), 1960-1968, (2015).[34] Zifeng Wu, Yongzhen Huang, and Liang Wang, 'Learning representative deep features for image set analysis', IEEE Transactions on Multimedia, 17 (11), 1960-1968, (2015).

[35] Yuanjun Xiong, Kai Zhu, Dahua Lin, and Xiaoou Tang, 'Recognize complex events from static images by fusing deep channels', in Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1600-1609, (2015).[35] Yuanjun Xiong, Kai Zhu, Dahua Lin, and Xiaoou Tang, 'Recognize complex events from static images by fusing deep channels', in Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1600-1609, (2015).

[36] Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio, 'Show, attend and tell: Neural image caption generation with visual Внимание', in Proceedings of the[36] Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio, 'Show, attend and tell: Neural image caption generation with visual Attention', in Proceedings of the

International Conference on International Conference on Machine Learning (ICML), pp. 2048-2057, (2015).30International Conference on International Conference on Machine Learning (ICML), pp. 2048-2057, (2015) .30

[37] J. Yang, P. Ren, D. Zhang, D. Chen, F. Wen, H. Li, and G. Hua, 'Neural aggregation network for video face recognition', in Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5216-5225. IEEE, (2017).[37] J. Yang, P. Ren, D. Zhang, D. Chen, F. Wen, H. Li, and G. Hua, 'Neural aggregation network for video face recognition', in Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5216-5225. IEEE, (2017).

[38] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba, 'Places: A million image database for scene recognition', IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1452-1464, (2018).[38] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba, 'Places: A million image database for scene recognition', IEEE Transactions on Pattern Analysis and Machine Intelligence, 40 (6), 1452-1464, ( 2018).

Claims

1. A method of recognizing events in photographs with automatic selection of albums, which consists in the following:

automatically assign the boundaries of albums, evaluating the similarity between the degrees of confidence for each pair of consecutive photos in the gallery, while the similarity between the photos is calculated as the distance between the degrees of confidence for each type of event, assessed by the classification of each photo, or the similarity between photos is calculated as the sum of similarities between degrees of confidence for each type of event, assessed based on the classification of each photograph and the similarity between their locations, if the EXIF (Exchangeable Photo File Format) data of those photographs contains location information;

group consecutive photos from the gallery into albums;

calculating the descriptor of each album using the attention mechanism of the neural model of attention, which predicts one class of events for a set of photos, while the predicted class is assigned to all photos between the mentioned boundaries;

recognize the event type tag of each album by supplying its descriptor to the input of the classifier;

recognize the event in the photos in the gallery by assigning an appropriate event type tag to all the photos in each album.

2. The method according to claim 1, wherein it is assessed that if the similarity does not exceed a predetermined threshold, then both photographs are included in the same album.

3. The method of claim 2, wherein the predetermined threshold is calculated automatically during the pre-training procedure using the provided training album set.

4. A method for recognizing an event in a photograph with its presentation in a text area, which consists in the following:

calculating vector representations of the photograph by feeding its RGB (red-green-blue) representations to the input of the convolutional neural network;

automatically compose image captions to generate textual descriptions of a photo based on its vector representations, while a textual description of a photo is generated as follows: the extracted vector representations and a sequence of previously generated words are fed to a recurrent neural network (RNN) to predict the next word in the textual description of the photo;

selecting a subset of the vocabulary by selecting the most common words in the training data with optional exclusion of stop words;

encode the generated signature;

feed the generated text descriptions to the input of the event classifier;

recognize the event by assigning the event classifier output to the tag of the corresponding photo.

5. The method according to claim 4, wherein the generated signature is encoded as a sparse vector using unitary encoding of the textual description of the photograph: the vth component of the vector is 1 only if the generated signature contains at least one vth word from the dictionary.

6. The method according to claim 5, further comprising: combining outputs of the classifier for the generated text descriptions and the traditional vector representation classifier into an ensemble to improve the accuracy of event recognition.

7. The method according to claim 6, wherein the classifier outputs are obtained as follows:

calculating confidence scores for all types of events for the photograph by feeding its vector representations to an arbitrary classifier;

calculating confidence scores for all types of events for the photograph by supplying a sparse vector of the unitary encoded text description to an arbitrary classifier;

combine outputs of classifiers for vector representations and texts in an ensemble based on simple voting with soft aggregation.

8. The method of claim 4, further comprising adding the predicted word to the sequence of previously generated words and feeding the extracted visual vector representations and the sequence to the same RNN.

9. The method of claim 4, wherein the training sample is a collection of photographs with a known type of event.

10. The method of claim 6, wherein the conventional classifier should be pre-trained to recognize events in the training sample, in which each photograph is presented along with its extracted visual vector representations.

11. A method according to claim 7, in which, in particular, the aggregated degrees of confidence are calculated as a weighted sum of the ratings of the degrees of confidence and the decision is made in favor of the class with the maximum aggregated degree of confidence.