WO2021101231A1

WO2021101231A1 - Event recognition on photos with automatic album detection

Info

Publication number: WO2021101231A1
Application number: PCT/KR2020/016234
Authority: WO
Inventors: Andrey Vladimirovich Savchenko
Original assignee: Samsung Electronics Co., Ltd.
Priority date: 2019-11-22
Filing date: 2020-11-18
Publication date: 2021-05-27

Abstract

Demonstrated is that grouping of consecutive photos and attention-based recognition of resulted photo sets can drastically improve the recognition accuracy. It has been shown that the most important parameter, namely, similarity threshold pO, can be automatically estimated in the learning procedure. It has been experimentally demonstrated that consecutive photos from the same album are better discovered if the confidence scores of classifier are matched, which has been learned on unfolded training set X. In addition, proposed is to apply generative models in classical discriminative task, namely, image captioning in event recognition in still photos. Presented is the novel pipeline of visual preferences prediction using image captioning with classification of generated captions and retrieval of photos based on their textual descriptions.

Description

EVENT RECOGNITION ON PHOTOS WITH AUTOMATIC ALBUM DETECTION

The invention relates to the field of computer technology, namely, methods for image processing and analysis, and can be used for organizing a photo gallery in mobile systems.

People are taking more photos than ever before in recent years [13] due to the rapid growth of social networks, cloud services and mobile technologies. To organize a personal collection, the photos are usually assigned to albums according to some events. The photo organizing systems (Apple iPhoto, Google Photos, etc.) allow the user to rapidly search for required photo, and also to increase the efficiency of work with a gallery [27]. Nowadays, these systems usually include content-based photo analysis and automatic association of each photo with different tags (scene description, persons, objects, locations, etc.). Such analysis can be used not only to selectively retrieve photos for particular tag in order to keep nice memories of some episodes of user's live [32], but to make personalized recommendations that assist customers in finding relevant items within large collections. The design of such systems requires the careful consideration of the user modeling approach [26]. A large gallery of photos on a mobile device can be used for understanding of such user's interests as sport, gadgets, fitness, cloth, cars, food, travelling, pets, etc. [12, 20].

Present solution focuses on one of the most challenging parts of photo organizing engine, namely, photo-based event recognition [1], in order to extract such events as holidays, sport events, weddings, various activities, etc. An event can be defined as a category that captures the "complex behavior of a group of people, interacting with multiple objects, and taking place in a specific environment" [32]. There exist two different tasks of event recognition. The first one is focused on processing of single photos, i.e. event is considered as a complex scene with large variations in visual appearance and structure [32]. The second task aims at predicting the event categories of a group of photos (album) [4]. In the latter case it is assumed that all photos in an album are weakly labeled [2], though importance of each photo may differ [33]. However, in practice only a gallery of photos is available so that the latter approach requires a user to manually choose the albums. Another option includes location-based album creation if the GPS tags are switched on. In both cases the usage of album-based event recognition is limited or even impossible.

In this invention a new formulation of event recognition task is examined: it is required to predict event categories in a gallery of photos, for which albums (groups of photos corresponding to a single event) are unknown. The novel two-stage approach is proposed. At first, embeddings (features) are extracted in each photo using the pre-trained convolutional neural network. These embeddings are classified individually. The scores of the classifier are used to group sequential photos into several clusters. Finally, the embeddings of photos in each group are aggregated into a single descriptor using neural attention mechanism. The latter is a weighting scheme to linearly combine all embeddings in the input set. The weights are adaptively calculated using a feed-forward neural network. Due to the attention mechanism, the aggregation is invariant to the image order and does not depend on the number of images in the input set. This algorithm is optionally extended to improve the accuracy for classification of each photo in a group. In contrast to conventional fine-tuning of convolutional neural networks (CNN) proposed is to use image captioning, i.e., generative model that converts photos to textual descriptions. They are one-hot encoded and summarized into sparse embedding vector suitable for learning of arbitrary classifier. Experimental study with Photo Event Collection and Multi-Label Curation of Flickr Events Dataset demonstrates that the proposed approach is 9-20% more accurate than event recognition on single photos. Moreover, proposed method has 13-16% lower error rate than classification of groups of photos obtained with hierarchical clustering. It is experimentally shown that the photo captions trained on Conceptual Captions dataset can be classified more accurately than the embeddings from object detector, though they both are obviously not as rich as the CNN-based embeddings. However, it is possible to combine present approach with conventional CNNs in an ensemble to provide the state- of-the-art results for several event datasets.

Thus, in this invention considered is the new task of event recognition, in which a gallery of photos is given and it is known that it contains ordered albums with unknown borders. Proposed is to automatically assign these borders based on the visual content of consecutive photos in a gallery. Next, consecutive photos are grouped, and descriptor of each group is computed with an attention mechanism from the neural aggregation module [37]. Finally, this approach is extended as follows. Despite conventional usage of CNNs as discriminative models in a classifier design, proposed is to borrow generative models to represent an input photo in other domain. In particular, existing methods of image captioning [14] that generates textual descriptions of photos are used. Proposed main contribution is a demonstration that the generated descriptions can be fed to the input of classifier in an ensemble in order to improve event recognition accuracy of traditional methods. Though the proposed visual representation is not as rich as embeddings extracted by fine-tuned CNNs, they are better than the outputs of object detectors [20].

The task of event recognition in the personal photo collections is not to recognize event in individual photo but in the whole album [29]. The events and sub-events of the sub-sequence photos are identified in [8] by integrating the optimized linear programming with the color descriptor of the signature image. The Stopwatch Hidden Markov Models were applied in [5] by treating the photos in an album as sequential data. The detectors for objects relevant to the events were trained in the holiday dataset [29]. Next, these holidays are classified based on the outputs of object detector. The paper [2] tackles the presence of irrelevant images in an album with multiple instance learning techniques. An iterative updating procedure for event type and image importance score prediction in a siamese network is presented in [33]. The authors of this paper used a CNN that recognizes the event type, and a Long Short-Term Memory (LSTM)-based sequence level event recognizer in a whole album. Moreover, they successfully applied the method for learning representative deep features for image set analysis [34]. The latter approach focuses on capturing the co-occurrences and frequencies of features so that the temporal coherence of photos in an album is not required. A model to recognize events from coarse to fine hierarchical level using multigranular features [24] is proposed in [13] based on an attention network that learns the representations of photo albums. The efficiency of re-finding expected photos in mobile phones was improved by a method to classify personal photos based on relationship of shooting time and shooting location to specific events [10].

The album information is not always available so that a gallery contains unstructured list of photos ordered by their creation time. In such case it is possible to use existing methods of event recognition on single photos [1]. Similar to other computer vision domains, the mainstream approach tends to applications of CNN-based architectures. For example, four different layers of fine-tuned CNN were used to extract features and perform Linear Discriminant Analysis in order to obtain the top entry in the ChaLearn LAP 2015 cultural event recognition challenge [9]. The bounding boxes of detected objects are projected onto multi-scale spatial maps for increasing the accuracy of event recognition [35]. The novel iterative selection method is introduced in [32] to identify a subset of classes that are most relevant for transferring deep representations learned from object (ImageNet) and scene (Places2) datasets.

Proposed is a method for event recognition on photos with automatic album detection, comprising: automatically assigning borders of the albums based on the visual content of consecutive photos in the gallery; grouping consecutive photos in a gallery into albums; computing a descriptor of each album with an attention mechanism from the neural aggregation module; recognizing tag of event type of each album by feeding its descriptor into a classifier; recognizing event on photos in the gallery by assigning corresponding tag of event type to all photos in each album. Wherein automatic assigning of borders of the albums includes the following steps: estimating similarity between confidences of each pair of consecutive photos in a gallery; if a similarity does not exceed a certain threshold then it is assumed that both photos are included in the same album. Wherein the similarity between photos is computed as a distance between confidences for each type of event estimated based on the classifying each photo. Wherein the similarity between photos is computed as a sum of similarities between confidences for each type of event estimated based on the classifying each photo and the similarity between their locations, if location information is available in EXIF (Exchangeable Photo File Format) data of these photos. Wherein the certain threshold is automatically computed during the preliminary training procedure by using available training set of albums.

Also, proposed is a method for event recognition on photo with its representation in text domain, comprising: computing embeddings of an photo by feeding its RGB (Red-Green-Blue) representation into a convolutional neural network; using image captioning to generate textual descriptions of the photo based on its embeddings; encoding the generated caption; feeding the generated textual descriptions to the input of event classifier; recognizing event by assigning the output of event classifier to the corresponding photo. Wherein the generated caption is encoded as a sparse vector of one-hot encoding of textual description of the photo: the v-th component of vector is equal to 1 only if at least one word in the generated caption is equal to the v-th word from vocabulary. Proposed method further comprises combining of the classifier output for generated textual descriptions and traditional classifier of embeddings in an ensemble to improve event recognition accuracy. Wherein the outputs of classifiers are obtained as follows: computing confidence scores of all event types for a photo by feeding its embeddings into any classifier; computing confidence scores of all event types for a photo by feeding a sparse vector of one-hot encoded textual description into any classifier; combining the outputs of classifiers for embeddings and texts in a simple voting with soft aggregation. Wherein textual description of the photo is generated by: feeding extracted embeddings and a sequence of previously generated words into RNN (Recurrent Neural Network) in order to predict the next word in the textual description of a photo; mapping the numbers generated by the RNN into real words from a vocabulary; selecting a subset of vocabulary by choosing top most frequently occuring words in the training data with optional exclusion of stop words. Proposed method further comprises appending predicted word into this sequence of previously generated words and feeding extracted visual embeddings and this sequence into the same RNN. Wherein training set is a set of training photos with known type of event. Wherein traditional classifier should be preliminary trained to recognize events in a training set, where each photo from the training set is represented with its extracted own visual embeddings. Wherein in particular, aggregated confidences are computed as the weighted sum of confidence scores, the decision is taken in favor of the class with the maximal aggregated confidence.

The above and/or other aspects will be more apparent by describing exemplary embodiments with reference to the accompanying drawing.

Fig. 1 illustrates proposed gallery-based event recognition pipeline.

Fig. 2 illustrates attention-based neural network for embeddings from MobileNet v2.

Fig. 3 illustrates Mobile demo GUI.

Fig. 4 illustrates proposed event recognition pipeline based on image captioning.

Fig. 5 illustrates sample results of event recognition.

The invention can be used in photo organizing software, that can optionally be implemented on, for example, a computer-readable medium, to automatically select albums from unlabeled set of photos and associate each album with specific tags (types of events or scenes). The present invention can be implemented by software and/or hardware in any device smartphone, mobile phone, tablet PC, laptop, computer, etc.

The technical effect is significant improvement in event recognition accuracy by combining sequentially taken photos into albums, (i.e. groups of sequential photos with similar content) and then classifying each album based on the neural attention mechanism. The confidence score of event classifiers is used to find similar photos instead of conventional matching of visual embeddings. In addition to the traditional visual embeddings, their automatically generated textual descriptions (captions) are used to increase the accuracy of the ensemble of classifiers by increasing the variety of visual embeddings.

Proposed solution consists in the following:

Perform the following processing for each photo in a gallery. Compute embeddings (embeddings) of a photo by feeding its RGB (Red-Green-Blue) representation into a convolutional neural network. It is proposed to use additional representation of a photo by using image captioning techniques to increase the variety of visual embeddings. In particular, generate caption (textual description) of a photo by feeding its embeddings into specially-trained recurrent neural network. Next, one-hot encode the generated captions. Estimate confidences for each type of event by classifying a photo using an ensemble of traditional classifier of embeddings and one-hot encoded photo captions. Ensemble of traditional classifier of embeddings and one-hot encoded photo captions consists in: traditional approach consists in classifying the embeddings of photos extracted using convolutional neural networks; proposed approach: a text description of the photo is first generated, then this text is converted to a numerical form using the one-hot encoding method, and the resulting vector is classified using known methods.

Estimate similarity between each pair of consecutive photos in a gallery. It is proposed to compute similarity between normalized confidence scores described above instead of conventional computation of the similarity between embeddings extracted by convolutional neural network. In order to normalize the confidence scores, their L2 norm is computed as a square root of a sum of squares of all confidence scores, and each score is divided into this L2 norm. If location information is available in EXIF (Exchangeable Photo File Format) data of these photos, the similarity between their locations can be added to the similarity between normalized confidence scores. Automatically grouping sequential photos in a gallery into albums by using the following rule: if a similarity does not exceed a certain threshold then it is assumed that both photos are included in the same album. The threshold is automatically computed during the preliminary training procedure by using available training set of albums. Training albums are taken from the available training set, e.g., Photo Event Collection (PEC) or the Multi-Label Curation of Flickr Events Dataset (ML-CUFED), they are needed to train the classifier.

Despite existing methods of detecting events in photo collections, assign tag of event type to all photos in each album by using special neural network with an attention mechanism that is trained to classify event in a set of photos from the same album. Here the accuracy is improved over conventional photo tagging technique because events are recognized simultaneously for all photos in selected albums, so that mistakes for individual photos may be automatically corrected.

Annotating personal photo albums is an emerging trend in photo organizing services [8]. A method for hierarchical photo organization into topics and topic-related categories on a smartphone is proposed in [17] based on integration of convolutional neural network (CNN) and topic modeling for photo classification. An automatic hierarchical clustering and best photo selection solution is introduced in [16] for modeling user decisions in organizing or clustering similar photos in albums. Organizing photo albums for user preference prediction on mobile device is considered in [24].

Unfortunately, the accuracy of event classification on still photos [32] is in general much lower than the accuracy of album-based recognition [33]. That is why in the present invention proposed is to concentrate on other suitable visual embeddings extracted with the generative models and, in particular, image captioning techniques. There is a wide range of applications of image captioning: from automatic generation of descriptions for photos posted in social networks to photo retrieval from databases using generated text descriptions [30]. The image captioning methods are usually based on an encoder-decoder neural network, which first encodes an photo into a fixed-length vector representation using pre-trained CNN, and then decodes the representation into captions (a natural language description). During the training of a decoder (generator) the input photo and its ground- truth textual description is fed as inputs to the neural network, while one hot encoded description presents the desired network output. Description is encoded using text embeddings in the Embedding (lookup) layer [11]. The generated photo and text embeddings are merged using concatenation or summation, and form the input to the decoder part of the network. It is typical to include the recurrent neural network (RNN) layer followed by a fully connected layer with the Soft- max output layer.

One of the first successful models, "Show and Tell" [31], won the first MS COCO Image Captioning Challenge in 2015. It uses the RNN with the long short-term memory (LSTM) units in a decoder part. Its enhancement "Show, Attend and Tell" [36] incorporates a soft attention mechanism to improve the quality of the caption generation. The "Neural Baby Talk" image captioning model [18] is based on generating of the template with slot locations explicitly tied to specific image regions. These slots are then filled in by visual concepts identified in the object detectors. The foreground regions are obtained using the Faster-RCNN network [21], and the LSTM with attention mechanism serves as a decoder. The "Multimodal Recurrent Neural Network" (mRNN) [19] is based on the Inception network for photo embeddings extraction and deep RNN for sentences generation. One of the best models nowadays is the Auto-Reconstructor Network (ARNet) [6], which uses Inception-V4 network [28] in an encoder and the decoder is based on LSTM. There exist two pre-trained models with greedy search (ARNet-g) and beam search (ARNet-b) with size 3 to generate the final caption for each input photo.

MATERIALS AND METHODS

Problem formulation

As it was noticed above, the task of recognizing events in each photo in the gallery is being solved. At that, at the first stage, albums (groups of consecutive photographs of the user of the same event) are automatically selected based on the analysis of the photo content. Next, events corresponding to each highlighted album are recognized. After that, the event recognized for the album is mapped to each photo in this album. As a result, it is possible to recognize events more accurately than simply classifying the events in each individual photograph.

In this subsection a technological engine is discovered that can solve this task by using sequential processing of photos similarly to cluster analysis with the Basic Sequential Algorithmic Scheme (BSAS) [3]. The main task can be formulated as follows. It is required to assign each photo

from a gallery of an input user to one of C > 1 event categories (classes). Here T ≥ 1 is the total number of photos in a gallery. The training set of N ≥ 1 albums is available for learning of event classifier. The n-th training (reference) album is defined by collection of Ln photos

. The class label

of each n-th album is supposed to be given, i.e., assumed is that an album is associated with exactly one event type.

Conventional event recognition on single photos [32] is the special case of above-formulated problem if T=1. Hence, it is possible to solve our task by classifying events in each t-th photo (t=1,2,…,T) independently. However, it is possible to take into account that the gallery {Xt} in our task is not a random collection of photos but can be represented as a sequence of disjoint albums. Each photo in an album is associated with the same event. In contrast to the album-based event recognition, we do not know the borders of each album, i.e., the number of the first t1 and the last t2 photo from a gallery, for which it is guaranteed that all photos

, t= t ₁+1,…, t ₂, correspond to this album. This task possesses several characteristics that makes it extremely challenging compared to previously studied problems. One of these characteristics is the presence of irrelevant photos or unimportant photos that can be in principle associated to any event [1]. These photos are easily detected in attention-based models [13, 37], but may have a significant impact on an accuracy of automatic detection of the album's borders.

The baseline approach here is to classify all T photos independently of each other so that the decision for every photo does not influence the decision for any other photo. In such case it is typical to unfold the training albums into a set

of

photos so that the collection-level label c _n of the n-th album is assigned to labels of each l-th photo

. Labels c _n are event numbers corresponding to the entire album (photo collections). Next, it is possible to train any known classifier that can return vector of confidence scores for each class rather than predict class label only. If L is rather small to train a deep CNN (convolutional neural network) from scratch, the transfer learning or domain adaptation can be applied [11]. In these methods a large external dataset, e.g. ImageNet-1000 or Places2 [38], is used to pre-train a deep CNN. As special attention is payed to offline recognition on mobile devices, it is reasonable to use such CNNs as MobileNet v1/v2 [15, 22]. The final step in transfer learning is fine-tuning of this neural network on X. This step includes replacement of the last layer of the pre-trained CNN to the new layer with Softmax activations and C outputs. During the classification process, each input photo X _t is fed to the fine-tuned CNN to compute the scores (predictions at the last layer):

This procedure can be modified by replacing C logistic regressions in the last layer to more complex classifier, e.g., random forest (RF), support vector machine (SVM) or gradient boosting. In this case the embeddings [25] are extracted using the outputs of one of the last layers of pre-trained CNN. Namely, the photos X _t and X _n(l) are fed to the CNN, and the outputs of the one-but-last layer are used as the D-dimensional embedding vectors

and

, respectively. Such deep learning-based embedding extractors allow training of a general classifier C. The t-th photo is fed into this classifier to obtain C-dimensional confidence scores

.

Finally, the confidences

computed by any of above-mentioned ways are used to make a decision in favor of the most probable class:

Event recognition in a gallery of photos.

Fig. 1 illustrates proposed gallery-based event recognition pipeline. It is necessary to note, that all described units can be implemented by software and/or hardware.

In unit "Embedding extraction (CNN)" implemented are: extracting visual embeddings (embeddings) by feeding RGB representation of each photo from a gallery into convolutional neural network compute the outputs of one of its last layers (usually, penultimate layer).

In unit "Classifier" implemented are: computing confidence scores of all event types for each photo by feeding its embeddings into any classifier. This classifier should be preliminary trained to recognize events in a training set. Normalized is the vector of confidence scores using L2-norm.

In unit "Sequential cluster analysis" implemented is: for each photo from the gallery computing the similarity between its normalized confidence scores with the normalized confidence scores of the next photo. If location information is available in EXIF (Exchangeable Photo File Format) data of these photos, the similarity between their locations can be added to the similarity between normalized confidence scores. If this similarity exceeds a certain threshold then the photos will be included into different albums and a border between albums of these two consecutive photos will be established.

In unit "Neural attention model" implemented is: for each consecutive pair of album border extracted on the previous step, obtaining visual embeddings (embeddings) of all photos between these borders. Fed is the set of embeddings into neural attention model that predicts a single event class for a set of photos. The event class predicted by this is assigned to all photos between these borders.

Here, firstly, module "Embedding extractor" computes embeddings x _t of every t-th individual photo as described in above. The classifier confidences p _t are estimated in the "Classifier" block. Next, sequential analysis from BSAS clustering [3] in the "Sequential cluster analysis" module for a sequence of confidences {p _t} in order to obtain the borders of albums is used. Namely, the similarities between confidences of all subsequent photos p(p _t, p _t- ₁) are computed. Here any appropriate similarity can be use, e.g., Euclidean, Minkowski, chi-squared, Kullback-Leibler and Jensen-Shannon divergences, etc. If a similarity does not exceed a certain threshold

then it is assumed that both photos are included in the same album. If location information is available in EXIF (Exchangeable Photo File Format) data of these photos, the similarity between their locations can be added to

in order to obtain the final similarity to be matched with a threshold. Otherwise, the border between two albums is established at the t-th position. As a result, obtained are the borders

, so that the k-th album contains photos

. See Fig. 2.

At the second stage, the final descriptor x(k) of the k-th album is produced as a weighted sum of individual embeddings x _t:

where the weights w may depend on the embeddings x _t. It is typical to use here average pooling (AvgPool) with equal weights, so that conventional computation of mean embedding vector is implemented.

Algorithm 1 Proposed gallery-based event recognition

However, in this invention proposed is to learn the weights w(x _t), particularly, with an attention mechanism from the neural aggregation module used previously only for video recognition [37]:

Here q is the learnable D-dimensional vector of weights. The dense (fully connected) layer is attached to the resulted descriptor x(k), and the whole neural network (Fig. 2) is trained in end-to- end manner using given training set of N ≥ 1 albums. The event class predicted by this network in the "Neural attention model" block (Fig. 1) is assigned to all photos

.

Complete classification and learning procedures are presented in Algorithm 1 and Algorithm 2, respectively. For simplicity, mentioned is that the latter calls the event classification in step 17. However, to speed-up computations it is recommended to pre-compute the pairwise similarity matrix between confidence scores of all training photos so that embedding extraction (steps 3-4 in Algorithm 1) and similarity calculation are not needed during the learning of the model.

In the present invention implemented is the whole pipeline (Fig. 1) in the publicly- available demo application for Android (http://drive.google.com/open?id=1aYN0ZwU90T8ZruacvND01hbIaJS4EZLI) (Fig. 3), that was previously developed to extract user preferences by processing all photos from the gallery in the background thread [26]. The similar events found in photos made in one day were united into High-level logs for the most important events. Only those scenes/events are displayed for which there exist at least 2 photos and the average score of scene/event predictions for all photos of the day exceeds a certain threshold. This threshold is automatically obtained during the learning procedure (steps 16-22 in Algorithm 2). The sample screenshot of the main user interface is shown in Fig. 3a. It is possible to tap any bar in this histogram to show a new form with detailed categories (Fig. 3b). If a concrete category is tapped, a "display" form appears, which contains a list of all photos from the gallery with this category (Fig. 3c). Here event by date is grouped and provided is a possibility to choose concrete day.

Algorithm 2 Learning procedure in the proposed approach

Event recognition in single photos

Event recognition in single photos task can be formulated as a typical image recognition problem. It is required to assign each photo X from a gallery to one of C > 1 event categories (classes). The training set of N ≥ 1 photos

with known event labels

is available for classifier learning. Sometimes the training photos of the same event are associated with an album [5, 33]. In such case the training albums are unfolded into a set X so that the collection-level label of the album is assigned to labels of each photo from this album. This task possesses several characteristics that makes it extremely challenging compared to album-based event recognition. One of these characteristics is the presence of irrelevant photos or unimportant photos that can be associated to any event [1]. These photos can be detected by attention- based models when the whole album is available [13], but may have a significant impact on a quality of event recognition in single photos.

As N is usually rather small, the transfer learning may be applied [11]. A deep CNN is firstly pre-trained on a large dataset, e.g. ImageNet or Places [38]. Secondly, this CNN is fine-tuned on X, i.e., the last layer is replaced to the new layer with Softmax activations and C outputs. Each input photo X is classified by feeding it to the fine-tuned CNN to compute C scores from the output layer, i.e., estimates of posterior probabilities for all event categories. This procedure can be modified by extraction of deep photo embeddings using the outputs of one of the last layers of the pre-trained CNN. The photos X and X _n are fed to the input of the CNN, and the outputs of the one-but-last layer are used as the D-dimensional embedding vectors

and

, respectively. Such deep learning-based embedding extractors allow training of a general classifier

, e.g., k-nearest neighbor, random forest (RF), support vector machine (SVM) or gradient boosting. The C-dimensional vector of

confidence scores is predicted given the input photo in both cases of fine-tuning with the last Softmax layer in a role of classifier

and embedding extraction with general classifier. See Fig. 3. The final decision can be made in favor of class with the maximal confidence.

In this invention used is another approach to event recognition based on generative models and image captioning. The proposed event recognition pipeline based on image captioning is presented in Fig. 4. It is necessary to note, that all described units can be implemented by software and/or hardware.

In unit "Embedding extraction (CNN)" implemented is: extracting visual embeddings (embeddings) by feeding RGB representation of a photo from a gallery into preliminarily trained convolutional neural network compute the outputs of one of its last layers (usually, penultimate layer).

In unit "Caption generation" implemented is: feeding extracted visual embeddings and a sequence of previously generated words into RNN (Recurrent Neural Network) in order to predict the next word in the textual description of a photo. This sequence of previously generated words initially contains only one special word <START>. While the predicted word is not equal to special word <END>, append this predicted word into this sequence of previously generated words and feed extracted visual embeddings and this sequence into the same RNN.

In unit "Vocabulary" implemented is: as above-mentioned RNN operates with words represented by numbers, it is necessary to map the numbers generated by the RNN into real words from a vocabulary.

In unit "Caption preprocessing" implemented is: selecting a subset of vocabulary by choosing top most frequently occuring words in the training data with optional exclusion of stop words. Next, each photo is represented as the sparse vector using the one-hot encoding: the v-th component of vector is equal to 1 only if at least one word in the generated caption is equal to the v-th word from vocabulary.

Unit "Training set" contains: a set of training photos with known type of event.

In unit "Classification of embeddings" implemented is: computing confidence scores of all event types for a photo by feeding its embeddings into any classifier. This classifier should be preliminary trained to recognize events in a training set, where each photo from the training set is represented with its own visual embeddings extracted with a procedure from block "Embedding extraction (CNN)".

In unit "Text classification" implemented is: computing confidence scores of all event types for a photo by feeding a sparse vector of one-hot encoded textual description into any classifier. This classifier should be preliminary trained to recognize events in a training set, where each photo from the training set is represented with its own visual embeddings extracted with a procedure from block "Caption preprocessing".

In unit "Ensemble" : the outputs of classifiers for embeddings and texts are combined in a simple voting with soft aggregation. In particular, aggregated confidences are computed as the weighted sum of confidence scores computed in blocks "Classification of embeddings" and "Text classification". The decision is taken in favor of the class with the maximal aggregated confidence.

At first, conventional extraction of embeddings x is implemented using pre-trained CNN. Next, these visual embeddings and a vocabulary V are fed to a special RNN-based neural network (generator) that produces the caption, which describes each input photo. Caption is represented as a sequence of L > 0 words from the vocabulary

It is generated sequentially, word-by word starting from special word

until a special

word is produced [6]. See Fig. 4.

The generated caption t is fed into an event classifier. In order to learn its parameters, every n-th photo from the training set is fed to the same image captioning network to produce the caption

. Since the number of words Ln is not the same for all photos, it is necessary to either train a sequential RNN-based classifier or transform all captions into a embedding vectors with the same dimensionality. As the number of training instances N is not very large, experimentally noticed is that the latter approach is as accurate as the former, though the training time is significantly lower. Hence, decided is to use the one-hot encoding of the sequences t and {tn } into vectors of 0s and 1s as described in [7]. In particular, selected is a subset of vocabulary

by choosing top most frequently occurring words in the training data {

} with optional exclusion of stop words. Next, each input photo is represented as the

dimensional sparse vector

, where

is the size of reduced vocabulary

and the v-th component of vector

is equal to 1 only if at least one of L words in the caption t is equal to the v-th word from vocabulary

. This would mean, for instance, turning the sequence {1,5,10,2} into a

-dimensional sparse vector that would be all 0s except for indices 1, 2, 5 and 10, which would be 1s [7]. The same procedure is used to describe each n-th training photo with

-dimensional sparse vector

. After that an arbitrary classifier

of such textual representations suitable for sparse data can be used to predict C confidence scores

It was demonstrated in [7] that such approach is even more accurate than conventional RNN-based classifiers (including one layer of LSTMs) for IMDB dataset.

In general, it is not expected that classification of short textual descriptions is more accurate than the conventional image recognition methods. Nevertheless, it is believed that the presence of photo captions in an ensemble of classifiers can significantly improve its diversity. Moreover, as the captions are generated based on the extracted embedding vector x, only one inference in the CNN is required if combined conventional general classifier of embeddings from individual classifiers are combined in a simple voting with soft aggregation. See Fig. 5.

In particular, computed is aggregated confidences as the weighted sum of outputs of individual classifier:

The decision is taken in favor of the class with the maximal confidence:

The weight

in (4) can be chosen using a special validation subset in order to obtain the highest accuracy of criterion (5).

Let us provide qualitative examples for the usage of the pipeline (Fig. 4). The results of (correct) event recognition using the ensemble are presented in Fig. 5. Here the first line of the title contains the generated caption of the photo. In addition, the title displays the result of event recognition using captions t (second line), embeddings

(third line), and the whole ensemble (last line). As one can notice, the single classification of captions is not always correct. However, the ensemble is able to obtain reliable solution even when individual classifiers make wrong decisions.

EXPERIMENTAL STUDY

Event recognition in a gallery of photos

Only a limited number of datasets is available for event recognition in personal photo-collections [1]. Hence, examined are two main datasets in this field, namely:

1. PEC [5] with 61,364 photos from 807 collections of 14 social event classes (birthday, wedding, graduation, etc.). Used is its split provided by authors: the training set with 667 albums (50,279 photos) and testing set with 140 albums (11,085 photos).

2. ML-CUFED [33] contains 23 common event types. Each album is associated with several events, i.e., it is a multi-label classification task. Conventional split into the training set (75,377 photos, 1507 albums) and test set (376 albums with 19,420 photos) was used.

The embeddings were extracted using the scene recognition models (Inception v3 and MobileNet v2 with a =1 and a = 1.4) pretrained on the Places2 dataset [38]. Used are two techniques in order to obtain a final descriptor of a set of photos, namely, 1) simple averaging of embeddings of individual photos in a set (AvgPool); and 2) the implementation of neural attention mechanism (2)-(3) for L ₂- normed embeddings. In the former case the linear SVM classifier from scikit-learn library was used as C, because it has higher accuracy than RF, k-NN and RBF SVM. In the latter case the weights of the attention-based network (Fig. 2) are learned using the sets with S = 10 randomly chosen photos from all albums in order to make identical shape of input tensors. As a result, 667 training subsets and 1507 subsets with S = 10 photos were obtained for PEC and ML- CUFED, respectively. As the ML-CUFED contains multiple labels per each album, used are sigmoid activations and binary cross-entropy loss. Conventional Softmax activations and categorical cross-entropy are applied for the PEC. The model was learned using ADAM optimizer (learning rate 0.001) for 10 epochs with early stop in Keras 2.3 framework with TensorFlow 1.15 backend.

Table 1: Accuracy (%) of event recognition in a set of images (album).

The recognition accuracies of the pre-trained CNN are presented in Table 1. Here computed is the multi-label accuracy for ML- CUFED so that prediction is assumed to be correct if it corresponds to any label associated with an album. In this table provided are the best-known results for these datasets [33, 34].

Here in all cases the attention-based aggregation is 1-3% more accurate when compared to classification of average embeddings. As one can notice, the proposed implementation of attention mechanism achieves the state-of-the-art results, though used are much faster CNNs (MobileNet and Inception rather than AlexNet and ResNet-101) and do not consider sequential nature of photos in an album in the attention-based network (Fig. 2). The most remarkable fact here is that the best results for the PEC are achieved for the most simple model (MobileNet v2, a = 1.0), which can be explained by the lack of training data for this particular dataset.

As claimed above, in general there is no information about albums in a gallery. Hence, event should be assigned to all photos individually. In the next experiment directly assigned is the collection- level first label to each photo contained in both datasets and simply use the photo itself for event recognition, without any meta information. In addition to baseline approach (Subsection 3.1) used is hierarchical agglomerative clustering of entire testing gallery. Only the best results achieved by the average linkage clustering of embeddings

extracted by pre-trained CNN and confidence scores pt are reported. In the former case used are both Euclidean (L2) and chi-squared (X2) distances. As the confidence scores returned by decision_function for LinearSVC are not always non-negative, only Euclidean distance is implemented for the confidence scores. The results are shown in Table 2.

Table 2: Accuracy (%) of event recognition in a single image.

Here, firstly, the accuracy of event recognition in single photos is 25-30% lower than the accuracy of the album-based classification (Table 1). Secondly, clustering of the confidence scores at the output of the best classifier does not significantly influence the overall accuracy. Thirdly, hierarchical clustering with the chi-squared distance leads to slightly more accurate results than conventional Euclidean metric. Finally, preliminarily clustering of embeddings decreases the error rate of the baseline in only 1.2-2% even if the similarity threshold in clustering is carefully chosen.

Let us demonstrate how the assumption about sequentially ordered photos in an album can increase the accuracy of event recognition. In order to make the task more complex, the following transformation of the order of testing photos was performed 10 times. The sequence of albums are randomly shuffled, and the photos in each album are also shuffled. In addition to the matching of confidences from decision_function of the linearSVC performed is their L2 normalization. Moreover, fine-tuned are CNNs using the unfolded training set X as follows. At first, the weights in the base part of the CNN were frozen and the new head (fully connected layer with C outputs and Soft- max activation) was learned during 10 epochs. Next, the weights in the whole CNN were learned during 3 epochs with 10-times lower learning rate.

The results (mean accuracy ± its standard deviation) of the proposed Algorithms 1, 2 for the PEC and the ML-CUFED are presented in Table 3 and Table 4, respectively. Here the attention mechanism provides up to 8% lower error rates in most cases. It is remarkable that the matching of similarity between L2-normed confidences significantly improves the overall accuracy of attention model for the PEC (Table 3), though present experiments did not show any improvements in conventional clustering from the previous experiment (Table 2). The fine-tuned CNNs obviously lead to the most accurate decision, but the difference (0.1-1.6%) with the best results of the pre-trained models is rather small. However, the latter do not require additional inference in existing scene recognition models, so the implementation of event recognition in an album will be very fast if the scenes should be additionally classified, e.g., for more detailed user modeling [26]. Surprisingly, computing the similarity between confidence scores of classifiers

reduces the error rate of conventional matching of embeddings

on 2-7%. Let us recall that conventional clustering of embeddings was 1-2% more accurate when compared to the classifier's scores (Table 2). It seems that the threshold p0 can be estimated (Algorithm 2) more reliably in this particular case when most photos from the same event are matched in the prediction procedure (Algorithm 1). Finally, the most important conclusion is that the proposed approach has 9-20% higher accuracies when compared to baseline. Moreover, the algorithm is 13-16% more accurate than classification of groups of photos obtained with hierarchical clustering (Table 2).

Table 3: Accuracy (%) of the proposed approach, PEC.

Table 4: Accuracy (%) of the proposed approach, ML-CUFED.

Event recognition in single photos

In addition to PEC and ML-CUFED examined is WIDER (Web Image Dataset for Event Recognition) [35] with 50,574 photos and C = 61 event categories (parade, dancing, meeting, press conference, etc). Used is standard train/test split for all datasets proposed by their creators. In PEC and ML-CUFED the collection-level label is directly assigned to each photo contained in this collection. Any metadata are completely ignored, e.g., temporal information, except the photo itself similarly to the paper [32].

To focus on possibility to implement offline event recognition on mobile devices [26], in order to compare the proposed approach with conventional classifiers, used are MobileNet v2 with a = 1 [23] and Inception v4 [28] CNNs. At first, they are pre-trained on the Places2 dataset [38] for embedding extraction. The linear SVM classifier from scikit-learn library was used, because it has higher accuracy than other classifiers from this library (RF, k-NN and RBF SVM). Moreover, these CNNs are fine-tuned using the given training set as follows. At first, the weights in the base part of the CNN were frozen and the new head (fully connected layer with C outputs and Softmax activation) was learned using ADAM optimizer (learning rate 0.001) for 10 epochs with early stop in Keras 2.2 framework with TensorFlow 1.15 backend. Next, the weights in the whole CNN were learned during 5 epochs using ADAM. Finally, the CNN was trained using SGD during 3 epochs with 10-times lower learning rate.

In addition, used are embeddings from object detection models that are typical for event recognition [35, 26]. As many photos from the same event sometimes contains identical objects (e.g., ball in the football), they can be detected by contemporary CNN-based methods, i.e., SS- DLite [23] or Faster R-CNN [21]. These methods detect the positions of several objects in the input photo and predict the scores of each class from the predefined set of K > 1 types. Extracted is the sparse K-dimensional vector of scores for each type of object. If there are several objects of the same type, the maximal score is stored in this embedding vector [20]. This embedding vector is either classified by the linear SVM or used to train a feed-forward neural network with two hidden layers containing 32 units. Both classifiers were learned using the training set from each event dataset. In this study examined is SSD with MobileNet backbone and Faster R-CNN with InceptionResNet backbones. The models pre-trained on the Open Photos Dataset v4 (K = 601 objects) were taken from the TensorFlow Object Detection Model Zoo.

The preliminarily experimental study with the pre-trained image captioning models discussed in Section 2 demonstrated that the best quality for MS COCO captioning dataset is achieved by the AR- Net model [6]. Thus, in this experiment used is ARNet's encoder-decoder model. However, it can be replaced to any other image captioning technique without modification of the event recognition algorithm. The ARNet was trained on the Conceptual Captions Dataset that contains more than 3.3M photo-URL and caption pairs in the training set, and about 15 thousand pairs in the validation set. The embedding extraction in encoder is implemented not only with the same CNNs (Inception and MobileNet v2). Extracted are

most frequent words except special words <START> and <END>. They are classified by either linear SVM or a feedforward neural network with the same architecture as for object detection case. Again, these classifiers are trained from scratch given each event training set. The weight w in the ensemble (Eq. 1) was estimated using the same set.

The results of the lightweight mobile (MobileNet and SSD object detector) and deep models (Inception and Faster R-CNN) for PEC, WIDER and ML-CUFED are presented in Tables 5, 6, 7, respectively. Here added are the best-known results for the same experimental setups.

Table 5: Event recognition accuracy (%), PEC

Table 6: Event recognition accuracy (%), WIDER

Certainly, the proposed recognition of photo captions is not as accurate as conventional CNN-based embeddings. However, classification of textual descriptions is much better than the random guess with accuracy

and

for PEC, WIDER and ML-CUFED, respectively. It is important to emphasize that the approach has lower error rate than classification of the embeddings based on object detection in most case. This gain is especially noticeable for lightweight SSD models, which are 1.5-13% less accurate than the proposed classification of photo captions due to the limitations of SSD-based models to detect small objects (food, pets, fashion accessories, etc.). The Faster R-CNN-based detection embeddings can be classified more accurately, but the inference in the Faster R-CNN with InceptionResNet backbone is several times slower than decoding in the ARNet (6-10 seconds vs 0.5-2 seconds on MacBook Pro 2015).

Table 7: Event recognition accuracy (%), ML-CUFED

Finally, the most appropriate way to use image captioning in event classification is its fusion with conventional CNNs. In such case improved is the previous state-of-the-art for PEC from 62.2% [32] even for the lightweight models (63.38%) if the fine-tuned CNNs are used in an ensemble. Present Inception-based model is even better (accuracy 65.12%). The state-of-the-art accuracy 53% [32] for the WIDER dataset is not still reached, though the best accuracy (51.84%) is up to 9% higher when compared to the best results (42.4%) from original paper [35]. Present experimental setup for the ML-CUFED dataset is studied at first time here because this dataset is developed mostly for album-based event recognition.

In practice it is preferable to use pre-trained CNN as an embedding extractor in order to prevent additional inference in fine-tuned CNN when it differs with the encoder in image captioning model. Unfortunately, the accuracies of SVM for pre-trained CNN embeddings are 1.5-3% lower when compared to the fine-tuned models for PEC and ML-CUFED. In this case additional inference may be acceptable. However, the difference in error rates between pre-trained and fine- tuned models for WIDER dataset is not significant, so that the pre-trained CNNs are definitely worth being used here.

The foregoing exemplary embodiments are examples and are not to be construed as limiting. In addition, the description of the exemplary embodiments is intended to be illustrative, and not to limit the scope of the claims, and many alternatives, modifications, and variations will be apparent to those skilled in the art.

REFERENCES

[1] Kashif Ahmad and Nicola Conci, 'How deep features have improved event recognition in multimedia: A survey', ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 15(2), 39 (2019).

[2] Kashif Ahmad, Nicola Conci, Giulia Boato, and Francesco GB De Natale, 'Event recognition in personal photo collections via multiple instance learning-based classification of multiple images', Journal of Electronic Imaging, 26(6), 060502, (2017).

[3] Wesam M Ashour, Riham Z Muqat, Alaaeddin B AlQazzaz, and Saeb R AbdElnabi, 'Improve basic sequential algorithm scheme using ant colony algorithm', in Proceedings of the 7th Palestinian International Conference on Electrical and Computer Engineering (PICECE), pp. 1-6. IEEE, (2019).

[4] Siham Bacha, Mohand Said Allili, and Nadjia Benblidia, 'Event recognition in photo albums using probabilistic graphical models and feature relevance', Journal of Visual Communication and Image Representation, 40, 546-558, (2016).

[5] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool, 'Event recognition in photo collections with a stopwatch HMM', in Proceedings of the International Conference on Computer Vision (ICCV), pp. 1193-1200. IEEE, (2013).

[6] Xinpeng Chen, Lin Ma, Wenhao Jiang, Jian Yao, and Wei Liu, 'Regularizing rnns for caption generation by reconstructing the past with the present', in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018).

[7] Francois Chollet, Deep learning with Python, Manning Publications Company, 2017.

[8] Minh-Son Dao, Duc-Tien Dang-Nguyen, and Francesco GB De Natale, 'Signature-image-based event analysis for personal photo albums', in Proceedings of the 19th International Conference on Multimedia (ACM MM), pp. 1481-1484. ACM, (2011).

[9] Sergio Escalera, Junior Fabian, Pablo Pardo, Xavier Baro, Jordi Gonzalez, Hugo J Escalante, Dusan Misevic, Ulrich Steiner, and Isabelle Guyon, 'Chalearn looking at people 2015: Apparent age and cultural event recognition datasets and results', in Proceedings of the International Conference on Computer Vision Workshops (ICCVW), pp. 1-9, (2015).

[10] Ming Geng, Yukun Li, and Fenglian Liu, 'Classifying personal photo collections: an event-based approach', in Proceedings of the Asia- Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, pp. 201-215. Springer, (2018).

[11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep learning, MIT Press, 2016.

[12] Ivan Grechikhin and Andrey V Savchenko, 'User modeling on mobile device based on facial clustering and object detection in photos and videos', in Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis, pp. 429-440. Springer, (2019).

[13] Cong Guo, Xinmei Tian, and Tao Mei, 'Multigranular event recognition of personal photo albums', IEEE Transactions on Multimedia, 20(7), 1837-1847, (2017).

[14] MD Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga, 'A comprehensive survey of deep learning for image captioning', ACM Computing Surveys (CSUR), 51(6), 118, (2019).

[15] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, 'MobileNets: Efficient convolutional neural networks for mobile vision applications', arXivpreprint arXiv:1704.04861, (2017).

[16] Dmitry Kuzovkin, Tania Pouli, Olivier Le Meur, Remi Cozot, Jonathan Kervec, and Kadi Bouatouch, 'Context in photo albums: Understanding and modeling user behavior in clustering and selection', ACM Transactions on Applied Perception (TAP), 16(2), 11, (2019).

[17] Stefan Lonn, Petia Radeva, and Mariella Dimiccoli, 'Smartphone picture organization: A hierarchical approach', Computer Vision and Image Understanding, 187, 102789, (2019).

[18] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh, 'Neural baby talk', in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018).

[19] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille, 'Deep captioning with multimodal recurrent neural networks (m-RNN)', in Proceedings of the International Conference on Learning Representations (ICLR), (2015).

[20] Alexandr Rassadin and Andrey Savchenko, 'Scene recognition in user preference prediction based on classification of deep embeddings and object detection', in Proceedings of the International Symposium on Neural Networks (ISNN), volume 11555, pp. 422-430. Springer, (2019).

[21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, 'Faster R- CNN: Towards real-time object detection with region proposal networks', in Advances in Neural Information Processing Systems (NIPS), pp. 91-99, (2015).

[22] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, 'MobilenetV2: Inverted residuals and linear bottlenecks', in Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510-4520, (2018).

[23] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, 'Mobilenetv2: Inverted residuals and linear bottlenecks', in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 4510-4520, (2018).

[24] Andrey V. Savchenko, 'Efficient facial representations for age, gender and identity recognition in organizing photo albums using multi-output ConvNet', PeerJ Computer Science, 5(e197), (2019).

[25] Andrey V. Savchenko, 'Sequential three-way decisions in multi-category image recognition with deep features based on distance factor', Information Sciences, 489, 18-36, (2019).

[26] Andrey V Savchenko, Kirill V Demochkin, and Ivan S Grechikhin, 'User preference prediction in visual data on mobile devices', arXiv preprint arXiv:1907.04519, (2019).

[27] Anastasiia D Sokolova, Angelina S Kharchevnikova, and Andrey V Savchenko, 'Organizing multimedia data in video surveillance systems based on face verification with convolutional neural networks', in Proceedings of the International Conference on Analysis of Images, Social Networks and Texts (AIST), pp. 223-230. Springer, (2017).

[28] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex A. Alemi, 'Inception-v4, Inception-ResNet and the impact of residual connections on learning', in Proceedings of the International Conference on Learning Representations (ICLR) Workshop, (2016).

[29] Shen-Fu Tsai, Thomas S Huang, and Feng Tang, 'Album-based object-centric event recognition', in Proceedings of the International Conference on Multimedia and Expo, pp. 1-6. IEEE, (2011).

[30] Nivetha Vijayaraju, 'Image retrieval using image captioning', Master's Projects, 687, (2019).

[31] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, 'Show and tell: Lessons learned from the 2015 mscoco image captioning challenge', IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 652-663, (2017).

[32] Limin Wang, Zhe Wang, Yu Qiao, and Luc Van Gool, 'Transferring deep object and scene representations for event recognition in still images', International Journal of Computer Vision, 126(2-4), 390-409, (2018).

[33] Yufei Wang, Zhe Lin, Xiaohui Shen, Radomir Mech, Gavin Miller, and Garrison W Cottrell, 'Recognizing and curating photo albums via event-specific image importance', in Proceedings of British Conference on Machine Vision (BMVC), (2017).

[34] Zifeng Wu, Yongzhen Huang, and Liang Wang, 'Learning representative deep features for image set analysis', IEEE Transactions on Multi-media, 17(11), 1960-1968, (2015).

[35] Yuanjun Xiong, Kai Zhu, Dahua Lin, and Xiaoou Tang, 'Recognize complex events from static images by fusing deep channels', in Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1600-1609, (2015).

[36] Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio, 'Show, attend and tell: Neural image caption generation with visual attention', in Proceedings of the International Conference on International Conference on Machine Learning (ICML), pp. 2048-2057, (2015).

[37] J. Yang, P. Ren, D. Zhang, D. Chen, F. Wen, H. Li, and G. Hua, 'Neural aggregation network for video face recognition', in Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5216-5225. IEEE, (2017).

[38] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba, 'Places: A 10 million image database for scene recognition', IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1452-1464, (2018).

Claims

Method for event recognition on photos with automatic album detection, comprising:

automatically assigning borders of the albums based on the visual content of consecutive photos in the gallery;

grouping consecutive photos from a gallery into albums;

computing a descriptor of each album with an attention mechanism from the neural aggregation module;

recognizing tag of event type of each album by feeding its descriptor into a classifier;

recognizing event on photos in the gallery by assigning corresponding tag of event type to all photos in each album.
Method according to claim 1, wherein automatic assigning of borders of the albums includes the following steps:

estimating similarity between confidences of each pair of consecutive photos in a gallery;

if a similarity does not exceed a certain threshold then both photos are included in the same album.
Method according to claim 2, wherein the similarity between photos is computed as a distance between confidences for each type of event estimated based on the classifying each photo.
Method according to claim 2, wherein the similarity between photos is computed as a sum of similarities between confidences for each type of event estimated based on the classifying each photo and the similarity between their locations, if location information is available in EXIF (Exchangeable Photo File Format) data of these photos.
Method according to claim 2, wherein the certain threshold is automatically computed during the preliminary training procedure by using available training set of albums.
Method for event recognition on photo with its representation in text domain, comprising:

computing embeddings of a photo by feeding its RGB (Red-Green-Blue) representation into a convolutional neural network;

using image captioning to generate textual descriptions of the photo based on its embeddings;

encoding the generated caption;

feeding the generated textual descriptions to the input of event classifier;

recognizing event by assigning the output of event classifier to the corresponding photo.
Method according to claim 6, wherein the generated caption is encoded as a sparse vector of one-hot encoding of textual description of the photo: the v-th component of vector is equal to 1 only if at least one word in the generated caption is equal to the v-th word from vocabulary.
Method according to claim 7, further comprising combining of the classifier output for generated textual descriptions and traditional classifier of embeddings in an ensemble to improve event recognition accuracy.
Method according to claim 8, wherein the outputs of classifiers are obtained as follows:

computing confidence scores of all event types for a photo by feeding its embeddings into any classifier;

computing confidence scores of all event types for a photo by feeding a sparse vector of one-hot encoded textual description into any classifier;

combining the outputs of classifiers for embeddings and texts in a simple voting with soft aggregation.
Method according to claim 6, wherein textual description of the photo is generated by:

feeding extracted embeddings and a sequence of previously generated words into RNN (Recurrent Neural Network) in order to predict the next word in the textual description of a photo;

mapping the numbers generated by the RNN into real words from a vocabulary;

selecting a subset of vocabulary by choosing top most frequently occuring words in the training data with optional exclusion of stop words.
Method according to claim 10, further comprising appending predicted word into this sequence of previously generated words and feeding extracted visual embeddings and this sequence into the same RNN.
Method according to claim 10, wherein training set is a set of photos with known type of event.
Method according to claim 8, wherein traditional classifier should be preliminary trained to recognize events in a training set, where each photo from the training set is represented with its extracted own visual embeddings.
Method according to claim 9, wherein in particular, aggregated confidences are computed as the weighted sum of confidence scores, the decision is taken in favor of the class with the maximal aggregated confidence.