WO2022074017A1

WO2022074017A1 - A computer-implemented method for estimating the popularity of an input image

Info

Publication number: WO2022074017A1
Application number: PCT/EP2021/077473
Authority: WO
Inventors: Fabrizio MALFANTI; Gabriele TORRE
Original assignee: Kellify S.P.A.
Priority date: 2020-10-05
Filing date: 2021-10-05
Publication date: 2022-04-14

Abstract

A computer-implemented method for estimating the popularity of an input image, comprises training a neural network based classifier system, receiving said input image, extracting a set of numerical features from said input image, and determining an input image popularity based on the numerical features extracted from said input image with the trained neural network based classifier system. The neural network based classifier system is trained by extracting set of numerical features from the training image comprising one or more numerical features selected from an average color of the image, an average details density of the image, an area and/or a predominant color of an object in the image, and a least detailed spot including the minimum number of details and a most detailed spot of the image including the maximum number of details.

Description

A computer-implemented method for estimating the popularity of an input image

DESCRIPTION

Technical field

The present invention relates to a computer-implemented method for estimating the popularity of digital images.

Background art

Recent years have witnessed an accelerated proliferation of digital photographic images being uploaded to various online photo sharing communities such as Instagram, Flickr, and Reddit. Such images are made accessible through the public websites of the social platforms where they can be rated for aesthetic quality, originality and other characteristics by viewers of the websites.

In this context, some photos turn to be extremely popular and gain millions of likes and comments, while some are completely ignored. Even for images uploaded by the same user at the same time, their popularity may be substantially different. See Keyan Ding, et al., "Intrinsic Image Popularity Assessment", Proceedings of ACM Conference, 2019, New York.

Thus, to assist in processing of images, there is a need for a system that can accurately predict the potential of a social image to go viral on the Internet.

There has been considerable effort in the field of image quality assessment to design quality metrics that can predict the image popularity automatically. Indeed, it is proven that the popularity of digital images is strongly related to their aesthetic quality which - to some extent - is ruled by well-known photographic practices concerning for example the use of lighting and colors and the image composition.

Some attempts have been made to evaluate aesthetic qualities of images by computer-implemented methods which focus on extracting descriptors from the digital image with good correlation with human preference.

A method for determining the aesthetic quality of an image is described in US 20120269425. The method includes extracting a set of local features from the image, such as gradient and/or color features and generating an image representation which describes the distribution of the local features. A classifier system is used for determining an aesthetic quality of the image based on the computed image representation.

It is evident that the classifier system needs to be trained properly. Despite the proliferation of annotated image data available through photo sharing platforms, which could be used as training data, challenges for image popularity assessment remain.

First, such data is annotated with an intrinsic noise because when dealing with human preference, unanimous consensus is rare. For example, because users often have different opinions about image quality and other metrics, conventional machine learning systems experience difficulty accounting for inconsistencies in labelled data indicating aesthetic quality. While the amount of data used to train an automated system could be increased, this does not always solve the problem. As a result, conventional systems for rating or otherwise classifying images often produce inconsistent or inaccurate results that fail to reflect the subjective appreciation of some users.

Another challenge concerns the design of features to capture human preference. The features currently in use do not always correlate well with human perception. In other words, they are not powerful enough to capture all the visual information required for assessment.

Therefore, there remains a need for a method which can improve image popularity assessment.

Such a need is particularly felt for the purpose of evaluating the aesthetics of the content of a work of art, where a number of works of art exists that represent a similar content, are offered for a similar price and/or were made by the same author as the work of art under evaluation.

Description of the invention

The problem underlying the present invention is that of providing a computer-implemented method which is functionally designed so as to at least partly remedy at least one of the disadvantages encountered with reference to the cited prior art.

Within this problem, a scope of the invention is to provide a method which is capable of providing a meaningful assessment of the popularity of an image.

A further scope of the invention is to provide a method which can be easily implemented by a computer.

This problem is solved and these scopes are reached by the invention by means of a computer-implemented method comprising: training a neural network based classifier system; receiving an input image; extracting from the input image a set of numerical features; and, with the neural network based classifier system, determining the input image popularity based on the numerical features.

It will be appreciated that the computer-implemented method of the invention provides a suitable solution to the problem of estimating the popularity of an input image.

Advantageously, training the classifier system is the first step of the method, followed by a run-time step comprising the remaining operations mentioned above, which are executed when the method is in use once the classifier has been trained.

It should be noted that, in this context, the expression "neural network based classifier system" indicates an artificial neural network which is trained to determine the popularity of the input image based on a set of learned numerical features extracted from the input image.

As used herein a "numerical feature" refers to an identifiable trait or aesthetical feature of a digital image. For example, as detailed further below, a numerical feature can refer to an image attribute including, but not limited to, detected corners, detected edges, colors distribution, average relative luminance, details density along predetermined directions, and so on. As will be appreciated, the input image popularity is closely related to these image attributes.

It will be appreciated that the neural network based classifier system can be trained so as to identify an image popularity using an image representation, rather than the actual image to be classified.

In this manner, the method according to the invention can focus on selected specific features of the image which have been identified as most representative for the assessment of the popularity of an image.

This allows to achieve a more precise classification of the image and to require less computational resources for the computer implementation of the method.

Preferably, the set of numerical features used for training the classifier system and for determining the input image popularity comprises one or more numerical features selected from:

• an average color of the image;

• an average details density of the image;

• an area and/or a predominant color of an object in the image; and

• a least detailed spot and a most detailed spot of the image.

Preferably, at least all the above listed features are included in the set of numerical features used for training the classifier system and for determining the input image popularity.

Preferably, training the classifier system comprises receiving a set of training images, wherein each of said training images has a known image popularity value, said image popularity value indicating the likelihood of an image being popular among viewers of the training images.

This could include the likelihood of the image being popular among the users of a social network system.

It should be noted that, as used herein, the term "social network system" includes both web based social networks and physical social networks, i.e. groups of individuals or organizations that are connected for their interest, activity etc. Preferably, the training further comprises for each training image: extracting a set of numerical features from the training image; and generating an image representation according to the numerical features extracted from the training image.

As used herein, the expression "generating an image representation according to the numerical features extracted from the training image" refers preferably to image annotation reflecting low- and/or high-level recognition results, such as local descriptors or recognized objects. Advantageously, the numerical features extracted from the training images correspond to the features extracted from the input image. Preferably, the training further comprises training the classifier system with the training image representations and corresponding training image popularity values.

It should be noted that, as used herein, an "image representation" refers to a synthetic image which describes the distribution of a set of numerical features extracted from a training image.

It should be noted that, in this context, the expression "image popularity value" preferably indicates a binarized value which is assigned to a training image based on at least one popularity score which has been manually-assigned to said training image. Further preferably, in the event of a plurality of manually-assigned image popularity scores for the training image, the popularity value is computed by averaging the plurality of manually-assigned image popularity scores. As will be appreciated, the image popularity score indicates the likelihood of an image being popular among its viewers. It should be noted that, in some embodiments, the popularity score may be assigned to the training image by a machine rather than manually.

As will be appreciated, extracting the set of numerical features from the training image may comprise defining a set of regions of the training image and, for each of said regions, generating a local descriptor based on low level features of pixels in the region. In this case, the image representation comprises an aggregation of the local descriptors.

In a preferred exemplary embodiment, generating the image representation comprises one or more operations chosen from the following operations.

In particular, generating the image representation may comprise the operation of assigning to the image representation an average color of the training image.

In alternative or in addition, generating the image representation may comprise the operation of partitioning the image representation into a set of areas which have lightness levels proportional to an average details density of the training image regions which correspond to said areas. Preferably, the set of areas comprises exactly two areas, one framed with respect to the other. Further preferably, the lightness levels are proportional to the average details density along the horizontal and/or vertical and/or diagonal direction(s) of the training image regions which correspond to said areas.

In alternative or in addition, generating the image representation may comprise the operation of representing each of a set of objects of the training image by superimposing on a background of the image representation a circle with an area proportional to the object area and with a color corresponding to the object predominant color. Preferably, the set of objects consists of the four largest objects of the training image.

In alternative or in addition, generating the image representation may comprise the operation of representing the least and the most detailed spots of the training image by superimposing on a background of the image representation a first circle and a second circle having two different colors, respectively.

As will be appreciated, in some embodiments the circles may be replaced by squares or by any other shapes which serve to the purpose of representing an object or a spot on the background of an image.

In a preferred exemplary embodiment, the average details density extracted from each training or input image comprises one or more percentages chosen from : a percentage of details of the image versus a smoothed background of the image; a percentage of details along a horizontal direction of the image versus total details in the image; a percentage of details along a vertical direction of the image versus total details in the image; a percentage of details along a diagonal direction of the image versus total details in the image; a percentage of details in a center of the image versus details in a border of the image; a percentage of details along the horizontal direction in the center versus details along the horizontal direction in the border; a percentage of details along the vertical direction in the center versus details along the vertical direction in the border; and a percentage of details along the diagonal direction in the center versus details along the diagonal direction in the border.

Advantageously, the classifier system comprises a Siamese neural network. As used herein, a "Siamese neural network" refers to an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. It should be noted that feeding the Siamese neural network with the image representations and popularity values of the training images allows to train the classifier so as to accurately estimate the public appreciation to a given image by understanding the driver features that make the image popular.

It will be appreciated that the above listed features can be used in any combination both for generating the training image representations and for generating an input image representation.

Brief description of the drawings

The features and advantages of the invention will be made clearer by the following detailed description of a preferred, but not exclusive, exemplary embodiment thereof, illustrated by way of non-limiting example with reference to the appended drawings, in which:

- Figure 1 is a functional block diagram of inputs to an exemplary classifier system during training and in use;

- Figure 2 is a flow diagram illustrating a method for image popularity assessment in accordance with one aspect of the exemplary embodiment.

Preferred embodiment of the invention

With reference to the Figures, a method for estimating the popularity of an input image according to the present invention is based on a neural network based classifier system.

The classifier system is trained in a first step of the method so that the trained classifier system can be used for assessing the popularity of the input image.

As previously mentioned, the classifier system of the invention does not apply neural networks directly on the relevant image, i.e. on the image in the form that will be published in the social network, but it uses an image representation which is representative of the published image, as will be described in the following.

In order to train the classifier system a plurality of training images are collected during data gathering.

Each of said training images has a known image popularity value which represents the likelihood of being liked by, or more in general be popular among, the user of a social network system. Such popularity value, can be either defined manually, e.g. by counting the actual "like" received from the users, or determined in other manner, also including other known automatic image popularity assessment method.

In order to be used as training images, the collected images are subjected to a features extraction and, preferably, to an image processing.

As shown in Figure 1, several processes can be applied on the images in order to extract information contents and numerical features characterizing the image.

In some preferred embodiments, image corners detection is performed. This processing step is aimed at computing the ratio between the number of detected corners in relation to the number of pixels of a given image. This process is preferably based on the algorithm proposed by Chris Harris & Mike Stephens in their 1988 paper and implemented in the OpenCV function cv2.cornerHarris().

The process preferably identifies the image coordinates which correspond to an intensity variation for a displacement in all the image directions. It will be appreciated that the Harris-Stephens corner detection method depends directly on two functions: the window function which locally weights the pixel intensity and the Sobel Derivative which is needed for computing the image intensity derivatives.

These two functions require a set of input parameters that, in preferred embodiments, are respectively set to 2 for the window size and 3 for the Sobel derivative aperture.

In case the Harris-Stephens algorithm is used, the Harris detector free parameter can be set to 0.04.

According to these parameters, corners may be identified as pixels with a value greater than l.e-4.

Extraction of features may also comprise an edge detection processing.

This processing is aimed at estimating the image area associated to objects edges.

The ratio between the total number of pixels defined as edges by the edge detection algorithm and the total number of image pixels can be accordingly calculated.

In preferred embodiments the process include application of the edge detection algorithm proposed by John F. Canny in 1986 and implemented in the OpenCV function cv2.Canny().

This implementation takes two input parameters, minVal and maxVal, which are in preferred embodiments respectively set to 50 and 100.

A further parameter of the above algorithm is the aperture_size of the Sobel kernel. This parameter can be kept equal to its default value.

In some embodiments, the features are extracted by applying a processing of estimating the color distribution of the image.

Advantageously, a color palette based on only 8 color channels (i.e. Web palette) can be used.

In fact, in case an 8 color channels are used it is possible to keep a low dimensionality of the image color representation.

The feature obtained according to this processing includes the percentage of the total pixel number associated to every channel.

In some embodiments, an average relative luminance value of the image can be determined and included among the extracted features.

The relative luminance of the image is preferably calculated combining the intensities associated to the RGB channels of the image according to L = 0.2126 * R + 0.7152 * G + 0.0722 * B .

An average luminance level of the image can therefore be computed according to the above formula.

In some preferred embodiments, extraction processes includes applying a 2D Discrete Wavelets Transform (DWT) to the image.

A single step of DWT allows to assess the quantity of details at a small scale in the horizontal direction, in the vertical direction and in diagonal direction.

Preferably, the 2D Discrete Wavelets Transform (DWT) is applied on a grayscale image.

The process may include splitting the image in central core and an external border, allowing to recognize where details are distributed, in the horizontal, vertical or diagonal direction.

Preferably, the external border is defined as the 28/64 of the image area, i.e. is a border with a width of ¹/s of the picture dimension.

It will be appreciated that more in general the 2D Discrete Wavelets allows to extract one or more of the following features: percentage of details versus smoothed background, percentage of horizontal details vs total details, percentage of vertical details vs total details, percentage of diagonal details vs total details, percentage of details in the center versus details in the border, percentage of horizontal details in the center versus horizontal details in the border, percentage of vertical details in the center versus vertical details in the border, percentage of diagonal details in the center versus diagonal details in the border.

A further process of features extraction includes transforming the image to encode the objects color distribution within the image field of view.

In a preferred exemplary embodiment, this transformation is based on the Satoshi Suzuki et al. (1985) technique, which identifies every foreground object by its closed border and its predominant color. See Satoshi Suzuki et al., "Topological structural analysis of digitized binary images by border following", Computer Vision, Graphics, and Image Processing, Volume 30, Issue 1, April 1985, Pages 32-46. Preferably, each object is represented by a circle with the same object area and centered in its center of gravity.

Further preferably, circles characterized by similar color properties and location are aggregated. Advantageously, two circles are aggregated when their distance is smaller than twice the radius of the largest circle and when a color distance between the two circles is smaller than 33.

As use herein, a "color distance" refers to a distance in the color domain which is defined as

D_c = V(«i - «₂)² + (Gi - G₂)² + (Bi - B₂)² wherein R, G, and B indicate the levels of - respectively - red, green and blue, which are the closest to the original levels in the web safe color coding.

As will be appreciated, this process allows to extract one or more of the following features: one or more circles, the radius of each circle, the coordinates of the center of each circle, and the color of each circle. In the event of a plurality of circles, the circles of different colors may be ordered by radius.

In some preferred embodiments, extraction processes include using a Wavelet Transform to perform a two level DWT.

At the second level of the two level DWT, the process may include summing the three layers of details (horizontal, vertical and diagonal) and convolving the resulting image with a 2D step function with an area which may be equal to substantially 1% of the entire image. The two spots associated to the maximum and minimum detail presence are then identified. It will be appreciated that the features extracted by this process include two spots with the maximum and the minimum number of detail. Preferably, the two spots are rectangular in shape. Further preferably, the spots are exactly two and, in the event that more than one spot is identified in one of the two categories, the process maintains the spot which has been detected first.

Extraction of features may also comprise a 2D Fourier Transform of the image.

Advantageously, this process detects frequencies in horizontal direction and/or in vertical direction and/or in diagonal direction.

In a preferred exemplary embodiment, this process allows to extract the first eight frequencies in horizontal and vertical direction. Preferably, these frequencies are ordered descending by amplitude.

A further process includes extracting from the image a set of hidden numerical features related to the content of the image and its arrangement within the field of view.

Preferably, those features are computed in terms of ResNet50, which is a neural network pre-trained for object recognition on the ImageNet dataset.

In a preferred exemplary embodiment, the features computed during this image processing step are organized in four sets of numerical values (preferably 2048 numerical values), respectively computed as the output of the -35, -55, -75 and the last layers of the network. Those features are combined by a fully connected layer (preferably a fully connected layer of 2048 values output) and with a R.eLU activation function. In some embodiments, training images can be subjected to further processing directed to determine intrinsic features thereof.

These features may define further data characterizing the images in connection to possible parameters useful for the assessment of their popularity. Also, the following processing could be useful in order to determine the image popularity value of the training images in case this is not calculated manually.

In some embodiments, processing of the training image include an assessment of the image aesthetic properties.

The aesthetic properties of an image may be advantageously determined in terms of the numeric features. Preferably, such aesthetic properties are computed by means of the previously discussed ResNet50 Network. In preferred embodiments, the features computed with the ResNet50 Network are given as input to a neural network. Preferably, the neural network is composed by two layers: the latter made by a 64 neural unit equipped by a ReLu activation function and a drop-out ratio of 0.2 while the former made by only one neural unit with linear activation.

This network may by trained by further images as trained network of precollected images are available in the art.

For example training may occur by means of the AVA dataset which provides a set of more than 250000 images together with a distributions of scores for each image which synthesized the aesthetic judgements of hundreds of amateur and professional photographers.

In some embodiments, training image processing may include an assessment of the image quality. Evaluation of the visual quality of the training images is preferably obtained by means of a Convolutional Neural Network (CNN).

The inputs of the network may consist of nonoverlapping 32 x 32 patches from a grayscale image, to which a contrast normalization is performed. The Convolutional Neural Network may predict the quality score for each patch and averages these scores to obtain a quality estimation for the overall image.

In some embodiments the used network consists of five layers.

Preferably, the network includes a first layer which is a convolutional layer which filters the input with 50 kernels each of size 7 x 7 with a stride of 1 pixel. The convolutional layer produces 50 feature maps each of size 26 x 26, followed by a pooling operation that reduces each feature map to one max and one min.

The network may also include two fully connected layers of 800 nodes each coming after the pooling, equipped by a R.eLu activation function and a drop-out ratio of 0.5.

The last layer is a simple linear regression with a one dimensional output that gives the score.

As explained in connection to the aesthetic assessment, training of the network may occur by using images available in the art.

For example the network may be trained by using the TID2013 dataset which contains 25 reference images and 3000 distorted images.

In such case and in similar cases, each image may be associated with a Mean Opinion Score (MOS) in the range [0, 9], where higher MOS indicates higher quality. As previously mentioned, processing of the training images can be also used for determining an intrinsic popularity of the training image.

Since the training images are supposed to be used for training the neural network based classifier system, it could be possible to use a set of images which has been subjected to assessment of their popularity according to other methods known in the art.

In some embodiments, processing of the images for determining their intrinsic popularity includes using a Siamese neural network.

An example of such Siamese network is disclosed in Intrinsic Image Popularity Assessment Ding. Et al. as disclosed in the paper published for Conference'19, October 2019, Nice, France.

Preferably, a pairwise learning-to-rank approaches is used in the Siamese neural networks.

In preferred embodiments, according to the Siamese neural networks approach, the inputs includes two RGB images with high and low intrinsic popularity score respectively.

The Siamese network includes two network streams. Preferably the architectures of the two streams are the same, whose weights are shared during the training phase.

The network may be a modified version of the previously discussed ResNet50 network by replacing the last layer with a fully connected layer of one output, which represents the predicted intrinsic popularity score. Preferably, the predicted score difference of the two images is converted to a probability using a logistic function, which is compared with the ground-truth binary label (1 if the first image is more popular than the second one, 0 otherwise).

In some embodiments, the images used in the Siamese network are rescaled to 256, from which a 224 x 224 x 3 sub-image is randomly cropped.

As previously explained, the training of the Siamese network may be carried on a custom collection of social media posts, by optimizing the cross entropy function.

With reference again to Figure 2, training of the classifier system is obtained by using respective image representations of the train images which are obtained by using one or more of the above image processing. In a preferred embodiment, processing is directed to obtain at least an average color of the image, an average details density of the image, an area and/or a predominant color of an object in the image; and the spot of the image including the minimum number of details and the spot of the image including the maximum number of details.

In preferred embodiments, the image representation is an image having a color equal to the average original image color.

Preferably, the image representation comprises two separate sections forming a respective area in the image.

Those areas, preferably one framed with respect to the other, may have a lightness levels proportional to the average details density along the horizontal, vertical and diagonal directions.

In some embodiment the four larger color spots of the original image are superimposed on the background together with other two circles.

Preferably, each of those two circles area equals the 1% of the image. In some embodiments, the positions and colors of the circles correspond respectively to the spots with the minimum and maximum number of details determined in the previously described processing of the image.

Preferably, white is used for the less detailed spot and black is for the more detailed one.

The image representations, also called synthetic images, are used to feed the network of the classifier system.

Preferably, the classifier system comprises a Siamese network and the image representations are fed in couple to the Siamese network.

According to a preferred embodiment the same architecture properties of one described in connection to the assessment of the Intrinsic Image Popularity of the training images can be used.

As a matter of fact, according to an aspect of the invention the classifier system can make use of the image representation, i.e. of a synthetic image, instead of the image to be published and of which the popularity is to be determined, in order to be trained.

According to a further aspect of the invention categorization of the input image may include similar processing of the image as of the training images.

In other words, also for the input image extraction of numerical features is performed. Preferably, the same features of the training image are also extracted for the input image.

An image representation of the input image may be thus generated according to the numerical features extracted from the input image.

In this manner the trained classifier system can be used for determining the popularity of input image popularity.

Preprocessing of the input image can also be performed in order to provide a preliminary classification of the content of the input image.

This allows to verify that the input image content is appropriate and not out-of-context with respect to the current application of the method.

In some embodiments, the method of the invention comprises:

- defining a plurality of image categories reflecting high-level features of a set of input images;

- assigning the input image to the relevant image category or categories based on high-level recognition results extracted from the input image; and

- with the trained neural network based classifier system, determining the input image popularity relative to other input images assigned to the same image category or categories.

Examples of high-level features and corresponding image categories include recognized objects such as landscapes, still lives or female figures, just to name a few.

In some embodiments, high-level features further include information about the author or price of a work of art captured in an image.

More generally, the term "high-level features" may refer to descriptors derived from an image and containing information about the semantic of its contents.

For a given image, the recognition of high-level features may be based on the recognition of low-level features.

The assignment of the input image to the relevant image category/categories may reflect high-level recognition results and may be performed either manually or automatically. In some embodiments, the assignment is performed automatically by a classifier system such as the trained neural network based classifier system.

It will be appreciated that such a method is specially adapted for evaluating the aesthetics of the content of a work of art, by comparing digital images of different works of art that belong to the same image category, e.g. the landscapes category.

As a matter of fact, the assignment of input images to predefined image categories speeds up the process of evaluating the aesthetics of input images because it greatly reduces the number of images to be processed by only selecting images that are relevant to a given category, and further avoids meaningless comparisons among images belonging to unrelated categories which would only pollute the results of the assessment.

Such a problem is neither addressed nor solved by the known image popularity assessment techniques, which rather focus on identifying the most popular images overall, i.e. regardless of the type of content they represent. This inevitably results in the comparison of images belonging to different content categories, e.g. landscapes with still lives, which is clearly meaningless for the purpose of evaluating the aesthetics of a landscape image relative to other landscape images, for example.

In the context of the invention, categories may be broader or narrower according to the specific requirements of the circumstances. For instance, one category may refer to female figures in general or only to profile views of female figures.

It should be noted that, for achieving a meaningful result with computational efficiency, the trained neural network based classifier system is preferably applied only to input images that have been previously assigned to the same image category or categories.

In some embodiments, the output of the previous image processing steps is given as input to a two layer neural network. It should be noted that this is a further step with respect to the previous steps mentioned above. In particular, it should be noted that, in a preferred embodiment, the two layer neural network is a separate network with respect to the Siamese neural network and it combines the output results from both the Siamese neural network and the other indicators mentioned above. Preferably, the indicators can be represented by the previously mentioned operations on the images.

This two layer neural network is aimed at estimating the expected popularity (e.g. the amount of feedback in terms of the number of likes) associated with a given image and based on the input features computed during the above described processing of the image.

The network is preferably made by 64 and 1 neurons with Relu and Linear activation functions respectively.

The invention thus solves the proposed problem, achieving numerous advantages including that of providing a method for estimating the popularity of an input image which is reliable without requiring excessive computational resources.

The method is also particularly flexible since it allows proper estimation of the popularity of images having different content and directed to different groups of user.

Claims

CLAIMS A computer-implemented method for estimating the popularity of an input image, comprising:

• training a neural network based classifier system;

• receiving said input image;

• extracting a set of numerical features from said input image;

• with the trained neural network based classifier system, determining an input image popularity based on the numerical features extracted from said input image; wherein training said neural network based classifier system comprises:

• receiving a plurality of training images, wherein each of said training images has a known image popularity value, said image popularity value indicating the likelihood of an image being popular among viewers of the training images; and

• for each training image:

- extracting said set of numerical features from the training image,

- generating a training image representation according to the numerical features extracted from said training image;

• training the neural network based classifier system with the training image representations and corresponding training image popularity values of said training images; wherein the set of numerical features extracted from each training

25 or input image comprises one or more numerical features selected from :

• an average color of the image;

• an average details density of the image;

• an area and/or a predominant color of an object in the image; and

• a least detailed spot including the minimum number of details and a most detailed spot of the image including the maximum number of details.

2. The method according to claim 1, wherein extracting the set of numerical features from the training image comprises defining a set of regions of said training image and, for each of said regions, generating a local descriptor based on low level features of pixels in the region, and wherein the training image representation comprises an aggregation of said local descriptors.

3. The method according to claim 1 or 2, wherein generating said training image representation comprises one or more operations chosen from :

• assigning to said training image representation an average color of said training image;

• partitioning said training image representation into a set of areas which have lightness levels proportional to an average details density of the training image regions which correspond to said areas;

• representing each of a set of objects of said training image by superimposing on a background of said training image representation a circle with an area proportional to the object area and with a color corresponding to the object predominant color; and

• representing the least and the most detailed spots of said training image by superimposing on a background of said training image representation a first circle and a second circle having two different colors, respectively. The method according to claim 3, wherein said set of areas comprises exactly two areas, one framed with respect to the other. The method according to claim 3 or 4, wherein said set of objects consists of the four largest objects of said training image. The method according to any of the preceding claims, wherein the average details density extracted from said training or input image comprises one or more percentages chosen from :

• a percentage of details of the image versus a smoothed background of the image;

• a percentage of details along a horizontal direction of the image versus total details in the image;

• a percentage of details along a vertical direction of the image versus total details in the image;

• a percentage of details along a diagonal direction of the image versus total details in the image;

• a percentage of details in a center of the image versus details in a border of the image; • a percentage of details along the horizontal direction in the center versus details along the horizontal direction in the border;

• a percentage of details along the vertical direction in the center versus details along the vertical direction in the border; and

• a percentage of details along the diagonal direction in the center versus details along the diagonal direction in the border.

7. The method according to any of the preceding claims, wherein said neural network based classifier system comprises a Siamese neural network, the training images being grouped in pairs, so as to achieve a pairwise learning-to-rank approach.

8. The method according to claim 7, when depending on claim 3, wherein output of the Siamese neural network is given as input to a two layer neural network, combining the output results from both the Siamese neural network and the other results provided by said one or more operations.

9. The method according to any of the preceding claims, further comprising filtering a set of input images to identify said input image.

10. The method according to any of the preceding claims, wherein extracting said set of numerical features from the training image and/or from the input image includes applying a 2D Discrete Wavelets Transform (DWT) to the training image and/or to the input image respectively.

11. The method according to any of the preceding claims, wherein

28 image representation refers to a synthetic image which describes the distribution of a set of numerical features extracted from a training image. The method according to any of the preceding claims, wherein generating the training image representation according to the numerical features extracted from the training image entails image annotation reflecting low- and/or high-level recognition results. The method according to any of the preceding claims, further comprising:

• defining a plurality of image categories reflecting high-level features of a set of input images;

• assigning said input image to the relevant image category or categories based on high-level recognition results extracted from said input image; and

• with the trained neural network based classifier system, determining the input image popularity relative to other input images assigned to the same image category or categories. The method according to claim 13, wherein the high-level features and corresponding image categories refer to recognized objects contained in the input images. The method according to claim 13 or 14, wherein the trained neural network based classifier system is applied only to input images that have been previously assigned to the same image category or categories.

29